Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Boolean equivalent of pandas to_numeric()

I am searching for the boolean equivalent of pandas to_numeric() I want the function to convert a column to True/False/nan if at all possible, and if not throw an error.

My motivation is that I need to auto recognize and convert boolean columns in a dataset with ~1000 columns. I can do a similar things with floats/integers using the following code:

df = df_raw.apply(pd.to_numeric, errors='ignore')
like image 680
Selah Avatar asked Nov 22 '25 09:11

Selah


2 Answers

Since pd.to_numeric is primarily used to convert strings to numeric values, I'm going to work under the assumption that you want to convert strings of literal boolean values.

Consider the dataframe df

df = pd.DataFrame([
        ['1', None, 'True'],
        ['False', 2, True]
    ])

print(df)

       0    1     2
0      1  NaN  True
1  False  2.0  True

My Choice
This is what I'd propose. Further below, I break it down in an attempt to explain what is going on.

def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x

vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)

       0    1     2
0    NaN  NaN  True
1  False  NaN  True

Timing
You'll notice that my proposed solution is pretty fast

%%timeit
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
10000 loops, best of 3: 149 µs per loop

%timeit df.astype(str).applymap(to_boolean)
1000 loops, best of 3: 1.28 ms per loop

%timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1000 loops, best of 3: 1.27 ms per loop

Explanation

Step 1
Now I'll create a simple function using ast.literal_eval to convert strings to values

from ast import literal_eval

def try_eval(x):
    try:
        x = literal_eval(x)
    except:
        pass
    return x

Step 2
applymap with my new function. It's going to look the same!

d1 = df.applymap(try_eval)
print(d1)

       0    1     2
0      1  NaN  True
1  False  2.0  True

Step 3
Use where and applymap again to find where values are actually bool

d2 = d1.where(d1.applymap(type).eq(bool))
print(d2)

       0   1     2
0    NaN NaN  True
1  False NaN  True

Step 4
You can drop columns with all NaN

print(d2.dropna(1, 'all'))

       0     2
0    NaN  True
1  False  True
like image 128
piRSquared Avatar answered Nov 23 '25 23:11

piRSquared


You need replace with where where replace to NaN all not boolean:

df = df.replace({'True':True,'False':False})
df = df.where(df.applymap(type) == bool)

Old solution (very slow):

You can astype to strings if some boolean in df, applymap with custom function and ast.literal_eval for converting:

from ast import literal_eval

def to_boolean(x):
    try:
        x = literal_eval(x)
        if type(x) == bool:
            return x
        else:
            return np.nan
    except:
        x = np.nan
    return x

print (df.astype(str).applymap(to_boolean))
#with borrowing sample from piRSquared
       0   1     2
0    NaN NaN  True
1  False NaN  True

Timings:

In [76]: %timeit (jez(df))
1 loop, best of 3: 488 ms per loop

In [77]: %timeit (jez2(df))
1 loop, best of 3: 527 ms per loop

#piRSquared fastest solution
In [78]: %timeit (pir(df))
1 loop, best of 3: 5.42 s per loop

#maxu solution
In [79]: %timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1 loop, best of 3: 1.88 s per loop

#jezrael ols solution
In [80]: %timeit df.astype(str).applymap(to_boolean)
1 loop, best of 3: 13.3 s per loop

Code for timings:

df = pd.DataFrame([
        ['True', False, '1', 0, None, 5.2],
        ['False', True, '0', 1, 's', np.nan]])

#[20000 rows x 60 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df = pd.concat([df]*10, axis=1).reset_index(drop=True)
df.columns = pd.RangeIndex(len(df.columns))
#print (df)

def to_boolean(x):
    try:
        x = literal_eval(x)
        if type(x) == bool:
            return x
        else:
            return np.nan
    except:
        x = np.nan
    return x


def try_eval2(x):
    if type(x) is str:
        try:
            x = literal_eval(x)
        except:
            x = np.nan

    if type(x) is not bool:
        x = np.nan

    return x

def pir(df):
    vals = df.values
    v = vals.ravel()
    a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
    df2 = pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
    return (df2)

def jez(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type) == bool)
    return (df)

def jez2(df):
    df = df.replace({'True':True,'False':False})
    df = df.where(df.applymap(type).eq(bool))
    return (df)
like image 29
jezrael Avatar answered Nov 23 '25 23:11

jezrael