I am searching for the boolean equivalent of pandas to_numeric() I want the function to convert a column to True/False/nan if at all possible, and if not throw an error.
My motivation is that I need to auto recognize and convert boolean columns in a dataset with ~1000 columns. I can do a similar things with floats/integers using the following code:
df = df_raw.apply(pd.to_numeric, errors='ignore')
Since pd.to_numeric is primarily used to convert strings to numeric values, I'm going to work under the assumption that you want to convert strings of literal boolean values.
Consider the dataframe df
df = pd.DataFrame([
['1', None, 'True'],
['False', 2, True]
])
print(df)
0 1 2
0 1 NaN True
1 False 2.0 True
My Choice
This is what I'd propose. Further below, I break it down in an attempt to explain what is going on.
def try_eval2(x):
if type(x) is str:
try:
x = literal_eval(x)
except:
x = np.nan
if type(x) is not bool:
x = np.nan
return x
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
0 1 2
0 NaN NaN True
1 False NaN True
Timing
You'll notice that my proposed solution is pretty fast
%%timeit
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
10000 loops, best of 3: 149 µs per loop
%timeit df.astype(str).applymap(to_boolean)
1000 loops, best of 3: 1.28 ms per loop
%timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1000 loops, best of 3: 1.27 ms per loop
Step 1
Now I'll create a simple function using ast.literal_eval to convert strings to values
from ast import literal_eval
def try_eval(x):
try:
x = literal_eval(x)
except:
pass
return x
Step 2
applymap with my new function. It's going to look the same!
d1 = df.applymap(try_eval)
print(d1)
0 1 2
0 1 NaN True
1 False 2.0 True
Step 3
Use where and applymap again to find where values are actually bool
d2 = d1.where(d1.applymap(type).eq(bool))
print(d2)
0 1 2
0 NaN NaN True
1 False NaN True
Step 4
You can drop columns with all NaN
print(d2.dropna(1, 'all'))
0 2
0 NaN True
1 False True
You need replace with where where replace to NaN all not boolean:
df = df.replace({'True':True,'False':False})
df = df.where(df.applymap(type) == bool)
Old solution (very slow):
You can astype to strings if some boolean in df, applymap with custom function and ast.literal_eval for converting:
from ast import literal_eval
def to_boolean(x):
try:
x = literal_eval(x)
if type(x) == bool:
return x
else:
return np.nan
except:
x = np.nan
return x
print (df.astype(str).applymap(to_boolean))
#with borrowing sample from piRSquared
0 1 2
0 NaN NaN True
1 False NaN True
Timings:
In [76]: %timeit (jez(df))
1 loop, best of 3: 488 ms per loop
In [77]: %timeit (jez2(df))
1 loop, best of 3: 527 ms per loop
#piRSquared fastest solution
In [78]: %timeit (pir(df))
1 loop, best of 3: 5.42 s per loop
#maxu solution
In [79]: %timeit df.astype(str).stack().map({'True':True, 'False':False}).unstack()
1 loop, best of 3: 1.88 s per loop
#jezrael ols solution
In [80]: %timeit df.astype(str).applymap(to_boolean)
1 loop, best of 3: 13.3 s per loop
Code for timings:
df = pd.DataFrame([
['True', False, '1', 0, None, 5.2],
['False', True, '0', 1, 's', np.nan]])
#[20000 rows x 60 columns]
df = pd.concat([df]*10000).reset_index(drop=True)
df = pd.concat([df]*10, axis=1).reset_index(drop=True)
df.columns = pd.RangeIndex(len(df.columns))
#print (df)
def to_boolean(x):
try:
x = literal_eval(x)
if type(x) == bool:
return x
else:
return np.nan
except:
x = np.nan
return x
def try_eval2(x):
if type(x) is str:
try:
x = literal_eval(x)
except:
x = np.nan
if type(x) is not bool:
x = np.nan
return x
def pir(df):
vals = df.values
v = vals.ravel()
a = np.array([try_eval2(x) for x in v.tolist()], dtype=object)
df2 = pd.DataFrame(a.reshape(vals.shape), df.index, df.columns)
return (df2)
def jez(df):
df = df.replace({'True':True,'False':False})
df = df.where(df.applymap(type) == bool)
return (df)
def jez2(df):
df = df.replace({'True':True,'False':False})
df = df.where(df.applymap(type).eq(bool))
return (df)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With