I have a dataframe looking like this:
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_oversetup false
5 re_oversetup alerting false
6 alerting_s re_setup false
7 re_oversetup oo_setup true
8 alerting bounce false
So, I need to classified rows where string 'setup' is contained or not.
And simple output would be:
id k1 k2 same
1 re_setup oo_setup true
2 oo_setup oo_setup true
3 alerting bounce false
4 bounce re_setup false
5 re_setup alerting false
6 alerting_s re_setup false
7 re_setup oo_setup true
8 alerting bounce false
I've tried something with this, but as I expact, I have error with selecting multiple columns.
data['same'] = data[data['k1', 'k2'].str.contains('setup')==True]
I think you need apply
with str.contains
, because it working only with Series
(one column):
print (data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')))
k1 k2
0 True True
1 True True
2 False False
3 False True
4 True False
5 False True
6 True True
7 False False
Then add DataFrame.all
for check if all True
s per row
data['same'] = data[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
or DataFrame.any
for check at least one True
per row:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).any(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup True
4 5 re_setup alerting True
5 6 alerting_s re_setup True
6 7 re_setup oo_setup True
7 8 alerting bounce False
Another solutions with applymap
for elemnt wise check:
data['same'] = data[['k1', 'k2']].applymap(lambda x: 'setup' in x).all(1)
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
If only 2 columns simple chain conditions with &
like all
or |
like any
:
data['same'] = data['k1'].str.contains('setup') & data['k2'].str.contains('setup')
print (data)
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_setup False
4 5 re_setup alerting False
5 6 alerting_s re_setup False
6 7 re_setup oo_setup True
7 8 alerting bounce False
Here's another generic reduce operation without needing apply
In [114]: np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
Out[114]: array([ True, True, False, True, True, True, True, False], dtype=bool)
Detail
In [115]: df['same'] = np.logical_or.reduce(
[df[c].str.contains('setup') for c in ['k1', 'k2']])
In [116]: df
Out[116]:
id k1 k2 same
0 1 re_setup oo_setup True
1 2 oo_setup oo_setup True
2 3 alerting bounce False
3 4 bounce re_oversetup True
4 5 re_oversetup alerting True
5 6 alerting_s re_setup True
6 7 re_oversetup oo_setup True
7 8 alerting bounce False
Timings
Small
In [111]: df.shape
Out[111]: (8, 4)
In [108]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
1000 loops, best of 3: 421 µs per loop
In [109]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
1000 loops, best of 3: 2.01 ms per loop
Large
In [110]: df.shape
Out[110]: (40000, 4)
In [112]: %timeit np.logical_or.reduce([df[c].str.contains('setup') for c in ['k1', 'k2']])
10 loops, best of 3: 59.5 ms per loop
In [113]: %timeit df[['k1', 'k2']].apply(lambda x: x.str.contains('setup')).any(1)
10 loops, best of 3: 88.4 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With