I have a column in a DataFrame (which is a column in a csv) which are comma-separated values. I'd like to split this column into multiple columns.
The problem is an old one, and has been discussed here also, but there is one peculiarity: one entry may be from 0-n comma-separated values. An example:
df.head():
i: vals | sth_else
---------------------
1: a,b,c | ba
2: a,d | be
3: | bi
4: e,a,c | bo
5: e | bu
I'd like the following output (or similar, e.g. True/False):
i : a | b | c | d | e | sth_else
-----------------------------------
1: 1 | 1 | 1 | 0 | 0 | ba
2: 1 | 0 | 0 | 1 | 0 | be
3: 0 | 0 | 0 | 0 | 0 | bi
4: 1 | 0 | 1 | 0 | 1 | bo
5: 0 | 0 | 0 | 0 | 1 | bu
I'm currently experimenting with the Series.str.split and then Series.to_dict functions, but with out any satisfactory results (causing always a ValueError: arrays must all be same length. :)
Also, I always try to find elegant solutions which are easily understandable when looked at after a couple of months ;). In any case, propositions are highly appreciated!
Here is the dummy.csv for testing.
vals;sth_else
a,b,c;ba
a,d;be
;bi
e,a,c;bo
e;bu
import pandas as pd
from StringIO import StringIO # py2.7 used here
# from io.StringIO import StringIO if you have py3.x
# data
# ==================================================================
csv_buffer = 'vals;sth_else\na,b,c;ba\na,d;be\n;bi\ne,a,c;bo\ne;bu'
df = pd.read_csv(StringIO(csv_buffer), sep=';')
Out[58]:
vals sth_else
0 a,b,c ba
1 a,d be
2 NaN bi
3 e,a,c bo
4 e bu
# processing
# ==================================================================
def func(group):
return pd.Series(group.vals.str.split(',').values[0], name='vals')
ser = df.groupby(level=0).apply(func)
Out[60]:
0 0 a
1 b
2 c
1 0 a
1 d
2 0 NaN
3 0 e
1 a
2 c
4 0 e
Name: vals, dtype: object
# use get_dummies, and then aggregate for each column of a b c d e to be its max (max is always 1 in this case)
pd.get_dummies(ser)
Out[85]:
a b c d e
0 0 1 0 0 0 0
1 0 1 0 0 0
2 0 0 1 0 0
1 0 1 0 0 0 0
1 0 0 0 1 0
2 0 0 0 0 0 0
3 0 0 0 0 0 1
1 1 0 0 0 0
2 0 0 1 0 0
4 0 0 0 0 0 1
# do this groupby on outer index level [0,1,2,3,4] and reduce any inner group from multiple rows to one row
df_dummies = pd.get_dummies(ser).groupby(level=0).apply(lambda group: group.max())
Out[64]:
a b c d e
0 1 1 1 0 0
1 1 0 0 1 0
2 0 0 0 0 0
3 1 0 1 0 1
4 0 0 0 0 1
df_dummies['sth_else'] = df.sth_else
Out[67]:
a b c d e sth_else
0 1 1 1 0 0 ba
1 1 0 0 1 0 be
2 0 0 0 0 0 bi
3 1 0 1 0 1 bo
4 0 0 0 0 1 bu
This is very similar to another question today. As I said in that question, there may be a simple elegant pandas way to do this, but I also find it convenient to simply create a new data frame and populate it by iterating over the original one in the following fashion:
#import and create your data
import pandas as pd
DF = pd.DataFrame({ 'vals' : ['a,b,c', 'a,d', '', 'e,a,c', 'e'],
'other' : ['ba', 'be', 'bi', 'bo', 'bu']
}, dtype = str)
Now create a new data frame with the other column form the DF as the index and columns that are drawn from the unique characters found in your val column in the DF:
New_DF = pd.DataFrame({col : 0 for col in
set([letter for letter in ''.join([char for char in DF.vals.values])
if letter.isalpha()])},
index = DF.other)
In [51]: New_DF
Out[51]:
a b c d e
other
ba 0 0 0 0 0
be 0 0 0 0 0
bi 0 0 0 0 0
bo 0 0 0 0 0
bu 0 0 0 0 0
Now simply iterate over the index of the New_DF slicing the original DF at that value and iterate over the columns to see if they appear in the relevant_string:
for ind in New_DF.index:
relevant_string = str(DF[DF.other == ind].vals.values)
for col in list(New_DF.columns):
if col in relevant_string:
New_DF.loc[ind, col] += 1
Output looks like this
In [54]: New_DF
Out[54]:
a b c d e
other
ba 1 1 1 0 0
be 1 0 0 1 0
bi 0 0 0 0 0
bo 1 0 1 0 1
bu 0 0 0 0 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With