Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle inconsistent columns of CSV

Tags:

python

pandas

csv

My CSV data looks like this:

ID;name;info
1;ABC;text1
2;DEF;text2;text3
3;GHI;text4;
4;JKL;text5;text6;text7

There are 3 named columns. The additional unnamed columns all relate to the last one (info), and the amount of those additional columns is not known.

Using df=pd.read_csv(filename, delimiter=";", dtype=object) returns a "Error tokenizing data. C error..." due to irregular shape.

Is it possible to merge the last columns into one column containing a list, to achieve the result below?

ID;name;info
1;ABC;[text1]
2;DEF;[text2, text3]
3;GHI;[text4]
4;JKL;[text5, text6, text7]
like image 241
Laura Avatar asked Nov 23 '25 11:11

Laura


1 Answers

Here is a general way where we count the number of delimiters in the columns and based on that construct the dataframe:

data = pd.read_csv("text.csv")
n_sep = data.columns[0].count(";")
headers = data.columns.str.split(";")[0]

data[headers] = data.iloc[:, 0].str.split(";", n=n_sep, expand=True)
data = data.iloc[:, 1:].assign(info=data['info'].str.split(";"))
  ID name                   info
0  1  ABC                [text1]
1  2  DEF         [text2, text3]
2  3  GHI              [text4, ]
3  4  JKL  [text5, text6, text7]
like image 134
Erfan Avatar answered Nov 26 '25 00:11

Erfan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!