I have a little bit of a problem here. I have a txt file containing lines of the form (let's say for line 1):
id1-a1-b1-c1
I want to load it in a data frame using pandas with the index being the id's and the columns name being 'A', 'B', 'C' and the values the corresponding ai, bi, ci
at the end I want the dataframe to look like:
    'A'   'B'  'C'
id1  a1    b1   c1
id2  a2    b2   c2
...   ...   ...  ...
I may want to read by chunks in the file is large but let's assume I read at once:
with open('file.txt') as f:
    table = pd.read_table(f, sep='-', index_col=0, header=None,   lineterminator='\n')
and rename the columns
table.columns = ['A','B','C']
my current output is something like:
    'A'   'B'  'C'
0
id1  a1    b1   c1
id2  a2    b2   c2
...   ...   ...  ...
there is an extra row that I can't explain
Thanks
EDIT
when I try to add the field
chunksize=20
and after doing:
for chunk in table:
    print(chunk)
I get the following error:
pandas.parser.CParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.
To set a column as index for a DataFrame, use DataFrame. set_index() function, with the column name passed as argument. You can also setup MultiIndex with multiple columns in the index. In this case, pass the array of column names required for index, to set_index() method.
To create an index, from a column, in Pandas dataframe you use the set_index() method. For example, if you want the column “Year” to be index you type <code>df. set_index(“Year”)</code>. Now, the set_index() method will return the modified dataframe as a result.
We can set a specific column or multiple columns as an index in pandas DataFrame. Create a list of column labels to be used to set an index. We need to pass the column or list of column labels as input to the DataFrame. set_index() function to set it as an index of DataFrame.
If you know the column names before the file is read, pass the list using names parameter of read_table:
with open('file.txt') as f:
    table = pd.read_table(f, sep='-', index_col=0, header=None, names=['A','B','C'],
                          lineterminator='\n')
Which outputs:
      A   B   C
id1  a1  b1  c1
id2  a2  b2  c2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With