I have a text file
that holds some result of an operation. The data is displayed in a human-readable format
(like a table). How do I parse this data so that I can form a data structure such as dictionaries
with this data?
An example of the unstructured data
is shown below.
===============================================================
Title
===============================================================
Header Header Header Header Header Header
1 2 3 4 5 6
---------------------------------------------------------------
1 Yes No 6 0001 0002 True
2 No Yes 7 0003 0004 False
3 Yes No 6 0001 0001 True
4 Yes No 6 0001 0004 False
4 No No 4 0004 0004 True
5 Yes No 2 0001 0001 True
6 Yes No 1 0001 0001 False
7 No No 2 0004 0004 True
The data displayed in the above example is not tab-separated
or comma separated
. It always has a header
and correspondingly may/may not have values along the column-like
appearance.
I have tried using basic parsing techniques such as regex
and conditional checks
, but I need a more robust way to parse this data as the above shown example is not the only format the data gets rendered.
Update 1: There are many cases apart from the shown example such as addition of more columns, single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).
Is there any python
library to solve this problem?
Can machine learning
techniques help in this problem without parsing? If yes, what type of problem would it be (Classification, Regression, Clustering)?
===============================================================
Title
===============================================================
Header Key_1 Header Header Header Header
1 Key_2 3 4 5 6
---------------------------------------------------------------
1 Value1 No 6 0001 0002 True
Value2
2 Value1 Yes 7 0003 0004 False
Value2
3 Value1 No 6 0001 0001 True
Value2
4 Value1 No 6 0001 0004 False
Value2
5 Value1 No 4 0004 0004 True
Value2
6 Value1 No 2 0001 0001 True
Value2
7 Value1 No 1 0001 0001 False
Value2
8 Value1 No 2 0004 0004 True
Value2
Update 2: Another example of what it might look like which involves a single cell having more than one instance (but shown visually in next row, whereas it belongs to the previous row).
Say your example is 'sample.txt'.
import pandas as pd
df = pd.read_table('sample.txt', skiprows=[0, 1, 2, 3, 5], delimiter='\s\s+')
print(df)
print(df.shape)
1 2 3 4 5 6
0 1 Yes No 6 0001 0002 True
1 2 No Yes 7 0003 0004 False
2 3 Yes No 6 0001 0001 True
3 4 Yes No 6 0001 0004 False
4 4 No No 4 0004 0004 True
5 5 Yes No 2 0001 0001 True
6 6 Yes No 1 0001 0001 False
7 7 No No 2 0004 0004 True
(8, 6)
You can change the data types of course. Please check tons of params of pd.read_table()
. Also, there are method for xlsx, csv, html, sql, json, hdf, even clipboard, etc.
welcome to pandas...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With