I have a csv file in which the first line reads something like the following:
Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]...
and so on. What I'd like to do is create filtered value such that the first column is
Ref
and it takes all the values after the # sign like 001,50,51,52... The second column name is ID and it takes the value after = like XYZ_0,M3_0,M3_1,M3_2,M3_3... And finally make a third column which takes all the values present in the square brackets like 1234,112,154,254,...
header_pattern = r'Pyscip_V(\d+\.\d+) Ref:'
version_match = re.search(header_pattern, first_line.iloc[0, 0])
version_number = version_match.group(1) if version_match else ''
matches = re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]', first_line.iloc[0, 0])
parsed_df = []
for match in matches:
row_dict = {
'Ref': match[0] if match[0] else '',
'ID': match[1] if match[1] else '',
'Ser_No': match[2] if match[2] else ''
}
parsed_df.append(row_dict)
new_df = pd.DataFrame(parsed_df)
However, I only get enpty dataframe. What seems to be the problem here?
Edit: the data from 3rd row looks like the following:
ID Date XYZ_0 M3_0 M3_1 M3_2
1 22.12.2023 12.6 0.5 1.2 2.3
The expected outcome is
Ref ID Num
001 XYZ_0 1234
50 M3_0 112
51 M3_1 154
I would open the csv file, extract the first line and process it, only after read the rest of the CSV with pandas.
For that, your original approach and regex are fine.
import re
import pandas as pd
with open('my_csv.csv') as f:
first_line = next(f)
header_df = pd.DataFrame(re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]',
first_line),
columns=['Ref', 'ID', 'Num'])
data_df = pd.read_csv(f, sep=r'\s+')
print(header_df)
print(data_df)
Output:
# header_df
Ref ID Num
0 001 XYZ_0 1234
1 50 M3_0 112
2 51 M3_1 154
3 52 M3_2 254
# data_df
ID Date XYZ_0 M3_0 M3_1 M3_2
0 1 22.12.2023 12.6 0.5 1.2 2.3
You can try (regex101 link):
import re
import pandas as pd
text = """\
Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]
"""
pat = r"#(\d+)=([^[]+)\[([^\]]+)\]"
data = []
for line in text.splitlines():
for a, b, c in re.findall(pat, line):
data.append((a, b, c))
df = pd.DataFrame(data, columns=["Ref", "ID", "Num"])
print(df)
Prints:
Ref ID Num
0 001 XYZ_0 1234
1 50 M3_0 112
2 51 M3_1 154
3 52 M3_2 254
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With