Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

create filtered value using pandas

Tags:

python

pandas

I have a csv file in which the first line reads something like the following:

Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]...

and so on. What I'd like to do is create filtered value such that the first column is Ref and it takes all the values after the # sign like 001,50,51,52... The second column name is ID and it takes the value after = like XYZ_0,M3_0,M3_1,M3_2,M3_3... And finally make a third column which takes all the values present in the square brackets like 1234,112,154,254,...

header_pattern = r'Pyscip_V(\d+\.\d+) Ref:'

version_match = re.search(header_pattern, first_line.iloc[0, 0])
version_number = version_match.group(1) if version_match else ''

matches = re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]', first_line.iloc[0, 0])

parsed_df = []
for match in matches:
    row_dict = {
        'Ref': match[0] if match[0] else '',
        'ID': match[1] if match[1] else '',
        'Ser_No': match[2] if match[2] else ''
    }
    parsed_df.append(row_dict)

new_df = pd.DataFrame(parsed_df)


However, I only get enpty dataframe. What seems to be the problem here?

Edit: the data from 3rd row looks like the following:

ID  Date    XYZ_0  M3_0   M3_1  M3_2    
1   22.12.2023  12.6  0.5 1.2   2.3

The expected outcome is

Ref ID Num
001 XYZ_0 1234
50 M3_0 112
51 M3_1 154
like image 647
dnem Avatar asked Nov 28 '25 18:11

dnem


2 Answers

I would open the csv file, extract the first line and process it, only after read the rest of the CSV with pandas.

For that, your original approach and regex are fine.

import re
import pandas as pd

with open('my_csv.csv') as f:
    first_line = next(f)
    header_df = pd.DataFrame(re.findall(r'#(\d+)=(\w+_\d)\[([\d]+)\]',
                                        first_line),
                             columns=['Ref', 'ID', 'Num'])
    
    data_df = pd.read_csv(f, sep=r'\s+')

print(header_df)
print(data_df)

Output:

# header_df
   Ref     ID   Num
0  001  XYZ_0  1234
1   50   M3_0   112
2   51   M3_1   154
3   52   M3_2   254

# data_df
   ID        Date  XYZ_0  M3_0  M3_1  M3_2
0   1  22.12.2023   12.6   0.5   1.2   2.3
like image 182
mozway Avatar answered Dec 01 '25 07:12

mozway


You can try (regex101 link):

import re
import pandas as pd

text = """\
Pyscip_V1.11 Ref: #001=XYZ_0[1234] #50=M3_0[112] #51=M3_1[154] #52=M3_2[254]
"""

pat = r"#(\d+)=([^[]+)\[([^\]]+)\]"

data = []
for line in text.splitlines():
    for a, b, c in re.findall(pat, line):
        data.append((a, b, c))

df = pd.DataFrame(data, columns=["Ref", "ID", "Num"])
print(df)

Prints:

   Ref     ID   Num
0  001  XYZ_0  1234
1   50   M3_0   112
2   51   M3_1   154
3   52   M3_2   254
like image 35
Andrej Kesely Avatar answered Dec 01 '25 06:12

Andrej Kesely