Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

while condition or extractall with regex or other to handle new data

I have a dataset read from a file that I THOUGHT looked like this throughout (500-600 lines total):

0,['' '']
1,['Size' 'S']
2,['Energy (kJ)' '1644']
3,['Protein (g)' '20.9']
4,['Carbohydrates (g)' '33.6']
5,['Sugars (g)' '1.8']
6,['Total Fat (g)' '18.7']
7,['Saturated Fat' '4.9']
8,['Trans Fat (g)' '0']
9,['Dietary Fibre (g)' '5.2']
10,['Sodium (mg)' '845']
11,['Serving Size (g)' '180']

So I used this code to generate the dataframe I needed:

with open("dataset.txt", 'r') as infile:
    l = [x.replace(']', ',').replace("[", '').replace('"', '').replace('\n', '').strip().split(',') for x in infile]
    df = pd.DataFrame(l)
    df['A'] = list(range(len(df.index)))
    del df[2]
    df.rename(columns={1: 'nutrient'}, inplace=True)
    df[['amount_S']] = df['nutrient'].str.extract(pat=r'(?:\'\s\')(S|\d+\.\d+)', expand=True).fillna(0)
    df['nutrient'] = df['nutrient'].str.replace(pat=r'\'\s\'S|\d+',repl ='')
    df['nutrient'] = df['nutrient'].str.replace('\'', repl = '')
    df['nutrient'] = df['nutrient'].str.replace('.', repl='')

The dataframe output from the code looks like this (ready to join with another dataset and pivot):

             nutrient   A  amount_S
0                       0        0
1                Size   1        S
2        Energy (kJ)    2        0
3        Protein (g)    3     20.9
4  Carbohydrates (g)    4     33.6
5         Sugars (g)    5      1.8
6      Total Fat (g)    6     18.7
7      Saturated Fat    7      4.9
8      Trans Fat (g)    8        0
9  Dietary Fibre (g)    9      5.2
10        Sodium (mg)   10        0
11   Serving Size (g)   11        0

Now I have discovered that I also have some entries in my file that look like this:

0,['' '' '' '']
1,['Size' 'S' 'Size' 'M']
2,['Energy (kJ)' '351' 'Energy (kJ)' '617']
3,['Protein (g)' '2.3' 'Protein (g)' '4']
4,['Carbohydrates (g)' '15.4' 'Carbohydrates (g)' '26.9']
5,['Sugars (g)' '1.9' 'Sugars (g)' '3.3']
6,['Total Fat (g)' '0.6' 'Total Fat (g)' '1']
7,['Saturated Fat' '0.1' 'Saturated Fat' '0.1']
8,['Trans Fat (g)' '0' 'Trans Fat (g)' '0']
9,['Dietary Fibre (g)' '1.9' 'Dietary Fibre (g)' '3.4']
10,['Sodium (mg)' '2' 'Sodium (mg)' '4']
11,['Serving Size (g)' '75' 'Serving Size (g)' '125']
0,['' '' '' '' '' '' '' '']
1,['Size' 'S' 'Size' 'M' 'Size' 'L' 'Size' 'XL']
2,"['Energy (kJ)' '1431' 'Energy (kJ)' '2030' 'Energy (kJ)' '2863' 'Energy (kJ)' '3383']"
3,"['Protein (g)' '5.7' 'Protein (g)' '8.1' 'Protein (g)' '11.4' 'Protein (g)' '13.5']"
4,"['Carbohydrates (g)' '41.5' 'Carbohydrates (g)' '58.8' 'Carbohydrates (g)' '82.9' 'Carbohydrates (g)' '98']"
5,"['Sugars (g)' '1.2' 'Sugars (g)' '1.7' 'Sugars (g)' '2.4' 'Sugars (g)' '2.9']"
6,"['Total Fat (g)' '17.9' 'Total Fat (g)' '25.4' 'Total Fat (g)' '35.9' 'Total Fat (g)' '42.4']"
7,"['Saturated Fat' '7.9' 'Saturated Fat' '11.2' 'Saturated Fat' '15.8' 'Saturated Fat' '18.7']"
8,"['Trans Fat (g)' '0' 'Trans Fat (g)' '0' 'Trans Fat (g)' '0' 'Trans Fat (g)' '0']"
9,"['Dietary Fibre (g)' '3.7' 'Dietary Fibre (g)' '5.3' 'Dietary Fibre (g)' '7.5' 'Dietary Fibre (g)' '8.8']"
10,"['Sodium (mg)' '305' 'Sodium (mg)' '432.1' 'Sodium (mg)' '609' 'Sodium (mg)' '720']"
11,"['Serving Size (g)' '110' 'Serving Size (g)' '156' 'Serving Size (g)' '220' 'Serving Size (g)' '260']"

I would like to move the numerical data to new columns (amount_M, amount_L, amount_XL). The "nutrient" column does not need to be repeated. What is the best way to go about taking care of these cases?

like image 464
ChiChi Avatar asked Dec 22 '25 10:12

ChiChi


1 Answers

Use:

import ast
# convert output fo 2 column dataframe
df = pd.read_csv('file5.csv', names=['a','b'])
#add comma to ' ', convert each row to lists 
df['b'] = df['b'].str.replace("'\s+'", "','").apply(ast.literal_eval)
#remove rows with 0 in a column
df =  df[df['a'] != 0]
#print (df)

fin = {}
#create dictionary of dataframes - groupby by helper Series - 
# necessary first value 1 for distinguish groups in a column
for i, x in dict(tuple(df.groupby(df['a'].eq(1).cumsum().sub(1)))).items():
#    print (x)

    #create DataFrame with column b, first row is header
    df2 = pd.DataFrame(x.b.values.tolist()[1:], columns=x.b.iloc[0])
    #remove duplicates columns names
    df2 = df2.loc[:, ~df2.columns.duplicated()]
#    print (df2)
    #convert output to dictionary (if necessary)
    fin[i] = df2

print (fin[0])
                Size     S     M
0        Energy (kJ)   351   617
1        Protein (g)   2.3     4
2  Carbohydrates (g)  15.4  26.9
3         Sugars (g)   1.9   3.3
4      Total Fat (g)   0.6     1
5      Saturated Fat   0.1   0.1
6      Trans Fat (g)     0     0
7  Dietary Fibre (g)   1.9   3.4
8        Sodium (mg)     2     4
9   Serving Size (g)    75   125

print (fin[1])
                Size     S      M     L    XL
0        Energy (kJ)  1431   2030  2863  3383
1        Protein (g)   5.7    8.1  11.4  13.5
2  Carbohydrates (g)  41.5   58.8  82.9    98
3         Sugars (g)   1.2    1.7   2.4   2.9
4      Total Fat (g)  17.9   25.4  35.9  42.4
5      Saturated Fat   7.9   11.2  15.8  18.7
6      Trans Fat (g)     0      0     0     0
7  Dietary Fibre (g)   3.7    5.3   7.5   8.8
8        Sodium (mg)   305  432.1   609   720
9   Serving Size (g)   110    156   220   260
like image 137
jezrael Avatar answered Dec 24 '25 01:12

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!