Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: merge dataframes and keep all values in cells if not identical

So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.

EXAMPLE: User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.

INPUT: df

user age team
1 24 x
2 56 y
3 32 z
df = pd.DataFrame({'user': ['1', '2', '3'],
                    'age': [24,56,32],
                    'team': [x,y,z]})

df1

user age name
1 27 Ronald
2 NaN Eugene
4 44 Jeff
5 61 Britney
df = pd.DataFrame({'user': ['1','2','4','5'],
                    'age': [27,NaN,44,61],
                    'name': ['Ronald','Eugene','Jeff','Britney']})

EXPECTED OUTPUT:

CASES:

  1. two identical values: keep one

  2. one value is NaN: keep non NaN value

  3. two different values: concat with delimiter so it can be review later. I will highlight it.

user age team name
1 24 27
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        print(excel_files)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

That's what the OUTPUT looks like

user age team name
1 27 x Ronald
2 56 y Eugene
3 32 z NaN
4 44 NaN Jeff
5 61 NaN Britney

Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        #df1.set_index('uid',inplace=True)
        print(excel_files)
        #print(df1)
        #print(df1.dtypes)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            #df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

            cols_to_remove = df.columns
            for column in cols_to_remove:
                if "_dupe" in column:
                    new_col = str(column).replace('_dupe','')

                    df[new_col] = df[new_col].str.cat(df[column],sep='||')
                    print('New Values: ',df[new_col])
                    df.pop(column)

Any help will be appreciated. Thanks Raf

like image 374
niebyl2002 Avatar asked Nov 22 '25 15:11

niebyl2002


1 Answers

I would merge, then apply a groupby.agg on columns:

merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg('last')
)

Output:

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

Alterntive output:

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)

Output:

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney
like image 193
mozway Avatar answered Nov 25 '25 05:11

mozway