Python: merge dataframes and keep all values in cells if not identical

Question

So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.

EXAMPLE: User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.

INPUT: df

user	age	team
1	24	x
2	56	y
3	32	z

df = pd.DataFrame({'user': ['1', '2', '3'],
                    'age': [24,56,32],
                    'team': [x,y,z]})

df1

user	age	name
1	27	Ronald
2	NaN	Eugene
4	44	Jeff
5	61	Britney

df = pd.DataFrame({'user': ['1','2','4','5'],
                    'age': [27,NaN,44,61],
                    'name': ['Ronald','Eugene','Jeff','Britney']})

EXPECTED OUTPUT:

CASES:

two identical values: keep one
one value is NaN: keep non NaN value
two different values: concat with delimiter so it can be review later. I will highlight it.

user	age	team	name
1	24		27
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        print(excel_files)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

That's what the OUTPUT looks like

user	age	team	name
1	27	x	Ronald
2	56	y	Eugene
3	32	z	NaN
4	44	NaN	Jeff
5	61	NaN	Britney

Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.

df = pd.DataFrame()

for excel_files in FILELIST:
    if excel_files.endswith(".xlsx"):
        df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
        #df1.set_index('uid',inplace=True)
        print(excel_files)
        #print(df1)
        #print(df1.dtypes)

        if df.empty:
            df = df.append(df1)
        else:
            df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
            #df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)

            cols_to_remove = df.columns
            for column in cols_to_remove:
                if "_dupe" in column:
                    new_col = str(column).replace('_dupe','')

                    df[new_col] = df[new_col].str.cat(df[column],sep='||')
                    print('New Values: ',df[new_col])
                    df.pop(column)

Any help will be appreciated. Thanks Raf

mozway · Accepted Answer

I would merge, then apply a groupby.agg on columns:

merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg('last')
)

Output:

  user   age  team     name
0    1  27.0     x   Ronald
1    2  56.0     y   Eugene
2    3  32.0     z     None
3    4  44.0  None     Jeff
4    5  61.0  None  Britney

Alterntive output:

out = (merged
 .groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
 .agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)

Output:

  user        age team     name
0    1  24.0|27.0    x   Ronald
1    2       56.0    y   Eugene
2    3       32.0    z         
3    4       44.0          Jeff
4    5       61.0       Britney

Python: merge dataframes and keep all values in cells if not identical

Tags:

python

merge

concatenation

pandas

dataframe

niebyl2002

1 Answers

mozway

Recent Activity

Donate For Us

Python: merge dataframes and keep all values in cells if not identical

Tags:

python

merge

concatenation

pandas

dataframe

niebyl2002

1 Answers

mozway

Related questions

Recent Activity

Donate For Us