So I'm trying to merge multiple excel files. Each file will have different dimensions. Some files may have identical column names with either data being NULL, same or different. The script I wrote merges multiple files with different dimensions and removes duplicated columns with the last value being dropped in the final column cell. However, I'm trying to concat values, if not equal, so that users can manually go through duped data in excel.
EXAMPLE: User 1 has age = 24 in df table and age = 27 in df1. I'm trying to get both values in that cell in the final consolidated output.
INPUT: df
| user | age | team |
|---|---|---|
| 1 | 24 | x |
| 2 | 56 | y |
| 3 | 32 | z |
df = pd.DataFrame({'user': ['1', '2', '3'],
'age': [24,56,32],
'team': [x,y,z]})
df1
| user | age | name |
|---|---|---|
| 1 | 27 | Ronald |
| 2 | NaN | Eugene |
| 4 | 44 | Jeff |
| 5 | 61 | Britney |
df = pd.DataFrame({'user': ['1','2','4','5'],
'age': [27,NaN,44,61],
'name': ['Ronald','Eugene','Jeff','Britney']})
EXPECTED OUTPUT:
CASES:
two identical values: keep one
one value is NaN: keep non NaN value
two different values: concat with delimiter so it can be review later. I will highlight it.
| user | age | team | name |
|---|---|---|---|
| 1 | 24 | 27 | |
| 2 | 56 | y | Eugene |
| 3 | 32 | z | NaN |
| 4 | 44 | NaN | Jeff |
| 5 | 61 | NaN | Britney |
Here's what I have so far. User drop files in specified folder then loop thru all excel files. First loop will append data into df dataframe, every next loop is merge. Issue is, I'm getting values (if not null) from last loop ONLY.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
print(excel_files)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
That's what the OUTPUT looks like
| user | age | team | name |
|---|---|---|---|
| 1 | 27 | x | Ronald |
| 2 | 56 | y | Eugene |
| 3 | 32 | z | NaN |
| 4 | 44 | NaN | Jeff |
| 5 | 61 | NaN | Britney |
Tried looping thru the columns and then concat. I can see combined values in df[new_col] but it fails to update df dataframe and final output shows NaN.
df = pd.DataFrame()
for excel_files in FILELIST:
if excel_files.endswith(".xlsx"):
df1 = pd.read_excel(FILEPATH_INPUT+excel_files, dtype=str)
#df1.set_index('uid',inplace=True)
print(excel_files)
#print(df1)
#print(df1.dtypes)
if df.empty:
df = df.append(df1)
else:
df = pd.merge(df,df1,on=UNIQUE_KEY,how=JOIN_METHOD,suffixes=('','_dupe'))
#df.drop([column for column in df.columns if '_dupe' in column],axis=1, inplace=True)
cols_to_remove = df.columns
for column in cols_to_remove:
if "_dupe" in column:
new_col = str(column).replace('_dupe','')
df[new_col] = df[new_col].str.cat(df[column],sep='||')
print('New Values: ',df[new_col])
df.pop(column)
Any help will be appreciated. Thanks Raf
I would merge, then apply a groupby.agg on columns:
merged = df.merge(df1, on='user', how='outer', suffixes=('', '_dupe'))
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg('last')
)
Output:
user age team name
0 1 27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z None
3 4 44.0 None Jeff
4 5 61.0 None Britney
Alterntive output:
out = (merged
.groupby(merged.columns.str.replace('_dupe', ''), sort=False, axis=1)
.agg(lambda g: g.agg(lambda s: '|'.join(s.dropna().unique().astype(str)), axis=1))
)
Output:
user age team name
0 1 24.0|27.0 x Ronald
1 2 56.0 y Eugene
2 3 32.0 z
3 4 44.0 Jeff
4 5 61.0 Britney
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With