I have a dataset that looks a little like this:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345
1 Bob the Builder $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
where missing data is okay, unless it's obvious that they can be merged. What I mean by that is instead of the dataset above, I want to merge the rows where both the ID and Name are the same, and the other features can fill in each other's blanks. For example, the dataset above would become:
ID Name Address Zip Cost
1 Bob the Builder 123 Main St 12345 $99,999.99
2 Bob the Builder 123 Sub St 54321 $74,483.01
3 Nigerian Prince Area 51 33333 $999,999.99
3 Pinhead Larry Las Vegas 31333 $11.00
4 Fox Mulder Area 51 $0.99
I've thought about using df.groupby(["ID", "Name"]) and then concatenating the strings since the missing values are empty strings, but got no luck with it.
The data has been scraped off websites, so they've had to go through a lot of cleaning to end up here. I can't think of an elegant way of figuring this out!
This only works if rows we are potentially merging are next to each other.
setup
df = pd.DataFrame(dict(
ID=[1, 1, 2, 3, 3, 4],
Name=['Bob the Builder'] * 3 + ['Nigerian Prince', 'Pinhead Larry', 'Fox Mulder'],
Address=['123 Main St', '', '123 Sub St', 'Area 51', 'Las Vegas', 'Area 51'],
Zip=['12345', '', '54321', '33333', '31333', ''],
Cost=['', '$99,999.99', '$74,483.01', '$999.999.99', '$11.00', '$0.99']
))[['ID', 'Name', 'Address', 'Zip', 'Cost']]
fill up missing
replace('', np.nan) then forward fill then back fill
df_ = df.replace('', np.nan).ffill().bfill()
concat
take last row of filled up df_ if its a duplicate row
take non filled up df if not duplicated
pd.concat([
df_[df_.duplicated()],
df.loc[df_.drop_duplicates(keep=False).index]
])

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With