The following is my data:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 5 10
abc 1 2 1/1/2017 19:07:54 10 15
abc 2 3 2/1/2017 20:07:54 15 20
abc 2 4 2/1/2017 21:07:54 20 25
def 3 5 3/1/2017 22:07:54 25 30
def 3 6 3/1/2017 23:07:54 30 35
def 4 7 4/1/2017 12:07:54 35 40
def 4 8 4/1/2017 13:07:54 40 45
I want to remove the duplicates based on three columns, name
, id
and date
and take the first value. I tried the following command:
data.drop_duplicates(subset=['name', 'id', 'date'],keep = 'first')
I also want to group these three columns and take the sum of value
and value2
column and I tried following column:
data[['name', 'id', 'date', 'value']].groupby(['name', 'id', 'date']).sum()
data[['name', 'id', 'date', 'value2']].groupby(['name', 'id', 'date']).sum()
Now I want to join all the three data frames and take the columns. I am thinking there should be a better way to do this? The following is the output I am looking for:
name id junk date time value value2
abc 1 1 1/1/2017 18:07:54 15 25
abc 2 3 2/1/2017 20:07:54 35 45
def 3 5 3/1/2017 22:07:54 55 65
def 4 7 4/1/2017 12:07:54 75 85
Where I want to consider to remove duplicates based on name
, id
and date
column, take the first value of junk
and time
columns and also add the value
and value2
columns.
Can anybody help me in doing this?
You need groupby
with agg
:
df = df.groupby(['name', 'id', 'date'])
.agg({'value':'sum', 'value2':'sum', 'time':'first', 'junk':'first'})
.reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 755
Dynamic solution:
g_cols = ['name','id','date']
sum_cols = ['value','value2']
#remove columns in groupby and for sum
cols = df.columns[~df.columns.isin(sum_cols + g_cols)]
print (cols)
Index(['junk', 'time'], dtype='object')
#dict comprehension for sum columns
d_sum = {col:'sum' for col in sum_cols}
#dict comprehension for first columns
d = {col:'first' for col in cols}
#add dicts together
d.update(d_sum)
print (d)
{'value2': 'sum', 'time': 'first', 'junk': 'first', 'value': 'sum'}
df = df.groupby(g_cols).agg(d).reset_index()
print (df)
name id date value2 time junk value
0 abc 1 1/1/2017 25 18:07:54 1 15
1 abc 2 2/1/2017 45 20:07:54 3 35
2 def 3 3/1/2017 65 22:07:54 5 55
3 def 4 4/1/2017 85 12:07:54 7 75
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With