Remove duplicates with few columns and sum the other columns

Question

The following is my data:

name    id  junk    date    time      value   value2
abc      1   1    1/1/2017  18:07:54    5       10
abc      1   2    1/1/2017  19:07:54    10      15
abc      2   3    2/1/2017  20:07:54    15      20
abc      2   4    2/1/2017  21:07:54    20      25
def      3   5    3/1/2017  22:07:54    25      30
def      3   6    3/1/2017  23:07:54    30      35
def      4   7    4/1/2017  12:07:54    35      40 
def      4   8    4/1/2017  13:07:54    40      45

I want to remove the duplicates based on three columns, name, id and date and take the first value. I tried the following command:

data.drop_duplicates(subset=['name', 'id', 'date'],keep = 'first')

I also want to group these three columns and take the sum of value and value2 column and I tried following column:

data[['name', 'id', 'date', 'value']].groupby(['name', 'id', 'date']).sum()
data[['name', 'id', 'date', 'value2']].groupby(['name', 'id', 'date']).sum()

Now I want to join all the three data frames and take the columns. I am thinking there should be a better way to do this? The following is the output I am looking for:

name    id  junk    date    time      value   value2
abc      1   1    1/1/2017  18:07:54    15      25
abc      2   3    2/1/2017  20:07:54    35      45
def      3   5    3/1/2017  22:07:54    55      65
def      4   7    4/1/2017  12:07:54    75      85

Where I want to consider to remove duplicates based on name, id and date column, take the first value of junk and time columns and also add the value and value2 columns.

Can anybody help me in doing this?

jezrael · Accepted Answer

You need groupby with agg:

df = df.groupby(['name', 'id', 'date'])
       .agg({'value':'sum', 'value2':'sum', 'time':'first', 'junk':'first'})
       .reset_index()
print (df)
  name  id      date  value2      time  junk  value
0  abc   1  1/1/2017      25  18:07:54     1     15
1  abc   2  2/1/2017      45  20:07:54     3     35
2  def   3  3/1/2017      65  22:07:54     5     55
3  def   4  4/1/2017      85  12:07:54     7     755

Dynamic solution:

g_cols = ['name','id','date']
sum_cols = ['value','value2']

#remove columns in groupby and for sum
cols = df.columns[~df.columns.isin(sum_cols + g_cols)]
print (cols)
Index(['junk', 'time'], dtype='object')

#dict comprehension for sum columns
d_sum = {col:'sum' for col in sum_cols}
#dict comprehension for first columns
d = {col:'first' for col in cols}
#add dicts together
d.update(d_sum)
print (d)
{'value2': 'sum', 'time': 'first', 'junk': 'first', 'value': 'sum'}

df = df.groupby(g_cols).agg(d).reset_index()
print (df)
 name  id      date  value2      time  junk  value
0  abc   1  1/1/2017      25  18:07:54     1     15
1  abc   2  2/1/2017      45  20:07:54     3     35
2  def   3  3/1/2017      65  22:07:54     5     55
3  def   4  4/1/2017      85  12:07:54     7     75

Remove duplicates with few columns and sum the other columns

Tags:

python

python-3.x

pandas

aggregate

python-2.7

haimen

1 Answers

jezrael

Recent Activity

Donate For Us

Remove duplicates with few columns and sum the other columns

Tags:

python

python-3.x

pandas

aggregate

python-2.7

haimen

1 Answers

jezrael

Related questions

Recent Activity

Donate For Us