Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataframe manipulation: "explode rows" on new dataframe with repeated indices

I have two dataframes say df1 and df2, for example

import pandas as pd
col_1= ["A", ["B","C"], ["A","C","D"], "D"]
col_id = [1,2,3,4]
col_2 = [1,2,2,3,3,4,4]
d1  = {'ID': [1,2,3,4], 'Labels': col_1}
d2  = {'ID': col_2, }
d_2_get = {'ID': col_2, "Labels": ["A", "B", "C", "A", "C", "D", np.nan] }
df1 = pd.DataFrame(data=d1)
df2 = pd.DataFrame(data=d2)
df_2_get = pd.DataFrame(data=d_2_get)

df1 looking like

    ID  col2
0   1   A
1   2   [B, C]
2   3   [A, C, D]
3   4   D

and df2 looking like

    ID
0   1
1   2
2   2
3   3
4   3
5   4
6   4

I want to add a column Labels to df2, taken from df1, in such a way that:

  • for index i, start with the first value in df1
  • if the new row in df2["ID"] has a repeated entry, get the next value in df1, if it exists. If not, set NaN.

Given df1 and df2, the output should look like df_2_get below

    ID  Labels
0   1   A
1   2   B
2   2   C
3   3   A
4   3   C
5   4   D
6   4   NaN

My current clumsy attempt is below,

from collections import Counter
def list_flattener(list_of_lists):
    return [item for row in list_of_lists for item in row]

def my_dataframe_filler(df1, df2):
    list_2_fill = []
    repeats = dict(Counter(df2["ID"]))
    for k in repeats.keys():
        available_labels_list = df1[df1["ID"]==k]["Labels"].tolist()
        available_labels_list+=[[np.nan]*10]
        available_labels_list = list_flattener(available_labels_list)
        list_2_fill+=available_labels_list[:repeats[k]]
    return list_2_fill

and then use as

df2["Labels"] = my_dataframe_filler(df1, df2)

but I would like to learn how a pandas black belt would handle the problem, thanks

like image 545
user37292 Avatar asked Sep 16 '25 23:09

user37292


1 Answers

IIUC, you could explode and perform a merge after deduplication with groupby.cumcount:

out = (df2
    .assign(n=df2.groupby('ID').cumcount())
    .merge(df1.explode('Labels').assign(n=lambda x: x.groupby('ID').cumcount()),
           on=['ID', 'n'], how='left'
          )
    #.drop(columns='n')
)

Output:

   ID  n Labels
0   1  0      A
1   2  0      B
2   2  1      C
3   3  0      A
4   3  1      C
5   4  0      D
6   4  1    NaN

Alternatively, a pure python approach using iterators, map and next:

# for each list, build an iterator
d = dict(zip(df1['ID'], map(iter, df1['Labels'])))

# take the appropriate list and get the next available item
# default to None if exhausted
df2['Labels'] = df2['ID'].map(lambda x: next(d[x], None))

Output:

   ID Labels
0   1      A
1   2      B
2   2      C
3   3      A
4   3      C
5   4      D
6   4   None
like image 163
mozway Avatar answered Sep 19 '25 16:09

mozway