Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten delimited strings in a list-column of a Pandas DataFrame?

I have a pandas DataFrame with one list-column, say:

df = pd.DataFrame({"pairs": [["A|B", "B|C", "C|D", "D|F"], ["A|D", "D|F", "F|G", "G|D"], ["C|D", "D|X"]]})

The lists in the pairs column always contain consecutive pairs, where the 2 elements of the pairs are separated by a |. I want to kind of "flatten" the lists in these columns, by instead of storing the pairs, now storing the elements of the pairs in the same order. So the desired DataFrame looks like:

elements
[A, B, C, D, F]
[A, D, F, G, D]
[C, D, X]

(Edited: I also would like to have elements occur multiple times in a resulting list, like D in the second row)

This looks so simple and I can't believe there wouldn't be an efficient solution to this, but so far I have failed to find any python method that could help me

like image 289
Peter Avatar asked Oct 20 '25 16:10

Peter


1 Answers

You can use set-comprehension:

df["elements"] = df["pairs"].apply(
    lambda x: {ww for w in x for ww in w.split("|")}
)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  {B, C, D, A, F}
1       [A|D, D|F, F|G]     {G, D, F, A}
2            [C|D, D|X]        {X, C, D}

If you want lists:

df["elements"] = df["pairs"].apply(
    lambda x: list({ww for w in x for ww in w.split("|")})
)
print(df)
                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [D, F, A, C, B]
1       [A|D, D|F, F|G]     [G, D, A, F]
2            [C|D, D|X]        [X, D, C]

EDIT: To mantain order:

def fn(x):
    seen = set()
    out = []
    for v in x:
        for w in v.split("|"):
            if not w in seen:
                seen.add(w)
                out.append(w)
    return out


df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]     [A, D, F, G]
2            [C|D, D|X]        [C, D, X]


EDIT: To keep multiple elements + order:

from itertools import groupby, chain

def fn(x):
    return [v for v, _ in groupby(chain.from_iterable(v.split("|") for v in x))]

df["elements"] = df["pairs"].apply(fn)
print(df)

Prints:

                  pairs         elements
0  [A|B, B|C, C|D, D|F]  [A, B, C, D, F]
1  [A|D, D|F, F|G, G|D]  [A, D, F, G, D]
2            [C|D, D|X]        [C, D, X]
like image 91
Andrej Kesely Avatar answered Oct 23 '25 05:10

Andrej Kesely



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!