I have a column of a pandas dataframe that I would like to split and expand into a new dataframe based on the second instance of a delimiter. I was splitting based on the last instance of the delimiter, but unfortunately there are a handful of instances in ~80k rows that have 4 '_' instead of 3.
For example, I have a dataframe with multiple columns where the one I would like to split into a new dataframe looks like the following:
df.head()
gene
0 NM_000000_foo_blabla
1 NM_000001_bar
and I want to split & expand it such that it separates to this:
(Desired)
df2.head()
col1 col2
0 NM_000000 foo_bar
1 NM_000001 foo
In using my current code:
df2 = df['gene'].str.rsplit('_', 1, expand=True).rename(lambda x: f'col{x + 1}', axis=1)
I get this:
(Actual)
df2.head()
col1 col2
0 NM_000000_foo bar
1 NM_000001 foo
Is there a simple way to achieve this my modifying the line of code I'm already using? I tried playing with the number of splits in rsplit but couldn't achieve the result I was looking for. Thanks!
Since your data seems to be fairly well defined, you can extract on the second instance of the delimiter using a regular expression.
df['gene'].str.extract(r'(?:[^_]+_){2}(.*)')
0
0 foo_blabla
1 bar
You can generalize this to be any delimiter, and match it any number of times using a simple function:
def build_regex(delimiter, num_matches=1):
return rf'(?:[^{delimiter}]+{delimiter}){{{num_matches}}}(.*)'
>>> build_regex('_', 2)
'(?:[^_]+_){2}(.*)'
>>> df['gene'].str.extract(build_regex('_', 2))
0
0 foo_blabla
1 bar
>>> df['gene'].str.extract(build_regex('_', 3))
0
0 blabla
1 NaN
Regex Explanation
(?: # non capture group
[^_]+ # match anything but _ one or more times
_ # match _
){2} # match this group 2 times
( # start of capture group 1
.* # match anything greedily
) # end of matching group 1
If there wasn't guaranteed to be text before either of the first two delimiters, you can also make the not assertion match 0 or more times:
(?:[^_]*_){2}(.*)
Just replace 2nd '_' by your custom deliminator and split on it
df.gene.str.replace(r'([^_]+_[^_]+)_', r'\1|').str.split('|', expand=True)
Out[488]:
0 1
0 NM_000000 foo_blabla
1 NM_000001 bar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With