Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas how to extract specific strings from dataframe

I have the following dataframe:

d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])

What I'm trying to do is extract the POS, REF and ALT information i.e POS=4323432 and create another dataframe. The original file is much larger but I'm pretty sure the data in the columns of the original file are not strings.

I have tried the following:

cols = df1.select_dtypes('object').columns
df1[cols] = df1[cols].apply(lambda x: x.astype(str))
df1 = frame.apply(lambda x: x.str.extract('POS=, REF=, ALT='))

But can't seem to get it.

Desired output:

         POS        REF   ALT
PP25     345432     G    A
PP25     245332     T    GA
PP25     34545432   T    A
PP25     87532      A    C
PP25     4347532    T    G
PP25     4323432    A    T
PP25     44423432   G    T
COX4     23332      A    G
COX4     45332      G    GAGG
COX4     2453344432 C    T
COX4     225332     G    A
P53      3335332    G    A
P53      445332     G    C
P53      23245332   T    A
P53      245332     A    C


thanks!

like image 473
Chip Avatar asked Oct 21 '25 03:10

Chip


1 Answers

stack, split, explode and use str.extract with a short regex:

out = (df1.stack()
          .str.split(',\s+(?=REC)').explode()
          .str.extract(r'POS=(\d+).*REF=([ACGT]).*ALT=([ACGT])')
       )

Alternative with named capturing groups and dropping the second level:

out = (df1.stack()
          .str.split(',\s+(?=REC)').explode()
          .str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
          .droplevel(1)
       )

NB. I assumed you only want to match A/T/G/C for REF and ALT, if you have more characters you can add them in the group.

output:

             POS REF ALT
PP25      345432   G   A
PP25      245332   T   G
PP25    34545432   T   A
PP25       87532   A   C
PP25     4347532   T   G
PP25     4323432   A   T
PP25    44423432   G   T
COX4       23332   A   G
COX4       45332   G   G
COX4  2453344432   C   T
COX4      225332   G   A
P53      3335332   G   A
P53       445332   G   C
P53     23245332   T   A
P53       245332   A   C

If the fields are not always in the same order (POS->REF->ALT), you'll have to use extractall and groupby.agg:

(df1
 .stack().str.split(',\s+(?=REC)').explode()
 .str.extractall(r'POS=(?P<POS>\d+)|REF=(?P<REF>[ACGT])|ALT=(?P<ALT>[ACGT])')
 .groupby(level=[0,1], sort=False).first()
 .droplevel(1)
)

reproducibility test:

import pandas as pd
d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])
(df1.stack()
    .str.split(',\s+(?=REC)').explode()
    .str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
    .droplevel(1)
)
like image 194
mozway Avatar answered Oct 23 '25 18:10

mozway



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!