I have the following dataframe:
d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])
What I'm trying to do is extract the POS, REF and ALT information i.e POS=4323432 and create another dataframe. The original file is much larger but I'm pretty sure the data in the columns of the original file are not strings.
I have tried the following:
cols = df1.select_dtypes('object').columns
df1[cols] = df1[cols].apply(lambda x: x.astype(str))
df1 = frame.apply(lambda x: x.str.extract('POS=, REF=, ALT='))
But can't seem to get it.
Desired output:
POS REF ALT
PP25 345432 G A
PP25 245332 T GA
PP25 34545432 T A
PP25 87532 A C
PP25 4347532 T G
PP25 4323432 A T
PP25 44423432 G T
COX4 23332 A G
COX4 45332 G GAGG
COX4 2453344432 C T
COX4 225332 G A
P53 3335332 G A
P53 445332 G C
P53 23245332 T A
P53 245332 A C
thanks!
stack, split, explode and use str.extract with a short regex:
out = (df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(\d+).*REF=([ACGT]).*ALT=([ACGT])')
)
Alternative with named capturing groups and dropping the second level:
out = (df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
.droplevel(1)
)
NB. I assumed you only want to match A/T/G/C for REF and ALT, if you have more characters you can add them in the group.
output:
POS REF ALT
PP25 345432 G A
PP25 245332 T G
PP25 34545432 T A
PP25 87532 A C
PP25 4347532 T G
PP25 4323432 A T
PP25 44423432 G T
COX4 23332 A G
COX4 45332 G G
COX4 2453344432 C T
COX4 225332 G A
P53 3335332 G A
P53 445332 G C
P53 23245332 T A
P53 245332 A C
If the fields are not always in the same order (POS->REF->ALT), you'll have to use extractall and groupby.agg:
(df1
.stack().str.split(',\s+(?=REC)').explode()
.str.extractall(r'POS=(?P<POS>\d+)|REF=(?P<REF>[ACGT])|ALT=(?P<ALT>[ACGT])')
.groupby(level=[0,1], sort=False).first()
.droplevel(1)
)
reproducibility test:
import pandas as pd
d = {'sample1':['REC(CHR=2,,POS=345432,,REF=G,ALT=A,,BAND=ARG), REC(CHR=2,,POS=245332,,REF=T,,ALT=GA,BAND=AA4T)', 'REC(CHR=4,,POS=23332,,REF=A,,ALT=G,BAND=C4T)','REC(CHR=8,,POS=3335332,,REF=G,,ALT=A,BAND=AA4T)'], 'sample2':['REC(CHR=2,,POS=34545432,,REF=T,,ALT=A,,BAND=ARG)','REC(CHR=4,,POS=45332,,REF=G,,ALT=GAGG,BAND=AA4SST)','REC(CHR=8,,POS=445332,,REF=G,,ALT=C,BAND=33T)'], 'sample3':['REC(CHR=2,,POS=87532,,REF=A,ALT=C,,BAND=1243D)','REC(CHR=4,,POS=2453344432,,REF=C,,ALT=T,BAND=EE3)','REC(CHR=8,,POS=23245332,,REF=T,,ALT=A,BAND=AA4T)'], 'sample4':['REC(CHR=2,,POS=4347532,,REF=T,,ALT=G,,BAND=GM34), REC(CHR=2,,POS=4323432,,REF=A,,ALT=T,,BAND=GMA34), REC(CHR=2,,POS=44423432,,REF=G,,ALT=T,,BAND=GSSMA34)','REC(CHR=4,,POS=225332,,REF=G,,ALT=A,BAND=EER4T)','REC(CHR=8,,POS=245332,,REF=A,,ALT=C,BAND=AA4T)']}
df1 = pd.DataFrame(d, index=['PP25','COX4','P53'])
(df1.stack()
.str.split(',\s+(?=REC)').explode()
.str.extract(r'POS=(?P<POS>\d+).*REF=(?P<REF>[ACGT]).*ALT=(?P<ALT>[ACGT])')
.droplevel(1)
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With