I have a pandas dataframe such as:
Species Pathway Number of Gene Families
1 uniSU2 ABC 1.0
2 uniSU2 Wzy 11.0
3 uniSU2 Synthase 2.0
4 n116 Wzy 0.0
5 n116 ABC 4.0
7 n116 Synthase 14.0
8 Aullax ABC 9.0
9 Aulax Synthase 1.0
10 Aullax Wzy 2.0
11 Criepi Wzy 0.0
12 Criepi ABC 2.0
13 Criepi Synthase 3.0
I want to select the Species (1st column) that have all the three possible pathways - ABC, Wzy, Synthase (2nd column). For this, the Number of Gene Families (3rd column) would have to be a positive number (>0) for all the three pathways - ABC > 0; Wzy > 0 and Synthase > 0.
The results for this subset of my dataframe would be:
Species
uniSU2
Aullax
I think this gets me halfway:
geneCount_stacked.loc[geneCount_stacked['Number of Gene Families'] > 0, ['Species','Pathway']]
But I can't workout how to move forward from here.
Many thanks in advance!
Try this:
res = pd.DataFrame({'Species': [x for x, y in df.groupby('Species') if len({'ABC', 'Wzy', 'Synthase'} & set(y.Pathway)) == 3 and all(y['Number of Gene Families'] > 0)]})
Output
Species
0 Aullax
1 uniSU2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With