I have a massive dataframe. The dataframe has column patient.drug. This column contains list of dictionaries as its elements. I want to filter out all the rows that contain 'NIFEDIPINE' word in patient.drug column.
The dataframe is very large. Here is a sample of it.
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
so far, I have tried
df[df['patient.drug'].str.contains('NIFEDIPINE')]
but it is giving me an error.
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n ...\n nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n dtype='float64', length=12000)] are in the [columns]"
I have also tried using in operator and iterating over rows.
lst=[]
for i in range(len(df)):
if 'NIFEDIPINE' in df.loc[i, "patirnt.drug"]:
lst.append(i)
print(lst)
Which is also causing an error. What should I do to get it right?
After reproducing your Data,
>>> df
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
While using Your code:
>>> df[df['patient.drug'].str.contains('NIFEDIPINE')]
Error:
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan], dtype='float64')] are in the [columns]"
>>> df[df['patient.drug'].astype('str').str.contains('NIFEDIPINE')]
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
4 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
Note:
This is raising issue due to indexer check in the pandas indexer.py section, which is as follows:
--> pandas/core/indexing.py
# Count missing values:
missing_mask = indexer < 0
missing = (missing_mask).sum()
if missing:
if missing == len(indexer):
axis_name = self.obj._get_axis_name(axis)
raise KeyError(f"None of [{key}] are in the [{axis_name}]")
# We (temporarily) allow for some missing keys with .loc, except in
# some cases (e.g. setting) in which "raise_missing" will be False
Suppose you have this layout of column:
Search string 'NIFEDIPINE' found on the 2nd and 4th entries:
data = {'patient.drug':
[[{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}],
[{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}],
[{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}],
[{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}],
]
}
df = pd.DataFrame(data)
patient.drug
0 [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
2 [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}] <=== keyword here
(Layout digged out from your previous questions)
Solution:
[Updated for 1) support more than one dict in a list and 2) partial string match].
Use: .loc + .explode() + .apply():
keyword = 'NIFEDIPINE'
df.loc[df['patient.drug'].explode().apply(lambda d: keyword in ' '.join(d.values())).any(level=0)]
Result:
Rows with keyword string 'NIFEDIPINE' correctly extracted and displayed:
patient.drug
1 [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
3 [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With