Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter out rows of a dataframe containing a specific string

I have a massive dataframe. The dataframe has column patient.drug. This column contains list of dictionaries as its elements. I want to filter out all the rows that contain 'NIFEDIPINE' word in patient.drug column.

The dataframe is very large. Here is a sample of it.

                                                         patient.drug
0                       [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1                       [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]      
2                       [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3                       [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]      

so far, I have tried

df[df['patient.drug'].str.contains('NIFEDIPINE')]

but it is giving me an error.

 raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n              ...\n              nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n             dtype='float64', length=12000)] are in the [columns]"

I have also tried using in operator and iterating over rows.

lst=[]
for i in range(len(df)):
    if 'NIFEDIPINE' in df.loc[i, "patirnt.drug"]:
        lst.append(i)
print(lst)

Which is also causing an error. What should I do to get it right?

like image 909
user16309118 Avatar asked Jan 17 '26 12:01

user16309118


2 Answers

After reproducing your Data,

>>> df
                                                         patient.drug
0  [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
4  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

While using Your code:

>>> df[df['patient.drug'].str.contains('NIFEDIPINE')]

Error:

    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan], dtype='float64')] are in the [columns]"

Solution:

    >>> df[df['patient.drug'].astype('str').str.contains('NIFEDIPINE')]
                                                        patient.drug
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
4  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

Note:

This is raising issue due to indexer check in the pandas indexer.py section, which is as follows:

--> pandas/core/indexing.py

# Count missing values:
missing_mask = indexer < 0
missing = (missing_mask).sum()

if missing:
    if missing == len(indexer):
        axis_name = self.obj._get_axis_name(axis)
        raise KeyError(f"None of [{key}] are in the [{axis_name}]")

    # We (temporarily) allow for some missing keys with .loc, except in
    # some cases (e.g. setting) in which "raise_missing" will be False
like image 93
Karn Kumar Avatar answered Jan 20 '26 04:01

Karn Kumar


Suppose you have this layout of column:

Search string 'NIFEDIPINE' found on the 2nd and 4th entries:

data = {'patient.drug': 
     [[{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}],
      [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}],
      [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}],
      [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}],
     ]
}
df = pd.DataFrame(data)

                                                         patient.drug
0       [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1   [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]      <=== keyword here
2  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3   [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]      <=== keyword here

(Layout digged out from your previous questions)

Solution:

[Updated for 1) support more than one dict in a list and 2) partial string match].

Use: .loc + .explode() + .apply():

keyword = 'NIFEDIPINE'
df.loc[df['patient.drug'].explode().apply(lambda d: keyword in ' '.join(d.values())).any(level=0)]

Result:

Rows with keyword string 'NIFEDIPINE' correctly extracted and displayed:

                                                        patient.drug
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
3  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]
like image 23
SeaBean Avatar answered Jan 20 '26 04:01

SeaBean