I have a dataframe that looks like this:
parent region
estid
1 NaN A
2 NaN B
3 1.0 A
4 1.0 B
5 2.0 C
6 2.0 C
7 8.0 A
What I want is to create an extra column containing the region of the parent, defaulting to None if the parent is not found in the data, e.g.:
parent region parent_region
estid
1 NaN A None
2 NaN B None
3 1.0 A A
4 1.0 B A
5 2.0 C B
6 2.0 C B
7 8.0 A None
The following returns the correct result:
df["parent_region"] = df.apply(lambda x : df.loc[x["parent"]]["region"] if not math.isnan(x["parent"]) and x["parent"] in df.index else None, axis = 1)
But I'm very scared of inefficiencies, given that my dataframe has 168 million rows. Is there a better way to do it? I looked at lookup and get but I can't quite figure out how to work with IDs that can be NaN or not present in the dataframe.
For instance, I thought this could work: df.lookup(df["region"], df["parent"]), but it doesn't like very much null keys. df.get("region") does not return the parent's region, but the column itself, so it doesn't do what I want.
You can use Series.map method which functions similar to a dictionary. The values from parent column and region column serve as the keys and values constituting it. The mapping takes place if they share a common index between them.
Additionally, na_action=ignore could be used to speedup this mapping process, as all NaNs present in these columns would be ignored altogether and simply propagated.
Lastly, the missing values must be replaced with None using Series.replace method.
df["parent_region"] = df.parent.map(df.region, na_action='ignore').replace({np.NaN:None})
Out[121]:
estid
1 None
2 None
3 A
4 A
5 B
6 B
7 None
Name: parent_region, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With