First of all I am sorry if this question is already answered clearly. I have seen there are very similar answers, but I couldn't use it. So my problem is to match between two sets of columns (UsedFName==FName and UsedLName==LName) and then fill the Usedid Column with the ids from 'id' column when it fully matches.
So here is a toy data set
>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain NaN
1 Nadia Alam 2002 Tanvir Hossain NaN
2 Pia Naime 2003 Tanvir Hossain NaN
3 Koethe Talukdar 2004 Koethe Talukdar NaN
4 Manual Hausman 2005 Koethe Talukdar NaN
5 Constantine Pape NaN Max Weber NaN
6 Andreas Kai 2006 Max Weber NaN
7 Max Weber 2007 Manual Hausman NaN
8 Weber Mac 2008 Manual Hausman NaN
9 Plank Ingo 2009 Manual Hausman NaN
10 Tanvir Hossain 2001 Pia Naime NaN
11 Weber Mac 2008 Pia Naime NaN
12 Manual Hausman 2005 Tanvir Hossain NaN
13 Max Weber 2007 Tanvir Hossain NaN
14 Nadia Alam 2002 Manual Hausman NaN
15 Weber Mac 2008 Manual Hausman NaN
16 Pia Naime 2003 Koethe Talukdar NaN
17 Pia Naime 2003 Koethe Talukdar NaN
18 Constantine Pape NaN Koethe Talukdar NaN
19 Koethe Talukdar 2004 Koethe Talukdar NaN
20 Koethe Talukdar 2005 Manual Hausman NaN
21 NaN NaN NaN Manual Hausman NaN
22 NaN NaN NaN Manual Hausman NaN
23 NaN NaN NaN Manual Hausman NaN
24 NaN NaN NaN Manual Hausman NaN
25 NaN NaN NaN Manual Hausman NaN
26 NaN NaN NaN Manual Hausman NaN
27 NaN NaN NaN Manual Hausman NaN
This is the output
>>> df
FName LName id UsedFName UsedLName Usedid
0 Tanvir Hossain 2001 Tanvir Hossain 2001
1 Nadia Alam 2002 Tanvir Hossain 2001
2 Pia Naime 2003 Tanvir Hossain 2001
3 Koethe Talukdar 2004 Koethe Talukdar 2005
4 Manual Hausman 2005 Koethe Talukdar 2005
5 Constantine Pape NaN Max Weber 2007
6 Andreas Kai 2006 Max Weber 2007
7 Max Weber 2007 Manual Hausman 2005
8 Weber Mac 2008 Manual Hausman 2005
9 Plank Ingo 2009 Manual Hausman 2005
10 Tanvir Hossain 2001 Pia Naime 2003
11 Weber Mac 2008 Pia Naime 2003
12 Manual Hausman 2005 Tanvir Hossain 2001
13 Max Weber 2007 Tanvir Hossain 2001
14 Nadia Alam 2002 Manual Hausman 2005
15 Weber Mac 2008 Manual Hausman 2005
16 Pia Naime 2003 Koethe Talukdar 2005
17 Pia Naime 2003 Koethe Talukdar 2005
18 Constantine Pape NaN Koethe Talukdar 2005
19 Koethe Talukdar 2004 Koethe Talukdar 2005
20 Koethe Talukdar 2005 Manual Hausman 2005
21 NaN NaN NaN Manual Hausman 2005
22 NaN NaN NaN Manual Hausman 2005
23 NaN NaN NaN Manual Hausman 2005
24 NaN NaN NaN Manual Hausman 2005
25 NaN NaN NaN Manual Hausman 2005
26 NaN NaN NaN Manual Hausman 2005
27 NaN NaN NaN Manual Hausman 2005
Actually I was able to do it using nested for loops, here is the code:
for i in df['UsedFName'].index:
for j in df['FName'].index:
if df['UsedFName'][i]==df['FName'][j] & df['UsedLName'][i]==df['LName'][j]:
df.ix[i,'Usedid'] = df.ix[j,'id']
But using nested for loops here is computationally very expensive. I have a huge data set. Is it possible to use it without nested loops? Is there any simple Pythonic ways or Pandas/Numpy ways that I can use here?
Many thanks in advance for the help...looking forward to learn Python.
You will have to think of a more pandaesque may to add the hashing logic, but this matches your expected output and is a lot more efficient, all you want is to use the ids matching UsedFName and "UsedLName" to FName and LNames:
import pandas as pd
# Create dict where each key is tuple -> (FName,Lname)
# with the corresponding id as the value
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
# Do a lookup in d using a tuple -> (UsedFName, UsedLName) to get the correct id for each pairing
df["Usedid"] = [d[(f, l)] for f,l in zip(df["UsedFName"], df["UsedLName"])]
print(df["Usedid"])
Output:
0 2001
1 2001
2 2001
3 2005
4 2005
5 2007
6 2007
7 2005
8 2005
9 2005
10 2003
11 2003
12 2001
13 2001
14 2005
15 2005
16 2005
17 2005
18 2005
19 2005
20 2005
21 2005
22 2005
23 2005
24 2005
25 2005
26 2005
27 2005
Name: Useid, dtype: float64
If some names may not be used you can use a default value using dict.get.
This is faster than the suggested groupby:
In [12]: %%timeit
df = pd.read_csv("in.csv")
ids = df.groupby(['FName', 'LName']).id.apply(list)
df.Usedid = df.apply(lambda x: int(ids[x.UsedFName, x.UsedLName][-1]), axis=1)
....:
100 loops, best of 3: 5.27 ms per loop
In [13]: %%timeit
df = pd.read_csv("in.csv")
d = dict(zip(((f, l) for f, l in zip(df["FName"], df["LName"])), df["id"]))
df["Usedid"] = [d[(f, l)] for f, l in zip(df["UsedFName"], df["UsedLName"])]
....:
1000 loops, best of 3: 1.41 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With