I am working with pandas dataframes which contain arrays inside the dataframe elements. I'm trying to "apply" a function to these elements, and then return an array. But I have some very inconsistent behavior. The function runs okay the first few times, but then it fails. Here is my code:
import pandas as pd
import numpy as np
def g(x): # Function fails if I omit the .tolist()
return (np.concatenate([x['B'][1:], x['C'][1:]])).tolist()
df = pd.DataFrame({'A' : (1,2,3), \
'B': (np.array([0,1,2,3]),np.array([3,4,5,6]),np.array([6,7,8,9])), \
'C': (np.array([0,1,2,3]),np.array([2,9,6,9]),np.array([2,4,6,7]))})
# Before we start
print(df)
print("B is type: ", type(df.loc[0,'B']))
# First time
df['G'] = df.apply(g, axis=1)
print("G is type: ", type(df.loc[0,'G']))
# Second time
df['H'] = df.apply(g, axis=1)
print("H is type: ", type(df.loc[0,'H']))
# Third time
df['I'] = df.apply(g, axis=1)
print("I is type: ", type(df.loc[0,'I']))
# Fourth time - this one fails for me
df['J'] = df.apply(g, axis=1)
print("J is type: ", type(df.loc[0,'J']))
# Fifth time
df['K'] = df.apply(g, axis=1)
print("K is type: ", type(df.loc[0,'K']))
The code runs fine for me, up to the line df['J'], where it fails. The output is like this:
A B C
0 1 [0, 1, 2, 3] [0, 1, 2, 3]
1 2 [3, 4, 5, 6] [2, 9, 6, 9]
2 3 [6, 7, 8, 9] [2, 4, 6, 7]
B is type: <class 'numpy.ndarray'>
G is type: <class 'list'>
H is type: <class 'list'>
I is type: <class 'list'>
Then there is a big long error message which finishes with "ValueError: Wrong number of items passed 6, placement implies 1", and there is also a "KeyError: 'J'" in there too.
The crazy thing is that the function runs fine the first few times. My questions are:
df['J']?g(x) to return a numpy array rather than a list? If I leave out the .tolist() it gives me an error. Any help would be hugely appreciated! I've spent 2 days trying to understand what is going on here.
P.S. I haven't explained why I am using arrays inside dataframe elements, but I can explain if you think it would help.
Between the different times you apply g function, your dataframe changes, then it is not really a surprise that the reaction of pandas won't be the same everytime. If you only need to apply it to the columns B and C, i suggest you type:
df['J'] = df[['B','C']].apply(g, axis=1)
print("J is type: ", type(df.loc[0,'J']))
This way it works fine (but once again it only take the columns Band C into account).
As for the error, According to Ians it's because as soon as the output of the apply has more than 6 columns, it turns into a DataFrame instead of a Series. Then it can't be set to df['J'].
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With