Writing variable sized arrays to Pandas cells

Question

I have a large data set and I want to do a convolution calculation using multiple rows that match a criteria. I need to calculate a vector for each row first, and I thought it would be more efficient to store my vector in a dataframe column so I could try and avoid a for loop when I do the convolution. Trouble is, the vectors are variable length and I can't figure out how to do it.

Here's a summary of my data:

Date        State  Alloc P
2012-01-01  AK     3     0.5
2012-01-01  AL     4     0.3
…

Each state has a different Alloc and P value. There’s a row for every date and state and my dataframe is over 15,000 rows long.

For each entry, I want a vector that looks like this:

[P, np.zeros(Alloc), 1-P]

I can't figure out how to set a new column like this. I've tried statements like:

df['Test'] = [df['P'], np.zeros(df['Alloc'), 1 – df['P']]

but they don't work.

Does anyone have any ideas?

Thanks ☺

piRSquared · Accepted Answer

Try:

def get_vec(x):
    return [x.P] + np.zeros(x['Alloc']).tolist() + [1 - x.P]

df.apply(get_vec, axis=1)

0         [0.5, 0.0, 0.0, 0.0, 0.5]
1    [0.3, 0.0, 0.0, 0.0, 0.0, 0.7]
dtype: object

df['Test'] = df.apply(get_vec, axis=1)
df

enter image description here

Mike Woodward · Answer

So here's the answer. piRSquared was almost right, but not quite. There are several parts here.

The apply method partially works. It passes a row to the function and you can do a calculation as shown above. The problem is, you get a "ValueError: Shape of passed values is..." error message. The number of columns returned doesn't match the number of columns in the dataframe. My guess is this is because the return value is a list and Pandas isn't interpreting the result correctly.

The workaround is to do the apply on a single column. This single column should contain the P value and Alloc value. Here are the steps:

Create the merged column:

df['temp'] = df[['P','Alloc']].values.tolist()

Write a function:

def array_p(x): return [x[0]] + [0]*int(x[1]) + [1 - x[0]]

(int is needed because the previous line gives floats. I didn't need np.zeros)

Apply the function:

df['Array'] = temp['temp'].apply(array_p)

This works, but obviously involves more steps than it should. If anyone can provide a better answer, I'd love to hear it.

Writing variable sized arrays to Pandas cells

Tags:

python

pandas

Mike Woodward

2 Answers

piRSquared

Mike Woodward

Recent Activity

Donate For Us

Writing variable sized arrays to Pandas cells

Tags:

python

pandas

Mike Woodward

2 Answers

piRSquared

Mike Woodward

Related questions

Recent Activity

Donate For Us