Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing variable sized arrays to Pandas cells

Tags:

python

pandas

I have a large data set and I want to do a convolution calculation using multiple rows that match a criteria. I need to calculate a vector for each row first, and I thought it would be more efficient to store my vector in a dataframe column so I could try and avoid a for loop when I do the convolution. Trouble is, the vectors are variable length and I can't figure out how to do it.

Here's a summary of my data:

Date        State  Alloc P
2012-01-01  AK     3     0.5
2012-01-01  AL     4     0.3
…

Each state has a different Alloc and P value. There’s a row for every date and state and my dataframe is over 15,000 rows long.

For each entry, I want a vector that looks like this:

[P, np.zeros(Alloc), 1-P]

I can't figure out how to set a new column like this. I've tried statements like:

df['Test'] = [df['P'], np.zeros(df['Alloc'), 1 – df['P']]

but they don't work.

Does anyone have any ideas?

Thanks ☺

like image 659
Mike Woodward Avatar asked Dec 07 '25 03:12

Mike Woodward


2 Answers

Try:

def get_vec(x):
    return [x.P] + np.zeros(x['Alloc']).tolist() + [1 - x.P]

df.apply(get_vec, axis=1)

0         [0.5, 0.0, 0.0, 0.0, 0.5]
1    [0.3, 0.0, 0.0, 0.0, 0.0, 0.7]
dtype: object

df['Test'] = df.apply(get_vec, axis=1)
df

enter image description here

like image 187
piRSquared Avatar answered Dec 08 '25 17:12

piRSquared


So here's the answer. piRSquared was almost right, but not quite. There are several parts here.

The apply method partially works. It passes a row to the function and you can do a calculation as shown above. The problem is, you get a "ValueError: Shape of passed values is..." error message. The number of columns returned doesn't match the number of columns in the dataframe. My guess is this is because the return value is a list and Pandas isn't interpreting the result correctly.

The workaround is to do the apply on a single column. This single column should contain the P value and Alloc value. Here are the steps:

Create the merged column:

df['temp'] = df[['P','Alloc']].values.tolist()

Write a function:

def array_p(x): return [x[0]] + [0]*int(x[1]) + [1 - x[0]]

(int is needed because the previous line gives floats. I didn't need np.zeros)

Apply the function:

df['Array'] = temp['temp'].apply(array_p)

This works, but obviously involves more steps than it should. If anyone can provide a better answer, I'd love to hear it.

like image 23
Mike Woodward Avatar answered Dec 08 '25 16:12

Mike Woodward



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!