Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

best way to generate rows based on other rows in pandas at a big file

I have a csv with around 8 million of rows, something like that:

a b c
0 2 3

and I wanted to generate from it new rows based on the second and the third value so I will get:

a b c
0 2 3
0 3 3
0 4 3
0 5 3

which is basically just itereating through every row(in this example one row), and then creating a new row with a value of b+i, where i is between 0 to the value of c including c itself. c column is irelevant after the rows have been generated, problem is that it has million of rows, and doing that might generate many rows, so how can I do it efficenly? (loops are too slow for that amount of data). thanks

like image 687
secret Avatar asked Dec 20 '25 20:12

secret


1 Answers

You can reindex on the repeated index:

out = df.loc[df.index.repeat(df['c']+1)]
out['b'] += out.groupby(level=0).cumcount()
print(out)

Output (reset index if you want):

   a  b  c
0  0  2  3
0  0  3  3
0  0  4  3
0  0  5  3

Note since you blow your data up by the c column and you already have 8 million rows, your new dataframe can be too big on its own.

like image 192
Quang Hoang Avatar answered Dec 23 '25 09:12

Quang Hoang



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!