Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split large Dataframe into smaller equal dataframes

I have a large time-series dataframe. I would like to write a function that will arbitrarily split this large dataframe into N contiguous subperiods as new dataframes so that analysis may easily be done on each smaller dataframe.

I have this line of code that splits the large dataframe into even subperiods. I need a function that will output these split dataframes.

np.array_split(df, n) #n = arbitrary amount of new dataframes

I would like each new dataframe to be labeled as 1,2,3,4, etc. for each subperiod that it represents. So returning N number of dataframes that are all labeled according to their temporal nature of the initial large dataframe.

df before the function applied
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

3 new df's after function split applied 
df1
 1    43.91 -0.041619
 2    43.39  0.011913
 3    45.56 -0.048801
df2
 4    45.43  0.002857
 5    45.33  0.002204
 6    45.68 -0.007692
df3
 7    46.37 -0.014992
 8    48.04 -0.035381
 9    48.38 -0.007053

Please let me know if clarification is needed for any aspects. Thanks for the time!

like image 835
hkml Avatar asked Oct 12 '25 14:10

hkml


2 Answers

I don't know from your description if you are aware that np.array_split outputs n objects. If it's only a few objects you could manually assign them, for example:

df1, df2, df3 = np.array_split(df, 3)

This would assign every subarray to these variables in order. Otherwise you could assign the series of subarrays to a single variable;

split_df = np.array_split(df, 3)
len(split_df)
# 3

then loop over this one variable and do your analysis per subarray. I would personally choose the latter.

for object in split_df:
    print(type(object))

This prints <class 'pandas.core.frame.DataFrame'> three times.

like image 159
Ronny Efronny Avatar answered Oct 16 '25 05:10

Ronny Efronny


Use:

print (df)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801
4  45.43  0.002857
5  45.33  0.002204
6  45.68 -0.007692
7  46.37 -0.014992
8  48.04 -0.035381
9  48.38 -0.007053


def split(df, chunkSize = 30):
    return np.array_split(df, chunkSize)

It is possible, but not recommended:

for i, g in enumerate(split(df, 3), 1):
    globals()['df{}'.format(i)] =  g
print (df1)
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Here better is select each DataFrame by indexing:

dfs = split(df, 3)
print (dfs[0])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801

Also is possible create dictionaries, but in my opinion really overcomplicated:

def split1(df, chunkSize = 30):
    return {'df_{}'.format(i): g 
              for i, g in enumerate(np.array_split(df, chunkSize), 1)}

dfs = split1(df, 3)
print (dfs)
{'df_1':        a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801, 'df_2':        a         b
4  45.43  0.002857
5  45.33  0.002204, 'df_3':        a         b
6  45.68 -0.007692
7  46.37 -0.014992, 'df_4':        a         b
8  48.04 -0.035381
9  48.38 -0.007053}

print (dfs['df_1'])
       a         b
1  43.91 -0.041619
2  43.39  0.011913
3  45.56 -0.048801
like image 41
jezrael Avatar answered Oct 16 '25 06:10

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!