I have a large time-series dataframe. I would like to write a function that will arbitrarily split this large dataframe into N contiguous subperiods as new dataframes so that analysis may easily be done on each smaller dataframe.
I have this line of code that splits the large dataframe into even subperiods. I need a function that will output these split dataframes.
np.array_split(df, n) #n = arbitrary amount of new dataframes
I would like each new dataframe to be labeled as 1,2,3,4, etc. for each subperiod that it represents. So returning N number of dataframes that are all labeled according to their temporal nature of the initial large dataframe.
df before the function applied
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
4 45.43 0.002857
5 45.33 0.002204
6 45.68 -0.007692
7 46.37 -0.014992
8 48.04 -0.035381
9 48.38 -0.007053
3 new df's after function split applied
df1
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
df2
4 45.43 0.002857
5 45.33 0.002204
6 45.68 -0.007692
df3
7 46.37 -0.014992
8 48.04 -0.035381
9 48.38 -0.007053
Please let me know if clarification is needed for any aspects. Thanks for the time!
I don't know from your description if you are aware that np.array_split
outputs n
objects. If it's only a few objects you could manually assign them, for example:
df1, df2, df3 = np.array_split(df, 3)
This would assign every subarray to these variables in order. Otherwise you could assign the series of subarrays to a single variable;
split_df = np.array_split(df, 3)
len(split_df)
# 3
then loop over this one variable and do your analysis per subarray. I would personally choose the latter.
for object in split_df:
print(type(object))
This prints <class 'pandas.core.frame.DataFrame'>
three times.
Use:
print (df)
a b
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
4 45.43 0.002857
5 45.33 0.002204
6 45.68 -0.007692
7 46.37 -0.014992
8 48.04 -0.035381
9 48.38 -0.007053
def split(df, chunkSize = 30):
return np.array_split(df, chunkSize)
It is possible, but not recommended:
for i, g in enumerate(split(df, 3), 1):
globals()['df{}'.format(i)] = g
print (df1)
a b
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
Here better is select each DataFrame by indexing:
dfs = split(df, 3)
print (dfs[0])
a b
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
Also is possible create dictionaries, but in my opinion really overcomplicated:
def split1(df, chunkSize = 30):
return {'df_{}'.format(i): g
for i, g in enumerate(np.array_split(df, chunkSize), 1)}
dfs = split1(df, 3)
print (dfs)
{'df_1': a b
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801, 'df_2': a b
4 45.43 0.002857
5 45.33 0.002204, 'df_3': a b
6 45.68 -0.007692
7 46.37 -0.014992, 'df_4': a b
8 48.04 -0.035381
9 48.38 -0.007053}
print (dfs['df_1'])
a b
1 43.91 -0.041619
2 43.39 0.011913
3 45.56 -0.048801
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With