idx_level_0 = pd.date_range('2020-01-01', '2020-04-01', freq = 'M')
idx_level_1 = pd.date_range('2020-04-01', '2020-07-01', freq = 'M')
idx_dates = pd.MultiIndex.from_product([idx_level_0, idx_level_1], names = ['Event_Date', 'Observation_Date'])
ser_info_dated = pd.Series(range(len(idx_level_0) * len(idx_level_1)), index = idx_dates, name = 'Some_Values') / 33
list_levels_dates = sorted(list(set(idx_level_0) | set(idx_level_1)))
dict_to_numbers = dict(zip(list_levels_dates, range(len(list_levels_dates))))
df_info_numbered = ser_info_dated.reset_index().replace({'Event_Date': dict_to_numbers, 'Observation_Date': dict_to_numbers})
df_info_downcasted = df_info_numbered.copy()
df_info_downcasted[['Event_Date', 'Observation_Date']] = df_info_downcasted[['Event_Date', 'Observation_Date']].astype('int16')
It seemes to be a success:
print('df_info_downcasted column types:\n', df_info_downcasted.dtypes)
shows such a result:
df_info_downcasted column types:
Event_Date int16
Observation_Date int16
Some_Values float64
ser_info_downcasted = df_info_downcasted.set_index(['Event_Date', 'Observation_Date']).squeeze()
print('ser_info_downcasted index level 0 type: ', ser_info_downcasted.index.levels[0].dtype)
print('ser_info_downcasted index level 1 type: ', ser_info_downcasted.index.levels[1].dtype)
ser_info_downcasted index level 0 type: int64
ser_info_downcasted index level 1 type: int64
ser_info_astyped = ser_info_downcasted.copy()
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[0].astype('int16'), level = 0)
ser_info_astyped.index = ser_info_astyped.index.set_levels(ser_info_astyped.index.levels[1].astype('int16'), level = 1)
print('ser_info_astyped index level 0 type: ', ser_info_astyped.index.levels[0].dtype)
print('ser_info_astyped index level 1 type: ', ser_info_astyped.index.levels[1].dtype)
ser_info_astyped index level 0 type: int64
ser_info_astyped index level 1 type: int64
TL;DR: Pandas converts indexes to 64 byte value, best chance to minimize file is with HDF serialization.
Pandas does not seem to support int16
dtype as an index.
Int64Index
is a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable set.
source
This is further reinforced in pandas.Index.astype
.
Note that any signed integer dtype is treated as
'int64'
, and any unsigned integer dtype is treated as'uint64'
, regardless of the size.
source
So essentially our int16 values gets casted to int64 when set as an index. I have not worked with HDF5 before, but I tried to see what can be done to minimize the file size.
Looking at memory allocation
>>> print(ser_info_dated)
... Event_Date Observation_Date
... 2020-01-31 2020-04-30 0.000000
... 2020-05-31 0.030303
... 2020-06-30 0.060606
... 2020-02-29 2020-04-30 0.090909
... 2020-05-31 0.121212
... 2020-06-30 0.151515
... 2020-03-31 2020-04-30 0.181818
... 2020-05-31 0.212121
... 2020-06-30 0.242424
>>> print(ser_info_dated.memory_usage(index=True, deep=True))
... 478 # memory usage in bytes
vs
>>> print(df_info_downcasted)
... Event_Date Observation_Date Some_Values
... 0 0 3 0.000000
... 1 0 4 0.030303
... 2 0 5 0.060606
... 3 1 3 0.090909
... 4 1 4 0.121212
... 5 1 5 0.151515
... 6 2 3 0.181818
... 7 2 4 0.212121
... 8 2 5 0.242424
>>> print(df_info_downcasted.memory_usage(index=True, deep=True))
... Index 128
... Event_Date 18
... Observation_Date 18
... Some_Values 72
... dtype: int64
>>> print(df_info_downcasted.info())
... <class 'pandas.core.frame.DataFrame'>
... RangeIndex: 9 entries, 0 to 8
... Data columns (total 3 columns):
... # Column Non-Null Count Dtype
... --- ------ -------------- -----
... 0 Event_Date 9 non-null int16
... 1 Observation_Date 9 non-null int16
... 2 Some_Values 9 non-null float64
... dtypes: float64(1), int16(2)
... memory usage: 236.0 bytes
We can see that most of the memory is used in the index. When saving as a HDF5 it does not seem to matter though. (I increased the range to be 12H
for testing purposes)
>>> ser_info_dated.to_hdf("ser.h5", "ser")
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 412.628 kb
vs
>>> down_set_hdf5 = df_info_downcasted.to_hdf("down.h5", "down")
>>> print(f"{os.path.getsize('down.h5')/1000} kb")
... 413.204 kb
I do not know how you save HDF5 files, but in Pandas there is a compression argument complevel
which can be useful.
>>> ser_hdf5 = ser_info_dated.to_hdf("ser.h5", "ser", complevel=9)
>>> print(f"{os.path.getsize('ser.h5')/1000} kb")
... 155.452 kb
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With