I am generating large dataset for a machine learning application, which is a numpy array with shape (N,X,Y). Here N is the number of samples, X is the input of a sample and Y is the target of a sample. I want to save this array in the .npy format. I have many samples (N is very large) so that the final dataset is about 10+ GB. This means that I cannot create the whole dataset and then save it, as it will flood my memory.
Is it possible to instead to write batches of n samples iteratively to this file? So, I want to append for example batches of 256 samples to the file at once ((256,X,Y)).
Here is a solution based on numpy's implementaion of save to write a standard npy file including shape and type information:
import numpy as np
import numpy.lib as npl
a = np.random.random((30, 3, 2))
a1 = a[:10]
a2 = a[10:]
filename = 'out.npy'
with open(filename, 'wb+') as f:
header = npl.format.header_data_from_array_1_0(a1)
npl.format.write_array_header_1_0(f, header)
a1.tofile(f)
a2.tofile(f)
f.seek(0)
header['shape'] = (len(a1) + len(a2), *header['shape'][1:])
npl.format.write_array_header_1_0(f, header)
assert (np.load(filename) == a).all()
This works for C_CONTIGUOUS arrays without Python objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With