I am trying to import a 1.25 GB dataset into python using dask.array
The file is a 1312*2500*196 Array of uint16's. I need to convert this to a float32 array for later processing.
I have managed to stitch together this Dask array in uint16, however when I try to convert to float32 I get a memory error.
It doesn't matter what I do to the chunk size, I will always get a memory error.
I create the array by concatenating the array in lines of 100 (breaking the 2500 dimension up into little pieces of 100 lines, since dask can't natively read .RAW imaging files I have to use numpy.memmap() to read the file and then create the array.
Below I will supply a "as short as possible" code snippet:
I have tried two methods:
1) Create the full uint16 array and then try to convert to float32:
(note: the memmap is a 1312x100x196 array and lines ranges from 0 to 24)
for i in range(lines):
NewArray = da.concatenate([OldArray,Memmap],axis=0)
OldArray = NewArray
return NewArray
and then I use
Float32Array = FinalArray.map_blocks(lambda FinalArray: FinalArray * 1.,dtype=np.float32)
In method 2:
for i in range(lines):
NewArray = da.concatenate([OldArray,np.float32(Memmap)],axis=0)
OldArray = NewArray
return NewArray
Both methods result in a memory error.
Is there any reason for this?
I read that dask array is capable of doing up to 100 GB dataset calculations.
I tried all chunk sizes (from as small as 10x10x10 to a single line)
You can create a dask.array from a numpy memmap array directly with the da.from_array function
x = load_memmap_numpy_array_from_raw_file(filename)
d = da.from_array(x, chunks=...)
You can change the dtype with the astype method
d = d.astype(np.float32)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With