Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

efficient way to resize numpy or dataset?

Tags:

python

numpy

h5py

I want to understand the effect of resize() function on numpy array vs. an h5py dataset. In my application, I am reading a text file line by line and then after parsing the data, write into an hdf5 file. What would be a good approach to implement this. Should I add each new row into a numpy array and keep resizing (increasing the axis) for numpy array (eventually writing the complete numpy array into h5py dataset) or should I just add each new row data into h5py dataset directly and thus resizing the h5py dataset in memory. How does resize() function affects the performance if we keep resizing after each row? Or should I resize after every 100 or 1000 rows?

There can be around 200,000 lines in each dataset.

Any help is appreciated.

like image 925
Alok Avatar asked Oct 28 '25 03:10

Alok


1 Answers

I think resize() will copy all the data in the array, it's slow if you call it repeatly.

If you want to append data into the array continuously, you can create a large array first, and use index to copy data into it.

Or you can use array object from array module, it's a dynamic array that behaves like list. after append all the data into array object, you can convert it to ndarray. Here is an example:

import array
import numpy as np
a = array.array("d")
a.extend([0,1,2])
a.extend([3,4,5])
b = np.frombuffer(a, np.float).reshape(-1, 3)
like image 153
HYRY Avatar answered Oct 30 '25 14:10

HYRY