Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python h5py: can I store a dataset which different columns have different types?

Tags:

python

hdf5

h5py

Suppose I have a table which has many columns, only a few columns is float type, others are small integers, for example:

col1, col2, col3, col4
1.31   1      2     3
2.33   3      5     4
...

How can I store this effectively, suppose I use np.float32 for this dataset, the storage is wasted, because other columns only have a small integer, they don't need so much space. If I use np.int16, the float column is not exact, which also what I wanted. Therefore how do I deal with the situation like this?

Suppose I also have a string column, which make me more confused, how should I store the data?

col1, col2, col3, col4, col5
1.31   1      2     3    "a"
2.33   3      5     4    "b"
...

Edit:

To make things simpler, lets suppose the string column has fix length strings only, for example, length of 3.

like image 704
an offer can't refuse Avatar asked Oct 19 '25 04:10

an offer can't refuse


1 Answers

I'm going to demonstrate the structured array approach:

I'm guessing you are starting with a csv file 'table'. If not it's still the easiest way to turn your sample into an array:

In [40]: txt = '''col1, col2, col3, col4, col5
    ...: 1.31   1      2     3    "a"
    ...: 2.33   3      5     4    "b"
    ...: '''


In [42]: data = np.genfromtxt(txt.splitlines(), names=True, dtype=None, encoding=None)

In [43]: data
Out[43]: 
array([(1.31, 1, 2, 3, '"a"'), (2.33, 3, 5, 4, '"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

With these parameters, genfromtxt takes care of creating a structured array. Note it is a 1d array with 5 fields. Fields dtype are determined from the data.

In [44]: import h5py
...

In [46]: f = h5py.File('struct.h5', 'w')

In [48]: ds = f.create_dataset('data',data=data)
...
TypeError: No conversion path for dtype: dtype('<U3')

But h5py has problems saving the unicode strings (default for py3). There may be ways around that, but here it will be simpler to convert the string dtype to bytestrings. Besides, that'll be more compact.

To convert that, I'll make a new dtype, and use astype. Alternatively I could specify the dtypes in the genfromtxt call.

In [49]: data.dtype
Out[49]: dtype([('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', '<U3')])

In [50]: data.dtype.descr
Out[50]: 
[('col1', '<f8'),
 ('col2', '<i8'),
 ('col3', '<i8'),
 ('col4', '<i8'),
 ('col5', '<U3')]

In [51]: dt1 = data.dtype.descr

In [52]: dt1[-1] = ('col5', 'S3')

In [53]: data.astype(dt1)
Out[53]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

Now it saves the array without problem:

In [54]: data1 = data.astype(dt1)

In [55]: data1
Out[55]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

In [56]: ds = f.create_dataset('data',data=data1)

In [57]: ds
Out[57]: <HDF5 dataset "data": shape (2,), type "|V35">

In [58]: ds[:]
Out[58]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i8'), ('col3', '<i8'), ('col4', '<i8'), ('col5', 'S3')])

I could make further modifications, shortening one or more of the int fields:

In [60]: dt1[1] = ('col2','i2')    
In [61]: dt1[2] = ('col3','i2')

In [62]: dt1
Out[62]: 
[('col1', '<f8'),
 ('col2', 'i2'),
 ('col3', 'i2'),
 ('col4', '<i8'),
 ('col5', 'S3')]

In [63]: data1 = data.astype(dt1)

In [64]: data1
Out[64]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])

In [65]: ds1 = f.create_dataset('data1',data=data1)

ds1 has a more compact storage, 'V23' vs 'V35'

In [67]: ds1
Out[67]: <HDF5 dataset "data1": shape (2,), type "|V23">

In [68]: ds1[:]
Out[68]: 
array([(1.31, 1, 2, 3, b'"a"'), (2.33, 3, 5, 4, b'"b"')],
      dtype=[('col1', '<f8'), ('col2', '<i2'), ('col3', '<i2'), ('col4', '<i8'), ('col5', 'S3')])
like image 133
hpaulj Avatar answered Oct 20 '25 17:10

hpaulj



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!