Pandas read_hdf: how to get column names when using chunksize or iterator?

Question

I'm reading in a large (~10 GB) hdf5 table with pandas.read_hdf. I'm using iterator=True so that I can access chunks at a time (e.g., chunksize=100000 rows at a time).

How do I get a list of all the column names or 'keys'?

Also, how come there is no get_chunk method analogous to the one for pandas.read_table? Is directly iterating over the chunks the only way ("for chunk in data: "), and you can't access different numbered chunks at will ("data[300]")?

Edit:

Looks like I can access the column names with a loop that breaks after accessing the first chunk:

for i,v in enumerate(data):
if i != 0:
    break
colnames = v.columns

But then my second question still remains: is there no way to access each individual chunk on the pandas TextFileReader iterator (e.g., mimicking the get_chunk method of read_table, or with a dict-like lookup, data[0]), instead of doing the above weird single-iteration for loop?

wflynny · Accepted Answer

Have you tried loading the HDF5 file as an HDFStore? That would allow you to use the HDFStore.select method which may do what you want (with seeking, etc.). You can use select to only operate on a subset of columns too. To me it just looks like it provides more flexibility than the read_hdf function. The following might help as long as you know the structure of your HDF5 file:

store = pd.HDFStore('/path/to/file', 'r')
colnames = store.select('table_key', stop=1).columns

# iterate over table chunks
chunksize = 100000
chunks = store.select('table_key', chunksize=chunksize)
for chunk in chunks:
    ...code...

# select 1 specific chunk as iterator
chunksize = 100000
start, stop = 300*chunksize, 301*chunksize
this_chunk = store.select('table_key', start=start, stop=stop, iterator=True)
do_work(this_chunk)

store.close()

Note that you can also open an HDFStore as a context manager, e.g.,

with pd.HDFStore('/path/to/file', 'r') as store:
    ...code...

Pandas read_hdf: how to get column names when using chunksize or iterator?

Tags:

python

pandas

hdf5

numpy

chunking

quantumflash

1 Answers

wflynny

Recent Activity

Donate For Us

Pandas read_hdf: how to get column names when using chunksize or iterator?

Tags:

python

pandas

hdf5

numpy

chunking

quantumflash

1 Answers

wflynny

Related questions

Recent Activity

Donate For Us