I'm reading in a large (~10 GB) hdf5 table with pandas.read_hdf. I'm using iterator=True so that I can access chunks at a time (e.g., chunksize=100000 rows at a time).
How do I get a list of all the column names or 'keys'?
Also, how come there is no get_chunk method analogous to the one for pandas.read_table? Is directly iterating over the chunks the only way ("for chunk in data: "), and you can't access different numbered chunks at will ("data[300]")?
Edit:
Looks like I can access the column names with a loop that breaks after accessing the first chunk:
for i,v in enumerate(data):
if i != 0:
break
colnames = v.columns
But then my second question still remains: is there no way to access each individual chunk on the pandas TextFileReader iterator (e.g., mimicking the get_chunk method of read_table, or with a dict-like lookup, data[0]), instead of doing the above weird single-iteration for loop?
Have you tried loading the HDF5 file as an HDFStore
? That would allow you to use the HDFStore.select
method which may do what you want (with seeking, etc.). You can use select
to only operate on a subset of columns too. To me it just looks like it provides more flexibility than the read_hdf
function. The following might help as long as you know the structure of your HDF5 file:
store = pd.HDFStore('/path/to/file', 'r')
colnames = store.select('table_key', stop=1).columns
# iterate over table chunks
chunksize = 100000
chunks = store.select('table_key', chunksize=chunksize)
for chunk in chunks:
...code...
# select 1 specific chunk as iterator
chunksize = 100000
start, stop = 300*chunksize, 301*chunksize
this_chunk = store.select('table_key', start=start, stop=stop, iterator=True)
do_work(this_chunk)
store.close()
Note that you can also open an HDFStore as a context manager, e.g.,
with pd.HDFStore('/path/to/file', 'r') as store:
...code...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With