Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading custom file format to Dask dataframe

I have a huge custom text file (cant load the entire data into one pandas dataframe) which I want to read into Dask dataframe. I wrote a generator to read and parse the data in chunks and create pandas dataframes. I want to load these pandas dataframes into a dask dataframe and perform operations on the resulting dataframe (things like creating calculated columns, extracting parts of the dataframe, plotting etc). I tried using Dask bag but couldnt succeed. So I decided to write the resulting dataframe into an HDFStore and then use Dask to read from the HDFStore file. This worked well when I was doing it from my own computer. Code below.

cc = read_custom("demo.xyz", chunks=1000) # Generator of pandas dataframes
from pandas import HDFStore
s = HDFStore("demo.h5")
for c in cc:
    s.append("data", c, format='t', append=True)
s.close()

import dask.dataframe as dd
ddf = dd.read_hdf("demo.h5", "data", chunksize=100000)
seqv = (
    (
        (ddf.sxx - ddf.syy) ** 2
        + (ddf.syy - ddf.szz) ** 2
        + (ddf.szz - ddf.sxx) ** 2
        + 6 * (ddf.sxy ** 2 + ddf.syz ** 2 + ddf.sxz ** 2)
    )
    / 2
) ** 0.5
seqv.compute()

Since the last compute was slow, I decided to distribute it over a few systems on my LAN and started a scheduler on my machine and couple of workers in other systems. And fired up a Client as below.

from dask.distributed import Client
client = Client('mysystemip:8786') #Establishing connection with the scheduler all fine.

And then read in the Dask dataframe. However, I got error below when I executed seqv.compute().

HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 509, in H5Fopen
    unable to open file
  File "H5Fint.c", line 1400, in H5F__open
    unable to open file
  File "H5Fint.c", line 1615, in H5F_open
    unable to lock the file
  File "H5FD.c", line 1640, in H5FD_lock
    driver lock request failed
  File "H5FDsec2.c", line 941, in H5FD_sec2_lock
    unable to lock file, errno = 11, error message = 'Resource temporarily unavailable'

End of HDF5 error back trace

Unable to open/create file 'demo.h5'

I have made sure that all workers have access to demo.h5 file. I tried passing in the lock=False in read_hdf. Got the same error.

Isn't this possible to do? May be try another file format? I guess writing each pandas dataframe to separate files may work, but I'm trying to avoid it (I dont even want an intermediate HDFS file). But before I get to that route, I'd like to know if there is any other better approach to solve the problem.

Thanks for any suggestions!

like image 755
najeem Avatar asked Oct 16 '25 04:10

najeem


1 Answers

If you want to read data from a custom format in a text file I recommend using the dask.bytes.read_bytes function, which returns a list of delayed objects, each of which points to a block of bytes from your file. Those blocks will be cleanly separated by a line delimiter by default.

Something like this might work:

def parse_bytes(b: bytes) -> pandas.DataFrame:
    ...

blocks = dask.bytes.read_bytes("my-file.txt", delimiter=b"\n")
dataframes = [dask.delayed(parse_bytes)(block) for block in blocks]
df = dask.dataframe.from_delayed(dataframes)
  • https://docs.dask.org/en/latest/remote-data-services.html#dask.bytes.read_bytes
  • https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_delayed
like image 52
MRocklin Avatar answered Oct 18 '25 21:10

MRocklin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!