Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dask : How to read CSV files into a DataFrame from Microsoft Azure Blob

S3Fs is a Pythonic file interface to S3, does DASK have any Pythonic interface to Azure Storage Blob. Python SDK's for Azure Storage Blob provide ways to read and write to blob, but the interface requires the file to be downloaded to the local machine from cloud. I am looking for solutions that which read the blob to support DASK parallel read as either stream or string without persisting to local disk.

like image 987
Charles Selvaraj Avatar asked Dec 05 '25 18:12

Charles Selvaraj


1 Answers

I have newly pushed code here: https://github.com/dask/dask-adlfs

You may pip-install from that location, although you may be best served by conda-installing the requirements (dask, cffi, oauthlib) beforehand. In a python session, doing import dask_adlfs will be enough to register the backend with Dask, such that thereafter you can use azure URLs with dask functions like:

import dask.dataframe as dd
df = dd.read_csv('adl://mystore/path/to/*.csv', storage_options={
    tenant_id='mytenant', client_id='myclient', 
    client_secret='mysecret'})

Since this code is totally brand new and untested, expect rough edges. With luck, you can help iron out those edges.

like image 123
mdurant Avatar answered Dec 07 '25 11:12

mdurant



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!