I'm trying to load json files into a dask df.
files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)
There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df? Additionally, can't it handle missing values? I get the following error when trying to compute the df:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `DataFrame`
+-----------------+-------+----------+
| Column | Found | Expected |
+-----------------+-------+----------+
| x22 | - | float64 |
| x21 | - | object |
| x20 | - | float64 |
| x19 | - | float64 |
| x18 | - | object |
| x17 | - | float64 |
| x16 | - | object |
| x15 | - | object |
| x14 | - | object |
| x13 | - | object |
| x12 | - | object |
| x11 | - | object |
| x10 | - | object |
| x9 | - | float64 |
| x8 | - | object |
| x7 | - | object |
| x6 | - | object |
| x5 | - | int64 |
| x4 | - | object |
| x3 | - | float64 |
| x2 | - | object |
| x1 | - | object |
+-----------------+-------+----------+
read_json()
is new and tested for the "common" case of homogenous data. It could, like read_csv
, be extended to cope with column selection and data type coercion fairly easily. I note that the pandas function allows the passing of a dtype=
parameter.
This is not an answer, but perhaps you would be interested in submitting a PR at the repo ? The specific code lives in file dask.dataframe.io.json.
I bumped into similar problem and came up with another solution:
def read_data(path, **kwargs):
meta = dd.read_json(path, **kwargs).head(0)
meta = meta.head(0)
# edit meta dataframe to match what's read here
def json_engine(*args, **kwargs):
df = pd.read_json(*args, **kwargs)
# add or drop necessary columns here
return df
return dd.read_json(path, meta=meta, engine=json_engine, **kwargs)
So idea of this solution is that you do two things:
| Column | Found | Expected |
| x22 | - | object |
In this case you simply drop this column from meta and in your json_engine()
wrapper.
In this case you add necessary columns to meta with necessary types (BTW meta is just empty pandas dataframe in this case) and you also add those columns as empty in your json_engine()
wrapper if necessary.
Also look at proposal in comments to https://stackoverflow.com/a/50929229/2727308 answer - to use dask.bag instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With