Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dask read_json metadata mismatch

Tags:

dask

I'm trying to load json files into a dask df.

files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)

There are some missing values in the data, and some of the files have extra columns. Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df? Additionally, can't it handle missing values? I get the following error when trying to compute the df:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `DataFrame`
+-----------------+-------+----------+
| Column          | Found | Expected |
+-----------------+-------+----------+
| x22             | -     | float64  |
| x21             | -     | object   |
| x20             | -     | float64  |
| x19             | -     | float64  |
| x18             | -     | object   |
| x17             | -     | float64  |
| x16             | -     | object   |
| x15             | -     | object   |
| x14             | -     | object   |
| x13             | -     | object   |
| x12             | -     | object   |
| x11             | -     | object   |
| x10             | -     | object   |
| x9              | -     | float64  |
| x8              | -     | object   |
| x7              | -     | object   |
| x6              | -     | object   |
| x5              | -     | int64    |
| x4              | -     | object   |
| x3              | -     | float64  |
| x2              | -     | object   |
| x1              | -     | object   |
+-----------------+-------+----------+
like image 257
Maria Avatar asked Sep 11 '25 23:09

Maria


2 Answers

read_json() is new and tested for the "common" case of homogenous data. It could, like read_csv, be extended to cope with column selection and data type coercion fairly easily. I note that the pandas function allows the passing of a dtype= parameter.

This is not an answer, but perhaps you would be interested in submitting a PR at the repo ? The specific code lives in file dask.dataframe.io.json.

like image 117
mdurant Avatar answered Sep 14 '25 16:09

mdurant


I bumped into similar problem and came up with another solution:

def read_data(path, **kwargs):
    meta = dd.read_json(path, **kwargs).head(0)
    meta = meta.head(0)
    # edit meta dataframe to match what's read here

    def json_engine(*args, **kwargs):
        df = pd.read_json(*args, **kwargs)
        # add or drop necessary columns here
        return df

    return dd.read_json(path, meta=meta, engine=json_engine, **kwargs)

So idea of this solution is that you do two things:

  1. Edit meta as you see fit (for example removing column from it which you don't need)
  2. Wrapping json engine function and dropping/adding necessary columns so meta will match what's returned by this function.

Examples:

  1. You have one particular irrelevant column which cause your code to fail with error:
| Column          | Found | Expected     |
| x22             | -     | object       |

In this case you simply drop this column from meta and in your json_engine() wrapper.

  1. You have some relevant columns which are reported missing for some partitions. In this case you get similar error to topic starter.

In this case you add necessary columns to meta with necessary types (BTW meta is just empty pandas dataframe in this case) and you also add those columns as empty in your json_engine() wrapper if necessary.

Also look at proposal in comments to https://stackoverflow.com/a/50929229/2727308 answer - to use dask.bag instead.

like image 44
featuredpeow Avatar answered Sep 14 '25 14:09

featuredpeow