Reason I felt this was not a duplicate of this question:
exprs
statement) that - again - the author expects knowledge of the schema ex-ante, and is not inferring the schema.Requirements:
ex-ante, I do not have knowledge of what the json schema is, and thus need to infer it. spark.read.json seems the best case for inferring the schema, but all the examples I came across loaded the json from files. In my use case, the json was contained within a column in a dataframe.
I am agnostic to the source file type (in this case, tested with parquet and csv). However, the source dataframe schema is and will be well structured. For my use case, the json is contained within a column in the source dataframe called 'fields'.
The resulting dataframe should link to the primary key in the source dataframe ('id' for my example).
The key turned out to be in the spark source code. path
when passed to spark.read.json
may be a "RDD of Strings storing json objects".
Here's the source dataframe schema:
The code I came up with was:
def inject_id(row):
js = json.loads(row['fields'])
js['id'] = row['id']
return json.dumps(js)
json_df = spark.read.json(df.rdd.map(inject_id))
json_df
then had a schema as such
Note that - I did not test this with a more nested structure, but I believe it will support anything that spark.read.json
supports.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With