I have a dataframe with a string column that is valid JSON. I'd like to transform the contents of this column into a file that is a JSON array where each row becomes an object. I'm looking for a good way to implement this.
In python, I'd use polars to do this:
import polars as pl
import json
df = df.with_columns(
pl.col("my_str_column").str.json_decode()
)
column_as_list = df.select(pl.col("my_str_column")).to_series().to_list()
with open('target_file.json', 'w+') as w:
w.write(json.dumps([ob for ob in column_as_list]))
This approach might not be optimal, but it works fine. Now, I'd like to translate this into polars-rs. Here's where I got:
df = df.lazy().with_column(
col("my_str_column").str().json_decode(None, Some(100))
).collect()?;
This turns df.select("my_str_column") into a column with struct datatype.
I then tried to convert it to a series and write to file (inspired in these SO posts: How to write a vector to a json file, How to get a Vec from polars Series or ChunkedArray).
let binding: Series = df.select("my_str_column").pop().unwrap();
// let test: Vec<_> = binding.struct_().iter().collect(); // this would panic
So I removed the json_decode() part, leaving the column type as str, and used a snipped I found here (How do I change the string element of vector to json in Rust) to implement this:
// hacky way to convert the desired column into a series
let binding = df.select("my_str_column").pop().unwrap();
let series_collect: Vec<&str> = binding.str().unwrap().into_no_null_iter().collect();
// Convert input into Vec<Value>, by parsing each string
let elems = series_collect.into_iter()
.map(serde_json::from_str)
.collect::<Result<Vec<Value>,_>>()
.unwrap();
let json_val = Value::Array(elems);
// now write this json_val into disk
let mut path = std::env::current_dir()?;
path.push("target_file.json");
let file = File::create(path)?;
let mut writer = BufWriter::new(file);
serde_json::to_writer(&mut writer, &json_val).unwrap();
writer.flush()?;
This works, but I'm not sure if it's a good way to write a column to a JSON file. Any help to improve this code would be appreciated.
As for your py-polars approach, you could .unnest() and use .write_json() instead.
e.g. with row_oriented=True
df = pl.DataFrame({
"my_str_column": [
"""{"a": 1}""",
"""{"a": 2}""",
"""{"a": 3}"""
]
})
(df.select(pl.col("my_str_column").str.json_decode())
.unnest("my_str_column")
.write_json(row_oriented=True)
)
'[{"a":1},{"a":2},{"a":3}]'
We can see how write_json() is implemented here:
JsonWriter::new(file)
.with_json_format(JsonFormat::Json)
.finish(&mut self.df)
Meaning you should be able to:
let out = df.lazy().with_column(...).unnest(&["my_str_column"]).collect()?;
JsonWriter::new(writer)
.with_json_format(JsonFormat::Json)
.finish(&mut out)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With