I'm trying to read a JSON file via apache beam in python and apply some data quality rules over it. Currently I'm using beam.io.ReadFromText to read each json line and use some functions to modify the data. What would be a better way to read the JSON data and modify them ?
(p
| 'Getdata' >> beam.io.ReadFromText(input)
| 'filter_name' >> beam.FlatMap(lambda line: dq_name(line))
| 'filter_phone' >> beam.FlatMap(lambda line: dq_phone(line))
| 'filter_zip' >> beam.FlatMap(lambda line: dq_zip(line))
| 'filter_address' >> beam.FlatMap(lambda line: dq_city(line))
| 'filter_website' >> beam.FlatMap(lambda line: dq_website(line))
| 'write' >> beam.io.WriteToText(output_prefix) )
Note: I'm fairly new to this, I'm sorry if my current approach looks too shoddy of stupid.
You are approaching Apache Beam (Dataflow) from the wrong direction.
You are trying to read a line and then apply transforms to this line one at a time.
Instead, you need to look at Beam being a parallel processor. You will read in all the lines ReadFromText()
and then apply transforms to each line in parallel.
Look into the function beam.ParDo()
. This will allow you to create a class that can process each line of your JSON file. Your code would then have major steps like ReadFromText()
, ParDo(MyJsonProcessor())
, WriteToText()
.
Remember that your JSON will need to be Newline Delimited JSON. http://ndjson.org/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With