I have the following code:
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)
destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)
extract_job = client.extract_table(
table,
destination_uri,
job_config=job_config,
location='US') # API request
extract_job.result() # Waits for job to complete.
(Note that I'm acquiring my table object elsewhere.)
This works, and dumps the requested table into GCS as newline-delimted JSON. However, some of the columns in table are nullable, and some of them do contain null values. In the interests of consistency across all of our data, I'd like to preserve the null values in the json results. Is there any way to do this without having to use avro?
This post here: Big Query table extract in JSON, preserve nulls? ... suggests actually querying the table. I don't think that this is an option for me, as the tables I'm extracting contain millions rows apiece. One I'm looking at contains nearly 100M rows and weighs in at more than 25GB of data. But I haven't been able to find a way to set up the extract job to preserve nulls.
I think best way of doing this is using query job first.
There is code that doing this
job_config = bigquery.QueryJobConfig()
gcs_filename = 'file_with_nulls*.json.gzip'
table_ref = client.dataset(dataset_id).table('my_null_table')
job_config.destination = table_ref
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE
# Start the query, passing in the extra configuration.
query_job = client.query(
"""#standardSql
select TO_JSON_STRING(t) AS json from `project.dataset.table` as t ;""",
location='US',
job_config=job_config)
while not query_job.done():
time.sleep(1)
#check if table successfully written
print("query completed")
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
bigquery.DestinationFormat.CSV)
job_config.print_header = False
destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)
extract_job = client.extract_table(
table_ref,
destination_uri,
job_config=job_config,
location='US') # API request
extract_job.result()
print("extract completed")
After all done - you can delete temp table you create on step 1. Cost will be very low if you do it quickly (1TB storage for month is 20$ - so for 25GB for even 1 hour is will be 20/30/24 = 3 cents)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With