Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery Python API: Preserve null fields during extract_table job

I have the following code:

job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()  # Waits for job to complete.

(Note that I'm acquiring my table object elsewhere.)

This works, and dumps the requested table into GCS as newline-delimted JSON. However, some of the columns in table are nullable, and some of them do contain null values. In the interests of consistency across all of our data, I'd like to preserve the null values in the json results. Is there any way to do this without having to use avro?

This post here: Big Query table extract in JSON, preserve nulls? ... suggests actually querying the table. I don't think that this is an option for me, as the tables I'm extracting contain millions rows apiece. One I'm looking at contains nearly 100M rows and weighs in at more than 25GB of data. But I haven't been able to find a way to set up the extract job to preserve nulls.

like image 316
Gideon Avatar asked Apr 25 '26 22:04

Gideon


1 Answers

I think best way of doing this is using query job first.

  1. You got your table to extract from somewhere and run query job
  2. Run extract as CSV without headers

There is code that doing this

job_config = bigquery.QueryJobConfig()
gcs_filename = 'file_with_nulls*.json.gzip'

table_ref = client.dataset(dataset_id).table('my_null_table')
job_config.destination = table_ref

job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

# Start the query, passing in the extra configuration.
query_job = client.query(
    """#standardSql
    select TO_JSON_STRING(t) AS json from `project.dataset.table` as t ;""",
    location='US',
    job_config=job_config)

while not query_job.done():
    time.sleep(1)

#check if table successfully written
print("query completed")
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.CSV)
job_config.print_header = False

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()
print("extract completed")

After all done - you can delete temp table you create on step 1. Cost will be very low if you do it quickly (1TB storage for month is 20$ - so for 25GB for even 1 hour is will be 20/30/24 = 3 cents)

like image 135
Alexey Maloletkin Avatar answered Apr 28 '26 12:04

Alexey Maloletkin



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!