BigQuery Python API: Preserve null fields during extract_table job

Question

I have the following code:

job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON)

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()  # Waits for job to complete.

(Note that I'm acquiring my table object elsewhere.)

This works, and dumps the requested table into GCS as newline-delimted JSON. However, some of the columns in table are nullable, and some of them do contain null values. In the interests of consistency across all of our data, I'd like to preserve the null values in the json results. Is there any way to do this without having to use avro?

This post here: Big Query table extract in JSON, preserve nulls? ... suggests actually querying the table. I don't think that this is an option for me, as the tables I'm extracting contain millions rows apiece. One I'm looking at contains nearly 100M rows and weighs in at more than 25GB of data. But I haven't been able to find a way to set up the extract job to preserve nulls.

Alexey Maloletkin · Accepted Answer

I think best way of doing this is using query job first.

You got your table to extract from somewhere and run query job
Run extract as CSV without headers

There is code that doing this

job_config = bigquery.QueryJobConfig()
gcs_filename = 'file_with_nulls*.json.gzip'

table_ref = client.dataset(dataset_id).table('my_null_table')
job_config.destination = table_ref

job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

# Start the query, passing in the extra configuration.
query_job = client.query(
    """#standardSql
    select TO_JSON_STRING(t) AS json from `project.dataset.table` as t ;""",
    location='US',
    job_config=job_config)

while not query_job.done():
    time.sleep(1)

#check if table successfully written
print("query completed")
job_config = bigquery.ExtractJobConfig()
job_config.compression = bigquery.Compression.GZIP
job_config.destination_format = (
    bigquery.DestinationFormat.CSV)
job_config.print_header = False

destination_uri = 'gs://{}/{}'.format(bucket_name, gcs_filename)

extract_job = client.extract_table(
    table_ref,
    destination_uri,
    job_config=job_config,
    location='US')  # API request
extract_job.result()
print("extract completed")

After all done - you can delete temp table you create on step 1. Cost will be very low if you do it quickly (1TB storage for month is 20$ - so for 25GB for even 1 hour is will be 20/30/24 = 3 cents)

BigQuery Python API: Preserve null fields during extract_table job

Tags:

python

google-bigquery

Gideon

1 Answers

Alexey Maloletkin

Recent Activity

Donate For Us

BigQuery Python API: Preserve null fields during extract_table job

Tags:

python

google-bigquery

Gideon

1 Answers

Alexey Maloletkin

Related questions

Recent Activity

Donate For Us