I am using the google.bigquery.python api in order to automatically load a dataset from a GCS bucket to Bigquery on a daily basis. I specified a schema for this dataset and it worked untill I had one day this error:
ERROR - {u'state': u'DONE', u'errors': [{u'reason': u'invalid', u'message': u'Invalid argument: [email protected]', u'location': u'File: 0 / Line:1283011 / Field:44'},
{u'reason': u'invalid', u'message': u'Invalid argument: [email protected]', u'location': u'File: 0 / Line:1338016 / Field:44'}, {u'reason': u'invalid', u'message': u'Too many errors encountered. Limit is: 0.'}], u'errorResult': {u'reason': u'invalid', u'message': u'Too many errors encountered. Limit is: 0.'}}
My question is: how can I tell bigquery to delete automatically lines which don't fit the schema?
If you are using the bq
command line client, the option is --max-bad-records
. This tells BQ how many bad records to ignore before failing a load job. The default is zero. I don't recall whether setting this positive lets bad rows into the data or only ignores and suppresses the error. You might want to manually test it.
From: https://cloud.google.com/bigquery/bq-command-line-tool
--max_bad_records Type: integer Description: The maximum number of bad rows to skip before the load job is aborted and no updates are performed. If this value is larger than 0, the job will succeed as long as the number of bad records do not exceed this value. This is useful if you would like to load files that may have bad records. The default value for this parameter is 0 (all rows are required to be valid).
Valid Values: Any integer
Usage: bq load --max_bad_records=3 <destination_table> <data_source_uri> [<table_schema>]
In the python API this field is called maxBadRecords
. You can search for maxBadRecords
in the python BigQuery API repo on Github and you'll see it is in the jobs API configuration for a load job.
Sorry can't comment on Paul's answer as I have less than 50 reputations.
Re: whether setting this positive lets bad rows into the data or only ignores and suppresses the error
The bad rows will be ignored and will not go into the loaded data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With