Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery external tables with python

How can i create external tables (federated data source) in BigQuery using python (google-cloud-bigquery)?

I know you can use bq commands like this, but that is not how i want to do it:

bq mk --external_table_definition=path/to/json tablename
bq update tablename path/to/schemafile

with external_table_definition as:
{
  "autodetect": true,
  "maxBadRecords": 9999999,
  "csvOptions": {
    "skipLeadingRows": 1
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://bucketname/file_*.csv"
  ]
}

and a schemafile like this:

[
  {
    "mode": "NULLABLE", 
    "name": "mycolumn1", 
    "type": "INTEGER"
  }, 
  {
    "mode": "NULLABLE", 
    "name": "mycolumn2", 
    "type": "STRING"
  }, 
  {
    "mode": "NULLABLE", 
    "name": "mycolumn3", 
    "type": "STRING"
  }
]

Thank you for your help! Lars

like image 511
lars Avatar asked Oct 26 '25 04:10

lars


2 Answers

table_id = 'table1'

table = bigquery.Table(dataset_ref.table(table_id), schema=schema)
external_config = bigquery.ExternalConfig('CSV')
external_config = {
  "autodetect": true,
  "options": {
    "skip_leading_rows": 1
  },
  "source_uris": [
    "gs://bucketname/file_*.csv"
  ]
}
table.external_data_configuration = external_config
table = client.create_table(table) 

Schema Format is :

schema = [
    bigquery.SchemaField(name='mycolumn1', field_type='INTEGER', is_nullable=True),
    bigquery.SchemaField(name='mycolumn2', field_type='STRING', is_nullable=True),
    bigquery.SchemaField(name='mycolumn3', field_type='STRING', is_nullable=True),
]
like image 172
Nihal Sangeeth Avatar answered Oct 28 '25 19:10

Nihal Sangeeth


I know this is well after the question has been asked and answered, but the above accepted answer does not work. I attempted to do the same thing you are describing and additionally trying to use the same approach to update an existing external table who added some new columns. This would be the correct snippet to use assuming you have that JSON file stored somewhere like /tmp/schema.json

[
  {
    "mode": "NULLABLE", 
    "name": "mycolumn1", 
    "type": "INTEGER"
  }, 
  {
    "mode": "NULLABLE", 
    "name": "mycolumn2", 
    "type": "STRING"
  }, 
  {
    "mode": "NULLABLE", 
    "name": "mycolumn3", 
    "type": "STRING"
  }
]

You should simply need to have the following if you already have the API representation of the options you want to add to the external table.

from google.cloud import bigquery

client = bigquery.Client()

# dataset must exist first
dataset_name = 'some_dataset'
dataset_ref = client.dataset(dataset_name)

table_name = 'tablename'

# Or wherever your json schema lives
schema = client.schema_from_json('/tmp/schema.json')

external_table_options = {
  "autodetect": True,
  "maxBadRecords": 9999999,
  "csvOptions": {
    "skipLeadingRows": 1
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://bucketname/file_*.csv"
  ]
}

external_config = client.ExternalConfig.from_api_repr(external_table_options)

table = bigquery.Table(dataset_ref.table(table_name), schema=schema)
table.external_data_configuration = external_config

client.create_table(
    table,

    # Now you can create the table safely with this option
    # so that it does not fail if the table already exists
    exists_od=True

)

# And if you seek to update the table's schema and/or its
# external options through the same script then use
client.update_table(
    table,

    # As a side note, this portion of the code had me confounded for hours.
    # I could not for the life of me figure our that "fields" did not point
    # to the table's columns, but pointed to the `google.cloud.bigquery.Table`
    # object's attributes. IMHO, the naming of this parameter is horrible
    # given "fields" are already a thing (i.e. `SchemaField`s).
    fields=['schema', 'external_data_configuration'])
)

In addition to setting the external table configuration using the API representation, you can set all of the same attributes by calling the names of those attributes on the bigquery.ExternalConfig object itself. So this would be another approach surrounding just the external_config portion of the code above.

external_config = bigquery.ExternalConfig('CSV')
external_config.autodetect = True
external_config.max_bad_records = 9999999
external_config.options.skip_leading_rows = 1
external_config.source_uris = ["gs://bucketname/file_*.csv"]

I must again however raise some frustration with the Google documentation. The bigquery.ExternalConfig.options attribute claims that it can be set with a dictionary

>>> from google.cloud import bigquery
>>> help(bigquery.ExternalConfig.options)
Help on property:

    Optional[Dict[str, Any]]: Source-specific options.

but that is completely false. As you can see above the python object attribute names and the API representation names of those same attributes are slightly different. Either way you try it though, if you had a dict of the source-specific options (e.g. CSVOptions, GoogleSheetsOptions, BigTableOptions, etc...) and attempted to pass that dict as the options attribute, it laughs in your face and says mean things like this.

>>> from google.cloud import bigquery
>>> external_config = bigquery.ExternalConfig('CSV')
>>> options = {'skip_leading_rows': 1}
>>> external_config.options = options
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'skipLeadingRows': 1}
>>> external_config.options = options
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skip_leading_rows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skipLeadingRows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: cant set attribute

The workaround was iterating over the options dict and using the __setattr__() method on the options which worked well for me. Pick you favorite approach from above. I have tested all of this code and will be using it for some time.

like image 22
Roman Czerwinski Avatar answered Oct 28 '25 20:10

Roman Czerwinski