Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to sort a large table in BigQuery?

I have about 4GB data stored in Google's BigQuery in the following format:

   uuid    |   entity_name    |    property    |    value   
---------------------------------------------------------------
  abc      |   Person         |   first_name   |  John
  def      |   Person         |   age          |  45
  abc      |   Person         |   age          |  26
  def      |   Person         |   first_name   |  Mary
...

I want to get paginated results sorted by uuid. However, as per documentation, it is not possible to use ORDER BY or GROUP BY when the flag "allowLargeResults" is set to true. And of course querying such large table requires that. Is there a workaround for this scenario? I attempted to do client-side sorting however after successfully fetching the first few pages it raises an error "An existing connection was forcibly closed by the remote host".

Here's my query job:

query = 'SELECT * FROM [Users.Events] ORDER BY uuid'

query_request = {
    'jobReference': {
        'projectId': project_id,
        'job_id': str(uuid.uuid4())
    },
    'configuration': {
        'query': {
            'query': query,
            'priority': 'BATCH' if BATCH_QUERY else 'INTERACTIVE',
            'allowLargeResults' : True,
            'destinationTable': {
                'projectId': project_id,
                'datasetId': 'CrunchBase',
                'tableId': 'AllProperties_query'
            },
            'createDisposition': 'CREATE_IF_NEEDED',
            'writeDisposition': 'WRITE_TRUNCATE',
        }
    }
}

query_job = service.jobs().insert(
    projectId=project_id,
    body=query_request).execute(num_retries=2)

poll_job(service, query_job)

Result:

RuntimeError: {u'reason': u'resourcesExceeded', u'message': u'Resources exceeded during query execution.', u'location': u'query'}

EDIT: Attempting to sort within partition

I can potentially solve the problem if I figure out how to partition by entity_name and sort by uuid, however the following query does not work:

SELECT
  uuid, entity_name, property, value
OVER
  (PARTITION BY entity_name ORDER BY uuid) AS entities
FROM [CrunchBase.AllProperties];

Results:

Query Failed
Error: Missing function in Analytic Expression at: 1.15 - 1.70
like image 951
ksiomelo Avatar asked Sep 05 '25 03:09

ksiomelo


1 Answers

To answer the question in your edit, you need to actually specify an analytic function to apply over that ordered partition. Since you just want the current value for each row, you can use lead(x, 0).

For your query, you'd write something like this:

SELECT
  uuid, entity_name,
  LEAD(property, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_property,
  LEAD(value, 0) OVER (PARTITION BY entity_name ORDER BY uuid) AS cur_value,
FROM [CrunchBase.AllProperties]
like image 179
Jeremy Condit Avatar answered Sep 07 '25 21:09

Jeremy Condit