My test data consists of 27,768,767 rows. My schema includes a "message" column of type string. The length of these strings vary but are generally a couple of hundred characters. There is also user_id column of type int. Here are two queries that both return 0 rows (the where clauses match nothing in my data). To my surprise, however, they both report 4.69 GB processed.
SELECT * FROM logtesting.logs WHERE user_id=1;
Query complete (1.7s elapsed, 4.69 GB processed)
.
SELECT * FROM logtesting.logs WHERE message CONTAINS 'this string never appears';
Query complete (2.1s elapsed, 4.69 GB processed)
Since ints are stored in 8 bytes, I would have expected that the data processed in the former (user_id) query would be something like 213MB (28 million rows * 8 bytes per user_id). The latter (message) query is harder to estimate since the strings vary in length, but I would expect it to be several times greater than the former (user_id) query.
Is my understanding of how BigQuery calculates query costs wrong?
No matter what you do, BigQuery will need to scan all the rows in your tables (not necessarily all the columns though), so it's normal that you get this, since your table doesn't change. The where clause only means it won't RETURN the data. It still needs to process it.
The only way to make sure you lower the processing is to not select all your columns. BigQuery is column based, so if you don't need all your attributes, don't return them all (this also means they won't be processed). THIS will help lower your cost :)
Historically, "select *" wasn't supported to make sure people wouldn't find that out the hard way
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With