Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Firebase Cohorts in BigQuery

I am trying to replicate Firebase Cohorts using BigQuery. I tried the query from this post: Firebase exported to BigQuery: retention cohorts query, but the results I get don't make much sense.

I manage to get the users for period_lag 0 similar to what I can see in Firebase, however, the rest of the numbers don't look right:

Results: enter image description here

There is one of the period_lag missing (only see 0,1 and 3 -> no 2) and the user counts for each lag period don't look right either! I would expect to see something like that:

Firebase Cohort: enter image description here

I'm pretty sure that the issue is in how I replaced the parameters in the original query with those from Firebase. Here are the bits that I have updated in the original query:

#standardSQL
WITH activities AS (
  SELECT answers.user_dim.app_info.app_instance_id AS id,
    FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
  FROM `dataset.app_events_*` AS answers
  JOIN `dataset.app_events_*` AS questions
  ON questions.user_dim.app_info.app_instance_id = answers.user_dim.app_info.app_instance_id
 -- WHERE CONCAT('|', questions.tags, '|') LIKE '%|google-bigquery|%' 

(...)

WHERE cohorts_size.cohort >= FORMAT_DATE('%Y-%m', DATE('2017-11-01'))
ORDER BY cohort, period_lag, period_label

So I'm using user_dim.first_open_timestamp_micros instead of create_date and user_dim.app_info.app_instance_id instead of id and parent_id. Any idea what I'm doing wrong?

like image 915
Kelly P Avatar asked Nov 22 '25 17:11

Kelly P


1 Answers

I think there is a misunderstanding in the concept of how and which data to retrieve into the activities table. Let me state the differences between the case presented in the other StackOverflow question you linked, and the case you are trying to reproduce:

  • In the other question, answers.creation_date refers to a date value that is not fix, and can have different values for a single user. I mean, the same user can post two different answers in two different dates, that way, you will end up with two activities entries like: {[ID:user1, date:2018-01],[ID:user1, date:2018-02],[ID:user2, date:2018-01]}.
  • In your question, the use of answers.user_dim.first_open_timestamp_micros refers to a date value that is fixed in the past, because as stated in the documentation, that variable refers to The time (in microseconds) at which the user first opened the app. That value is unique, and therefore, for each user you will only have one activities entry, like:{[ID:user1, date:2018-01],[ID:user2, date:2018-02],[ID:user3, date:2018-01]}.

I think that is the reason why you are not getting information about the lagged retention of users, because you are not recording each time a user accesses the application, but only the first time they did.

Instead of using answers.user_dim.first_open_timestamp_micros, you should look for another value from the ones available in the documentation link I shared before, possibly event_dim.date or event_dim.timestamp_micros, although you will have to take into account that these fields refer to an event and not to a user, so you should do some pre-processing first. For testing purposes, you can use some of the publicly available BigQuery exports for Firebase.


Finally, as a side note, it is pointless to JOIN a table with itself, so regarding your edited Standard SQL query, it should better be:

#standardSQL
WITH activities AS (
  SELECT answers.user_dim.app_info.app_instance_id AS id,
    FORMAT_DATE('%Y-%m', DATE(TIMESTAMP_MICROS(answers.user_dim.first_open_timestamp_micros))) AS period
  FROM `dataset.app_events_*` AS answers
  GROUP BY id, period
like image 65
dsesto Avatar answered Nov 24 '25 06:11

dsesto



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!