Mixpanel BigQuery Export Design
This guide describes how Mixpanel exports your data to a Google BigQuery (opens in a new tab) dataset.
There are currently two ways to export mixpanel data into big.
- Exporting into Customer managed BigQuery (recommended)
- Exporting into Mixpanel managed BigQuery
As part of the export pipeline, a new dataset
mixpanel_nessie_day_partitioned_<PROJECT_ID> will be created if the customer chose to export into Mixpanel managed BigQuery or a dataset should be created with appropriate permissions on customer-managed BigQuery and provide us with a name. We also apply transformation rules to make the data compatible with data warehouses.
For user profile and identity mappings tables, we create a new table with a random suffix every time and then will update the
mp_identity_mappings views accordingly to use the latest table. You should always use the views and should refrain from using the actual tables as we don't delete the old tables immediately and you may be using an old table.
The data in the tables is partitioned based on
_PARTITIONTIME pseudo column (opens in a new tab) and in project timezone.
Note: TIMEPARITIONING shouldn't be updated on the table. It will fail your export jobs. Create a new table/view from this table for custom paritioning.
You can query data with a single table schema or with a multiple table schema in BigQuery. To get more information about the table schemas, please see Schema.
To query a single table schema, use this snippet.
SELECT count(*) FROM mixpanel_nessie_day_partitioned_<PROJECT_ID>.mp_master_event WHERE mp_event_name = “<CLEANED_EVENT_NAME>”
To query a multiple table schema, use this snippet.
SELECT count(*) FROM mixpanel_nessie_day_partitioned_<PROJECT_ID>.<CLEANED_EVENT_NAME>
CLEANED_EVENT_NAME is the transformed event name based on transformation rules.
You will need this if you suspect the export process is not exporting all the events you want. As the tables are partitions using
_PARTITIONTIME pseudo column (opens in a new tab) and in project timezone, you can use to following query to get the number of events per day in an easy and fast way:
SELECT _PARTITIONTIME AS pt, COUNT(*) FROM `mixpanel_nessie_day_partitioned_<PROJECT_ID>.mp_master_event` WHERE DATE(_PARTITIONTIME) <= "2021-12-03" AND DATE(_PARTITIONTIME) >= "2021-12-01" GROUP BY pt
This example returns the number of events in each day in project timezone for a monoschema export pipeline and an example daterange. You can adjust the query for multischema by putting the right table name in the query.
When using the ID mappings table, you should use the resolved
distinct_id in place of the non-resolved
distinct_id whenever present. If there is no resolved
distinct_id, you can then use the
distinct_id from the existing people or events table.
Below is an example SQL query that references the ID mapping table to count number of events in a specific date range for each unique user in San Francisco :
SELECT CASE WHEN m.resolved_distinct_id IS NOT NULL THEN m.resolved_distinct_id WHEN m.resolved_distinct_id IS NULL THEN e.distinct_id END as resolved_distinct_id, COUNT(*) AS count FROM mixpanel_nessie_day_partitioned_<PROJECT_ID>.mp_master_event e FULL OUTER JOIN mixpanel_nessie_day_partitioned_<PROJECT_ID>.mp_identity_mappings_data_view m ON e.distinct_id = m.distinct_id AND mp_city="San Francisco" AND DATE(e._PARTITIONTIME) <= "2021-12-03" AND DATE(e._PARTITIONTIME) >= "2021-12-01" GROUP BY resolved_distinct_id LIMIT 100
Counting number of times a user has done a specific behavior is also possible by adding more filters on event properties. You can adjust the query for multischema by putting the right table name in the query.
We recommend exporting Mixpanel data into customer-managed BigQuery, for this the customer needs to follow these steps.
Create a dataset in their BigQuery
Give Mixpanel the necessary permissions to export into this dataset.
We need two permissions to manage the dataset.
BigQuery Job User
- Go to IAM &Admin in your Google Cloud Console.
- Click + ADD to add principals
- Add new principle "email@example.com" and set role as "BigQuery Job User", and save.
**BigQuery Data Owner ** ]
- Go to BigQuery in your Google cloud Console.
- Open the dataset you want mixpanel to export to.
- Click on sharing and permissions in the drop down.
- In the Data Permissions window click on Add Principal
- Add new principle "firstname.lastname@example.org" and set role as "BigQuery Data Owner", and save.
- You need to pass this dataset and gcp project id as params when you create your pipeline (opens in a new tab)
This is not a recommended approach anymore. But if you choose to export into Mixpanel manged BigQuery then you must provide a Google group email address to use the BigQuery export when you create your pipeline (opens in a new tab). Mixpanel exports transformed data into BigQuery at a specified interval.
Note: Mixpanel creates a dataset in its own BigQuery instance and gives "View" access to the account(s) provided at the time of creating the pipeline. As a result, there is no storage cost associated with exporting data to BigQuery, but standard compute costs will occur when querying on this data.\n\nIf you choose to copy the dataset into your own BigQuery instance, you will additionally start accruing storage costs from BigQuery.