How Deduplication Works
Mixpanel deduplicates events using a combination of four key event properties:- Event Name (
event) - Distinct ID (
distinct_id) - Timestamp (
time) - Insert ID (
$insert_id)
$insert_id should be a randomly generated, unique value for each event to ensure proper deduplication. If $insert_id are reused, events may be unintentionally deduplicated.
Only the four key event properties listed above are used for deduplication. Additional event properties are not considered for the deduplication mechanism. For example, if two events share the same Event Name, Distinct ID, Timestamp, and Insert ID, but have different $city value, they are still considered duplicate events.
Deduplication Example
Deduplication occurs when a subset of the event data (event name, distinct_id, timestamp, $insert_id) is identical. Other event properties are not considered. Required Event Object attributes| Event Object property | Type | Description |
|---|---|---|
| event | String required | A name for the event. For example, “Signed up”, or “Uploaded Photo”. |
| properties | Object required | |
| properties.distinct_id | String required | The value of distinct_id will be treated as a string, and used to uniquely identify a user associated with your event. If you provide a distinct_id property with your events, you can track a given user through funnels and distinguish unique users for retention analyses. You should always send the same distinct_id when an event is triggered by the same user. |
| properties.token | String required | The Mixpanel token associated with your project. You can find your Mixpanel token in the project settings dialog in the Mixpanel app. Events without a valid token will be ignored. |
| properties.time | String required | The time an event occurred. If present, the value should be a unix timestamp (seconds since midnight, January 1st, 1970 - UTC). If this property is not included in your request, Mixpanel will use the time the event arrives at the server. |
| properties.$insert_id | String required | A unique UUID tied to exactly one occurrence of an event. |
$insert_id is checked for duplication after being minimized to the following shape:
$insert_id, one will be generated for it. However, it will not qualify for the deduplication process.
Deduplication Mechanisms
Mixpanel uses two main deduplication processes:Query-Time Deduplication
- When: Happens immediately when you query data in the Mixpanel UI.
- How: If multiple events share the same event_name, distinct_id, timestamp, and $insert_id, only the most recent version of the event is shown in reports (based on the API ingestion time). In most cases, only the more recent version of the event is shown in reports (based on the API ingestion time). It is important to note that Mixpanel does not guarantee upsert behavior.
- Scope: This deduplication is visible in the Mixpanel UI and reports, but not in raw data exports. Raw event export will contain all data as they were ingested, without any deduplication.
Compaction-Time Deduplication
- When: Runs periodically in the backend, typically after a few hours and again after about 20 days, once data ingestion for a day is complete.
- How: During compaction, Mixpanel scans for events with the same event name, distinct_id, and $insert_id (timestamp does not need to match exactly, just the same calendar day).
- Scope: This process helps reduce the storage of duplicate events and may affect event counts if duplicates were present with different timestamps