Experiments: Measure the impact of a/b testing

The Experiment Report is a separately priced product add-on. It is currently only offered to those on the Enterprise Plan. See our pricing page for more details.

Why Experiment?

Experimentation helps you make data-driven product decisions by measuring the real impact of changes on user behavior. Mixpanel is an ideal place to run experiments because all your product analytics data is already here, giving you comprehensive insights into how changes affect your entire user journey.

Prerequisites

Before getting started with experiments:

Exposure Event Tracking: Implement your experimentation events
Baseline Metrics: Have your key success metrics already measured in Mixpanel

Overview & Workflow

The Experiment report analyzes how one variant impacts your metrics versus other variant(s), helping you decide which variant should be rolled out more broadly. To access Experiments, click on the Experiments tab in the navigation panel, or Create New > Experiment.

Experiment Process

Plan → Setup & Launch → Monitor → Interpret Results → Make Decisions

Plan: Define hypothesis, success metrics, and test parameters
Setup & Launch: Configure experiment settings and begin exposure
Monitor: Track experiment progress and data collection
Interpret Results: Analyze statistical significance and lift
Make Decisions: Choose whether to ship, iterate, or abandon changes

Plan Your Experiment

Before creating an experiment report, ensure you have:

A clear hypothesis about what change will improve which metric
Defined primary success metrics (and secondary/guardrail metrics)
Estimated sample size and test duration requirements
Proper exposure event tracking implemented

Setup & Launch Your Experiment

Step 1: Select an Experiment

Click ‘New Experiment’ from the Experiment report menu and select your experiment. Any experiment started in the last 30 days will automatically be detected and populated in the dropdown. To analyze experiments that began before 30 days, please hard-code the experiment name

Only experiments tracked via exposure events, i.e, $experiment_started`, can be analyzed in the experiment report. Read more on how to track experiments here.

Step 2: Choose the ‘Control’ Variant

Select the ‘Variant’ that represents your control. All your other variant(s) will be compared to the control, i.e, how much better are they performing vs the control variant.

Step 3: Choose Success Metrics

Choose the primary metrics of success for the experiment. You can choose from either saved Mixpanel metrics or create a new metric leveraging the query panel. You can also add secondary metrics and guardrail metrics as required.

Step 4: Select the Test Duration

Enter either the sample size (the number of users to be exposed to the experiment) or the minimum number of days you want the experiment to run. This will determine the test duration. Once the sample size or days are complete, you can conclusively read the experiment results and make a decision.

Step 5: Confirm other Default Configurations

Mixpanel has set default automatic configurations, seen below. If required, please modify them as needed for the experiment

Experiment Model type: Sequential
Confidence Threshold: 95%
Experiment Start Date: Date of the first user exposed to the experiment

Monitor Your Experiment

Once your experiment is running, you can track its progress in the Experiments dashboard. Monitor key indicators:

Sample Size Progress: Track how many users have been exposed
Data Quality: Ensure exposure events are being tracked correctly
Guardrail Metrics: Watch for any negative impacts on important metrics
External Factors: Note any external events that might affect results

Interpret Your Results

The Experiments report identifies significant differences between the Control and Variant groups. Every metric has two key attributes:

p-value: this shows if the variants’ delta impact vs the control is statistically significant
lift: the variants’ delta impact on the metric vs control

Metric rows in the table are highlighted when any difference is calculated with high confidence. Specifically, if the difference is greater than the confidence interval you set up during the experiment configuration

Positive differences, where the variant value is higher than the control, are highlighted in green
Negative differences, where the variant value is lower than the control, are highlighted in red
Statistically insignificant results remain gray

How do you read statistical significance?

Statistical significance (p-value) helps you determine whether your experiment results are likely to hold true for the full rollout, giving you confidence in your decisions.

Statistical Significance Calculation

Mixpanel uses Frequentist statistical methods to compute p-values and confidence intervals. The specific approach depends on your metric type and experiment model.

Metric Types and Their Distributions:

Mixpanel categorizes metrics into three types, each using different statistical distributions:

Count Metrics (Total Events, Total Sessions): Use Poisson distribution
- Examples: Total purchases, total page views, session count
- Variance equals the mean (characteristic of Poisson distributions)
Rate Metrics (Conversion rates, Retention rates): Use Bernoulli distribution
- Examples: Signup conversion rate, checkout completion rate, 7-day retention
- Models binary outcomes (did/didn’t convert) across your user base
Value Metrics (Averages, Sums of properties): Use normal distribution approximation
- Examples: Average order value, total revenue, average session duration
- Calculates variance using sample statistics

Statistical Calculation Process:

For all metric types, we follow the same general process:

Calculate group rates for control and treatment
Estimate variance using the appropriate distribution
Compute standard error from variance and sample size
Calculate Z-score measuring how many standard errors apart the groups are
Derive p-value from Z-score using normal distribution

Statistical Foundation: Our calculations assume normal distributions for the sampling distributions of our metrics. While individual data points may not be normally distributed, the Central Limit Theorem tells us that with sufficient sample sizes, the sampling distributions of means and proportions will approximate normal distributions, making our statistical methods valid.

For Sequential Testing:

Uses continuous monitoring with adjusted significance thresholds with mSPRT method
Allows for early stopping when significance is reached
More conservative calculations to account for multiple testing

For Frequentist Testing:

Uses traditional hypothesis testing with fixed sample sizes
Formula: Max Significance Level (p-value) = [1-CI]/2 where CI = Confidence Interval

In the above image for example, max p=0.025 [(1-0.95)/2]

So, if an experiment’s results show

p ≤ 0.025: results are statistically significant for this metric, i.e, you can be 95% confident in the lift seen if the change is rolled out to all users.
p > 0.025: results are not statistically significant for this metric, i.e, you cannot be very confident in the results if the change is rolled out broadly.

Example: E-commerce Checkout Experiment

To illustrate how these calculations work in practice, let’s walk through a concrete example.

Scenario: Testing a new checkout UI on an e-commerce site with 20 users (10 control, 10 treatment).

Results:

Control group: 5 users converted (50% conversion rate), average cart size $60
Treatment group: 6 users converted (60% conversion rate), average cart size $67

For Conversion Rate (Rate Metric - Bernoulli Distribution):

Group rates: Control = 0.5, Treatment = 0.6
Variance calculation: Control = 0.5 × (1-0.5) = 0.25, Treatment = 0.6 × (1-0.6) = 0.24
Standard error: Combined SE = √((0.25/10) + (0.24/10)) = 0.221
Z-score: (0.6 - 0.5) / 0.221 = 0.45
P-value: ~0.65 (not statistically significant)

For Average Cart Size (Value Metric - Normal Distribution):

Group means: Control = $60, Treatment =$ 67
Variance calculation: Uses sample variance of cart values in each group
Standard error: Calculated from combined variance and sample sizes
Z-score and p-value: Computed using the same Z-test framework

This example shows why larger sample sizes are crucial—with only 10 users per group, even a 10-point difference in conversion rate isn’t statistically significant.

How do you read lift?

Lift is the percentage difference between the control and variant(s) metrics. $Lift= { (variant \,group\,rate - control \,group\,rate) \over (control \,group\,rate)}$

Lift, mean, and variance are calculated differently based on the type of metric being analyzed:

Count Metrics (Total Events, Sessions):

Group Rate: Total count ÷ Number of users exposed
Variance: Equal to the mean (Poisson distribution property)
Example: If treatment group has 150 total purchases from 100 exposed users, group rate = 1.5 purchases per user

Rate Metrics (Conversion, Retention):

Group Rate: The actual rate (already normalized)
Variance: Calculated using Bernoulli distribution: p × (1-p)
Example: If 25 out of 100 users convert, group rate = 0.25 (25% conversion rate)

Value Metrics (Averages, Sums):

Group Rate: Sum of property values ÷ Number of users exposed
Variance: Calculated from the distribution of individual property values
Example: If treatment group spent $5,000 total from 100 users, group rate =$ 50 average per exposed user

Why This Matters: Normalizing by exposed users (not just converters) helps you understand the impact on your entire user base. A feature that increases average order value among buyers but reduces conversion rate might actually decrease overall revenue per user.

Custom Formula Metrics: For complex metrics using formulas like Revenue per User = Total Revenue ÷ Unique Users, Mixpanel uses propagation of uncertainty to estimate variance. This combines the variances of the component metrics (Total Revenue and Unique Users) to calculate the overall metric’s statistical significance. The system assumes metrics in formulas are uncorrelated for these calculations.

When do we say the Experiment is ready to review?

Once the ‘Test Duration’ setup during configuration is complete, we show a banner that says “Experiment is ready to review”.

Test Duration can be either of two options:

Sample size to be exposed
Number of days you’d like to run the experiment

NOTE: If you are using a ‘sequential’ testing experiment model type, you can always peek at the results sooner. Learn more about what sequential testing is here

Diagnosing experiments further in regular Mixpanel reports

Click ‘Analyze’ on a metric to dive deeper into the results. This will open a normal Mixpanel insights report for the time range being analyzed with the experiment breakdown applied. This allows you to view users, view replays, or apply additional breakdowns to further analyze the results.

You can also add the experiment breakdowns and filters directly in a report via the Experiments tab in the query builder. This lets you do on-the-fly analysis with the experiment groups. Under the hood, the experiment breakdown and filter work the same as the Experiment report.

Looking under the hood - How does the analysis engine work?

The Experiment report behavior is powered by borrowed properties.

For every user event, we identify if the event is performed after being exposed to an experiment. If it were, then we would borrow the variant details from the tracked $experiment_started to attribute the event to the proper variant.

Implementation for Experimentation

Mixpanel experiment analysis work based on exposure events. To use the experiment report, you must send your Exposure events in the following format:

Event Name: “$experiment_started”

Event Properties:

“Experiment name” - the name of the experiment to which the user has been exposed
“Variant name” - the name of the variant into which the user was bucketed, for that experiment

An example track call would look like this:

mixpanel.track('$experiment_started', {'Experiment name': 'Test', 'Variant name': 'v1'})

You can specify the event and property that should be used as the exposure event, name, and variant in the project settings in the Overview tab under ‘Experiment Event Settings’. This allows you to use an experiment event that you’re already tracking, for example, via a 3rd party feature flagging tool. Note, only string properties should be used for the ‘Name’ and ‘Variant’.

When to track an exposure event?

An exposure event ONLY needs to be sent the first time a user is exposed to an experiment, as long as the user is always in the initial bucketed variant. Exposure events don’t have to be sent subsequently in new sessions.
If a user is part of multiple experiments, send a corresponding exposure event for each experiment.
Send exposure event only when a user is actually exposed, not at the start of a session.

For example, if you want to run an experiment on the payment page of a ride-sharing app, you only really care about users who open the app, book a ride, and then reach the payment page. Users who only open the app and do other activities shouldn’t be considered in the sample size. So exposure event should ideally be implemented to track only once the payment page is reached.
Send exposure details and not the assignment.

For example, you begin an experiment on 1st Aug, and 1M users are ‘assigned’ to the control and variant. You do not want to send an ‘exposure’ event for all these users right away, as they have only been assigned to the experiment. It’s possible that some user gets exposed on 4th Aug and some on 8th Aug. You would want to track $experiment_started at the exposure for accurate analysis.

FAQs

If a user switches variants mid-experiment, how do we calculate the impact on metrics?

We break a user and their associated behavior into fractional parts for analysis. We consider the initial behavior part of the first variant, then once the variant changes, we consider the rest of the behavior for analysis towards the new variant.
If a user is part of multiple experiments, how do we calculate the impact of a single experiment?

We consider the complete user’s behavior for every experiment that they are a part of.

We believe this will still give accurate results for a particular experiment, as the users have been randomly allocated. So there should be enough similar users, ie, part of multiple experiments, across both control and variants for a particular experiment.
For what time duration do we associate the user being exposed to an experiment to impact metrics?

Post experiment exposure, we consider a user’s behavior as ‘exposed’ to an experiment for a max of 90 days.

Experimentation Pricing FAQ

The Experiment Report is a separately priced product offered to organizations on the Enterprise Plan. Please contact us for more details.

Pricing Unit

Experimentation is priced based on MEUs - Monthly Experiment Users. Only users exposed to an experiment in a month are counted towards this tally.

How are MEUs different than MTUs (Monthly Tracked Users)?

MTUs count any user who has tracked an event to the project in the calendar month. MEU is a subset of MTU; it’s only users who have tracked an exposure experiment event (ie, $experiment_started) in the calendar month.

How can I estimate MEUs?

If you actively run experiments, you can look at the number of monthly users exposed to an experiment. Note that the MEU calculation is different if users are, on average, exposed to 30 or more experiments in a month.

If not running experiments, below are some rough estimations of MEU’s based on the number of MTUs being tracked to the project.

MTU bucket	Estimated MEU (% MTU)
Small (< 100k)	50-100%
Medium (100k - 1M)	40-75%
Large (1M - 10M)	25-60%
Very large (10M - 100M)	20-50%
100M +	10-25%

Does it matter how many experiments a user is exposed to within the month?

We’ve accounted for an MEU to be exposed to up to 30 experiments per month. If the average number of experiment exposure events per MEU is over 30, then the MEUs will be calculated as the total number of exposure events divided by 30.

What happens if I go over my purchased MEU bucket?

You can continue using Mixpanel Experiment Report, but you will be charged a higher rate for the overages.

Can I analyze experiments prior to the purchase date?

No. You can only analyze experiments starting from your experimentation purchase date. This means that the date used in your experiment cannot start prior to the purchase date.

But I am already paying for exposure events in my regular plan. Am I getting double-charged?

If you buy the Experimentation offering, we waive the charge for exposure events in your regular Mixpanel plan. You only get charged for the exposure events via the MEU calculation.

How can I monitor my account’s MEU consumption?

You can see your experiment MEU usage by going to Organization settings > Plan Details & Billing.

References

Experiment Model Types

Sequential: Allows you to detect lift and conclude experiments quickly, but may fail to reach significance for very small lifts. When to use? For large changes (~10%+ lift) when you want to stop early once significance is reached.
Frequentist: Capable of detecting smaller lifts, but requires you to keep experiments for the full duration. You’re discouraged from preemptively making decisions before the test duration is complete. When to use? For very small changes (~1% lift) when precision matters.

Experiment metric types

Primary Metrics: Main goals you’re trying to improve. These are metrics used to determine if the experiment succeeded. Examples: revenue, conversion rates, ARPU.
Guardrail Metrics: These are other important metrics that you want to ensure haven’t been negatively affected while focusing on the primary metrics. Examples: CSAT, churn rate.
Secondary Metrics: These provide a deeper understanding of how users are interacting with your changes, i.e, help to understand the “why” behind changes in the primary metric. Examples: time spent, number of pages visited, or specific user actions.

Make Your Decision

Once the experiment is ready to review, you can choose to ‘End Analysis’. Use these guidelines to make informed decisions:

When to Ship a Variant

Statistical significance achieved AND practical significance met (lift meets your minimum threshold)
Guardrail metrics remain stable (no significant negative impacts)
Sample size is adequate for your confidence requirements
Results align with your hypothesis and business objectives

When to Ship None

No statistical significance achieved after adequate test duration
Statistically significant but practically insignificant (lift too small to matter)
Negative impact on guardrail metrics outweighs primary metric gains
Results contradict your hypothesis significantly

When to Rerun or Iterate

Inconclusive results due to insufficient sample size
Mixed signals across different user segments
External factors contaminated the test period
Technical issues affected data collection

What to Watch Post-Rollout

Monitor guardrail metrics for 2-4 weeks after full rollout
Track long-term effects beyond your experiment window
Watch for novelty effects that may wear off
Document learnings for future experiments

Decision Options in Mixpanel

Ship Variant (any of the variants): You had a statistically significant result. You have made a decision to ship a variant to all users. NOTE: Shipping variant here is just a log; it does not actually trigger rolling out the feature flag unless you are using Mixpanel feature flags (in beta today).
Ship None: You may not have had any statistically significant results, or even if you have statistically significant results, the lift is not sufficient to warrant a change in user experience. You decide not to ship the change.
Defer Decision: You may have a direction you want to go, but need to sync with other stakeholders before confirming the decision. This is an example where you might defer decision, and come back at a later date and log the final decision.

Experiment Management

You can manage all your experiments via the Experiments Home tab. You can customize which columns you’d like to see.

Feature Flags Metric Trees

Was this page useful?