Technical
7 min read

How to Run Reliable Firebase A/B Tests

Firebase counts users as 'in variant B' when the variant never actually reached their device. Here's the proxy-parameter setup that gives you a cohort you can defend.

Sabri Karagonen

Data & Product

How to Run Reliable Firebase A/B Tests

Your Firebase A/B test has been running for two weeks and the dashboard says thousands of users got variant B. Pick any of them and walk through their session events. You'll find users whose entire session ran the variant A code path. The user property still says variant B.

Firebase records the assignment in its own logs regardless of whether the variant value ever made it onto the phone.

Where the Variant Gets Lost

When a user opens your app, the SDK asks Google for the latest Remote Config and the server picks a variant. The trip from Google's servers to your user's screen is where things break:

  • The fetch fails because the network is slow
  • The fetch is throttled (default minimum interval is 12 hours)
  • The user closes the app before the response arrives
  • The user is on an old build that doesn't know about the experiment
  • Activation happens after the screen already rendered with cached defaults

None of those stop Firebase from recording the assignment server-side. Your dashboard counts the user, the firebase_exp_5 property gets stamped, and from the analytics side everything looks normal. The variant value just never made it to the phone. The user kept running on the old code path, or never even reached the screen where the variant would matter.

The Firebase A/B Testing dashboard counts assignment. It does not count delivery.

The bias only goes one way. When users labeled variant B are actually running the variant A code path, their data lands in the wrong bucket and pulls both groups' numbers closer together. Lifts that should have shown up never do. You ship "no significant difference" for a test where variant B actually beat control, and you never find out.

How to Fix It

Firebase A/B Testing is still the right tool for most teams. It's free, the SDK is reliable, the audience targeting is flexible, and the integration with GA4 and BigQuery is hard to beat. Unless you're at a scale where you can justify your own experimentation infrastructure, Firebase plus this fix beats every alternative. The bug shows up on every Firebase A/B test you run, but it's a setup problem, not a reason to switch tools.

There's no SQL clever enough to fix this in the warehouse. The fix has to happen earlier, in the Firebase setup and in the client code. Stop trusting Firebase's experiment tracking entirely: the client emits its own custom event whenever a real assignment lands on the device, and the warehouse analysis runs on that event instead of on firebase_exp_5.

1. Create the Experiment and the Proxy Parameter

The order matters here. Do it in this sequence:

a. Create the A/B test in the Firebase console first. Save it as a draft, don't start it yet. You can skip the variants for now, just get the shell created. Firebase will assign the experiment an auto-generated ID, but the console doesn't display it anywhere — you have to grab it out of the address bar. Open the draft and the URL will look like .../experiments/results/42. That trailing number is the ID you use everywhere downstream. Note it down.

b. Create the proxy parameter. In Remote Config, add a Number parameter named exp_<id>, using that auto-generated ID. So if Firebase assigned 42 to your experiment, the parameter is exp_42. Set the default to -1. We deliberately pick something different from the SDK's natural 0 for long, and the next paragraph explains why.

c. Go back to the experiment and link the parameter. Attach exp_<id> to the experiment and set variant values: 1 for control, 2 for the first variant, 3 for the second, and so on.

Anything below 1 means no real bucket landed on the device. There are two ways that can happen, and the value tells you which: -1 means a fetch went through but the user wasn't bucketed (audience miss, experiment ended, conditions didn't match), and 0 means the SDK never got a successful fetch for this key (returns the language default). Both get filtered out at log time, but you can tell them apart in DebugView when you're debugging why a specific user isn't in the cohort.

Why use Firebase's auto-generated experiment ID instead of inventing your own? Because Firebase already stamps that same ID into firebase_exp_<id> as a user property. When you cross-reference between your abtest_start events and anything else that uses that ID, the numbers line up without a hand-maintained mapping.

Don't point the experiment at paywall_layout or whatever your feature parameter is. Keep them completely separate. The feature parameter gets read all over the place during rendering, which is exactly what causes the timing bugs. The proxy gets read in one place only, and only to tell you the assignment landed.

2. Fire the Event After a Successful Fetch

After FetchAndActivateAsync() completes, walk every Remote Config value. Any key that starts with exp_ is one of your proxy parameters, so emit abtest_start for it. The naming convention does the registration — you don't need to maintain a client-side list of experiment IDs.

var remoteConfig = FirebaseRemoteConfig.DefaultInstance;
remoteConfig.FetchAndActivateAsync().ContinueWithOnMainThread(task => {
    if (!task.Result) return;

    foreach (var kvp in remoteConfig.AllValues) {
        if (!kvp.Key.StartsWith("exp_")) continue;
        if (kvp.Value.Source != ValueSource.RemoteValue) continue;

        var group = (int)kvp.Value.LongValue;
        if (group < 1) continue;

        var expId = int.Parse(kvp.Key.Substring(4));
        FirebaseAnalytics.LogEvent("abtest_start",
            new Parameter("exp_id", expId),
            new Parameter("exp_group", group));
    }
});

The Source check is what makes this whole thing work. Source tells you where the value came from. Three options: StaticValue (the SDK has nothing for this key), DefaultValue (your in-app default is what you got back), or RemoteValue (an actual fetch from Firebase delivered this). We log abtest_start only when Source is RemoteValue, because that's the only case where the bucket assignment actually came down from the server. Drop the check and you go straight back to the original bug — logging users who got an in-app default, not a real assignment.

The group < 1 guard is the safety net in case a remote value comes down but it's actually the parameter's default (e.g., the user is in the audience but not in the experiment). Anything below 1 means "no real bucket," so we skip.

All abtest_start proves is that the assignment landed on the device. It doesn't prove the user saw the treatment, which is what your feature events are for. The cohort you analyze is the overlap of both.

3. Use an Event, Not a User Property

Firebase auto-stamps firebase_exp_<id> as a user property whenever an experiment activates. Don't try to set your own user property to mirror it, and don't lean on the auto-stamped one either.

GA4 caps you at 25 custom user properties total. Every experiment you run burns a slot, and the slots are sticky — clearing old ones takes manual work, so your runway shrinks every quarter.

User properties also bloat your warehouse: every event you send carries the current user-properties payload along with it, which means more bytes scanned on every BigQuery query and a bigger bill at the end of the month.

A custom event is timestamped and immutable. One event name, abtest_start, covers every experiment you'll ever run, parameterized by exp_id. Add ten more experiments next quarter and the schema stays the same.

If your analysis ever needs to know what a user experienced and when, you want events, not user properties.

4. Keep a Registry (Optional)

This one's nice to have, not load-bearing. Set up a Google Sheet or a CSV in your repo with one row per experiment, tracking:

  • exp_id: the integer used in the proxy parameter
  • Name: human-readable, like paywall_v3 or feed_ranking_test
  • Hypothesis: what you expect the variant to do
  • Variants: what 1, 2, 3 actually mean
  • Status: running, ended, paused
  • Start date and end date
  • Link to the ticket, doc, or design

It's the lookup table that maps numeric exp_group values back to what each experiment was about. You can live without it for the first few experiments. Once you're past five or so, you'll wish you had it: querying WHERE exp_id = 17 in BigQuery six months from now without a registry means digging through Slack and Firebase console history to remember what 17 even was.

5. The Repeatable Recipe

Every new experiment, every time:

  1. Create the A/B test shell in the Firebase console as a draft, and grab the auto-assigned experiment ID out of the URL
  2. Create a exp_<id> Number parameter in Remote Config with default -1
  3. Go back to the experiment, link the parameter, set variant values (1 for control, 2, 3, ... for variants)
  4. Start the experiment from the Firebase console
  5. (Optional) Add a row to your registry with the ID, hypothesis, and variant meanings

No client deploy needed per experiment — the exp_ prefix convention takes care of that. Per-experiment cost is maybe ten minutes once you've done it once.

Common Pitfalls

A couple of things bite people the first time through.

The big one is firing abtest_start outside the fetch-completion callback. If your code emits it from Awake, Start, on a timer, or anywhere unrelated to the fetch callback, the values it reads will be stale or default. Fire it only from inside the callback that confirms the fetch succeeded.

The other one is app-version targeting. Set the experiment's audience in the Firebase console to require the minimum app version that actually has the variant rendering code. Otherwise older builds get bucketed into a variant they can't render, and you'll be cleaning those users out at analysis time anyway.

Build the Assignments Table

With abtest_start flowing into your GA4 BigQuery export, the next step is collapsing those events down to one row per user per experiment. That's the table the rest of your analysis joins against.

CREATE OR REPLACE TABLE `your-project.your_dataset.abtest_assignments` AS
WITH
abtest_start_events AS
(
  SELECT
    user_pseudo_id,
    event_timestamp,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'exp_id') AS exp_id,
    (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'exp_group') AS exp_group
  FROM `your-project.analytics_NNNNNN.events_*`
  WHERE event_name = 'abtest_start'
)
SELECT
  user_pseudo_id,
  exp_id,
  TIMESTAMP_MICROS(MIN(event_timestamp)) AS assignment_ts,
  ANY_VALUE(exp_group) AS exp_group,
  COUNT(DISTINCT exp_group) > 1 AS has_multiple_assignments,
FROM abtest_start_events
WHERE exp_group > 0
GROUP BY 1, 2;

When you analyze an A/B test, drop users with has_multiple_assignments = TRUE from the cohort — they got rebucketed at some point and their data is split across variants. But also track what percent of users get the flag: that ratio is a system-health signal. A healthy setup sits near zero, and if it starts creeping up something is rebucketing users it shouldn't.

assignment_ts is what you anchor forward-window analysis on. Anything that happens before that timestamp doesn't count toward the user's bucket.

What You Get

A cohort definition you can defend in a meeting, and an end to the confusing conversation with stakeholders about why the Firebase dashboard and the BigQuery numbers don't agree. Your source of truth shifts to BigQuery, built on abtest_start events.

A More Advanced Setup

The per-fetch abtest_start pattern fires every time a successful fetch lands on the device. For most apps that's fine — the warehouse dedupes by user_id and exp_id, and the cost is negligible. For high-frequency apps, the duplicate volume starts to add up.

If you'd rather fire abtest_start once per assignment instead of once per fetch, store the last-seen exp_group per exp_id locally (PlayerPrefs or equivalent) and only emit when the value changes. You can also emit abtest_end when an experiment drops out of the Remote Config response, and abtest_change when the bucket flips mid-flight. That last one is the real-time rebucketing alert the warehouse approach can only flag after the fact.

Coming in Part 2

Forward-window conversion analysis on the abtest_assignments table, plus the sanity checks that catch broken instrumentation before it ruins three weeks of data.