Technical
7 min read

Deterministic A/B Test Bucketing

Make A/B test bucketing a pure function of (salt, user_id) so iOS, Android, web, and BigQuery all derive the same variant for the same user, and so you can preview cohort balance in the warehouse before launch.

Sabri Karagonen

Data & Product

Deterministic A/B Test Bucketing

Most A/B test platforms flip a coin per user, store the assignment, and read it back forever. The variant for a given user becomes whatever the central record says it is, which means you trust that the write landed, that two devices for the same user didn't race the first call, and that the assignment was retrievable before the session needed it. When any of those slips, the platform's analytics still attribute the session to whatever it eventually decided, so the dashboard shows users in "variant B" whose session actually ran variant A.

Deterministic bucketing replaces the stored assignment with a pure function. Given an experiment salt and a user ID, the variant is hash(salt + "_" + user_id) mod 100, computed on the spot. The salt and the variant sizes still come from a remote config the client fetches at startup, but once those land, every surface that knows the user's ID — iOS, Android, web, backend services, BigQuery queries against your warehouse — derives the same variant without consulting a central store, and without a record that can drift from reality.

That gives you four things a random-and-store platform can't:

  • Reproducibility. The same user always lands in the same bucket. Lose the assignment log, recompute it. Want to re-examine a historical experiment under a different cut of users? Re-run the function over the events table.
  • Cross-platform consistency. The user gets the same variant whether they open the iOS app, the Android app, or the web app, because every client computes from the same inputs.
  • Pre-launch cohort previews. With the assignment as a pure function, you can run it over your existing user table before the experiment ships and check that the cohorts are balanced on the metrics you care about. More on this below.
  • Warehouse parity. Analysts can derive the variant directly in SQL instead of joining to whatever the assignment service logged, so analysis stops depending on a separate system being healthy.

The cost is that you can't change a single user's assignment from a console mid-experiment. The right way to pause or roll back is the same channel that delivers the salt: flip a kill-switch flag in remote config that disables the variant code paths, or push sizes of [100, 0] to send everyone to control. Both take effect on the next config refresh.

The Recipe

The function takes three inputs and returns a variant index:

  • user_id: a stable identifier for the user (account ID, not install ID, more on that below).
  • salt: a string unique to this experiment, typically the experiment ID.
  • sizes: an array of integer percentages that must sum to 100. [50, 50] is a 50/50 split, [33, 33, 34] is three-way, [5, 95] is a 5/95.

It hashes salt + "_" + user_id with SHA-256, takes the first 8 bytes as a big-endian unsigned integer, and reduces it mod 100. That gives you a number from 0 to 99. Then it walks the cumulative sum of sizes and returns the index of the first band the bucket falls under.

If sizes doesn't sum to 100, the function raises, which is the cheapest way to catch the most common bug in this pattern: someone writes [33, 33, 33], the cumulative bands cover 0 to 98, and 1% of users silently end up in no variant at all.

The Function

Same algorithm in three places, so pick the tab for wherever you're calling it from. SQL is the simplest and probably the first one you'll reach for when poking around in the warehouse.

The bucket function is hosted at `bruin-fn.us.sha256_bucket`, with eu and other regional copies for non-US datasets (BigQuery requires the function and the data to share a location). It walks the first eight bytes of SHA256(s) with the modular-accumulation pattern (bucket = (bucket * 256 + byte) MOD m) the Python and C# code use, so the value matches bit-for-bit.

DECLARE salt  STRING       DEFAULT 'test_1b';
DECLARE sizes ARRAY<INT64> DEFAULT [50, 50];

-- Validate at query time. ERROR() aborts the query, so the check can't be skipped.
SELECT IF(
  (SELECT SUM(s) FROM UNNEST(sizes) AS s) != 100,
  ERROR(FORMAT('bucket sizes must sum to 100, got %d',
    (SELECT SUM(s) FROM UNNEST(sizes) AS s))),
  NULL);

WITH
bands AS (
  SELECT idx AS variant, SUM(s) OVER (ORDER BY idx) AS upper
  FROM UNNEST(sizes) AS s WITH OFFSET idx
),
bucketed AS (
  SELECT
    user_id,
    `bruin-fn.us.sha256_bucket`(CONCAT(salt, '_', user_id), 100) AS bucket
  FROM `proj.ds.users`
)
SELECT
  b.user_id,
  b.bucket,
  (SELECT MIN(variant) FROM bands WHERE b.bucket < bands.upper) AS variant
FROM bucketed b;

ERROR() aborts the query with a custom message, so the sum-to-100 check can't be skipped.

Why SHA-256, and What About MD5

For bucketing, MD5 is fine. The "MD5 is broken" warning is about adversarial collision resistance, where someone constructs two inputs that hash to the same digest, which doesn't apply when you're hashing your own user IDs and there's no attacker. The distribution you get out of MD5, SHA-256, FarmHash, MurmurHash, FNV is identical to several decimal places when there are only 100 buckets.

I went with SHA-256 in the article for one boring reason: it's in every language's standard library, and a future engineer reading the code in review won't ask "wait, isn't MD5 deprecated?" If the hash actually shows up in a profile somewhere (you'd need to be in a serious hot path), MurmurHash3 is what GrowthBook, PlanOut, and Optimizely use, since it's five to ten times faster, also unbiased, and packaged in every language.

What matters is parity, not cryptographic strength: whatever function you pick, write it once and replicate it identically everywhere it runs.

Splits, Ramps, and Holdouts

The array shape covers every common case:

  • 50/50: assign(uid, salt, [50, 50])
  • 35/65: assign(uid, salt, [35, 65])
  • Three-way: assign(uid, salt, [33, 33, 34]). Someone has to take the spare percent, since [33, 33, 33] sums to 99 and the function refuses.
  • Holdout: assign(uid, salt, [90, 5, 5]). 90% on control, 5% to each of two variants.

Ramping from 5% to 50% over a week is an array edit: [95, 5], then [80, 20], then [50, 50]. Don't change the salt while you're ramping, since that re-randomizes everyone, which means a user who saw the variant on Monday might not see it on Tuesday, which means your conversion data is split across both groups for the same user. Pick the salt once, change only the array.

Salt Rotation

Same (salt, user_id) always gives the same bucket, forever, and that property has two consequences for how you pick salts.

Between experiments you want a fresh salt per test. Otherwise a user who lands in the high-percentile bucket of one experiment lands in the high-percentile bucket of every other experiment too, since the buckets are correlated under the same salt. Use the experiment ID as the salt and you get independence for free.

When you actually want correlation, usually for mutual exclusion (where you don't want the same user to land in the variant of two overlapping tests), reuse the salt on purpose and design the variant ranges so they don't overlap. That's the foundation of layered experimentation, and it's a separate post.

Salt Selection: Catching Pre-Test Bias Before You Ship

This is the part you can't do on top of a server-driven framework, and to me it's the strongest argument for rolling your own.

When the bucket is a deterministic function of the user ID, you can simulate the experiment on your existing users before the test runs. Pick a salt, apply the assignment to your current user table, and compare the cohorts on the metrics you care about: DAU, ARPU, retention, whatever the experiment is supposed to move.

If the variant cohort skews 8% heavier on revenue before the experiment even runs, your power calculation is fiction. Pick a different salt and recheck.

-- Sanity-check a candidate salt against the metrics you care about.
DECLARE candidate_salt STRING DEFAULT 'test_1b_v1';

WITH bucketed AS (
  SELECT
    user_id,
    `bruin-fn.us.sha256_bucket`(CONCAT(candidate_salt, '_', user_id), 100) AS bucket
  FROM `proj.ds.users`
),
labeled AS (
  SELECT u.*, IF(b.bucket < 50, 'control', 'variant') AS arm
  FROM `proj.ds.user_stats` u
  JOIN bucketed b USING (user_id)
)
SELECT
  arm,
  COUNT(*)                  AS users,
  AVG(revenue_30d)          AS avg_revenue_30d,
  AVG(sessions_30d)         AS avg_sessions_30d,
  AVG(CAST(d7_retained AS INT64)) AS d7_retention
FROM labeled
GROUP BY arm
ORDER BY arm;

Run it for two or three candidate salts and pick the one whose cohorts are closest on the metrics that matter for your test. The formal name for this technique is rerandomization, and clinical-trial statisticians have been doing it for decades. With deterministic bucketing, it costs you ten minutes per experiment.

A server-side framework can't do that, since the assignment doesn't exist until the user opens the app and there's no way to preview the cohorts. Deterministic bucketing puts the assignment in your hands ahead of time, for every user you've ever seen.

Same Experience Across Devices

If the user_id you hash is your account ID (not the install ID, not firebase_pseudo_id), the same user gets the same variant on iOS, Android, web, and any future platform you ship. The bucket survives reinstalls and stays the same when the user moves between phone and tablet. Firebase Remote Config tracks per install, so it can't promise either of those.

The catch is that you need an account ID before the bucketing decision. For pre-login flows you fall back to install ID and accept that reinstalls flip the bucket, which is usually fine since pre-login experiments are short and the skew from reinstalls is small.

Where This Stops Working

If you don't have a stable user ID at the point you bucket, this approach doesn't fully work. Anonymous installs can use install ID, but reinstalls re-bucket those users. For experiments that are entirely pre-login and need stickiness across reinstalls, you need a different approach.

You also lose the console-driven audience targeting that hosted platforms give you, where you'd say "iOS 15+, US-only, premium tier" through a UI. Audience conditions move into the client gating that calls assign, or into warehouse-side cohort filters at analysis time.

Client and warehouse have to agree bit-for-bit. A different hash, encoding, or modulo convention will silently desync. Write the function once, document it, and add a parity test that runs the same (salt, user_id, sizes) triple through every implementation and asserts equal results.

What You Get

You end up with a bucket function that runs anywhere, gives the same answer in every environment, and lets you preview the cohorts before the test starts. For an early-stage app, that's the whole experimentation stack: thirty lines of code per language, plus one column in the events table.