Hypotheses, arms, units, exposure
A good hypothesis is not "treatment will be better than control".
You want:
- Change: What exactly are we modifying?
- Target behavior: What user behavior should move and in what direction?
- Mechanism: Why should this change move that behavior?
If the mechanism is fuzzy, your experiment is a lottery ticket.
Arms and randomization unit
- Control and treatment arms
- Multi‑arm tests are tempting. They also dilute power and increase multiple‑testing headaches.
- Default to 2–3 arms unless you have the traffic and the discipline to handle more.
- Randomization unit
- Users, sessions, devices, accounts, geos, clusters.
- In networked settings (social graph, marketplace), "user‑level" randomization often lies to you. Consider cluster or geo‑level tests to avoid spillovers.
Exposure / triggering point
You usually choose between:
- Top‑of‑funnel randomization
- Pros: captures full‑funnel and long‑range effects.
- Cons: heavy traffic cost, long time to significance, many never actually hit the feature.
- Bottom‑of‑funnel randomization
- Pros: cleaner read on direct impact, faster, smaller sample.
- Cons: blind to acquisition, discovery, and early‑funnel effects; selection bias risks.
Staff mindset: be explicit about what effects you are deliberately blind to.
2.2.2 Sample size, MDE, power, duration
You should not approve experiments that "run until the graph looks good".
- Minimum Detectable Effect (MDE) should come from economics and opportunity cost, not aesthetics.
- Significance (α) and power (1−β) are trade‑offs you choose, not magic constants.
- For binary vs. continuous outcomes, use the right formulas and include:
- Expected degraded data (logging bugs, bots, spam).
- Reasonable safety margin.
On duration:
- Cover at least one full weekly cycle unless you have a very good reason not to.
- Watch for holidays, campaigns, and other known shocks.
- For features with learning curves, short tests are systematically optimistic or pessimistic.
If the sample size required is unrealistic, the answer is not "run it anyway". The answer is "this is not worth testing in this form" or "we need a better metric / design".
Case study: Pricing test with impossible MDE
A marketplace team proposes a test to increase delivery fees by $0.50 in a small city.
- They want to detect a 1% change in completed orders with 95% confidence and 90% power.
- Traffic in that city is so low that the required duration works out to 11 months.
You walk them through the math and the economics:
- A 1% movement is below the level that would change any strategic decision.
- Eleven months is far too long to hold a city in a potentially worse state.
Outcome: the team either (a) increases MDE to a meaningful level and runs a shorter, bolder test across multiple cities, or (b) treats this as a business decision and ships with monitoring instead. You avoid a zombie experiment.
How to talk through this in an interview
- Mid‑level answer: Show you can compute or reason about sample size and conclude that the proposed test is under‑powered.
- Senior answer: Tie MDE back to business impact and opportunity cost, then describe how you redesigned the test (bigger MDE, more markets, or no test) to make it useful.
- Staff answer: Elevate to a pattern: you introduced a standard MDE/traffic sanity check into the experimentation process so that low‑leverage, under‑powered tests stop at design review instead of burning real users.