Governance, Pitfalls, and Organizational Learning

6.1 Multiple testing, SRM, and platform‑level checks

Give smart people an experiment dashboard and no guardrails and they will eventually "discover" a spurious win.

Mitigations:

Pre‑declare primary metrics and main slices.

Use corrections (Bonferroni, FDR) where many decisions hang off many tests.

Treat endless slicing as hypothesis generation, not proof.

🛠

Case study: The "miracle" uplift in one tiny segment

A signup funnel experiment shows no overall lift. A teammate excitedly points out a +12% uplift for "new users on iOS, in one specific country, on Tuesdays".

You:

Count how many segments were checked in total.

Show that, given that many looks, a "significant" winner by chance is almost guaranteed.

Reframe the finding as a hypothesis: "maybe this country is more sensitive to the new flow".

Propose a follow‑up experiment specifically powered for that country.

Outcome: you prevent shipping based on noise while still capturing a potentially real signal for future testing.

6.1.1 Sample Ratio Mismatch (SRM)

If your 50/50 split is actually 57/43, that is a bug, not random chance.

Mitigations:

Automate SRM checks and block decisioning when they fail.

Debug exposure logic, filters, and eligibility before arguing about lift.

🛠

Case study: Catching a bucketing bug before it ships

A search ranking experiment shows a huge win: +8% click‑through.

Before celebrating, your SRM dashboard lights up: treatment has only 42% of traffic.

You dig into the logs and find:

The bucketing system assigns variants before a spam filter runs.

Many spammy users are more likely to end up in control due to an unrelated rule.

Outcome: you declare the results invalid, fix the exposure bug, and rerun. The true effect is a modest +1.5% lift, still worth shipping—but now everyone trusts the number.

🗣️

How to talk through this in an interview

Mid‑level answer: Explain you monitored sample split and invalidated results when SRM appeared.

Senior answer: Walk through how SRM indicated a deeper bucketing issue, how you debugged it, and how that changed your view of the original "win".

Staff answer: Emphasize that you built SRM checks into the platform and set a norm that no experiment is interpreted without passing them, raising the overall quality and trust in experimental results.

6.2 When not to experiment (and what to do instead)

An underrated superpower is the ability to say, "We are not running an experiment here, and that is the right call."

Common cases:

The decision is low stakes and reversible → ship and monitor.

Traffic is scarce and the feature is speculative → do qualitative research first.

Randomization is infeasible or unethical → use observational causal tools instead.

The cost of delaying a decision is higher than the value of clean causality.

Your job is to protect experimentation from being used as a universal hammer. Every A/B test has an opportunity cost.

🛠

Case study: Copy tweaks vs. research on a broken onboarding

Activation for a B2B SaaS product is poor. A team wants to run a series of micro‑experiments on button copy and icon styles.

You step back:

Activation is fundamentally broken; users do not understand the value proposition.

Traffic is limited; dozens of small tests will burn months.

You recommend:

Skip experiments on copy tweaks for now.

Run qualitative user interviews and usability tests to identify core confusion points.

Make bigger, more principled onboarding changes, then A/B test that when you have a strong hypothesis.

Outcome: the team finds that the real issue is account setup complexity, not wording. A redesigned flow plus clearer value messaging moves activation far more than a year of micro‑tests ever would.

How to talk through this in an interview

Mid‑level answer: Say you sometimes chose qualitative research over experimentation when traffic was low and the problem was unclear.

Senior answer: Lay out a clear decision framework: stakes, reversibility, traffic, and clarity of hypothesis, and how that led you to prioritize research then a bigger test.

Staff answer: Position this as changing how the org thinks: you pushed teams away from "A/B everything" toward a more principled evidence ladder (qual → quasi‑exp → RCT), which improved both speed and impact.

6.3 Turning results into decisions and institutional memory

An experiment with a pretty dashboard but no decision or learning is a failure, regardless of p‑values.

At Staff level you should insist on:

Clear interpretation

"What did we learn about user behavior and the system?" not just "metric up/down".

How does this update our beliefs and roadmap?

Decision and rationale

Ship, iterate, kill, or rerun—with reasons written down.

Include trade‑offs across metrics and stakeholders.

Post‑experiment analysis that goes beyond the headline

Segment results to understand heterogeneity, but do not oversell weak signals.

Capture hypotheses for the next experiment.

Documentation that future you will thank you for

Link feature, metrics, design choices, and results in one place.

Make it searchable so future teams do not re‑run the same failed ideas.

🛠

Case study: Building an experiment playbook from a failed feature

A new social sharing feature fails to move engagement and slightly harms retention for new users.

Instead of quietly sunsetting it, you:

Write a short post‑mortem: original hypothesis, design, results, and what surprised you.

Document that social proof works only for certain cohorts and that extra sharing prompts annoy new users.

Extract reusable patterns: better guardrails for attention‑grabbing features, standard cohort slices, and messaging guidelines.

Outcome: future teams avoid repeating the same idea in a slightly different form. Your org’s experimentation quality ratchets upward because you turned one failed test into shared knowledge.

How to talk through this in an interview

Mid‑level answer: Explain that you documented failed experiments so others could learn from them.

Senior answer: Describe the specific reusable patterns you extracted (e.g., which cohorts like social proof, which guardrails you now monitor) and how that influenced later launches.

Staff answer: Emphasize that you turned isolated experiments into a playbook and knowledge base, shifting experimentation from isolated events to a compounding learning system for the whole organization.