Running a CRO program for agency clients (not just random A/B tests)
One-off A/B tests are a hobby; a CRO program is a service. The research-hypothesis-test-iterate loop, how to prioritise, the statistical traps, and the traffic threshold below which you shouldn't test at all.
TL;DR
Conversion rate optimisation (CRO) done well isn't a pile of random A/B tests — it's a program: a repeating loop of research → hypothesis → prioritise → test → analyse → iterate that compounds learnings over time. The research comes from quantitative data (GA4) plus qualitative tools; the tests are proper A/B experiments with enough sample size and duration to reach significance — and you must resist the cardinal sins: stopping early, peeking, and testing trivial changes. The hard prerequisite everyone underestimates: you need enough traffic to reach statistical significance, and conversion tracking that's actually correct — you can't measure a lift you can't track. Below: the loop, prioritisation, the stats traps, and when not to test.
CRO is one of the most valuable services an agency can sell — improving the conversion rate multiplies the return on every other channel's spend — and one of the most commonly done badly. "We A/B tested the button colour and it didn't move" is not CRO; it's noise. A real program is a system, and the system is what produces compounding results worth paying for.
The loop
- Research. Find where to focus, with evidence. Quantitative: GA4 funnels, drop-off pages, high-traffic/low-conversion pages. Qualitative: heatmaps and recordings to see the friction. Never start from "I have an idea for the homepage."
- Hypothesis. Frame it testably: "Because [evidence], we believe [change] will [effect] for [audience], measured by [metric]." A hypothesis without evidence is a guess.
- Prioritise. You'll have more ideas than traffic to test them. Use a framework — ICE (Impact × Confidence × Ease) or PIE — to rank, so you test the high-leverage hypotheses first instead of whatever's loudest.
- Test. Run a clean A/B experiment (below). One change per test where possible, so you know what caused the result.
- Analyse + iterate. Significant win → ship it and build on it. No effect or a loss → that's a learning, not a failure; feed it back into research. The compounding is in the iteration.
The program is the loop running continuously, with a documented backlog and a record of what you've learned — that's the asset a one-off test never builds.
The statistical traps that invalidate results
- Stopping early / peeking. Checking the test daily and calling it the moment it looks significant is the #1 way to ship false wins — early "significance" is usually noise. Decide the sample size and duration up front, and wait for it.
- Too-small sample. Significance needs volume. A test that "won" on 40 conversions per variant proved nothing.
- Too-short duration. Run at least one to two full business cycles (usually 2+ weeks) so weekday/weekend and pay-cycle effects wash out.
- Testing trivia. Button colours rarely move revenue. Test things tied to real friction (the offer, the form, the page structure) — that's where the evidence pointed in step 1.
- Multiple-variant fishing. Testing ten variants at once and crowning the winner is how randomness wins. Be disciplined.
When NOT to test (the honest part)
A/B testing has a hard traffic floor. If a page doesn't get enough conversions, you will never reach significance, and any "winner" is a coin flip. For low-traffic clients:
- Don't run A/B tests — you can't power them. Be honest with the client instead of selling theatre.
- Use qualitative + best practices instead: fix the obvious friction recordings reveal, apply proven UX patterns, and measure the before/after at the account level rather than via a/b split.
- Pool where valid — sometimes testing at a higher level (a whole funnel, across similar pages) gathers enough volume.
Knowing when testing is inappropriate is part of the expertise; running underpowered tests to look rigorous is the opposite.
The foundation: you can't optimise what you can't measure
A CRO program sits entirely on top of correct conversion tracking. If the conversion you're optimising fires inaccurately — double-counted, missing on mobile, polluting Unassigned — your test results are measuring a broken instrument, and you'll ship "wins" that aren't real. Verify the conversion tracking before you run a single test, and keep it in the measurement plan. This is the unglamorous prerequisite that separates a CRO program that compounds from one that generates confident nonsense.
Where this fits
CRO is a productisable, high-margin service — but it depends on a research loop, clean tracking, and disciplined statistics, all of which are easy to let slide. Phloz keeps the measurement foundation legible — each client's conversions, events, and qualitative tools modeled and health-checked on the tracking-infrastructure map — so your experiments rest on data you've verified. The CRM for CRO agencies and pricing pages cover the workflow — but the takeaway is the shift from tests to a program: research-driven, prioritised, properly powered, and built on tracking you trust.