carl-gustav.dev

Notes on systems, language, and craft.

Web, SEO & Growth

The practitioner's case for running marketing experiments

Marketing should work like a lab, not a production line. A minimum viable test framework, what's cheap to test, what's expensive, and why the organizational prerequisite is harder than the statistics.

I have sat in a lot of marketing meetings where a quarter’s plan was being assembled out of last quarter’s attribution. The slide deck looks fine. The numbers add up. Channel A produced this many leads at this cost; channel B produced fewer leads at a slightly worse cost; therefore next quarter we shift more spend to channel A. Decision made, calendar booked, agency briefed.

The thing that nobody says in that meeting is that the attribution model under the slides was never tested against an alternative attribution model. The channel mix was never tested against a different channel mix. The audience definitions were never tested against tighter or looser ones. The creative was never tested against itself. The whole quarter’s plan rests on a tower of plausible-looking inferences from data that was never made to defend itself.

I am not arguing that marketing should be run by statisticians. I am arguing that a marketing function which does not run experiments is a marketing function that cannot tell its winning hypotheses from its expensive ones, and a leadership team that asks for forecasts from such a function is asking for fiction with a confidence interval drawn on by hand.

Why the lab metaphor is the right one

Product teams have spent the last decade internalising the idea that you do not ship a feature without instrumenting it, and you do not call a feature successful until usage data says it is. The same teams will then commission a quarter of paid-media spend on the basis of a vendor’s deck and a gut feel. The bifurcation is strange. The same company that demands a controlled rollout for a button colour will hand over six figures of spend to a channel whose contribution has never been isolated from baseline.

A marketing function that has internalised experimentation does not look like a research lab. It looks like a production team with a small running queue of cheap tests, a calendar that protects them from being interrupted, and a habit of writing the decision rule down before the test starts. The lab metaphor is for the discipline, not for the aesthetic. Nobody is wearing a white coat.

The minimum viable test

A test that is worth running has five parts written down before it begins. Skip any one of them and what you have is a campaign you are hoping will tell you something.

  1. Hypothesis. Not “I think the new landing page will perform better.” That is a wish. “I think the new landing page will convert demo requests at a rate 20 percent higher than the current page because the headline now names the buyer’s role instead of describing the product.” That is a hypothesis. It has a direction, a magnitude, and a stated mechanism.
  2. Surface. Where the test runs. Which page, which audience, which placement. Stated in enough detail that a stranger could find it.
  3. Metric. One primary. At most two guardrail metrics. If the primary metric is leads, the guardrail metrics protect against lead-quality regressions and against cannibalising other parts of the funnel.
  4. Duration. Time-bound, not traffic-bound. “Until we hit significance” is how teams end up reading the dashboard daily and stopping the test the morning the curve looks good.
  5. Decision rule. Written before the test starts. If the primary metric improves by at least X with the guardrails intact, we ship the variant. Otherwise we keep the control. Written in plain language, not in p-values.

That last item is the one most teams skip. A test without a decision rule is not a test. It is a thing that produces a chart, after which somebody senior decides what the chart means.

What is cheap to test

Anything that lives on a page, in an ad set, or inside a sequence and can be served to half the audience without the other half noticing. Landing-page copy. Ad creative. Subject lines. CTA placement. Form length. Pricing-page layout. Audience definitions inside paid platforms. Bid strategies inside auction-based channels. Email cadence. Sequencing of touches in a nurture flow.

These tests are cheap because the cost of being wrong is small (a week of suboptimal performance) and the infrastructure to run them is already inside the tools you use. Google Optimize is free. Optimizely and VWO charge meaningful money but the kind of money any team running paid acquisition can defend. Inside Facebook Ads Manager or Google AdWords, A/B testing audiences and creatives is a built-in feature that most teams use as a reporting tab rather than as a test framework.

If you are running a small B2B SaaS site with five thousand weekly visitors and a two-percent conversion rate, that is a hundred conversions a week. Detecting a twenty-percent relative effect on that base takes about four to six weeks at standard power and confidence settings. That is a real budget for time, but it is the budget that lets a small team accumulate two to four legitimate tests per quarter, which over a year is enough to substantially reshape what the function does.

What is expensive to test

Anything that requires the audience to be exposed to the experiment for months before behaviour changes. Brand campaigns. Long-funnel B2B nurture sequences where the conversion event is six to nine months after the first touch. Anything that involves changing the attribution model itself. Anything that touches the senior end of a buyer’s journey where the sample sizes per cohort are small.

The honest answer for these is that you do not test them in the same way. You run them as longitudinal cohort studies, with the understanding that the inference is weaker. Or you do not test them at all, and you make explicit to leadership that this part of the budget is being spent on faith and accumulated pattern-matching rather than on isolated causal evidence. There is no shame in being explicit about that. There is significant shame in pretending an attribution model has resolved a causal question it has not.

The statistical layer

You need enough to not embarrass yourself. You do not need enough to publish a paper. Specifically:

  • You need to know how big a sample you need before the test starts. A power calculator does this in thirty seconds. Run it.
  • You need to know that early-stopping on a test you have been watching daily inflates the false-positive rate. Stop the test on the date you set, not the morning the curve crosses.
  • You need to know that statistical significance is not the same thing as practical significance. A two-percent lift on a metric that swings ten percent week-over-week is noise.
  • You need to know what a one-tailed versus two-tailed test commits you to, and you need to commit before the test runs.

That is the syllabus. There are weekend courses that cover it in four hours. Anybody on a marketing team can learn it without becoming a statistician.

The organisational prerequisite

Half your tests will fail. By which I mean half your hypotheses will turn out to be wrong, the variant will perform worse than control, and the test will have produced exactly one useful signal: do not do that thing.

This is fine in the data. It is rarely fine in the meeting. The marketer who proposed the test will feel like they wasted four weeks. The senior person who approved it will look for a reason to call the test design wrong rather than the hypothesis wrong. The agency partner will quietly stop suggesting tests because the failure rate makes them look bad.

The organisational work — and it is the harder work — is to build a culture in which a test that disproves a hypothesis is a successful test. The team that runs four tests of which three fail has saved itself from doing three things that did not work. The team that runs zero tests is still doing those three things, paying for them, and reporting on them as if they were working.

This requires a leader who can sit through a results review where most of the slides say “no significant effect, recommend rolling back” and treat that as forward motion. If your leadership cannot do that, your experimentation programme will quietly die in its third quarter, regardless of how good the test design is.

A test cadence

Once the framework is in place, the discipline is not about any single test. It is about cadence.

A small team running marketing experiments well has a weekly stand-up where running tests are reviewed against their guardrails (not their primary metrics — those are not allowed to be peeked at). It has a monthly review where completed tests are written up in a one-page format that includes the original hypothesis, the result, the decision, and what the team learned that did not depend on the test outcome. It has a quarterly planning session where the test backlog is groomed against the largest open questions in the marketing function.

This cadence is what separates a team that runs the occasional A/B test from a team that has internalised experimentation. The single test is a tactic. The cadence is the discipline.

Closing

The reason marketing has historically resisted this is not that marketers are unintelligent or unrigorous. It is that the feedback loops in marketing are long and noisy, and the temptation to substitute attribution storytelling for causal evidence is enormous, and the political cost of running a test that disproves a senior person’s pet idea is non-trivial.

Those are real obstacles. They are not arguments against the practice. They are arguments for building the practice carefully, starting with cheap tests, accumulating a body of internal evidence, and earning the right to ask harder questions later.

If you are running a marketing function right now and you do not have a test cadence, the place to start is one cheap test next week, written up the way I described, with the decision rule in writing before the variant goes live. The first one will feel like overhead. The fifth one will feel like the way you work.

What is the most expensive untested assumption in your current plan?

Written by Carl-Gustav Öberg

I'm Carl-Gustav Öberg, founder of Forge Nord. I build AI systems, run infrastructure, and write about what I learn along the way.

More inWeb, SEO & Growth See all in Web, SEO & Growth →