A/B test significance

What are we trying to figure out?

When we run an A/B test — like comparing static and interactive emails — we want to know:

Is one version truly better, or is the difference just random luck?

That’s where statistical significance comes in.

the setup

Let’s say we test two groups:

Group	Deliveries	Conversion	Conversion Rate (CVR)
A	10,057	10	0.10%
B	10,003	8	0.08%

Group A looks better — but the difference is tiny. Is it real, or could it just be random?

How to answer this question

We use a simple method called a Z-test to calculate a P-value. Here’s the step-by-step:

1. Pooled Conversion Rate

Combine both groups into one average baseline:

CVR_{pooled}= \frac{(Conversions_A + Conversions_B)}{(Group\ Size_A + Group\ Size_B)}

For our example:

CVR_{pooled}= \frac{(10+8)}{(10057+10003)} =0.09\%

2. Standard Error (SE)

This tells us how much natural randomness we expect between two similar-sized groups:

SE = \sqrt{CVR_{pooled} (1-CVR_{pooled})(\frac{1}{Group\ Size_A} + \frac{1}{Group\ Size_B})}

For our example:

SE = \sqrt{0.09\% (1-0.09\%)(\frac{1}{10057} + \frac{1}{10003})}=0.04\%

3. Z-Score

The Z-score shows how far the difference between two groups is from zero in terms of standard errors (zero being what we’d expect if there were no real difference). We calculate it by dividing the difference in conversion rates by the standard error — our expected random variation from sampling.

z = \frac{(CVR_B - CVR_A)}{SE}

For our example:

z = \frac{(0.08\% - 0.10\%)}{0.042\%}=-0.46

In simple terms, it answers:

“How unusual is this difference if A and B were actually performing the same?”

If the Z-score is close to 0, the difference is small compared to the noise — probably just random chance.
If the Z-score is 2 or more, the difference is much larger than the noise — likely not random, and therefore statistically significant.
For this, we assume a standard normal distribution, i.e. 68% of all data falls between $z=-1SE$ and $z=+1SE$, and 95% between $z=-2 SE$ and $z=+2 SE$.

The Z-score converts into a p-value, telling us how likely such a difference could occur by chance.

4. P-Value (Two-Tailed Test)

The p-value tells us how likely we’d see a difference this large (or larger) between A and B if they were actually identical — that is, if the difference was purely random.

p= 2 * (1-\Phi(|z|))

Where $\Phi(|z|)$ is the cumulative distribution function of the standard normal distribution (the probability that a value is less than z).

p= 2 * (1-\Phi(|-0.47|))=0.65

Put simply, it answers:

“If A and B truly perform the same, what are the odds I’d see this difference by chance?”

Here’s how to interpret it:

If the p-value is high (e.g. 0.65), there’s a 65% chance the difference is just random noise — making it not statistically significant.
If the p-value is low (typically below 0.05), there’s less than a 5% chance it’s due to randomness — making it statistically significant.

P-Value	What this tells us
< 0.05	Likely not random. Statistically significant ✅
> 0.05	Could be just noise. Not significant ❌

We usually use 0.05 (5%) as the threshold for confidence.

What does this mean?

If your test result is significant, you can say:

“Yes, B performs better than A — and it’s not just random.”

If it’s not significant, you say:

“The difference could just be luck — we can’t be sure yet.”

For our example:

Our p-value of 0.65 is > 0.05, so the difference is very likely just luck — we can’t be sure yet whether version A is better than version B.

Important notes

Small numbers (like 10 vs 8 conversions) make it hard to trust the results.
Try to reach at least 100+ conversions per component before acting.
Statistical significance ≠ practical significance. A 0.02% uplift might be real… but still not meaningful for your company depending on your overall sales volume.

AB Tests

​What are we trying to figure out?

​the setup

​How to answer this question

​1. Pooled Conversion Rate

​2. Standard Error (SE)

​3. Z-Score

​4. P-Value (Two-Tailed Test)

​What does this mean?

​Important notes

What are we trying to figure out?

the setup

How to answer this question

1. Pooled Conversion Rate

2. Standard Error (SE)

3. Z-Score

4. P-Value (Two-Tailed Test)

What does this mean?

Important notes