What are we trying to figure out?
When we run an A/B test — like comparing static and interactive emails — we want to know:Is one version truly better, or is the difference just random luck?
the setup
Let’s say we test two groups:| Group | Deliveries | Conversion | Conversion Rate (CVR) |
|---|---|---|---|
| A | 10,057 | 10 | 0.10% |
| B | 10,003 | 8 | 0.08% |
How to answer this question
We use a simple method called a Z-test to calculate a P-value. Here’s the step-by-step:1. Pooled Conversion Rate
Combine both groups into one average baseline: For our example:2. Standard Error (SE)
This tells us how much natural randomness we expect between two similar-sized groups: For our example:3. Z-Score
The Z-score shows how far the difference between two groups is from zero in terms of standard errors (zero being what we’d expect if there were no real difference). We calculate it by dividing the difference in conversion rates by the standard error — our expected random variation from sampling. For our example: In simple terms, it answers:“How unusual is this difference if A and B were actually performing the same?”
- If the Z-score is close to 0, the difference is small compared to the noise — probably just random chance.
- If the Z-score is 2 or more, the difference is much larger than the noise — likely not random, and therefore statistically significant.
-
For this, we assume a standard normal distribution, i.e. 68% of all data falls between $z=-1SE$ and $z=+1SE$, and 95% between $z=-2 SE$ and $z=+2 SE$.

4. P-Value (Two-Tailed Test)
The p-value tells us how likely we’d see a difference this large (or larger) between A and B if they were actually identical — that is, if the difference was purely random. Where $\Phi(|z|)$ is the cumulative distribution function of the standard normal distribution (the probability that a value is less than z). Put simply, it answers:“If A and B truly perform the same, what are the odds I’d see this difference by chance?”
- If the p-value is high (e.g. 0.65), there’s a 65% chance the difference is just random noise — making it not statistically significant.
- If the p-value is low (typically below 0.05), there’s less than a 5% chance it’s due to randomness — making it statistically significant.
| P-Value | What this tells us |
|---|---|
| < 0.05 | Likely not random. Statistically significant ✅ |
| > 0.05 | Could be just noise. Not significant ❌ |
What does this mean?
If your test result is significant, you can say:“Yes, B performs better than A — and it’s not just random.”
“The difference could just be luck — we can’t be sure yet.”
Our p-value of 0.65 is > 0.05, so the difference is very likely just luck — we can’t be sure yet whether version A is better than version B.
Important notes
- Small numbers (like 10 vs 8 conversions) make it hard to trust the results.
- Try to reach at least 100+ conversions per component before acting.
- Statistical significance ≠ practical significance. A 0.02% uplift might be real… but still not meaningful for your company depending on your overall sales volume.
