What are we trying to figure out?
When we run an A/B test — like comparing static and interactive emails — we want to know: That’s where statistical significance comes in.the setup
Let’s say we test two groups:| Group | Deliveries | Conversion | Conversion Rate (CVR) |
|---|---|---|---|
| A | 10,057 | 10 | 0.10% |
| B | 10,003 | 8 | 0.08% |
How to answer this question
We use a simple method called a Z-test to calculate a P-value. Here’s the step-by-step:1. Pooled Conversion Rate
Combine both groups into one average baseline: For our example:2. Standard Error (SE)
This tells us how much natural randomness we expect between two similar-sized groups: For our example:3. Z-Score
The Z-score shows how far the difference between two groups is from zero in terms of standard errors (zero being what we’d expect if there were no real difference). We calculate it by dividing the difference in conversion rates by the standard error — our expected random variation from sampling. For our example: In simple terms, it answers:“How unusual is this difference if A and B were actually performing the same?”
- If the Z-score is close to 0, the difference is small compared to the noise — probably just random chance.
- If the Z-score is 2 or more, the difference is much larger than the noise — likely not random, and therefore statistically significant.
-
For this, we assume a standard normal distribution, i.e. 68% of all data falls between $z=-1SE$ and $z=+1SE$, and 95% between $z=-2 SE$ and $z=+2 SE$.

4. P-Value (Two-Tailed Test)
The p-value tells us how likely we’d see a difference this large (or larger) between A and B if they were actually identical — that is, if the difference was purely random. Where $\Phi(|z|)$ is the cumulative distribution function of the standard normal distribution (the probability that a value is less than z). Put simply, it answers: Here’s how to interpret it:- If the p-value is high (e.g. 0.65), there’s a 65% chance the difference is just random noise — making it not statistically significant.
- If the p-value is low (typically below 0.05), there’s less than a 5% chance it’s due to randomness — making it statistically significant.
| P-Value | What this tells us |
|---|---|
| < 0.05 | Likely not random. Statistically significant ✅ |
| > 0.05 | Could be just noise. Not significant ❌ |
What does this mean?
If your test result is significant, you can say: If it’s not significant, you say:“The difference could just be luck — we can’t be sure yet.”
Our p-value of 0.65 is > 0.05, so the difference is very likely just luck — we can’t be sure yet whether version A is better than version B.
