A/B Testing

The Complete Guide to A/B Testing Sample Size (2026)

Learn how to calculate the perfect sample size for statistically significant A/B test results. Includes free calculator, formulas, real examples, and best practices.

SM
Dr. Sarah Mitchell
Statistical Analysis Lead
Feb 10, 2026
15 min read
The Complete Guide to A/B Testing Sample Size (2026)

The Complete Guide to A/B Testing Sample Size (2026)

Running A/B tests without the right sample size is like flipping a coin twice and declaring the result meaningful. You need enough data to confidently know if your variation actually performs better—or if you're just seeing random noise.

This complete guide will teach you everything you need to know about sample size calculation, from basic concepts to advanced techniques. By the end, you'll be able to confidently determine exactly how many visitors you need for reliable test results.

Try Our Free Sample Size Calculator →

Why Sample Size Matters

The Cost of Getting It Wrong

Too Small: Stop your test too early, and you might:

  • Implement a "winning" variation that actually performs worse
  • Miss detecting a real improvement
  • Waste development time on false positives
  • Make decisions based on random chance, not real data

Too Large: Run your test too long, and you'll:

  • Waste weeks waiting for unnecessary data
  • Delay shipping improvements
  • Incur opportunity costs
  • Lose to faster competitors

The sweet spot? Just enough samples to detect meaningful differences with statistical confidence.

Key Statistical Concepts

Statistical Significance (α)

Statistical significance is your acceptable risk of a false positive—declaring a winner when there isn't one.

Standard: 95% confidence (α = 0.05)

This means:

  • 95% confident the difference is real
  • 5% chance you're seeing random variation
  • Industry standard for most A/B tests

When to adjust:

  • 99% confidence (α = 0.01): Critical changes (checkout flow, pricing)
  • 90% confidence (α = 0.10): Exploratory tests, smaller decisions

Higher confidence requires larger sample sizes but reduces false positives.

Statistical Power (1-β)

Power is your ability to detect a real difference when one exists—avoiding false negatives.

Standard: 80% power (β = 0.20)

This means:

  • 80% chance of detecting a real improvement
  • 20% risk of missing a genuine effect
  • Widely accepted in scientific research

When to increase:

  • 90% power: Important business decisions
  • 95% power: Mission-critical changes

Higher power requires more samples but ensures you don't miss real improvements.

Minimum Detectable Effect (MDE)

MDE is the smallest improvement you want to detect. It's often the most overlooked—and most important—input.

Setting realistic MDE:

  • 10-20% relative improvement: Most common and achievable
  • 5-10%: Requires very large samples
  • 20%+: Easier to detect, might miss smaller wins

Example:

  • Baseline: 2.5% conversion rate
  • 20% relative improvement: 3.0% (2.5% × 1.20)
  • Absolute improvement: 0.5 percentage points

Be realistic. Expecting 50%+ improvements leads to underpowered tests that run forever.

Baseline Conversion Rate

Your current conversion rate determines required sample size. Lower baselines need more samples.

Why it matters:

  • 1% → 1.2% (20% improvement): ~24,000 samples per variation
  • 5% → 6% (20% improvement): ~4,800 samples per variation
  • 10% → 12% (20% improvement): ~2,400 samples per variation

Lower conversion rates have more variance, requiring larger samples to detect changes confidently.

The Sample Size Formula

Here's the complete formula for calculating sample size:

n = [Z_α/2 + Z_β]² × 2p(1-p) / (p₁ - p₂)²

Where:

  • n = Required sample size per variation
  • Z_α/2 = Z-score for significance level (1.96 for 95% confidence)
  • Z_β = Z-score for power (0.84 for 80% power)
  • p = Baseline conversion rate
  • p₁ - p₂ = Absolute difference (MDE)

Z-Score Table:

ConfidenceαZ_α/2
90%0.101.645
95%0.051.960
99%0.012.576
PowerβZ_β
70%0.300.524
80%0.200.842
90%0.101.282
95%0.051.645

Example Calculation

Inputs:

  • Baseline conversion: 2.5%
  • MDE: 20% relative (3.0% absolute = 0.5 percentage points)
  • Confidence: 95% (Z = 1.96)
  • Power: 80% (Z = 0.84)

Calculation:

p = 0.025
p₁ - p₂ = 0.005
n = [1.96 + 0.84]² × 2(0.025)(0.975) / (0.005)²
n = [2.8]² × 0.04875 / 0.000025
n = 7.84 × 0.04875 / 0.000025
n = 15,288 samples per variation

Total needed: 30,576 samples (both variations)

Or just use our calculator and skip the math: Sample Size Calculator →

Step-by-Step: Calculate Your Sample Size

Step 1: Measure Baseline Conversion Rate

Use at least 1 week of historical data:

  • Account for day-of-week patterns
  • Avoid holiday periods (unless testing seasonal changes)
  • Ensure traffic is stable and representative

Example: 250 conversions / 10,000 visitors = 2.5% baseline

Step 2: Set Your Minimum Detectable Effect

Ask yourself: "What improvement would be worth implementing?"

Guidelines:

  • New test hypothesis: Start with 10-20% relative improvement
  • Refinement test: Can detect smaller (5-10%)
  • Revolutionary change: Might see 30%+ improvement

Example: 20% improvement on 2.5% = 3.0% target (0.5pp absolute)

Step 3: Choose Significance Level

Standard: 95% confidence

Adjust based on risk tolerance:

  • Critical path (checkout, pricing): 99%
  • Standard test (copy, design): 95%
  • Exploration (minor tweaks): 90%

Step 4: Set Statistical Power

Standard: 80% power

Increase for important tests:

  • Business-critical decisions: 90%
  • Major investments: 95%

Higher power = more samples but less risk of missing real effects.

Step 5: Account for Multiple Variations

Testing more than 2 variations? Apply Bonferroni correction to maintain overall significance:

Adjusted α = α / number of comparisons

Example: 3 variations (A vs B, A vs C, B vs C = 3 comparisons)

  • Original α: 0.05
  • Adjusted α: 0.05 / 3 = 0.0167
  • Use 98.3% confidence instead of 95%

This increases required sample size but prevents false positives from multiple testing.

Step 6: Calculate Test Duration

Test Duration = (Sample Size × Variations) / Daily Traffic

Example:

  • Sample size needed: 15,288 per variation
  • Variations: 2
  • Daily traffic: 500 visitors
  • Duration: (15,288 × 2) / 500 = 61 days

Important: Round up and run for complete weeks (14, 21, 28 days) to account for weekly patterns.

Real-World Examples

Example 1: E-commerce Product Page

Scenario: Testing new product page layout

Inputs:

  • Current conversion: 3.2%
  • MDE: 15% relative improvement (3.68%)
  • Confidence: 95%
  • Power: 80%
  • Variations: 2 (A/B test)
  • Daily traffic: 800 visitors

Results:

  • Sample size: 8,141 per variation
  • Total needed: 16,282 visitors
  • Test duration: 21 days (at 800/day)

Decision: Run test for 3 full weeks.

Example 2: SaaS Signup Flow

Scenario: Streamlining multi-step signup

Inputs:

  • Current conversion: 8.5%
  • MDE: 12% relative improvement (9.52%)
  • Confidence: 95%
  • Power: 90% (important change)
  • Variations: 2
  • Daily traffic: 300 visitors

Results:

  • Sample size: 3,622 per variation
  • Total needed: 7,244 visitors
  • Test duration: 25 days

Decision: Run for 4 weeks to ensure complete week coverage.

Example 3: Email Subject Line

Scenario: A/B/C test of 3 subject lines

Inputs:

  • Current open rate: 22%
  • MDE: 10% relative (24.2%)
  • Confidence: 95%
  • Power: 80%
  • Variations: 3 (A/B/C)
  • List size: 50,000

With Bonferroni correction:

  • Adjusted confidence: 98.3% (α = 0.0167)
  • Sample size: 4,982 per variation
  • Total needed: 14,946 emails

Decision: Send to 15,000 random subset, analyze after 24 hours.

Special Considerations

Multiple Variations (Bonferroni Correction)

When testing 3+ variations, you make multiple comparisons:

  • 3 variations = 3 comparisons (A vs B, A vs C, B vs C)
  • 4 variations = 6 comparisons
  • 5 variations = 10 comparisons

Formula: k(k-1)/2 where k = number of variations

Each comparison increases false positive risk. Bonferroni correction divides your significance level by the number of comparisons.

Impact: Testing 4 variations instead of 2 can increase required sample size by 50-70%.

Recommendation: Limit to 2-3 variations per test. Run sequential tests instead of testing everything at once.

Low-Traffic Websites

Problem: Required sample size might mean 6+ month test duration.

Solutions:

1. Increase MDE: Accept detecting larger effects only

  • Instead of 10% improvement, test for 25%
  • Reduces sample size by ~85%

2. Test higher-traffic pages:

  • Homepage instead of specific product page
  • Signup flow instead of settings page

3. Combine traffic sources:

  • Test across multiple similar pages
  • Aggregate data carefully

4. Use sequential testing:

  • Check results at predetermined intervals
  • Stop early if strong signal emerges
  • Requires specialized statistical methods

5. Accept longer test durations:

  • Ensure test runs full weeks (1-2 minimum)
  • Monitor for external changes during test

Multi-Page Funnels

Testing a funnel with multiple steps requires different approach:

Problem: Conversion rate decreases at each step

  • Landing page: 10,000 visitors
  • Step 2: 3,000 (30% conversion)
  • Step 3: 1,200 (40% of step 2)
  • Purchase: 300 (25% of step 3)
  • Overall: 3% conversion

Solution: Calculate based on the final conversion rate (3%), not individual steps.

Sample size amplifies through the funnel—you need more top-funnel traffic than you might expect.

Mobile vs. Desktop

Different devices often have different conversion rates and user behavior:

Approach 1: Segment analysis

  • Run test on all traffic
  • Analyze mobile and desktop separately post-test
  • Requires larger sample size

Approach 2: Separate tests

  • Run mobile-specific and desktop-specific tests
  • Allows device-optimized variations
  • Doubles traffic requirements

Recommendation: Start with combined test, segment analysis. Run separate tests only if device behavior differs dramatically.

Common Mistakes to Avoid

1. The Peeking Problem

Mistake: Checking results daily and stopping when p-value < 0.05

Why it's wrong: The more you check, the higher your false positive rate. You'll eventually see p < 0.05 by random chance.

Fix:

  • Commit to sample size upfront
  • Check results only at predetermined milestones
  • Use sequential testing methods if you must peek
  • Wait for full sample size before declaring winner

2. Ignoring Weekly Patterns

Mistake: Running test for exactly 10 days or stopping mid-week

Why it's wrong:

  • Weekend traffic behaves differently than weekday
  • Your test might catch 2 Saturdays but only 1 Friday
  • Creates sampling bias

Fix:

  • Always run tests in complete weeks (7, 14, 21, 28 days minimum)
  • Start tests on the same day of week you plan to end them
  • Minimum 1-2 full weeks even if you hit sample size sooner

3. Testing Too Many Variations

Mistake: Running A/B/C/D/E test with 5 variations

Why it's wrong:

  • Requires Bonferroni correction
  • Massively increases sample size (often 3-4x)
  • Splits traffic too thinly
  • Each variation gets fewer samples

Fix:

  • Limit to 2-3 variations maximum
  • Run sequential tests instead
  • Focus on testing one hypothesis at a time
  • Use multivariate testing only when necessary and with huge traffic

4. Setting Unrealistic MDE

Mistake: Expecting 50%+ improvements as minimum

Why it's wrong:

  • Makes tests too easy to "pass"
  • Requires tiny sample sizes
  • Might miss smaller but still valuable 10-15% wins

Fix:

  • Be realistic: most winning tests improve 5-20%
  • Set MDE based on business value, not wishful thinking
  • Consider: would a 10% improvement be worth implementing?

5. Stopping Tests Early

Mistake: Stopping at 80% of required sample because results look good

Why it's wrong:

  • Underpowered test
  • Higher false positive rate
  • Results likely to regress to mean

Fix:

  • Commit to full sample size before starting
  • Set calendar reminders for proper end date
  • If you must end early, acknowledge increased error risk

Sample Size Benchmarks by Industry

Based on analysis of thousands of A/B tests:

E-commerce

MetricTypical RangeRequired SampleTest Duration*
Product page CR2-5%10,000-20,00014-30 days
Add to cart rate10-20%3,000-6,0007-14 days
Checkout completion40-60%1,000-2,0003-7 days

*Assuming 1,000 visitors/day

SaaS

MetricTypical RangeRequired SampleTest Duration*
Signup conversion5-15%3,000-8,00010-20 days
Free to paid2-5%10,000-20,00030-60 days
Feature adoption20-40%1,500-3,0005-10 days

*Assuming 500 visitors/day

B2B/Lead Gen

MetricTypical RangeRequired SampleTest Duration*
Form submission3-8%5,000-15,00020-40 days
Demo request1-3%15,000-40,00060-120 days
Content download10-25%2,000-5,0007-15 days

*Assuming 300 visitors/day

Media/Publishing

MetricTypical RangeRequired SampleTest Duration*
Click-through rate5-15%3,000-8,0002-5 days
Email signup2-6%8,000-20,0005-12 days
Video completion30-60%1,000-2,0001-3 days

*Assuming 2,500 visitors/day

Note: These are starting points. Your actual requirements depend on your specific MDE, confidence, and power settings.

Advanced Topics

Sequential Testing (SPRT)

What it is: Statistical method allowing you to peek at results and stop early while maintaining proper error rates.

How it works:

  • Set error boundaries that tighten over time
  • Can stop when result crosses boundary
  • Maintains alpha and beta levels

Benefits:

  • Can reduce test duration by 20-50%
  • Safe to monitor continuously
  • Faster decisions on clear winners/losers

Tradeoffs:

  • More complex to implement
  • May run longer if results are ambiguous
  • Requires specialized tools

Tools: Optimizely, VWO, and custom implementations

Bayesian A/B Testing

Difference from frequentist: Instead of "Is there a difference?", asks "What's the probability variation B is better?"

Benefits:

  • More intuitive interpretation
  • Can incorporate prior knowledge
  • Continuous monitoring without peeking penalty
  • Direct probability statements

Tradeoffs:

  • Requires setting priors
  • Can be seen as subjective
  • Less standardized than frequentist

Sample sizes: Generally similar to frequentist, sometimes smaller with strong priors.

Multi-Armed Bandit

What it is: Adaptive testing that automatically shifts traffic to better-performing variations.

How it works:

  • Starts with even split
  • Gradually allocates more traffic to winners
  • Minimizes exposure to losers

When to use:

  • Testing many variations (3+)
  • Cost of showing losing variation is high
  • Traffic is very high
  • Willing to sacrifice some statistical rigor for practical gains

Not recommended for:

  • Low traffic sites
  • Testing two variations
  • When you need definitive answers

Tools & Resources

Sample Size Calculators

Our recommendation: WMMW Sample Size Calculator

  • Clean interface
  • All standard options
  • Power analysis charts
  • Test duration estimates
  • Multiple variation support

Alternative options:

  • Evan Miller's calculator (simple, accurate)
  • Optimizely's calculator (built-in if using their platform)
  • Google Optimize calculator (deprecated but still referenced)

A/B Testing Platforms

Enterprise:

  • Optimizely: Full-featured, expensive, great for large teams
  • VWO: Mid-market, good features, reasonable pricing
  • Adobe Target: Enterprise-only, deep Analytics integration

Mid-Market:

  • Convert: Privacy-focused, good for GDPR compliance
  • AB Tasty: French company, strong in Europe
  • Kameleoon: AI-powered, advanced segmentation

Budget/Small Teams:

  • Google Optimize (discontinued)
  • Microsoft Clarity + custom implementation
  • Open source: Growthbook, Unleash

Statistical Resources

Books:

  • "Trustworthy Online Controlled Experiments" by Kohavi, Tang, Xu
  • "A/B Testing: The Most Powerful Way to Turn Clicks Into Customers" by Siroker & Koomen

Online Courses:

  • Udacity: A/B Testing by Google
  • CXL: Advanced A/B Testing & Experimentation

Conclusion

Sample size calculation isn't guesswork—it's statistics. And getting it right is the difference between making data-driven decisions and random ones.

Key Takeaways:

  1. Don't wing it: Calculate required sample size before starting any test
  2. Be realistic: Set achievable MDEs (10-20% improvements)
  3. Commit to the number: Don't stop early or peek continuously
  4. Account for patterns: Run tests in complete weeks (minimum 1-2)
  5. Use tools: Let calculators do the math for you

Ready to calculate your sample size?

Use Our Free Sample Size Calculator →

It takes 30 seconds and ensures your next test produces reliable, actionable results.

Stop guessing. Start testing with confidence.


Related Resources:

Questions? Drop a comment below or contact our analytics team.

A/B TestingSample SizeStatisticsConversion OptimizationTesting

Share this insight

Help your network discover smarter analytics.

Ready to Transform Your Analytics?

Stop relying on incomplete data. Get full visibility into your customer journey and make data-driven decisions that actually work.