A/B Testing Exercises in R: 20 Real-World Practice Problems
Twenty hands-on A/B testing exercises in R covering proportion tests, t-tests, power, sample size, lift estimation, multiple comparisons, sequential peeking, Sample Ratio Mismatch, and A/A diagnostics. Solutions are hidden behind reveal blocks so you can try each problem first.
Section 1. Conversion rates and the two-proportion test (4 problems)
Exercise 1.1: Compute control and treatment conversion rates from raw assignments
Task: Given an inline tibble experiment of 10 user assignments and binary conversion flags (0 or 1), compute the conversion rate per variant. Save a tibble with columns variant, users, conversions, cvr to ex_1_1 and verify both variants have the expected user counts before computing rates.
Expected result:
#> # A tibble: 2 x 4
#> variant users conversions cvr
#> <chr> <int> <dbl> <dbl>
#> 1 control 5 1 0.2
#> 2 treatment 5 2 0.4
Difficulty: Beginner
Each variant needs three numbers - how many users, how many converted, and the share that converted - so the ten rows have to collapse down to one row per variant.
Group by variant, then summarise with n() for users, sum(converted) for conversions, and mean(converted) for the rate.
Click to reveal solution
Explanation: Aggregating by variant with n(), sum(), and mean() gives the three numbers every A/B report starts with: sample size, conversion count, and conversion rate. The mean() shortcut works because converted is coded 0/1, so the average equals the proportion. Always print user counts alongside rates so a stakeholder can sanity-check sample sizes before reading the rate.
Exercise 1.2: Run a two-proportion test on a landing-page experiment
Task: The growth team at a SaaS company ran a landing-page test: 4,800 of 50,000 control visitors converted and 5,250 of 50,000 treatment visitors converted. Use prop.test() with correct = FALSE (classic z-test) to test whether the conversion rates differ at the 5% level. Save the full htest object to ex_1_2 and report the p-value.
Expected result:
#>
#> 2-sample test for equality of proportions without continuity
#> correction
#>
#> data: c(4800, 5250) out of c(50000, 50000)
#> X-squared = 21.21, df = 1, p-value = 4.115e-06
#> alternative hypothesis: two.sided
#> 95 percent confidence interval:
#> -0.01296933 -0.00503067
#> sample estimates:
#> prop 1 prop 2
#> 0.096 0.105
Difficulty: Beginner
A binary converted/not-converted outcome compared across two groups calls for a test on proportions rather than means.
Pass x = c(4800, 5250) and n = c(50000, 50000) and set correct = FALSE to get the classic z-test.
Click to reveal solution
Explanation: prop.test() is the default tool for binary outcomes: it pools the proportions under the null and computes a chi-square statistic equivalent to a two-sided z-test. Setting correct = FALSE matches the textbook z-test; the default Yates continuity correction is conservative and rarely matters for the large samples typical of online experiments. With p-value 4e-06 and a CI that excludes zero, the treatment is statistically detectable.
Exercise 1.3: Chi-square test on a 2-by-2 outcome table
Task: Build a 2-by-2 contingency table of variant (control, treatment) versus outcome (converted, not_converted) using the counts 4800/45200 and 5250/44750. Pass the matrix to chisq.test() and save the htest object to ex_1_3. Confirm the test statistic matches the prop.test from Exercise 1.2.
Expected result:
#>
#> Pearson's Chi-squared test
#>
#> data: m
#> X-squared = 21.21, df = 1, p-value = 4.115e-06
Difficulty: Intermediate
Counts arranged as a two-row, two-column grid let you test whether variant and outcome are associated.
Build the grid with matrix(..., nrow = 2), then pass it to chisq.test() with correct = FALSE so the statistic matches Exercise 1.2.
Click to reveal solution
Explanation: A 2-by-2 chi-square on counts is algebraically identical to the two-proportion z-test in Exercise 1.2: same statistic (21.21), same p-value, same conclusion. Use the matrix form when you already have counts in a cross-tab (e.g., from xtabs() or count() |> pivot_wider()); use prop.test() when you have raw numerators and denominators. The trap to avoid: forgetting correct = FALSE if you want to match the z-test exactly.
Exercise 1.4: One-sided test for a directional product claim
Task: A product manager wants a one-sided 95% test of whether the treatment (5,250 of 50,000) is higher than control (4,800 of 50,000): the redesign was launched specifically to lift signups and a non-inferiority result is not actionable. Run prop.test() with alternative = "greater" and save to ex_1_4. Report whether the PM can claim treatment is better.
Expected result:
#>
#> 2-sample test for equality of proportions without continuity
#> correction
#>
#> data: c(5250, 4800) out of c(50000, 50000)
#> X-squared = 21.21, df = 1, p-value = 2.058e-06
#> alternative hypothesis: greater
#> 95 percent confidence interval:
#> 0.005668925 1.000000000
#> sample estimates:
#> prop 1 prop 2
#> 0.105 0.096
Difficulty: Intermediate
When only an improvement in one direction is actionable, the test should look in just that single direction.
Call prop.test() with alternative = "greater" and list the treatment counts first in x and n.
Click to reveal solution
Explanation: The one-sided p-value is exactly half the two-sided p-value when the direction matches the data, so 4.1e-06 becomes 2.1e-06. The PM can confidently say the treatment is higher (p well below 0.05). Two cautions: the directional choice must be pre-registered before peeking, otherwise running both directions and picking the smaller p-value silently doubles your false-positive rate. Also note that prop.test() takes the variants in the same order you pass x and n, so swap them carefully when asking for greater.
Section 2. Continuous metrics with t-tests (3 problems)
Exercise 2.1: Welch t-test on average order value
Task: An e-commerce checkout test produced these per-user order totals (in dollars). Build two vectors aov_control and aov_treatment, run a Welch two-sample t-test on whether mean AOV differs, save the htest result to ex_2_1, and report the 95% CI for the mean difference.
Expected result:
#>
#> Welch Two Sample t-test
#>
#> data: aov_control and aov_treatment
#> t = -2.10, df = 17.95, p-value = 0.04999
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -19.9402 -0.0598
#> sample estimates:
#> mean of x mean of y
#> 58.5 68.5
Difficulty: Intermediate
Two sets of per-user dollar amounts call for a comparison of average values between the groups.
Pass aov_control and aov_treatment to t.test(); its default already performs the Welch (unequal-variance) version.
Click to reveal solution
Explanation: t.test() defaults to Welch's t-test, which does NOT assume equal variances and is the right choice for nearly every A/B test on revenue or session metrics. The CI for the mean difference (control minus treatment) excludes zero by a hair and the p-value is just under 0.05. With small samples (n=10 per arm) the interval is wide, so even a "significant" result like this is fragile: that one-sample-flip from significance is exactly why you size experiments before peeking, which is the next section.
Exercise 2.2: Compare Welch vs pooled variance assumptions
Task: Re-run the AOV comparison from Exercise 2.1, but this time pass var.equal = TRUE to assume equal variances (the classic Student's t-test). Save the htest object to ex_2_2 and compare the degrees of freedom and p-value to the Welch version.
Expected result:
#>
#> Two Sample t-test
#>
#> data: aov_control and aov_treatment
#> t = -2.10, df = 18, p-value = 0.04994
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -19.9394 -0.0606
#> sample estimates:
#> mean of x mean of y
#> 58.5 68.5
Difficulty: Intermediate
The classic Student version of the test assumes both groups share the same spread.
Re-run t.test(aov_control, aov_treatment) adding var.equal = TRUE.
Click to reveal solution
Explanation: With nearly identical sample sizes and similar variances, Welch and pooled produce almost the same answer: df 18.0 vs 17.95, p-values rounding to 0.05. Welch only differs meaningfully when variances are unequal AND group sizes are unequal. The cost of using Welch when variances ARE equal is essentially zero (slightly less power, never wrong), but the cost of pooling when variances are unequal is an inflated false-positive rate. Default to Welch unless you have a strong reason.
Exercise 2.3: Log-transformed t-test for skewed revenue
Task: A growth analyst has revenue-per-user data for two variants. Revenue is heavily right-skewed, so a raw t-test on means is misleading. Construct skewed lognormal samples (n=200 each, log-mean differing by 0.1), apply log1p() to each value, run a Welch t-test on the log-transformed values, and save the htest to ex_2_3. The hypothesis test is on log-scale means, which corresponds to testing the ratio of medians on the original scale.
Expected result:
#>
#> Welch Two Sample t-test
#>
#> data: log1p(rev_control) and log1p(rev_treatment)
#> t = -1.06, df = 397.97, p-value = 0.2920
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.2880634 0.0869039
#> sample estimates:
#> mean of x mean of y
#> 2.831 2.932
Difficulty: Advanced
A long right tail breaks the symmetry the test expects, so the values need reshaping before they are compared.
Wrap each vector in log1p() and feed the transformed values to t.test().
Click to reveal solution
Explanation: Raw revenue distributions almost always have a long right tail that violates the normality assumption of the t-test and inflates variance. Taking log1p() (which is log(1 + x) and handles zero values safely) symmetrizes the distribution and the t-test on log-scale means becomes a test on geometric means, often the more meaningful summary for revenue. A common alternative is the Mann-Whitney U test via wilcox.test(), but the log-t approach is preferred when you care about quantifying the effect size as a multiplicative lift.
Section 3. Sample size and power (3 problems)
Exercise 3.1: Plan an experiment with power.prop.test
Task: A product manager wants to detect an absolute lift from a 10% baseline conversion rate to 11% (one percentage point) with 80% power at a two-sided 5% significance level. Use power.prop.test() to compute the required sample size per group, save the result object to ex_3_1, and report ex_3_1$n rounded up to a whole number.
Expected result:
#>
#> Two-sample comparison of proportions power calculation
#>
#> n = 14744
#> p1 = 0.10
#> p2 = 0.11
#> sig.level = 0.05
#> power = 0.80
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
Difficulty: Intermediate
Sizing an experiment means solving for the user count given a baseline rate, a target rate, the desired power, and the significance level.
Call power.prop.test() with p1 = 0.10, p2 = 0.11, power = 0.80, sig.level = 0.05, then read $n.
Click to reveal solution
Explanation: The output n is per group, so the total user count is twice that. A useful rule of thumb falls out: smaller minimum detectable effects (MDE) need quadratically more users, so halving the MDE from 1pp to 0.5pp would require roughly four times the sample. Always pre-compute this BEFORE launching, not after a marketing campaign drives unplanned traffic. Leave any one of n, p2, or power as NULL to solve for it; the function fills in the missing slot.
Exercise 3.2: Sample size with pwr::pwr.2p.test using effect size h
Task: Use pwr::pwr.2p.test() from the pwr package to compute the sample size per group needed to detect Cohen's h = 0.05 at 80% power, two-sided, 5% significance. Save the result to ex_3_2 and contrast ex_3_2$n with the answer from Exercise 3.1.
Expected result:
#>
#> Difference of proportion power calculation for binomial distribution (arcsine transformation)
#>
#> h = 0.05
#> n = 3140
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: same sample sizes
Difficulty: Intermediate
Some power tools describe the effect as a single standardized number instead of two raw proportions.
Use pwr::pwr.2p.test() with h = 0.05, power = 0.80, sig.level = 0.05, alternative = "two.sided".
Click to reveal solution
Explanation: pwr.2p.test() parameterizes the effect by Cohen's h (an arcsine-transformed difference) rather than two raw proportions. For h = 0.05 the function returns about 3,140 per group, much smaller than Exercise 3.1's 14,744 because h = 0.05 corresponds to a larger relative effect than the 10pp-to-11pp jump (Cohen's h for that comparison is only about 0.033). The arcsine transform stabilizes variance across the proportion scale, which is why pwr uses it. Convert between formulations with pwr::ES.h(p1, p2).
Exercise 3.3: Compute the minimum detectable effect under a fixed budget
Task: The growth team only has 8,000 visitors per arm available before a marketing window closes. Holding power at 80%, two-sided 5% alpha, and a baseline p1 of 0.10, solve for the smallest detectable p2 using power.prop.test() with n = 8000 and p2 = NULL. Save the result to ex_3_3 and report the MDE on the absolute and relative scales.
Expected result:
#>
#> Two-sample comparison of proportions power calculation
#>
#> n = 8000
#> p1 = 0.10
#> p2 = 0.1136
#> sig.level = 0.05
#> power = 0.80
#>
#> Absolute MDE: 0.0136 (1.36 percentage points)
#> Relative MDE: 13.6%
Difficulty: Advanced
With the user count fixed, flip the usual question and solve instead for the smallest effect you could still detect.
Call power.prop.test() with n = 8000, p1 = 0.10, power = 0.80 and leave p2 = NULL so it solves for that slot.
Click to reveal solution
Explanation: Solving for p2 with fixed n flips the usual workflow: instead of "how many users do I need?", you ask "what is the smallest effect I can plausibly detect with what I have?" The honest answer for 8,000 per arm is a 13.6% relative lift, which lets the PM decide whether the test is worth running. If the realistic business effect is a 3% relative lift, this experiment is underpowered and should be redesigned (longer runtime, more variants ruled out, or a larger primary metric).
Section 4. Effect size, lift, and confidence intervals (3 problems)
Exercise 4.1: 95% confidence interval for the difference in proportions
Task: Manually build the 95% confidence interval for the difference p_treatment - p_control using the normal approximation: SE = sqrt(p1(1-p1)/n1 + p2(1-p2)/n2), CI = (p2 - p1) +/- 1.96 * SE. Use the values from Exercise 1.2 (4800/50000 and 5250/50000). Save a length-2 numeric vector c(lower, upper) to ex_4_1.
Expected result:
#> [1] 0.005031 0.012969
Difficulty: Intermediate
A confidence interval for a difference is the difference itself, plus and minus a margin built from the combined uncertainty of both estimates.
Compute the standard error with sqrt() using the SE formula, then add c(-1.96, 1.96) * se to the difference in proportions.
Click to reveal solution
Explanation: This is the classic Wald interval for two proportions and matches the CI that prop.test() reports (with sign flipped depending on which proportion you subtract). The interval lies entirely above zero so the lift is statistically detectable. For very small or very large proportions (p < 0.05 or p > 0.95), the Wald approximation is poor and the Wilson interval (set correct = FALSE and use binom.test() or prop.test()) is the better default. Reporting the CI alongside the p-value is far more informative for stakeholders than the p-value alone.
Exercise 4.2: Bootstrap a relative-lift confidence interval for revenue
Task: An analyst needs a 95% CI for the relative lift in mean revenue (mean(treatment) / mean(control) - 1), not the absolute difference, because leadership reports lift in percent terms. Using the rev_control and rev_treatment vectors from Exercise 2.3, write a function that resamples each group with replacement and computes the relative lift, run 2000 bootstrap replicates, and save the percentile CI to ex_4_2 as a length-2 vector.
Expected result:
#> [1] -0.18 0.43
Difficulty: Advanced
When you cannot assume a tidy distribution, resampling the observed data many times builds the interval empirically.
Resample each group with sample(x, length(x), replace = TRUE), compute the relative lift across many replicates, then take quantile(boots, c(0.025, 0.975)).
Click to reveal solution
Explanation: A bootstrap percentile CI makes no normality assumption, which matters because revenue is heavily skewed and the t-test's symmetric CI on the raw scale would be misleading. The CI here spans negative to positive, so the experiment cannot rule out either a loss or a sizeable win: under-powered. Two practical notes: use replicate(R, ...) or vectorize with matrix sampling for speed on large data, and prefer BCa CIs (boot::boot.ci(type = "bca")) over plain percentiles when the bootstrap distribution is skewed or biased.
Exercise 4.3: Cohen's h effect size for two proportions
Task: Use pwr::ES.h() to compute Cohen's h for the conversion rates 0.10 and 0.11 (a 1pp absolute lift on a 10% baseline). Save the scalar to ex_4_3 and round to 4 decimals.
Expected result:
#> [1] 0.0327
Difficulty: Beginner
A baseline-free measure puts a one-point lift onto a standardized scale you can compare across experiments.
Call pwr::ES.h(0.10, 0.11) and wrap the result in round(..., 4).
Click to reveal solution
Explanation: Cohen's h transforms two proportions onto an arcsine scale where the SD is approximately constant, then takes the difference: h = 2 * (asin(sqrt(p1)) - asin(sqrt(p2))). Conventional thresholds: 0.2 small, 0.5 medium, 0.8 large. A value of 0.033 is tiny, which is why Exercise 3.1 needed nearly 30,000 total users: the smaller the effect on the arcsine scale, the more samples you need. Use ES.h() whenever you need a size-free way to compare experiments with different baselines.
Section 5. Multiple variants and multiple comparisons (3 problems)
Exercise 5.1: Pairwise proportion tests with Bonferroni correction
Task: A PM ran a four-variant test (A, B, C, D) on a landing page. Use pairwise.prop.test() with p.adjust.method = "bonferroni" on the counts c(480, 525, 540, 460) of c(5000, 5000, 5000, 5000) to obtain a matrix of adjusted p-values. Save the htest object to ex_5_1 and identify which pair has the smallest adjusted p-value.
Expected result:
#>
#> Pairwise comparisons using Pairwise comparison of proportions
#>
#> data: c(480, 525, 540, 460) out of c(5000, 5000, 5000, 5000)
#>
#> A B C
#> B 0.99 - -
#> C 0.21 1.00 -
#> D 1.00 0.10 0.02
#>
#> P value adjustment method: bonferroni
Difficulty: Intermediate
Four variants mean many head-to-head tests, and each extra comparison inflates the chance of a false win unless the p-values are adjusted.
Use pairwise.prop.test() with the four counts, the four sample sizes, and p.adjust.method = "bonferroni".
Click to reveal solution
Explanation: With 4 variants there are 6 pairwise tests, so Bonferroni multiplies each raw p-value by 6 and caps at 1.0. Only C vs D survives at adjusted p = 0.02. Reporting the unadjusted p-values from 6 separate prop.test() calls would inflate the family-wise error rate well above 5%. Pick the comparison method to match your goal: Bonferroni for strong control of family-wise error, BH (the next exercise) for control of false discovery rate when you have many comparisons and are tolerant of a few false positives.
Exercise 5.2: Benjamini-Hochberg FDR correction with p.adjust
Task: You ran 10 simultaneous A/B tests across product surfaces and obtained these raw two-sided p-values. Apply Benjamini-Hochberg correction with p.adjust(method = "BH") and save the adjusted p-values to ex_5_2. Identify how many are below the 0.05 threshold after adjustment versus before.
Expected result:
#> [1] 0.0100 0.0200 0.0300 0.0500 0.1000 0.1500 0.2500 0.5000 0.7000 0.9000
#> raw < 0.05: 4
#> adj < 0.05: 3
Difficulty: Intermediate
Running ten tests at once needs the raw p-values rescaled to control the share of discoveries that are false.
Apply p.adjust() to raw_p with method = "BH".
Click to reveal solution
Explanation: BH controls the expected proportion of false discoveries among rejections, which is usually what you want when running many parallel tests: you tolerate a few false positives in exchange for higher power than Bonferroni. Bonferroni would shrink raw 0.020 to 0.20 (rejecting nothing past the first two), while BH keeps three discoveries. Use BH for screening (which features are worth deeper analysis?), Bonferroni for confirmatory comparisons where any false positive is expensive (regulatory submission, public claims).
Exercise 5.3: Holm versus Bonferroni adjusted p-values
Task: Apply both p.adjust(method = "bonferroni") and p.adjust(method = "holm") to the same raw_p vector from Exercise 5.2. Save a tibble with columns raw, bonferroni, holm (each rounded to 4 decimals) to ex_5_3 and compare which method is uniformly more powerful.
Expected result:
#> # A tibble: 10 x 3
#> raw bonferroni holm
#> <dbl> <dbl> <dbl>
#> 1 0.001 0.01 0.01
#> 2 0.004 0.04 0.036
#> 3 0.009 0.09 0.072
#> 4 0.02 0.2 0.14
#> 5 0.05 0.5 0.3
#> 6 0.09 0.9 0.45
#> 7 0.175 1 0.7
#> 8 0.4 1 1
#> 9 0.63 1 1
#> 10 0.9 1 1
Difficulty: Intermediate
Two corrections that control the same error rate can still differ in how much they shrink each individual p-value.
Call p.adjust() on raw_p twice - once with method = "bonferroni", once with method = "holm" - and assemble both columns into a tibble.
Click to reveal solution
Explanation: Holm (step-down) is uniformly at least as powerful as Bonferroni while controlling the same family-wise error rate, so there is no reason to prefer Bonferroni over Holm for confirmatory comparisons. Bonferroni multiplies every p-value by m (the number of tests); Holm sorts p-values and uses the multiplier m - rank + 1, which is smaller for all but the smallest p-value. Default to method = "holm" for FWER control and method = "BH" for FDR control: that pair handles 95% of A/B testing needs.
Section 6. Peeking, sequential checks, and experiment hygiene (4 problems)
Exercise 6.1: Visualize day-by-day cumulative conversion rates
Task: A PM is tempted to peek at the experiment every day. Build a tibble of 14 days of simulated cumulative successes and trials per variant where both true rates are equal to 0.10 (a true null). Compute cumulative conversion rates each day and plot cvr over day, one line per variant. Save the ggplot object to ex_6_1.
Expected result:
#> A ggplot with x = day, y = cvr, color = variant.
#> Two lines that drift and cross over the first few days,
#> stabilizing near 0.10 by day 14. No persistent gap exists
#> because the data generating process is a true null.
Difficulty: Intermediate
To show drift over time you first need a running total of successes and of trials, then a separate line for each variant.
Per group, take cumsum() of success and trials to get the daily rate, then build the chart with ggplot() and geom_line(), mapping color = variant.
Click to reveal solution
Explanation: Even when both variants have identical true rates, early days show wide gaps that close as sample size grows. This is exactly the trap PMs fall into when peeking: they see a gap on day 3, conclude treatment is winning, and stop. The fix is either to commit to a fixed-horizon analysis (run for the pre-computed sample size, then look once) or to use a sequential procedure that adjusts for repeated looks (alpha spending, Bayesian bandits). The chart is a great visual aid for explaining why peeking is dangerous.
Exercise 6.2: Bonferroni-adjusted alpha for repeated daily looks
Task: A team plans to peek at their experiment once per day for 7 days and stop early if any look hits significance. Compute the Bonferroni-adjusted per-look alpha needed to keep family-wise alpha at 0.05 across 7 looks. Save the scalar to ex_6_2.
Expected result:
#> [1] 0.007143
Difficulty: Beginner
Looking at the experiment many times means splitting one total error budget across all of those looks.
Divide the family-wise alpha 0.05 by the number of looks, 7.
Click to reveal solution
Explanation: A naive 7-look procedure has effective alpha far above 0.05 (the actual inflation is roughly 0.30 when looks are independent), so any "significant" finding mid-experiment is mostly noise. Bonferroni-adjusting to per-look alpha 0.0071 controls the family-wise error at 0.05 conservatively. Better procedures (Pocock, O'Brien-Fleming, mSPRT) spend alpha unevenly across looks for higher overall power, but Bonferroni is the right starting point if you have to invent a rule under time pressure.
Exercise 6.3: Sample Ratio Mismatch (SRM) chi-square test
Task: A platform engineer suspects a bucketing bug: the assignment split was meant to be 50/50 but the observed counts are 24,200 control and 25,800 treatment over 50,000 users. Run a chi-square goodness-of-fit test against the expected 25,000/25,000 split using chisq.test() and save the htest object to ex_6_3. A p-value below 0.001 typically triggers shutting the experiment down for investigation.
Expected result:
#>
#> Chi-squared test for given probabilities
#>
#> data: c(24200, 25800)
#> X-squared = 51.2, df = 1, p-value = 8.328e-13
Difficulty: Intermediate
Checking whether an observed split matches an intended split is a test of counts against expected probabilities.
Pass the observed counts c(24200, 25800) to chisq.test() with p = c(0.5, 0.5).
Click to reveal solution
Explanation: Sample Ratio Mismatch is the most common operational bug in production experimentation: a 50/50 randomizer that bucketed at 48.4/51.6 is wildly off and the p-value confirms it isn't sampling noise. Real causes include bot filtering that strips one arm asymmetrically, opt-in flows that gate the treatment, or assignment code that runs after a pre-treatment redirect. Any A/B test result with SRM detected is invalid: do not patch the analysis, fix the root cause and rerun. A common dashboard threshold is p < 0.001.
Exercise 6.4: Simulate an A/A test to verify the false-positive rate
Task: Run 1,000 simulated A/A experiments where both arms draw from the same Binomial(5000, 0.10). For each replicate, run a two-sided prop.test() (no continuity correction) and record whether p < 0.05. Save the empirical false-positive rate (a scalar between 0 and 1) to ex_6_4. With 1000 reps and true null, you should see roughly 0.05.
Expected result:
#> [1] 0.046
Difficulty: Advanced
To check a test's false-positive rate, run it many times on data where no real difference exists and tally how often it still flags significance.
Use replicate() to repeatedly draw counts with rbinom() and run prop.test(..., correct = FALSE), then take mean(p < 0.05).
Click to reveal solution
Explanation: A correctly calibrated test rejects the null at the nominal alpha rate when the null is TRUE. Here the empirical false-positive rate is 0.046, within Monte Carlo error of the theoretical 0.05. Running an A/A simulation on your real production pipeline (using actual traffic split, not just rbinom) is the highest-value sanity check before any A/B program launches: if you observe inflated false positives, your test is either using the wrong statistical formula or your randomizer is biased. Bookmark this as a regression test.
What to do next
- Revisit the parent tutorial A/B Testing in R for the full theory behind these exercises, including assumptions, when each test is appropriate, and pitfalls in production deployment.
- Continue with Hypothesis Testing Exercises in R for broader practice on z, t, chi-square, and non-parametric tests.
- Deepen your power-analysis intuition with Power Analysis Exercises in R, which extends the sample-size workflow to regression and ANOVA settings.
- For the multiple-comparison machinery, work through Multiple Comparison Exercises in R which covers Tukey, Dunnett, and Games-Howell beyond pairwise prop tests.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
A/B Testing Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
258 learners have earned this certificate