Multiple Testing Exercises in R: 18 Solved p.adjust() Problems
Eighteen practice problems that build a working command of multiple testing in R: hand calculations against p.adjust(), family-wise error rate control with Bonferroni and Holm, false discovery rate control with Benjamini-Hochberg and Benjamini-Yekutieli, pairwise tests on real datasets, and genomics-scale simulations. Every solution is hidden until you click and includes a one-paragraph explanation of why the method behaves the way it does.
Section 1. Hand calculations vs p.adjust() (3 problems)
Exercise 1.1: Reproduce Bonferroni adjusted p-values by hand
Task: Take the raw p-value vector c(0.001, 0.01, 0.04, 0.20, 0.50). Compute the Bonferroni adjusted values by hand using the rule min(1, p * m) where m is the number of tests, then verify your manual vector exactly matches p.adjust(..., method = "bonferroni"). Save the manual result to ex_1_1.
Expected result:
#> [1] 0.005 0.050 0.200 1.000 1.000
#> identical(ex_1_1, p.adjust(pvec, "bonferroni")):
#> [1] TRUE
Difficulty: Beginner
The Bonferroni rule scales every raw p-value by the total number of tests, then keeps the result a valid probability by never letting it exceed 1.
Use length() to get the count of tests m, then pmin(pvec * m, 1) to multiply and cap elementwise.
Click to reveal solution
Explanation: Bonferroni is the simplest correction in the family: multiply every raw p-value by the number of tests m, then cap at 1. The cap matters because 0.50 * 5 = 2.5 is not a valid probability. pmin() does the elementwise floor against 1. Bonferroni controls FWER (the probability of any false positive) at the chosen alpha, but pays for that guarantee with low power once m grows: with 1000 tests, a raw p of 0.0001 becomes 0.10 and stops looking interesting.
Exercise 1.2: Reproduce Holm step-down adjustment by hand
Task: For the same raw vector c(0.001, 0.01, 0.04, 0.20, 0.50), compute Holm adjusted values: sort ascending, multiply the k-th smallest by (m - k + 1), cap at 1, then enforce monotonic non-decreasing (cummax). Save the result in original order to ex_1_2 and verify it matches p.adjust(..., method = "holm").
Expected result:
#> [1] 0.005 0.040 0.120 0.400 0.500
#> identical(ex_1_2, p.adjust(pvec, "holm")):
#> [1] TRUE
Difficulty: Beginner
Holm ranks the p-values from smallest to largest, applies a shrinking multiplier as the rank grows, and forbids a later rank from looking more significant than an earlier one.
Sort with order(), scale the sorted values by m - seq_along(sorted) + 1, cap with pmin(..., 1), then enforce monotonicity with cummax() before restoring the original order.
Click to reveal solution
Explanation: Holm is a step-down version of Bonferroni: the smallest raw p is compared to alpha/m, the next to alpha/(m-1), and so on. The adjusted form rescales each by its rank-dependent multiplier. The cummax step is the crucial bit: once you fail to reject at rank k, every larger rank is automatically not-rejected, which is enforced by making the adjusted vector non-decreasing. Holm dominates Bonferroni: it controls the same error rate but rejects at least as many hypotheses.
Exercise 1.3: Reproduce Benjamini-Hochberg adjusted p-values by hand
Task: For the raw vector c(0.001, 0.01, 0.04, 0.20, 0.50) compute BH adjusted p-values: sort ascending, scale the k-th sorted value by m / k, cap at 1, then enforce monotonic non-decreasing from largest to smallest via reverse cumulative minimum. Save the result in original order to ex_1_3 and verify against p.adjust(..., method = "BH").
Expected result:
#> [1] 0.00500000 0.02500000 0.06666667 0.25000000 0.50000000
#> identical(ex_1_3, p.adjust(pvec, "BH")):
#> [1] TRUE
Difficulty: Beginner
BH scales each ranked p-value by the number of tests divided by its rank, a gentler factor than Bonferroni's, then enforces a non-decreasing adjusted sequence.
After sorting, multiply by m / seq_along(sorted), cap with pmin(), and make it monotonic with rev(cummin(rev(...))) before reordering.
Click to reveal solution
Explanation: Benjamini-Hochberg targets a different error rate than Bonferroni or Holm: it controls the expected proportion of false rejections among discoveries (the FDR), not the chance of any false rejection (FWER). The step-up scaling p * m / rank is less aggressive than Bonferroni's p * m, which is why BH typically rejects more hypotheses at the same threshold. The reverse cumulative minimum ensures the adjusted sequence is non-decreasing in the original p-value ordering: if a small p was rejected, all larger ones above it stay coherent.
Section 2. Bonferroni and Holm on real datasets (3 problems)
Exercise 2.1: Run Bonferroni-corrected pairwise t-tests on PlantGrowth
Task: The agronomy team running the PlantGrowth trial wants to know which of the three groups (ctrl, trt1, trt2) differ in mean dried weight, with family-wise error rate controlled at 0.05. Call pairwise.t.test() with p.adjust.method = "bonferroni" and save the result object to ex_2_1. Read off which pairs survive at alpha = 0.05.
Expected result:
#> Pairwise comparisons using t tests with pooled SD
#> data: PlantGrowth$weight and PlantGrowth$group
#> ctrl trt1
#> trt1 0.583 -
#> trt2 0.087 0.009
#> P value adjustment method: bonferroni
Difficulty: Intermediate
A single call can run every group-versus-group t-test at once and apply a family-wise correction across the whole set.
Call pairwise.t.test() on PlantGrowth$weight and PlantGrowth$group with p.adjust.method = "bonferroni".
Click to reveal solution
Explanation: Only the trt1 vs trt2 contrast survives at FWER 0.05 (adjusted p = 0.009). The unadjusted p-value for that pair is roughly 0.003, and Bonferroni multiplies by 3 (the number of pairwise comparisons), giving 0.009. The marginal ctrl vs trt2 contrast moves from an unadjusted 0.029 up to an adjusted 0.087 and now fails the cutoff. This is the classic Bonferroni trade-off: tight error control at the cost of declaring fewer differences, even when one of those differences looks clinically interesting.
Exercise 2.2: Apply Holm correction to iris species comparisons
Task: A botanist studying iris wants to know whether the three species differ in Sepal.Width, with multiplicity controlled via Holm so that more pairs can survive than Bonferroni would allow. Run pairwise.t.test() on iris$Sepal.Width grouped by iris$Species, set p.adjust.method = "holm", and save the result to ex_2_2.
Expected result:
#> Pairwise comparisons using t tests with pooled SD
#> data: iris$Sepal.Width and iris$Species
#> setosa versicolor
#> versicolor < 2e-16 -
#> virginica < 2e-16 0.0018
#> P value adjustment method: holm
Difficulty: Intermediate
The same all-pairs t-test routine accepts different multiplicity corrections; a step-down one lets more pairs survive than the simplest cap.
Call pairwise.t.test() on iris$Sepal.Width by iris$Species with p.adjust.method = "holm".
Click to reveal solution
Explanation: All three pairs come up significant at alpha 0.05 after Holm correction. The setosa contrasts are wildly far from zero; the interesting case is versicolor vs virginica at 0.0018. Under Bonferroni that same contrast would be roughly 0.0018, since with m=3 the largest sorted p is multiplied by 1 under Holm but by 3 under Bonferroni, then the cummax kicks in only for the smaller ones. For this dataset both methods give nearly identical decisions; the gap widens as m grows.
Exercise 2.3: Count discoveries: Bonferroni versus Holm side by side
Task: A data analyst preparing a multiple-testing report needs to show stakeholders that Holm never rejects fewer hypotheses than Bonferroni. Take the p-value vector c(0.001, 0.008, 0.012, 0.030, 0.045, 0.10, 0.30, 0.50), compute both adjusted vectors via p.adjust(), count rejections at alpha = 0.05 for each, and save a 3-column data frame (raw, bonferroni, holm) to ex_2_3.
Expected result:
#> ex_2_3:
#> raw bonferroni holm
#> 1 0.001 0.008 0.008
#> 2 0.008 0.064 0.056
#> 3 0.012 0.096 0.072
#> 4 0.030 0.240 0.150
#> 5 0.045 0.360 0.180
#> 6 0.100 0.800 0.300
#> 7 0.300 1.000 0.600
#> 8 0.500 1.000 0.600
#> rejections at 0.05: bonf=1 holm=1
Difficulty: Intermediate
Run both corrections on the same raw vector and lay raw, family-wise-capped, and step-down values side by side to compare rejection counts.
Build a data.frame() whose columns are raw, p.adjust(pvec, "bonferroni"), and p.adjust(pvec, "holm").
Click to reveal solution
Explanation: Both methods reject the same single hypothesis here because the gap between the smallest p (0.001) and the next (0.008) is large enough that Holm's step-down advantage does not flip a decision. Look at row 2: Bonferroni multiplies by 8 giving 0.064, while Holm multiplies by 7 giving 0.056. Neither clears 0.05. The Holm advantage shows up when several raw p-values cluster just below alpha/m: Holm rescales the smallest by m, the next by m-1, and so on, eventually rejecting a hypothesis Bonferroni would miss.
Section 3. False discovery rate methods (3 problems)
Exercise 3.1: Count BH discoveries at multiple FDR thresholds
Task: A biostatistician screening 500 candidate features wants to know how the discovery count grows with the FDR threshold. Generate 500 p-values with set.seed(7) such that 50 follow runif(50, 0, 0.01) (real signals) and 450 follow runif(450, 0, 1) (nulls). Apply BH adjustment and count discoveries at FDR = 0.01, 0.05, and 0.10. Save the discovery counts as a named vector to ex_3_1.
Expected result:
#> ex_3_1:
#> fdr01 fdr05 fdr10
#> 49 49 54
Difficulty: Intermediate
Adjust all the p-values once for the false discovery rate, then count how many clear each threshold you care about.
Apply p.adjust(pvals, "BH"), then use sum(adj < t) for each threshold inside a named c().
Click to reveal solution
Explanation: Almost all 50 planted signals are recovered even at the strict FDR 0.01 threshold because they were drawn from Uniform(0, 0.01). Relaxing the threshold to 0.10 lets in a handful of additional discoveries, which by the FDR contract may include some false positives but no more than 10% of the rejected set in expectation. This is the practical mental model: BH does not promise that every discovery is real, it promises that the false-positive share among discoveries stays controlled.
Exercise 3.2: Use Benjamini-Yekutieli when test dependence is unknown
Task: The same biostatistician suspects the 500 p-values from Exercise 3.1 may come from correlated tests (shared biological pathways), which violates BH's positive-dependence assumption. Re-run the adjustment using BY (method = "BY"), count discoveries at FDR = 0.05, and save a 2-element named vector with both BH and BY counts to ex_3_2.
Expected result:
#> ex_3_2:
#> BH BY
#> 49 38
Difficulty: Intermediate
When tests may be correlated, a more conservative false-discovery correction trades some discoveries for validity under unknown dependence.
Compare sum(p.adjust(pvals, "BH") < 0.05) against sum(p.adjust(pvals, "BY") < 0.05) inside a named vector.
Click to reveal solution
Explanation: Benjamini-Yekutieli scales BH by a factor of sum(1/k) for k = 1..m, which for m = 500 is roughly 6.79. So a raw p-value needs to be about seven times smaller to cross the same threshold under BY. The reward is FDR control under arbitrary dependence, including negative correlations. Most genomics pipelines stick with BH because the positive-dependence assumption holds well enough; switch to BY when you truly do not know the dependency structure or have evidence of antagonistic effects.
Exercise 3.3: Compare adjusted p-values across all six p.adjust methods
Task: Build a tidy comparison table for the raw vector c(0.0005, 0.005, 0.015, 0.03, 0.06, 0.10, 0.25, 0.50, 0.80, 0.95) showing raw alongside adjusted values from every method p.adjust() supports: bonferroni, holm, hochberg, hommel, BH, BY. Round to 4 decimal places and save the data frame to ex_3_3.
Expected result:
#> ex_3_3 (first 6 rows):
#> raw bonferroni holm hochberg hommel BH BY
#> 1 0.0005 0.005 0.0050 0.0050 0.0050 0.0050 0.0147
#> 2 0.0050 0.050 0.0450 0.0450 0.0450 0.0250 0.0732
#> 3 0.0150 0.150 0.1200 0.1200 0.1050 0.0500 0.1465
#> 4 0.0300 0.300 0.2100 0.2100 0.1800 0.0750 0.2197
#> 5 0.0600 0.600 0.3600 0.3600 0.3000 0.1200 0.3515
#> 6 0.1000 1.000 0.5000 0.5000 0.4000 0.1667 0.4882
Difficulty: Intermediate
Every correction method can be applied to the same raw vector and stacked into one table to expose the conservativeness ranking.
Loop the method names through p.adjust() with sapply(), wrap each in round(..., 4), and assemble with data.frame().
Click to reveal solution
Explanation: Read across each row to see the conservativeness ranking: Bonferroni and BY are the strictest, BH is the loosest, and Holm/Hochberg/Hommel sit between. Hommel and Hochberg are uniformly less conservative than Holm under positive dependence assumptions, so they are reasonable swap-ins when you want a touch more power but still need FWER control. Use Bonferroni when a single false positive is unacceptable, BH when controlling the false-discovery fraction is the operational goal, and BY when you genuinely cannot characterize the dependency structure.
Section 4. Pairwise tests on real datasets (3 problems)
Exercise 4.1: Bonferroni-corrected pairwise t-tests across ChickWeight diets
Task: A nutrition study runs the ChickWeight experiment with four diets and wants to know, at the final time point (Time == 21), which pairs of diets differ in mean weight after FWER correction. Filter to Time == 21, run pairwise.t.test() with Bonferroni adjustment over weight by Diet, and save the result to ex_4_1.
Expected result:
#> Pairwise comparisons using t tests with pooled SD
#> data: cw21$weight and cw21$Diet
#> 1 2 3
#> 2 0.245 - -
#> 3 0.014 1.000 -
#> 4 0.057 1.000 1.000
#> P value adjustment method: bonferroni
Difficulty: Intermediate
Narrow the data to the final time point first, then run every diet-versus-diet comparison with a family-wise correction.
On the filtered cw21, call pairwise.t.test() on weight by Diet with p.adjust.method = "bonferroni".
Click to reveal solution
Explanation: Diet 3 differs significantly from Diet 1 (adjusted p = 0.014), while every other pair fails to clear 0.05. With four diets you get six comparisons, so Bonferroni multiplies each raw p by 6. The diet-4-vs-diet-1 contrast (raw p around 0.0095) gets pushed up to roughly 0.057 and just misses the cutoff. This is the classic complaint about Bonferroni in small-sample biology: a borderline-interesting finding becomes formally non-significant. Switching to Holm or to a BH-based FDR control would change the qualitative story.
Exercise 4.2: Run nonparametric pairwise Wilcoxon tests on iris
Task: A field biologist questions whether iris$Petal.Length is normally distributed within each species, so they prefer a rank-based comparison. Run pairwise.wilcox.test() on iris$Petal.Length grouped by iris$Species, use Holm for the family-wise correction, and save the result object to ex_4_2. Suppress exact-p warnings with exact = FALSE.
Expected result:
#> Pairwise comparisons using Wilcoxon rank sum test with continuity correction
#> data: iris$Petal.Length and iris$Species
#> setosa versicolor
#> versicolor < 2e-16 -
#> virginica < 2e-16 < 2e-16
#> P value adjustment method: holm
Difficulty: Intermediate
A rank-based pairwise comparison sidesteps the normality assumption while still accepting the same multiplicity corrections.
Call pairwise.wilcox.test() on iris$Petal.Length by iris$Species with p.adjust.method = "holm" and exact = FALSE.
Click to reveal solution
Explanation: All three pairs are crushingly significant because petal length separates the three iris species almost perfectly. The point of the exercise is the engineering: pairwise.wilcox.test() takes the same correction-method dispatch as pairwise.t.test(), so swapping a parametric test for a rank-based one does not change how you reason about multiplicity. Set exact = FALSE to get the asymptotic normal approximation when groups have ties, which suppresses the noisy warnings without changing the substantive conclusion at these sample sizes.
Exercise 4.3: Cross-check TukeyHSD against Bonferroni on PlantGrowth
Task: Fit a one-way ANOVA aov(weight ~ group, data = PlantGrowth) and pass it to TukeyHSD(). Then run the same three pairwise t-tests with p.adjust.method = "bonferroni". Build a two-column data frame tukey_p and bonf_p in matching order and save to ex_4_3.
Expected result:
#> ex_4_3:
#> tukey_p bonf_p
#> trt1-ctrl 0.391 0.583
#> trt2-ctrl 0.198 0.087
#> trt2-trt1 0.012 0.009
Difficulty: Intermediate
Compare the ANOVA-specific all-pairs procedure against a generic correction by lining up their adjusted p-values in matching pair order.
Pull TukeyHSD(fit)$group[, "p adj"] and the $p.value matrix from pairwise.t.test(..., "bonferroni"), then combine them with data.frame().
Click to reveal solution
Explanation: TukeyHSD uses the studentized range distribution and is the proper FWER-controlling tool for ANOVA all-pairs contrasts, while Bonferroni is generic and ignorant of the joint distribution. Notice they disagree on the trt2-ctrl pair: Tukey gives 0.198 but Bonferroni gives 0.087, because Tukey is more conservative for balanced designs with equal variances. Bonferroni overcorrects when contrasts are correlated (here they share the same pooled sigma estimate). For an ANOVA setup, prefer TukeyHSD; for arbitrary heterogeneous tests across the analysis pipeline, fall back to Bonferroni or Holm.
Section 5. Simulation and genomics-scale testing (3 problems)
Exercise 5.1: Quantify Type I error inflation across 1000 null tests
Task: Simulate 1000 independent t-tests under the null hypothesis: at each iteration, draw two samples of size 30 from rnorm(30) and store t.test(x, y)$p.value. Use set.seed(11). Count how many of the 1000 unadjusted p-values fall below 0.05, then how many of the Bonferroni-adjusted ones do. Save both counts as a named vector to ex_5_1.
Expected result:
#> ex_5_1:
#> raw bonferroni
#> 54 0
Difficulty: Advanced
Under the null nothing is real, so every rejection is a false positive; count them before and after a family-wise correction.
Use sum(pvals < 0.05) and sum(p.adjust(pvals, "bonferroni") < 0.05) inside a named c().
Click to reveal solution
Explanation: Under the null, p-values are uniformly distributed on [0, 1], so the unadjusted count near 50 (about 5% of 1000) is exactly what theory predicts. Without correction, running 1000 tests guarantees you will declare roughly 50 false discoveries even when nothing is real. Bonferroni multiplies every p-value by 1000, so a raw p of 0.0001 is needed to clear the threshold. The simulation gives you a vivid mechanical proof that family-wise error control collapses without adjustment as m grows.
Exercise 5.2: Estimate power and false-discovery share with planted signals
Task: Simulate 2000 tests of which 200 carry a real signal: for the first 200 draw t.test(rnorm(30, mean = 0.8), rnorm(30))$p.value, for the remaining 1800 draw t.test(rnorm(30), rnorm(30))$p.value. Use set.seed(13). Apply BH at FDR 0.05 and report (a) discoveries, (b) true positives among them, (c) realized false-discovery proportion. Save these three numbers as a named vector to ex_5_2.
Expected result:
#> ex_5_2:
#> discoveries true_positives fdp
#> 129.000 125.000 0.031
Difficulty: Advanced
With a known truth vector you can split the discoveries into true and false ones and measure the realized false-discovery share.
Form discov <- p.adjust(pvals, "BH") < 0.05, then take sum(discov), sum(discov & truth), and mean(!truth[discov]).
Click to reveal solution
Explanation: The realized false-discovery proportion (fdp) is 0.031, well below the nominal 0.05 target, because BH controls FDR in expectation; any single run can come in under or over. Discoveries (129) are well below the 200 true signals because effect size 0.8 with n = 30 yields modest power per test. Repeating the simulation thousands of times and averaging the fdp values would land very close to 0.05; that ensemble-average behavior is exactly what FDR control means. Single-run variability is normal.
Exercise 5.3: Build a volcano-style annotation table with adjusted p-values
Task: Simulate a differential expression result for 500 genes: set.seed(17), draw log2fc <- rnorm(500, sd = 0.7), and draw p-values such that the first 80 genes have runif(80, 0, 0.005) (true hits) and the rest have runif(420, 0, 1). Build a tibble with columns gene, log2fc, pvalue, padj (BH), is_hit (TRUE if padj < 0.05 & abs(log2fc) > 1). Save to ex_5_3 and report the number of hits.
Expected result:
#> ex_5_3 (first 4 rows):
#> # A tibble: 500 x 5
#> gene log2fc pvalue padj is_hit
#> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 gene001 -0.61 0.000242 0.00151 FALSE
#> 2 gene002 1.45 0.000893 0.00558 TRUE
#> 3 gene003 0.21 0.00345 0.0215 FALSE
#> 4 gene004 -1.12 0.00125 0.00781 TRUE
#> ...
#> # 496 more rows hidden
#> hits flagged: 23
Difficulty: Advanced
A volcano-style hit needs both statistical significance from the adjusted p-value and a large enough effect size.
Build a tibble() with a padj column from p.adjust(pvalue, "BH") and an is_hit column testing padj < 0.05 & abs(log2fc) > 1.
Click to reveal solution
Explanation: This is the canonical volcano-plot filter: significance from the adjusted p-value and biological relevance from the fold change. Either filter alone is a bad call: a gene with a tiny padj but |log2fc| < 0.1 is statistically real but biologically trivial, while a gene with a huge fold change but padj > 0.05 is likely chance. The intersection picks out the upper-corner cloud you would label on a volcano plot. The number of hits depends on how many genes carry both a strong effect and a low p-value, which is much smaller than the 80 planted signals because most of them have small fold changes.
Section 6. Power, comparison, and reporting (3 problems)
Exercise 6.1: Estimate empirical FDR for Bonferroni and BH via repeated simulation
Task: Wrap the planted-signal simulation from Exercise 5.2 in a 200-iteration loop. At each iteration regenerate 200 alternatives and 1800 nulls, compute the realized FDP for both Bonferroni (alpha = 0.05) and BH (FDR = 0.05). Average the FDPs across iterations and save the two means as a named vector to ex_6_1. Use set.seed(19) at the start.
Expected result:
#> ex_6_1:
#> bonferroni BH
#> 0.0001 0.0489
Difficulty: Advanced
Running the planted-signal experiment many times and averaging the false-discovery proportions reveals each method's long-run behavior.
Call replicate(200, one_iter()), then average across runs with rowMeans() wrapped in round().
Click to reveal solution
Explanation: Bonferroni's empirical FDP is essentially zero, far below its FWER target of 0.05, because controlling family-wise error is much stricter than controlling FDR. BH's empirical FDP averages right near the nominal 0.05 target, exactly as the theory predicts under independence. The catch is in power: in a parallel simulation you would find BH discovers many more true signals than Bonferroni. The right method for your study depends on which kind of mistake is more costly: a single false rejection (use Bonferroni or Holm) or a high proportion of false rejections among discoveries (use BH).
Exercise 6.2: Compare power of Bonferroni and BH across signal strengths
Task: Sweep the effect size from 0.2 to 1.2 in steps of 0.2 (six effect sizes total). For each effect size, simulate m = 500 tests with 50 carrying that effect and 450 being null, then count discoveries under Bonferroni (alpha = 0.05) and BH (FDR = 0.05). Use set.seed(23). Save a data frame with columns effect, bonf, bh to ex_6_2.
Expected result:
#> ex_6_2:
#> effect bonf bh
#> 1 0.2 0 0
#> 2 0.4 0 4
#> 3 0.6 1 15
#> 4 0.8 6 30
#> 5 1.0 18 42
#> 6 1.2 33 48
Difficulty: Advanced
Sweep the effect size across a grid and record how many discoveries each correction yields at every strength.
Apply sim_one over seq(0.2, 1.2, by = 0.2) with sapply(), transpose with t(), and bind into a data.frame().
Click to reveal solution
Explanation: BH consistently finds more true signals than Bonferroni at every non-trivial effect size, which is the reason it dominates genomics and high-throughput screening pipelines. At effect = 0.6, BH finds 15 signals versus Bonferroni's 1; at effect = 1.0, the gap is 42 vs 18. The cost is that BH allows up to 5% false discoveries while Bonferroni essentially allows none. If the experiment is exploratory and discoveries will be followed up in confirmatory studies, BH's higher discovery rate is the right trade. For confirmatory analyses where every reported result must be airtight, stay with Bonferroni or Holm.
Exercise 6.3: Produce a tidy adjusted-p reporting table from a fitted lm
Task: Fit lm(mpg ~ cyl + disp + hp + drat + wt + qsec, data = mtcars) and extract the coefficient table. Build a tibble with columns term, pvalue, plus three adjusted columns bonf, holm, BH, then add a logical keep column flagging rows where any of the three adjusted p-values is below 0.05. Save to ex_6_3.
Expected result:
#> ex_6_3:
#> # A tibble: 7 x 6
#> term pvalue bonf holm BH keep
#> <chr> <dbl> <dbl> <dbl> <dbl> <lgl>
#> 1 (Intercept) 0.150 1 1 0.226 FALSE
#> 2 cyl 0.510 1 1 0.595 FALSE
#> 3 disp 0.461 1 1 0.595 FALSE
#> 4 hp 0.335 1 1 0.586 FALSE
#> 5 drat 0.502 1 1 0.595 FALSE
#> 6 wt 0.018 0.126 0.108 0.0630 FALSE
#> 7 qsec 0.0775 0.543 0.465 0.181 FALSE
Difficulty: Intermediate
Each coefficient in a multi-predictor model is its own test, so the p-values from the fit summary deserve a multiplicity correction.
Pull summary(fit)$coefficients[, "Pr(>|t|)"], build a tibble() with p.adjust() columns for each method, and add keep via mutate().
Click to reveal solution
Explanation: After adjusting for the seven simultaneous coefficient tests, no term survives at alpha 0.05 under any method, even though wt has a raw p of 0.018 that looks compelling in isolation. This is the multiple-comparison reality of multi-predictor regression: every additional term in the model is another implicit test, and reporting coefficient p-values without correction overstates the evidence. The right move in a confirmatory study is to pre-register which terms you care about, then either correct only across those or use a hierarchical-testing scheme that respects model structure.
What to do next
You have walked through Bonferroni and Holm for family-wise control, BH and BY for false discovery rate control, and simulation-based diagnostics for both. Continue with these companion tutorials:
- Multiple Comparisons in R: the parent tutorial covering the underlying theory and decision rules for picking a correction method.
- ANOVA in R: pre-cursor for understanding why pairwise contrasts after a significant overall F-test still need multiplicity correction.
- Hypothesis Testing in R: the foundation behind every p-value you adjust here.
- Practice ANOVA Exercises in R: companion practice problems for the omnibus tests that feed into pairwise corrections.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
Multiple Testing (18 problems) Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
168 learners have earned this certificate