T-Test Exercises in R: 20 Real-World Practice Problems
Twenty t-test problems covering one-sample, two-sample (Welch and Student), paired, and one-sided variants with assumption checks, effect sizes, power calculations, and end-to-end stakeholder workflows. Every solution is hidden until you click; verify against the Expected result block before peeking.
The dataset stable used across this hub: iris, mtcars, ToothGrowth, PlantGrowth, ChickWeight, plus inline tibbles where a domain scenario calls for one. Throughout, save each answer to ex_<section>_<problem> so you can sanity-check against the Expected result before revealing the solution.
Section 1. One-sample t-tests (4 problems)
Exercise 1.1: Test whether iris Sepal.Length mean equals 5.85
Task: The botany lab claims the global mean sepal length across iris species is 5.85 cm. Using the built-in iris dataset, run a two-sided one-sample t-test of Sepal.Length against the null mean 5.85 and save the htest object to ex_1_1. Report whether the p-value rejects the null at alpha 0.05.
Expected result:
#> One Sample t-test
#>
#> data: iris$Sepal.Length
#> t = -0.39031, df = 149, p-value = 0.6969
#> alternative hypothesis: true mean is not equal to 5.85
#> 95 percent confidence interval:
#> 5.709732 5.976934
#> sample mean
#> 5.843333
Difficulty: Beginner
A claimed population mean is something you supply from outside the data, never something you estimate from the sample itself.
Feed the vector and the claimed value as mu to t.test(); the two-sided test is the default, so no alternative is needed.
Click to reveal solution
Explanation: t.test(x, mu = 5.85) defaults to a two-sided test against the supplied null mean. The p-value 0.697 is far above 0.05, so the data do not contradict the lab's claim. The 95 percent CI [5.71, 5.98] contains 5.85, which is the same conclusion expressed as an interval. Always pull mu from the scientific claim, never from the data itself.
Exercise 1.2: One-sided test that mtcars mpg exceeds 18
Task: A fuel-economy reviewer wants evidence that the average car in mtcars gets more than 18 mpg. Run a one-sided (greater) one-sample t-test of mtcars$mpg against mu = 18 and save the htest object to ex_1_2. Confirm whether the lower CI bound stays above 18.
Expected result:
#> One Sample t-test
#>
#> data: mtcars$mpg
#> t = 2.4286, df = 31, p-value = 0.01054
#> alternative hypothesis: true mean is greater than 18
#> 95 percent confidence interval:
#> 18.40632 Inf
#> sample mean
#> 20.09062
Difficulty: Intermediate
When you have a directional prediction, the test should put all its attention on one tail instead of splitting it across both.
Set alternative = "greater" in t.test() alongside mu = 18.
Click to reveal solution
Explanation: Setting alternative = "greater" halves the p-value compared to the two-sided test only when the sample mean is on the predicted side. The CI becomes one-sided: [18.41, Inf), and because its finite endpoint exceeds 18, the test rejects at alpha 0.05. The common mistake is using "greater" when the sample mean is below the null; in that case R still reports a finite p but it will be near 1, not near 0.
Exercise 1.3: Manufacturing QA against a 10 mm bolt spec
Task: A factory specification requires bolts to average 10 mm. Quality control measures 12 bolts and finds the lengths shown below. Run a two-sided one-sample t-test against mu = 10 and save the htest object to ex_1_3. Decide whether the line should be paused (reject at alpha 0.01).
Expected result:
#> One Sample t-test
#>
#> data: bolt_lengths
#> t = 1.2456, df = 11, p-value = 0.2389
#> alternative hypothesis: true mean is not equal to 10
#> 95 percent confidence interval:
#> 9.990078 10.033255
#> sample mean
#> 10.01167
Difficulty: Intermediate
Compare the measured average against the engineering target without assuming which direction any drift would take.
Call t.test() on bolt_lengths with mu = 10 and leave alternative at its two-sided default.
Click to reveal solution
Explanation: A p-value of 0.24 is nowhere near 0.01, so the line is statistically on-spec. Small samples (n = 12) have low power, so a non-rejection is not proof of compliance; it just means this evidence is insufficient to flag drift. For ongoing monitoring, an SPC chart with control limits is more useful than a single t-test because it visualizes trend, not just a single snapshot.
Exercise 1.4: Extract the 95% CI from a t-test object
Task: You ran the test in Exercise 1.1 and now need just the 95 percent confidence interval as a length-two numeric vector for a downstream report. Pull conf.int directly off the htest object and strip the attribute. Save the resulting unnamed numeric vector to ex_1_4.
Expected result:
#> [1] 5.709732 5.976934
Difficulty: Intermediate
A test result is a named list, so the interval you want is just one element you can pull out directly.
Index $conf.int off ex_1_1 and wrap it in as.numeric() to strip the trailing attribute.
Click to reveal solution
Explanation: t.test() returns an htest list with $conf.int, $estimate, $statistic, $p.value, and $parameter (df). Wrapping in as.numeric() drops the conf.level attribute that tags along, which matters when you later paste the values into a report or pass them to a function that errors on attributes. For one-sided tests one endpoint will be -Inf or Inf, which as.numeric preserves.
Section 2. Two-sample tests: Welch and Student (4 problems)
Exercise 2.1: Compare Petal.Length between setosa and versicolor
Task: Use the iris dataset to compare Petal.Length between species setosa and versicolor. Run a two-sample (Welch) t-test using the formula interface and filter out the virginica rows before testing. Save the htest object to ex_2_1.
Expected result:
#> Welch Two Sample t-test
#>
#> data: Petal.Length by Species
#> t = -39.493, df = 62.14, p-value < 2.2e-16
#> alternative hypothesis: true difference in means between group setosa and group versicolor is not equal to 0
#> 95 percent confidence interval:
#> -2.939618 -2.656382
#> mean in group setosa mean in group versicolor
#> 1.462 4.260
Difficulty: Beginner
Trim the dataset down to just the two groups you want to compare before running any two-group test.
Use subset() to keep the two species, droplevels() to clear the unused factor level, then t.test(Petal.Length ~ Species, data = ...).
Click to reveal solution
Explanation: Formula Petal.Length ~ Species is the cleanest two-sample syntax when data live in a data frame. t.test() defaults to var.equal = FALSE (Welch), which is the right default because equal variances are rarely true and Welch is robust when they happen to be equal. Dropping unused factor levels with droplevels() prevents R from silently trying a three-group comparison that errors on a two-sample test.
Exercise 2.2: Welch vs Student on ToothGrowth supplement groups
Task: Using ToothGrowth, compare tooth length len between supplements OJ and VC two ways: first with the default Welch correction, then with var.equal = TRUE (Student). Store the two htest objects in a named list ex_2_2 with elements welch and student, then compare the two p-values.
Expected result:
#> $welch
#> Welch Two Sample t-test
#> ...
#> t = 1.9153, df = 55.309, p-value = 0.06063
#>
#> $student
#> Two Sample t-test
#> ...
#> t = 1.9153, df = 58, p-value = 0.06039
Difficulty: Intermediate
The only thing separating the two variants is whether the test is allowed to pool the two group variances into one.
Run t.test(len ~ supp, data = ToothGrowth) twice, the second call adding var.equal = TRUE, and wrap both in a named list().
Click to reveal solution
Explanation: When sample sizes are equal and variances similar, Welch and Student give nearly identical p-values, as here (0.061 vs 0.060). The Welch df is fractional because Satterthwaite approximation accounts for unequal variances even when they are unequal only slightly. Practical guidance: use Welch by default and only use Student if you have a strong prior reason to assume equal variances, since Student is anti-conservative when variances differ.
Exercise 2.3: One-sided test that PlantGrowth trt1 differs from ctrl
Task: Using PlantGrowth, test whether the treatment trt1 group has a different mean weight from the ctrl group. Filter to just those two groups, then run a two-sided Welch t-test. Save the htest to ex_2_3, then report whether the result is significant at alpha 0.10.
Expected result:
#> Welch Two Sample t-test
#>
#> data: weight by group
#> t = 1.1913, df = 16.524, p-value = 0.2504
#> alternative hypothesis: true difference in means between group ctrl and group trt1 is not equal to 0
#> 95 percent confidence interval:
#> -0.2875162 1.0295162
#> mean in group ctrl mean in group trt1
#> 5.032 4.661
Difficulty: Intermediate
Keep only the control and treatment-one rows before comparing their means.
Use subset() to filter group to c("ctrl", "trt1"), then droplevels(), then t.test(weight ~ group, data = ...).
Click to reveal solution
Explanation: p = 0.25 is well above 0.10, so trt1 is not detectably different from control. The CI [-0.29, 1.03] crosses zero, confirming the same conclusion. If you suspected trt1 reduces weight, a one-sided test (alternative = "greater" for ctrl) might be motivated by prior science, but the convention is to pre-register one-sided tests; otherwise reviewers will assume you fished for the smaller p-value.
Exercise 2.4: Welch holds up under unequal sample sizes
Task: Construct two simulated groups, group A with 50 observations and group B with only 6 observations, both from normal populations with different variances. Run a Welch t-test on the combined data using the formula interface. Save the htest object to ex_2_4 and note the fractional degrees of freedom.
Expected result:
#> Welch Two Sample t-test
#>
#> data: y by grp
#> t = -0.81935, df = 5.1456, p-value = 0.4493
#> alternative hypothesis: true difference in means between group A and group B is not equal to 0
#> 95 percent confidence interval:
#> -5.494091 2.756478
#> mean in group A mean in group B
#> 9.842977 11.211784
Difficulty: Advanced
Lopsided group sizes need no special handling; the default two-sample test already corrects for them on its own.
Call t.test(y ~ grp, data = unequal_df) and read the fractional df off the returned object.
Click to reveal solution
Explanation: Welch df collapses toward the smaller sample size whenever that group also has the bigger variance, which is the exact scenario that makes Student's pooled-variance test wildly anti-conservative. Here df = 5.1, not the n1 + n2 - 2 = 54 you would get from Student. If you had run Student here you would have inflated Type I error roughly fivefold. Lesson: never pool variances without testing them, and Welch makes the whole question moot.
Section 3. Paired t-tests (3 problems)
Exercise 3.1: Weight-loss before and after, paired design
Task: A nutrition study weighs 10 participants before and after a 12-week program. The two vectors are paired by subject id, so use paired = TRUE. Run a two-sided paired t-test and save the htest object to ex_3_1.
Expected result:
#> Paired t-test
#>
#> data: before and after
#> t = 8.1029, df = 9, p-value = 1.987e-05
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> 1.541544 2.658456
#> mean difference
#> 2.1
Difficulty: Beginner
Each subject contributes two readings that belong together, so the test should analyze the within-subject change, not two separate samples.
Pass both vectors to t.test() with paired = TRUE.
Click to reveal solution
Explanation: A paired t-test is mathematically a one-sample t-test on the within-subject differences (before minus after) against mu = 0. The order of arguments only flips the sign of the t-statistic and CI, not the p-value. If you forgot paired = TRUE here, R would run an independent two-sample test and produce a far larger p-value because between-subject variance swamps the consistent 2-3 kg drop. The pairing is what gives the test its power.
Exercise 3.2: Paired test on ChickWeight, day 0 vs day 21
Task: Using ChickWeight, build a paired comparison of weight at Time == 0 versus Time == 21 for the same chick. Reshape so each chick contributes one before and one after value, drop chicks missing either time point, and run a paired t-test. Save the htest object to ex_3_2.
Expected result:
#> Paired t-test
#>
#> data: pair_wide$t0 and pair_wide$t21
#> t = -20.611, df = 44, p-value < 2.2e-16
#> alternative hypothesis: true mean difference is not equal to 0
#> 95 percent confidence interval:
#> -147.0541 -120.7237
#> mean difference
#> -133.8889
Difficulty: Advanced
Pairing only works once each chick sits on one row with its two time points side by side, and any chick missing either point must be dropped.
Reshape with pivot_wider() keyed on Chick, filter() out the NA rows, then call t.test(..., paired = TRUE).
Click to reveal solution
Explanation: Real paired studies almost always lose some subjects to follow-up, so filter(!is.na(t0), !is.na(t21)) is mandatory before pairing. Reshaping with pivot_wider() turns a long-format panel into a one-row-per-chick wide table, which is the shape paired = TRUE expects. The huge effect (mean diff = 134 g) is unsurprising biologically and the p-value floors at the machine epsilon (< 2.2e-16); always report < 1e-3 or < 0.001 rather than the exact floor value.
Exercise 3.3: When paired beats independent: pre vs post BP
Task: A clinic measures systolic blood pressure on 8 patients before and after a new med. Run the same data both ways: once as a paired t-test, once as a two-sample t-test (which ignores the pairing). Save a named list with paired and unpaired htest objects to ex_3_3 and compare the two p-values.
Expected result:
#> paired unpaired
#> 4.0863e-09 1.7126e-02
Difficulty: Intermediate
Run the identical numbers through a design that respects the within-patient link and one that ignores it entirely.
Build a list() holding t.test(pre, post, paired = TRUE) and a plain t.test(pre, post).
Click to reveal solution
Explanation: Both tests agree on direction, but the paired p-value is roughly seven orders of magnitude smaller. Why: most of the variance in BP comes from between-patient differences (some run high, some run low). The paired test eliminates that nuisance variance by analyzing within-patient changes. Forgetting to pair when your design IS paired is the single most common t-test error in practice; the diagnostic is "is the same unit (person, chick, plot) measured twice?" If yes, pair.
Section 4. Assumptions, robustness & alternatives (3 problems)
Exercise 4.1: Shapiro-Wilk normality check before a t-test
Task: Before trusting the Welch t-test in Exercise 2.3, run Shapiro-Wilk normality tests on the weight values within each PlantGrowth group (ctrl and trt1). Save a named list with the two htest objects to ex_4_1 and report whether either group rejects normality at alpha 0.05.
Expected result:
#> $ctrl
#> Shapiro-Wilk normality test
#> data: weight[group == "ctrl"]
#> W = 0.95682, p-value = 0.7475
#>
#> $trt1
#> Shapiro-Wilk normality test
#> data: weight[group == "trt1"]
#> W = 0.9304, p-value = 0.4519
Difficulty: Intermediate
Check the distribution shape one group at a time before trusting any mean-based comparison.
Call shapiro.test() on the weight values of each group and store the two results in a named list().
Click to reveal solution
Explanation: Neither group rejects normality (p = 0.75 and 0.45), so the t-test in 2.3 is safe to interpret. Shapiro-Wilk is sensitive at moderate n (10 to 50) and underpowered below n = 8, so absent rejection is weak evidence of normality at very small sizes. A pragmatic alternative is to look at a Q-Q plot directly with qqnorm() and rely on the Central Limit Theorem for n >= 30 since the t-statistic is robust to mild non-normality.
Exercise 4.2: Levene's test for equal variances
Task: Before deciding between Welch and Student on the ToothGrowth supplement comparison, run Levene's test for homogeneity of variance using car::leveneTest with center = "median". Save the resulting ANOVA-style object to ex_4_2 and read off the p-value to confirm which test variant to prefer.
Expected result:
#> Levene's Test for Homogeneity of Variance (center = "median")
#> Df F value Pr(>F)
#> group 1 1.2136 0.2752
#> 58
Difficulty: Intermediate
Before choosing between pooled and separate variances, test whether the two spreads are even comparable.
Call leveneTest(len ~ supp, data = ToothGrowth, center = "median").
Click to reveal solution
Explanation: Levene's test does not reject equal variances (p = 0.28). Pre-test then pick test is a known statistical pitfall called the conditional test problem: it inflates Type I error in the second-stage t-test. The cleaner modern advice is to skip Levene entirely and always use Welch, which costs essentially nothing in power when variances are equal and recovers the right alpha when they are not. Center = "median" (Brown-Forsythe) is more robust than the classic mean-centred Levene.
Exercise 4.3: Wilcoxon rank-sum as a robust alternative
Task: A skewed revenue distribution makes a two-sample t-test on mtcars$mpg between am == 0 and am == 1 look fragile to outliers. Run a Mann-Whitney / Wilcoxon rank-sum test using wilcox.test() with the formula interface and save the htest object to ex_4_3, then compare its p-value to the Welch t-test.
Expected result:
#> Wilcoxon rank sum test with continuity correction
#>
#> data: mpg by am
#> W = 42, p-value = 0.001871
#> alternative hypothesis: true location shift is not equal to 0
Difficulty: Intermediate
When outliers make a mean-based test shaky, switch to a procedure that compares ranks rather than averages.
Call wilcox.test(mpg ~ am, data = mtcars) using the formula interface.
Click to reveal solution
Explanation: Both tests reject at alpha 0.01, but Wilcoxon is testing a location shift in ranks rather than a difference of means. Use the rank-sum when (1) the data are clearly skewed and your sample is too small for the CLT to bail you out, or (2) the response is ordinal. Note that wilcox.test() reports W, not U; W = U + n1(n1+1)/2. R's continuity correction can be turned off with correct = FALSE for tiny samples where it over-shrinks the p-value.
Section 5. Effect sizes, CIs & power (3 problems)
Exercise 5.1: Cohen's d for the iris Petal.Length comparison
Task: The htest object from Exercise 2.1 gives a p-value but no standardized effect size. Compute Cohen's d for Petal.Length between setosa and versicolor using the pooled standard deviation, by hand: numerator is the difference in means, denominator is the pooled SD. Save the scalar to ex_5_1.
Expected result:
#> [1] -10.51747
Difficulty: Intermediate
A standardized effect size rescales the raw gap between means by the typical spread shared by the two groups.
Build the pooled SD from each group's var() and sample size, then divide the difference in means by it.
Click to reveal solution
Explanation: Cohen's d expresses the mean difference in pooled-SD units, so d = -10.5 means the two species centroids are over ten standard deviations apart on petal length, which is enormous by Cohen's rules of thumb (d = 0.2 small, 0.5 medium, 0.8 large). The pooled SD denominator is the conventional choice; using the SD of a single group or an averaged SD gives slightly different effect sizes (Glass's delta or Hedges' g). With unequal n the Hedges correction (1 - 3/(4(n1+n2)-9)) reduces small-sample bias.
Exercise 5.2: Power calculation for a planned two-sample study
Task: A clinical team is planning a two-arm trial expecting a medium effect (Cohen's d = 0.5). They want 80 percent power at alpha 0.05 (two-sided). Use pwr::pwr.t.test() to compute the required sample size per group. Save the returned power.htest object to ex_5_2 and read off n.
Expected result:
#> Two-sample t test power calculation
#>
#> n = 63.76561
#> d = 0.5
#> sig.level = 0.05
#> power = 0.8
#> alternative = two.sided
#>
#> NOTE: n is number in *each* group
Difficulty: Intermediate
Fix the expected effect, the alpha, and the power you want, and the only unknown left is how many subjects each arm needs.
Call pwr.t.test() with d, sig.level, power, and type = "two.sample", leaving n unspecified.
Click to reveal solution
Explanation: The result n = 63.77 means round UP to 64 per group, or 128 total. Always round up: rounding down would leave you short of 80 percent power. Note type = "two.sample" is critical; the default is "two.sample" but spelling it out prevents bugs when colleagues read your code. For unequal group sizes use pwr.t2n.test() and supply n1 and n2 directly. To bracket sensitivity, recompute at d = 0.4 and 0.6 to show how n explodes for smaller anticipated effects.
Exercise 5.3: Post-hoc power from an observed effect
Task: Using the ToothGrowth supplement comparison (Welch from Exercise 2.2), compute the observed Cohen's d, then plug it into pwr.t.test() with n = 30 per group to estimate the achieved power. Save the power.htest object to ex_5_3 and note that the test was underpowered.
Expected result:
#> Two-sample t test power calculation
#>
#> n = 30
#> d = 0.4946386
#> sig.level = 0.05
#> power = 0.4753406
#> alternative = two.sided
Difficulty: Advanced
First measure the effect the study actually observed, then ask how likely a design that size was to detect it.
Compute the observed Cohen's d from the pooled SD, then pass it plus n = 30 to pwr.t.test() with power left out.
Click to reveal solution
Explanation: Observed power = 0.48 explains exactly why the p-value (0.06) hovered just above 0.05: the design had less than a coin-flip chance of detecting the very effect it observed. Many journals now ask authors NOT to report post-hoc power because it is mathematically equivalent to the p-value and offers no extra information. The legitimate use is exactly what we did here: to justify a larger replication study, not to retroactively explain a non-significant result.
Section 6. End-to-end domain workflows (3 problems)
Exercise 6.1: Clinical trial paired BP with assumption check and CI
Task: A cardiology PI needs a one-paragraph summary of a paired BP trial: mean change, 95 percent CI, p-value, and a normality check on the differences. Build a small workflow that runs shapiro.test() on the differences, runs the paired t-test, and returns a named list with shapiro, ttest, and diff_mean. Save the list to ex_6_1.
Expected result:
#> $shapiro$p.value : ~ 0.71
#> $ttest$p.value : ~ 1.0e-8
#> $diff_mean : -8.7
Difficulty: Advanced
Screen the within-subject differences for normality, not the raw before and after columns, which can each be non-normal on their own.
Compute the difference vector, then build a list() of shapiro.test() on it, t.test(..., paired = TRUE), and mean().
Click to reveal solution
Explanation: A clinical workflow is rarely "run one test." You sequence a normality screen on the DIFFERENCES (not the raw before/after, which can each be non-normal even when their pairwise differences are normal), then the paired test, then a CI. Wrapping all three in a list keeps everything together for downstream kable() or gtsummary rendering. The CI for the mean change is the natural way to communicate effect magnitude to clinicians; p-values alone are not actionable in a clinical setting.
Exercise 6.2: A/B test on per-user spend with unequal allocation
Task: A growth analyst runs an A/B test with 75 percent traffic on control and 25 percent on treatment. The per-user 30-day spend is shown below as two vectors. Run a Welch t-test plus a one-sided variant testing the marketing hypothesis that treatment lifts spend. Save a named list ex_6_2 with elements twoSided and oneSided.
Expected result:
#> $twoSided$p.value : ~ 0.085
#> $oneSided$p.value : ~ 0.042
Difficulty: Advanced
Report both the neutral two-tailed result and the directional one the marketing hypothesis actually predicted.
Build a list() with t.test(spend_trt, spend_ctrl) and the same call plus alternative = "greater".
Click to reveal solution
Explanation: Two practical points for A/B tests. First, gamma-distributed revenue violates normality; the CLT rescues a t-test at n = 750 and 250, but for n < ~50 per arm you should use a Mann-Whitney or a permutation test instead. Second, the one-sided test halves the two-sided p-value only when the observed sign matches the hypothesis. Pre-register one-sidedness; post-hoc switching is p-hacking. Many companies report both two-sided p-values and a separate predicted direction in dashboards.
Exercise 6.3: A reusable APA-style reporter function
Task: Write a function apa_t that takes an htest object (from t.test) and returns a one-line APA-formatted character string with t, df, p, and 95 percent CI. Save the function itself to ex_6_3 and demonstrate it on the htest from Exercise 1.1.
Expected result:
#> [1] "t(149) = -0.39, p = 0.697, 95% CI [5.71, 5.98]"
Difficulty: Beginner
Pull the reporting pieces straight off the test object and stitch them into a single formatted line.
Write a function(h) that reads h$parameter, h$statistic, h$p.value, and h$conf.int, then assembles them with sprintf().
Click to reveal solution
Explanation: Reporters that wrap repeated stats output pay for themselves after the third paste. APA style asks for italic t and df; in markdown you would wrap the t and df in underscores. For p-values below 0.001 the convention is to print "p < 0.001" rather than the exact tiny value, which a robust function would handle with if (h$p.value < 0.001) "p < .001" else sprintf("p = %.3f", h$p.value). Pairing this with broom::tidy() gives you a tibble row per test, ideal for batch reporting.
What to do next
- t-Tests in R is the conceptual companion: every variant, when to use it, and the decision rule for picking between Welch, Student, and paired.
- ANOVA Exercises in R extends two-group comparisons to three or more groups when a t-test is no longer the right tool.
- Power Analysis Exercises in R drills deeper into sample size planning beyond the two problems here.
- Linear Regression Exercises in R is the next step once you start adding covariates to a two-group comparison.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
t-Test Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
188 learners have earned this certificate