Correlation Exercises in R: 20 Practice Problems
Twenty practice problems on correlation in R covering Pearson, Spearman, Kendall, significance testing with cor.test, partial correlation, bootstrap confidence intervals, correlation matrices, and visual diagnostics. Solutions are hidden behind a reveal so you can attempt every problem first, then check your approach.
Section 1. Basic Pearson correlation (4 problems)
Exercise 1.1: Compute the Pearson correlation between weight and mpg
Task: Use base R cor() on the mtcars dataset to compute the Pearson correlation coefficient between wt (weight) and mpg (miles per gallon). Save the single numeric value to ex_1_1 and print it.
Expected result:
#> [1] -0.8676594
Difficulty: Beginner
Correlation measures how strongly two numeric columns move together; you only need the single coefficient for these two columns.
Pass the two columns as the first two arguments; the Pearson method is the default, so no extra argument is required.
Click to reveal solution
Explanation: cor(x, y) returns a single Pearson coefficient when both arguments are numeric vectors. The strong negative value near -0.87 says that heavier cars deliver markedly fewer miles per gallon, which matches physical intuition. If you pass cor(mtcars$wt, mtcars$mpg, method = "pearson") explicitly you get the same number; "pearson" is the default.
Exercise 1.2: Compute correlation when one column has missing values
Task: A reporting analyst is auditing airquality and notices that Ozone and Solar.R both contain NA. Compute the Pearson correlation between Ozone and Solar.R using only the rows that have non-missing values in both columns and save the result to ex_1_2.
Expected result:
#> [1] 0.3483417
Difficulty: Intermediate
Missing values block the computation, so you must tell the function to ignore rows where either column is empty.
Add the use = "complete.obs" argument so NAs are dropped across both columns together before the coefficient is computed.
Click to reveal solution
Explanation: The use argument controls NA handling. "complete.obs" casts out any row where either input is missing before computing the coefficient; "pairwise.complete.obs" is the matrix-mode equivalent that handles each pair independently. Without use, cor() returns NA whenever any value is missing. A common mistake is calling na.omit() on one column at a time, which misaligns the two vectors and yields a wrong number.
Exercise 1.3: Contrast correlation with covariance on the same pair
Task: Using mtcars$hp and mtcars$mpg, compute both the covariance and the Pearson correlation, then assemble them into a named numeric vector with elements covariance and correlation. Save the vector to ex_1_3 so the analyst can see how scale-dependence differs between the two measures.
Expected result:
#> covariance correlation
#> -320.732056 -0.776168
Difficulty: Intermediate
You need two measures of the same pair, one scale-dependent and one standardized, bundled together under names.
Combine cov() and cor() inside a single c() call, giving the elements names covariance and correlation.
Click to reveal solution
Explanation: Correlation is covariance rescaled by the product of the standard deviations: $r = \mathrm{cov}(x, y) / (s_x s_y)$. That standardization is exactly what makes correlation comparable across pairs and unit systems; covariance changes if you switch horsepower to kilowatts, but correlation does not. When you only care about strength and direction of a linear association, use correlation; covariance matters when you need the unstandardized magnitude (for example, portfolio variance from a covariance matrix).
Exercise 1.4: Implement Pearson correlation from scratch
Task: Without calling cor() or cov(), implement Pearson correlation between two numeric vectors x and y using only mean(), sum(), and sqrt(). Apply your formula to mtcars$disp and mtcars$mpg and save the result to ex_1_4. Verify it agrees with cor() to within 1e-12.
Expected result:
#> [1] -0.8475514
#> agrees with cor(): TRUE
Difficulty: Intermediate
Center both vectors on their averages, then divide the summed cross-product by the geometric mean of the two summed squared deviations.
Build the numerator with sum((x - mean(x)) * (y - mean(y))) and the denominator with sqrt() of the product of the two squared-deviation sums.
Click to reveal solution
Explanation: Pearson's formula is the centered cross-product divided by the geometric mean of the centered sums of squares. Re-deriving it once cements two ideas: the numerator is proportional to covariance, and the denominator equals $(n-1) s_x s_y$ if you divide by $n-1$ on both sides (those $n-1$ factors cancel). Real codebases never reimplement this, but writing it once helps you reason about why cor() is invariant under linear rescaling of either variable.
Section 2. Visualizing correlations (3 problems)
Exercise 2.1: Scatter plot with the correlation annotated in the title
Task: A junior analyst wants a publication-ready chart that shows the relationship between wt and mpg in mtcars. Build a ggplot scatter plot with a linear smoother (no confidence ribbon) and embed the rounded Pearson correlation in the plot title. Save the ggplot object to ex_2_1.
Expected result:
# scatter of wt vs mpg with geom_smooth(method = "lm", se = FALSE)
# title reads: "Weight vs MPG (r = -0.87)"
# x axis: Weight (1000 lbs), y axis: Miles per gallon
Difficulty: Intermediate
Build a scatter plot, add a straight-line trend without its uncertainty band, and place the rounded coefficient into the title text.
Combine geom_point() with geom_smooth(method = "lm", se = FALSE), then assemble the title string with paste0() inside labs().
Click to reveal solution
Explanation: Annotating the correlation directly in the title removes the back-and-forth between the chart and a separate stats table. Two small choices matter: se = FALSE keeps the ribbon off (it has nothing to do with $r$ and clutters the plot), and round(r, 2) matches the precision a reader can actually distinguish visually. For an in-panel annotation rather than a title, swap labs(title = ...) for annotate("text", x = 5, y = 30, label = paste("r =", r)).
Exercise 2.2: Build a pairs plot for four numeric variables
Task: A product analyst is exploring relationships across mtcars and wants a quick all-pairs scatter matrix limited to mpg, hp, wt, and qsec. Produce a base R pairs plot of just those four columns and save the call into ex_2_2. Wrap the call in invisible() so the saved object is the data subset, not the plot output.
Expected result:
# 4x4 scatter matrix in the plotting window
# diagonal: variable names mpg, hp, wt, qsec
# off-diagonal: scatter plots, e.g. mpg vs hp top right
Difficulty: Intermediate
You want an all-pairs scatter grid restricted to four chosen columns, with the saved object being the data subset itself.
Subset the four columns into ex_2_2 first, then draw the grid by calling pairs() on that subset.
Click to reveal solution
Explanation: A pairs plot is the fastest way to spot non-linear, monotonic-but-curved, or grouped relationships before you ever compute a coefficient. The diagonal labels each column; the off-diagonal panels show every pairwise scatter. If you see a clear curve (for example hp vs mpg bending), a Pearson coefficient will understate the association and you should switch to Spearman or transform a variable. GGally::ggpairs() adds densities and correlations to the same grid if you prefer the ggplot2 look.
Exercise 2.3: Build a correlation heatmap with hierarchical clustering
Task: A risk team is preparing a one-page summary of how the eleven numeric columns of mtcars co-move. Compute the full Pearson correlation matrix and pass it to ggcorrplot::ggcorrplot() with hierarchical clustering, lower triangle only, and the coefficient values printed in each tile. Save the ggplot object to ex_2_3.
Expected result:
# triangular correlation heatmap
# variables reordered by hclust so visually related blocks sit together
# each tile shows a coefficient rounded to 2 decimals
# blue = positive, red = negative, white = near zero
Difficulty: Advanced
Compute the full correlation matrix first, then render it as a clustered triangular heatmap with the coefficient values shown in each tile.
Pass the matrix to ggcorrplot() with hc.order = TRUE, type = "lower", and lab = TRUE.
Click to reveal solution
Explanation: hc.order = TRUE reorders rows and columns using hierarchical clustering on the correlation distance, so highly correlated blocks visually cluster together. That clustering is what turns a heatmap from "pretty colors" into a real diagnostic for finding latent factor structure. type = "lower" halves the visual load since the matrix is symmetric. For very wide matrices (20+ variables), turn off lab to avoid label collisions; for narrow ones, keeping the numbers is worth the extra ink.
Section 3. Correlation matrices (3 problems)
Exercise 3.1: Compute the full correlation matrix of mtcars
Task: Compute the full pairwise Pearson correlation matrix across every column of mtcars. Save the 11-by-11 matrix to ex_3_1. Round to two decimal places when you print it so the output stays readable in a console.
Expected result:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> mpg 1.00 -0.85 -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55
#> cyl -0.85 1.00 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53
#> disp -0.85 0.90 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39
#> hp -0.78 0.83 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75
#> ...
#> # 7 more rows hidden
Difficulty: Beginner
Feeding an entire dataset, rather than two single columns, returns every pairwise coefficient at once.
Call cor() on the whole mtcars data frame to get the 11-by-11 matrix.
Click to reveal solution
Explanation: When cor() receives a data frame or matrix, it returns the full symmetric matrix of pairwise correlations. The diagonal is always 1 (every variable correlates perfectly with itself). Use round() purely for display; the underlying matrix keeps full precision. If any column is non-numeric, cor() errors; pre-filter with mtcars |> select(where(is.numeric)) in mixed-type frames.
Exercise 3.2: Filter strong correlations from a matrix
Task: A factor research team needs to flag every variable pair in mtcars whose absolute Pearson correlation exceeds 0.8 (excluding self-pairs). Return a tibble with columns var1, var2, and r, ordered by descending absolute correlation. Save the result to ex_3_2.
Expected result:
#> # A tibble: 10 x 3
#> var1 var2 r
#> <chr> <chr> <dbl>
#> 1 disp cyl 0.902
#> 2 disp wt 0.888
#> 3 mpg wt -0.868
#> 4 mpg cyl -0.852
#> 5 mpg disp -0.848
#> ...
#> # 5 more rows hidden
Difficulty: Intermediate
Drop the symmetric duplicates and self-pairs, reshape the matrix into long rows, then keep only the strong pairs sorted by magnitude.
Blank the upper triangle with upper.tri(), melt via as.table() then as.data.frame(), and apply filter(abs(r) > 0.8) with arrange(desc(abs(r))).
Click to reveal solution
Explanation: Zeroing the upper triangle and diagonal with NA is the cleanest trick to avoid double-counting symmetric pairs and self-correlations. as.table() then as.data.frame() melts the matrix into long form, which is much easier to filter and sort than a wide square matrix. The pattern generalizes to any pairwise-association screen: collinearity audits, gene-gene co-expression, asset-return clustering.
Exercise 3.3: Compute cross-correlations between two variable groups
Task: Treat mtcars columns mpg, hp, wt, qsec as performance variables and cyl, gear, carb, am as design variables. Compute the 4-by-4 matrix of Pearson correlations between every performance column and every design column. Save the rectangular matrix to ex_3_3.
Expected result:
#> cyl gear carb am
#> mpg -0.852 0.480 -0.551 0.600
#> hp 0.832 -0.126 0.750 -0.243
#> wt 0.782 -0.583 0.428 -0.692
#> qsec -0.591 -0.213 -0.656 -0.230
Difficulty: Intermediate
Splitting the columns into two groups and correlating one group against the other produces a rectangular, not square, matrix.
Subset the two column sets into separate data frames, then pass both to cor(x, y) as the two arguments.
Click to reveal solution
Explanation: When you pass two data frames to cor(x, y), it returns a rectangular matrix with rows from x and columns from y. That avoids computing within-group correlations you do not care about. The pattern is heavily used in canonical correlation setups, between-block PLS, and any screen where you want to see how one block of variables relates to another (for example, sensor channels vs. operating-condition tags).
Section 4. Rank-based correlation: Spearman and Kendall (3 problems)
Exercise 4.1: Compare Pearson and Spearman on a non-linear monotonic pair
Task: Build a vector x <- 1:30 and a non-linear but monotonically increasing y <- exp(x / 10). Compute both the Pearson and Spearman correlations and assemble them into a named vector ex_4_1. The result will show why Spearman is the right tool when the relationship is monotonic but not linear.
Expected result:
#> pearson spearman
#> 0.8856148 1.0000000
Difficulty: Intermediate
One coefficient reacts to curvature while the other cares only about rank order; compute both on the same pair to see the contrast.
Call cor() twice with method = "pearson" and method = "spearman", bundled into a named c().
Click to reveal solution
Explanation: Pearson measures linear association, so any curvature drags the coefficient below 1 even when the relationship is perfectly monotonic. Spearman correlates the ranks of x and y, and because exponentiation preserves order, ranks agree exactly: Spearman = 1. Switching to Spearman is the fix whenever a scatter plot shows a clean curve rather than a straight line, or when one variable is on a log/exponential scale.
Exercise 4.2: Compute Kendall's tau on small ordinal data
Task: Two judges score eight figure-skating routines on integer scales. Build the inline vectors judge_a <- c(7, 4, 9, 6, 8, 3, 5, 2) and judge_b <- c(6, 5, 9, 7, 8, 2, 4, 3) and compute Kendall's tau between them. Save the single scalar to ex_4_2.
Expected result:
#> [1] 0.8571429
Difficulty: Intermediate
This coefficient is built by counting how often the two rankings agree versus disagree on the order of pairs.
Call cor() on the two judge vectors with method = "kendall".
Click to reveal solution
Explanation: Kendall's tau counts concordant minus discordant pairs and divides by the total number of pairs, so it has a direct probabilistic interpretation: if you pick two routines at random, $(1 + \tau)/2$ is the probability the two judges agree on their relative order. With small $n$ and ordinal data (rankings, Likert ratings), Kendall is more robust to ties and outliers than Spearman, and a tau of about 0.86 says the two judges agree on the relative order in roughly 93 percent of pair comparisons.
Exercise 4.3: Compare three correlation methods on the iris numeric block
Task: A botanist is choosing between Pearson, Spearman, and Kendall for reporting the association between Sepal.Length and Petal.Length across all 150 iris rows. Build a tibble named ex_4_3 with columns method and r holding all three coefficients so the team can compare them at a glance.
Expected result:
#> # A tibble: 3 x 2
#> method r
#> <chr> <dbl>
#> 1 pearson 0.872
#> 2 spearman 0.882
#> 3 kendall 0.719
Difficulty: Intermediate
Compute the same association three different ways and lay the results out as a two-column table of method name and coefficient.
Build a tibble() with a method column and an r column holding three cor() calls that vary the method argument.
Click to reveal solution
Explanation: Spearman and Pearson agree closely here because the relationship in iris is roughly linear; Kendall's tau is on a different scale (it counts pair concordances) so it is always closer to zero than Spearman for the same monotonic data, even when both indicate the same direction. Report Pearson when you have ruled out non-linearity and outliers, Spearman when the relationship is monotonic but not linear, and Kendall when ties are common or sample size is small.
Section 5. Testing correlation significance (3 problems)
Exercise 5.1: Run cor.test and extract p-value and estimate
Task: A quality team wants a quick significance check on the correlation between Sepal.Length and Petal.Width in iris. Run cor.test() and pull out a named numeric vector with elements estimate and p_value. Save the vector to ex_5_1 so it slots straight into a status table.
Expected result:
#> estimate p_value
#> 8.179536e-01 2.325498e-37
Difficulty: Beginner
Run the significance test once, then pick the coefficient and the p-value out of the resulting object by name.
Save cor.test() to a variable and read $estimate and $p.value into a named vector.
Click to reveal solution
Explanation: cor.test() returns an htest object with $estimate, $p.value, $conf.int, and $statistic. Pulling fields by name is more robust than parsing the printed output, especially when you are piping results into a downstream report or dashboard. The vanishingly small p-value here is unsurprising: with $n = 150$ and an estimate near 0.82, the test has overwhelming power. Always look at the estimate alongside the p-value, since a tiny p-value with a tiny estimate just means the sample was large.
Exercise 5.2: Pull the 95 percent confidence interval for a correlation
Task: Using the same Sepal.Length vs Petal.Width pair in iris, compute the 95 percent confidence interval for Pearson's $r$ via cor.test() and save it as a length-2 numeric vector with names lower and upper to ex_5_2.
Expected result:
#> lower upper
#> 0.7568552 0.8648366
Difficulty: Intermediate
The same significance test also reports an interval estimate for the true correlation; you just need to extract and label it.
Read $conf.int from the cor.test() object and name its two elements lower and upper with setNames().
Click to reveal solution
Explanation: cor.test() builds the CI by applying Fisher's z transformation, computing the standard CI on the z scale, and back-transforming via $\tanh$. The CI shrinks fast as $n$ grows; a sample of 150 produces a band only about 0.11 wide. Report the CI rather than just the point estimate when you want to communicate uncertainty, especially for moderate sample sizes where a 0.30 correlation might come with a CI that brushes zero.
Exercise 5.3: Derive a Fisher z confidence interval by hand
Task: Given a Pearson correlation $r = 0.6$ from a sample of $n = 30$ paired observations, derive the 95 percent confidence interval for the population correlation by applying Fisher's z transformation, working on the z scale, and back-transforming. Save the length-2 numeric vector c(lower, upper) to ex_5_3 without calling cor.test().
Expected result:
#> [1] 0.3083669 0.7935342
Difficulty: Advanced
Move the correlation onto a scale where its standard error is stable, build a symmetric interval there, then move back.
Use atanh() for the transform, a standard error of 1 / sqrt(n - 3), qnorm(0.975) as the multiplier, and tanh() to back-transform.
Click to reveal solution
Explanation: Fisher's transformation $z = \tanh^{-1}(r)$ stabilizes the variance of the sampling distribution of $r$: on the $z$ scale the standard error is approximately $1/\sqrt{n-3}$, regardless of the true correlation. You build the symmetric CI in $z$, then back-transform with $\tanh$ to land in the $[-1, 1]$ scale. This is exactly what cor.test() does internally; running it by hand once cements why the CI is asymmetric in the original scale (wider on the side closer to 0).
Section 6. Advanced: partial, robust, and bootstrap (4 problems)
Exercise 6.1: Compute partial correlation controlling for a third variable
Task: Heavier cars also tend to have larger engines, so the raw correlation between wt and mpg confounds engine size. Compute the partial correlation between wt and mpg in mtcars while controlling for disp (displacement). Use ppcor::pcor.test() and save a named vector with estimate and p_value to ex_6_1.
Expected result:
#> estimate p_value
#> -0.5859041 0.0005488
Difficulty: Advanced
Strip the influence of the third variable out of both columns before measuring how the leftover parts relate.
Call ppcor::pcor.test() with the two variables of interest plus the control variable, then pull $estimate and $p.value.
Click to reveal solution
Explanation: Partial correlation residualizes both wt and mpg against disp and then correlates the residuals. The raw Pearson $r$ between wt and mpg is about -0.87; the partial $r$ controlling for disp shrinks to about -0.59, which says a meaningful chunk of the apparent weight effect was really displacement working through weight. Equivalently, you could fit lm(wt ~ disp) and lm(mpg ~ disp), correlate the residuals, and reproduce the same number; ppcor::pcor.test() just adds the inference.
Exercise 6.2: Bootstrap a 95 percent CI for Pearson r
Task: A reviewer asks for a non-parametric 95 percent CI on the correlation between wt and mpg in mtcars, in case Fisher's z assumptions are shaky. Draw 2000 bootstrap resamples (with replacement, same size as the original), compute Pearson $r$ on each, take the 2.5 and 97.5 percentiles, and save the length-2 numeric vector c(lower, upper) to ex_6_2. Set set.seed(42) first for reproducibility.
Expected result:
#> 2.5% 97.5%
#> -0.9296055 -0.7639175
Difficulty: Advanced
Resample the paired rows many times, recompute the coefficient on each draw, and read the interval off the spread of those values.
Use replicate() around sample.int(n, replace = TRUE) to index rows, then quantile() at probs c(0.025, 0.975).
Click to reveal solution
Explanation: The percentile bootstrap makes no distributional assumption: you resample the rows (not the columns) to keep $x$ and $y$ paired, recompute $r$ on each resample, and read off quantiles. When the Fisher's z CI and the bootstrap CI roughly agree, you can trust both; when they disagree, the bootstrap is the safer report. For correlations near $\pm 1$ where Fisher's z is most accurate, both will be tight; for small $n$ or heavy-tailed data, the bootstrap is the standard recommendation.
Exercise 6.3: Compare two correlations from independent samples
Task: A team is checking whether the wt vs mpg correlation differs between automatic (am == 0) and manual (am == 1) cars in mtcars. Compute Pearson $r$ in each subgroup, then apply Fisher's z test for two independent correlations and return a named vector with r_auto, r_manual, and p_value. Save it to ex_6_3.
Expected result:
#> r_auto r_manual p_value
#> -0.6975356 -0.8915209 0.1577874
Difficulty: Advanced
Compute the coefficient within each subgroup, then test whether the gap between them is larger than sampling noise.
Transform each r with atanh(), divide the difference by sqrt(1/(n1-3) + 1/(n2-3)), and convert to a p-value with pnorm().
Click to reveal solution
Explanation: Comparing two correlations from independent samples is a Fisher-z test: transform each $r$, divide the difference by $\sqrt{1/(n_1-3) + 1/(n_2-3)}$, and use a normal reference. The two subgroup correlations look numerically different (-0.70 vs -0.89), but the test p-value of about 0.16 says that gap is well within sampling noise given small group sizes (19 and 13). The general lesson: eyeballed differences between subgroup correlations need a test, especially when subgroups are small.
Exercise 6.4: Detect non-linear association where Pearson is near zero
Task: Build x <- seq(-3, 3, length.out = 200) and y <- x^2 + rnorm(200, sd = 0.1) after setting set.seed(7). Compute both Pearson and Spearman correlation, then cor.test() on the absolute values to detect the symmetric quadratic association. Save a named vector with pearson, spearman, and r_on_abs_x to ex_6_4.
Expected result:
#> pearson spearman r_on_abs_x
#> 0.003204528 -0.020610028 0.998790907
Difficulty: Intermediate
A symmetric curve hides from ordinary correlation, so also relate the response to the magnitude of the predictor instead of its signed value.
Compute cor() with method = "pearson" and "spearman", then a third cor() call that uses abs(x).
Click to reveal solution
Explanation: Pearson and Spearman both miss this association entirely because the relationship is symmetric (large $|x|$ produces large $y$, regardless of sign). Always look at a scatter plot before declaring "no correlation"; near-zero $r$ often hides U-shapes, V-shapes, or other symmetric patterns. Once you spot the symmetry, correlating $y$ with $|x|$ (or transforming to $\log y$, $\sqrt{y}$, or fitting lm(y ~ poly(x, 2))) recovers the real signal. This is the single most common way correlation analysis goes wrong in practice.
What to do next
- Correlation in R: the parent tutorial covering the full theory and the cor.test workflow.
- Linear Regression Exercises in R: the natural next step after correlation diagnostics.
- Hypothesis Testing Exercises in R: generalizes the cor.test idea to t-tests, proportion tests, and beyond.
- Visualization Exercises in R: practice the scatter, pairs, and heatmap idioms used here.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
Correlation Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
202 learners have earned this certificate