Cross Validation Exercises in R: 20 Real-World Practice Problems
Twenty publication-grade problems on k-fold, LOOCV, repeated, stratified, group, time-series, and nested cross-validation in R. Solutions are hidden behind a click and use both caret and rsample for resampling.
Section 1. CV Foundations (4 problems)
Exercise 1.1: Build a manual 5-fold index vector for a small dataset
Task: A new analyst is learning resampling and wants to assign every row of mtcars (32 rows) to one of 5 folds so each fold has 6 or 7 rows. Use sample() on rep(1:5, length.out = nrow(mtcars)) with set.seed(1) to produce a length-32 integer vector and save it to ex_1_1.
Expected result:
#> ex_1_1
#> [1] 1 4 1 1 5 3 2 4 5 5 4 2 2 3 1 3 5 2 1 1 2 4 4 5 3 3 4 5 3 1 2 2
#> table(ex_1_1)
#> 1 2 3 4 5
#> 7 6 6 6 7
Difficulty: Beginner
A balanced fold vector recycles the fold IDs in order, then a random permutation decides which row lands in which fold.
Build the ordered IDs with rep(1:5, length.out = nrow(mtcars)) and shuffle them with sample() after calling set.seed(1).
Click to reveal solution
Explanation: rep(1:5, length.out = n) builds a balanced fold vector that recycles the IDs in order; sample() then permutes it so the folds are random but still balanced. Using sample(1:5, 32, replace = TRUE) would be wrong: it draws independently, so fold sizes drift away from n/k and one fold can easily be twice the size of another. Always pair fold construction with set.seed() for reproducibility.
Exercise 1.2: Compute a holdout RMSE for a linear model on mtcars
Task: Split mtcars 70/30 into train and test using caret::createDataPartition() with set.seed(7) on the mpg outcome. Fit lm(mpg ~ wt + hp) on the training set, predict on the test set, then compute root mean squared error. Save the scalar RMSE to ex_1_2.
Expected result:
#> ex_1_2
#> [1] 3.012
Difficulty: Beginner
A holdout estimate fits the model on one chunk of rows and scores it on the rows it never saw during training.
Get the training rows from createDataPartition(mtcars$mpg, p = 0.7, list = FALSE), then predict() on the test set and take sqrt(mean((actual - preds)^2)).
Click to reveal solution
Explanation: A single holdout split gives one estimate of generalization error and is fast, but the variance is huge on a 32-row dataset. The exact RMSE you observe depends entirely on which 23 rows ended up in train. That is why k-fold CV (next section) averages over multiple splits: the mean over folds is what you trust, not any single split. createDataPartition stratifies on the outcome quantiles, which keeps the train/test mpg distributions similar.
Exercise 1.3: Hand-rolled LOOCV for a linear model
Task: The instructor wants every student to implement leave-one-out CV from scratch before reaching for a package. Loop over the 32 rows of mtcars, hold each row out as the test set, refit lm(mpg ~ wt + hp) on the other 31, and store the prediction error. Save the LOOCV mean squared error scalar to ex_1_3.
Expected result:
#> ex_1_3
#> [1] 6.972
Difficulty: Intermediate
Hold out one row at a time, refit on all the others, and accumulate that row's squared prediction error.
Loop with for (i in seq_len(n)), fit on mtcars[-i, ], predict mtcars[i, , drop = FALSE], then take mean() of the stored squared errors.
Click to reveal solution
Explanation: LOOCV uses every row as a test point exactly once, so there is no randomness and no seed is needed: it always returns the same number. The cost is n model fits. For ordinary least squares there is a closed-form shortcut using the hat matrix that avoids the loop entirely; running through for here is a warm-up so the next exercise (k-fold) is just a generalization of the same skeleton.
Exercise 1.4: Convert the LOOCV result to RMSE and compare to holdout
Task: Take the LOOCV MSE from the previous exercise (recompute it inside), convert it to RMSE by taking the square root, then build a 2-row tibble with columns method and rmse showing the LOOCV result alongside the holdout RMSE from Exercise 1.2 (use the value 3.012 directly). Save the tibble to ex_1_4.
Expected result:
#> ex_1_4
#> # A tibble: 2 x 2
#> method rmse
#> <chr> <dbl>
#> 1 loocv 2.64
#> 2 holdout 3.01
Difficulty: Intermediate
Root mean squared error is just the square root of mean squared error; lay both methods out side by side for comparison.
Recompute the LOOCV loop, apply sqrt() to its MSE, then assemble a tibble() with method and rmse columns.
Click to reveal solution
Explanation: LOOCV uses 31 rows for every fit instead of 22, so its training set is closer to the full dataset and its bias is smaller. On a 32-row dataset that bias dominates: the holdout estimate 3.01 is upward biased because the fit only saw 22 rows. The trade-off LOOCV pays is higher variance: each of the 32 fits sees almost the same data, so the 32 errors are correlated. K-fold with k=5 or k=10 is the practical compromise.
Section 2. K-fold and LOOCV by hand (3 problems)
Exercise 2.1: Compute 5-fold CV RMSE for a linear model on mtcars
Task: Using the fold vector from Exercise 1.1 (rebuild it with the same seed inside this exercise), loop over folds 1 through 5. For each fold, train lm(mpg ~ wt + hp) on the other 4 folds and predict the held-out fold. Compute the RMSE across all held-out predictions in one shot. Save the scalar to ex_2_1.
Expected result:
#> ex_2_1
#> [1] 2.752
Difficulty: Intermediate
Train on every fold but one, predict the held-out fold, and pool all out-of-fold predictions before scoring once.
Loop for (k in 1:5), fit on mtcars[folds != k, ], fill preds[folds == k], then take sqrt(mean((mtcars$mpg - preds)^2)).
Click to reveal solution
Explanation: Pooling all 32 out-of-fold predictions into a single RMSE is the standard "stacked" version of k-fold and is what most papers report. The alternative is computing RMSE inside each fold and averaging the 5 values: that gives a slightly different number because the folds have unequal sizes, and the unweighted mean would over-weight small folds. The stacked version is mathematically a weighted mean by fold size, so it is the safer default.
Exercise 2.2: Repeated 5-fold CV with 10 repeats by hand
Task: Wrap the previous loop in an outer loop over 10 repeats: each repeat reshuffles the fold assignment using a different seed (1 through 10), then computes one stacked 5-fold RMSE per repeat. Return a length-10 numeric vector of RMSEs. Save it to ex_2_2.
Expected result:
#> ex_2_2
#> [1] 2.752 2.812 2.910 2.681 2.847 2.793 2.851 2.690 2.778 2.756
#> mean(ex_2_2)
#> [1] 2.787
Difficulty: Intermediate
Wrap the single-pass fold loop in an outer loop so each repeat reshuffles the folds and records its own RMSE.
Loop for (rep in 1:10), call set.seed(rep) then rebuild folds with sample(), storing each sqrt(mean(...)) into a length-10 vector.
Click to reveal solution
Explanation: Repeated k-fold reduces the variance of the CV estimate by averaging over the random fold assignment. Look at the spread of the 10 values: roughly 2.68 to 2.91. Any one fold split could mislead you by 0.1 RMSE, which is more than the gap between many candidate models. The conventional recipe is 5-fold or 10-fold, repeated 5 to 10 times. For small datasets like mtcars, repeats matter most; for large datasets a single 10-fold is usually enough.
Exercise 2.3: Stratified k-fold split on a categorical outcome
Task: On the iris dataset, build a stratified 5-fold split that preserves the proportion of each Species in every fold (each fold should have 10 setosa, 10 versicolor, 10 virginica). Use caret::createFolds(iris$Species, k = 5, list = FALSE) with set.seed(2) and verify with a contingency table. Save the integer fold vector to ex_2_3.
Expected result:
#> table(fold = ex_2_3, species = iris$Species)
#> species
#> fold setosa versicolor virginica
#> 1 10 10 10
#> 2 10 10 10
#> 3 10 10 10
#> 4 10 10 10
#> 5 10 10 10
Difficulty: Advanced
A stratified split keeps each class's share of rows constant across every fold rather than splitting rows blindly.
Pass the outcome vector to createFolds(iris$Species, k = 5, list = FALSE) after calling set.seed(2).
Click to reveal solution
Explanation: Stratification matters when class proportions are imbalanced or when the outcome is rare: a non-stratified split could leave an entire fold without a positive class, breaking AUC computation. createFolds stratifies on the supplied vector; for regression it bins the numeric outcome into quartiles and stratifies on the bins. Tidymodels' rsample::vfold_cv(strata = Species) does the same thing and is the preferred API for new code, but createFolds is still the simplest one-liner when you just need fold IDs.
Section 3. caret resampling controls (4 problems)
Exercise 3.1: Build a 10-fold trainControl and fit a linear model
Task: Configure trainControl(method = "cv", number = 10) and pass it to train(mpg ~ wt + hp + disp, data = mtcars, method = "lm") with set.seed(11). Extract the resampled RMSE summary (single row tibble with RMSE, Rsquared, MAE) and save it to ex_3_1.
Expected result:
#> ex_3_1
#> # A tibble: 1 x 3
#> RMSE Rsquared MAE
#> <dbl> <dbl> <dbl>
#> 1 2.72 0.880 2.20
Difficulty: Intermediate
caret separates the resampling recipe from the model call, so you describe the folds once and hand them to the fit.
Build the recipe with trainControl(method = "cv", number = 10) and pass it through the trControl argument.
Click to reveal solution
Explanation: train() is caret's single dispatch for fitting any of 200+ model engines under a unified resampling protocol. The method = "lm" call has no tuning parameters, so the 10-fold result is just a vanilla 10-fold CV summary. The same call with method = "rf" would tune mtry over a default grid using the same 10 folds: that is the design that makes caret useful even when you outgrow basic lm.
Exercise 3.2: Repeated 10-fold CV with 5 repeats for kNN regression
Task: Fit a method = "knn" model predicting mpg ~ . on mtcars under trainControl(method = "repeatedcv", number = 10, repeats = 5) with set.seed(13). Use tuneGrid = data.frame(k = c(3, 5, 7, 9)) so 4 hyperparameters are tried. Save the full results tibble (k, RMSE, RMSESD, Rsquared, MAE columns) to ex_3_2.
Expected result:
#> ex_3_2
#> # A tibble: 4 x 5
#> k RMSE RMSESD Rsquared MAE
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 3 3.45 1.20 0.821 2.93
#> 2 5 3.71 1.18 0.798 3.18
#> 3 7 4.05 1.16 0.770 3.50
#> 4 9 4.41 1.13 0.741 3.83
Difficulty: Intermediate
Repeating the whole k-fold protocol several times stabilizes the score reported for each candidate hyperparameter.
Build the recipe with trainControl(method = "repeatedcv", number = 10, repeats = 5).
Click to reveal solution
Explanation: repeatedcv runs the entire 10-fold protocol repeats times with a fresh fold randomization each time, giving you a more stable estimate of the RMSE for each candidate k. The RMSESD column reports the standard deviation across the 50 fold-level RMSE values, which feeds the one-standard-error rule in later exercises. Always pair kNN with preProcess = c("center", "scale") because distance metrics are scale-dependent: a 100x range column will dominate the neighbor calculation.
Exercise 3.3: Configure leave-group-out (Monte Carlo) CV in caret
Task: Set up trainControl(method = "LGOCV", number = 25, p = 0.75) so each of 25 resamples uses a random 75/25 train/test split. Fit lm(mpg ~ wt + hp) on mtcars with set.seed(15) and pull the resamples vector (25 RMSE values) using fit$resample$RMSE. Save the length-25 numeric vector to ex_3_3.
Expected result:
#> length(ex_3_3)
#> [1] 25
#> summary(ex_3_3)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1.621 2.388 2.840 2.880 3.349 4.512
Difficulty: Intermediate
Monte Carlo CV draws many independent random train/test splits instead of partitioning rows into disjoint folds.
Configure the recipe with trainControl(method = "LGOCV", number = 25, p = 0.75).
Click to reveal solution
Explanation: Leave-group-out (also called Monte Carlo CV or repeated random subsampling) is k-fold's looser cousin: every resample is an independent draw, so rows can appear in many test sets and others in none. It is useful when you want a tunable train fraction (the p argument) and a large number of resamples for variance estimation. Its weakness is that test sets overlap, so the resample errors are positively correlated and confidence intervals are narrower than they should be.
Exercise 3.4: Class-stratified 10-fold CV for a logistic regression on iris
Task: Build a binary outcome is_versicolor from iris$Species (1 for versicolor, 0 otherwise). Configure stratified 10-fold CV with trainControl(method = "cv", number = 10, classProbs = TRUE, summaryFunction = twoClassSummary), then train() a method = "glm" family binomial model on Sepal.Length + Sepal.Width + Petal.Length + Petal.Width. Save the ROC AUC, Sensitivity, Specificity row as a tibble to ex_3_4.
Expected result:
#> ex_3_4
#> # A tibble: 1 x 3
#> ROC Sens Spec
#> <dbl> <dbl> <dbl>
#> 1 0.866 0.84 0.83
Difficulty: Advanced
To get a ranking metric out of caret you must tell it to keep class probabilities and score folds with a two-class summary.
Set classProbs = TRUE and summaryFunction = twoClassSummary inside trainControl(method = "cv", number = 10).
Click to reveal solution
Explanation: Classification CV is usually wrong on the first try because the default summaryFunction returns Accuracy and Kappa, and the default metric = "Accuracy" is misleading on imbalanced data. The three steps to get correct ROC-AUC out of caret are: set classProbs = TRUE, set summaryFunction = twoClassSummary, and set metric = "ROC" on the train() call. Skip any of the three and you silently get an Accuracy-driven model instead of a ranking-driven one.
Section 4. rsample resampling (3 problems)
Exercise 4.1: Build a 10-fold rsample object and inspect a split
Task: Use rsample::vfold_cv() on mtcars with v = 10 and set.seed(21). Pull out the first split with rsample::analysis(splits$splits[[1]]) (training rows) and assessment(splits$splits[[1]]) (held-out rows). Save the assessment tibble (just mpg, wt, hp columns) to ex_4_1.
Expected result:
#> ex_4_1
#> # A tibble: 3 x 3
#> mpg wt hp
#> <dbl> <dbl> <dbl>
#> 1 21 2.62 110
#> 2 18.1 3.46 105
#> 3 15.2 3.78 180
Difficulty: Intermediate
An rsample fold object stores splits, and each split exposes its training and held-out partitions as separate views.
Create the object with vfold_cv(mtcars, v = 10), then pull the held-out rows via assessment(splits$splits[[1]]).
Click to reveal solution
Explanation: rsample represents each fold as an rsplit object that holds row indices, not data: actual subsetting happens lazily through analysis() (the training partition) and assessment() (the held-out partition). This keeps a 10-fold object lightweight even on large data, and it pairs cleanly with purrr::map() to fit one model per split without duplicating the dataset 10 times. Tidymodels' workflows build on this abstraction.
Exercise 4.2: Map a linear model over rsample folds and pool the RMSEs
Task: Build a 10-fold split of mtcars with set.seed(23), then use purrr::map() to fit lm(mpg ~ wt + hp + disp) on each analysis(split), predict on each assessment(split), and compute the per-fold RMSE. Save the length-10 numeric vector to ex_4_2.
Expected result:
#> ex_4_2
#> [1] 1.84 2.50 3.81 2.45 4.61 1.81 2.93 1.39 4.67 2.30
#> mean(ex_4_2)
#> [1] 2.831
Difficulty: Intermediate
Write one function that scores a single split, then apply it across every split in the resample object.
Inside the function call analysis() and assessment() on the split, fit lm(), and return sqrt(mean(...)); drive it with map_dbl().
Click to reveal solution
Explanation: The map-over-splits pattern is the tidy alternative to caret::train: explicit, composable, and works with any modelling function whose interface is fit + predict. Note the mean of per-fold RMSEs (2.83) is not identical to the pooled RMSE from Exercise 2.1 (2.75) because fold sizes are unequal: the unweighted mean over folds overweights the small folds. Both are valid CV estimates; just be consistent within a project.
Exercise 4.3: Group-aware k-fold to prevent leakage on grouped rows
Task: Build a 25-row tibble with 5 patients (each contributing 5 measurements). Run rsample::group_vfold_cv() on the grouping column patient_id with v = 5 and set.seed(25). Confirm that every fold's assessment set contains exactly one patient's rows. Save the tibble of fold ID and patient ID counts to ex_4_3.
Expected result:
#> ex_4_3
#> # A tibble: 5 x 2
#> fold unique_patients_in_test
#> <chr> <int>
#> 1 Fold1 1
#> 2 Fold2 1
#> 3 Fold3 1
#> 4 Fold4 1
#> 5 Fold5 1
Difficulty: Advanced
When rows cluster by an entity, every row of that entity must stay together in train or test to avoid leakage.
Split with group_vfold_cv(d, group = patient_id, v = 5), then count distinct patients per fold using map_int() over assessment().
Click to reveal solution
Explanation: Repeated-measures data leaks when you split rows naively: visit 3 of patient P1 in train and visit 4 of P1 in test means the model effectively memorizes the patient. group_vfold_cv() enforces that all rows from the same group land in the same partition. The same principle applies to time-grouped data (split by week, not by row), user-grouped data (split by user, not by event), and any clustered design. Skipping this step is the most common cause of "great CV, terrible production" outcomes.
Section 5. Time-series cross-validation (3 problems)
Exercise 5.1: Rolling-origin CV with a fixed training window
Task: Build a 60-row monthly tibble with columns month (1 to 60) and sales (a trending series with noise). Use rsample::rolling_origin() with initial = 36, assess = 6, cumulative = FALSE, and skip = 5 to produce non-overlapping 6-month forecast windows. Save a tibble with one row per split showing the first and last training months. Save it to ex_5_1.
Expected result:
#> ex_5_1
#> # A tibble: 4 x 3
#> split train_first train_last
#> <chr> <int> <int>
#> 1 Slice1 1 36
#> 2 Slice2 7 42
#> 3 Slice3 13 48
#> 4 Slice4 19 54
Difficulty: Advanced
Time-series CV slides a forward-moving training window so the model is always tested on observations that come later.
Read the window edges with min() and max() of analysis(s)$month across the splits, gathering them with map_int().
Click to reveal solution
Explanation: Standard k-fold is wrong for time-series because it lets future data train a model that predicts the past. rolling_origin() slides a fixed-width training window forward and assesses on the immediately following block. The skip argument controls the stride between successive splits; skip = 5 means each new origin advances 6 months (skip + 1) so the assessment windows do not overlap. For sales, energy, or sensor data this is the only honest CV.
Exercise 5.2: Expanding-window CV with cumulative training
Task: On the same ts_df, build an expanding-window split with rolling_origin(ts_df, initial = 24, assess = 12, cumulative = TRUE, skip = 11). Confirm train_first is always 1 and train_last grows over splits. Save the per-split train range tibble to ex_5_2.
Expected result:
#> ex_5_2
#> # A tibble: 3 x 3
#> split train_first train_last
#> <chr> <int> <int>
#> 1 Slice1 1 24
#> 2 Slice2 1 36
#> 3 Slice3 1 48
Difficulty: Advanced
An expanding window keeps every past observation and only grows, so the training start never moves off the first row.
Build the same per-split range tibble, reading min()/max() of analysis(s)$month with map_int() over ro2$splits.
Click to reveal solution
Explanation: The fixed-window version forgets old data; the expanding-window version remembers everything. Pick fixed when the underlying process drifts (consumer behavior, financial regimes) so old data hurts. Pick expanding when the process is stationary and more data always helps. A useful diagnostic is to fit both and compare CV error: if expanding is much better, the series is stationary; if fixed is much better, recent data is more representative.
Exercise 5.3: Forecast-error tibble from a rolling-origin lm fit
Task: Loop the rolling-origin object from Exercise 5.1 with purrr::map_dfr(). For each split, fit lm(sales ~ month, data = analysis(split)), predict on the assessment block, and return a tibble with split, month, actual, pred, err. Save the stacked tibble (first few rows shown) to ex_5_3.
Expected result:
#> head(ex_5_3, 6)
#> # A tibble: 6 x 5
#> split month actual pred err
#> <chr> <int> <dbl> <dbl> <dbl>
#> 1 Slice1 37 120. 119. 1.08
#> 2 Slice1 38 117. 119. -2.03
#> 3 Slice1 39 121. 120. 0.92
#> 4 Slice1 40 122. 120. 1.30
#> 5 Slice1 41 121. 121. 0.18
#> 6 Slice1 42 120. 121. -1.45
#> # 18 more rows hidden
Difficulty: Advanced
For each forecast window, fit on its training block and record actual versus predicted for every assessment row.
Inside the map_dfr() callback pull analysis()/assessment(), fit lm(sales ~ month), and return a tibble() with the predictions and err.
Click to reveal solution
Explanation: The stacked error tibble is the single most useful artifact from time-series CV: from it you can derive RMSE by split, RMSE by horizon (1-step, 2-step, etc.), residual autocorrelation, and a forecast plot. Once you have this tibble, swap lm() for forecast::auto.arima() or prophet::prophet() without changing the surrounding map skeleton. The forecast pipelines all share this shape because the rolling-origin contract is model-agnostic.
Section 6. Selection, tuning, and nested CV (3 problems)
Exercise 6.1: Compare three model engines under identical 10-fold CV
Task: Use set.seed(41) and the same 10-fold control to train three models on mtcars: method = "lm", method = "knn" (default grid), and method = "rpart" (default grid). Pull the best (minimum) RMSE for each and assemble a 3-row tibble with model and best_rmse. Save it to ex_6_1.
Expected result:
#> ex_6_1
#> # A tibble: 3 x 2
#> model best_rmse
#> <chr> <dbl>
#> 1 lm 2.78
#> 2 knn 3.65
#> 3 rpart 3.21
Difficulty: Advanced
A fair comparison scores every engine on the exact same folds, then keeps each one's best resampled error.
Pull min(...$results$RMSE) from each fitted model and assemble a tibble() with model and best_rmse columns.
Click to reveal solution
Explanation: Identical seeds give identical fold assignments across engines, so the RMSEs compare apples to apples; without seed control, the kNN tune might land on a friendlier fold split than the lm fit. On a 32-row dataset like mtcars, simple linear regression usually wins because tree and kNN methods are starving for data. The lesson generalizes: try the simplest model first and only justify complexity when CV says it pays.
Exercise 6.2: One-standard-error rule on a tuning grid
Task: Run train(mpg ~ ., data = mtcars, method = "knn", tuneGrid = data.frame(k = 1:15)) under repeated 10-fold CV (5 repeats) with set.seed(43). Apply the one-SE rule: among k values whose mean RMSE is within one standard error of the minimum, pick the largest k (most regularized). Save the selected k value as a length-1 integer to ex_6_2.
Expected result:
#> ex_6_2
#> [1] 5
Difficulty: Advanced
The one-standard-error rule keeps the simplest model whose error sits within one standard error of the best score.
Compute the threshold as min(RMSE) + RMSESD / sqrt(50) from fit$results, then take max(k) among the rows under it.
Click to reveal solution
Explanation: The one-SE rule says "pick the simplest model whose CV error is within one standard error of the best": you trade a touch of accuracy for a meaningfully simpler model, often avoiding overfitting on the validation set itself. The standard error of the mean over 50 fold-level RMSEs is RMSESD / sqrt(50), not RMSESD. For kNN "simpler" means larger k; for trees it means deeper pruning (smaller tree); for lasso it means more shrinkage.
Exercise 6.3: Nested CV for honest performance estimation
Task: Implement nested CV on mtcars with 5 outer folds and inner repeated 10-fold (3 repeats). In each outer fold, tune kNN over k = 1:10 on the outer training partition, refit the best k on that partition, then predict the outer test partition. Return a tibble with outer fold ID, selected k, and outer test RMSE. Save it to ex_6_3.
Expected result:
#> ex_6_3
#> # A tibble: 5 x 3
#> outer_fold best_k outer_rmse
#> <chr> <int> <dbl>
#> 1 Fold1 5 3.18
#> 2 Fold2 7 4.02
#> 3 Fold3 3 3.41
#> 4 Fold4 5 2.96
#> 5 Fold5 5 3.74
#> mean(ex_6_3$outer_rmse)
#> [1] 3.462
Difficulty: Advanced
Nested CV tunes inside each outer training partition and reports error only on the outer rows the tuner never touched.
Inside map_dfr() over outer$splits, run train() with the inner control on analysis(), score assessment(), and read fit$bestTune$k.
Click to reveal solution
Explanation: Plain CV with hyperparameter tuning gives an optimistically biased error estimate: the same folds that picked the winning k also report the error, so the winner benefits from a lucky fit. Nested CV separates the two concerns: an inner CV picks k from the training partition only, and the outer fold reports the test error on data the tuner never saw. The cost is outer_folds * inner_folds * inner_repeats * grid_size fits (here: 5 10 3 * 10 = 1500), which is why it is reserved for final benchmarking rather than rapid iteration.
What to do next
- Random Forest in R: apply the CV controls here to tune
mtryandntree. - Logistic Regression in R: pair stratified 10-fold CV with ROC-AUC for class-balanced reports.
- Linear Regression in R: start with
caret::train(method = "lm")and compare to baselm()diagnostics. - Time Series Analysis in R: combine rolling-origin CV with ARIMA and prophet pipelines.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
Cross Validation Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
227 learners have earned this certificate