Decision Tree Exercises in R: 18 Real-World Practice Problems
Eighteen scenario-based decision tree exercises grouped into five themed sections covering tree fitting, hyperparameter tuning, visualization, prediction, and pruning with the rpart package. Every problem ships with an expected result so you can verify, and solutions are hidden behind reveal toggles so you actually try first.
Section 1. Fit your first decision tree (4 problems)
Exercise 1.1: Grow a classification tree on iris
Task: Fit a classification tree predicting Species from all four numeric measurements in the built-in iris dataset using rpart(). Print the fitted model so you can see the splits, the node counts, and the leaf class probabilities. Save the fitted object to ex_1_1.
Expected result:
#> n= 150
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 150 100 setosa (0.33333 0.33333 0.33333)
#> 2) Petal.Length< 2.45 50 0 setosa (1.00000 0.00000 0.00000) *
#> 3) Petal.Length>=2.45 100 50 versicolor (0.00000 0.50000 0.50000)
#> 6) Petal.Width< 1.75 54 5 versicolor (0.00000 0.90741 0.09259) *
#> 7) Petal.Width>=1.75 46 1 virginica (0.00000 0.02174 0.97826) *
Difficulty: Beginner
A tree only needs to know which column is the outcome and which dataset to learn the splits from.
Pass the formula Species ~ . and data = iris to the tree-fitting function; the dot stands in for every other column.
Click to reveal solution
Explanation: rpart() infers the task from the response: a factor Species triggers classification (Gini by default), a numeric response would trigger regression. The formula Species ~ . says "use every other column as a predictor." Each printed line is a node: n is the count, loss the misclassified count at that node, yval the majority class, and the parenthesized vector is per-class probability. Asterisks mark terminal leaves.
Exercise 1.2: Predict mpg with a regression tree on mtcars
Task: A consumer-magazine reviewer wants a quick rule-based explanation of fuel economy. Fit a regression tree predicting mpg from every other column of the built-in mtcars dataset using rpart(), and save the fitted object to ex_1_2. Print it to inspect the splits.
Expected result:
#> n= 32
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 32 1126.0470 20.09062
#> 2) cyl>=5 21 198.4724 16.64762
#> 4) hp>=192.5 7 28.8286 13.41429 *
#> 5) hp< 192.5 14 72.5571 18.26429 *
#> 3) cyl< 5 11 203.3855 26.66364 *
Difficulty: Beginner
When the outcome column holds numbers, the same tree-growing approach predicts a value instead of a class, with no extra setting.
Use the formula mpg ~ . with data = mtcars; the dot picks up every remaining column as a predictor.
Click to reveal solution
Explanation: Because mpg is numeric, rpart() fits an anova (regression) tree by default and minimises within-node sum of squares. The printed deviance is the residual sum of squares at that node, and yval is the mean response. The first split on cyl mirrors what you would expect from EDA: heavier engines run lower mpg. You can force regression explicitly with method = "anova" if your response could be misinterpreted.
Exercise 1.3: Build a regression tree on airquality after dropping NAs
Task: Predict daily Ozone from Solar.R, Wind, Temp, and Month in the built-in airquality dataset. The dataset has missing readings, so apply na.omit() first and fit the tree on the cleaned data. Save the fitted tree to ex_1_3 and print it.
Expected result:
#> n= 111
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 111 125143.10 42.09910
#> 2) Temp< 82.5 79 42531.59 26.54430
#> 4) Wind>=7.15 69 10919.33 22.33333
#> 8) Solar.R< 79.5 18 777.11 12.22222 *
#> 9) Solar.R>=79.5 51 7652.55 25.90196 *
#> 5) Wind< 7.15 10 21946.40 55.60000 *
#> 3) Temp>=82.5 32 41152.00 80.50000
#> 6) Temp< 87.5 12 10919.92 60.50000 *
#> 7) Temp>=87.5 20 17129.20 92.50000 *
Difficulty: Intermediate
List the four wanted predictors explicitly on the right of the formula rather than using the catch-all dot.
Write Ozone ~ Solar.R + Wind + Temp + Month and point data at the already-cleaned clean_air.
Click to reveal solution
Explanation: rpart() does not natively handle NA in the response: those rows would be silently dropped from the model frame anyway, so cleaning first makes the row count explicit (111 of 153 retained). Predictor NAs could be handled via surrogate splits (covered later), but na.omit() is the simplest baseline. The first split on Temp is a strong signal that ozone formation is photochemical and temperature-driven.
Exercise 1.4: Fit a churn classifier on a small inline customer table
Task: A subscription business is prototyping a churn model on a 20-row sample. Construct the inline churn_df shown below, then fit a classification tree predicting churn from tenure_mo, monthly_charges, and contract. Save the fitted tree to ex_1_4 and print it.
Expected result:
#> n= 20
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 20 8 no (0.6000000 0.4000000)
#> 2) tenure_mo>=10 11 0 no (1.0000000 0.0000000) *
#> 3) tenure_mo< 10 9 1 yes (0.1111111 0.8888889) *
Difficulty: Intermediate
Spell out the three predictor columns in the formula; the outcome belongs on the left of the tilde.
Use churn ~ tenure_mo + monthly_charges + contract with data = churn_df.
Click to reveal solution
Explanation: Even on 20 rows, rpart() finds a clean single-split rule on tenure: customers under 10 months are mostly churners, and longer-tenured customers are not. Notice monthly_charges and contract did not enter the tree because tenure dominated the impurity reduction at the root. The default cp = 0.01 and minsplit = 20 are gentle enough that tiny datasets often stop after one split. We will tune these in the next section.
Section 2. Tune tree growth parameters (4 problems)
Exercise 2.1: Cap tree depth at 2 with maxdepth
Task: Refit the airquality regression tree from Exercise 1.3, but force the tree to stop at depth 2 by passing control = rpart.control(maxdepth = 2). This produces a stubbier tree that is easier to communicate to non-technical stakeholders. Save it to ex_2_1 and print it.
Expected result:
#> n= 111
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 111 125143.10 42.09910
#> 2) Temp< 82.5 79 42531.59 26.54430
#> 4) Wind>=7.15 69 10919.33 22.33333 *
#> 5) Wind< 7.15 10 21946.40 55.60000 *
#> 3) Temp>=82.5 32 41152.00 80.50000
#> 6) Temp< 87.5 12 10919.92 60.50000 *
#> 7) Temp>=87.5 20 17129.20 92.50000 *
Difficulty: Beginner
Tree growth is capped by handing the model a bundle of control settings, one of which limits how many levels deep splits may go.
Add control = rpart.control(maxdepth = 2) to the same Ozone ~ ... fit on na.omit(airquality).
Click to reveal solution
Explanation: maxdepth counts edges from the root, so maxdepth = 2 permits at most one further split below each child of the root. The deeper Solar.R split present in the unconstrained tree is dropped here. maxdepth is a hard ceiling that bypasses cross-validation; it is useful when a stakeholder has a fixed display budget (a slide, a report) rather than a statistical motivation.
Exercise 2.2: Lower minsplit to grow a deeper mtcars tree
Task: Refit the mtcars regression tree but allow splits in nodes as small as 5 observations using rpart.control(minsplit = 5, cp = 0.005). The lower complexity threshold keeps splits that the default cp = 0.01 would prune. Save the fit to ex_2_2 and print it to see the extra splits.
Expected result:
#> n= 32
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 32 1126.04700 20.09062
#> 2) cyl>=5 21 198.47240 16.64762
#> 4) hp>=192.5 7 28.82857 13.41429 *
#> 5) hp< 192.5 14 72.55714 18.26429
#> 10) wt>=2.5425 12 41.93917 17.95000 *
#> 11) wt< 2.5425 2 3.92000 20.15000 *
#> 3) cyl< 5 11 203.38550 26.66364
#> 6) hp>=84 5 10.50800 24.36000 *
#> 7) hp< 84 6 152.45330 28.58333 *
Difficulty: Intermediate
Two thresholds decide whether a node may split: a minimum node size and a minimum improvement; loosen both to grow deeper.
Pass control = rpart.control(minsplit = 5, cp = 0.005) to the mpg ~ . fit on mtcars.
Click to reveal solution
Explanation: Two parameters interact here. minsplit is the structural floor: a node with fewer than 5 observations is never considered for splitting. cp is the statistical floor: a candidate split is kept only if it reduces overall lack of fit by at least cp times the root deviance. Lowering both lets the tree grow until either floor is hit. This is the standard approach for "grow then prune": grow large with low cp, then prune back to the cross-validated optimum (Section 5).
Exercise 2.3: Use a high cp for a deliberately conservative classifier
Task: A risk and compliance team wants the simplest possible decision rule on iris for an audit document, even at the cost of accuracy. Fit a classification tree on iris with cp = 0.5 so only the single biggest impurity reduction survives pruning. Save to ex_2_3 and print it.
Expected result:
#> n= 150
#>
#> node), split, n, loss, yval, (yprob)
#> * denotes terminal node
#>
#> 1) root 150 100 setosa (0.33333 0.33333 0.33333)
#> 2) Petal.Length< 2.45 50 0 setosa (1.00000 0.00000 0.00000) *
#> 3) Petal.Length>=2.45 100 50 versicolor (0.00000 0.50000 0.50000) *
Difficulty: Intermediate
A high complexity threshold keeps only the splits that deliver a large impurity reduction and discards the rest.
Set control = rpart.control(cp = 0.5) on the Species ~ . fit over iris.
Click to reveal solution
Explanation: cp = 0.5 is unusually aggressive: only a split that cuts at least half of the root impurity survives. On iris, that single survivor is the textbook Petal.Length < 2.45 rule, which perfectly isolates setosa. Auditors and regulators often prefer this kind of one-line rule because it is trivially explainable and trivially testable, even if it groups versicolor and virginica together. Tune cp upward whenever interpretability outranks accuracy.
Exercise 2.4: Compare node counts between default and deeply grown iris trees
Task: Grow two iris classification trees: a default one (fit_default) and a deeply grown one (fit_deep) using rpart.control(cp = 0.001, minsplit = 2). Return a named list giving the node count of each tree (use nrow(fit$frame)), saved to ex_2_4, so you can quantify how much complexity the defaults suppress.
Expected result:
#> $nodes_default
#> [1] 5
#>
#> $nodes_deep
#> [1] 17
Difficulty: Advanced
Build one tree with the package defaults and a second with a far lower complexity threshold and minimum node size.
fit_default is a plain rpart(Species ~ ., data = iris); fit_deep adds control = rpart.control(cp = 0.001, minsplit = 2).
Click to reveal solution
Explanation: fit$frame is the internal node table; each row is one node (internal or terminal), so nrow() is a clean size proxy. The default tree has 5 nodes (3 leaves), the deep tree balloons to 17 because the lower cp and minsplit = 2 permit splits that fit individual misclassified flowers. Most of those extra splits will fail cross-validation, which motivates the printcp / prune workflow in Section 5. Always grow then prune; never trust a tree fit at very low cp directly.
Section 3. Visualize and interpret the tree (3 problems)
Exercise 3.1: Plot the iris tree with rpart.plot
Task: Take the iris classification tree from Exercise 1.1 and render it with rpart.plot() from the rpart.plot package. Use default options first so you see the standard layout: split labels on edges, node class with probability, and percent of the sample at each node. Save the underlying fit to ex_3_1; the plot is a side effect.
Expected result:
#> A tree diagram with three leaves:
#> - leftmost leaf labelled "setosa 1.00 .00 .00 33%"
#> - middle leaf labelled "versicolor .00 .91 .09 36%"
#> - rightmost leaf labelled "virginica .00 .02 .98 31%"
#> Edges show "Petal.Length < 2.5" and "Petal.Width < 1.8" splits.
Difficulty: Beginner
A fitted tree object can be handed straight to a dedicated diagram drawer whose defaults already show classes, probabilities, and node percentages.
Call rpart.plot() on ex_3_1 with no extra arguments.
Click to reveal solution
Explanation: rpart.plot() is purpose-built for rpart objects and is far more readable than the base plot.rpart() plus text.rpart() combination. By default it shows the predicted class, the per-class probability vector, and the percentage of training rows that land in each leaf, which is everything a non-technical reader needs to follow the tree. Tweak type (0-5), extra (0-104), and box.palette to control verbosity and colour. For high-stakes communication, extra = 104 adds the loss count to leaves.
Exercise 3.2: Rank predictors by variable importance
Task: A modeller reviewing the mtcars regression tree wants to know which predictors actually drove the splits. Fit the tree as in Exercise 1.2, then extract the variable.importance slot, which sums the impurity reduction credited to each predictor (including surrogate splits). Save the named numeric vector to ex_3_2.
Expected result:
#> cyl disp hp wt drat qsec vs am
#> 657.4823 643.2852 503.7382 643.5723 357.4612 224.0312 322.4521 62.6234
#> gear carb
#> 62.6234 105.6324
Difficulty: Intermediate
The fitted model already stores a ranking of how much each predictor contributed; you just reach into the object for it.
Pull the variable.importance element out of fit_mt with the $ accessor.
Click to reveal solution
Explanation: variable.importance is an unnormalised numeric vector indexed by predictor name, summing across both primary and surrogate splits. A predictor can score high even if it never won a primary split, because rpart credits surrogates that mimic the chosen split. To convert to a percentage, divide by the sum: round(100 * fit_mt$variable.importance / sum(fit_mt$variable.importance), 1). Cross-check against the printed tree: predictors that win primary splits should sit at the top.
Exercise 3.3: Extract the leaf rules for a marketing campaign brief
Task: A marketing analyst needs the if-then rules from a churn tree to hand off to a campaigns team. Refit the churn tree from Exercise 1.4, then call rpart.rules() from rpart.plot to produce a tidy data frame of the leaf rules with predicted class probabilities. Save to ex_3_3.
Expected result:
#> churn
#> 1 0.00 when tenure_mo >= 10
#> 2 0.89 when tenure_mo < 10
Difficulty: Intermediate
Instead of reading the indented tree printout, ask for a tidy table where each row is one leaf's if-then rule.
Call rpart.rules() on the fit_churn object.
Click to reveal solution
Explanation: rpart.rules() walks every leaf and emits the conjunction of conditions that route a row there, alongside the predicted value. It is far easier to drop into a campaign brief than the indented printout of the fitted object. For multi-class problems it returns one column per class probability. If you want plain English output, post-process the data frame with glue::glue() or paste rules into a templated email. Keep in mind: every rpart.rules() row is mutually exclusive and collectively exhaustive over the training space.
Section 4. Predict and evaluate (4 problems)
Exercise 4.1: Predict class labels back onto iris
Task: Use the iris tree from Exercise 1.1 to predict class labels for every row of iris using predict(..., type = "class"). Save the resulting factor to ex_4_1. Inspect the first few values to confirm the prediction shape.
Expected result:
#> 1 2 3 4 5 6
#> setosa setosa setosa setosa setosa setosa
#> Levels: setosa versicolor virginica
Difficulty: Intermediate
Reuse the fitted tree to score rows, asking for the single most likely category rather than raw scores.
Call predict() with fit_iris, the iris data, and type = "class".
Click to reveal solution
Explanation: For a classification rpart, type = "class" returns the majority class at the leaf each row falls into, as a factor with the same levels as the training response. Without type, you would get the per-class probability matrix (the next exercise). Always pass newdata explicitly even when you want training predictions; relying on the default fitted values from the model object is a frequent source of leakage in larger pipelines.
Exercise 4.2: Predict class probabilities on iris
Task: Use the same iris tree to produce the class probability matrix instead of hard labels by passing type = "prob" to predict(). Save the matrix to ex_4_2 and inspect the first six rows.
Expected result:
#> setosa versicolor virginica
#> 1 1 0 0
#> 2 1 0 0
#> 3 1 0 0
#> 4 1 0 0
#> 5 1 0 0
#> 6 1 0 0
Difficulty: Intermediate
The same scoring step can return a full set of per-category likelihoods instead of one hard label.
Call predict() with fit_iris, iris, and type = "prob".
Click to reveal solution
Explanation: Probabilities from a single rpart tree are the per-class proportions inside the leaf the row lands in, so within one leaf every row gets identical probabilities. This is why ROC and lift curves built on a single tree look "stair-stepped." Bagged or random forest ensembles smooth these out by averaging across many trees. For threshold tuning or calibration work, prefer probabilities to hard labels.
Exercise 4.3: Confusion matrix for iris training predictions
Task: Cross-tabulate the predicted class labels from Exercise 4.1 against the true Species using table(), with predicted as rows and actual as columns. Save the resulting confusion matrix to ex_4_3. The off-diagonals tell you which species the tree confuses.
Expected result:
#> actual
#> predicted setosa versicolor virginica
#> setosa 50 0 0
#> versicolor 0 49 5
#> virginica 0 1 45
Difficulty: Intermediate
Cross-tabulating two parallel vectors of labels gives a grid whose diagonal counts the correct calls.
Pass predicted = preds and actual = iris$Species to table().
Click to reveal solution
Explanation: table() with named arguments produces a labelled contingency. The 6 misclassifications (5 versicolor predicted as virginica, 1 virginica predicted as versicolor) match what the printed tree's leaf loss counts foreshadowed. Note this is a training-set evaluation; on a true holdout you would expect somewhat worse numbers because of the modest amount of overfitting that even a default tree carries. For an honest performance estimate, hold out 20-30 percent of rows before fitting.
Exercise 4.4: Compute holdout RMSE for the mtcars regression tree
Task: A fleet operations analyst wants an honest error estimate for the mtcars mpg tree. Set the seed to 1, sample 22 rows for training and use the remaining 10 as holdout, fit the regression tree on training, predict on holdout, and compute the root mean squared error. Save the numeric RMSE to ex_4_4.
Expected result:
#> [1] 3.182
Difficulty: Advanced
Fit the tree on the training rows only, score the untouched holdout rows, then summarise the gap between predictions and truth as one error number.
After rpart(mpg ~ ., data = train) and predict() on holdout, combine the residuals with sqrt(mean(...^2)).
Click to reveal solution
Explanation: RMSE is in the same units as the response (mpg here), so a 3.18 mpg average deviation is interpretable directly: the tree is roughly within 3 mpg of the truth on cars it has not seen. Compare this to the standard deviation of mpg in the holdout to gauge whether the model beats a "predict the mean" baseline. With only 32 rows total, a single split like this is noisy: in production you would prefer k-fold cross-validation or a bootstrap to stabilise the estimate.
Section 5. Prune the tree and choose complexity (3 problems)
Exercise 5.1: Inspect the cp table for a deeply grown mtcars tree
Task: Grow an mtcars regression tree at low complexity using rpart(mpg ~ ., data = mtcars, cp = 0.001), then call printcp() to display the full complexity-parameter table. Save the captured cp matrix to ex_5_1 (it lives in fit$cptable) so you can inspect xerror against tree size.
Expected result:
#> CP nsplit rel error xerror xstd
#> 1 0.64313928 0 1.00000 1.07243 0.260311
#> 2 0.09748121 1 0.35686 0.55872 0.121234
#> 3 0.04686057 2 0.25938 0.45211 0.092312
#> 4 0.01000000 3 0.21252 0.42345 0.085211
#> 5 0.00500000 4 0.20252 0.41812 0.083211
#> 6 0.00100000 5 0.19752 0.42034 0.083345
Difficulty: Intermediate
The complexity-parameter table is already stored inside the fitted tree object; you only need to read it out.
Assign the cptable element of deep_mt, reached with the $ accessor.
Click to reveal solution
Explanation: The cp table summarises every candidate subtree from a one-node stump up to the fully grown tree. rel error is training error relative to the root; xerror is the cross-validated equivalent (rpart runs 10-fold CV internally during fitting). xstd is the standard error of the cross-validated estimate. Pick cp by reading down the table for the row with smallest xerror (the next exercise) or the most parsimonious row within one xstd of that minimum (the exercise after).
Exercise 5.2: Prune the mtcars tree to the cp with minimum xerror
Task: A production team wants the most accurate pruned tree. Take the deep tree from Exercise 5.1, find the row with the smallest cross-validation error in fit$cptable, extract that row's CP value, and pass it to prune(). Save the pruned tree to ex_5_2 and print it.
Expected result:
#> n= 32
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 32 1126.04700 20.09062
#> 2) cyl>=5 21 198.47240 16.64762
#> 4) hp>=192.5 7 28.82857 13.41429 *
#> 5) hp< 192.5 14 72.55714 18.26429
#> 10) wt>=2.5425 12 41.93917 17.95000 *
#> 11) wt< 2.5425 2 3.92000 20.15000 *
#> 3) cyl< 5 11 203.38550 26.66364 *
Difficulty: Advanced
Find the table row with the lowest cross-validation error, take its complexity value, and cut the tree back to that size.
Use which.min() on the "xerror" column to index the "CP" value, then feed it to prune() as cp.
Click to reveal solution
Explanation: which.min() on the xerror column finds the cp index that minimises cross-validated error; passing that CP value into prune() returns the corresponding subtree. Because xerror is an estimate, the minimum can shift between runs unless you fix the seed before calling rpart() (rpart uses its own internal CV folds). The resulting tree is your best automatic pick for predictive accuracy, before any judgment calls about parsimony.
Exercise 5.3: Apply the 1-SE rule for a more parsimonious pruned tree
Task: The same team also wants a fallback tree that is as small as possible while still within one standard error of the best xerror. Implement the 1-SE rule: find min(xerror) + xstd at the minimum row, then pick the smallest tree (largest CP) whose xerror is below that threshold. Prune to that cp and save to ex_5_3.
Expected result:
#> n= 32
#>
#> node), split, n, deviance, yval
#> * denotes terminal node
#>
#> 1) root 32 1126.0470 20.09062
#> 2) cyl>=5 21 198.4724 16.64762 *
#> 3) cyl< 5 11 203.3855 26.66364 *
Difficulty: Advanced
Among the rows whose error sits under the threshold, the first one is the smallest tree that still qualifies; prune to its complexity value.
Use which(cpt[, "xerror"] <= threshold)[1] to grab the "CP", then call prune() on deep_mt.
Click to reveal solution
Explanation: The 1-SE rule (Breiman et al.) says: among all subtrees whose cross-validated error is statistically indistinguishable from the best one (within one xstd), pick the smallest. This guards against overfitting when the cv estimate is noisy. which(cpt[, "xerror"] <= threshold)[1] returns the first qualifying row, which because the cp table is ordered by descending CP corresponds to the most parsimonious qualifier. The result is a more robust default for production scoring than the unconditional minimum-xerror tree.
What to do next
- Random Forest Exercises in R to practice ensembling many trees and averaging their predictions for better accuracy.
- XGBoost Exercises in R for gradient-boosted trees, the production-grade upgrade path.
- caret Exercises in R to learn how to wrap rpart in resampling, tuning grids, and standardised performance summaries.
- Practice Hub for the full library of R practice problems across data wrangling, modelling, and visualization.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
Decision Tree Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
160 learners have earned this certificate