dplyr Exercises in R: 50 Real-World Practice Problems
Fifty scenario-based dplyr exercises grouped into six themed sections covering filtering, mutating, summarising, joins, window functions, and multi-step pipelines. Every problem ships with an expected result so you can verify, and solutions are hidden behind reveal toggles so you actually try first.
Section 1. Filter and select (8 problems)
Exercise 1.1: Find every fuel-sipping four-cylinder car
Task: A consumer-magazine reviewer is compiling a short list of fuel-efficient compacts. From the built-in mtcars dataset, return only rows where cyl == 4 and mpg > 25, keeping just the car name (from rowname) and mpg. Save to ex_1_1.
Expected result:
#> car mpg
#> 1 Fiat 128 32.4
#> 2 Honda Civic 30.4
#> 3 Toyota Corolla 33.9
#> 4 Fiat X1-9 27.3
#> 5 Porsche 914-2 26.0
#> 6 Lotus Europa 30.4
Difficulty: Beginner
Two conditions must both hold at once. filter() accepts several conditions, separated by a comma.
After filter(cyl == 4, mpg > 25), add select(car, mpg) to keep just those two columns.
Click to reveal solution
Explanation: Passing two conditions to filter() separated by a comma is equivalent to & (logical AND). rownames_to_column() from tibble lifts the car names out of the row index so they survive the pipeline; base mtcars stores them as rownames(mtcars) which filter() cannot reference directly.
Exercise 1.2: Filter by a vector of named models
Task: Marketing has flagged three specific models for a teardown report: "Mazda RX4", "Honda Civic", and "Toyota Corolla". Return the rows of mtcars whose name matches any of those three using %in%, keeping all columns. Save to ex_1_2.
Expected result:
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
#> 2 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 3 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Difficulty: Beginner
You are matching each row against a known set of names. R has an operator for set membership.
Use filter(car %in% target) to keep every row whose name is in the vector.
Click to reveal solution
Explanation: %in% returns a logical vector the same length as car, marking each element TRUE if it appears anywhere in target. It is the idiomatic way to filter against a known set, and is far more readable than chaining several car == "..." clauses joined by |. Negate with !(car %in% target).
Exercise 1.3: Drop rows with missing Ozone readings
Task: A climatologist is preparing the airquality dataset for a regression on Ozone and needs to drop every row where Ozone is NA. Use filter() with is.na() (or !is.na()) to keep only complete-ozone rows and save the cleaned tibble to ex_1_3.
Expected result:
#> Ozone Solar.R Wind Temp Month Day
#> 1 41 190 7.4 67 5 1
#> 2 36 118 8.0 72 5 2
#> 3 12 149 12.6 74 5 3
#> 4 18 313 11.5 62 5 4
#> ...
#> # 116 rows total after dropping NA Ozone
Difficulty: Beginner
NA cannot be tested with ==; you need a dedicated function for detecting missing values.
filter(!is.na(Ozone)) keeps the rows where Ozone is present.
Click to reveal solution
Explanation: Direct comparison Ozone == NA always returns NA (never TRUE), so filter(Ozone == NA) returns zero rows silently. is.na() is the only correct test for missingness; negating it with ! is the standard dplyr idiom for dropping NAs in one column. For multiple columns, use filter(if_all(c(Ozone, Solar.R), \(x) !is.na(x))).
Exercise 1.4: Pick measurement columns with starts_with
Task: From the iris dataset, keep only the four numeric measurement columns whose names begin with "Sepal" or "Petal", plus the Species column. Use select() with starts_with() helpers instead of typing each name. Save the result to ex_1_4.
Expected result:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> ...
#> # 147 more rows hidden
Difficulty: Intermediate
You can pick columns by a shared name prefix instead of spelling each one out.
Pass starts_with("Sepal") and starts_with("Petal") to select(), then add Species.
Click to reveal solution
Explanation: Tidyselect helpers like starts_with(), ends_with(), contains(), and matches() (regex) make column picks robust to schema drift: if a new "Sepal.Area" column appears tomorrow, it joins automatically. Listing Species after the helpers fixes its position at the end, which is useful when the rest of the pipeline expects measurements first.
Exercise 1.5: Keep diamonds priced above the median
Task: A jeweller wants the upper half of inventory by price. From diamonds, filter rows where price is strictly greater than the median price across the full dataset, then keep just carat, cut, and price. Save to ex_1_5 and confirm the row count.
Expected result:
#> # A tibble: 26,966 x 3
#> carat cut price
#> <dbl> <ord> <int>
#> 1 1.17 Very Good 2774
#> 2 1.01 Premium 2781
#> 3 1.01 Fair 2788
#> 4 1.01 Premium 2788
#> ...
#> nrow(ex_1_5)
#> [1] 26966
Difficulty: Intermediate
Work out the cutoff value from the column itself, then keep the rows that beat it.
Pass price > median(price) to filter(), then select(carat, cut, price).
Click to reveal solution
Explanation: median(price) is evaluated once at the start of filter() because dplyr passes the entire column vector. Strictly greater (>) excludes the median itself, which is why the count is slightly under half of 53,940. Use >= if you want the median row included. Because there is no group_by(), the median is computed across the whole dataset.
Exercise 1.6: Find heavy-duty trucks or sporty coupes
Task: From mpg, return rows where class is either "pickup" or "2seater" using a single filter() call with %in%. Keep only manufacturer, model, class, and cty. Save the result to ex_1_6 for a downstream comparison plot.
Expected result:
#> # A tibble: 38 x 4
#> manufacturer model class cty
#> <chr> <chr> <chr> <int>
#> 1 chevrolet corvette 2seater 16
#> 2 chevrolet corvette 2seater 15
#> 3 chevrolet corvette 2seater 16
#> 4 chevrolet corvette 2seater 15
#> ...
#> # 34 more rows hidden
Difficulty: Intermediate
You are testing each row for membership in a small set of allowed categories.
Use filter(class %in% c("pickup", "2seater")), then select() the four columns.
Click to reveal solution
Explanation: %in% collapses what would otherwise be class == "pickup" | class == "2seater". As the set of allowed values grows, %in% stays one short line while | chains become unreadable. The vector passed to %in% can be constructed elsewhere (e.g., read from a config file), which makes the pipeline declarative.
Exercise 1.7: Drop unwanted columns with negative select
Task: A reporting team wants mtcars without the four engine geometry columns disp, drat, qsec, and carb. Use select() with negative selection (the !c(...) form) so that adding new columns later still keeps them automatically. Save to ex_1_7.
Expected result:
#> mpg cyl hp wt vs am gear
#> Mazda RX4 21.0 6 110 2.620 0 1 4
#> Mazda RX4 Wag 21.0 6 110 2.875 0 1 4
#> Datsun 710 22.8 4 93 2.320 1 1 4
#> ...
#> # 29 more rows hidden
Difficulty: Intermediate
Name the few columns you want gone rather than the many you want kept.
Wrap the drop list in select(!c(...)), listing disp, drat, qsec, carb.
Click to reveal solution
Explanation: Negative select with !c(...) (or the older -c(...)) is the right tool when the keep-list is longer than the drop-list. It is also robust to schema drift: any new column appearing in mtcars later automatically passes through, instead of being silently dropped as it would be with a positive select(mpg, cyl, hp, wt, vs, am, gear).
Exercise 1.8: Drop rows missing ALL listed measurements
Task: In airquality, several rows have multiple missing measurements. Keep only rows where at least one of Ozone, Solar.R, or Wind is non-NA (i.e. drop rows missing all three). Use if_all() inside filter() and save to ex_1_8.
Expected result:
#> nrow(airquality)
#> [1] 153
#> nrow(ex_1_8)
#> [1] 153 # no row in airquality is missing ALL three; sanity check
#> # repeat with mtcars-style synthetic NAs to see the helper in action
Difficulty: Advanced
A row should be dropped only when every listed measurement is missing at once.
Negate if_all(c(Ozone, Solar.R, Wind), is.na) inside filter().
Click to reveal solution
Explanation: if_all() returns TRUE only when the predicate (is.na) holds across every named column in a row. Negating with ! keeps rows where it is NOT true that all three are missing, which is the standard "drop fully empty rows" pattern. Use if_any() for "missing in at least one"; the two read as quantifiers on columns.
Section 2. Arrange, mutate, and transform (8 problems)
Exercise 2.1: Bin diamonds into price tiers with case_when
Task: A jeweller preparing a quarterly sale wants to bucket diamonds into three tiers: "budget" (price < 1000), "mid" (1000 to 4999), and "premium" (>= 5000). Add a tier column with case_when() and save the augmented tibble to ex_2_1.
Expected result:
#> count(ex_2_1, tier)
#> # A tibble: 3 x 2
#> tier n
#> <chr> <int>
#> 1 budget 14524
#> 2 mid 28966
#> 3 premium 10450
Difficulty: Intermediate
You need three price buckets, and each row should land in the first range it matches.
Add a tier column in mutate() with case_when(), ordering the cutoffs and ending with a TRUE catch-all.
Click to reveal solution
Explanation: case_when() reads top to bottom; the first matching condition wins, so the second branch only catches prices in [1000, 4999]. The trailing TRUE ~ ... is the catch-all default; without it, prices at or above 5000 would become NA. This is much cleaner than nested if_else() for three or more buckets.
Exercise 2.2: Compute price-per-carat and rank
Task: A pricing analyst wants a "value per carat" view. Add ppc = price / carat to diamonds, then keep only carat, cut, price, ppc, and arrange descending by ppc. Save the top of the table to ex_2_2 and inspect the first six rows.
Expected result:
#> # A tibble: 6 x 4
#> carat cut price ppc
#> <dbl> <ord> <int> <dbl>
#> 1 1.07 Ideal 18279 17083.
#> 2 1.03 Ideal 17590 17078.
#> 3 1.04 Ideal 17729 17047.
#> 4 1.08 Ideal 18394 17031.
#> 5 1.05 Ideal 17869 17018.
#> 6 1.06 Ideal 18028 17008.
Difficulty: Intermediate
Derive the new ratio first, then reshape the columns and sort by it.
Use mutate(ppc = price / carat), then select() the four columns, then arrange(desc(ppc)).
Click to reveal solution
Explanation: mutate() adds the derived column without dropping existing ones, then select() reshapes to the four columns of interest. arrange(desc(ppc)) puts the most expensive-per-carat stones first; desc() is preferred over -ppc because it works on character and factor columns too. The actual numbers depend on the exact rounding of diamonds.
Exercise 2.3: Flag automatic vs manual with if_else
Task: A used-car aggregator needs a human-readable transmission label. In mtcars, the am column is 0/1; add a new column transmission that says "manual" when am == 1 and "automatic" otherwise, using if_else(). Save the modified data to ex_2_3.
Expected result:
#> count(ex_2_3, transmission)
#> transmission n
#> 1 automatic 19
#> 2 manual 13
Difficulty: Beginner
This is a two-outcome label driven by a single 0/1 condition.
Add transmission in mutate() with if_else(am == 1, "manual", "automatic").
Click to reveal solution
Explanation: if_else() from dplyr is type-strict (both branches must return the same type) and NA-aware (carries NA through the predicate), unlike base ifelse() which can silently coerce. For binary recodes from 0/1 numerics, this is the standard idiom. For three or more outcomes, switch to case_when().
Exercise 2.4: Round every numeric column with across
Task: Apply round(x, 2) to every numeric column of iris in a single call, leaving the Species factor untouched. Use mutate() with across() and where(is.numeric) so the code does not list column names. Save the rounded tibble to ex_2_4.
Expected result:
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> ...
#> # 147 more rows hidden
Difficulty: Intermediate
Apply the same rounding to a whole class of columns without naming any of them.
Inside mutate(), use across(where(is.numeric), \(x) round(x, 2)).
Click to reveal solution
Explanation: across() applies a function to many columns inside a single mutate(); where(is.numeric) is a tidyselect predicate that picks columns by class. The anonymous-function shorthand \(x) round(x, 2) is preferred over the older ~ round(.x, 2) because it scales to multiple arguments and matches base R syntax.
Exercise 2.5: Convert engine columns to factors
Task: For mtcars, the engine count cyl, gear count gear, and carburetor count carb are stored as numerics but are really categorical. Convert all three to factors in one mutate() using across(), leaving every other column unchanged. Save to ex_2_5.
Expected result:
#> sapply(ex_2_5[c("cyl","gear","carb")], class)
#> cyl gear carb
#> "factor" "factor" "factor"
Difficulty: Intermediate
Three named columns need the same type conversion in a single sweep.
Inside mutate(), use across(c(cyl, gear, carb), as.factor).
Click to reveal solution
Explanation: across(c(...), fn) passes the function bare (no anonymous wrapper needed) because as.factor() takes a single vector. This pattern lets you batch-coerce many columns and is the dplyr replacement for mutate_at() (now superseded). Categorical conversion early in the pipeline keeps later models (lm, aov) from treating the codes as continuous.
Exercise 2.6: Sort cars by mpg then hp
Task: A reviewer wants mtcars sorted with the most fuel-efficient cars first, breaking ties by lowest horsepower. Use arrange() with descending mpg and ascending hp, keeping all columns. Save the sorted tibble to ex_2_6 and inspect the head.
Expected result:
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> ...
Difficulty: Beginner
One sort key sets the main order while a second one breaks the ties.
Call arrange(desc(mpg), hp) so mpg leads and ascending hp resolves ties.
Click to reveal solution
Explanation: arrange() evaluates its sort keys left to right, so desc(mpg) is primary and hp is the tie-break. Notice the two Honda Civic / Lotus Europa rows both at 30.4 mpg: the one with lower hp (52 vs 113) appears first because of the secondary hp key. Use arrange(across(everything())) to sort on every column at once.
Exercise 2.7: Coalesce two columns into one
Task: Build an inline tibble of customer contacts where some have email filled and others have phone filled but not both. Add a primary_contact column equal to email when present, otherwise phone, using coalesce(). Save the augmented tibble to ex_2_7.
Expected result:
#> # A tibble: 4 x 4
#> id email phone primary_contact
#> <chr> <chr> <chr> <chr>
#> 1 A ada@example.com NA ada@example.com
#> 2 B NA 555-0102 555-0102
#> 3 C carol@example.com NA carol@example.com
#> 4 D NA 555-0104 555-0104
Difficulty: Intermediate
For each row, take the first value that is actually present out of the two options.
Add primary_contact in mutate() with coalesce(email, phone).
Click to reveal solution
Explanation: coalesce() walks its arguments left to right and returns the first non-NA value per row, much like SQL's COALESCE. It is the cleanest way to express "use A if present, else fall back to B (else C, etc.)" and beats nested if_else() chains. If every input is NA for a row, the result is NA.
Exercise 2.8: Re-code cut quality with case_match
Task: A summary table for executives wants diamonds$cut collapsed into two labels: "premium" for "Ideal" or "Premium", and "other" for the remaining three levels. Use case_match() (a case_when() shortcut for value re-coding) and save to ex_2_8.
Expected result:
#> count(ex_2_8, cut_group)
#> cut_group n
#> 1 other 22198
#> 2 premium 31742
Difficulty: Advanced
You are mapping a few specific category values onto two new labels.
Use case_match() inside mutate(), sending c("Ideal", "Premium") to "premium" and using .default for the rest.
Click to reveal solution
Explanation: case_match() is designed for value-to-value mapping, removing the noisy cut == "Ideal" | cut == "Premium" syntax that case_when() would need. The .default argument is the catch-all (replacing TRUE ~ ...). Coercing cut to character avoids surprises with the ordered factor levels that diamonds ships with.
Section 3. Summarise and group_by (10 problems)
Exercise 3.1: Mean and median mpg by cylinder count
Task: A reviewer wants headline fuel-economy numbers per engine size. Group mtcars by cyl and compute the mean and median of mpg for each group, naming the columns mean_mpg and median_mpg. Save the three-row summary to ex_3_1.
Expected result:
#> # A tibble: 3 x 3
#> cyl mean_mpg median_mpg
#> <dbl> <dbl> <dbl>
#> 1 4 26.7 26
#> 2 6 19.7 19.7
#> 3 8 15.1 15.2
Difficulty: Beginner
Split the data by engine size, then collapse each part down to one summary row.
After group_by(cyl), use summarise(mean_mpg = mean(mpg), median_mpg = median(mpg)).
Click to reveal solution
Explanation: group_by() partitions the data; summarise() then collapses each partition to one row, with each named expression becoming a new column. The result is automatically ungrouped at the last grouping variable, which is why downstream code rarely needs an explicit ungroup() after a single group_by().
Exercise 3.2: Count rows per cut quality
Task: A reporting analyst needs the size of every cut segment in diamonds. Use the count() shortcut to return a two-column tibble of cut level and row count, sorted from largest segment to smallest. Save the result to ex_3_2.
Expected result:
#> # A tibble: 5 x 2
#> cut n
#> <ord> <int>
#> 1 Ideal 21551
#> 2 Premium 13791
#> 3 Very Good 12082
#> 4 Good 4906
#> 5 Fair 1610
Difficulty: Beginner
You want each category's size, ranked from the biggest segment to the smallest.
Use count(cut, sort = TRUE).
Click to reveal solution
Explanation: count(col) is shorthand for group_by(col) |> summarise(n = n()). The sort = TRUE argument arranges descending by n, which is exactly what you want for a ranked segment view. For multiple grouping columns, pass them all (count(cut, color)); for weighted counts, pass wt = some_col.
Exercise 3.3: Summarise every numeric column at once
Task: Get the per-species mean of every numeric column in iris in a single call. Group by Species, then use summarise() with across(where(is.numeric), mean) so adding a new measurement column would just appear automatically. Save to ex_3_3.
Expected result:
#> # A tibble: 3 x 5
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 3.43 1.46 0.246
#> 2 versicolor 5.94 2.77 4.26 1.33
#> 3 virginica 6.59 2.97 5.55 2.03
Difficulty: Intermediate
Average every measurement column per group without listing the columns by hand.
After group_by(Species), use summarise(across(where(is.numeric), mean)).
Click to reveal solution
Explanation: across() inside summarise() produces one summary column per input column (here, four numerics give four summary outputs). For multiple summary functions per column, pass a named list: across(where(is.numeric), list(mean = mean, sd = sd)) produces Sepal.Length_mean, Sepal.Length_sd, and so on.
Exercise 3.4: Mean price by cut and color
Task: A merchandiser wants a two-way breakdown. Group diamonds by cut and color, summarise to mean price and n = n(), keep only segments with more than 1000 rows, and save the surviving rows to ex_3_4. The result should have 22 rows.
Expected result:
#> # A tibble: 22 x 4
#> cut color mean_price n
#> <ord> <ord> <dbl> <int>
#> 1 Ideal D 2629. 2834
#> 2 Ideal E 2598. 3903
#> 3 Ideal F 3375. 3826
#> 4 Ideal G 3721. 4884
#> ...
#> # 18 more rows hidden
Difficulty: Intermediate
Summarise on a two-variable grouping first, then drop the small segments.
Group by cut, color, summarise(mean_price = mean(price), n = n(), .groups = "drop"), then filter(n > 1000).
Click to reveal solution
Explanation: summarise() with two grouping columns produces one row per (cut, color) combination. Passing .groups = "drop" strips the residual grouping so the downstream filter() runs on a regular tibble. Without it, filter() would still work, but later operations could be silently scoped to remaining groups.
Exercise 3.5: Weighted mean of mpg by cylinder
Task: Use car weight wt to compute the weighted mean of mpg per cylinder in mtcars, so heavier cars count for more. Group by cyl and call weighted.mean(mpg, w = wt) inside summarise(). Save to ex_3_5 and compare against a plain mean column called naive_mean.
Expected result:
#> # A tibble: 3 x 3
#> cyl wmean_mpg naive_mean
#> <dbl> <dbl> <dbl>
#> 1 4 26.2 26.7
#> 2 6 19.7 19.7
#> 3 8 15.0 15.1
Difficulty: Advanced
A plain average treats every car equally; here heavier cars should pull harder.
Inside summarise(), use weighted.mean(mpg, w = wt) next to a plain mean(mpg).
Click to reveal solution
Explanation: Weighted means matter whenever observations are unequal: heavy cars likely use more fuel and should pull the average down, which is why wmean_mpg is a touch lower than naive_mean for cyl == 4. dplyr does not have a dedicated weighted-mean verb, but weighted.mean() from base R drops straight into summarise().
Exercise 3.6: Conditional sum inside summarise
Task: From mpg, count how many city-mpg miles come from cars with class == "suv" versus other classes, expressed as one row per manufacturer. Use summarise() with sum(cty[class == "suv"]) and sum(cty[class != "suv"]). Save to ex_3_6.
Expected result:
#> # A tibble: 15 x 3
#> manufacturer suv_cty other_cty
#> <chr> <int> <int>
#> 1 audi 0 259
#> 2 chevrolet 55 119
#> 3 dodge 104 250
#> 4 ford 126 108
#> ...
#> # 11 more rows hidden
Difficulty: Intermediate
Within one summary you can total just the rows that meet a condition by indexing the vector.
After group_by(manufacturer), use sum(cty[class == "suv"]) and sum(cty[class != "suv"]) in summarise().
Click to reveal solution
Explanation: Inline subsetting (cty[class == "suv"]) is dplyr-friendly: each summary expression sees the group's columns as plain vectors, so any base R indexing works. This avoids splitting the pipeline into two filter() branches and joining them back. Numbers depend on the exact mpg shipped with ggplot2.
Exercise 3.7: Per-group quantiles of price
Task: A pricing team wants the 25th, 50th, and 75th percentiles of price for each cut in diamonds. Inside summarise() use quantile(price, probs = c(0.25, 0.5, 0.75)) and unnest the named vector into three columns with as_tibble_row(). Save to ex_3_7.
Expected result:
#> # A tibble: 5 x 4
#> cut q25 q50 q75
#> <ord> <dbl> <dbl> <dbl>
#> 1 Fair 942 3282 5206.
#> 2 Good 912 3050. 5028
#> 3 Very Good 912 2648 5373
#> 4 Premium 1046 3185 6296
#> 5 Ideal 878 1810. 4678.
Difficulty: Advanced
Each summary cell must hold a single number, so ask for one percentile at a time.
After group_by(cut), call quantile(price, 0.25), quantile(price, 0.50) and quantile(price, 0.75) as three columns.
Click to reveal solution
Explanation: summarise() requires each expression to return a single value, so calling quantile() separately for each probability is the simplest path. The alternative, returning a length-3 vector, used to require reframe() (added in dplyr 1.1) which lets a summary expression produce multiple rows per group. For wide output, the three-call form is more readable.
Exercise 3.8: Group share as a percent of total
Task: A merchandiser wants the share of inventory per cut in diamonds, expressed as a percent rounded to one decimal. Compute n = n() per group, then add a share_pct column using n / sum(n) * 100. Save the result to ex_3_8.
Expected result:
#> # A tibble: 5 x 3
#> cut n share_pct
#> <ord> <int> <dbl>
#> 1 Fair 1610 3
#> 2 Good 4906 9.1
#> 3 Very Good 12082 22.4
#> 4 Premium 13791 25.6
#> 5 Ideal 21551 40
Difficulty: Intermediate
Get the group counts first, then express each one as a fraction of their total.
After count(cut), use mutate(share_pct = round(n / sum(n) * 100, 1)).
Click to reveal solution
Explanation: count() returns one row per cut and a column n. Because the resulting tibble has no remaining grouping (count auto-drops one level), sum(n) is the grand total across all cuts, which gives the share-of-whole. To get share within a different parent group, wrap the divisor in a grouped sum(n) via group_by() before the mutate().
Exercise 3.9: Compare mean price across cuts in wide format
Task: Build a one-row wide table showing the mean price for each cut level of diamonds as its own column (Fair, Good, Very_Good, Premium, Ideal). Use summarise() + pivot_wider() from tidyr. Save the single-row result to ex_3_9.
Expected result:
#> # A tibble: 1 x 5
#> Fair Good `Very Good` Premium Ideal
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4359. 3929. 3982. 4584. 3458.
Difficulty: Intermediate
Compute the summary in long form, then spread one column out into headers.
After group_by(cut) and summarise(mean_price = mean(price)), use pivot_wider(names_from = cut, values_from = mean_price).
Click to reveal solution
Explanation: Wide format is the right shape for side-by-side comparison and exec dashboards. The summary stays long while computing (group_by + summarise), then pivot_wider() reshapes once: names_from becomes the new column headers, values_from becomes their cell values. Use pivot_longer() to go the other way.
Exercise 3.10: Use the .by argument shortcut
Task: Repeat the mean-and-median-mpg-by-cyl summary from Exercise 3.1, but skip group_by() and use the newer .by argument inside summarise() (introduced in dplyr 1.1). Save the result to ex_3_10 and confirm it matches ex_3_1.
Expected result:
#> # A tibble: 3 x 3
#> cyl mean_mpg median_mpg
#> <dbl> <dbl> <dbl>
#> 1 6 19.7 19.7
#> 2 4 26.7 26
#> 3 8 15.1 15.2
Difficulty: Intermediate
You can attach the grouping to a single verb instead of declaring it as a separate step.
Inside summarise(), pass .by = cyl rather than calling group_by() first.
Click to reveal solution
Explanation: .by is a per-call grouping argument that does not persist after the verb finishes, so you never need an explicit ungroup(). It is preferable when grouping applies to just one verb. Row order in the output follows the order of first appearance of cyl in the data, not sorted; pass .by = cyl with a downstream arrange(cyl) for a sorted view.
Section 4. Joins and combining tables (8 problems)
Exercise 4.1: Inner join customers with their orders
Task: A finance team needs every customer-order pair where both sides exist. Build two inline tibbles, customers (id, name) and orders (cust_id, amount), and inner-join them on the customer key. Save the resulting joined tibble to ex_4_1.
Expected result:
#> # A tibble: 3 x 3
#> id name amount
#> <int> <chr> <dbl>
#> 1 1 Ada 100
#> 2 2 Bob 50
#> 3 2 Bob 75
Difficulty: Beginner
Keep only the customer-order pairs that have a match on both sides.
Use inner_join(orders, by = c("id" = "cust_id")).
Click to reveal solution
Explanation: inner_join() keeps only rows that match on both sides, so customer 3 (Carol, no order) and order from cust_id 4 (no matching customer) both drop. The named vector c("id" = "cust_id") tells dplyr the left key is id and the right key is cust_id. The newer alternative is join_by(id == cust_id).
Exercise 4.2: Left join preserves all customers
Task: Using the same customers and orders tibbles, return one row per customer with their order amount (or NA if they never bought anything). Use left_join() so Carol is retained. Save to ex_4_2 and notice the NA row.
Expected result:
#> # A tibble: 4 x 3
#> id name amount
#> <int> <chr> <dbl>
#> 1 1 Ada 100
#> 2 2 Bob 50
#> 3 2 Bob 75
#> 4 3 Carol NA
Difficulty: Intermediate
Every customer must survive the join, even one who never placed an order.
Use left_join(orders, by = c("id" = "cust_id")).
Click to reveal solution
Explanation: left_join() keeps every row from the left side regardless of whether a match exists on the right; unmatched rows get NA in the right-side columns. Bob has two orders so he appears twice (left rows can be duplicated by the right). This is the right join when the left table is your "spine" of entities you must report on.
Exercise 4.3: Find customers without orders via anti_join
Task: From the inline customers and orders tibbles, return the customers who have never placed an order. Use anti_join() so the result is a normal customer tibble without any order columns. Save to ex_4_3 and confirm it has one row.
Expected result:
#> # A tibble: 1 x 2
#> id name
#> <int> <chr>
#> 1 3 Carol
Difficulty: Intermediate
You want the customers who have no counterpart at all in the orders table.
Use anti_join(orders, by = c("id" = "cust_id")).
Click to reveal solution
Explanation: Filtering joins (anti_join, semi_join) only return columns from the left table and never duplicate rows. anti_join() keeps rows that have no match on the right, the inverse of semi_join(). It is the cleanest way to ask "who is missing from B?" without a noisy left_join() |> filter(is.na(amount)) pattern.
Exercise 4.4: Keep customers who DID place orders via semi_join
Task: Using the same two inline tibbles, return the customer rows for those who have at least one order, without inflating the result by the order count (semi_join, not inner_join). Save to ex_4_4 and confirm two rows.
Expected result:
#> # A tibble: 2 x 2
#> id name
#> <int> <chr>
#> 1 1 Ada
#> 2 2 Bob
Difficulty: Intermediate
Filter customers by membership in the orders table, without copying order columns or duplicating rows.
Use semi_join(orders, by = c("id" = "cust_id")).
Click to reveal solution
Explanation: semi_join() is to inner_join() what anti_join() is to left_join() |> filter(is.na(...)): a filtering-only variant. It keeps at most one copy of each left row and never adds right-side columns. Reach for it when you want to subset by membership in another table without materialising extra columns.
Exercise 4.5: Full join exposes both-sides gaps
Task: Combine the inline customers and orders tibbles with full_join() so the result contains every customer AND every order, with NAs filling unmatched cells on either side. Save to ex_4_5 and confirm five rows.
Expected result:
#> # A tibble: 5 x 4
#> id name cust_id amount
#> <int> <chr> <dbl> <dbl>
#> 1 1 Ada 1 100
#> 2 2 Bob 2 50
#> 3 2 Bob 2 75
#> 4 3 Carol NA NA
#> 5 NA NA 4 200
Difficulty: Intermediate
Keep every row from both tables, letting NAs fill the gaps on whichever side is missing.
Use full_join(orders, by = c("id" = "cust_id"), keep = TRUE).
Click to reveal solution
Explanation: full_join() is the union of left_join() and right_join() and is invaluable when you suspect both sides have orphan records. The keep = TRUE argument preserves both join keys as separate columns so you can see exactly which side a row came from. Drop keep and the keys merge into one column.
Exercise 4.6: Join on two key columns
Task: Inline two tibbles staffing and shifts keyed on both date and store. Inner-join them on the composite key so each shift gets the correct staff count for that date and store. Save the joined result to ex_4_6.
Expected result:
#> # A tibble: 3 x 4
#> date store staff_count shift_label
#> <chr> <chr> <dbl> <chr>
#> 1 2026-01-02 A 5 morning
#> 2 2026-01-02 B 3 evening
#> 3 2026-01-03 A 4 morning
Difficulty: Intermediate
The match is correct only when both key fields agree at the same time.
Use inner_join(shifts, by = c("date", "store")).
Click to reveal solution
Explanation: Passing a character vector to by joins on multiple columns by name (same name on both sides). If the columns were named differently, you would use by = c("date" = "shift_date", "store" = "shift_store"). Composite keys are how you avoid Cartesian explosions: every (date, store) pair is meant to be unique on at least one side.
Exercise 4.7: Inequality join with join_by
Task: A risk team wants to attach a tax rate from a brackets tibble (min_income, max_income, rate) to a small payroll tibble. Use join_by() with the inequality operators >= and < so each salary maps to the bracket it falls into. Save the augmented payroll to ex_4_7.
Expected result:
#> # A tibble: 3 x 3
#> employee salary rate
#> <chr> <dbl> <dbl>
#> 1 Ada 45000 0.15
#> 2 Bob 85000 0.22
#> 3 Carol 200000 0.32
Difficulty: Advanced
Each salary belongs to the one bracket whose lower and upper bounds straddle it.
Use inner_join(brackets, by = join_by(salary >= min_income, salary < max_income)), then select() the columns.
Click to reveal solution
Explanation: join_by() (dplyr 1.1+) supports inequalities, rolling joins, and overlap joins that the older by = character syntax could not express. Here each salary is matched to the unique bracket whose min_income <= salary < max_income. Without inequality joins, this would require a manual crossing() followed by filter(), a quadratic blow-up on real data.
Exercise 4.8: Stack two datasets with bind_rows and .id
Task: Append the first three rows of mtcars and the first three rows of mpg's shared columns into one tall tibble, tagging each row with a source column equal to "mtcars" or "mpg". Use bind_rows() with the .id argument. Save to ex_4_8.
Expected result:
#> # A tibble: 6 x 3
#> source cyl hp
#> <chr> <dbl> <dbl>
#> 1 mtcars 6 110
#> 2 mtcars 6 110
#> 3 mtcars 4 93
#> 4 mpg 4 NA
#> 5 mpg 4 NA
#> 6 mpg 4 NA
Difficulty: Intermediate
Stack the two tables on top of each other and tag every row with where it came from.
Call bind_rows(mtcars = a, mpg = b, .id = "source").
Click to reveal solution
Explanation: bind_rows() aligns columns by name, filling missing ones with NA, which is why hp is NA for the mpg rows. The .id argument creates a new column whose values come from the names you passed (here "mtcars" and "mpg"), perfect for tagging row provenance when you union datasets from different sources.
Section 5. Window functions and ranking (8 problems)
Exercise 5.1: Rank cars by mpg within cylinder
Task: A reviewer wants the rank order of mpg within each cyl group of mtcars, with rank 1 going to the highest mpg. Use group_by(cyl) and mutate(rank_mpg = row_number(desc(mpg))). Keep car, cyl, mpg, rank_mpg and save to ex_5_1.
Expected result:
#> head(filter(ex_5_1, cyl == 4))
#> # A tibble: 6 x 4
#> car cyl mpg rank_mpg
#> <chr> <dbl> <dbl> <int>
#> 1 Toyota Corolla 4 33.9 1
#> 2 Fiat 128 4 32.4 2
#> 3 Honda Civic 4 30.4 3
#> 4 Lotus Europa 4 30.4 4
#> 5 Fiat X1-9 4 27.3 5
#> 6 Porsche 914-2 4 26.0 6
Difficulty: Intermediate
Rank the rows within each group, with the highest value taking position one.
After group_by(cyl), use mutate(rank_mpg = row_number(desc(mpg))), then select() the four columns.
Click to reveal solution
Explanation: row_number() breaks ties by the order of appearance in the data, giving every row a distinct integer rank. Wrapping the input in desc() reverses the direction so 1 is "best". Use min_rank() if you want tied values to share a rank with gaps, or dense_rank() for ties without gaps.
Exercise 5.2: Period-over-period change with lag
Task: A growth team wants the month-over-month change in unemploy from the economics dataset. Compute delta = unemploy - lag(unemploy) after sorting by date, keep date, unemploy, delta, and drop the first NA row. Save to ex_5_2.
Expected result:
#> # A tibble: 6 x 3
#> date unemploy delta
#> <date> <dbl> <dbl>
#> 1 1967-08-01 2945 62
#> 2 1967-09-01 2958 13
#> 3 1967-10-01 3143 185
#> 4 1967-11-01 3066 -77
#> 5 1967-12-01 3018 -48
#> 6 1968-01-01 2878 -140
Difficulty: Intermediate
Each row's change compares it against the previous period, so fix the order before anything else.
After arrange(date), use mutate(delta = unemploy - lag(unemploy)), then drop the NA row with filter().
Click to reveal solution
Explanation: lag(x) shifts the vector down by one, so the first row becomes NA (no prior period). Always arrange() before lag()/lead() because they trust the row order they see. Use lag(x, n = 12) for year-over-year on monthly data, or pass default = ... to control how the first n rows fill.
Exercise 5.3: Cumulative sum of price within cut
Task: Within each cut of diamonds, compute the running sum of price ordered by carat ascending. Keep cut, carat, price, and the new cum_price column. Save the result to ex_5_3 and inspect a small head per group.
Expected result:
#> # A tibble: 6 x 4
#> # Groups: cut [1]
#> cut carat price cum_price
#> <ord> <dbl> <int> <int>
#> 1 Fair 0.22 337 337
#> 2 Fair 0.25 336 673
#> 3 Fair 0.25 631 1304
#> 4 Fair 0.27 460 1764
#> 5 Fair 0.27 573 2337
#> 6 Fair 0.27 626 2963
Difficulty: Intermediate
A running total should restart at each group and follow the row order you set.
After arrange(cut, carat) and group_by(cut), use mutate(cum_price = cumsum(price)).
Click to reveal solution
Explanation: cumsum() is a window function: it returns a vector the same length as its input. Inside a group_by(), it restarts at each group, which is the entire point. arrange() before group_by() controls the order within each window; without it, the running sum would follow whatever order the source happened to be in.
Exercise 5.4: Top 3 priciest stones per cut
Task: From diamonds, return the three priciest stones per cut level using slice_max(). Keep all columns and order each within-cut block from most to least expensive. Save to ex_5_4 and confirm 15 rows.
Expected result:
#> # A tibble: 15 x 10
#> carat cut color clarity depth table price x y z
#> <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 2.01 Fair G SI1 70.6 64 18574 7.43 6.64 4.69
#> 2 2.02 Fair H SI2 64.5 57 18565 8.00 7.95 5.14
#> 3 2.01 Fair G SI1 62.8 63 18557 7.91 7.85 4.94
#> ...
#> # 12 more rows hidden
Difficulty: Intermediate
Within each group, keep only the handful of rows at the top of one column.
After group_by(cut), use slice_max(price, n = 3, with_ties = FALSE).
Click to reveal solution
Explanation: slice_max(price, n = 3) keeps the top three rows by price within each group. with_ties = FALSE forces exactly three rows even when there is a tie at the cutoff, which is what you want when downstream code expects a fixed count. The sibling slice_min() does the inverse, and slice_sample(n = 3) gives random picks.
Exercise 5.5: Deduplicate keeping the latest record
Task: Inline a tibble events of (id, ts, value) where some ids repeat across timestamps. Keep only the most recent row per id using arrange(desc(ts)) |> group_by(id) |> slice_head(n = 1). Save the deduplicated tibble to ex_5_5.
Expected result:
#> # A tibble: 3 x 3
#> id ts value
#> <int> <chr> <dbl>
#> 1 1 2026-04-03 30
#> 2 2 2026-04-02 25
#> 3 3 2026-04-04 50
Difficulty: Intermediate
Order the rows so the newest record sits first, then take just the top one per id.
Use arrange(desc(ts)), group_by(id), then slice_head(n = 1).
Click to reveal solution
Explanation: "Most recent per id" is a window-level filter: rank by ts within id, keep rank 1. The arrange(desc(ts)) puts the newest row first within each group, and slice_head(n = 1) takes it. The same idea generalises to "first event per session", "best score per player", and so on.
Exercise 5.6: dense_rank versus row_number
Task: In mpg, add two columns ranking cars by hwy mileage descending across the whole dataset: rn = row_number(desc(hwy)) (unique) and dr = dense_rank(desc(hwy)) (ties share a rank, no gaps). Save the augmented tibble to ex_5_6 and inspect a few tied rows.
Expected result:
#> # A tibble: 6 x 4
#> manufacturer hwy rn dr
#> <chr> <int> <int> <int>
#> 1 volkswagen 44 1 1
#> 2 volkswagen 44 2 1
#> 3 toyota 44 3 1
#> 4 honda 41 4 2
#> 5 volkswagen 41 5 2
#> 6 toyota 37 6 3
Difficulty: Advanced
One ranking gives every row a unique slot; the other lets ties share a slot with no gaps after them.
Inside mutate(), set rn = row_number(desc(hwy)) and dr = dense_rank(desc(hwy)).
Click to reveal solution
Explanation: row_number() always returns a unique integer per row (ties broken by source order). dense_rank() gives equal values the same rank with no gap to the next, so three rows tied at hwy=44 all get dr=1 and the next distinct value (hwy=41) is dr=2. min_rank() is the third flavour: equal values share, but the next rank skips ahead.
Exercise 5.7: Three-period moving average with lag
Task: From economics, compute a 3-month centred moving average of unemploy using base mean() over lag(x,1), x, lead(x,1). Sort by date first, keep date, unemploy, and the new column ma3, dropping the first and last rows. Save the result to ex_5_7.
Expected result:
#> # A tibble: 6 x 3
#> date unemploy ma3
#> <date> <dbl> <dbl>
#> 1 1967-08-01 2945 2929.
#> 2 1967-09-01 2958 3015.
#> 3 1967-10-01 3143 3056.
#> 4 1967-11-01 3066 3076.
#> 5 1967-12-01 3018 2987.
#> 6 1968-01-01 2878 2926.
Difficulty: Advanced
A centred average needs the current value plus one neighbour on each side of it.
After arrange(date), use mutate(ma3 = (lag(unemploy) + unemploy + lead(unemploy)) / 3), then drop the NA rows.
Click to reveal solution
Explanation: A centred 3-period average uses one observation on each side of the current row, which is why lag() and lead() appear together. The first and last rows of the series have no neighbour on one side, so ma3 is NA there and the filter() drops them. For longer windows, the slider package offers slide_dbl().
Exercise 5.8: First and last value per group
Task: A retention analyst wants the first and last unemploy reading for each year in economics. Sort by date, add a year column, and within each year use first() and last() to capture the opening and closing values. Save the summary to ex_5_8.
Expected result:
#> # A tibble: 6 x 3
#> year first last
#> <dbl> <dbl> <dbl>
#> 1 1967 2944 3018
#> 2 1968 2878 2685
#> 3 1969 2718 2884
#> 4 1970 3201 4988
#> 5 1971 4903 4715
#> 6 1972 4471 4191
Difficulty: Intermediate
Within each year you want the opening and closing reading, which depends on row order.
After adding a year column and group_by(year), use summarise(first = first(unemploy), last = last(unemploy)).
Click to reveal solution
Explanation: first() and last() are dplyr helpers that respect the current row order, so arrange() controls what counts as "first". They are clearer than head(x, 1) and tail(x, 1) inside summarise() because they return a single scalar by design. Use nth(x, n) for arbitrary positions.
Section 6. Multi-step pipelines and real scenarios (8 problems)
Exercise 6.1: Cohort retention from a tiny event log
Task: Build an inline event log with one row per (user, day_visited). For each signup_day cohort, compute the share of users who came back at least once on a later day. Save the cohort-level retention rate to ex_6_1 as a percentage rounded to one decimal.
Expected result:
#> # A tibble: 2 x 3
#> signup_day n_users retention_pct
#> <chr> <int> <dbl>
#> 1 2026-01-01 3 66.7
#> 2 2026-01-02 2 50
Difficulty: Advanced
First reduce each user to whether they ever returned, then average that flag per cohort.
Group by signup_day, user and summarise(returned = any(day_visited > signup_day)), then group by signup_day and take mean(returned).
Click to reveal solution
Explanation: The pipeline runs in two grouping passes: first collapse each (cohort, user) to a single retention flag with any(), then average the flag within each cohort to get the rate. Doing it in one pass would over-count users with many return visits. mean() of a logical vector is the share that is TRUE.
Exercise 6.2: Remove price outliers within each cut
Task: From diamonds, remove the price outliers within each cut using the 1.5 * IQR rule. Compute per-cut quartiles, then filter to rows whose price is inside [Q1 - 1.5*IQR, Q3 + 1.5*IQR]. Save the cleaned data to ex_6_2 and report row count change.
Expected result:
#> nrow(diamonds)
#> [1] 53940
#> nrow(ex_6_2)
#> [1] 50225 # roughly: dataset minus per-cut high-price outliers
Difficulty: Advanced
Outlier limits must be computed per group, then applied row by row inside that same group.
After group_by(cut), mutate() the quantile(price, 0.25) and quantile(price, 0.75) bounds, then filter() on the 1.5*IQR range.
Click to reveal solution
Explanation: Per-group outlier removal needs per-group thresholds, which is why mutate() (not summarise()) is used: it broadcasts the group quantiles to every row in the group. After the filter, the helper columns are dropped. Tukey's 1.5 * IQR is conservative for skewed data like prices; for log-scale data, do the same logic on log(price).
Exercise 6.3: Find groups with monotonically increasing values
Task: Inline a tibble metrics with (segment, week, value). Return only the segments whose value strictly increases week-over-week across all four weeks. Use group_by() and a logical reduction inside summarise(). Save the surviving segment names to ex_6_3.
Expected result:
#> # A tibble: 1 x 1
#> segment
#> <chr>
#> 1 alpha
Difficulty: Advanced
A segment qualifies only if every week-to-week step moves upward.
Inside a per-segment summarise(), test all(diff(value) > 0), then filter() the TRUE segments.
Click to reveal solution
Explanation: diff() returns the period-over-period gaps; all(diff(value) > 0) is TRUE only when every gap is positive, i.e. the series is strictly increasing. Wrapping this in a per-group summarise() reduces every segment to a single TRUE/FALSE, then a downstream filter() keeps the qualifying ones. >= instead of > would allow plateaus.
Exercise 6.4: Largest single-month jump in unemployment
Task: In economics, find the date with the largest single-month increase in unemploy, plus the size of the jump. Use lag() to compute month-over-month change, then slice_max() to pull the row. Save the one-row tibble to ex_6_4.
Expected result:
#> # A tibble: 1 x 3
#> date unemploy jump
#> <date> <dbl> <dbl>
#> 1 2008-12-01 11400 584
Difficulty: Advanced
Build the period-over-period changes first, then pull the single biggest one.
After arrange(date) and mutate(jump = unemploy - lag(unemploy)), use slice_max(jump, n = 1, with_ties = FALSE).
Click to reveal solution
Explanation: The combined pattern of lag() for differences and slice_max() for extremes is one of dplyr's most useful idioms for time series anomaly detection. arrange(date) is essential before lag(). with_ties = FALSE keeps the output to exactly one row when several months happen to share the maximum jump.
Exercise 6.5: Pareto 80/20 of inventory value
Task: From diamonds, find the smallest set of rows (ranked by price descending) whose cumulative price reaches 80% of total inventory value. Add a running share and filter up to (and including) the row that crosses 0.8. Save the qualifying rows to ex_6_5.
Expected result:
#> nrow(ex_6_5)
#> [1] 32115 # ~60% of rows account for the top 80% of value
#> sum(ex_6_5$price) / sum(diamonds$price)
#> [1] 0.8000003
Difficulty: Advanced
Rank rows by value, accumulate their share of the total, and stop at the row that crosses 80%.
Use cumsum() to build cum_share, then filter(cum_share <= 0.80 | lag(cum_share, default = 0) <= 0.80).
Click to reveal solution
Explanation: The Pareto cut is "biggest first until cumulative share crosses 80%", which requires including the row that crosses, not just the rows below it. The lag(cum_share, default = 0) <= 0.80 clause adds back the boundary-crosser. The exact count depends on price distribution but typically the top 20% to 60% of rows hold 80% of value.
Exercise 6.6: Z-score within groups and flag outliers
Task: Within each cyl group of mtcars, compute the z-score of hp and flag rows whose absolute z-score exceeds 1.5 as outliers. Add columns z_hp and is_outlier, keep car, cyl, hp, z_hp, is_outlier. Save the result to ex_6_6.
Expected result:
#> filter(ex_6_6, is_outlier)
#> # A tibble: 4 x 5
#> car cyl hp z_hp is_outlier
#> <chr> <dbl> <dbl> <dbl> <lgl>
#> 1 Lotus Europa 4 113 1.97 TRUE
#> 2 Ferrari Dino 6 175 1.59 TRUE
#> 3 Maserati Bora 8 335 2.06 TRUE
#> 4 Cadillac Fleetwood 8 205 -1.51 TRUE
Difficulty: Intermediate
Standardise the values within each group so they are comparable, then flag the extreme ones.
After group_by(cyl), mutate() z_hp = (hp - mean(hp)) / sd(hp) and is_outlier = abs(z_hp) > 1.5.
Click to reveal solution
Explanation: Group-wise standardisation puts each hp on a comparable scale per cylinder class, which matters because a 175-hp 6-cylinder is unusual but a 175-hp 8-cylinder is not. The threshold 1.5 (sigma) is a soft outlier rule; 2 or 3 is stricter. Keep mutate() (not summarise()) so the original rows are preserved.
Exercise 6.7: Pivot a price summary into a wide cut by color table
Task: Group diamonds by cut and color, compute mean price, then pivot to a wide layout where rows are cut and columns are color. Round values to the nearest integer. Save the resulting 5x8 tibble to ex_6_7.
Expected result:
#> # A tibble: 5 x 8
#> cut D E F G H I J
#> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Fair 4292 3683 3827 4239 5136 4685 4976
#> 2 Good 3405 3424 3496 4123 4276 5079 4574
#> ...
#> # 3 more rows hidden
Difficulty: Intermediate
Build the two-variable summary in long form, then spread one variable across the columns.
After group_by(cut, color) and summarise(mean_price = round(mean(price))), use pivot_wider(names_from = color, values_from = mean_price).
Click to reveal solution
Explanation: A two-way summary table is the canonical use of summarise() + pivot_wider(). The result has one row per cut and one column per color, with cell values from mean_price. If a (cut, color) combination had no rows, the cell would default to NA; pass values_fill = 0 to override.
Exercise 6.8: Detect duplicates by a subset of columns
Task: Inline a tibble txns (id, account, amount, ts) where some rows are exact duplicates on (account, amount) within the same minute. Flag those duplicates with add_count() and keep only the duplicate rows for review. Save the suspicious rows to ex_6_8.
Expected result:
#> # A tibble: 4 x 5
#> id account amount ts dup_n
#> <int> <chr> <dbl> <chr> <int>
#> 1 1 A 100 2026-04-01 10:00 2
#> 2 2 A 100 2026-04-01 10:00 2
#> 3 4 B 50 2026-04-01 11:00 2
#> 4 5 B 50 2026-04-01 11:00 2
Difficulty: Intermediate
Count how many rows share the same key, then keep only those that are not alone.
Use add_count(account, amount, name = "dup_n"), then filter(dup_n > 1).
Click to reveal solution
Explanation: add_count() is count() followed by a join: it adds a column with the group size to every original row, so you can filter by it without losing the original columns. Filtering on dup_n > 1 keeps only rows that belong to a duplicate group. For exact whole-row duplicates use distinct() (drop) or janitor::get_dupes() (inspect).
What to do next
- Data Wrangling With dplyr: the core tutorial that introduces every verb used here.
- dplyr Joins Exercises: drill specifically on join types, key mismatches, and inequality joins.
- dplyr Group By Exercises: deeper practice on grouped summaries,
.by, and grouped windows. - dplyr Window Functions Exercises: focused practice on
lag,lead, ranking, and running calculations.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
dplyr Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
855 learners have earned this certificate