dplyr Window Functions Exercises in R: 25 Practice Problems
Twenty-five practice problems on dplyr window functions: lead and lag, cumulative operations, ranking, ntile bucketing, positional selectors, cumulative filters, and within-group windows. Each problem ships with a stated task, an expected result, and a hidden solution with an explanation. Built-in datasets only.
Section 1. lead and lag offsets (5 problems)
Exercise 1.1: Compute the previous value of a vector with lag
Task: Use lag() to add a column called prev containing the previous value of x for a small tibble built from x = c(10, 12, 15, 14, 18). The first row's previous value should be NA because nothing precedes it. Save the resulting tibble to ex_1_1.
Expected result:
#> # A tibble: 5 x 2
#> x prev
#> <dbl> <dbl>
#> 1 10 NA
#> 2 12 10
#> 3 15 12
#> 4 14 15
#> 5 18 14
Difficulty: Beginner
Each row needs the value sitting in the row directly above it, and the first row stays empty because nothing comes before it.
Inside mutate(), create the prev column by calling lag(x).
Click to reveal solution
Explanation: lag(x) shifts the vector down by one position and pads the head with NA because the first row has no predecessor. The default offset is n = 1 and you can override it (for example lag(x, n = 2) looks two rows back). Pair lag() with mutate() to create lookback columns on time ordered data; the boundary NA is intentional and lets downstream is.na() filters skip the first observation cleanly.
Exercise 1.2: Peek at the next value with lead
Task: Use lead() to add a column called nxt containing the next value of x for the same tibble built from x = c(10, 12, 15, 14, 18). The last row's next value should be NA because nothing follows it. Save the result to ex_1_2.
Expected result:
#> # A tibble: 5 x 2
#> x nxt
#> <dbl> <dbl>
#> 1 10 12
#> 2 12 15
#> 3 15 14
#> 4 14 18
#> 5 18 NA
Difficulty: Beginner
Each row needs the value sitting in the row directly below it, and the last row stays empty because nothing comes after it.
Inside mutate(), create the nxt column by calling lead(x).
Click to reveal solution
Explanation: lead(x) is the mirror image of lag(x): it shifts the vector up by one position and pads the tail with NA. Use lead() when you need to compare a row with what comes next, for example to mark transitions like state changes, end of session, or upcoming spikes. Both functions accept a default argument that replaces the boundary NA with any sentinel value you prefer.
Exercise 1.3: Day over day Temp change in airquality
Task: Compute the day over day change in Temp from the airquality dataset. Add a column temp_change defined as Temp minus lag(Temp). Keep only Month, Day, Temp, and temp_change, then arrange by Month and Day so the lag respects time order. Save the result to ex_1_3.
Expected result:
#> # A tibble: 153 x 4
#> Month Day Temp temp_change
#> <int> <int> <int> <int>
#> 1 5 1 67 NA
#> 2 5 2 72 5
#> 3 5 3 74 2
#> 4 5 4 62 -12
#> 5 5 5 56 -6
#> ...
#> # 148 more rows hidden
Difficulty: Intermediate
A day-over-day change is this day's reading minus the reading recorded the day before it.
In mutate(), compute temp_change as Temp - lag(Temp), then select() the four columns to keep.
Click to reveal solution
Explanation: arrange() before lag() is critical because lag respects row order, not date order. Subtracting lag(Temp) produces a delta whose sign reveals direction (positive for warming, negative for cooling). The first row is NA because there is no prior observation. This same pattern computes first differences for any time series before fitting AR models or stationarity tests.
Exercise 1.4: Per chick weight gain between visits
Task: A nutrition researcher running a feed trial wants the weight gain a chick puts on between consecutive visits in the ChickWeight dataset. Group by Chick, arrange by Time, then add a gain column computed as weight - lag(weight). Save the result to ex_1_4.
Expected result:
#> # A tibble: 578 x 5
#> # Groups: Chick [50]
#> weight Time Chick Diet gain
#> <dbl> <dbl> <ord> <fct> <dbl>
#> 1 42 0 1 1 NA
#> 2 51 2 1 1 9
#> 3 59 4 1 1 8
#> 4 64 6 1 1 5
#> 5 76 8 1 1 12
#> ...
#> # 573 more rows hidden
Difficulty: Intermediate
The lookback must reset at each chick so one chick's first visit never borrows the previous chick's last weight.
group_by(Chick), then arrange(Time, .by_group = TRUE), then mutate() a gain column from weight - lag(weight).
Click to reveal solution
Explanation: group_by(Chick) makes lag() respect chick boundaries so the lag at Time 0 of a new chick is NA instead of leaking the previous chick's last weight. arrange(Time, .by_group = TRUE) keeps within group ordering correct after the grouping is applied. This pattern is the backbone of longitudinal change scores and pre or post comparisons.
Exercise 1.5: Week over week Ozone comparison
Task: Compare each day's Ozone reading in airquality with the reading from exactly seven days earlier. Arrange by Month then Day, then add a column ozone_7day_lag computed as lag(Ozone, n = 7). Keep Month, Day, Ozone, and ozone_7day_lag. Save to ex_1_5.
Expected result:
#> # A tibble: 153 x 4
#> Month Day Ozone ozone_7day_lag
#> <int> <int> <int> <int>
#> 1 5 1 41 NA
#> 2 5 2 36 NA
#> ...
#> 8 5 8 19 41
#> 9 5 9 8 36
#> ...
#> # 144 more rows hidden
Difficulty: Intermediate
Comparing a day with one exactly seven days earlier means looking back seven positions instead of just one.
Pass n = 7 to lag(Ozone) inside mutate(), then select() the four columns to keep.
Click to reveal solution
Explanation: Passing n = 7 to lag() shifts the vector down by seven positions, useful for week over week comparisons in daily data. The first seven rows become NA because there is no value seven days before them. For more flexible periods, prefer named arguments (lag(x, n = 7, default = NA)) so the intent is explicit and the boundary behavior is documented.
Section 2. Cumulative operations (4 problems)
Exercise 2.1: Running total with cumsum
Task: Compute the running total of a small daily series. Build a tibble with daily = c(4, 7, 3, 8, 5), then add a total column using cumsum(daily) so each row holds the accumulated sum up to that point. Save the result to ex_2_1.
Expected result:
#> # A tibble: 5 x 2
#> daily total
#> <dbl> <dbl>
#> 1 4 4
#> 2 7 11
#> 3 3 14
#> 4 8 22
#> 5 5 27
Difficulty: Beginner
A running total carries forward the accumulated sum of every value up to and including the current row.
Inside mutate(), build the total column by applying cumsum(daily).
Click to reveal solution
Explanation: cumsum() returns a vector of the same length where each element is the running sum up to and including that index. The output's last value equals sum(daily). Unlike sum(), cumsum() preserves row alignment so it slots into mutate() directly. Sister functions cumprod(), cummax(), cummin(), and cummean() follow the same shape contract.
Exercise 2.2: Running unemployment total from economics
Task: An analyst tracking cumulative joblessness over the post war period wants a running total of monthly unemploy from the economics dataset. Add a cum_unemp column with cumsum(unemploy) and keep only date, unemploy, and cum_unemp. Save the result to ex_2_2.
Expected result:
#> # A tibble: 574 x 3
#> date unemploy cum_unemp
#> <date> <dbl> <dbl>
#> 1 1967-07-01 2944 2944
#> 2 1967-08-01 2945 5889
#> 3 1967-09-01 2958 8847
#> 4 1967-10-01 3143 11990
#> 5 1967-11-01 3066 15056
#> ...
#> # 569 more rows hidden
Difficulty: Intermediate
A running total stacks each month's value on top of every month that came before it.
In mutate(), use cumsum(unemploy) for cum_unemp, then select() the three columns.
Click to reveal solution
Explanation: economics is already sorted by date, so cumsum() accumulates in calendar order without an explicit arrange(). When the underlying order is unknown, sort first. Cumulative totals are a building block for dashboards: divide cum_unemp by cum_pop for a running unemployment fraction, or take the period over period delta of the cumulative series to recover smoothed monthly flow.
Exercise 2.3: Per manufacturer running max city mpg
Task: A fleet analyst comparing models within each manufacturer in mpg wants a running maximum of cty (city miles per gallon) as cars appear in row order. Group by manufacturer, then add a running_max_cty column with cummax(cty). Save the augmented tibble to ex_2_3.
Expected result:
#> # A tibble: 234 x 4
#> # Groups: manufacturer [15]
#> manufacturer model cty running_max_cty
#> <chr> <chr> <int> <int>
#> 1 audi a4 18 18
#> 2 audi a4 21 21
#> 3 audi a4 20 21
#> 4 audi a4 21 21
#> 5 audi a4 16 21
#> ...
#> # 229 more rows hidden
Difficulty: Intermediate
A running maximum reports the largest value seen so far, and it should start over for each manufacturer.
group_by(manufacturer), then mutate() a running_max_cty column with cummax(cty).
Click to reveal solution
Explanation: cummax() returns a non decreasing vector where each element is the maximum seen so far. Combined with group_by(), the running max restarts at each manufacturer. This is a classic "best ever to date" idiom and shows up in finance (drawdown calculations) and sports (record scores by season). Switch to cummin() for the worst case mirror.
Exercise 2.4: Monthly running mean of Temp
Task: Compute a within month running mean of Temp in airquality so the cumulative mean restarts at the first day of each month. Group by Month, arrange by Day, then add a cum_mean_temp column with cummean(Temp). Save the result to ex_2_4.
Expected result:
#> # A tibble: 153 x 4
#> # Groups: Month [5]
#> Month Day Temp cum_mean_temp
#> <int> <int> <int> <dbl>
#> 1 5 1 67 67
#> 2 5 2 72 69.5
#> 3 5 3 74 71
#> 4 5 4 62 68.8
#> 5 5 5 56 66.2
#> ...
#> # 148 more rows hidden
Difficulty: Advanced
A running mean averages every value up to the current row, and it should restart on the first day of each month.
group_by(Month), arrange(Day, .by_group = TRUE), then mutate() cum_mean_temp with cummean(Temp).
Click to reveal solution
Explanation: cummean() divides cumsum(x) by seq_along(x), producing a smoothed view that gives early observations weight equal to later ones within the group. The grouped version restarts at each month boundary because window functions in dplyr respect group structure. Compare to a fixed lookback rolling mean (zoo::rollmean) when you need a constant window length instead of an expanding one.
Section 3. Ranking functions (5 problems)
Exercise 3.1: Rank mtcars rows by mpg
Task: Sort mtcars in descending order by mpg and add a rank column using row_number() so the most fuel efficient car receives rank 1. Convert row names into a model column first so the result reads cleanly. Keep only model, mpg, and rank, then save to ex_3_1.
Expected result:
#> # A tibble: 32 x 3
#> model mpg rank
#> <chr> <dbl> <int>
#> 1 Toyota Corolla 33.9 1
#> 2 Fiat 128 32.4 2
#> 3 Honda Civic 30.4 3
#> 4 Lotus Europa 30.4 4
#> 5 Fiat X1-9 27.3 5
#> ...
#> # 27 more rows hidden
Difficulty: Beginner
Once the rows are sorted from best to worst, each row's position in that order is its rank.
arrange(desc(mpg)), then mutate() a rank column with row_number().
Click to reveal solution
Explanation: row_number() always returns a unique integer for every row, breaking ties by row position. That makes it the right choice when you want a strict 1, 2, 3 ordering with no duplicates, for example to take the top N per group. For tie aware rankings, switch to min_rank() or dense_rank(). The two Honda Civic and Lotus Europa cars share 30.4 mpg but get distinct row numbers here.
Exercise 3.2: Rank diamonds within each cut by price
Task: A jeweller compiling cut specific price guides wants diamonds ranked from cheapest to most expensive within each cut. Group by cut, then add a price_rank column with min_rank(price). Keep cut, price, and price_rank, then arrange by cut and price_rank. Save to ex_3_2.
Expected result:
#> # A tibble: 53,940 x 3
#> # Groups: cut [5]
#> cut price price_rank
#> <ord> <int> <int>
#> 1 Fair 337 1
#> 2 Fair 345 2
#> 3 Fair 357 3
#> 4 Good 327 1
#> 5 Good 335 2
#> ...
#> # 53,935 more rows hidden
Difficulty: Intermediate
The ranking should restart inside each cut so a stone's position reflects its own cut, not the whole dataset.
group_by(cut), then mutate() a price_rank column with min_rank(price).
Click to reveal solution
Explanation: min_rank() assigns tied values the same rank and leaves gaps after the tie, mimicking the ranking convention used in track and field. Combining it with group_by(cut) restarts the ranking at each cut, so a row's price_rank is its position inside its cut, not across the full dataset. Use dense_rank() when you would rather avoid gaps after ties.
Exercise 3.3: Compare min_rank, dense_rank, and row_number on ties
Task: A code reviewer comparing the three rank flavors needs a tibble built from x = c(10, 10, 12, 15, 15, 20) with three columns called min, dense, and row showing min_rank(x), dense_rank(x), and row_number(x) respectively. Save the comparison tibble to ex_3_3.
Expected result:
#> # A tibble: 6 x 4
#> x min dense row
#> <dbl> <int> <int> <int>
#> 1 10 1 1 1
#> 2 10 1 1 2
#> 3 12 3 2 3
#> 4 15 4 3 4
#> 5 15 4 3 5
#> 6 20 6 4 6
Difficulty: Intermediate
The three ranking styles only disagree on how they handle tied values, so place all three side by side.
In a single mutate(), set min = min_rank(x), dense = dense_rank(x), and row = row_number(x).
Click to reveal solution
Explanation: The three functions only disagree at ties. min_rank shares the rank but leaves a gap (1, 1, 3). dense_rank shares the rank without gaps (1, 1, 2). row_number breaks ties by position so every row gets a unique integer. Pick min_rank for sports style rankings, dense_rank for cardinality preserving bins, and row_number when you need exactly one winner.
Exercise 3.4: Percent rank diamonds prices
Task: Use percent_rank(price) on the full diamonds dataset to assign each diamond a 0 to 1 score reflecting its position in the sorted price distribution. Keep price and the new pct_rank column, then arrange ascending by price. Save to ex_3_4.
Expected result:
#> # A tibble: 53,940 x 2
#> price pct_rank
#> <int> <dbl>
#> 1 326 0
#> 2 326 0
#> 3 327 0.0000371
#> 4 334 0.0000557
#> 5 335 0.0000742
#> ...
#> # 53,935 more rows hidden
Difficulty: Intermediate
A 0-to-1 score places each value by where it sits along the sorted distribution, from smallest to largest.
Inside mutate(), build pct_rank by applying percent_rank(price).
Click to reveal solution
Explanation: percent_rank() returns (min_rank(x) - 1) / (n - 1), mapping the minimum value to 0 and the maximum to 1 exactly. The output is closed on both ends, unlike cume_dist() which is closed only on the upper end. Use percent ranks to compare distributions of different sizes side by side, or to derive a normalised score column that downstream features can consume.
Exercise 3.5: Top 3 chicks per diet by final weight
Task: A trial coordinator preparing a feed efficacy report wants the top three chicks per Diet based on their weight at the final visit (Time == 21) in ChickWeight. Filter to that timepoint, group by Diet, rank descending by weight with row_number(), then keep ranks 1 through 3 in each diet. Save to ex_3_5.
Expected result:
#> # A tibble: 12 x 4
#> # Groups: Diet [4]
#> Time weight Chick Diet
#> <dbl> <dbl> <ord> <fct>
#> 1 21 305 7 1
#> 2 21 303 5 1
#> 3 21 275 4 1
#> 4 21 331 22 2
#> 5 21 295 17 2
#> ...
#> # 7 more rows hidden
Difficulty: Advanced
Within each diet, order chicks from heaviest down and then keep only the first three of each group.
group_by(Diet), arrange(desc(weight), .by_group = TRUE), mutate() a rank with row_number(), then filter(rank <= 3).
Click to reveal solution
Explanation: row_number() after arrange(desc(weight)) produces a unique 1 through N ranking per group, so rank <= 3 always yields exactly three rows even when weights tie. Compare with min_rank(desc(weight)) <= 3, which can return more than three rows when two chicks share third place. slice_max(weight, n = 3, with_ties = FALSE) is a shorter equivalent worth knowing.
Section 4. ntile bucketing (3 problems)
Exercise 4.1: Quartile buckets of city mpg
Task: Bucket cty (city miles per gallon) from the mpg dataset into four roughly equal sized quartiles using ntile(cty, 4). Keep only manufacturer, model, cty, and the new cty_quartile column. Save the resulting tibble to ex_4_1.
Expected result:
#> # A tibble: 234 x 4
#> manufacturer model cty cty_quartile
#> <chr> <chr> <int> <int>
#> 1 audi a4 18 2
#> 2 audi a4 21 3
#> 3 audi a4 20 3
#> 4 audi a4 21 3
#> ...
#> # 230 more rows hidden
Difficulty: Beginner
Splitting into four equally sized groups assigns every row to one of four buckets based on row count.
Inside mutate(), build the cty_quartile column with ntile(cty, 4).
Click to reveal solution
Explanation: ntile() splits the input into n buckets of as equal size as possible by row count, not by value range. Two cars with the same cty may end up in different buckets when the bucket boundary cuts through tied values; this is a feature, not a bug, because it guarantees balanced bucket sizes. For value range buckets instead, use cut() from base R or cut_number() from ggplot2.
Exercise 4.2: Decile diamond prices within each cut
Task: A pricing analyst wants per cut decile labels on diamonds so a stone in the top 10 percent of "Premium" is comparable to the top 10 percent of "Ideal". Group by cut, then add a price_decile column using ntile(price, 10) so each cut gets its own decile grid. Save to ex_4_2.
Expected result:
#> # A tibble: 53,940 x 4
#> # Groups: cut [5]
#> carat cut price price_decile
#> <dbl> <ord> <int> <int>
#> 1 0.23 Ideal 326 1
#> 2 0.21 Premium 326 1
#> 3 0.23 Good 327 1
#> 4 0.29 Premium 334 1
#> ...
#> # 53,936 more rows hidden
Difficulty: Intermediate
Each cut needs its own ten-bucket grid, so the bucketing has to restart for every cut.
group_by(cut), then mutate() a price_decile column with ntile(price, 10).
Click to reveal solution
Explanation: Grouping before ntile() restarts the decile count inside each cut, so decile 10 within "Fair" and decile 10 within "Ideal" describe different absolute prices. This relative bucketing is the right tool when cohorts differ in scale but you want comparable rank positions. Without group_by(), the deciles would mix all 53,940 stones into one global grid.
Exercise 4.3: Compare ntile to cut_number for Temp deciles
Task: A statistician comparing two decile algorithms wants both ntile(Temp, 10) and cut_number(Temp, 10) applied to airquality$Temp in the same tibble. The first returns integer bucket IDs, the second returns factor intervals. Keep Temp, ntile_bucket, and cut_number_bucket, then save to ex_4_3.
Expected result:
#> # A tibble: 153 x 3
#> Temp ntile_bucket cut_number_bucket
#> <int> <int> <fct>
#> 1 67 3 (65,68]
#> 2 72 5 (71,74]
#> 3 74 6 (71,74]
#> 4 62 1 [56,67]
#> ...
#> # 149 more rows hidden
Difficulty: Advanced
Two ways of slicing into ten groups: one returns plain integer bucket IDs, the other returns labelled intervals.
In one mutate(), set ntile_bucket = ntile(Temp, 10) and cut_number_bucket = cut_number(Temp, 10).
Click to reveal solution
Explanation: ntile() returns integer IDs and aims for equal sized buckets even when that means splitting ties across buckets. cut_number() from ggplot2 returns factor intervals with explicit boundary labels, which makes the bucket meaning self documenting on plot axes. The two often disagree at tied values: the factor labels show identical intervals for the same value while ntile may place rows in adjacent integer buckets.
Section 5. Positional selectors (4 problems)
Exercise 5.1: First and last chick weight per Diet
Task: A study coordinator summarising chick growth wants, for each Diet in ChickWeight, the first and the last weight recorded across all chicks when sorted by Time. Group by Diet, arrange by Time, then summarise using first(weight) and last(weight) into columns called first_w and last_w. Save the result to ex_5_1.
Expected result:
#> # A tibble: 4 x 3
#> Diet first_w last_w
#> <fct> <dbl> <dbl>
#> 1 1 42 205
#> 2 2 40 214
#> 3 3 39 270
#> 4 4 42 237
Difficulty: Intermediate
For each diet you want the earliest and the latest weight once the rows sit in time order.
group_by(Diet), arrange(Time, .by_group = TRUE), then summarise() with first(weight) and last(weight).
Click to reveal solution
Explanation: first() and last() are alignment friendly versions of head(x, 1) and tail(x, 1): they return a scalar even when the input is empty (giving the optional default instead of an error). Because they respect row order, arrange() before them is mandatory. Both also accept an order_by argument so you can specify the ordering inline without a separate arrange() step.
Exercise 5.2: Third highest mpg with nth
Task: Use nth() on the mtcars$mpg column to extract the third highest mpg value in the dataset. Sort the vector with sort(..., decreasing = TRUE) first so position three corresponds to the third best, not the third row. The result should be a single number. Save it to ex_5_2.
Expected result:
#> [1] 30.4
Difficulty: Intermediate
The third highest value is whatever lands in position three once the values are ordered from largest down.
sort(decreasing = TRUE) the vector, then pull position three with nth(3).
Click to reveal solution
Explanation: nth(x, n) returns the n th element of x and falls back to a default argument when n is out of bounds, instead of throwing the cryptic subscript error you get from x[n]. Sorting beforehand is the cleanest way to express positional intent; alternatively, pass order_by = -x to fold sorting into nth() directly: nth(mpg, 3, order_by = -mpg).
Exercise 5.3: Each chick's heaviest visit time
Task: A researcher analysing growth trajectories wants, for each chick in ChickWeight, the Time at which that chick was heaviest. Group by Chick, then summarise with Time[which.max(weight)] into a column called peak_time. Save the result to ex_5_3.
Expected result:
#> # A tibble: 50 x 2
#> Chick peak_time
#> <ord> <dbl>
#> 1 18 2
#> 2 16 8
#> 3 15 12
#> 4 13 12
#> 5 9 21
#> ...
#> # 45 more rows hidden
Difficulty: Intermediate
For each chick you need the time value sitting on the row where weight reaches its largest.
group_by(Chick), then summarise() peak_time as Time[which.max(weight)].
Click to reveal solution
Explanation: which.max() returns the position of the first maximum, which combined with bracket indexing of Time recovers the value of any sibling column at the row of the peak. The dplyr equivalent using window functions is slice_max(weight, n = 1, with_ties = FALSE), which keeps the full row instead of a single column and is the right call when you need more than one column from the peak observation.
Exercise 5.4: First and last model alphabetically per manufacturer
Task: A catalog editor preparing a press kit needs, for each manufacturer in mpg, the first and last model name when models are sorted alphabetically. Group by manufacturer, arrange by model, then summarise with first(model) and last(model) into first_model and last_model. Save to ex_5_4.
Expected result:
#> # A tibble: 15 x 3
#> manufacturer first_model last_model
#> <chr> <chr> <chr>
#> 1 audi a4 a6 quattro
#> 2 chevrolet c1500 suburban malibu
#> 3 dodge caravan 2wd ram 1500 pickup
#> 4 ford expedition 2wd mustang
#> ...
#> # 11 more rows hidden
Difficulty: Advanced
For each manufacturer you want the alphabetically earliest and the alphabetically latest model name.
group_by(manufacturer), arrange(model, .by_group = TRUE), then summarise() with first(model) and last(model).
Click to reveal solution
Explanation: The same row may appear repeatedly in mpg because each model has several trim variants; first() and last() after arrange(model) therefore pick the alphabetic extremes of the unique models reduced to text. If you need the alphabetically first distinct model only (no duplicates), insert distinct(model) before the summarise(). The .by_group = TRUE argument ensures arrange respects group boundaries.
Section 6. Cumulative filters (2 problems)
Exercise 6.1: Leading run of windy days
Task: A flight ops planner wants to know how long the initial run of "windy" days lasts at the start of airquality, where windy is defined as Wind > 7. Add a column windy_run using cumall(Wind > 7) so the value stays TRUE while every preceding day was windy and flips to FALSE as soon as the first calm day appears. Keep Day, Wind, and windy_run. Save to ex_6_1.
Expected result:
#> # A tibble: 153 x 3
#> Day Wind windy_run
#> <int> <dbl> <lgl>
#> 1 1 7.4 TRUE
#> 2 2 8 TRUE
#> 3 3 12.6 TRUE
#> 4 4 11.5 TRUE
#> 5 5 14.3 TRUE
#> 6 6 14.9 TRUE
#> 7 7 8.6 TRUE
#> 8 8 13.8 TRUE
#> ...
#> # 145 more rows hidden
Difficulty: Intermediate
The flag should stay on only while every day so far has been windy, then switch off permanently at the first calm day.
Inside mutate(), build the windy_run column by applying cumall(Wind > 7).
Click to reveal solution
Explanation: cumall(p) is TRUE at position i if every earlier value of the predicate p was also TRUE; once p becomes FALSE for any prior row, the column locks at FALSE forever. The mirror function cumall(! p) flags the leading run where the predicate was never satisfied. Both are useful for early stopping logic such as "valid until the first error" or "still within SLA".
Exercise 6.2: First fraud alert onwards with cumany
Task: A fraud team wants every transaction from the first triggered alert onwards flagged as "post incident". Given a small tibble with alert = c(FALSE, FALSE, TRUE, FALSE, FALSE), add a column post_incident using cumany(alert) so the value flips to TRUE at the first TRUE and stays that way. Save the result to ex_6_2.
Expected result:
#> # A tibble: 5 x 2
#> alert post_incident
#> <lgl> <lgl>
#> 1 FALSE FALSE
#> 2 FALSE FALSE
#> 3 TRUE TRUE
#> 4 FALSE TRUE
#> 5 FALSE TRUE
Difficulty: Advanced
The flag should switch on at the first alert and then stay on for every row that follows.
Inside mutate(), build the post_incident column by applying cumany(alert).
Click to reveal solution
Explanation: cumany(p) is the cumulative OR of a logical vector: once any earlier value is TRUE, the rest of the column is TRUE. This is the "sticky flag" idiom and is exactly what you need to mark rows as having entered some absorbing state (after first error, after first conversion, after first churn event). Pair it with group_by(account_id) to make the flag account scoped.
Section 7. Within group windows (2 problems)
Exercise 7.1: Weight as a multiple of starting weight per chick
Task: A growth biologist normalises chick weights to each animal's starting weight. For each Chick in ChickWeight, divide every weight by that chick's first(weight) after arranging by Time. Add a numeric multiplier column called pct_of_start so the first row of each group is exactly 1. Save the augmented tibble to ex_7_1.
Expected result:
#> # A tibble: 578 x 5
#> # Groups: Chick [50]
#> weight Time Chick Diet pct_of_start
#> <dbl> <dbl> <ord> <fct> <dbl>
#> 1 42 0 1 1 1
#> 2 51 2 1 1 1.21
#> 3 59 4 1 1 1.40
#> 4 64 6 1 1 1.52
#> ...
#> # 574 more rows hidden
Difficulty: Advanced
Normalising to the starting weight means dividing every weight by that chick's earliest recorded weight.
group_by(Chick), arrange(Time, .by_group = TRUE), then mutate() pct_of_start as weight / first(weight).
Click to reveal solution
Explanation: first(weight) inside mutate() after grouping returns the chick's earliest weight, recycled across that chick's rows. Dividing produces a normalised trajectory that lets you compare growth across animals with different birth weights on the same scale. This per subject normalisation is also called baseline scaling and is standard practice in longitudinal study reporting before fitting mixed effects models.
Exercise 7.2: Mark new price highs per cut with cummax and lag
Task: A pricing analyst building a "new record" indicator wants, within each cut of diamonds, a flag marking every row whose price strictly exceeds the running max of all prior rows in that group. Group by cut, compute prior_max as lag(cummax(price)), then derive a logical new_high column. Save the augmented tibble to ex_7_2.
Expected result:
#> # A tibble: 53,940 x 4
#> # Groups: cut [5]
#> cut price prior_max new_high
#> <ord> <int> <int> <lgl>
#> 1 Ideal 326 NA NA
#> 2 Premium 326 NA NA
#> 3 Good 327 NA NA
#> 4 Premium 334 326 TRUE
#> ...
#> # 53,936 more rows hidden
Difficulty: Advanced
Compare each price against the highest price seen strictly before it within the same cut.
In mutate(), set prior_max = lag(cummax(price)) and then new_high = price > prior_max.
Click to reveal solution
Explanation: cummax(price) gives the running maximum including the current row; lag() then shifts it down so each row is compared against the maximum seen strictly before it. Combining the two window functions creates a "high water mark" indicator without writing a loop. The first row of every group is NA for prior_max, which propagates to new_high; replace with FALSE using coalesce(price > prior_max, FALSE) if you need a strict logical column.
What to do next
- dplyr Joins Exercises in R: twenty five join problems covering inner, left, semi, and anti.
- dplyr Group By Exercises in R: grouped summaries, the
.byshortcut, and aggregation patterns. - Data Wrangling With dplyr: full reference tutorial covering the core verbs end to end.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
dplyr Window Functions Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
145 learners have earned this certificate