dplyr Joins Exercises in R: 25 Practice Problems
Twenty-five practice problems on dplyr joins covering mutating joins (left, right, inner, full, nest), filtering joins (semi, anti), multi-key joins, the modern join_by() helper, inequality joins, rolling joins, NA handling, self-joins, and full end-to-end reconciliation workflows. Solutions are hidden behind a click so you can attempt each problem first.
Note: orders deliberately contains two orphan rows (customer_id 7 and 9 do not exist in customers). Several exercises rely on those gaps to demonstrate how joins handle unmatched keys.
Section 1. Mutating joins (5 problems)
Exercise 1.1: Augment every customer with their order history using left_join
Task: A customer success manager wants a single roster showing every customer plus their order history, including customers who never placed an order. Use left_join() to attach orders to customers on the matching customer_id, keeping every customer row even when no orders exist. Save the result to ex_1_1.
Expected result:
#> # A tibble: 11 x 7
#> customer_id name segment signup order_id order_date amount
#> <int> <chr> <chr> <date> <int> <date> <dbl>
#> 1 1 Alice Pro 2024-01-15 101 2024-04-02 50
#> 2 1 Alice Pro 2024-01-15 103 2024-05-10 30
#> 3 1 Alice Pro 2024-01-15 107 2024-08-12 45
#> 4 2 Bob Free 2024-02-08 102 2024-04-15 80
#> 5 3 Carol Pro 2024-03-22 104 2024-05-22 65
#> 6 4 Dan Free 2024-05-01 108 2024-09-09 75
#> 7 5 Eve Pro 2024-06-18 106 2024-07-04 200
#> 8 5 Eve Pro 2024-06-18 109 2024-10-21 90
#> 9 6 Frank Free 2024-09-04 NA NA NA
#> # 2 more rows where customer 1 also matches (order 101, 103, 107)
Difficulty: Beginner
Decide which table must keep every one of its rows even when the other side has no match, and put that table first.
Use a mutating join that preserves all left-side rows, joining on customer_id via the by argument.
Click to reveal solution
Explanation: left_join() keeps every row of the LEFT table and pulls matching columns from the RIGHT. When the right side has no match (Frank, customer 6, never ordered), the new columns become NA. Use this whenever the left table is the entity you must preserve completely. inner_join() would silently drop Frank, which is the most common bug in customer reports.
Exercise 1.2: Combine orders and products with inner_join
Task: Use inner_join() to combine orders and products so every row carries both the order details and the product details. Add a product_id column to orders first using c(1, 2, 1, 3, 4, 1, 2, 3, 4, 1), then inner-join on product_id. Save the result to ex_1_2.
Expected result:
#> # A tibble: 10 x 7
#> order_id customer_id order_date amount product_id product category
#> <int> <int> <date> <dbl> <dbl> <chr> <chr>
#> 1 101 1 2024-04-02 50 1 Widget Tools
#> 2 102 2 2024-04-15 80 2 Gadget Tools
#> 3 103 1 2024-05-10 30 1 Widget Tools
#> 4 104 3 2024-05-22 65 3 Sprocket Parts
#> 5 105 7 2024-06-01 120 4 Gizmo Tools
#> 6 106 5 2024-07-04 200 1 Widget Tools
#> # 4 more rows
Difficulty: Beginner
You first need a shared key column on the orders side, then a join that keeps only rows present in both tables.
Add product_id with mutate(), then inner_join() the products table on product_id.
Click to reveal solution
Explanation: inner_join() keeps only rows that exist in BOTH tables. Because every assigned product_id (1-4) exists in products, no rows are dropped here. If you had typed 99 for any order, that row would vanish silently. Use inner_join() when the join key represents a foreign-key constraint you trust; switch to left_join() when you must preserve every fact-table row regardless of dimension coverage.
Exercise 1.3: Reconcile the universe of customer ids with full_join
Task: An auditor reconciling two ledgers wants to see every customer_id that appears in EITHER customers or orders, including the orphan ids 7 and 9 in orders that never registered as customers. Use full_join() to combine both tables and save the unioned result to ex_1_3.
Expected result:
#> # A tibble: 12 x 7
#> customer_id name segment signup order_id order_date amount
#> <int> <chr> <chr> <date> <int> <date> <dbl>
#> 1 1 Alice Pro 2024-01-15 101 2024-04-02 50
#> ...
#> 10 6 Frank Free 2024-09-04 NA NA NA
#> 11 7 NA NA NA 105 2024-06-01 120
#> 12 9 NA NA NA 110 2024-11-30 150
Difficulty: Intermediate
You want every key that appears in either table, so neither side should be allowed to drop rows.
Use full_join() on customer_id to get the union of both keysets.
Click to reveal solution
Explanation: full_join() returns the union of both keysets; anything missing on a side becomes NA. This is the right tool for reconciliation: rows where the LEFT columns are NA flag orphan orders (data integrity bugs); rows where the RIGHT columns are NA flag dormant customers. An inner_join would mask both signals and a left_join would only catch one of the two.
Exercise 1.4: Preserve the orders table with right_join
Task: A reporting analyst wants every order enriched with customer metadata, but rows for orphan customer_id values 7 and 9 must still appear with NA name. Use right_join() from customers to orders so that the right-hand orders table is the one preserved in full. Save to ex_1_4.
Expected result:
#> # A tibble: 10 x 7
#> customer_id name segment signup order_id order_date amount
#> <int> <chr> <chr> <date> <int> <date> <dbl>
#> 1 1 Alice Pro 2024-01-15 101 2024-04-02 50
#> ...
#> 9 7 NA NA NA 105 2024-06-01 120
#> 10 9 NA NA NA 110 2024-11-30 150
Difficulty: Intermediate
This time the orders table is the one that must survive in full, even though it sits on the right.
Call right_join() from customers to orders on customer_id so every order row is kept.
Click to reveal solution
Explanation: right_join(x, y) preserves every row of y. It is the mirror of left_join(). In real pipelines most people prefer the left form because it reads naturally with the pipe operator: orders |> left_join(customers, ...) is the same result with the order of arguments flipped. Pick whichever orientation puts your fact table on the left.
Exercise 1.5: Nest each customer's orders into a list-column with nest_join
Task: A junior analyst wants every customer row to carry a nested tibble of that customer's orders, instead of duplicating the customer columns across multiple rows. Use nest_join() to attach a list-column called orders (one tibble per customer) to customers. Save the nested table to ex_1_5.
Expected result:
#> # A tibble: 6 x 5
#> customer_id name segment signup orders
#> <int> <chr> <chr> <date> <list>
#> 1 1 Alice Pro 2024-01-15 <tibble [3 x 3]>
#> 2 2 Bob Free 2024-02-08 <tibble [1 x 3]>
#> 3 3 Carol Pro 2024-03-22 <tibble [1 x 3]>
#> 4 4 Dan Free 2024-05-01 <tibble [1 x 3]>
#> 5 5 Eve Pro 2024-06-18 <tibble [2 x 3]>
#> 6 6 Frank Free 2024-09-04 <tibble [0 x 3]>
Difficulty: Intermediate
You want one row per customer carrying a bundled sub-table of their orders, not duplicated customer rows.
Use nest_join() on customer_id to produce the orders list-column.
Click to reveal solution
Explanation: nest_join() keeps every left-side row exactly once and adds a list-column where each entry is a tibble of matching rows from the right table. This shape is ideal for one-row-per-entity reports or for downstream purrr::map() work where you fit a model per customer. Frank gets an empty 0-row tibble rather than NA, so subsequent map calls return clean empty results instead of erroring.
Section 2. Filtering joins (3 problems)
Exercise 2.1: Find customers who placed at least one order using semi_join
Task: The marketing team is preparing a thank-you email blast and wants the list of customers who have placed AT LEAST one order. Use semi_join() to filter customers down to those whose customer_id appears in orders, without bringing in any of the order columns. Save the resulting roster to ex_2_1.
Expected result:
#> # A tibble: 5 x 4
#> customer_id name segment signup
#> <int> <chr> <chr> <date>
#> 1 1 Alice Pro 2024-01-15
#> 2 2 Bob Free 2024-02-08
#> 3 3 Carol Pro 2024-03-22
#> 4 4 Dan Free 2024-05-01
#> 5 5 Eve Pro 2024-06-18
Difficulty: Intermediate
You want to filter customers down to those with a match, without pulling in any order columns.
Use semi_join() of customers against orders on customer_id.
Click to reveal solution
Explanation: semi_join() is a filtering join: it keeps only rows of the left table that match in the right and never adds new columns. It is the dplyr equivalent of customers |> filter(customer_id %in% orders$customer_id) but is faster on large tables and reads better with multi-column keys. Frank, who never ordered, is correctly excluded.
Exercise 2.2: Detect orphan orders with anti_join
Task: The data team is hunting orphan records where an order references a customer_id that does not exist in customers. Use anti_join() to return only the orders whose customer_id is NOT in the customers table. Save the offending rows to ex_2_2 so the team can file a data-quality ticket.
Expected result:
#> # A tibble: 2 x 4
#> order_id customer_id order_date amount
#> <int> <int> <date> <dbl>
#> 1 105 7 2024-06-01 120
#> 2 110 9 2024-11-30 150
Difficulty: Intermediate
You want to keep only the order rows that have no matching customer at all.
Use anti_join() of orders against customers on customer_id.
Click to reveal solution
Explanation: anti_join() is the inverse of semi_join(): it keeps only rows with NO match. This is the canonical pattern for data-quality reports. These two orders point to customer_id values that were never created, which signals either a deletion bug, a data-load race, or stale FK references. The fix is to investigate; using inner_join() to silently drop them would only hide the problem.
Exercise 2.3: Flag dormant customers with anti_join after a date filter
Task: A churn analyst wants to flag customers who have NOT placed any order in the last 90 days as of 2024-12-01. First filter orders to those after 2024-09-02, then use anti_join() against the recent set so that only dormant customers remain. Save the dormant roster to ex_2_3.
Expected result:
#> # A tibble: 4 x 4
#> customer_id name segment signup
#> <int> <chr> <chr> <date>
#> 1 1 Alice Pro 2024-01-15
#> 2 2 Bob Free 2024-02-08
#> 3 3 Carol Pro 2024-03-22
#> 4 6 Frank Free 2024-09-04
Difficulty: Advanced
First define what "recent" means as a separate set, then keep customers who are absent from it.
Use filter(order_date >= as.Date("2024-09-02")) to build the recent set, then anti_join() on customer_id.
Click to reveal solution
Explanation: Combine filter() with anti_join() to express "in A, not in B-after-filter". This is the canonical churn-flag pattern: define what "active" means in B, then anti-join to get the inactive set. Note that semi_join() would give the OPPOSITE result, the active customers. Always state the question to yourself in plain language before choosing semi vs anti, since the two are easy to swap by accident.
Section 3. Multi-key joins and renamed columns (4 problems)
Exercise 3.1: Join two tables whose key columns have different names
Task: Joining tables whose key columns have different names is the most common dplyr-joins gotcha. Rename customers$customer_id to id, then use inner_join() with by = c("customer_id" = "id") to merge orders with the renamed customers table. The result should keep the left side's column name. Save to ex_3_1.
Expected result:
#> # A tibble: 8 x 7
#> order_id customer_id order_date amount name segment signup
#> <int> <int> <date> <dbl> <chr> <chr> <date>
#> 1 101 1 2024-04-02 50 Alice Pro 2024-01-15
#> 2 102 2 2024-04-15 80 Bob Free 2024-02-08
#> 3 103 1 2024-05-10 30 Alice Pro 2024-01-15
#> 4 104 3 2024-05-22 65 Carol Pro 2024-03-22
#> # 4 more rows; orphan orders 105 and 110 are dropped
Difficulty: Beginner
When the key columns are named differently on each side, the join needs to be told which name maps to which.
Use rename(id = customer_id) on customers, then inner_join() with by = c("customer_id" = "id").
Click to reveal solution
Explanation: The named vector c("customer_id" = "id") reads as "left side's customer_id matches right side's id". The result keeps the LEFT side's column name. The modern join_by(customer_id == id) does the same thing more readably. Either form avoids the noisy alternative of renaming one side first just so the names line up.
Exercise 3.2: Join on a composite key of two columns
Task: Build two inline tibbles sales and targets keyed jointly on (region, quarter), then left_join() them on BOTH columns at once by passing by = c("region", "quarter"). Compute a gap column as actual - target so the result reads as a variance report. Save to ex_3_2.
Expected result:
#> # A tibble: 6 x 5
#> region quarter actual target gap
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 West Q1 120 100 20
#> 2 West Q2 140 150 -10
#> 3 East Q1 200 180 20
#> 4 East Q2 210 220 -10
#> 5 South Q1 90 100 -10
#> 6 South Q2 110 100 10
Difficulty: Intermediate
A row is only the same record when both key columns agree, so the match must use the pair together.
Pass by = c("region", "quarter") to left_join(), then add mutate(gap = actual - target).
Click to reveal solution
Explanation: Passing a character vector to by joins on every column listed, in the order given. If you forget one and join only on region, you get a many-to-many explosion (each region matches all four quarters). dplyr 1.1+ now warns about unintended many-to-many joins; previously this was a silent foot-gun that produced quietly inflated revenue numbers in dashboards.
Exercise 3.3: Use the modern join_by helper for the same composite key
Task: A reporting analyst prefers the modern join_by() syntax over the older string-vector by. Re-do the sales-versus-targets join from Exercise 3.2, but this time pass by = join_by(region, quarter) to inner_join(). Add the same gap column and save to ex_3_3.
Expected result:
#> # A tibble: 6 x 5
#> region quarter actual target gap
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 West Q1 120 100 20
#> 2 West Q2 140 150 -10
#> 3 East Q1 200 180 20
#> 4 East Q2 210 220 -10
#> 5 South Q1 90 100 -10
#> 6 South Q2 110 100 10
Difficulty: Intermediate
This is the same composite-key match as before, just expressed with the modern unquoted key syntax.
Pass by = join_by(region, quarter) to inner_join(), then add the same mutate(gap = actual - target).
Click to reveal solution
Explanation: join_by() (added in dplyr 1.1.0) is the recommended modern form for join keys. It accepts bare column names without quotes, supports inequalities like >=, and exposes helpers such as between() and closest(). For plain equality joins it is a stylistic choice; for non-equi or rolling joins it is the only option. Adopting it everywhere keeps the codebase consistent.
Exercise 3.4: Combine an equality with closest() for an SCD-Type-2 lookup
Task: A trading desk wants every order joined to the SAME customer's segment AS OF the order date, not the current segment. Build inline segment_history (each customer can have multiple historical segment changes with effective_date), then use inner_join() with join_by(customer_id, closest(order_date >= effective_date)) so each order picks its segment as of the order date. Save to ex_3_4.
Expected result:
#> # A tibble: 8 x 6
#> order_id customer_id order_date amount effective_date segment_at
#> <int> <int> <date> <dbl> <date> <chr>
#> 1 101 1 2024-04-02 50 2024-01-15 Free
#> 2 102 2 2024-04-15 80 2024-02-08 Free
#> 3 103 1 2024-05-10 30 2024-01-15 Free
#> 4 104 3 2024-05-22 65 2024-03-22 Pro
#> 5 106 5 2024-07-04 200 2024-06-18 Free
#> 6 107 1 2024-08-12 45 2024-06-01 Pro
#> 7 108 4 2024-09-09 75 2024-05-01 Free
#> 8 109 5 2024-10-21 90 2024-08-01 Pro
Difficulty: Advanced
Each order needs the same customer plus the single history version that was effective at its date.
Combine an equality with a closest-match inequality: join_by(customer_id, closest(order_date >= effective_date)) inside inner_join().
Click to reveal solution
Explanation: Combining an equality (customer_id) with closest() is the canonical SCD-Type-2 lookup: for each fact row, find the dimension version that was effective at the fact's timestamp. Without join_by() you would have to inner_join on the equality, then group_by() and slice_max(), which is several times slower and harder to verify. Orphan customer ids 7 and 9 drop out because there is no history row for them.
Section 4. Many-to-many and join helpers (4 problems)
Exercise 4.1: Acknowledge a many-to-many fan-out with relationship
Task: A platform engineer wants a deliberate cartesian product of orders and promo codes per customer. Build an inline promos tibble where customer 1 has two promo codes and customer 5 has two as well, then inner_join() orders to promos with relationship = "many-to-many" to silence the warning. Save the fanned-out table to ex_4_1.
Expected result:
#> # A tibble: 11 x 6
#> order_id customer_id order_date amount promo_code pct_off
#> <int> <int> <date> <dbl> <chr> <dbl>
#> 1 101 1 2024-04-02 50 WELCOME 10
#> 2 101 1 2024-04-02 50 SUMMER 15
#> 3 103 1 2024-05-10 30 WELCOME 10
#> 4 103 1 2024-05-10 30 SUMMER 15
#> # 7 more rows for customers 2 and 5
Difficulty: Advanced
When both sides have duplicate keys the row count fans out, and dplyr expects you to declare that this is intentional.
Pass relationship = "many-to-many" to inner_join() on customer_id.
Click to reveal solution
Explanation: When BOTH sides have duplicate keys, dplyr 1.1+ warns because the row count balloons unexpectedly. Setting relationship = "many-to-many" is your written acknowledgement that this fan-out is intentional. The cleaner alternative is to first aggregate one side to one-row-per-key, but for a true cartesian product within each key (every order paired with every promo), this is correct.
Exercise 4.2: Disambiguate duplicate column names with the suffix argument
Task: When two tables share a non-key column name, dplyr appends .x and .y by default. Build two tibbles feb_prices and mar_prices each with columns product_id and price, then inner_join() them passing suffix = c("_feb","_mar") so the result has clearly named price_feb and price_mar columns. Save to ex_4_2.
Expected result:
#> # A tibble: 4 x 3
#> product_id price_feb price_mar
#> <int> <dbl> <dbl>
#> 1 1 10 11
#> 2 2 20 19
#> 3 3 30 35
#> 4 4 40 38
Difficulty: Intermediate
Both tables share a non-key column name, so you need to control how the duplicates are labelled in the result.
Pass suffix = c("_feb", "_mar") to inner_join() on product_id.
Click to reveal solution
Explanation: The default .x and .y suffixes become useless once a column is shadowed twice in a chain (you end up with price.x.x). Always pass an explicit suffix when joining time-series snapshots, A/B variants, or before-and-after panels. A cleaner alternative is to rename one side BEFORE the join (rename(price_feb = price)) so the columns survive without machine suffixes.
Exercise 4.3: Force a hard failure on orphans with unmatched = "error"
Task: An ETL engineer wants every order to have a matching customer or the pipeline must FAIL LOUDLY rather than silently dropping rows. Use inner_join() with unmatched = "error" to force a hard error, wrap it in tryCatch() to capture the message, and save the captured error string (or the success string) to ex_4_3 so the pipeline can log it.
Expected result:
#> [1] "Each row of `x` must have a match in `y`.
#> i Row 5 of `x` does not have a match."
Difficulty: Advanced
An unmatched row should make the pipeline fail loudly rather than silently disappear, and the failure must be captured.
Pass unmatched = "error" to inner_join() and wrap the call in tryCatch() with an error handler returning conditionMessage(e).
Click to reveal solution
Explanation: unmatched = "error" turns silent row-loss into a loud failure, which is the right behavior for ETL where data integrity matters more than a partial result. The default "drop" quietly removes orphans, which is dangerous in production. Pair it with tryCatch() so a failed batch gets logged rather than crashing the whole job. The exact error wording can vary by dplyr version.
Exercise 4.4: Stop NA-to-NA matches with na_matches = "never"
Task: When a join key has NA values, dplyr's default is to match NA-against-NA, which is rarely what you want. Build two tibbles obs (with values 1, 2, NA, 3) and lookup (with values 1, NA, 3) on a key column called id, then left_join() them with na_matches = "never" so the NA rows do NOT join. Save to ex_4_4.
Expected result:
#> # A tibble: 4 x 3
#> id val_x val_y
#> <dbl> <chr> <chr>
#> 1 1 a X
#> 2 2 b NA
#> 3 NA c NA
#> 4 3 d Z
Difficulty: Intermediate
Two missing key values should not be treated as equal, so the join must be told to exclude them.
Pass na_matches = "never" to left_join() on id.
Click to reveal solution
Explanation: By default dplyr matches NA against NA in join keys, which descends from R's three-valued comparison semantics. na_matches = "never" changes this so NAs are excluded, matching base R's merge() behavior. Using the default ('na') would have joined obs's NA-id row to lookup's NA-id row, almost always a bug rather than intent in production data.
Section 5. Inequality and rolling joins (4 problems)
Exercise 5.1: Bucket each order into a price tier with an inequality join
Task: An accounting analyst wants to assign every order to the correct price tier from price_tiers (where each tier is a half-open range like [0, 50)). Use inner_join() with by = join_by(amount >= min_amount, amount < max_amount) to attach the matching tier label to each order. Save to ex_5_1.
Expected result:
#> # A tibble: 10 x 6
#> order_id customer_id order_date amount min_amount tier
#> <int> <int> <date> <dbl> <dbl> <chr>
#> 1 101 1 2024-04-02 50 50 small
#> 2 102 2 2024-04-15 80 50 small
#> 3 103 1 2024-05-10 30 0 micro
#> 4 104 3 2024-05-22 65 50 small
#> 5 105 7 2024-06-01 120 100 mid
#> 6 106 5 2024-07-04 200 200 large
#> # 4 more rows
Difficulty: Advanced
Each order should match the one tier whose half-open range contains its amount, which is a comparison rather than an equality.
Use join_by(amount >= min_amount, amount < max_amount) inside inner_join().
Click to reveal solution
Explanation: Inequality joins (also called non-equi joins) match rows where columns satisfy a comparison rather than an exact equality. Each order picks up the single tier whose half-open range contains its amount. Without join_by() you would have to cross-join then filter, which is O(n*m) and wastes memory. dplyr 1.1+ rewrites this into an efficient interval scan internally.
Exercise 5.2: Attach the most recent FX rate with a closest() rolling join
Task: A finance team wants to attach the most recent FX rate available on or before each transaction date. Build inline transactions (5 rows with txn_date and amount_usd) and fx_rates (4 rows with rate_date and eur_per_usd), then use inner_join() with join_by(closest(txn_date >= rate_date)) to pick exactly one rate per transaction. Save to ex_5_2.
Expected result:
#> # A tibble: 5 x 5
#> txn_id txn_date amount_usd rate_date eur_per_usd
#> <int> <date> <dbl> <date> <dbl>
#> 1 1 2024-03-04 1000 2024-03-01 0.92
#> 2 2 2024-03-15 2500 2024-03-10 0.91
#> 3 3 2024-04-01 800 2024-04-01 0.93
#> 4 4 2024-04-09 1500 2024-04-01 0.93
#> 5 5 2024-05-02 600 2024-05-01 0.94
Difficulty: Advanced
Each transaction needs exactly one rate: the most recent one dated on or before it.
Use join_by(closest(txn_date >= rate_date)) inside inner_join().
Click to reveal solution
Explanation: closest() wraps an inequality and asks for the SINGLE best match per left-side row. closest(txn_date >= rate_date) picks the latest rate that is still on-or-before the transaction date. This is the dplyr equivalent of a SQL correlated subquery with LIMIT 1 or a data.table rolling join. Without closest() you would get every prior rate and then have to slice_max() per group, which is slower and more verbose.
Exercise 5.3: Tag each customer signup with the active campaign using between()
Task: A subscription analyst wants to join each customer signup to the campaign that was running on that signup date. Build inline campaigns (3 rows with campaign, starts, ends) and use inner_join() with by = join_by(between(signup, starts, ends)) to attach the active campaign per customer. Save to ex_5_3.
Expected result:
#> # A tibble: 6 x 7
#> customer_id name segment signup campaign starts ends
#> <int> <chr> <chr> <date> <chr> <date> <date>
#> 1 1 Alice Pro 2024-01-15 WINTER 2024-01-01 2024-03-31
#> 2 2 Bob Free 2024-02-08 WINTER 2024-01-01 2024-03-31
#> 3 3 Carol Pro 2024-03-22 WINTER 2024-01-01 2024-03-31
#> 4 4 Dan Free 2024-05-01 SPRING 2024-04-01 2024-06-14
#> 5 5 Eve Pro 2024-06-18 SUMMER 2024-06-15 2024-09-30
#> 6 6 Frank Free 2024-09-04 SUMMER 2024-06-15 2024-09-30
Difficulty: Intermediate
Each signup should match the campaign whose start-to-end window it falls inside.
Use join_by(between(signup, starts, ends)) inside inner_join().
Click to reveal solution
Explanation: between(x, lo, hi) is shorthand for x >= lo, x <= hi with both bounds inclusive by default. Pass bounds = "[)" to mirror SQL's left-inclusive, right-exclusive convention if your campaign windows abut. A common mistake is leaving overlapping ranges in the right table, which silently fans rows out; validate the right table for non-overlap before relying on the join.
Exercise 5.4: Pair every order with its prior orders via a self inequality join
Task: An analyst wants every order paired with all PRIOR orders by the same customer (so each row reads "this order followed that earlier one"). Use a self inner_join on orders with join_by(customer_id, order_date > order_date) and pick informative suffixes c("_curr","_prev"). Save the chronological pairs to ex_5_4.
Expected result:
#> # A tibble: 4 x 7
#> order_id_curr customer_id order_date_curr amount_curr order_id_prev order_date_prev amount_prev
#> <int> <int> <date> <dbl> <int> <date> <dbl>
#> 1 103 1 2024-05-10 30 101 2024-04-02 50
#> 2 107 1 2024-08-12 45 101 2024-04-02 50
#> 3 107 1 2024-08-12 45 103 2024-05-10 30
#> 4 109 5 2024-10-21 90 106 2024-07-04 200
Difficulty: Advanced
Join the table to itself so each order can be paired with every earlier order from the same customer.
Use join_by(customer_id, order_date > order_date) inside a self inner_join(), with suffix = c("_curr", "_prev").
Click to reveal solution
Explanation: Self-joins on inequalities are the dplyr way to express "pair each row with all prior rows by the same group". The > operator forms a triangular join: for n events per customer you get n*(n-1)/2 pairs, which can blow up. For huge tables prefer window functions (lag() for just-the-previous, or lead()) over a full triangular self-join, and reserve this pattern for "all prior" reports.
Section 6. End-to-end join workflows (5 problems)
Exercise 6.1: Top three customers by total revenue for a quarterly review
Task: The GM is preparing for a quarterly business review and wants the top three customers by total order amount, with their full name and segment. Combine orders and customers with the appropriate join, aggregate per customer, then keep the top three rows by revenue. Save the leaderboard to ex_6_1.
Expected result:
#> # A tibble: 3 x 3
#> name segment total_revenue
#> <chr> <chr> <dbl>
#> 1 Eve Pro 290
#> 2 Alice Pro 125
#> 3 Bob Free 80
Difficulty: Intermediate
Combine the two tables, total the amounts per customer, then keep only the highest three.
Use inner_join() on customer_id, then group_by(name, segment), summarise(total_revenue = sum(amount)), and slice_max(total_revenue, n = 3).
Click to reveal solution
Explanation: Standard top-N report pattern: inner_join (drops orphan orders), aggregate per customer, slice the top N. Using inner_join() here is intentional because the report is for legitimate customers only, so the orphan orders 105 and 110 are correctly excluded. Always justify whether orphans should appear or be silently filtered, and pick the join type that matches the decision.
Exercise 6.2: Identify Pro-segment churn candidates
Task: The customer success team wants the list of Pro-segment customers who have NOT placed any order in the last 90 days as of 2024-12-01, sorted by signup date with oldest first. Use a filtering join after restricting orders to recent activity. Save the at-risk roster to ex_6_2.
Expected result:
#> # A tibble: 2 x 4
#> customer_id name segment signup
#> <int> <chr> <chr> <date>
#> 1 1 Alice Pro 2024-01-15
#> 2 3 Carol Pro 2024-03-22
Difficulty: Intermediate
Narrow to Pro customers, define what counts as recent activity, keep those without it, and order by signup.
Use filter(segment == "Pro"), an anti_join() against the recent orders set, then arrange(signup).
Click to reveal solution
Explanation: Filtering joins compose well with regular filters: pre-filter the right side to define "active", pre-filter the left side to define "in scope", then anti_join. Order matters less for correctness than for readability. Many analysts put the segment filter after the anti_join, which works but reads less like the stakeholder's plain-English question and so is harder to review.
Exercise 6.3: Build a system reconciliation table with a derived status column
Task: A compliance officer wants a single reconciliation table showing every customer_id from EITHER orders or customers, plus a status column with values "matched", "orphan_order", or "no_orders". Use full_join() on the two distinct keysets with boolean flags, derive status with case_when(), and save the audit table to ex_6_3.
Expected result:
#> # A tibble: 8 x 2
#> customer_id status
#> <int> <chr>
#> 1 1 matched
#> 2 2 matched
#> 3 3 matched
#> 4 4 matched
#> 5 5 matched
#> 6 6 no_orders
#> 7 7 orphan_order
#> 8 9 orphan_order
Difficulty: Advanced
Take the union of both keysets, then label each id by which side it actually came from.
Add boolean flags before a full_join() on customer_id, then derive status with case_when().
Click to reveal solution
Explanation: Reconciliation queries always start with full_join() to get the union of both keysets. The trick to deriving "where did this row come from" is to add boolean flags BEFORE the join, then use case_when() after. NA after a full join means "no row on this side", which is the same signal but harder to read in a status column shown to a stakeholder.
Exercise 6.4: Build a denormalized orders fact table for a dashboard
Task: A BI engineer wants a single denormalized "orders fact" table that the dashboard can query without further joins. Combine orders (with an added product_id = c(1,2,1,3,4,1,2,3,4,1)), customers, products, and the optional returns table. Use left_join() for everything so the order grain is preserved. Save to ex_6_4.
Expected result:
#> # A tibble: 10 x 11
#> order_id customer_id order_date amount product_id name segment signup product category reason
#> <int> <int> <date> <dbl> <dbl> <chr> <chr> <date> <chr> <chr> <chr>
#> 1 101 1 2024-04-02 50 1 Alice Pro 2024-01-15 Widget Tools damaged
#> 2 102 2 2024-04-15 80 2 Bob Free 2024-02-08 Gadget Tools NA
#> 3 103 1 2024-05-10 30 1 Alice Pro 2024-01-15 Widget Tools NA
#> 4 104 3 2024-05-22 65 3 Carol Pro 2024-03-22 Sprocket Parts NA
#> 5 105 7 2024-06-01 120 4 NA NA NA Gizmo Tools wrong size
#> # 5 more rows; refund column also present
Difficulty: Intermediate
Anchor on the orders table so the row count never changes, then layer each dimension on without dropping rows.
Add product_id with mutate(), then chain left_join() calls for customers, products, and returns.
Click to reveal solution
Explanation: Star-schema denormalization is a sequence of left_join()s anchored on the fact table (orders). Always start from the fact, never from a dimension, so the row count of the result equals the row count of the fact. Watch the column-count tax: wide tables hurt downstream pivot performance. For analytical engines prefer this shape; for OLTP storage keep it normalized.
Exercise 6.5: Compute days-to-first-order per customer with a left_join
Task: The growth team wants days_to_first_order for every customer, computed as the gap in days between their signup date and their first order date. Aggregate orders (after a semi_join() to drop orphans) to one row per customer using min(order_date), left_join() back to customers, then derive the lag in days. Save to ex_6_5.
Expected result:
#> # A tibble: 6 x 4
#> customer_id name signup days_to_first_order
#> <int> <chr> <date> <dbl>
#> 1 1 Alice 2024-01-15 78
#> 2 2 Bob 2024-02-08 67
#> 3 3 Carol 2024-03-22 61
#> 4 4 Dan 2024-05-01 131
#> 5 5 Eve 2024-06-18 16
#> 6 6 Frank 2024-09-04 NA
Difficulty: Advanced
Reduce orders to each customer's earliest date, attach it back so every customer stays visible, then take the gap.
Use semi_join() then summarise(first_order_date = min(order_date)), left_join() back to customers, and mutate() the day difference.
Click to reveal solution
Explanation: Time-to-first-X is one of the most common growth metrics. The pattern is: aggregate the event table to a per-entity minimum date, left-join back to the entity dimension to keep entities with no events visible (Frank gets NA, not dropped), then compute the lag. The semi_join() step is defensive: it removes orphans so the per-customer min is computed only over events whose customer actually exists.
What to do next
Once you finish these, work through these closely related hubs:
- dplyr Exercises in R: 50 Practice Problems covers the full verb set (select, filter, mutate, arrange, summarise) so the joins above slot into longer pipelines.
- dplyr Group By Exercises in R drills the aggregation patterns that almost always follow a join in real reports.
- tidyr Exercises in R covers the pivot and unite/separate operations that prepare wide or long tables before a join.
- Data Wrangling Exercises in R is the broader mixed-package hub if you want a longer end-to-end workout.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
dplyr Joins Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
172 learners have earned this certificate