data.table Exercises in R: 50 Real Practice Problems
Fifty hands-on practice problems on data.table covering the i/j/by triple, in-place modification with :=, joins, reshape, the special symbols family and the performance idioms that make data.table fast. Solutions are hidden under each exercise so you can attempt first and verify after.
Section 1. The i, j, by triple (10 problems)
Exercise 1.1: Filter fuel-efficient cars with i
Task: Use the i slot of data.table on the mt data.table to keep only rows where mpg is strictly greater than 25. Print the result and save it to ex_1_1.
Expected result:
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <char> <num> ...
#> 1: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 2: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 3: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 4: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 5: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
#> 6: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Difficulty: Beginner
The first slot inside the brackets is the row filter; write the condition in the table's own column namespace, no table-name prefix.
Inside mt[...], put mpg > 25 as the first argument.
Click to reveal solution
Explanation: Inside [, the first argument is i and lives entirely in the data.table's column namespace, so you write mpg > 25 not mt$mpg > 25. Compared to base R's mt[mt$mpg > 25, ] this is shorter and avoids repeating the table name. data.table also evaluates i more efficiently when a key or index is set, which you'll explore later in Section 6.
Exercise 1.2: Select two columns by name in j
Task: From mt, return only the car and mpg columns using j with list() (or its . alias). Keep all rows. Save the resulting two-column data.table to ex_1_2.
Expected result:
#> car mpg
#> <char> <num>
#> 1: Mazda RX4 21.0
#> 2: Mazda RX4 Wag 21.0
#> 3: Datsun 710 22.8
#> 4: Hornet 4 Drive 21.4
#> ...
#> 32 rows total
Difficulty: Beginner
Column selection happens in the second slot; wrap the names so a table comes back instead of a bare vector.
In j, write .(car, mpg) using the .() list shorthand.
Click to reveal solution
Explanation: Writing column names inside .() (a shorthand for list()) tells data.table to return a new data.table with those columns. Without .(), mt[, car] returns a plain vector of car names because j is a single expression. The .() form is the idiomatic way to subset multiple columns while keeping data.table structure.
Exercise 1.3: Compute mean mpg in j
Task: Compute the mean of mpg across all 32 rows of mt by writing the aggregation directly in the j slot. The result should be a single-row data.table with a named column called avg_mpg. Save it to ex_1_3.
Expected result:
#> avg_mpg
#> <num>
#> 1: 20.09062
Difficulty: Beginner
An aggregation written in the second slot collapses all rows down to a single summary row.
In j, write .(avg_mpg = mean(mpg)) to name the result column.
Click to reveal solution
Explanation: When j is wrapped in .(), you can name the returned columns inline using name = expression syntax, mirroring dplyr::summarise. Drop the .() and you get a bare numeric value, which is sometimes what you want, but losing the column name makes downstream code harder to chain. Use .() whenever you'll bind, join or print the result.
Exercise 1.4: Lot manager filters and selects in one step
Task: A used-car lot manager wants only the model name and weight of cars that achieve at least 25 mpg and have four cylinders. Combine the row filter in i and the column projection in j of mt. Save the two-column data.table to ex_1_4.
Expected result:
#> car wt
#> <char> <num>
#> 1: Fiat 128 2.200
#> 2: Honda Civic 1.615
#> 3: Toyota Corolla 1.835
#> 4: Fiat X1-9 1.935
#> 5: Lotus Europa 1.513
Difficulty: Intermediate
The row filter and the column projection are independent slots you can stack inside one set of brackets.
Combine mpg >= 25 & cyl == 4 in i with .(car, wt) in j.
Click to reveal solution
Explanation: i and j are independent expressions, so you can stack a logical filter and a column projection in a single bracketed call without piping. The order is fixed: rows first, then columns. Compared to base R you avoid creating an intermediate subset and avoid repeating mt$, which is the main ergonomic win of data.table syntax.
Exercise 1.5: Group mean mpg by cylinder count
Task: Compute the mean of mpg for each value of cyl in mt. Use the by slot. Name the aggregated column mean_mpg and save the two-column data.table to ex_1_5.
Expected result:
#> cyl mean_mpg
#> <num> <num>
#> 1: 6 19.74286
#> 2: 4 26.66364
#> 3: 8 15.10000
Difficulty: Beginner
Grouping is a third slot that splits the table before the aggregation in the second slot runs.
Add by = cyl and name the column with .(mean_mpg = mean(mpg)).
Click to reveal solution
Explanation: The by slot defines the grouping. data.table returns groups in the order they first appear in the source table. To get groups in sorted order without re-sorting afterwards, use keyby = cyl instead. keyby also sets a key on the result, which makes downstream joins and lookups fast.
Exercise 1.6: Multiple aggregations per group
Task: For each cyl group in mt, compute three statistics in one call: mean of mpg, max of hp, and the count of rows using .N. Name the columns mean_mpg, max_hp, n. Save the result to ex_1_6.
Expected result:
#> cyl mean_mpg max_hp n
#> <num> <num> <num> <int>
#> 1: 6 19.74286 175 7
#> 2: 4 26.66364 113 11
#> 3: 8 15.10000 335 14
Difficulty: Intermediate
Each comma-separated item in the second slot is its own per-group expression evaluated independently.
In .(), list mean(mpg), max(hp) and .N, each named, with by = cyl.
Click to reveal solution
Explanation: Each item inside .() is its own expression evaluated per group. .N is a special symbol holding the number of rows in the current group, so it works without referencing any column. Mixing summary functions and .N in one call is the standard "summary by group" pattern in data.table and avoids the multi-step group_by + summarise flow.
Exercise 1.7: Marketing manager groups by two keys
Task: A marketing manager looking at mt wants the average mpg broken out by every combination of cyl and am (transmission). Group by both columns simultaneously using by. Save the result to ex_1_7.
Expected result:
#> cyl am mean_mpg
#> <num> <num> <num>
#> 1: 6 1 20.56667
#> 2: 4 1 28.07500
#> 3: 6 0 19.12500
#> 4: 8 0 15.05000
#> 5: 4 0 22.90000
#> 6: 8 1 15.40000
Difficulty: Intermediate
Grouping can span more than one column, giving one output row per combination of values.
Pass both columns to by inside .(), as by = .(cyl, am).
Click to reveal solution
Explanation: Pass multiple grouping columns inside .() (or list()). The output has one row per unique combination. With keyby = .(cyl, am) instead, you'd get the same six rows but sorted by both keys, which often matters for plotting or for downstream merges that rely on a sort order.
Exercise 1.8: Order rows in i by descending mpg
Task: Sort mt by mpg from highest to lowest and return only the top six rows so you can inspect the most fuel-efficient cars. Use order(-mpg) inside i. Save the six-row data.table to ex_1_8.
Expected result:
#> car mpg cyl disp hp drat wt qsec vs am gear carb
#> <char> <num> ...
#> 1: Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
#> 2: Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
#> 3: Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
#> 4: Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
#> 5: Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
#> 6: Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Difficulty: Beginner
Sorting belongs in the row slot, and a negative sign flips the direction to descending.
Use order(-mpg) in i, then chain a [1:6] bracket to take the top rows.
Click to reveal solution
Explanation: data.table overrides order() internally with a fast radix-sort version when used inside [. The minus sign reverses direction. Chaining [1:6] afterwards is the idiomatic "top N" pattern. For an in-place sort that also rearranges the underlying object (no copy), use setorder(mt, -mpg) instead, which is slightly faster on large tables.
Exercise 1.9: Junior analyst chains two brackets
Task: A junior analyst onboarding to data.table wants to compute the mean mpg per cyl using mt, then keep only the groups where the mean is above 18, using bracket chaining [][]. Save the filtered grouped result to ex_1_9.
Expected result:
#> cyl mean_mpg
#> <num> <num>
#> 1: 6 19.74286
#> 2: 4 26.66364
Difficulty: Intermediate
One bracket can feed its result straight into the next, the way a pipe chains steps.
After the grouped .(mean_mpg = mean(mpg)), by = cyl, chain a second bracket [mean_mpg > 18].
Click to reveal solution
Explanation: Bracket chaining is data.table's answer to the pipe: the first [] produces an intermediate data.table and the second filters it. Because each [] returns a fresh data.table, you can chain arbitrarily many. The newly named mean_mpg column is available immediately in the next bracket since it lives in the intermediate result.
Exercise 1.10: Group by an expression rather than a column
Task: Compute the mean mpg of cars in mt split into two groups defined by the logical expression am == 1, naming the grouping output column is_manual. Group by the expression directly, no helper column. Save the two-row result to ex_1_10.
Expected result:
#> is_manual mean_mpg
#> <lgcl> <num>
#> 1: FALSE 17.14737
#> 2: TRUE 24.39231
Difficulty: Intermediate
The grouping slot accepts an arbitrary expression, not just a bare column name.
Write by = .(is_manual = am == 1) to group on the logical expression directly.
Click to reveal solution
Explanation: by accepts arbitrary expressions, not just column names. Naming them inside .() makes the output column explicit. This pattern saves you from creating a throwaway flag column before grouping. It's especially handy in interactive analysis when you want a quick "is X true vs not" cut without polluting the source table.
Section 2. Modify in place with := (8 problems)
Exercise 2.1: Add a kilometres-per-litre column
Task: On the mt data.table, create a new column kpl that converts US mpg to kilometres per litre using the factor 0.4251, modifying mt in place with the walrus operator. Then print mt and save it to ex_2_1.
Expected result:
#> car mpg cyl disp hp drat wt qsec vs am gear carb kpl
#> <char> <num> ... <num>
#> 1: Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 8.9271
#> 2: Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 8.9271
#> ...
Difficulty: Beginner
A new column can be written straight into the table by reference, with no copy of the frame.
On the right of kpl :=, put mpg * 0.4251.
Click to reveal solution
Explanation: The := operator modifies the data.table by reference: no new object is created and the assignment returns invisibly. That makes it extremely memory-efficient for large tables compared to mt$kpl <- mt$mpg * 0.4251, which would copy the whole frame. To see the modified table after a :=, print it explicitly or wrap the call in [] on the right side.
Exercise 2.2: Compliance officer flags low-fuel-economy vehicles
Task: A compliance officer wants a new logical column low_mpg on mt set to TRUE whenever mpg is below 18, otherwise FALSE. Create the column with := in a single call, then save the modified mt to ex_2_2.
Expected result:
#> car mpg ... low_mpg
#> 1: Mazda RX4 21.0 ... FALSE
#> 2: Hornet 4 Drive 21.4 ... FALSE
#> 3: Duster 360 14.3 ... TRUE
#> 4: Merc 240D 24.4 ... FALSE
#> 5: Cadillac Fleetwood 10.4 ... TRUE
Difficulty: Intermediate
A comparison applied to a column evaluates element-wise, producing a TRUE/FALSE vector directly.
Assign mpg < 18 to low_mpg with :=; no conditional wrapper is needed.
Click to reveal solution
Explanation: A logical expression on a vector evaluates element-wise, so mpg < 18 produces a logical vector of the right length automatically. No ifelse wrapper is needed when the result is already TRUE/FALSE. If you wanted "Low"/"Normal" labels instead, switch to fifelse(mpg < 18, "Low", "Normal"), the type-stable data.table version of ifelse.
Exercise 2.3: Add two columns at once with the list form
Task: In a single bracketed call on mt, create two new columns simultaneously: power_to_weight defined as hp / wt, and disp_per_cyl defined as disp / cyl. Use the := list(...) form, then save the modified mt to ex_2_3.
Expected result:
#> car mpg ... power_to_weight disp_per_cyl
#> 1: Mazda RX4 21.0 ... 41.984 26.66667
#> 2: Hornet 4 Drive 21.4 ... 33.426 43.00000
#> 3: Duster 360 14.3 ... 62.171 45.00000
#> ...
Difficulty: Intermediate
Several derived columns can be created together in a single pass over the table.
Inside ` :=(...) , set power_to_weight = hp / wt and disp_per_cyl = disp / cyl`.
Click to reveal solution
Explanation: The functional form ` :=(name1 = expr1, name2 = expr2) is the cleanest way to add multiple columns in one pass. It edits the table once, scanning columns just once, which is faster than calling :=` twice in a row. Use this whenever you have several derived columns to compute together.
Exercise 2.4: Remove an unwanted column
Task: The marketing analyst no longer needs the qsec (quarter-mile time) column from mt. Remove the column in place by assigning it to NULL with :=. Then save the slimmer mt to ex_2_4.
Expected result:
#> "qsec" %in% names(ex_2_4)
#> [1] FALSE
Difficulty: Beginner
A column can be dropped in place in constant time, no matter how large the table is.
Assign NULL to qsec with :=.
Click to reveal solution
Explanation: Setting a column to NULL via := deletes it by reference, in constant time, regardless of table size. This is the data.table equivalent of df$qsec <- NULL but it does not copy the whole table. To drop multiple columns at once use mt[, c("col1", "col2") := NULL].
Exercise 2.5: Audit team patches values in a subset
Task: An audit team on mt wants to clamp implausibly high gear values: for any row where gear is greater than 4, set gear to exactly 4. Combine an i filter with a j update via :=. Save the modified mt to ex_2_5.
Expected result:
#> ex_2_5[, max(gear)]
#> [1] 4
Difficulty: Intermediate
An update touches only the rows the filter slot selects, leaving every other row unchanged.
Filter with gear > 4 in i, then write gear := 4 in j.
Click to reveal solution
Explanation: When := is combined with an i filter, the assignment only touches the matched rows, leaving all others untouched. This is the canonical "patch a subset" idiom: cheap, in place, no copy. Equivalent base R code, mt$gear[mt$gear > 4] <- 4, works but is slower on wide tables because it triggers a copy of the whole frame in some R versions.
Exercise 2.6: Sports scout ranks players within position
Task: Treat cyl as a "position" and hp as a "performance score" on mt. A sports scout wants a new column hp_rank giving the descending rank of hp within each cyl group. Compute it with := and by. Save the modified mt to ex_2_6.
Expected result:
#> ex_2_6[order(cyl, hp_rank), .(car, cyl, hp, hp_rank)][1:6]
#> car cyl hp hp_rank
#> 1: Lotus Europa 4 113 1
#> 2: Honda Civic 4 52 11
#> ...
Difficulty: Advanced
A within-group rank needs both an update and a grouping in the same bracketed call.
Use frank(-hp, ties.method = "min") as the hp_rank := value, with by = cyl.
Click to reveal solution
Explanation: frank is data.table's fast equivalent of base rank. Combined with := and by, it produces a within-group rank in one pass without splitting and recombining. Negating hp reverses the order so higher horsepower gets rank 1. The ties.method = "min" choice gives the smallest of the tied ranks to all tied values, a common choice for leaderboards.
Exercise 2.7: Rename two columns in place
Task: Rename the disp column of mt to displacement and the wt column to weight_klb using setnames() in a single call with vector arguments. This modifies mt in place. Save the renamed table to ex_2_7.
Expected result:
#> all(c("displacement", "weight_klb") %in% names(ex_2_7))
#> [1] TRUE
Difficulty: Intermediate
Renaming happens by reference and pairs old names to new names by position.
Pass old = c("disp", "wt") and new = c("displacement", "weight_klb") to setnames.
Click to reveal solution
Explanation: setnames renames columns by reference, never copying the table. The old/new vector form lets you rename many at once and is order-sensitive: position 1 in old maps to position 1 in new. To rename by position, pass integer indices in old. After this exercise, remember the table is permanently changed for downstream exercises in your session.
Exercise 2.8: Functional := over many columns programmatically
Task: Convert all numeric columns of ir (sepal and petal measurements) from centimetres to millimetres by multiplying each by 10, in place, using := with column vectors selected via .SDcols. Save the modified ir to ex_2_8.
Expected result:
#> ex_2_8[1:3]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <num> <num> <num> <num> <fctr>
#> 1: 51 35 14 2 setosa
#> 2: 49 30 14 2 setosa
#> 3: 47 32 13 2 setosa
Difficulty: Advanced
To transform many columns at once, iterate a function over the per-column subset of the table.
Use lapply(.SD, function(x) x * 10) with .SDcols = num_cols and the (num_cols) := form.
Click to reveal solution
Explanation: The (num_cols) syntax on the left of := unquotes the variable so the names come from the character vector, not literally. .SD is the subset of data, restricted to .SDcols, and lapply iterates a function over each column. This idiom scales: change .SDcols and the function, and the same shape works for hundreds of columns.
Section 3. Special symbols and helpers (8 problems)
Exercise 3.1: Count rows per cylinder group with .N
Task: Use the .N special symbol on mt to compute the number of rows in each cyl group. Return a two-column data.table with cyl and n. Save the result to ex_3_1.
Expected result:
#> cyl n
#> <num> <int>
#> 1: 6 7
#> 2: 4 11
#> 3: 8 14
Difficulty: Beginner
A special symbol holds the row count of the current group without referencing any column.
Return .(n = .N) with by = cyl.
Click to reveal solution
Explanation: .N is the row count of the current i-filtered, by-grouped frame. It is an integer scalar inside each group, so wrapping it with .() gives you a named output column. This is the data.table equivalent of dplyr::count(cyl) and the most-used idiom in real-world data.table code.
Exercise 3.2: Total rows in the table with .N
Task: Without using nrow(), count the total number of rows in mt by referencing .N from the j slot without a by argument. Return a single-row data.table named total_rows. Save the result to ex_3_2.
Expected result:
#> total_rows
#> <int>
#> 1: 32
Difficulty: Beginner
With no grouping, the row-count symbol refers to the entire table.
Return .(total_rows = .N) with no by argument.
Click to reveal solution
Explanation: With no by, .N refers to the whole table (after any i filter). This is the convenient way to express a row count plus other summary stats in the same call. Note that mt[, .N] (without the .() wrapper) returns a bare integer instead of a single-row data.table, which is fine for printing but harder to bind into a result table.
Exercise 3.3: Inspect the structure of .SD
Task: Inside j, return .SD itself for the first row of each cyl group of mt. This shows that .SD is the data.table of all non-by columns for the current group. Use head(.SD, 1) and group by cyl. Save the result to ex_3_3.
Expected result:
#> cyl car mpg displacement hp drat weight_klb vs am gear carb kpl low_mpg power_to_weight disp_per_cyl hp_rank
#> <num> <char> <num> <num> ...
#> 1: 6 Mazda RX4 21.0 160.0 110 3.90 2.620 0 1 4 4 ...
#> 2: 4 Datsun 710 22.8 108.0 93 3.85 2.320 1 1 4 1 ...
#> 3: 8 Hornet Sportabout 18.7 360.0 175 3.15 3.440 0 0 3 2 ...
Difficulty: Intermediate
The per-group data subset is itself a table you can return directly from the second slot.
Return head(.SD, 1) with by = cyl.
Click to reveal solution
Explanation: .SD is "Subset of Data" and represents the per-group slice of the table without the grouping columns. Returning head(.SD, 1) per group is the data.table equivalent of dplyr::slice_head(n = 1). Because .SD includes all non-group columns by default, this is also useful for sampling, lagging, or any per-group operation that needs the full row, not just an aggregate.
Exercise 3.4: Auditor summarises a column subset
Task: An auditor wants the mean of only the mpg, hp and weight_klb columns of mt, by cyl. Use .SD with .SDcols to restrict the columns considered, then apply lapply(.SD, mean). Save the four-column result to ex_3_4.
Expected result:
#> cyl mpg hp weight_klb
#> <num> <num> <num> <num>
#> 1: 6 19.74286 122.28571 3.117143
#> 2: 4 26.66364 82.63636 2.285727
#> 3: 8 15.10000 209.21429 3.999214
Difficulty: Intermediate
Restricting which columns enter the per-group subset lets you summarise only those columns.
Set audit_cols <- c("mpg", "hp", "weight_klb") and use lapply(.SD, mean) in j with .SDcols.
Click to reveal solution
Explanation: .SDcols controls which columns end up in .SD. Combined with lapply(.SD, FUN) it's the standard pattern for "apply this function to many columns at once". Pass a character vector, an integer range, or a function predicate like is.numeric to choose columns dynamically. The output has one row per group with one summary column per .SDcols entry.
Exercise 3.5: Performance reviewer keeps top per group
Task: A performance reviewer wants the two heaviest cars within each cyl group of mt, returning the full row for each. Use .SD with head after ordering, all in one bracketed call using order in i per group. Save the resulting data.table to ex_3_5.
Expected result:
#> cyl car weight_klb ...
#> <num> <char> <num>
#> 1: 6 Valiant 3.460
#> 2: 6 Merc 280C 3.440
#> 3: 4 Toyota Corona 2.465
#> 4: 4 Volvo 142E 2.780
#> 5: 8 Lincoln Continental 5.424
#> 6: 8 Cadillac Fleetwood 5.250
Difficulty: Intermediate
Sort the whole table first, then take the leading rows of each group from the sorted order.
With rows ordered by -weight_klb, return head(.SD, 2) with by = cyl.
Click to reveal solution
Explanation: When i sorts globally and by groups afterwards, data.table preserves the sorted order within each group. Calling head(.SD, 2) per group then takes the top two heaviest rows. This "sort then per-group head" trick is the cleanest top-N idiom in data.table and is much faster than splitting by group and applying head separately.
Exercise 3.6: Tag groups with a sequential ID using .GRP
Task: Add a column grp_id to mt containing a sequential integer that increments per cyl group, using the .GRP special symbol. Cars in the same cylinder bucket share an id. Save the modified mt to ex_3_6.
Expected result:
#> ex_3_6[, .(car, cyl, grp_id)][1:4]
#> car cyl grp_id
#> 1: Mazda RX4 6 1
#> 2: Mazda RX4 Wag 6 1
#> 3: Datsun 710 4 2
#> 4: Hornet 4 Drive 6 1
Difficulty: Advanced
A special symbol tags each group with a sequential integer as the groups are encountered.
Assign .GRP to grp_id with by = cyl.
Click to reveal solution
Explanation: .GRP is an integer that uniquely tags each group, assigned in the order the groups are encountered. It is useful for tracing which rows belong together after a complex pipeline, for joining a per-group lookup back, or as a compact factor replacement. Note: the numbering follows the order groups appear in the source, not the sorted order; use keyby to align the two.
Exercise 3.7: Reporting analyst captures original row indices
Task: A reporting analyst wants the original row position of the maximum-hp car within each cyl group of mt, using the .I special symbol that holds row indices. Return a two-column table of cyl and row_idx. Save it to ex_3_7.
Expected result:
#> cyl row_idx
#> <num> <int>
#> 1: 6 31
#> 2: 4 28
#> 3: 8 31
Difficulty: Advanced
One special symbol carries each group's positions back into the original table's row numbering.
Use .I[which.max(hp)] as the row_idx expression, with by = cyl.
Click to reveal solution
Explanation: .I is the vector of row indices in the original table for the current group. which.max(hp) returns the position of the maximum inside the group, and .I[] translates that group-relative position into the original row number. This is the standard idiom when you need the index back into the source frame, for example to update those rows or to extract them by mt[idx_vector].
Exercise 3.8: Inspect group metadata with .BY
Task: Apply a function inside j that returns the current group's values along with a per-group computed statistic on mt. Use .BY to inspect the grouping values inside the expression and pair it with the group's mean mpg. Save the resulting data.table to ex_3_8.
Expected result:
#> cyl mean_mpg cyl_label
#> <num> <num> <char>
#> 1: 6 19.74286 cyl=6
#> 2: 4 26.66364 cyl=4
#> 3: 8 15.10000 cyl=8
Difficulty: Advanced
Inside a grouped expression you can read back the current group's own grouping values.
Build the label with paste0("cyl=", .BY$cyl).
Click to reveal solution
Explanation: .BY is a named list of the grouping column values inside the current group, even when the group is defined by an anonymous expression. It's most useful inside custom functions called from j, where the function body doesn't see the surrounding by argument. Use it to build descriptive labels, write group-specific log lines, or branch logic based on which group is currently being processed.
Section 4. Joins (8 problems)
These problems use small inline data.tables that we construct first.
Exercise 4.1: E-commerce analyst left-joins orders to customers
Task: An e-commerce analyst wants every orders row enriched with the customer's name and city. Use the X[Y, on=] idiom where the right-hand side is customers to perform a left join keeping all order rows. Save the merged data.table to ex_4_1.
Expected result:
#> order_id cust_id amount name city
#> <int> <num> <num> <char> <char>
#> 1: 101 1 120 Anya Mumbai
#> 2: 102 1 45 Anya Mumbai
#> 3: 103 2 230 Bilal Lagos
#> 4: 104 3 75 Chen Shanghai
#> 5: 105 3 60 Chen Shanghai
#> 6: 106 3 410 Chen Shanghai
#> 7: 107 6 50 <NA> <NA>
#> 8: 108 6 30 <NA> <NA>
Difficulty: Intermediate
In a bracket join, the table inside the brackets drives the lookup and the outer table is the source matched against.
Write customers[orders, on = "cust_id"].
Click to reveal solution
Explanation: Reading customers[orders, on = "cust_id"] aloud as "for every row of orders, look up the matching row in customers" gives you the mental model: the right-hand table is the driver, the bracketed table is the lookup. Rows from orders with no match get NA for the lookup columns. This is data.table's idiomatic left join and works without setting a key.
Exercise 4.2: Inner join keeping only matched rows
Task: Repeat the previous join but drop unmatched order rows entirely, returning only the orders whose cust_id exists in customers. Use the nomatch = NULL argument inside the same bracketed join call. Save the resulting six-row data.table to ex_4_2.
Expected result:
#> order_id cust_id amount name city
#> <int> <num> <num> <char> <char>
#> 1: 101 1 120 Anya Mumbai
#> 2: 102 1 45 Anya Mumbai
#> 3: 103 2 230 Bilal Lagos
#> 4: 104 3 75 Chen Shanghai
#> 5: 105 3 60 Chen Shanghai
#> 6: 106 3 410 Chen Shanghai
Difficulty: Intermediate
A single flag flips a left join into an inner join by discarding the unmatched rows.
Add nomatch = NULL to customers[orders, on = "cust_id"].
Click to reveal solution
Explanation: The nomatch = NULL flag tells data.table to discard rows where the lookup fails, producing an inner join. The default nomatch = NA keeps unmatched rows with NAs (a left join). Memorising this single flag is enough to flip between left and inner joins without changing the rest of the syntax.
Exercise 4.3: Full outer join with merge()
Task: Build the full outer join of customers and orders on cust_id so unmatched rows from both sides appear. Use merge.data.table with all = TRUE. Save the result to ex_4_3.
Expected result:
#> cust_id name city order_id amount
#> <num> <char> <char> <int> <num>
#> 1: 1 Anya Mumbai 101 120
#> 2: 1 Anya Mumbai 102 45
#> 3: 2 Bilal Lagos 103 230
#> 4: 3 Chen Shanghai 104 75
#> 5: 3 Chen Shanghai 105 60
#> 6: 3 Chen Shanghai 106 410
#> 7: 4 Devi Delhi NA NA
#> 8: 5 Eli Cairo NA NA
#> 9: 6 <NA> <NA> 107 50
#>10: 6 <NA> <NA> 108 30
Difficulty: Intermediate
A full outer join keeps unmatched rows from both sides, which the bracket syntax cannot express.
Call merge(customers, orders, by = "cust_id", all = TRUE).
Click to reveal solution
Explanation: merge with all = TRUE returns every row from both tables, filling unmatched cells with NA. The bracket syntax X[Y] cannot express a full outer join directly, so merge is the right tool here. all.x = TRUE and all.y = TRUE give left and right outer joins respectively, completing the four standard join variants in a single function.
Exercise 4.4: Fraud team finds anti-join orphans
Task: A fraud team wants the orders rows whose cust_id is unknown, that is rows that do NOT match anything in customers. Use the anti-join pattern with !customers[orders, on = "cust_id", which = TRUE] indexing on the order table. Save the orphan rows to ex_4_4.
Expected result:
#> order_id cust_id amount
#> <int> <num> <num>
#> 1: 107 6 50
#> 2: 108 6 30
Difficulty: Advanced
Negating the lookup table keeps the rows that fail to match instead of those that do.
Write orders[!customers, on = "cust_id"].
Click to reveal solution
Explanation: Prefixing the right-hand table with ! flips the semantics from "keep matches" to "keep non-matches". This anti-join pattern reads naturally: "give me orders not in customers". It's the idiomatic way to find orphans, missing references, or recently churned IDs without writing an explicit is.na filter afterwards.
Exercise 4.5: Risk team updates a column during the join
Task: A risk team wants to enrich orders in place with the customer's city, leaving unmatched orders with a city value of "unknown". Use the update-on-join idiom orders[customers, city := ..., on = "cust_id"] then patch missing cities afterwards. Save the modified orders to ex_4_5.
Expected result:
#> order_id cust_id amount city
#> <int> <num> <num> <char>
#> 1: 101 1 120 Mumbai
#> 2: 102 1 45 Mumbai
#> 3: 103 2 230 Lagos
#> 4: 104 3 75 Shanghai
#> 5: 105 3 60 Shanghai
#> 6: 106 3 410 Shanghai
#> 7: 107 6 50 unknown
#> 8: 108 6 30 unknown
Difficulty: Advanced
An update during a join reaches into the other table's columns, told apart by a name prefix.
Assign i.city to city inside orders[customers, ..., on = "cust_id"].
Click to reveal solution
Explanation: Inside an update-on-join, the right-hand table's columns are referenced with the i. prefix to disambiguate from the left-hand columns. i.city is the city column from customers. Update-on-join writes the new column directly into the left-hand table by reference, never creating a copy, which is the most memory-efficient way to enrich a large fact table with dimensional attributes.
Exercise 4.6: Jeweller buckets prices with a non-equi join
Task: A jeweller has price bands in price_bands and a 100-row sample dm_sample of diamonds. Use a non-equi join on price >= band_lo AND price <= band_hi to label every diamond with its band. Save the labelled data.table to ex_4_6.
Expected result:
#> price band
#> <int> <char>
#> 1: 326 budget
#> 2: 326 budget
#> 3: 327 budget
#> 4: 2757 mid
#> 5: 18000 premium
#> ...
Difficulty: Advanced
A range lookup matches rows with inequalities rather than equality on the join keys.
Join with on = .(price >= band_lo, price <= band_hi) and select .(price = x.price, band = i.band).
Click to reveal solution
Explanation: Non-equi joins use inequality operators inside on = .(). They are the data.table answer to "lookup by range", which would otherwise require a cross join plus filter. The x. prefix disambiguates the left-hand price column when the same name appears in the join condition. Non-equi joins also support <, > and combinations, and are dramatically faster than the SQL equivalent on large tables.
Exercise 4.7: Trading desk uses a rolling join
Task: A trading desk has trades dated rows and rates dated reference rows. For each trade, attach the most recent rate as of the trade date using a rolling join with roll = TRUE. Save the joined data.table with trade_id, date and rate to ex_4_7.
Expected result:
#> trade_id date rate
#> <int> <IDate> <num>
#> 1: 1 2026-01-15 7.50
#> 2: 2 2026-02-20 7.65
#> 3: 3 2026-02-28 7.65
#> 4: 4 2026-03-10 7.40
#> 5: 5 2026-04-25 7.30
Difficulty: Advanced
A rolling join carries the last observed value forward to the requested time points.
In the keyed roll = TRUE join, select .(trade_id, date, rate) in the j slot.
Click to reveal solution
Explanation: roll = TRUE carries the last observed value forward to fill gaps in the join. It's designed for time-series and reference data: as-of pricing, currency rates, last-known location. The keyed columns become the join columns automatically, so setkey on date for both tables is mandatory. Use roll = -Inf to carry the next value backward instead.
Exercise 4.8: Multi-key join between two transactional tables
Task: Build two tables, sales and targets, each keyed on region and quarter, and join them on both columns simultaneously. The output should show every sales row with the matched target. Save the multi-key joined result to ex_4_8.
Expected result:
#> region quarter sales target
#> <char> <char> <num> <num>
#> 1: North Q1 1200 1000
#> 2: North Q2 1500 1400
#> 3: South Q1 900 800
#> 4: South Q2 980 900
Difficulty: Intermediate
A composite-key join must align on every shared column at the same time.
Pass a character vector to on, as sales[targets, on = c("region", "quarter")].
Click to reveal solution
Explanation: Pass a character vector to on = to join on multiple columns. The match must align on every column, otherwise the row is unmatched. This is the workhorse pattern when fact tables and dimensions share a composite key. For the most common case where both tables already have an identical key set via setkey, you can omit on = entirely and data.table will use the shared key automatically.
Section 5. Reshape with melt and dcast (6 problems)
Exercise 5.1: Marketing analyst melts a wide table to long
Task: A marketing analyst has the wide table iris and wants a long version where each row holds one measurement value. Use melt on ir with Species as the id column to produce columns variable and value. Save the long data.table to ex_5_1.
Expected result:
#> Species variable value
#> <fctr> <fctr> <num>
#> 1: setosa Sepal.Length 51
#> 2: setosa Sepal.Length 49
#> ...
#> 600 rows total
Difficulty: Intermediate
Reshaping wide to long stacks every non-identifier column into one variable/value pair.
Call melt(ir, id.vars = "Species").
Click to reveal solution
Explanation: melt converts wide to long. Columns listed in id.vars are preserved as identifiers; every other column becomes a stacked variable/value pair. The result has rows equal to nrow(input) * (ncol - id_count). This is the input shape expected by ggplot2's facetting and most modelling pipelines that want one observation per row.
Exercise 5.2: Cast a long table back to wide
Task: Take ex_5_1 from the previous problem and pivot it back to a wide table with one column per measurement variable, using dcast. The formula is Species ~ variable. Aggregate with mean since each Species has multiple measurements. Save the wide data.table to ex_5_2.
Expected result:
#> Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#> <fctr> <num> <num> <num> <num>
#> 1: setosa 50.06 34.28 14.62 2.46
#> 2: versicolor 59.36 27.70 42.60 13.26
#> 3: virginica 65.88 29.74 55.52 20.26
Difficulty: Intermediate
Pivoting long back to wide spreads one column's labels out across new columns.
Use dcast(ex_5_1, Species ~ variable, value.var = "value", fun.aggregate = mean).
Click to reveal solution
Explanation: dcast reverses melt. The formula has id columns on the left and the column that holds variable names on the right. When multiple rows share an id-variable pair, fun.aggregate collapses them; omitting it produces an error when duplicates exist. value.var names the column whose values fill the resulting cells.
Exercise 5.3: Climatologist melts with multiple measure groups
Task: A climatologist has the airquality data.table aq and wants two measurement groups side by side: temperature/wind and ozone/solar.R per Month and Day. Use melt with the measure.vars argument as a list of two character vectors. Save the long data.table to ex_5_3.
Expected result:
#> Month Day variable1 value1 variable2 value2
#> <int> <int> <fctr> <num> <fctr> <num>
#> 1: 5 1 Temp 67.0 Ozone 41
#> 2: 5 2 Temp 72.0 Ozone 36
#> ...
Difficulty: Advanced
Two measurement families can be reshaped together so their values land side by side.
Pass measure.vars a list of two vectors, list(c("Temp", "Wind"), c("Ozone", "Solar.R")).
Click to reveal solution
Explanation: Passing measure.vars as a list of vectors creates one melted group per list element, producing paired variable1/value1, variable2/value2 columns. The two groups are aligned by position within each vector. This is the data.table-only shortcut for "melt multiple variable families at once" without running melt twice and joining the results.
Exercise 5.4: Aggregate during dcast
Task: Cast mt to wide with rows by cyl and columns by gear, with each cell containing the count of cars in that combination. Use dcast with fun.aggregate = length and any column as value.var. Save the wide count table to ex_5_4.
Expected result:
#> cyl `3` `4` `5`
#> <num> <int> <int> <int>
#> 1: 4 1 8 2
#> 2: 6 2 4 1
#> 3: 8 12 0 2
Difficulty: Intermediate
Casting with a counting aggregator turns the pivot into a cross-tabulation of group sizes.
Set fun.aggregate = length in the dcast call.
Click to reveal solution
Explanation: Setting fun.aggregate = length turns dcast into a fast cross-tabulation, faster than table() because it runs on the underlying data.table engine. Empty combinations become 0 rather than NA because the missing-cell fill default is 0 when fun.aggregate returns counts. Use fill = NA to override.
Exercise 5.5: Ecologist dcasts multiple value columns
Task: An ecologist has aq and wants a wide table with rows by Month, columns by Day, and two value matrices side by side: mean Temp and mean Wind. Use dcast with value.var = c("Temp", "Wind"). Save the resulting wide data.table to ex_5_5.
Expected result:
#> Month Temp_1 Temp_2 ... Wind_1 Wind_2 ...
#> <int> <num> <num> <num> <num>
#> 1: 5 67 72 7.4 8.0
#> 2: 6 78 74 8.6 9.7
#> ...
Difficulty: Advanced
A cast can lay out more than one measurement series at once, each prefixed by its name.
Pass value.var = c("Temp", "Wind") to dcast.
Click to reveal solution
Explanation: Passing a character vector to value.var produces one set of wide columns per value variable, prefixed with the variable name. This is the easiest way to lay several measurement series next to each other for visual inspection without running dcast repeatedly. na.rm = TRUE is forwarded to mean so missing readings are ignored per cell.
Exercise 5.6: Round-trip melt and dcast as a pivot pipeline
Task: Starting from mt, build a pipeline that melts the numeric columns mpg, hp and weight_klb to long form keyed on car, then immediately casts back with dcast to recover the original wide shape. Verify it round-trips. Save the wide reconstruction to ex_5_6.
Expected result:
#> identical(setcolorder(ex_5_6, c("car", "mpg", "hp", "weight_klb"))[order(car)],
#> mt[, .(car, mpg, hp, weight_klb)][order(car)])
#> [1] TRUE
Difficulty: Intermediate
Reshaping out to long and back to wide should recover the original shape unchanged.
First melt(..., id.vars = "car"), then dcast(long, car ~ variable, value.var = "value").
Click to reveal solution
Explanation: melt then dcast is data.table's pivot pipeline. Round-tripping is a useful sanity check that no information was lost: the recovered table has the same rows and column types as the original. The pattern is also how you reshape via an intermediate long form when neither pure melt nor pure dcast is enough on its own.
Section 6. Performance and idioms (10 problems)
Exercise 6.1: Set a key and lookup by binary search
Task: Set car as the key on mt so lookups become binary searches rather than linear scans. Then retrieve the row for "Honda Civic" using mt["Honda Civic"]. Save the single-row result to ex_6_1.
Expected result:
#> car mpg cyl displacement hp drat weight_klb vs am gear carb kpl low_mpg power_to_weight disp_per_cyl hp_rank grp_id
#> <char> <num> ...
#> 1: Honda Civic 304.0 ...
Difficulty: Intermediate
Marking a sort column turns a lookup from a linear scan into a binary search.
Call setkey(mt, car) before the mt["Honda Civic"] lookup.
Click to reveal solution
Explanation: setkey sorts the table by reference and marks the column as the key, enabling O(log n) lookups via the mt[value] shorthand. Without a key, the same lookup is O(n) and you'd have to write mt[car == "Honda Civic"]. For batch lookups, pass a vector: mt[c("Honda Civic", "Lotus Europa")]. Keys are persistent: once set, they survive subset operations.
Exercise 6.2: setorder mutates in place, order copies
Task: Sort mt by descending hp using setorder which modifies in place, and compare the runtime cost with mt[order(-hp)] which returns a copy, by capturing only the result of the in-place sort. Save the sorted mt to ex_6_2.
Expected result:
#> ex_6_2[1, .(car, hp)]
#> car hp
#> <char> <num>
#> 1: Maserati Bora 335
Difficulty: Intermediate
Sorting in place rearranges the table itself, with no copy and no returned value.
Call setorder(mt, -hp) for the descending sort.
Click to reveal solution
Explanation: setorder reorders the rows of the table by reference, no copy, no key set, no return value. It's faster than mt[order(-hp)] because the latter materialises a copy. The minus sign in front of a column reverses sort direction. For sorting plus key setting in one call, use setkey which always sorts ascending; for sorting without keying, prefer setorder.
Exercise 6.3: SRE adds a secondary index for fast filters
Task: An SRE wants repeated filters on cyl in mt to use an index without changing the table order. Set a secondary index on cyl using setindex and then filter with mt[.(4), on = "cyl"]. Save the four-cylinder rows to ex_6_3.
Expected result:
#> nrow(ex_6_3)
#> [1] 11
Difficulty: Advanced
A secondary index speeds repeated filters on a column without reordering the table.
Call setindex(mt, cyl) before filtering with mt[.(4), on = "cyl"].
Click to reveal solution
Explanation: A secondary index gives you the speed of binary search on a non-key column without reordering the table. data.table also auto-creates indexes when you filter with == on a column repeatedly: this is called "auto-indexing". You can inspect indexes with indices(mt) and remove them with setindex(mt, NULL). Multiple indexes can coexist on different columns.
Exercise 6.4: Use fcase for vectorised multi-branch logic
Task: Replace nested ifelse calls on mt by using fcase to label each car as "low", "mid" or "high" based on whether mpg is below 18, between 18 and 25, or above 25 respectively. Add the label as a column mpg_band. Save the modified mt to ex_6_4.
Expected result:
#> ex_6_4[order(mpg), .(car, mpg, mpg_band)][1:3]
#> car mpg mpg_band
#> <char> <num> <char>
#> 1: Cadillac Fleetwood 10.4 low
#> 2: Lincoln Continental 10.4 low
#> 3: Camaro Z28 13.3 low
Difficulty: Advanced
A vectorised multi-branch conditional checks each condition top to bottom and returns the first match.
Assign fcase(mpg < 18, "low", mpg >= 18 & mpg <= 25, "mid", mpg > 25, "high") to mpg_band.
Click to reveal solution
Explanation: fcase is data.table's vectorised, type-stable multi-branch conditional. It evaluates conditions top to bottom and returns the value for the first match. Unlike dplyr::case_when, it never silently coerces types, raising an error if branches mix incompatible types. Pass default = "other" to handle the catch-all rather than writing an explicit final TRUE row.
Exercise 6.5: fwrite a table to a temporary file and read it back
Task: Write mt out to a temporary CSV file with fwrite, then read it back into a new data.table with fread. Save the round-tripped data.table to ex_6_5. This demonstrates the fast IO functions on the same session.
Expected result:
#> nrow(ex_6_5) == nrow(mt)
#> [1] TRUE
Difficulty: Intermediate
A fast write to disk followed by a fast read should return the same number of rows.
Write with fwrite(mt, tmp) and read it back with fread(tmp).
Click to reveal solution
Explanation: fread and fwrite are typically 10x to 100x faster than base read.csv/write.csv, thanks to a multi-threaded C implementation. fread also auto-detects types, column separators and headers. For very large files, pass nrows to read a subset, select to keep only a few columns, or cmd to pipe through gzip/grep before reading.
Exercise 6.6: setDT converts a data.frame without copying
Task: Start from a base R data.frame copy of mtcars and turn it into a data.table by reference using setDT. Demonstrate that the resulting object inherits data.table class. Save the converted table to ex_6_6.
Expected result:
#> inherits(ex_6_6, "data.table")
#> [1] TRUE
Difficulty: Intermediate
A data.frame can be upgraded to a data.table in place, with no memory copy.
Call setDT(df, keep.rownames = "car").
Click to reveal solution
Explanation: setDT upgrades a data.frame to a data.table in place, with zero copy. as.data.table does the same job but returns a new object, doubling memory transiently. keep.rownames = "car" preserves rownames as a column rather than dropping them, useful for mtcars-like datasets where the rowname is informative.
Exercise 6.7: Code reviewer catches a copy() pitfall
Task: A code reviewer wants to demonstrate that assigning a data.table with <- does NOT copy. Create dt_a <- mt, modify dt_a with :=, then prove the original mt was also modified because they share memory. Use copy() to fix the bug and save the safely independent copy to ex_6_7.
Expected result:
#> ex_6_7[1, mpg] != mt[1, mpg]
#> [1] TRUE
Difficulty: Advanced
Assigning a table to a new name shares memory, so a deep copy is needed for true independence.
Use copy(mt) before modifying the new table with :=.
Click to reveal solution
Explanation: Because := mutates by reference, two variable names pointing at the same data.table will both see the change. copy() performs a deep copy so the new object is independent. This is the single most common source of subtle bugs in data.table code. The general rule: if a function accepts a data.table and mutates it with := or set*, document it loudly or call copy() defensively at the top.
Exercise 6.8: Chained := updates in a single pass
Task: Add three derived columns to mt in a single bracketed call to amortise overhead: mpg_band_n (1, 2, 3 for low/mid/high), weight_t (weight_klb / 1000), and power_idx (hp * 0.001). Use the functional ` :=(...) form. Save the modified mt to ex_6_8`.
Expected result:
#> all(c("mpg_band_n", "weight_t", "power_idx") %in% names(ex_6_8))
#> [1] TRUE
Difficulty: Intermediate
Bundling several column updates into one call amortises the per-call dispatch overhead.
In ` :=(...) , set mpg_band_n = fmatch(mpg_band, c("low", "mid", "high")), weight_t = weight_klb / 1000, power_idx = hp * 0.001`.
Click to reveal solution
Explanation: A single ` :=(...) call is faster than three back-to-back := calls because the data.table dispatcher and column-name lookup happens once. The trade-off is readability: too many simultaneous updates start to obscure intent. Three to five is a common sweet spot. fmatch is data.table's fast match` used here to translate a character band into a numeric index.
Exercise 6.9: Geneticist counts distinct values with uniqueN
Task: A geneticist wants the count of distinct gear values per cyl group in mt. Use uniqueN inside j as a per-group expression. Return a two-column data.table of cyl and n_gears. Save the result to ex_6_9.
Expected result:
#> cyl n_gears
#> <num> <int>
#> 1: 4 3
#> 2: 6 3
#> 3: 8 2
Difficulty: Intermediate
A fast distinct-count gives the number of unique values within each group.
Use uniqueN(gear) as the n_gears expression with by = cyl.
Click to reveal solution
Explanation: uniqueN is data.table's fast distinct-count, equivalent to length(unique(x)) but implemented in C and several times faster. It also accepts a data.table argument, in which case it counts distinct rows across all columns. Pair it with by for "cardinality per group" reports, a frequent metric in audits, churn dashboards and feature-engineering.
Exercise 6.10: setcolorder rearranges columns by reference
Task: Rearrange the columns of mt so that car, cyl and mpg appear first in that order, followed by everything else in their current order. Use setcolorder which operates in place without copying. Save the rearranged mt to ex_6_10.
Expected result:
#> names(ex_6_10)[1:3]
#> [1] "car" "cyl" "mpg"
Difficulty: Beginner
Column order can be changed in place, moving chosen columns to the front of the table.
Call setcolorder(mt, c("car", "cyl", "mpg")).
Click to reveal solution
Explanation: setcolorder moves the listed columns to the front, in the order given, and leaves the rest in their original order. It mutates by reference, so no copy is made even on a very wide table. Use it to put key columns at the start of a printed table for readability, or to align column order across multiple tables before a rbindlist.
What to do next
- Walk through the syntax fundamentals in the dplyr Exercises hub for an alternative grammar.
- Practise wider data-wrangling patterns in Data Wrangling Exercises.
- Drill the
*applyand functional patterns in Apply Family Exercises. - Sharpen statistical modelling with Linear Regression Exercises.
r-statistics.co · Verifiable credential · Public URL
This document certifies mastery of
data.table Mastery
Every certificate has a public verification URL that proves the holder passed the assessment. Anyone with the link can confirm the recipient and date.
1,271 learners have earned this certificate