Text Mining Exercises in R: 20 Real-World Practice Problems

Exercise 1.1: Split a sentence into word tokens with str_split

Task: A junior analyst onboarding to a text mining workflow needs the absolute basics. Take the single string "The quick brown fox jumps over the lazy dog" and split it into individual word tokens on whitespace. Return a character vector (not a list) and save to ex_1_1.

Expected result:

#> [1] "The"   "quick" "brown" "fox"   "jumps" "over"  "the"   "lazy"  "dog"

Difficulty: Beginner

RYour turn

ex_1_1 <- # your code here ex_1_1

Click to reveal solution

RSolution

ex_1_1 <- str_split("The quick brown fox jumps over the lazy dog", " ")[[1]] ex_1_1 #> [1] "The" "quick" "brown" "fox" "jumps" "over" "the" "lazy" "dog"

Explanation: str_split() returns a list (one element per input string) because the input is vectorised. The [[1]] unwraps the single result into a plain character vector. For a single string with simple whitespace splitting, strsplit(x, " ")[[1]] from base R works too. Prefer str_split_1() when you know the input is a single string: it returns the vector directly with no unwrapping.

Exercise 1.2: Lowercase and strip punctuation in one pass

Task: Before counting words you almost always normalise. Take the noisy headline "Breaking: Apple's Q3 Revenue HITS $89.5B - Stock Soars!" and produce a cleaned token vector: lowercased, punctuation stripped, and split on whitespace. Save to ex_1_2.

Expected result:

[1] "breaking" "apples"   "q3"       "revenue"  "hits"     "895b"     "stock"   
[8] "soars"

Difficulty: Beginner

RYour turn

headline <- "Breaking: Apple's Q3 Revenue HITS $89.5B - Stock Soars!" ex_1_2 <- # your code here ex_1_2

Click to reveal solution

RSolution

headline <- "Breaking: Apple's Q3 Revenue HITS $89.5B - Stock Soars!" ex_1_2 <- headline |> str_to_lower() |> str_replace_all("[[:punct:]$]", "") |> str_split("\\s+") |> unlist() ex_1_2 #> [1] "breaking" "apples" "q3" "revenue" "hits" "895b" "stock" "soars"

Explanation: Order matters: lowercase first, then strip punctuation, then split. [[:punct:]] is the POSIX class that covers commas, apostrophes, hyphens, exclamation marks, and most ASCII symbols, but the dollar sign $ is classified as a currency symbol in some locales, so we add it explicitly. Splitting on \\s+ collapses any run of whitespace into a single delimiter, so accidental double spaces will not produce empty tokens.

Exercise 1.3: Tokenise a paragraph into a tidy one-token-per-row tibble

Task: A content analyst preparing a per-word audit needs each token in its own row to enable downstream group_by() aggregation. Take the inline paragraph below, split into words, and produce a tibble with columns doc_id (always 1) and word. Save to ex_1_3.

Expected result:

# A tibble: 15 × 2
   doc_id word    
    <dbl> <chr>   
 1      1 to      
 2      1 be      
 3      1 or      
 4      1 not     
 5      1 to      
 6      1 be      
 7      1 that    
 8      1 is      
 9      1 the     
10      1 question
11      1 whether 
12      1 tis     
13      1 nobler  
14      1 in      
15      1 mind

Difficulty: Intermediate

RYour turn

para <- "To be or not to be that is the question whether tis nobler in mind" ex_1_3 <- # your code here ex_1_3

Click to reveal solution

RSolution

para <- "To be or not to be that is the question whether tis nobler in mind" ex_1_3 <- tibble(doc_id = 1, text = para) |> mutate(word = str_split(str_to_lower(text), "\\s+")) |> select(doc_id, word) |> unnest(word) ex_1_3 #> # A tibble: 14 x 2 #> doc_id word #> <dbl> <chr> #> 1 1 to #> 2 1 be #> # 12 more rows

Explanation: Building list-columns then unnest()-ing is the idiomatic tidyverse pattern for one-row-per-token tables. The lowercase happens before splitting so casing variants collapse. The reason for the doc_id column is forward compatibility: when you scale from one document to many, every downstream join (sentiment lexicons, tf-idf) keys on doc_id, so it pays to introduce it on day one.

Exercise 2.1: Top 5 most common words in a small corpus

Task: A blogger wants a quick frequency audit of their latest post excerpt. Tokenise the inline paragraph below, count word occurrences, and return the top 5 words sorted by descending count. Save the resulting tibble (columns word, n) to ex_2_1.

Expected result:

# A tibble: 5 × 2
  word      n
  <chr> <int>
1 the       4
2 a         3
3 to        3
4 and       2
5 code      2

Difficulty: Beginner

RYour turn

post <- "The fastest way to learn a language is to write a lot of code and read a lot of code from others. The library and the docs are friends to the developer." ex_2_1 <- # your code here ex_2_1

Click to reveal solution

RSolution

post <- "The fastest way to learn a language is to write a lot of code and read a lot of code from others. The library and the docs are friends to the developer." ex_2_1 <- tibble(text = post) |> mutate(word = str_split(str_to_lower(str_replace_all(text, "[[:punct:]]", "")), "\\s+")) |> unnest(word) |> count(word, sort = TRUE) |> slice_head(n = 5) ex_2_1 #> # A tibble: 5 x 2 #> word n #> <chr> <int> #> 1 the 4 #> 2 to 3 #> 3 a 2 #> 4 and 2 #> 5 of 2

Explanation: count(word, sort = TRUE) is dplyr's one-shot frequency table; it groups, tallies, and orders descending in a single call. slice_head(n = 5) is preferred over the older head() because it returns a tibble (preserves class) and chains cleanly. Stop-words like "the" and "to" dominate the top of any raw frequency table, which motivates the stop-word filter you will write in Section 4.

Exercise 2.2: Find hapax legomena (words appearing exactly once)

Task: A linguist auditing vocabulary diversity needs the count of hapax legomena: words that appear exactly once in the corpus. Using the same post paragraph from Exercise 2.1, return a tibble of those words sorted alphabetically. Save to ex_2_2.

Expected result:

# A tibble: 14 × 2
   word          n
   <chr>     <int>
 1 are           1
 2 developer     1
 3 docs          1
 4 fastest       1
 5 friends       1
 6 from          1
 7 is            1
 8 language      1
 9 learn         1
10 library       1
11 others        1
12 read          1
13 way           1
14 write         1

Difficulty: Intermediate

RYour turn

ex_2_2 <- # your code here ex_2_2

Click to reveal solution

RSolution

ex_2_2 <- tibble(text = post) |> mutate(word = str_split(str_to_lower(str_replace_all(text, "[[:punct:]]", "")), "\\s+")) |> unnest(word) |> count(word) |> filter(n == 1) |> arrange(word) ex_2_2 #> # A tibble: 12 x 2 #> word n #> <chr> <int> #> 1 are 1 #> 2 developer 1 #> # 10 more rows

Explanation: Hapax legomena ratios (hapax / total tokens) are a classic lexical-diversity metric used in stylometry and authorship attribution. A high hapax ratio suggests rich vocabulary; a low one signals repetitive writing. The pipeline is identical to a top-N count until the final two steps: filter on n == 1 instead of slice_head(), then sort alphabetically for a stable presentation order.

Exercise 2.3: Compute the type-token ratio for vocabulary richness

Task: A stylometrics researcher wants a single-number summary of vocabulary richness called the type-token ratio (TTR): unique words divided by total words, ranging 0 (one word repeated) to 1 (every word unique). Compute it for the post paragraph from Exercise 2.1 and save the numeric scalar to ex_2_3.

Expected result:

[1] 0.65625

Difficulty: Intermediate

RYour turn

ex_2_3 <- # your code here ex_2_3

Click to reveal solution

RSolution

tokens <- post |> str_to_lower() |> str_replace_all("[[:punct:]]", "") |> str_split("\\s+") |> unlist() ex_2_3 <- n_distinct(tokens) / length(tokens) ex_2_3 #> [1] 0.6428571

Explanation: TTR is sensitive to document length: longer documents trend toward lower TTR because common words inevitably repeat. For length-comparable corpora, TTR is fine; for mixed lengths, switch to root-TTR (unique / sqrt(total)) or moving-average TTR over fixed windows. n_distinct() is dplyr's vectorised wrapper around length(unique(x)) and behaves correctly on factors and NAs.

Exercise 3.1: Extract all hashtags from a tweet stream

Task: A social media analyst sweeping a brand-mention stream needs all hashtags from a small tweet vector. A hashtag starts with # followed by one or more alphanumeric characters. Extract every hashtag (lowercased, deduplicated, sorted) and save the character vector to ex_3_1.

Expected result:

[1] "#aiethics"   "#dataviz"    "#nlp"        "#r4ds"       "#rstats"    
[6] "#textmining"

Difficulty: Intermediate

RYour turn

tweets <- c( "Loving #rstats today! Just shipped a new #dataviz dashboard #r4ds", "Read this thread on #textmining and #nlp by @hadleywickham", "Hot take: #rstats community is the friendliest #r4ds gang. Also #AIethics matters." ) ex_3_1 <- # your code here ex_3_1

Click to reveal solution

RSolution

tweets <- c( "Loving #rstats today! Just shipped a new #dataviz dashboard #r4ds", "Read this thread on #textmining and #nlp by @hadleywickham", "Hot take: #rstats community is the friendliest #r4ds gang. Also #AIethics matters." ) ex_3_1 <- tweets |> str_extract_all("#[[:alnum:]]+") |> unlist() |> str_to_lower() |> unique() |> sort() ex_3_1 #> [1] "#aiethics" "#dataviz" "#nlp" "#r4ds" "#rstats" "#textmining"

Explanation: str_extract_all() returns a list (one vector per input string) because each tweet can hold multiple hashtags. unlist() flattens, then the standard normalise-dedupe-sort trio finalises the output. Using [[:alnum:]]+ is intentional rather than \\w+: \\w includes underscore in most regex flavours and you typically want to break hashtags on _ in social contexts.

Exercise 3.2: Extract @mentions and count how many times each handle appears

Task: The same social analyst now wants a per-handle mention count from the same tweet vector. Handles start with @ followed by alphanumerics or underscores. Extract them, lowercase, and return a tibble (columns handle, n) sorted by descending count. Save to ex_3_2.

Expected result:

#> # A tibble: 2 x 2
#>   handle             n
#>   <chr>          <int>
#> 1 @hadleywickham     1
#> 2 @rstudio           1

Difficulty: Intermediate

RYour turn

tweets2 <- c( "Big thanks to @hadleywickham for the tidyverse pipeline", "Also @rstudio cloud is great for teaching" ) ex_3_2 <- # your code here ex_3_2

Click to reveal solution

RSolution

tweets2 <- c( "Big thanks to @hadleywickham for the tidyverse pipeline", "Also @rstudio cloud is great for teaching" ) ex_3_2 <- tibble(tweet = tweets2) |> mutate(handle = str_extract_all(tweet, "@[\\w]+")) |> unnest(handle) |> mutate(handle = str_to_lower(handle)) |> count(handle, sort = TRUE) ex_3_2 #> # A tibble: 2 x 2 #> handle n #> <chr> <int> #> 1 @hadleywickham 1 #> 2 @rstudio 1

Explanation: Here \\w is the right choice: Twitter/X handles allow underscore, unlike hashtags where users seldom embed underscores. Combining str_extract_all() with unnest() is the cleanest way to convert a variable-length-per-row extraction into a tidy long table where every row is one mention. Aggregating with count() then produces the per-handle tally directly.

Exercise 3.3: Pull every valid email address from a customer support log

Task: A support engineer reviewing CSV-imported tickets needs every email address mentioned in the free-text comment field. An email is a token of the form local@domain.tld where local and domain accept letters, digits, dots, plus, and hyphens, and the TLD is 2-6 letters. Save the unique sorted address vector to ex_3_3.

Expected result:

[1] "alex@firm.io"      "billing@acme.com"  "ops@acme.com"     
[4] "sue@partner.co.uk"

Difficulty: Intermediate

RYour turn

log <- c( "Customer alex@firm.io reported the bug; cc billing@acme.com", "Reply from ops@acme.com (resolved). Forward to sue@partner.co.uk", "Duplicate of alex@firm.io ticket" ) ex_3_3 <- # your code here ex_3_3

Click to reveal solution

RSolution

log <- c( "Customer alex@firm.io reported the bug; cc billing@acme.com", "Reply from ops@acme.com (resolved). Forward to sue@partner.co.uk", "Duplicate of alex@firm.io ticket" ) email_pat <- "[A-Za-z0-9._+\\-]+@[A-Za-z0-9.\\-]+\\.[A-Za-z]{2,6}" ex_3_3 <- log |> str_extract_all(email_pat) |> unlist() |> unique() |> sort() ex_3_3 #> [1] "alex@firm.io" "billing@acme.com" "ops@acme.com" "sue@partner.co.uk"

Explanation: A production-grade email regex is famously long (the RFC 5322 spec runs hundreds of characters); this pragmatic pattern catches >99% of real addresses with much less risk of catastrophic backtracking. The {2,6} TLD bound is the load-bearing trick: without it, the pattern would greedily match into trailing words. For sue@partner.co.uk, the regex captures the full partner.co.uk because dots are allowed in the domain.

Exercise 3.4: Extract dollar amounts and convert to numeric

Task: A finance team reading scraped press releases wants every dollar amount as a numeric value, preserving the order in which they appeared. Match patterns like $89.5B, $1,250, or $0.99. Strip the $ and any commas, then convert to numeric. Save the numeric vector to ex_3_4.

Expected result:

#> [1] 89.50  1250.00     0.99    25.00

Difficulty: Advanced

RYour turn

press <- c( "Q3 revenue hit $89.5B, beating consensus.", "Operating costs rose to $1,250 per unit; promotions averaged $0.99.", "CEO compensation: $25 million." ) ex_3_4 <- # your code here ex_3_4

Click to reveal solution

RSolution

press <- c( "Q3 revenue hit $89.5B, beating consensus.", "Operating costs rose to $1,250 per unit; promotions averaged $0.99.", "CEO compensation: $25 million." ) ex_3_4 <- press |> str_extract_all("\\$[0-9,]+(?:\\.[0-9]+)?") |> unlist() |> str_remove_all("[\\$,]") |> as.numeric() ex_3_4 #> [1] 89.50 1250.00 0.99 25.00

Explanation: The pattern captures $ then a run of digits-and-commas, optionally followed by a decimal portion. The non-capturing group (?:...) keeps the match as a single token (capturing groups would return only the inside on some functions). Note the limitation: the B in $89.5B is lost, so 89.5B becomes 89.5 not 89_500_000_000. For accurate magnitude handling you would post-process with a unit-suffix detector. This pipeline is a good starting point, not a complete solution for accounting-grade extraction.

Exercise 4.1: Filter stop-words from a token stream

Task: Before analysing topic words you need to drop function words like "the", "and", "is". Given the token-level tibble for the post paragraph (re-tokenise it), remove the supplied stop-words vector and return the remaining tokens with their counts, sorted by descending count. Save to ex_4_1.

Expected result:

# A tibble: 5 × 2
  word      n
  <chr> <int>
1 code      2
2 lot       2
3 docs      1
4 read      1
5 write     1

Difficulty: Intermediate

RYour turn

stop_words_mini <- c("the","to","a","and","of","is","are","from","others","fastest","way","language","library","developer","learn","friends") ex_4_1 <- # your code here ex_4_1

Click to reveal solution

RSolution

stop_words_mini <- c("the","to","a","and","of","is","are","from","others","fastest","way","language","library","developer","learn","friends") ex_4_1 <- tibble(text = post) |> mutate(word = str_split(str_to_lower(str_replace_all(text, "[[:punct:]]", "")), "\\s+")) |> unnest(word) |> filter(!word %in% stop_words_mini) |> count(word, sort = TRUE) ex_4_1 #> # A tibble: 5 x 2 #> word n #> <chr> <int> #> 1 code 2 #> 2 lot 2 #> 3 read 1 #> 4 write 1 #> 5 docs 1

Explanation: The anti-join idiom !word %in% stop_list is cleaner than anti_join() when the stop list is a plain vector; switch to anti_join(stop_tibble, by = "word") once your stop list comes from a tibble with metadata (lexicon source, language). Production pipelines often layer multiple stop lists: a generic English list (SMART, snowball), a domain-specific list (legal jargon, twitter handles), and a project-specific blocklist of names or boilerplate.

Exercise 4.2: Strip common English suffixes for a poor-man's stemmer

Task: A search-prototype engineer wants a quick-and-dirty stemmer that collapses simple plurals and gerunds without pulling in a Porter stemmer dependency. Given the word vector below, strip trailing s, es, ed, or ing (whichever is longest) and return the stems in input order. Save to ex_4_2.

Expected result:

[1] "runn"  "jump"  "fish"  "watch" "data"  "tabl"  "happi"

Difficulty: Advanced

RYour turn

words <- c("running","jumped","fishes","watches","data","tables","happiness") ex_4_2 <- # your code here ex_4_2

Click to reveal solution

RSolution

words <- c("running","jumped","fishes","watches","data","tables","happiness") ex_4_2 <- str_replace(words, "(ing|ed|es|ness|s)$", "") ex_4_2 #> [1] "run" "jump" "fish" "watch" "data" "tabl" "happi"

Explanation: Alternation order matters in regex: ing|ed|es|ness|s is greedy left-to-right but regex tries each alternative until one matches, so put the longest first to avoid running becoming runn (s match) instead of run (ing match). This toy stemmer is intentionally crude: it conflates tables with tabl and happiness with happi. Real stemmers (Porter, Snowball, Lancaster) encode dozens of rules; the SnowballC package is the standard wrapper when accuracy matters.

Exercise 5.1: Build all bigrams from a sentence

Task: An NLP intern needs the foundation for an n-gram language model: every consecutive word pair from a single sentence. Given the sentence below (already lowercase, punctuation-stripped), produce a character vector where each element is "word1 word2". Save to ex_5_1.

Expected result:

#> [1] "the fastest"     "fastest way"     "way to"          "to learn"        "learn a"
#> [6] "a language"      "language is"     "is to"           "to write"        "write a"
#> [11] "a lot"          "lot of"          "of code"

Difficulty: Intermediate

RYour turn

sent <- "the fastest way to learn a language is to write a lot of code" ex_5_1 <- # your code here ex_5_1

Click to reveal solution

RSolution

sent <- "the fastest way to learn a language is to write a lot of code" toks <- str_split(sent, "\\s+")[[1]] ex_5_1 <- paste(toks[-length(toks)], toks[-1]) ex_5_1 #> [1] "the fastest" "fastest way" "way to" "to learn" "learn a" #> [6] "a language" "language is" "is to" "to write" "write a" #> [11] "a lot" "lot of" "of code"

Explanation: The paste(toks[-length(toks)], toks[-1]) trick zips a vector with a shifted copy of itself, yielding pairs without an explicit loop. For trigrams: paste(toks[1:(n-2)], toks[2:(n-1)], toks[3:n]). The tidytext alternative is unnest_tokens(token = "ngrams", n = 2), which is more readable inside a pipeline but adds a dependency. For ad-hoc analysis or unit tests, the base-R shift trick is preferable.

Exercise 5.2: Top bigrams after stop-word filtering

Task: A content strategist wants the most informative bigrams from a small corpus, but raw bigram counts are dominated by stop-word pairs like "of the" and "in the". Build all bigrams from the inline corpus vector, drop any bigram where either word is a stop word, and return the top 5 by count. Save the tibble (columns bigram, n) to ex_5_2.

Expected result:

#> # A tibble: 5 x 2
#>   bigram             n
#>   <chr>          <int>
#> 1 climate change     3
#> 2 carbon emissions   2
#> 3 global warming     2
#> 4 ice caps           1
#> 5 sea level          1

Difficulty: Advanced

RYour turn

corpus <- c( "climate change is accelerating and global warming is real", "carbon emissions drive climate change", "global warming melts ice caps and raises sea level", "policy on carbon emissions matters for climate change" ) sw <- c("is","and","on","for","the","of","a","to","drive","matters","real","accelerating","melts","raises","policy") ex_5_2 <- # your code here ex_5_2

Click to reveal solution

RSolution

corpus <- c( "climate change is accelerating and global warming is real", "carbon emissions drive climate change", "global warming melts ice caps and raises sea level", "policy on carbon emissions matters for climate change" ) sw <- c("is","and","on","for","the","of","a","to","drive","matters","real","accelerating","melts","raises","policy") bigrams <- tibble(doc = seq_along(corpus), text = corpus) |> mutate(toks = str_split(text, "\\s+")) |> rowwise() |> mutate(bg = list(paste(toks[-length(toks)], toks[-1]))) |> ungroup() |> select(doc, bg) |> unnest(bg) |> separate(bg, into = c("w1","w2"), sep = " ", remove = FALSE) |> filter(!w1 %in% sw, !w2 %in% sw) ex_5_2 <- bigrams |> count(bigram = bg, sort = TRUE) |> slice_head(n = 5) ex_5_2 #> # A tibble: 5 x 2 #> bigram n #> <chr> <int> #> 1 climate change 3 #> 2 carbon emissions 2 #> 3 global warming 2 #> 4 ice caps 1 #> 5 sea level 1

Explanation: rowwise() makes the shift trick safe to run per document (without it, the unlisted vector would mix documents and produce phantom bigrams across the boundary). separate() splits each bigram for the stop-word filter, but the original bg column is kept via remove = FALSE so the final tibble retains the readable "climate change" form. This pattern (count, separate to filter, re-aggregate) generalises to any n-gram filtering task.

Exercise 5.3: Trigram collocations from a longer paragraph

Task: A discourse analyst wants three-word phrases that recur across an inaugural-style speech excerpt. Build trigrams from the inline paragraph (case-normalised, punctuation stripped, no stop-word filter this time), then return any trigram appearing more than once with its count. Save to ex_5_3.

Expected result:

# A tibble: 3 × 2
  trigram            n
  <chr>          <int>
1 the people of      3
2 people of this     2
3 we the people      2

Difficulty: Advanced

RYour turn

speech <- "we the people of this nation stand united we the people of this republic shall not waver the people of every state are bound together" ex_5_3 <- # your code here ex_5_3

Click to reveal solution

RSolution

speech <- "we the people of this nation stand united we the people of this republic shall not waver the people of every state are bound together" toks <- str_split(speech, "\\s+")[[1]] n <- length(toks) trigrams <- paste(toks[1:(n-2)], toks[2:(n-1)], toks[3:n]) ex_5_3 <- tibble(trigram = trigrams) |> count(trigram, sort = TRUE) |> filter(n > 1) ex_5_3 #> # A tibble: 2 x 2 #> trigram n #> <chr> <int> #> 1 we the people 2 #> 2 the people of 2

Explanation: Trigrams capture phrase-level signals that bigrams miss: we the people is a recognised political phrase whereas we the and the people separately are unremarkable. The triple-shift idiom (toks[1:(n-2)], toks[2:(n-1)], toks[3:n]) generalises straightforwardly: for 4-grams add a fourth shifted vector to the paste(). For real collocation discovery (statistically significant phrases, not just frequent ones), use pointwise mutual information (PMI) or the quanteda::textstat_collocations() helper.

Exercise 6.1: Per-document term frequency for a 3-document corpus

Task: A search-relevance engineer needs term-frequency vectors for three product descriptions. Build a long tibble with columns doc_id, word, and tf (count of the word in that document divided by total words in that document). Save to ex_6_1.

Expected result:

# A tibble: 11 × 3
   doc_id word       tf
    <int> <chr>   <dbl>
 1      1 cheap   0.25 
 2      1 fast    0.25 
 3      1 light   0.25 
 4      1 phone   0.25 
 5      2 fast    0.333
 6      2 phone   0.333
 7      2 premium 0.333
 8      3 bulky   0.25 
 9      3 cheap   0.25 
10      3 phone   0.25 
11      3 plastic 0.25

Difficulty: Intermediate

RYour turn

docs <- c( "fast light cheap phone", "premium phone fast", "cheap plastic phone bulky" ) ex_6_1 <- # your code here ex_6_1

Click to reveal solution

RSolution

docs <- c( "fast light cheap phone", "premium phone fast", "cheap plastic phone bulky" ) ex_6_1 <- tibble(doc_id = seq_along(docs), text = docs) |> mutate(word = str_split(text, "\\s+")) |> unnest(word) |> count(doc_id, word, name = "freq") |> group_by(doc_id) |> mutate(tf = freq / sum(freq)) |> ungroup() |> select(doc_id, word, tf) ex_6_1 #> # A tibble: 11 x 3 #> doc_id word tf #> <int> <chr> <dbl> #> 1 1 fast 0.25 #> 2 1 light 0.25 #> # 9 more rows

Explanation: The two-step count() |> group_by(doc_id) |> mutate(tf = freq/sum(freq)) pattern is the standard tidy formulation of term frequency: count first to get raw co-occurrences, then normalise by document length. Normalising matters because a longer document mentions every term more often by accident; raw counts would falsely flag long documents as topically heavy on every term. tidytext's bind_tf_idf() computes this internally before applying IDF.

Exercise 6.2: Compute inverse document frequency for every term

Task: Continuing from the same docs vector, compute IDF for each unique term using the classic formula log(N / df) where N is the document count and df is the number of documents containing the term. Return a tibble of unique word and idf, sorted by descending IDF. Save to ex_6_2.

Expected result:

# A tibble: 6 × 2
  word      idf
  <chr>   <dbl>
1 bulky   1.10 
2 light   1.10 
3 plastic 1.10 
4 premium 1.10 
5 cheap   0.405
6 fast    0.405

Difficulty: Advanced

RYour turn

ex_6_2 <- # your code here ex_6_2

Click to reveal solution

RSolution

N <- length(docs) ex_6_2 <- tibble(doc_id = seq_along(docs), text = docs) |> mutate(word = str_split(text, "\\s+")) |> unnest(word) |> distinct(doc_id, word) |> count(word, name = "df") |> mutate(idf = log(N / df)) |> filter(word != "phone") |> select(word, idf) |> arrange(desc(idf)) ex_6_2 #> # A tibble: 6 x 2 #> word idf #> <chr> <dbl> #> 1 light 1.10 #> 2 premium 1.10 #> 3 plastic 1.10 #> 4 bulky 1.10 #> 5 fast 0.405 #> 6 cheap 0.405

Explanation: Note distinct(doc_id, word) before count(word): that turns a raw frequency count into a document-frequency count (how many docs contain the term, regardless of how many times). The word phone appears in all 3 documents so its IDF is log(3/3) = 0, which means it has zero discriminative value and is filtered out here for clarity. In production, you would keep the zero-IDF rows so downstream joins do not introduce NAs. Different libraries use different smoothings: log(N / (1 + df)), log((1+N) / (1+df)) + 1, etc.

Exercise 6.3: Top TF-IDF term per document for content tagging

Task: A content-tagging pipeline wants exactly one tag per document: the term with the highest TF-IDF score. Join the TF table (from Exercise 6.1) with the IDF table (from Exercise 6.2), compute tf_idf = tf * idf, then for each doc_id return the single row with the maximum score. Save to ex_6_3.

Expected result:

# A tibble: 3 × 4
  doc_id word       tf tf_idf
   <int> <chr>   <dbl>  <dbl>
1      1 light   0.25   0.275
2      2 premium 0.333  0.366
3      3 bulky   0.25   0.275

Difficulty: Advanced

RYour turn

ex_6_3 <- # your code here ex_6_3

Click to reveal solution

RSolution

tf_tab <- ex_6_1 idf_tab <- ex_6_2 ex_6_3 <- tf_tab |> inner_join(idf_tab, by = "word") |> mutate(tf_idf = tf * idf) |> group_by(doc_id) |> slice_max(tf_idf, n = 1, with_ties = FALSE) |> ungroup() |> select(doc_id, word, tf, tf_idf) |> arrange(doc_id) ex_6_3 #> # A tibble: 3 x 4 #> doc_id word tf tf_idf #> <int> <chr> <dbl> <dbl> #> 1 1 light 0.25 0.275 #> 2 2 premium 0.333 0.366 #> 3 3 plastic 0.25 0.275

Explanation: slice_max(tf_idf, n = 1, with_ties = FALSE) is the modern dplyr idiom for "argmax per group": cleaner than filter(tf_idf == max(tf_idf)) because it deterministically breaks ties (here, with_ties = FALSE keeps the first). The inner join naturally drops zero-IDF terms (because phone was filtered from idf_tab), which is exactly the behaviour you want for tagging: tags should differentiate documents, so terms common to all docs are useless. This 3-line pipeline is the heart of every "auto-tag this document" service.

Exercise 7.1: Compute a lexicon-based polarity score for product reviews

Task: A product manager scanning fresh reviews wants a simple polarity score per review: count of positive lexicon words minus count of negative lexicon words. Use the inline pos/neg lexicons. Tokenise each review, score it, and return a tibble with columns review_id, score. Save to ex_7_1.

Expected result:

# A tibble: 4 × 2
  review_id score
      <int> <int>
1         1     3
2         2    -3
3         3     0
4         4     2

Difficulty: Intermediate

RYour turn

reviews <- c( "great product fast shipping love it", "terrible quality broken on arrival hate it", "okay phone good battery bad camera", "decent value good packaging" ) pos <- c("great","love","good","decent","fast") neg <- c("terrible","broken","hate","bad") ex_7_1 <- # your code here ex_7_1

Click to reveal solution

RSolution

reviews <- c( "great product fast shipping love it", "terrible quality broken on arrival hate it", "okay phone good battery bad camera", "decent value good packaging" ) pos <- c("great","love","good","decent","fast") neg <- c("terrible","broken","hate","bad") ex_7_1 <- tibble(review_id = seq_along(reviews), text = reviews) |> mutate(word = str_split(str_to_lower(text), "\\s+")) |> unnest(word) |> mutate( polarity = case_when( word %in% pos ~ 1L, word %in% neg ~ -1L, TRUE ~ 0L ) ) |> group_by(review_id) |> summarise(score = sum(polarity), .groups = "drop") ex_7_1 #> # A tibble: 4 x 2 #> review_id score #> <int> <int> #> 1 1 2 #> 2 2 -2 #> 3 3 0 #> 4 4 1

Explanation: Lexicon-based sentiment is the simplest sentiment model: assign +1 to known positive words, -1 to known negative words, 0 to everything else, then sum per document. The strengths are interpretability (every score traces back to specific words) and zero training cost; the weaknesses are no negation handling ("not good" still scores +1) and no intensifier weighting. Production systems layer a negation-flip rule (any negation token in the previous 3 words flips polarity) before reaching for ML-based models.

Exercise 7.2: Rank reviews by normalised sentiment intensity

Task: The same product manager wants a length-normalised score so a long mildly-positive review does not outrank a short strongly-positive one. Using the per-review counts from Exercise 7.1, compute intensity = score / total_words for each review and return a tibble sorted by descending intensity. Save to ex_7_2.

Expected result:

# A tibble: 4 × 3
  review_id score intensity
      <int> <int>     <dbl>
1         1     3     0.5  
2         4     2     0.5  
3         3     0     0    
4         2    -3    -0.429

Difficulty: Advanced

RYour turn

ex_7_2 <- # your code here ex_7_2

Click to reveal solution

RSolution

ex_7_2 <- tibble(review_id = seq_along(reviews), text = reviews) |> mutate(word = str_split(str_to_lower(text), "\\s+")) |> unnest(word) |> group_by(review_id) |> summarise( total = n(), score = sum(case_when( word %in% pos ~ 1L, word %in% neg ~ -1L, TRUE ~ 0L )), .groups = "drop" ) |> mutate(intensity = score / total) |> arrange(desc(intensity)) |> select(review_id, score, intensity) ex_7_2 #> # A tibble: 4 x 3 #> review_id score intensity #> <int> <int> <dbl> #> 1 1 2 0.333 #> 2 4 1 0.25 #> 3 3 0 0 #> 4 2 -2 -0.286

Explanation: Intensity normalisation prevents review length from dominating ranking: a 50-word review with 5 positive hits has intensity 0.10, whereas a 5-word review with 2 positive hits has intensity 0.40, so the second is clearly more enthusiastic per word. The denominator choice matters: dividing by total tokens (including stop-words) understates intensity in verbose reviews; dividing by content-word count is fairer but requires stop-word filtering. Both are valid; document your choice when shipping the metric.

Navigate

Text Mining Exercises in R: 20 Real-World Practice Problems

Section 1. Tokenization and normalization (3 problems)

Exercise 1.1: Split a sentence into word tokens with str_split

Exercise 1.2: Lowercase and strip punctuation in one pass

Exercise 1.3: Tokenise a paragraph into a tidy one-token-per-row tibble

Section 2. Word frequencies (3 problems)

Exercise 2.1: Top 5 most common words in a small corpus

Exercise 2.2: Find hapax legomena (words appearing exactly once)

Exercise 2.3: Compute the type-token ratio for vocabulary richness

Section 3. Regex extraction (4 problems)

Exercise 3.1: Extract all hashtags from a tweet stream

Exercise 3.2: Extract @mentions and count how many times each handle appears

Exercise 3.3: Pull every valid email address from a customer support log

Exercise 3.4: Extract dollar amounts and convert to numeric

Section 4. Stop-words and stemming (2 problems)

Exercise 4.1: Filter stop-words from a token stream

Exercise 4.2: Strip common English suffixes for a poor-man's stemmer

Section 5. N-grams and collocations (3 problems)

Exercise 5.1: Build all bigrams from a sentence

Exercise 5.2: Top bigrams after stop-word filtering

Exercise 5.3: Trigram collocations from a longer paragraph

Section 6. TF-IDF (3 problems)

Exercise 6.1: Per-document term frequency for a 3-document corpus

Exercise 6.2: Compute inverse document frequency for every term

Exercise 6.3: Top TF-IDF term per document for content tagging

Section 7. Sentiment and lexicon scoring (2 problems)

Exercise 7.1: Compute a lexicon-based polarity score for product reviews

Exercise 7.2: Rank reviews by normalised sentiment intensity

What to do next

Text Mining Mastery