Skip to contents

tmtyro is optimized for a fast start, speeding users along a standard workflow made possible by tidytext and the tidyverse packages on which it’s built. But it also offers a pathway for growth by providing vectorized functions for use in a larger ecosystem of tools and simplifying the transition to text mining for anyone familiar with those tools.

Loading a corpus with tmtyro

Regardless of a user’s starting point, tmtyro’s main verbs for gathering and preparing a corpus will prove useful:

library(dplyr)
library(ggplot2)
library(gt)
library(tmtyro)
corpus_dubliners <- get_gutenberg_corpus(2814) |> 
  load_texts(lemma = TRUE, pos = TRUE) |> 
  identify_by(part) |> 
  standardize_titles()

corpus_dubliners
#> # A tibble: 67,885 × 7
#>    doc_id      title     author       part        word  pos   lemma
#>    <fct>       <chr>     <chr>        <chr>       <chr> <chr> <chr>
#>  1 The Sisters Dubliners Joyce, James THE SISTERS there EX    there
#>  2 The Sisters Dubliners Joyce, James THE SISTERS was   VBD   be   
#>  3 The Sisters Dubliners Joyce, James THE SISTERS no    DT    no   
#>  4 The Sisters Dubliners Joyce, James THE SISTERS hope  NN    hope 
#>  5 The Sisters Dubliners Joyce, James THE SISTERS for   IN    for  
#>  6 The Sisters Dubliners Joyce, James THE SISTERS him   PRP   him  
#>  7 The Sisters Dubliners Joyce, James THE SISTERS this  DT    this 
#>  8 The Sisters Dubliners Joyce, James THE SISTERS time  NN    time 
#>  9 The Sisters Dubliners Joyce, James THE SISTERS it    PRP   it   
#> 10 The Sisters Dubliners Joyce, James THE SISTERS was   VBD   be   
#> # ℹ 67,875 more rows

Adding columns with dplyr

Beyond this point, anyone familiar with common tidyverse packages like dplyr and ggplot2 might forge their own path. For instance, while tmtyro offers a selection of verbs for adding new columns—all helpfully beginning with add_...()—the dplyr way to add columns is with mutate(). To support this workflow, tmtyro offers functions for working with columns as vectors, too.

Word count and frequency

The main path for adding a column of word frequencies is with the add_frequency() function. Users familiar with dplyr can instead use mutate() paired with get_frequency():

corpus_dubliners |> 
  select(doc_id, word, lemma) |> 
  mutate(
    count_word = get_frequency(word),
    count_lemma = get_frequency(lemma))
#> # A tibble: 67,885 × 5
#>    doc_id      word  lemma count_word count_lemma
#>    <fct>       <chr> <chr>      <int>       <int>
#>  1 The Sisters there there        168         168
#>  2 The Sisters was   be          1169        2146
#>  3 The Sisters no    no           168         168
#>  4 The Sisters hope  hope          14          26
#>  5 The Sisters for   for          508         508
#>  6 The Sisters him   him          494         494
#>  7 The Sisters this  this         124         173
#>  8 The Sisters time  time         114         135
#>  9 The Sisters it    it           604         604
#> 10 The Sisters was   be          1169        2146
#> # ℹ 67,875 more rows

These frequencies can just as easily be reported as percentages relative to the whole, using get_frequency(percent = TRUE):

corpus_dubliners |> 
  select(doc_id, word, lemma) |> 
  mutate(
    count_word = get_frequency(word, percent = TRUE),
    count_lemma = get_frequency(lemma, percent = TRUE))
#> # A tibble: 67,885 × 5
#>    doc_id      word  lemma count_word count_lemma
#>    <fct>       <chr> <chr>      <dbl>       <dbl>
#>  1 The Sisters there there   0.00247     0.00247 
#>  2 The Sisters was   be      0.0172      0.0316  
#>  3 The Sisters no    no      0.00247     0.00247 
#>  4 The Sisters hope  hope    0.000206    0.000383
#>  5 The Sisters for   for     0.00748     0.00748 
#>  6 The Sisters him   him     0.00728     0.00728 
#>  7 The Sisters this  this    0.00183     0.00255 
#>  8 The Sisters time  time    0.00168     0.00199 
#>  9 The Sisters it    it      0.00890     0.00890 
#> 10 The Sisters was   be      0.0172      0.0316  
#> # ℹ 67,875 more rows

Of course, these frequencies relate to the entire corpus. To get document-level numbers, use the dplyr functions group_by() and ungroup() around mutate():

corpus_dubliners <- corpus_dubliners |> 
  select(doc_id, word) 

dubliners_count <- corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    n = get_frequency(word),
    freq = get_frequency(word, percent = TRUE)) |> 
  ungroup()

dubliners_count
#> # A tibble: 67,885 × 4
#>    doc_id      word      n    freq
#>    <fct>       <chr> <int>   <dbl>
#>  1 The Sisters there    15 0.0048 
#>  2 The Sisters was      57 0.0182 
#>  3 The Sisters no       16 0.00512
#>  4 The Sisters hope      1 0.00032
#>  5 The Sisters for      32 0.0102 
#>  6 The Sisters him      43 0.0138 
#>  7 The Sisters this      6 0.00192
#>  8 The Sisters time      3 0.00096
#>  9 The Sisters it       40 0.0128 
#> 10 The Sisters was      57 0.0182 
#> # ℹ 67,875 more rows

Vocabulary richness

tmtyro offers add_vocabulary() for adding columns devoted to vocabulary growth, uniqueness, and ratios of lexical diversity. When using mutate(), these features are handled by testing functions like is_new() and is_hapax() and measuring functions like get_cumulative_vocabulary(), get_ttr(), and get_hir(). As with word count, it’s usually best to calculate values grouped by document using group_by() and ungroup():

dubliners_vocab <- corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    new_word = is_new(word), 
    hapax_word = is_hapax(word),
    vocab = get_cumulative_vocabulary(word), 
    ttr = get_ttr(word), 
    hir = get_hir(word)) |> 
  ungroup()

dubliners_vocab
#> # A tibble: 67,885 × 7
#>    doc_id      word  new_word hapax_word vocab   ttr   hir
#>    <fct>       <chr> <lgl>    <lgl>      <int> <dbl> <dbl>
#>  1 The Sisters there TRUE     FALSE          1   1   0    
#>  2 The Sisters was   TRUE     FALSE          2   1   0    
#>  3 The Sisters no    TRUE     FALSE          3   1   0    
#>  4 The Sisters hope  TRUE     TRUE           4   1   0.25 
#>  5 The Sisters for   TRUE     FALSE          5   1   0.2  
#>  6 The Sisters him   TRUE     FALSE          6   1   0.167
#>  7 The Sisters this  TRUE     FALSE          7   1   0.143
#>  8 The Sisters time  TRUE     FALSE          8   1   0.125
#>  9 The Sisters it    TRUE     FALSE          9   1   0.111
#> 10 The Sisters was   FALSE    FALSE          9   0.9 0.1  
#> # ℹ 67,875 more rows

Slower than these other methods, get_htr() offers more than is available in add_vocabulary(), returning the hapax-token ratio. This method can be slower, so be careful when applying it to a large corpus

dubliners_vocab <- 
  dubliners_vocab |> 
  select(doc_id, word, ttr, hir) |> 
  filter(doc_id %in% c("The Sisters", "An Encounter")) |> 
  group_by(doc_id) |> 
  mutate(
    htr = get_htr(word)
  ) |> 
  ungroup()

dubliners_vocab |> 
  # skip the first few lines
  {\(x) x[-c(1:6),]}()
#> # A tibble: 6,374 × 5
#>    doc_id      word     ttr    hir   htr
#>    <fct>       <chr>  <dbl>  <dbl> <dbl>
#>  1 The Sisters this   1     0.143  1    
#>  2 The Sisters time   1     0.125  1    
#>  3 The Sisters it     1     0.111  1    
#>  4 The Sisters was    0.9   0.1    0.8  
#>  5 The Sisters the    0.909 0.0909 0.818
#>  6 The Sisters third  0.917 0.167  0.833
#>  7 The Sisters stroke 0.923 0.231  0.846
#>  8 The Sisters night  0.929 0.214  0.857
#>  9 The Sisters after  0.933 0.2    0.867
#> 10 The Sisters night  0.875 0.188  0.75 
#> # ℹ 6,364 more rows

Dictionary Matching

While add_dictionary() can manage adding columns matching a dictionary to a column like word or lemma, the get_match() function will return the dictionary match for any word or vector. It’s used to add a column with dplyr’s mutate() like this:

emoji_weather <- make_dictionary(
  list(
    "️☔️" = c("rain", "rains", "rainy", "raining"),
    "️⛈️" = c("storm", "storms", "stormy", "storming"),
    "☁️" = c("cloud", "clouds", "cloudy"),
    "🌞" = c("sun", "sunny"),
    "🌫️" = c("fog", "fogs", "foggy", "mist", "misty"),
    "🌬️" = c("wind", "winds", "windy"),
    "️❄️" = c("snow", "snows", "snowing")),
  name = "weather")

dubliners_weather <- corpus_dubliners |> 
  mutate(weather = get_match(word, emoji_weather))

dubliners_weather |> 
  # show only one story and skip a few hundred words
  filter(doc_id == "The Dead") |> 
  filter(row_number() > 602)
#> # A tibble: 15,076 × 3
#>    doc_id   word     weather
#>    <fct>    <chr>    <chr>  
#>  1 The Dead he       NA     
#>  2 The Dead stood    NA     
#>  3 The Dead on       NA     
#>  4 The Dead the      NA     
#>  5 The Dead mat      NA     
#>  6 The Dead scraping NA     
#>  7 The Dead the      NA     
#>  8 The Dead snow     ️❄️      
#>  9 The Dead from     NA     
#> 10 The Dead his      NA     
#> # ℹ 15,066 more rows

dubliners_weather |> 
  drop_na()
#> # A tibble: 53 × 3
#>    doc_id       word   weather
#>    <fct>        <chr>  <chr>  
#>  1 The Sisters  clouds ☁️      
#>  2 The Sisters  sunny  🌞     
#>  3 The Sisters  sun    🌞     
#>  4 The Sisters  clouds ☁️      
#>  5 An Encounter storm  ️⛈️      
#>  6 An Encounter sunny  🌞     
#>  7 An Encounter sun    🌞     
#>  8 An Encounter clouds ☁️      
#>  9 Araby        rainy  ️☔️     
#> 10 Araby        rain   ️☔️     
#> # ℹ 43 more rows

Sentiment

Matching sentiment with get_sentiment() just uses a special kind of dictionary matching:

corpus_dubliners |> 
  mutate(
    sent_1 = get_match(word, tidytext::get_sentiments("bing")),
    sent_2 = get_sentiment(word, "bing")) |> 
  drop_na()
#> # A tibble: 3,786 × 4
#>    doc_id      word      sent_1   sent_2  
#>    <fct>       <chr>     <chr>    <chr>   
#>  1 The Sisters evenly    positive positive
#>  2 The Sisters dead      negative negative
#>  3 The Sisters darkened  negative negative
#>  4 The Sisters blind     negative negative
#>  5 The Sisters idle      negative negative
#>  6 The Sisters strangely negative negative
#>  7 The Sisters like      positive positive
#>  8 The Sisters like      positive positive
#>  9 The Sisters sinful    negative negative
#> 10 The Sisters fear      negative negative
#> # ℹ 3,776 more rows

Tf-idf

The get_tf() function offers a shorthand alias to get_frequency(percent = TRUE). Importantly, this frequency reports values for the whole of a group, so it may often be necessary to use group_by() and ungroup():

dubliners_tfidf <- corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    tf = get_tf(word)) |> 
  ungroup()

dubliners_tfidf
#> # A tibble: 67,885 × 3
#>    doc_id      word       tf
#>    <fct>       <chr>   <dbl>
#>  1 The Sisters there 0.0048 
#>  2 The Sisters was   0.0182 
#>  3 The Sisters no    0.00512
#>  4 The Sisters hope  0.00032
#>  5 The Sisters for   0.0102 
#>  6 The Sisters him   0.0138 
#>  7 The Sisters this  0.00192
#>  8 The Sisters time  0.00096
#>  9 The Sisters it    0.0128 
#> 10 The Sisters was   0.0182 
#> # ℹ 67,875 more rows

Simpler is to use get_tf_by(), which accepts a string and a grouping variable like doc_id. It works well with get_idf_by() to calculate the inverse document frequency for each term. Lastly, multiplying these columns together results in tf-idf:

dubliners_tfidf <- corpus_dubliners |> 
  mutate(
    tf = get_tf_by(word, doc_id),
    idf = get_idf_by(word, doc_id),
    tf_idf = tf * idf)

dubliners_tfidf
#> # A tibble: 67,885 × 5
#>    doc_id      word       tf    idf   tf_idf
#>    <fct>       <chr>   <dbl>  <dbl>    <dbl>
#>  1 The Sisters there 0.0048  0      0       
#>  2 The Sisters was   0.0182  0      0       
#>  3 The Sisters no    0.00512 0      0       
#>  4 The Sisters hope  0.00032 0.511  0.000163
#>  5 The Sisters for   0.0102  0      0       
#>  6 The Sisters him   0.0138  0      0       
#>  7 The Sisters this  0.00192 0.0690 0.000132
#>  8 The Sisters time  0.00096 0      0       
#>  9 The Sisters it    0.0128  0      0       
#> 10 The Sisters was   0.0182  0      0       
#> # ℹ 67,875 more rows

Alternatively, get_tfidf_by() simplifies the process to one step:

dubliners_tfidf |> 
  select(doc_id, word, tf_idf) |> 
  mutate(
    tf_idf2 = get_tfidf_by(word, doc_id)
  )
#> # A tibble: 67,885 × 4
#>    doc_id      word    tf_idf  tf_idf2
#>    <fct>       <chr>    <dbl>    <dbl>
#>  1 The Sisters there 0        0       
#>  2 The Sisters was   0        0       
#>  3 The Sisters no    0        0       
#>  4 The Sisters hope  0.000163 0.000163
#>  5 The Sisters for   0        0       
#>  6 The Sisters him   0        0       
#>  7 The Sisters this  0.000132 0.000132
#>  8 The Sisters time  0        0       
#>  9 The Sisters it    0        0       
#> 10 The Sisters was   0        0       
#> # ℹ 67,875 more rows

Preparing tables with dplyr and gt

In addition to adding new columns, tmtyro’s functions like add_vocabulary() and add_sentiment() also prepare objects to work easily with tabulize(). Users who wish to prepare similar tables manually will need to become familiar with a package like gt, allowing for nearly limitless customization. A few methods for creating and modifying gt tables are shown below, but more are found in package documentation.

Corpus details

By default, a corpus prepared by tmtyro will tabulize() into a table showing word counts for each document. A simple version of this can be prepared by hand with very little effort:

gt_details <- corpus_dubliners |> 
  count(doc_id) |> 
  gt()

gt_details
doc_id n
The Sisters 3125
An Encounter 3255
Araby 2332
Eveline 1829
After the Race 2235
Two Gallants 3916
The Boarding House 2793
A Little Cloud 4981
Counterparts 4118
Clay 2648
A Painful Case 3622
Ivy Day in the Committee Room 5269
A Mother 4526
Grace 7558
The Dead 15678

Once the table is prepared, gt allows for further tweaking—for instance, to format word counts for readability, hide the doc_id column header, and rename n as words:

gt_details |> 
  fmt_integer(n) |> 
  cols_label(
    doc_id = "",
    n = "words")
words
The Sisters 3,125
An Encounter 3,255
Araby 2,332
Eveline 1,829
After the Race 2,235
Two Gallants 3,916
The Boarding House 2,793
A Little Cloud 4,981
Counterparts 4,118
Clay 2,648
A Painful Case 3,622
Ivy Day in the Committee Room 5,269
A Mother 4,526
Grace 7,558
The Dead 15,678

Word frequencies

The standard workflow for preparing a polished table of high-frequency word counts with tmtyro—add_frequency() |> tabulize()—will easily show a few of the most used words in each document. To use get_frequency() when adding columns for word counts, a chain of functions will prepare a summary table—group_by() |> summarize() |> ungroup() |> slice_max(). Once it’s ready, gt() will do the rest.

dubliners_count <- dubliners_count |>
  group_by(doc_id, word) |> 
  summarize(n = max(n)) |> 
  ungroup() |> 
  slice_max(
    order_by = n, 
    by = doc_id, 
    n = 3) # show three words each

gt_counts <- dubliners_count |> 
  # limit to three stories for a shorter display
  filter(doc_id %in% c("The Sisters", "An Encounter", "The Dead")) |> 
  gt()

gt_counts
doc_id word n
The Sisters the 171
The Sisters and 118
The Sisters to 94
An Encounter the 181
An Encounter and 107
An Encounter he 101
The Dead the 866
The Dead and 570
The Dead of 395

The cols_label() function from gt can adjust headers, and tmtyro’s collapse_rows() function cleans up repeated values in the first column:

gt_counts |> 
  cols_label(doc_id = "") |> 
  collapse_rows(doc_id)
word n
The Sisters the 171
and 118
to 94
An Encounter the 181
and 107
he 101
The Dead the 866
and 570
of 395

Choosing to adjust things manually introduces a steeper learning curve, but it also allows for greater customization:

dubliners_count |> 
  filter(doc_id %in% c("The Sisters", "An Encounter", "The Dead")) |> 
  gt(groupname_col = "doc_id") |> 
  cols_label(
    word = "") |> 
  data_color(columns = n, palette = "PuBuGn") |> 
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_row_groups())
n
The Sisters
the 171
and 118
to 94
An Encounter
the 181
and 107
he 101
The Dead
the 866
and 570
of 395

Dictionary matches, including for sentiment, follow the same pattern.

Vocabulary richness

A similar manual workflow can be used to prepare tables of vocabulary richness:

gt_vocab <- corpus_dubliners |> 
  filter(doc_id %in% c("The Sisters", "An Encounter", "The Dead")) |> 
  group_by(doc_id) |> 
  summarize(
    length = n(),
    vocab_count = sum(is_new(word)),
    ttr = last(get_ttr(word)),
    hapax_count = sum(is_hapax(word)),
    htr = last(get_hir(word))) |> 
  ungroup() |> 
  gt()

gt_vocab
doc_id length vocab_count ttr hapax_count htr
The Sisters 3125 901 0.2883200 557 0.1782400
An Encounter 3255 982 0.3016897 624 0.1917051
The Dead 15678 2757 0.1758515 1580 0.1007782

Here, tab spanners can be added to approximate the version created by a typical tmtyro workflow:

gt_vocab |> 
  tab_spanner(
    label = "vocabulary",
    columns = c("vocab_count", "ttr")) |> 
  tab_spanner(
    label = "hapax",
    columns = c("hapax_count", "htr")) |> 
  cols_label(
    vocab_count = "total",
    ttr = "ratio",
    hapax_count = "total",
    htr = "ratio") |> 
  fmt_number(c(ttr, htr), decimals = 3)
doc_id length
vocabulary
hapax
total ratio total ratio
The Sisters 3125 901 0.288 557 0.178
An Encounter 3255 982 0.302 624 0.192
The Dead 15678 2757 0.176 1580 0.101

Extending tabulize()

Learning some of gt’s functions can also be helpful for customizing outputs derived from tabulize(), which are just gt objects. To demonstrate the usefulness of customization, we can create a short table of word counts by selecting a few titles and showing the default output from tabulize():

some_docs <- unique(corpus_dubliners$doc_id)[c(1:3, 12, 15)]

gt_1 <- corpus_dubliners |> 
  filter(doc_id %in% some_docs) |> 
  tabulize()

gt_1
words
The Sisters 3,125
An Encounter 3,255
Araby 2,332
Ivy Day in the Committee Room 5,269
The Dead 15,678

This table is functional but not necessarily pretty. While beauty is subjective, customization makes it possible to aim for something clean like this:

gt_1 |> 
  tab_style(
    style = cell_borders(
      sides = "all", 
      color = NULL),
    locations = cells_body()) |> 
  tab_style(
    style = cell_text(size = pct(70)),
    locations = cells_column_labels()
  ) |> 
  cols_align(
    align = "right",
    columns = doc_id) |> 
  opt_css(
    css = ".gt_col_headings {border-bottom-color: #FFFFFF !important;}"
  )
words
The Sisters 3,125
An Encounter 3,255
Araby 2,332
Ivy Day in the Committee Room 5,269
The Dead 15,678

Preparing figures with ggplot2

Vectorized functions beginning get_...() and is...() don’t work well with visualize(), which is made for the standard workflow. Those visualizations are all built with ggplot2, so they can be recreated with a little effort and nearly limitless customization. Consult the ggplot2 documentation to learn more methods than can be demonstrated here.

Corpus details

By default, a corpus prepared with load_texts() will visualize() into a bar chart showing word counts for each document. Preparing something manually is pretty simple, even if it doesn’t compare well to the default output:

# default output
visualize(corpus_dubliners)

# manual output
corpus_dubliners |> 
  count(doc_id) |> 
  ggplot(aes(
    x = n, 
    y = doc_id)) +
  geom_col()

Among other things, visualize() preserves the order of documents from top to bottom, adjusts labeling, and adds some settings for theme and color. Alone, each is a simple change. But everything adds up when polishing publication-ready graphs, including customizing gridlines, adjusting label spacing, and formatting numbers:

corpus_dubliners |> 
  count(doc_id) |> 
  # reverse doc_id order
  mutate(doc_id = forcats::fct_rev(doc_id)) |> 
  ggplot(aes(
    x = n, 
    y = doc_id, 
    # add color
    fill = doc_id)) +
  geom_col(show.legend = FALSE) +
  # adjust number format and shift y-axis labels
  scale_x_continuous(
    labels = scales::label_comma(),
    expand = c(0, 0)) +
  # change the theme background
  theme_minimal() +
  # adjust labels
  labs(
    x = "length (words)",
    y = NULL) +
  # adjust grid lines
  theme(
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.y = element_blank())

Word frequencies

When used after add_frequency(), visualize() will prepare a faceted graph of some of the top word frequencies for each document. To create something similar manually, using mutate() with a vectorized function like get_frequency() or get_tf_by(), it’s necessary to prepare a table with summarize() and slice_max() before piping it to ggplot():

corpus_dubliners |>
  mutate(
    n = get_tf_by(word, doc_id)) |> 
  group_by(doc_id, word) |> 
  summarize(n = max(n)) |> 
  ungroup() |> 
  slice_max(
    order_by = n, 
    by = doc_id, 
    n = 3) |> # show 3 words each
  ggplot(aes(n, word)) +
  geom_col() +
  facet_wrap(vars(doc_id), scales = "free")

The resulting graph can be further customized with ggplot2’s functions.

Vocabulary richness

The default visualizations tmtyro makes after add_vocabulary() are highly customized. It isn’t hard to make a simple version after a vectorized function like get_cumulative_vocabulary(), but this version can lack readability:

corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    vocab = get_cumulative_vocabulary(word), 
    progress = row_number()) |> 
  ungroup() |> 
  ggplot(aes(
    x = progress, 
    y = vocab, 
    color = doc_id)) +
  geom_line()

Adding direct labels is often worth the effort:

dubliners_vocab <- corpus_dubliners |> 
  group_by(doc_id) |> 
  mutate(
    vocab = get_cumulative_vocabulary(word), 
    progress = row_number()) |> 
  ungroup()

# table of labels and locations
document_labels <- dubliners_vocab |> 
  group_by(doc_id) |> 
  summarize(
    vocab = last(vocab),
    progress = last(progress)) |> 
  ungroup()

dubliners_vocab |> 
  ggplot(aes(
    x = progress, 
    y = vocab, 
    color = doc_id)) +
  geom_line() +
  geom_point(
    data = document_labels) +
  # avoid overlapping labels
  ggrepel::geom_text_repel(
    data = document_labels,
    aes(label = doc_id)) +
  theme(legend.position = "none")