Skip to contents

Working with text as data is a multi-step process. After choosing and collecting documents, you’ll need to load them in some structured way before anything else. Only then is it possible to “do” text analysis: tagging parts of speech, normalizing by lemma, comparing features, measuring sentiment, and so on. Even then, you’ll need to communicate findings by preparing compelling explanations, tables, and visualizations of your results.

The tmtyro package aims to make these steps fast and easy.

  • Purpose-built functions for collecting a corpus let you focus on what instead of how.
  • Scalable functions for loading a corpus provide room for growth, from simple word count to grammar parsing and lemmatizing.
  • Additional functions standardize approaches for measuring word use and vocabulary uniqueness, detecting sentiment, assessing term frequency–inverse document frequency, working with n-grams, and even building topic models.
  • One simple function prepares publication-ready tables, automatically adjusting based on the kind of data used. Another simple function prepares compelling visualizations, returning clean, publication-ready figures.
  • Every step is offered as a verb using complementary syntax. This keeps workflows easy to build, easy to understand, easy to explain, and easy to reproduce.

Preparing texts

tmtyro offers a few functions to gather and load texts for study:

  • get_gutenberg_corpus() caches the HTML version of books by their Project Gutenberg ID, parses their text and headers, and presents them in a table.
  • get_micusp_corpus() caches papers from the Michigan Corpus of Upper-level Student Papers, parses them for metadata and contents, and presents them in a table.
  • download_once() caches an online file and passes the local path invisibly.
  • load_texts() prepares a table in “tidytext” format with one word per row and columns for metadata. These texts can be loaded from a folder of files or passed from a table. Parameters allow for lemmatization, part-of-speech processing, and other options.

Other functions aid with preparing a corpus:

  • move_header_to_text() corrects overzealous identification of HTML headers when parsing books from Project Gutenberg.
  • standardize_titles() converts a vector or column into title case, converts underscores with spaces, and optionally removes initial articles.
  • identify_by() sets a column of metadata to serve as document marker.

Get a corpus

Collecting texts from Project Gutenberg will be a common first step for many. The function get_gutenberg_corpus() needs only the Gutenberg ID number, found in the book’s URL. The resulting table draws metadata from the gutenbergr package, with columns for “gutenberg_id”, “title”, “author”, headers such as those used for chapters, and “text.”

library(tmtyro)
joyce <- get_gutenberg_corpus(c(2814, 4217, 4300))

joyce
#> # A tibble: 10,810 × 7
#>    gutenberg_id title     author       part        section subsection text      
#>           <int> <chr>     <chr>        <chr>       <chr>   <chr>      <chr>     
#>  1         2814 Dubliners Joyce, James THE SISTERS NA      NA         There was…
#>  2         2814 Dubliners Joyce, James THE SISTERS NA      NA         Old Cotte…
#>  3         2814 Dubliners Joyce, James THE SISTERS NA      NA         “No, I wo…
#>  4         2814 Dubliners Joyce, James THE SISTERS NA      NA         He began …
#>  5         2814 Dubliners Joyce, James THE SISTERS NA      NA         “I have m…
#>  6         2814 Dubliners Joyce, James THE SISTERS NA      NA         He began …
#>  7         2814 Dubliners Joyce, James THE SISTERS NA      NA         “Well, so…
#>  8         2814 Dubliners Joyce, James THE SISTERS NA      NA         “Who?” sa…
#>  9         2814 Dubliners Joyce, James THE SISTERS NA      NA         “Father F…
#> 10         2814 Dubliners Joyce, James THE SISTERS NA      NA         “Is he de…
#> # ℹ 10,800 more rows

In some cases, headers may make better sense if read as part of the text, as in the “Aeolus” chapter of Ulysses, where frequent newspaper headlines pepper the page:

ulysses <- get_gutenberg_corpus(4300)

# dplyr is used here to choose a smaller example for comparison
ulysses |> 
  dplyr::filter(section == "[ 7 ]")
#> # A tibble: 476 × 7
#>    gutenberg_id title   author       part   section subsection             text 
#>           <int> <chr>   <chr>        <chr>  <chr>   <chr>                  <chr>
#>  1         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE H… Befo…
#>  2         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE H… —Rat…
#>  3         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE H… —Com…
#>  4         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE H… Righ…
#>  5         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE H… —Sta…
#>  6         4300 Ulysses Joyce, James — II — [ 7 ]   THE WEARER OF THE CRO… Unde…
#>  7         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS Gros…
#>  8         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS —The…
#>  9         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS —Jus…
#> 10         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS The …
#> # ℹ 466 more rows

These can be corrected with move_header_to_text().

ulysses <- get_gutenberg_corpus(4300) |> 
  move_header_to_text(subsection)

# dplyr is used here to choose a smaller example for comparison
ulysses |> 
  dplyr::filter(section == "[ 7 ]")
#> # A tibble: 539 × 6
#>    gutenberg_id title   author       part   section text                        
#>           <int> <chr>   <chr>        <chr>  <chr>   <chr>                       
#>  1         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE HIBERNI…
#>  2         4300 Ulysses Joyce, James — II — [ 7 ]   Before Nelson’s pillar tram…
#>  3         4300 Ulysses Joyce, James — II — [ 7 ]   —Rathgar and Terenure!      
#>  4         4300 Ulysses Joyce, James — II — [ 7 ]   —Come on, Sandymount Green! 
#>  5         4300 Ulysses Joyce, James — II — [ 7 ]   Right and left parallel cla…
#>  6         4300 Ulysses Joyce, James — II — [ 7 ]   —Start, Palmerston Park!    
#>  7         4300 Ulysses Joyce, James — II — [ 7 ]   THE WEARER OF THE CROWN     
#>  8         4300 Ulysses Joyce, James — II — [ 7 ]   Under the porch of the gene…
#>  9         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS      
#> 10         4300 Ulysses Joyce, James — II — [ 7 ]   Grossbooted draymen rolled …
#> # ℹ 529 more rows

Headers can be moved for specific texts in a corpus by specifying a filter like title == "Ulysses":

joyce <- joyce |> 
  move_header_to_text(subsection, title == "Ulysses")

joyce |> 
  dplyr::filter(section == "[ 7 ]")
#> # A tibble: 539 × 6
#>    gutenberg_id title   author       part   section text                        
#>           <int> <chr>   <chr>        <chr>  <chr>   <chr>                       
#>  1         4300 Ulysses Joyce, James — II — [ 7 ]   IN THE HEART OF THE HIBERNI…
#>  2         4300 Ulysses Joyce, James — II — [ 7 ]   Before Nelson’s pillar tram…
#>  3         4300 Ulysses Joyce, James — II — [ 7 ]   —Rathgar and Terenure!      
#>  4         4300 Ulysses Joyce, James — II — [ 7 ]   —Come on, Sandymount Green! 
#>  5         4300 Ulysses Joyce, James — II — [ 7 ]   Right and left parallel cla…
#>  6         4300 Ulysses Joyce, James — II — [ 7 ]   —Start, Palmerston Park!    
#>  7         4300 Ulysses Joyce, James — II — [ 7 ]   THE WEARER OF THE CROWN     
#>  8         4300 Ulysses Joyce, James — II — [ 7 ]   Under the porch of the gene…
#>  9         4300 Ulysses Joyce, James — II — [ 7 ]   GENTLEMEN OF THE PRESS      
#> 10         4300 Ulysses Joyce, James — II — [ 7 ]   Grossbooted draymen rolled …
#> # ℹ 529 more rows

Load texts

load_texts() prepares a set of documents for study, either from a table or from a folder of files.

From a table

A table like the one prepared by get_gutenberg_corpus() can be prepared in tidytext format with one word per row using load_texts().

corpus_ulysses <- ulysses |> 
  load_texts()

corpus_ulysses
#> # A tibble: 265,043 × 6
#>    doc_id title   author       part  section word     
#>     <int> <chr>   <chr>        <chr> <chr>   <chr>    
#>  1   4300 Ulysses Joyce, James — I — [ 1 ]   stately  
#>  2   4300 Ulysses Joyce, James — I — [ 1 ]   plump    
#>  3   4300 Ulysses Joyce, James — I — [ 1 ]   buck     
#>  4   4300 Ulysses Joyce, James — I — [ 1 ]   mulligan 
#>  5   4300 Ulysses Joyce, James — I — [ 1 ]   came     
#>  6   4300 Ulysses Joyce, James — I — [ 1 ]   from     
#>  7   4300 Ulysses Joyce, James — I — [ 1 ]   the      
#>  8   4300 Ulysses Joyce, James — I — [ 1 ]   stairhead
#>  9   4300 Ulysses Joyce, James — I — [ 1 ]   bearing  
#> 10   4300 Ulysses Joyce, James — I — [ 1 ]   a        
#> # ℹ 265,033 more rows

From files

If text files are already collected in a folder on disk, they can be prepared in a table by passing the path to the folder inside load_texts(). Used this way, load_texts() will load up every file using the “txt” file extension, populating the doc_id column with the first part of the file name.

corpus_austen <- load_texts("austen")

In this example, the “austen” folder is found within the current project. If it was instead found somewhere else on the computer, the complete path can be passed like this:
load_texts("~/corpora/austen")

Choose a different doc_id

Documents loaded from get_gutenberg_corpus() use the gutenberg_id column as their document identifier.

corpus_dubliners <- get_gutenberg_corpus(2814) |> 
  load_texts(lemma = TRUE, pos = TRUE)

corpus_dubliners
#> # A tibble: 67,885 × 7
#>    doc_id title     author       part        word  pos   lemma
#>     <int> <chr>     <chr>        <chr>       <chr> <chr> <chr>
#>  1   2814 Dubliners Joyce, James THE SISTERS there EX    there
#>  2   2814 Dubliners Joyce, James THE SISTERS was   VBD   be   
#>  3   2814 Dubliners Joyce, James THE SISTERS no    DT    no   
#>  4   2814 Dubliners Joyce, James THE SISTERS hope  NN    hope 
#>  5   2814 Dubliners Joyce, James THE SISTERS for   IN    for  
#>  6   2814 Dubliners Joyce, James THE SISTERS him   PRP   him  
#>  7   2814 Dubliners Joyce, James THE SISTERS this  DT    this 
#>  8   2814 Dubliners Joyce, James THE SISTERS time  NN    time 
#>  9   2814 Dubliners Joyce, James THE SISTERS it    PRP   it   
#> 10   2814 Dubliners Joyce, James THE SISTERS was   VBD   be   
#> # ℹ 67,875 more rows

If a different column is preferred, identify_by() makes the switch. In this example from Dubliners, for instance, each story’s title is shown under “part”. identify_by() makes it easy to identify documents by that column:

corpus_dubliners <- corpus_dubliners |> 
  identify_by(part)

corpus_dubliners
#> # A tibble: 67,885 × 7
#>    doc_id      title     author       part        word  pos   lemma
#>    <fct>       <chr>     <chr>        <chr>       <chr> <chr> <chr>
#>  1 THE SISTERS Dubliners Joyce, James THE SISTERS there EX    there
#>  2 THE SISTERS Dubliners Joyce, James THE SISTERS was   VBD   be   
#>  3 THE SISTERS Dubliners Joyce, James THE SISTERS no    DT    no   
#>  4 THE SISTERS Dubliners Joyce, James THE SISTERS hope  NN    hope 
#>  5 THE SISTERS Dubliners Joyce, James THE SISTERS for   IN    for  
#>  6 THE SISTERS Dubliners Joyce, James THE SISTERS him   PRP   him  
#>  7 THE SISTERS Dubliners Joyce, James THE SISTERS this  DT    this 
#>  8 THE SISTERS Dubliners Joyce, James THE SISTERS time  NN    time 
#>  9 THE SISTERS Dubliners Joyce, James THE SISTERS it    PRP   it   
#> 10 THE SISTERS Dubliners Joyce, James THE SISTERS was   VBD   be   
#> # ℹ 67,875 more rows

Standardize titles

standardize_titles() converts titles to something cleaner by adopting title case.

before <- unique(corpus_dubliners$doc_id)

corpus_dubliners <- corpus_dubliners |> 
  standardize_titles()

after <- unique(corpus_dubliners$doc_id)

data.frame(before, after)
#>                           before                         after
#> 1                    THE SISTERS                   The Sisters
#> 2                   AN ENCOUNTER                  An Encounter
#> 3                          ARABY                         Araby
#> 4                        EVELINE                       Eveline
#> 5                 AFTER THE RACE                After the Race
#> 6                   TWO GALLANTS                  Two Gallants
#> 7             THE BOARDING HOUSE            The Boarding House
#> 8                 A LITTLE CLOUD                A Little Cloud
#> 9                   COUNTERPARTS                  Counterparts
#> 10                          CLAY                          Clay
#> 11                A PAINFUL CASE                A Painful Case
#> 12 IVY DAY IN THE COMMITTEE ROOM Ivy Day in the Committee Room
#> 13                      A MOTHER                      A Mother
#> 14                         GRACE                         Grace
#> 15                      THE DEAD                      The Dead

Studying texts

Useful at many stages of work with a corpus, contextualize() shows the context of a search term, with an adjustable window on either side and options for searching with regular expressions. Most other functions for studying texts follow a predictable naming convention:

  • add_vocabulary() adds columns measuring the lexical variety of texts.
  • add_sentiment() adds a column of sentiment identifiers from a chosen lexicon.
  • add_ngrams() adds columns of words for bigrams, trigrams, or more.

Not every method preserves the size or shape of data passed to it:

  • summarize_tf_idf() returns a data frame for every token in each document in a corpus, with columns indicating weights for term frequency-inverse document frequency.

Along with these, other functions assist with the process:

  • drop_na() drops rows with missing data in any column or in specified columns.
  • combine_ngrams() combines multiple columns for n-grams into one.
  • separate_ngrams() separatesa single column of n-grams into one column per word.

But understanding context of key words with show_context() might be especially helpful.

Showing context

contextualize() finds uses of a word within a corpus and returns a window of context around each use.

corpus_dubliners |> 
  contextualize("snow")
#> mat scraping the snow from his goloshes
#> light fringe of snow lay like a
#> home in the snow if she were
#> the park the snow would be lying
#> standing in the snow on the quay

By default, contextualize() returns five results, showing a window of three words before and after an exact search term. Adjusting limit changes the number of results, with limit = 0 returning a table. Other options include window to adjust the number of words shown and regex to accept partial matches.

corpus_dubliners |> 
  contextualize(regex = "sno",
                window = 2,
                limit = 0)
#> # A tibble: 22 × 4
#>    doc_id         word           index context                            
#>    <fct>          <chr>          <int> <chr>                              
#>  1 After the Race snorting        1088 to the SNOrting motor the          
#>  2 The Dead       snow             610 scraping the SNOw from his         
#>  3 The Dead       snow             705 fringe of SNOw lay like            
#>  4 The Dead       snow_stiffened   739 through the SNOw_stiffened frieze a
#>  5 The Dead       snowing          754 is it SNOwing again mr             
#>  6 The Dead       snow            1660 in the SNOw if she                 
#>  7 The Dead       snow            5411 park the SNOw would be             
#>  8 The Dead       snow            8739 in the SNOw on the                 
#>  9 The Dead       snow            8773 weighted with SNOw the wellington  
#> 10 The Dead       snow            8782 cap of SNOw that flashed           
#> # ℹ 12 more rows

When loading texts, load_texts() provides an option to keep original capitalization and punctuation. This option doesn’t always work, and it seems incompatible with the current implementation of part-of-speech parsing, so it’s not always appropriate. But using contextualize() on corpora loaded with load_texts(keep_original = TRUE) will show search terms much closer to their original context:

corpus_joyce <- joyce |> 
  load_texts(keep_original = TRUE) |> 
  identify_by(title)

tundish <- 
  corpus_joyce |> 
  contextualize("tundish", limit = 1:7)
#> it not a tundish? —What is a
#> —What is a tundish? —That. The the
#> that called a tundish in Ireland? asked
#> is called a tundish in Lower Drumcondra,
#> best English. —A tundish, said the dean
#> word yet again. —Tundish! Well now, that
#> April 13. That tundish has been on

Even when limit is set to some value other than 0, the table of results is returned invisibly for later recall.

tundish
#> # A tibble: 7 × 4
#>   doc_id                                  word    index context                 
#>   <fct>                                   <chr>   <int> <chr>                   
#> 1 A Portrait of the Artist as a Young Man tundish 63827 it not a TUNDISH? —What…
#> 2 A Portrait of the Artist as a Young Man tundish 63831 —What is a TUNDISH? —Th…
#> 3 A Portrait of the Artist as a Young Man tundish 63840 that called a TUNDISH i…
#> 4 A Portrait of the Artist as a Young Man tundish 63858 is called a TUNDISH in …
#> 5 A Portrait of the Artist as a Young Man tundish 63872 best English. —A TUNDIS…
#> 6 A Portrait of the Artist as a Young Man tundish 64122 word yet again. —TUNDIS…
#> 7 A Portrait of the Artist as a Young Man tundish 84352 April 13. That TUNDISH …

Vocabulary richness

add_vocabulary() adds measurements of vocabulary richness, including cumulative vocabulary size, indicators of hapax legomena, and markers of progress.

vocab_dubliners <- 
  corpus_dubliners |> 
  add_vocabulary()

vocab_dubliners
#> # A tibble: 67,885 × 14
#>    doc_id   title author part  word  pos   lemma new_word hapax vocabulary   ttr
#>    <fct>    <chr> <chr>  <chr> <chr> <chr> <chr> <lgl>    <lgl>      <int> <dbl>
#>  1 The Sis… Dubl… Joyce… THE … there EX    there TRUE     FALSE          1   1  
#>  2 The Sis… Dubl… Joyce… THE … was   VBD   be    TRUE     FALSE          2   1  
#>  3 The Sis… Dubl… Joyce… THE … no    DT    no    TRUE     FALSE          3   1  
#>  4 The Sis… Dubl… Joyce… THE … hope  NN    hope  TRUE     TRUE           4   1  
#>  5 The Sis… Dubl… Joyce… THE … for   IN    for   TRUE     FALSE          5   1  
#>  6 The Sis… Dubl… Joyce… THE … him   PRP   him   TRUE     FALSE          6   1  
#>  7 The Sis… Dubl… Joyce… THE … this  DT    this  TRUE     FALSE          7   1  
#>  8 The Sis… Dubl… Joyce… THE … time  NN    time  TRUE     FALSE          8   1  
#>  9 The Sis… Dubl… Joyce… THE … it    PRP   it    TRUE     FALSE          9   1  
#> 10 The Sis… Dubl… Joyce… THE … was   VBD   be    FALSE    FALSE          9   0.9
#> # ℹ 67,875 more rows
#> # ℹ 3 more variables: htr <dbl>, progress_words <int>, progress_percent <dbl>

Sentiment

add_sentiment() adds measurements of sentiment using the “Bing” lexicon by default.

sentiment_dubliners <- corpus_dubliners |> 
  add_sentiment()

sentiment_dubliners
#> # A tibble: 67,946 × 5
#>    title     author       doc_id      word  sentiment
#>    <chr>     <chr>        <fct>       <chr> <chr>    
#>  1 Dubliners Joyce, James The Sisters there NA       
#>  2 Dubliners Joyce, James The Sisters was   NA       
#>  3 Dubliners Joyce, James The Sisters no    NA       
#>  4 Dubliners Joyce, James The Sisters hope  NA       
#>  5 Dubliners Joyce, James The Sisters for   NA       
#>  6 Dubliners Joyce, James The Sisters him   NA       
#>  7 Dubliners Joyce, James The Sisters this  NA       
#>  8 Dubliners Joyce, James The Sisters time  NA       
#>  9 Dubliners Joyce, James The Sisters it    NA       
#> 10 Dubliners Joyce, James The Sisters was   NA       
#> # ℹ 67,936 more rows

Dropping empty rows

Since many words may not be found in a given sentiment lexicon, drop_na() makes it easy to remove empty rows.

sentiment_dubliners |> 
  drop_na(sentiment)
#> # A tibble: 3,868 × 5
#>    title     author       doc_id      word      sentiment
#>    <chr>     <chr>        <fct>       <chr>     <chr>    
#>  1 Dubliners Joyce, James The Sisters evenly    positive 
#>  2 Dubliners Joyce, James The Sisters dead      negative 
#>  3 Dubliners Joyce, James The Sisters darkened  negative 
#>  4 Dubliners Joyce, James The Sisters blind     negative 
#>  5 Dubliners Joyce, James The Sisters idle      negative 
#>  6 Dubliners Joyce, James The Sisters strangely negative 
#>  7 Dubliners Joyce, James The Sisters like      positive 
#>  8 Dubliners Joyce, James The Sisters like      positive 
#>  9 Dubliners Joyce, James The Sisters sinful    negative 
#> 10 Dubliners Joyce, James The Sisters fear      negative 
#> # ℹ 3,858 more rows

Choosing a sentiment lexicon

The lexicon can be chosen at measurement.

sentiment_ulysses <- ulysses |> 
  load_texts() |> 
  identify_by(section) |> 
  add_sentiment(lexicon = "nrc")

sentiment_ulysses |> 
  drop_na(sentiment)
#> # A tibble: 63,006 × 6
#>    title   author       part  doc_id word    sentiment   
#>    <chr>   <chr>        <chr> <fct>  <chr>   <chr>       
#>  1 Ulysses Joyce, James — I — [ 1 ]  stately positive    
#>  2 Ulysses Joyce, James — I — [ 1 ]  plump   anticipation
#>  3 Ulysses Joyce, James — I — [ 1 ]  buck    fear        
#>  4 Ulysses Joyce, James — I — [ 1 ]  buck    negative    
#>  5 Ulysses Joyce, James — I — [ 1 ]  buck    positive    
#>  6 Ulysses Joyce, James — I — [ 1 ]  buck    surprise    
#>  7 Ulysses Joyce, James — I — [ 1 ]  razor   fear        
#>  8 Ulysses Joyce, James — I — [ 1 ]  dark    sadness     
#>  9 Ulysses Joyce, James — I — [ 1 ]  fearful fear        
#> 10 Ulysses Joyce, James — I — [ 1 ]  fearful negative    
#> # ℹ 62,996 more rows

N-grams

Following the same pattern, add_ngrams() adds columns for n-length phrases of words. By default, it prepares bigrams (or 2-grams).

bigrams_joyce <- corpus_joyce |> 
  add_ngrams()

bigrams_joyce
#> # A tibble: 417,846 × 8
#>    doc_id    title     author       part        section original word_1 word_2
#>    <fct>     <chr>     <chr>        <chr>       <chr>   <chr>    <chr>  <chr> 
#>  1 Dubliners Dubliners Joyce, James THE SISTERS NA      There    there  was   
#>  2 Dubliners Dubliners Joyce, James THE SISTERS NA      was      was    no    
#>  3 Dubliners Dubliners Joyce, James THE SISTERS NA      no       no     hope  
#>  4 Dubliners Dubliners Joyce, James THE SISTERS NA      hope     hope   for   
#>  5 Dubliners Dubliners Joyce, James THE SISTERS NA      for      for    him   
#>  6 Dubliners Dubliners Joyce, James THE SISTERS NA      him      him    this  
#>  7 Dubliners Dubliners Joyce, James THE SISTERS NA      this     this   time  
#>  8 Dubliners Dubliners Joyce, James THE SISTERS NA      time:    time   it    
#>  9 Dubliners Dubliners Joyce, James THE SISTERS NA      it       it     was   
#> 10 Dubliners Dubliners Joyce, James THE SISTERS NA      was      was    the   
#> # ℹ 417,836 more rows

Other n-grams can be chosen by passing a vector of numbers.

trigrams_joyce <- corpus_joyce |> 
  add_ngrams(1:3)

trigrams_joyce
#> # A tibble: 417,846 × 9
#>    doc_id    title     author       part   section original word_1 word_2 word_3
#>    <fct>     <chr>     <chr>        <chr>  <chr>   <chr>    <chr>  <chr>  <chr> 
#>  1 Dubliners Dubliners Joyce, James THE S… NA      There    there  was    no    
#>  2 Dubliners Dubliners Joyce, James THE S… NA      was      was    no     hope  
#>  3 Dubliners Dubliners Joyce, James THE S… NA      no       no     hope   for   
#>  4 Dubliners Dubliners Joyce, James THE S… NA      hope     hope   for    him   
#>  5 Dubliners Dubliners Joyce, James THE S… NA      for      for    him    this  
#>  6 Dubliners Dubliners Joyce, James THE S… NA      him      him    this   time  
#>  7 Dubliners Dubliners Joyce, James THE S… NA      this     this   time   it    
#>  8 Dubliners Dubliners Joyce, James THE S… NA      time:    time   it     was   
#>  9 Dubliners Dubliners Joyce, James THE S… NA      it       it     was    the   
#> 10 Dubliners Dubliners Joyce, James THE S… NA      was      was    the    third 
#> # ℹ 417,836 more rows

Tf-idf

Unlike other measurements, term frequency–inverse document frequency doesn’t preserve word order, and it reduces documents to one instance of each token. Since any use of tf-idf can’t merely add a column, summarize_tf_idf() avoids the add_ naming convention. Results are returned in descending strength of tf-idf.

tfidf_dubliners <- corpus_dubliners |> 
  summarize_tf_idf()

tfidf_dubliners
#> # A tibble: 17,656 × 6
#>    doc_id                        word         n      tf   idf tf_idf
#>    <fct>                         <chr>    <int>   <dbl> <dbl>  <dbl>
#>  1 Clay                          maria       40 0.0151   2.71 0.0409
#>  2 Two Gallants                  corley      46 0.0117   2.71 0.0318
#>  3 After the Race                jimmy       24 0.0107   2.71 0.0291
#>  4 Ivy Day in the Committee Room henchy      53 0.0101   2.71 0.0272
#>  5 A Little Cloud                gallaher    48 0.00964  2.71 0.0261
#>  6 The Dead                      gabriel    142 0.00906  2.71 0.0245
#>  7 Grace                         kernan      66 0.00873  2.71 0.0236
#>  8 Ivy Day in the Committee Room o’connor    45 0.00854  2.71 0.0231
#>  9 A Little Cloud                chandler    41 0.00823  2.71 0.0223
#> 10 A Mother                      kearney     50 0.0110   2.01 0.0223
#> # ℹ 17,646 more rows

Tf-idf’s method understandably emphasizes proper nouns that are unique to each document. The remove_names argument in load_texts() can help to filter out words that appear only in capitalized form. Removing names from Dubliners makes a noticeable difference in tf-idf results:

tfidf_dubliners <- get_gutenberg_corpus(2814) |> 
  load_texts(remove_names = TRUE) |> 
  identify_by(part) |> 
  standardize_titles() |> 
  summarize_tf_idf()

tfidf_dubliners
#> # A tibble: 16,721 × 6
#>    doc_id         word         n      tf   idf  tf_idf
#>    <fct>          <chr>    <int>   <dbl> <dbl>   <dbl>
#>  1 The Dead       aunt       101 0.00693 1.61  0.0112 
#>  2 Araby          bazaar       9 0.00406 2.71  0.0110 
#>  3 The Sisters    aunt        19 0.00649 1.61  0.0104 
#>  4 After the Race cars         6 0.00286 2.71  0.00773
#>  5 An Encounter   we          58 0.0189  0.405 0.00767
#>  6 After the Race car         11 0.00524 1.32  0.00692
#>  7 Eveline        avenue       4 0.00225 2.71  0.00610
#>  8 Counterparts   weathers    11 0.00283 2.01  0.00570
#>  9 Counterparts   pa           8 0.00206 2.71  0.00557
#> 10 The Sisters    snuff        6 0.00205 2.71  0.00555
#> # ℹ 16,711 more rows

If load_texts() is used with pos = TRUE, proper nouns can be filtered, but these tags are sometimes inaccurate.

Preparing tables

tabulize() prepares tables for every kind of measurement. This repetition makes it easy to see and appreciate findings without struggling to recall a specialized function.

Corpus details

By default, tabulize() prepares a table showing the lengths of each document.

corpus_joyce |> 
  tabulize()
words
Dubliners 67,945
A Portrait of the Artist as a Young Man 84,926
Ulysses 264,975

Word counts

Adding count = TRUE will show the counts of the most-frequent words.

corpus_joyce |> 
  tabulize(count = TRUE)
word n
Ulysses the 14,952
of 8,143
and 7,210
a 6,501
to 4,960
in 4,945
A Portrait of the Artist as a Young Man the 5,913
and 3,375
of 3,148
a 1,948
to 1,929
he 1,855
Dubliners the 4,075
and 2,234
of 1,867
to 1,753
he 1,646
a 1,582

Vocabulary richness

When used after add_vocabulary(), tabulize() prepares a clean summary table.

corpus_joyce |> 
  add_vocabulary() |> 
  tabulize()
length vocabulary hapax
total ratio total ratio
Dubliners 67,945 7,339 0.108 3,683 0.054
A Portrait of the Artist as a Young Man 84,926 9,177 0.108 4,581 0.054
Ulysses 264,975 29,959 0.113 16,331 0.062

Sentiment

For sentiment analysis, tabulize() returns a summary of figures for each document.

# dplyr is used here to choose a smaller example for comparison
sentiment_dubliners_part <- sentiment_dubliners |> 
  dplyr::filter(doc_id %in% c("The Sisters", "An Encounter",  "Araby"))

sentiment_dubliners_part |> 
  tabulize()
sentiment n %
The Sisters negative 110 3.53
positive 69 2.22
2,934 94.25
An Encounter negative 89 2.73
positive 78 2.39
3,090 94.87
Araby negative 86 3.67
positive 41 1.75
2,218 94.58

Setting drop_na = TRUE removes rows without sentiment measure.

sentiment_dubliners_part |> 
  tabulize(drop_na = TRUE)
sentiment n %
The Sisters negative 110 61.45
positive 69 38.55
An Encounter negative 89 53.29
positive 78 46.71
Araby negative 86 67.72
positive 41 32.28

The ignore parameter aids in selecting a subset of sentiments, converting the rest to NA.

# dplyr is used here to choose a smaller example for comparison
sentiment_ulysses_part <- sentiment_ulysses |> 
  dplyr::filter(doc_id %in% c("[ 1 ]", "[ 2 ]", "[ 3 ]"))

sentiment_ulysses_part |> 
  tabulize(ignore = c("anger", "anticipation", "disgust", "fear", "trust", "positive", "negative"))
sentiment n %
[ 1 ] joy 161 1.93
sadness 124 1.48
surprise 164 1.96
7,910 94.63
[ 2 ] joy 65 1.32
sadness 69 1.40
surprise 64 1.30
4,735 95.99
[ 3 ] joy 89 1.40
sadness 114 1.79
surprise 69 1.08
6,104 95.73

N-grams

After add_ngrams(), tabulize() returns the top n-grams per document. By default, the first six are shown for each group, but rows can be chosen freely.

bigrams_joyce |> 
  tabulize(rows = 1:2)
ngram n %
Ulysses of the 1,628 0.61
in the 1,447 0.55
A Portrait of the Artist as a Young Man of the 896 1.06
in the 499 0.59
Dubliners of the 507 0.75
in the 353 0.52

Tf-idf

For data frames prepared with summarize_tf_idf(), tabulize() returns six rows of the top-scoring words for each document. This amount can be specified with the rows argument.

tfidf_dubliners |> 
  tabulize(rows = 1:3)
word n tf idf tf_idf
The Sisters aunt 19 0.00649 1.60944 0.01044
snuff 6 0.00205 2.70805 0.00555
me 35 0.01195 0.31015 0.00371
An Encounter we 58 0.01890 0.40547 0.00767
field 9 0.00293 1.32176 0.00388
us 27 0.00880 0.40547 0.00357
Araby bazaar 9 0.00406 2.70805 0.01100
uncle 5 0.00226 2.01490 0.00455
my 43 0.01941 0.22314 0.00433
Eveline avenue 4 0.00225 2.70805 0.00610
her 96 0.05402 0.06899 0.00373
mother’s 5 0.00281 1.32176 0.00372
After the Race cars 6 0.00286 2.70805 0.00773
car 11 0.00524 1.32176 0.00692
host 3 0.00143 2.70805 0.00387
Two Gallants peas 4 0.00107 2.70805 0.00289
ginger 5 0.00134 2.01490 0.00269
companion’s 3 0.00080 2.70805 0.00217
The Boarding House reparation 5 0.00184 2.70805 0.00498
boarding 4 0.00147 2.70805 0.00398
bread 3 0.00110 2.70805 0.00299
A Little Cloud child 13 0.00282 1.09861 0.00309
whisky 7 0.00152 1.60944 0.00244
melancholy 6 0.00130 1.60944 0.00209
Counterparts weathers 11 0.00283 2.01490 0.00570
pa 8 0.00206 2.70805 0.00557
desk 8 0.00206 1.60944 0.00331
Clay tip 6 0.00237 2.01490 0.00478
cakes 4 0.00158 2.70805 0.00428
matron 4 0.00158 2.70805 0.00428
A Painful Case deceased 6 0.00170 2.70805 0.00460
engine 5 0.00142 2.70805 0.00383
paragraph 5 0.00142 2.70805 0.00383
Ivy Day in the Committee Room he’s 26 0.00554 0.91629 0.00508
cigarette 7 0.00149 2.01490 0.00301
bottle 15 0.00320 0.91629 0.00293
A Mother concert 14 0.00337 1.60944 0.00543
baritone 8 0.00193 2.70805 0.00522
concerts 10 0.00241 2.01490 0.00485
Grace pope 13 0.00191 2.01490 0.00385
constable 11 0.00162 2.01490 0.00326
gentlemen 13 0.00191 1.09861 0.00210
The Dead aunt 101 0.00693 1.60944 0.01116
snow 20 0.00137 2.70805 0.00372
miss 64 0.00439 0.76214 0.00335

Preparing figures

tmtyro provides many functions for preparing figures, but only one is typically needed:

  • visualize() works intuitively with tmtyro objects, preparing figures suited to whatever work is being done.

Customization is easy:

  • change_colors() provides a single interface for modifying filled and colored layers.

Corpus details

By default, visualize() prepares a figure showing the lengths of each document.

corpus_joyce |> 
  visualize(inorder = FALSE)

Word counts

Adding count = TRUE will show the counts of the most-frequent words.

corpus_joyce |> 
  visualize(count = TRUE)

Vocabulary richness

When used after add_vocabulary(), visualize() charts each document by its length and the number of unique tokens. A figure like this is useful to compare documents by their rate of vocabulary growth.

corpus_dubliners |> 
  add_vocabulary() |> 
  visualize()

Other features, such as type-token ratio (“ttr”), hapax-token ratio (“htr”), or a sampling of hapax legomena (“hapax”) can also be shown.

vocab_dubliners |> 
  visualize("ttr")

corpus_joyce |> 
  add_vocabulary() |> 
  visualize("hapax")

Sentiment

For sentiment analysis, visualize() allows for comparison among documents in a set.

sentiment_dubliners |> 
  visualize()

The ignore parameter stipulates values to remove from the Y-axis to focus a figure.

sentiment_ulysses |> 
  visualize(ignore = c("anger", "anticipation", "disgust", "fear", "trust", "positive", "negative"))

N-grams

For n-grams, visualize() typically returns a network visualization inspired by the bigram network in Text Mining with R.

bigrams_joyce |> 
  visualize()

Combining n-grams

N-gram frequencies can be compared by combining them before visualization. Certain arguments allow for deviation from typical charts, including choosing the rows to chart and modifying colors to be set by values on the Y-axis.

bigrams_joyce |> 
  dplyr::filter(word_1 == "he") |> 
  combine_ngrams() |> 
  visualize(rows = 1:5, color_y = TRUE)

Tf-idf

visualize() returns bars showing the top words for each document. This can be a useful way to differentiate texts in a set from each other. Because tfidf_dubliners was prepared with load_texts(remove_names = TRUE), the resulting chart shows clearer delineation of topics characteristic of the stories in Joyce’s collection:

tfidf_dubliners |> 
  visualize(rows = 1:4)

Changing colors

change_colors() does what its name implies. By default, it adopts the “Dark2” palette from Brewer.

sentiment_dubliners |> 
  visualize() |> 
  change_colors()

Colors can be chosen manually.

library(ggraph)
bigrams_joyce |> 
  visualize(top_n = 60) |> 
  change_colors(c("#999999", 
                  "orange", 
                  "darkred"))

Optionally, use a named vector to set some colors by value instead of by order. By default unnamed colors are gray.

bigrams_joyce |> 
  dplyr::filter(word_1 == "he") |> 
  combine_ngrams() |> 
  visualize(rows = 1:5, color_y = TRUE, reorder_y = TRUE) |> 
  change_colors(c(
    "he is" = "darkorange",
    "he has" = "orange"))

Unnamed colors fill in as needed.

bigrams_joyce |> 
  dplyr::filter(word_1 == "he") |> 
  combine_ngrams() |> 
  visualize(rows = 1:5, color_y = TRUE, reorder_y = TRUE) |> 
  change_colors(c(
    "he is" = "darkorange",
    "he has" = "orange", 
    "navy", "skyblue"))

Or choose a predetermined color set and palette, as described in function documentation.

tfidf_dubliners |> 
  visualize(rows = 1:4) |> 
  change_colors(colorset = "viridis", palette = "mako", direction = -1)