get_corpus()
works nearly identically as load_texts()
, but it has two fundamental differences. First, it adds a "corpus" column to the resulting table to help with record keeping. Second, it adds an option for caching its output in a local RDS file, saved in the project directory.
Usage
get_corpus(
corpus,
name = ".txt",
word = TRUE,
lemma = FALSE,
lemma_replace = FALSE,
to_lower = TRUE,
remove_names = FALSE,
pos = FALSE,
poetry = FALSE,
paragraph = TRUE,
cache = TRUE
)
Arguments
- corpus
Vector of any length, where each value is either a string identifying a directory of texts or the first part of a filename to a cached RDS file prepared by tmtyro.
- name
What naming pattern to search for in this folder. Defaults to ".txt".
- word
Whether to split one word per line. Defaults to TRUE.
- lemma
Whether to lemmatize the text. When
word
is TRUE, adds a new column calledlemma
. This step can add a lot of time, so it defaults to FALSE.- lemma_replace
When
lemma
andword
are both TRUE, toggles whether to replace theword
column with the lemmatized tokens. Defaults to FALSE- to_lower
When
word
is TRUE, toggles whether to convert all words to lowercase. Defaults to TRUE.- remove_names
When
word
is TRUE, toggles whether to remove words that only appear with the form of initial capitals. Defaults to FALSE.- pos
Whether to add a column for part-of-speech tag. This step can add a lot of time, so it defaults to FALSE.
- poetry
Whether to detect and indicate stanza breaks and line breaks. Defaults to FALSE.
- paragraph
Whether to detect paragraph breaks for prose. Defaults to TRUE.
- cache
Whether to save a cached copy of the corpus. Some options like
pos = TRUE
andlemma = TRUE
can add significant time to corpus preparation, so settingcache = TRUE
saves the need to repeat steps each time a corpus is loaded. Defaults to TRUE.
Examples
if (FALSE) { # \dontrun{
austen <- get_corpus("austen")
shakespeare <- get_corpus(
c("comedy",
"history",
"tragedy"))
} # }