load_texts()
loads a corpus from a folder of texts or a data frame and prepares it for further study using tidytext principles. By default, load_texts()
will add paragraph numbers (suitable for prose), and unnest at the word level, but options exist to change these defaults for poetry, to avoid unnesting, and even to remove words that seem like proper nouns or to apply techniques of natural language processing for lemmatizing words or tagging their parts of speech.
Usage
load_texts(
src = "data",
name = ".txt",
word = TRUE,
lemma = FALSE,
lemma_replace = FALSE,
to_lower = TRUE,
remove_names = FALSE,
pos = FALSE,
keep_original = FALSE,
poetry = FALSE,
paragraph = TRUE,
n = 1L,
...
)
Arguments
- src
Either a string identifying the name of a directory containing texts or a data frame containing an unnested column called "text" and one column with a name ending in "_id". Files should either be stored in a directory within the project folder or under the subdirectory called "data". Defaults to "data" to load texts from that directory.
- name
What naming pattern to search for in this folder. Defaults to ".txt".
- word
Whether to split one word per line. Defaults to TRUE.
- lemma
Whether to lemmatize the text. When
word
is TRUE, adds a new column calledlemma
. This step can add a lot of time, so it defaults to FALSE.- lemma_replace
When
lemma
andword
are both TRUE, toggles whether to replace theword
column with the lemmatized tokens. Defaults to FALSE- to_lower
When
word
is TRUE, toggles whether to convert all words to lowercase. Defaults to TRUE.- remove_names
When
word
is TRUE, toggles whether to remove words that only appear with the form of initial capitals. Defaults to FALSE.- pos
Whether to add a column for part-of-speech tag. This step can add a lot of time, so it defaults to FALSE.
- keep_original
Whether to try to retain the original punctuation and capitalization in a parallel column. This won't always work, so it defaults to FALSE.
- poetry
Whether to detect and indicate stanza breaks and line breaks. Defaults to FALSE.
- paragraph
Whether to detect paragraph breaks for prose. Defaults to TRUE.
- n
The number of words per row. By default,
load_texts()
unnests a text one word at a time using a column calledword
. Whenn
is a value greater than 1,load_texts()
will instead usetidytext::unnest_tokens()
withtoken = "ngrams"
to create a column calledngram
.- ...
Additional arguments passed along to
tidytext::unnest_tokens()
for use withtokenizers
Value
A data frame with two to five columns and one row for each token (optionally, one row for each paragraph or one row for each line) in the corpus.
Examples
if (FALSE) { # \dontrun{
mysteries <-
load_texts("mystery-novels")
dickinson <-
load_texts("dickinson-poems",
poetry = TRUE)
# `load_texts()` can also be used with
# a traditional tidytext workflow:
mysteries <-
load_texts("mystery-novels",
word = FALSE,
to_lower = FALSE) |>
tidytext::unnest_tokens(word, text)
} # }