add_vocabulary() augments a tidy text table with columns describing the lexical variety of the corpus. Among other things, checks for uniqueness and size of vocabulary, with additional ratios reporting these measurements in relation to document size.
Value
A data frame with 7 added columns , the first two logical and the rest numeric:
new_word(logical) Indicates whether this is the first instance of a given wordhapax_doc(logical) Indicates whether this word is the only incident of a given word, or hapax legomenon, at the document levelhapax_corpus(logical) Indicates whether this word is the only incident of a given word, or hapax legomenon, at the corpus levelvocabulary(integer) Running count of words usedttr(double) Type-token ratio, derived from the running count of words divided by the total number of words usedhir(double) Hapax introduction ratio, derived from the running count of hapax legomena divided by the total number of words used.progress_words(integer) Running count of total words used so far in a documentprogress_percent(double) Words used so far as a percentage of the total number of words used in a document
Examples
if (FALSE) { # \dontrun{
dubliners <- get_gutenberg_corpus(2814) |>
load_texts() |>
identify_by(part) |>
standardize_titles()
dubliners |>
add_vocabulary() |>
head()
} # }
