Skip to contents

Prepare Texts

Functions for collecting, loading, and cleaning a corpus of texts.

Collecting Texts

get_corpus()
Prepare a corpus or corpora of texts
get_gutenberg_corpus()
Build and load a corpus from Project Gutenberg
get_micusp_corpus()
Get a MICUSP corpus
download_once()
Download a file once
micusp_metadata()
Get MICUSP metadata
parse_html()
Read HTML headers and text from file

Loading Texts

load_texts()
Load a folder or data frame of texts

Cleaning Text and Metadata

move_header_to_text()
Move a header column to text
identify_by()
Choose a new doc_id column
standardize_titles()
Standardize document titles
unnest_without_caps()
Split text into words and drop proper nouns

Measure Text Features

Functions for measuring features of texts and being choosy about how you do it.

add_dictionary()
Add values from a dictionary
add_frequency()
Add frequency of words or other features
add_index()
Index document row numbers
add_ngrams()
Add ngram columns
add_partitions()
Divide documents in equal lengths
add_sentiment()
Add sentiment markers
add_tf_idf()
Compare usage across a corpus
add_vocabulary()
Measure lexical variety
drop_na()
Drop rows containing missing values
drop_stopwords()
Remove stopwords
summarize_tf_idf()
Compare usage across a corpus
expand_documents()
Convert data frame from long tidy format to wider format
combine_ngrams()
Combine ngram columns
separate_ngrams()
Separate one word per column
make_dictionary()
Create a lexicon

Model Topics

Model complex relationships in a corpus.

load_topic_model()
Load (or cache and load) a topic model
make_topic_model()
Construct a topic model

Explore Results

Generic functions make it easy to share results with an audience (or keep them to yourself)

contextualize()
Show a term in context
tabulize()
Prepare a table of data
visualize()
Visualize output

Adjusting tables and figures

collapse_rows()
Collapse gt rows in the style of kableExtra
change_colors()
Choose other colors

Vectorized functions

get_cumulative_vocabulary()
Cumulative total of vocabulary size
get_frequency() get_tf()
Get frequencies of values in a vector
get_hir()
Cumulative hapax introduction ratio
get_htr()
Cumulative hapax-token ratio
get_idf_by()
Get inverse document frequencies of values in one vector x categorized by another vector by.
get_match()
Get dictionary matches of values in a vector
get_sentiment()
Get sentiment matches of values in a vector
get_tf_by()
Get term frequencies of values in one vector x categorized by another vector by.
get_tfidf_by()
Term frequency–inverse document frequency
get_ttr()
Cumulative type-token ratio
is_hapax()
Check for hapax legomena
is_new()
Check for new words in a vocabulary

Data

Data included

pos_tags
Part of speech tags