Package index • tmtyro

Prepare Texts

Functions for collecting, loading, and cleaning a corpus of texts.

Collecting Texts

get_corpus(): Prepare a corpus or corpora of texts

get_gutenberg_corpus(): Build and load a corpus from Project Gutenberg

get_micusp_corpus(): Get a MICUSP corpus

download_once(): Download a file once

micusp_metadata(): Get MICUSP metadata

parse_html(): Read HTML headers and text from file

Loading Texts

load_texts(): Load a folder or data frame of texts

Cleaning Text and Metadata

move_header_to_text(): Move a header column to text

identify_by(): Choose a new doc_id column

standardize_titles(): Standardize document titles

unnest_without_caps(): Split text into words and drop proper nouns

Measure Text Features

Functions for measuring features of texts and being choosy about how you do it.

add_dictionary(): Add values from a dictionary

add_frequency(): Add frequency of words or other features

add_index(): Index document row numbers

add_ngrams(): Add ngram columns

add_partitions(): Divide documents in equal lengths

add_sentiment(): Add sentiment markers

add_tf_idf(): Compare usage across a corpus

add_vocabulary(): Measure lexical variety

drop_na(): Drop rows containing missing values

drop_stopwords(): Remove stopwords

summarize_tf_idf(): Compare usage across a corpus

expand_documents(): Convert data frame from long tidy format to wider format

combine_ngrams(): Combine ngram columns

separate_ngrams(): Separate one word per column

make_dictionary(): Create a lexicon

Model Topics

Model complex relationships in a corpus.

load_topic_model(): Load (or cache and load) a topic model

make_topic_model(): Construct a topic model

Explore Results

Generic functions make it easy to share results with an audience (or keep them to yourself)

contextualize(): Show a term in context

tabulize(): Prepare a table of data

visualize(): Visualize output

Adjusting tables and figures

collapse_rows(): Collapse gt rows in the style of kableExtra

change_colors(): Choose other colors

Vectorized functions

get_cumulative_vocabulary(): Cumulative total of vocabulary size

get_frequency() get_tf(): Get frequencies of values in a vector

get_hir(): Cumulative hapax introduction ratio

get_htr(): Cumulative hapax-token ratio

get_idf_by(): Get inverse document frequencies of values in one vector x categorized by another vector by.

get_match(): Get dictionary matches of values in a vector

get_sentiment(): Get sentiment matches of values in a vector

get_tf_by(): Get term frequencies of values in one vector x categorized by another vector by.

get_tfidf_by(): Term frequency–inverse document frequency

get_ttr(): Cumulative type-token ratio

is_hapax(): Check for hapax legomena

is_new(): Check for new words in a vocabulary

Data

Data included

pos_tags: Part of speech tags