Build and load a corpus from Project Gutenberg — get_gutenberg

get_gutenberg_corpus() improves upon the functionality of gutenbergr::gutenberg_download() in three key ways.

Retrieving the ".htm" version of texts instead of the ".zip" version typically used by gutenberger dramatically improves file coverage.
Parsing HTML headers allows texts to be studied by sections and chapters. Parsing is handled by parse_html(), with move_header_to_text() available for corrections.
Caching files locally avoids repeated downloads, thereby improving code portability, allowing offline access, and reducing network use.

All changes are made with consideration for server bandwidth, so a two-second delay is introduced between each download attempt. This will slow down the initial acquisition of corpora, but offline caching speeds things up considerably in subsequent use.

Usage

get_gutenberg_corpus(
  gutenberg_id,
  dir = "gutenberg",
  meta_fields = c("gutenberg_id", "title", "author"),
  html_title = FALSE,
  ...
)

Arguments

gutenberg_id: A vector of ID numbers from Project Gutenberg or a data frame containing a gutenberg_id column, such as from the results of a call to gutenbergr::gutenberg_works().
dir: The directory for storing downloaded .txt files. Default value is "gutenberg".
meta_fields: Additional fields to add from gutenbergr::gutenberg_metadata describing each book. By default, title and author are added.
html_title: Whether to use the h1 header from an HTML file to determine a document's title. By default, uses gutenbergr::gutenberg_metadata.
...: Additional parameters passed along to gutenbergr::gutenberg_strip().

Value

A data frame with one row for each line of the texts in the corpus.

Examples

library(gutenbergr)

dalloway <- gutenberg_works(author == "Woolf, Virginia",
                            title == "Mrs Dalloway in Bond Street") |>
  get_gutenberg_corpus()