get_gutenberg_corpus() improves upon the functionality
of gutenbergr::gutenberg_download() in three key ways.
Retrieving the ".htm" version of texts instead of the ".zip" version typically used by gutenberger dramatically improves file coverage.
Parsing HTML headers allows texts to be studied by sections and chapters. Parsing is handled by
parse_html(), withmove_header_to_text()available for corrections.Caching files locally avoids repeated downloads, thereby improving code portability, allowing offline access, and reducing network use.
All changes are made with consideration for server bandwidth, so a two-second delay is introduced between each download attempt. This will slow down the initial acquisition of corpora, but offline caching speeds things up considerably in subsequent use.
Arguments
- gutenberg_id
A vector of ID numbers from Project Gutenberg or a data frame containing a
gutenberg_idcolumn, such as from the results of a call togutenbergr::gutenberg_works()- download
Whether files should be automatically downloaded into a project subdirectory as needed (the default), always downloaded into the project folder, temporarily downloaded once per-session, or never downloaded
- dir
The project subdirectory for storing downloaded
.htmfiles- meta_fields
Additional fields to add from gutenbergr::gutenberg_metadata describing each book
- html_title
Whether to use the h1 header from an HTML file to determine a document's title instead of gutenbergr::gutenberg_metadata
- ...
Additional parameters passed along to
gutenbergr::gutenberg_strip()
Examples
if (FALSE) { # \dontrun{
library(gutenbergr)
dalloway <- gutenberg_works(author == "Woolf, Virginia",
title == "Mrs Dalloway in Bond Street") |>
get_gutenberg_corpus()
} # }
