get_gutenberg_corpus()
improves upon the functionality
of gutenbergr::gutenberg_download()
in three key ways.
Retrieving the ".htm" version of texts instead of the ".zip" version typically used by gutenberger dramatically improves file coverage.
Parsing HTML headers allows texts to be studied by sections and chapters. Parsing is handled by
parse_html()
, withmove_header_to_text()
available for corrections.Caching files locally avoids repeated downloads, thereby improving code portability, allowing offline access, and reducing network use.
All changes are made with consideration for server bandwidth, so a two-second delay is introduced between each download attempt. This will slow down the initial acquisition of corpora, but offline caching speeds things up considerably in subsequent use.
Usage
get_gutenberg_corpus(
gutenberg_id,
dir = "gutenberg",
meta_fields = c("gutenberg_id", "title", "author"),
html_title = FALSE,
...
)
Arguments
- gutenberg_id
A vector of ID numbers from Project Gutenberg or a data frame containing a
gutenberg_id
column, such as from the results of a call togutenbergr::gutenberg_works()
.- dir
The directory for storing downloaded
.txt
files. Default value is "gutenberg".- meta_fields
Additional fields to add from gutenbergr::gutenberg_metadata describing each book. By default, title and author are added.
- html_title
Whether to use the h1 header from an HTML file to determine a document's title. By default, uses gutenbergr::gutenberg_metadata.
- ...
Additional parameters passed along to
gutenbergr::gutenberg_strip()
.
Examples
library(gutenbergr)
dalloway <- gutenberg_works(author == "Woolf, Virginia",
title == "Mrs Dalloway in Bond Street") |>
get_gutenberg_corpus()