Divide documents in equal lengths
Usage
add_partitions(
df,
size = 1000,
overlap = 0,
minimum = 0.25,
by = doc_id,
character = FALSE
)Arguments
- df
A tidy data frame, potentially containing a column called "word"
- size
Size of each partition
- overlap
Size each partition should overlap. If a value between 0 and 1 is used,
overlapwill be calculated as a percentage ofsize.- minimum
Minimum partition size. If a value between 0 and 1 is used,
minimumwill be calculated as a percentage ofsize.- by
A column containing document grouping
- character
Whether to return a
partitioncolumn as a character vector with zeroes added for padding. This feature may be helpful if usingidentify_by()to considerpartitionwhen defining documents in a corpus.
Examples
dubliners <- get_gutenberg_corpus(2814) |>
load_texts() |>
identify_by(part) |>
standardize_titles()
dubliners |>
add_partitions() |>
head()
#> # A tibble: 6 × 6
#> doc_id title author part partition word
#> <fct> <chr> <chr> <chr> <int> <chr>
#> 1 The Sisters Dubliners Joyce, James THE SISTERS 1 there
#> 2 The Sisters Dubliners Joyce, James THE SISTERS 1 was
#> 3 The Sisters Dubliners Joyce, James THE SISTERS 1 no
#> 4 The Sisters Dubliners Joyce, James THE SISTERS 1 hope
#> 5 The Sisters Dubliners Joyce, James THE SISTERS 1 for
#> 6 The Sisters Dubliners Joyce, James THE SISTERS 1 him
