Divide documents in equal lengths
Usage
add_partitions(
df,
size = 1000,
overlap = 0,
minimum = 0.25,
by = doc_id,
character = FALSE
)
Arguments
- df
A tidy data frame, potentially containing a column called "word"
- size
Size of each partition
- overlap
Size each partition should overlap. If a value between 0 and 1 is used,
overlap
will be calculated as a percentage ofsize
.- minimum
Minimum partition size. If a value between 0 and 1 is used,
minimum
will be calculated as a percentage ofsize
.- by
A column containing document grouping
- character
Whether to return a
partition
column as a character vector with zeroes added for padding. This feature may be helpful if usingidentify_by()
to considerpartition
when defining documents in a corpus.
Examples
dubliners <- get_gutenberg_corpus(2814) |>
load_texts() |>
identify_by(part) |>
standardize_titles()
dubliners |>
add_partitions() |>
head()
#> # A tibble: 6 × 6
#> doc_id title author part partition word
#> <fct> <chr> <chr> <chr> <int> <chr>
#> 1 The Sisters Dubliners Joyce, James THE SISTERS 1 there
#> 2 The Sisters Dubliners Joyce, James THE SISTERS 1 was
#> 3 The Sisters Dubliners Joyce, James THE SISTERS 1 no
#> 4 The Sisters Dubliners Joyce, James THE SISTERS 1 hope
#> 5 The Sisters Dubliners Joyce, James THE SISTERS 1 for
#> 6 The Sisters Dubliners Joyce, James THE SISTERS 1 him