Skip to contents

Divide documents in equal lengths

Usage

add_partitions(
  df,
  size = 1000,
  overlap = 0,
  minimum = 0.25,
  by = doc_id,
  character = FALSE
)

Arguments

df

A tidy data frame, potentially containing a column called "word"

size

Size of each partition

overlap

Size each partition should overlap. If a value between 0 and 1 is used, overlap will be calculated as a percentage of size.

minimum

Minimum partition size. If a value between 0 and 1 is used, minimum will be calculated as a percentage of size.

by

A column containing document grouping

character

Whether to return a partition column as a character vector with zeroes added for padding. This feature may be helpful if using identify_by() to consider partition when defining documents in a corpus.

Value

The original data frame with a column added for partition.

Examples

dubliners <- get_gutenberg_corpus(2814) |>
  load_texts() |>
  identify_by(part) |>
  standardize_titles()

dubliners |>
  add_partitions() |>
  head()
#> # A tibble: 6 × 6
#>   doc_id      title     author       part        partition word 
#>   <fct>       <chr>     <chr>        <chr>           <int> <chr>
#> 1 The Sisters Dubliners Joyce, James THE SISTERS         1 there
#> 2 The Sisters Dubliners Joyce, James THE SISTERS         1 was  
#> 3 The Sisters Dubliners Joyce, James THE SISTERS         1 no   
#> 4 The Sisters Dubliners Joyce, James THE SISTERS         1 hope 
#> 5 The Sisters Dubliners Joyce, James THE SISTERS         1 for  
#> 6 The Sisters Dubliners Joyce, James THE SISTERS         1 him