Segment tokens into new documents of equally sized token lengths, with the possibility of overlapping the chunks.
tokens_chunk(x, size, overlap = 0, use_docvars = TRUE)
tokens object whose token elements will be segmented into chunks
integer; the token length of the chunks
integer; the number of tokens in a chunk to be taken from the
last overlap
tokens from the preceding chunk
if TRUE
, repeat the docvar values for each chunk;
if FALSE
, drop the docvars in the chunked tokens
A tokens object whose documents have been split into chunks of
length size
.
txts <- c(doc1 = "Fellow citizens, I am again called upon by the voice of
my country to execute the functions of its Chief Magistrate.",
doc2 = "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
toks <- tokens(txts)
tokens_chunk(toks, size = 5)
#> Tokens consisting of 10 documents.
#> doc1.1 :
#> [1] "Fellow" "citizens" "," "I" "am"
#>
#> doc1.2 :
#> [1] "again" "called" "upon" "by" "the"
#>
#> doc1.3 :
#> [1] "voice" "of" "my" "country" "to"
#>
#> doc1.4 :
#> [1] "execute" "the" "functions" "of" "its"
#>
#> doc1.5 :
#> [1] "Chief" "Magistrate" "."
#>
#> doc2.1 :
#> [1] "When" "the" "occasion" "proper" "for"
#>
#> [ reached max_ndoc ... 4 more documents ]
tokens_chunk(toks, size = 5, overlap = 4)
#> Tokens consisting of 47 documents.
#> doc1.1 :
#> [1] "Fellow" "citizens" "," "I" "am"
#>
#> doc1.2 :
#> [1] "citizens" "," "I" "am" "again"
#>
#> doc1.3 :
#> [1] "," "I" "am" "again" "called"
#>
#> doc1.4 :
#> [1] "I" "am" "again" "called" "upon"
#>
#> doc1.5 :
#> [1] "am" "again" "called" "upon" "by"
#>
#> doc1.6 :
#> [1] "again" "called" "upon" "by" "the"
#>
#> [ reached max_ndoc ... 41 more documents ]