Segment tokens into new documents of equally sized token lengths, with the possibility of overlapping the chunks.
tokens_chunk(
x,
size,
overlap = 0,
use_docvars = TRUE,
verbose = quanteda_options("verbose")
)
tokens object whose token elements will be segmented into chunks
integer; the token length of the chunks
integer; the number of tokens in a chunk to be taken from the
last overlap
tokens from the preceding chunk
if TRUE
, repeat the docvar values for each chunk;
if FALSE
, drop the docvars in the chunked tokens
if TRUE
print the number of tokens and documents before and
after the function is applied. The number of tokens does not include paddings.
A tokens object whose documents have been split into chunks of
length size
.
txts <- c(doc1 = "Fellow citizens, I am again called upon by the voice of
my country to execute the functions of its Chief Magistrate.",
doc2 = "When the occasion proper for it shall arrive, I shall
endeavor to express the high sense I entertain of this
distinguished honor.")
toks <- tokens(txts)
tokens_chunk(toks, size = 5)
#> Tokens consisting of 10 documents.
#> doc1.1 :
#> [1] "Fellow" "citizens" "," "I" "am"
#>
#> doc1.2 :
#> [1] "again" "called" "upon" "by" "the"
#>
#> doc1.3 :
#> [1] "voice" "of" "my" "country" "to"
#>
#> doc1.4 :
#> [1] "execute" "the" "functions" "of" "its"
#>
#> doc1.5 :
#> [1] "Chief" "Magistrate" "."
#>
#> doc2.1 :
#> [1] "When" "the" "occasion" "proper" "for"
#>
#> [ reached max_ndoc ... 4 more documents ]
tokens_chunk(toks, size = 5, overlap = 4)
#> Tokens consisting of 47 documents.
#> doc1.1 :
#> [1] "Fellow" "citizens" "," "I" "am"
#>
#> doc1.2 :
#> [1] "citizens" "," "I" "am" "again"
#>
#> doc1.3 :
#> [1] "," "I" "am" "again" "called"
#>
#> doc1.4 :
#> [1] "I" "am" "again" "called" "upon"
#>
#> doc1.5 :
#> [1] "am" "again" "called" "upon" "by"
#>
#> doc1.6 :
#> [1] "again" "called" "upon" "by" "the"
#>
#> [ reached max_ndoc ... 41 more documents ]