Segment tokens into new documents of equally sized token lengths, with the possibility of overlapping the chunks.

tokens_chunk(
  x,
  size,
  overlap = 0,
  use_docvars = TRUE,
  verbose = quanteda_options("verbose")
)

Arguments

x

tokens object whose token elements will be segmented into chunks

size

integer; the token length of the chunks

overlap

integer; the number of tokens in a chunk to be taken from the last overlap tokens from the preceding chunk

use_docvars

if TRUE, repeat the docvar values for each chunk; if FALSE, drop the docvars in the chunked tokens

verbose

if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

A tokens object whose documents have been split into chunks of length size.

See also

Examples

txts <- c(doc1 = "Fellow citizens, I am again called upon by the voice of
                  my country to execute the functions of its Chief Magistrate.",
          doc2 = "When the occasion proper for it shall arrive, I shall
                  endeavor to express the high sense I entertain of this
                  distinguished honor.")
toks <- tokens(txts)
tokens_chunk(toks, size = 5)
#> Tokens consisting of 10 documents.
#> doc1.1 :
#> [1] "Fellow"   "citizens" ","        "I"        "am"      
#> 
#> doc1.2 :
#> [1] "again"  "called" "upon"   "by"     "the"   
#> 
#> doc1.3 :
#> [1] "voice"   "of"      "my"      "country" "to"     
#> 
#> doc1.4 :
#> [1] "execute"   "the"       "functions" "of"        "its"      
#> 
#> doc1.5 :
#> [1] "Chief"      "Magistrate" "."         
#> 
#> doc2.1 :
#> [1] "When"     "the"      "occasion" "proper"   "for"     
#> 
#> [ reached max_ndoc ... 4 more documents ]
tokens_chunk(toks, size = 5, overlap = 4)
#> Tokens consisting of 47 documents.
#> doc1.1 :
#> [1] "Fellow"   "citizens" ","        "I"        "am"      
#> 
#> doc1.2 :
#> [1] "citizens" ","        "I"        "am"       "again"   
#> 
#> doc1.3 :
#> [1] ","      "I"      "am"     "again"  "called"
#> 
#> doc1.4 :
#> [1] "I"      "am"     "again"  "called" "upon"  
#> 
#> doc1.5 :
#> [1] "am"     "again"  "called" "upon"   "by"    
#> 
#> doc1.6 :
#> [1] "again"  "called" "upon"   "by"     "the"   
#> 
#> [ reached max_ndoc ... 41 more documents ]