Segment tokens by splitting on a pattern match. This is useful for breaking the tokenized texts into smaller document units, based on a regular pattern or a user-supplied annotation. While it normally makes more sense to do this at the corpus level (see corpus_segment), tokens_segment provides the option to perform this operation on tokens.

tokens_segment(x, pattern, valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE, extract_pattern = FALSE,
  pattern_position = c("before", "after"), use_docvars = TRUE)



tokens object whose token elements will be segmented


a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.


the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.


ignore case when matching, if TRUE


remove matched patterns from the texts and save in docvars, if TRUE


either "before" or "after", depending on whether the pattern precedes the text (as with a tag) or follows the text (as with punctuation delimiters)


if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.


tokens_segment returns a tokens object whose documents have been split by patterns


txts <- "Fellow citizens, I am again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occasion proper for it shall arrive, I shall endeavor to express the high sense I entertain of this distinguished honor." toks <- tokens(txts) # split by any punctuation toks_punc <- tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed", pattern_position = "after") toks_punc <- tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex", extract_pattern = FALSE, pattern_position = "after")