Segment tokens by splitting on a pattern match. This is useful for breaking the tokenized texts into smaller document units, based on a regular pattern or a user-supplied annotation. While it normally makes more sense to do this at the corpus level (see corpus_segment), tokens_segment provides the option to perform this operation on tokens.

tokens_segment(x, pattern, valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE, extract_pattern = FALSE,
pattern_position = c("before", "after"), use_docvars = TRUE)

## Arguments

x tokens object whose token elements will be segmented a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details. the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details. ignore case when matching, if TRUE remove matched patterns from the texts and save in docvars, if TRUE either "before" or "after", depending on whether the pattern precedes the text (as with a tag) or follows the text (as with punctuation delimiters) if TRUE, repeat the docvar values for each segmented text; if FALSE, drop the docvars in the segmented corpus. Dropping the docvars might be useful in order to conserve space or if these are not desired for the segmented corpus.

## Value

tokens_segment returns a tokens object whose documents have been split by patterns

## Examples

txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)

# split by any punctuation
toks_punc <- tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
pattern_position = "after")
toks_punc <- tokens_segment(toks, "^\\p{Sterm}\$", valuetype = "regex",
extract_pattern = FALSE,
pattern_position = "after")