Segment tokens by splitting on a pattern match. This is useful for breaking
the tokenized texts into smaller document units, based on a regular pattern
or a user-supplied annotation. While it normally makes more sense to do this
at the corpus level (see corpus_segment()
), tokens_segment
provides the option to perform this operation on tokens.
tokens object whose token elements will be segmented
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
remove matched patterns from the texts and save in
docvars, if TRUE
either "before"
or "after"
, depending
on whether the pattern precedes the text (as with a tag) or follows the
text (as with punctuation delimiters)
if TRUE
, repeat the docvar values for each
segmented text; if FALSE
, drop the docvars in the segmented corpus.
Dropping the docvars might be useful in order to conserve space or if these
are not desired for the segmented corpus.
tokens_segment
returns a tokens object whose documents
have been split by patterns
txts <- "Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for
it shall arrive, I shall endeavor to express the high sense I entertain of
this distinguished honor."
toks <- tokens(txts)
# split by any punctuation
tokens_segment(toks, "^\\p{Sterm}$", valuetype = "regex",
extract_pattern = TRUE,
pattern_position = "after")
#> Tokens consisting of 2 documents and 1 docvar.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 10 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 11 more ]
#>
tokens_segment(toks, c(".", "?", "!"), valuetype = "fixed",
extract_pattern = TRUE,
pattern_position = "after")
#> Tokens consisting of 2 documents and 1 docvar.
#> text1.1 :
#> [1] "Fellow" "citizens" "," "I" "am" "again"
#> [7] "called" "upon" "by" "the" "voice" "of"
#> [ ... and 10 more ]
#>
#> text1.2 :
#> [1] "When" "the" "occasion" "proper" "for" "it"
#> [7] "shall" "arrive" "," "I" "shall" "endeavor"
#> [ ... and 11 more ]
#>