Substitute token types based on vectorized one-to-one matching. Since this
function is created for lemmatization or user-defined stemming. It supports
substitution of multi-word features by multi-word features, but substitution
is fastest when pattern
and replacement
are character vectors
and valuetype = "fixed"
as the function only substitute types of
tokens. Please use tokens_lookup()
with exclusive = FALSE
to replace dictionary values.
tokens_replace(
x,
pattern,
replacement,
valuetype = "glob",
case_insensitive = TRUE,
apply_if = NULL,
verbose = quanteda_options("verbose")
)
tokens object whose token elements will be replaced
a character vector or list of character vectors. See pattern for more details.
a character vector or (if pattern
is a list) list
of character vectors of the same length as pattern
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
logical vector of length ndoc(x)
; documents are modified
only when corresponding values are TRUE
, others are left unchanged.
print status messages if TRUE
tokens_lookup
toks1 <- tokens(data_corpus_inaugural, remove_punct = TRUE)
# lemmatization
taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation")
lemma <- rep("TAX", length(taxwords))
toks2 <- tokens_replace(toks1, taxwords, lemma, valuetype = "fixed")
kwic(toks2, "TAX") |>
tail(10)
#> Keyword-in-context with 10 matches.
#> [1925-Coolidge, 3004] a living we must have | TAX |
#> [1925-Coolidge, 3116] correct course to follow in | TAX |
#> [1981-Reagan, 273] for their labor by a | TAX |
#> [1981-Reagan, 290] productivity But great as our | TAX |
#> [1981-Reagan, 1521] and to lighten our punitive | TAX |
#> [1985-Reagan, 496] were right to believe that | TAX |
#> [1985-Reagan, 1106] lives We must simplify our | TAX |
#> [1985-Reagan, 1418] permanently control Government's power to | TAX |
#> [1985-Reagan, 1438] spend its citizens money and | TAX |
#> [2013-Obama, 739] remake our government revamp our | TAX |
#>
#> reform The method of raising
#> and all other economic legislation
#> system which penalizes successful achievement
#> burden is it has not
#> burden And these will be
#> rates have been reduced inflation
#> system make it more fair
#> and spend We must act
#> them into servitude when the
#> Code reform our schools and
#>
# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))
#> [1] TRUE
# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Supreme Court")),
phrase(c("Supreme Court of the United States")))
kwic(toks4, phrase(c("Supreme Court of the United States")))
#> Keyword-in-context with 4 matches.
#> [1857-Buchanan, 441:446] which legitimately belongs to the |
#> [1861-Lincoln, 2323:2328] to be decided by the |
#> [1861-Lincoln, 2465:2470] fixed by decisions of the |
#> [1889-Harrison, 408:413] by the organization of the |
#>
#> Supreme Court of the United States | of the United States before
#> Supreme Court of the United States | nor do I deny that
#> Supreme Court of the United States | the instant they are made
#> Supreme Court of the United States | shall have been suitably observed
#>