Substitute token types based on vectorized one-to-one matching. Since this function is created for lemmatization or user-defined stemming. It support substitution of multi-word features by multi-word features, but substitution is fastest when pattern and replacement are character vectors and valuetype = "fixed" as the function only substitute types of tokens. Please use tokens_lookup with exclusive = FALSE to replace dictionary values.

tokens_replace(x, pattern, replacement, valuetype = "glob",
  case_insensitive = TRUE, verbose = quanteda_options("verbose"))

Arguments

x

tokens object whose token elements will be replaced

pattern

a character vector or list of character vectors. See pattern for more details.

replacement

a character vector or (if pattern is a list) list of character vectors of the same length as pattern

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore case when matching, if TRUE

verbose

print status messages if TRUE

See also

tokens_lookup

Examples

toks <- tokens(data_corpus_irishbudget2010, remove_punct = TRUE) # lemmatization infle <- c("foci", "focus", "focused", "focuses", "focusing", "focussed", "focusses") lemma <- rep("focus", length(infle)) toks2 <- tokens_replace(toks, infle, lemma, valuetype = "fixed") kwic(toks2, "focus*")
#> #> [Lenihan, Brian (FF), 998] measures A key feature and | focus | #> [Lenihan, Brian (FF), 4990] 2010 our investment projects will | focus | #> [Bruton, Richard (FG), 1948] budget and see that the | focus | #> [Burton, Joan (LAB), 830] must therefore be the main | focus | #> [Burton, Joan (LAB), 3256] the budget had just one | focus | #> [Burton, Joan (LAB), 3727] the garden county however the | focus | #> [Burton, Joan (LAB), 4517] That is too narrow a | focus | #> [Burton, Joan (LAB), 4712] economic revival that has a | focus | #> [Morgan, Arthur (SF), 2891] creating new jobs Instead the | focus | #> [Morgan, Arthur (SF), 3420] what should be the main | focus | #> [Morgan, Arthur (SF), 6225] must be completely redrawn to | focus | #> [Cowen, Brian (FF), 2788] 2010 The scheme will also | focus | #> [Cowen, Brian (FF), 3394] to maximise the efficiency and | focus | #> [Cowen, Brian (FF), 4018] in place with a particular | focus | #> [ODonnell, Kieran (FG), 1774] coherent plan which should be | focus | #> [Gilmore, Eamon (LAB), 2390] also states More recent studies | focus | #> #> of today's budget is regaining #> on labour-intensive areas such as #> has been on the front #> of policy The Labour Party #> and that was just too #> of the feature is not #> There is a character in #> other than the dream of #> was on rates of pay #> of economic recovery which is #> on the more labour intensive #> on providing information via the #> of our investment and ensure #> on some of the worst #> on jobs The Taoiseach is #> on country cases provide evidence
# stemming type <- types(toks) stem <- char_wordstem(type, "porter") toks3 <- tokens_replace(toks, type, stem, valuetype = "fixed", case_insensitive = FALSE) identical(toks3, tokens_wordstem(toks, "porter"))
#> [1] TRUE
# multi-multi substitution toks4 <- tokens_replace(toks, phrase(c("Minister Deputy Lenihan")), phrase(c("Minister Deputy Conor Lenihan"))) kwic(toks4, phrase(c("Minister Deputy Conor Lenihan")))
#> #> [Burton, Joan (LAB), 1805:1808] ought to be cut The | #> [Burton, Joan (LAB), 1970:1973] their extra contribution comes The | #> [OCaolain, Caoimhghin (SF), 571:574] thanks that carers get The | #> #> Minister Deputy Conor Lenihan | should note that in this #> Minister Deputy Conor Lenihan | had a simple choice today #> Minister Deputy Conor Lenihan | claims the overriding objective of