Replace tokens in a tokens object

Substitute token types based on vectorized one-to-one matching. Since this function is created for lemmatization or user-defined stemming. It supports substitution of multi-word features by multi-word features, but substitution is fastest when pattern and replacement are character vectors and valuetype = "fixed" as the function only substitute types of tokens. Please use tokens_lookup() with exclusive = FALSE to replace dictionary values.

tokens_replace(
  x,
  pattern,
  replacement,
  valuetype = "glob",
  case_insensitive = TRUE,
  apply_if = NULL,
  verbose = quanteda_options("verbose")
)

Arguments

x: tokens object whose token elements will be replaced
pattern: a character vector or list of character vectors. See pattern for more details.
replacement: a character vector or (if pattern is a list) list of character vectors of the same length as pattern
valuetype: the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.
case_insensitive: logical; if TRUE, ignore case when matching a pattern or dictionary values
apply_if: logical vector of length ndoc(x); documents are modified only when corresponding values are TRUE, others are left unchanged.
verbose: if TRUE print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Examples

toks1 <- tokens(data_corpus_inaugural, remove_punct = TRUE)

# lemmatization
taxwords <- c("tax", "taxing", "taxed", "taxed", "taxation")
lemma <- rep("TAX", length(taxwords))
toks2 <- tokens_replace(toks1, taxwords, lemma, valuetype = "fixed")
kwic(toks2, "TAX") |>
    tail(10)
#> Keyword-in-context with 10 matches.                                                                        
#>  [1925-Coolidge, 3004]                     a living we must have | TAX |
#>  [1925-Coolidge, 3116]               correct course to follow in | TAX |
#>     [1981-Reagan, 273]                      for their labor by a | TAX |
#>     [1981-Reagan, 290]             productivity But great as our | TAX |
#>    [1981-Reagan, 1521]               and to lighten our punitive | TAX |
#>     [1985-Reagan, 496]                were right to believe that | TAX |
#>    [1985-Reagan, 1106]                lives We must simplify our | TAX |
#>    [1985-Reagan, 1418] permanently control Government's power to | TAX |
#>    [1985-Reagan, 1438]              spend its citizens money and | TAX |
#>      [2013-Obama, 739]          remake our government revamp our | TAX |
#>                                               
#>  reform The method of raising                 
#>  and all other economic legislation           
#>  system which penalizes successful achievement
#>  burden is it has not                         
#>  burden And these will be                     
#>  rates have been reduced inflation            
#>  system make it more fair                     
#>  and spend We must act                        
#>  them into servitude when the                 
#>  Code reform our schools and                  
#> 

# stemming
type <- types(toks1)
stem <- char_wordstem(type, "porter")
toks3 <- tokens_replace(toks1, type, stem, valuetype = "fixed", case_insensitive = FALSE)
identical(toks3, tokens_wordstem(toks1, "porter"))
#> [1] TRUE

# multi-multi substitution
toks4 <- tokens_replace(toks1, phrase(c("Supreme Court")),
                        phrase(c("Supreme Court of the United States")))
kwic(toks4, phrase(c("Supreme Court of the United States")))
#> Keyword-in-context with 4 matches.                                                              
#>   [1857-Buchanan, 441:446] which legitimately belongs to the |
#>  [1861-Lincoln, 2323:2328]              to be decided by the |
#>  [1861-Lincoln, 2465:2470]         fixed by decisions of the |
#>   [1889-Harrison, 408:413]        by the organization of the |
#>                                                                        
#>  Supreme Court of the United States | of the United States before      
#>  Supreme Court of the United States | nor do I deny that               
#>  Supreme Court of the United States | the instant they are made        
#>  Supreme Court of the United States | shall have been suitably observed
#>

Arguments

See also

Examples