Apply a dictionary to a dfm by looking up all dfm features for matches in a a set of dictionary values, and replace those features with a count of the dictionary's keys. If exclusive = FALSE then the behaviour is to apply a "thesaurus", where each value match is replaced by the dictionary key, converted to capitals if capkeys = TRUE (so that the replacements are easily distinguished from features that were terms found originally in the document).

dfm_lookup(x, dictionary, levels = 1:5, exclusive = TRUE,
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  capkeys = !exclusive, nomatch = NULL,
  verbose = quanteda_options("verbose"))

Arguments

x

the dfm to which the dictionary will be applied

dictionary

a dictionary class object

levels

levels of entries in a hierachical dictionary that will be applied

exclusive

if TRUE, remove all features not in dictionary, otherwise, replace values in dictionary with keys while leaving other features unaffected

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore the case of dictionary values if TRUE

capkeys

if TRUE, convert dictionary keys to uppercase to distinguish them from other features

nomatch

an optional character naming a new feature that will contain the counts of features of x not matched to a dictionary key. If NULL (default), do not tabulate unmatched features.

verbose

print status messages if TRUE

Note

If using dfm_lookup with dictionaries containing multi-word values, matches will only occur if the features themselves are multi-word or formed from ngrams. A better way to match dictionary values that include multi-word patterns is to apply tokens_lookup to the tokens, and then construct the dfm.

Examples

myDict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"), opposition = c("Opposition", "reject", "notincorpus"), taxglob = "tax*", taxregex = "tax.+$", country = c("United_States", "Sweden"))) myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), remove = stopwords("english"), verbose = FALSE) myDfm
#> Document-feature matrix of: 2 documents, 11 features (50% sparse). #> 2 x 11 sparse Matrix of class "dfmSparse" #> features #> docs christmas ruined opposition tax plan . united_states sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1
# glob format dfm_lookup(myDfm, myDict, valuetype = "glob")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
dfm_lookup(myDfm, myDict, valuetype = "glob", case_insensitive = FALSE)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
# regex v. glob format: note that "united_states" is a regex match for "tax*" dfm_lookup(myDfm, myDict, valuetype = "glob")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
dfm_lookup(myDfm, myDict, valuetype = "regex", case_insensitive = TRUE)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
# fixed format: no pattern matching dfm_lookup(myDfm, myDict, valuetype = "fixed")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
dfm_lookup(myDfm, myDict, valuetype = "fixed", case_insensitive = FALSE)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
# show unmatched tokens dfm_lookup(myDfm, myDict, nomatch = "_UNMATCHED")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found