Apply a dictionary to a dfm by looking up all dfm features for matches in a a set of dictionary values, and replace those features with a count of the dictionary's keys. If exclusive = FALSE then the behaviour is to apply a "thesaurus", where each value match is replaced by the dictionary key, converted to capitals if capkeys = TRUE (so that the replacements are easily distinguished from features that were terms found originally in the document).

dfm_lookup(
  x,
  dictionary,
  levels = 1:5,
  exclusive = TRUE,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = TRUE,
  capkeys = !exclusive,
  nomatch = NULL,
  verbose = quanteda_options("verbose")
)

Arguments

x

the dfm to which the dictionary will be applied

dictionary

a dictionary-class object

levels

levels of entries in a hierarchical dictionary that will be applied

exclusive

if TRUE, remove all features not in dictionary, otherwise, replace values in dictionary with keys while leaving other features unaffected

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

logical; if TRUE, ignore case when matching a pattern or dictionary values

capkeys

if TRUE, convert dictionary keys to uppercase to distinguish them from other features

nomatch

an optional character naming a new feature that will contain the counts of features of x not matched to a dictionary key. If NULL (default), do not tabulate unmatched features.

verbose

print status messages if TRUE

Note

If using dfm_lookup with dictionaries containing multi-word values, matches will only occur if the features themselves are multi-word or formed from n-grams. A better way to match dictionary values that include multi-word patterns is to apply tokens_lookup() to the tokens, and then construct the dfm.

See also

dfm_replace

Examples

dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
                          opposition = c("Opposition", "reject", "notincorpus"),
                          taxglob = "tax*",
                          taxregex = "tax.+$",
                          country = c("United_States", "Sweden")))
dfmat <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
                      "Does the United_States or Sweden have more progressive taxation?")))
dfmat
#> Document-feature matrix of: 2 documents, 20 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    my christmas was ruined by your opposition tax plan .
#>   text1  1         1   1      1  1    1          1   1    1 1
#>   text2  0         0   0      0  0    0          0   0    0 0
#> [ reached max_nfeat ... 10 more features ]

# glob format
dfm_lookup(dfmat, dict, valuetype = "glob")
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       1        0       0
#>   text2         0          0       1        0       2
dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       1        0       0
#>   text2         0          0       1        0       2

# regex v. glob format: note that "united_states" is a regex match for "tax*"
dfm_lookup(dfmat, dict, valuetype = "glob")
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       1        0       0
#>   text2         0          0       1        0       2
dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE)
#> Document-feature matrix of: 2 documents, 5 features (40.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       1        0       0
#>   text2         0          0       2        1       2

# fixed format: no pattern matching
dfm_lookup(dfmat, dict, valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 5 features (70.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       0        0       0
#>   text2         0          0       0        0       2
dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 5 features (70.00% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country
#>   text1         1          1       0        0       0
#>   text2         0          0       0        0       2

# show unmatched tokens
dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#>        features
#> docs    christmas opposition taxglob taxregex country _UNMATCHED
#>   text1         1          1       1        0       0          7
#>   text2         0          0       1        0       2          7