Apply a dictionary to a dfm by looking up all dfm features for matches in a a
set of dictionary values, and replace those features with a count of
the dictionary's keys. If exclusive = FALSE
then the behaviour is to
apply a "thesaurus", where each value match is replaced by the dictionary
key, converted to capitals if capkeys = TRUE
(so that the replacements
are easily distinguished from features that were terms found originally in
the document).
dfm_lookup(
x,
dictionary,
levels = 1:5,
exclusive = TRUE,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
nomatch = NULL,
verbose = quanteda_options("verbose")
)
the dfm to which the dictionary will be applied
a dictionary-class object
levels of entries in a hierarchical dictionary that will be applied
if TRUE
, remove all features not in dictionary,
otherwise, replace values in dictionary with keys while leaving other
features unaffected
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
if TRUE
, convert dictionary keys to uppercase to
distinguish them from other features
an optional character naming a new feature that will contain
the counts of features of x
not matched to a dictionary key. If
NULL
(default), do not tabulate unmatched features.
print status messages if TRUE
If using dfm_lookup
with dictionaries containing multi-word
values, matches will only occur if the features themselves are multi-word
or formed from n-grams. A better way to match dictionary values that include
multi-word patterns is to apply tokens_lookup()
to the tokens,
and then construct the dfm.
dfm_replace
dict <- dictionary(list(christmas = c("Christmas", "Santa", "holiday"),
opposition = c("Opposition", "reject", "notincorpus"),
taxglob = "tax*",
taxregex = "tax.+$",
country = c("United_States", "Sweden")))
dfmat <- dfm(tokens(c("My Christmas was ruined by your opposition tax plan.",
"Does the United_States or Sweden have more progressive taxation?")))
dfmat
#> Document-feature matrix of: 2 documents, 20 features (50.00% sparse) and 0 docvars.
#> features
#> docs my christmas was ruined by your opposition tax plan .
#> text1 1 1 1 1 1 1 1 1 1 1
#> text2 0 0 0 0 0 0 0 0 0 0
#> [ reached max_nfeat ... 10 more features ]
# glob format
dfm_lookup(dfmat, dict, valuetype = "glob")
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 1 0 0
#> text2 0 0 1 0 2
dfm_lookup(dfmat, dict, valuetype = "glob", case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 1 0 0
#> text2 0 0 1 0 2
# regex v. glob format: note that "united_states" is a regex match for "tax*"
dfm_lookup(dfmat, dict, valuetype = "glob")
#> Document-feature matrix of: 2 documents, 5 features (50.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 1 0 0
#> text2 0 0 1 0 2
dfm_lookup(dfmat, dict, valuetype = "regex", case_insensitive = TRUE)
#> Document-feature matrix of: 2 documents, 5 features (40.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 1 0 0
#> text2 0 0 2 1 2
# fixed format: no pattern matching
dfm_lookup(dfmat, dict, valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 5 features (70.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 0 0 0
#> text2 0 0 0 0 2
dfm_lookup(dfmat, dict, valuetype = "fixed", case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 5 features (70.00% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country
#> text1 1 1 0 0 0
#> text2 0 0 0 0 2
# show unmatched tokens
dfm_lookup(dfmat, dict, nomatch = "_UNMATCHED")
#> Document-feature matrix of: 2 documents, 6 features (41.67% sparse) and 0 docvars.
#> features
#> docs christmas opposition taxglob taxregex country _UNMATCHED
#> text1 1 1 1 0 0 7
#> text2 0 0 1 0 2 7