Convert tokens into equivalence classes defined by values of a dictionary object.
tokens_lookup(
x,
dictionary,
levels = 1:5,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
capkeys = !exclusive,
exclusive = TRUE,
nomatch = NULL,
nested_scope = c("key", "dictionary"),
verbose = quanteda_options("verbose")
)
tokens object to which dictionary or thesaurus will be supplied
the dictionary-class object that will be applied to
x
integers specifying the levels of entries in a hierarchical
dictionary that will be applied. The top level is 1, and subsequent levels
describe lower nesting levels. Values may be combined, even if these
levels are not contiguous, e.g. levels = c(1:3)
will collapse the second
level into the first, but record the third level (if present) collapsed
below the first (see examples).
the type of pattern matching: "glob"
for "glob"-style
wildcard expressions; "regex"
for regular expressions; or "fixed"
for
exact matching. See valuetype for details.
logical; if TRUE
, ignore case when matching a
pattern
or dictionary values
if TRUE, convert dictionary keys to uppercase to distinguish them from other features
if TRUE
, remove all features not in dictionary,
otherwise, replace values in dictionary with keys while leaving other
features unaffected
an optional character naming a new key for tokens that do not
matched to a dictionary values If NULL
(default), do not record
unmatched tokens.
how to treat matches from different dictionary keys that
are nested. When one value is nested within another, such as "a b" being
nested within "a b c", then tokens_lookup()
will match the longer. When
nested_scope = "key"
, this longer-match priority is applied only
within the key, while "dictionary"
applies it across keys, matching only
the key with the longer pattern, not the matches nested within that longer
pattern from other keys. See Details.
print status messages if TRUE
Dictionary values may consist of sequences, and there are different methods of counting key matches based on values that are nested or that overlap.
When two different keys in a dictionary are nested matches of one another,
the nested_scope
options provide the choice of matching each key's
values independently (the "key"
) option, or just counting the
longest match (the "dictionary"
option). Values that are nested
within the same key are always counted as a single match. See the
last example below comparing the New York and New York Times
for these two different behaviours.
Overlapping values, such as "a b"
and "b a"
are
currently always considered as separate matches if they are in different
keys, or as one match if the overlap is within the same key.
tokens_replace
toks1 <- tokens(data_corpus_inaugural)
dict1 <- dictionary(list(country = "united states",
law=c("law*", "constitution"),
freedom=c("free*", "libert*")))
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE))
#> applying a dictionary consisting of 3 keys
#> Document-feature matrix of: 59 documents, 3 features (14.69% sparse) and 4 docvars.
#> features
#> docs country law freedom
#> 1789-Washington 2 1 6
#> 1793-Washington 0 1 0
#> 1797-Adams 3 10 6
#> 1801-Jefferson 0 6 11
#> 1805-Jefferson 1 13 6
#> 1809-Madison 2 2 5
#> [ reached max_ndoc ... 53 more documents ]
dfm(tokens_lookup(toks1, dict1, valuetype = "glob", verbose = TRUE, nomatch = "NONE"))
#> applying a dictionary consisting of 3 keys
#> Document-feature matrix of: 59 documents, 4 features (11.02% sparse) and 4 docvars.
#> features
#> docs country law freedom none
#> 1789-Washington 2 1 6 1526
#> 1793-Washington 0 1 0 146
#> 1797-Adams 3 10 6 2555
#> 1801-Jefferson 0 6 11 1906
#> 1805-Jefferson 1 13 6 2359
#> 1809-Madison 2 2 5 1250
#> [ reached max_ndoc ... 53 more documents ]
dict2 <- dictionary(list(country = "united states",
law = c("law", "constitution"),
freedom = c("freedom", "liberty")))
# dfm(applyDictionary(toks1, dict2, valuetype = "fixed"))
dfm(tokens_lookup(toks1, dict2, valuetype = "fixed"))
#> Document-feature matrix of: 59 documents, 3 features (20.90% sparse) and 4 docvars.
#> features
#> docs country law freedom
#> 1789-Washington 2 1 1
#> 1793-Washington 0 1 0
#> 1797-Adams 3 8 3
#> 1801-Jefferson 0 6 7
#> 1805-Jefferson 1 9 5
#> 1809-Madison 2 2 3
#> [ reached max_ndoc ... 53 more documents ]
# hierarchical dictionary example
txt <- c(d1 = "The United States has the Atlantic Ocean and the Pacific Ocean.",
d2 = "Britain and Ireland have the Irish Sea and the English Channel.")
toks2 <- tokens(txt)
dict3 <- dictionary(list(US = list(Countries = c("States"),
oceans = c("Atlantic", "Pacific")),
Europe = list(Countries = c("Britain", "Ireland"),
oceans = list(west = "Irish Sea",
east = "English Channel"))))
tokens_lookup(toks2, dict3, levels = 1)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "US" "US" "US"
#>
#> d2 :
#> [1] "Europe" "Europe" "Europe" "Europe"
#>
tokens_lookup(toks2, dict3, levels = 2)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "Countries" "oceans" "oceans"
#>
#> d2 :
#> [1] "Countries" "Countries" "oceans" "oceans"
#>
tokens_lookup(toks2, dict3, levels = 1:2)
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "US.Countries" "US.oceans" "US.oceans"
#>
#> d2 :
#> [1] "Europe.Countries" "Europe.Countries" "Europe.oceans" "Europe.oceans"
#>
tokens_lookup(toks2, dict3, levels = 3)
#> Tokens consisting of 2 documents.
#> d1 :
#> character(0)
#>
#> d2 :
#> [1] "west" "east"
#>
tokens_lookup(toks2, dict3, levels = c(1,3))
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "US" "US" "US"
#>
#> d2 :
#> [1] "Europe" "Europe" "Europe.west" "Europe.east"
#>
tokens_lookup(toks2, dict3, levels = c(2,3))
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "Countries" "oceans" "oceans"
#>
#> d2 :
#> [1] "Countries" "Countries" "oceans.west" "oceans.east"
#>
# show unmatched tokens
tokens_lookup(toks2, dict3, nomatch = "_UNMATCHED")
#> Tokens consisting of 2 documents.
#> d1 :
#> [1] "_UNMATCHED" "_UNMATCHED" "US.Countries" "_UNMATCHED" "_UNMATCHED"
#> [6] "US.oceans" "_UNMATCHED" "_UNMATCHED" "_UNMATCHED" "US.oceans"
#> [11] "_UNMATCHED" "_UNMATCHED"
#>
#> d2 :
#> [1] "Europe.Countries" "_UNMATCHED" "Europe.Countries"
#> [4] "_UNMATCHED" "_UNMATCHED" "Europe.oceans.west"
#> [7] "_UNMATCHED" "_UNMATCHED" "Europe.oceans.east"
#> [10] "_UNMATCHED"
#>
# nested matching differences
dict4 <- dictionary(list(paper = "New York Times", city = "New York"))
toks4 <- tokens("The New York Times is a New York paper.")
tokens_lookup(toks4, dict4, nested_scope = "key", exclusive = FALSE)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "The" "PAPER" "CITY" "is" "a" "CITY" "paper" "."
#>
tokens_lookup(toks4, dict4, nested_scope = "dictionary", exclusive = FALSE)
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "The" "PAPER" "is" "a" "CITY" "paper" "."
#>