This function selects or removes features from a dfm or fcm,
based on feature name matches with pattern
. The most common usages
are to eliminate features from a dfm already constructed, such as stopwords,
or to select only terms of interest from a dictionary.
dfm_select( x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, min_nchar = NULL, max_nchar = NULL, verbose = quanteda_options("verbose") ) dfm_remove(x, ...) dfm_keep(x, ...) fcm_select( x, pattern = NULL, selection = c("keep", "remove"), valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE, verbose = quanteda_options("verbose"), ... ) fcm_remove(x, ...) fcm_keep(x, ...)
x | |
---|---|
pattern | a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
selection | whether to |
valuetype | the type of pattern matching: |
case_insensitive | logical; if |
min_nchar, max_nchar | optional numerics specifying the minimum and
maximum length in characters for tokens to be removed or kept; defaults are
|
verbose | if |
... | used only for passing arguments from |
A dfm or fcm object, after the feature selection has been applied.
For compatibility with earlier versions, when pattern
is a
dfm object and selection = "keep"
, then this will be
equivalent to calling dfm_match()
. In this case, the following
settings are always used: case_insensitive = FALSE
, and
valuetype = "fixed"
. This functionality is deprecated, however, and
you should use dfm_match()
instead.
dfm_remove
and fcm_remove
are simply a convenience
wrappers to calling dfm_select
and fcm_select
with
selection = "remove"
.
dfm_keep
and fcm_keep
are simply a convenience wrappers to
calling dfm_select
and fcm_select
with selection = "keep"
.
This function selects features based on their labels. To select
features based on the values of the document-feature matrix, use
dfm_trim()
.
dfmat <- tokens(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?")) %>% dfm(tolower = FALSE) dict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(dfmat, pattern = dict) #> Document-feature matrix of: 2 documents, 4 features (50.00% sparse) and 0 docvars. #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1 dfm_select(dfmat, pattern = dict, case_insensitive = FALSE) #> Document-feature matrix of: 2 documents, 1 feature (50.00% sparse) and 0 docvars. #> features #> docs by #> text1 1 #> text2 0 dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex") #> Document-feature matrix of: 2 documents, 6 features (50.00% sparse) and 0 docvars. #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1 dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex") #> Document-feature matrix of: 2 documents, 14 features (50.00% sparse) and 0 docvars. #> features #> docs ruined your opposition tax plan . the or Sweden have #> text1 1 1 1 1 1 1 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 #> [ reached max_nfeat ... 4 more features ] dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed") #> Document-feature matrix of: 2 documents, 9 features (50.00% sparse) and 0 docvars. #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1 dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed") #> Document-feature matrix of: 2 documents, 11 features (50.00% sparse) and 0 docvars. #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation #> text1 0 #> text2 1 #> [ reached max_nfeat ... 1 more feature ] # select based on character length dfm_select(dfmat, min_nchar = 5) #> Document-feature matrix of: 2 documents, 7 features (50.00% sparse) and 0 docvars. #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1 dfmat <- dfm(tokens(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."))) dfmat #> Document-feature matrix of: 2 documents, 18 features (38.89% sparse) and 0 docvars. #> features #> docs this is a document with lots of stopwords . no #> text1 1 1 1 1 1 1 1 1 1 0 #> text2 0 0 0 0 0 1 1 1 1 1 #> [ reached max_nfeat ... 8 more features ] dfm_remove(dfmat, stopwords("english")) #> Document-feature matrix of: 2 documents, 6 features (25.00% sparse) and 0 docvars. #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1 toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) fcmat <- fcm(toks) fcmat #> Feature co-occurrence matrix of: 12 by 12 features. #> features #> features this contains lots of stopwords no if and or but #> this 0 1 1 1 1 0 0 0 0 0 #> contains 0 0 1 1 1 0 0 0 0 0 #> lots 0 0 0 1 1 1 1 1 1 1 #> of 0 0 0 0 1 0 0 0 0 0 #> stopwords 0 0 0 0 0 0 0 0 0 0 #> no 0 0 0 0 0 0 1 1 1 1 #> if 0 0 0 0 0 0 0 1 1 1 #> and 0 0 0 0 0 0 0 0 1 1 #> or 0 0 0 0 0 0 0 0 0 1 #> but 0 0 0 0 0 0 0 0 0 0 #> [ reached max_feat ... 2 more features, reached max_nfeat ... 2 more features ] fcm_remove(fcmat, stopwords("english")) #> Feature co-occurrence matrix of: 3 by 3 features. #> features #> features contains lots stopwords #> contains 0 1 1 #> lots 0 0 1 #> stopwords 0 0 0