This function selects or removes features from a dfm or fcm, based on feature name matches with pattern. The most common usages are to eliminate features from a dfm already constructed, such as stopwords, or to select only terms of interest from a dictionary.

dfm_select(x, pattern = NULL, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  min_nchar = NULL, max_nchar = NULL,
  verbose = quanteda_options("verbose"))

dfm_remove(x, ...)

dfm_keep(x, ...)

fcm_select(x, pattern = NULL, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  verbose = quanteda_options("verbose"), ...)

fcm_remove(x, pattern = NULL, ...)

fcm_keep(x, pattern = NULL, ...)

Arguments

x

the dfm or fcm object whose features will be selected

pattern

a character vector, list of character vectors, dictionary, or collocations object. See pattern for details.

selection

whether to keep or remove the features

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

For dfm_select, pattern may also be a dfm; see Value below.

case_insensitive

ignore the case of dictionary values if TRUE

min_nchar, max_nchar

optional numerics specifying the minimum and maximum length in characters for tokens to be removed or kept; defaults are NULL for no limits. These are applied after (and hence, in addition to) any selection based on pattern matches.

verbose

if TRUE print message about how many pattern were removed

...

used only for passing arguments from dfm_remove or dfm_keep to dfm_select. Cannot include selection.

Value

A dfm or fcm object, after the feature selection has been applied.

For compatibility with earlier versions, when pattern is a dfm object and selection = "keep", then this will be equivalent to calling dfm_match. In this case, the following settings are always used: case_insensitive = FALSE, and valuetype = "fixed". This functionality is deprecated, however, and you should use dfm_match instead.

Details

dfm_remove and fcm_remove are simply a convenience wrappers to calling dfm_select and fcm_select with selection = "remove".

dfm_keep and fcm_keep are simply a convenience wrappers to calling dfm_select and fcm_select with selection = "keep".

Note

This function selects features based on their labels. To select features based on the values of the document-feature matrix, use dfm_trim.

See also

Examples

dfmat <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), tolower = FALSE) dict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(dfmat, pattern = dict)
#> Document-feature matrix of: 2 documents, 4 features (50.0% sparse). #> 2 x 4 sparse Matrix of class "dfm" #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1
dfm_select(dfmat, pattern = dict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50.0% sparse). #> 2 x 1 sparse Matrix of class "dfm" #> features #> docs by #> text1 1 #> text2 0
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50.0% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1
dfm_select(dfmat, pattern = c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50.0% sparse). #> 2 x 14 sparse Matrix of class "dfm" #> features #> docs ruined your opposition tax plan . the or Sweden have more progressive #> text1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1
dfm_select(dfmat, pattern = stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50.0% sparse). #> 2 x 9 sparse Matrix of class "dfm" #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1
dfm_select(dfmat, pattern = stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50.0% sparse). #> 2 x 11 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1
# select based on character length dfm_select(dfmat, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50.0% sparse). #> 2 x 7 sparse Matrix of class "dfm" #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1
dfmat <- dfm(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords.")) dfmat
#> Document-feature matrix of: 2 documents, 18 features (38.9% sparse). #> 2 x 18 sparse Matrix of class "dfm" #> features #> docs this is a document with lots of stopwords . no if , and or but about it #> text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 #> text2 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 #> features #> docs : #> text1 0 #> text2 1
dfm_remove(dfmat, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25.0% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1
toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) fcmat <- fcm(toks) fcmat
#> Feature co-occurrence matrix of: 12 by 12 features. #> 12 x 12 sparse Matrix of class "fcm" #> features #> features this contains lots of stopwords no if and or but about it #> this 0 1 1 1 1 0 0 0 0 0 0 0 #> contains 0 0 1 1 1 0 0 0 0 0 0 0 #> lots 0 0 0 1 1 1 1 1 1 1 1 1 #> of 0 0 0 0 1 0 0 0 0 0 0 0 #> stopwords 0 0 0 0 0 0 0 0 0 0 0 0 #> no 0 0 0 0 0 0 1 1 1 1 1 1 #> if 0 0 0 0 0 0 0 1 1 1 1 1 #> and 0 0 0 0 0 0 0 0 1 1 1 1 #> or 0 0 0 0 0 0 0 0 0 1 1 1 #> but 0 0 0 0 0 0 0 0 0 0 1 1 #> about 0 0 0 0 0 0 0 0 0 0 0 1 #> it 0 0 0 0 0 0 0 0 0 0 0 0
fcm_remove(fcmat, stopwords("english"))
#> Feature co-occurrence matrix of: 3 by 3 features. #> 3 x 3 sparse Matrix of class "fcm" #> features #> features contains lots stopwords #> contains 0 1 1 #> lots 0 0 1 #> stopwords 0 0 0