This function selects or discards features from a dfm or fcm, based on feature name matches with pattern. The most common usages are to eliminate features from a dfm already constructed, such as stopwords, or to select only terms of interest from a dictionary.

dfm_select(x, pattern, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  min_nchar = 1L, max_nchar = 63L, verbose = quanteda_options("verbose"),
  ...)

dfm_remove(x, pattern, ...)

fcm_select(x, pattern = NULL, selection = c("keep", "remove"),
  valuetype = c("glob", "regex", "fixed"), case_insensitive = TRUE,
  verbose = TRUE, ...)

fcm_remove(x, pattern, ...)

Arguments

x

the dfm or fcm object whose features will be selected

pattern

a character vector, list of character vectors, dictionary, collocations, or dfm. See pattern for details.

selection

whether to keep or remove the features

valuetype

the type of pattern matching: "glob" for "glob"-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching. See valuetype for details.

case_insensitive

ignore the case of dictionary values if TRUE

min_nchar, max_nchar

numerics specifying the minimum and maximum length in characters for features to be removed or kept; defaults are 1 and 79. (Set max_nchar to NULL for no upper limit.) These are applied after (and hence, in addition to) any selection based on pattern matches.

verbose

if TRUE print message about how many pattern were removed

...

used only for passing arguments from *_remove to *_select functions

Value

A dfm or fcm object, after the feature selection has been applied. When pattern is a dfm object, then the returned object will be identical in its feature set to the dfm supplied as the pattern argument. This means that any features in x not in the dfm provided as pattern will be discarded, and that any features in found in the dfm supplied as pattern but not found in x will be added with all zero counts. Because selecting on a dfm is designed to produce a selected dfm with an exact feature match, when pattern is a dfm object, then the following settings are always used: case_insensitive = FALSE, and valuetype = "fixed". Selecting on a dfm is useful when you have trained a model on one dfm, and need to project this onto a test set whose features must be identical. It is also used in bootstrap_dfm. See examples.

Details

dfm_remove and fcm_remove are simply a convenience wrappers to calling dfm_select and fcm_select with selection = "remove".

Note

This function selects features based on their labels. To select features based on the values of the document-feature matrix, use dfm_trim.

Examples

myDfm <- dfm(c("My Christmas was ruined by your opposition tax plan.", "Does the United_States or Sweden have more progressive taxation?"), tolower = FALSE, verbose = FALSE) mydict <- dictionary(list(countries = c("United_States", "Sweden", "France"), wordsEndingInY = c("by", "my"), notintext = "blahblah")) dfm_select(myDfm, mydict)
#> Document-feature matrix of: 2 documents, 4 features (50% sparse). #> 2 x 4 sparse Matrix of class "dfmSparse" #> features #> docs My by United_States Sweden #> text1 1 1 0 0 #> text2 0 0 1 1
dfm_select(myDfm, mydict, case_insensitive = FALSE)
#> Document-feature matrix of: 2 documents, 1 feature (50% sparse). #> 2 x 1 sparse Matrix of class "dfmSparse" #> features #> docs by #> text1 1 #> text2 0
dfm_select(myDfm, c("s$", ".y"), selection = "keep", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs My Christmas was by Does United_States #> text1 1 1 1 1 0 0 #> text2 0 0 0 0 1 1
dfm_select(myDfm, c("s$", ".y"), selection = "remove", valuetype = "regex")
#> Document-feature matrix of: 2 documents, 14 features (50% sparse). #> 2 x 14 sparse Matrix of class "dfmSparse" #> features #> docs ruined your opposition tax plan . the or Sweden have more progressive #> text1 1 1 1 1 1 1 0 0 0 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1
dfm_select(myDfm, stopwords("english"), selection = "keep", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 9 features (50% sparse). #> 2 x 9 sparse Matrix of class "dfmSparse" #> features #> docs My was by your Does the or have more #> text1 1 1 1 1 0 0 0 0 0 #> text2 0 0 0 0 1 1 1 1 1
dfm_select(myDfm, stopwords("english"), selection = "remove", valuetype = "fixed")
#> Document-feature matrix of: 2 documents, 11 features (50% sparse). #> 2 x 11 sparse Matrix of class "dfmSparse" #> features #> docs Christmas ruined opposition tax plan . United_States Sweden progressive #> text1 1 1 1 1 1 1 0 0 0 #> text2 0 0 0 0 0 0 1 1 1 #> features #> docs taxation ? #> text1 0 0 #> text2 1 1
# select based on character length dfm_select(myDfm, min_nchar = 5)
#> Document-feature matrix of: 2 documents, 7 features (50% sparse). #> 2 x 7 sparse Matrix of class "dfmSparse" #> features #> docs Christmas ruined opposition United_States Sweden progressive taxation #> text1 1 1 1 0 0 0 0 #> text2 0 0 0 1 1 1 1
# selecting on a dfm txts <- c("This is text one", "The second text", "This is text three") (dfm1 <- dfm(txts[1:2]))
#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs this is text one the second #> text1 1 1 1 1 0 0 #> text2 0 0 1 0 1 1
(dfm2 <- dfm(txts[2:3]))
#> Document-feature matrix of: 2 documents, 6 features (41.7% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs the second text this is three #> text1 1 1 1 0 0 0 #> text2 0 0 1 1 1 1
(dfm3 <- dfm_select(dfm1, dfm2, valuetype = "fixed", verbose = TRUE))
#> kept 5 features
#> , padded
#> 1 feature
#>
#> Document-feature matrix of: 2 documents, 6 features (50% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs the second text this is three #> text1 0 0 1 1 1 0 #> text2 1 1 1 0 0 0
setequal(featnames(dfm2), featnames(dfm3))
#> [1] TRUE
tmpdfm <- dfm(c("This is a document with lots of stopwords.", "No if, and, or but about it: lots of stopwords."), verbose = FALSE) tmpdfm
#> Document-feature matrix of: 2 documents, 18 features (38.9% sparse). #> 2 x 18 sparse Matrix of class "dfmSparse" #> features #> docs this is a document with lots of stopwords . no if , and or but about it #> text1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 #> text2 0 0 0 0 0 1 1 1 1 1 1 2 1 1 1 1 1 #> features #> docs : #> text1 0 #> text2 1
dfm_remove(tmpdfm, stopwords("english"))
#> Document-feature matrix of: 2 documents, 6 features (25% sparse). #> 2 x 6 sparse Matrix of class "dfmSparse" #> features #> docs document lots stopwords . , : #> text1 1 1 1 1 0 0 #> text2 0 1 1 1 2 1
toks <- tokens(c("this contains lots of stopwords", "no if, and, or but about it: lots"), remove_punct = TRUE) tmpfcm <- fcm(toks)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
tmpfcm
#> Error in eval(expr, envir, enclos): object 'tmpfcm' not found
fcm_remove(tmpfcm, stopwords("english"))
#> Error in fcm_remove(tmpfcm, stopwords("english")): object 'tmpfcm' not found