Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.

Feature selection is implemented by considering features across all documents, by summing them for term frequency, or counting the documents in which they occur for document frequency. Rank and quantile versions of these are also implemented, for taking the first \(n\) features in terms of descending order of overall global counts or document frequencies, or as a quantile of all frequencies.

dfm_trim(
  x,
  min_termfreq = NULL,
  max_termfreq = NULL,
  termfreq_type = c("count", "prop", "rank", "quantile"),
  min_docfreq = NULL,
  max_docfreq = NULL,
  docfreq_type = c("count", "prop", "rank", "quantile"),
  sparsity = NULL,
  verbose = quanteda_options("verbose"),
  ...
)

Arguments

x

a dfm object

min_termfreq, max_termfreq

minimum/maximum values of feature frequencies across all documents, below/above which features will be removed

termfreq_type

how min_termfreq and max_termfreq are interpreted. "count" sums the frequencies; "prop" divides the term frequencies by the total sum; "rank" is matched against the inverted ranking of features in terms of overall frequency, so that 1, 2, ... are the highest and second highest frequency features, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of term frequencies.

min_docfreq, max_docfreq

minimum/maximum values of a feature's document frequency, below/above which features will be removed

docfreq_type

specify how min_docfreq and max_docfreq are interpreted. "count" is the same as [docfreq](x, scheme = "count"); "prop" divides the document frequencies by the total sum; "rank" is matched against the inverted ranking of document frequency, so that 1, 2, ... are the features with the highest and second highest document frequencies, and so on; "quantile" sets the cutoffs according to the quantiles (see quantile()) of document frequencies.

sparsity

equivalent to 1 - min_docfreq, included for comparison with tm

verbose

print messages

...

not used

Value

A dfm reduced in features (with the same number of documents)

Note

Trimming a dfm object is an operation based on the values in the document-feature matrix. To select subsets of a dfm based on the features themselves (meaning the feature labels from featnames()) -- such as those matching a regular expression, or removing features matching a stopword list, use dfm_select().

Examples

dfmat <- dfm(tokens(data_corpus_inaugural))

# keep only words occurring >= 10 times and in >= 2 documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 2)
#> Document-feature matrix of: 59 documents, 1,533 features (68.72% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens  of the senate and house representatives :
#>   1789-Washington               1  71 116      1  48     2               2 1
#>   1793-Washington               0  11  13      0   2     0               0 1
#>   1797-Adams                    3 140 163      1 130     0               2 0
#>   1801-Jefferson                2 104 130      0  81     0               0 1
#>   1805-Jefferson                0 101 143      0  93     0               0 0
#>   1809-Madison                  1  69 104      0  43     0               0 0
#>                  features
#> docs              among to
#>   1789-Washington     1 48
#>   1793-Washington     0  5
#>   1797-Adams          4 72
#>   1801-Jefferson      1 61
#>   1805-Jefferson      7 83
#>   1809-Madison        0 61
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 1,523 more features ]

# keep only words occurring >= 10 times and in at least 0.4 of the documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 0.4)
#> Document-feature matrix of: 59 documents, 1,534 features (68.74% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens  of the senate and house representatives :
#>   1789-Washington               1  71 116      1  48     2               2 1
#>   1793-Washington               0  11  13      0   2     0               0 1
#>   1797-Adams                    3 140 163      1 130     0               2 0
#>   1801-Jefferson                2 104 130      0  81     0               0 1
#>   1805-Jefferson                0 101 143      0  93     0               0 0
#>   1809-Madison                  1  69 104      0  43     0               0 0
#>                  features
#> docs              among to
#>   1789-Washington     1 48
#>   1793-Washington     0  5
#>   1797-Adams          4 72
#>   1801-Jefferson      1 61
#>   1805-Jefferson      7 83
#>   1809-Madison        0 61
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 1,524 more features ]

# keep only words occurring <= 10 times and in <=2 documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 2)
#> Document-feature matrix of: 59 documents, 5,675 features (97.86% sparse) and 4 docvars.
#>                  features
#> docs              notification 14th month fondest predilection flattering
#>   1789-Washington            1    1     1       1            1          1
#>   1793-Washington            0    0     0       0            0          0
#>   1797-Adams                 0    0     0       0            0          0
#>   1801-Jefferson             0    0     0       0            0          0
#>   1805-Jefferson             0    0     0       0            0          0
#>   1809-Madison               0    0     0       0            0          0
#>                  features
#> docs              immutable asylum interruptions gradual
#>   1789-Washington         2      1             1       1
#>   1793-Washington         0      0             0       0
#>   1797-Adams              0      0             0       0
#>   1801-Jefferson          0      0             0       0
#>   1805-Jefferson          0      0             0       0
#>   1809-Madison            0      0             0       0
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 5,665 more features ]

# keep only words occurring <= 10 times and in at most 3/4 of the documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 0.75)
#> Document-feature matrix of: 59 documents, 0 features (0.00% sparse) and 4 docvars.
#> [ reached max_ndoc ... 53 more documents ]

# keep only words occurring 5 times in 1000, and in 2 of 5 of documents
dfm_trim(dfmat, min_docfreq = 0.4, min_termfreq = 0.005, termfreq_type = "prop")
#> Document-feature matrix of: 59 documents, 27 features (0.88% sparse) and 4 docvars.
#>                  features
#> docs               of the and to have with that which by   ,
#>   1789-Washington  71 116  48 48   12   17   18    36 20  70
#>   1793-Washington  11  13   2  5    1    0    1     1  2   5
#>   1797-Adams      140 163 130 72    7   16   22    20 30 201
#>   1801-Jefferson  104 130  81 61   10   20   24    25 16 128
#>   1805-Jefferson  101 143  93 83   24   28   37    23 22 142
#>   1809-Madison     69 104  43 61    8   10    9    14 11  47
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 17 more features ]

## quantiles
toks <- as.tokens(list(unlist(mapply(rep, letters[1:10], 10:1), use.names = FALSE)))
dfmat <- dfm(toks)
dfmat
#> Document-feature matrix of: 1 document, 10 features (0.00% sparse) and 0 docvars.
#>        features
#> docs     a b c d e f g h i j
#>   text1 10 9 8 7 6 5 4 3 2 1

# keep only the top 20th percentile or higher features

# keep only words above the 80th percentile
dfm_trim(dfmat, min_termfreq = 0.800001, termfreq_type = "quantile", verbose = TRUE)
#> Removing features occurring: 
#>   - fewer than 9 times: 8
#>   Total features removed: 8 (80.0%).
#> Document-feature matrix of: 1 document, 2 features (0.00% sparse) and 0 docvars.
#>        features
#> docs     a b
#>   text1 10 9

# keep only words occurring frequently (top 20%) and in <=2 documents
dfm_trim(dfmat, min_termfreq = 0.2, max_docfreq = 2, termfreq_type = "quantile")
#> Document-feature matrix of: 1 document, 9 features (0.00% sparse) and 0 docvars.
#>        features
#> docs     a b c d e f g h i
#>   text1 10 9 8 7 6 5 4 3 2

if (FALSE) {
# compare to removeSparseTerms from the tm package
(dfmattm <- convert(dfmat, "tm"))
tm::removeSparseTerms(dfmattm, 0.7)
dfm_trim(dfmat, min_docfreq = 0.3)
dfm_trim(dfmat, sparsity = 0.7)
}