For a dfm object, returns a (weighted) document frequency for each term. The default is a simple count of the number of documents in which a feature occurs more than a given frequency threshold. (The default threshold is zero, meaning that any feature occurring at least once in a document will be counted.)

docfreq(x, scheme = c("count", "inverse", "inversemax", "inverseprob",
"unary"), smoothing = 0, k = 0, base = 10, threshold = 0,
use.names = TRUE)

## Arguments

x a dfm type of document frequency weighting, computed as follows, where $$N$$ is defined as the number of documents in the dfm and $$s$$ is the smoothing constant: count$$df_j$$, the number of documents for which $$n_{ij} > threshold$$ inverse$$\textrm{log}_{base}\left(s + \frac{N}{k + df_j}\right)$$ inversemax$$\textrm{log}_{base}\left(s + \frac{\textrm{max}(df_j)}{k + df_j}\right)$$ inverseprob$$\textrm{log}_{base}\left(\frac{N - df_j}{k + df_j}\right)$$ unary1 for each feature added to the quotient before taking the logarithm added to the denominator in the "inverse" weighting types, to prevent a zero document count for a term the base with respect to which logarithms in the inverse document frequency weightings are computed; default is 10 (see Manning, Raghavan, and Schütze 2008, p123). numeric value of the threshold above which a feature will considered in the computation of document frequency. The default is 0, meaning that a feature's document frequency will be the number of documents in which it occurs greater than zero times. logical; if TRUE attach feature labels as names of the resulting numeric vector not used

## Value

a numeric vector of document frequencies for each feature

## References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

## Examples

dfmat1 <- dfm(data_corpus_inaugural[1:2])
docfreq(dfmat1[, 1:20])#> fellow-citizens              of             the          senate             and
#>               1               2               2               1               2
#>           house representatives               :           among    vicissitudes
#>               1               1               2               1               1
#>        incident              to            life              no           event
#>               1               2               1               1               1
#>           could            have          filled              me            with
#>               1               2               1               2               1
# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
dfmat2 <-
matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
byrow = TRUE, nrow = 2,
dimnames = list(docs = c("document1", "document2"),
features = c("this", "is", "a", "sample",
"another", "example"))) %>%
as.dfm()
dfmat2#> Document-feature matrix of: 2 documents, 6 features (33.3% sparse).
#> 2 x 6 sparse Matrix of class "dfm"
#>            features
#> docs        this is a sample another example
#>   document1    1  1 2      1       0       0
#>   document2    1  1 0      0       2       3docfreq(dfmat2)#>    this      is       a  sample another example
#>       2       2       1       1       1       1 docfreq(dfmat2, scheme = "inverse")#>    this      is       a  sample another example
#> 0.00000 0.00000 0.30103 0.30103 0.30103 0.30103 docfreq(dfmat2, scheme = "inverse", k = 1, smoothing = 1)#>      this        is         a    sample   another   example
#> 0.2218487 0.2218487 0.3010300 0.3010300 0.3010300 0.3010300 docfreq(dfmat2, scheme = "unary")#>    this      is       a  sample another example
#>       1       1       1       1       1       1 docfreq(dfmat2, scheme = "inversemax")#>    this      is       a  sample another example
#> 0.00000 0.00000 0.30103 0.30103 0.30103 0.30103 docfreq(dfmat2, scheme = "inverseprob")#>    this      is       a  sample another example
#>       0       0       0       0       0       0