Weight a dfm by term frequency-inverse document frequency (tf-idf), with full control over options. Uses fully sparse methods for efficiency.

dfm_tfidf(x, scheme_tf = "count", scheme_df = "inverse", base = 10,
  ...)

Arguments

x

object for which idf or tf-idf will be computed (a document-feature matrix)

scheme_tf

scheme for dfm_weight; defaults to "count"

scheme_df

scheme for docfreq; defaults to "inverse". Other options to docfreq can be passed through the ellipsis (...).

base

the base for the logarithms in the tf and docfreq calls; default is 10

...

additional arguments passed to docfreq.

Details

dfm_tfidf computes term frequency-inverse document frequency weighting. The default is to use counts instead of normalized term frequency (the relative term frequency within document), but this can be overridden using scheme_tf = "prop".

References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

See also

Examples

mydfm <- as.dfm(data_dfm_lbgexample) head(mydfm[, 5:10])
#> Document-feature matrix of: 6 documents, 6 features (61.1% sparse). #> 6 x 6 sparse Matrix of class "dfm" #> features #> docs E F G H I J #> R1 45 78 115 146 158 146 #> R2 0 2 3 10 22 45 #> R3 0 0 0 0 0 0 #> R4 0 0 0 0 0 0 #> R5 0 0 0 0 0 0 #> V1 0 0 0 2 3 10
head(dfm_tfidf(mydfm)[, 5:10])
#> Document-feature matrix of: 6 documents, 6 features (61.1% sparse). #> 6 x 6 sparse Matrix of class "dfm" #> features #> docs E F G H I J #> R1 35.01681 37.2154579 54.868944 43.95038 47.56274 43.95038 #> R2 0 0.9542425 1.431364 3.01030 6.62266 13.54635 #> R3 0 0 0 0 0 0 #> R4 0 0 0 0 0 0 #> R5 0 0 0 0 0 0 #> V1 0 0 0 0.60206 0.90309 3.01030
docfreq(mydfm)[5:15]
#> E F G H I J K L M N O #> 1 2 2 3 3 3 4 4 4 4 4
head(dfm_weight(mydfm)[, 5:10])
#> Document-feature matrix of: 6 documents, 6 features (61.1% sparse). #> 6 x 6 sparse Matrix of class "dfm" #> features #> docs E F G H I J #> R1 45 78 115 146 158 146 #> R2 0 2 3 10 22 45 #> R3 0 0 0 0 0 0 #> R4 0 0 0 0 0 0 #> R5 0 0 0 0 0 0 #> V1 0 0 0 2 3 10
# replication of worked example from # https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf wiki_dfm <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), features = c("this", "is", "a", "sample", "another", "example"))) %>% as.dfm() wiki_dfm
#> Document-feature matrix of: 2 documents, 6 features (33.3% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs this is a sample another example #> document1 1 1 2 1 0 0 #> document2 1 1 0 0 2 3
docfreq(wiki_dfm)
#> this is a sample another example #> 2 2 1 1 1 1
dfm_tfidf(wiki_dfm, scheme_tf = "prop") %>% round(digits = 2)
#> Document-feature matrix of: 2 documents, 6 features (33.3% sparse). #> 2 x 6 sparse Matrix of class "dfm" #> features #> docs this is a sample another example #> document1 0 0 0.12 0.06 0 0 #> document2 0 0 0 0 0.09 0.13
# NOT RUN { # comparison with tm if (requireNamespace("tm")) { convert(wiki_dfm, to = "tm") %>% tm::weightTfIdf() %>% as.matrix() # same as: dfm_tfidf(wiki_dfm, base = 2, scheme_tf = "prop") } # }