Weight the feature frequencies in a dfm
dfm_weight( x, scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave"), weights = NULL, base = 10, k = 0.5, smoothing = 0.5, force = FALSE ) dfm_smooth(x, smoothing = 1)
x | document-feature matrix created by dfm |
---|---|
scheme | a label of the weight type:
|
weights | if |
base | base for the logarithm when |
k | the k for the augmentation when |
smoothing | constant added to the dfm cells for smoothing, default is 1
for |
force | logical; if |
dfm_weight
returns the dfm with weighted values. Note the
because the default weighting scheme is "count"
, simply calling this
function on an unweighted dfm will return the same object. Many users will
want the normalized dfm consisting of the proportions of the feature counts
within each document, which requires setting scheme = "prop"
.
dfm_smooth
returns a dfm whose values have been smoothed by
adding the smoothing
amount. Note that this effectively converts a
matrix from sparse to dense format, so may exceed memory requirements
depending on the size of your input matrix.
Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
dfmat1 <- dfm(data_corpus_inaugural) #> Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first. dfmat2 <- dfm_weight(dfmat1, scheme = "prop") topfeatures(dfmat2) #> the , of and . to in our #> 3.8381701 2.8260570 2.7179413 2.1189165 2.0425725 1.7929599 1.0882747 0.8917136 #> a we #> 0.8790494 0.8082196 dfmat3 <- dfm_weight(dfmat1) topfeatures(dfmat3) #> the of , and . to in a our we #> 10183 7180 7173 5406 5155 4591 2827 2292 2224 1827 dfmat4 <- dfm_weight(dfmat1, scheme = "logcount") topfeatures(dfmat4) #> the , of and . to in a #> 185.1899 177.4855 176.2702 170.1605 168.3167 166.0280 153.0303 146.2659 #> our that #> 143.3759 141.3741 dfmat5 <- dfm_weight(dfmat1, scheme = "logave") topfeatures(dfmat5) #> the , of and . to in a #> 124.00625 118.76377 118.00973 113.83079 112.55748 111.14070 102.32000 97.53466 #> our that #> 95.57679 94.48593 # combine these methods for more complex dfm_weightings, e.g. as in Section 6.4 # of Introduction to Information Retrieval head(dfm_tfidf(dfmat1, scheme_tf = "logcount")) #> Document-feature matrix of: 6 documents, 9,439 features (93.84% sparse) and 4 docvars. #> features #> docs fellow-citizens of the senate and house representatives #> 1789-Washington 0.4920984 0 0 0.8166095 0 1.128984 0.8127846 #> 1793-Washington 0 0 0 0 0 0 0 #> 1797-Adams 0.7268890 0 0 0.8166095 0 0 0.8127846 #> 1801-Jefferson 0.6402348 0 0 0 0 0 0 #> 1805-Jefferson 0 0 0 0 0 0 0 #> 1809-Madison 0.4920984 0 0 0 0 0 0 #> features #> docs : among vicissitudes #> 1789-Washington 0.2026503 0.1373836 1.071882 #> 1793-Washington 0.2026503 0 0 #> 1797-Adams 0 0.2200967 0 #> 1801-Jefferson 0.2026503 0.1373836 0 #> 1805-Jefferson 0 0.2534861 0 #> 1809-Madison 0 0 0 #> [ reached max_nfeat ... 9,429 more features ] # apply numeric weights str <- c("apple is better than banana", "banana banana apple much better") (dfmat6 <- dfm(str, remove = stopwords("english"))) #> Warning: 'dfm.character()' is deprecated. Use 'tokens()' first. #> Warning: 'remove' is deprecated; use dfm_remove() instead #> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars. #> features #> docs apple better banana much #> text1 1 1 1 0 #> text2 1 1 2 1 dfm_weight(dfmat6, weights = c(apple = 5, banana = 3, much = 0.5)) #> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars. #> features #> docs apple better banana much #> text1 5 1 3 0 #> text2 5 1 6 0.5 # smooth the dfm dfmat <- dfm(data_corpus_inaugural) #> Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first. dfm_smooth(dfmat, 0.5) #> Document-feature matrix of: 59 documents, 9,439 features (0.00% sparse) and 4 docvars. #> features #> docs fellow-citizens of the senate and house #> 1789-Washington 1.5 71.5 116.5 1.5 48.5 2.5 #> 1793-Washington 0.5 11.5 13.5 0.5 2.5 0.5 #> 1797-Adams 3.5 140.5 163.5 1.5 130.5 0.5 #> 1801-Jefferson 2.5 104.5 130.5 0.5 81.5 0.5 #> 1805-Jefferson 0.5 101.5 143.5 0.5 93.5 0.5 #> 1809-Madison 1.5 69.5 104.5 0.5 43.5 0.5 #> features #> docs representatives : among vicissitudes #> 1789-Washington 2.5 1.5 1.5 1.5 #> 1793-Washington 0.5 1.5 0.5 0.5 #> 1797-Adams 2.5 0.5 4.5 0.5 #> 1801-Jefferson 0.5 1.5 1.5 0.5 #> 1805-Jefferson 0.5 0.5 7.5 0.5 #> 1809-Madison 0.5 0.5 0.5 0.5 #> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]