Weight the feature frequencies in a dfm

dfm_weight(
x,
scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave"),
weights = NULL,
base = 10,
k = 0.5,
smoothing = 0.5,
force = FALSE
)

dfm_smooth(x, smoothing = 1)

## Arguments

x document-feature matrix created by dfm a label of the weight type: count$$tf_{ij}$$, an integer feature count (default when a dfm is created) propthe proportion of the feature counts of total feature counts (aka relative frequency), calculated as $$tf_{ij} / \sum_j tf_{ij}$$ propmaxthe proportion of the feature counts of the highest feature count in a document, $$tf_{ij} / \textrm{max}_j tf_{ij}$$ logcounttake the 1 + the logarithm of each count, for the given base, or 0 if the count was zero: $$1 + \textrm{log}_{base}(tf_{ij})$$ if $$tf_{ij} > 0$$, or 0 otherwise. booleanrecode all non-zero counts as 1 augmentedequivalent to $$k + (1 - k) *$$ dfm_weight(x, "propmax") logave(1 + the log of the counts) / (1 + log of the average count within document), or $$\frac{1 + \textrm{log}_{base} tf_{ij}}{1 + \textrm{log}_{base}(\sum_j tf_{ij} / N_i)}$$ logsmoothlog of the counts + smooth, or $$tf_{ij} + s$$ if scheme is unused, then weights can be a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named features. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged). base for the logarithm when scheme is "logcount" or logave the k for the augmentation when scheme = "augmented" constant added to the dfm cells for smoothing, default is 1 for dfm_smooth() and 0.5 for dfm_weight() logical; if TRUE, apply weighting scheme even if the dfm has been weighted before. This can result in invalid weights, such as as weighting by "prop" after applying "logcount", or after having grouped a dfm using dfm_group().

## Value

dfm_weight returns the dfm with weighted values. Note the because the default weighting scheme is "count", simply calling this function on an unweighted dfm will return the same object. Many users will want the normalized dfm consisting of the proportions of the feature counts within each document, which requires setting scheme = "prop".

dfm_smooth returns a dfm whose values have been smoothed by adding the smoothing amount. Note that this effectively converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

## References

Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

docfreq()

## Examples

dfmat1 <- dfm(data_corpus_inaugural)
#> Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

dfmat2 <- dfm_weight(dfmat1, scheme = "prop")
topfeatures(dfmat2)
#>       the         ,        of       and         .        to        in       our
#> 3.8381701 2.8260570 2.7179413 2.1189165 2.0425725 1.7929599 1.0882747 0.8917136
#>         a        we
#> 0.8790494 0.8082196
dfmat3 <- dfm_weight(dfmat1)
topfeatures(dfmat3)
#>   the    of     ,   and     .    to    in     a   our    we
#> 10183  7180  7173  5406  5155  4591  2827  2292  2224  1827
dfmat4 <- dfm_weight(dfmat1, scheme = "logcount")
topfeatures(dfmat4)
#>      the        ,       of      and        .       to       in        a
#> 185.1899 177.4855 176.2702 170.1605 168.3167 166.0280 153.0303 146.2659
#>      our     that
#> 143.3759 141.3741
dfmat5 <- dfm_weight(dfmat1, scheme = "logave")
topfeatures(dfmat5)
#>       the         ,        of       and         .        to        in         a
#> 124.00625 118.76377 118.00973 113.83079 112.55748 111.14070 102.32000  97.53466
#>       our      that
#>  95.57679  94.48593

# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4
# of Introduction to Information Retrieval
#> Document-feature matrix of: 6 documents, 9,439 features (93.84% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens of the    senate and    house representatives
#>   1789-Washington       0.4920984  0   0 0.8166095   0 1.128984       0.8127846
#>   1793-Washington       0          0   0 0           0 0              0
#>   1797-Adams            0.7268890  0   0 0.8166095   0 0              0.8127846
#>   1801-Jefferson        0.6402348  0   0 0           0 0              0
#>   1805-Jefferson        0          0   0 0           0 0              0
#>   1809-Madison          0.4920984  0   0 0           0 0              0
#>                  features
#> docs                      :     among vicissitudes
#>   1789-Washington 0.2026503 0.1373836     1.071882
#>   1793-Washington 0.2026503 0             0
#>   1801-Jefferson  0.2026503 0.1373836     0
#>   1805-Jefferson  0         0.2534861     0
#> [ reached max_nfeat ... 9,429 more features ]

# apply numeric weights
str <- c("apple is better than banana", "banana banana apple much better")
(dfmat6 <- dfm(str, remove = stopwords("english")))
#> Warning: 'dfm.character()' is deprecated. Use 'tokens()' first.
#> Warning: 'remove' is deprecated; use dfm_remove() instead
#> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars.
#>        features
#> docs    apple better banana much
#>   text1     1      1      1    0
#>   text2     1      1      2    1
dfm_weight(dfmat6, weights = c(apple = 5, banana = 3, much = 0.5))
#> Document-feature matrix of: 2 documents, 4 features (12.50% sparse) and 0 docvars.
#>        features
#> docs    apple better banana much
#>   text1     5      1      3  0
#>   text2     5      1      6  0.5

# smooth the dfm
dfmat <- dfm(data_corpus_inaugural)
#> Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
dfm_smooth(dfmat, 0.5)
#> Document-feature matrix of: 59 documents, 9,439 features (0.00% sparse) and 4 docvars.
#>                  features
#> docs              fellow-citizens    of   the senate   and house
#>   1789-Washington             1.5  71.5 116.5    1.5  48.5   2.5
#>   1793-Washington             0.5  11.5  13.5    0.5   2.5   0.5
#>   1797-Adams                  3.5 140.5 163.5    1.5 130.5   0.5
#>   1801-Jefferson              2.5 104.5 130.5    0.5  81.5   0.5
#>   1805-Jefferson              0.5 101.5 143.5    0.5  93.5   0.5
#>   1809-Madison                1.5  69.5 104.5    0.5  43.5   0.5
#>                  features
#> docs              representatives   : among vicissitudes
#>   1789-Washington             2.5 1.5   1.5          1.5
#>   1793-Washington             0.5 1.5   0.5          0.5
#>   1797-Adams                  2.5 0.5   4.5          0.5
#>   1801-Jefferson              0.5 1.5   1.5          0.5
#>   1805-Jefferson              0.5 0.5   7.5          0.5
#>   1809-Madison                0.5 0.5   0.5          0.5
#> [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]