These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = "euclidean", upper = FALSE, diag = FALSE, p = 2)

textstat_simil(x, selection = NULL, margin = c("documents", "features"),
  method = "correlation", upper = FALSE, diag = FALSE)

Arguments

x

a dfm object

selection

character vector of document names or feature labels from x. A "dist" object is returned if selection is NULL, otherwise, a matrix is returned.

margin

identifies the margin of the dfm on which similarity or difference will be computed: documents for documents or features for word/term features.

method

method the similarity or distance measure to be used; see Details

upper

whether the upper triangle of the symmetric \(V \times V\) matrix is recorded

diag

whether the diagonal of the distance matrix should be recorded

p

The power of the Minkowski distance.

Value

textstat_simil and textstat_dist return dist class objects.

Details

textstat_dist options are: "euclidean" (default), "Chisquared", "Chisquared2", "hamming", "kullback". "manhattan", "maximum", "canberra", and "minkowski". textstat_simil options are: "correlation" (default), "cosine", "jaccard", "eJaccard", "dice", "eDice", "simple matching", "hamann", and "faith".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "relfreq").

References

The "Chisquared" metric is from Legendre, P., & Gallagher, E. D. (2001). "Ecologically meaningful transformations for ordination of species data". Oecologia, 129(2), 271–280. doi.org/10.1007/s004420100716 The "Chisquared2" metric is the "Quadratic-Chi" measure from Pele, O., & Werman, M. (2010). "The Quadratic-Chi Histogram Distance Family". In Computer Vision – ECCV 2010 (Vol. 6312, pp. 749–762). Berlin, Heidelberg: Springer, Berlin, Heidelberg. doi.org/10.1007/978-3-642-15552-9_54. "hamming" is \(\sum{x \neq y)}\).

"kullback" is the Kullback-Leibler distance, which assumes that \(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and \(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is assumed to be zero as the limit value. The formula is: $$\sum{P(x)*log(P(x)/p(y))}$$ All other measures are described in the proxy package.

See also

textstat_dist, as.list.dist, dist

Examples

# create a dfm from inaugural addresses from Reagan onwards presDfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (d1 <- textstat_dist(presDfm, margin = "documents"))
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
as.matrix(d1)
#> Error in as.matrix(d1): object 'd1' not found
# distances for specific documents textstat_dist(presDfm, "2017-Trump", margin = "documents")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
textstat_dist(presDfm, "2005-Bush", margin = "documents", method = "eJaccard")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
(d2 <- textstat_dist(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents"))
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
as.list(d1)
#> Error in as.list(d1): object 'd1' not found
# similarities for documents (s1 <- textstat_simil(presDfm, method = "cosine", margin = "documents"))
#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'
as.matrix(s1)
#> Error in as.matrix(s1): object 's1' not found
as.list(s1)
#> Error in as.list(s1): object 's1' not found
# similarities for for specific documents textstat_simil(presDfm, "2017-Trump", margin = "documents")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
textstat_simil(presDfm, "2017-Trump", method = "cosine", margin = "documents")
#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'
textstat_simil(presDfm, c("2009-Obama" , "2013-Obama"), margin = "documents")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
# compute some term similarities s2 <- textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
head(as.matrix(s2), 10)
#> Error in as.matrix(s2): object 's2' not found
as.list(s2, n = 8)
#> Error in as.list(s2, n = 8): object 's2' not found