These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the sparse
dfm objects.
textstat_dist(x, selection = NULL, margin = c("documents", "features"), method = "euclidean", upper = FALSE, diag = FALSE, p = 2) textstat_simil(x, selection = NULL, margin = c("documents", "features"), method = "correlation", upper = FALSE, diag = FALSE)
x  a dfm object 

selection  character vector of document names or feature labels from

margin  identifies the margin of the dfm on which similarity or
difference will be computed: 
method  method the similarity or distance measure to be used; see Details 
upper  whether the upper triangle of the symmetric \(V \times V\) matrix is recorded 
diag  whether the diagonal of the distance matrix should be recorded 
p  The power of the Minkowski distance. 
textstat_simil
and textstat_dist
return dist
class objects.
textstat_dist
options are: "euclidean"
(default),
"Chisquared"
, "Chisquared2"
, "hamming"
,
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "eJaccard"
, "dice"
,
"eDice"
, "simple matching"
, "hamann"
, and
"faith"
.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
dfm_weight(x, "relfreq")
.
The "Chisquared"
metric is from Legendre, P., & Gallagher,
E. D. (2001).
"Ecologically
meaningful transformations for ordination of species data".
Oecologia, 129(2), 271–280. doi.org/10.1007/s004420100716
The "Chisquared2"
metric is the "QuadraticChi" measure from Pele,
O., & Werman, M. (2010).
"The
QuadraticChi Histogram Distance Family". In Computer Vision – ECCV
2010 (Vol. 6312, pp. 749–762). Berlin, Heidelberg: Springer, Berlin,
Heidelberg. doi.org/10.1007/9783642155529_54.
"hamming"
is \(\sum{x \neq y)}\).
"kullback"
is the KullbackLeibler distance, which assumes that
\(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and
\(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is
assumed to be zero as the limit value. The formula is:
$$\sum{P(x)*log(P(x)/p(y))}$$
All other measures are described in the proxy package.
textstat_dist
, as.list.dist
,
dist
# create a dfm from inaugural addresses from Reagan onwards presDfm < dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (d1 < textstat_dist(presDfm, margin = "documents"))#> Error in get(".SigLength", envir = env): object '.SigLength' not foundas.matrix(d1)#> Error in as.matrix(d1): object 'd1' not found# distances for specific documents textstat_dist(presDfm, "2017Trump", margin = "documents")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundtextstat_dist(presDfm, "2005Bush", margin = "documents", method = "eJaccard")#> Error in get(".SigLength", envir = env): object '.SigLength' not found(d2 < textstat_dist(presDfm, c("2009Obama" , "2013Obama"), margin = "documents"))#> Error in get(".SigLength", envir = env): object '.SigLength' not foundas.list(d1)#> Error in as.list(d1): object 'd1' not found# similarities for documents (s1 < textstat_simil(presDfm, method = "cosine", margin = "documents"))#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'as.matrix(s1)#> Error in as.matrix(s1): object 's1' not foundas.list(s1)#> Error in as.list(s1): object 's1' not found# similarities for for specific documents textstat_simil(presDfm, "2017Trump", margin = "documents")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundtextstat_simil(presDfm, "2017Trump", method = "cosine", margin = "documents")#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'textstat_simil(presDfm, c("2009Obama" , "2013Obama"), margin = "documents")#> Error in get(".SigLength", envir = env): object '.SigLength' not found# compute some term similarities s2 < textstat_simil(presDfm, c("fair", "health", "terror"), method = "cosine", margin = "features")#> Error in get(".SigLength", envir = env): object '.SigLength' not foundhead(as.matrix(s2), 10)#> Error in as.matrix(s2): object 's2' not foundas.list(s2, n = 8)#> Error in as.list(s2, n = 8): object 's2' not found