textstat_simil_old.Rd
These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the sparse
dfm objects.
textstat_dist_old( x, selection = NULL, margin = c("documents", "features"), method = "euclidean", upper = FALSE, diag = FALSE, p = 2 ) textstat_simil_old( x, selection = NULL, margin = c("documents", "features"), method = "correlation", upper = FALSE, diag = FALSE )
x | a dfm object |
---|---|
selection | a valid index for document or feature names from |
margin | identifies the margin of the dfm on which similarity or
difference will be computed: |
method | method the similarity or distance measure to be used; see Details |
upper | whether the upper triangle of the symmetric \(V \times V\) matrix is recorded |
diag | whether the diagonal of the distance matrix should be recorded |
p | The power of the Minkowski distance. |
textstat_simil
and textstat_dist
return
dist
class objects if selection is NULL
, otherwise, a
matrix is returned matching distances to the documents or features
identified in the selection.
textstat_dist
options are: "euclidean"
(default),
"chisquared"
, "chisquared2"
,
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, "hamman"
, and
"faith"
.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
dfm_weight(x, "prop")
.
The "chisquared"
metric is from Legendre, P., & Gallagher,
E. D. (2001).
"Ecologically
meaningful transformations for ordination of species data".
Oecologia, 129(2), 271-280. doi.org/10.1007/s004420100716
The "chisquared2"
metric is the "Quadratic-Chi" measure from Pele,
O., & Werman, M. (2010).
"The
Quadratic-Chi Histogram Distance Family". In Computer Vision - ECCV
2010 (Vol. 6312, pp. 749-762). Berlin, Heidelberg: Springer, Berlin,
Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.
"kullback"
is the Kullback-Leibler distance, which assumes that
\(P(x_i) = 0\) implies \(P(y_i)=0\), and in case both \(P(x_i)\) and
\(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is
assumed to be zero as the limit value. The formula is:
$$\sum{P(x)*log(P(x)/p(y))}$$
All other measures are described in the proxy package.