Similarity and distance computation between documents or features

These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

textstat_dist_old(
  x,
  selection = NULL,
  margin = c("documents", "features"),
  method = "euclidean",
  upper = FALSE,
  diag = FALSE,
  p = 2
)

textstat_simil_old(
  x,
  selection = NULL,
  margin = c("documents", "features"),
  method = "correlation",
  upper = FALSE,
  diag = FALSE
)

Arguments

x	a dfm object
selection	a valid index for document or feature names from `x`, to be selected for comparison
margin	identifies the margin of the dfm on which similarity or difference will be computed: `"documents"` for documents or `"features"` for word/term features
method	method the similarity or distance measure to be used; see Details
upper	whether the upper triangle of the symmetric $V \times V$ matrix is recorded
diag	whether the diagonal of the distance matrix should be recorded
p	The power of the Minkowski distance.

Value

textstat_simil and textstat_dist return dist class objects if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

Details

textstat_dist options are: "euclidean" (default), "chisquared", "chisquared2", "kullback". "manhattan", "maximum", "canberra", and "minkowski".

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamman", and "faith".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "prop").

References

The "chisquared" metric is from Legendre, P., & Gallagher, E. D. (2001). "Ecologically meaningful transformations for ordination of species data". Oecologia, 129(2), 271-280. doi.org/10.1007/s004420100716

The "chisquared2" metric is the "Quadratic-Chi" measure from Pele, O., & Werman, M. (2010). "The Quadratic-Chi Histogram Distance Family". In Computer Vision - ECCV 2010 (Vol. 6312, pp. 749-762). Berlin, Heidelberg: Springer, Berlin, Heidelberg. doi.org/10.1007/978-3-642-15552-9_54.

"kullback" is the Kullback-Leibler distance, which assumes that $P(x_i) = 0$ implies $P(y_i)=0$, and in case both $P(x_i)$ and $P(y_i)$ equals to zero, then $P(x_i) * log(p(x_i)/p(y_i))$ is assumed to be zero as the limit value. The formula is: $$\sum{P(x)*log(P(x)/p(y))}$$

All other measures are described in the proxy package.

Similarity and distance computation between documents or features

Arguments

Value

Details

Note

References

See also

Contents

Author