Similarity and distance computation between documents or features

These functions compute matrixes of distances and similarities between documents or features from a dfm() and return a matrix of similarities or distances in a sparse format. These methods are fast and robust because they operate directly on the sparse dfm objects. The output can easily be coerced to an ordinary matrix, a data.frame of pairwise comparisons, or a dist format.

textstat_simil(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman",
    "simple matching"),
  min_simil = NULL,
  ...
)

textstat_dist(
  x,
  y = NULL,
  selection = NULL,
  margin = c("documents", "features"),
  method = c("euclidean", "manhattan", "maximum", "canberra", "minkowski"),
  p = 2,
  ...
)

# S3 method for textstat_proxy
as.list(x, sorted = TRUE, n = NULL, diag = FALSE, ...)

# S3 method for textstat_proxy
as.data.frame(
  x,
  row.names = NULL,
  optional = FALSE,
  diag = FALSE,
  upper = FALSE,
  ...
)

Arguments

x, y	a dfm objects; `y` is an optional target matrix matching `x` in the margin on which the similarity or distance will be computed.
selection	(deprecated - use `y` instead).
margin	identifies the margin of the dfm on which similarity or difference will be computed: `"documents"` for documents or `"features"` for word/term features.
method	character; the method identifying the similarity or distance measure to be used; see Details.
min_simil	numeric; a threshold for the similarity values below which similarity values will not be returned
...	unused
p	The power of the Minkowski distance.
sorted	sort results in descending order if `TRUE`
n	the top `n` highest-ranking items will be returned. If n is `NULL`, return all items.
diag	logical; if `FALSE`, exclude the item's comparison with itself
row.names	`NULL` or a character vector giving the row names for the data frame. Missing values are not allowed.
optional	logical. If `TRUE`, setting row names and converting column names (to syntactic names: see `make.names`) is optional. Note that all of R's base package `as.data.frame()` methods use `optional` only for column names treatment, basically with the meaning of `data.frame(*, check.names = !optional)`. See also the `make.names` argument of the `matrix` method.
upper	logical; if `TRUE`, return pairs as both (A, B) and (B, A)

Value

A sparse matrix from the Matrix package that will be symmetric unless y is specified.

These can be transformed easily into a list format using as.list(), which returns a list for each unique element of the second of the pairs, as.dist() to be transformed into a dist object, or as.matrix() to convert it into an ordinary matrix.

as.data.list for a textstat_simil or textstat_dist object returns a list equal in length to the columns of the simil or dist object, with the rows and their values as named elements. By default, this list excludes same-time pairs (when diag = FALSE) and sorts the values in descending order (when sorted = TRUE).

as.data.frame for a textstat_simil or textstat_dist object returns a data.frame of pairwise combinations and the and their similarity or distance value.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", and "hamman".

textstat_dist options are: "euclidean" (default), "manhattan", "maximum", "canberra", and "minkowski".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in [dfm_weight](x, "prop").

Examples