These functions compute matrixes of distances and similarities between documents or features from a dfm and return a matrix of similarities or distances in a sparse format. These methods are fast and robust because they operate directly on the sparse dfm objects. The output can easily be coerced to an ordinary matrix, a data.frame of pairwise comparisons, or a dist format.

textstat_simil(x, y = NULL, selection = NULL, margin = c("documents",
  "features"), method = c("correlation", "cosine", "jaccard", "ejaccard",
  "dice", "edice", "hamman", "simple matching"), min_simil = NULL, ...)

textstat_dist(x, y = NULL, selection = NULL, margin = c("documents",
  "features"), method = c("euclidean", "manhattan", "maximum",
  "canberra", "minkowski"), p = 2, ...)

# S3 method for textstat_proxy
as.list(x, sorted = TRUE, n = NULL,
  diag = FALSE, ...)

# S3 method for textstat_proxy
as.data.frame(x, row.names = NULL,
  optional = FALSE, diag = FALSE, upper = FALSE, ...)

Arguments

x, y

a dfm objects; y is an optional target matrix matching x in the margin on which the similarity or distance will be computed.

selection

(deprecated - use y instead).

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

character; the method identifying the similarity or distance measure to be used; see Details.

min_simil

numeric; a threshold for the similarity values below which similarity values will not be returned

...

unused

p

The power of the Minkowski distance.

sorted

sort results in descending order if TRUE

n

the top n highest-ranking items will be returned. If n is NULL, return all items.

diag

logical; if FALSE, exclude the item's comparison with itself

row.names

NULL or a character vector giving the row names for the data frame. Missing values are not allowed.

optional

logical. If TRUE, setting row names and converting column names (to syntactic names: see make.names) is optional. Note that all of R's base package as.data.frame() methods use optional only for column names treatment, basically with the meaning of data.frame(*, check.names = !optional). See also the make.names argument of the matrix method.

upper

logical; if TRUE, return pairs as both (A, B) and (B, A)

Value

A sparse matrix from the Matrix package that will be symmetric unless y is specified.

These can be transformed easily into a list format using as.list(), which returns a list for each unique element of the second of the pairs, as.dist to be transformed into a dist object, or as.matrix to convert it into an ordinary matrix.

as.data.list for a textstat_simil or textstat_dist object returns a list equal in length to the columns of the simil or dist object, with the rows and their values as named elements. By default, this list excludes same-time pairs (when diag = FALSE) and sorts the values in descending order (when sorted = TRUE).

as.data.frame for a textstat_simil or textstat_dist object returns a data.frame of pairwise combinations and the and their similarity or distance value.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", and "hamman".

textstat_dist options are: "euclidean" (default), "manhattan", "maximum", "canberra", and "minkowski".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "prop").

See also

Examples

# similarities for documents dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 2000), remove_punct = TRUE, remove = stopwords("english")) (tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))
#> textstat_simil object; method = "cosine" #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 1.000 0.520 0.541 0.556 0.452 #> 2005-Bush 0.520 1.000 0.458 0.516 0.435 #> 2009-Obama 0.541 0.458 1.000 0.637 0.448 #> 2013-Obama 0.556 0.516 0.637 1.000 0.455 #> 2017-Trump 0.452 0.435 0.448 0.455 1.000
as.matrix(tstat1)
#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 1.0000000 0.5204355 0.5411649 0.5561972 0.4518935 #> 2005-Bush 0.5204355 1.0000000 0.4575297 0.5163644 0.4349030 #> 2009-Obama 0.5411649 0.4575297 1.0000000 0.6373318 0.4481950 #> 2013-Obama 0.5561972 0.5163644 0.6373318 1.0000000 0.4546945 #> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945 1.0000000
as.list(tstat1)
#> $`2001-Bush` #> 2013-Obama 2009-Obama 2005-Bush 2017-Trump #> 0.5561972 0.5411649 0.5204355 0.4518935 #> #> $`2005-Bush` #> 2001-Bush 2013-Obama 2009-Obama 2017-Trump #> 0.5204355 0.5163644 0.4575297 0.4349030 #> #> $`2009-Obama` #> 2013-Obama 2001-Bush 2005-Bush 2017-Trump #> 0.6373318 0.5411649 0.4575297 0.4481950 #> #> $`2013-Obama` #> 2009-Obama 2001-Bush 2005-Bush 2017-Trump #> 0.6373318 0.5561972 0.5163644 0.4546945 #> #> $`2017-Trump` #> 2013-Obama 2001-Bush 2009-Obama 2005-Bush #> 0.4546945 0.4518935 0.4481950 0.4349030 #>
as.list(tstat1, diag = TRUE)
#> $`2001-Bush` #> 2001-Bush 2013-Obama 2009-Obama 2005-Bush 2017-Trump #> 1.0000000 0.5561972 0.5411649 0.5204355 0.4518935 #> #> $`2005-Bush` #> 2005-Bush 2001-Bush 2013-Obama 2009-Obama 2017-Trump #> 1.0000000 0.5204355 0.5163644 0.4575297 0.4349030 #> #> $`2009-Obama` #> 2009-Obama 2013-Obama 2001-Bush 2005-Bush 2017-Trump #> 1.0000000 0.6373318 0.5411649 0.4575297 0.4481950 #> #> $`2013-Obama` #> 2013-Obama 2009-Obama 2001-Bush 2005-Bush 2017-Trump #> 1.0000000 0.6373318 0.5561972 0.5163644 0.4546945 #> #> $`2017-Trump` #> 2017-Trump 2013-Obama 2001-Bush 2009-Obama 2005-Bush #> 1.0000000 0.4546945 0.4518935 0.4481950 0.4349030 #>
# min_simil (tstat2 <- textstat_simil(dfmat, method = "cosine", margin = "documents", min_simil = 0.6))
#> textstat_simil object; method = "cosine" #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 1 . . . . #> 2005-Bush . 1 . . . #> 2009-Obama . . 1.000 0.637 . #> 2013-Obama . . 0.637 1.000 . #> 2017-Trump . . . . 1
as.matrix(tstat2)
#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 1 NA NA NA NA #> 2005-Bush NA 1 NA NA NA #> 2009-Obama NA NA 1.0000000 0.6373318 NA #> 2013-Obama NA NA 0.6373318 1.0000000 NA #> 2017-Trump NA NA NA NA 1
# similarities for for specific documents textstat_simil(dfmat, dfmat["2017-Trump", ], margin = "documents")
#> textstat_simil object; method = "correlation" #> 2017-Trump #> 2001-Bush 0.364 #> 2005-Bush 0.344 #> 2009-Obama 0.343 #> 2013-Obama 0.361 #> 2017-Trump 1.000
textstat_simil(dfmat, dfmat["2017-Trump", ], method = "cosine", margin = "documents")
#> textstat_simil object; method = "cosine" #> 2017-Trump #> 2001-Bush 0.452 #> 2005-Bush 0.435 #> 2009-Obama 0.448 #> 2013-Obama 0.455 #> 2017-Trump 1.000
textstat_simil(dfmat, dfmat[c("2009-Obama", "2013-Obama"), ], margin = "documents")
#> textstat_simil object; method = "correlation" #> 2009-Obama 2013-Obama #> 2001-Bush 0.439 0.468 #> 2005-Bush 0.337 0.420 #> 2009-Obama 1.000 0.550 #> 2013-Obama 0.550 1.000 #> 2017-Trump 0.343 0.361
# compute some term similarities tstat3 <- textstat_simil(dfmat, dfmat[, c("fair", "health", "terror")], method = "cosine", margin = "features") head(as.matrix(tstat3), 10)
#> fair health terror #> president 0.4670994 0.5606119 0.1348400 #> clinton 0.4714045 0.4629100 0.0000000 #> distinguished 0.6666667 0.6546537 0.0000000 #> guests 0.6666667 0.6546537 0.0000000 #> fellow 0.6299408 0.7423075 0.2182179 #> citizens 0.7084919 0.6667367 0.0766965 #> peaceful 0.5773503 0.5669467 0.0000000 #> transfer 0.4082483 0.2672612 0.0000000 #> authority 0.8164966 0.5345225 0.0000000 #> rare 0.5773503 0.3779645 0.0000000
as.list(tstat3, n = 6)
#> $fair #> continue chance raging differences turn dangers #> 1 1 1 1 1 1 #> #> $health #> can generations upon work without greater #> 0.9971765 0.9799579 0.9759001 0.9590244 0.9561829 0.9538210 #> #> $terror #> bestowed sacrifices ancestors generosity cooperation forty-four #> 1 1 1 1 1 1 #>
# distances for documents (tstat4 <- textstat_dist(dfmat, margin = "documents"))
#> textstat_dist object; method = "euclidean" #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 0 52.8 49.9 48.3 47.6 #> 2005-Bush 52.8 0 60.8 56.9 57.4 #> 2009-Obama 49.9 60.8 0 48.0 54.9 #> 2013-Obama 48.3 56.9 48.0 0 53.7 #> 2017-Trump 47.6 57.4 54.9 53.7 0
as.matrix(tstat4)
#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 2001-Bush 0.00000 52.84884 49.94997 48.31149 47.61302 #> 2005-Bush 52.84884 0.00000 60.84406 56.85948 57.41080 #> 2009-Obama 49.94997 60.84406 0.00000 47.98958 54.91812 #> 2013-Obama 48.31149 56.85948 47.98958 0.00000 53.73081 #> 2017-Trump 47.61302 57.41080 54.91812 53.73081 0.00000
as.list(tstat4)
#> $`2001-Bush` #> 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 52.84884 49.94997 48.31149 47.61302 #> #> $`2005-Bush` #> 2009-Obama 2017-Trump 2013-Obama 2001-Bush #> 60.84406 57.41080 56.85948 52.84884 #> #> $`2009-Obama` #> 2005-Bush 2017-Trump 2001-Bush 2013-Obama #> 60.84406 54.91812 49.94997 47.98958 #> #> $`2013-Obama` #> 2005-Bush 2017-Trump 2001-Bush 2009-Obama #> 56.85948 53.73081 48.31149 47.98958 #> #> $`2017-Trump` #> 2005-Bush 2009-Obama 2013-Obama 2001-Bush #> 57.41080 54.91812 53.73081 47.61302 #>
as.dist(tstat4)
#> 2001-Bush 2005-Bush 2009-Obama 2013-Obama #> 2005-Bush 52.84884 #> 2009-Obama 49.94997 60.84406 #> 2013-Obama 48.31149 56.85948 47.98958 #> 2017-Trump 47.61302 57.41080 54.91812 53.73081
# distances for specific documents textstat_dist(dfmat, dfmat["2017-Trump", ], margin = "documents")
#> textstat_dist object; method = "euclidean" #> 2017-Trump #> 2001-Bush 47.6 #> 2005-Bush 57.4 #> 2009-Obama 54.9 #> 2013-Obama 53.7 #> 2017-Trump 0
(tstat5 <- textstat_dist(dfmat, dfmat[c("2009-Obama" , "2013-Obama"), ], margin = "documents"))
#> textstat_dist object; method = "euclidean" #> 2009-Obama 2013-Obama #> 2001-Bush 49.9 48.3 #> 2005-Bush 60.8 56.9 #> 2009-Obama 0 48.0 #> 2013-Obama 48.0 0 #> 2017-Trump 54.9 53.7
as.matrix(tstat5)
#> 2009-Obama 2013-Obama #> 2001-Bush 49.94997 48.31149 #> 2005-Bush 60.84406 56.85948 #> 2009-Obama 0.00000 47.98958 #> 2013-Obama 47.98958 0.00000 #> 2017-Trump 54.91812 53.73081
as.list(tstat5)
#> $`2009-Obama` #> 2005-Bush 2017-Trump 2001-Bush 2013-Obama #> 60.84406 54.91812 49.94997 47.98958 #> #> $`2013-Obama` #> 2005-Bush 2017-Trump 2001-Bush 2009-Obama #> 56.85948 53.73081 48.31149 47.98958 #>
# NOT RUN { # plot a dendrogram after converting the object into distances plot(hclust(as.dist(tstat4))) # }