These functions compute matrixes of distances and similarities between documents or features from a dfm and return a dist object (or a matrix if specific targets are selected). They are fast and robust because they operate directly on the sparse dfm objects.

textstat_simil(x, selection = NULL, margin = c("documents",
  "features"), method = c("correlation", "cosine", "jaccard", "ejaccard",
  "dice", "edice", "hamman", "simple matching", "faith"), upper = FALSE,
  diag = FALSE)

textstat_dist(x, selection = NULL, margin = c("documents", "features"),
  method = c("euclidean", "kullback", "manhattan", "maximum", "canberra",
  "minkowski"), upper = FALSE, diag = FALSE, p = 2)

Arguments

x

a dfm object

selection

a valid index for document or feature names (depending on margin) from x, to be selected for comparison

margin

identifies the margin of the dfm on which similarity or difference will be computed: "documents" for documents or "features" for word/term features.

method

method the similarity or distance measure to be used; see Details.

upper

whether the upper triangle of the symmetric \(V \times V\) matrix is recorded. Only used when value = "dist".

diag

whether the diagonal of the distance matrix should be recorded. . Only used when value = "dist".

p

The power of the Minkowski distance.

Value

By default, textstat_simil and textstat_dist return dist class objects if selection is NULL, otherwise, a matrix is returned matching distances to the documents or features identified in the selection.

These can be transformed into a list format using as.list.dist, if that format is preferred.

Details

textstat_simil options are: "correlation" (default), "cosine", "jaccard", "ejaccard", "dice", "edice", "simple matching", "hamman", and "faith".

textstat_dist options are: "euclidean" (default), "kullback". "manhattan", "maximum", "canberra", and "minkowski".

Note

If you want to compute similarity on a "normalized" dfm object (controlling for variable document lengths, for methods such as correlation for which different document lengths matter), then wrap the input dfm in dfm_weight(x, "prop").

References

"kullback" is the Kullback-Leibler distance, which assumes that \(P(x_i) = 0\) implies \(P(y_i)=0\), and in case either \(P(x_i)\) or \(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is assumed to be zero as the limit value. The formula is: $$\sum{P(x)*log(P(x)/p(y))}$$

All other measures are described in the proxy package.

See also

Examples

# similarities for documents mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), remove_punct = TRUE, remove = stopwords("english")) (s1 <- textstat_simil(mt, method = "cosine", margin = "documents"))
#> 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton #> 1985-Reagan 0.6885376 #> 1989-Bush 0.5342227 0.5569825 #> 1993-Clinton 0.5588243 0.5929382 0.5463195 #> 1997-Clinton 0.5620486 0.6409087 0.5723938 0.6258047 #> 2001-Bush 0.4972389 0.4951335 0.4982357 0.4775110 0.5442529 #> 2005-Bush 0.4577205 0.5113232 0.4556697 0.4396107 0.4355170 #> 2009-Obama 0.5768486 0.5892557 0.5721535 0.5925820 0.6297973 #> 2013-Obama 0.6054181 0.6331729 0.5290330 0.5997993 0.6121809 #> 2017-Trump 0.4234273 0.4734261 0.4530407 0.4982465 0.4947457 #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama #> 1985-Reagan #> 1989-Bush #> 1993-Clinton #> 1997-Clinton #> 2001-Bush #> 2005-Bush 0.5204355 #> 2009-Obama 0.5411649 0.4575297 #> 2013-Obama 0.5561972 0.5163644 0.6373318 #> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945
#> 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton #> 1981-Reagan 1.0000000 0.6885376 0.5342227 0.5588243 0.5620486 #> 1985-Reagan 0.6885376 1.0000000 0.5569825 0.5929382 0.6409087 #> 1989-Bush 0.5342227 0.5569825 1.0000000 0.5463195 0.5723938 #> 1993-Clinton 0.5588243 0.5929382 0.5463195 1.0000000 0.6258047 #> 1997-Clinton 0.5620486 0.6409087 0.5723938 0.6258047 1.0000000 #> 2001-Bush 0.4972389 0.4951335 0.4982357 0.4775110 0.5442529 #> 2005-Bush 0.4577205 0.5113232 0.4556697 0.4396107 0.4355170 #> 2009-Obama 0.5768486 0.5892557 0.5721535 0.5925820 0.6297973 #> 2013-Obama 0.6054181 0.6331729 0.5290330 0.5997993 0.6121809 #> 2017-Trump 0.4234273 0.4734261 0.4530407 0.4982465 0.4947457 #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 1981-Reagan 0.4972389 0.4577205 0.5768486 0.6054181 0.4234273 #> 1985-Reagan 0.4951335 0.5113232 0.5892557 0.6331729 0.4734261 #> 1989-Bush 0.4982357 0.4556697 0.5721535 0.5290330 0.4530407 #> 1993-Clinton 0.4775110 0.4396107 0.5925820 0.5997993 0.4982465 #> 1997-Clinton 0.5442529 0.4355170 0.6297973 0.6121809 0.4947457 #> 2001-Bush 1.0000000 0.5204355 0.5411649 0.5561972 0.4518935 #> 2005-Bush 0.5204355 1.0000000 0.4575297 0.5163644 0.4349030 #> 2009-Obama 0.5411649 0.4575297 1.0000000 0.6373318 0.4481950 #> 2013-Obama 0.5561972 0.5163644 0.6373318 1.0000000 0.4546945 #> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945 1.0000000
#> $`1981-Reagan` #> 1985-Reagan 2013-Obama 2009-Obama 1997-Clinton 1993-Clinton 1989-Bush #> 0.6885376 0.6054181 0.5768486 0.5620486 0.5588243 0.5342227 #> 2001-Bush 2005-Bush 2017-Trump #> 0.4972389 0.4577205 0.4234273 #> #> $`1985-Reagan` #> 1981-Reagan 1997-Clinton 2013-Obama 1993-Clinton 2009-Obama 1989-Bush #> 0.6885376 0.6409087 0.6331729 0.5929382 0.5892557 0.5569825 #> 2005-Bush 2001-Bush 2017-Trump #> 0.5113232 0.4951335 0.4734261 #> #> $`1989-Bush` #> 1997-Clinton 2009-Obama 1985-Reagan 1993-Clinton 1981-Reagan 2013-Obama #> 0.5723938 0.5721535 0.5569825 0.5463195 0.5342227 0.5290330 #> 2001-Bush 2005-Bush 2017-Trump #> 0.4982357 0.4556697 0.4530407 #> #> $`1993-Clinton` #> 1997-Clinton 2013-Obama 1985-Reagan 2009-Obama 1981-Reagan 1989-Bush #> 0.6258047 0.5997993 0.5929382 0.5925820 0.5588243 0.5463195 #> 2017-Trump 2001-Bush 2005-Bush #> 0.4982465 0.4775110 0.4396107 #> #> $`1997-Clinton` #> 1985-Reagan 2009-Obama 1993-Clinton 2013-Obama 1989-Bush 1981-Reagan #> 0.6409087 0.6297973 0.6258047 0.6121809 0.5723938 0.5620486 #> 2001-Bush 2017-Trump 2005-Bush #> 0.5442529 0.4947457 0.4355170 #> #> $`2001-Bush` #> 2013-Obama 1997-Clinton 2009-Obama 2005-Bush 1989-Bush 1981-Reagan #> 0.5561972 0.5442529 0.5411649 0.5204355 0.4982357 0.4972389 #> 1985-Reagan 1993-Clinton 2017-Trump #> 0.4951335 0.4775110 0.4518935 #> #> $`2005-Bush` #> 2001-Bush 2013-Obama 1985-Reagan 1981-Reagan 2009-Obama 1989-Bush #> 0.5204355 0.5163644 0.5113232 0.4577205 0.4575297 0.4556697 #> 1993-Clinton 1997-Clinton 2017-Trump #> 0.4396107 0.4355170 0.4349030 #> #> $`2009-Obama` #> 2013-Obama 1997-Clinton 1993-Clinton 1985-Reagan 1981-Reagan 1989-Bush #> 0.6373318 0.6297973 0.5925820 0.5892557 0.5768486 0.5721535 #> 2001-Bush 2005-Bush 2017-Trump #> 0.5411649 0.4575297 0.4481950 #> #> $`2013-Obama` #> 2009-Obama 1985-Reagan 1997-Clinton 1981-Reagan 1993-Clinton 2001-Bush #> 0.6373318 0.6331729 0.6121809 0.6054181 0.5997993 0.5561972 #> 1989-Bush 2005-Bush 2017-Trump #> 0.5290330 0.5163644 0.4546945 #> #> $`2017-Trump` #> 1993-Clinton 1997-Clinton 1985-Reagan 2013-Obama 1989-Bush 2001-Bush #> 0.4982465 0.4947457 0.4734261 0.4546945 0.4530407 0.4518935 #> 2009-Obama 2005-Bush 1981-Reagan #> 0.4481950 0.4349030 0.4234273 #>
# similarities for for specific documents textstat_simil(mt, selection = "2017-Trump", margin = "documents")
#> 2017-Trump #> 1981-Reagan 0.3635906 #> 1985-Reagan 0.4208903 #> 1989-Bush 0.3983633 #> 1993-Clinton 0.4579742 #> 1997-Clinton 0.4531154 #> 2001-Bush 0.3999252 #> 2005-Bush 0.3814126 #> 2009-Obama 0.3870892 #> 2013-Obama 0.3996661 #> 2017-Trump 1.0000000
textstat_simil(mt, selection = "2017-Trump", method = "cosine", margin = "documents")
#> 2017-Trump #> 1981-Reagan 0.4234273 #> 1985-Reagan 0.4734261 #> 1989-Bush 0.4530407 #> 1993-Clinton 0.4982465 #> 1997-Clinton 0.4947457 #> 2001-Bush 0.4518935 #> 2005-Bush 0.4349030 #> 2009-Obama 0.4481950 #> 2013-Obama 0.4546945 #> 2017-Trump 1.0000000
textstat_simil(mt, selection = c("2009-Obama" , "2013-Obama"), margin = "documents")
#> 2009-Obama 2013-Obama #> 1981-Reagan 0.5159057 0.5545574 #> 1985-Reagan 0.5329123 0.5878448 #> 1989-Bush 0.5133582 0.4707774 #> 1993-Clinton 0.5515630 0.5621398 #> 1997-Clinton 0.5915817 0.5744939 #> 2001-Bush 0.4823183 0.5046514 #> 2005-Bush 0.3877744 0.4603052 #> 2009-Obama 1.0000000 0.5869754 #> 2013-Obama 0.5869754 1.0000000 #> 2017-Trump 0.3870892 0.3996661
# compute some term similarities s2 <- textstat_simil(mt, selection = c("fair", "health", "terror"), method = "cosine", margin = "features") head(as.matrix(s2), 10)
#> fair health terror #> senator 0.7385489 0.00000000 0.1666667 #> hatfield 0.6030227 0.00000000 0.4082483 #> mr 0.3078596 0.09724333 0.1786474 #> chief 0.6154575 0.27216553 0.1666667 #> justice 0.3594254 0.52981294 0.1622214 #> president 0.5817745 0.40929374 0.2864459 #> vice 0.8040303 0.33333333 0.2721655 #> bush 0.6154575 0.54433105 0.3333333 #> mondale 0.6030227 0.00000000 0.4082483 #> baker 0.6030227 0.00000000 0.4082483
as.list(s2, n = 8)
#> $fair #> size economic tax beginning national economy republic months #> 0.9045340 0.8922269 0.8869686 0.8864053 0.8775269 0.8775269 0.8703883 0.8703883 #> #> $health #> wrong reform common knowledge planet generations #> 0.8944272 0.8944272 0.8888889 0.8888889 0.8819171 0.8728716 #> ideals true #> 0.8540168 0.8432740 #> #> $terror #> full sustain solve land commonplace denied #> 0.9428090 0.9128709 0.9128709 0.8876254 0.8660254 0.8660254 #> guarantee problem #> 0.8660254 0.8660254 #>
# create a dfm from inaugural addresses from Reagan onwards mt <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (d1 <- textstat_dist(mt, margin = "documents"))
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 58.90671 #> 2001-Bush 52.82045 63.63961 #> 2005-Bush 62.79331 73.38256 54.32311 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 55.21775
#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 0.00000 58.90671 52.82045 62.79331 51.66237 #> 1997-Clinton 58.90671 0.00000 63.63961 73.38256 59.95832 #> 2001-Bush 52.82045 63.63961 0.00000 54.32311 50.70503 #> 2005-Bush 62.79331 73.38256 54.32311 0.00000 62.33779 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 0.00000 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama 2017-Trump #> 1993-Clinton 51.30302 52.14403 #> 1997-Clinton 60.81118 65.85590 #> 2001-Bush 49.03060 48.79549 #> 2005-Bush 57.90509 58.00000 #> 2009-Obama 48.48711 55.65968 #> 2013-Obama 0.00000 55.21775 #> 2017-Trump 55.21775 0.00000
# distances for specific documents textstat_dist(mt, "2017-Trump", margin = "documents")
#> 2017-Trump #> 1993-Clinton 52.14403 #> 1997-Clinton 65.85590 #> 2001-Bush 48.79549 #> 2005-Bush 58.00000 #> 2009-Obama 55.65968 #> 2013-Obama 55.21775 #> 2017-Trump 0.00000
(d2 <- textstat_dist(mt, c("2009-Obama" , "2013-Obama"), margin = "documents"))
#> 2009-Obama 2013-Obama #> 1993-Clinton 51.66237 51.30302 #> 1997-Clinton 59.95832 60.81118 #> 2001-Bush 50.70503 49.03060 #> 2005-Bush 62.33779 57.90509 #> 2009-Obama 0.00000 48.48711 #> 2013-Obama 48.48711 0.00000 #> 2017-Trump 55.65968 55.21775
#> $`1993-Clinton` #> 2005-Bush 1997-Clinton 2001-Bush 2017-Trump 2009-Obama 2013-Obama #> 62.79331 58.90671 52.82045 52.14403 51.66237 51.30302 #> #> $`1997-Clinton` #> 2005-Bush 2017-Trump 2001-Bush 2013-Obama 2009-Obama 1993-Clinton #> 73.38256 65.85590 63.63961 60.81118 59.95832 58.90671 #> #> $`2001-Bush` #> 1997-Clinton 2005-Bush 1993-Clinton 2009-Obama 2013-Obama 2017-Trump #> 63.63961 54.32311 52.82045 50.70503 49.03060 48.79549 #> #> $`2005-Bush` #> 1997-Clinton 1993-Clinton 2009-Obama 2017-Trump 2013-Obama 2001-Bush #> 73.38256 62.79331 62.33779 58.00000 57.90509 54.32311 #> #> $`2009-Obama` #> 2005-Bush 1997-Clinton 2017-Trump 1993-Clinton 2001-Bush 2013-Obama #> 62.33779 59.95832 55.65968 51.66237 50.70503 48.48711 #> #> $`2013-Obama` #> 1997-Clinton 2005-Bush 2017-Trump 1993-Clinton 2001-Bush 2009-Obama #> 60.81118 57.90509 55.21775 51.30302 49.03060 48.48711 #> #> $`2017-Trump` #> 1997-Clinton 2005-Bush 2009-Obama 2013-Obama 1993-Clinton 2001-Bush #> 65.85590 58.00000 55.65968 55.21775 52.14403 48.79549 #>