textstat_simil.Rd
These functions compute matrixes of distances and similarities between
documents or features from a dfm
and return a
dist
object (or a matrix if specific targets are
selected). They are fast and robust because they operate directly on the
sparse dfm objects.
textstat_simil(x, selection = NULL, margin = c("documents", "features"), method = c("correlation", "cosine", "jaccard", "ejaccard", "dice", "edice", "hamman", "simple matching", "faith"), upper = FALSE, diag = FALSE) textstat_dist(x, selection = NULL, margin = c("documents", "features"), method = c("euclidean", "kullback", "manhattan", "maximum", "canberra", "minkowski"), upper = FALSE, diag = FALSE, p = 2)
x | a dfm object |
---|---|
selection | a valid index for document or feature names (depending on
|
margin | identifies the margin of the dfm on which similarity or
difference will be computed: |
method | method the similarity or distance measure to be used; see Details. |
upper | whether the upper triangle of the symmetric \(V \times V\)
matrix is recorded. Only used when |
diag | whether the diagonal of the distance matrix should be recorded. .
Only used when |
p | The power of the Minkowski distance. |
By default, textstat_simil
and textstat_dist
return
dist
class objects if selection is NULL
, otherwise, a
matrix is returned matching distances to the documents or features
identified in the selection.
These can be transformed into a list format using
as.list.dist
, if that format is preferred.
textstat_simil
options are: "correlation"
(default),
"cosine"
, "jaccard"
, "ejaccard"
, "dice"
,
"edice"
, "simple matching"
, "hamman"
, and
"faith"
.
textstat_dist
options are: "euclidean"
(default),
"kullback"
. "manhattan"
, "maximum"
, "canberra"
,
and "minkowski"
.
If you want to compute similarity on a "normalized" dfm object
(controlling for variable document lengths, for methods such as correlation
for which different document lengths matter), then wrap the input dfm in
dfm_weight(x, "prop")
.
"kullback"
is the Kullback-Leibler distance, which assumes that
\(P(x_i) = 0\) implies \(P(y_i)=0\), and in case either \(P(x_i)\) or
\(P(y_i)\) equals to zero, then \(P(x_i) * log(p(x_i)/p(y_i))\) is
assumed to be zero as the limit value. The formula is:
$$\sum{P(x)*log(P(x)/p(y))}$$
All other measures are described in the proxy package.
textstat_dist
,
as.matrix.simil
,
as.list.dist
, dist
,
as.dist
# similarities for documents dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), remove_punct = TRUE, remove = stopwords("english")) (tstat1 <- textstat_simil(dfmat, method = "cosine", margin = "documents"))#> 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton #> 1985-Reagan 0.6885376 #> 1989-Bush 0.5342227 0.5569825 #> 1993-Clinton 0.5588243 0.5929382 0.5463195 #> 1997-Clinton 0.5620486 0.6409087 0.5723938 0.6258047 #> 2001-Bush 0.4972389 0.4951335 0.4982357 0.4775110 0.5442529 #> 2005-Bush 0.4577205 0.5113232 0.4556697 0.4396107 0.4355170 #> 2009-Obama 0.5768486 0.5892557 0.5721535 0.5925820 0.6297973 #> 2013-Obama 0.6054181 0.6331729 0.5290330 0.5997993 0.6121809 #> 2017-Trump 0.4234273 0.4734261 0.4530407 0.4982465 0.4947457 #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama #> 1985-Reagan #> 1989-Bush #> 1993-Clinton #> 1997-Clinton #> 2001-Bush #> 2005-Bush 0.5204355 #> 2009-Obama 0.5411649 0.4575297 #> 2013-Obama 0.5561972 0.5163644 0.6373318 #> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945as.matrix(tstat1)#> 1981-Reagan 1985-Reagan 1989-Bush 1993-Clinton 1997-Clinton #> 1981-Reagan 1.0000000 0.6885376 0.5342227 0.5588243 0.5620486 #> 1985-Reagan 0.6885376 1.0000000 0.5569825 0.5929382 0.6409087 #> 1989-Bush 0.5342227 0.5569825 1.0000000 0.5463195 0.5723938 #> 1993-Clinton 0.5588243 0.5929382 0.5463195 1.0000000 0.6258047 #> 1997-Clinton 0.5620486 0.6409087 0.5723938 0.6258047 1.0000000 #> 2001-Bush 0.4972389 0.4951335 0.4982357 0.4775110 0.5442529 #> 2005-Bush 0.4577205 0.5113232 0.4556697 0.4396107 0.4355170 #> 2009-Obama 0.5768486 0.5892557 0.5721535 0.5925820 0.6297973 #> 2013-Obama 0.6054181 0.6331729 0.5290330 0.5997993 0.6121809 #> 2017-Trump 0.4234273 0.4734261 0.4530407 0.4982465 0.4947457 #> 2001-Bush 2005-Bush 2009-Obama 2013-Obama 2017-Trump #> 1981-Reagan 0.4972389 0.4577205 0.5768486 0.6054181 0.4234273 #> 1985-Reagan 0.4951335 0.5113232 0.5892557 0.6331729 0.4734261 #> 1989-Bush 0.4982357 0.4556697 0.5721535 0.5290330 0.4530407 #> 1993-Clinton 0.4775110 0.4396107 0.5925820 0.5997993 0.4982465 #> 1997-Clinton 0.5442529 0.4355170 0.6297973 0.6121809 0.4947457 #> 2001-Bush 1.0000000 0.5204355 0.5411649 0.5561972 0.4518935 #> 2005-Bush 0.5204355 1.0000000 0.4575297 0.5163644 0.4349030 #> 2009-Obama 0.5411649 0.4575297 1.0000000 0.6373318 0.4481950 #> 2013-Obama 0.5561972 0.5163644 0.6373318 1.0000000 0.4546945 #> 2017-Trump 0.4518935 0.4349030 0.4481950 0.4546945 1.0000000as.list(tstat1)#> $`1981-Reagan` #> 1985-Reagan 2013-Obama 2009-Obama 1997-Clinton 1993-Clinton 1989-Bush #> 0.6885376 0.6054181 0.5768486 0.5620486 0.5588243 0.5342227 #> 2001-Bush 2005-Bush 2017-Trump #> 0.4972389 0.4577205 0.4234273 #> #> $`1985-Reagan` #> 1981-Reagan 1997-Clinton 2013-Obama 1993-Clinton 2009-Obama 1989-Bush #> 0.6885376 0.6409087 0.6331729 0.5929382 0.5892557 0.5569825 #> 2005-Bush 2001-Bush 2017-Trump #> 0.5113232 0.4951335 0.4734261 #> #> $`1989-Bush` #> 1997-Clinton 2009-Obama 1985-Reagan 1993-Clinton 1981-Reagan 2013-Obama #> 0.5723938 0.5721535 0.5569825 0.5463195 0.5342227 0.5290330 #> 2001-Bush 2005-Bush 2017-Trump #> 0.4982357 0.4556697 0.4530407 #> #> $`1993-Clinton` #> 1997-Clinton 2013-Obama 1985-Reagan 2009-Obama 1981-Reagan 1989-Bush #> 0.6258047 0.5997993 0.5929382 0.5925820 0.5588243 0.5463195 #> 2017-Trump 2001-Bush 2005-Bush #> 0.4982465 0.4775110 0.4396107 #> #> $`1997-Clinton` #> 1985-Reagan 2009-Obama 1993-Clinton 2013-Obama 1989-Bush 1981-Reagan #> 0.6409087 0.6297973 0.6258047 0.6121809 0.5723938 0.5620486 #> 2001-Bush 2017-Trump 2005-Bush #> 0.5442529 0.4947457 0.4355170 #> #> $`2001-Bush` #> 2013-Obama 1997-Clinton 2009-Obama 2005-Bush 1989-Bush 1981-Reagan #> 0.5561972 0.5442529 0.5411649 0.5204355 0.4982357 0.4972389 #> 1985-Reagan 1993-Clinton 2017-Trump #> 0.4951335 0.4775110 0.4518935 #> #> $`2005-Bush` #> 2001-Bush 2013-Obama 1985-Reagan 1981-Reagan 2009-Obama 1989-Bush #> 0.5204355 0.5163644 0.5113232 0.4577205 0.4575297 0.4556697 #> 1993-Clinton 1997-Clinton 2017-Trump #> 0.4396107 0.4355170 0.4349030 #> #> $`2009-Obama` #> 2013-Obama 1997-Clinton 1993-Clinton 1985-Reagan 1981-Reagan 1989-Bush #> 0.6373318 0.6297973 0.5925820 0.5892557 0.5768486 0.5721535 #> 2001-Bush 2005-Bush 2017-Trump #> 0.5411649 0.4575297 0.4481950 #> #> $`2013-Obama` #> 2009-Obama 1985-Reagan 1997-Clinton 1981-Reagan 1993-Clinton 2001-Bush #> 0.6373318 0.6331729 0.6121809 0.6054181 0.5997993 0.5561972 #> 1989-Bush 2005-Bush 2017-Trump #> 0.5290330 0.5163644 0.4546945 #> #> $`2017-Trump` #> 1993-Clinton 1997-Clinton 1985-Reagan 2013-Obama 1989-Bush 2001-Bush #> 0.4982465 0.4947457 0.4734261 0.4546945 0.4530407 0.4518935 #> 2009-Obama 2005-Bush 1981-Reagan #> 0.4481950 0.4349030 0.4234273 #># similarities for for specific documents textstat_simil(dfmat, selection = "2017-Trump", margin = "documents")#> 2017-Trump #> 1981-Reagan 0.3635906 #> 1985-Reagan 0.4208903 #> 1989-Bush 0.3983633 #> 1993-Clinton 0.4579742 #> 1997-Clinton 0.4531154 #> 2001-Bush 0.3999252 #> 2005-Bush 0.3814126 #> 2009-Obama 0.3870892 #> 2013-Obama 0.3996661 #> 2017-Trump 1.0000000textstat_simil(dfmat, selection = "2017-Trump", method = "cosine", margin = "documents")#> 2017-Trump #> 1981-Reagan 0.4234273 #> 1985-Reagan 0.4734261 #> 1989-Bush 0.4530407 #> 1993-Clinton 0.4982465 #> 1997-Clinton 0.4947457 #> 2001-Bush 0.4518935 #> 2005-Bush 0.4349030 #> 2009-Obama 0.4481950 #> 2013-Obama 0.4546945 #> 2017-Trump 1.0000000#> 2009-Obama 2013-Obama #> 1981-Reagan 0.5159057 0.5545574 #> 1985-Reagan 0.5329123 0.5878448 #> 1989-Bush 0.5133582 0.4707774 #> 1993-Clinton 0.5515630 0.5621398 #> 1997-Clinton 0.5915817 0.5744939 #> 2001-Bush 0.4823183 0.5046514 #> 2005-Bush 0.3877744 0.4603052 #> 2009-Obama 1.0000000 0.5869754 #> 2013-Obama 0.5869754 1.0000000 #> 2017-Trump 0.3870892 0.3996661# compute some term similarities tstat2 <- textstat_simil(dfmat, selection = c("fair", "health", "terror"), method = "cosine", margin = "features") head(as.matrix(tstat2), 10)#> fair health terror #> senator 0.7385489 0.00000000 0.1666667 #> hatfield 0.6030227 0.00000000 0.4082483 #> mr 0.3078596 0.09724333 0.1786474 #> chief 0.6154575 0.27216553 0.1666667 #> justice 0.3594254 0.52981294 0.1622214 #> president 0.5817745 0.40929374 0.2864459 #> vice 0.8040303 0.33333333 0.2721655 #> bush 0.6154575 0.54433105 0.3333333 #> mondale 0.6030227 0.00000000 0.4082483 #> baker 0.6030227 0.00000000 0.4082483#> $fair #> size economic tax beginning national economy republic months #> 0.9045340 0.8922269 0.8869686 0.8864053 0.8775269 0.8775269 0.8703883 0.8703883 #> #> $health #> wrong reform common knowledge planet generations #> 0.8944272 0.8944272 0.8888889 0.8888889 0.8819171 0.8728716 #> ideals true #> 0.8540168 0.8432740 #> #> $terror #> full sustain solve land commonplace denied #> 0.9428090 0.9128709 0.9128709 0.8876254 0.8660254 0.8660254 #> guarantee problem #> 0.8660254 0.8660254 #># create a dfm from inaugural addresses from Reagan onwards dfmat <- dfm(corpus_subset(data_corpus_inaugural, Year > 1990), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) # distances for documents (tstat1 <- textstat_dist(dfmat, margin = "documents"))#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1997-Clinton 58.90671 #> 2001-Bush 52.82045 63.63961 #> 2005-Bush 62.79331 73.38256 54.32311 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama #> 1997-Clinton #> 2001-Bush #> 2005-Bush #> 2009-Obama #> 2013-Obama #> 2017-Trump 55.21775as.matrix(tstat1)#> 1993-Clinton 1997-Clinton 2001-Bush 2005-Bush 2009-Obama #> 1993-Clinton 0.00000 58.90671 52.82045 62.79331 51.66237 #> 1997-Clinton 58.90671 0.00000 63.63961 73.38256 59.95832 #> 2001-Bush 52.82045 63.63961 0.00000 54.32311 50.70503 #> 2005-Bush 62.79331 73.38256 54.32311 0.00000 62.33779 #> 2009-Obama 51.66237 59.95832 50.70503 62.33779 0.00000 #> 2013-Obama 51.30302 60.81118 49.03060 57.90509 48.48711 #> 2017-Trump 52.14403 65.85590 48.79549 58.00000 55.65968 #> 2013-Obama 2017-Trump #> 1993-Clinton 51.30302 52.14403 #> 1997-Clinton 60.81118 65.85590 #> 2001-Bush 49.03060 48.79549 #> 2005-Bush 57.90509 58.00000 #> 2009-Obama 48.48711 55.65968 #> 2013-Obama 0.00000 55.21775 #> 2017-Trump 55.21775 0.00000# distances for specific documents textstat_dist(dfmat, "2017-Trump", margin = "documents")#> 2017-Trump #> 1993-Clinton 52.14403 #> 1997-Clinton 65.85590 #> 2001-Bush 48.79549 #> 2005-Bush 58.00000 #> 2009-Obama 55.65968 #> 2013-Obama 55.21775 #> 2017-Trump 0.00000#> 2009-Obama 2013-Obama #> 1993-Clinton 51.66237 51.30302 #> 1997-Clinton 59.95832 60.81118 #> 2001-Bush 50.70503 49.03060 #> 2005-Bush 62.33779 57.90509 #> 2009-Obama 0.00000 48.48711 #> 2013-Obama 48.48711 0.00000 #> 2017-Trump 55.65968 55.21775as.list(tstat2)#> $`2009-Obama` #> 2005-Bush 1997-Clinton 2017-Trump 1993-Clinton 2001-Bush 2013-Obama #> 62.33779 59.95832 55.65968 51.66237 50.70503 48.48711 #> #> $`2013-Obama` #> 1997-Clinton 2005-Bush 2017-Trump 1993-Clinton 2001-Bush 2009-Obama #> 60.81118 57.90509 55.21775 51.30302 49.03060 48.48711 #>