R/textstat_collocations.R
textstat_collocations.Rd
Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocations( x, method = "lambda", size = 2, min_count = 2, smoothing = 0.5, tolower = TRUE, ... ) is.collocations(x)
x | a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
|
---|---|
method | association measure for detecting collocations. Currently this
is limited to |
size | integer; the length of the collocations to be scored |
min_count | numeric; minimum frequency of collocations that will be scored |
smoothing | numeric; a smoothing parameter added to the observed counts (default is 0.5) |
tolower | logical; if |
... | additional arguments passed to |
textstat_collocations
returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and \(\lambda\) and \(z\) statistics. When size
is a vector, then count_nested
counts the lower-order collocations
that occur within a higher-order collocation (but this does not affect the
statistics).
is.collocation
returns TRUE
if the object is of class
collocations
, FALSE
otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using [tokens_remove](x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald \(z\)-statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing
. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.
Wald test \(z\)-statistic \(z\) is calculated as:
$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$
This function is under active development, with more measures to be added in the the next release of quanteda.
Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe
corp <- data_corpus_inaugural[1:2] head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)#> collocation count count_nested length lambda z #> 1 have been 5 0 2 5.704259 7.354588 #> 2 has been 3 0 2 5.565217 6.409333 #> 3 of the 24 0 2 1.673501 6.382475 #> 4 i have 5 0 2 3.743580 6.268303 #> 5 which i 6 0 2 3.172217 6.135144 #> 6 will be 4 0 2 3.868500 5.930143 #> 7 less than 2 0 2 6.279494 5.529680 #> 8 public good 2 0 2 6.279494 5.529680 #> 9 you will 2 0 2 4.917893 5.431752 #> 10 may be 3 0 2 4.190711 5.328038#> collocation count count_nested length lambda z #> 1 of which the 2 0 3 6.1259648 2.8317522 #> 2 in which i 3 0 3 2.1689288 1.1741918 #> 3 i have in 2 0 3 2.3809129 1.0618774 #> 4 and of the 2 0 3 0.8847383 0.7498730 #> 5 me by the 2 0 3 1.4726869 0.6560780 #> 6 to the great 2 0 3 1.2891870 0.5660311 #> 7 voice of my 2 0 3 1.2270130 0.5298220 #> 8 which ought to 2 0 3 1.4083232 0.5278314 #> 9 of the confidence 2 0 3 1.1220858 0.4948962 #> 10 the united states 2 0 3 1.2597834 0.4272349# extracting multi-part proper nouns (capitalized terms) toks1 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE) toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE) head(tstat, 10)#> collocation count count_nested length lambda z #> 1 United States Congress 2 0 3 -2.149876 -1.013431 #> 2 Vice President Bush 2 0 3 -11.580297 -4.470152# vectorized size txt <- c(". . . . a b c . . a b c . . . c d e", "a b . . a b . . a b . . a b . a b", "b c d . . b c . b c . . . b c") textstat_collocations(txt, size = 2:3)#> collocation count count_nested length lambda z #> 1 a b 7 2 2 5.652489e+00 2.745546e+00 #> 2 b c 6 3 2 5.609472e+00 2.721287e+00 #> 3 c d 2 2 2 4.976734e+00 2.354187e+00 #> 4 a b c 2 0 3 -1.110223e-16 -3.103168e-17