R/textstat_collocations.R
textstat_collocations.Rd
Identify and score multiword expressions, or adjacent fixedlength collocations, from text.
textstat_collocations( x, method = "lambda", size = 2, min_count = 2, smoothing = 0.5, tolower = TRUE, ... ) is.collocations(x)
x  a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with


method  association measure for detecting collocations. Currently this
is limited to 
size  integer; the length of the collocations to be scored 
min_count  numeric; minimum frequency of collocations that will be scored 
smoothing  numeric; a smoothing parameter added to the observed counts (default is 0.5) 
tolower  logical; if 
...  additional arguments passed to 
textstat_collocations
returns a data.frame of collocations and
their scores and statistics. This consists of the collocations, their
counts, length, and \(\lambda\) and \(z\) statistics. When size
is a vector, then count_nested
counts the lowerorder collocations
that occur within a higherorder collocation (but this does not affect the
statistics).
is.collocation
returns TRUE
if the object is of class
collocations
, FALSE
otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using [tokens_remove](x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = \(K\)word target multiword
expression the coefficient for the \(K\)way interaction parameter in the
saturated loglinear model fitted to the counts of the terms forming the set
of eligible multiword expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multiword expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald \(z\)statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a \(K\)word target expression \(x\), and let \(z\) be any
\(K\)word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing
. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)way interaction parameter in the saturated loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (1)^{Kb_{i}} * log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.
Wald test \(z\)statistic \(z\) is calculated as:
$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{1}]^{(1/2)}}$$
This function is under active development, with more measures to be added in the the next release of quanteda.
Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multiword verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe
corp < data_corpus_inaugural[1:2] head(cols < textstat_collocations(corp, size = 2, min_count = 2), 10)#> collocation count count_nested length lambda z #> 1 have been 5 0 2 5.704259 7.354588 #> 2 has been 3 0 2 5.565217 6.409333 #> 3 of the 24 0 2 1.673501 6.382475 #> 4 i have 5 0 2 3.743580 6.268303 #> 5 which i 6 0 2 3.172217 6.135144 #> 6 will be 4 0 2 3.868500 5.930143 #> 7 less than 2 0 2 6.279494 5.529680 #> 8 public good 2 0 2 6.279494 5.529680 #> 9 you will 2 0 2 4.917893 5.431752 #> 10 may be 3 0 2 4.190711 5.328038#> collocation count count_nested length lambda z #> 1 of which the 2 0 3 6.1259648 2.8317522 #> 2 in which i 3 0 3 2.1689288 1.1741918 #> 3 i have in 2 0 3 2.3809129 1.0618774 #> 4 and of the 2 0 3 0.8847383 0.7498730 #> 5 me by the 2 0 3 1.4726869 0.6560780 #> 6 to the great 2 0 3 1.2891870 0.5660311 #> 7 voice of my 2 0 3 1.2270130 0.5298220 #> 8 which ought to 2 0 3 1.4083232 0.5278314 #> 9 of the confidence 2 0 3 1.1220858 0.4948962 #> 10 the united states 2 0 3 1.2597834 0.4272349# extracting multipart proper nouns (capitalized terms) toks1 < tokens(data_corpus_inaugural) toks2 < tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE) toks3 < tokens_select(toks2, pattern = "^([AZ][az\\]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) tstat < textstat_collocations(toks3, size = 3, tolower = FALSE) head(tstat, 10)#> collocation count count_nested length lambda z #> 1 United States Congress 2 0 3 2.149876 1.013431 #> 2 Vice President Bush 2 0 3 11.580297 4.470152# vectorized size txt < c(". . . . a b c . . a b c . . . c d e", "a b . . a b . . a b . . a b . a b", "b c d . . b c . b c . . . b c") textstat_collocations(txt, size = 2:3)#> collocation count count_nested length lambda z #> 1 a b 7 2 2 5.652489e+00 2.745546e+00 #> 2 b c 6 3 2 5.609472e+00 2.721287e+00 #> 3 c d 2 2 2 4.976734e+00 2.354187e+00 #> 4 a b c 2 0 3 1.110223e16 3.103168e17