Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.
textstat_collocationsdev(x, method = "all", size = 2, min_count = 2, smoothing = 0.5, tolower = TRUE, show_counts = FALSE, ...) is.collocationsdev(x)
x | a character, corpus, or tokens object whose
collocations will be scored. The tokens object should include punctuation,
and if any words have been removed, these should have been removed with
|
---|---|
method | association measure for detecting collocations: |
size | integer; the length of the collocations to be scored |
min_count | numeric; minimum frequency of collocations that will be scored |
smoothing | numeric; a smoothing parameter added to the observed counts (default is 0.5) |
tolower | logical; if |
show_counts | logical; if |
... | additional arguments passed to |
textstat_collocationsdev
returns a data.frame of collocations and their
scores and statistics.
is.collocationdev
returns TRUE
if the object is of class
collocationsdev
, FALSE
otherwise.
Documents are grouped for the purposes of scoring, but collocations will not span sentences.
If x
is a tokens object and some tokens have been removed, this should be done
using tokens_remove(x, pattern, padding = TRUE)
so that counts will still be
accurate, but the pads will prevent those collocations from being scored.
The lambda
computed for a size = \(K\)-word target multi-word
expression the coefficient for the \(K\)-way interaction parameter in the
saturated log-linear model fitted to the counts of the terms forming the set
of eligible multi-word expressions. This is the same as the "lambda" computed
in Blaheta and Johnson's (2001), where all multi-word expressions are
considered (rather than just verbs, as in that paper). The z
is the
Wald \(z\)-statistic computed as the quotient of lambda
and the Wald
statistic for lambda
as described below.
In detail:
Consider a \(K\)-word target expression \(x\), and let \(z\) be any
\(K\)-word expression. Define a comparison function \(c(x,z)=(j_{1},
\dots, j_{K})=c\) such that the \(k\)th element of \(c\) is 1 if the
\(k\)th word in \(z\) is equal to the \(k\)th word in \(x\), and 0
otherwise. Let \(c_{i}=(j_{i1}, \dots, j_{iK})\), \(i=1, \dots,
2^{K}=M\), be the possible values of \(c(x,z)\), with \(c_{M}=(1,1,
\dots, 1)\). Consider the set of \(c(x,z_{r})\) across all expressions
\(z_{r}\) in a corpus of text, and let \(n_{i}\), for \(i=1,\dots,M\),
denote the number of the \(c(x,z_{r})\) which equal \(c_{i}\), plus the
smoothing constant smoothing
. The \(n_{i}\) are the counts in a
\(2^{K}\) contingency table whose dimensions are defined by the
\(c_{i}\).
\(\lambda\): The \(K\)-way interaction parameter in the saturated
loglinear model fitted to the \(n_{i}\). It can be calculated as
$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}$$
where \(b_{i}\) is the number of the elements of \(c_{i}\) which are equal to 1.
Wald test \(z\)-statistic \(z\) is calculated as: $$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$
This function is under active development, with more measures to be added in the the next release of quanteda.
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.
txts <- data_corpus_inaugural[1:2] head(cols <- textstat_collocationsdev(txts, size = 2, min_count = 2), 10)#> collocation count length lambda z G2 chi2 pmi #> 1 , and 17 2 2.643957 8.170237 49.47743 108.46212 2.927463 #> 2 have been 5 2 5.731000 7.487958 43.20136 399.03760 6.200685 #> 3 of the 24 2 1.781820 6.830093 37.22476 58.28699 1.935835 #> 4 has been 3 2 5.717327 6.584944 28.52046 323.74321 6.548608 #> 5 i have 5 2 3.772416 6.461199 26.86011 113.55789 4.463719 #> 6 , i 10 2 2.570085 6.377237 29.25016 65.92607 2.956032 #> 7 will be 4 2 3.974267 6.109305 23.64307 112.94349 4.728587 #> 8 less than 2 2 6.431212 5.663496 23.15338 373.56773 7.233106 #> 9 public good 2 2 6.431212 5.663496 23.15338 373.56773 7.233106 #> 10 which i 6 2 2.657154 5.555529 19.98871 52.21109 3.264154 #> LFMD #> 1 11.186029 #> 2 11.119548 #> 3 11.165255 #> 4 10.163318 #> 5 9.382582 #> 6 9.740667 #> 7 9.068437 #> 8 9.876962 #> 9 9.876962 #> 10 8.665034head(cols <- textstat_collocationsdev(txts, size = 3, min_count = 2), 10)#> collocation count length lambda z G2 chi2 pmi #> 1 of which the 2 3 6.179554 2.8579715 13.4611112 23.7539935 3.2516278 #> 2 , and of 2 3 3.066282 1.7161287 4.0540624 3.9852233 1.2377281 #> 3 in which i 3 3 2.907704 1.5893955 3.4809716 3.0412877 0.7012360 #> 4 , or by 2 3 3.086502 1.3263061 2.2762489 1.9886129 0.5716844 #> 5 i have in 2 3 2.484260 1.1250830 1.6346876 1.4070556 0.4984132 #> 6 me by the 2 3 2.362269 1.0839184 1.5158738 1.3075711 0.4867261 #> 7 , and the 3 3 1.017118 1.0243655 1.0678760 1.0779195 0.5158313 #> 8 and of the 2 3 1.057485 0.8988065 0.8416763 0.8445156 0.5606277 #> 9 , i shall 3 3 1.661358 0.7605286 0.6951628 0.6084811 0.1996503 #> 10 . on the 2 3 1.014510 0.5884358 0.3960617 0.3629160 0.2626685 #> LFMD #> 1 5.895484 #> 2 3.881584 #> 3 4.315946 #> 4 3.215541 #> 5 3.142269 #> 6 3.130582 #> 7 4.130541 #> 8 3.204484 #> 9 3.814360 #> 10 2.906525# extracting multi-part proper nouns (capitalized terms) toks2 <- tokens(data_corpus_inaugural) toks2 <- tokens_remove(toks2, stopwords("english"), padding = TRUE) toks2 <- tokens_select(toks2, "^([A-Z][a-z\\-]{2,})", valuetype = "regex", case_insensitive = FALSE, padding = TRUE) seqs <- textstat_collocationsdev(toks2, size = 3, tolower = FALSE) head(seqs, 10)#> collocation count length lambda z G2 #> 1 United States Congress 2 3 -2.152404 -1.014623 0.7972545 #> 2 Vice President Bush 2 3 -11.582818 -4.471125 9.6364697 #> chi2 pmi LFMD #> 1 1.182867 -0.1873977 2.456458 #> 2 9474.743454 -0.2634959 2.380360