Identify and score multi-word expressions — textstat

Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.

textstat_collocations(
  x,
  method = "lambda",
  size = 2,
  min_count = 2,
  smoothing = 0.5,
  tolower = TRUE,
  ...
)

is.collocations(x)

Arguments

x	a character, corpus, or tokens object whose collocations will be scored. The tokens object should include punctuation, and if any words have been removed, these should have been removed with `padding = TRUE`. While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized.
method	association measure for detecting collocations. Currently this is limited to `"lambda"`. See Details.
size	integer; the length of the collocations to be scored
min_count	numeric; minimum frequency of collocations that will be scored
smoothing	numeric; a smoothing parameter added to the observed counts (default is 0.5)
tolower	logical; if `TRUE`, form collocations as lower-cased combinations
...	additional arguments passed to `tokens()`, if `x` is not a tokens object already

Value

textstat_collocations returns a data.frame of collocations and their scores and statistics. This consists of the collocations, their counts, length, and $\lambda$ and $z$ statistics. When size is a vector, then count_nested counts the lower-order collocations that occur within a higher-order collocation (but this does not affect the statistics).

is.collocation returns TRUE if the object is of class collocations, FALSE otherwise.

Details

Documents are grouped for the purposes of scoring, but collocations will not span sentences. If x is a tokens object and some tokens have been removed, this should be done using [tokens_remove](x, pattern, padding = TRUE) so that counts will still be accurate, but the pads will prevent those collocations from being scored.

The lambda computed for a size = $K$-word target multi-word expression the coefficient for the $K$-way interaction parameter in the saturated log-linear model fitted to the counts of the terms forming the set of eligible multi-word expressions. This is the same as the "lambda" computed in Blaheta and Johnson's (2001), where all multi-word expressions are considered (rather than just verbs, as in that paper). The z is the Wald $z$-statistic computed as the quotient of lambda and the Wald statistic for lambda as described below.

In detail:

Consider a $K$-word target expression $x$, and let $z$ be any $K$-word expression. Define a comparison function $c(x,z)=(j_{1}, \dots, j_{K})=c$ such that the $k$th element of $c$ is 1 if the $k$th word in $z$ is equal to the $k$th word in $x$, and 0 otherwise. Let $c_{i}=(j_{i1}, \dots, j_{iK})$, $i=1, \dots, 2^{K}=M$, be the possible values of $c(x,z)$, with $c_{M}=(1,1, \dots, 1)$. Consider the set of $c(x,z_{r})$ across all expressions $z_{r}$ in a corpus of text, and let $n_{i}$, for $i=1,\dots,M$, denote the number of the $c(x,z_{r})$ which equal $c_{i}$, plus the smoothing constant smoothing. The $n_{i}$ are the counts in a $2^{K}$ contingency table whose dimensions are defined by the $c_{i}$.

$\lambda$: The $K$-way interaction parameter in the saturated loglinear model fitted to the $n_{i}$. It can be calculated as

$$\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}$$

where $b_{i}$ is the number of the elements of $c_{i}$ which are equal to 1.

Wald test $z$-statistic $z$ is calculated as:

$$z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}$$

Note

This function is under active development, with more measures to be added in the the next release of quanteda.

References

Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Author

Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe

Examples

corp <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
#>    collocation count count_nested length   lambda        z
#> 1    have been     5            0      2 5.704259 7.354588
#> 2     has been     3            0      2 5.565217 6.409333
#> 3       of the    24            0      2 1.673501 6.382475
#> 4       i have     5            0      2 3.743580 6.268303
#> 5      which i     6            0      2 3.172217 6.135144
#> 6      will be     4            0      2 3.868500 5.930143
#> 7    less than     2            0      2 6.279494 5.529680
#> 8  public good     2            0      2 6.279494 5.529680
#> 9     you will     2            0      2 4.917893 5.431752
#> 10      may be     3            0      2 4.190711 5.328038
head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)
#>          collocation count count_nested length    lambda         z
#> 1       of which the     2            0      3 6.1259648 2.8317522
#> 2         in which i     3            0      3 2.1689288 1.1741918
#> 3          i have in     2            0      3 2.3809129 1.0618774
#> 4         and of the     2            0      3 0.8847383 0.7498730
#> 5          me by the     2            0      3 1.4726869 0.6560780
#> 6       to the great     2            0      3 1.2891870 0.5660311
#> 7        voice of my     2            0      3 1.2270130 0.5298220
#> 8     which ought to     2            0      3 1.4083232 0.5278314
#> 9  of the confidence     2            0      3 1.1220858 0.4948962
#> 10 the united states     2            0      3 1.2597834 0.4272349

# extracting multi-part proper nouns (capitalized terms)
toks1 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
                       case_insensitive = FALSE, padding = TRUE)
tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
head(tstat, 10)
#>              collocation count count_nested length     lambda         z
#> 1 United States Congress     2            0      3  -2.149876 -1.013431
#> 2    Vice President Bush     2            0      3 -11.580297 -4.470152

# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
         "a b . . a b . . a b . . a b . a b",
         "b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)
#>   collocation count count_nested length        lambda             z
#> 1         a b     7            2      2  5.652489e+00  2.745546e+00
#> 2         b c     6            3      2  5.609472e+00  2.721287e+00
#> 3         c d     2            2      2  4.976734e+00  2.354187e+00
#> 4       a b c     2            0      3 -1.110223e-16 -3.103168e-17