Calculate lexical diversity — textstat

Calculate the lexical diversity of text(s).

textstat_lexdiv(
  x,
  measure = c("TTR", "C", "R", "CTTR", "U", "S", "K", "I", "D", "Vm", "Maas", "MATTR",
    "MSTTR", "all"),
  remove_numbers = TRUE,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_hyphens = FALSE,
  log.base = 10,
  MATTR_window = 100L,
  MSTTR_segment = 100L,
  ...
)

Arguments

x	an dfm or tokens input object for whose documents lexical diversity will be computed
measure	a character vector defining the measure to compute
remove_numbers	logical; if `TRUE` remove features or tokens that consist only of numerals (the Unicode "Number" `[N]` class)
remove_punct	logical; if `TRUE` remove all features or tokens that consist only of the Unicode "Punctuation" `[P]` class)
remove_symbols	logical; if `TRUE` remove all features or tokens that consist only of the Unicode "Punctuation" `[S]` class)
remove_hyphens	logical; if `TRUE` split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes two features or tokens "self" and "storage". Default is FALSE to preserve such words as is, with the hyphens.
log.base	a numeric value defining the base of the logarithm (for measures using logarithms)
MATTR_window	a numeric value defining the size of the moving window for computation of the Moving-Average Type-Token Ratio (Covington & McFall, 2010)
MSTTR_segment	a numeric value defining the size of the each segment for the computation of the the Mean Segmental Type-Token Ratio (Johnson, 1944)
...	for passing arguments to other methods

Value

A data.frame of documents and their lexical diversity scores.

Details

textstat_lexdiv calculates the lexical diversity of documents using a variety of indices.

In the following formulas, $N$ refers to the total number of tokens, $V$ to the number of types, and $f_v(i, N)$ to the numbers of types occurring $i$ times in a sample of length $N$.

"TTR":: The ordinary Type-Token Ratio: $$TTR = \frac{V}{N}$$
"C":: Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR): $$C = \frac{\log{V}}{\log{N}}$$
"R":: Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998): $$R = \frac{V}{\sqrt{N}}$$
"CTTR":: Carroll's Corrected TTR: $$CTTR = \frac{V}{\sqrt{2N}}$$
"U":: Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998): $$U = \frac{(\log{N})^2}{\log{N} - \log{V}}$$
"S":: Summer's index: $$S = \frac{\log{\log{V}}}{\log{\log{N}}}$$
"K":: Yule's K (Yule, 1944, as presented in Tweedie & Baayen, 1998, Eq. 16) is calculated by: $$K = 10^4 \times \left[ -\frac{1}{N} + \sum_{i=1}^{V} f_v(i, N) \left( \frac{i}{N} \right)^2 \right] $$
"I":: Yule's I (Yule, 1944) is calculated by: $$I = \frac{V^2}{M_2 - V}$$ $$M_2 = \sum_{i=1}^{V} i^2 * f_v(i, N)$$
"D":: Simpson's D (Simpson 1949, as presented in Tweedie & Baayen, 1998, Eq. 17) is calculated by: $$D = \sum_{i=1}^{V} f_v(i, N) \frac{i}{N} \frac{i-1}{N-1}$$
"Vm":: Herdan's $V_m$ (Herdan 1955, as presented in Tweedie & Baayen, 1998, Eq. 18) is calculated by: $$V_m = \sqrt{ \sum_{i=1}^{V} f_v(i, N) (i/N)^2 - \frac{i}{V} }$$
"Maas":: Maas' indices ($a$, $\log{V_0}$ & $\log{}_{e}{V_0}$): $$a^2 = \frac{\log{N} - \log{V}}{\log{N}^2}$$ $$\log{V_0} = \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}$$ The measure was derived from a formula by Mueller (1969, as cited in Maas, 1972). $\log{}_{e}{V_0}$ is equivalent to $\log{V_0}$, only with $e$ as the base for the logarithms. Also calculated are $a$, $\log{V_0}$ (both not the same as before) and $V'$ as measures of relative vocabulary growth while the text progresses. To calculate these measures, the first half of the text and the full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: for the current method (for a dfm) there is no computation on separate halves of the text.
"MATTR":: The Moving-Average Type-Token Ratio (Covington & McFall, 2010) calculates TTRs for a moving window of tokens from the first to the last token, computing a TTR for each window. The MATTR is the mean of the TTRs of each window.
"MSTTR":: Mean Segmental Type-Token Ratio (sometimes referred to as Split TTR) splits the tokens into segments of the given size, TTR for each segment is calculated and the mean of these values returned. When this value is < 1.0, it splits the tokens into equal, non-overlapping sections of that size. When this value is > 1, it defines the segments as windows of that size. Tokens at the end which do not make a full segment are ignored.

References

Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94--100.

Herdan, G. (1955). A New Derivation and Interpretation of Yule's 'Characteristic' K. Zeitschrift für angewandte Mathematik und Physik, 6(4): 332--334.

Maas, H.D. (1972). Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes. Zeitschrift für Literaturwissenschaft und Linguistik, 2(8), 73--96.

McCarthy, P.M. & Jarvis, S. (2007). vocd: A Theoretical and Empirical Evaluation. Language Testing, 24(4), 459--488.

McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment. Behaviour Research Methods, 42(2), 381--392.

Michalke, M. (2014) koRpus: An R Package for Text Analysis. R package version 0.05-5. https://reaktanz.de/?c=hacking&s=koRpus

Simpson, E.H. (1949). Measurement of Diversity. Nature, 163: 688.

Tweedie. F.J. and Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323--352.

Yule, G. U. (1944) The Statistical Study of Literary Vocabulary. Cambridge: Cambridge University Press.

Author

Kenneth Benoit and Jiong Wei Lua. Many of the formulas have been reimplemented from functions written by Meik Michalke in the koRpus package.

Examples

txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can
          barbecue it, boil it, broil it, bake it, saute it.",
         "There's shrimp-kabobs,
          shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's
          pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup,
          shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp
          sandwich.")
tokens(txt) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
#>   document       TTR     CTTR         K
#> 1    text1 0.7916667 2.742414  798.6111
#> 2    text2 0.6060606 2.461830 1551.8825
dfm(txt) %>%
    textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
#>   document       TTR     CTTR         K
#> 1    text1 0.7916667 2.742414  798.6111
#> 2    text2 0.6060606 2.461830 1551.8825

toks <- tokens(corpus_subset(data_corpus_inaugural, Year > 2000))
textstat_lexdiv(toks, c("CTTR", "TTR", "MATTR"), MATTR_window = 100)
#>     document     CTTR       TTR     MATTR
#> 1  2001-Bush 10.37904 0.3689198 0.6885984
#> 2  2005-Bush 11.26505 0.3500724 0.6781998
#> 3 2009-Obama 12.91628 0.3736402 0.7070275
#> 4 2013-Obama 11.99681 0.3709369 0.7029654
#> 5 2017-Trump 10.01461 0.3728344 0.6670238