Calculate the lexical diversity or complexity of text(s).

textstat_lexdiv(x, measure = c("all", "TTR", "C", "R", "CTTR", "U", "S",
  "K", "Vm", "Maas"), log.base = 10)

Arguments

x

an input dfm

measure

a character vector defining the measure to calculate.

log.base

a numeric value defining the base of the logarithm (for measures using logs)

...

not used

Value

textstat_lexdiv returns a data.frame of documents and their lexical diversity scores.

Details

textstat_lexdiv calculates a variety of proposed indices for lexical diversity. In the following formulas, \(N\) refers to the total number of tokens, \(V\) to the number of types, and \(f_v(i, N)\) to the numbers of types occurring \(i\) times in a sample of length \(N\).

"TTR":

The ordinary Type-Token Ratio: $$TTR = \frac{V}{N}$$

"C":

Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR): $$C = \frac{\log{V}}{\log{N}}$$

"R":

Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998): $$R = \frac{V}{\sqrt{N}}$$

"CTTR":

Carroll's Corrected TTR: $$CTTR = \frac{V}{\sqrt{2N}}$$

"U":

Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998): $$U = \frac{(\log{N})^2}{\log{N} - \log{V}}$$

"S":

Summer's index: $$S = \frac{\log{\log{V}}}{\log{\log{N}}}$$

"K":

Yule's K (Yule, 1944, as presented in Tweedie & Baayen, 1998, Eq. 16) is calculated by: $$K = 10^4 \times \left[ -\frac{1}{N} + \sum_{i=1}^{V} f_v(i, N) \left( \frac{i}{N} \right)^2 \right] $$

"D":

Simpson's D (Simpson 1949, as presented in Tweedie & Baayen, 1998, Eq. 17) is calculated by: $$D = \sum_{i=1}^{V} f_v(i, N) \frac{i}{N} \frac{i-1}{N-1}$$

"Vm":

Herdan's \(V_m\) (Herdan 1955, as presented in Tweedie & Baayen, 1998, Eq. 18) is calculated by: $$V_m = \sqrt{ \sum_{i=1}^{V} f_v(i, N) (i/N)^2 - \frac{i}{V} }$$

"Maas":

Maas' indices (\(a\), \(\log{V_0}\) & \(\log{}_{e}{V_0}\)): $$a^2 = \frac{\log{N} - \log{V}}{\log{N}^2}$$ $$\log{V_0} = \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}$$ The measure was derived from a formula by Mueller (1969, as cited in Maas, 1972). \(\log{}_{e}{V_0}\) is equivalent to \(\log{V_0}\), only with \(e\) as the base for the logarithms. Also calculated are \(a\), \(\log{V_0}\) (both not the same as before) and \(V'\) as measures of relative vocabulary growth while the text progresses. To calculate these measures, the first half of the text and the full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: for the current method (for a dfm) there is no computation on separate halves of the text.

Note

This implements only the static measures of lexical diversity, not more complex measures based on windows of text such as the Mean Segmental Type-Token Ratio, the Moving-Average Type-Token Ratio (Covington & McFall, 2010), the MLTD or MLTD-MA (Moving-Average Measure of Textual Lexical Diversity) proposed by McCarthy & Jarvis (2010) or Jarvis (no year), or the HD-D version of vocd-D (see McCarthy & Jarvis, 2007). These are available from the package korRpus.

References

Covington, M.A. & McFall, J.D. (2010). "Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR)". Journal of Quantitative Linguistics 17(2), 94--100.

Herdan, Gustav. 1955. "A New Derivation and Interpretation of Yule's 'Characteristic' K." Zeitschrift für angewandte Mathematik und Physik 6(4): 332--34.

Maas, H.-D., (1972). "\"Uber den Zusammenhang zwischen Wortschatzumfang und L\"ange eines Textes". Zeitschrift f\"ur Literaturwissenschaft und Linguistik 2(8), 73--96.

McCarthy, P.M. & Jarvis, S. (2007). "vocd: A theoretical and empirical evaluation". Language Testing 24(4), 459--488.

McCarthy, P.M. & Jarvis, S. (2010). "MTLD, vocd-D, and HD-D: A validation study of sophisticated approaches to lexical diversity assessment". Behaviour Research Methods 42(2), 381--392.

Michalke, Meik. (2014) koRpus: An R Package for Text Analysis. Version 0.05-5. http://reaktanz.de/?c=hacking&s=koRpus

Simpson, Edward H. 1949. "Measurement of Diversity." Nature 163: 688.

Tweedie. F.J. & Baayen, R.H. (1998). "How Variable May a Constant Be? Measures of Lexical Richness in Perspective". Computers and the Humanities 32(5), 323--352.

Examples

mydfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), verbose = FALSE) (result <- textstat_lexdiv(mydfm, c("CTTR", "TTR", "U")))
#> document CTTR TTR U #> 1 1981-Reagan 11.378940 0.3046595 22.99986 #> 2 1985-Reagan 11.461022 0.2998973 22.96229 #> 3 1989-Bush 10.324247 0.2819843 21.37823 #> 4 1993-Clinton 9.992155 0.3300600 22.11897 #> 5 1997-Clinton 10.373546 0.2964475 21.75021 #> 6 2001-Bush 9.844814 0.3274336 21.88049 #> 7 2005-Bush 10.792498 0.3169470 22.69528 #> 8 2009-Obama 12.222576 0.3319808 24.61202 #> 9 2013-Obama 11.546345 0.3392318 24.11639 #> 10 2017-Trump 9.493324 0.3295181 21.50726
cor(textstat_lexdiv(mydfm, "all")[,-1])
#> Warning: longer object length is not a multiple of shorter object length
#> Warning: longer object length is not a multiple of shorter object length
#> Warning: longer object length is not a multiple of shorter object length
#> Warning: longer object length is not a multiple of shorter object length
#> Warning: the standard deviation is zero
#> TTR C R CTTR U S K #> TTR 1.00000000 0.8718973 0.04870698 0.04870698 0.4574744 0.5625226 NA #> C 0.87189735 1.0000000 0.53081068 0.53081068 0.8331756 0.8951709 NA #> R 0.04870698 0.5308107 1.00000000 1.00000000 0.9097523 0.8520574 NA #> CTTR 0.04870698 0.5308107 1.00000000 1.00000000 0.9097523 0.8520574 NA #> U 0.45747444 0.8331756 0.90975232 0.90975232 1.0000000 0.9915124 NA #> S 0.56252261 0.8951709 0.85205745 0.85205745 0.9915124 1.0000000 NA #> K NA NA NA NA NA NA 1 #> D NA NA NA NA NA NA NA #> Vm -0.34891250 0.1531773 0.91502934 0.91502934 0.6719463 0.5773260 NA #> Maas -0.45448662 -0.8320582 -0.91102148 -0.91102148 -0.9995375 -0.9919846 NA #> lgV0 0.23953222 0.6835092 0.98095778 0.98095778 0.9728109 0.9368443 NA #> lgeV0 0.23953222 0.6835092 0.98095778 0.98095778 0.9728109 0.9368443 NA #> D Vm Maas lgV0 lgeV0 #> TTR NA -0.3489125 -0.4544866 0.2395322 0.2395322 #> C NA 0.1531773 -0.8320582 0.6835092 0.6835092 #> R NA 0.9150293 -0.9110215 0.9809578 0.9809578 #> CTTR NA 0.9150293 -0.9110215 0.9809578 0.9809578 #> U NA 0.6719463 -0.9995375 0.9728109 0.9728109 #> S NA 0.5773260 -0.9919846 0.9368443 0.9368443 #> K NA NA NA NA NA #> D 1 NA NA NA NA #> Vm NA 1.0000000 -0.6750263 0.8244108 0.8244108 #> Maas NA -0.6750263 1.0000000 -0.9734910 -0.9734910 #> lgV0 NA 0.8244108 -0.9734910 1.0000000 1.0000000 #> lgeV0 NA 0.8244108 -0.9734910 1.0000000 1.0000000