Calculate the lexical diversity or complexity of text(s).

textstat_lexdiv(x, measure = c("all", "TTR", "C", "R", "CTTR", "U", "S",
  "K", "D", "Vm", "Maas"), log.base = 10, remove_numbers = TRUE,
  remove_punct = TRUE, remove_symbols = TRUE, remove_hyphens = FALSE)

Arguments

x

an input dfm or tokens input object

measure

a character vector defining the measure to calculate

log.base

a numeric value defining the base of the logarithm (for measures using logs)

remove_numbers

logical; if TRUE remove features or tokens that consist only of numerals (the Unicode "Number" [N] class)

remove_punct

logical; if TRUE remove all features or tokens that consist only of the Unicode "Punctuation" [P] class)

remove_symbols

logical; if TRUE remove all features or tokens that consist only of the Unicode "Punctuation" [S] class)

remove_hyphens

logical; if TRUE split words that are connected by hyphenation and hyphenation-like characters in between words, e.g. "self-storage" becomes two features or tokens "self" and "storage". Default is FALSE to preserve such words as is, with the hyphens.

Value

textstat_lexdiv returns a data.frame of documents and their lexical diversity scores.

Details

textstat_lexdiv calculates a variety of proposed indices for lexical diversity. In the following formulas, \(N\) refers to the total number of tokens, \(V\) to the number of types, and \(f_v(i, N)\) to the numbers of types occurring \(i\) times in a sample of length \(N\).

"TTR":

The ordinary Type-Token Ratio: $$TTR = \frac{V}{N}$$

"C":

Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR): $$C = \frac{\log{V}}{\log{N}}$$

"R":

Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998): $$R = \frac{V}{\sqrt{N}}$$

"CTTR":

Carroll's Corrected TTR: $$CTTR = \frac{V}{\sqrt{2N}}$$

"U":

Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998): $$U = \frac{(\log{N})^2}{\log{N} - \log{V}}$$

"S":

Summer's index: $$S = \frac{\log{\log{V}}}{\log{\log{N}}}$$

"K":

Yule's K (Yule, 1944, as presented in Tweedie & Baayen, 1998, Eq. 16) is calculated by: $$K = 10^4 \times \left[ -\frac{1}{N} + \sum_{i=1}^{V} f_v(i, N) \left( \frac{i}{N} \right)^2 \right] $$

"D":

Simpson's D (Simpson 1949, as presented in Tweedie & Baayen, 1998, Eq. 17) is calculated by: $$D = \sum_{i=1}^{V} f_v(i, N) \frac{i}{N} \frac{i-1}{N-1}$$

"Vm":

Herdan's \(V_m\) (Herdan 1955, as presented in Tweedie & Baayen, 1998, Eq. 18) is calculated by: $$V_m = \sqrt{ \sum_{i=1}^{V} f_v(i, N) (i/N)^2 - \frac{i}{V} }$$

"Maas":

Maas' indices (\(a\), \(\log{V_0}\) & \(\log{}_{e}{V_0}\)): $$a^2 = \frac{\log{N} - \log{V}}{\log{N}^2}$$ $$\log{V_0} = \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}$$ The measure was derived from a formula by Mueller (1969, as cited in Maas, 1972). \(\log{}_{e}{V_0}\) is equivalent to \(\log{V_0}\), only with \(e\) as the base for the logarithms. Also calculated are \(a\), \(\log{V_0}\) (both not the same as before) and \(V'\) as measures of relative vocabulary growth while the text progresses. To calculate these measures, the first half of the text and the full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: for the current method (for a dfm) there is no computation on separate halves of the text.

Note

This implements only the static measures of lexical diversity, not more complex measures based on windows of text such as the Mean Segmental Type-Token Ratio, the Moving-Average Type-Token Ratio (Covington & McFall, 2010), the MLTD or MLTD-MA (Moving-Average Measure of Textual Lexical Diversity) proposed by McCarthy & Jarvis (2010) or Jarvis (no year), or the HD-D version of vocd-D (see McCarthy & Jarvis, 2007). These are available from the package korRpus.

References

Covington, Michael A. and Joe D. McFall. 2010. "Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR)". Journal of Quantitative Linguistics 17(2): 94--100.

Herdan, Gustav. 1955. "A New Derivation and Interpretation of Yule's 'Characteristic' K." Zeitschrift für angewandte Mathematik und Physik 6(4): 332--334.

Maas, Heinz-Dieter. 1972. "Über den Zusammenhang zwischen Wortschatzumfang und Länge eines Textes". Zeitschrift für Literaturwissenschaft und Linguistik 2(8): 73--96.

McCarthy, Philip M. and Scott Jarvis. 2007. "vocd: A Theoretical and Empirical Evaluation". Language Testing 24(4): 459--488.

McCarthy, Philip M. and Scott Jarvis. 2010. "MTLD, vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment". Behaviour Research Methods 42(2): 381--392.

Michalke, Meik. (2014) koRpus: An R Package for Text Analysis. R package version 0.05-5. http://reaktanz.de/?c=hacking&s=koRpus

Simpson, Edward H. 1949. "Measurement of Diversity." Nature 163: 688.

Tweedie. Fiona J. and R. Harald Baayen. 1998. "How Variable May a Constant Be? Measures of Lexical Richness in Perspective". Computers and the Humanities 32(5): 323--352.

Examples

mydfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), verbose = FALSE) (result <- textstat_lexdiv(mydfm, c("CTTR", "TTR", "U")))
#> document CTTR TTR U #> 1 1981-Reagan 12.07549 0.3463595 24.89515 #> 2 1985-Reagan 11.99335 0.3356835 24.48583 #> 3 1989-Bush 10.99287 0.3231102 23.07188 #> 4 1993-Clinton 10.61324 0.3754693 24.12386 #> 5 1997-Clinton 10.94684 0.3333333 23.29504 #> 6 2001-Bush 10.37904 0.3689198 23.63757 #> 7 2005-Bush 11.26505 0.3500724 24.12469 #> 8 2009-Obama 12.91628 0.3736402 26.69551 #> 9 2013-Obama 11.99681 0.3709369 25.60049 #> 10 2017-Trump 10.01461 0.3728344 23.29366
cor(textstat_lexdiv(mydfm, "all")[,-1])
#> TTR C R CTTR U S #> TTR 1.00000000 0.87474824 -0.06890274 -0.06890274 0.43113568 0.54093379 #> C 0.87474824 1.00000000 0.42238766 0.42238766 0.81336016 0.88065570 #> R -0.06890274 0.42238766 1.00000000 1.00000000 0.86988854 0.80053122 #> CTTR -0.06890274 0.42238766 1.00000000 1.00000000 0.86988854 0.80053122 #> U 0.43113568 0.81336016 0.86988854 0.86988854 1.00000000 0.99092386 #> S 0.54093379 0.88065570 0.80053122 0.80053122 0.99092386 1.00000000 #> K 0.09869723 -0.05531017 -0.31164089 -0.31164089 -0.22536889 -0.19303493 #> D 0.04556277 -0.07140815 -0.24576843 -0.24576843 -0.19170973 -0.16897667 #> Vm 0.02173116 -0.03766314 -0.13161889 -0.13161889 -0.09963151 -0.08685865 #> Maas -0.43150628 -0.81434842 -0.86944078 -0.86944078 -0.99943116 -0.99202024 #> lgV0 0.18391592 0.63653817 0.96757261 0.96757261 0.96608008 0.92549041 #> lgeV0 0.18391592 0.63653817 0.96757261 0.96757261 0.96608008 0.92549041 #> K D Vm Maas lgV0 lgeV0 #> TTR 0.09869723 0.04556277 0.02173116 -0.4315063 0.1839159 0.1839159 #> C -0.05531017 -0.07140815 -0.03766314 -0.8143484 0.6365382 0.6365382 #> R -0.31164089 -0.24576843 -0.13161889 -0.8694408 0.9675726 0.9675726 #> CTTR -0.31164089 -0.24576843 -0.13161889 -0.8694408 0.9675726 0.9675726 #> U -0.22536889 -0.19170973 -0.09963151 -0.9994312 0.9660801 0.9660801 #> S -0.19303493 -0.16897667 -0.08685865 -0.9920202 0.9254904 0.9254904 #> K 1.00000000 0.99638586 0.98036518 0.2243634 -0.2745434 -0.2745434 #> D 0.99638586 1.00000000 0.99283358 0.1906940 -0.2227132 -0.2227132 #> Vm 0.98036518 0.99283358 1.00000000 0.0989224 -0.1157789 -0.1157789 #> Maas 0.22436335 0.19069398 0.09892240 1.0000000 -0.9658324 -0.9658324 #> lgV0 -0.27454341 -0.22271316 -0.11577890 -0.9658324 1.0000000 1.0000000 #> lgeV0 -0.27454341 -0.22271316 -0.11577890 -0.9658324 1.0000000 1.0000000
txt <- c("Anyway, like I was sayin', shrimp is the fruit of the sea. You can barbecue it, boil it, broil it, bake it, saute it.", "There's shrimp-kabobs, shrimp creole, shrimp gumbo. Pan fried, deep fried, stir-fried. There's pineapple shrimp, lemon shrimp, coconut shrimp, pepper shrimp, shrimp soup, shrimp stew, shrimp salad, shrimp and potatoes, shrimp burger, shrimp sandwich.") tokens(txt) %>% textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
#> document TTR CTTR K #> 1 text1 0.7916667 2.742414 798.6111 #> 2 text2 0.6060606 2.461830 1551.8825
dfm(txt) %>% textstat_lexdiv(measure = c("TTR", "CTTR", "K"))
#> document TTR CTTR K #> 1 text1 0.7916667 2.742414 798.6111 #> 2 text2 0.6060606 2.461830 1551.8825