Calculate the lexical diversity or complexity of text(s).

textstat_lexdiv(x, measure = c("all", "TTR", "C", "R", "CTTR", "U", "S",
  "Maas"), log.base = 10, drop = TRUE, ...)

Arguments

x

an input object, such as a document-feature matrix object

measure

a character vector defining the measure to calculate.

log.base

a numeric value defining the base of the logarithm (for measures using logs)

drop

if TRUE, the result is returned as a numeric vector if only a single measure is requested; otherwise, a data.frame is returned with each column consisting of a requested measure.

...

not used

Value

a data.frame or vector of lexical diversity statistics, each row or vector element corresponding to an input document

Details

textstat_lexdiv calculates a variety of proposed indices for lexical diversity. In the following formulae, \(N\) refers to the total number of tokens, and \(V\) to the number of types:

"TTR":

The ordinary Type-Token Ratio: $$TTR = \frac{V}{N}$$

"C":

Herdan's C (Herdan, 1960, as cited in Tweedie & Baayen, 1998; sometimes referred to as LogTTR): $$C = \frac{\log{V}}{\log{N}}$$

"R":

Guiraud's Root TTR (Guiraud, 1954, as cited in Tweedie & Baayen, 1998): $$R = \frac{V}{\sqrt{N}}$$

"CTTR":

Carroll's Corrected TTR: $$CTTR = \frac{V}{\sqrt{2N}}$$

"U":

Dugast's Uber Index (Dugast, 1978, as cited in Tweedie & Baayen, 1998): $$U = \frac{(\log{N})^2}{\log{N} - \log{V}}$$

"S":

Summer's index: $$S = \frac{\log{\log{V}}}{\log{\log{N}}}$$

"K":

Yule's K (Yule, 1944, as cited in Tweedie & Baayen, 1998) is calculated by: $$K = 10^4 \times \frac{(\sum_{X=1}^{X}{{f_X}X^2}) - N}{N^2}$$ where \(N\) is the number of tokens, \(X\) is a vector with the frequencies of each type, and \(f_X\) is the frequencies for each X.

"Maas":

Maas' indices (\(a\), \(\log{V_0}\) & \(\log{}_{e}{V_0}\)): $$a^2 = \frac{\log{N} - \log{V}}{\log{N}^2}$$ $$\log{V_0} = \frac{\log{V}}{\sqrt{1 - \frac{\log{V}}{\log{N}}^2}}$$ The measure was derived from a formula by Mueller (1969, as cited in Maas, 1972). \(\log{}_{e}{V_0}\) is equivalent to \(\log{V_0}\), only with \(e\) as the base for the logarithms. Also calculated are \(a\), \(\log{V_0}\) (both not the same as before) and \(V'\) as measures of relative vocabulary growth while the text progresses. To calculate these measures, the first half of the text and the full text will be examined (see Maas, 1972, p. 67 ff. for details). Note: for the current method (for a dfm) there is no computation on separate halves of the text.

Note

This implements only the static measures of lexical diversity, not more complex measures based on windows of text such as the Mean Segmental Type-Token Ratio, the Moving-Average Type-Token Ratio (Covington & McFall, 2010), the MLTD or MLTD-MA (Moving-Average Measure of Textual Lexical Diversity) proposed by McCarthy & Jarvis (2010) or Jarvis (no year), or the HD-D version of vocd-D (see McCarthy & Jarvis, 2007). These are available from the package korRpus.

References

Covington, M.A. & McFall, J.D. (2010). Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR). Journal of Quantitative Linguistics, 17(2), 94--100. Maas, H.-D., (1972). \"Uber den Zusammenhang zwischen Wortschatzumfang und L\"ange eines Textes. Zeitschrift f\"ur Literaturwissenschaft und Linguistik, 2(8), 73--96. McCarthy, P.M. & Jarvis, S. (2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459--488. McCarthy, P.M. & Jarvis, S. (2010). MTLD, vocd-D, and HD-D: A validation study of sophisticated approaces to lexical diversity assessment. Behaviour Research Methods, 42(2), 381--392. Michalke, Meik. (2014) koRpus: An R Package for Text Analysis. Version 0.05-5. http://reaktanz.de/?c=hacking&s=koRpus Tweedie. F.J. & Baayen, R.H. (1998). How Variable May a Constant Be? Measures of Lexical Richness in Perspective. Computers and the Humanities, 32(5), 323--352.

Examples

mydfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), verbose = FALSE) (results <- textstat_lexdiv(mydfm, c("CTTR", "TTR", "U")))
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
cor(textstat_lexdiv(mydfm, "all"))
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
# with different settings of drop textstat_lexdiv(mydfm, "TTR", drop = TRUE)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
textstat_lexdiv(mydfm, "TTR", drop = FALSE)
#> Error in get(".SigLength", envir = env): object '.SigLength' not found