Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.

textstat_keyness(
  x,
  target = 1L,
  measure = c("chi2", "exact", "lr", "pmi"),
  sort = TRUE,
  correction = c("default", "yates", "williams", "none"),
  ...
)

Arguments

x

a dfm containing the features to be examined for keyness

target

the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference

measure

(signed) association measure to be used for computing keyness. Currently available: "chi2"; "exact" (Fisher's exact test); "lr" for the likelihood ratio; "pmi" for pointwise mutual information. Note that the "exact" test is very computationally intensive and therefore much slower than the other methods.

sort

logical; if TRUE sort features scored in descending order of the measure, otherwise leave in original feature order

correction

if "default", Yates correction is applied to "chi2"; William's correction is applied to "lr"; and no correction is applied for the "exact" and "pmi" measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the "exact" and "pmi" measures has no effect and produces a warning.

...

not used

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio \(G2\) statistic; for "pmi" this is the pointwise mutual information statistics.

textstat_keyness returns a data.frame of features and their keyness scores and frequency counts.

References

Bondi, M. & Scott, M. (eds) (2010). Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.

Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in Texts, Bondi, M. & Scott, M. (eds): 1--42. Amsterdam, Philadelphia: John Benjamins.

Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus Analysis in Language Education. Amsterdam: Benjamins: 55.

Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1): 61--74.

Examples

# compare pre- v. post-war terms using grouping period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war") dfmat1 <- dfm(data_corpus_inaugural, groups = period) head(dfmat1) # make sure 'post-war' is in the first row
#> Document-feature matrix of: 2 documents, 9,360 features (34.9% sparse). #> features #> docs fellow-citizens of the senate and house representatives : #> post-war 0 1437 1988 2 1456 3 0 105 #> pre-war 39 5666 8094 13 3854 8 19 29 #> features #> docs among vicissitudes #> post-war 22 0 #> pre-war 86 5 #> [ reached max_nfeat ... 9,350 more features ]
head(tstat1 <- textstat_keyness(dfmat1), 10)
#> feature chi2 p n_target n_reference #> 1 we 707.2185 0 960 779 #> 2 . 230.9059 0 1804 3141 #> 3 - 226.2633 0 239 155 #> 4 our 190.4594 0 874 1307 #> 5 us 187.4062 0 262 216 #> 6 : 178.9971 0 105 29 #> 7 america 177.5687 0 130 54 #> 8 world 176.2784 0 188 123 #> 9 americans 151.2940 0 67 7 #> 10 new 142.2859 0 150 97
tail(tstat1, 10)
#> feature chi2 p n_target n_reference #> 9351 upon -51.51813 7.094325e-13 39 332 #> 9352 it -51.84279 6.012968e-13 257 1132 #> 9353 public -55.70000 8.437695e-14 11 213 #> 9354 states -58.74394 1.798561e-14 28 305 #> 9355 constitution -60.88287 6.106227e-15 6 200 #> 9356 be -71.19896 0.000000e+00 257 1224 #> 9357 should -82.68107 0.000000e+00 15 309 #> 9358 which -159.02346 0.000000e+00 95 911 #> 9359 of -175.47661 0.000000e+00 1437 5666 #> 9360 the -294.14547 0.000000e+00 1988 8094
# compare pre- v. post-war terms using logical vector dfmat2 <- dfm(data_corpus_inaugural) head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)
#> feature chi2 p n_target n_reference #> 1 we 707.2185 0 960 779 #> 2 . 230.9059 0 1804 3141 #> 3 - 226.2633 0 239 155 #> 4 our 190.4594 0 874 1307 #> 5 us 187.4062 0 262 216 #> 6 : 178.9971 0 105 29 #> 7 america 177.5687 0 130 54 #> 8 world 176.2784 0 188 123 #> 9 americans 151.2940 0 67 7 #> 10 new 142.2859 0 150 97
# compare Trump 2017 to other post-war preseidents dfmat3 <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war")) head(textstat_keyness(dfmat3, target = "2017-Trump"), 10)
#> feature chi2 p n_target n_reference #> 1 protected 76.20152 0.000000e+00 5 1 #> 2 will 50.88954 9.771073e-13 40 299 #> 3 while 47.92566 4.426903e-12 6 7 #> 4 obama 47.58369 5.270562e-12 3 0 #> 5 we've 47.58369 5.270562e-12 3 0 #> 6 america 31.15097 2.387204e-08 18 112 #> 7 again 27.59193 1.498024e-07 9 33 #> 8 everyone 27.50081 1.570283e-07 4 5 #> 9 your 26.45161 2.702225e-07 11 50 #> 10 breath 25.39798 4.664057e-07 2 0
# using the likelihood ratio method head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)
#> feature G2 p n_target n_reference #> 1 will 24.516313 7.368333e-07 41 317 #> 2 america 13.996558 1.831456e-04 19 130 #> 3 your 10.406851 1.255487e-03 12 68 #> 4 again 9.734223 1.808684e-03 10 51 #> 5 while 9.486598 2.069782e-03 7 25 #> 6 american 8.850857 2.929513e-03 12 76 #> 7 protected 8.804622 3.004684e-03 6 19 #> 8 back 6.836773 8.930003e-03 7 34 #> 9 you 6.685605 9.719453e-03 14 121 #> 10 country 5.801129 1.601588e-02 10 72