Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.
textstat_keyness( x, target = 1L, measure = c("chi2", "exact", "lr", "pmi"), sort = TRUE, correction = c("default", "yates", "williams", "none"), ... )
x | a dfm containing the features to be examined for keyness |
---|---|
target | the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference |
measure | (signed) association measure to be used for computing keyness.
Currently available: |
sort | logical; if |
correction | if |
... | not used |
a data.frame of computed statistics and associated p-values, where
the features scored name each row, and the number of occurrences for both
the target and reference groups. For measure = "chi2"
this is the
chi-squared value, signed positively if the observed value in the target
exceeds its expected value; for measure = "exact"
this is the
estimate of the odds ratio; for measure = "lr"
this is the
likelihood ratio \(G2\) statistic; for "pmi"
this is the pointwise
mutual information statistics.
textstat_keyness
returns a data.frame of features and
their keyness scores and frequency counts.
Bondi, M. & Scott, M. (eds) (2010). Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.
Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in Texts, Bondi, M. & Scott, M. (eds): 1--42. Amsterdam, Philadelphia: John Benjamins.
Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus Analysis in Language Education. Amsterdam: Benjamins: 55.
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1): 61--74.
# compare pre- v. post-war terms using grouping period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war") dfmat1 <- dfm(data_corpus_inaugural, groups = period) head(dfmat1) # make sure 'post-war' is in the first row#> Document-feature matrix of: 2 documents, 9,360 features (34.9% sparse). #> features #> docs fellow-citizens of the senate and house representatives : #> post-war 0 1437 1988 2 1456 3 0 105 #> pre-war 39 5666 8094 13 3854 8 19 29 #> features #> docs among vicissitudes #> post-war 22 0 #> pre-war 86 5 #> [ reached max_nfeat ... 9,350 more features ]#> feature chi2 p n_target n_reference #> 1 we 707.2185 0 960 779 #> 2 . 230.9059 0 1804 3141 #> 3 - 226.2633 0 239 155 #> 4 our 190.4594 0 874 1307 #> 5 us 187.4062 0 262 216 #> 6 : 178.9971 0 105 29 #> 7 america 177.5687 0 130 54 #> 8 world 176.2784 0 188 123 #> 9 americans 151.2940 0 67 7 #> 10 new 142.2859 0 150 97#> feature chi2 p n_target n_reference #> 9351 upon -51.51813 7.094325e-13 39 332 #> 9352 it -51.84279 6.012968e-13 257 1132 #> 9353 public -55.70000 8.437695e-14 11 213 #> 9354 states -58.74394 1.798561e-14 28 305 #> 9355 constitution -60.88287 6.106227e-15 6 200 #> 9356 be -71.19896 0.000000e+00 257 1224 #> 9357 should -82.68107 0.000000e+00 15 309 #> 9358 which -159.02346 0.000000e+00 95 911 #> 9359 of -175.47661 0.000000e+00 1437 5666 #> 9360 the -294.14547 0.000000e+00 1988 8094# compare pre- v. post-war terms using logical vector dfmat2 <- dfm(data_corpus_inaugural) head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)#> feature chi2 p n_target n_reference #> 1 we 707.2185 0 960 779 #> 2 . 230.9059 0 1804 3141 #> 3 - 226.2633 0 239 155 #> 4 our 190.4594 0 874 1307 #> 5 us 187.4062 0 262 216 #> 6 : 178.9971 0 105 29 #> 7 america 177.5687 0 130 54 #> 8 world 176.2784 0 188 123 #> 9 americans 151.2940 0 67 7 #> 10 new 142.2859 0 150 97# compare Trump 2017 to other post-war preseidents dfmat3 <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war")) head(textstat_keyness(dfmat3, target = "2017-Trump"), 10)#> feature chi2 p n_target n_reference #> 1 protected 76.20152 0.000000e+00 5 1 #> 2 will 50.88954 9.771073e-13 40 299 #> 3 while 47.92566 4.426903e-12 6 7 #> 4 obama 47.58369 5.270562e-12 3 0 #> 5 we've 47.58369 5.270562e-12 3 0 #> 6 america 31.15097 2.387204e-08 18 112 #> 7 again 27.59193 1.498024e-07 9 33 #> 8 everyone 27.50081 1.570283e-07 4 5 #> 9 your 26.45161 2.702225e-07 11 50 #> 10 breath 25.39798 4.664057e-07 2 0# using the likelihood ratio method head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)#> feature G2 p n_target n_reference #> 1 will 24.516313 7.368333e-07 41 317 #> 2 america 13.996558 1.831456e-04 19 130 #> 3 your 10.406851 1.255487e-03 12 68 #> 4 again 9.734223 1.808684e-03 10 51 #> 5 while 9.486598 2.069782e-03 7 25 #> 6 american 8.850857 2.929513e-03 12 76 #> 7 protected 8.804622 3.004684e-03 6 19 #> 8 back 6.836773 8.930003e-03 7 34 #> 9 you 6.685605 9.719453e-03 14 121 #> 10 country 5.801129 1.601588e-02 10 72