Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.

textstat_keyness(x, target = 1L, measure = c("chi2", "exact", "lr",
  "pmi"), sort = TRUE, correction = c("default", "yates", "williams",
  "none"))

Arguments

x

a dfm containing the features to be examined for keyness

target

the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference

measure

(signed) association measure to be used for computing keyness. Currently available: "chi2"; "exact" (Fisher's exact test); "lr" for the likelihood ratio; "pmi" for pointwise mutual information.

sort

logical; if TRUE sort features scored in descending order of the measure, otherwise leave in original feature order

correction

if "default", Yates correction is applied to "chi2"; William's correction is applied to "lr"; and no correction is applied for the "exact" and "pmi" measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the "exact" and "pmi" measures has no effect and produces a warning.

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio \(G2\) statistic; for "pmi" this is the pointwise mutual information statistics.

textstat_keyness returns a data.frame of features and their keyness scores and frequency counts.

References

Bondi, Marina and Mike Scott, eds. 2010. Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.

Stubbs, Michael. 2010. "Three Concepts of Keywords". In Keyness in Texts, Marina Bondi and Mike Scott, eds: 1--42. Amsterdam, Philadelphia: John Benjamins.

Scott, Mike and Christopher Tribble. 2006. Textual Patterns: keyword and corpus analysis in language education. Amsterdam: Benjamins: 55.

Dunning, Ted. 1993. "Accurate Methods for the Statistics of Surprise and Coincidence." Computational Linguistics 19(1): 61--74.

Examples

# compare pre- v. post-war terms using grouping period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war") mydfm <- dfm(data_corpus_inaugural, groups = period) head(mydfm) # make sure 'post-war' is in the first row
#> Document-feature matrix of: 2 documents, 9,357 features (34.9% sparse).
head(result <- textstat_keyness(mydfm), 10)
#> feature chi2 p n_target n_reference #> 1 we 702.0019 0 960 779 #> 2 - 395.1361 0 450 312 #> 3 . 226.6654 0 1804 3141 #> 4 our 187.8329 0 874 1307 #> 5 us 186.0132 0 262 216 #> 6 : 178.1283 0 105 29 #> 7 america 176.6114 0 130 54 #> 8 world 175.1354 0 188 123 #> 9 americans 150.6480 0 67 7 #> 10 new 141.3690 0 150 97
tail(result, 10)
#> feature chi2 p n_target n_reference #> 9348 upon -51.91189 5.804246e-13 39 332 #> 9349 it -52.69911 3.888001e-13 257 1132 #> 9350 public -55.99225 7.271961e-14 11 213 #> 9351 states -59.13009 1.476597e-14 28 305 #> 9352 constitution -61.16666 5.218048e-15 6 200 #> 9353 be -72.21927 0.000000e+00 257 1224 #> 9354 should -83.10741 0.000000e+00 15 309 #> 9355 which -160.14560 0.000000e+00 95 911 #> 9356 of -179.17847 0.000000e+00 1437 5666 #> 9357 the -299.85716 0.000000e+00 1988 8094
# compare pre- v. post-war terms using logical vector mydfm2 <- dfm(data_corpus_inaugural) head(textstat_keyness(mydfm2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)
#> feature chi2 p n_target n_reference #> 1 we 702.0019 0 960 779 #> 2 - 395.1361 0 450 312 #> 3 . 226.6654 0 1804 3141 #> 4 our 187.8329 0 874 1307 #> 5 us 186.0132 0 262 216 #> 6 : 178.1283 0 105 29 #> 7 america 176.6114 0 130 54 #> 8 world 175.1354 0 188 123 #> 9 americans 150.6480 0 67 7 #> 10 new 141.3690 0 150 97
# compare Trump 2017 to other post-war preseidents pwdfm <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war")) head(textstat_keyness(pwdfm, target = "2017-Trump"), 10)
#> feature chi2 p n_target n_reference #> 1 protected 76.64466 0.000000e+00 5 1 #> 2 will 51.44795 7.351897e-13 40 299 #> 3 while 48.23022 3.790079e-12 6 7 #> 4 obama 47.85727 4.584000e-12 3 0 #> 5 we've 47.85727 4.584000e-12 3 0 #> 6 america 31.45537 2.040775e-08 18 112 #> 7 again 27.81145 1.337322e-07 9 33 #> 8 everyone 27.67876 1.432269e-07 4 5 #> 9 your 26.67898 2.402201e-07 11 50 #> 10 transferring 25.54569 4.320292e-07 2 0
# using the likelihood ratio method head(textstat_keyness(dfm_smooth(pwdfm), measure = "lr", target = "2017-Trump"), 10)
#> feature G2 p n_target n_reference #> 1 will 24.604106 7.040156e-07 41 317 #> 2 america 14.040255 1.789387e-04 19 130 #> 3 your 10.435140 1.236402e-03 12 68 #> 4 again 9.758516 1.784939e-03 10 51 #> 5 while 9.504990 2.049139e-03 7 25 #> 6 american 8.877690 2.886766e-03 12 76 #> 7 protected 8.820562 2.978550e-03 6 19 #> 8 back 6.853526 8.846653e-03 7 34 #> 9 you 6.713202 9.570175e-03 14 121 #> 10 country 5.821599 1.583055e-02 10 72