Calculate keyness statistics — textstat

Calculate "keyness", a score for features that occur differentially across different categories. Here, the categories are defined by reference to a "target" document index in the dfm, with the reference group consisting of all other documents.

textstat_keyness(
  x,
  target = 1L,
  measure = c("chi2", "exact", "lr", "pmi"),
  sort = TRUE,
  correction = c("default", "yates", "williams", "none"),
  ...
)

Arguments

x	a dfm containing the features to be examined for keyness
target	the document index (numeric, character or logical) identifying the document forming the "target" for computing keyness; all other documents' feature frequencies will be combined for use as a reference
measure	(signed) association measure to be used for computing keyness. Currently available: `"chi2"`; `"exact"` (Fisher's exact test); `"lr"` for the likelihood ratio; `"pmi"` for pointwise mutual information. Note that the "exact" test is very computationally intensive and therefore much slower than the other methods.
sort	logical; if `TRUE` sort features scored in descending order of the measure, otherwise leave in original feature order
correction	if `"default"`, Yates correction is applied to `"chi2"`; William's correction is applied to `"lr"`; and no correction is applied for the `"exact"` and `"pmi"` measures. Specifying a value other than the default can be used to override the defaults, for instance to apply the Williams correction to the chi2 measure. Specifying a correction for the `"exact"` and `"pmi"` measures has no effect and produces a warning.
...	not used

Value

a data.frame of computed statistics and associated p-values, where the features scored name each row, and the number of occurrences for both the target and reference groups. For measure = "chi2" this is the chi-squared value, signed positively if the observed value in the target exceeds its expected value; for measure = "exact" this is the estimate of the odds ratio; for measure = "lr" this is the likelihood ratio \(G2\) statistic; for "pmi" this is the pointwise mutual information statistics.

textstat_keyness returns a data.frame of features and their keyness scores and frequency counts.

References

Bondi, M. & Scott, M. (eds) (2010). Keyness in Texts. Amsterdam, Philadelphia: John Benjamins.

Stubbs, M. (2010). Three Concepts of Keywords. In Keyness in Texts, Bondi, M. & Scott, M. (eds): 1--42. Amsterdam, Philadelphia: John Benjamins.

Scott, M. & Tribble, C. (2006). Textual Patterns: Keyword and Corpus Analysis in Language Education. Amsterdam: Benjamins: 55.

Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1): 61--74.

Examples

# compare pre- v. post-war terms using grouping
period <- ifelse(docvars(data_corpus_inaugural, "Year") < 1945, "pre-war", "post-war")
dfmat1 <- dfm(data_corpus_inaugural, groups = period)
head(dfmat1) # make sure 'post-war' is in the first row
#> Document-feature matrix of: 2 documents, 9,360 features (34.9% sparse).
#>           features
#> docs       fellow-citizens   of  the senate  and house representatives   :
#>   post-war               0 1437 1988      2 1456     3               0 105
#>   pre-war               39 5666 8094     13 3854     8              19  29
#>           features
#> docs       among vicissitudes
#>   post-war    22            0
#>   pre-war     86            5
#> [ reached max_nfeat ... 9,350 more features ]
head(tstat1 <- textstat_keyness(dfmat1), 10)
#>      feature     chi2 p n_target n_reference
#> 1         we 707.2185 0      960         779
#> 2          . 230.9059 0     1804        3141
#> 3          - 226.2633 0      239         155
#> 4        our 190.4594 0      874        1307
#> 5         us 187.4062 0      262         216
#> 6          : 178.9971 0      105          29
#> 7    america 177.5687 0      130          54
#> 8      world 176.2784 0      188         123
#> 9  americans 151.2940 0       67           7
#> 10       new 142.2859 0      150          97
tail(tstat1, 10)
#>           feature       chi2            p n_target n_reference
#> 9351         upon  -51.51813 7.094325e-13       39         332
#> 9352           it  -51.84279 6.012968e-13      257        1132
#> 9353       public  -55.70000 8.437695e-14       11         213
#> 9354       states  -58.74394 1.798561e-14       28         305
#> 9355 constitution  -60.88287 6.106227e-15        6         200
#> 9356           be  -71.19896 0.000000e+00      257        1224
#> 9357       should  -82.68107 0.000000e+00       15         309
#> 9358        which -159.02346 0.000000e+00       95         911
#> 9359           of -175.47661 0.000000e+00     1437        5666
#> 9360          the -294.14547 0.000000e+00     1988        8094

# compare pre- v. post-war terms using logical vector
dfmat2 <- dfm(data_corpus_inaugural)
head(textstat_keyness(dfmat2, docvars(data_corpus_inaugural, "Year") >= 1945), 10)
#>      feature     chi2 p n_target n_reference
#> 1         we 707.2185 0      960         779
#> 2          . 230.9059 0     1804        3141
#> 3          - 226.2633 0      239         155
#> 4        our 190.4594 0      874        1307
#> 5         us 187.4062 0      262         216
#> 6          : 178.9971 0      105          29
#> 7    america 177.5687 0      130          54
#> 8      world 176.2784 0      188         123
#> 9  americans 151.2940 0       67           7
#> 10       new 142.2859 0      150          97

# compare Trump 2017 to other post-war preseidents
dfmat3 <- dfm(corpus_subset(data_corpus_inaugural, period == "post-war"))
head(textstat_keyness(dfmat3, target = "2017-Trump"), 10)
#>      feature     chi2            p n_target n_reference
#> 1  protected 76.20152 0.000000e+00        5           1
#> 2       will 50.88954 9.771073e-13       40         299
#> 3      while 47.92566 4.426903e-12        6           7
#> 4      obama 47.58369 5.270562e-12        3           0
#> 5      we've 47.58369 5.270562e-12        3           0
#> 6    america 31.15097 2.387204e-08       18         112
#> 7      again 27.59193 1.498024e-07        9          33
#> 8   everyone 27.50081 1.570283e-07        4           5
#> 9       your 26.45161 2.702225e-07       11          50
#> 10    breath 25.39798 4.664057e-07        2           0

# using the likelihood ratio method
head(textstat_keyness(dfm_smooth(dfmat3), measure = "lr", target = "2017-Trump"), 10)
#>      feature        G2            p n_target n_reference
#> 1       will 24.516313 7.368333e-07       41         317
#> 2    america 13.996558 1.831456e-04       19         130
#> 3       your 10.406851 1.255487e-03       12          68
#> 4      again  9.734223 1.808684e-03       10          51
#> 5      while  9.486598 2.069782e-03        7          25
#> 6   american  8.850857 2.929513e-03       12          76
#> 7  protected  8.804622 3.004684e-03        6          19
#> 8       back  6.836773 8.930003e-03        7          34
#> 9        you  6.685605 9.719453e-03       14         121
#> 10   country  5.801129 1.601588e-02       10          72