Produces counts and document frequencies summaries of the features in a dfm, optionally grouped by a docvars variable or other supplied grouping variable.

textstat_frequency(
x,
n = NULL,
groups = NULL,
ties_method = c("min", "average", "first", "random", "max", "dense"),
...
)

Arguments

x a dfm object (optional) integer specifying the top n features to be returned, within group if groups is specified either: a character vector containing the names of document variables to be used for grouping; or a factor or object that can be coerced into a factor equal in length or rows to the number of documents. NA values of the grouping value are dropped. See groups for details. character string specifying how ties are treated. See data.table::frank() for details. Unlike that function, however, the default is "min", so that frequencies of 10, 10, 11 would be ranked 1, 1, 3. additional arguments passed to dfm_group(). This can be useful in passing force = TRUE, for instance, if you are grouping a dfm that has been weighted.

Value

a data.frame containing the following variables:

feature

(character) the feature

frequency

count of the feature

rank

rank of the feature, where 1 indicates the greatest frequency

docfreq

document frequency of the feature, as a count (the number of documents in which this feature occurred at least once)

docfreq

document frequency of the feature, as a count

group

(only if groups is specified) the label of the group. If the features have been grouped, then all counts, ranks, and document frequencies are within group. If groups is not specified, the group column is omitted from the returned data.frame.

textstat_frequency returns a data.frame of features and their term and document frequencies within groups.

Examples

set.seed(20)
dfmat1 <- dfm(c("a a b b c d", "a d d d", "a a a"))
textstat_frequency(dfmat1)#>   feature frequency rank docfreq group
#> 1       a         6    1       3   all
#> 2       d         4    2       2   all
#> 3       b         2    3       1   all
#> 4       c         1    4       1   alltextstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "first")#>   feature frequency rank docfreq group
#> 1       a         5    1       2   one
#> 2       b         2    2       1   one
#> 3       c         1    3       1   one
#> 4       d         1    4       1   one
#> 5       d         3    1       1   two
#> 6       a         1    2       1   twotextstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "dense")#>   feature frequency rank docfreq group
#> 1       a         5    1       2   one
#> 2       b         2    2       1   one
#> 3       c         1    3       1   one
#> 4       d         1    3       1   one
#> 5       d         3    1       1   two
#> 6       a         1    2       1   two
dfmat2 <- corpus_subset(data_corpus_inaugural, President == "Obama") %>%
dfm(remove_punct = TRUE, remove = stopwords("english"))
tstat1 <- textstat_frequency(dfmat2)
head(tstat1, 10)#>    feature frequency rank docfreq group
#> 1       us        44    1       2   all
#> 2     must        25    2       2   all
#> 3      can        20    3       2   all
#> 4   nation        18    4       2   all
#> 5   people        18    4       2   all
#> 6      new        17    6       2   all
#> 7     time        16    7       2   all
#> 8    every        15    8       2   all
#> 9  america        14    9       2   all
#> 10     now        11   10       2   all
# \donttest{
# plot 20 most frequent words
library("ggplot2")
ggplot(tstat1[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency")
# plot relative frequencies by group
dfmat3 <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
dfm(remove = stopwords("english"), remove_punct = TRUE) %>%
dfm_group(groups = "President") %>%
dfm_weight(scheme = "prop")

# calculate relative frequency by president
tstat2 <- textstat_frequency(dfmat3, n = 10, groups = "President")

# plot frequencies
ggplot(data = tstat2, aes(x = factor(nrow(tstat2):1), y = frequency)) +
geom_point() +
facet_wrap(~ group, scales = "free") +
coord_flip() +
scale_x_discrete(breaks = nrow(tstat2):1,
labels = tstat2\$feature) +
labs(x = NULL, y = "Relative frequency")# }