Identify the most frequent features in a dfm

List the most (or least) frequently occurring features in a dfm, either as a whole or separated by document.

topfeatures(
  x,
  n = 10,
  decreasing = TRUE,
  scheme = c("count", "docfreq"),
  groups = NULL
)

Arguments

x: the object whose features will be returned
n: how many top features should be returned
decreasing: If TRUE, return the n most frequent features; otherwise return the n least frequent features
scheme: one of count for total feature frequency (within group if applicable), or docfreq for the document frequencies of features
groups: grouping variable for sampling, equal in length to the number of documents. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for groups. See news(Version >= "3.0", package = "quanteda") for details.

Value

A named numeric vector of feature counts, where the names are the feature labels, or a list of these if groups is given.

Examples

dfmat1 <- corpus_subset(data_corpus_inaugural, Year > 1980) |>
    tokens(remove_punct = TRUE) |>
    dfm()
dfmat2 <- dfm_remove(dfmat1, stopwords("en"))

# most frequent features
topfeatures(dfmat1)
#>  the  and   of   we   to  our    a   in   is that 
#> 1302 1119  915  715  714  651  518  421  358  341 
topfeatures(dfmat2)
#>      us america    must  people     can  nation   world     new     one    time 
#>     219     134     119     108     104     103     103      99      91      81 

# least frequent features
topfeatures(dfmat2, decreasing = FALSE)
#>     hatfield      mondale        baker       moomaw    momentous   occurrence 
#>            1            1            1            1            1            1 
#>    routinely       unique       really every-4-year 
#>            1            1            1            1 

# top features of individual documents
topfeatures(dfmat2, n = 5, groups = docnames(dfmat2))
#> Error in eval(groupsub): object 'dfmat2' not found

# grouping by president last name
topfeatures(dfmat2, n = 5, groups = President)
#> $Biden
#>      us america     can     one  nation 
#>      27      18      16      15      12 
#> 
#> $Bush
#> freedom      us  nation america     can 
#>      36      27      27      27      24 
#> 
#> $Clinton
#>      us     new   world    must america 
#>      40      38      28      28      26 
#> 
#> $Obama
#>     us   must    can nation people 
#>     44     25     20     18     18 
#> 
#> $Reagan
#>         us government     people      world        one 
#>         52         29         25         23         22 
#> 
#> $Trump
#>  america       us      one american   people 
#>       36       29       23       20       19 
#> 

# features by document frequencies
tail(topfeatures(dfmat1, scheme = "docfreq", n = 200))
#>   schools america's     chief      vice       yet       two 
#>         9         9         8         8         8         8