List the most (or least) frequently occurring features in a dfm, either as a whole or separated by document.

topfeatures(
  x,
  n = 10,
  decreasing = TRUE,
  scheme = c("count", "docfreq"),
  groups = NULL
)

Arguments

x

the object whose features will be returned

n

how many top features should be returned

decreasing

If TRUE, return the n most frequent features; otherwise return the n least frequent features

scheme

one of count for total feature frequency (within group if applicable), or docfreq for the document frequencies of features

groups

grouping variable for sampling, equal in length to the number of documents. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for groups. See news(Version >= "3.0", package = "quanteda") for details.

Value

A named numeric vector of feature counts, where the names are the feature labels, or a list of these if groups is given.

Examples

dfmat1 <- corpus_subset(data_corpus_inaugural, Year > 1980) |>
    tokens(remove_punct = TRUE) |>
    dfm()
dfmat2 <- dfm_remove(dfmat1, stopwords("en"))

# most frequent features
topfeatures(dfmat1)
#>  the  and   of   to   we  our    a   in   is that 
#> 1201 1023  838  649  627  608  472  379  329  317 
topfeatures(dfmat2)
#>      us america    must  people     new   world  nation     can    time     one 
#>     192     116     109      99      96      95      91      88      77      76 

# least frequent features
topfeatures(dfmat2, decreasing = FALSE)
#>     hatfield      mondale        baker       moomaw    momentous   occurrence 
#>            1            1            1            1            1            1 
#>    routinely       unique       really every-4-year 
#>            1            1            1            1 

# top features of individual documents
topfeatures(dfmat2, n = 5, groups = docnames(dfmat2))
#> Error in eval(groupsub): object 'dfmat2' not found

# grouping by president last name
topfeatures(dfmat2, n = 5, groups = President)
#> $Biden
#>      us america     can     one  nation 
#>      27      18      16      15      12 
#> 
#> $Bush
#> freedom      us  nation america     can 
#>      36      27      27      27      24 
#> 
#> $Clinton
#>      us     new   world    must america 
#>      40      38      28      28      26 
#> 
#> $Obama
#>     us   must    can nation people 
#>     44     25     20     18     18 
#> 
#> $Reagan
#>         us government     people      world        one 
#>         52         29         25         23         22 
#> 
#> $Trump
#>  america american   people  country      one 
#>       18       11       10        9        8 
#> 

# features by document frequencies
tail(topfeatures(dfmat1, scheme = "docfreq", n = 200))
#>      unity       like     cannot       even challenges    schools 
#>          8          8          8          8          8          8