Get the count of tokens (total features) or types (unique tokens).

ntoken(x, ...)

ntype(x, ...)

Arguments

x

a quanteda object: a character, corpus, tokens, or dfm object

...

additional arguments passed to tokens

Value

count of the total tokens or types

Details

The precise definition of "tokens" for objects not yet tokenized (e.g. character or corpus objects) can be controlled through optional arguments passed to tokens through ....

For dfm objects, ntype will only return the count of features that occur more than zero times in the dfm.

Note

Due to differences between raw text tokens and features that have been defined for a dfm, the counts may be different for dfm objects and the texts from which the dfm was generated. Because the method tokenizes the text in order to count the tokens, your results will depend on the options passed through to tokens.

Examples

# simple example txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.") ntoken(txt)
#> text1 text2 #> 7 6
ntype(txt)
#> text1 text2 #> 7 5
ntoken(char_tolower(txt)) # same
#> text1 text2 #> 7 6
ntype(char_tolower(txt)) # fewer types
#> text1 text2 #> 6 4
ntoken(char_tolower(txt), remove_punct = TRUE)
#> text1 text2 #> 5 4
ntype(char_tolower(txt), remove_punct = TRUE)
#> text1 text2 #> 4 3
# with some real texts ntoken(corpus_subset(data_corpus_inaugural, Year<1806), remove_punct = TRUE)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson #> 1430 135 2318 1726 2166
ntype(corpus_subset(data_corpus_inaugural, Year<1806), remove_punct = TRUE)
#> 1789-Washington 1793-Washington 1797-Adams 1801-Jefferson 1805-Jefferson #> 617 91 819 711 799
ntoken(dfm(corpus_subset(data_corpus_inaugural, Year<1800)))
#> Error in get(".SigLength", envir = env): object '.SigLength' not found
ntype(dfm(corpus_subset(data_corpus_inaugural, Year<1800)))
#> 1789-Washington 1793-Washington 1797-Adams #> 603 95 801