tokens_xptr objects that extend the
tokens objects with external pointers for a greater efficiency. Once
tokens objects are converted to
tokens_xptr objects using
tokens_*.tokens_xptr() methods are called automatically.
Improved C++ functions to allow the users to change the number of threads for parallel computing in more flexible manner using
quanteda_options(). The value of
threads can be changed in the middle of analysis pipeline.
"word4" the default (word) tokeniser, with improved efficiency, language handling, and customisation options.
Replaced all occurrences of the magrittr
%>% pipe with the R pipe
|> introduced in R 4.1, although the
%>% pipe is still re-exported and therefore available to all users of quanteda without loading any additional packages.
bootstrap_dfm() was removed for character and corpus objects. The correct way to bootstrap sentences is not to tokenize them as sentences and then bootstrap them from the dfm. This is consistent with requiring the user to tokenise objects prior to forming dfms or other “downstream” objects.
dfm() no longer works on character or corpus objects, only on tokens or other dfm objects. This was deprecated in v3 and removed in v4.
Very old arguments to
dfm() options that were not visible but worked with warnings (such as
stem = TRUE) are removed.
Deprecated or renamed arguments formerly passed in
tokens() that formerly mapped to the v3 arguments with a warning are removed.
Methods for readtext objects are removed, since these are data.frame objects that are straightforward to convert into a
tokens_group()works efficiently even when the number of documents and groups are very large.