Version 3.2 • quanteda

Bug fixes and stability enhancements

dfm() returns a dfm with the identical column order even if tokens_compound() or tokens_ngrams() is used in the upstream (#2100).
dfm_group() with NA values in a grouping variable now drops those, similar to the behaviour of tokens_group() and corpus_group() (#2134).

Changes and additions

char_wordstem() now has a a new argument check_whitespace, which will not throw an error when lower-casing text containing a whitespace character.
dfm_remove() now has a new argument padding = FALSE that when TRUE, collects counts of the removed features in the first column. This produces results consistent with what is compiled as a dfm built from tokens where some have been removed with padding = TRUE (#2152).

Bug fixes and stability enhancements

dfm_lookup() ignores matches of multiple dictionary values in the same key in a similar way as tokens_lookup() (#2159).

Changes and additions

A new split_tags argument has been added to tokens(), to provide the user with an option not to preserve social media tags (addresses #2156).

Bug fixes and stability enhancements

fcm() computes the marginal frequency of upper-case tokens correctly (#2176).
tokens_chunk() keeps all the docid, including those of empty documents, in the original object.
tokens_select() recycles values when the length of startpos or endpos is less than ndoc(x).
tokens_lookup() and dfm_lookup() can apply very large dictionaries (more than 100,000 keys).

Bug fixes and stability enhancements

Matrix package calls updated for compatibility with Matrix 1.4.2. (#2182)
Changes to C++ code for fcm() to prevent some (chance) errors downstream in LSX. (#2181)

Bug fixes and stability enhancements

Fixes test failures caused by recent changes to Matrix package behaviours on some operating systems.

Changes and additions

segid() is added to extract document serial numbers from corpus, tokens or dfm objects.