dfm()
returns a dfm with the identical column order even if tokens_compound()
or tokens_ngrams()
is used in the upstream (#2100).dfm_group()
with NA values in a grouping variable now drops those, similar to the behaviour of tokens_group()
and corpus_group()
(#2134).char_wordstem()
now has a a new argument check_whitespace
, which will not throw an error when lower-casing text containing a whitespace character.dfm_remove()
now has a new argument padding = FALSE
that when TRUE
, collects counts of the removed features in the first column. This produces results consistent with what is compiled as a dfm built from tokens where some have been removed with padding = TRUE
(#2152).dfm_lookup()
ignores matches of multiple dictionary values in the same key in a similar way as tokens_lookup()
(#2159).fcm()
computes the marginal frequency of upper-case tokens correctly (#2176).tokens_chunk()
keeps all the docid, including those of empty documents, in the original object.tokens_select()
recycles values when the length of startpos
or endpos
is less than ndoc(x)
.tokens_lookup()
and dfm_lookup()
can apply very large dictionaries (more than 100,000 keys).segid()
is added to extract document serial numbers from corpus, tokens or dfm objects.