Bug fixes and stability enhancements

Changes and additions

  • char_wordstem() now has a a new argument check_whitespace, which will not throw an error when lower-casing text containing a whitespace character.
  • dfm_remove() now has a new argument padding = FALSE that when TRUE, collects counts of the removed features in the first column. This produces results consistent with what is compiled as a dfm built from tokens where some have been removed with padding = TRUE (#2152).

Bug fixes and stability enhancements

Changes and additions

  • A new split_tags argument has been added to tokens(), to provide the user with an option not to preserve social media tags (addresses #2156).

Bug fixes and stability enhancements

  • fcm() computes the marginal frequency of upper-case tokens correctly (#2176).
  • tokens_chunk() keeps all the docid, including those of empty documents, in the original object.
  • tokens_select() recycles values when the length of startpos or endpos is less than ndoc(x).
  • tokens_lookup() and dfm_lookup() can apply very large dictionaries (more than 100,000 keys).

Bug fixes and stability enhancements

  • Matrix package calls updated for compatibility with Matrix 1.4.2. (#2182)
  • Changes to C++ code for fcm() to prevent some (chance) errors downstream in LSX. (#2181)

Bug fixes and stability enhancements

Fixes test failures caused by recent changes to Matrix package behaviours on some operating systems.

Changes and additions

  • segid() is added to extract document serial numbers from corpus, tokens or dfm objects.