quanteda 1.4.0 2019-01-30

Bug fixes and stability enhancements

  • Fixed bug in dfm_compress() and dfm_group() that changed or deleted docvars attributes of dfm objects (#1506).
  • Fixed a bug in textplot_xray() that caused incorrect facet labels when a pattern contained multiple list elements or values (#1514).
  • kwic() now correctly returns the pattern associated with each match as the "keywords" attribute, for all pattern types (#1515)
  • Implemented some improvements in efficiency and computation of unusual edge cases for textstat_simil() and textstat_dist().

New features

  • textstat_lexdiv() now works on tokens objects, not just dfm objects. New methods of lexical diversity now include MATTR (the Moving-Average Type-Token Ratio, Covington & McFall 2010) and MSTTR (Mean Segmental Type-Token Ratio).
  • New function tokens_split() allows splitting single into multiple tokens based on a pattern match. (#1500)
  • New function tokens_chunk() allows splitting tokens into new documents of equally-sized “chunks”. (#1520)
  • New function textstat_entropy() now computes entropy for a dfm across feature or document margins.
  • The documentation for textstat_readability() is vastly improved, now providing detailing all formulas and providing full references.
  • New function dfm_match() allows a user to specify the features in a dfm according to a fixed vector of feature names, including those of another dfm. Replaces dfm_select(x, pattern) where pattern was a dfm.
  • A new argument vertex_labelsize added to textplot_network() to allow more precise control of label sizes, either globally or individually.

Behaviour changes

  • tokens.tokens(x, remove_hyphens = TRUE) where x was generated with remove_hyphens = FALSE now behaves similarly to how the same tokens would be handled had this option been called on character input as tokens.character(x, remove_hyphens = TRUE). (#1498)

quanteda 1.4.1 2019-02-26

Bug fixes and stability enhancements

  • Fixed an issue with special handling of whitespace variants that caused a test to fail when running Ubuntu 18.10 system with libicu-dev version 63.1 (#1604).
  • Fixed the operation of docvars<-.corpus() in a way that solves #1603 (reassignment of docvar names).

quanteda 1.4.3 2019-04-01

Bug fixes and stability enhancements

  • Changed the default value of the size argument in dfm_sample() to the number of features, not the number of documents. (#1643)
  • Fixes a few CRAN-related issues (compiler warnings on Solaris and encoding warnings on r-devel-linux-x86_64-debian-clang.)

Behaviour changes

  • Added a force = TRUE option and error checking for the situations of applying dfm_weight() or dfm_group() to a dfm that has already been weighted. (#1545) The function textstat_frequency() now allows passing this argument to dfm_group() via .... (#1646)
  • textstat_frequency() now has a new argument for resolving ties when ranking term frequencies, defaulting to the “min” method. (#1634)