quanteda 1.3.0 2018-06-05

New Features

  • Added to = "tripletlist" output type for convert(), to convert a dfm into a simple triplet list. (#1321)
  • Added tokens_tortl() and char_tortl() to add markers for right-to-left language tokens and character objects. (#1322)

Behaviour changes

Bug fixes

  • Fixed a bug in textmodel_affinity() that caused failure when the input dfm had been compiled with tolower = FALSE. (#1338)
  • Fixed a bug affecting tokens_lookup() and dfm_lookup() when nomatch is used. (#1347)
  • Fixed a problem whereby NA texts created a “document” (or tokens) containing "NA" (#1372)

quanteda 1.3.4 2018-07-15

Bug fixes and stability enhancements

  • Keep encodings of types when a tokens object is recompiled. (#1387)
  • More robust handling in predict.textmodel_worscores() when training and test feature sets are difference (#1380).
  • char_segment() and corpus_segment() are more robust to whitespace characters preceding a pattern (#1394).
  • tokens_ngrams() is more robust to handling large numbers of documents (#1395).
  • corpus.data.frame() is now robust to handling data.frame inputs with improper or missing variable names (#1388).

New Features

quanteda 1.3.13 2018-11-01

Bug fixes and stability enhancements

  • Fixed a bug causing incorrect counting in fcm(x, ordered = TRUE). (#1413) Also set the condition that window can be of size 1 (formerly the limit was 2 or greater).
  • Fixed deprecation warnings from adding a dfm as docvars, and this now imports the feature names as docvar names automatically. (related to #1417)
  • Fixed behaviour from tokens(x, what = "fasterword", remove_separators = TRUE) so that it correctly splits words separated by \n and \t characters. (#1420)
  • Add error checking for functions taking dfm inputs in case a dfm has empty features (#1419).
  • For textstat_readability(), fixed a bug in Dale-Chall-based measures and in the Spache word list measure. These were caused by an incorrect lookup mechanism but also by limited implementation of the wordlists. The new wordlists include all of the variations called for in the original measures, but using fast fixed matching. (#1410)
  • Fixed problems with basic dfm operations (rowMeans(), rowSums(), colMeans(), colSums()) caused by not having access to the Matrix package methods. (#1428)
  • Fixed problem in textplot_scale1d() when input a predicted wordscores object with se.fit = TRUE (#1440).
  • Improved the stability of textplot_network(). (#1460)

New Features

  • Added new argument intermediate to textstat_readability(x, measure, intermediate = FALSE), which if TRUE returns intermediate quantities used in the computation of readability statistics. Useful for verification or direct use of the intermediate quantities.
  • Added a new separator argument to kwic() to allow a user to define which characters will be added between tokens returned from a keywords in context search. (#1449)
  • Reimplemented textstat_dist() and textstat_simil() in C++ for enhanced performance. (#1210)
  • Added a tokens_sample() function (#1478).

Behaviour changes

  • Removed the Hamming distance method from textstat_dist() (#1443), based on the reasoning in #1442.
  • Removed the “chisquared” and “chisquared2” distance measures from textstat_simil(). (#1442)

quanteda 1.3.14 2018-11-19

Bug fixes and stability enhancements

  • Improved the robustness of textstat_keyness() (#1482).
  • Improved the accuracy of sparsity reporting for the print method of a dfm (#1473).
  • Diagonals on a textstat_simil() return object coerced to matrix now default to 1.0, rather than 0.0 (#1494).

New Features

  • Added the following measures to textstat_lexdiv(): Yule’s K, Simpson’s D, and Herdan’s Vm.