Version 1.3 • quanteda

Added to = "tripletlist" output type for convert(), to convert a dfm into a simple triplet list. (#1321)
Added tokens_tortl() and char_tortl() to add markers for right-to-left language tokens and character objects. (#1322)

Improved corpus.kwic() by adding new arguments split_context and extract_keyword.
dfm_remove(x, selection = anydfm) is now equivalent to dfm_remove(x, selection = featnames(anydfm)). (#1320)
Improved consistency of predict.textmodel_nb() returns, and added type = argument. (#1329)

Fixed a bug in textmodel_affinity() that caused failure when the input dfm had been compiled with tolower = FALSE. (#1338)
Fixed a bug affecting tokens_lookup() and dfm_lookup() when nomatch is used. (#1347)
Fixed a problem whereby NA texts created a “document” (or tokens) containing "NA" (#1372)

Keep encodings of types when a tokens object is recompiled. (#1387)
More robust handling in predict.textmodel_worscores() when training and test feature sets are difference (#1380).
char_segment() and corpus_segment() are more robust to whitespace characters preceding a pattern (#1394).
tokens_ngrams() is more robust to handling large numbers of documents (#1395).
corpus.data.frame() is now robust to handling data.frame inputs with improper or missing variable names (#1388).

Added as.igraph.fcm() method for converting an fcm object into an igraph graph object.
Added a case_insensitive argument to char_segment() and corpus_segment().

Fixed a bug causing incorrect counting in fcm(x, ordered = TRUE). (#1413) Also set the condition that window can be of size 1 (formerly the limit was 2 or greater).
Fixed deprecation warnings from adding a dfm as docvars, and this now imports the feature names as docvar names automatically. (related to #1417)
Fixed behaviour from tokens(x, what = "fasterword", remove_separators = TRUE) so that it correctly splits words separated by \n and \t characters. (#1420)
Add error checking for functions taking dfm inputs in case a dfm has empty features (#1419).
For textstat_readability(), fixed a bug in Dale-Chall-based measures and in the Spache word list measure. These were caused by an incorrect lookup mechanism but also by limited implementation of the wordlists. The new wordlists include all of the variations called for in the original measures, but using fast fixed matching. (#1410)
Fixed problems with basic dfm operations (rowMeans(), rowSums(), colMeans(), colSums()) caused by not having access to the Matrix package methods. (#1428)
Fixed problem in textplot_scale1d() when input a predicted wordscores object with se.fit = TRUE (#1440).
Improved the stability of textplot_network(). (#1460)

Added new argument intermediate to textstat_readability(x, measure, intermediate = FALSE), which if TRUE returns intermediate quantities used in the computation of readability statistics. Useful for verification or direct use of the intermediate quantities.
Added a new separator argument to kwic() to allow a user to define which characters will be added between tokens returned from a keywords in context search. (#1449)
Reimplemented textstat_dist() and textstat_simil() in C++ for enhanced performance. (#1210)
Added a tokens_sample() function (#1478).

Removed the Hamming distance method from textstat_dist() (#1443), based on the reasoning in #1442.
Removed the “chisquared” and “chisquared2” distance measures from textstat_simil(). (#1442)

Improved the robustness of textstat_keyness() (#1482).
Improved the accuracy of sparsity reporting for the print method of a dfm (#1473).
Diagonals on a textstat_simil() return object coerced to matrix now default to 1.0, rather than 0.0 (#1494).

Added the following measures to textstat_lexdiv(): Yule’s K, Simpson’s D, and Herdan’s Vm.

Changelog 1.3