flatten
and levels
arguments to as.list.dictionary2()
to enable more flexible conversion of dictionary objects. (#1661)corpus_sample()
, the size
now works with the by
argument, to control the size of units sampled from each group.textstat_dist()
and textstat_simil()
, see below.tokens()
. (#1713)textstat_dist()
and textstat_simil()
now return sparse symmetric matrix objects using classes from the Matrix package. This replaces the former structure based on the dist
class. Computation of these classes is now also based on the fast implementation in the proxyC package. When computing similarities, the new min_simil
argument allows a user to ignore certain values below a specified similarity threshold. A new coercion method as.data.frame.textstat_simildist()
now exists for converting these returns into a data.frame of pairwise comparisons. Existing methods such as as.matrix()
, as.dist()
, and as.list()
work as they did before.textstat_dist()
and textstat_simil()
because these were either not symmetric or not invariant to document or feature ordering. Finally, the selection
argument has been deprecated in favour of a new y
argument.textstat_readability()
now defaults to measure = "Flesch"
if no measure is supplied. This makes it consistent with textstat_lexdiv()
that also takes a default measure (“TTR”) if none is supplied. (#1715)max_nchar
and min_nchar
in tokens_select()
are now NULL, meaning they are not applied if the user does not supply values. Fixes #1713.kwic.corpus()
and kwic.tokens()
behaviour now aligned, meaning that dictionaries are correctly faceted by key instead of by value. (#1684)tokens()
verbose output. (#1683)textstat_readability()
. (#1701)textstat_dist()
and textstat_simil()
. (#1730)textstat_dist()
and textstat_simil()
class symmetric matrices.textstat_lexdiv()
.featfreq()
to compute the overall feature frequencies from a dfm.tokens_lookup()
when exclusive = FALSE
and the tokens object has paddings. (#1743)tokens_replace()
(#1765).