block_size
to quanteda_options()
to control the number of documents in blocked tokenization.print.dictionary2()
to control the printing of nested levels with max_nkey
(#1967)textstat_summary()
to provide detailed information about dfm, tokens and corpus objects. It will replace summary()
in future versions.what = "word"
) corpora with large numbers of documents that contain social media tags and URLs that needed to be preserved (such a large corpus of Tweets).quanteda_options()
. The following are now preserved: “#政治” as well as Weibo-style hashtags such as “#英国首相#”.convert(x, to = "data.frame")
now outputs the first column as “doc_id” rather than “document” since “document” is a commonly occurring term in many texts. (#1918)char_select()
, char_keep()
, and char_remove()
for easy manipulation of character vectors.dictionary_edit()
for easy, interactive editing of dictionaries, plus the functions char_edit()
and list_edit()
for editing character and list of character objects.textplot_wordcloud()
that plots objects from textstat_keyness()
, to visualize keywords either by comparison or for the target category only.kwic()
(#1840).logsmooth
scheme to dfm_weight()
.textstat_summary()
method, which returns summary information about the tokens/types/features etc in an object. It also caches summary information so that this can be retrieved on subsequent calls, rather than re-computed.NA
for non-existent features when n
> nfeat(x)
in textstat_frequency(x, n)
. (#1929)dfm_lookup()
and tokens_lookup()
in which an error was caused when no dictionary key returned a single match (#1946).textstat_simil/dist
object converted to a data.frame to drop its document2
labels (#1939).dfm_match()
to fail on a dfm that included “pads” (""
). (#1960)data_dfm_lbgexample
object using more modern dfm internals.textstat_readability()
, textstat_lexdiv()
, and nscrabble()
so that empty texts are not dropped in the result. (#1976)corpus_reshape()
now allows reshaping back to documents even when segmented texts were of zero length. (#1978)summary.corpus()
/textstat_summary()
.textstat_keyness()
performance is now improved through implementation in (multi-threaded) C++.