New features

  • textstat_keyness() now returns a data.frame with p-values as well as the test statistic, and rownames containing the feature. This is more consistent with the other textstat functions.
  • tokens_lookup() implements new rules for nested and linked sequences in dictionary values. See #502.
  • tokens_compound() has a new join argument for better handling of nested and linked sequences. See #517.
  • Internal operations on tokens are now significantly faster due to a reimplementation of the hash table functions in C++. (#510)
  • dfm() now works with multi-word dictionaries and thesauruses, which previously worked only with tokens_lookup().
  • fcm() is now parallelized for improved performance on multi-core systems.

Bug fixes

  • Fixed C++ incompatibilities on older platforms due to compiler incompatibilities with the required TBB libraries (for multi-threading) (#531, #532, #535), in addition to safeguarding against other compiler warnings across a variety of new tested undefined behaviours.
  • Fixed a bug in convert(x, to = "lsa") that transposed row and column names (#526)
  • Added missing fcm() method for corpus objects (#538)
  • Fixed some minor issues with reading in Lexicoder format dictionaries (Improvements to Lexicoder dictionary handling

quanteda 0.9.9-3

Bug fixes

  • Fixed a bug causing dfm and tokens to break on > 10,000 documents. (#438)
  • Fixed a bug in tokens(x, what = "character", removeSeparators = TRUE) that returned an empty string.
  • Fixed a bug in corpus.VCorpus if the VCorpus contains a single document. (#445)
  • Fixed a bug in dfm_compress in which the function failed on documents that contained zero feature counts. (#467)
  • Fixed a bug in textmodel_NB that caused the class priors Pc to be refactored alphabetically instead of in the order of assignment (#471), also affecting predicted classes (#476).

New features

  • New textstat function textstat_keyness() discovers words that occur at differential rates between partitions of a dfm (using chi-squared, Fisher’s exact test, and the G^2 likelihood ratio test to measure the strength of associations).
  • Added 2017-Trump to the inaugural corpus datasets (data_corpus_inaugual and data_char_inaugural).
  • Improved the groups argument in texts() (and in dfm() that uses this function), which will now coerce to a factor rather than requiring one.
  • Added a dfm constructor from dfm objects, with the option of collapsing by groups.
  • Added new arguments to sequences(): ordered and max_length, the latter to prevent memory leaks from extremely long sequences.
  • dictionary() now accepts YAML as an input file format.
  • dfm_lookup and tokens_lookup now accept a levels argument to determine which level of a hierarchical dictionary should be applied.
  • Added min_nchar and max_nchar arguments to dfm_select.
  • dictionary() can now be called on the argument of a list() without explicitly wrapping it in list().
  • fcm now works directly on a dfm object when context = "documents".

This release has some major changes to the API, described below.

Data objects

Renamed data objects

new name original name notes
data_char_sampletext exampleString
data_char_mobydick mobydickText
data_dfm_LBGexample LBGexample
data_char_sampletext exampleString

Renamed internal data objects

The following objects have been renamed, but will not affect user-level functionality because they are primarily internal. Their man pages have been moved to a common ?data-internal man page, hidden from the index, but linked from some of the functions that use them.

new name original name notes
data_int_syllables englishSyllables (used by textcount_syllables())
data_char_wordlists wordlists (used by readability())
data_char_stopwords .stopwords (used by stopwords()

Deprecated data objects

In v.0.9.9 the old names remain available, but are deprecated.

new name original name notes
data_char_ukimmig2010 ukimmigTexts
data_corpus_irishbudget2010 ie2010Corpus
data_char_inaugural inaugTexts
data_corpus_inaugural inaugCorpus

Deprecated functions

The following functions will still work, but issue a deprecation warning:

new function deprecated function constructs:
tokens tokenize() tokens class object
corpus_subset subset.corpus corpus class object
corpus_reshape changeunits corpus class object
corpus_sample sample corpus class object
corpus_segment segment corpus class object
dfm_compress compress dfm class object
dfm_lookup applyDictionary dfm class object
dfm_remove removeFeatures.dfm dfm class object
dfm_sample sample.dfm dfm class object
dfm_select selectFeatures.dfm dfm class object
dfm_smooth smoother dfm class object
dfm_sort sort.dfm dfm class object
dfm_trim trim.dfm dfm class object
dfm_weight weight dfm class object
textplot_wordcloud plot.dfm (plot)
textplot_xray plot.kwic (plot)
textstat_readability readability data.frame
textstat_lexdiv lexdiv data.frame
textstat_simil similarity dist
textstat_dist similarity dist
featnames features character
nsyllable syllables (named) integer
nscrabble scrabble (named) integer
tokens_ngrams ngrams tokens class object
tokens_skipgrams skipgrams tokens class object
tokens_toupper toUpper.tokens, toUpper.tokenizedTexts tokens, tokenizedTexts
tokens_tolower toLower.tokens, toLower.tokenizedTexts tokens, tokenizedTexts
char_toupper toUpper.character, toUpper.character character
char_tolower toLower.character, toLower.character character
tokens_compound joinTokens, phrasetotoken tokens class object

New functions

The following are new to v0.9.9 (and not associated with deprecated functions):

new function description output class
fcm() constructor for a feature co-occurrence matrix fcm
fcm_select selects features from an fcm fcm
fcm_remove removes features from an fcm fcm
fcm_sort sorts an fcm in alphabetical order of its features fcm
fcm_compress compacts an fcm fcm
fcm_tolower lowercases the features of an fcm and compacts fcm
fcm_toupper uppercases the features of an fcm and compacts fcm
dfm_tolower lowercases the features of a dfm and compacts dfm
dfm_toupper uppercases the features of a dfm and compacts dfm
sequences experimental collocation detection sequences

Deleted functions and data objects

new name reason moved to the readtext package
describeTexts deprecated several versions ago for summary.character
textfile moved to package readtext
encodedTexts moved to package readtext, as data_char_encodedtexts
findSequences replaced by sequences

Other new features

  • to = "lsa" functionality added to convert() (#414)
  • Much faster pattern matching in general, through an overhaul of how valuetype matches work for many functions.
  • Added experimental View methods for kwic objects, based on Javascript Datatables.
  • kwic is completely rewritten, now uses fast hashed index matching in C++ and fully implements vectorized matches (#306) and all valuetypes (#307).
  • tokens_lookup, tokens_select, and tokens_remove are faster and use parallelization (based on the TBB library).
  • textstat_dist and textstat_simil add fast, sparse, and parallel computation of many new distance and similarity matrices.
  • Added textmodel_wordshoal fitting function.
  • Add max_docfreq and min_docfreq arguments, and better verbose output, to dfm_trim (#383).
  • Added support for batch hashing of tokens through tokens(), for more memory-efficient token hashing when dealing with very large numbers of documents.
  • Added support for in-memory compressed corpus objects.
  • Consolidated corpus-level metadata arguments in corpus() through the metacorpus list argument.
  • Added Greek stopwords. (See #282).
  • Added index handling [, [[, and $ for (hashed) tokens objects.
  • Now using ggplot2.
  • Added tokens methods for collocations() and kwic().
  • Much improved performance for tokens_select() (formerly selectFeatures.tokens()).
  • Improved ngrams() and joinTokens() performance for hashed tokens class objects.
  • Improved dfm.character() by using new tokens() constructor to create hashed tokenized texts by default when creating a dfm, resulting in performance gains when constructing a dfm. Creating a dfm from a hashed tokens object is now 4-5 times faster than the older tokenizedTexts object.
  • Added new (hashed) tokens class object.
  • Added plot method for fitted textmodel_wordscores objects.
  • Added fast tokens_lookup() method (formerly applyDictionary()), that also works with dictionaries that have multi-word keys. Addresses but does not entirely yet solve #188.
  • Added sparsity() function to compute the sparsity of a dfm.
  • Added feature co-occurrence matrix functions (fcm).

New features

  • corpus_reshape() can now go from sentences and paragraph units back to documents.
  • Added a by = argument to corpus_sample(), for use in bootstrap resampling of sub-document units.
  • Added an experimental method bootstrap_dfm() to generate a list of dimensionally-equivalent dfm objects based on sentence-level resampling of the original documents.
  • Added option to tokens() and dfm() for passing docvars through to to tokens and dfm objects, and added docvars() and metadoc() methods for tokens and dfm class objects. Overall, the code for docvars and metadoc is now more robust and consistent.
  • docvars() on eligible objects that contain no docvars now returns an empty 0 x 0 data.frame (in the spirit of #242).
  • Redesigned textmodel_scale1d now produces sorted and grouped document positions for fitted wordfish models, and produces a ggplot2 plot object.
  • textmodel_wordfish() now preserves sparsity while processing the dfm, and uses a fast approximation to an SVD to get starting values. This also dramatically improves performance in computing this model. (#482, #124)
  • The speed of kwic() is now dramatically improved, and also returns an indexed set of tokens that makes subsequent commands on a kwic class object much faster. (#603)
  • Package options (for verbose, threads) can now be set or queried using quanteda_options().
  • Improved performance and better documentation for corpus_segment(). (#634)
  • Added functions corpus_trimsentences() and char_trimsentences() to remove sentences from a corpus or character object, based on token length or pattern matching.
  • Added options to textstat_readability(): min_sentence_length and max_sentence_length. (#632)
  • Indexing now works for dictionaries, for slicing out keys and values ([), or accessing values directly ([[). (#651)
  • Began the consolidation of collocation detection and scoring into a new function textstat_collocations(), which combines the existing collocations() and sequences() functions. (#434) Collocations now behave as sequences for other functions (such as tokens_compound()) and have a greatly improved performance for such uses.

Behaviour changes

  • docvars() now permits direct access to “metadoc” fields (starting with _, e.g. _document)
  • metadoc() now returns a vector instead of a data.frame for a single variable, similar to docvars()
  • Most verbose options now take the default from getOption("verbose") rather than fixing the value in the function signatures. (#577)
  • textstat_dist() and textstat_simil() now return a matrix if a selection argument is supplied, and coercion to a list produces a list of distances or similarities only for that selection.
  • All remaining camelCase arguments are gone. For commonly used ones, such as those in tokens(), the old arguments (e.g. removePunct) still produce the same behaviour but with a deprecation warning.
  • Added n_target and n_reference columns to textstat_keyness() to return counts for each category being compared for keyness.

Bug fixes

  • Fixed an problem in tokens generation for some irregular characters (#554).
  • Fixed a problem in setting the parallel thread size on single-core machines (#556).
  • Fixed problems for str() on a corpus with no docvars (#571).
  • removeURL in tokens() now removes URLs where the first part of the URL is a single letter (#587).
  • dfm_select now works correctly for ngram features (#589).
  • Fixed a bug crashing corpus constructors for character vectors with duplicated names (the cause of #580).
  • Fixed a bug in the behaviour for dfm_select(x, features) when features was a dfm, that failed to produce the intended featnames matches for the output dfm.
  • Fixed a bug in corpus_segment(x, what = "tags") when a document contained a whitespace just before a tag, at the beginning of the file, or ended with a tag followed by no text (#618, #634).
  • Fixed some problems with dictionary construction and reading some dictionary formats (#454, #455, #459).

New features

  • Corpus construction using corpus() now works for a tm::SimpleCorpus object. (#680)
  • Added corpus_trim() and char_trim() functions for selecting documents or subsets of documents based on sentence, paragraph, or document lengths.
  • Conversion of a dfm to an stm object now passes docvars through in the $meta of the return object.
  • New dfm_group(x, groups = ) command, a convenience wrapper around dfm.dfm(x, groups = ) (#725).
  • Methods for extending quanteda functions to readtext objects updated to match CRAN release of readtext package.
  • Corpus constructor methods for data.frame objects now conform to the “text interchange format” for corpus data.frames, automatically recognizing doc_id and text fields, which also provides interoperability with the readtext package. corpus construction methods are now more explicitly tailored to input object classes.

Bug fixes and stability enhancements

  • dfm_lookup() behaves more robustly on different platforms, especially for keys whose values match no features (#704).
  • textstat_simil() and textstat_dist() no longer take the n argument, as this was not sorting features in correct order.
  • Fixed failure of tokens(x, what = "character") when x included Twitter characters @ and # (#637).
  • Fixed bug #707 where ntype.dfm() produced an incorrect result.
  • Fixed bug #706 where textstat_readability() and textstat_lexdiv() for single-document returns when drop = TRUE.
  • Improved the robustness of corpus_reshape().
  • print, and head, and tail methods for dfm are more robust (#684).
  • Fixed bug in convert(x, to = "stm") caused by zero-count documents and zero-count features in a dfm (#699, #700, #701). This also removes docvar rows from $meta when this is passed through the dfm, for zero-count documents.
  • Corrected broken handling of nested Yoshikoder dictionaries in dictionary(). (#722)
  • dfm_compress now preserves a dfm’s docvars if collapsing only on the features margin, which means that dfm_tolower() and dfm_toupper() no longer remove the docvars.
  • fcm_compress() now retains the fcm class, and generates and error when an asymmetric compression is attempted (#728).
  • textstat_collocations() now returns the collocations as character, not as a factor (#736)
  • Fixed a bug in dfm_lookup(x, exclusive = FALSE) wherein an empty dfm ws returned with there was no no match (#116).
  • Argument passing through dfm() to tokens() is now robust, and preserves variables defined in the calling environment (#721).
  • Fixed issues related to dictionaries failing when applying str(), names(), or other indexing operations, which started happening on Linux and Windows platforms following the CRAN move to 3.4.0. (#744)
  • Dictionary import using the LIWC format is more robust to improperly formatted input files (#685).
  • Weights applied using dfm_weight() now print friendlier error messages when the weight vector contains features not found in the dfm. See this Stack Overflow question for the use case that sparked this improvement.

New features

  • Improvements and consolidation of methods for detecting multi-word expressions, now active only through textstat_collocations(), which computes only the lambda method for now, but does so accurately and efficiently. (#753, #803). This function is still under development and likely to change further.
  • Added new quanteda_options that affect the maximum documents and features displayed by the dfm print method (#756).
  • ngram formation is now significantly faster, including with skips (skipgrams).
  • Improvements to topfeatures():
    • now accepts a groups argument that can be used to generate lists of top (or bottom) features in a group of texts, including by document (#336).
    • new argument scheme that takes the default of (frequency) "count" but also a new "docfreq" value (#408).
  • New wrapper phrase() converts whitespace-separated multi-word patterns into a list of patterns. This affects the feature/pattern matching in tokens/dfm_select/remove, tokens_compound, tokens/dfm_lookup, and kwic. phrase() and the associated changes also make the behaviour of using character vectors, lists of characters, dictionaries, and collocation objects for pattern matches far more consistent. (See #820, #787, #740, #837, #836, #838)
  • corpus.Corpus() for creating a corpus from a tm Corpus now works with more complex objects that include document-level variables, such as data from the manifestoR package (#849).
  • New plot function textplot_keyness() plots term “keyness”, the association of words with contrasting classes as measured by textstat_keyness().
  • Added corpus constructor for corpus objects (#690).
  • Added dictionary constructor for dictionary objects (#690).
  • Added a tokens constructor for tokens objects (#690), including updates to tokens() that improve the consistency and efficiency of the tokenization.
  • Added new quanteda_options(): language_stemmer and language_stopwords, now used for default in *_wordstem functions and stopwords() for defaults, respectively. Also uses this option in dfm() when stem = TRUE, rather than hard-wiring in the “english” stemmer (#386).
  • Added a new function textstat_frequency() to compile feature frequencies, possibly by groups. (#825)
  • Added nomatch option to tokens_lookup() and dfm_lookup(), to provide tokens or feature counts for categories not matched to any dictionary key. (#496)

Behaviour changes

  • The functions sequences() and collocations() have been removed and replaced by textstat_collocations().
  • (Finally) we added “will” to the list of English stopwords (#818).
  • dfm objects with one or both dimensions having zero length, and empty kwic objects now display more appropriately in their print methods (per #811).
  • Pattern matches are now implemented more consistently across functions. In functions such as *_select, *_remove, tokens_compound, features has been replaced by pattern, and in kwic, keywords has been replaced by pattern. These all behave consistently with respect to pattern, which now has a unified single help page and parameter description.(#839) See also above new features related to phrase().
  • We have improved the performance of the C++ routines that handle many of the tokens_* functions using hashed tokens, making some of them 10x faster (#853).
  • Upgrades to the dfm_group() function now allow “empty” documents to be created using the fill = TRUE option, for making documents conform to a selection (similar to how dfm_select() works for features, when supplied a dfm as the pattern argument). The groups argument now behaves consistently across the functions where it is used. (#854)
  • dictionary() now requires its main argument to be a list, not a series of elements that can be used to build a list.
  • Some changes to the behaviour of tokens() have improved the behaviour of remove_hyphens = FALSE, which now behaves more correctly regardless of the setting of remove_punct (#887).
  • Improved cbind.dfm() function allows cbinding vectors, matrixes, and (recyclable) scalars to dfm objects.

Bug fixes and stability enhancements

  • For the underlying methods behind textstat_collocations(), we corrected the word matching, and lambda and z calculation methods, which were slightly incorrect before. We also removed the chi2, G2, and pmi statistics, because these were incorrectly calculated for size > 2.
  • LIWC-formatted dictionary import now robust to assignment to term assignment to missing categories.
  • textmodel_NB(x, y, distribution = "Bernoulli") was previously inactive even when this option was set. It has now been fully implemented and tested (#776, #780).
  • Separators including rare spacing characters are now handled more robustly by the remove_separators argument in tokens(). See #796.
  • Improved memory usage when computing ntoken() and ntype(). (#795)
  • Improvements to quanteda_options() now does not throw an error when quanteda functions are called directly without attaching the package. In addition, quanteda options can be set now in .Rprofile and will not be overwritten when the options initialization takes place when attaching the package.
  • Fixed a bug in textstat_readability() that wrongly computed the number of words with fewer than 3 syllables in a text; this affected the FOG.NRI and the Linsear.Write measures only.
  • Fixed mistakes in the computation of two docfreq schemes: "logave" and "inverseprob".
  • Fixed a bug in the handling of multi-thread options where the settings using quanteda_options() did not actually set the number of threads. In addition, we fixed a bug causing threading to be turned off on macOS (due to a check for a gcc version that is not used for compiling the macOS binaries) prevented multi-threading from being used at all on that platform.
  • Fixed a bug causing failure when functions that use quanteda_options() are called without the namespace or package being attached or loaded (#864).
  • Fixed a bug in overloading the View method that caused all named objects in the RStudio/Source pane to be named “x”. (#893)