• trim() now accepts proportions in addition to integer thresholds. Also accepts a new sparsity argument, which works like tm’s removeSparseTerms(x, sparse = ) (for those who really want to think of sparsity this way).

  • [i] and [i, j] indexing of corpus objects is now possible, for extracting texts or docvars using convenient notation. See ?corpus Details.

  • ngrams() and skipgrams() now use the same underlying function, with skip replacing the previous window argument (where a skip = window - 1). For efficiency, both are now implemented in C++.

  • tokenize() has a new argument, removeHyphens, that controls the treatment of intra-word hyphens.

  • Added new measures from readability for mean syllables per word and mean words per sentence directly.

  • wordstem now works on ngrams (tokenizedTexts and dfm objects).

  • Enhanced operation of kwic(), including the definition of a kwic class object, and a plot method for this object (produces a dispersion plot).

  • Lots more error checking of arguments passed to … (and potentially misspecified or misspelled). Addresses Issue #62.

  • Almost all methods are now methods defined for objects, from a generic.

  • texts(x, groups = ) now allows groups to be factors, not just document variable labels. There is a new method for texts.character(x, groups = ) which is useful for supplying a factor to concatenate character objects by group.

Bug Fixes

  • corrected inaccurate printing of valuetype in verbose note of selectFeatures.dfm(). (Did not affect functionality.)

  • fixed broken quanteda.R demo, expanded demonstration code.

  • added new methods for similarity(), including sparse matrix computation for method = “correlation” and “cosine”. (More planned soon.) Also allows easy conversion to a matrix using as.matrix() on similarity lists.

  • more robust implementation of LIWC-formatted dictionary file imports

  • better implementation of tf-idf, and relative frequency weighting, especially for very large sparse matrix objects. tf(), idf(), and tfidf() now provide relative term frequency, inverse document frequency, and tf-idf directly.

  • textmodel_wordfish() now accepts an integer dispersionFloor argument to constrain the phi parameter to a minimum value (of underdispersion).

  • textfile() now takes a vector of filenames, if you wish to construct these yourself. See ?textfile examples.

  • removeFeatures() and selectFeatures.collocations() now all use a consistent interface and same underlying code, with removeFeatures() acting as a wrapper to selectFeatures().

  • convert(x, to = “stm”) now about 3-4x faster because it uses index positions from the dgCMatrix to convert to the sparse matrix format expected by stm.

Bug fixes

  • Fixed a bug in textfile() preventing encodingFrom and encodingTo from working properly.

  • Fixed a nasty bug problem in convert(x, to = "stm") that mixed up the word indexes. Thanks Felix Haass for spotting this!

  • Fixed a problem where wordstem was not working on ngram=1 tokenized objects

  • Fixed toLower(x, keepAcronyms = TRUE) that caused an error when x contained no acronyms.

  • Creating a corpus from a tm VCorpus now works if a “document” is a vector of texts rather than a single text

  • Fixed a bug in texts(x, groups = MORE THAN ONE DOCVAR) that now groups correctly on combinations of multiple groups

  • Added presidents’ first names to inaugCorpus

  • Added textmodel implementation of multinomial and Bernoulli Naive Bayes.

  • Improved documentation.

  • Added c.corpus() method for concatenating arbitrarily large sets of corpus objects.

  • Default for similarity() is now margin = "documents" – prevents overly massive results if selection = NULL.

  • Defined rowMeans() and colMeans() methods for dfm objects.

  • Enhancements to summary.character() and summary.corpus(): Added n = to summary.character(); added pass-through options to tokenize() in summary.corpus() and summary.character() methods; added toLower as an argument to both.

  • Enhancements to corpus object indexing, including [[ and [[<-.

Bug fixes

  • Fixed a bug preventing smoother() from working.

  • Fixed a bug in segment.corpus(x, what = “tag”) that was failing to recover the tag values after the first text.

  • Fix bug in plot.dfm(x, comparison = TRUE) method causing warning about rowMeans() failing.

  • Fixed an issue for mfdict <- dictionary(file = "http://ow.ly/VMRkL", format = "LIWC") causing it to fail because of the irregular combination of tabs and spaces in the dictionary file.

  • Fixed an exception thrown by wordstem.character(x) if one element of x was NA.

  • dfm() on a text or tokenized text containing an NA element now returns a row with 0 feature counts. Previously it returned a count of 1 for an NA feature.

  • Fix issue #91 removeHyphens = FALSE not working in tokenise for some multiple intra-word hyphens, such as “one-of-a-kind”

  • Fixed a bug in as.matrix.similMatrix() that caused scrambled conversion when feature sets compared were unequal, which normally occurs when setting similarity(x, n = <something>) when n < nfeature(x)

  • Fixed a bug in which a corpusSource object (from textfile()) with empty docvars prevented this argument from being supplied to corpus(corpusSourceObject, docvars = something).

  • Fixed inaccurate documentation for weight(), which previously listed unavailable options.

  • More accurate and complete documentation for tokenize().

  • traps an exception when calling wordstem.tokenizedTexts(x) where x was not word tokenized.

  • Fixed a bug in textfile() that prevented passthrough arguments in …, such as fileEncoding = or encoding =

  • Fixed a bug in textfile() that caused exceptions with input documents containing docvars when there was only a single column of docvars (such as .csv files)

  • Improved Naive Bayes model and prediction, textmodel(x, y, method = "NB"), now works correctly on k > 2.

  • Improved tag handling for segment(x, what = “tags”)

  • Added valuetype argument to segment() methods, which allows faster and more robust segmentation on large texts.

  • corpus() now converts all hyphen-like characters to simple hyphen

  • segment.corpus() now preserves all existing docvars.

  • corpus documentation now removes the description of the corpus object’s structure since too many users were accessing these internal elements directly, which is strongly discouraged, as we are likely to change the corpus internals (soon and often). Repeat after me: “encapsulation”.

  • Improve robustness of corpus.VCorpus() for constructing a corpus from a tm Corpus object.

  • Add UTF-8 preservation to ngrams.cpp.

  • Fix encoding issues for textfile(), improve functionality.

  • Added two data objects: Moby Dick is now available as mobydickText, without needing to access a zipped text file; encodedTextFiles.zip is now a zipped archive of different encodings of (mainly) the UN Declaration of Human Rights, for testing conversions from 8-bit encodings in different (non-Roman) languages.

  • phrasetotoken() now has a method correctly defined for corpus class objects.

  • lexdiv() now works just like readability(), and is faster (based on data.table) and the code is simpler.

  • removed quanteda::df() as a synonym for docfreq(), as this conflicted with stats::df().

  • added version information when package is attached.

  • improved rbind() and cbind() methods for dfm. Both now take any length sequence of dfms and perform better type checking.
    rbind.dfm() also knits together dfms with different features, which can be useful for information and retrieval purposes or machine learning.

  • selectFeatures(x, anyDfm) (where the second argument is a dfm) now works with a selection = “remove” option.

  • tokenize.character adds a removeURL option.

  • added a corpus method for data.frame objects, so that a corpus can be constructed directly from a data.frame. Requires the addition of a textField argument (similar to textfile).

  • added compress.dfm() to combine identically named columns or rows. #123

  • Much better phrasetotoken(), with additional methods for all combinations of corpus/character v. dictionary/character/collocations.

  • Added aweight(x, type, ...) signature where the second argument can be a named numeric vector of weights, not just a label for a type of weight. Thanks https://stackoverflow.com/questions/36815926/assigning-weights-to-different-features-in-r/36823475#36823475.

  • as.data.frame for dfms now passes ... to as.data.frame.matrix.

  • Fixed bug in predict.fitted_textmodel_NB() that caused a failure with k > 2 classes (#129)

  • Improved dfm.tokenizedTexts() performance by taking care of zero-token documents more efficiently.

  • dictionary(file = "liwc_formatted_dict.dic", format = "LIWC") now handles poorly formatted dictionary files better, such as the Moral Foundations Dictionary in the examples for ?dictionary.

  • added as.tokenizedTexts to coerce any list of characters to a tokenizedTexts object.

Bug fixes >= 0.9.6-3

  • Fix bug in phrasetotoken, signature ‘corpus,ANY’ that was causing an infinite loop.

  • Fixed bug introduced in commit b88287f (0.9.5-26) that caused a failure in dfm() with empty (zero-token) documents. Also fixes Issue #168.

  • Fixed bug that caused dfm() to break if no features or only one feature was found.

  • Fixed bug in predict.fitted_textmodel_NB() that caused a failure with k > 2 classes (#129)

Bug fixes

  • Fixed a false-alarm warning message in textmodel_wordfish()

  • Argument defaults for readability.corpus() now same as readability.character(). Fixes #107.

  • Fixed a bug causing LIWC format dictionary imports to fail if extra characters followed the closing % in the file header.

  • Fixed a bug in applyDictionary(x, dictionary, exclusive = FALSE) when the dictionary produced no matches at all, caused by an attempt to negative index a NULL. #115

  • Fixed #117, a bug where wordstem.tokenizedTexts() removed attributes from the object, causing a failure of dfm.tokenizedTexts().

  • Fixed #119, a bug in selectFeatures.tokenizedTexts(x, features, selection = “remove”) that returned a NULL for a document’s tokens when no matching pattern for removal was found.

  • Improved the behaviour of the removeHyphens option to tokenize() when what = "fasterword" or what "fastestword".

  • readability() now returns measures in order called, not function definition order.

  • textmodel(x, model = “wordfish”) now removes zero-frequency documents and words prior to calling Rcpp.

  • Fixed a bug in sample.corpus() that caused an error when no docvars existed. #128

New Features

  • Improved the performance of selectFeatures.tokenizedTexts().
  • Improved the performance of rbind.dfm().
  • Added support for different docvars when importing multiple files using textfile(). (#147)
  • Added support for comparison dispersion plots in plot.kwic(). (#146)
  • Added a corpus constructor method for kwic objects.
  • Substantially improved the performance of convert(x, to = "stm") for dfm export, including adding an argument for meta-data (docvars, in quanteda parlance). (#209)
  • Internal rewrite of textfile(), now supports more file types, more wildcard patterns, and is far more robust generally.
  • Add support for loading external dictionary formats:
  • Yoshikoder,
  • Lexicoder v2 and v3 (#228)
  • Autodetect dictionary file format from file extension, so no longer require format keyword for loading dictionaries (#227)
  • Improved compatibility with rOpenSci guidelines (#218):
    • Use httr to get remote files
    • Use messages() to display messages rather than print or cat
    • Reorganise sections in README file
  • Added new punctuation argument to collocations() to provide new options for handling collocations separated by punctuation characters (#220).

Bug fixes

  • (0.9.8.7) Solved #267 in which fcm(x, tri = TRUE) temporarily created a dense logical matrix.
  • (0.9.8.7) Added feature co-occurrence matrix functions (fcm).
  • (0.9.8.5) Fixed an incompatibility in sequences.cpp with Solaris x86 (#257)
  • (0.9.8.4) Fix bug in verbose output of dfm that causes misreporting of number of features (#250)
  • (0.9.8.4) Fix a bug in selectFeatures.dfm() that ignored case_insensitive = TRUE settings (#251) correct the documentation for this function.
  • (0.9.8.3) Fix a bug in tf(x, scheme = "propmax") that returned a wrong computation; correct the documentation for this function.
  • (0.9.8.2) Fixed a bug in textfile() causing all texts to have the same name, for types using the “textField” argument (a single file containing multiple documents).
  • Fixed bug in phrasetotoken() where if pattern included a + for valuetype = c("glob", "fixed") it threw a regex error. #239
  • Fixed bug in textfile() where source is a remote .zip set. (#172)
  • Fixed bug in wordstem.dfm() that caused an error if supplied a dfm with a feature whose total frequency count was zero, or with a feature whose total docfreq was zero. Fixes #181.
  • Fix #214 “mysterious stemmed token” bug in wordstem.dfm(), introduced in fixing #181.
  • Fixed previously non-functional toLower = argument in dfm.tokenizedTexts().
  • Fixed some errors in the computation of a few readability formulas (#215).
  • Added filenames names to text vectors returned by textfile (#221).
  • dictionary() now works correctly when reading LIWC dictionaries where all terms belong to one key (#229).
  • `convert(x, to = “stm”) now indexes the dfm components from 1, not 0 (#222).
  • Remove temporary stemmed token (#214).
  • Fixed bug in textmodel_NB() for non-“uniform” priors (#241)

Changes

  • Added warn = FALSE to the readLines() calls in textfile(), so that no warnings are issued when files are read that are missing a final EOL or that contain embedded nuls.
  • trim() now prints an output message even when no features are removed (#223)
  • We now skip some platform-dependent tests on CRAN, travis-ci and Windows.