trim() now accepts proportions in addition to integer thresholds. Also accepts a new sparsity argument, which works like tm’s removeSparseTerms(x, sparse = ) (for those who really want to think of sparsity this way).
[i] and [i, j] indexing of corpus objects is now possible, for extracting texts or docvars using convenient notation. See ?corpus Details.
ngrams() and skipgrams() now use the same underlying function, with
skip replacing the previous
window argument (where a skip = window - 1). For efficiency, both are now implemented in C++.
tokenize() has a new argument, removeHyphens, that controls the treatment of intra-word hyphens.
Added new measures from readability for mean syllables per word and mean words per sentence directly.
wordstem now works on ngrams (tokenizedTexts and dfm objects).
Enhanced operation of kwic(), including the definition of a kwic class object, and a plot method for this object (produces a dispersion plot).
Lots more error checking of arguments passed to … (and potentially misspecified or misspelled). Addresses Issue #62.
Almost all methods are now methods defined for objects, from a generic.
texts(x, groups = ) now allows groups to be factors, not just document variable labels. There is a new method for texts.character(x, groups = ) which is useful for supplying a factor to concatenate character objects by group.
added new methods for similarity(), including sparse matrix computation for method = “correlation” and “cosine”. (More planned soon.) Also allows easy conversion to a matrix using as.matrix() on similarity lists.
more robust implementation of LIWC-formatted dictionary file imports
better implementation of tf-idf, and relative frequency weighting, especially for very large sparse matrix objects. tf(), idf(), and tfidf() now provide relative term frequency, inverse document frequency, and tf-idf directly.
textmodel_wordfish() now accepts an integer
dispersionFloor argument to constrain the phi parameter to a minimum value (of underdispersion).
textfile() now takes a vector of filenames, if you wish to construct these yourself. See ?textfile examples.
removeFeatures() and selectFeatures.collocations() now all use a consistent interface and same underlying code, with removeFeatures() acting as a wrapper to selectFeatures().
convert(x, to = “stm”) now about 3-4x faster because it uses index positions from the dgCMatrix to convert to the sparse matrix format expected by stm.
Fixed a bug in textfile() preventing encodingFrom and encodingTo from working properly.
Fixed a nasty bug problem in
convert(x, to = "stm") that mixed up the word indexes. Thanks Felix Haass for spotting this!
Fixed a problem where wordstem was not working on ngram=1 tokenized objects
Fixed toLower(x, keepAcronyms = TRUE) that caused an error when x contained no acronyms.
Creating a corpus from a tm VCorpus now works if a “document” is a vector of texts rather than a single text
Fixed a bug in texts(x, groups = MORE THAN ONE DOCVAR) that now groups correctly on combinations of multiple groups
Added presidents’ first names to inaugCorpus
Added textmodel implementation of multinomial and Bernoulli Naive Bayes.
c.corpus() method for concatenating arbitrarily large sets of corpus objects.
similarity() is now
margin = "documents" – prevents overly massive results if
selection = NULL.
Enhancements to summary.character() and summary.corpus(): Added n = to summary.character(); added pass-through options to tokenize() in summary.corpus() and summary.character() methods; added toLower as an argument to both.
Enhancements to corpus object indexing, including [[ and [[<-.
Fixed a bug preventing
smoother() from working.
Fixed a bug in segment.corpus(x, what = “tag”) that was failing to recover the tag values after the first text.
Fix bug in
plot.dfm(x, comparison = TRUE) method causing warning about rowMeans() failing.
Fixed an issue for
mfdict <- dictionary(file = "http://ow.ly/VMRkL", format = "LIWC") causing it to fail because of the irregular combination of tabs and spaces in the dictionary file.
Fixed an exception thrown by wordstem.character(x) if one element of x was NA.
dfm() on a text or tokenized text containing an NA element now returns a row with 0 feature counts. Previously it returned a count of 1 for an NA feature.
Fix issue #91 removeHyphens = FALSE not working in tokenise for some multiple intra-word hyphens, such as “one-of-a-kind”
Fixed a bug in
as.matrix.similMatrix() that caused scrambled conversion when feature sets compared were unequal, which normally occurs when setting
similarity(x, n = <something>) when n < nfeature(x)
Fixed a bug in which a corpusSource object (from
textfile()) with empty docvars prevented this argument from being supplied to
corpus(corpusSourceObject, docvars = something).
Fixed inaccurate documentation for
weight(), which previously listed unavailable options.
More accurate and complete documentation for
traps an exception when calling wordstem.tokenizedTexts(x) where x was not word tokenized.
Fixed a bug in
textfile() that prevented passthrough arguments in …, such as
fileEncoding = or
Fixed a bug in
textfile() that caused exceptions with input documents containing docvars when there was only a single column of docvars (such as .csv files)
Improved Naive Bayes model and prediction,
textmodel(x, y, method = "NB"), now works correctly on k > 2.
Improved tag handling for segment(x, what = “tags”)
Added valuetype argument to segment() methods, which allows faster and more robust segmentation on large texts.
corpus() now converts all hyphen-like characters to simple hyphen
segment.corpus() now preserves all existing docvars.
corpus documentation now removes the description of the corpus object’s structure since too many users were accessing these internal elements directly, which is strongly discouraged, as we are likely to change the corpus internals (soon and often). Repeat after me: “encapsulation”.
Improve robustness of
corpus.VCorpus() for constructing a corpus from a tm Corpus object.
Add UTF-8 preservation to ngrams.cpp.
Fix encoding issues for textfile(), improve functionality.
Added two data objects: Moby Dick is now available as
mobydickText, without needing to access a zipped text file;
encodedTextFiles.zip is now a zipped archive of different encodings of (mainly) the UN Declaration of Human Rights, for testing conversions from 8-bit encodings in different (non-Roman) languages.
phrasetotoken() now has a method correctly defined for corpus class objects.
lexdiv() now works just like readability(), and is faster (based on data.table) and the code is simpler.
removed quanteda::df() as a synonym for docfreq(), as this conflicted with stats::df().
added version information when package is attached.
improved rbind() and cbind() methods for dfm. Both now take any length sequence of dfms and perform better type checking.
rbind.dfm() also knits together dfms with different features, which can be useful for information and retrieval purposes or machine learning.
selectFeatures(x, anyDfm) (where the second argument is a dfm) now works with a selection = “remove” option.
tokenize.character adds a removeURL option.
added a corpus method for data.frame objects, so that a corpus can be constructed directly from a data.frame. Requires the addition of a
textField argument (similar to textfile).
compress.dfm() to combine identically named columns or rows. #123
phrasetotoken(), with additional methods for all combinations of corpus/character v. dictionary/character/collocations.
weight(x, type, ...) signature where the second argument can be a named numeric vector of weights, not just a label for a type of weight. Thanks http://stackoverflow.com/questions/36815926/assigning-weights-to-different-features-in-r/36823475#36823475.
as.data.frame for dfms now passes
Fixed bug in
predict.fitted_textmodel_NB() that caused a failure with k > 2 classes (#129)
dfm.tokenizedTexts() performance by taking care of zero-token documents more efficiently.
dictionary(file = "liwc_formatted_dict.dic", format = "LIWC") now handles poorly formatted dictionary files better, such as the Moral Foundations Dictionary in the examples for
as.tokenizedTexts to coerce any list of characters to a tokenizedTexts object.
Fix bug in phrasetotoken, signature ‘corpus,ANY’ that was causing an infinite loop.
Fixed bug introduced in commit b88287f (0.9.5-26) that caused a failure in dfm() with empty (zero-token) documents. Also fixes Issue #168.
Fixed bug that caused dfm() to break if no features or only one feature was found.
Fixed bug in predict.fitted_textmodel_NB() that caused a failure with k > 2 classes (#129)
Fixed a false-alarm warning message in textmodel_wordfish()
Argument defaults for readability.corpus() now same as readability.character(). Fixes #107.
Fixed a bug causing LIWC format dictionary imports to fail if extra characters followed the closing % in the file header.
Fixed a bug in applyDictionary(x, dictionary, exclusive = FALSE) when the dictionary produced no matches at all, caused by an attempt to negative index a NULL. #115
Fixed #117, a bug where wordstem.tokenizedTexts() removed attributes from the object, causing a failure of dfm.tokenizedTexts().
Fixed #119, a bug in selectFeatures.tokenizedTexts(x, features, selection = “remove”) that returned a NULL for a document’s tokens when no matching pattern for removal was found.
Improved the behaviour of the
removeHyphens option to
what = "fasterword" or
readability() now returns measures in order called, not function definition order.
textmodel(x, model = “wordfish”) now removes zero-frequency documents and words prior to calling Rcpp.
Fixed a bug in sample.corpus() that caused an error when no docvars existed. #128
convert(x, to = "stm")for dfm export, including adding an argument for meta-data (docvars, in quanteda parlance). (#209)
textfile(), now supports more file types, more wildcard patterns, and is far more robust generally.
formatkeyword for loading dictionaries (#227)
messages()to display messages rather than
collocations()to provide new options for handling collocations separated by punctuation characters (#220).
fcm(x, tri = TRUE)temporarily created a dense logical matrix.
case_insensitive = TRUEsettings (#251) correct the documentation for this function.
tf(x, scheme = "propmax")that returned a wrong computation; correct the documentation for this function.
phrasetotoken()where if pattern included a
valuetype = c("glob", "fixed")it threw a regex error. #239
textfile()where source is a remote .zip set. (#172)
wordstem.dfm()that caused an error if supplied a dfm with a feature whose total frequency count was zero, or with a feature whose total docfreq was zero. Fixes #181.
wordstem.dfm(), introduced in fixing #181.
toLower =argument in
dictionary()now works correctly when reading LIWC dictionaries where all terms belong to one key (#229).
warn = FALSEto the
textfile(), so that no warnings are issued when files are read that are missing a final EOL or that contain embedded nuls.
trim()now prints an output message even when no features are removed (#223)