trim() now accepts proportions in addition to integer thresholds. Also accepts a new sparsity argument, which works like tm’s removeSparseTerms(x, sparse = ) (for those who really want to think of sparsity this way).
[i] and [i, j] indexing of corpus objects is now possible, for extracting texts or docvars using convenient notation. See ?corpus Details.
ngrams() and skipgrams() now use the same underlying function, with skip
replacing the previous window
argument (where a skip = window - 1). For efficiency, both are now implemented in C++.
tokenize() has a new argument, removeHyphens, that controls the treatment of intra-word hyphens.
Added new measures from readability for mean syllables per word and mean words per sentence directly.
wordstem now works on ngrams (tokenizedTexts and dfm objects).
Enhanced operation of kwic(), including the definition of a kwic class object, and a plot method for this object (produces a dispersion plot).
Lots more error checking of arguments passed to … (and potentially misspecified or misspelled). Addresses Issue #62.
Almost all methods are now methods defined for objects, from a generic.
texts(x, groups = ) now allows groups to be factors, not just document variable labels. There is a new method for texts.character(x, groups = ) which is useful for supplying a factor to concatenate character objects by group.
added new methods for similarity(), including sparse matrix computation for method = “correlation” and “cosine”. (More planned soon.) Also allows easy conversion to a matrix using as.matrix() on similarity lists.
more robust implementation of LIWC-formatted dictionary file imports
better implementation of tf-idf, and relative frequency weighting, especially for very large sparse matrix objects. tf(), idf(), and tfidf() now provide relative term frequency, inverse document frequency, and tf-idf directly.
textmodel_wordfish() now accepts an integer dispersionFloor
argument to constrain the phi parameter to a minimum value (of underdispersion).
textfile() now takes a vector of filenames, if you wish to construct these yourself. See ?textfile examples.
removeFeatures() and selectFeatures.collocations() now all use a consistent interface and same underlying code, with removeFeatures() acting as a wrapper to selectFeatures().
convert(x, to = “stm”) now about 3-4x faster because it uses index positions from the dgCMatrix to convert to the sparse matrix format expected by stm.
Fixed a bug in textfile() preventing encodingFrom and encodingTo from working properly.
Fixed a nasty bug problem in convert(x, to = "stm")
that mixed up the word indexes. Thanks Felix Haass for spotting this!
Fixed a problem where wordstem was not working on ngram=1 tokenized objects
Fixed toLower(x, keepAcronyms = TRUE) that caused an error when x contained no acronyms.
Creating a corpus from a tm VCorpus now works if a “document” is a vector of texts rather than a single text
Fixed a bug in texts(x, groups = MORE THAN ONE DOCVAR) that now groups correctly on combinations of multiple groups
Added presidents’ first names to inaugCorpus
Added textmodel implementation of multinomial and Bernoulli Naive Bayes.
Improved documentation.
Added c.corpus()
method for concatenating arbitrarily large sets of corpus objects.
Default for similarity()
is now margin = "documents"
– prevents overly massive results if selection = NULL
.
Defined rowMeans()
and colMeans()
methods for dfm objects.
Enhancements to summary.character() and summary.corpus(): Added n = to summary.character(); added pass-through options to tokenize() in summary.corpus() and summary.character() methods; added toLower as an argument to both.
Enhancements to corpus object indexing, including [[ and [[<-.
Fixed a bug preventing smoother()
from working.
Fixed a bug in segment.corpus(x, what = “tag”) that was failing to recover the tag values after the first text.
Fix bug in plot.dfm(x, comparison = TRUE)
method causing warning about rowMeans() failing.
Fixed an issue for mfdict <- dictionary(file = "http://ow.ly/VMRkL", format = "LIWC")
causing it to fail because of the irregular combination of tabs and spaces in the dictionary file.
Fixed an exception thrown by wordstem.character(x) if one element of x was NA.
dfm() on a text or tokenized text containing an NA element now returns a row with 0 feature counts. Previously it returned a count of 1 for an NA feature.
Fix issue #91 removeHyphens = FALSE not working in tokenise for some multiple intra-word hyphens, such as “one-of-a-kind”
Fixed a bug in as.matrix.similMatrix()
that caused scrambled conversion when feature sets compared were unequal, which normally occurs when setting similarity(x, n = <something>)
when n < nfeature(x)
Fixed a bug in which a corpusSource object (from textfile()
) with empty docvars prevented this argument from being supplied to corpus(corpusSourceObject, docvars = something)
.
Fixed inaccurate documentation for weight()
, which previously listed unavailable options.
More accurate and complete documentation for tokenize()
.
traps an exception when calling wordstem.tokenizedTexts(x) where x was not word tokenized.
Fixed a bug in textfile()
that prevented passthrough arguments in …, such as fileEncoding =
or encoding =
Fixed a bug in textfile()
that caused exceptions with input documents containing docvars when there was only a single column of docvars (such as .csv files)
Improved Naive Bayes model and prediction, textmodel(x, y, method = "NB")
, now works correctly on k > 2.
Improved tag handling for segment(x, what = “tags”)
Added valuetype argument to segment() methods, which allows faster and more robust segmentation on large texts.
corpus() now converts all hyphen-like characters to simple hyphen
segment.corpus() now preserves all existing docvars.
corpus documentation now removes the description of the corpus object’s structure since too many users were accessing these internal elements directly, which is strongly discouraged, as we are likely to change the corpus internals (soon and often). Repeat after me: “encapsulation”.
Improve robustness of corpus.VCorpus()
for constructing a corpus from a tm Corpus object.
Add UTF-8 preservation to ngrams.cpp.
Fix encoding issues for textfile(), improve functionality.
Added two data objects: Moby Dick is now available as mobydickText
, without needing to access a zipped text file; encodedTextFiles.zip
is now a zipped archive of different encodings of (mainly) the UN Declaration of Human Rights, for testing conversions from 8-bit encodings in different (non-Roman) languages.
phrasetotoken() now has a method correctly defined for corpus class objects.
lexdiv() now works just like readability(), and is faster (based on data.table) and the code is simpler.
removed quanteda::df() as a synonym for docfreq(), as this conflicted with stats::df().
added version information when package is attached.
improved rbind() and cbind() methods for dfm. Both now take any length sequence of dfms and perform better type checking.
rbind.dfm() also knits together dfms with different features, which can be useful for information and retrieval purposes or machine learning.
selectFeatures(x, anyDfm)
(where the second argument is a dfm) now works with a selection = “remove” option.
tokenize.character adds a removeURL option.
added a corpus method for data.frame objects, so that a corpus can be constructed directly from a data.frame. Requires the addition of a textField
argument (similar to textfile).
added compress.dfm()
to combine identically named columns or rows. #123
Much better phrasetotoken()
, with additional methods for all combinations of corpus/character v. dictionary/character/collocations.
Added aweight(x, type, ...
) signature where the second argument can be a named numeric vector of weights, not just a label for a type of weight. Thanks https://stackoverflow.com/questions/36815926/assigning-weights-to-different-features-in-r/36823475#36823475.
as.data.frame
for dfms now passes ...
to as.data.frame.matrix
.
Fixed bug in predict.fitted_textmodel_NB()
that caused a failure with k > 2 classes (#129)
Improved dfm.tokenizedTexts()
performance by taking care of zero-token documents more efficiently.
dictionary(file = "liwc_formatted_dict.dic", format = "LIWC")
now handles poorly formatted dictionary files better, such as the Moral Foundations Dictionary in the examples for ?dictionary
.
added as.tokenizedTexts
to coerce any list of characters to a tokenizedTexts object.
Fix bug in phrasetotoken, signature ‘corpus,ANY’ that was causing an infinite loop.
Fixed bug introduced in commit b88287f (0.9.5-26) that caused a failure in dfm() with empty (zero-token) documents. Also fixes Issue #168.
Fixed bug that caused dfm() to break if no features or only one feature was found.
Fixed bug in predict.fitted_textmodel_NB() that caused a failure with k > 2 classes (#129)
Fixed a false-alarm warning message in textmodel_wordfish()
Argument defaults for readability.corpus() now same as readability.character(). Fixes #107.
Fixed a bug causing LIWC format dictionary imports to fail if extra characters followed the closing % in the file header.
Fixed a bug in applyDictionary(x, dictionary, exclusive = FALSE) when the dictionary produced no matches at all, caused by an attempt to negative index a NULL. #115
Fixed #117, a bug where wordstem.tokenizedTexts() removed attributes from the object, causing a failure of dfm.tokenizedTexts().
Fixed #119, a bug in selectFeatures.tokenizedTexts(x, features, selection = “remove”) that returned a NULL for a document’s tokens when no matching pattern for removal was found.
Improved the behaviour of the removeHyphens
option to tokenize()
when what = "fasterword"
or what "fastestword"
.
readability() now returns measures in order called, not function definition order.
textmodel(x, model = “wordfish”) now removes zero-frequency documents and words prior to calling Rcpp.
Fixed a bug in sample.corpus() that caused an error when no docvars existed. #128
selectFeatures.tokenizedTexts()
.rbind.dfm()
.textfile()
. (#147)plot.kwic()
. (#146)convert(x, to = "stm")
for dfm export, including adding an argument for meta-data (docvars, in quanteda parlance). (#209)textfile()
, now supports more file types, more wildcard patterns, and is far more robust generally.format
keyword for loading dictionaries (#227)messages()
to display messages rather than print
or cat
punctuation
argument to collocations()
to provide new options for handling collocations separated by punctuation characters (#220).fcm(x, tri = TRUE)
temporarily created a dense logical matrix.fcm
).selectFeatures.dfm()
that ignored case_insensitive = TRUE
settings (#251) correct the documentation for this function.tf(x, scheme = "propmax")
that returned a wrong computation; correct the documentation for this function.phrasetotoken()
where if pattern included a +
for valuetype = c("glob", "fixed")
it threw a regex error. #239textfile()
where source is a remote .zip set. (#172)wordstem.dfm()
that caused an error if supplied a dfm with a feature whose total frequency count was zero, or with a feature whose total docfreq was zero. Fixes #181.wordstem.dfm()
, introduced in fixing #181.toLower =
argument in dfm.tokenizedTexts()
.textfile
(#221).dictionary()
now works correctly when reading LIWC dictionaries where all terms belong to one key (#229).warn = FALSE
to the readLines()
calls in textfile()
, so that no warnings are issued when files are read that are missing a final EOL or that contain embedded nuls.trim()
now prints an output message even when no features are removed (#223)