Functions for creating and managing textual corpora, extracting features from textual data, and analyzing those features using quantitative methods.
quanteda makes it easy to manage texts in the form of a corpus, defined as a collection of texts that includes document-level variables specific to each text, as well as meta-data. quanteda includes tools to make it easy and fast to manipulate the texts in a corpus, by performing the most common natural language processing tasks simply and quickly, such as tokenizing, stemming, or forming ngrams. quanteda's functions for tokenizing texts and forming multiple tokenized documents into a document-feature matrix are both extremely fast and very simple to use. quanteda can segment texts easily by words, paragraphs, sentences, or even user-supplied delimiters and tags.
Built on the text processing functions in the stringi package, which is in turn built on C++ implementation of the ICU libraries for Unicode text handling, quanteda pays special attention to fast and correct implementation of Unicode and the handling of text in any character set.
quanteda is built for efficiency and speed, through its design around three infrastructures: the stringi package for text processing, the Matrix package for sparse matrix objects, and computationally intensive processing (e.g. for tokens) handled in parallelized C++. If you can fit it into memory, quanteda will handle it quickly. (And eventually, we will make it possible to process objects even larger than available memory.)
quanteda is principally designed to allow users a fast and convenient method to go from a corpus of texts to a selected matrix of documents by features, after defining what the documents and features. The package makes it easy to redefine documents, for instance by splitting them into sentences or paragraphs, or by tags, as well as to group them into larger documents by document variables, or to subset them based on logical conditions or combinations of document variables. The package also implements common NLP feature selection functions, such as removing stopwords and stemming in numerous languages, selecting words found in dictionaries, treating words as equivalent based on a user-defined "thesaurus", and trimming and weighting features based on document frequency, feature frequency, and related measures such as tf-idf.
Tools for working with dictionaries are one of quanteda's principal strengths, and the package includes several core functions for preparing and applying dictionaries to texts, for example for lexicon-based sentiment analysis.
Once constructed, a quanteda document-feature matrix ("dfm") can be easily analyzed using either quanteda's built-in tools for scaling document positions, or used with a number of other text analytic tools, such as: topic models (including converters for direct use with the topicmodels, LDA, and stm packages) document scaling (using the quanteda.textmodels package's functions for the "wordfish" and "Wordscores" models, or direct use with the ca package for correspondence analysis), or machine learning through a variety of other packages that take matrix or matrix-like inputs. quanteda includes functions for converting its core objects, but especially a dfm, into other formats so that these are easy to use with other analytic packages.
Additional features of quanteda include:
powerful, flexible tools for working with dictionaries;
the ability to identify keywords associated with documents or groups of documents;
the ability to explore texts using key-words-in-context;
quick computation of word or document similarities, for clustering or to compute distances for other purposes;
a comprehensive suite of descriptive statistics on text such as the number of sentences, words, characters, or syllables per document; and
flexible, easy to use graphical tools to portray many of the analyses available in the package.
Report bugs at https://github.com/quanteda/quanteda/issues