quanteda is an R package for managing and analyzing text, created and maintained by Kenneth Benoit and Kohei Watanabe. Its creation was funded by the European Research Council grant ERC-2011-StG 283794-QUANTESS and its continued development is supported by the Quanteda Initiative CIC.
The package is designed for R users needing to apply natural language processing to texts, from documents to final analysis. Its capabilities match or exceed those provided in many end-user software applications, many of which are expensive and not open source. The package is therefore of great benefit to researchers, students, and other analysts with fewer financial resources. While using quanteda requires R programming knowledge, its API is designed to enable powerful, efficient analysis with a minimum of steps. By emphasizing consistent design, furthermore, quanteda lowers the barriers to learning and using NLP and quantitative text analysis even for proficient R programmers.
The quanteda 4.0 is a major release that improves functionality and performance and further improves function consistency by removing previously deprecated functions. It also includes significant new tokeniser rules that makes the default tokeniser smarter than ever before, with new Unicode and ICU-compliant rules that enable it to work more consistently with even more languages.
We describe more fully these significant changes in: * an article about the new external pointer tokens objects; * an article showing performance benchmarks for the new external pointer tokens objects, as well as some of the tokeniser improvements in v4; and * the changelog for v4 a full listing of the changes, improvements, and deprecations in v4.
We completed the trend of splitting quanteda into modular packages with the release of v3. The quanteda family of packages includes the following:
textmodel_*()
functions. This was split from the main package with the v2 releasetextstat_*()
functions, split with the v3 releasetextplot_*()
functions, split with the v3 releaseWe are working on additional package releases, available in the meantime from our GitHub pages:
The normal way from CRAN, using your R GUI or
(New for quanteda v4.0) For Linux users: Because all installations on Linux are compiled, Linux users will first need to install the Intel oneAPI Threading Building Blocks for parallel computing for installation to work.
To install TBB on Linux:
Windows or macOS users do not have to install TBB or any other packages to enable parallel computing when installing quanteda from CRAN.
Because this compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers to build the development version.
You will also need to install TBB:
macOS:
After installing Homebrew:
Windows:
Install RTools, which includes the TBB libraries.
See the quick start guide to learn how to use quanteda.
Benoit, Kenneth, Kohei Watanabe, Haiyan Wang, Paul Nulty, Adam Obeng, Stefan Müller, and Akitaka Matsuo. (2018) “quanteda: An R package for the quantitative analysis of textual data”. Journal of Open Source Software. 3(30), 774. https://doi.org/10.21105/joss.00774.
For a BibTeX entry, use the output from citation(package = "quanteda")
.
If you like quanteda, please consider leaving feedback or a testimonial here.
Contributions in the form of feedback, comments, code, and bug reports are most welcome. How to contribute: