vignettes/pkgdown/comparison.Rmd
comparison.Rmd
This article compares quanteda to alternative R packages for quantitative text analysis (tm, tidytext, corpus, and koRpus) and the Natural Language Toolkit for Python. If a function is available in another package, we provide the respective command.
Note that we have used the package manuals for the comparison. If we have overlooked certain functions, please let us know - either by editing the table and issuing a Pull Request or by contacting the maintainer.
Function | quanteda | tm | tidytext | corpus | koRpus | NLTK |
---|---|---|---|---|---|---|
Create corpus | corpus() | Corpus() | corpus_frame() | read.corp.custom() | PlaintextCorpusReader() | |
Bind/subset corpora | corpus_subset() | tm_combine(); tm_filter() | ||||
Reshape corpus into smaller units | corpus_reshape(); corpus_segment() | text_split() | ||||
Take random sample of corpus texts | corpus_sample() | |||||
Keywords-in-context | kwic() | text_locate() | common_contexts() | |||
Tokenize texts | tokens() | tokenizer() | unnest_tokens() | text_tokens() | tokenize() | nltk.word_tokenize |
Stem features | tokens_wordstem() | stemDocument() | stem_snowball() | treetag() | stem() | |
Define multi-word features | phrase() | MWETokenizer | ||||
Create document-feature matrix | dfm() | TermDocumentMatrix() | cast_dfm() | term_matrix() | ||
Create a feature co-occurrence matrix | fcm() | |||||
Weight a dfm | dfm_weight() | weightTf(); weightTfIdf() | bind_tf_idf() | |||
Create a custom dictionary | dictionary() | dictionary always a data.frame object | SentimentAnalyzer | |||
Included dictionaries | Lexicoder Sentiment Dictionary | AFINN, Bing, NRC | AFINN Sentiment dictionary, WordNet-Affect Lexicon | |||
Apply custom dictionaries | dfm_lookup() | dplyr::inner_join() | SentimentAnalyzer | |||
Supported dictionary formats | Wordstat, LIWC, yoshicoder, lexicoder, YAML | data.frame objects | ||||
Calculate feature frequencies | textstat_frequency() | FindMostFreqTerms() | dplyr::count() | term_stats() | freq.analysis() | FreqDist() |
Extract collocations | textstat_collocations() | unnest_tokens(token = “ngrams”) | collocations() | |||
Readability scores | textstat_readability() | readability() | nltk_contrib.readability | |||
Lexical diversity | textstat_lexdiv() | various measures | lexical_diversity() | |||
Distance/similarity measures | textstat_simil(); textstat_dist() | |||||
Keyness statistics | textstat_keyness() | |||||
Wordcloud | textplot_wordcloud() | |||||
Correspondence Analysis | textmodel_ca() | |||||
Naïve Bayes | textmodel_nb() | NaiveBayesClassifier | ||||
Wordscores | textmodel_wordscores() | |||||
Wordfish | textmodel_wordfish() | |||||
Convert dfm to other format | convert() | cast_tdm() | ||||
POS-tagging | spacyr package | parts_of_speech() | kRp.POS.tags() | nltk.pos_tag | ||
Import texts | readtext package | Reader() | read() |