Fit a multinomial or Bernoulli Naive Bayes model, given a dfm and some training labels.

textmodel_NB(x, y, smooth = 1, prior = c("uniform", "docfreq", "termfreq"),
  distribution = c("multinomial", "Bernoulli"), ...)

Arguments

x

the dfm on which the model will be fit. Does not need to contain only the training documents.

y

vector of training labels associated with each document identified in train. (These will be converted to factors if not already factors.)

smooth

smoothing parameter for feature counts by class

prior

prior distribution on texts; see Details

distribution

count model for text features, can be multinomial or Bernoulli. To fit a "binary multinomial" model, first convert the dfm to a binary matrix using tf(x, "boolean").

...

more arguments passed through

Value

A list of return values, consisting of:

call

original function call

PwGc

probability of the word given the class (empirical likelihood)

Pc

class prior probability

PcGw

posterior class probability given the word

Pw

baseline probability of the word

data

list consisting of x training class, and y test class

distribution

the distribution argument

prior

argument passed as a prior

smooth

smoothing parameter

Predict Methods

A predict method is also available for a fitted Naive Bayes object, see predict.textmodel_NB_fitted.

References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf Jurafsky, Daniel and James H. Martin. (2016) Speech and Language Processing. Draft of November 7, 2016. https://web.stanford.edu/~jurafsky/slp3/6.pdf

Examples

## Example from 13.1 of _An Introduction to Information Retrieval_ txt <- c(d1 = "Chinese Beijing Chinese", d2 = "Chinese Chinese Shanghai", d3 = "Chinese Macao", d4 = "Tokyo Japan Chinese", d5 = "Chinese Chinese Chinese Tokyo Japan") trainingset <- dfm(txt, tolower = FALSE) trainingclass <- factor(c("Y", "Y", "Y", "N", NA), ordered = TRUE) ## replicate IIR p261 prediction for test set (document 5) (nb.p261 <- textmodel_NB(trainingset, trainingclass, prior = "docfreq"))
#> Fitted Naive Bayes model: #> Call: #> textmodel_NB.dfm(x = trainingset, y = trainingclass, prior = "docfreq") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> 6 x 4 Matrix of class "dgeMatrix" #> Y N Y N #> Chinese 0.42857143 0.2222222 0.8526316 0.1473684 #> Beijing 0.14285714 0.1111111 0.7941176 0.2058824 #> Shanghai 0.14285714 0.1111111 0.7941176 0.2058824 #> Macao 0.14285714 0.1111111 0.7941176 0.2058824 #> Tokyo 0.07142857 0.2222222 0.4909091 0.5090909 #> Japan 0.07142857 0.2222222 0.4909091 0.5090909 #>
predict(nb.p261, newdata = trainingset[5, ])
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d5 -8.10769 -8.906681 0.6898 0.3102 Y #>
# contrast with other priors predict(textmodel_NB(trainingset, trainingclass, prior = "uniform"))
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -4.333653 -5.898527 0.8271 0.1729 Y #> d2 -4.333653 -5.898527 0.8271 0.1729 Y #> d3 -3.486355 -4.394449 0.7126 0.2874 Y #> d4 -6.818560 -5.205379 0.1661 0.8339 N #> d5 -8.513155 -8.213534 0.4257 0.5743 N #>
predict(textmodel_NB(trainingset, trainingclass, prior = "termfreq"))
#> Predicted textmodel of type: Naive Bayes #> #> lp(Y) lp(N) Pr(Y) Pr(N) Predicted #> d1 -3.958960 -6.504662 0.9273 0.0727 Y #> d2 -3.958960 -6.504662 0.9273 0.0727 Y #> d3 -3.111662 -5.000585 0.8686 0.1314 Y #> d4 -6.443866 -5.811515 0.3470 0.6530 N #> d5 -8.138462 -8.819670 0.6640 0.3360 Y #>
## replicate IIR p264 Bernoulli Naive Bayes (nb.p261.bern <- textmodel_NB(trainingset, trainingclass, distribution = "Bernoulli", prior = "docfreq"))
#> Fitted Naive Bayes model: #> Call: #> textmodel_NB.dfm(x = trainingset, y = trainingclass, prior = "docfreq", #> distribution = "Bernoulli") #> #> #> Training classes and priors: #> Y N #> 0.75 0.25 #> #> Likelihoods: Class Posteriors: #> 6 x 4 Matrix of class "dgeMatrix" #> Y N Y N #> Chinese 0.8 0.6666667 0.7826087 0.2173913 #> Beijing 0.4 0.3333333 0.7826087 0.2173913 #> Shanghai 0.4 0.3333333 0.7826087 0.2173913 #> Macao 0.4 0.3333333 0.7826087 0.2173913 #> Tokyo 0.2 0.6666667 0.4736842 0.5263158 #> Japan 0.2 0.6666667 0.4736842 0.5263158 #>
predict(nb.p261.bern, newdata = trainingset[5, ])
#> Error in getMethod("t", "dgCMatrix"): no generic function found for 't'