The tutorial covers:
- Creating sample data
- Preparing document matrix
- Defining the model
- Prediction and accuracy check
- Source code listing
library(RTextTools)
library(e1071)
library(caret)
Creating sample data
First, we'll generate sample sentences to create a training dataset for this tutorial. The sentences in a dataset are random opinions. You can add or use other sentences as input data. Our task is to find out whether the opinion is positive or negative.
sentPositive = c(
"I like it", "like it a lot", "It's really good",
"recommend!", "Enjoyed!", "like it",
"It's really good", "recommend too",
"outstanding", "good", "recommend!",
"like it a lot", "really good",
"Definitely recommend!", "It is fun",
"liked!", "highly recommend this",
"fantastic show", "exciting",
"Very good", "it's ok",
"exciting show", "amazing performance",
"it is great!","I am excited a lot",
"it is terrific", "Definitely good one",
"very satisfied", "Glad we went",
"Once again outstanding!", "awesome"
)
sentNegative = c(
"Not good at all!", "rude",
"It is rude", "I don't like this type",
"poor", "Boring", "Not good!",
"not liked", "I hate this type of",
"not recommend", "not satisfied",
"not enjoyed", "Not recommend this.",
"disgusting movie","waste of time",
"feel tired after watching this",
"horrible performance", "not so good",
"so boring I fell asleep", "poor show",
"a bit strange","terrible"
)
df = data.frame(sentiment = "positive", text = sentPositive)
df = rbind(df, data.frame(sentiment = "negative", text = sentNegative))
Next, we'll split the df data into the train and test parts.
index = sample(1:nrow(df), size = .9 * nrow(df))
train = df[index, ]
test = df[-index, ]
head(train)
sentiment text
8 positive recommend too
13 positive really good
24 positive it is great!
14 positive Definitely recommend!
43 negative not enjoyed
3 positive It's really good
head(test)
sentiment text
5 positive Enjoyed!
20 positive Very good
21 positive it's ok
38 negative Not good!
51 negative poor show
52 negative a bit strange
Preparing document matrix
Next, we'll create matrix data from the text of a train and test data with a crete_matrix function of the RTextTool package. The RTextTool is a package for text classification. A create_matrix() creates a document-term matrix.
mTrain = create_matrix(train[,2], language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE)
matTrain = as.matrix(mTrain)
mTest = create_matrix(test[,2], language="english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE)
matTest = as.matrix(mTest)
print(matTest)
Terms
Docs bit enjoyed good not poor show strange very
Enjoyed! 0 1 0 0 0 0 0 0
Very good 0 0 1 0 0 0 0 1
it's ok 0 0 0 0 0 0 0 0
Not good! 0 0 1 1 0 0 0 0
poor show 0 0 0 0 1 1 0 0
a bit strange 1 0 0 0 0 0 1 0
Defining the model
We'll create the classifier model with NaiveBayes algorithm. To fit the model we need matrix document data and target labels.
labelTrain = as.factor(train[,1])
labelTest = as.factor(test[,1])
model = naiveBayes(matTrain, labelTrain)
We evaluate the fitted model.
pred = predict(model, matTrain)
confusionMatrix(labelTrain, pred)
Confusion Matrix and Statistics
Reference
Prediction positive negative
positive 27 1
negative 2 17
Accuracy : 0.9362
95% CI : (0.8246, 0.9866)
No Information Rate : 0.617
P-Value [Acc > NIR] : 6.026e-07
Kappa : 0.8664
Mcnemar's Test P-Value : 1
Sensitivity : 0.9310
Specificity : 0.9444
Pos Pred Value : 0.9643
Neg Pred Value : 0.8947
Prevalence : 0.6170
Detection Rate : 0.5745
Detection Prevalence : 0.5957
Balanced Accuracy : 0.9377
'Positive' Class : positive
Prediction and accuracy check
Finally, we'll predict our test data with the fitted model and check the accuracy.
pred = predict(model, matTest);
data.frame(test,pred)
sentiment text pred
5 positive Enjoyed! positive
20 positive Very good positive
21 positive it's ok positive
38 negative Not good! negative
51 negative poor show positive
52 negative a bit strange positive
confusionMatrix(labelTest, pred)
Confusion Matrix and Statistics
Reference
Prediction positive negative
positive 3 0
negative 2 1
Accuracy : 0.6667
95% CI : (0.2228, 0.9567)
No Information Rate : 0.8333
P-Value [Acc > NIR] : 0.9377
Kappa : 0.3333
Mcnemar's Test P-Value : 0.4795
Sensitivity : 0.6000
Specificity : 1.0000
Pos Pred Value : 1.0000
Neg Pred Value : 0.3333
Prevalence : 0.8333
Detection Rate : 0.5000
Detection Prevalence : 0.5000
Balanced Accuracy : 0.8000
'Positive' Class : positive
In this tutorial, we've briefly learned how to classify sentiment data with the NaiveBayes method in R. The complete code is listed below.
Source code listing
library(RTextTools)
library(e1071)
library(caret)
set.seed(12345)
sentPositive <- c(
"I like it", "like it a lot", "It's really good",
"recommend!", "Enjoyed!", "like it",
"It's really good", "recommend too",
"outstanding", "good", "recommend!",
"like it a lot", "really good",
"Definitely recommend!", "It is fun",
"liked!", "highly recommend this",
"fantastic show", "exciting",
"Very good", "it's ok",
"exciting show", "amazing performance",
"it is great!","I am excited a lot",
"it is terrific", "Definitely good one",
"very satisfied", "Glad we went",
"Once again outstanding!", "awesome"
)
sentNegative <- c(
"Not good at all!", "rude",
"It is rude", "I don't like this type",
"poor", "Boring", "Not good!",
"not liked", "I hate this type of",
"not recommend", "not satisfied",
"not enjoyed", "Not recommend this.",
"disgusting movie","waste of time",
"feel tired after watching this",
"horrible performance", "not so good",
"so boring I fell asleep", "poor show",
"a bit strange","terrible"
)
df = data.frame(sentiment="positive", text=sentPositive)
df = rbind(df, data.frame(sentiment="negative", text=sentNegative))
index = sample(1:nrow(df), size = .9 * nrow(df))
train = df[index, ]
test = df[-index, ]
head(train)
head(test)
mTrain = create_matrix(train[,2], language = "english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE)
matTrain = as.matrix(mTrain)
mTest = create_matrix(test[,2], language = "english",
removeStopwords=FALSE, removeNumbers=TRUE,
stemWords=FALSE)
matTest = as.matrix(mTest)
labelTrain = as.factor(train[,1])
labelTest = as.factor(test[,1])
model = naiveBayes(matTrain, labelTrain)
pred = predict(model, matTrain)
confusionMatrix(labelTrain, pred)
pred = predict(model, matTest)
data.frame(test, pred)
confusionMatrix(labelTest, pred)
Thank you so much for this nice information. Hope so many people will get aware of this and useful as well. And please keep update like this.
ReplyDeleteText Analytics Software
Data Scraping Tools