DataTechNotes: Sentiment Classification with NLTK Naive Bayes Classifier

NLTK (Natural Language Toolkit) provides Naive Bayes classifier to classify text data. In this post, we'll learn how to use NLTK Naive Bayes classifier to classify text data in Python. You can get more information about NLTK on this page. In this classifier, the way of an input data preparation is different from the ways in the other libraries and this is the only important part to understand well in this tutorial. The post covers:

Data preparation
Training the model
Predicting the test data

We'll start by loading the required libraries.

import pandas as pd
import nltk
import random
from nltk.tokenize import word_tokenize

Data preparation

Here, I prepared a simple sentiment data for this tutorial. The data contains imaginary random sentiment texts. In a dataset folder, we'll locate a pos_sentiment.csv file that contains positive sentiment data and a neg_sentiment.csv file that contains negative sentiment data. You can find both files content below. Copy and save it in your datasets folder.

We'll load files as shown below the listing.

poss = pd.read_csv('datasets/pos_sentiment.csv')
negs = pd.read_csv('datasets/neg_sentiment.csv')
poss.columns = ["text"]
negs.columns = ["text"]

Then, we'll set 'positive' or 'negative' label into each line of text data.

data=([(pos['text'], 'positive') for index, pos in poss.iterrows()]+
    [(neg['text'], 'negative') for index, neg in negs.iterrows()])

print(data[0:3])

[('like it a lot ', 'positive'), ("It's really good ", 'positive'), \

 ('Recommend! I really enjoyed! ', 'positive')]

Next, we'll tokenize the words in text data and create train data.

tokens=set(word.lower() for words in data for word in word_tokenize(words[0]))
train = [({word: (word in word_tokenize(x[0])) \

            for word in tokens}, x[1]) for x in data]

The train data content looks as below.

print(train[0])
({'i': False, 'fun': False, 'again': False, 'an': False,

 'excellent': False, 'exciting': False, 'go': False,

 'nasty': False, 'what': False, 'restaurant': False,

 'really': False, 'horrible': False, 'enjoyed': False,

 'did': False, 'too': False, 'terrific': False,

 'strange': False, 'so': False, 'exceptional': False,

 'am': False, 'once': False, 'definitely': False,

 'went': False, 'it': True, 'good': False, 'one': False,

 'great': False, 'time': False, '.': False, 'satisfied': False,

 'awesome': False, 'expect': False, 'tired': False,

 'offensive': False, 'service': False, 'disgusting': False,

 'asleep': False, 'nightmare': False, 'we': False,

 'after': False, 'this': False, 'type': False, 'nice': False,

 'feel': False, 'poor': False, 'fantastic': False,

 .....

 'you': False, 'not': False, 'the': False, 'movie': False,

 'a': True, "n't": False, 'recommend': False, 'ok': False}, 'positive')

Finally, we'll shuffle the train data and split it into train and test parts.

random.shuffle(train)

len(train)
55

train_x=train[0:50]
test_x=train[51:55]

Training the model

We'll define an NLTK Navie Bayes model and train it with a train_x data.

model = nltk.NaiveBayesClassifier.train(train_x)

Most informative features can be checked with the below method.

model.show_most_informative_features()

Most Informative Features
                    this = True           negati : positi =      4.4 : 1.0
                      it = True           positi : negati =      2.9 : 1.0
                       , = True           negati : positi =      2.7 : 1.0
                    show = True           positi : negati =      2.0 : 1.0
                       a = True           positi : negati =      1.6 : 1.0
               recommend = True           positi : negati =      1.6 : 1.0
             performance = True           negati : positi =      1.5 : 1.0
                    like = True           negati : positi =      1.5 : 1.0
                   liked = True           negati : positi =      1.5 : 1.0
                 enjoyed = True           negati : positi =      1.5 : 1.0

We can check the model prediction accuracy with test_x data.

acc=nltk.classify.accuracy(model, text_x)
print("Accuracy:", acc)

Accuracy: 0.75

Predicting the test data

Finally, we'll predict the new test data with the trained model.

tests=['I really like it', 
       'I do not think this is good one', 
       'this is good one',
       'I hate the show!']

for test in tests:
 t_features = {word: (word in word_tokenize(test.lower())) for word in tokens}
 print(test," : ", model.classify(t_features))

I really like it  :  positive
I do not think this is good one  :  negative
this is good one  :  positive
I hate the show!  :  positive

In this tutorial, we've briefly learned how to classify sentiment data with NLTK Naive Bayes classifier in Python. Thank you for reading.

The full source code and training data are listed below.

import pandas as pd
import nltk
import random
from nltk.tokenize import word_tokenize

poss = pd.read_csv('datasets/pos_sentiment.csv')
negs = pd.read_csv('datasets/neg_sentiment.csv')
poss.columns = ["text"]
negs.columns = ["text"]

data=([(pos['text'], 'positive') for index, pos in poss.iterrows()]+
    [(neg['text'], 'negative') for index, neg in negs.iterrows()])

tokens=set(word.lower() for words in data for word in word_tokenize(words[0]))
train=[({word:(word in word_tokenize(x[0])) \

         for word in tokens}, x[1]) for x in data]

print(tokens)
print(train[0])

random.shuffle(train)
train_x=train[0:50]
test_x=train[51:55]

model = nltk.NaiveBayesClassifier.train(train_x)
acc=nltk.classify.accuracy(model, test_x)
print("Accuracy:", acc)

model.show_most_informative_features()

tests=['I really like it', 
    'I do not think this is good one', 
    'this is good one',
    'I hate the show!']

for test in tests:
 t_features = {word: (word in word_tokenize(test.lower())) for word in tokens}
 print(test," : ", model.classify(t_features))

pos_sentiment.csv data

"I like it "
"like it a lot "
"It's really good "
"Recommend! I really enjoyed! "
"It's really good "
"recommend too "
"outstanding performance "
"it's good! recommend! "
"Great! "
"really good. Definitely, recommend! "
"It is fun "
"Exceptional! liked a lot! "
"highly recommend this "
"fantastic show "
"exciting, liked. "
"it's ok "
"exciting show "
"amazing performance "
"it is great! "
"I am excited a lot "
"it is terrific "
"Definitely good one "
"Excellent, very satisfied "
"Glad we went "
"Once again outstanding! "
"awesome! excellent show "
"This is truly a good one! "
"What a nice restaurant."
"What a nice show."
"what a great place!"
"Great atmosphere"
"Definitely you should go"
"This is a great!"
"I really love it"

neg_sentiment.csv data

"it's mediocre! not recommend "
"Not good at all! "
"It is rude "
"I don't like this type "
"poor performance "
"Boring, not good at all! "
"not liked "
"I hate this type of things "
"not recommend, not satisfied "
"not enjoyed, I don't recommend this. "
"disgusting movie "
"waste of time, poor show "
"feel tired after watching this "
"horrible performance "
"not so good "
"so boring I fell asleep "
"a bit strange "
"terrible! I did not expect. "
"This is an awful "
"Nasty and horrible! "
"Offensive, it is a crap! "
"Disappointing! not liked. "
"The service is a nightmare"

DataTechNotes

Pages

Sentiment Classification with NLTK Naive Bayes Classifier

1 comment: