- Data preparation
- Training the model
- Predicting the test data
import pandas as pd import nltk import random from nltk.tokenize import word_tokenize
Data preparation
Here, I prepared a simple sentiment data for this tutorial. The data contains imaginary random sentiment texts. In a dataset folder, we'll locate a pos_sentiment.csv file that contains positive sentiment data and a neg_sentiment.csv file that contains negative sentiment data. You can find both files content below. Copy and save it in your datasets folder.
We'll load files as shown below the listing.
poss = pd.read_csv('datasets/pos_sentiment.csv') negs = pd.read_csv('datasets/neg_sentiment.csv') poss.columns = ["text"] negs.columns = ["text"]
Then, we'll set 'positive' or 'negative' label into each line of text data.
data=([(pos['text'], 'positive') for index, pos in poss.iterrows()]+ [(neg['text'], 'negative') for index, neg in negs.iterrows()]) print(data[0:3])
[('like it a lot ', 'positive'), ("It's really good ", 'positive'), \
('Recommend! I really enjoyed! ', 'positive')]Next, we'll tokenize the words in text data and create train data.
tokens=set(word.lower() for words in data for word in word_tokenize(words[0])) train = [({word: (word in word_tokenize(x[0])) \
for word in tokens}, x[1]) for x in data]
The train data content looks as below.
print(train[0]) ({'i': False, 'fun': False, 'again': False, 'an': False,
'excellent': False, 'exciting': False, 'go': False,
'nasty': False, 'what': False, 'restaurant': False,
'really': False, 'horrible': False, 'enjoyed': False,
'did': False, 'too': False, 'terrific': False,
'strange': False, 'so': False, 'exceptional': False,
'am': False, 'once': False, 'definitely': False,
'went': False, 'it': True, 'good': False, 'one': False,
'great': False, 'time': False, '.': False, 'satisfied': False,
'awesome': False, 'expect': False, 'tired': False,
'offensive': False, 'service': False, 'disgusting': False,
'asleep': False, 'nightmare': False, 'we': False,
'after': False, 'this': False, 'type': False, 'nice': False,
'feel': False, 'poor': False, 'fantastic': False,
.....
'you': False, 'not': False, 'the': False, 'movie': False,
'a': True, "n't": False, 'recommend': False, 'ok': False}, 'positive')
Finally, we'll shuffle the train data and split it into train and test parts.
random.shuffle(train)
len(train) 55
train_x=train[0:50] test_x=train[51:55]
Training the model
We'll define an NLTK Navie Bayes model and train it with a train_x data.
model = nltk.NaiveBayesClassifier.train(train_x)
Most informative features can be checked with the below method.
model.show_most_informative_features()
Most Informative Features this = True negati : positi = 4.4 : 1.0 it = True positi : negati = 2.9 : 1.0 , = True negati : positi = 2.7 : 1.0 show = True positi : negati = 2.0 : 1.0 a = True positi : negati = 1.6 : 1.0 recommend = True positi : negati = 1.6 : 1.0 performance = True negati : positi = 1.5 : 1.0 like = True negati : positi = 1.5 : 1.0 liked = True negati : positi = 1.5 : 1.0 enjoyed = True negati : positi = 1.5 : 1.0
We can check the model prediction accuracy with test_x data.
acc=nltk.classify.accuracy(model, text_x) print("Accuracy:", acc)
Accuracy: 0.75
Predicting the test data
Finally, we'll predict the new test data with the trained model.
tests=['I really like it', 'I do not think this is good one', 'this is good one', 'I hate the show!'] for test in tests: t_features = {word: (word in word_tokenize(test.lower())) for word in tokens} print(test," : ", model.classify(t_features))
I really like it : positive I do not think this is good one : negative this is good one : positive I hate the show! : positive
In this tutorial, we've briefly learned how to classify sentiment data with NLTK Naive Bayes classifier in Python. Thank you for reading.
The full source code and training data are listed below.
import pandas as pd import nltk import random from nltk.tokenize import word_tokenize poss = pd.read_csv('datasets/pos_sentiment.csv') negs = pd.read_csv('datasets/neg_sentiment.csv') poss.columns = ["text"] negs.columns = ["text"] data=([(pos['text'], 'positive') for index, pos in poss.iterrows()]+ [(neg['text'], 'negative') for index, neg in negs.iterrows()]) tokens=set(word.lower() for words in data for word in word_tokenize(words[0])) train=[({word:(word in word_tokenize(x[0])) \
for word in tokens}, x[1]) for x in data] print(tokens) print(train[0]) random.shuffle(train) train_x=train[0:50] test_x=train[51:55] model = nltk.NaiveBayesClassifier.train(train_x) acc=nltk.classify.accuracy(model, test_x) print("Accuracy:", acc) model.show_most_informative_features() tests=['I really like it', 'I do not think this is good one', 'this is good one', 'I hate the show!'] for test in tests: t_features = {word: (word in word_tokenize(test.lower())) for word in tokens} print(test," : ", model.classify(t_features))
pos_sentiment.csv data
"I like it " "like it a lot " "It's really good " "Recommend! I really enjoyed! " "It's really good " "recommend too " "outstanding performance " "it's good! recommend! " "Great! " "really good. Definitely, recommend! " "It is fun " "Exceptional! liked a lot! " "highly recommend this " "fantastic show " "exciting, liked. " "it's ok " "exciting show " "amazing performance " "it is great! " "I am excited a lot " "it is terrific " "Definitely good one " "Excellent, very satisfied " "Glad we went " "Once again outstanding! " "awesome! excellent show " "This is truly a good one! " "What a nice restaurant." "What a nice show." "what a great place!" "Great atmosphere" "Definitely you should go" "This is a great!" "I really love it"
neg_sentiment.csv data
"it's mediocre! not recommend " "Not good at all! " "It is rude " "I don't like this type " "poor performance " "Boring, not good at all! " "not liked " "I hate this type of things " "not recommend, not satisfied " "not enjoyed, I don't recommend this. " "disgusting movie " "waste of time, poor show " "feel tired after watching this " "horrible performance " "not so good " "so boring I fell asleep " "a bit strange " "terrible! I did not expect. " "This is an awful " "Nasty and horrible! " "Offensive, it is a crap! " "Disappointing! not liked. " "The service is a nightmare"
where can I get the data? thanks
ReplyDelete