- Preparing the data
- Defining the keras model
- Predicting test data
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras import layers
from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix
import pandas as pd
Preparing the data
I prepared a simple sentiment data for this tutorial. The data contains imaginary random opinions that positive opinion labeled '1' and negative opinion with '0'. The below is sample content of sentiment training data. You can find the full list of the sentiment data in this link and save it as a sentiments.csv file on your target folder.
1,"I like it " 1,"like it a lot " 1,"It's really good " 1,"Recommend! I really enjoyed! " 1,"It's really good " 1,"recommend too " 1,"outstanding performance " ... 0,"it's mediocre! not recommend " 0,"Not good at all! " 0,"It is rude " 0,"I don't like this type " 0,"poor performance " 0,"Boring, not good at all! " 0,"not liked " 0,"I hate this type of things " ...
First, we'll load text data and split into the train and test parts.
df = pd.read_csv('datasets/sentiments.csv') df.columns = ["label","text"] x = df['text'].values y = df['label'].values x_train, x_test, y_train, y_test = \ train_test_split(x, y, test_size=0.1, random_state=123)
Next, we'll vectorize text data with the Tokenizer() method.
tokenizer = Tokenizer(num_words=100) tokenizer.fit_on_texts(x) xtrain= tokenizer.texts_to_sequences(x_train) xtest= tokenizer.texts_to_sequences(x_test)
print(xtest) [[6, 42, 43, 1, 15], [21, 14], [76, 6, 77, 2, 78], \
[17, 1, 25, 53], [2, 5, 2, 24], [17, 1, 25, 12]]
We'll apply a padding method to add zeros and set the fixed size into each vector.
maxlen=20 xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen) xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)
print(xtest) [[ 6 42 43 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [21 14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [76 6 77 2 78 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [17 1 25 53 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [ 2 5 2 24 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [17 1 25 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Defining the keras model
Before creating the keras model we need to define vocabulary size and embedding dimension. We can get the size from the tokenizer's word index.
vocab_size=len(tokenizer.word_index)+1 embedding_dim=50
Next, we'll create a keras sequential model, add the Embedding layer and the other layers into the model, and compile it.
model=Sequential() model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen)) model.add(layers.Flatten()) model.add(layers.Dense(16,activation="relu")) model.add(layers.Dense(1, activation="sigmoid")) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy']) model.summary() _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 20, 50) 4450 _________________________________________________________________ flatten_28 (Flatten) (None, 1000) 0 _________________________________________________________________ dense_42 (Dense) (None, 16) 16016 _________________________________________________________________ dense_43 (Dense) (None, 1) 17 ================================================================= Total params: 20,483 Trainable params: 20,483 Non-trainable params: 0 _________________________________________________________________
Finally, we'll train the model and check the training accuracy.
model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False) loss, acc = model.evaluate(xtrain, y_train, verbose=False) print("Training Accuracy: ", acc.round(2))
Training Accuracy: 0.8
Predicting test data
We can predict test data and check the result accuracy.
ypred=model.predict(xtest) ypred[ypred>0.5]=1 ypred[ypred<=0.5]=0 cm = confusion_matrix(y_test, ypred) print(cm) [[2 0] [1 3]]
Printing the test data content and its original and predicted values.
result=zip(x_test, y_test, ypred) for i in result: print(i)
('I am excited a lot ', 1, array([0.], dtype=float32)) ('exciting, liked. ', 1, array([1.], dtype=float32)) ('terrible! I did not expect. ', 0, array([0.], dtype=float32)) ('What a nice restaurant.', 1, array([1.], dtype=float32)) ('not recommend, not satisfied ', 0, array([0.], dtype=float32)) ('What a nice show.', 1, array([1.], dtype=float32))
In this post, we've briefly learned how to implement word embedding for binary classification of text data with keras.
The full source code is listed below.
from keras.preprocessing.text import Tokenizerfrom keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras import layersfrom sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matriximport pandas as pd df = pd.read_csv('datasets/sentiments.csv') df.columns = ["label","text"] x = df['text'].values y = df['label'].values x_train, x_test, y_train, y_test = \ train_test_split(x, y, test_size=0.1, random_state=123) tokenizer = Tokenizer(num_words=100) tokenizer.fit_on_texts(x) xtrain= tokenizer.texts_to_sequences(x_train) xtest= tokenizer.texts_to_sequences(x_test) vocab_size=len(tokenizer.word_index)+1 maxlen=20xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen) xtest=pad_sequences(xtest,padding='post', maxlen=maxlen) embedding_dim=50 model=Sequential() model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen)) model.add(layers.Flatten()) model.add(layers.Dense(16,activation="relu")) model.add(layers.Dense(1, activation="sigmoid")) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy']) model.summary() model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False) loss, acc = model.evaluate(xtrain, y_train, verbose=False) print("Training Accuracy: ", acc.round(2)) ypred=model.predict(xtest) ypred[ypred>0.5]=1 ypred[ypred<=0.5]=0 cm = confusion_matrix(y_test, ypred) print(cm) result=zip(x_test, y_test, ypred) for i in result: print(i)
No comments:
Post a Comment