learn a sequence data in deep learning. In this post, we'll learn how to apply LSTM for binary text classification problem. The post covers:
- Preparing data
- Defining the LSTM model
- Predicting test data
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras import layers from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import pandas as pd
Preparing data
Here, I prepared a simple sentiment data for this tutorial. The data contains imaginary random opinions that positive opinion labeled '1' and negative opinion with '0'. The below is sample content of sentiment training data. You can find the full list of the sentiment data in this link and save it as a sentiments.csv file on your target folder.
1,"I like it " 1,"like it a lot " 1,"It's really good " 1,"Recommend! I really enjoyed! " 1,"It's really good " 1,"recommend too " 1,"outstanding performance " ... 0,"it's mediocre! not recommend " 0,"Not good at all! " 0,"It is rude " 0,"I don't like this type " 0,"poor performance " 0,"Boring, not good at all! " 0,"not liked " 0,"I hate this type of things " ...
We'll load text data and split it into the train and test parts.
df = pd.read_csv('datasets/sentiments.csv') df.columns = ["label","text"] x = df['text'].values y = df['label'].values x_train, x_test, y_train, y_test = \ train_test_split(x, y, test_size=0.1, random_state=123)
Next, we'll convert text data into token vectors.
tokenizer = Tokenizer(num_words=100) tokenizer.fit_on_texts(x) xtrain= tokenizer.texts_to_sequences(x_train) xtest= tokenizer.texts_to_sequences(x_test)
We'll apply a padding method to add zeros and set the fixed size into each vector.
maxlen=10 xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen) xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)
print(x_train[3])
Excellent, very satisfied
print(xtrain[3])
[23 45 24 0 0 0 0 0 0 0]
Defining the LSTM model
We apply the Embedding layer for input data before adding the LSTM layer into the Keras sequential model. The model definition goes as a following.
embedding_dim=50
model=Sequential() model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen)) model.add(layers.LSTM(units=50,return_sequences=True)) model.add(layers.LSTM(units=10)) model.add(layers.Dropout(0.5)) model.add(layers.Dense(8)) model.add(layers.Dense(1, activation="sigmoid")) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy']) model.summary()
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_24 (Embedding) (None, 10, 50) 4450 _________________________________________________________________ lstm_40 (LSTM) (None, 10, 50) 20200 _________________________________________________________________ lstm_41 (LSTM) (None, 10) 2440 _________________________________________________________________ dropout_16 (Dropout) (None, 10) 0 _________________________________________________________________ dense_65 (Dense) (None, 8) 88 _________________________________________________________________ dense_66 (Dense) (None, 1) 9 ================================================================= Total params: 27,187 Trainable params: 27,187 Non-trainable params: 0 _________________________________________________________________
Finally, we'll train the model and check the training accuracy.
model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False) loss, acc = model.evaluate(xtrain, y_train, verbose=False) print("Training Accuracy: ", acc.round(2))
Training Accuracy: 1.0
loss, acc = model.evaluate(xtest, y_test, verbose=False) print("Test Accuracy: ", acc.round(2))
Test Accuracy: 1.0
Predicting test data
Finally, we can predict test data and check the prediction accuracy.
ypred=model.predict(xtest) ypred[ypred>0.5]=1 ypred[ypred<=0.5]=0 cm = confusion_matrix(y_test, ypred) print(cm)
[[2 0] [0 4]]
result=zip(x_test, y_test, ypred) for i in result: print(i)
('I am excited a lot ', 1, array([1.], dtype=float32)) ('exciting, liked. ', 1, array([1.], dtype=float32)) ('terrible! I did not expect. ', 0, array([0.], dtype=float32)) ('What a nice restaurant.', 1, array([1.], dtype=float32)) ('not recommend, not satisfied ', 0, array([0.], dtype=float32)) ('What a nice show.', 1, array([1.], dtype=float32))
In this post, we've briefly learned how to implement LSTM for binary classification of text data with Keras. The source code is listed below.
from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras import layers from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix import pandas as pd df = pd.read_csv('datasets/sentiments.csv') df.columns = ["label","text"] x = df['text'].values y = df['label'].values x_train, x_test, y_train, y_test = \ train_test_split(x, y, test_size=0.1, random_state=123) tokenizer = Tokenizer(num_words=100) tokenizer.fit_on_texts(x) xtrain= tokenizer.texts_to_sequences(x_train) xtest= tokenizer.texts_to_sequences(x_test) vocab_size=len(tokenizer.word_index)+1 maxlen=10 xtrain=pad_sequences(xtrain,padding='post', maxlen=maxlen) xtest=pad_sequences(xtest,padding='post', maxlen=maxlen)
print(x_train[3]) print(xtrain[3])
embedding_dim=50 model=Sequential() model.add(layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=maxlen)) model.add(layers.LSTM(units=50,return_sequences=True)) model.add(layers.LSTM(units=10)) model.add(layers.Dropout(0.5)) model.add(layers.Dense(8)) model.add(layers.Dense(1, activation="sigmoid")) model.compile(optimizer="adam", loss="binary_crossentropy", metrics=['accuracy']) model.summary() model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False) loss, acc = model.evaluate(xtrain, y_train, verbose=False) print("Training Accuracy: ", acc.round(2)) loss, acc = model.evaluate(xtest, y_test, verbose=False) print("Test Accuracy: ", acc.round(2)) ypred=model.predict(xtest) ypred[ypred>0.5]=1 ypred[ypred<=0.5]=0 cm = confusion_matrix(y_test, ypred) print(cm) result=zip(x_test, y_test, ypred) for i in result: print(i)
please share dataset file
ReplyDeleteThanks for sharing.
ReplyDeleteI have tried your code, and getting error in this part (for training the model) :
model.fit(xtrain,y_train, epochs=20, batch_size=16, verbose=False)...
The error is :
UnimplementedError: Cast string to float is not supported.
Do you have any idea why this happened?
I really stuck on this.
Thank you very much
im getting the same, did you figure it out?
Deletewhat bout the a Bidirectional LSTM
ReplyDelete