Text classification is a fundamental task in natural language processing (NLP) that involves categorizing text into predefined categories or labels. In this blog post, we will explore how to perform text classification using the SpaCy library for text preprocessing and the Scikit-Learn library for building a machine learning classifier. The tutorial covers:
- Preparing data
- Feature extraction with TF-IDF
- Building a text classifier
- Evaluating the model and prediction
- Conclusion
Let's get started.
We'll begin by loading the necessary libraries for this tutorial.
Preparing data
We'll be working with a simple dataset that contains text samples categorized into three labels: "NLP", "Programming", and "Machine Learning". The dataset is designed to showcase the diversity of language used in these domains. You can also use your own dataset instead of this one.
Next, we load the SpaCy language model for tokenization and text preprocessing. Each text sample is tokenized and converted to lowercase for consistency.
The dataset is then split into training and testing sets using Scikit-Learn's train_test_split function. This ensures that the model is trained on one portion of the data and evaluated on another, unseen portion.
Feature extraction with TF-IDF
To represent the text data numerically, we use the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer. This converts each text sample into a vector of numerical features, taking into account the importance of each term in the entire dataset.
Building a text classifier
The Multinomial Naive Bayes (MNB) classifier is a probabilistic machine learning algorithm based on Bayes' theorem. It is particularly well-suited for text classification tasks, where the features (in this case, words or terms) are assumed to be multinomially distributed.
The algorithm estimates the probabilities of each label using the training data. For each label, it calculates the likelihood of observing the features given that label. The prior probability of each label is also estimated from the training data.
During prediction, the algorithm uses Bayes' theorem to calculate the probability of each label given the observed features and selects the label with the highest probability.
We use a Multinomial Naive Bayes classifier from Scikit-Learn to build the text classification model. The model is trained on the TF-IDF transformed training data.
Evaluating the model and prediction
The model's performance is evaluated on the testing set using accuracy and a detailed classification report. Here, we use accuracy_score and classification_report function of Scikit-learn.
The result:
The trained model is then used to predict the labels for new sentences. We define two new sentence to classify with trained model.
The result is as follows.
Source code listing
No comments:
Post a Comment