In this tutorial, we will learn how to fine-tune a pre-trained large language model (LLM) for a text classification task using the Hugging Face transformers library. We will use the DistilBERT model, a smaller and faster version of BERT, and fine-tune it on the IMDb movie review dataset for sentiment analysis (positive or negative). The tutorial
covers:
- Introduction to fine-turing LLMs
- Loading and preparing a dataset
- Data tokenization
- Fine-tuning the model
- Prediction and model evaluation
- Execution
- Conclusion
- Full code listing
Introduction to fine-tuning LLMs
Large Language Models (LLMs) like BERT, GPT, and DistilBERT are pre-trained on massive amounts of text data, learning general language patterns and representations. However, to make them perform well on specific tasks like sentiment analysis, text classification, or question answering, we need to fine-tune them with relevant data.
What is fine-tuning?
Fine-tuning is the process of taking a
pre-trained LLM and adapting it to a specific task by training it
further on a smaller, task-specific dataset. This allows the model to
learn task-specific patterns while retaining the general language
knowledge it gained during pre-training.
Why fine-tuning LLMs
Customization:
Pre-trained LLMs are general-purpose, but fine-tuning tailors them to
specific tasks like sentiment analysis, spam detection, or named entity
recognition.
Efficiency: Fine-tuning requires less data and computational resources compared to training a model from scratch.
Better results: A fine-tuned model usually performs better than a general pre-trained model on a specific task.
Loading and preparing a dataset
We run this code on Google Colab, which provides a GPU for faster execution. Using a GPU is highly recommended for training LLMs to improve performance and reduce training time.
Before starting, make sure you have the following Python libraries installed. You can install them using pip.
pip install transformers datasets torch scikit-learn
We start by importing the necessary libraries.
from datasets import load_dataset
from transformers import (
AutoTokenizer,
DistilBertForSequenceClassification,
TrainingArguments,
Trainer,
)
import torch
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np
Next, we specify the model name, the directory for saving the model, and the number of samples for training, evaluation, and testing. Since the IMDb dataset contains 50,000 movie reviews, we use only a small subset to keep the process simple and speed up training.
# Constants
MODEL_NAME = "distilbert-base-uncased"
OUTPUT_DIR = "./fine_tuned_model"
TRAIN_SAMPLES = 1000
EVAL_SAMPLES = 200
TEST_SAMPLES = 100
SEED = 42
We define a function to load train, evaluation, and test data.
# Load IMDb dataset
def load_imdb_dataset():
dataset = load_dataset("imdb")
train_dataset = dataset["train"].shuffle(seed=SEED).select(range(TRAIN_SAMPLES))
eval_dataset = dataset["test"].shuffle(seed=SEED).select(range(EVAL_SAMPLES))
test_dataset = dataset["test"].shuffle(seed=SEED).select(range(TEST_SAMPLES))
return train_dataset, eval_dataset, test_dataset
Data tokenization
In this step, we prepare the text data for input into the model by converting it into a format that the model can understand. This process is called tokenization.. Tokenization helps transform text into numerical representations, making it usable for training language models.
We define a function that applies a tokenizer to a given dataset.
# Tokenize dataset
def tokenize_dataset(dataset, tokenizer):
return dataset.map(
lambda examples: tokenizer(examples["text"], padding="max_length", truncation=True),
batched=True,
num_proc=4,
).with_format("torch")
- dataset.map(): Applies the tokenizer to each text example in the dataset.
- lambda function: Uses the tokenizer to convert "text" into tokenized format, ensuring all sequences have the same length (padding="max_length") and are truncated if too long (truncation=True).
- batched=True: Processes multiple samples at once for efficiency.
- num_proc=4: Uses four parallel processes to speed up tokenization.
- .with_format("torch"): Converts the dataset into a PyTorch-compatible format.
Fine-tuning the model
In this section, we fine-tune the DistilBERT model on the IMDb dataset to classify movie reviews as positive or negative using the following steps:- We use DistilBertForSequenceClassification, which is designed for text classification with two labels (positive/negative).
- We define key training settings like learning rate, batch size, and number of epochs using TrainingArguments.
- The trainer class automates the training process, handling tasks like forward and backward passes as well as optimization.
- The model learns from the tokenized IMDb dataset to classify reviews accurately.
- Once training is complete, we save the fine-tuned model and tokenizer for later use.
# Fine-tune the model
def fine_tune_model(train_dataset, eval_dataset, output_dir):
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenized_train = tokenize_dataset(train_dataset, tokenizer)
tokenized_eval = tokenize_dataset(eval_dataset, tokenizer)
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
training_args = TrainingArguments(
output_dir=output_dir, # Directory to save model checkpoints
evaluation_strategy="epoch", # Evaluate after each epoch
learning_rate=2e-5, # Learning rate
per_device_train_batch_size=32, # Batch size for training (larger for GPU)
per_device_eval_batch_size=32, # Batch size for evaluation
num_train_epochs=10, # Number of epochs
fp16=True, # Enable mixed precision (FP16) for faster training
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_eval,
)
trainer.train()
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")
To use the fine-tuned model and tokenizer saved during training, we create a load_model() function. This function loads the DistilBERT model and its tokenizer from the saved directory. Once loaded, the model is ready for inference, meaning we can use it to make predictions on new data. This way, we don’t have to retrain the model, saving both time and resources.
# Load the fine-tuned model
def load_model(model_dir):
model = DistilBertForSequenceClassification.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print("Model and tokenizer loaded successfully!")
return model, tokenizer
Prediction and model evaluation
In this step, we use the fine-tuned model to make predictions on the test dataset and evaluate its performance. The predict() function takes the model, tokenizer, and test data to generate sentiment predictions, classifying each review as either positive or negative.
To measure accuracy, we compare the model’s predicted labels with the actual labels from the test dataset. We then calculate the accuracy percentage and generate a classification report, including precision, recall, and F1-score. This helps us understand how well the model performs on unseen data.The evaluate_model() function
# Evaluate the model
def evaluate_model(model, tokenizer, test_dataset):
tokenized_test = tokenize_dataset(test_dataset, tokenizer)
trainer = Trainer(model=model)
predictions = trainer.predict(tokenized_test)
predicted_labels = torch.argmax(torch.tensor(predictions.predictions), dim=-1)
true_labels = np.array([example["label"] for example in test_dataset])
accuracy = accuracy_score(true_labels, predicted_labels) cr = classification_report(true_labels, predicted_labels, target_names=["Negative", "Positive"])
print(f"Classification Accuracy: {accuracy:.4f}")
print("Classification Report:\n", cr)
Execution
Finally, we execute the above functions step by step to start fine-tuning the model. Once you have your trained model, you don’t need to run the fine_tune_model() function again.
train_dataset, eval_dataset, test_dataset = load_imdb_dataset()
fine_tune_model(train_dataset, eval_dataset, OUTPUT_DIR) # Comment out after training the model
model, tokenizer = load_model(OUTPUT_DIR)
evaluate_model(model, tokenizer, test_dataset)
The result is as follows:
[320/320 02:09, Epoch 10/10]
Epoch |
Training Loss |
Validation Loss |
1 |
No log |
0.479953 |
2 |
No log |
0.335293 |
3 |
No log |
0.562791 |
4 |
No log |
0.379063 |
5 |
No log |
0.448461 |
6 |
No log |
0.469026 |
7 |
No log |
0.478982 |
8 |
No log |
0.499201 |
9 |
No log |
0.518111 |
10 |
No log |
0.520344 |
Evaluation results: {'eval_loss': 0.5203442573547363, 'eval_runtime': 0.6816,
'eval_samples_per_second': 293.412, 'eval_steps_per_second': 10.269, 'epoch': 10.0}
Model saved to ./fine_tuned_model
Model and tokenizer loaded successfully!
Classification Accuracy: 0.8300
Classification Report:
precision recall f1-score support
Negative 0.86 0.81 0.83 53
Positive 0.80 0.85 0.82 47
accuracy 0.83 100
macro avg 0.83 0.83 0.83 100
weighted avg 0.83 0.83 0.83 100
Conclusion
In this tutorial, we explored how to load and preprocess a dataset for text classification, fine-tune a pre-trained LLM (DistilBERT) using Hugging Face's Transformers library, and evaluate the fine-tuned model on a test dataset. You can adapt this code to fine-tune other models or datasets for various NLP tasks. The full source code is listed below.
Full code listing
from datasets import load_dataset
from transformers import (
AutoTokenizer,
DistilBertForSequenceClassification,
TrainingArguments,
Trainer,
)
import torch
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import numpy as np
# Constants
MODEL_NAME = "distilbert-base-uncased"
OUTPUT_DIR = "./fine_tuned_model"
TRAIN_SAMPLES = 1000
EVAL_SAMPLES = 200
TEST_SAMPLES = 100
SEED = 42
# Load IMDb dataset
def load_imdb_dataset():
dataset = load_dataset("imdb")
train_dataset = dataset["train"].shuffle(seed=SEED).select(range(TRAIN_SAMPLES))
eval_dataset = dataset["test"].shuffle(seed=SEED).select(range(EVAL_SAMPLES))
test_dataset = dataset["test"].shuffle(seed=SEED).select(range(TEST_SAMPLES))
return train_dataset, eval_dataset, test_dataset
# Tokenize dataset
def tokenize_dataset(dataset, tokenizer):
return dataset.map(
lambda examples: tokenizer(examples["text"], padding="max_length", truncation=True),
batched=True,
num_proc=4,
).with_format("torch")
# Fine-tune the model
def fine_tune_model(train_dataset, eval_dataset, output_dir):
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenized_train = tokenize_dataset(train_dataset, tokenizer)
tokenized_eval = tokenize_dataset(eval_dataset, tokenizer)
model = DistilBertForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
training_args = TrainingArguments(
output_dir=output_dir, # Directory to save model checkpoints
evaluation_strategy="epoch", # Evaluate after each epoch
learning_rate=2e-5, # Learning rate
per_device_train_batch_size=32, # Batch size for training (larger for GPU)
per_device_eval_batch_size=32, # Batch size for evaluation
num_train_epochs=10, # Number of epochs
fp16=True, # Enable mixed precision (FP16) for faster training
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_eval,
)
trainer.train()
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)
print(f"Model saved to {output_dir}")
# Load the fine-tuned model
def load_model(model_dir):
model = DistilBertForSequenceClassification.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print("Model and tokenizer loaded successfully!")
return model, tokenizer
# Evaluate the model
def predict(model, tokenizer, test_dataset):
tokenized_test = tokenize_dataset(test_dataset, tokenizer)
trainer = Trainer(model=model)
predictions = trainer.predict(tokenized_test)
predicted_labels = torch.argmax(torch.tensor(predictions.predictions), dim=-1)
true_labels = np.array([example["label"] for example in test_dataset])
accuracy = accuracy_score(true_labels, predicted_labels) cr = classification_report(true_labels, predicted_labels, target_names=["Negative", "Positive"])
print(f"Classification Accuracy: {accuracy:.4f}")
print("Classification Report:\n", cr)
# execution
train_dataset, eval_dataset, test_dataset = load_imdb_dataset()
fine_tune_model(train_dataset, eval_dataset, OUTPUT_DIR) # Comment out after training the model
model, tokenizer = load_model(OUTPUT_DIR)
evaluate_model(model, tokenizer, test_dataset)
No comments:
Post a Comment