DataTechNotes: Gradient Boosting Classification Example with Scikit-learn

Gradient Boosting is a powerful ensemble learning technique used for classification and regression tasks. Its effectiveness and flexibility make Gradient Boosting suitable for implementation in various domains in machine learning. In this tutorial, we'll learn about Gradient Boosting classification using Scikit-learn machine learning library in Python. The tutorial covers the following topics:

Introduction to Gradient Boosting
Preparing data
Defining the model and training
Hyperparameters
Making predictions and evaluating the model
Conclusion

Let's get started.

Introduction to Gradient Boosting

Gradient Boosting is an ensemble learning technique that constructs a sequence of weak learners, typically decision trees, in a sequential manner. Each learner in the sequence aims to correct the mistakes made by the previous one. It optimizes the loss function directly using gradient descent, making it highly effective for handling complex datasets and producing accurate predictions.

Gradient boosting includes the following components for train the model and predicting the data.

Initialization:
- Gradient Boosting starts with an initial prediction, often the mean value for regression tasks or the log odds for classification tasks.
Sequential Training:
- It sequentially trains a series of weak learners, usually decision trees, each attempting to correct the errors made by the combination of all previous learners.
Gradient Descent:
- Gradient Boosting optimizes the loss function directly using gradient descent. It minimizes the loss by adding weak learners that minimize the gradient of the loss function with respect to the ensemble's predictions.
Adding Weak Learners:
- At each iteration, a weak learner is trained on the residuals (the differences between the current predictions and the actual values). This weak learner is fitted to the negative gradient of the loss function with respect to the current predictions to reduce the residual errors.
Combining Predictions:
- The predictions from all weak learners are combined to obtain the final ensemble prediction. Each learner contributes a weighted prediction to the ensemble.
Regularization:
- To prevent overfitting, Gradient Boosting applies regularization techniques like tree depth limits, shrinkage (learning rate), and subsampling of training instances.
Loss Functions:
- Gradient Boosting can be applied to classification loss functions like binary cross-entropy or multinomial deviance.
Stopping Criteria:
- Gradient Boosting continues adding weak learners until a specified stopping criterion is met, such as reaching a maximum number of iterations.

Preparing data

We'll begin by loading the necessary libraries for this tutorial.

 
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
 

Next, we create a synthetic classification dataset generated using the make_classification function from scikit-learn. The dataset contains 1000 samples with 5 input features and 4 classes.

 
# Generating synthetic regression dataset
X, y = make_classification(n_samples=1000, n_features=5, n_classes=4,
                           n_clusters_per_class=1, n_informative=2, random_state=42)

print(X[:5])
print(y[:5])

The dataset looks as follows.

[[-0.43964281  0.54254734 -0.82241993  0.40136622 -0.85484   ]
 [ 2.82223116 -2.48085944 -1.14769139 -2.10113103  3.04027792]
 [ 1.61838572 -1.36947785 -2.08411294 -1.17965857  1.61360231]
 [ 1.65904812 -0.61520205  1.11268837 -0.83509772 -0.27220548]
 [ 1.84982402 -1.67945551 -0.92669831 -1.40250885  2.12312866]]
[0 3 3 1 3]

Then we split data into train and test parts. Here, we use 20 percent of data as test data.

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Defining the model and training

We initialize the Gradient Boosting classifier using the GradientBoostingClassifier class from scikit-learn, where we specify hyperparameters such as the number of trees, learning rate, and maximum depth.
We proceed to train the Gradient Boosting classifier on the training data by invoking the fit() method.

# Initialize and train the Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbc.fit(X_train, y_train)
 

Hyperparameters

Hyperparameters are parameters that are set before the learning process begins. They control the behavior of the learning algorithm and influence the performance of the model. By adjusting these hyperparameters, you can fine-tune the performance of the Gradient Boosting model to achieve better accuracy and generalization on unseen data. However, finding the optimal combination of hyperparameters often requires experimentation and tuning using techniques like grid search or random search.

n_estimators specifies the number of weak learners (decision trees in this case) that will be combined to form the final ensemble. Increasing the number of estimators may improve the model's performance, but it also increases the computational cost.
learning_rate determines the step size at which the gradient descent optimization procedure adjusts the weights of the weak learners. A lower learning rate makes the model more robust to overfitting but may require more iterations to converge.
max_depth specifies the maximum depth of each decision tree in the ensemble. It controls the complexity of the individual trees and helps prevent overfitting.

Making predictions and evaluating the model

Using the trained classifier, we proceed to make predictions on the testing data by calling the predict method.
Then, we calculate the accuracy of the model by comparing the predicted labels with the true labels from the testing set. To achieve this, we leverage the accuracy_score and classification_report functions from scikit-learn. These functions provide insightful metrics such as precision, recall, and f1-score, enabling a comprehensive evaluation of the classification performance.

 
# Make predictions on the test set
y_pred = gbc.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))
 

The result appears as follows:

  
Accuracy: 0.8433333333333334
Classification Report:
              precision    recall  f1-score   support

           0       0.84      0.88      0.86        67
           1       0.79      0.78      0.79        82
           2       0.91      0.87      0.89        79
           3       0.84      0.85      0.84        72

    accuracy                           0.84       300
   macro avg       0.84      0.85      0.84       300
weighted avg       0.84      0.84      0.84       300
 

Conclusion

Gradient Boosting is a powerful ensemble learning technique that can be used for classification and regression tasks. In this tutorial, we covered the basics of Gradient Boosting with Scikit-learn and classification example. The full source code is listed below.

Source code

 
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report


# Generating synthetic regression dataset
X, y = make_classification(n_samples=1000, n_features=5, n_classes=4,
                           n_clusters_per_class=1, n_informative=2, random_state=42)
print(X[:5])
print(y[:5])

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train the Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gbc.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gbc.predict(X_test)

# Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))

# Get feature importances
feature_importances = gbc.feature_importances_
print("Feature Importances:", feature_importances)

DataTechNotes

Pages

Gradient Boosting Classification Example with Scikit-learn

No comments:

Post a Comment