Classification Example with BaggingClassifier in Python
Bagging, short for Bootstrap Aggregating, is an ensemble learning technique in machine learning that combines multiple models to improve predictive performance. It works by training multiple models independently on different subsets of the training data and then combining their predictions through averaging (for regression) or voting (for classification).
In this tutorial, we'll explore the basics of bagging technique and how to implement classification using Sciki-learn BaggingClassifier. The tutorial covers:
Introduction to bagging
Bagging with single estimator
Bagging with multiple estimators
Conclusion
Introduction to Bagging
Bagging, short for Bootstrap Aggregating, is a widely used technique in ensemble learning to improve the performance of machine learning models.
In bagging, multiple base learners (often of the same type) are trained independently on different subsets of the training data. These subsets are typically created by sampling the training data with replacement. Each base learner then makes its predictions, and the final prediction is often obtained by voting for classification tasks over the predictions of all base learners.
The main idea behind bagging is to reduce overfitting and variance by combining the predictions of multiple models trained on different subsets of the data. This often leads to better generalization performance compared to individual models. RandomForest, for example, is a popular ensemble learning method that uses bagging with decision trees as base learners.
Bagging with single estimator Now let's start implementing classification with bagging method in Python. We'll begin by loading the necessary libraries for this tutorial.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
Then we load Iris dataset and split it into train and test sets by using train_test_split function.
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
To define a base classifier we use DecisionTreeClassifier class then initialize the bagging classifier with base estimator and its number. We train model on training data using fit() method.
In this part of tutorial, we implement multiple base estimators and check their performance. To evaluate the estimator models performance we create custom classification dataset and apply classification with each estimator.
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
def create_data(N):
columns = ['a', 'b', 'c', 'target']
data = np.random.randint(10, size=(N, 3)) # Generate random numbers for columns a, b, and c
y = np.where(np.sum(data, axis=1) > 25, 'high',
np.where(np.sum(data, axis=1) < 12, 'low', 'normal')) # Calculate y
In this tutorial, we learned about the Bagging technique and how to classify data using the Scikit-learn BaggingClassifier class. We also implemented multiple estimators for classifying data and evaluated their performance.
Hi dude, this post is one of the most simple and explanatory I could find, it helped me a lot.
ReplyDeleteOnly in the part of:
Y = df[["y"]]
I changed:
Y = df[["y"][0]]
So that Python doesn't show a message on the terminal.
Thank you!
Delete