DataTechNotes: Classification Example with Bagging Method in R

Bagging, short for Bootstrap Aggregating, is an ensemble machine learning technique used to improve the performance and accuracy of predictive models, particularly for high-variance algorithms like decision trees. Bagging works by training multiple instances of the same model using different subsets of the training data and then combining their predictions to make a final prediction.

In this tutorial, we will learn how to apply 'bagging' for classification task by using 'ipred' package, which is based on the Classification and Regression Trees (CART) algorithm.

Getting started

Make sure you have the "ipred" and "caret" packages installed, which provides essential tools for bagging in R. To install the packages in R, you can use the install.packages() function. Open R or RStudio and run the following commands. We'll also be working with the classic "iris" dataset to apply classification.

 
# Download and install the 'ipred' and 'caret' packages from the CRAN

install.packages("ipred")
install.packages("caret") 
 
# Load Libraries and Dataset
library(ipred)
library(caret)

data(iris)

Data preparation

The first step is to split our dataset into training and testing sets. We'll allocate 90% of the data to training and the remaining 10% to testing.

# Split the Dataset
set.seed(12)
indexes <- createDataPartition(iris$Species, p = 0.9, list = FALSE)
train <- iris[indexes, ]
test <- iris[-indexes, ]

Utilizing Bagging

Now, we're ready to apply the bagging function to create our classification model. In this example, we're using classification trees based on the CART (Classification and Regression Trees) algorithm. The coob argument enables out-of-bag estimate error calculation, and nbagg specifies the number of bootstrap replications.

# Implement Bagging
fit <- bagging(Species ~ ., data = train, coob = TRUE, nbagg = 100)

Once we've built our bagging model, it's a good practice to check the model's out-of-bag estimate of misclassification error.

 
print(fit)
Bagging classification trees with 100 bootstrap replications 

Call: bagging.data.frame(formula = Species ~ ., data = train, coob = TRUE, 
    nbagg = 100)

Out-of-bag estimate of misclassification error:  0.0593 

Making Predictions and Confusion Matrix

With our model ready, we can proceed to predict the test data and evaluate the results. The predict function will give us a comparison between the original species labels and the predicted values.

# Make Predictions
pred <- predict(fit, test)
result <- data.frame(original = test$Species, predicted = pred)

Now, let's generate a confusion matrix to understand how well our model is performing. The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives, giving you a comprehensive view of your model's performance.

# Evaluate Model with Confusion Matrix
conf_matrix <- confusionMatrix(data = result$predicted, 
                    reference = result$original)
print(conf_matrix)
 
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         1
  virginica       0          0         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 2.16e-06        
                                          
                  Kappa : 0.9             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           0.8000
Specificity                 1.0000            0.9000           1.0000
Pos Pred Value              1.0000            0.8333           1.0000
Neg Pred Value              1.0000            1.0000           0.9091
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.4000           0.2667
Balanced Accuracy           1.0000            0.9500           0.9000 
 

Conclusion

In this blog post, we've delved into the topic of bagging classification in R. By aggregating the results from multiple bootstrapped datasets, we can improve the accuracy of our classification models. Bagging is a valuable tool in machine learning, especially when we work with complex datasets and challenging classification tasks.

Source code listing

# Load Libraries and Dataset
library(ipred)
library(caret)

data(iris)

# Split the Dataset
set.seed(12)
indexes <- createDataPartition(iris$Species, p = 0.9, list = FALSE)
train <- iris[indexes, ]
test <- iris[-indexes, ]

# Implement Bagging
fit <- bagging(Species ~ ., data = train, coob = TRUE, nbagg = 100)

# Model Summary
print(fit)

# Make Predictions
pred <- predict(fit, test)
result <- data.frame(original = test$Species, predicted = pred)

# Evaluate Model with Confusion Matrix
conf_matrix <- confusionMatrix(data = result$predicted, 
                        reference = result$original)
print(conf_matrix)
 

DataTechNotes

Pages

Classification Example with Bagging Method in R

2 comments: