Bagging, short for Bootstrap Aggregating, is an ensemble machine learning technique used to improve the performance and accuracy of predictive models, particularly for high-variance algorithms like decision trees. Bagging works by training multiple instances of the same model using different subsets of the training data and then combining their predictions to make a final prediction.
In this tutorial, we will learn how to apply 'bagging' for classification task by using 'ipred' package, which is based on the Classification and Regression Trees (CART) algorithm.Getting started
Make sure you
have the "ipred" and "caret" packages installed, which provides essential tools for
bagging in R. To install the packages in R, you can use the install.packages() function. Open R or RStudio and run the following commands. We'll also be working with the classic "iris" dataset to apply classification.
Data preparation
The first step is to split our dataset into training and testing sets.
We'll allocate 90% of the data to training and the remaining 10% to
testing.
Utilizing Bagging
Now, we're ready to apply the bagging function to create our
classification model. In this example, we're using classification trees
based on the CART (Classification and Regression Trees) algorithm. The coob
argument enables out-of-bag estimate error calculation, and nbagg
specifies the number of bootstrap replications.
Once we've built our bagging model, it's a good practice to check the model's out-of-bag estimate of misclassification error.
Call: bagging.data.frame(formula = Species ~ ., data = train, coob = TRUE,
nbagg = 100)
Out-of-bag estimate of misclassification error: 0.0593
Making Predictions and Confusion Matrix
With our model ready, we can proceed to predict the test data and evaluate the results. The predict
function will give us a comparison between the
original species labels and the predicted values.
Now, let's generate a confusion matrix to understand how well our model is performing. The confusion matrix provides a detailed breakdown of true positives,
true negatives, false positives, and false negatives, giving you a
comprehensive view of your model's performance.
Reference
Prediction setosa versicolor virginica
setosa 5 0 0
versicolor 0 5 1
virginica 0 0 4
Overall Statistics
Accuracy : 0.9333
95% CI : (0.6805, 0.9983)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 2.16e-06
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 1.0000 0.8000
Specificity 1.0000 0.9000 1.0000
Pos Pred Value 1.0000 0.8333 1.0000
Neg Pred Value 1.0000 1.0000 0.9091
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3333 0.2667
Detection Prevalence 0.3333 0.4000 0.2667
Balanced Accuracy 1.0000 0.9500 0.9000
Conclusion
In this blog post, we've delved into the topic of bagging classification in R. By aggregating the results from multiple bootstrapped datasets, we can improve the accuracy of our classification models. Bagging is a valuable tool in machine learning, especially when we work with complex datasets and challenging classification tasks.
Source code listing
Superb
ReplyDeletewhen i run the code on my data, the output of the model coming from the print() funciton gives me back regression tree results instead of classification tree
ReplyDelete