AdaBoost (Adaptive Boosting) is a boosting algorithm in machine learning. Improving week learners and creating an aggregated model to improve model accuracy is a key concept of boosting algorithms. A weak learner is defined as the one with poor performance or slightly better than a random guess classifier. Adaboost improves those classifiers by increasing their weights and gets their votes to create the final combined model.
In this post, we'll learn how to use the adabag package's boosting function to classify data in R. The tutorial covers:
- Preparing data
- Classification with boosting
- Classification with boosting.cv
- Source code listing
We'll start by loading the required libraries.
library(adabag)
library(caret) |
In this tutorial, we'll use the Iris dataset as a target classification data. We'll split it into the train and test parts. Here we'll use 10 percent of a dataset as a test data.
indexes=createDataPartition(iris$Species, p=.90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
Classification with boosting
We'll define the model with boosting function and train it with train data. The 'boosting' function applies the AdaBoost.M1 and SAMME algorithms using classification trees. A 'boos' is a bootstrap uses the weights for each observation in an iteration if it is TRUE. Otherwise, each observation is used with its weight. A 'mfinal' is the number of iterations or trees to use.
model = boosting(Species~., data=train, boos=TRUE, mfinal=50)
We can check the model properties
print(names(model))
[1] "formula" "trees" "weights" "votes" "prob" "class"
[7] "importance" "terms" "call"
print(model$trees[1])
[[1]]
n= 135
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 135 88 versicolor (0.3185185 0.3481481 0.3333333)
2) Petal.Length< 2.7 43 0 setosa (1.0000000 0.0000000 0.0000000) *
3) Petal.Length>=2.7 92 45 versicolor (0.0000000 0.5108696 0.4891304)
6) Petal.Width< 1.75 50 3 versicolor (0.0000000 0.9400000 0.0600000) *
7) Petal.Width>=1.75 42 0 virginica (0.0000000 0.0000000 1.0000000) *
The model is ready and we can predict test data. Predicted data accuracy is also included in output data.
pred = predict(model, test)
print(pred$confusion)
Observed Class
Predicted Class setosa versicolor virginica
setosa 5 0 0
versicolor 0 5 0
virginica 0 0 5
print(pred$error)
[1] 0
We can also print the probability of each class in test data.
result = data.frame(test$Species, pred$prob, pred$class)
print(result)
test.Species X1 X2 X3 pred.class
1 setosa 0.92897958 0.07102042 0.00000000 setosa
2 setosa 0.90999935 0.07693250 0.01306815 setosa
3 setosa 0.88902756 0.09790429 0.01306815 setosa
4 setosa 0.92897958 0.07102042 0.00000000 setosa
5 setosa 0.88902756 0.09790429 0.01306815 setosa
6 versicolor 0.01288461 0.91943143 0.06768396 versicolor
7 versicolor 0.01288461 0.84235917 0.14475622 versicolor
8 versicolor 0.03205498 0.95093238 0.01701263 versicolor
9 versicolor 0.03205498 0.95093238 0.01701263 versicolor
10 versicolor 0.03205498 0.95093238 0.01701263 versicolor
11 virginica 0.00000000 0.04468596 0.95531404 virginica
12 virginica 0.00000000 0.01577596 0.98422404 virginica
13 virginica 0.00000000 0.05561801 0.94438199 virginica
14 virginica 0.00000000 0.05561801 0.94438199 virginica
15 virginica 0.00000000 0.33446425 0.66553575 virginica
Classification with boosting.cv
The boosting.cv function provides cross-validation method. The training data is divided into multiple subsets to apply boosting and prediction is performed for the entire dataset. To train the model we use entire dataset and get prediction result. Here, v is cross-validation subsets numbers.
cvmodel = boosting.cv(Species~., data=iris, boos=TRUE, mfinal=10, v=5)
We'll check the accuracy.
print(cvmodel[-1])
$confusion
Observed Class
Predicted Class setosa versicolor virginica
setosa 50 0 0
versicolor 0 45 3
virginica 0 5 47
$error
[1] 0.05333333
You can compare the original and predicted classes.
data.frame(iris$Species, cvmodel$class)
In this post, we've briefly learned how to classify data with the adabag boosting model in R. The full source code is listed below.
Source code listing
library(adabag)
library(caret)
indexes=createDataPartition(iris$Species, p=.90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
model = boosting(Species~., data=train, boos=TRUE, mfinal=50)
print(names(model))
print(model$trees[1])
pred = predict(model, test)
print(pred$confusion)
print(pred$error)
result = data.frame(test$Species, pred$prob, pred$class)
print(result)
# cross-validataion method
cvmodel = boosting.cv(Species~., data=iris, boos=TRUE, mfinal=10, v=5)
print(cvmodel[-1])
print(data.frame(iris$Species, cvmodel$class))
Great example and code!!!! Thanks much
ReplyDeleteWhat happens in the cases were null values exists? Can the model work without filling them with average values or any other similar ideia?
ReplyDelete