Boosting is one of the ensemble learning techniques in machine learning and it is widely used in regression and classification problems. The main concept of this method is to improve (boost) the week learners sequentially and increase the model accuracy with a combined model. There are several boosting algorithms such as Gradient boosting, AdaBoost (Adaptive Boost), XGBoost and others.
In this post, we'll learn how to classify data with a gbm (Generalized Boosted Model) package's gbm (Gradient Boosting Model) method. This package applies J. Friedman's gradient boosting machines and Adaboot algorithms. The tutorial covers:
- Preparing the data
- Classification with gbm
- Classification with caret train method
- Source code listing
library(gbm)
library(caret)
Preparing the data
We'll use the Iris dataset as a target classification data and prepare it by splitting into the train and test parts. Here, we'll use 10 percent of the dataset as test data.
indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
Classification with gbm
We'll define the gbm model and include train data to fit the model. Here, we'll set multinomial distribution, 10 cross-validation fold, and 200 trees.
mod_gbm = gbm(Species ~.,
data = train,
distribution = "multinomial",
cv.folds = 10,
shrinkage = .01,
n.minobsinnode = 10,
n.trees = 200)
print(mod_gbm)
gbm(formula = Species ~ ., distribution = "multinomial", data = train,
n.trees = 200, n.minobsinnode = 10, shrinkage = 0.01, cv.folds = 10)
A gradient boosted model with multinomial loss function.
200 iterations were performed.
The best cross-validation iteration was 200.
There were 4 predictors of which 3 had non-zero influence.
The model is ready, and we'll predict test data.
pred = predict.gbm(object = mod_gb,
newdata = test,
n.trees = 200,
type = "response")
The predicted result is not easy-readable data so we'll get class names with the highest prediction value.
labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)
print(result)
test.Species labels
1 setosa setosa
2 setosa setosa
3 setosa setosa
4 setosa setosa
5 setosa setosa
6 versicolor versicolor
7 versicolor versicolor
8 versicolor versicolor
9 versicolor virginica
10 versicolor versicolor
11 virginica versicolor
12 virginica virginica
13 virginica virginica
14 virginica virginica
15 virginica virginica
Finally, we'll check the confusion matrix.
cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 5 0 0
versicolor 0 4 1
virginica 0 1 4
Overall Statistics
Accuracy : 0.8667
95% CI : (0.5954, 0.9834)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 3.143e-05
Kappa : 0.8
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8000 0.8000
Specificity 1.0000 0.9000 0.9000
Pos Pred Value 1.0000 0.8000 0.8000
Neg Pred Value 1.0000 0.9000 0.9000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2667 0.2667
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.8500 0.8500
Classification with caret train method
In the second method, we use the caret package's train() function for model fitting. The train() function requires train control parameter and we can define it as below.
tc = trainControl(method = "repeatedcv", number = 10)
Next, we'll define the model and train it with train data.
model = train(Species ~., data=train, method="gbm", trControl=tc)
We can predict test data with the fitted model.
pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)
test.Species pred
1 setosa setosa
2 setosa setosa
3 setosa setosa
4 setosa setosa
5 setosa setosa
6 versicolor versicolor
7 versicolor versicolor
8 versicolor versicolor
9 versicolor versicolor
10 versicolor versicolor
11 virginica virginica
12 virginica versicolor
13 virginica virginica
14 virginica virginica
15 virginica virginica
Finally, we'll check the confusion matrix.
cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 5 0 0
versicolor 0 5 0
virginica 0 1 4
Overall Statistics
Accuracy : 0.9333
95% CI : (0.6805, 0.9983)
No Information Rate : 0.4
P-Value [Acc > NIR] : 2.523e-05
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8333 1.0000
Specificity 1.0000 1.0000 0.9091
Pos Pred Value 1.0000 1.0000 0.8000
Neg Pred Value 1.0000 0.9000 1.0000
Prevalence 0.3333 0.4000 0.2667
Detection Rate 0.3333 0.3333 0.2667
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9167 0.9545
In this tutorial, we've learned how to classify data with gbm method in R. The full source code is listed below.
Source code listing
library(gbm)
library(caret)
indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]
mod_gbm = gbm(Species ~.,
data = train,
distribution = "multinomial",
cv.folds = 10,
shrinkage = .01,
n.minobsinnode = 10,
n.trees = 200)
print(mod_gbm)
pred = predict.gbm(object = mod_gbm,
newdata = test,
n.trees = 200,
type = "response")
labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)
print(result)
cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)
# caret train method
tc = trainControl(method = "repeatedcv", number = 10)
model = train(Species ~., data=train, method="gbm", trControl=tc)
print(model)
pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)
cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)
Hello there! This is my first comment here, so I just wanted to give a quick shout out and say I genuinely enjoy reading your articles. Can you recommend any other blogs/websites/forums that deal with the same subjects? Thanks.
ReplyDeleteSurya Informatics
wonderfull... the caret train method worked just fine for me and helped a lot.
ReplyDeleteThank you..!!