DataTechNotes: Gradient Boosting Classification with GBM in R

Boosting is one of the ensemble learning techniques in machine learning and it is widely used in regression and classification problems. The main concept of this method is to improve (boost) the week learners sequentially and increase the model accuracy with a combined model. There are several boosting algorithms such as Gradient boosting, AdaBoost (Adaptive Boost), XGBoost and others.

In this post, we'll learn how to classify data with a gbm (Generalized Boosted Model) package's gbm (Gradient Boosting Model) method. This package applies J. Friedman's gradient boosting machines and Adaboot algorithms. The tutorial covers:

Preparing the data
Classification with gbm
Classification with caret train method
Source code listing

We'll start by loading the required packages.

library(gbm)

library(caret)

Preparing the data

We'll use the Iris dataset as a target classification data and prepare it by splitting into the train and test parts. Here, we'll use 10 percent of the dataset as test data.

indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]

Classification with gbm

We'll define the gbm model and include train data to fit the model. Here, we'll set multinomial distribution, 10 cross-validation fold, and 200 trees.

mod_gbm = gbm(Species ~.,
              data = train,
              distribution = "multinomial",
              cv.folds = 10,
              shrinkage = .01,
              n.minobsinnode = 10,
              n.trees = 200)

print(mod_gbm)
gbm(formula = Species ~ ., distribution = "multinomial", data = train, 
    n.trees = 200, n.minobsinnode = 10, shrinkage = 0.01, cv.folds = 10)
A gradient boosted model with multinomial loss function.
200 iterations were performed.
The best cross-validation iteration was 200.
There were 4 predictors of which 3 had non-zero influence.

The model is ready, and we'll predict test data.

pred = predict.gbm(object = mod_gb,
                   newdata = test,
                   n.trees = 200,
                   type = "response")

The predicted result is not easy-readable data so we'll get class names with the highest prediction value.

labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)

print(result)
   test.Species     labels
1        setosa     setosa
2        setosa     setosa
3        setosa     setosa
4        setosa     setosa
5        setosa     setosa
6    versicolor versicolor
7    versicolor versicolor
8    versicolor versicolor
9    versicolor  virginica
10   versicolor versicolor
11    virginica versicolor
12    virginica  virginica
13    virginica  virginica
14    virginica  virginica
15    virginica  virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          4         1
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.8667          
                 95% CI : (0.5954, 0.9834)
    No Information Rate : 0.3333          
    P-Value [Acc > NIR] : 3.143e-05       
                                          
                  Kappa : 0.8             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8000           0.8000
Specificity                 1.0000            0.9000           0.9000
Pos Pred Value              1.0000            0.8000           0.8000
Neg Pred Value              1.0000            0.9000           0.9000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2667           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.8500           0.8500

Classification with caret train method

In the second method, we use the caret package's train() function for model fitting. The train() function requires train control parameter and we can define it as below.

tc = trainControl(method = "repeatedcv", number = 10)

Next, we'll define the model and train it with train data.

model = train(Species ~., data=train, method="gbm", trControl=tc)

We can predict test data with the fitted model.

pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)
   test.Species       pred
1        setosa     setosa
2        setosa     setosa
3        setosa     setosa
4        setosa     setosa
5        setosa     setosa
6    versicolor versicolor
7    versicolor versicolor
8    versicolor versicolor
9    versicolor versicolor
10   versicolor versicolor
11    virginica  virginica
12    virginica versicolor
13    virginica  virginica
14    virginica  virginica
15    virginica  virginica

Finally, we'll check the confusion matrix.

cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 2.523e-05       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545

In this tutorial, we've learned how to classify data with gbm method in R. The full source code is listed below.

Source code listing


library(gbm)
library(caret)

indexes = createDataPartition(iris$Species, p = .90, list = F)
train = iris[indexes, ]
test = iris[-indexes, ]

mod_gbm = gbm(Species ~.,
              data = train,
              distribution = "multinomial",
              cv.folds = 10,
              shrinkage = .01,
              n.minobsinnode = 10,
              n.trees = 200)
print(mod_gbm)

pred = predict.gbm(object = mod_gbm,
                    newdata = test,
                    n.trees = 200,
                    type = "response")

labels = colnames(pred)[apply(pred, 1, which.max)]
result = data.frame(test$Species, labels)
print(result)

cm = confusionMatrix(test$Species, as.factor(labels))
print(cm)

# caret train method
tc = trainControl(method = "repeatedcv", number = 10)
model = train(Species ~., data=train, method="gbm", trControl=tc)
print(model)

pred = predict(model, test)
result = data.frame(test$Species, pred)
print(result)

cm = confusionMatrix(test$Species, as.factor(pred))
print(cm)

Classification with Adaboost Model in R

Classification with XGBoost Model in R

2 comments:

thiruAugust 1, 2019 at 10:59 PM
Hello there! This is my first comment here, so I just wanted to give a quick shout out and say I genuinely enjoy reading your articles. Can you recommend any other blogs/websites/forums that deal with the same subjects? Thanks.
Surya Informatics
UnknownOctober 8, 2021 at 6:08 AM
wonderfull... the caret train method worked just fine for me and helped a lot.
Thank you..!!

Pages

Gradient Boosting Classification with GBM in R

2 comments: