Classification with XGBoost Model in R

    Extreme Gradient Boosting (XGBoost) is a gradient boosing algorithm in machine learning. The XGboost applies regularization technique to reduce the overfitting. The advantage of XGBoost over classical gradient boosting is that it is fast in execution speed and it performs well in predictive modeling of classification and regression problems.

    In this tutorial, we'll briefly learn how to classify data with xgboost by using the xgboost package in R. The tutorial cover:

  1. Preparing data
  2. Defining the model
  3. Predicting test data
We'll start by loading the required packages.

library(xgboost)
library(caret)


Preparing data

    In this tutorial, we'll use the Iris dataset as a target classification data. First, we'll split the dataset into the train and test parts. Here, ten percent of the dataset is selected as a test data.

indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]

Next, we'll extract x - feature and y - label parts. The training x data should be in matrix type to use in xgboost. Thus, we'll convert x data into the matrix type.

train_x = data.matrix(train[,-5])
train_y = train[,5]
 
test_x = data.matrix(test[,-5])
test_y = test[,5]

Next, we need to convert the train and test data into xgb matrix type.

xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)


Defining the model

    We can define the xgboost model with xgboost function with changing some of the parameters. Note that xgboost is a training function, thus we need to include the train data too. Once we run the function, it fits the model with training data.

xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
[1] train-rmse:1.213938 
[2] train-rmse:0.865807 
[3] train-rmse:0.622092 
[4] train-rmse:0.451725 
[5] train-rmse:0.334372 
[6] train-rmse:0.255238
.... 
[43] train-rmse:0.026330 
[44] train-rmse:0.026025 
[45] train-rmse:0.025677 
[46] train-rmse:0.025476 
[47] train-rmse:0.024495 
[48] train-rmse:0.023678 
[49] train-rmse:0.022138 
[50] train-rmse:0.020715 
 
print(xgbc)
##### xgb.Booster
raw: 30.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max.depth = 3)
params (as set within xgb.train):
  max_depth = "3", silent = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
  cb.save.model(save_period = save_period, save_name = save_name)
niter: 50
evaluation_log:
    iter train_rmse
       1   1.213938
       2   0.865807
---                
      49   0.022138
      50   0.020715


Predicting test data

The model is ready and we can predict the test data.

pred = predict(xgbc, xgb_test)
print(pred)
 [1] 1.0083745 0.9993168 0.7263275 0.9887304 0.9993168 1.9989902 1.9592317 1.9999132
 [9] 2.0134101 1.9976928 2.9946277 3.5094361 2.8852687 2.8306360 2.1748595

Now, we'll convert the result into factor type.

pred[(pred>3)] = 3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
 [1] setosa     setosa     setosa     setosa     setosa     versicolor versicolor
 [8] versicolor versicolor versicolor virginica  virginica  virginica  virginica 
[15] versicolor
Levels: setosa versicolor virginica

We'll check the prediction accuracy with a confusion matrix.

cm = confusionMatrix(test_y, pred_y)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 2.523e-05       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545


We can compare the result with original values.

result = cbind(orig=as.character(test_y),
               factor=as.factor(test_y),
               pred=pred,
               rounded=round(pred),
               pred=as.character(levels(test_y))[round(pred)])
 
print(data.frame(result))
         orig factor              pred rounded     pred.1
1      setosa      1  1.00837445259094       1     setosa
2      setosa      1 0.999316811561584       1     setosa
3      setosa      1 0.726327538490295       1     setosa
4      setosa      1 0.988730430603027       1     setosa
5      setosa      1 0.999316811561584       1     setosa
6  versicolor      2  1.99899017810822       2 versicolor
7  versicolor      2  1.95923173427582       2 versicolor
8  versicolor      2  1.99991321563721       2 versicolor
9  versicolor      2  2.01341009140015       2 versicolor
10 versicolor      2  1.99769282341003       2 versicolor
11  virginica      3   2.9946277141571       3  virginica
12  virginica      3                 3       3  virginica
13  virginica      3   2.8852686882019       3  virginica
14  virginica      3   2.8306360244751       3  virginica
15  virginica      3  2.17485952377319       2 versicolor



   In this tutorial, we've briefly learned how to classify data with xgboost in R. The full source code is listed below.


Source code listing

library(xgboost) library(caret) indexes = createDataPartition(iris$Species, p=.9, list=F) train = iris[indexes, ] test = iris[-indexes, ] train_x = data.matrix(train[,-5]) train_y = train[,5] test_x = data.matrix(test[,-5]) test_y = test[,5] xgb_train = xgb.DMatrix(data=train_x, label=train_y) xgb_test = xgb.DMatrix(data=test_x, label=test_y) xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50) print(xgbc) pred = predict(xgbc, xgb_test) print(pred) pred[(pred>3)]=3 pred_y = as.factor((levels(test_y))[round(pred)]) print(pred_y) cm = confusionMatrix(test_y, pred_y) print(cm) result = cbind(orig=as.character(test_y), factor=as.factor(test_y), pred=pred, rounded=round(pred), pred=as.character(levels(test_y))[round(pred)]) print(data.frame(result))

Classification with Adaboost Model in R

Classification with Gradient Boosting Model in R

1 comment:

  1. Great example, thank you for sharing!

    I'm a big fan of demonstrating complex models like xgboost on well known datasets like the iris flowers dataset. I did something similar, showing how to use xgboost in python on the iris dataset.

    ReplyDelete