DataTechNotes: Classification with XGBoost Model in R

Extreme Gradient Boosting (XGBoost) is a gradient boosing algorithm in machine learning. The XGboost applies regularization technique to reduce the overfitting. The advantage of XGBoost over classical gradient boosting is that it is fast in execution speed and it performs well in predictive modeling of classification and regression problems.

In this tutorial, we'll briefly learn how to classify data with xgboost by using the xgboost package in R. The tutorial cover:

Preparing data
Defining the model
Predicting test data

We'll start by loading the required packages.

library(xgboost)

library(caret)

Preparing data

In this tutorial, we'll use the Iris dataset as a target classification data. First, we'll split the dataset into the train and test parts. Here, ten percent of the dataset is selected as a test data.

indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]

Next, we'll extract x - feature and y - label parts. The training x data should be in matrix type to use in xgboost. Thus, we'll convert x data into the matrix type.

train_x = data.matrix(train[,-5])
train_y = train[,5]
 
test_x = data.matrix(test[,-5])
test_y = test[,5]

Next, we need to convert the train and test data into xgb matrix type.

xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)

Defining the model

We can define the xgboost model with xgboost function with changing some of the parameters. Note that xgboost is a training function, thus we need to include the train data too. Once we run the function, it fits the model with training data.

xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
[1] train-rmse:1.213938 
[2] train-rmse:0.865807 
[3] train-rmse:0.622092 
[4] train-rmse:0.451725 
[5] train-rmse:0.334372 
[6] train-rmse:0.255238

....

[43] train-rmse:0.026330 
[44] train-rmse:0.026025 
[45] train-rmse:0.025677 
[46] train-rmse:0.025476 
[47] train-rmse:0.024495 
[48] train-rmse:0.023678 
[49] train-rmse:0.022138 
[50] train-rmse:0.020715

print(xgbc)
##### xgb.Booster
raw: 30.2 Kb 
call:
  xgb.train(params = params, data = dtrain, nrounds = nrounds, 
    watchlist = watchlist, verbose = verbose, print_every_n = print_every_n, 
    early_stopping_rounds = early_stopping_rounds, maximize = maximize, 
    save_period = save_period, save_name = save_name, xgb_model = xgb_model, 
    callbacks = callbacks, max.depth = 3)
params (as set within xgb.train):
  max_depth = "3", silent = "1"
xgb.attributes:
  niter
callbacks:
  cb.print.evaluation(period = print_every_n)
  cb.evaluation.log()
  cb.save.model(save_period = save_period, save_name = save_name)
niter: 50
evaluation_log:
    iter train_rmse
       1   1.213938
       2   0.865807
---                
      49   0.022138
      50   0.020715

Predicting test data

The model is ready and we can predict the test data.

pred = predict(xgbc, xgb_test)

print(pred)
 [1] 1.0083745 0.9993168 0.7263275 0.9887304 0.9993168 1.9989902 1.9592317 1.9999132
 [9] 2.0134101 1.9976928 2.9946277 3.5094361 2.8852687 2.8306360 2.1748595

Now, we'll convert the result into factor type.

pred[(pred>3)] = 3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
 [1] setosa     setosa     setosa     setosa     setosa     versicolor versicolor
 [8] versicolor versicolor versicolor virginica  virginica  virginica  virginica 
[15] versicolor
Levels: setosa versicolor virginica

We'll check the prediction accuracy with a confusion matrix.

cm = confusionMatrix(test_y, pred_y)
print(cm)
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          5          0         0
  versicolor      0          5         0
  virginica       0          1         4

Overall Statistics
                                          
               Accuracy : 0.9333          
                 95% CI : (0.6805, 0.9983)
    No Information Rate : 0.4             
    P-Value [Acc > NIR] : 2.523e-05       
                                          
                  Kappa : 0.9             
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8333           1.0000
Specificity                 1.0000            1.0000           0.9091
Pos Pred Value              1.0000            1.0000           0.8000
Neg Pred Value              1.0000            0.9000           1.0000
Prevalence                  0.3333            0.4000           0.2667
Detection Rate              0.3333            0.3333           0.2667
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            0.9167           0.9545

We can compare the result with original values.

result = cbind(orig=as.character(test_y),
               factor=as.factor(test_y),
               pred=pred,
               rounded=round(pred),
               pred=as.character(levels(test_y))[round(pred)])
 
print(data.frame(result))
         orig factor              pred rounded     pred.1
1      setosa      1  1.00837445259094       1     setosa
2      setosa      1 0.999316811561584       1     setosa
3      setosa      1 0.726327538490295       1     setosa
4      setosa      1 0.988730430603027       1     setosa
5      setosa      1 0.999316811561584       1     setosa
6  versicolor      2  1.99899017810822       2 versicolor
7  versicolor      2  1.95923173427582       2 versicolor
8  versicolor      2  1.99991321563721       2 versicolor
9  versicolor      2  2.01341009140015       2 versicolor
10 versicolor      2  1.99769282341003       2 versicolor
11  virginica      3   2.9946277141571       3  virginica
12  virginica      3                 3       3  virginica
13  virginica      3   2.8852686882019       3  virginica
14  virginica      3   2.8306360244751       3  virginica
15  virginica      3  2.17485952377319       2 versicolor

In this tutorial, we've briefly learned how to classify data with xgboost in R. The full source code is listed below.

Source code listing


library(xgboost)
library(caret)

indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]

train_x = data.matrix(train[,-5])
train_y = train[,5]

test_x = data.matrix(test[,-5])
test_y = test[,5]

xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)

xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
print(xgbc)

pred = predict(xgbc, xgb_test)
print(pred)

pred[(pred>3)]=3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)

cm = confusionMatrix(test_y, pred_y)
print(cm)

result = cbind(orig=as.character(test_y),
                factor=as.factor(test_y),
                pred=pred,
                rounded=round(pred),
                pred=as.character(levels(test_y))[round(pred)])

print(data.frame(result))

Classification with Adaboost Model in R

Classification with Gradient Boosting Model in R

DataTechNotes

Pages

Classification with XGBoost Model in R

1 comment: