Extreme Gradient Boosting (XGBoost) is a gradient boosing algorithm in machine learning. The XGboost applies regularization technique to reduce the overfitting. The advantage of XGBoost over classical gradient boosting is that it is fast in execution speed and it performs well in predictive modeling of classification and regression problems.
In this tutorial, we'll briefly learn how to classify data with xgboost by using the xgboost package in R. The tutorial cover:
- Preparing data
- Defining the model
- Predicting test data
library(xgboost)
library(caret)
Preparing data
In this tutorial, we'll use the Iris dataset as a target classification data. First, we'll split the dataset into the train and test parts. Here, ten percent of the dataset is selected as a test data.
indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]
Next, we'll extract x - feature and y - label parts. The training x data should be in matrix type to use in xgboost. Thus, we'll convert x data into the matrix type.
train_x = data.matrix(train[,-5])
train_y = train[,5]
test_x = data.matrix(test[,-5])
test_y = test[,5]
Next, we need to convert the train and test data into xgb matrix type.
xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)
Defining the model
We can define the xgboost model with xgboost function with changing some of the parameters. Note that xgboost is a training function, thus we need to include the train data too. Once we run the function, it fits the model with training data.
xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
[1] train-rmse:1.213938
[2] train-rmse:0.865807
[3] train-rmse:0.622092
[4] train-rmse:0.451725
[5] train-rmse:0.334372
[6] train-rmse:0.255238
....
[43] train-rmse:0.026330
[44] train-rmse:0.026025
[45] train-rmse:0.025677
[46] train-rmse:0.025476
[47] train-rmse:0.024495
[48] train-rmse:0.023678
[49] train-rmse:0.022138
[50] train-rmse:0.020715
print(xgbc)
##### xgb.Booster
raw: 30.2 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max.depth = 3)
params (as set within xgb.train):
max_depth = "3", silent = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
cb.save.model(save_period = save_period, save_name = save_name)
niter: 50
evaluation_log:
iter train_rmse
1 1.213938
2 0.865807
---
49 0.022138
50 0.020715
Predicting test data
The model is ready and we can predict the test data.
pred = predict(xgbc, xgb_test)
print(pred)
[1] 1.0083745 0.9993168 0.7263275 0.9887304 0.9993168 1.9989902 1.9592317 1.9999132
[9] 2.0134101 1.9976928 2.9946277 3.5094361 2.8852687 2.8306360 2.1748595
Now, we'll convert the result into factor type.
pred[(pred>3)] = 3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
[1] setosa setosa setosa setosa setosa versicolor versicolor
[8] versicolor versicolor versicolor virginica virginica virginica virginica
[15] versicolor
Levels: setosa versicolor virginica
We'll check the prediction accuracy with a confusion matrix.
cm = confusionMatrix(test_y, pred_y)
print(cm)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 5 0 0
versicolor 0 5 0
virginica 0 1 4
Overall Statistics
Accuracy : 0.9333
95% CI : (0.6805, 0.9983)
No Information Rate : 0.4
P-Value [Acc > NIR] : 2.523e-05
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8333 1.0000
Specificity 1.0000 1.0000 0.9091
Pos Pred Value 1.0000 1.0000 0.8000
Neg Pred Value 1.0000 0.9000 1.0000
Prevalence 0.3333 0.4000 0.2667
Detection Rate 0.3333 0.3333 0.2667
Detection Prevalence 0.3333 0.3333 0.3333
Balanced Accuracy 1.0000 0.9167 0.9545
We can compare the result with original values.
result = cbind(orig=as.character(test_y),
factor=as.factor(test_y),
pred=pred,
rounded=round(pred),
pred=as.character(levels(test_y))[round(pred)])
print(data.frame(result))
orig factor pred rounded pred.1
1 setosa 1 1.00837445259094 1 setosa
2 setosa 1 0.999316811561584 1 setosa
3 setosa 1 0.726327538490295 1 setosa
4 setosa 1 0.988730430603027 1 setosa
5 setosa 1 0.999316811561584 1 setosa
6 versicolor 2 1.99899017810822 2 versicolor
7 versicolor 2 1.95923173427582 2 versicolor
8 versicolor 2 1.99991321563721 2 versicolor
9 versicolor 2 2.01341009140015 2 versicolor
10 versicolor 2 1.99769282341003 2 versicolor
11 virginica 3 2.9946277141571 3 virginica
12 virginica 3 3 3 virginica
13 virginica 3 2.8852686882019 3 virginica
14 virginica 3 2.8306360244751 3 virginica
15 virginica 3 2.17485952377319 2 versicolor
In this tutorial, we've briefly learned how to classify data with xgboost in R. The full source code is listed below.
Source code listing
library(xgboost)
library(caret)
indexes = createDataPartition(iris$Species, p=.9, list=F)
train = iris[indexes, ]
test = iris[-indexes, ]
train_x = data.matrix(train[,-5])
train_y = train[,5]
test_x = data.matrix(test[,-5])
test_y = test[,5]
xgb_train = xgb.DMatrix(data=train_x, label=train_y)
xgb_test = xgb.DMatrix(data=test_x, label=test_y)
xgbc = xgboost(data=xgb_train, max.depth=3, nrounds=50)
print(xgbc)
pred = predict(xgbc, xgb_test)
print(pred)
pred[(pred>3)]=3
pred_y = as.factor((levels(test_y))[round(pred)])
print(pred_y)
cm = confusionMatrix(test_y, pred_y)
print(cm)
result = cbind(orig=as.character(test_y),
factor=as.factor(test_y),
pred=pred,
rounded=round(pred),
pred=as.character(levels(test_y))[round(pred)])
print(data.frame(result))
Great example, thank you for sharing!
ReplyDeleteI'm a big fan of demonstrating complex models like xgboost on well known datasets like the iris flowers dataset. I did something similar, showing how to use xgboost in python on the iris dataset.