The XGBoost stands for "Extreme Gradient Boosting" and it is an implementation of gradient boosting trees algorithm. It is a popular supervised machine learning method with characteristics like computation speed, parallelization, and performance. XGBoost is an open-source software library and you can use it in the R development environment by downloading the xgboost R package.
In this tutorial, we'll briefly learn how to fit and predict regression data with the 'xgboost' function. The
tutorial covers:
- Preparing the data
- Fitting the model and prediction
- Accuracy checking
- Source code listing
We'll start by loading the required library.
library(xgboost)
library(caret)
Preparing the data
We use Boston house-price dataset as a regression dataset in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, I'll extract 15 percent of the dataset as test data. The xgboost uses matrix data so that we need to convert our data into the xgb matrix type.
boston = MASS::Boston
str(boston)
set.seed(12)
indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]
train_x = data.matrix(train[, -13])
train_y = train[,13]
test_x = data.matrix(test[, -13])
test_y = test[, 13]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
Fitting the model and prediction
We'll define the model by using the xgboost() function of
xgboost package. Here, we'll set 'max_depth' and 'nrounds' parameters. A 'max_depth' defines the depth of trees that the higher value is the more complex the model is. An 'nrounds' is the maximum number of iteration.
The calling the function is enough to train the model with included data. You can check the summary of the model by using the print() and str() functions.
xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)
##### xgb.Booster
raw: 22.2 Kb
call:
xgb.train(params = params, data = dtrain, nrounds = nrounds,
watchlist = watchlist, verbose = verbose, print_every_n = print_every_n,
early_stopping_rounds = early_stopping_rounds, maximize = maximize,
save_period = save_period, save_name = save_name, xgb_model = xgb_model,
callbacks = callbacks, max_depth = 2)
params (as set within xgb.train):
max_depth = "2", validate_parameters = "1"
xgb.attributes:
niter
callbacks:
cb.print.evaluation(period = print_every_n)
cb.evaluation.log()
# of features: 13
niter: 50
nfeatures : 13
evaluation_log:
iter train_rmse
1 10.288543
2 7.710918
---
49 2.007022
50 1.997438
Next, we'll predict the x test data with the xgbc model.
pred_y = predict(xgbc, xgb_test)
Accuracy check
Next, we'll check the prediction accuracy with MSE, MAE, and RMSE metrics.
mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)
cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)
MSE: 11.99942 MAE: 2.503739 RMSE: 3.464018
Finally, we'll visualize y original test and y predicted data in a plot.
x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),
col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))
In this tutorial, we've learned how to fit and predict regression data with xgboost in R. The full source code is listed below.
Source code listing
library(xgboost)
library(caret)
boston = MASS::Boston
str(boston)
set.seed(12)
indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]
train_x = data.matrix(train[, -13])
train_y = train[,13]
test_x = data.matrix(test[, -13])
test_y = test[, 13]
xgb_train = xgb.DMatrix(data = train_x, label = train_y)
xgb_test = xgb.DMatrix(data = test_x, label = test_y)
xgbc = xgboost(data = xgb_train, max.depth = 2, nrounds = 50)
print(xgbc)
pred_y = predict(xgbc, xgb_test)
mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)
cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)
x = 1:length(test_y)
plot(x, test_y, col = "red", type = "l")
lines(x, pred_y, col = "blue", type = "l")
legend(x = 1, y = 38, legend = c("original test_y", "predicted test_y"),
col = c("red", "blue"), box.lty = 1, cex = 0.8, lty = c(1, 1))
Reference:
Great. Thanks
ReplyDeleteThank you. This example helps me a lot.
ReplyDeleteGreat piece here!
ReplyDeleteThanks. Very resourceful
ReplyDeleteHow can I get R^2? here?
ReplyDeleteThis prints the results in a nice format.
ReplyDeletedata.frame(MSE = mean((test_y - pred_y)^2),
MAE = caret::MAE(test_y, pred_y),
RMSE = caret::RMSE(test_y, pred_y),
R2 = caret::R2(test_y, pred_y))