DataTechNotes: Regression Example With RPART Tree Model in R

Decision trees can be implemented by using the 'rpart' package in R. The 'rpart' package extends to Recursive Partitioning and Regression Trees which applies the tree-based model for regression and classification problems.

In this tutorial, we'll briefly learn how to fit and predict regression data by using 'rpart' function in R. The tutorial covers:

Preparing the data
Fitting the model and prediction
Accuracy checking
Source code listing

We'll start by loading the required libraries.

library(rpart)
library(caret)

Preparing the data

We use Boston house-price dataset as a target regression data in this tutorial. After loading the dataset, first, we'll split them into the train and test parts, and extract x-input and y-label parts. Here, I'll extract 15 percent of the dataset as test data. It is better to scale x part of data to improve the accuracy.

boston = MASS::Boston
str(boston)

set.seed(12)

indexes = createDataPartition(boston$medv, p = .85, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]


train_x = train[, -14]
train_x = scale(train_x)[,]
train_y = train[,14]

test_x = test[, -14]
test_x = scale(test[,-14])[,]
test_y = test[,14]

Fitting the model and prediction

We'll define the model by using the rpart() function of the rpart package and fit on train data. Here, we'll set 'control' parameters as shown below. The calling the function is enough to train the model with included data. You can check the summary of the model by using the print() or printcp() function.

fit = rpart(train_y~., data = data.frame(train_x, train_y), 
            control = rpart.control(cp = 0.00001))

printcp(fit)

Regression tree:
rpart(formula = train_y ~ ., data = data.frame(train_x, train_y), 
    control = rpart.control(cp = 1e-05))

Variables actually used in tree construction:
 [1] age     black   crim    dis     lstat   nox     ptratio rad     rm      tax     zn     

Root node error: 38319/458 = 83.667

n= 458 

           CP nsplit rel error  xerror     xstd
1  0.45252258      0   1.00000 1.00371 0.087768
2  0.17808779      1   0.54748 0.62144 0.061139
3  0.06372523      2   0.36939 0.41766 0.049265
4  0.04001076      3   0.30566 0.34812 0.045449
5  0.03530233      4   0.26565 0.34948 0.046059
6  0.02585806      5   0.23035 0.31908 0.044292
7  0.00855071      6   0.20449 0.27123 0.039490

....

Next, we'll apply prune function for fitted data. Then we can plot the trees.

fit.pruned = prune(fit, cp = 0.0001)

plot(fit.pruned)
text(fit.pruned, cex = 0.9, xpd = TRUE)

Now, we can predict the x test data with the trained model.

pred_y = predict(fit.pruned, data.frame(test_x))

Accuracy checking

Next, we'll check the prediction accuracy with MSE, MAE, and RMSE metrics.

print(data.frame(test_y, pred_y))

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

MSE:  20.28907 MAE:  2.979355  RMSE:  4.504339


MSE:  11.99942 MAE:  2.503739  RMSE:  3.464018

Finally, we'll visualize original test and predicted data in a plot.

x = 1:length(test_y)


plot(x, test_y, col = "red", type = "l", lwd=2,
     main = "Boston housing test data prediction")
lines(x, pred_y, col = "blue", lwd=2)
legend("topright",  legend = c("original-medv", "predicted-medv"), 
       fill = c("red", "blue"), col = 2:3,  adj = c(0, 0.6))
grid()

In this tutorial, we've learned how to fit and predict regression data with rpart function in R. The full source code is listed below.

Source code listing

library(rpart)
library(caret)

boston = MASS::Boston
str(boston)

set.seed(12)
indexes = createDataPartition(boston$medv, p = .9, list = F)
train = boston[indexes, ]
test = boston[-indexes, ]

train_x = train[, -14]
train_x = scale(train_x)[,]
train_y = train[,14]

test_x = test[, -14]
test_x = scale(test[,-14])[,]
test_y = test[,14]

fit = rpart(train_y~., data = data.frame(train_x, train_y), 
            control = rpart.control(cp = 0.00001))
printcp(fit)

fit.pruned = prune(fit, cp = 0.0001)

plot(fit.pruned)
text(fit.pruned, cex = 0.9, xpd = TRUE)

pred_y = predict(fit.pruned, data.frame(test_x))
print(data.frame(test_y, pred_y))

mse = mean((test_y - pred_y)^2)
mae = caret::MAE(test_y, pred_y)
rmse = caret::RMSE(test_y, pred_y)

cat("MSE: ", mse, "MAE: ", mae, " RMSE: ", rmse)

x = 1:length(test_y)

plot(x, test_y, col = "red", type = "l", lwd=2,
     main = "Boston housing test data prediction")
lines(x, pred_y, col = "blue", lwd=2)
legend("topright",  legend = c("original-medv", "predicted-medv"), 
       fill = c("red", "blue"), col = 2:3,  adj = c(0, 0.6))
grid()

Reference:

RPART package

DataTechNotes

Pages

Regression Example With RPART Tree Model in R

No comments:

Post a Comment