- Preparing data
- Best alpha
- Fitting the model and checking the results
- Cross-validation with RidgeCV
- Source code listing
from sklearn.datasets import load_boston from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np
Preparing data
We use Boston house-price dataset as regression dataset in this tutorial. After loading the dataset, first, we'll separate data into x - feature and y - label. Then we'll split them into the train and test parts. Here, I'll extract 15 percent of the dataset as test data.
boston = load_boston() x, y = boston.data, boston.target xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15)
Best alpha
Alpha is an important factor in regularization. It defines Ridge shrinkage or regularization strength. The higher value means the stronger regularization. We don't know which value works efficiently for our regularization method. Thus we'll figure out the best alpha value by checking the model accuracy with setting multiple alpha values.
alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1]
We can define Ridge model by setting alfa and fit it with x, y data. Then we check the R-squared, MSE, RMSE values for each alpha.
for a in alphas: model = Ridge(alpha=a, normalize=True).fit(x,y) score = model.score(x, y) pred_y = model.predict(x) mse = mean_squared_error(y, pred_y) print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}" .format(a, score, mse, np.sqrt(mse)))
Alpha:0.000001, R2:0.741, MSE:21.90, RMSE:4.68 Alpha:0.000010, R2:0.741, MSE:21.90, RMSE:4.68 Alpha:0.000100, R2:0.741, MSE:21.90, RMSE:4.68 Alpha:0.001000, R2:0.741, MSE:21.90, RMSE:4.68 Alpha:0.010000, R2:0.740, MSE:21.92, RMSE:4.68 Alpha:0.100000, R2:0.732, MSE:22.66, RMSE:4.76 Alpha:0.500000, R2:0.686, MSE:26.49, RMSE:5.15 Alpha:1.000000, R2:0.635, MSE:30.81, RMSE:5.55
The result shows that alpha with a 0.01 is the best value we can use.
Fitting the model and checking the results
Next, we'll define the Ridge model again with alpha 0.01 values and fit it with xtrain and ytrain data, then we'll predict the xtest data and check the prediction accuracy.
ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain) ypred = ridge_mod.predict(xtest) score = model.score(xtest,ytest) mse = mean_squared_error(ytest,ypred) print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}" .format(score, mse,np.sqrt(mse)))
R2:0.691, MSE:15.56, RMSE:3.95
Finally, we'll visualize the result.
x_ax = range(len(xtest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()
Cross-validation with RidgeCV
RidgeCV is built-in cross-validation class. In this model, we can set all alpha values and get the efficient alpha value in a set.
ridge_cv=RidgeCV(alphas=alphas, store_cv_values=True) ridge_mod = ridge_cv.fit(xtrain,ytrain) print(ridge_mod.alpha_)
0.01
print(np.mean(ridge_mod.cv_values_, axis=0))
[25.38818446 25.388184 25.38817941 25.388134 25.387734 25.38842764 25.44565372 25.54571739]
Now, we can predict test data and check the accuracy.
ypred = ridge_mod.predict(xtest) score = ridge_mod.score(xtest,ytest) mse = mean_squared_error(ytest,ypred) print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}" .format(score, mse, np.sqrt(mse)))
R2:0.814, MSE:15.49, RMSE:3.94
We can also plot the result
x_ax = range(len(xtest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()
In this post, we've briefly learned how to use Ridge and RidgeCV classes for regression data analysis in Python. The full source code is listed below. Thank you for reading!
Source code listing
from sklearn.datasets import load_boston from sklearn.linear_model import Ridge, RidgeCV from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split import matplotlib.pyplot as plt import numpy as np boston = load_boston() x, y = boston.data, boston.target xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.15) alphas = [0.000001, 0.00001, 0.0001, 0.001, 0.01, 0.1,0.5, 1] for a in alphas: model = Ridge(alpha=a, normalize=True).fit(x,y) score = model.score(x, y) pred_y = model.predict(x) mse = mean_squared_error(y, pred_y) print("Alpha:{0:.6f}, R2:{1:.3f}, MSE:{2:.2f}, RMSE:{3:.2f}" .format(a, score, mse, np.sqrt(mse))) ridge_mod=Ridge(alpha=0.01, normalize=True).fit(xtrain,ytrain) ypred = ridge_mod.predict(xtest) score = model.score(xtest,ytest) mse = mean_squared_error(ytest,ypred) print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}" .format(score, mse,np.sqrt(mse))) x_ax = range(len(xtest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show() # RidgeCV method ridge_cv=RidgeCV(alphas=alphas, store_cv_values=True) ridge_mod = ridge_cv.fit(xtrain,ytrain) print(ridge_mod.alpha_) print(np.mean(ridge_mod.cv_values_, axis=0)) ypred = ridge_mod.predict(xtest) score = ridge_mod.score(xtest,ytest) mse = mean_squared_error(ytest,ypred) print("R2:{0:.3f}, MSE:{1:.2f}, RMSE:{2:.2f}" .format(score, mse, np.sqrt(mse))) x_ax = range(len(xtest)) plt.scatter(x_ax, ytest, s=5, color="blue", label="original") plt.plot(x_ax, ypred, lw=0.8, color="red", label="predicted") plt.legend() plt.show()
THANKS!!!
ReplyDeleteAmazing !!! Thank you soo much, makes so much more sense now :DD
ReplyDeleteThanks
ReplyDeleteReally I've understood
ReplyDelete