Linear regression is a widely used statistical technique for modeling
the relationship between a dependent variable and one or more
independent variables. In this tutorial, we'll learn how to perform
linear regression in R using the lm() function and evaluate the model's performance.
The tutorial covers:
Introduction to linear regression
Data preparing
Fitting the model
Accuracy check
Source code listing
Introduction to linear regression
Linear regression models the relationship between a dependent variable (target) and one or more independent variables (predictors) by fitting a linear equation to the observed data.
In simple linear regression, there is only one independent variable, while in multiple linear regression, there are multiple independent variables. The linear equation can be expressed:
y=mx+b
Where:
is the dependent variable
is the independent variable
is the slope of the line
is the y-intercept
The parameters and are estimated from the data using various techniques such as the method of least squares. Once the parameters are estimated, the linear equation can be used to predict the value of the dependent variable for new values of the independent variable .
Data preparing
We'll start by loading the required libraries for this tutorial.
# Load libraries
library(caret)
In this tutorial, we'll utilize the Boston housing price dataset for regression analysis. We'll begin by preparing the data, which involves splitting it into training and testing sets. Additionally, you can examine the structure of the dataset using the str() command. In this dataset, medv represents the target variable (output or label), while the remaining variables serve as input features (predictors).
# Load the Boston dataset
boston <- MASS::Boston
str(boston)
# Set seed for reproducibility
set.seed(123)
# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.85, list = FALSE)
train <- boston[indexes, ]
test <- boston[-indexes, ]
The str() function displays the structure of an R object.
In the model fitting part, we are using the lm() function in R to fit a linear regression model. In the code below, the lm() function is utilized for linear regression modeling in R.
The formula medv ~ . specifies the linear regression model. In this formula, medv represents the dependent variable (also known as the response or target variable), and . denotes all other variables in the dataset (excluding medv). This notation instructs R to utilize all other variables in the training dataset as predictors to forecast medv.
The argument data = train specifies the dataset from which the variables are derived. In this context, train refers to the training dataset containing both the dependent variable (medv) and the independent variables utilized for prediction.
F-statistic: 96.45 on 13 and 418 DF, p-value: < 2.2e-16
Next, we'll predict test data with a trained model. Here, test[, -14] means that we're selecting all columns of the test dataset except for the 14th column, which corresponds to the medv
variable. This ensures that the predictor variables in the test dataset
align with the predictors used in the model fitting process.
# Make predictions on the test set
pred_medv <- predict(model, newdata = test[,-14)
Accuracy check We can assess the prediction accuracy using several metrics, including Mean Squared Error (MSE), Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared. Let's examine these metrics to evaluate the performance of our linear regression model:
In this plot, the actual values are represented by red dots, and the predicted values are connected by a blue line.
This plot allows for a quick visual comparison between the actual and predicted values, helping us assess the performance of our linear regression model.
Conclusion
In this tutorial, we learned how to perform linear regression in R using the lm()
function. We split the data into training and testing sets, trained the
model, evaluated its performance using evaluation metrics, and
visualized the results. Linear regression is a powerful tool for
modeling relationships between variables and making predictions based on
observed data. The full source code is listed below. Source code listing
# Load libraries
library(caret)
# Load the Boston dataset
boston <- MASS::Boston
# Set seed for reproducibility
set.seed(123)
# Split the data into training and testing sets
indexes <- createDataPartition(boston$medv, p = 0.85, list = FALSE)
No comments:
Post a Comment