Feature selection, a technique in feature engineering, plays a key role in building effective machine learning models. Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is a useful tool for selecting important features. It helps reduce model complexity, prevent overfitting, and makes the model easier to understand.
In this tutorial, we'll go through the steps for using Lasso regression to perform feature selection. This tutorial
will cover:
- Brief Explanation of Lasso
- Preparing the data
- Training a Baseline Linear Regression Model
- Applying Lasso for Feature Selection
- Evaluating a Model Using Selected Features
- Conclusion
- Full source code listing
Let's get started.
Brief Explanation of Lasso
Lasso, short for Least Absolute Shrinkage and Selection Operator, is a popular technique for feature selection and regularization, especially useful for linear models. It reduces model complexity and improves interpretability by penalizing certain features, automatically setting the coefficients of irrelevant ones to zero. This makes it a great choice for high-dimensional datasets, as it helps focus only on the most meaningful features.
How Lasso Works
Lasso regression adds a penalty to the linear regression objective function. This penalty is the sum of the absolute values of the model’s coefficients, which pushes some coefficients to zero if their associated features aren’t informative. The objective function for Lasso is:
where:
- is the true target value for sample i,
- is the predicted value,
- represents the coefficients for each feature j,
- is the regularization parameter controlling the penalty strength.
As increases, Lasso forces more coefficients toward zero, selecting only the most important features. This way, Lasso combines regularization with feature selection, helping simplify models and improve interpretability.
Preparing the Data
We'll begin by loading the necessary libraries for this tutorial. Here, we’ll use classes and functions from the Scikit-learn library to implement Lasso for feature selection.
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
We'll use the fetch_california_housing dataset from sklearn.datasets, which contains various features of Californian homes, and the target variable is the median house value.
After loading the dataset, split it into features (X) and target (y), then divide it into training and testing sets. We’ll use an 80-20 split, where 80% of the data goes into training and 20% into testing.
# Load the California housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features (predictors)
y = pd.Series(data.target) # Target variable (house prices)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Lasso regression is sensitive to feature scales, so we standardize the features by making the mean 0 and variance 1. This step ensures that each feature contributes equally to the model.
# Standardize features for improved model performance (mean=0, variance=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform on train set
X_test_scaled = scaler.transform(X_test) # Transform test set based on train fit
Training a Baseline Linear Regression Model
As a baseline, we first train a simple linear regression model on the scaled data, evaluating its performance on the test set. This will serve as a comparison for the Lasso model.
# Train a simple Linear Regression model to establish a baseline performance
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train) # Fit model on scaled training data
linear_score = linear_model.score(X_test_scaled, y_test) # Evaluate on test set
print(f'Linear Regression Model Score: {linear_score:.4f}') # Print baseline model score
The model score (R-squared) value is:
Linear Regression Model Score: 0.5758
Applying Lasso for Feature Selection
Now, we use Lasso with a regularization parameter alpha=0.05. Increasing alpha strengthens the penalty and pushes more coefficients toward zero, effectively removing some features. Here, alpha=0.05 provides a moderate penalty that helps in selecting features without being too restrictive.
# Apply Lasso regression for feature selection with an alpha (regularization parameter)
lasso = Lasso(alpha=0.05) # Increase alpha to impose more feature selection
lasso.fit(X_train_scaled, y_train) # Fit Lasso on the scaled training data
Once the Lasso model is fitted, we can examine the feature coefficients. Non-zero coefficients represent selected features that contribute to the prediction, while zero-valued coefficients correspond to discarded features.
# Extract feature importance (coefficients of the Lasso model)
importance = lasso.coef_
feature_importance = pd.Series(importance, index=X.columns) # Create series for readability
print("Feature Importance:\n", feature_importance)
The result looks as below.
Feature Importance:
MedInc 0.741977
HouseAge 0.139559
AveRooms -0.000000
AveBedrms 0.000000
Population 0.000000
AveOccup -0.000000
Latitude -0.259219
Longitude -0.216379
dtype: float64
We can now keep only the features with non-zero coefficients to create a new dataset with the selected features. This reduces the data's dimensionality, making the model simpler and potentially improving its ability to generalize.
# Select only the important features with non-zero coefficients
important_features = feature_importance[feature_importance != 0].index.tolist()
print("Selected Features:", important_features)
Selected Features: ['MedInc', 'HouseAge', 'Latitude', 'Longitude']
Evaluating a Model Using Selected Features
Using the selected features, we subset our training and testing data, then train a new linear regression model on this reduced set. Finally, we compare its performance with the baseline model.
# Subset the original training and testing data based on selected features
X_train_selected = X_train[important_features]
X_test_selected = X_test[important_features]
# Train a Linear Regression model on the selected features
linear_model_selected = LinearRegression()
linear_model_selected.fit(X_train_selected, y_train) # Fit model on reduced feature set
linear_score_selected = linear_model_selected.score(X_test_selected, y_test) # Evaluate on test set
print(f'Linear Regression Model Score with Selected Features: {linear_score_selected:.4f}')
Linear Regression Model Score with Selected Features: 0.5811
Linear Regression Model Score with Selected Features: 0.5811
Conclusion
In this tutorial, we used Lasso regression to select the most relevant features in a dataset. Here we learned how to apply Lasso regression to identify important features by setting the coefficients of less relevant features to zero and compared the performance with baseline regression model.
By reducing the number of features, Lasso can make models simpler and easier to understand, which can also improve generalization and performance. This is especially helpful with large datasets that have many irrelevant or redundant features.
Full source code listing
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
# Load the California housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features (predictors)
y = pd.Series(data.target) # Target variable (house prices)
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize features for improved model performance (mean=0, variance=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform on train set
X_test_scaled = scaler.transform(X_test) # Transform test set based on train fit
# Train a simple Linear Regression model to establish a baseline performance
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train) # Fit model on scaled training data
linear_score = linear_model.score(X_test_scaled, y_test) # Evaluate on test set
print(f'Linear Regression Model Score: {linear_score:.4f}') # Print baseline model score
# Apply Lasso regression for feature selection with an alpha (regularization parameter)
lasso = Lasso(alpha=0.05) # Increase alpha to impose more feature selection
lasso.fit(X_train_scaled, y_train) # Fit Lasso on the scaled training data
# Extract feature importance (coefficients of the Lasso model)
importance = lasso.coef_
feature_importance = pd.Series(importance, index=X.columns) # Create series for readability
#print("Feature Importance:\n", feature_importance)
# Select only the important features with non-zero coefficients
important_features = feature_importance[feature_importance != 0].index.tolist()
print("Selected Features:", important_features)
# Subset the original training and testing data based on selected features
X_train_selected = X_train[important_features]
X_test_selected = X_test[important_features]
# Train a Linear Regression model on the selected features
linear_model_selected = LinearRegression()
linear_model_selected.fit(X_train_selected, y_train) # Fit model on reduced feature set
linear_score_selected = linear_model_selected.score(X_test_selected, y_test) # Evaluate on test set
print(f'Linear Regression Model Score with Selected Features: {linear_score_selected:.4f}')
No comments:
Post a Comment