Feature Selection with Lasso Regression

        Feature selection, a technique in feature engineering, plays a key role in building effective machine learning models. Lasso regression, short for Least Absolute Shrinkage and Selection Operator,  is a useful tool for selecting important features.  It helps reduce model complexity, prevent overfitting, and makes the model easier to understand.
    In this tutorial, we'll go through the steps for using Lasso regression to perform feature selection. This tutorial will cover:

  1. Brief Explanation of Lasso
  2. Preparing the data  
  3. Training a Baseline Linear Regression Model 
  4. Applying Lasso for Feature Selection
  5. Evaluating a Model Using Selected Features
  6. Conclusion 
  7. Full source code listing

     Let's get started.

Brief Explanation of Lasso

        Lasso, short for Least Absolute Shrinkage and Selection Operator, is a popular technique for feature selection and regularization, especially useful for linear models. It reduces model complexity and improves interpretability by penalizing certain features, automatically setting the coefficients of irrelevant ones to zero. This makes it a great choice for high-dimensional datasets, as it helps focus only on the most meaningful features.

How Lasso Works

    Lasso regression adds a penalty to the linear regression objective function. This penalty is the sum of the absolute values of the model’s coefficients, which pushes some coefficients to zero if their associated features aren’t informative. The objective function for Lasso is:

minimizei=1n(yiy^i)2+λj=1pβj\text{minimize} \sum_{i=1}^n (y_i - \hat{y}_i)^2 + \lambda \sum_{j=1}^p |\beta_j|

where:

  • yiy_i is the true target value for sample ii,
  • y^i\hat{y}_iis the predicted value,
  • βj\beta_j represents the coefficients for each feature jj,
  • λ\lambda is the regularization parameter controlling the penalty strength.

As λ\lambda increases, Lasso forces more coefficients toward zero, selecting only the most important features. This way, Lasso combines regularization with feature selection, helping simplify models and improve interpretability.

 

Preparing the Data

    We'll begin by loading the necessary libraries for this tutorial. Here, we’ll use classes and functions from the Scikit-learn library to implement Lasso for feature selection.

 
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler
 

   We'll use the fetch_california_housing dataset from sklearn.datasets, which contains various features of Californian homes, and the target variable is the median house value.

     After loading the dataset, split it into features (X) and target (y), then divide it into training and testing sets. We’ll use an 80-20 split, where 80% of the data goes into training and 20% into testing.

 
# Load the California housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features (predictors)
y = pd.Series(data.target) # Target variable (house prices)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    Lasso regression is sensitive to feature scales, so we standardize the features by making the mean 0 and variance 1. This step ensures that each feature contributes equally to the model.


# Standardize features for improved model performance (mean=0, variance=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform on train set
X_test_scaled = scaler.transform(X_test) # Transform test set based on train fit


Training a Baseline Linear Regression Model

    As a baseline, we first train a simple linear regression model on the scaled data, evaluating its performance on the test set. This will serve as a comparison for the Lasso model.

 
# Train a simple Linear Regression model to establish a baseline performance
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train) # Fit model on scaled training data
linear_score = linear_model.score(X_test_scaled, y_test) # Evaluate on test set
print(f'Linear Regression Model Score: {linear_score:.4f}') # Print baseline model score

The model score (R-squared) value is:

 
Linear Regression Model Score: 0.5758
 


Applying Lasso for Feature Selection

    Now, we use Lasso with a regularization parameter alpha=0.05. Increasing alpha strengthens the penalty and pushes more coefficients toward zero, effectively removing some features. Here, alpha=0.05 provides a moderate penalty that helps in selecting features without being too restrictive.

 
# Apply Lasso regression for feature selection with an alpha (regularization parameter)
lasso = Lasso(alpha=0.05) # Increase alpha to impose more feature selection
lasso.fit(X_train_scaled, y_train) # Fit Lasso on the scaled training data
 

    Once the Lasso model is fitted, we can examine the feature coefficients. Non-zero coefficients represent selected features that contribute to the prediction, while zero-valued coefficients correspond to discarded features.


# Extract feature importance (coefficients of the Lasso model)
importance = lasso.coef_
feature_importance = pd.Series(importance, index=X.columns) # Create series for readability
print("Feature Importance:\n", feature_importance)
 

The result looks as below.

 
Feature Importance: MedInc 0.741977 HouseAge 0.139559 AveRooms -0.000000 AveBedrms 0.000000 Population 0.000000 AveOccup -0.000000 Latitude -0.259219 Longitude -0.216379 dtype: float64

    We can now keep only the features with non-zero coefficients to create a new dataset with the selected features. This reduces the data's dimensionality, making the model simpler and potentially improving its ability to generalize.

 
# Select only the important features with non-zero coefficients
important_features = feature_importance[feature_importance != 0].index.tolist()
print("Selected Features:", important_features) 


 
Selected Features: ['MedInc', 'HouseAge', 'Latitude', 'Longitude'] 
 

 

Evaluating a Model Using Selected Features

    Using the selected features, we subset our training and testing data, then train a new linear regression model on this reduced set. Finally, we compare its performance with the baseline model.


# Subset the original training and testing data based on selected features
X_train_selected = X_train[important_features]
X_test_selected = X_test[important_features]

# Train a Linear Regression model on the selected features
linear_model_selected = LinearRegression()
linear_model_selected.fit(X_train_selected, y_train) # Fit model on reduced feature set
linear_score_selected = linear_model_selected.score(X_test_selected, y_test) # Evaluate on test set
print(f'Linear Regression Model Score with Selected Features: {linear_score_selected:.4f}')



 
 Linear Regression Model Score with Selected Features: 0.5811
 
Linear Regression Model Score with Selected Features: 0.5811


   

Conclusion
 
    In this tutorial, we used Lasso regression to select the most relevant features in a dataset. Here we learned how to apply Lasso regression to identify important features by setting the coefficients of less relevant features to zero and compared the performance with baseline regression model.
    By reducing the number of features, Lasso can make models simpler and easier to understand, which can also improve generalization and performance. This is especially helpful with large datasets that have many irrelevant or redundant features.
 
 
 Full source code listing

 
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso, LinearRegression
from sklearn.preprocessing import StandardScaler

# Load the California housing dataset
data = fetch_california_housing()
X = pd.DataFrame(data.data, columns=data.feature_names) # Features (predictors)
y = pd.Series(data.target) # Target variable (house prices)

# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features for improved model performance (mean=0, variance=1)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit and transform on train set
X_test_scaled = scaler.transform(X_test) # Transform test set based on train fit

# Train a simple Linear Regression model to establish a baseline performance
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train) # Fit model on scaled training data
linear_score = linear_model.score(X_test_scaled, y_test) # Evaluate on test set
print(f'Linear Regression Model Score: {linear_score:.4f}') # Print baseline model score

# Apply Lasso regression for feature selection with an alpha (regularization parameter)
lasso = Lasso(alpha=0.05) # Increase alpha to impose more feature selection
lasso.fit(X_train_scaled, y_train) # Fit Lasso on the scaled training data

# Extract feature importance (coefficients of the Lasso model)
importance = lasso.coef_
feature_importance = pd.Series(importance, index=X.columns) # Create series for readability
#print("Feature Importance:\n", feature_importance)

# Select only the important features with non-zero coefficients
important_features = feature_importance[feature_importance != 0].index.tolist()
print("Selected Features:", important_features)

# Subset the original training and testing data based on selected features
X_train_selected = X_train[important_features]
X_test_selected = X_test[important_features]

# Train a Linear Regression model on the selected features
linear_model_selected = LinearRegression()
linear_model_selected.fit(X_train_selected, y_train) # Fit model on reduced feature set
linear_score_selected = linear_model_selected.score(X_test_selected, y_test) # Evaluate on test set
print(f'Linear Regression Model Score with Selected Features: {linear_score_selected:.4f}')





No comments:

Post a Comment