DataTechNotes: SelectKBest Feature Selection Example in Python

Scikit-learn API provides SelectKBest class for extracting best features of given dataset. The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training. It helps us to eliminate less important part of the data and reduce a training time.

In this tutorial, we'll briefly learn how to select best features of classification and regression data by using the SelectKBest in Python. The tutorial covers:

SelectKBest for classification data
SelectKBest for regression data
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array

SelectKBest for classification

First, we'll apply the SelectKBest model to classification data, Iris dataset. We'll load the dataset and check the feature data dimension. The 'data' property of the iris object is considered feature data.

iris = load_iris()
x = iris.data
y = iris.target

print("Feature data dimension: ", x.shape)

Feature data dimension:  (150, 4)

Next, we'll define the model by using SelectKBest class. For classification we'll set 'chi2' method as a scoring function. The target number of features is defined by k parameter. Then we'll fit and transform method on training x and y data.

select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)

print("After selecting best 3 features:", z.shape)

After selecting best 3 features: (150, 3)

We've selected 3 best features in x data. To identify the selected features we use get_support() function and filter out them from the features name list. The z object contains selected x data.

filter = select.get_support()
features = array(iris.feature_names)

print("All features:")
print(features)

print("Selected best 3:")
print(features[filter])

print(z)

All features:
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)' 'petal width (cm)']
Selected best 3:
['sepal length (cm)' 'petal length (cm)' 'petal width (cm)']

SelectKBest for regression data

We apply the same method for regression data only changing scoring function. We'll load the Boston housing data set and check the feature data dimensions.

boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

Feature data dimension:  (506, 13)

Next, we'll define the model by using SelectKBest class. For regression, we'll set 'f_regression' method as a scoring function. The target number of features to select is 8. We'll fit and transform the model on training x and y data.

select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y)

print("After selecting best 8 features:", z.shape)

After selecting best 8 features: (506, 8)

To identify the selected features we can use get_support() function and filter out them from the features list. The z object contains selected x data.

filter = select.get_support()
features = array(boston.feature_names)

print("All features:")
print(features)

print("Selected best 8:")
print(features[filter])
print(z)

All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']
Selected best 8:
['CRIM' 'INDUS' 'NOX' 'RM' 'RAD' 'TAX' 'PTRATIO' 'LSTAT']

In this tutorial, we've briefly learned how to get k best features in classification and regression data by using SelectKBest model in Python. The full source code is listed below.

Feature Selection Example with RFECV in Python

Recursive Feature Elimination (RFE) Example in Python

Source code listing

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array

iris = load_iris()
x = iris.data
y = iris.target

print("Feature data dimension: ", x.shape)

select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
print("After selecting best 3 features:", z.shape)

filter = select.get_support()
features = array(iris.feature_names)

print("All features:")
print(features)

print("Selected best 3:")
print(features[filter])
print(z)


boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y)
print("After selecting best 8 features:", z.shape)

filter = select.get_support()
features = array(boston.feature_names)

print("All features:")
print(features)

print("Selected best 8:")
print(features[filter])
print(z)

References:

Scikit learn API SelectKBest

5 comments:

AnonymousJune 16, 2022 at 7:26 PM
great
AnonymousNovember 24, 2022 at 8:01 PM
So you randomly decided that the number of features to select for regression is 8 out of 13? Why 8? Why not 7 or 9 or 5?? Doesn't seem very scientific.
Mary BrownJuly 12, 2023 at 11:17 PM
great
AnonymousNovember 29, 2023 at 3:31 PM
Thanks, great help

Pages

SelectKBest Feature Selection Example in Python

Feature Selection Example with RFECV in Python

Recursive Feature Elimination (RFE) Example in Python

5 comments: