Scikit-learn API provides SelectKBest class for extracting best features of given dataset. The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training. It helps us to eliminate less important part of the data and reduce a training time.
In this tutorial, we'll briefly learn how to select best features of classification and regression data by using the SelectKBest in Python. The
tutorial
covers:
- SelectKBest for classification data
- SelectKBest for regression data
- Source code listing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array
First,
we'll apply the SelectKBest model to classification data, Iris dataset. We'll load the dataset and check the feature data dimension. The 'data' property of the iris object is considered feature data.
Next, we'll define the model by using SelectKBest class. For classification we'll set 'chi2' method as a scoring function. The target number of features is defined by k parameter. Then we'll fit and transform method on training x and y data.
iris = load_iris()
x = iris.data
y = iris.target
print("Feature data dimension: ", x.shape)
Feature data dimension: (150, 4)
Next, we'll define the model by using SelectKBest class. For classification we'll set 'chi2' method as a scoring function. The target number of features is defined by k parameter. Then we'll fit and transform method on training x and y data.
select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
print("After selecting best 3 features:", z.shape)
After selecting best 3 features: (150, 3)
We've selected 3 best features in x data. To identify the selected features we use get_support() function and filter out them from the features name list. The z object contains selected x data.
filter = select.get_support()
features = array(iris.feature_names)
print("All features:")
print(features)
print("Selected best 3:")
print(features[filter])
print(z)
All features:
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)' 'petal width (cm)']
Selected best 3:
['sepal length (cm)' 'petal length (cm)' 'petal width (cm)']
SelectKBest for regression data
We apply the same method for regression data only changing scoring function. We'll load the Boston housing data set and check the feature data dimensions.
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
Feature data dimension: (506, 13)
Next,
we'll define the model by using SelectKBest class. For regression,
we'll set 'f_regression' method as a scoring function. The target number of
features to select is 8. We'll fit and transform the model
on training x and y data.
In this tutorial, we've briefly learned how to get k best features in classification and regression data by using SelectKBest model in Python. The full
source code is listed below. select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y)
print("After selecting best 8 features:", z.shape)
After selecting best 8 features: (506, 8)
To identify the selected features we can use
get_support() function and filter out them from the features list. The z object contains selected x data.
filter = select.get_support()
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected best 8:")
print(features[filter])
print(z)
All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected best 8:
['CRIM' 'INDUS' 'NOX' 'RM' 'RAD' 'TAX' 'PTRATIO' 'LSTAT']
Source code listing
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, f_regression
from sklearn.datasets import load_boston
from sklearn.datasets import load_iris
from numpy import array
iris = load_iris()
x = iris.data
y = iris.target
print("Feature data dimension: ", x.shape)
select = SelectKBest(score_func=chi2, k=3)
z = select.fit_transform(x,y)
print("After selecting best 3 features:", z.shape)
filter = select.get_support()
features = array(iris.feature_names)
print("All features:")
print(features)
print("Selected best 3:")
print(features[filter])
print(z)
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
select = SelectKBest(score_func=f_regression, k=8)
z = select.fit_transform(x, y)
print("After selecting best 8 features:", z.shape)
filter = select.get_support()
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected best 8:")
print(features[filter])
print(z)
References:
great
ReplyDeleteSo you randomly decided that the number of features to select for regression is 8 out of 13? Why 8? Why not 7 or 9 or 5?? Doesn't seem very scientific.
ReplyDeletefacts, is there process to choosing the k value?
Deletegreat
ReplyDeleteThanks, great help
ReplyDelete