DataTechNotes: SelectFromModel Feature Selection Example in Python

Scikit-learn API provides SelectFromModel class for extracting best features of given dataset according to the importance of weights. The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value.

In this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python. The tutorial covers:

SelectFromModel for regression data
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array

SelectFromModel for regression data

We use Boston housing price dataset in this tutorial. We'll load the dataset and check the dimensions of feature data.

boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

Feature data dimension:  (506, 13)

SelectFromModel requires an estimator and we can use AdaBoostRegressor class for this purpose. An estimator model must have attributes to provide the indexes of selected data like 'get_support()' function. We'll define model by default value which applies median method to set the threshold value and fit the model on x and y data.

estimator = AdaBoostRegressor(random_state=0, n_estimators=50)
selector = SelectFromModel(estimator)
selector = selector.fit(x, y)

After the training, we'll get status of each feature data. To identify the selected features we can use get_support() function and filter out them from the features list. Finally, we'll get selected features names and respective data from the x data.

status = selector.get_support()
print("Selection status: ", status)

Selection status:  [False False False False False  True False  True False False False False
  True]

features = array(boston.feature_names)
print("All features:")
print(features)

print("Selected features:")
print(features[filter])

selector.transform(x)

All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

Selected features:
['RM' 'DIS' 'LSTAT']

array([[6.575 , 4.09  , 4.98  ],
       [6.421 , 4.9671, 9.14  ],
       [7.185 , 4.9671, 4.03  ],
       ...,
       [6.976 , 2.1675, 5.64  ],
       [6.794 , 2.3889, 6.48  ],
       [6.03  , 2.505 , 7.88  ]])

In this tutorial, we've briefly learned how to select important features in a dataset by using sklearn SelectFromModel class in python. The full source code is listed below.

Source code listing

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array

boston = load_boston()
x = boston.data
y = boston.target

print("Feature data dimension: ", x.shape)

estimator = AdaBoostRegressor(random_state=0, n_estimators=50)
selector = SelectFromModel(estimator)
selector = selector.fit(x, y)

status = selector.get_support()
print("Selection status: ", status)

features = array(boston.feature_names)
print("All features:")
print(features)

print("Selected features:")
print(features[status])
selector.transform(x)

References:

Scikit learn API SelectFromModel

DataTechNotes

Pages

SelectFromModel Feature Selection Example in Python

No comments:

Post a Comment