Scikit-learn
API provides SelectFromModel class for extracting best features of given
dataset according to the importance of weights. The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value.
In
this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python. The
tutorial
covers:
- SelectFromModel for regression data
- Source code listing
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array
SelectFromModel for regression data
We use Boston housing price dataset in this tutorial. We'll load the dataset and check the dimensions of feature data.
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
Feature data dimension: (506, 13)
SelectFromModel requires an estimator and we can use AdaBoostRegressor class for this purpose. An estimator model must have attributes to provide the indexes of selected data like 'get_support()' function. We'll define model by default value which applies median method to set the threshold value and fit the model on x and y data.
In this tutorial, we've briefly learned how to select important features in a dataset by using sklearn SelectFromModel class in python.
The full
source code is listed below.
estimator = AdaBoostRegressor(random_state=0, n_estimators=50)
selector = SelectFromModel(estimator)
selector = selector.fit(x, y)
After the training, we'll get status of each feature data. To identify the selected features we can use
get_support() function and filter out them from the features list. Finally, we'll get selected features names and respective data from the x data.
status = selector.get_support()
print("Selection status: ", status)
Selection status: [False False False False False True False True False False False False
True]
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected features:")
print(features[filter])
selector.transform(x)
All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected features:
['RM' 'DIS' 'LSTAT']
array([[6.575 , 4.09 , 4.98 ],
[6.421 , 4.9671, 9.14 ],
[7.185 , 4.9671, 4.03 ],
...,
[6.976 , 2.1675, 5.64 ],
[6.794 , 2.3889, 6.48 ],
[6.03 , 2.505 , 7.88 ]])
Source code listing
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
estimator = AdaBoostRegressor(random_state=0, n_estimators=50)
selector = SelectFromModel(estimator)
selector = selector.fit(x, y)
status = selector.get_support()
print("Selection status: ", status)
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected features:")
print(features[status])
selector.transform(x)
References:
No comments:
Post a Comment