Extracting influential features of dataset is essential part of data preparation to train model in machine learning. Scikit-learn
API provides RFE class that ranks features by recursive feature elimination to select best features. The method recursively eliminates the least important features based on specific attributes taken by estimator.
In
this tutorial, we'll briefly learn how to select best features of dataset by using the RFE in Python. The
tutorial
covers:
- RFE Example with Boston dataset
- Source code listing
from sklearn.feature_selection import RFE
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array
We'll load Boston housing price dataset and check the dimensions of features data. The 'data'
property of the boston object is considered a feature data.
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
Feature data dimension: (506, 13)
The feature data contains 13 columns of 506 rows, our purpose is to decrease those columns by selecting best 8 by their influence rank.
Next,
we'll define the model by using RFE class. The class requires estimator and we can use AdaBoostRegressor meta-estimator model for this purpose. The target number of
features to select is defined by n_feature_to_select parameter and step defines number of features to remove in each round. We'll fit the model on x and y training data.
estimator = AdaBoostRegressor(random_state=0, n_estimators=100)
selector = RFE(estimator, n_features_to_select=8, step=1)
selector = selector.fit(x, y)
After fitting we can obtain selected features and their ranking positions.
filter = selector.support_ ranking = selector.ranking_ print("Mask data: ", filter) print("Ranking: ", ranking)
Mask data: [ True False False False True True False True True True True False
True]
Ranking: [1 5 3 6 1 1 4 1 1 1 1 2 1]
To make it readable we'll filter out the selected features.
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected features:")
print(features[filter])
All features:
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected features:
['CRIM' 'NOX' 'RM' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'LSTAT']
Source code listing
from sklearn.feature_selection import RFE
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import load_boston
from numpy import array
boston = load_boston()
x = boston.data
y = boston.target
print("Feature data dimension: ", x.shape)
estimator = AdaBoostRegressor(random_state=0, n_estimators=100)
selector = RFE(estimator, n_features_to_select=8, step=1)
selector = selector.fit(x, y)
filter = selector.support_
ranking = selector.ranking_
print("Mask data: ", filter)
print("Ranking: ", ranking)
features = array(boston.feature_names)
print("All features:")
print(features)
print("Selected features:")
print(features[filter])
References:
No comments:
Post a Comment