RFECV (Recursive Feature Elimination with Cross-Validation) performs recursive feature elimination with cross-validation loop to extract the optimal features. Scikit-learn provides RFECV class to implement RFECV method to find the most important features in a given dataset.
Selecting optimal features is important part of data preparation in machine learning. It helps us to eliminate less important part of the data and reduce a training time in large datasets.
In this tutorial, we'll briefly learn how to select best features of classification and regression data by using the RFECV in Python. The tutorial covers:- RFECV for classification data
- RFECV for regression data
- Source code listing
from sklearn.feature_selection import RFECV from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import load_boston from sklearn.datasets import load_iris from numpy import array
First,
we'll apply the RFECV for classification dataset. We'll load the dataset and take feature and label parts of iris data.
RFECV requires estimator model. Here we can use Random Forest Classifier class as an estimator model. Then we'll define RFECV and fit it on training x and y data. Ranking property gives us ranking position of a each feature. Optimal features are labeled rank 1.
iris = load_iris()
x = iris.data
y = iris.target
RFECV requires estimator model. Here we can use Random Forest Classifier class as an estimator model. Then we'll define RFECV and fit it on training x and y data. Ranking property gives us ranking position of a each feature. Optimal features are labeled rank 1.
rfc = RandomForestClassifier()
select = RFECV(estimator=rfc, cv=10)
select = select.fit(x,y)
print("Feature ranking: ", select.ranking_)
Feature ranking: [2 3 1 1]
Next, we'll extract the selected features. get_support() function helps us to get those features names.
mask = select.get_support()
features = array(iris.feature_names)
best_features = features[mask]
print("All features: ", x.shape[1]) print(features) print("Selected best: ", best_features.shape[0]) print(features[mask])
All features: 4
['sepal length (cm)' 'sepal width (cm)' 'petal length (cm)'
'petal width (cm)']
Selected best:2
['petal length (cm)' 'petal width (cm)']
RFECV for regression data
We apply the same method for regression data. We'll load the Boston housing dataset and take feature and label parts of the data.
boston = load_boston()
x = boston.data
y = boston.target
Next,
we'll define the estimator model and apply it into RFECV class. Then we can fit the model
on training x and y data. Ranking property gives us ranking position of a each feature. Optimal features are labeled rank 1.
In this tutorial, we've briefly learned how to select optimal features of classification and regression data by using RFECV model in Python.
The full
source code is listed below.
rfr = RandomForestRegressor()
select = RFECV(rfr, step=1, cv=5)
select = select.fit(x, y)
print("Feature ranking: ", select.ranking_)
Feature ranking: [1 2 1 3 1 1 1 1 1 1 1 1 1]
Next, we'll extract the selected features. get_support() function helps us to get those features names.
mask = select.get_support()
features = array(boston.feature_names)
best_features = features[mask]
print("All features: ", x.shape[1])
print(features)
print("Selected best: ", best_features.shape[0])
print(features[mask])
All features: 13
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
'B' 'LSTAT']
Selected best: 11
['CRIM' 'INDUS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
Source code listing
from sklearn.feature_selection import RFECV from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import load_boston from sklearn.datasets import load_iris from numpy import array # RFECV for classification iris = load_iris() x = iris.data y = iris.target rfc = RandomForestClassifier() select = RFECV(estimator=rfc, cv=5) select = select.fit(x,y) print("Feature ranking: ", select.ranking_) mask = select.get_support() features = array(iris.feature_names) best_features = features[mask] print("All features: ", x.shape[1]) print(features) print("Selected best: ", best_features.shape[0]) print(features[mask]) # RFECV for regression boston = load_boston() x = boston.data y = boston.target rfr = RandomForestRegressor() select = RFECV(rfr, step=1, cv=5) select = select.fit(x, y) print("Feature ranking: ", select.ranking_) mask = select.get_support() features = array(boston.feature_names) best_features = features[mask] print("All features: ", x.shape[1]) print(features) print("Selected best: ", best_features.shape[0]) print(features[mask])
References:
great
ReplyDelete