Extremely Randomized Trees (or Extra-Trees) is an ensemble learning method. The method creates extra trees in sub-samples of datasets and applies majority voting to improve the predictivity of the classifier. By this approach, the method reduces the variance. The method applies a random thresholds for each features of sub-samples to obtain the best of the thresholds as a splitting rule.
In this tutorial, we'll briefly learn how to classify data by using Scikit-learn's ExtraTreesClassifier class in Python. The tutorial covers:
- Preparing the data
- Training the model
- Predicting and accuracy check
- Source code listing
- Video tutorial
from sklearn.ensemble import ExtraTreesClassifier from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix
Preparing the data
In this tutorial, we'll use the Iris dataset as target data to classify. We'll define the x and y data parts.
Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.
iris = load_iris() x, y = iris.data, iris.target
Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)
Training the model
Next, we'll define the classifier by using the ExtraTreesClassifier class. We can set the estimator number, here I'll set 100 to the estimator's number.
clf = ExtraTreesClassifier(n_estimators=100) print(clf) ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
Then, we'll fit the model on train data and check the model accuracy score.
clf.fit(xtrain, ytrain)
score = clf.score(xtrain, ytrain) print("Score: ", score) Score: 1.0
We can also apply a cross-validation method to the model and check the training accuracy.
cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 ) print("CV average score: %.2f" % cv_scores.mean()) CV average score: 0.96
Predicting and accuracy check
Now, we can predict the test data by using the trained model. After the prediction, we'll check the accuracy level by using the confusion matrix function.
ypred = clf.predict(xtest) cm = confusion_matrix(ytest, ypred) print(cm) [[5 0 0] [0 4 0] [0 0 6]]
In this tutorial, we've briefly learned how to classify data by using Scikit-learn API's ExtraTreesClassifier class in Python. The full source code is listed below.
Source code listing
from sklearn.ensemble import ExtraTreesClassifier from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.datasets import load_iris from sklearn.metrics import confusion_matrix iris = load_iris() x, y = iris.data, iris.target xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15) clf = ExtraTreesClassifier(n_estimators=100) print(clf) clf.fit(xtrain, ytrain) score = clf.score(xtrain, ytrain) print("Score: ", score) cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 ) print("CV average score: %.2f" % cv_scores.mean()) ypred = clf.predict(xtest) cm = confusion_matrix(ytest, ypred) print(cm)
Video tutorial
References:
No comments:
Post a Comment