DataTechNotes: Classification Example with an Extra-Trees Method in Python

Extremely Randomized Trees (or Extra-Trees) is an ensemble learning method. The method creates extra trees in sub-samples of datasets and applies majority voting to improve the predictivity of the classifier. By this approach, the method reduces the variance. The method applies a random thresholds for each features of sub-samples to obtain the best of the thresholds as a splitting rule.

In this tutorial, we'll briefly learn how to classify data by using Scikit-learn's ExtraTreesClassifier class in Python. The tutorial covers:

Preparing the data
Training the model
Predicting and accuracy check
Source code listing
Video tutorial

We'll start by loading the required libraries.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix

Preparing the data

In this tutorial, we'll use the Iris dataset as target data to classify. We'll define the x and y data parts.

iris = load_iris()
x, y = iris.data, iris.target

Then, we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

Training the model

Next, we'll define the classifier by using the ExtraTreesClassifier class. We can set the estimator number, here I'll set 100 to the estimator's number.

clf = ExtraTreesClassifier(n_estimators=100)
print(clf)

ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=None, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=None, verbose=0,
                     warm_start=False)

Then, we'll fit the model on train data and check the model accuracy score.

clf.fit(xtrain, ytrain)

score = clf.score(xtrain, ytrain)
print("Score: ", score)

Score:  1.0

We can also apply a cross-validation method to the model and check the training accuracy.

cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 )
print("CV average score: %.2f" % cv_scores.mean())

CV average score: 0.96

Predicting and accuracy check

Now, we can predict the test data by using the trained model. After the prediction, we'll check the accuracy level by using the confusion matrix function.

ypred = clf.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm)

[[5 0 0]
 [0 4 0]
 [0 0 6]]

In this tutorial, we've briefly learned how to classify data by using Scikit-learn API's ExtraTreesClassifier class in Python. The full source code is listed below.

Source code listing

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix

iris = load_iris()
x, y = iris.data, iris.target
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

clf = ExtraTreesClassifier(n_estimators=100)
print(clf)

clf.fit(xtrain, ytrain)
score = clf.score(xtrain, ytrain)
print("Score: ", score)

cv_scores = cross_val_score(clf, xtrain, ytrain, cv=5 )
print("CV average score: %.2f" % cv_scores.mean())

ypred = clf.predict(xtest)

cm = confusion_matrix(ytest, ypred)
print(cm)

Video tutorial

https://youtu.be/xUck_ISpoYI

References:

Scikit learn API

DataTechNotes

Pages

Classification Example with an Extra-Trees Method in Python

No comments:

Post a Comment