In this tutorial, we'll learn how to classify data with VotingClassfier class of 'sklearn.ensemble' package in Python. The post covers:
- Preparing dataset
- Classifying with hard voting
- Classifying with soft voting
- Classifying iris dataset with a voting classifier
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import VotingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn import datasets
Preparing dataset
We'll create classification dataset with randomly generated numbers with some logic. Then, we'll separate into X and Y parts, encode Y value, and split the dataset into the training and test parts.
def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(300) df.head() X = df[["a","b","c"]] Y = df[["y"]] Y.head() le=LabelEncoder() y=le.fit_transform(Y) y[0:5] Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)
Classifying with hard voting
As it is stated above, voting classifier contains different classifier methods. Here, we'll create a set of classifiers.
lr = LogisticRegression(); gnb = GaussianNB() dtc = DecisionTreeClassifier(criterion="entropy") knc = KNeighborsClassifier(n_neighbors=1) base_methods=[('LogisticReg', lr), ('GaussianNB', gnb), ('DecisionTree',dtc), ('KNeighbors',knc)]
Next, we'll create VoitngClassifier model with base classifier methods.
vote_model=VotingClassifier(estimators=base_methods)
print(vote_model) VotingClassifier(estimators=[('LogisticReg',
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('Gau...owski', metric_params=None, n_jobs=1, n_neighbors=1, p=2, weights='uniform'))], flatten_transform=None, n_jobs=1, voting='hard', weights=None)
As you may have noticed here, hard voting is a default method in VotingClassifier. Next, we'll fit the model with training data, predict test, and check the accuracy.
vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest))
0.92
print(confusion_matrix(ytest, ytest_pred))
[[ 5 0 0] [ 0 22 1] [ 1 4 42]]
Classifying with soft voting
For a soft voting method, we need to give weights to classifiers. We'll provide weight value according to classifier's performance. First, we'll check the performance of each classifier.
for name,method in base_methods: method.fit(Xtrain, ytrain) ypred=method.predict(Xtrain) acc=method.score(Xtest, ytest) print(name, "Accuracy:", acc)
LogisticReg Accuracy: 0.8533333333333334 GaussianNB Accuracy: 0.9066666666666666 DecisionTree Accuracy: 0.88 KNeighbors Accuracy: 0.9333333333333333
The KNeighbors and GaussianNB are performing better than the other classifiers, so we'll give them higher weight. We'll create VotingClassifier with a soft voting method and weights parameter.
vote_model=VotingClassifier(estimators=base_methods, voting='soft', weights=[1,2,1,2])
print(vote_model) VotingClassifier(estimators=[('LogisticReg',
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1, penalty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False)), ('Gau...owski', metric_params=None, n_jobs=1, n_neighbors=1, p=2, weights='uniform'))], flatten_transform=None, n_jobs=1, voting='soft', weights=[1, 2, 1, 2])
Next, we'll fit the model, predict the test data, and check the accuracy.
vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest))
0.9466666666666667
print(confusion_matrix(ytest, ytest_pred))
[[ 5 0 0] [ 0 22 1] [ 1 2 44]]
The result shows slightly better performance in classification.
Classifying iris dataset with a voting classifier
Here, I'll show how to apply VoitingClassifier for real dataset like iris. We'll use the above base classifiers we've created.
iris= datasets.load_iris() X = iris.data Y = iris.target le=LabelEncoder() y=le.fit_transform(Y) Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0) vote_model=VotingClassifier(estimators=base_methods, voting='soft', weights=[1,2,1,2]) vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest))
0.9736842105263158
print(confusion_matrix(ytest, ytest_pred))
[[13 0 0] [ 0 15 1] [ 0 0 9]]
In this post, we've briefly learned how to use VoitingClassifier class to classify data in Python.
The full source code is listed below.
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.metrics import confusion_matrix from sklearn.model_selection import train_test_split from sklearn.ensemble import VotingClassifier from sklearn.naive_bayes import GaussianNB from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegression from sklearn import datasets def CreateDataFrame(N): columns = ['a','b','c','y'] df = pd.DataFrame(columns=columns) for i in range(N): a = np.random.randint(10) b = np.random.randint(20) c = np.random.randint(5) y = "normal" if((a+b+c)>25): y="high" elif((a+b+c)<12): y= "low" df.loc[i]= [a, b, c, y] return df df = CreateDataFrame(300) df.head() X = df[["a","b","c"]] Y = df[["y"]] Y.head() le=LabelEncoder() y=le.fit_transform(Y) y[0:5] Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0) lr = LogisticRegression(); gnb = GaussianNB() dtc = DecisionTreeClassifier(criterion="entropy") knc = KNeighborsClassifier(n_neighbors=1)
# creating base classifiers
base_methods=[('LogisticReg', lr), ('GaussianNB', gnb), ('DecisionTree',dtc), ('KNeighbors',knc)] # hard voting method vote_model=VotingClassifier(estimators=base_methods) vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred)) # check performance of each classifier for name,method in base_methods: method.fit(Xtrain, ytrain) ypred=method.predict(Xtrain) acc=method.score(Xtest, ytest) print(name, "Accuracy:", acc) # soft voting method vote_model=VotingClassifier(estimators=base_methods, voting='soft', weights=[1,2,1,2]) vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred)) # iris classification with voting classifier iris= datasets.load_iris() X = iris.data Y = iris.target le=LabelEncoder() y=le.fit_transform(Y) Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0) vote_model=VotingClassifier(estimators=base_methods, voting='soft', weights=[1,2,1,2]) vote_model=vote_model.fit(Xtrain,ytrain) ytest_pred=vote_model.predict(Xtest) print(vote_model.score(Xtest, ytest)) print(confusion_matrix(ytest, ytest_pred))
Reference:
No comments:
Post a Comment