DataTechNotes: Classification with Voting Classifier in Python

A voting classifier is an ensemble learning method, and it is a kind of wrapper contains different machine learning classifiers to classify the data with combined voting. There are 'hard/majority' and 'soft' voting methods to make a decision regarding the target class. Hard voting decides according to vote number which is the majority wins. In soft voting, we can set weight value to give more priorities to certain classifiers according to their performance.
In this tutorial, we'll learn how to classify data with VotingClassfier class of 'sklearn.ensemble' package in Python. The post covers:

Preparing dataset
Classifying with hard voting
Classifying with soft voting
Classifying iris dataset with a voting classifier

We'll start by loading the required libraries.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

Preparing dataset

We'll create classification dataset with randomly generated numbers with some logic. Then, we'll separate into X and Y parts, encode Y value, and split the dataset into the training and test parts.

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(300)
df.head()

X = df[["a","b","c"]]
Y = df[["y"]]
Y.head()
le=LabelEncoder()
y=le.fit_transform(Y)
y[0:5]

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

Classifying with hard voting

As it is stated above, voting classifier contains different classifier methods. Here, we'll create a set of classifiers.

lr = LogisticRegression();
gnb = GaussianNB()
dtc = DecisionTreeClassifier(criterion="entropy")
knc = KNeighborsClassifier(n_neighbors=1)

base_methods=[('LogisticReg', lr), 
              ('GaussianNB', gnb), 
              ('DecisionTree',dtc),   
              ('KNeighbors',knc)]

Next, we'll create VoitngClassifier model with base classifier methods.

vote_model=VotingClassifier(estimators=base_methods)

print(vote_model)
VotingClassifier(estimators=[('LogisticReg',

   LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
   intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
   penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
   verbose=0, warm_start=False)), ('Gau...owski',
   metric_params=None, n_jobs=1, n_neighbors=1, p=2,
   weights='uniform'))],
   flatten_transform=None, n_jobs=1, voting='hard', weights=None)

As you may have noticed here, hard voting is a default method in VotingClassifier. Next, we'll fit the model with training data, predict test, and check the accuracy.

vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))

0.92

print(confusion_matrix(ytest, ytest_pred))

[[ 5  0  0]
 [ 0 22  1]
 [ 1  4 42]]

Classifying with soft voting

For a soft voting method, we need to give weights to classifiers. We'll provide weight value according to classifier's performance. First, we'll check the performance of each classifier.

for name,method in base_methods:
 method.fit(Xtrain, ytrain)
 ypred=method.predict(Xtrain)
 acc=method.score(Xtest, ytest)
 print(name, "Accuracy:", acc)

LogisticReg Accuracy: 0.8533333333333334
GaussianNB Accuracy: 0.9066666666666666
DecisionTree Accuracy: 0.88
KNeighbors Accuracy: 0.9333333333333333

The KNeighbors and GaussianNB are performing better than the other classifiers, so we'll give them higher weight. We'll create VotingClassifier with a soft voting method and weights parameter.

vote_model=VotingClassifier(estimators=base_methods, 
       voting='soft',
       weights=[1,2,1,2])

print(vote_model)
VotingClassifier(estimators=[('LogisticReg',

      LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
      intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
      penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False)), ('Gau...owski',
      metric_params=None, n_jobs=1, n_neighbors=1, p=2,
      weights='uniform'))],
      flatten_transform=None, n_jobs=1, voting='soft',
      weights=[1, 2, 1, 2])

Next, we'll fit the model, predict the test data, and check the accuracy.

vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))

0.9466666666666667

print(confusion_matrix(ytest, ytest_pred))

[[ 5  0  0]
 [ 0 22  1]
 [ 1  2 44]]

The result shows slightly better performance in classification.

Classifying iris dataset with a voting classifier

Here, I'll show how to apply VoitingClassifier for real dataset like iris. We'll use the above base classifiers we've created.

iris= datasets.load_iris()
X = iris.data
Y = iris.target
le=LabelEncoder()
y=le.fit_transform(Y)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

vote_model=VotingClassifier(estimators=base_methods, 
       voting='soft',
       weights=[1,2,1,2])
vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))

0.9736842105263158

print(confusion_matrix(ytest, ytest_pred))

[[13  0  0]
 [ 0 15  1]
 [ 0  0  9]]

In this post, we've briefly learned how to use VoitingClassifier class to classify data in Python.
The full source code is listed below.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

def CreateDataFrame(N):
 columns = ['a','b','c','y']
 df = pd.DataFrame(columns=columns)
 for i in range(N):
  a = np.random.randint(10)
  b = np.random.randint(20)
  c = np.random.randint(5)
  y = "normal"
  if((a+b+c)>25):
   y="high"
  elif((a+b+c)<12):
   y= "low"

  df.loc[i]= [a, b, c, y]
 return df

df = CreateDataFrame(300)
df.head()

X = df[["a","b","c"]]
Y = df[["y"]]
Y.head()
le=LabelEncoder()
y=le.fit_transform(Y)
y[0:5]

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

lr = LogisticRegression();
gnb = GaussianNB()
dtc = DecisionTreeClassifier(criterion="entropy")
knc = KNeighborsClassifier(n_neighbors=1)

# creating base classifiers

base_methods=[('LogisticReg', lr), 
       ('GaussianNB', gnb), 
       ('DecisionTree',dtc),   
       ('KNeighbors',knc)]

# hard voting method
vote_model=VotingClassifier(estimators=base_methods)
vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))
print(confusion_matrix(ytest, ytest_pred)) 

# check performance of each classifier
for name,method in base_methods:
 method.fit(Xtrain, ytrain)
 ypred=method.predict(Xtrain)
 acc=method.score(Xtest, ytest)
 print(name, "Accuracy:", acc)

# soft voting method
vote_model=VotingClassifier(estimators=base_methods, 
       voting='soft',
       weights=[1,2,1,2])
vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))
print(confusion_matrix(ytest, ytest_pred)) 

# iris classification with voting classifier
iris= datasets.load_iris()
X = iris.data
Y = iris.target
le=LabelEncoder()
y=le.fit_transform(Y)
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=0)

vote_model=VotingClassifier(estimators=base_methods, 
       voting='soft',
       weights=[1,2,1,2])
vote_model=vote_model.fit(Xtrain,ytrain)
ytest_pred=vote_model.predict(Xtest)
print(vote_model.score(Xtest, ytest))
print(confusion_matrix(ytest, ytest_pred))

Reference:

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

DataTechNotes

Pages

Classification with Voting Classifier in Python

No comments:

Post a Comment