DataTechNotes: One Hot Encoding Example in Python

One hot encoding is an important technique in data classification with neural network models. Labels in classification data need to be represented in a matrix map with 0 and 1 elements to train the model and this representation is called one-hot encoding.
In this post, we'll learn how to create one hot encoding array map in Python. The post covers:

One hot encoding with the sklearn
One hot encoding with Keras
Iris dataset one hot encoding example
Source code listing

We'll start by loading the required libraries.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical
from sklearn import datasets

One hot encoding with sklearn

To represent labels in one hot encoding map, first, we need to create integer vector with unique integer value assigned to each label class like 'cat':0, 'dog':1, 'mouse':2, etc. Let's see an example.

labels=['dog','cat','cat','mouse','dog','dog']
label_encoder=LabelEncoder()
label_ids=label_encoder.fit_transform(labels)

print(labels)

['dog', 'cat', 'cat', 'mouse', 'dog', 'dog']

print(label_ids)

[1 0 0 2 1 1]

Then we can create a one hot encoded matrix that identifies label with the value 1. One hot matrix map is about the positions of unique label names with alphabetic order like {cat, dog, mouse}. The target label is defined by setting a '1' in its position in a matrix.

   { (0, 0, 1),
      (0, 1, 0),
      (1, 0, 0) }

Here, (0, 0, 1) represents 'mouse', (0, 1, 0) represents 'dog', and (1, 0, 0) represents 'cat'. We can create the matrix map as shown below.

onehot_encoder=OneHotEncoder(sparse=False)
reshaped=label_ids.reshape(len(label_ids), 1)
onehot=onehot_encoder.fit_transform(reshaped)

print(onehot)

[[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]

One hot encoding with Keras

We can also create one hot encoding map with to_categorical() function of Keras. Here, we'll use label_ids vector data.

print(label_ids)
[1 0 0 2 1 1]

to_cat=to_categorical(label_ids)
print(to_cat)

[[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]
 [0. 1. 0.]]

Iris dataset one hot encoding example

Next, we'll create one hot encoding map for iris dataset category values. As you may know, iris data contains 3 types of species; setosa, versicolor, and virginica. They are encoded as 0, 1, and 2 in a dataset. So we can reshape and transform with a OneHotEncoder().

iris= datasets.load_iris()
X = iris.data
Y = iris.target

onehot_encoder=OneHotEncoder(sparse=False)
reshaped=Y.reshape(len(Y), 1)
y_onehot=onehot_encoder.fit_transform(reshaped)

print(Y.shape)

(150,)

print(y_onehot.shape)

(150, 3)

print(Y[0:10])

[0 0 0 0 0 0 0 0 0 0]

print(y_onehot[1:10])

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

In this post, we've briefly learned how to create one hot encoding map for labels in classification data. The full source is listed below.

Source code listing

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical
from sklearn import datasets

labels=['dog','cat','cat','mouse','dog','dog']
label_encoder=LabelEncoder()
label_ids=label_encoder.fit_transform(labels)
print(labels)
print(label_ids)

onehot_encoder=OneHotEncoder(sparse=False)
reshaped=label_ids.reshape(len(label_ids), 1)
onehot=onehot_encoder.fit_transform(reshaped)
print(onehot)

to_cat=to_categorical(label_ids)
print(to_cat)

iris= datasets.load_iris()
X = iris.data
Y = iris.target

onehot_encoder=OneHotEncoder(sparse=False)
reshaped=Y.reshape(len(Y), 1)
y_onehot=onehot_encoder.fit_transform(reshaped)
print(Y.shape)
print(y_onehot.shape)

print(Y[0:10])
print(y_onehot[1:10])

DataTechNotes

Pages

One Hot Encoding Example in Python

No comments:

Post a Comment