DataTechNotes: Kernel PCA Projection Example in Python

In machine learning, Principal Component Analysis is used to reduce variables in a dataset without losing key information in a data. PCA is linear transformation method. Kernel PCA is extension of PCA for non linear data.

The Scikit-learn API provides KernelPCA class to apply Kernel PCA method in Python. In this tutorial, we'll briefly learn how to project data by using KernelPCA and visualize the projected data in a graph. The tutorials covers:

Iris dataset Kernel PCA projection and visualizing
MNIST dataset Kernel PCA projection and visualizing
Source code listing

We'll start by loading the required libraries and functions.

from sklearn.decomposition import KernelPCA 
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd

Iris dataset Kernel PCA projection and visualizing

After loading the Iris dataset, we'll take the 'data' and 'target' parts of the dataset.

iris = load_iris()
x = iris.data
y = iris.target

We'll define the model by using the KernelPCA class, here we'll set kernel type, n_components, and gamma. To fit the model, we'll use fit_transform() method with x data.

kpca = KernelPCA(kernel="rbf", n_components=2, gamma=.01)

z = kpca.fit_transform(x)

To visualize the result in a graph, we'll collect the output component data in a pandas dataframe, then use 'seaborn' library's scatterplot(). In color palette of scatter plot, we'll set 3 that defines the number of categories in label data.

df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 3),
                data=df).set(title="Iris data KernelPCA projection")

You can change the projection by setting different value to the gamma parameter.

MNIST dataset Kernel PCA projection and visualizing

Next, we'll apply the same method to the larger dataset. MNIST handwritten digit dataset works well for this purpose and we can use Keras API's MNIST data. We extract only train part of the dataset because here it is enough to test data with KernelPCA class.

(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)

(60000, 28, 28)

For simplicity, I use some part of the data.

# slicing the data 
x_train = x_train[0:3000,]
y_train = y_train[0:3000,]

MNIST is a three-dimensional data, we'll reshape it into the two-dimensional one.

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

(3000, 784)

Here, we have 3000 samples with 784 features .

First, we'll transform data with KernelPCA default parameters, then with defined parameters. Below code shows projection and visualizing in a graph.

kpca = KernelPCA()

z = kpca.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data KernelPCA projection")

Next, we'll change some parameters of KernelPCA and visualize it in a graph.

kpca = KernelPCA(kernel="rbf", n_components=2, gamma=1)

z = kpca.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data KernelPCA projection")

The above graphs show a two-dimensional visualization of the MNIST data. The colors define the target digits and their feature data location in 2D space. You can see the projection difference when we change the parameters. The different projection helps you two understand key components of your data.

In this tutorial, we've briefly learned how to how to project data with Kernel PCA method and visualize the projected data in Python. The full source code is listed below.

Source code listing

from sklearn.decomposition import KernelPCA 
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd

iris = load_iris()
x = iris.data
y = iris.target

kpca = KernelPCA(kernel="rbf", n_components=2, gamma=.01)

z = kpca.fit_transform(x)
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 3),
                data=df).set(title="Iris data KernelPCA projection")

# MNIST data projection
(x_train, y_train), (_ , _) = mnist.load_data()

# use part of data
x_train=x_train[0:3000,]
y_train=y_train[0:3000,]

print(x_train.shape) 

x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)

kpca = KernelPCA()

z = kpca.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data KernelPCA projection")


kpca = KernelPCA(kernel="rbf", n_components=2, gamma=1)

z = kpca.fit_transform(x_mnist)

df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                palette=sns.color_palette("hls", 10),
                data=df).set(title="MNIST data KernelPCA projection")

References:

Scikit-learn KernelPCA

DataTechNotes

Pages

Kernel PCA Projection Example in Python

No comments:

Post a Comment