In machine learning, Principal Component Analysis is used to reduce variables in a dataset without losing key information in a data. PCA is linear transformation method. Kernel PCA is extension of PCA for non linear data.
The Scikit-learn API provides KernelPCA class to apply Kernel PCA method in Python. In this tutorial, we'll briefly learn how to project data by using KernelPCA and visualize the projected data in a graph. The tutorials covers:
- Iris dataset Kernel PCA projection and visualizing
- MNIST dataset Kernel PCA projection and visualizing
- Source code listing
We'll start by loading the required libraries and functions.
from sklearn.decomposition import KernelPCA
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
Iris dataset Kernel PCA projection and visualizing
After loading the Iris dataset, we'll take the 'data' and 'target' parts of the dataset.
iris = load_iris()
x = iris.data
y = iris.target
We'll define the model by using the KernelPCA class, here we'll set kernel type, n_components, and gamma. To fit the model, we'll use fit_transform() method with x data.
kpca = KernelPCA(kernel="rbf", n_components=2, gamma=.01)
z = kpca.fit_transform(x)
To visualize the result in a graph, we'll collect the output
component data in a pandas dataframe, then use 'seaborn' library's
scatterplot(). In color palette of scatter plot, we'll
set 3 that defines the number of categories in label data.
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 3),
data=df).set(title="Iris data KernelPCA projection")
MNIST dataset Kernel PCA projection and visualizing
Next,
we'll apply the same method to the larger dataset. MNIST handwritten
digit dataset works well for this purpose and we can use Keras API's
MNIST data. We extract only train part of the dataset because here it is
enough to test data with KernelPCA class.
(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)
(60000, 28, 28)
For simplicity, I use some part of the data.
# slicing the data
x_train = x_train[0:3000,]
y_train = y_train[0:3000,]
MNIST is a three-dimensional data, we'll reshape it into the two-dimensional one.
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)
(3000, 784)
Here,
we have 3000 samples with 784 features .
First, we'll transform data with KernelPCA default parameters, then with defined parameters. Below code shows projection and visualizing in a graph.
kpca = KernelPCA()
z = kpca.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data KernelPCA projection")
Next, we'll change some parameters of KernelPCA and visualize it in a graph.
kpca = KernelPCA(kernel="rbf", n_components=2, gamma=1)
z = kpca.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data KernelPCA projection")
The above graphs show a two-dimensional visualization of the MNIST data. The colors define
the target digits and their feature data location in 2D space. You can see the projection difference when we change the parameters. The different projection helps you two understand key components of your data.
In this tutorial, we've briefly learned how to how to project data with Kernel PCA method and visualize the projected data in Python. The full source code is listed below.
Source code listing
from sklearn.decomposition import KernelPCA
from keras.datasets import mnist
from sklearn.datasets import load_iris
from numpy import reshape
import seaborn as sns
import pandas as pd
iris = load_iris()
x = iris.data
y = iris.target
kpca = KernelPCA(kernel="rbf", n_components=2, gamma=.01)
z = kpca.fit_transform(x)
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 3),
data=df).set(title="Iris data KernelPCA projection")
# MNIST data projection
(x_train, y_train), (_ , _) = mnist.load_data()
# use part of data
x_train=x_train[0:3000,]
y_train=y_train[0:3000,]
print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)
kpca = KernelPCA()
z = kpca.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data KernelPCA projection")
kpca = KernelPCA(kernel="rbf", n_components=2, gamma=1)
z = kpca.fit_transform(x_mnist)
df = pd.DataFrame()
df["y"] = y_train
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]
sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
palette=sns.color_palette("hls", 10),
data=df).set(title="MNIST data KernelPCA projection")
References:
No comments:
Post a Comment