Dimensionality reducing is used when we deal with large datasets, which contain too many feature data, to increase the calculation speed, to reduce the model size, and to visualize the huge datasets in a better way. The purpose of this method is to keep the most important data while removing the most of the feature data.
In this to tutorial, we'll briefly learn how to reduce data dimensions with Sparse and Gaussian random projection and PCA methods in Python. The Scikit-learn API provides the SparseRandomProjection, GaussianRandomProjection classes and PCA transformer function to reduce data dimension. After reading this tutorial, you'll learn how to reduce dimensionality of the dataset by using those methods. The tutorial covers:
- Preparing the data
- Gaussian random projection
- Sparse random projection
- PCA projection
- MNIST data projection
- Source code listing
We'll start by loading the required libraries and functions.
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from keras.datasets import mnist
from numpy import reshape
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
x, _ = make_regression(n_samples=50000, n_features=1000)
print(x.shape)
(50000, 1000)
(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)
(60000, 28, 28)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_mnist.shape)
(60000, 784)
grp = GaussianRandomProjection(n_components=200)
grp_data = grp.fit_transform(x)
print(grp_data.shape)
(50000, 200)
srp = SparseRandomProjection(n_components=200)
srp_data = srp.fit_transform(x)
print(srp_data.shape)
(50000, 200)
pca = PCA(n_components=200)
pca_data = pca.fit_transform(x)
print(pca_data.shape)
(50000, 200)
# Sparse random prejection on 2 components
srp = SparseRandomProjection(n_components = 2)
z = srp.fit_transform(x_mnist)
df_srp = pd.DataFrame()
df_srp["y"] = y_train
df_srp["comp-1"] = z[:,0]
df_srp["comp-2"] = z[:,1]
# Gaussian random prejection on 2 components
grp = GaussianRandomProjection(n_components = 2)
z = grp.fit_transform(x_mnist)
df_grp = pd.DataFrame()
df_grp["y"] = y_train
df_grp["comp-1"] = z[:,0]
df_grp["comp-2"] = z[:,1]
# PCA prejection on 2 components
pca = PCA(n_components=2)
z = pca.fit_transform(x_mnist)
df_pca = pd.DataFrame()
df_pca["y"] = y_train
df_pca["comp-1"] = z[:,0]
df_pca["comp-2"] = z[:,1]
fig, ax = plt.subplots(3,1, figsize=(10,20))
sns.scatterplot(x="comp-1", y="comp-2", hue=df_srp.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_srp,
ax=ax[0]).set(title='Sparse random projection')
sns.scatterplot(x="comp-1", y="comp-2", hue=df_grp.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_grp,
ax=ax[1]).set(title='Gaussian random projection')
sns.scatterplot(x="comp-1", y="comp-2", hue=df_pca.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_pca,
ax=ax[2]).set(title="PCA projection")
from sklearn.random_projection import GaussianRandomProjection
from sklearn.random_projection import SparseRandomProjection
from sklearn.decomposition import PCA
from sklearn.datasets import make_regression
from keras.datasets import mnist
from numpy import reshape
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
x, _ = make_regression(n_samples=50000, n_features=1000)
print(x.shape)
(x_train, y_train), (_ , _) = mnist.load_data()
print(x_train.shape)
x_mnist = reshape(x_train, [x_train.shape[0], x_train.shape[1]*x_train.shape[2]])
print(x_minst.shape)
grp = GaussianRandomProjection(n_components=200)
grp_data = grp.fit_transform(x)
print(grp_data.shape)
srp = SparseRandomProjection(n_components=200)
srp_data = srp.fit_transform(x)
print(srp_data.shape)
pca = PCA(n_components=200)
pca_data = pca.fit_transform(x)
print(pca_data.shape)
# Sparse random prejection on 2 components
srp = SparseRandomProjection(n_components = 2)
z = srp.fit_transform(x_mnist)
df_srp = pd.DataFrame()
df_srp["y"] = y_train
df_srp["comp-1"] = z[:,0]
df_srp["comp-2"] = z[:,1]
# Gaussian random prejection on 2 components
grp = GaussianRandomProjection(n_components = 2)
z = grp.fit_transform(x_mnist)
df_grp = pd.DataFrame()
df_grp["y"] = y_train
df_grp["comp-1"] = z[:,0]
df_grp["comp-2"] = z[:,1]
# PCA prejection on 2 components
pca = PCA(n_components=2)
z = pca.fit_transform(x_mnist)
df_pca = pd.DataFrame()
df_pca["y"] = y_train
df_pca["comp-1"] = z[:,0]
df_pca["comp-2"] = z[:,1]
fig, ax = plt.subplots(3,1, figsize=(10,20))
sns.scatterplot(x="comp-1", y="comp-2", hue=df_srp.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_srp,
ax=ax[0]).set(title='Sparse random projection')
sns.scatterplot(x="comp-1", y="comp-2", hue=df_grp.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_grp,
ax=ax[1]).set(title='Gaussian random projection')
sns.scatterplot(x="comp-1", y="comp-2", hue=df_pca.y.tolist(),
palette=sns.color_palette("hls", 10), data=df_pca,
ax=ax[2]).set(title="PCA projection")
No comments:
Post a Comment