DataTechNotes: PCA-Based Anomaly Detection in Python

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior. Principal Component Analysis (PCA) is a dimensionality reduction technique that can be used for anomaly detection by projecting data into a lower-dimensional space and identifying anomalies as points that deviate significantly from the projected data.

In this tutorial, we will learn how to perform PCA-based anomaly detection using Python. We will generate synthetic 3D data, apply PCA, and detect anomalies based on the reconstruction error. Finally, we will evaluate the performance using a confusion matrix and classification report and visualize the results in a 3D plot.

The tutorial covers:

Introduction to PCA and Anomaly detection
Generating test data
Applying PCA
Detecting anomalies
Conclusion
Source code listing

Introduction to PCA and anomaly detection

PCA is a statistical technique that transforms the data into a new coordinate system, where the first coordinate (principal component) explains the most variance in the data, the second coordinate explains the second most variance, and so on. By reducing the dimensionality of the data, PCA can help in identifying patterns and anomalies.

Anomaly Detection
Anomaly detection involves identifying data points that deviate significantly from the majority of the data. In the context of PCA, anomalies are points that have a high reconstruction error when projected back to the original space.

Generating test data

Before we start, make sure you have the necessary libraries installed. You can install those libraries using pip command.

 
pip install numpy pandas scikit-learn matplotlib
 

We import the required libraries for this tutorial.

 import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report
 

Next, we'll generate synthetic 3D data with a small portion of anomalies.

 
# Generate simple 3D random scattered datanp.random.seed(42)
n_samples = 300
n_features = 3  # Only 3 features for simplicity

# Generate normal data (centered around 0)
normal_data = np.random.normal(loc=0, scale=1, size=(n_samples, n_features))

# Introduce some anomalies (located at the edges)
# Uniform distribution for edge anomalies
anomalies = np.random.uniform(low=-5, high=5, size=(20, n_features)) 
data = np.vstack([normal_data, anomalies])

# Convert to a DataFrame for easier manipulation
df = pd.DataFrame(data, columns=[f'feature_{i}' for i in range(n_features)])

# Add ground truth labels (1 for anomalies, 0 for normal)
df['ground_truth'] = 0  # Initialize all as normal
df.iloc[-20:, df.columns.get_loc('ground_truth')] = 1  # Last 20 rows are anomalies
 
print(df.head())

First 5 rows of 'df' data frame.

 
feature_0     feature_1  feature_2  ground_truth
 0.496714  -0.138264   0.647689             0
 1.523030  -0.234153  -0.234137             0
 1.579213   0.767435  -0.469474             0
 0.542560  -0.463418  -0.465730             0
 0.241962  -1.913280  -1.724918             0  

Applying PCA

PCA is sensitive to the scale of the data, so it's important to standardize the data before applying PCA.

 
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df)

Now we can apply PCA to the scaled data. We will apply PCA to reduce the dimensionality of the data to 3 components for better visualization purpose.

  
# Apply PCA with 3 components
pca = PCA(n_components=3)
data_pca = pca.fit_transform(data_scaled)

# Convert to DataFrame for easier manipulation
df_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])

Detecting anomalies

Anomalies can be detected by calculating the reconstruction error. The idea is that normal data points will have a low reconstruction error, while anomalies will have a high reconstruction error.

We will reconstruct the data from the PCA space and calculate the reconstruction error.

 
# Reconstruct the data from the PCA space
data_reconstructed = pca.inverse_transform(data_pca)

# Calculate the reconstruction error
reconstruction_error = np.sum((data_scaled - data_reconstructed) ** 2, axis=1)

# Add the reconstruction error to the DataFrame
df['reconstruction_error'] = reconstruction_error
 

We can identify anomalies by setting a threshold on the reconstruction error. We can use the 95th percentile as the threshold.

# Determine anomalies based on reconstruction error
threshold = np.percentile(df['reconstruction_error'], 95)  # 95th percentile as threshold
df['anomaly'] = df['reconstruction_error'] > threshold

# Add anomaly labels to the PCA DataFrame
df_pca['anomaly'] = df['anomaly'] 
 

We will evaluate the performance of the anomaly detection using a confusion matrix and classification report.

 
# Evaluate performance
print("Confusion Matrix:")
print(confusion_matrix(df['ground_truth'], df['anomaly']))

print("\nClassification Report:")
print(classification_report(df['ground_truth'], df['anomaly']))
 

The output looks:

Confusion Matrix:
[[297   3]
 [  7  13]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       300
           1       0.81      0.65      0.72        20

    accuracy                           0.97       320
   macro avg       0.89      0.82      0.85       320
weighted avg       0.97      0.97      0.97       320 
 

Using the extracted anomalous points, we can visualize them in a a 3D plot, highlighting the anomalies

 
# Visualize the data in 3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot normal points
ax.scatter(df_pca[~df['anomaly']]['PC1'], 
           df_pca[~df['anomaly']]['PC2'], 
           df_pca[~df['anomaly']]['PC3'], 
           color='blue', label='Normal', alpha=0.6)

# Plot anomalies
ax.scatter(df_pca[df['anomaly']]['PC1'], 
           df_pca[df['anomaly']]['PC2'], 
           df_pca[df['anomaly']]['PC3'], 
           color='red', label='Anomaly', alpha=0.6)

ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('3D PCA-Based Anomaly Detection')
ax.legend()
plt.show()
 

Conclusion

In this tutorial, we applied PCA-based anomaly detection to synthetic 3D data. We evaluated the performance using a confusion matrix and classification report and visualized the results in a 3D plot.

PCA-based anomaly detection is a powerful technique, especially when dealing with high-dimensional data. It's important to note that the effectiveness of this method depends on the choice of the threshold and the nature of the data. Full source code is provided below.

Source code listing

 
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, classification_report

# Create simpler 3D random scattered data
np.random.seed(42)
n_samples = 300
n_features = 3  # Only 3 features for simplicity

# Generate normal data (centered around 0)
normal_data = np.random.normal(loc=0, scale=1, size=(n_samples, n_features))

# Introduce some anomalies (located at the edges)
# Uniform distribution for edge anomalies 
anomalies = np.random.uniform(low=-5, high=5, size=(20, n_features))  
data = np.vstack([normal_data, anomalies])

# Convert to a DataFrame for easier manipulation
df = pd.DataFrame(data, columns=[f'feature_{i}' for i in range(n_features)])

# Add ground truth labels (1 for anomalies, 0 for normal)
df['ground_truth'] = 0  # Initialize all as normal
df.iloc[-20:, df.columns.get_loc('ground_truth')] = 1  # Last 20 rows are anomalies
print(df.head())

# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(df.drop(columns=['ground_truth']))

# Apply PCA with 3 components
pca = PCA(n_components=3)
data_pca = pca.fit_transform(data_scaled)

# Convert to DataFrame for easier manipulation
df_pca = pd.DataFrame(data_pca, columns=['PC1', 'PC2', 'PC3'])

# Reconstruct the data from the PCA space
data_reconstructed = pca.inverse_transform(data_pca)

# Calculate the reconstruction error
reconstruction_error = np.sum((data_scaled - data_reconstructed) ** 2, axis=1)

# Add the reconstruction error to the DataFrame
df['reconstruction_error'] = reconstruction_error

# Determine anomalies based on reconstruction error
threshold = np.percentile(df['reconstruction_error'], 95)  # 95th percentile as threshold
df['anomaly'] = df['reconstruction_error'] > threshold

# Add anomaly labels to the PCA DataFrame
df_pca['anomaly'] = df['anomaly']

# Evaluate performance
print("Confusion Matrix:")
print(confusion_matrix(df['ground_truth'], df['anomaly']))

print("\nClassification Report:")
print(classification_report(df['ground_truth'], df['anomaly']))

# Visualize the data in 3D
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

# Plot normal points
ax.scatter(df_pca[~df['anomaly']]['PC1'], 
           df_pca[~df['anomaly']]['PC2'], 
           df_pca[~df['anomaly']]['PC3'], 
           color='blue', label='Normal', alpha=0.6)

# Plot anomalies
ax.scatter(df_pca[df['anomaly']]['PC1'], 
           df_pca[df['anomaly']]['PC2'], 
           df_pca[df['anomaly']]['PC3'], 
           color='red', label='Anomaly', alpha=0.6)

ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('3D PCA-Based Anomaly Detection')
ax.legend()
plt.show()
 

DataTechNotes

Pages

PCA-Based Anomaly Detection in Python

No comments:

Post a Comment