Anomaly Detection Example with Kernel Density in Python

   The Kernel Density estimation is a method to estimate the probability density function of a random variables. We can apply this model to detect outliers in a dataset.
   In this tutorial, we'll learn how to detect the outliers of regression data by applying the KernelDensity class of Scikit-learn API in Python. The tutorial covers:
  1. Preparing the data
  2. Anomaly detection with KernelDensity
  3. Testing with Boston housing dataset
  4. Source code listing
  5. Video tutorial

    If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.  

We'll start by loading the required libraries for this tutorial.

from sklearn.neighbors import KernelDensity
from numpy import where, random, array, quantile
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston


Preparing the data

We'll use randomly generated regression data as a target dataset. Here, we'll write simple function to generate sample data. To check the dataset we'll visualize it in a plot to check.

random.seed(124)
def makeData(N):
    x = []
    for i in range(N):
        a = i/1000 + random.uniform(-3, 2)
        r = random.uniform(-5, 10)
        if(r >= 9.8):
            r = r + 10
        elif(r<(-4.8)):
            r = r +(-10)        
        x.append([a + r])   
    return array(x)

n = 500
x= makeData(n)

x_ax = range(n)
plt.plot(x_ax, x)
plt.show() 


Next, we'll scale the dataset.

x = scale(x)


Anomaly detection with KernelDensity

We'll use Scikit-learn API's KernelDensity class to define the kernel density model.

kernaldens = KernelDensity().fit(x)
print(kernaldens)
KernelDensity(algorithm='auto', atol=0, bandwidth=1.0, breadth_first=True, kernel='gaussian', leaf_size=40, metric='euclidean', metric_params=None, rtol=0)

We'll obtain the scores of each sample in x dataset by using score_sample() method.

scores = kernaldens.score_samples(x)

Then, we'll extract the threshold value from the scores data by using quantile() function.

thresh = quantile(scores, .01)
print(thresh)
-4.071068385863522

By using threshold value, we'll find the samples with the scores that are equal to or lower than the threshold value.

index = where(scores <= thresh)
values = x[index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.plot(x_ax, x)
plt.scatter(index,values, color='r')
plt.show()



Testing with Boston housing dataset

We can apply the same method to the Boston housing dataset. We'll use only y target data part of the dataset. We'll reshape and scale it to use it in the KernelDensity model.

boston = load_boston()
y = boston.target

y = y.reshape(y.shape[0],1)
y = scale(y)

Next, we'll define the model, fit the model on y data, and find out the scores of samples. Then, we'll collect the anomalies by using threshold value.

kernaldens = KernelDensity().fit(y)
print(kernaldens)

scores = kernaldens.score_samples(y)
thresh = quantile(scores, .01)
print(thresh)
index = where(scores <= thresh)
values = y[index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

x_ax = range(y.shape[0])
plt.plot(x_ax, y)
plt.scatter(index,values, color='r')
plt.show()


   In this tutorial, we've briefly learned how to detect the anomalies by using the kernel density method by using the Scikit-learn's KernelDensity class in Python. The full source code is listed below.


Source code listing

from sklearn.neighbors import KernelDensity
from numpy import where, random, array, quantile
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

random.seed(124)
def makeData(N):
    x = []
    for i in range(N):
        a = i/1000 + random.uniform(-3, 2)
        r = random.uniform(-5, 10)
        if(r >= 9.8):
            r = r + 10
        elif(r<(-4.8)):
            r = r +(-10)        
        x.append([a + r])   
    return array(x)

n = 500
x= makeData(n)

x_ax = range(n)
plt.plot(x_ax, x)
plt.show()

x = scale(x)

kernaldens = KernelDensity().fit(x) print(kernaldens) scores = kernaldens.score_samples(x) thresh = quantile(scores, .01) print(thresh) index = where(scores <= thresh) values = x[index] plt.plot(x_ax, x) plt.scatter(index,values, color='r') plt.show() boston = load_boston() y = boston.target y = y.reshape(y.shape[0],1) y = scale(y) kernaldens = KernelDensity().fit(y) print(kernaldens) scores = kernaldens.score_samples(y) thresh = quantile(scores, .01) print(thresh) index = where(scores <= thresh) values = y[index] x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(index,values, color='r') plt.show()


Video tutorial



1 comment: