The Kernel Density estimation is a method to estimate the probability density function of a random variables. We can apply this model to detect outliers in a dataset.
In this tutorial, we'll learn how to detect the outliers of regression
data by applying the KernelDensity class of Scikit-learn API in Python. The
tutorial covers:
- Preparing the data
- Anomaly detection with KernelDensity
- Testing with Boston housing dataset
- Source code listing
- Video tutorial
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
from sklearn.neighbors import KernelDensity from numpy import where, random, array, quantile from sklearn.preprocessing import scale import matplotlib.pyplot as plt from sklearn.datasets import load_boston
Preparing the data
We'll use randomly generated regression data as a target dataset. Here, we'll write simple function to generate sample data. To check the dataset we'll visualize it in a plot to check.
random.seed(124) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.8): r = r + 10 elif(r<(-4.8)): r = r +(-10) x.append([a + r]) return array(x) n = 500 x= makeData(n) x_ax = range(n) plt.plot(x_ax, x) plt.show()
Next, we'll scale the dataset.
x = scale(x)
Anomaly detection with KernelDensity
We'll use Scikit-learn API's KernelDensity class to define the kernel density model.
kernaldens = KernelDensity().fit(x) print(kernaldens)
KernelDensity(algorithm='auto', atol=0, bandwidth=1.0, breadth_first=True,
kernel='gaussian', leaf_size=40, metric='euclidean',
metric_params=None, rtol=0)
We'll obtain the scores of each sample in x dataset by using score_sample() method.
scores = kernaldens.score_samples(x)
Then, we'll extract the threshold value from the scores data by using quantile() function.
thresh = quantile(scores, .01) print(thresh)
-4.071068385863522
By using threshold value, we'll find the samples with the scores that are equal to or lower than the threshold value.
index = where(scores <= thresh) values = x[index]
plt.plot(x_ax, x) plt.scatter(index,values, color='r') plt.show()
Testing with Boston housing dataset
We
can apply the same method to the Boston housing dataset. We'll use only
y target data part of the dataset. We'll reshape and scale it to use it
in the KernelDensity model.
boston = load_boston() y = boston.target y = y.reshape(y.shape[0],1) y = scale(y)
Next,
we'll define the model, fit the model on y data, and find out the
scores of samples. Then, we'll collect the anomalies by using threshold value.
kernaldens = KernelDensity().fit(y) print(kernaldens) scores = kernaldens.score_samples(y) thresh = quantile(scores, .01) print(thresh) index = where(scores <= thresh) values = y[index]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(index,values, color='r') plt.show()
Source code listing
from sklearn.neighbors import KernelDensity from numpy import where, random, array, quantile from sklearn.preprocessing import scale import matplotlib.pyplot as plt from sklearn.datasets import load_boston random.seed(124) def makeData(N): x = [] for i in range(N): a = i/1000 + random.uniform(-3, 2) r = random.uniform(-5, 10) if(r >= 9.8): r = r + 10 elif(r<(-4.8)): r = r +(-10) x.append([a + r]) return array(x) n = 500 x= makeData(n) x_ax = range(n) plt.plot(x_ax, x) plt.show() x = scale(x)
kernaldens = KernelDensity().fit(x) print(kernaldens) scores = kernaldens.score_samples(x) thresh = quantile(scores, .01) print(thresh) index = where(scores <= thresh) values = x[index] plt.plot(x_ax, x) plt.scatter(index,values, color='r') plt.show() boston = load_boston() y = boston.target y = y.reshape(y.shape[0],1) y = scale(y) kernaldens = KernelDensity().fit(y) print(kernaldens) scores = kernaldens.score_samples(y) thresh = quantile(scores, .01) print(thresh) index = where(scores <= thresh) values = y[index] x_ax = range(y.shape[0]) plt.plot(x_ax, y) plt.scatter(index,values, color='r') plt.show()
Video tutorial
Very useful tutorial. Thank you!
ReplyDelete