The Elliptical Envelope method is a statistical and machine learning technique used for detecting outliers or anomalies in a dataset. It's particularly useful when you have multivariate data (data with multiple features or dimensions) and you want to identify observations that deviate significantly from the norm.
The Elliptical Envelope method detects the outliers in a Gaussian distributed data.
Scikit-learn API provides the EllipticEnvelope class to apply this method for anomaly detection. In this tutorial, we'll learn how to detect the anomalies by using the Elliptical Envelope method in Python. The tutorial covers:
- Preparing the data
- Defining the model and prediction
- Anomaly detection with scores
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
We'll start by loading the required libraries for this tutorial.
from sklearn.covariance import EllipticEnvelope from sklearn.datasets import make_blobs from numpy import quantile, where, random import matplotlib.pyplot as plt
Preparing data
random.seed(2)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))
We'll check the dataset by visualizing it in a plot.
plt.scatter(x[:,0], x[:,1])
plt.show()
Defining the model and prediction
We'll define the model by using the EllipticEnvelope class of Scikit-learn API. We'll define the contamination value in a class definition. Contamination argument defines the proportion of outliers in a dataset.
elenv = EllipticEnvelope(contamination=.02)
print(elenv)
EllipticEnvelope(assume_centered=False, contamination=0.02, random_state=None,
store_precision=True, support_fraction=None)
We'll fit the model on x dataset and get the prediction data with the fit_predict() method.
pred = elenv.fit_predict(x)
Next, we'll extract the negative outputs as the outliers.
anom_index = where(pred==-1)
values = x[anom_index]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
Anomaly detection with scores
We can find anomalies by using their scores. In this method, we'll define the model without setting the contamination argument. In this case, the model applies the default value.
elenv = EllipticEnvelope()
print(elenv)
EllipticEnvelope(assume_centered=False, contamination=0.1, random_state=None,
store_precision=True, support_fraction=None)
We'll fit the model on x dataset, then extract the samples score.
elenv.fit(x)
scores = elenv.score_samples(x)
Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 2 percent of score values as the anomalies.
thresh = quantile(scores, .02)
print(thresh)
-9.469243838613968
Next, we'll extract the anomalies by comparing the threshold value and identify the values of elements.
index = where(scores <= thresh)
values = x[index]
Finally, we can visualize the results in a plot by highlighting the anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
In both methods above we've got the same result. You can use any of them in your analysis. The threshold or contamination value can be changed to filter out more extreme cases.
In this tutorial, we've learned how to detect the anomalies with the Elliptical Envelope method by using the Scikit-learn's EllipticEnvelope class in Python. The full source code is listed below.
Source code listing
from sklearn.covariance import EllipticEnvelope from sklearn.datasets import make_blobs from numpy import quantile, where, random import matplotlib.pyplot as plt random.seed(12) x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5)) plt.scatter(x[:,0], x[:,1]) plt.show()
elenv = EllipticEnvelope(contamination=.02)
print(elenv)
pred = elenv.fit_predict(x)
anom_index=where(pred==-1)
values = x[anom_index]
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
elenv = EllipticEnvelope()
print(elenv)
elenv.fit(x)
scores = elenv.score_samples(x)
thresh = quantile(scores, .02)
print(thresh)
index = where(scores <= thresh)
values = x[index]
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
References:
Hi,
ReplyDeleteWithin the sklearn.covariance package, there are many methods and algorithms (Empirical covariance, Shrunk Covariance, OAS, GraphicalLasso, etc.). Why did you specifically use Elliptical Envelope for this example vs any other algorthim?
It would be great if you can also share some examples highlighting the different scenarios in which we should use the different sklearn.covariance algorithms.
Thanks for your suggestion I'll think about it. The purpose of this post is to show an example of anomaly detection with Elliptical Envelope method.
Delete