In this tutorial, we'll learn how to detect anomaly in a dataset by using the Local Outlier Factor method in Python. The Scikit-learn API provides the LocalOutlierFactor class for this algorithm and we'll use it in this tutorial. The tutorial covers:
- Preparing the dataset
- Defining the model and prediction
- Anomaly detection with scores
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
We'll start by loading the required libraries for this tutorial.
from sklearn.neighbors import LocalOutlierFactor from sklearn.datasets import make_blobs from numpy import quantile, where, random import matplotlib.pyplot as plt
Preparing the dataset
We'll create a random sample dataset for this tutorial by using the make_blob() function.
random.seed(1)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10))
We'll check the dataset by visualizing it in a plot.
plt.scatter(x[:,0], x[:,1])
plt.show()
Defining the model and prediction
We'll define the model by using the LocalOutlierFactor class of Scikit-learn API. We'll set estimators number and contamination value in arguments. Contamination defines the proportion of outliers in a dataset.
lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)
We'll fit the model on x dataset and get the prediction data with the fit_predict() method.
y_pred = lof.fit_predict(x)
We'll extract the negative outputs as the outliers.
lofs_index = where(y_pred==-1)
values = x[lofs_index]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
Anomaly detection with scores
In the second method, we'll define the model without setting the contamination argument.
model = LocalOutlierFactor(n_neighbors=20)
We'll fit the model on x dataset, then extract the samples score.
model.fit_predict(x)
lof = model.negative_outlier_factor_
Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 3 percent of score values as the anomalies.
thresh = quantile(lof, .03)
print(thresh)
-1.8191482960907037
We'll extract the anomalies by comparing the threshold value and identify the values of elements.
index = where(lof<=thresh)
values = x[index]
Finally, we can visualize the results in a plot by highlighting the anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
In both methods above we've got the same result. You can use any of them in your analysis. The threshold or contamination value can be changed to filter out more extreme cases.
In this tutorial, we've learned how to detect the anomalies with the Local Outlier Factor algorithm by using the Scikit-learn API class in Python. The full source code is listed below.
Source code listing
from sklearn.neighbors import LocalOutlierFactor from sklearn.datasets import make_blobs from numpy import quantile, where, random import matplotlib.pyplot as plt random.seed(1) x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10)) plt.scatter(x[:,0], x[:,1]) plt.show() lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)
print(thresh)
y_pred = lof.fit_predict(x)
lofs_index=where(y_pred==-1)
values = x[lofs_index]
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
model = LocalOutlierFactor(n_neighbors=20)
print(model)
model.fit_predict(x)
lof = model.negative_outlier_factor_
thresh = quantile(lof, .03)
print(thresh)
index = where(lof<=thresh)
values = x[index]
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()
References:
No comments:
Post a Comment