DataTechNotes: Anomaly Detection Example with Local Outlier Factor in Python

The Local Outlier Factor is an algorithm to detect anomalies in observation data. Measuring the local density score of each sample and weighting their scores are the main concept of the algorithm. By comparing the score of the sample to its neighbors, the algorithm defines the lower density elements as anomalies in data.

In this tutorial, we'll learn how to detect anomaly in a dataset by using the Local Outlier Factor method in Python. The Scikit-learn API provides the LocalOutlierFactor class for this algorithm and we'll use it in this tutorial. The tutorial covers:

Preparing the dataset
Defining the model and prediction
Anomaly detection with scores
Source code listing

If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.

We'll start by loading the required libraries for this tutorial.

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

Preparing the dataset

We'll create a random sample dataset for this tutorial by using the make_blob() function.

random.seed(1)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10))

We'll check the dataset by visualizing it in a plot.

plt.scatter(x[:,0], x[:,1])
plt.show()

Defining the model and prediction

We'll define the model by using the LocalOutlierFactor class of Scikit-learn API. We'll set estimators number and contamination value in arguments. Contamination defines the proportion of outliers in a dataset.

lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)

We'll fit the model on x dataset and get the prediction data with the fit_predict() method.

y_pred = lof.fit_predict(x)

We'll extract the negative outputs as the outliers.

lofs_index = where(y_pred==-1)
values = x[lofs_index]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

Anomaly detection with scores

In the second method, we'll define the model without setting the contamination argument.

model = LocalOutlierFactor(n_neighbors=20)

We'll fit the model on x dataset, then extract the samples score.

model.fit_predict(x)

lof = model.negative_outlier_factor_

Next, we'll obtain the threshold value from the scores by using the quantile function. Here, we'll get the lowest 3 percent of score values as the anomalies.

thresh = quantile(lof, .03)
print(thresh)

-1.8191482960907037

We'll extract the anomalies by comparing the threshold value and identify the values of elements.

index = where(lof<=thresh)
values = x[index]

Finally, we can visualize the results in a plot by highlighting the anomalies with a color.

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()

In both methods above we've got the same result. You can use any of them in your analysis. The threshold or contamination value can be changed to filter out more extreme cases.

In this tutorial, we've learned how to detect the anomalies with the Local Outlier Factor algorithm by using the Scikit-learn API class in Python. The full source code is listed below.

Source code listing

from sklearn.neighbors import LocalOutlierFactor
from sklearn.datasets import make_blobs
from numpy import quantile, where, random
import matplotlib.pyplot as plt

random.seed(1)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(10,10))

plt.scatter(x[:,0], x[:,1])
plt.show()

lof = LocalOutlierFactor(n_neighbors=20, contamination=.03)

print(thresh)

y_pred = lof.fit_predict(x)

lofs_index=where(y_pred==-1)
values = x[lofs_index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

model = LocalOutlierFactor(n_neighbors=20)

print(model)

model.fit_predict(x)

lof = model.negative_outlier_factor_
thresh = quantile(lof, .03)
print(thresh)

index = where(lof<=thresh)
values = x[index]

plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0],values[:,1], color='r')
plt.show()

References:

DataTechNotes

Pages

Anomaly Detection Example with Local Outlier Factor in Python

No comments:

Post a Comment