The Scikit-learn API provides the DBSCAN class for this algorithm and we'll use it in this tutorial. The tutorial covers:
- Preparing the dataset
- Defining the model and anomaly detection
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
from sklearn.cluster import DBSCAN from sklearn.datasets import make_blobs from numpy import random, where import matplotlib.pyplot as plt
Preparing the dataset
We'll create a random sample dataset for this tutorial by using the make_blob() function.
random.seed(7)
x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5))
We'll check the dataset by visualizing it in a plot.
plt.scatter(x[:,0], x[:,1])
plt.show()
Defining the model and anomaly detection
We'll define the model by using the DBSCAN class of Scikit-learn API. We'll define the 'eps' and 'min_sample' in the arguments of the class. The argument 'eps' is the distance between two samples to be considered as a neighborhood and 'min_samples' is the number of samples in a neighborhood.
dbscan = DBSCAN(eps = 0.28, min_samples = 20)
print(dbscan)
DBSCAN(algorithm='auto', eps=0.28, leaf_size=30, metric='euclidean',
metric_params=None, min_samples=20, n_jobs=None, p=None)
We'll fit the model with x dataset and get the prediction data with the fit_predict() method.
pred = elenv.fit_predict(x)
Next, we'll extract the negative outputs as the outliers.
anom_index = where(pred == -1)
values = x[anom_index]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
plt.scatter(x[:,0], x[:,1])
plt.scatter(values[:,0], values[:,1], color='r')
plt.show()
In this tutorial, we've learned how to detect the anomalies with the DBSCAN method by using the Scikit-learn's DBSCAN class in Python. The full source code is listed below.
Source code listing
from sklearn.cluster import DBSCAN from sklearn.datasets import make_blobs from numpy import random, where import matplotlib.pyplot as plt random.seed(7) x, _ = make_blobs(n_samples=200, centers=1, cluster_std=.3, center_box=(20, 5)) plt.scatter(x[:,0], x[:,1]) plt.show() dbscan = DBSCAN(eps = 0.28, min_samples = 20) print(dbscan) pred = dbscan.fit_predict(x) anom_index = where(pred == -1) values = x[anom_index] plt.scatter(x[:,0], x[:,1]) plt.scatter(values[:,0], values[:,1], color='r') plt.show()
References:
thanks for post
ReplyDeletewww.softscients.com