The Gaussian Mixture Model (GMM) is a powerful probabilistic model that represents a mixture of Gaussian distributions and it is widely used in clustering problems.
In this tutorial, we'll learn how to detect anomalies in a dataset by using the GaussianMixture class of Scikit-learn API. The tutorial covers:
- Preparing the dataset
- Defining the model and anomaly detection
- Source code listing
If you want to know other anomaly detection methods, please check out my A Brief Explanation of 8 Anomaly Detection Methods with Python tutorial.
We'll start by loading the required libraries for this tutorial.
Preparing the dataset
We'll create a random sample dataset for this tutorial by using the make_blob() function.
This is a target data to detect anomalies by using Gaussian Mixture method.
Defining the model and anomaly detection
In scikit-learn's GaussianMixture class, the score_samples method computes the log likelihood of each sample in the input data. The log likelihood represents how well the observed data fits the estimated Gaussian mixture model.
In the context of anomaly detection, we can set a threshold on these log likelihood scores. Samples with log likelihoods below a certain threshold are considered anomalies or outliers, as they are less likely to be generated by the learned Gaussian mixture model.
We'll define the model by using the GaussianMixture class of
Scikit-learn. Here, we'll use the class with a default value. You can set some of the arguments according to your dataset content. You can check all default parameters used in a class with get_params() method.
{'covariance_type': 'full', 'init_params': 'kmeans', 'max_iter': 100, 'means_init': None,
We'll get the weighted log probabilities for each sample with a score_sample() method.
Next, we'll extract the threshold values from the scores data by using quantile() function.
-2.4998195352804533
Based on the extracted threshold value, we'll identify samples with scores equal to or lower than the threshold.
Finally, we'll visualize the results by highlighting the anomalies in a red.
In this tutorial, we've learned how to detect the anomalies with the
Gaussian mixture method by using the Scikit-learn's GaussianMixture class in Python. We detected the anomalies in a data by using their log likelihood scores. The full source code is listed below.
Source code listing
References:
A comment and a question.
ReplyDeleteComment: when I print(gausMix), I get "GaussianMixture()". Do you know why we see a difference? (I am following your code verbatim).
Question: I think you chose your threshold based on how you designed your data blobs. How would you pick a threshold in general?
1. This may help you.
Deletefrom sklearn import set_config
2. Yes you need to set the threshold according to your data content.
Hello I would like to ask, If I can somehow to distribute data after GMM by quartiles (Q1 Q2 and Q3)? can I do it somehow?
ReplyDeletegood
ReplyDelete