Gaussian Mixture is a probabilistic model to represent a mixture of
multiple Gaussian distributions on population data. The model is widely
used in clustering problems.
In this tutorial, you'll briefly learn how to detect outliers in a data by using Gaussian Mixture method in R. We'll use mclus() function of Mclust library in R.
The
tutorial covers:
- Preparing the data
- Defining the model and anomaly detection
- Video tutorial
- Source code listing
We'll start by loading the required library.
library(mclust)
Preparing the data
We'll create a random sample dataset for this tutorial and visualize it in a plot to check it visually.
set.seed(124)
n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
We'll try to find out the outliers in this dataset.
We need to scale the data.
x = scale(x)[,1]
Defining the model and anomaly detection
We'll define the model by using the mclust() function of
Mclust library. Here, I'll set 3 to number of the component G, and V model type. We'll fit the model on x data and print the summary of it.
xfit = Mclust(x, G=3, model="V")
summary(xfit)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust V (univariate, unequal variance) model with 3 components:
log-likelihood n df BIC ICL
-610.9331 500 8 -1271.583 -1344.313
Clustering table:
1 2 3
6 331 163
Next, we'll predict the x data with the xfit model.
pred = predict(xfit)
str(pred)
List of 2
$ classification: int [1:500] 2 2 2 2 2 2 2 2 3 2 ...
$ z : num [1:500, 1:3] 0.00804 0.00324 0.00424 0.0032 0.00394 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:3] "1" "2" "3"
We'll use the first column of z property.
xpred = pred$z[,1]
Next, we'll extract the threshold values from the probability scores by using quantile() function. Here, 0.99 means that we'll quantile the value of 99%.
thr = quantile(xpred, .99)
print(thr)
99%
0.5860772
By using the threshold value, we'll find the samples with the
scores that are equal to or higher the threshold value. Then, we'll get the index of those values.
outliers = which(xpred >= thr)
index = x[outliers]
Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")
In this tutorial, we've learned how to detect the anomalies with the
Gaussian mixture method by using the mclust function of Mclust library in R. The full source code is listed below.
Video Tutorial
Source code listing
library(mclust)
set.seed(124)
n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
x = scale(x)[,1]
xfit = Mclust(x, G=3, model="V")
summary(xfit)
pred = predict(xfit)
str(pred)
xpred = pred$z[,1]
thr = quantile(xpred, .99)
print(thr)
outliers = which(xpred >= thr)
index = x[outliers]
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")
No comments:
Post a Comment