DataTechNotes: Anomaly Detection Example with Gaussian Mixture in R

Gaussian Mixture is a probabilistic model to represent a mixture of multiple Gaussian distributions on population data. The model is widely used in clustering problems.

In this tutorial, you'll briefly learn how to detect outliers in a data by using Gaussian Mixture method in R. We'll use mclus() function of Mclust library in R.

The tutorial covers:

Preparing the data
Defining the model and anomaly detection
Video tutorial
Source code listing

We'll start by loading the required library.

library(mclust)

Preparing the data

We'll create a random sample dataset for this tutorial and visualize it in a plot to check it visually.

set.seed(124)

n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)

plot(x, col="blue", type='l', pch=19)

We'll try to find out the outliers in this dataset.

We need to scale the data.

x = scale(x)[,1]

Defining the model and anomaly detection

We'll define the model by using the mclust() function of Mclust library. Here, I'll set 3 to number of the component G, and V model type. We'll fit the model on x data and print the summary of it.

xfit = Mclust(x, G=3, model="V")


summary(xfit)

---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust V (univariate, unequal variance) model with 3 components: 

 log-likelihood   n df       BIC       ICL
      -610.9331 500  8 -1271.583 -1344.313

Clustering table:
  1   2   3 
  6 331 163

Next, we'll predict the x data with the xfit model.

pred = predict(xfit)
str(pred)

List of 2
 $ classification: int [1:500] 2 2 2 2 2 2 2 2 3 2 ...
 $ z             : num [1:500, 1:3] 0.00804 0.00324 0.00424 0.0032 0.00394 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:3] "1" "2" "3"

We'll use the first column of z property.

xpred = pred$z[,1]

Next, we'll extract the threshold values from the probability scores by using quantile() function. Here, 0.99 means that we'll quantile the value of 99%.

thr = quantile(xpred, .99)
print(thr)

     99% 
0.5860772

By using the threshold value, we'll find the samples with the scores that are equal to or higher the threshold value. Then, we'll get the index of those values.

outliers = which(xpred >= thr)
index = x[outliers]

Finally, we'll visualize the results in a plot by highlighting the anomalies with a color.

plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")

In this tutorial, we've learned how to detect the anomalies with the Gaussian mixture method by using the mclust function of Mclust library in R. The full source code is listed below.

Video Tutorial

Source code listing

library(mclust)
 
set.seed(124)
n = 500
x = runif(n)*10
x[sample(1:n, 10)] <- sample(-20:20, 10)
plot(x, col="blue", type='l', pch=19)
 
x = scale(x)[,1]
xfit = Mclust(x, G=3, model="V")

summary(xfit)
pred = predict(xfit)
str(pred)

xpred = pred$z[,1]
 
thr = quantile(xpred, .99)
print(thr)

outliers = which(xpred >= thr)
index = x[outliers]
 
plot(x, col="blue", type='l', pch=19)
points(outliers,index, pch=19, col="red")

DataTechNotes

Pages

Anomaly Detection Example with Gaussian Mixture in R

No comments:

Post a Comment