DataTechNotes: Outlier Detection Example With K-means Distance Calculation in R

Outliers in data can be calculated each element's distance from its clustered center value. We can divide the data into specified clusters by using R's kmeans() function.

In this tutorial, I'll try to detect outliers in a list by using kmean() function and distance calculation in R. The tutorial covers:

Preparing test data
K-means distance calculation
Source code listing

Preparing test data

We'll start preparing the test data for this tutorial. Here, we can use Boston housing dataset label data. We'll load the dataset and visualize the target data in graph.

boston = MASS::Boston
dim(boston)


test = boston[,14]
plot(test, pch=16, col="blue")

Kmeans distance calculation

Cluster numbers can be decided by checking the test data structure. We can divide test data into two clusters by setting 2 into the 'centers' parameter of the function.

 
km = kmeans(test, centers=2)

print(km)

K-means clustering with 2 clusters of sizes 106, 400

Cluster means:
      [,1]
1 36.73019
2 18.77050

Clustering vector:
  [1] 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [34] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2
 [67] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1
[100] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[133] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2
[166] 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 2
[232] 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1
[265] 1 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2
[298] 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[331] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[364] 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[397] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[430] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[463] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[496] 2 2 2 2 2 2 2 2 2 2 2

Within cluster sum of squares by cluster:
[1] 6040.163 9648.192
 (between_SS / total_SS =  63.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"

We'll extract centers from km object. Here, we can see the cluster id and its center values of each element.

centers=km$centers[km$cluster,] 
str(centers)

Named num [1:506] 18.8 18.8 36.7 36.7 36.7 ...
 - attr(*, "names")= chr [1:506] "2" "2" "1" "1" ...

head(centers)

 2        2        1        1        1        1 
18.77050 18.77050 36.73019 36.73019 36.73019 36.73019

Next, we'll calculate the distance of each observation value in a dataset and sort the output data.

distance <- sqrt((test-centers)^2)
ordered <- order(distance, decreasing = T)

We'll extract top outliers number by collecting two extreme (min and max) values of distance.

min_out = min(test[ordered])
max_out = max(test[ordered])
outs = c(test[test[ordered]==min_out], test[test[ordered]==max_out])

outs_count = length(outs)

Now, we can obtain outliers from the ordered list by setting their number.

outs = head(ordered, outs_count)

cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")

Outliers index:  399 406 162 163 164 167 187 196 205 226 258 268 284 369 370 371 372 373

Outliers value:  5 5 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50

Finally, we'll visualize the above values in a graph.

plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")

In this tutorial, we have briefly learned how to detect the outliers by using kmeans() function and distance calculation in R. The full source code is listed below.

Source code listing

# load Boston data and extract label part

boston = MASS::Boston
dim(boston)

test = boston[,14]
plot(test, pch=16, col="blue")


# apply kmeans and extract centers

km = kmeans(test, centers=2)
centers = km$centers[km$cluster,] 
head(centers)

# calculate distance

distance = sqrt((test-centers)^2)
ordered = order(distance, decreasing = T)

# extract outliers

min_out = min(test[ordered])
max_out = max(test[ordered])
outs_val = c(test[test[ordered]==min_out], test[test[ordered]==max_out])

outs_count = length(outs_val)
outs = head(ordered, outs_count)

cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")

# visualize in a plot

plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")

Outlier check with SVM novelty detection in R

Outlier detection with Local Outlier Factor with R

Outlier detection with boxplot.stats function in R

DataTechNotes

Pages

Outlier Detection Example With K-means Distance Calculation in R

No comments:

Post a Comment