Outliers in data can be calculated each element's distance from its clustered center value. We can divide the data into specified clusters by using R's kmeans() function.
In this tutorial, I'll try to detect outliers in a list by using kmean() function and distance calculation in R. The
tutorial
covers:
- Preparing test data
- K-means distance calculation
- Source code listing
boston = MASS::Boston
dim(boston)
test = boston[,14]
plot(test, pch=16, col="blue")
Kmeans distance calculation
Cluster numbers can be decided by checking the test data structure. We can divide test data into two clusters by setting 2 into the 'centers' parameter of the function.
km = kmeans(test, centers=2)
print(km)
K-means clustering with 2 clusters of sizes 106, 400
Cluster means:
[,1]
1 36.73019
2 18.77050
Clustering vector:
[1] 2 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[34] 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2
[67] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1
[100] 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[133] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 1 2
[166] 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 1 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 1 1 1 1 1 2
[232] 1 1 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 1 1 1 1 1 1
[265] 1 2 1 1 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 2 2 1 2
[298] 2 2 1 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[331] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
[364] 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[397] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[430] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[463] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[496] 2 2 2 2 2 2 2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 6040.163 9648.192
(between_SS / total_SS = 63.3 %)
Available components:
[1] "cluster" "centers" "totss" "withinss"
[5] "tot.withinss" "betweenss" "size" "iter"
[9] "ifault"
We'll extract centers from km object. Here, we can see the cluster id and its center values of each element.
centers=km$centers[km$cluster,]
str(centers)
Named num [1:506] 18.8 18.8 36.7 36.7 36.7 ...
- attr(*, "names")= chr [1:506] "2" "2" "1" "1" ...
head(centers)
2 2 1 1 1 1
18.77050 18.77050 36.73019 36.73019 36.73019 36.73019
Next, we'll calculate the distance of each observation value in a dataset and sort the output data.
distance <- sqrt((test-centers)^2)
ordered <- order(distance, decreasing = T)
We'll extract top outliers number by collecting two extreme (min and max) values of distance.
min_out = min(test[ordered])
max_out = max(test[ordered])
outs = c(test[test[ordered]==min_out], test[test[ordered]==max_out])
outs_count = length(outs)
Now, we can obtain outliers from the ordered list by setting their number.
outs = head(ordered, outs_count)
cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")
Outliers index: 399 406 162 163 164 167 187 196 205 226 258 268 284 369 370 371 372 373
Outliers value: 5 5 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50 50
Finally, we'll visualize the above values in a graph.
plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")
In this tutorial, we have briefly learned how to detect the outliers by using kmeans() function and distance calculation in R. The full source code is listed below.
Source code listing
# load Boston data and extract label part
boston = MASS::Boston
dim(boston)
test = boston[,14]
plot(test, pch=16, col="blue")
# apply kmeans and extract centers
km = kmeans(test, centers=2)
centers = km$centers[km$cluster,]
head(centers)
# calculate distance
distance = sqrt((test-centers)^2)
ordered = order(distance, decreasing = T)
# extract outliers
min_out = min(test[ordered])
max_out = max(test[ordered])
outs_val = c(test[test[ordered]==min_out], test[test[ordered]==max_out])
outs_count = length(outs_val)
outs = head(ordered, outs_count)
cat("Outliers index: ", outs, "\n")
cat("Outliers value: ", test[outs], "\n")
# visualize in a plot
plot(test, pch=16, col="blue")
points(outs, test[outs], pch=16, col="red")
Outlier check with SVM novelty detection in R
Outlier detection with Local Outlier Factor with R
Outlier detection with boxplot.stats function in R
No comments:
Post a Comment