Clustering Example with kmeans in R

    Clustering is an unsupervised learning technique to learn some characteristics of a dataset and divide samples into the groups based on their similar features.
   K-means algorithm can be used to cluster dataset. In this method, K random points are selected as centroids in a dataset. Then, the elements are arranged to the closest centroids by calculating the distance. The process is repeated to achieve optimal distances between sample data and centroids.
    In this tutorial, we'll learn how to cluster data with kmeans() function in R. The tutorial covers:
  1. Preparing the data
  2. Clustering with kmeans() and visualizing
  3. Source code listing
Let's get started.

Preparing the data

    We'll used Bostong housing dataset's "medv" data as a target cluster data. First, we'll load it extract "medv" column and add index column.

boston = MASS::Boston
str(boston)

'data.frame':    506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 

n = nrow(boston)
data = data.frame("index" = seq.int(n), "medv" = boston[,"medv"])


Then, we'll visualize data in a plot to obseve visually.

plot(data$index, data$medv, col="blue", pch = 16)



If you work with large dataset, it is better to scale it with scale() function.

# data$medv = scale(data$medv)[, 1]


Clustering with kmeans() and visualizing

    Next, we'll cluster data with kmeans() function. It can be simply defined as below.

 kmeans(x, k)
       x - is numeric vector data,
       k - the number of clusters


Here, we'll set a 3 to the k, number of clusters parameter.

km = kmeans(data, 3)
print(km)


K-means clustering with 3 clusters of sizes 170, 166, 170

Cluster means:
  index     medv
1 251.5 28.69824
2  83.5 21.60181
3 421.5 17.27647

Clustering vector:
  [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [44] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[130] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
[173] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[216] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[302] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3
[345] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[388] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[431] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[474] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Within cluster sum of squares by cluster:
[1] 422186.4 388747.8 420461.7
 (between_SS / total_SS =  88.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"  


Next, we'll visualize the clustered data in a plot by highlighting with colors, and adding centroid points of each cluster.

plot(data, col = km$cluster)
points(km$centers, col = 1:3, pch = c(6, 7, 8), lwd = 5)



    In this tutorial, we've briefly learned how to cluster data with kmeans() function and visualize it in R. The source code is listed below.


Source code listing

boston = MASS::Boston
str(boston)

n = nrow(boston)
data = data.frame("index" = seq.int(n), "medv" = boston[,"medv"])
plot(data$index, data$medv, col="blue", pch = 16)

# data$medv = scale(data$medv)[,1]

km = kmeans(data, 3)
print(km)

plot(data, col = km$cluster)
points(km$centers, col = 1:3, pch = c(6, 7, 8), lwd = 5)


No comments:

Post a Comment