DataTechNotes: Clustering Example with kmeans in R

Clustering is an unsupervised learning technique to learn some characteristics of a dataset and divide samples into the groups based on their similar features.

K-means algorithm can be used to cluster dataset. In this method, K random points are selected as centroids in a dataset. Then, the elements are arranged to the closest centroids by calculating the distance. The process is repeated to achieve optimal distances between sample data and centroids.

In this tutorial, we'll learn how to cluster data with kmeans() function in R. The tutorial covers:

Preparing the data
Clustering with kmeans() and visualizing
Source code listing

Let's get started.

Preparing the data

We'll used Bostong housing dataset's "medv" data as a target cluster data. First, we'll load it extract "medv" column and add index column.

boston = MASS::Boston
str(boston)

'data.frame':    506 obs. of 14 variables:
$ crim   : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn     : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas   : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox    : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm     : num 6.58 6.42 7.18 7 7.15 ...
$ age    : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
$ dis    : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad    : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax    : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv   : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

n = nrow(boston)
data = data.frame("index" = seq.int(n), "medv" = boston[,"medv"])

Then, we'll visualize data in a plot to obseve visually.

plot(data$index, data$medv, col="blue", pch = 16)

If you work with large dataset, it is better to scale it with scale() function.

# data$medv = scale(data$medv)[, 1]

Clustering with kmeans() and visualizing

Next, we'll cluster data with kmeans() function. It can be simply defined as below.

kmeans(x, k)
x - is numeric vector data,
k - the number of clusters

Here, we'll set a 3 to the k, number of clusters parameter.

km = kmeans(data, 3)
print(km)

K-means clustering with 3 clusters of sizes 170, 166, 170

Cluster means:
index     medv
1 251.5 28.69824
2 83.5 21.60181
3 421.5 17.27647

Clustering vector:
[1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[44] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[87] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[130] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1
[173] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[216] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[259] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[302] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3
[345] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[388] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[431] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[474] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Within cluster sum of squares by cluster:
[1] 422186.4 388747.8 420461.7
(between_SS / total_SS = 88.6 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

Next, we'll visualize the clustered data in a plot by highlighting with colors, and adding centroid points of each cluster.

plot(data, col = km$cluster)

points(km$centers, col = 1:3, pch = c(6, 7, 8), lwd = 5)

In this tutorial, we've briefly learned how to cluster data with kmeans() function and visualize it in R. The source code is listed below.

Source code listing

boston = MASS::Boston
str(boston)

n = nrow(boston)
data = data.frame("index" = seq.int(n), "medv" = boston[,"medv"])
plot(data$index, data$medv, col="blue", pch = 16)

# data$medv = scale(data$medv)[,1]

km = kmeans(data, 3)
print(km)

plot(data, col = km$cluster)
points(km$centers, col = 1:3, pch = c(6, 7, 8), lwd = 5)

DataTechNotes

Pages

Clustering Example with kmeans in R

No comments:

Post a Comment