In this tutorial, you will briefly learn how to perform clustering and visualize it in a plot with R. Some of the functions of the 'cluster' package help us to do a hierarchy analysis in R. The tutorial covers:
- Preparing the data
- Clustering with the hclust() function
- Visualizing in a plot
- Source code listing
Hierarchical clustering is a method to create groups that contain similar objects in a given dataset. The method builds a bottom-up ordered hierarchy through the clustering of similar objects.
library(MASS)
library(cluster)
Preparing the data
We'll use an "Animal" dataset of the MASS library and we'll load it.
data(Animals)
str(Animals)
'data.frame': 28 obs. of 2 variables:
$ body : num 1.35 465 36.33 27.66 1.04 ...
$ brain: num 8.1 423 119.5 115 5.5 ...
head(Animals,10)
body brain
Mountain beaver 1.35 8.1
Cow 465.00 423.0
Grey wolf 36.33 119.5
Goat 27.66 115.0
Guinea pig 1.04 5.5
Dipliodocus 11700.00 50.0
Asian elephant 2547.00 4603.0
Donkey 187.10 419.0
Horse 521.00 655.0
Potar monkey 10.00 115.0
Clustering with the hclust() function
The hclust() function, based on the agglomerative clustering method, requires the distance data of a given dataset. We can find out it by using the dist() function. The dist() function performs distance computation between the rows of the data matrix with a given method (Euclidean, Manhattan, etc.). Euclidean is a default distance calculation method in this function and you can change it according to your dataset content.
distance = dist(Animals, method = "euclidean")
str(distance)
'dist' num [1:378] 622.18 116.76 110.09 2.62 11698.73 ...
- attr(*, "Size")= int 28
- attr(*, "Labels")= chr [1:28] "Mountain beaver" "Cow" "Grey wolf" "Goat" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = Animals, method = "euclidean")
Now, we can use the hclust() function with distance data.
hc = hclust(distance, "complete")
print(hc)
Call:
hclust(d = distance, method = "complete")
Cluster method : complete
Distance : euclidean
Number of objects: 28
Next, we'll cut a tree into groups of data by using cutree() function. Here, we'll set a 3 to the k parameter to identify the group's number.
groups = cutree(hc, k = 3)
table(groups)
groups
1 2 3
25 2 1
We can check each group's elements after the grouping.
gid = unique(groups)
for(i in 1:length(gid))
cat("Group:", gid[i], rownames(Animals)[groups == gid[i]], "\n")
Group: 1 Mountain beaver Cow Grey wolf Goat Guinea pig Asian elephant Donkey Horse Potar monkey Cat Giraffe Gorilla Human African elephant Rhesus monkey Kangaroo Golden hamster Mouse Rabbit Sheep Jaguar Chimpanzee Rat Mole Pig
Group: 2 Dipliodocus Triceratops
Group: 3 Brachiosaurus
Visualizing in a plot
Next, we'll construct a hierarchy of clustered elements by using the agnes() function.
agn = agnes(distance)
print(agn)
Call: agnes(x = distance)
Agglomerative coefficient: 0.9579279
Order of objects:
[1] Mountain beaver Guinea pig Golden hamster Mouse Rat
[6] Mole Rabbit Cat Kangaroo Grey wolf
[11] Goat Potar monkey Rhesus monkey Sheep Jaguar
[16] Pig Donkey Gorilla Chimpanzee Cow
[21] Horse Giraffe Human Asian elephant African elephant
[26] Dipliodocus Triceratops Brachiosaurus
Height (summary):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.61 15.48 65.94 4173.58 517.81 85797.28
Available components:
[1] "order" "height" "ac" "merge" "diss" "call" "method"
[8] "order.lab"
Finally, we'll visualize the hierarchical treed of the clustered data. The pltree() function draws a clustered tree or dendrogram in a plot. We can also highlight the clustered groups in the plot by using a rect.hclust() function.
pltree(agn, cex = 0.8, hang = -1)
rect.hclust(agn, k = 3, border = 2:4)
In this tutorial, we have briefly learned how to perform hierarchical cluster analysis and draw a plot in R. The full source code is listed below.
Source code listing
library(MASS)
library(cluster)
data(Animals)
str(Animals)
head(Animals,10)distance = dist(Animals, method = "euclidean")
str(distance)
hc = hclust(distance, "complete")
print(hc)groups = cutree(hc, k = 3)
table(groups)
gid = unique(groups)
for(i in 1:length(gid))
cat("Group:", gid[i], rownames(Animals)[groups == gid[i]], "\n")
agn = agnes(distance)
print(agn)
pltree(agn, cex = 0.8, hang = -1)
rect.hclust(agn, k = 3, border = 2:4)
Thank you very much for this article, its really simple to follow
ReplyDelete