DataTechNotes: Hierarchical (Agglomerative) Clustering Example in R

A hierarchical type of clustering applies either "top-down" or "bottom-up" method for clustering observation data. Agglomerative is a hierarchical clustering method that applies the "bottom-up" approach to group the elements in a dataset. In this method, each element starts its own cluster and progressively merges with other clusters according to certain criteria.

In this tutorial, you will briefly learn how to perform clustering and visualize it in a plot with R. Some of the functions of the 'cluster' package help us to do a hierarchy analysis in R. The tutorial covers:

Preparing the data
Clustering with the hclust() function
Visualizing in a plot
Source code listing

   We'll start by loading the required R libraries for this tutorial.

   Hierarchical clustering is a method to create groups that contain similar objects in a given dataset. The method builds a bottom-up ordered hierarchy through the clustering of similar objects.

library(MASS)
library(cluster)

Preparing the data

   We'll use an "Animal" dataset of the MASS library and we'll load it.

data(Animals)
str(Animals)
'data.frame':    28 obs. of 2 variables:
$ body : num 1.35 465 36.33 27.66 1.04 ...
$ brain: num 8.1 423 119.5 115 5.5 ...

head(Animals,10)
                            body brain
Mountain beaver     1.35    8.1
Cow                   465.00 423.0
Grey wolf             36.33 119.5
Goat                    27.66 115.0
Guinea pig               1.04    5.5
Dipliodocus      11700.00   50.0
Asian elephant 2547.00 4603.0
Donkey                187.10 419.0
Horse                  521.00 655.0
Potar monkey      10.00 115.0

Clustering with the hclust() function

   The hclust() function, based on the agglomerative clustering method, requires the distance data of a given dataset. We can find out it by using the dist() function. The dist() function performs distance computation between the rows of the data matrix with a given method (Euclidean, Manhattan, etc.). Euclidean is a default distance calculation method in this function and you can change it according to your dataset content.

distance = dist(Animals, method = "euclidean")
str(distance)

'dist' num [1:378] 622.18 116.76 110.09 2.62 11698.73 ...
- attr(*, "Size")= int 28
- attr(*, "Labels")= chr [1:28] "Mountain beaver" "Cow" "Grey wolf" "Goat" ...
- attr(*, "Diag")= logi FALSE
- attr(*, "Upper")= logi FALSE
- attr(*, "method")= chr "euclidean"
- attr(*, "call")= language dist(x = Animals, method = "euclidean")

Now, we can use the hclust() function with distance data.

hc = hclust(distance, "complete")
print(hc)
Call:
hclust(d = distance, method = "complete")

Cluster method   : complete
Distance         : euclidean
Number of objects: 28

Next, we'll cut a tree into groups of data by using cutree() function. Here, we'll set a 3 to the k parameter to identify the group's number.

groups = cutree(hc, k = 3)
table(groups)

groups
1 2 3
25 2 1

We can check each group's elements after the grouping.

gid = unique(groups)
for(i in 1:length(gid))
   cat("Group:", gid[i], rownames(Animals)[groups == gid[i]], "\n")

Group: 1 Mountain beaver Cow Grey wolf Goat Guinea pig Asian elephant Donkey Horse Potar monkey Cat Giraffe Gorilla Human African elephant Rhesus monkey Kangaroo Golden hamster Mouse Rabbit Sheep Jaguar Chimpanzee Rat Mole Pig
Group: 2 Dipliodocus Triceratops
Group: 3 Brachiosaurus

Visualizing in a plot

   Next, we'll construct a hierarchy of clustered elements by using the agnes() function.

agn = agnes(distance)
print(agn)

Call:    agnes(x = distance)
Agglomerative coefficient: 0.9579279
Order of objects:
[1] Mountain beaver Guinea pig       Golden hamster   Mouse            Rat
[6] Mole             Rabbit           Cat              Kangaroo         Grey wolf
[11] Goat             Potar monkey     Rhesus monkey    Sheep            Jaguar
[16] Pig              Donkey           Gorilla          Chimpanzee       Cow
[21] Horse            Giraffe          Human            Asian elephant   African elephant
[26] Dipliodocus      Triceratops      Brachiosaurus
Height (summary):
    Min. 1st Qu.   Median     Mean 3rd Qu.     Max.
    0.61    15.48    65.94 4173.58   517.81 85797.28

Available components:
[1] "order"     "height"    "ac"        "merge"     "diss"      "call"      "method"
[8] "order.lab"

Finally, we'll visualize the hierarchical treed of the clustered data. The pltree() function draws a clustered tree or dendrogram in a plot. We can also highlight the clustered groups in the plot by using a rect.hclust() function.

pltree(agn, cex = 0.8, hang = -1)
rect.hclust(agn, k = 3, border = 2:4)

   In this tutorial, we have briefly learned how to perform hierarchical cluster analysis and draw a plot in R. The full source code is listed below.

Source code listing

library(MASS)
library(cluster)

data(Animals)
str(Animals)

head(Animals,10)distance = dist(Animals, method = "euclidean")
str(distance)

hc = hclust(distance, "complete")
print(hc)groups = cutree(hc, k = 3)
table(groups)

gid = unique(groups)
for(i in 1:length(gid))
   cat("Group:", gid[i], rownames(Animals)[groups == gid[i]], "\n")

agn = agnes(distance)
print(agn)

pltree(agn, cex = 0.8, hang = -1)
rect.hclust(agn, k = 3, border = 2:4)

DataTechNotes

Pages

Hierarchical (Agglomerative) Clustering Example in R

1 comment: