Robots, Art, Offshore Finance, and Life 



ML single linkage example
ML DENDROGRAM allocation to clusters
ML complete linkage example


In hierarchical clustering, you categorise the objects into a hierarchy similar to a tree-like diagram which is called a "dendrogram".

The main use of the dendrogram is to work out the best way to allocate objects to clusters. The dendrogram below shows the hierarchical clustering of observations shown on the scatter-plot to the left. 

The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.   

The dendrogram is a summary of the distance matrix, and, as occurs with most summaries, information is lost. For example, the dendrogram suggests that C and D are much closer to each other than is C to B, but the original data (shown in the scatterplot), shows us that this is not true. A dendrogram is only accurate when data satisfies the ultrametric tree inequality, and this is unlikely for any real-world data.  The consequence of the information loss is that the dendrograms are most accurate at the bottom, showing which items are very similar.


Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters. In the example below, we have two clusters, one that combines A and B, and a second combining C, D, E, and F.


A common mistake when reading dendrograms is to assume that the shape of the dendrogram gives a clue as to how many clusters exist. In the example above, the (incorrect) interpretation is that the dendrogram shows that there are two clusters, as the distance between the clusters (the vertical segments of the dendrogram) are highest between two and three clusters.

Such an interpretation is justified only when the ultrametric tree inequality holds, which, as mentioned above, is very rare. In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data. Where there is an obviously “correct” number of clusters this will often be evident in a dendrogram. However, dendrograms often suggest a correct number of clusters when there is no real evidence to support the conclusion.

One question is how to decide when to stop merging the clusters? Well, that depends on the domain knowledge you have about the data. For, example if you are clustering football players on a field based on their positions on the field which will represent their coordinates for distance calculation, you already know that you should end with only 2 clusters as there can be only two teams playing a football match.

But sometimes you don't have that information. In such cases, you can leverage the results from the dendrogram to approximate the number of clusters. You cut the dendrogram tree with a horizontal line at a height where the line can traverse the maximum distance up and down without intersecting the merging point.

Measuring the goodness of Clusters
Perhaps the most important part in any unsupervised learning task is the analysis of the results. After you have performed the clustering using any algorithm and any sets of parameters you need to make sure that you did it right. But how do you determine that?

There are many measures to do this, perhaps the most popular one is the Dunn's Index. Dunn's index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn's index.


Randomly choose K examples as initial centroids

While True:

  • Create K clusters by assigning each example to the closest centroid
  • Compute K new centroids by averaging examples in each cluster
  • If centroids dont change; break

What is the complexity of one iteration? - K*N*D, where N is number of points and D is time required to compute the distance between a pair of points

K can be chosen by having "a priori" knowledge bout the application domain. e.g.

  • There are two kinds of people; K=2
  • There five different kinds of bacteria; K=5

Search for a good K 

  • Try different values of K and evaluate the quality of the results
  • Run hierarchical clustering on subset of data.


There are many fundamental differences between the two algorithms, although any one can perform better than the other in different cases. Some of the differences are:

  • Distance used: Hierarchical clustering can virtually handle any distance metric while k-means rely on euclidean distances.
  • Stability of results: k-means requires a random step at its initialisation that may yield different results if the process is re-run. That wouldn't be the case in hierarchical clustering.
  • Number of Clusters: While you can use elbow plots, Silhouette plot etc. to figure the right number of clusters in k-means, hierarchical too can use all of those but with the added benefit of leveraging the dendrogram for the same.
  • Computation Complexity: K-means is less computationally expensive than hierarchical clustering and can be run on large data-sets within a reasonable time frame, which is the main reason k-means is more popular.
Robots, Art, Offshore Finance, Life - Machine Learning Hierarchical Clusters and Dendrograms


Clustering is a well known technique used, for example, in customer segmentation or outlier detection.

Hierarchical clustering is the hierarchical application of clustering techniques.

Clustering is the extraction of natural groupings of similar data objects, and there are a few general principles which are relevant:

  • The clusters should be naturally occurring in data.
  • The clustering should discover hidden patterns in the data.
  • Data points within the cluster should be similar.
  • Data points in two different clusters should not be similar.

Common algorithms used for clustering include K-Means

Hierarchical clustering relies on using these clustering techniques to find a hierarchy of clusters, where this hierarchy resembles a tree structure, called a "Dendrogram".


There are two top-level methods for finding these hierarchical clusters:

"Agglomerative clustering"uses a bottom-up approach, wherein each data point starts in its own cluster. These clusters are then joined greedily, by taking the two most similar clusters together and merging them.

"Divisive clustering" uses a top-down approach, wherein all data points start in the same cluster. You can then use a parametric clustering algorithm like "K-Means" to divide the cluster into two clusters.

  • Start by assigning each item to a cluster, so that if you have n items you now have n clusters, each containing just one item.
  • Find the closest (most similar) pair of clusters and merge them into a single cluster so that you have one less cluster.
  • Continue the process until all items are clustered into a single cluster of size n.

Both of these methods rely on constructing a similarity matrix between all of the data points

It is necessary to start with each data point as its own cluster and then combine clusters based on some similarity measure. The similarity between the clusters is often calculated from the dissimilarity measures such as the euclidean distance between two clusters. So the larger the distance between two clusters, the better it is.

There are many distance metrics that you can consider to calculate the dissimilarity measure, and the choice depends on the type of data in the data-set. For example if you have continuous numerical values in your data-set you can use euclidean distance, Other distance measures include Manhattan and Minkowski.


The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster.

There are 3 (three) key questions that need to be answered first:

  1. How do you represent a cluster of more than one point?
  2. How do you determine the "nearness" of clusters?
  3. When do you stop combining clusters?


  1. It starts by calculating the distance between every pair of observation points and stores it in a distance matrix.
  2. It then puts every point in its own cluster.
  3. Then it starts merging the closest pairs of points based on the distances from the distance matrix and as a result the amount of clusters goes down by 1.
  4. Then it recomputes the distance between the new cluster and the old ones and stores them in a new distance matrix.
  5. Lastly it repeats steps 2 and 3 until all the clusters are merged into one single cluster.

There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. The most common linkage methods are:

SINGLE LINKAGE:  Consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.

COMPLETE LINKAGE: Consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.

AVERAGE LINKAGE: Consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

Which linkage method to use is entirely a matter of choice and there is no hard and fast method that will always give good results. Different linkage methods lead to different clusters.