In hierarchical clustering, you categorise the objects into a hierarchy similar to a tree-like diagram which is called a "dendrogram".
The main use of the dendrogram is to work out the best way to allocate objects to clusters. The dendrogram below shows the hierarchical clustering of observations shown on the scatter-plot to the left.
The key to interpreting a dendrogram is to focus on the height at which any two objects are joined together. In the example above, we can see that E and F are most similar, as the height of the link that joins them together is the smallest. The next two most similar objects are A and B.
The dendrogram is a summary of the distance matrix, and, as occurs with most summaries, information is lost. For example, the dendrogram suggests that C and D are much closer to each other than is C to B, but the original data (shown in the scatterplot), shows us that this is not true. A dendrogram is only accurate when data satisfies the ultrametric tree inequality, and this is unlikely for any real-world data. The consequence of the information loss is that the dendrograms are most accurate at the bottom, showing which items are very similar.
Observations are allocated to clusters by drawing a horizontal line through the dendrogram. Observations that are joined together below the line are in clusters. In the example below, we have two clusters, one that combines A and B, and a second combining C, D, E, and F.
DENDROGRAMS CANNOT DICTATE HOW MANY CLUSTERS YOU NEED
A common mistake when reading dendrograms is to assume that the shape of the dendrogram gives a clue as to how many clusters exist. In the example above, the (incorrect) interpretation is that the dendrogram shows that there are two clusters, as the distance between the clusters (the vertical segments of the dendrogram) are highest between two and three clusters.
Such an interpretation is justified only when the ultrametric tree inequality holds, which, as mentioned above, is very rare. In general, it is a mistake to use dendrograms as a tool for determining the number of clusters in data. Where there is an obviously “correct” number of clusters this will often be evident in a dendrogram. However, dendrograms often suggest a correct number of clusters when there is no real evidence to support the conclusion.
One question is how to decide when to stop merging the clusters? Well, that depends on the domain knowledge you have about the data. For, example if you are clustering football players on a field based on their positions on the field which will represent their coordinates for distance calculation, you already know that you should end with only 2 clusters as there can be only two teams playing a football match.
But sometimes you don't have that information. In such cases, you can leverage the results from the dendrogram to approximate the number of clusters. You cut the dendrogram tree with a horizontal line at a height where the line can traverse the maximum distance up and down without intersecting the merging point.
Measuring the goodness of Clusters
Perhaps the most important part in any unsupervised learning task is the analysis of the results. After you have performed the clustering using any algorithm and any sets of parameters you need to make sure that you did it right. But how do you determine that?
There are many measures to do this, perhaps the most popular one is the Dunn's Index. Dunn's index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn's index.
Randomly choose K examples as initial centroids
What is the complexity of one iteration? - K*N*D, where N is number of points and D is time required to compute the distance between a pair of points
K can be chosen by having "a priori" knowledge bout the application domain. e.g.
Search for a good K
There are many fundamental differences between the two algorithms, although any one can perform better than the other in different cases. Some of the differences are:
Clustering is a well known technique used, for example, in customer segmentation or outlier detection.
Hierarchical clustering is the hierarchical application of clustering techniques.
Clustering is the extraction of natural groupings of similar data objects, and there are a few general principles which are relevant:
Common algorithms used for clustering include K-Means
Hierarchical clustering relies on using these clustering techniques to find a hierarchy of clusters, where this hierarchy resembles a tree structure, called a "Dendrogram".
There are two top-level methods for finding these hierarchical clusters:
"Agglomerative clustering"uses a bottom-up approach, wherein each data point starts in its own cluster. These clusters are then joined greedily, by taking the two most similar clusters together and merging them.
"Divisive clustering" uses a top-down approach, wherein all data points start in the same cluster. You can then use a parametric clustering algorithm like "K-Means" to divide the cluster into two clusters.
Both of these methods rely on constructing a similarity matrix between all of the data points
It is necessary to start with each data point as its own cluster and then combine clusters based on some similarity measure. The similarity between the clusters is often calculated from the dissimilarity measures such as the euclidean distance between two clusters. So the larger the distance between two clusters, the better it is.
There are many distance metrics that you can consider to calculate the dissimilarity measure, and the choice depends on the type of data in the data-set. For example if you have continuous numerical values in your data-set you can use euclidean distance, Other distance measures include Manhattan and Minkowski.
The key operation in hierarchical agglomerative clustering is to repeatedly combine the two nearest clusters into a larger cluster.
There are 3 (three) key questions that need to be answered first:
There are several ways to measure the distance between clusters in order to decide the rules for clustering, and they are often called Linkage Methods. The most common linkage methods are:
SINGLE LINKAGE: Consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.
COMPLETE LINKAGE: Consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster.
AVERAGE LINKAGE: Consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.
Which linkage method to use is entirely a matter of choice and there is no hard and fast method that will always give good results. Different linkage methods lead to different clusters.