{% extends "layout.html" %} {% block content %} Study Guide: Hierarchical Clustering

🌳 Study Guide: Hierarchical Clustering

Tap Me!

πŸ”Ή Core Concepts

Story-style intuition: Organizing a Family Reunion

Imagine you are organizing a big family reunion. You start by grouping the closest relatives: siblings form small groups. Then, you merge those groups with their cousins. Next, you merge those larger groups with their aunts and uncles. You keep doing this until the entire extended family is in one giant group. Hierarchical clustering works just like this: it builds a family tree, or a dendrogram, showing how everyone is related, from the closest individuals to the entire family.

What is Clustering?

In machine learning, clustering is an unsupervised learning technique. This means you have data, but you don't have pre-defined labels for it. The goal of clustering is to find natural groupings (or "clusters") in the data, where points within the same group are more similar to each other than to those in other groups.

Example: You have a list of customers and their purchasing habits (e.g., spending amount, frequency of visits). Clustering would help you automatically identify groups like "high-spending loyal customers," "occasional bargain hunters," and "new visitors" without you having to define these groups first.

Definition of Hierarchical Clustering

Hierarchical Clustering builds a hierarchy of clusters, either from the bottom up or the top down. The result is a tree-like structure called a dendrogram, which shows the entire "family tree" of how the groups were formed.

πŸ”Ή Mathematical Foundation

Distance Metrics (Measuring Point-to-Point Similarity)

Linkage Criteria (Measuring Cluster-to-Cluster Similarity)

Once you have family groups, how do you decide which two groups should merge next? Do you connect them based on their two closest members (single linkage), their two most distant members (complete linkage), or the average distance between all their members (average linkage)?

πŸ”Ή Algorithm Steps (Agglomerative)

Let's walk through a simple example with four points A, B, C, D. Their initial distance matrix (Euclidean) is:

      A   B   C   D
    A   0   2   6   10
    B   2   0   5   9
    C   6   5   0   4
    D   10  9   4   0
  1. Initialization: We start with four clusters: {A}, {B}, {C}, {D}.
  2. Merge 1: The smallest distance is 2 (between A and B). We merge them into a new cluster {A, B}. Our clusters are now {A, B}, {C}, {D}.
  3. Update 1: We update the distance matrix using, for example, single linkage:
    • dist({A,B}, C) = min(dist(A,C), dist(B,C)) = min(6, 5) = 5
    • dist({A,B}, D) = min(dist(A,D), dist(B,D)) = min(10, 9) = 9
    New Matrix:
              {A,B}   C   D
        {A,B}   0     5   9
        C       5     0   4
        D       9     4   0
  4. Merge 2: The smallest distance is now 4 (between C and D). We merge them into {C, D}. Our clusters are now {A, B}, {C, D}.
  5. Update 2: Update the final distance:
    • dist({A,B}, {C,D}) = min(dist(A,C), dist(A,D), dist(B,C), dist(B,D)) = min(6, 10, 5, 9) = 5
  6. Final Merge: We merge the last two clusters {A, B} and {C, D} at a distance of 5. The process is complete.

πŸ”Ή The Dendrogram

The dendrogram is the tree diagram that visualizes the entire merging process. The y-axis represents the distance at which clusters were merged. By "cutting" the dendrogram with a horizontal line, you can choose the final number of clusters.

Example: For the algorithm steps above, the dendrogram would show A and B merging at a low height (distance 2). C and D would merge at a slightly higher height (distance 4). Finally, the {A, B} group and the {C, D} group would merge at an even higher height (distance 5).

If you draw a horizontal "cut" line at a distance of 3, you would cross three vertical lines, giving you three clusters: {A, B}, {C}, and {D}. If you cut at a distance of 6, you would get one cluster: {A, B, C, D}.

πŸ”Ή Comparison: Hierarchical vs. K-Means

Feature Hierarchical Clustering K-Means Clustering
Number of Clusters Not needed upfront. Chosen by cutting the dendrogram. Must be pre-specified (K).
Speed & Scalability Slow (O(n^2) to O(n^3)). Not for large datasets. Fast and scales well to large data.
Output A full hierarchy (dendrogram) showing relationships. A single set of K clusters.
Determinism Deterministic (always the same result). Can vary based on initial random centroids.
Use-Case Example Understanding evolutionary relationships between species (phylogenetic trees). The hierarchy is the main goal. Segmenting 1 million customers into 3 pricing tiers (Gold, Silver, Bronze). Speed and scalability are key.

πŸ”Ή Strengths & Weaknesses

Advantages:

Disadvantages:

πŸ”Ή Python Implementation

Let's use Python's `scipy` library to build our "family tree" (the dendrogram) and `scikit-learn` to perform the final clustering once we decide how many groups we want.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage

# 1. Generate and Prepare Data
X, y = make_blobs(n_samples=50, centers=4, cluster_std=1.2, random_state=42)
X_scaled = StandardScaler().fit_transform(X)

# 2. Build the Linkage Matrix using Scipy for the dendrogram
linkage_matrix = linkage(X_scaled, method='ward')

# 3. Plot the Dendrogram
plt.figure(figsize=(15, 7))
plt.title('Hierarchical Clustering Dendrogram (Ward Linkage)')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
dendrogram(linkage_matrix)
plt.show()


# 4. Perform Clustering with Scikit-learn
# Let's say the dendrogram suggests 4 clusters is a good choice.
agg_cluster = AgglomerativeClustering(n_clusters=4, linkage='ward')
labels = agg_cluster.fit_predict(X_scaled)

# 5. Visualize the final clusters
plt.figure(figsize=(10, 7))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=labels, cmap='viridis', s=50)
plt.title('Final Clusters (n=4)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()

        

πŸ”Ή Key Terminology Explained

The Story: Decoding the Family Reunion Plan

Let's break down some of the technical terms used in our family reunion plan to make sure everything is crystal clear.

πŸ”Ή Best Practices

{% endblock %}