Learn how machines find hidden groups in data. Watch points get assigned and clusters adapt in real-time!
Points can partially belong to multiple clusters.
Alternates between E-step and M-step.
Ellipses show the statistical shape of clusters.
Algorithm stops when centers stabilize.
Measures model fit. Higher values = better grouping.
Calculating the "responsibility" (probability) that each point belongs to each cluster center.
Moving and stretching clusters to better fit the points assigned to them.
The point where the clusters stop moving because they've found the optimal mathematical fit.
EM (Expectation-Maximization) is a powerful iterative method used to find maximum likelihood estimates of parameters in statistical models, where the model depends on unobserved latent variables.
Imagine you have a bag of colored marbles, but you're colorblind! You can feel their sizes and weights, but you can't see the colors. EM helps you figure out which marbles are likely the same color based on their shared physical properties.
The algorithm works by alternating between two main steps until it finds the best mathematical grouping.
Assignment Phase
Goal: Calculate the probability (responsibility) of each cluster for each data point.
responsibility = (fit to cluster) / (total fit to all clusters)
Update Phase
Goal: Update cluster parameters (center, shape, weight) based on assigned points.
new_mean = average(points × their_responsibilities)
Random centers and circular shapes
Assign data points to clusters
Update cluster parameters
Repeat 2 & 3 until stable
We have a classic loop: To find the centers, we need to know point assignments. To find point assignments, we need to know the centers. EM solves this by starting with a "best guess" and iteratively refining it.
Mathematically, each iteration of EM is guaranteed to increase the Log-Likelihood of the model (or leave it unchanged). This means the model always gets better at explaining the data until it hits a maximum.
| Feature | K-Means | GMM (EM) |
|---|---|---|
| Assignment | Hard (0 or 1) | Soft (Probabilities) |
| Cluster Shape | Always circular/spherical | Flexible ellipses (any orientation) |
| Model Type | Distance-based | Distribution-based |
| Use Case | Simple, distinct groups | Overlapping, varied group shapes |
Grouping pixels by color/texture to separate objects in photos.
Identifying different speakers in an audio stream using voice patterns.
Finding groups of customers with similar shopping behaviors.
Clustering gene expression data to find functional biological groups.
Classifying climate zones based on temperature and humidity data.
Clustering emails into 'Ham' and 'Spam' based on content features.