Understanding DBSCAN
DBSCAN is a density-based clustering algorithm that groups data points that are closely packed together and marks outliers as noise based on their density in the feature space. It identifies clusters as dense regions in the data space separated by areas of lower density. Unlike K-Means or hierarchical clustering which assumes clusters are compact and spherical, DBSCAN performs well in handling real-world data irregularities such as:
- Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular or convex.
- Noise and Outliers: It effectively identifies and handles noise points without assigning them to any cluster.
The figure below shows a dataset with clustering algorithms: K-Means and Hierarchical handling compact, spherical clusters with varying noise tolerance while DBSCAN manages arbitrary-shaped clusters and noise handling.
Key Parameters in DBSCAN
1. eps ($$\epsilon$$): This defines the radius of the neighborhood around a data point. If the distance between two points is less than or equal to $$\epsilon$$, they are considered neighbors. A common method to determine $\epsilon$ is by analyzing the k-distance graph. Choosing the right $\epsilon$ is important:
- If $$\epsilon$$ is too small, most points will be classified as noise.
- If $$\epsilon$$ is too large, clusters may merge and the algorithm may fail to distinguish between them.
2. MinPts: This is the minimum number of points required within the $\epsilon$ radius to form a dense region. A general rule of thumb is to set MinPts $$$\ge D+1$$, where $$D$$ is the number of dimensions in the dataset.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
- Core points: which have a sufficient number of neighbors within a specified radius (epsilon).
- Border points: which are near core points but lack enough neighbors to be core points themselves.
- Noise points: which do not belong to any cluster.
Steps in the DBSCAN Algorithm
- Identify Core Points: For each point in the dataset, count the number of points within its $\epsilon$ neighborhood. If the count meets or exceeds MinPts, mark the point as a core point.
- Form Clusters: For each core point that is not already assigned to a cluster, create a new cluster. Recursively find all density-connected points (i.e., points within the $\epsilon$ radius of the core point) and add them to the cluster.
- Density Connectivity: Two points $a$ and $b$ are density-connected if there exists a chain of points where each point is within the $\epsilon$ radius of the next, and at least one point in the chain is a core point. This chaining process ensures that all points in a cluster are connected through a series of dense regions.
- Label Noise Points: After processing all points, any point that does not belong to a cluster is labeled as noise.
How this DBSCAN Visualization Handles User-Added Data
In this interactive visualization, when you click the "Add New Point & Cluster" button, the new point you specify is appended to the existing dataset. Importantly, the entire DBSCAN clustering algorithm is then re-run from scratch on this updated dataset. This means that:
- The new point is treated as part of the original data, and its type (core, border, or noise) and cluster assignment are determined by the algorithm based on its density relative to all other points.
- The cluster assignments and even the types (core/border/noise) of previously existing points might change, as the addition of a new point can alter the density landscape and connectivity.
- The visualization dynamically updates to reflect these new cluster structures, showing the real-time effect of adding data on the DBSCAN clustering process.