DBSCAN
1. Basics
- is a density based algorithm compared to its centroid based counterparts.
- instead of having number of clusters as the hyperparameter, we have two hyperparameters now:
2. Algorithm
- breifly describing the algorithm:
- pick an unassigned sample randomly and assign it to new cluster
- count how many samples from this one are at a distance less than epsilon
- if this count is greater than n, all of these are assigned to the present cluster as well
- repeat this check for each member of cluster A. if the above conditions are satisfied (more than n in less than epsilon), the present cluster is updated with these new additions as well
- repeat step 4 until convergence of the present cluster
- when no new additions to the present cluster can be made, start building a new cluster from step 1. Terminate procedure when no unassigned clusters remain
3. Misc
- DBSCAN can build arbitarily shaped clusters whereas others algorithms are bound to generate hyperspheres.
- a major drawback is that tuning for two hyperparameters is much harder than tuning for one.
- moreover, a fixed epsilon cannot account for clusters with different densities (they're still clusters)
4. HDBSCAN
- removes the need to decide on epsilon and functions with only one hyperparameter (n - the minimum number of samples to put in a cluster).
- fast implementations are present and can serve as a quick baseline for a clustering task.
- read more here: https://en.wikipedia.org/wiki/DBSCAN#Extensions
Tags::ml:ai: