Clustering
Table of Contents
1. Basics
- learning to assign a label to examples by leveraging an unlabeled dataset
- due to the unlabelled nature of the problem, evaluation strategies are important
- performance usually depends a whole lot more on the underlying distribution from which the data was sampled
1.1. Evaluation
The nature of a clustering can be evaluated via multiple perspectives and their combinations. This leads to several loss functions that are losely based on the two major criteria:
- larger Inter cluster distance is better
- small Intra-cluser diameter is better
Some advanced metrics may even consider considering the shape of the clusters but I won't be exploring that here.
checkout scikit learn clustering docs to know more about evaluating clustering performance.
1.2. Determining number of clusters
- easier for less than 4 dimension, need a strategy for higher dimensions
1.2.1. Prediction of strength
- split the data into training and test set
- fix number of clusters
- run clustering algorithm on training and test set
- clustering regions for these two results can be formed now
- build a comembership matrix for the two as follows :
- rows and cols being total number of points in test set
- for each cell, mark as 1 if those data points to the same cluster in the training clustering results
- if k (number of clusters) is reasonable this comembership matrix should be in close alignment with the test clustering results
- read more here : https://gwalther.su.domains/predictionstrength.pdf
1.2.2. Gap statistic
2. Algorithms
2.1. Hard clustering models
- definite assignment of a cluster to each label
2.1.1. k-Means
2.1.2. DBSCAN
2.2. Soft clustering models
- Membership scores for clusters rather than hard labels