Clustering

1. Basics
- 1.1. Evaluation
- 1.2. Determining number of clusters
2. Algorithms

1. Basics

learning to assign a label to examples by leveraging an unlabeled dataset
due to the unlabelled nature of the problem, evaluation strategies are important
- performance usually depends a whole lot more on the underlying distribution from which the data was sampled

1.1. Evaluation

The nature of a clustering can be evaluated via multiple perspectives and their combinations. This leads to several loss functions that are losely based on the two major criteria:

larger Inter cluster distance is better
small Intra-cluser diameter is better

Some advanced metrics may even consider considering the shape of the clusters but I won't be exploring that here.

checkout scikit learn clustering docs to know more about evaluating clustering performance.

1.2. Determining number of clusters

easier for less than 4 dimension, need a strategy for higher dimensions

1.2.1. Prediction of strength

split the data into training and test set
fix number of clusters
run clustering algorithm on training and test set
clustering regions for these two results can be formed now
build a comembership matrix for the two as follows :
- rows and cols being total number of points in test set
- for each cell, mark as 1 if those data points to the same cluster in the training clustering results
if k (number of clusters) is reasonable this comembership matrix should be in close alignment with the test clustering results
read more here : https://gwalther.su.domains/predictionstrength.pdf