Clustering

1. Basics

  • learning to assign a label to examples by leveraging an unlabeled dataset
  • due to the unlabelled nature of the problem, evaluation strategies are important
    • performance usually depends a whole lot more on the underlying distribution from which the data was sampled

1.1. Evaluation

The nature of a clustering can be evaluated via multiple perspectives and their combinations. This leads to several loss functions that are losely based on the two major criteria:

  • larger Inter cluster distance is better
  • small Intra-cluser diameter is better

Some advanced metrics may even consider considering the shape of the clusters but I won't be exploring that here.

checkout scikit learn clustering docs to know more about evaluating clustering performance.

1.2. Determining number of clusters

  • easier for less than 4 dimension, need a strategy for higher dimensions

1.2.1. Prediction of strength

  • split the data into training and test set
  • fix number of clusters
  • run clustering algorithm on training and test set
  • clustering regions for these two results can be formed now
  • build a comembership matrix for the two as follows :
    • rows and cols being total number of points in test set
    • for each cell, mark as 1 if those data points to the same cluster in the training clustering results
  • if k (number of clusters) is reasonable this comembership matrix should be in close alignment with the test clustering results
  • read more here : https://gwalther.su.domains/predictionstrength.pdf

2. Algorithms

2.1. Hard clustering models

  • definite assignment of a cluster to each label

2.1.1. k-Means

2.1.2. DBSCAN

2.2. Soft clustering models

  • Membership scores for clusters rather than hard labels

2.2.2. HDBSCAN

2.3. Malleable

Tags::ml:ai: