Classification

To classify given data points into one or more known data classes.

1. OverArching types

  • binary classification
  • multiclass classification
  • multilabel classification (multiple viable labels for a data point)

2. Generic Classification Pipeline

  1. Obtain datum, label pairs that can be used for learning.
  2. split the dataset into train, validation (optional testing) parts and decide on the evaluation metrics that will be employed
  3. pre-process the splits accordingly and proceed with alternating training/validation phase. Degrees of freedom that can be explored
    • improving feature engineering
    • tuning the model hyperparameters
  4. test and benchmark the model on the test set using the evaluation metric.
  5. deployment : on new data points with unknown categories

3. Classification Evaluation Metrics

3.1. Primitive

  • Accuracy

Percentage of correctly labelled data points

  • weighted accuracy according to classes of importance is a viable modification
  • Precision

How many positive predictions were actually positive?

  • Recall

How many positives were predicted out of all the actual positives?

  • F1-score/measure

Harmonic mean of Precision and Recall

Summarizing the above for binary clasifiction

(defun data-generator ...)
(defvar true-labels ...) ;; 1 - positive ; 0 - negative

(defvar pred-labels (model (data-generator)))

(defun count-match (true-label pred-label trues preds)
  (sum (map (lambda (actual prediction)
              (cond ((and (= actual true-label)
                          (= prediction pred-label))
                     1)
                    (t 0)))
            trues
            preds)))

(defun counter (lambda (true-label pred-label)
                 (count-match true-label pred-label
                              true-labels pred-labels)))

(defvar true-positives (counter 1 1))
(defvar true-negatives (counter 0 0))
(defvar false-positives (counter 0 1))
(defvar false-negatives (counter 1 0))

(assert (= (+ true-positives true-negatives
              false-positives false-negatves)
           (length true-labels)))

(defvar accuracy (/ (+ true-positives true-negatives)
                    (length true-labels)))

(defvar precision (/ true-positives
                     (+ true-positives false-positives)))

(defvar recall (/ true-positives
                  (+ true-postives false-negatives)))

(defun harmonic-mean (a b) (/ (* 2 a b) (+ a b)))

(defvar f1-measure (harmonic-mean precision recall))

3.2. Not-so-primitive

3.2.1. Area under ROC curve

  1. ROC Curve (Receiver Operating Characteristic)
    • X and Y Axes: The ROC curve is a plot with the False Positive Rate (FPR) on the X-axis and the True Positive Rate (TPR) on the Y-axis.
      • False Positive Rate (FPR): It's the ratio of false positives (incorrectly classified as positive) to the total number of actual negatives.
      • True Positive Rate (TPR): It's the ratio of true positives (correctly classified as positive) to the total number of actual positives.
    • Thresholds: The ROC curve is created by varying the classification threshold of a binary classifier. At different threshold values, the model classifies data as positive or negative.
    • Curve Shape: The curve starts at the bottom-left corner (0,0) and goes towards the top-right corner (1,1). A diagonal line (the "random guess" line) would represent a model that's no better than random guessing.
    • Performance: A model's performance is determined by how far its ROC curve is from the random guess line. The closer it is to the top-left corner, the better the model's ability to distinguish between classes.
  2. AUC (Area Under the Curve)
    • AUC Value: The AUC is a single number that quantifies the overall performance of the model based on the ROC curve. It's the area under the ROC curve.
    • Interpretation: AUC values range from 0 to 1. A model with an AUC of 0.5 represents random guessing (no discrimination), while a model with an AUC of 1.0 represents perfect discrimination.
    • High AUC: A higher AUC indicates that the model is good at distinguishing between the two classes. It suggests that, on average, the model ranks positive examples higher than negative examples.
  3. Summary

    The ROC curve is a graphical representation of a model's ability to classify data, showing the trade-off between false positives and true positives at different thresholds. The AUC summarizes this performance in a single number, with a higher AUC indicating better discrimination ability. It's a common tool for evaluating the performance of binary classification models.

4. Classifiers

4.1. Types

4.1.1. Generative Classifiers

  • model probablility of observing a data point's feature set given the label and report the argmax

4.1.2. Discriminative Classifiers

  • model the joint probability distribution of the feature-label set report the one with the max probabilty

4.2. Examples

4.2.1. Naive Bayes

  • naive usage of the bayes theorem.
  • prediction is the class which has the highest likelihood given the current data point.
  • the distribution to evaluate the above likelihood is built from the dataset.
  • example of a generative classifier

4.2.2. Logistic Regression

  • example of discriminative classifier* Classification Evaluation Metrics
  • slap on a logistic function on top of a regressor
  • serves as a quick baseline (see MVP)

4.2.3. Support Vector Machine

  • tries finding a separation hyperplane post mapping(via a kernel function) data points to a higher dimensional space
  • unlike logistic regression, can deal with non-linear boundaries
  • can take longer to train

4.2.4. Deep Learning based

  • don't use a hammer when pliers get it done elegantly.
  • The usage is reduced down to the formulation of the feature set and labels in a format that's compatible with deep learning algorithms
    • see Deep Learning
    • relevant architectures : CNNs, RNNs and more complex variants
  • Transfer Learning methods are increasingly more feasible today: finetuning a generically pretrained large neural network can produce good results fairly quickly.

5. Possible Problems

  • Class Imbalance
  • Feature Engineering
    • too sparse representations (in case of text)
    • un-normalized/ un-standardized numerical features
    • too many linearly dependent numerical features that could be represented by a single amalgamation and help reduce the model complexity.
  • Hyperparameter Tuning
    • model/algorithm dependent
Tags::task:ai: