Classification

1. OverArching types
2. Generic Classification Pipeline
3. Classification Evaluation Metrics
- 3.1. Primitive
- 3.2. Not-so-primitive
  - 3.2.1. Area under ROC curve
4. Classifiers
- 4.1. Types
  - 4.1.1. Generative Classifiers
  - 4.1.2. Discriminative Classifiers
- 4.2. Examples
5. Possible Problems
6. Relevant Nodes

To classify given data points into one or more known data classes.

1. OverArching types

binary classification
multiclass classification
multilabel classification (multiple viable labels for a data point)

2. Generic Classification Pipeline

Obtain datum, label pairs that can be used for learning.
split the dataset into train, validation (optional testing) parts and decide on the evaluation metrics that will be employed
pre-process the splits accordingly and proceed with alternating training/validation phase. Degrees of freedom that can be explored
- improving feature engineering
- tuning the model hyperparameters
test and benchmark the model on the test set using the evaluation metric.
deployment : on new data points with unknown categories

3. Classification Evaluation Metrics

3.1. Primitive

Accuracy

Percentage of correctly labelled data points

weighted accuracy according to classes of importance is a viable modification

Precision

How many positive predictions were actually positive?

Recall

How many positives were predicted out of all the actual positives?

F1-score/measure

Harmonic mean of Precision and Recall

Summarizing the above for binary clasifiction

(defun data-generator ...)
(defvar true-labels ...) ;; 1 - positive ; 0 - negative

(defvar pred-labels (model (data-generator)))

(defun count-match (true-label pred-label trues preds)
  (sum (map (lambda (actual prediction)
              (cond ((and (= actual true-label)
                          (= prediction pred-label))
                     1)
                    (t 0)))
            trues
            preds)))

(defun counter (lambda (true-label pred-label)
                 (count-match true-label pred-label
                              true-labels pred-labels)))

(defvar true-positives (counter 1 1))
(defvar true-negatives (counter 0 0))
(defvar false-positives (counter 0 1))
(defvar false-negatives (counter 1 0))

(assert (= (+ true-positives true-negatives
              false-positives false-negatves)
           (length true-labels)))

(defvar accuracy (/ (+ true-positives true-negatives)
                    (length true-labels)))

(defvar precision (/ true-positives
                     (+ true-positives false-positives)))

(defvar recall (/ true-positives
                  (+ true-postives false-negatives)))

(defun harmonic-mean (a b) (/ (* 2 a b) (+ a b)))

(defvar f1-measure (harmonic-mean precision recall))

3.2. Not-so-primitive

3.2.1. Area under ROC curve

Classification: ROC Curve and AUC

ROC Curve (Receiver Operating Characteristic)
- X and Y Axes: The ROC curve is a plot with the False Positive Rate (FPR) on the X-axis and the True Positive Rate (TPR) on the Y-axis.
  - False Positive Rate (FPR): It's the ratio of false positives (incorrectly classified as positive) to the total number of actual negatives.
  - True Positive Rate (TPR): It's the ratio of true positives (correctly classified as positive) to the total number of actual positives.
- Thresholds: The ROC curve is created by varying the classification threshold of a binary classifier. At different threshold values, the model classifies data as positive or negative.
- Curve Shape: The curve starts at the bottom-left corner (0,0) and goes towards the top-right corner (1,1). A diagonal line (the "random guess" line) would represent a model that's no better than random guessing.
- Performance: A model's performance is determined by how far its ROC curve is from the random guess line. The closer it is to the top-left corner, the better the model's ability to distinguish between classes.
AUC (Area Under the Curve)
- AUC Value: The AUC is a single number that quantifies the overall performance of the model based on the ROC curve. It's the area under the ROC curve.
- Interpretation: AUC values range from 0 to 1. A model with an AUC of 0.5 represents random guessing (no discrimination), while a model with an AUC of 1.0 represents perfect discrimination.
- High AUC: A higher AUC indicates that the model is good at distinguishing between the two classes. It suggests that, on average, the model ranks positive examples higher than negative examples.
Summary

The ROC curve is a graphical representation of a model's ability to classify data, showing the trade-off between false positives and true positives at different thresholds. The AUC summarizes this performance in a single number, with a higher AUC indicating better discrimination ability. It's a common tool for evaluating the performance of binary classification models.

4. Classifiers

4.1. Types

4.1.1. Generative Classifiers

model probablility of observing a data point's feature set given the label and report the argmax

4.1.2. Discriminative Classifiers

model the joint probability distribution of the feature-label set report the one with the max probabilty

4.2. Examples

4.2.1. Naive Bayes

naive usage of the bayes theorem.
prediction is the class which has the highest likelihood given the current data point.
the distribution to evaluate the above likelihood is built from the dataset.
example of a generative classifier

4.2.2. Logistic Regression

example of discriminative classifier* Classification Evaluation Metrics
slap on a logistic function on top of a regressor
serves as a quick baseline (see MVP)

4.2.3. Support Vector Machine

tries finding a separation hyperplane post mapping(via a kernel function) data points to a higher dimensional space
unlike logistic regression, can deal with non-linear boundaries
can take longer to train

4.2.4. Deep Learning based

don't use a hammer when pliers get it done elegantly.
The usage is reduced down to the formulation of the feature set and labels in a format that's compatible with deep learning algorithms
- see Deep Learning
- relevant architectures : CNNs, RNNs and more complex variants
Transfer Learning methods are increasingly more feasible today: finetuning a generically pretrained large neural network can produce good results fairly quickly.
- see Deep Learning in NLP in for specific info

5. Possible Problems

Class Imbalance
Feature Engineering
- too sparse representations (in case of text)
- un-normalized/ un-standardized numerical features
- too many linearly dependent numerical features that could be represented by a single amalgamation and help reduce the model complexity.
Hyperparameter Tuning
- model/algorithm dependent