Classification
Table of Contents
To classify given data points into one or more known data classes.
1. OverArching types
- binary classification
- multiclass classification
- multilabel classification (multiple viable labels for a data point)
2. Generic Classification Pipeline
- Obtain datum, label pairs that can be used for learning.
- split the dataset into train, validation (optional testing) parts and decide on the evaluation metrics that will be employed
- pre-process the splits accordingly and proceed with alternating training/validation phase. Degrees of freedom that can be explored
- improving feature engineering
- tuning the model hyperparameters
- test and benchmark the model on the test set using the evaluation metric.
- deployment : on new data points with unknown categories
3. Classification Evaluation Metrics
3.1. Primitive
- Accuracy
Percentage of correctly labelled data points
- weighted accuracy according to classes of importance is a viable modification
- Precision
How many positive predictions were actually positive?
- Recall
How many positives were predicted out of all the actual positives?
- F1-score/measure
Harmonic mean of Precision and Recall
Summarizing the above for binary clasifiction
(defun data-generator ...) (defvar true-labels ...) ;; 1 - positive ; 0 - negative (defvar pred-labels (model (data-generator))) (defun count-match (true-label pred-label trues preds) (sum (map (lambda (actual prediction) (cond ((and (= actual true-label) (= prediction pred-label)) 1) (t 0))) trues preds))) (defun counter (lambda (true-label pred-label) (count-match true-label pred-label true-labels pred-labels))) (defvar true-positives (counter 1 1)) (defvar true-negatives (counter 0 0)) (defvar false-positives (counter 0 1)) (defvar false-negatives (counter 1 0)) (assert (= (+ true-positives true-negatives false-positives false-negatves) (length true-labels))) (defvar accuracy (/ (+ true-positives true-negatives) (length true-labels))) (defvar precision (/ true-positives (+ true-positives false-positives))) (defvar recall (/ true-positives (+ true-postives false-negatives))) (defun harmonic-mean (a b) (/ (* 2 a b) (+ a b))) (defvar f1-measure (harmonic-mean precision recall))
3.2. Not-so-primitive
3.2.1. Area under ROC curve
- ROC Curve (Receiver Operating Characteristic)
- X and Y Axes: The ROC curve is a plot with the False Positive Rate (FPR) on the X-axis and the True Positive Rate (TPR) on the Y-axis.
- False Positive Rate (FPR): It's the ratio of false positives (incorrectly classified as positive) to the total number of actual negatives.
- True Positive Rate (TPR): It's the ratio of true positives (correctly classified as positive) to the total number of actual positives.
- Thresholds: The ROC curve is created by varying the classification threshold of a binary classifier. At different threshold values, the model classifies data as positive or negative.
- Curve Shape: The curve starts at the bottom-left corner (0,0) and goes towards the top-right corner (1,1). A diagonal line (the "random guess" line) would represent a model that's no better than random guessing.
- Performance: A model's performance is determined by how far its ROC curve is from the random guess line. The closer it is to the top-left corner, the better the model's ability to distinguish between classes.
- X and Y Axes: The ROC curve is a plot with the False Positive Rate (FPR) on the X-axis and the True Positive Rate (TPR) on the Y-axis.
- AUC (Area Under the Curve)
- AUC Value: The AUC is a single number that quantifies the overall performance of the model based on the ROC curve. It's the area under the ROC curve.
- Interpretation: AUC values range from 0 to 1. A model with an AUC of 0.5 represents random guessing (no discrimination), while a model with an AUC of 1.0 represents perfect discrimination.
- High AUC: A higher AUC indicates that the model is good at distinguishing between the two classes. It suggests that, on average, the model ranks positive examples higher than negative examples.
- Summary
The ROC curve is a graphical representation of a model's ability to classify data, showing the trade-off between false positives and true positives at different thresholds. The AUC summarizes this performance in a single number, with a higher AUC indicating better discrimination ability. It's a common tool for evaluating the performance of binary classification models.
4. Classifiers
4.1. Types
4.1.1. Generative Classifiers
- model probablility of observing a data point's feature set given the label and report the argmax
4.1.2. Discriminative Classifiers
- model the joint probability distribution of the feature-label set report the one with the max probabilty
4.2. Examples
4.2.1. Naive Bayes
- naive usage of the bayes theorem.
- prediction is the class which has the highest likelihood given the current data point.
- the distribution to evaluate the above likelihood is built from the dataset.
- example of a generative classifier
4.2.2. Logistic Regression
- example of discriminative classifier* Classification Evaluation Metrics
- slap on a logistic function on top of a regressor
- serves as a quick baseline (see MVP)
4.2.3. Support Vector Machine
- tries finding a separation hyperplane post mapping(via a kernel function) data points to a higher dimensional space
- unlike logistic regression, can deal with non-linear boundaries
- can take longer to train
4.2.4. Deep Learning based
- don't use a hammer when pliers get it done elegantly.
- The usage is reduced down to the formulation of the feature set and labels in a format that's compatible with deep learning algorithms
- see Deep Learning
- relevant architectures : CNNs, RNNs and more complex variants
- Transfer Learning methods are increasingly more feasible today: finetuning a generically pretrained large neural network can produce good results fairly quickly.
- see Deep Learning in NLP in for specific info
5. Possible Problems
- Class Imbalance
- Feature Engineering
- too sparse representations (in case of text)
- un-normalized/ un-standardized numerical features
- too many linearly dependent numerical features that could be represented by a single amalgamation and help reduce the model complexity.
- Hyperparameter Tuning
- model/algorithm dependent