Support Vector Machine
Table of Contents
Note : this is a form of a Statistical Model and a classical example for Supervised Learning
1. Pedagogical Representation build up
An SVM algorithm
- envisions each feature vector as a point in a high-dimensional space
- separates (in the context of classification) these points via a hyperplane (called a decision boundary)
- which can be symbolically expressed as follows
(defmacro assert-equation (label expression) ...) ; tags and equates a symbolic expression to 0 (assert-equation 'svm-hyperplane '(- (dot-product w x) b)) ;; w is the weight vector ;; b is the bias for the decision ;; x is the feature vector (defun fetch-eqn-expr-value (tag inputs) ...) (defun sign (expr) ...) ;return sign of value of expression (defun prediction-label (inputs) (sign (fetch-eqn-expr-value 'svm-hyperplane inputs)))
- the goal of the learning algorithm is to find an optimal value of w and b - w* and b* that minimize the associated loss function.
- finding optimal values for an expresssion is studied in the domain of Optimization
- we'd also like to have an associated margin with the decision boundary and make it as wide as possible.
- a large margin aids in better generalization
- a rudimentary approach to building a margin is using +1 and -1 as thresholds for decision rather than only using the sign.
- maximizing the margin ( 2 / L2 norm of w ) is synonymous to minimizing the norm of w subject to the series of decision assertions of the training data
(defmacro constraint-assert (...) ...) (defmacro for-all (...) ...) (defvar svm-constraint (for-all i in (range N) ; N - num training samples (constraint-assert (>= (* (y i) (- (dot-product w (x i)) b)) 1))))
The problem can then be symbolically summarized as
(defmacro optimize (task-tag parameters objective constraint) ...) (optimize 'minimize '(w b) '(l2-norm w) svm-constraint)
2. Customization
2.1. Noisy Data
- in cases of linearly inseparable data, we employ an additional hinge loss denoted as:
(defun svm-hinge-loss (input label) (max 0 (1 - (* label (- (dot-product w input) b)))))
- this is zero in case of a correct classification and proportionally positive if a sample lies on the incorrect side of the hyperplane.
- the overall "soft-margin" (initially "hard-margin" without the hinge-loss) svm objective now is
'(+ (* c (square (l2-norm w))) (average (mapcar #'svm-hinge-loss inputs labels)))
- where c is hyperparameter that helps controls the importance we wish to give to the hinge-loss
- it therefore regulates the trade-off between minimizing empirical risk (fitting to training data well) and generalizing well (better classifiying future inquiries)
2.2. Kernel Trick
- it may not always be possible to draw a suitable hyperplane between two classes
- one can employ kernels in such a case
- superficially speaking, they are maps between hyperspaces that one uses to mathematically elaborate upon certain properties of the original distrubution
- an identity kernel leads to the simplest form of a linear SVM.
- note that explicitly mapping all points to the hyperspace could quickly become very inefficient.
- This can be avoided by using the method of Lagrange multipliers which is further explored in the domain of Optimization
- exploring different types and characteristics of Kernels in its own node.