Support Vector Machine

1. Pedagogical Representation build up
2. Customization
- 2.1. Noisy Data
- 2.2. Kernel Trick

Note : this is a form of a Statistical Model and a classical example for Supervised Learning

1. Pedagogical Representation build up

An SVM algorithm

envisions each feature vector as a point in a high-dimensional space
separates (in the context of classification) these points via a hyperplane (called a decision boundary)
- which can be symbolically expressed as follows

(defmacro assert-equation (label expression)
  ...) ; tags and equates a symbolic expression to 0

(assert-equation 'svm-hyperplane
                 '(- (dot-product w x)
                   b)) 
;; w is the weight vector
;; b is the bias for the decision
;; x is the feature vector

(defun fetch-eqn-expr-value (tag inputs)
  ...)

(defun sign (expr)
  ...) ;return sign of value of expression

(defun prediction-label (inputs)
  (sign (fetch-eqn-expr-value 'svm-hyperplane
                              inputs)))

the goal of the learning algorithm is to find an optimal value of w and b - w* and b* that minimize the associated loss function.
finding optimal values for an expresssion is studied in the domain of Optimization
we'd also like to have an associated margin with the decision boundary and make it as wide as possible.
- a large margin aids in better generalization
a rudimentary approach to building a margin is using +1 and -1 as thresholds for decision rather than only using the sign.
maximizing the margin ( 2 / L2 norm of w ) is synonymous to minimizing the norm of w subject to the series of decision assertions of the training data

(defmacro constraint-assert (...)
  ...)

(defmacro for-all (...)
  ...)

(defvar svm-constraint
  (for-all i in (range N) ; N - num training samples
           (constraint-assert
            (>= (* (y i)
                   (- (dot-product w
                                   (x i))
                      b))
                1))))

The problem can then be symbolically summarized as

(defmacro optimize (task-tag
                    parameters
                    objective
                    constraint)
  ...)

(optimize 'minimize
          '(w b)
          '(l2-norm w)
          svm-constraint)

2. Customization

2.1. Noisy Data

in cases of linearly inseparable data, we employ an additional hinge loss denoted as:

(defun svm-hinge-loss (input label)
  (max 0 (1 - (* label
                 (- (dot-product w
                                 input)
                    b)))))

this is zero in case of a correct classification and proportionally positive if a sample lies on the incorrect side of the hyperplane.
the overall "soft-margin" (initially "hard-margin" without the hinge-loss) svm objective now is

'(+ (* c (square (l2-norm w)))
  (average (mapcar #'svm-hinge-loss
            inputs
            labels)))

where c is hyperparameter that helps controls the importance we wish to give to the hinge-loss
- it therefore regulates the trade-off between minimizing empirical risk (fitting to training data well) and generalizing well (better classifiying future inquiries)

2.2. Kernel Trick

it may not always be possible to draw a suitable hyperplane between two classes
- one can employ kernels in such a case
- superficially speaking, they are maps between hyperspaces that one uses to mathematically elaborate upon certain properties of the original distrubution
an identity kernel leads to the simplest form of a linear SVM.
note that explicitly mapping all points to the hyperspace could quickly become very inefficient.
- This can be avoided by using the method of Lagrange multipliers which is further explored in the domain of Optimization
exploring different types and characteristics of Kernels in its own node.

Support Vector Machine

Table of Contents

1. Pedagogical Representation build up

2. Customization

2.1. Noisy Data

2.2. Kernel Trick