Support Vector Machine

Note : this is a form of a Statistical Model and a classical example for Supervised Learning

1. Pedagogical Representation build up

An SVM algorithm

  • envisions each feature vector as a point in a high-dimensional space
  • separates (in the context of classification) these points via a hyperplane (called a decision boundary)
    • which can be symbolically expressed as follows
(defmacro assert-equation (label expression)
  ...) ; tags and equates a symbolic expression to 0

(assert-equation 'svm-hyperplane
                 '(- (dot-product w x)
                   b)) 
;; w is the weight vector
;; b is the bias for the decision
;; x is the feature vector

(defun fetch-eqn-expr-value (tag inputs)
  ...)

(defun sign (expr)
  ...) ;return sign of value of expression

(defun prediction-label (inputs)
  (sign (fetch-eqn-expr-value 'svm-hyperplane
                              inputs)))
  • the goal of the learning algorithm is to find an optimal value of w and b - w* and b* that minimize the associated loss function.
  • finding optimal values for an expresssion is studied in the domain of Optimization
  • we'd also like to have an associated margin with the decision boundary and make it as wide as possible.
    • a large margin aids in better generalization
  • a rudimentary approach to building a margin is using +1 and -1 as thresholds for decision rather than only using the sign.
  • maximizing the margin ( 2 / L2 norm of w ) is synonymous to minimizing the norm of w subject to the series of decision assertions of the training data
(defmacro constraint-assert (...)
  ...)

(defmacro for-all (...)
  ...)

(defvar svm-constraint
  (for-all i in (range N) ; N - num training samples
           (constraint-assert
            (>= (* (y i)
                   (- (dot-product w
                                   (x i))
                      b))
                1))))

The problem can then be symbolically summarized as

(defmacro optimize (task-tag
                    parameters
                    objective
                    constraint)
  ...)

(optimize 'minimize
          '(w b)
          '(l2-norm w)
          svm-constraint)

2. Customization

2.1. Noisy Data

  • in cases of linearly inseparable data, we employ an additional hinge loss denoted as:
(defun svm-hinge-loss (input label)
  (max 0 (1 - (* label
                 (- (dot-product w
                                 input)
                    b)))))
  • this is zero in case of a correct classification and proportionally positive if a sample lies on the incorrect side of the hyperplane.
  • the overall "soft-margin" (initially "hard-margin" without the hinge-loss) svm objective now is
'(+ (* c (square (l2-norm w)))
  (average (mapcar #'svm-hinge-loss
            inputs
            labels)))
  • where c is hyperparameter that helps controls the importance we wish to give to the hinge-loss
    • it therefore regulates the trade-off between minimizing empirical risk (fitting to training data well) and generalizing well (better classifiying future inquiries)

2.2. Kernel Trick

  • it may not always be possible to draw a suitable hyperplane between two classes
    • one can employ kernels in such a case
    • superficially speaking, they are maps between hyperspaces that one uses to mathematically elaborate upon certain properties of the original distrubution
  • an identity kernel leads to the simplest form of a linear SVM.
  • note that explicitly mapping all points to the hyperspace could quickly become very inefficient.
  • exploring different types and characteristics of Kernels in its own node.
Tags::ml:ai: