Regression

1. Instances

1.1. Linear Regression

  • the prediction is an affine combination of the input feature vector.
    • the model is described by the set of parameters:
      • the weights vector
      • the bias scalar
    • the predictor is formulated as:
(defun predict (&key input-vector)
  (let ((w (fetch-weight-vector))
        (b (fetch-bias)))
    (+ (dot-product w input-vector)
       b)))
  • The loss function for this model is expressed as:
(defun mse-loss (inputs labels)
  "mean squared error, given lists of inputs and
corresponding labels"
  (/ (sum (mapcar #'(lambda (input label)
                      (square (- (predict input)
                                 label)))
                  inputs
                  labels))
     N))
  • this can also be termed as the cost-function.
  • the average of the penalties (squared error in this case) over the dataset is termed as the empirical risk.
  • Linear models rarely overfit : they're the simplest models one could use - see Occam's Razor
  • a point of convenience about a loss function is that it should be differentiable : the square of the differnece is preferable to the absolute value due to this reason.
    • this allows one to employ matrix calculus to potentially arrive to closed form solutions.
    • condensing the above, that would be equating the nabla'd loss to zero.

1.2. Logistic Regression

  • maps linear regression to a 0-1 scale, suited to the task of binary classification via the sigmoid function
(defun sigmoid (x)
  (let ((e (eulers-number desired-precision)))
    (/ 1 (+ 1
            (power e (- x))))))

(defun linear-regression (x)
  ...) ; see previous section's predictor

(defun logistic-regression (x)
  (sigmoid (linear-regression x)))
  • the loss this time is negative-log-likelihood : the optimization criterion we're using is "maximum likelihood"
  • the likelihood for a positive sample is the result of the predictor
    • "1 - that" would be the likelihood of a negative sample
  • given the labels are binary (1s and 0s) units, the likelihood when observing a dataset, based on this model would be:
(defun likelihood (input label)
  (let ((prediction (logistic-regression input)))
    (* (power prediction
              label)
       (power (- 1 prediction)
              (- 1 label))))) 

(defun overall-likelihood (inputs labels)
  (reduce #'multiply
          (mapcar #'likelihood
                  inputs
                  labels)
          :initial-value 1))
  • do note that this is simply the mathematical representation of the concept and not how one actually goes about the computation.
    • because we wish to differentiate the likelihood, we map it via a convenient monotonic function (exponentiation is expensive) that doesn't change the characteristics of the loss's maximas and minimas.
  • hence the logarithm.

    (defun log-likelihood (input label)
      (let (prediction (logistic-regression input))
        (+ (* label
              (log prediction))
           (* (- 1 label)
              (log (- 1 prediction))))))
    
    (defun log-overall-likelihood (inputs labels)
      (reduce #'add
              (mapcar #'log-likelihood
                      inputs
                      labels)
              :initial-value 0))
    
    • Likelihood is explored further in its own node.
    • matrix calculus alone isn't sufficient in this and there's no convenient closed-form solution
    • more generic optimization techniques like Gradient Descent are required

2. Misc

2.1. Kernel Regression

  • Non-parametric model
  • briefly speaking,
    • the idea is to estimate the value of the function being modelled at a point using a weighted average of the function's values at surrounding known input output pairs.
    • the weights are decided by a kernel of our choice that has the following primary nature:
      • values closer to the center (the point being estimated) have larger weights
      • values farther away have lower weights
      • preferably engineer a differentiable kernel to avoid a jerky regression curve output.
  • given that this is a non-parametric model, the weights are based on the dataset we already have:
(defun kernel (input)
  ...);;suitable kernel function)

(defun distance (input index)
  (/ (- (fetch-feature index) input)
     b));b is a hyperparameter

(defun generate-weight (input index)
  (* (length dataset)
     (/ (kernel (distance input index))
        (reduce #'+
                (mapcar #'(lambda (index); different index
                            (kernel (distance input index)))
                        (range (length dataset)))
                0))))

(defun func (input);;modelling function
  (/ (reduce #'+
             (mapcar #'(lambda (index)
                         (* (fetch-label index)
                            (generate-weight input index)))
                     (range (length dataset)))
             0)
     (length dataset)))

the kernel should satisfy some basic requirements as above. The most frequenctly used is the Guassian kernel

(defun kernel (input)
  (* (/ 1 (sqrt (* 2 pi)))
     (exp (/ (- (square input)) 2))))

b is a hyperparameter that dictates the fit of the regression curve and can be chosen using a validation set. A higher value of b results in a larger span of receiving influence for the central point being analysed : that is the curve will be smoother and one can expect a regularized and good fit. Lowering b results in emphasizing local points more and will result in a more wavy curve that varies drastically depending on the surrounding points from the center - this may lead to an overfit.

Tags::ml:ai: