Regression

1. Instances
- 1.1. Linear Regression
- 1.2. Logistic Regression
2. Misc
- 2.1. Kernel Regression

1. Instances

1.1. Linear Regression

the prediction is an affine combination of the input feature vector.
- the model is described by the set of parameters:
  - the weights vector
  - the bias scalar
- the predictor is formulated as:

(defun predict (&key input-vector)
  (let ((w (fetch-weight-vector))
        (b (fetch-bias)))
    (+ (dot-product w input-vector)
       b)))

The loss function for this model is expressed as:

(defun mse-loss (inputs labels)
  "mean squared error, given lists of inputs and
corresponding labels"
  (/ (sum (mapcar #'(lambda (input label)
                      (square (- (predict input)
                                 label)))
                  inputs
                  labels))
     N))

this can also be termed as the cost-function.
the average of the penalties (squared error in this case) over the dataset is termed as the empirical risk.
Linear models rarely overfit : they're the simplest models one could use - see Occam's Razor
a point of convenience about a loss function is that it should be differentiable : the square of the differnece is preferable to the absolute value due to this reason.
- this allows one to employ matrix calculus to potentially arrive to closed form solutions.
- condensing the above, that would be equating the nabla'd loss to zero.

1.2. Logistic Regression

maps linear regression to a 0-1 scale, suited to the task of binary classification via the sigmoid function

(defun sigmoid (x)
  (let ((e (eulers-number desired-precision)))
    (/ 1 (+ 1
            (power e (- x))))))

(defun linear-regression (x)
  ...) ; see previous section's predictor

(defun logistic-regression (x)
  (sigmoid (linear-regression x)))

the loss this time is negative-log-likelihood : the optimization criterion we're using is "maximum likelihood"
the likelihood for a positive sample is the result of the predictor
- "1 - that" would be the likelihood of a negative sample
given the labels are binary (1s and 0s) units, the likelihood when observing a dataset, based on this model would be:

(defun likelihood (input label)
  (let ((prediction (logistic-regression input)))
    (* (power prediction
              label)
       (power (- 1 prediction)
              (- 1 label))))) 

(defun overall-likelihood (inputs labels)
  (reduce #'multiply
          (mapcar #'likelihood
                  inputs
                  labels)
          :initial-value 1))

do note that this is simply the mathematical representation of the concept and not how one actually goes about the computation.
- because we wish to differentiate the likelihood, we map it via a convenient monotonic function (exponentiation is expensive) that doesn't change the characteristics of the loss's maximas and minimas.

hence the logarithm.

(defun log-likelihood (input label)
  (let (prediction (logistic-regression input))
    (+ (* label
          (log prediction))
       (* (- 1 label)
          (log (- 1 prediction))))))

(defun log-overall-likelihood (inputs labels)
  (reduce #'add
          (mapcar #'log-likelihood
                  inputs
                  labels)
          :initial-value 0))

Likelihood is explored further in its own node.
matrix calculus alone isn't sufficient in this and there's no convenient closed-form solution
more generic optimization techniques like Gradient Descent are required

2. Misc

2.1. Kernel Regression

Non-parametric model
briefly speaking,
- the idea is to estimate the value of the function being modelled at a point using a weighted average of the function's values at surrounding known input output pairs.
- the weights are decided by a kernel of our choice that has the following primary nature:
  - values closer to the center (the point being estimated) have larger weights
  - values farther away have lower weights
  - preferably engineer a differentiable kernel to avoid a jerky regression curve output.
given that this is a non-parametric model, the weights are based on the dataset we already have:

(defun kernel (input)
  ...);;suitable kernel function)

(defun distance (input index)
  (/ (- (fetch-feature index) input)
     b));b is a hyperparameter

(defun generate-weight (input index)
  (* (length dataset)
     (/ (kernel (distance input index))
        (reduce #'+
                (mapcar #'(lambda (index); different index
                            (kernel (distance input index)))
                        (range (length dataset)))
                0))))

(defun func (input);;modelling function
  (/ (reduce #'+
             (mapcar #'(lambda (index)
                         (* (fetch-label index)
                            (generate-weight input index)))
                     (range (length dataset)))
             0)
     (length dataset)))

the kernel should satisfy some basic requirements as above. The most frequenctly used is the Guassian kernel

(defun kernel (input)
  (* (/ 1 (sqrt (* 2 pi)))
     (exp (/ (- (square input)) 2))))

b is a hyperparameter that dictates the fit of the regression curve and can be chosen using a validation set. A higher value of b results in a larger span of receiving influence for the central point being analysed : that is the curve will be smoother and one can expect a regularized and good fit. Lowering b results in emphasizing local points more and will result in a more wavy curve that varies drastically depending on the surrounding points from the center - this may lead to an overfit.