Regression
Table of Contents
1. Instances
1.1. Linear Regression
- the prediction is an affine combination of the input feature vector.
- the model is described by the set of parameters:
- the weights vector
- the bias scalar
- the predictor is formulated as:
- the model is described by the set of parameters:
(defun predict (&key input-vector) (let ((w (fetch-weight-vector)) (b (fetch-bias))) (+ (dot-product w input-vector) b)))
- The loss function for this model is expressed as:
(defun mse-loss (inputs labels) "mean squared error, given lists of inputs and corresponding labels" (/ (sum (mapcar #'(lambda (input label) (square (- (predict input) label))) inputs labels)) N))
- this can also be termed as the cost-function.
- the average of the penalties (squared error in this case) over the dataset is termed as the empirical risk.
- Linear models rarely overfit : they're the simplest models one could use - see Occam's Razor
- a point of convenience about a loss function is that it should be differentiable : the square of the differnece is preferable to the absolute value due to this reason.
- this allows one to employ matrix calculus to potentially arrive to closed form solutions.
- condensing the above, that would be equating the nabla'd loss to zero.
1.2. Logistic Regression
- maps linear regression to a 0-1 scale, suited to the task of binary classification via the sigmoid function
(defun sigmoid (x) (let ((e (eulers-number desired-precision))) (/ 1 (+ 1 (power e (- x)))))) (defun linear-regression (x) ...) ; see previous section's predictor (defun logistic-regression (x) (sigmoid (linear-regression x)))
- the loss this time is negative-log-likelihood : the optimization criterion we're using is "maximum likelihood"
- the likelihood for a positive sample is the result of the predictor
- "1 - that" would be the likelihood of a negative sample
- given the labels are binary (1s and 0s) units, the likelihood when observing a dataset, based on this model would be:
(defun likelihood (input label) (let ((prediction (logistic-regression input))) (* (power prediction label) (power (- 1 prediction) (- 1 label))))) (defun overall-likelihood (inputs labels) (reduce #'multiply (mapcar #'likelihood inputs labels) :initial-value 1))
- do note that this is simply the mathematical representation of the concept and not how one actually goes about the computation.
- because we wish to differentiate the likelihood, we map it via a convenient monotonic function (exponentiation is expensive) that doesn't change the characteristics of the loss's maximas and minimas.
hence the logarithm.
(defun log-likelihood (input label) (let (prediction (logistic-regression input)) (+ (* label (log prediction)) (* (- 1 label) (log (- 1 prediction)))))) (defun log-overall-likelihood (inputs labels) (reduce #'add (mapcar #'log-likelihood inputs labels) :initial-value 0))
- Likelihood is explored further in its own node.
- matrix calculus alone isn't sufficient in this and there's no convenient closed-form solution
- more generic optimization techniques like Gradient Descent are required
2. Misc
2.1. Kernel Regression
- Non-parametric model
- briefly speaking,
- the idea is to estimate the value of the function being modelled at a point using a weighted average of the function's values at surrounding known input output pairs.
- the weights are decided by a kernel of our choice that has the following primary nature:
- values closer to the center (the point being estimated) have larger weights
- values farther away have lower weights
- preferably engineer a differentiable kernel to avoid a jerky regression curve output.
- given that this is a non-parametric model, the weights are based on the dataset we already have:
(defun kernel (input) ...);;suitable kernel function) (defun distance (input index) (/ (- (fetch-feature index) input) b));b is a hyperparameter (defun generate-weight (input index) (* (length dataset) (/ (kernel (distance input index)) (reduce #'+ (mapcar #'(lambda (index); different index (kernel (distance input index))) (range (length dataset))) 0)))) (defun func (input);;modelling function (/ (reduce #'+ (mapcar #'(lambda (index) (* (fetch-label index) (generate-weight input index))) (range (length dataset))) 0) (length dataset)))
the kernel should satisfy some basic requirements as above. The most frequenctly used is the Guassian kernel
(defun kernel (input) (* (/ 1 (sqrt (* 2 pi))) (exp (/ (- (square input)) 2))))
b is a hyperparameter that dictates the fit of the regression curve and can be chosen using a validation set. A higher value of b results in a larger span of receiving influence for the central point being analysed : that is the curve will be smoother and one can expect a regularized and good fit. Lowering b results in emphasizing local points more and will result in a more wavy curve that varies drastically depending on the surrounding points from the center - this may lead to an overfit.