Density Estimation
- problem of modeling the probability density function function of the unknown porbability distribution from which the dataset has been drawn.
- proceeding with a non-parametric model because assumptions regarding the underlying distribution might not necessarily be true.
- see Kernel Regression for a non-parametric model's example
Building up a pedagogical one-dimensional case:
for the inputs being sampled from an unknown pdf f, denoting the kernel model by f-hat …
(defun kernel (input) ...);;suitable kernel function) (defun distance (input index) (/ (- input (fetch-feature index)) b));b is a hyperparameter (defun f-hat (input);; the kernel model of f (/ (reduce #'+ (mapcar #'(lambda (index) (kernel (distance input index))) (range (length dataset))) 0) (* (length dataset) b)))
- b is a hyperparameter that controls the trade-off between the bias and the variance
- the kernel can be any suitable function of choice but proceeding with the Guassian kernel here:
(defun guassian-kernel (input) "zero mean, unit variance guassian kernel" (/ (exponent (- (/ (square input) 2))) (sqrt (* 2 pi))))
- we look for such a value of b that closes off the difference between the real latent f and f-hat.
- A reasonable choice for this is the mean integrated square error because we are dealing with a continuous and not a discrete domain this time. formulating:
(defun expectation (rand-var) ...);returns expectation (pdf weighted mean) of a random variable (defun integrate (limits ;; range or set variable ;; integrated w.r.t. this integrand) ;; expression to integrate ;; this will be a computational approximation if possible ;; i.e a large summation ultimately ;; otherwise another variable to be ;; passed on for further usage (...)) (defun MISE (b) ;; mean integrated square error (expectation (integrate (set-of-reals) (x) (square (- (f-hat x) ;; kernel model (f x))))));; real latent pdf
Do note that this could also be done for a probability mass function with some tweaks to the loss function: due to the discrete case of the function being modelled this time. We'll be having the usual Mean squared error in this case.
If b is too small, we weigh the closer samples a little to heavily leading to an overfit and wiggly estimator. If b is too large, all samples weigh in a little too much for a local approximation leading to a flatter estimation.
To find the appropriate b, tune using the validation set.
read up more on MISE(b) here : https://en.wikipedia.org/wiki/Mean_integrated_squared_error
(haven't covered further simplification here : expand square, substitute unbiased estimators, and …….)