Feature Engineering
Table of Contents
- preparing the dataset to be used for the learning algorithm
- the goal is to convert the data in to features with high-predictive power and make them usable in the first place
Some common feature engineering processes are:
1. One Hot Encoding
- converting categoricals into separate booleans
2. Entity Embeddings
- one hot encodings might not be the best option when some relevances do exist between the categorical variables being considered.
- see https://arxiv.org/pdf/1604.06737.pdf
3. Binning (Bucketting)
- converting a continuous feature into multiple exclusive boolean buckets (based on value ranges)
- 0 to 10, 10 to 20, and so on… , for instance.
4. Normalization
- converting varying numerical ranges into a standard (-1 to 1 or 0 to 1).
- aids learning algorithms computationally (avoid precision and overflow discrepancies)
(defun normalize (numerical-data-vector) (let* ((min (minimum numerical-data-vector)) (max (maximum numerical-data-vector)) (span (- max min))) (mapcar #'(lambda (feature) (/ (- feature min) span)) numerical-data-vector)))
5. Standardization
- aka z-score normalization
- rescaling features so that they have the properties of a standard normal distribution (zero mean, unit variance)
(defun standardize (numerical-data-vector) (let* ((mu (mean numerical-data-vector)) (sigma (sqrt (variance numerical-data-vector)))) (mapcar #'(lambda (feature) (/ (- feature mu) sigma)) numerical-data-vector)))
6. Dealing with Missing Features
Possible approaches:
- removing examples with missing features
- using a learning algorithm that can deal with missing data
- data imputation techniques
7. Data Imputation Techniques
- replace by mean, median or other similar statistic
- something outside the normal range to indicate imputation (-1 in a normal 2-5 range for instance)
- something according to the range and not a statistic (0 for -1 to 1 for instance)
A more advanced approach is modelling the imputation as a regression problem before proceeding with the actual task. In this case all the other features are used to predict the missing feature.
In cases of a large dataset, one can introduce an extra indicator feature to signify missing data and then place a value of choice.
- test more than 1 technique and proceed with what suits best