Algorithm Selection

for a quick overview, have a look at the scikit learn estimator cheatsheet. You may even use a validation set to test out algorithms quickly before fitting to a large training set.

1. Factors to consider

1.1. Interpretability

  • does the model need to be intellectually accessible to a non-technical audience (predictions in medical imaging for instance)?
  • if only accuracy matters, one may not mind black boxes.

1.2. In-memory vs Out-of-memory

  • can the dataset be completely loaded at once in the RAM : allows for a greater choice between algorithms
  • Otherwise, need to use an incremental learning algorithms, that can improve model by adding data gradually

1.3. Number of features and examples

  • neural nets and suitable ensembles when dealing with a large amount of features.
  • traditional models for limited amount of features and speed -> see Occam's Razor

1.4. Categorical vs numerical features

  • what kind of data do we have, one of the two or a mix of both.
  • the algorithm and relevant feature engineering techniques should be able to handle the data appropriately

1.5. Nonlinearity of the data

1.6. Training speed

  • neural nets will be slower than a traditional solution

1.7. Prediction speed

  • Live high throughput requirements or daily batch-processing jobs: both cases call for different approaches to minimize overall tradeoffs.
Tags::ml:ai: