Text Classification
Table of Contents
see Classification (this node collates notes specific to classification in nlp)
To classify given textual sequences into one or more known data classes.
1. Primitives
- Lexicon based sentiment analysis
- see VaderSentiment
- Tradional Machine Learning based methods
- see Classifiers
2. Text-Classificaton Pipeline
- similar to the Generic Classification Pipeline
- for feature engineering see : textual feature representation
3. Deep Learning Approaches
- two base architectures used for tackling NLP tasks : CNNs and RNNs. The rest will be variants and/or major enhancements but these do capture the initial approaches towards the tasks
- in addition to the processing text for traditional machine learning pipelines, we might need to pad sequences to make tokenized sequences of the same length - these might call for attentionmasks before further processing.
- transforming these to embeddings by references from an embedding matrix is what is done next before feeding this preprocessed input into the network
3.1. CNNs
- employ 1d CNNs post pre-processing.
- embed, convolve, pool …
- slap on a linear layer + softmax (or other suitable collater) in the end for classification
- learns latent relations that contribute toward a class.
3.2. LSTMs (RNN variants)
- semantics of the sequence of the data do not have to be explicitly incorporated into the architecture but are a part of the temporality of the data flow itself.
- instead of multiple convolves and pools, we have a single RNN variant unit that process all of the sequence and the output is given then to a linear layer + any suitable collator for classification