text-embeddings
Table of Contents
See Text Representation.
word2vec was a seminal segue into using dense vector spaces for representing tokens in a corpus. Some nlp tasks require embeddings for a larger-than-usual sequence and techniques to adapt singular maps are not quite obvious.
1. Baseline
- represent the sequence with sum/average of the individual token embeddings
- loses precious information such as ordering but might be sufficient for most tasks.
2. The OOV problem
- as an initial approach, OOV (out of vocab) tokens can be handled by assigning a defualt vector to them or skipping them altogether in the preprocessing phase.
- a more epistemologically sound approach would be to dive deeper in the morphology of the tokens and start with character-level ngrams for embeddings rather than whole words. (see Releases · facebookresearch/fastText · GitHub ) : also available via genism
- this allows the embedder to map suffixes, prefixes and other meaningful etymolgical sections of a word - combined later on to produce the words and in turn a larger texts representation: Thus resulting in better handling of OOV words.
3. Doc2Vec
- trains on sequences of processed paragraphs instead of words.
- w.r.t to what words are for SkipGram / CBOW when training word2vec
- allows retaining information regarding the order of sequence.
- also available via genism
4. Interpretation
- other than being an input to further downstream training modules, visualizing embeddings for a corpus can provide important insights.
- exploring techniques for visualizing embeddings generically in Vector Visualization : this is not usually as simple as analysing low-dimensional data and the vectors need to be pre-processed appriately.