text-embeddings

1. Baseline
2. The OOV problem
3. Doc2Vec
4. Interpretation

word2vec was a seminal segue into using dense vector spaces for representing tokens in a corpus. Some nlp tasks require embeddings for a larger-than-usual sequence and techniques to adapt singular maps are not quite obvious.

1. Baseline

represent the sequence with sum/average of the individual token embeddings
loses precious information such as ordering but might be sufficient for most tasks.

2. The OOV problem

as an initial approach, OOV (out of vocab) tokens can be handled by assigning a defualt vector to them or skipping them altogether in the preprocessing phase.
a more epistemologically sound approach would be to dive deeper in the morphology of the tokens and start with character-level ngrams for embeddings rather than whole words. (see Releases � facebookresearch/fastText � GitHub ) : also available via genism
- this allows the embedder to map suffixes, prefixes and other meaningful etymolgical sections of a word - combined later on to produce the words and in turn a larger texts representation: Thus resulting in better handling of OOV words.

3. Doc2Vec

trains on sequences of processed paragraphs instead of words.
- w.r.t to what words are for SkipGram / CBOW when training word2vec
allows retaining information regarding the order of sequence.
also available via genism

4. Interpretation

other than being an input to further downstream training modules, visualizing embeddings for a corpus can provide important insights.
exploring techniques for visualizing embeddings generically in Vector Visualization : this is not usually as simple as analysing low-dimensional data and the vectors need to be pre-processed appriately.

text-embeddings

Table of Contents

1. Baseline

2. The OOV problem

3. Doc2Vec

4. Interpretation