text-embeddings

See Text Representation.

word2vec was a seminal segue into using dense vector spaces for representing tokens in a corpus. Some nlp tasks require embeddings for a larger-than-usual sequence and techniques to adapt singular maps are not quite obvious.

1. Baseline

  • represent the sequence with sum/average of the individual token embeddings
  • loses precious information such as ordering but might be sufficient for most tasks.

2. The OOV problem

  • as an initial approach, OOV (out of vocab) tokens can be handled by assigning a defualt vector to them or skipping them altogether in the preprocessing phase.
  • a more epistemologically sound approach would be to dive deeper in the morphology of the tokens and start with character-level ngrams for embeddings rather than whole words. (see Releases · facebookresearch/fastText · GitHub ) : also available via genism
    • this allows the embedder to map suffixes, prefixes and other meaningful etymolgical sections of a word - combined later on to produce the words and in turn a larger texts representation: Thus resulting in better handling of OOV words.

3. Doc2Vec

  • trains on sequences of processed paragraphs instead of words.
    • w.r.t to what words are for SkipGram / CBOW when training word2vec
  • allows retaining information regarding the order of sequence.
  • also available via genism

4. Interpretation

  • other than being an input to further downstream training modules, visualizing embeddings for a corpus can provide important insights.
  • exploring techniques for visualizing embeddings generically in Vector Visualization : this is not usually as simple as analysing low-dimensional data and the vectors need to be pre-processed appriately.
Tags::nlp: