Text Representation

Note: we approach this step post data engineering.

AKA Textual feature representation

1. Abstract

  • not as straightforward as representing images, videos, speech
  • there is a malleable synthetic component to representation
    • gives rise to several degrees of freedom/choices in terms of how one chooses represent samples
  • an explicitly observable discrete (symbolic) component to the structure written language and maps to a vector space may not be that intuitive to the beginner at the get

1.1. Characteristics of a good representation?

For any generic nlp task, processing a data point involves:

  • breaking the data point into its atomic units: retrieving their independent semantics
  • understanding their syntactics (grammar)
  • understading the context
  • finally, inferring the semantics of the data point as a whole from the above.

A good representation would intuitively facilitate the above operations.

2. Specifics

Most text representation schemes fall under the bucket of the vector space model : they'll vectorize the samples under some priors to obtain a numerical representaion Some basic vectorization approaches:

  1. One hot encoding
    • assuming independence between tokens and no notion of closeness/farness between tokens : only a discrete distinction
    • sparse and inefficient for large vocabulary size
    • a variable length representation:- not applicable when we need to map, say two different paragraphs, to the same vector space.

      from sklearn import OneHotEncoder
      
      corpus = init_corpus()
      
      encoder = OneHotEncoder()
      encoder.fit(corpus)
      
      encoder.transform(test_sequence)
      
  2. Bag of words (subset of bag-of-ngrams)
    • union of the indicator vectors of the individual tokens
    • does not capture order or context
    • is of fixed length (len(vocab)) but still sparse if vocabulary is large and data points are small
    • frequency isn't captured usually : only existence ..

      from sklearn import CountVectorizer
      
      corpus = init_corpus()
      
      # only record existence indicators
      encoder = CountVectorizer(binary=True)
      
      encoder.fit(corpus)
      
      encoder.transform(test_sequence)
      
  3. TF - IDF (term frequency - inverse document frequency)
    • words that occur frequently across all documents of a corpus may not be important for most tasks.
    • stop word removal does deal with this to an extent but that is not a statistical approach
    • TF-IDF reports the corpus adjusted document frequency of a term : an implcit feature engineering step along the lines of dimensionality reduction.
    • still can't capture closeness/farness between terms

      (defun TF-IDF (term document)
        (let ((term-frequency #'(lambda ()
                                  (/ (number-of-occurences-of-t-in-d)
                                     (total-number-of-terms-in-d))))
              (inverse-document-frequency #'(lambda ()
                                              (logarithm :base e
                                                         :of (/ (total-number-of-documents)
                                                                (number-of-documents-that-have-t))))))
          (* (term-frequency) (inverse-document-frequency))))
      
  4. Word embeddings (init by word2vec)
    • The major disadvantage of the ones above is that they don't have an explicit guage for how semantically close two words/phrases(beyond unigrams) are. This leads to the usage of dense vector representations aka embeddings aka distributed representation : this calls for a black-box function that takes your n-gram and produces the mapped embedding - that's a job for Neural Networks
    • progress made towards capturing semantic differences/similarities between two tokens in a corpus
    • allows for some intuitive vector math to understand relations between tokens. for instance, the following may be reasonable assumptions regarding the representations in the embedding space:
(defun embbeder (token-chain)
  (progn ...)
  (return embedding))

(defun closep (a b &key (thresh defthresh)) ...)

(assert (closep (+ (embedding 'land) (embedding 'transport))
                (embedding 'car)))

(assert (closep (+ (- (embedding 'king) (embedding 'man))
                   (embedding 'woman))
                (embedding 'queen))))
  • exploring embedder further:
    • it builds the embedding by looking at the distributional similarity (accepting distributional hypothesis) of the words i.e. its neighborhood, aka context.
    • on a conceptual level: this is done by a vector space level fixed point iteration where each token embedding is intialized randomly and then improved upon with each iteration using that token's context in its occurences in the corpus.
    • specifically, word2vec, uses a 2 layer neural net for this.
  • for pretrained word embeddings (which are form of key-value stores), refer genism
  • quick similarity searches can be done on a vector space by using cosine similarity
  • for training your word embeddings, look into continuous bag of words and SkipGram
  1. Scaling beyond words:
    • most tasks require embedding paragraphs or even documents into a dense vector space : exploring this in a separate node
    • Also note that in addition to the individual short-comings of the above methods, none of them can handle the OOV (out of vocab) problem gracefully.
  1. HandCrafted Features:
    • when significant domain specific knowledge is available beforehand, one can add manual manipulation processes before we use the features for better performance, than a generic representation pipeline
    • this may be the case when the task at hand has specific requirements that may not be well captured for fitting a model, when only employing generic pipelines.

3. Relevant nodes

4. Detour

  • from a more abstract perspective, text is pretty personal to me as most of my journaling, logging, blogging and ideation takes place textually.
  • This deserves a special treatment in the context of operating your environments.
Tags::nlp: