Text Representation

1. Abstract
- 1.1. Characteristics of a good representation?
2. Specifics
3. Relevant nodes
- 3.1. Representation Learning
4. Detour

Note: we approach this step post data engineering.

AKA Textual feature representation

1. Abstract

not as straightforward as representing images, videos, speech
there is a malleable synthetic component to representation
- gives rise to several degrees of freedom/choices in terms of how one chooses represent samples
an explicitly observable discrete (symbolic) component to the structure written language and maps to a vector space may not be that intuitive to the beginner at the get

1.1. Characteristics of a good representation?

For any generic nlp task, processing a data point involves:

breaking the data point into its atomic units: retrieving their independent semantics
understanding their syntactics (grammar)
understading the context
finally, inferring the semantics of the data point as a whole from the above.

A good representation would intuitively facilitate the above operations.

2. Specifics

Most text representation schemes fall under the bucket of the vector space model : they'll vectorize the samples under some priors to obtain a numerical representaion Some basic vectorization approaches:

One hot encoding
- assuming independence between tokens and no notion of closeness/farness between tokens : only a discrete distinction
- sparse and inefficient for large vocabulary size
- a variable length representation:- not applicable when we need to map, say two different paragraphs, to the same vector space.
```
from sklearn import OneHotEncoder

corpus = init_corpus()

encoder = OneHotEncoder()
encoder.fit(corpus)

encoder.transform(test_sequence)
```
Bag of words (subset of bag-of-ngrams)
- union of the indicator vectors of the individual tokens
- does not capture order or context
- is of fixed length (len(vocab)) but still sparse if vocabulary is large and data points are small
- frequency isn't captured usually : only existence ..
  - might be useful in certain cases (Occam's Razor)
```
from sklearn import CountVectorizer

corpus = init_corpus()

# only record existence indicators
encoder = CountVectorizer(binary=True)

encoder.fit(corpus)

encoder.transform(test_sequence)
```

TF - IDF (term frequency - inverse document frequency)

words that occur frequently across all documents of a corpus may not be important for most tasks.
stop word removal does deal with this to an extent but that is not a statistical approach
TF-IDF reports the corpus adjusted document frequency of a term : an implcit feature engineering step along the lines of dimensionality reduction.

still can't capture closeness/farness between terms

(defun TF-IDF (term document)
  (let ((term-frequency #'(lambda ()
                            (/ (number-of-occurences-of-t-in-d)
                               (total-number-of-terms-in-d))))
        (inverse-document-frequency #'(lambda ()
                                        (logarithm :base e
                                                   :of (/ (total-number-of-documents)
                                                          (number-of-documents-that-have-t))))))
    (* (term-frequency) (inverse-document-frequency))))

Word embeddings (init by word2vec)
- The major disadvantage of the ones above is that they don't have an explicit guage for how semantically close two words/phrases(beyond unigrams) are. This leads to the usage of dense vector representations aka embeddings aka distributed representation : this calls for a black-box function that takes your n-gram and produces the mapped embedding - that's a job for Neural Networks
- progress made towards capturing semantic differences/similarities between two tokens in a corpus
- allows for some intuitive vector math to understand relations between tokens. for instance, the following may be reasonable assumptions regarding the representations in the embedding space:

(defun embbeder (token-chain)
  (progn ...)
  (return embedding))

(defun closep (a b &key (thresh defthresh)) ...)

(assert (closep (+ (embedding 'land) (embedding 'transport))
                (embedding 'car)))

(assert (closep (+ (- (embedding 'king) (embedding 'man))
                   (embedding 'woman))
                (embedding 'queen))))

exploring embedder further:
- it builds the embedding by looking at the distributional similarity (accepting distributional hypothesis) of the words i.e. its neighborhood, aka context.
- on a conceptual level: this is done by a vector space level fixed point iteration where each token embedding is intialized randomly and then improved upon with each iteration using that token's context in its occurences in the corpus.
- specifically, word2vec, uses a 2 layer neural net for this.
for pretrained word embeddings (which are form of key-value stores), refer genism
quick similarity searches can be done on a vector space by using cosine similarity
for training your word embeddings, look into continuous bag of words and SkipGram

Scaling beyond words:
- most tasks require embedding paragraphs or even documents into a dense vector space : exploring this in a separate node
- Also note that in addition to the individual short-comings of the above methods, none of them can handle the OOV (out of vocab) problem gracefully.

HandCrafted Features:
- when significant domain specific knowledge is available beforehand, one can add manual manipulation processes before we use the features for better performance, than a generic representation pipeline
- this may be the case when the task at hand has specific requirements that may not be well captured for fitting a model, when only employing generic pipelines.

3. Relevant nodes

3.1. Representation Learning

4. Detour

from a more abstract perspective, text is pretty personal to me as most of my journaling, logging, blogging and ideation takes place textually.
This deserves a special treatment in the context of operating your environments.