Text Representation
Table of Contents
Note: we approach this step post data engineering.
AKA Textual feature representation
1. Abstract
- not as straightforward as representing images, videos, speech
- there is a malleable synthetic component to representation
- gives rise to several degrees of freedom/choices in terms of how one chooses represent samples
- an explicitly observable discrete (symbolic) component to the structure written language and maps to a vector space may not be that intuitive to the beginner at the get
1.1. Characteristics of a good representation?
For any generic nlp task, processing a data point involves:
- breaking the data point into its atomic units: retrieving their independent semantics
- understanding their syntactics (grammar)
- understading the context
- finally, inferring the semantics of the data point as a whole from the above.
A good representation would intuitively facilitate the above operations.
2. Specifics
Most text representation schemes fall under the bucket of the vector space model : they'll vectorize the samples under some priors to obtain a numerical representaion Some basic vectorization approaches:
- One hot encoding
- assuming independence between tokens and no notion of closeness/farness between tokens : only a discrete distinction
- sparse and inefficient for large vocabulary size
a variable length representation:- not applicable when we need to map, say two different paragraphs, to the same vector space.
from sklearn import OneHotEncoder corpus = init_corpus() encoder = OneHotEncoder() encoder.fit(corpus) encoder.transform(test_sequence)
- Bag of words (subset of bag-of-ngrams)
- union of the indicator vectors of the individual tokens
- does not capture order or context
- is of fixed length (len(vocab)) but still sparse if vocabulary is large and data points are small
frequency isn't captured usually : only existence ..
- might be useful in certain cases (Occam's Razor)
from sklearn import CountVectorizer corpus = init_corpus() # only record existence indicators encoder = CountVectorizer(binary=True) encoder.fit(corpus) encoder.transform(test_sequence)
- TF - IDF (term frequency - inverse document frequency)
- words that occur frequently across all documents of a corpus may not be important for most tasks.
- stop word removal does deal with this to an extent but that is not a statistical approach
- TF-IDF reports the corpus adjusted document frequency of a term : an implcit feature engineering step along the lines of dimensionality reduction.
still can't capture closeness/farness between terms
(defun TF-IDF (term document) (let ((term-frequency #'(lambda () (/ (number-of-occurences-of-t-in-d) (total-number-of-terms-in-d)))) (inverse-document-frequency #'(lambda () (logarithm :base e :of (/ (total-number-of-documents) (number-of-documents-that-have-t)))))) (* (term-frequency) (inverse-document-frequency))))
- Word embeddings (init by word2vec)
- The major disadvantage of the ones above is that they don't have an explicit guage for how semantically close two words/phrases(beyond unigrams) are. This leads to the usage of dense vector representations aka embeddings aka distributed representation : this calls for a black-box function that takes your n-gram and produces the mapped embedding - that's a job for Neural Networks
- progress made towards capturing semantic differences/similarities between two tokens in a corpus
- allows for some intuitive vector math to understand relations between tokens. for instance, the following may be reasonable assumptions regarding the representations in the embedding space:
(defun embbeder (token-chain) (progn ...) (return embedding)) (defun closep (a b &key (thresh defthresh)) ...) (assert (closep (+ (embedding 'land) (embedding 'transport)) (embedding 'car))) (assert (closep (+ (- (embedding 'king) (embedding 'man)) (embedding 'woman)) (embedding 'queen))))
- exploring
embedder
further:- it builds the
embedding
by looking at the distributional similarity (accepting distributional hypothesis) of the words i.e. its neighborhood, aka context. - on a conceptual level: this is done by a vector space level fixed point iteration where each token embedding is intialized randomly and then improved upon with each iteration using that token's context in its occurences in the corpus.
- specifically, word2vec, uses a 2 layer neural net for this.
- it builds the
- for pretrained word embeddings (which are form of key-value stores), refer genism
- quick similarity searches can be done on a vector space by using cosine similarity
- for training your word embeddings, look into continuous bag of words and SkipGram
- Scaling beyond words:
- most tasks require embedding paragraphs or even documents into a dense vector space : exploring this in a separate node
- Also note that in addition to the individual short-comings of the above methods, none of them can handle the OOV (out of vocab) problem gracefully.
- HandCrafted Features:
- when significant domain specific knowledge is available beforehand, one can add manual manipulation processes before we use the features for better performance, than a generic representation pipeline
- this may be the case when the task at hand has specific requirements that may not be well captured for fitting a model, when only employing generic pipelines.
4. Detour
- from a more abstract perspective, text is pretty personal to me as most of my journaling, logging, blogging and ideation takes place textually.
- This deserves a special treatment in the context of operating your environments.