Attention
1. Overview
- Definition and Purpose
- Attention mechanisms are computational techniques used to improve the efficiency and performance of neural networks.
- They allow models to focus on specific parts of the input data when making predictions or generating outputs, much like human attention.
- Types of Attention
- Soft Attention: Assigns a probability distribution over the input elements, allowing the model to attend to different parts of the input differently.
- Hard Attention: Selects specific parts of the input with binary choices, often using reinforcement learning techniques.
- Self-Attention: Each element of the input sequence attends to all other elements, widely used in Transformer models.
- Bahdanau Attention (Additive Attention): Computes weights using a feed-forward network, applying to encoder-decoder architectures.
- Luong Attention (Multiplicative Attention): Utilizes dot-product or general methods for calculating alignment scores between input and output sequences.
- Applications
- Natural Language Processing: Improves sequence-to-sequence tasks such as translation, summarization, and question answering.
- Computer Vision: Helps in identifying relevant areas in images, enhancing object detection and image captioning.
- Speech Recognition: Aligns spoken and textual data, improving accuracy in transcriptions and translations.
- Key Models Leveraging Attention
- Transformer: Utilizes self-attention and has become foundational in NLP tasks.
- BERT: Uses bidirectional attention to achieve deep contextual understanding.
- GPT: Autoregressive transformer-based model that generates human-like text by predicting next words in sequences.
2. Working Mechanism
2.0.1. Self-Attention and Vectors
- Query (Q): Determines which sequence elements to focus on by projecting the input sequence into a different space.
- Key (K): Represents each sequence element in a way that allows the model to match it against the Query.
- Value (V): Contains the actual information to aggregate and use for the output.
2.0.2. Mathematical Explanation
- Self-Attention Mechanism
- For a given word in a sequence, self-attention computes a weighted sum of values from all words in the sequence.
- The weights are determined by the dot product of Query and Key vectors, followed by a softmax operation.
- Formulas
- Computing the Attention Scores:
- Let \( X \) be the input sequence matrix where each row is an input word vector.
- Compute Queries, Keys, and Values as linear projections of the input:
\[
Q = XW^Q, \quad K = XW^K, \quad V = XW^V
\]
where \( W^Q, W^K, W^V \) are learned parameter matrices specific to Query, Key, and Value respectively.
- Scaled Dot-Product Attention:
- Compute attention scores using dot products followed by scaling:
\[
\text{Attention Scores} = \frac{QK^T}{\sqrt{d_k}}
\]
where \( d_k \) is the dimension of the Key vectors (used for scaling to prevent large values which improve convergence).
- Apply Softmax and Compute Output:
- Apply the softmax function to the scores to obtain the attention weights:
\[
\text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
\]
- Compute the final output as a weighted sum of the values:
\[
\text{Output} = \text{Attention Weights} \cdot V
\]
2.0.3. Connections and Insights
- Projection Matrices: The matrices \( W^Q, W^K, W^V \) transform the input to queries, keys, and values, allowing attention to operate over different semantic representations.
- Softmax Scaling: The division by \(\sqrt{d_k}\) stabilizes the gradients by preventing the dot-product values from becoming too large.
- Attention Mechanism: It enables the transformer to focus differently on different input parts, capturing dependencies irrespective of their position.
Tags::ml:ai: