Attention-ML

1. Overview
2. Working Mechanism
3. Relevant Nodes

1. Overview

Definition and Purpose
- Attention mechanisms are computational techniques used to improve the efficiency and performance of neural networks.
- They allow models to focus on specific parts of the input data when making predictions or generating outputs, much like human attention.
Types of Attention
- Soft Attention: Assigns a probability distribution over the input elements, allowing the model to attend to different parts of the input differently.
- Hard Attention: Selects specific parts of the input with binary choices, often using reinforcement learning techniques.
- Self-Attention: Each element of the input sequence attends to all other elements, widely used in Transformer models.
- Bahdanau Attention (Additive Attention): Computes weights using a feed-forward network, applying to encoder-decoder architectures.
- Luong Attention (Multiplicative Attention): Utilizes dot-product or general methods for calculating alignment scores between input and output sequences.
Applications
- Natural Language Processing: Improves sequence-to-sequence tasks such as translation, summarization, and question answering.
- Computer Vision: Helps in identifying relevant areas in images, enhancing object detection and image captioning.
- Speech Recognition: Aligns spoken and textual data, improving accuracy in transcriptions and translations.
Key Models Leveraging Attention
- Transformer: Utilizes self-attention and has become foundational in NLP tasks.
- BERT: Uses bidirectional attention to achieve deep contextual understanding.
- GPT: Autoregressive transformer-based model that generates human-like text by predicting next words in sequences.

2. Working Mechanism

2.0.1. Self-Attention and Vectors

Query (Q): Determines which sequence elements to focus on by projecting the input sequence into a different space.
Key (K): Represents each sequence element in a way that allows the model to match it against the Query.
Value (V): Contains the actual information to aggregate and use for the output.

2.0.2. Mathematical Explanation

Self-Attention Mechanism
- For a given word in a sequence, self-attention computes a weighted sum of values from all words in the sequence.
- The weights are determined by the dot product of Query and Key vectors, followed by a softmax operation.
Formulas
1. Computing the Attention Scores:
  - Let \( X \) be the input sequence matrix where each row is an input word vector.
  - Compute Queries, Keys, and Values as linear projections of the input: \[ Q = XW^Q, \quad K = XW^K, \quad V = XW^V \] where \( W^Q, W^K, W^V \) are learned parameter matrices specific to Query, Key, and Value respectively.
2. Scaled Dot-Product Attention:
  - Compute attention scores using dot products followed by scaling: \[ \text{Attention Scores} = \frac{QK^T}{\sqrt{d_k}} \] where \( d_k \) is the dimension of the Key vectors (used for scaling to prevent large values which improve convergence).
3. Apply Softmax and Compute Output:
  - Apply the softmax function to the scores to obtain the attention weights: \[ \text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \]
  - Compute the final output as a weighted sum of the values: \[ \text{Output} = \text{Attention Weights} \cdot V \]

2.0.3. Connections and Insights

Projection Matrices: The matrices \( W^Q, W^K, W^V \) transform the input to queries, keys, and values, allowing attention to operate over different semantic representations.
Softmax Scaling: The division by \(\sqrt{d_k}\) stabilizes the gradients by preventing the dot-product values from becoming too large.
Attention Mechanism: It enables the transformer to focus differently on different input parts, capturing dependencies irrespective of their position.

Attention-ML

Table of Contents

1. Overview

2. Working Mechanism

2.0.1. Self-Attention and Vectors

2.0.2. Mathematical Explanation

2.0.3. Connections and Insights

3. Relevant Nodes