Attention

1. Overview

  • Definition and Purpose
    • Attention mechanisms are computational techniques used to improve the efficiency and performance of neural networks.
    • They allow models to focus on specific parts of the input data when making predictions or generating outputs, much like human attention.
  • Types of Attention
    • Soft Attention: Assigns a probability distribution over the input elements, allowing the model to attend to different parts of the input differently.
    • Hard Attention: Selects specific parts of the input with binary choices, often using reinforcement learning techniques.
    • Self-Attention: Each element of the input sequence attends to all other elements, widely used in Transformer models.
    • Bahdanau Attention (Additive Attention): Computes weights using a feed-forward network, applying to encoder-decoder architectures.
    • Luong Attention (Multiplicative Attention): Utilizes dot-product or general methods for calculating alignment scores between input and output sequences.
  • Applications
    • Natural Language Processing: Improves sequence-to-sequence tasks such as translation, summarization, and question answering.
    • Computer Vision: Helps in identifying relevant areas in images, enhancing object detection and image captioning.
    • Speech Recognition: Aligns spoken and textual data, improving accuracy in transcriptions and translations.
  • Key Models Leveraging Attention
    • Transformer: Utilizes self-attention and has become foundational in NLP tasks.
    • BERT: Uses bidirectional attention to achieve deep contextual understanding.
    • GPT: Autoregressive transformer-based model that generates human-like text by predicting next words in sequences.

2. Working Mechanism

2.0.1. Self-Attention and Vectors

  • Query (Q): Determines which sequence elements to focus on by projecting the input sequence into a different space.
  • Key (K): Represents each sequence element in a way that allows the model to match it against the Query.
  • Value (V): Contains the actual information to aggregate and use for the output.

2.0.2. Mathematical Explanation

  1. Self-Attention Mechanism
    • For a given word in a sequence, self-attention computes a weighted sum of values from all words in the sequence.
    • The weights are determined by the dot product of Query and Key vectors, followed by a softmax operation.
  2. Formulas
    1. Computing the Attention Scores:
      • Let \( X \) be the input sequence matrix where each row is an input word vector.
      • Compute Queries, Keys, and Values as linear projections of the input: \[ Q = XW^Q, \quad K = XW^K, \quad V = XW^V \] where \( W^Q, W^K, W^V \) are learned parameter matrices specific to Query, Key, and Value respectively.
    2. Scaled Dot-Product Attention:
      • Compute attention scores using dot products followed by scaling: \[ \text{Attention Scores} = \frac{QK^T}{\sqrt{d_k}} \] where \( d_k \) is the dimension of the Key vectors (used for scaling to prevent large values which improve convergence).
    3. Apply Softmax and Compute Output:
      • Apply the softmax function to the scores to obtain the attention weights: \[ \text{Attention Weights} = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) \]
      • Compute the final output as a weighted sum of the values: \[ \text{Output} = \text{Attention Weights} \cdot V \]

2.0.3. Connections and Insights

  • Projection Matrices: The matrices \( W^Q, W^K, W^V \) transform the input to queries, keys, and values, allowing attention to operate over different semantic representations.
  • Softmax Scaling: The division by \(\sqrt{d_k}\) stabilizes the gradients by preventing the dot-product values from becoming too large.
  • Attention Mechanism: It enables the transformer to focus differently on different input parts, capturing dependencies irrespective of their position.
Tags::ml:ai: