Large Language Models

1. Overview

1.1. Definition:

Large Language Models (LLMs) are a class of artificial intelligence that leverage deep learning techniques, particularly neural networks, to understand, generate, and manipulate human language.

1.2. Architecture:

  • Primarily founded on transformer architecture, this enables efficient processing and context understanding over longer text sequences.
  • Components include layers of encoders and decoders, self-attention mechanisms, and feedforward neural networks.

1.3. Training Data:

  • LLMs are trained on vast datasets comprising text from books, articles, websites, and various online resources.
  • The quality and breadth of training data affect model performance and bias.

1.4. Applications:

2. Relevant Attributes

2.1. Context Window

  • Definition:
    • The context window refers to the amount of text (i.e., number of tokens) that a language model considers when generating a response.
    • This is essentially the "memory" the model uses to understand the current input and produce coherent outputs.
  • Technical Aspects:
    • The size of a context window is often set by the architecture of the model and can vary across different implementations.
    • It dictates the length of text the model can process at one time, influencing both coherence and relevance.
  • Operational Implications:
    • A smaller context window means potential truncation of input, which could lead to loss of information if the input exceeds the window size.
    • A larger context window allows the model to draw from a more substantial portion of text, enhancing its ability to maintain coherence across a longer narrative.
  • Storage and Performance:
    • Larger context windows generally require more computational resources due to the increased memory footprint and processing time.
    • This can impact system performance, requiring optimization to handle longer context windows efficiently.
  • Limitations:
    • Even when a large context window is available, models may struggle with very long-term dependencies, as the influence of earlier parts of the text diminishes.
    • Potential for decreased relevance over larger spans due to the nature of attention mechanisms.

3. Misc

3.1. The March of Nines w.r.t. LLMs

3.1.3. Fine Tuning

3.1.4. Custom UI/UX

3.2. Scaling Laws

The scaling laws for Large Language Models (LLMs) describe how changes in different parameters affect the performance of these models.

  • N : Number of Params
  • D : Dataset size
  • F : FLOPs
  1. (N) Model Size:
    • Increasing the number of parameters in a model generally improves performance. However, the performance gains may diminish past a certain point due to diminishing returns.
    • Larger models capture more complex patterns and nuances in data, which can help improve generalization.
  2. (D) Data Size:
    • More training data typically leads to better model performance, as it allows the model to learn from a wider array of examples and scenarios.
    • There's a synergy between model size and data size; a larger model may require significantly more data to reach optimal performance.
    • recommended training data set size of 20 times the number of model parameters : see the chinchilla paper
  3. (F) Compute Budget:
    • The amount of computational resources directly influences the model's training and inference times.
    • Efficient utilization of the compute budget involves balancing between model size and data size to achieve the desired performance.

3.2.1. Resources

3.3. Emergent Abilities in LLMs

  • Definition:
    • Emergent abilities are features or skills that manifest in large-scale neural networks and are not observed in smaller models.
  • Scale and Complexity:
    • The occurrence of emergent abilities is generally correlated with an increase in the model's parameters and training data.
    • Larger models have a more complex representation space, allowing for more sophisticated pattern recognition and problem-solving.
  • Examples:
    • Language translation without specific training for multilingual tasks.
    • Basic reasoning and common sense knowledge application.
    • Playing complex games or performing tasks that require strategy or planning.
  • Reasons for Emergence:
    • Large datasets provide diverse patterns and contexts, assisting in generalization.
    • Complex architectures allow for nuanced data transformations, uncovering higher-order patterns.
    • Spontaneous discovery of useful heuristics or shortcuts to perform tasks efficiently.
  • Research and Development Directions:
    • Increasingly accurate benchmarking and analysis to study when and how these abilities manifest.
    • Developing tools to better visualize and interpret the decision-making processes of LLMs.

3.4. Evaluating LLMS via Benchmarks

3.4.1. Big Bench Suite

3.4.2. Truthful QA

3.4.3. Massive Multitask Language Understanding

3.4.4. Word in Context

3.5. Hyperparameters of an LLM

3.5.1. During Training:

  • Learning Rate:
    • Controls the step size for updating model weights.
    • A crucial hyperparameter as it affects convergence and stability.
  • Batch Size:
    • Number of training examples used in one iteration.
    • Larger batch sizes can stabilize gradient updates but require more memory.
  • Number of Epochs:
    • Defines how many times the entire training dataset is passed through the model.
    • Needed to ensure adequate learning without overfitting.
  • Optimizer Type:
    • Algorithms like Adam, SGD, or RMSProp used to adjust weights.
    • Different optimizers can result in varying convergence speeds and outcomes.
  • Dropout Rate:
    • Probability of dropping units in neural networks to prevent overfitting.
    • Applied to the network layers during training.
  • Weight Initialization:
    • Strategy for initializing model weights.
    • Influences how quickly and effectively the model converges.
  • Gradient Clipping:
    • Limits the maximum value of gradients to prevent exploding gradient issues.
    • Especially useful in training large networks.
  • Warmup Steps:
    • Number of initial training steps with a gradually increasing learning rate.
    • Helps avoid large sudden updates in early training.

3.5.2. During Inference:

  • Beam Size (in beam search):
    • Number of beams (alternate sequences) considered for output generation.
    • Balances between computational resources and quality of output.
  • Temperature:
    • Controls randomness during sampling; higher values increase randomness.
    • Influences creativity versus coherence of generated text.
  • Top-k Sampling:
    • Limits the next word selection to the top k probable entries.
    • Reduces unpredictability by narrowing down the choice of words.
  • Top-p Sampling (Nucleus Sampling):
    • Extends top-k by choosing from a dynamically determined set of most probable outputs.
    • Balances diversity and coherence more effectively than fixed k.
  • Max Token Length:
    • Maximum number of tokens to generate in the output.
    • Used to allocate computational resources appropriately.

3.5.3. Connections:

  • Learning Rate and Warmup Steps:
    • Both influence how learning is paced and stabilized during the early training stages.
  • Batch Size and Gradient Clipping:
    • Larger batch sizes might affect the stability of gradients, where clipping can help to prevent instabilities.
  • Temperature, Top-k, and Top-p Sampling:
    • These hyperparameters work together to modulate the randomness and quality of the generated text during inference.

3.7. RLHF

4. Resources

4.1. Book: Building LLMs for production

Tags::ml:ai: