# CHAPTER 4: Alignment

**CSC 375/575 - Generative AI**  
**Prof. Rongyu Lin, Quinnipiac University**

---

### Topics Covered

- **4.1 An Overview of LLM Alignment** - Understanding alignment challenges and approaches
- **4.2 Instruction Alignment** - Supervised fine-tuning for instruction-following
- **4.3 RLHF (Reinforcement Learning from Human Feedback)** - Training LLMs with human preferences
- **4.4 Advanced Alignment Methods** - DPO, process-based alignment, and inference-time techniques

### Learning Objectives

By the end of this lecture, students will be able to:

1. **Understand the alignment problem** in LLMs and why it's critical for safe, effective AI systems
2. **Implement supervised fine-tuning (SFT)** for instruction alignment and understand data acquisition methods
3. **Apply reinforcement learning fundamentals** to LLM training, including policy gradients, value functions, and advantage functions
4. **Explain RLHF components** including reward models, Bradley-Terry model, and PPO algorithm
5. **Distinguish between alignment approaches** including DPO vs. RLHF, outcome-based vs. process-based methods
6. **Implement inference-time alignment** techniques such as Best-of-N sampling

## Setup

First, let's import the necessary libraries for our demonstrations.

In [37]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple, Optional
from IPython.display import Image, display
import torch
import torch.nn as nn
import torch.nn.functional as F

# Set random seed for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Setup complete!")

Setup complete!


## Introduction: What is Alignment?

### The Alignment Problem

**Alignment** in the context of LLMs refers to ensuring that model outputs align with human expectations, values, and preferences. While the term "alignment" has roots in traditional NLP (e.g., word alignment in machine translation), its modern usage focuses on a more critical challenge:

#### Key Challenges:

1. **Instruction Following**: Pre-trained LLMs trained on next-token prediction may not naturally follow user instructions
2. **Safety Concerns**: Models may generate harmful, biased, or inappropriate content
3. **Value Alignment**: Ensuring outputs reflect human values and ethical standards
4. **Context Appropriateness**: Responses should be contextually suitable and helpful

### Why Can't Pre-training Solve Everything?

While LLMs are trained on vast amounts of data, several factors make alignment necessary:

- **Incomplete Coverage**: Training data cannot cover all possible tasks and human preferences
- **Conflicting Signals**: Pre-training data may contain contradictory information about appropriate behavior
- **Specific Requirements**: Users have specific needs that general pre-training cannot anticipate
- **Safety Constraints**: Pre-training data may include harmful content that should not be replicated

### The Pre-train-then-Align Paradigm

Modern LLM development follows a two-stage approach:

1. **Pre-training Stage**: Train on massive datasets using next-token prediction
2. **Alignment Stage**: Fine-tune to follow instructions, align with human preferences, and ensure safety

## 4.1 An Overview of LLM Alignment

### Three Main Approaches to Alignment

Alignment can be achieved through different methods, each with unique strengths:

#### **1. Fine-tuning with Labeled Data (Instruction Alignment)**

**Method**: Supervised Fine-Tuning (SFT)
- Train LLM on task-specific instruction-output pairs
- Dataset: Collection of (instruction + input, expected output) pairs
- **Example**: Question-answer pairs, summarization tasks, translation examples

**When to use**:
- Clear input-output relationships
- Straightforward data annotation
- Well-defined task objectives

#### **2. Fine-tuning with Reward Models (RLHF)**

**Method**: Reinforcement Learning from Human Feedback
- Train a reward model to score outputs based on human preferences
- Use RL to optimize LLM policy to maximize reward
- **Example**: Preferring helpful, harmless, and honest responses

**When to use**:
- Complex, subjective human preferences
- Hard-to-formalize quality criteria
- Nuanced behavioral alignment (tone, style, values)

#### **3. Inference-time Alignment (Prompting and Reranking)**

**Method**: Dynamic adaptation without training
- Prompting: Guide behavior through carefully crafted prompts
- Reranking: Generate multiple outputs, select best using scoring
- **Example**: Best-of-N sampling with reward model

**When to use**:
- No training/fine-tuning resources available
- Task-specific adaptation needed
- Quick prototyping and iteration

### Typical Alignment Pipeline

In practice, these methods are often used sequentially:

```
Pre-training → SFT → RLHF → Prompting (at inference)
```

Each stage refines the model's behavior, building on the previous stage's improvements.

---

<img src="images/fig_4_1_pretrain_align.png" width="50%" alt="Pre-train-then-align Paradigm" style="display: block; margin: 0 auto;">

<p style="text-align: center; font-style: italic; margin-top: 15px; line-height: 1.6;">
<strong>Figure 4.1:</strong> Schematic illustration of the pre-train-then-align method. In the pre-training stage, we train an LLM on vast amounts of data using next token prediction. Then, in the alignment stage, we align the LLM to user instructions, intents, and preferences.
</p>



## 4.2 Instruction Alignment

### What is Instruction-Following?

**Instruction-following** is the ability of an LLM to understand and execute user instructions correctly. 

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

<strong>❌ Without Instruction Fine-tuning (Pre-trained LLM fails):</strong><br><br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Instruction:</span> Summarize this text in three sentences.<br><br>

<span style="background-color: #87ceeb; padding: 2px 4px;">Input:</span><br>
Daylight Savings Time (DST) - the process of moving clocks forward by one hour in the summer - was started in Germany in 1916. The practice was implemented to conserve energy during World War I...<br><br>

<span style="background-color: #ffcdd2; padding: 2px 4px;">Output (WRONG):</span><br>
<u>...and later adopted by other European countries. The United States began using DST in 1918, though it was repealed after the war ended. It was reinstated during World War II...</u> ❌<br><br>

<strong>Problem:</strong> The model continues writing the article instead of summarizing!<br><br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

<strong>✓ With Instruction Fine-tuning (Aligned LLM succeeds):</strong><br><br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Instruction:</span> Summarize this text in three sentences.<br><br>

<span style="background-color: #87ceeb; padding: 2px 4px;">Input:</span><br>
Daylight Savings Time (DST) - the process of moving clocks forward by one hour in the summer - was started in Germany in 1916...<br><br>

<span style="background-color: #c8e6c9; padding: 2px 4px;">Output (CORRECT):</span><br>
Daylight Savings Time (DST) was first implemented in Germany in 1916 to conserve energy during wartime. The practice involves moving clocks forward by one hour during summer months. Many countries adopted this system, though its effectiveness and necessity remain debated. ✓

</div>

**Solution**: Instruction fine-tuning adapts the LLM to follow instructions accurately.

<h3 style="font-size: 1.5em; color: #003865; margin-top: 20px;">4.2.1 Supervised Fine-Tuning (SFT)</h3>

### Core Concept

Unlike standard language model training (maximizing $P(x, y)$), SFT focuses on **conditional generation**:

$$\tilde{\theta} = \arg\max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \text{Pr}_{\theta}(\mathbf{y} \mid \mathbf{x})$$

where:
- $\mathbf{x}$: instruction + user input
- $\mathbf{y}$: expected output
- $\mathcal{D}$: SFT dataset of (x, y) pairs

### Mathematical Formulation

The conditional log-probability decomposes as:

$$\log \text{Pr}_{\theta}(\mathbf{y} \mid \mathbf{x}) = \sum_{i=1}^{n} \log \text{Pr}_{\theta}(y_i \mid \mathbf{x}, \mathbf{y}_{<i})$$

**Key Implementation Detail**: 
- Concatenate input $\mathbf{x}$ and output $\mathbf{y}$ into single sequence
- During training: **Set loss for input tokens to 0**, compute loss **only for output tokens**
- This ensures the model learns to generate outputs conditioned on inputs

### SFT Dataset Example

<table style="width: 100%; border-collapse: collapse; margin: 20px 0; font-size: 0.9em; line-height: 1.5;">
  <thead>
    <tr style="background-color: #003865; color: white;">
      <th style="width: 25%; padding: 12px; text-align: left; border: 1px solid #ddd; font-weight: bold;">Instruction + Input</th>
      <th style="width: 25%; padding: 12px; text-align: left; border: 1px solid #ddd; font-weight: bold;">Output</th>
    </tr>
  </thead>
  <tbody>
    <tr style="background-color: #f9f9f9;">
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        <strong>Summarize the following article.</strong><br>
        In recent years, solar energy has seen unprecedented growth...
      </td>
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        Solar energy has experienced rapid growth recently, becoming the fastest-growing renewable energy source. This growth is driven by technological advances and decreasing costs.
      </td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        <strong>Extract the main financial figures from the following earnings report.</strong><br>
        The company reported a revenue of $10 million...
      </td>
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        Revenue: $10 million, Profit Margin: 15%
      </td>
    </tr>
    <tr style="background-color: #f9f9f9;">
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        <strong>Classify the following email as spam or not spam.</strong><br>
        Congratulations! You've won a $500 gift card...
      </td>
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        Spam
      </td>
    </tr>
    <tr style="background-color: #ffffff;">
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        <strong>Provide a solution to the following technical issue.</strong><br>
        My computer is running slow and often freezes.
      </td>
      <td style="padding: 12px; border: 1px solid #ddd; vertical-align: top;">
        First, check for malware using antivirus software. Clear temporary files and disk space...
      </td>
    </tr>
  </tbody>
</table>

---


<img src="images/fig_4_2_sft_illustration.png" width="60%" alt="SFT Illustration" style="display: block; margin: 0 auto;">

<p style="text-align: center; font-style: italic; margin-top: 15px; line-height: 1.6;">
<strong>Figure 4.2:</strong> Illustration of supervised fine-tuning for LLMs. The input and output are concatenated into a single sequence. During forward pass, the LLM runs as usual. During backward pass, the loss is computed only for the output part (set loss for input part to 0).
</p>

### Multi-Round Conversations (Chatbots)

For conversational models, we extend SFT to handle multiple turns:

$$\hat{\theta} = \arg\max_{\theta} \sum_{k=1}^{K} \log \text{Pr}_{\theta}(\mathbf{y}^k \mid \mathbf{x}^1, \mathbf{y}^1, \ldots, \mathbf{x}^k)$$

where $K$ is the number of conversation rounds.

**Efficient Implementation**:
- Treat entire conversation as single sequence: $\mathbf{x}^1, \mathbf{y}^1, \mathbf{x}^2, \mathbf{y}^2, \ldots, \mathbf{x}^K, \mathbf{y}^K$
- Set loss to 0 for all user inputs ($\mathbf{x}^k$)
- Compute loss only for chatbot responses ($\mathbf{y}^k$)
- Single forward pass processes entire conversation!

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

<strong>Example: Healthcare Chatbot Conversation</strong><br><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">User:</span> I've been feeling very tired lately.<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Chatbot:</span> I'm sorry to hear that. Besides feeling tired, have you noticed any other symptoms?<br><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">User:</span> Yes, I'm also experiencing headaches frequently.<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Chatbot:</span> How long have these symptoms been going on?<br><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">User:</span> About a week now.<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Chatbot:</span> It might be good to check in with a healthcare professional. Would you like help setting up an appointment?<br><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">User:</span> Yes, please. Can it be after work hours?<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Chatbot:</span> Sure, I can arrange that. There are slots available next Wednesday and Thursday after 5 PM. Which day works better for you?<br><br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

<strong>Training Process:</strong><br>
• Loss computed <strong>ONLY</strong> for <span style="background-color: #c8e6c9; padding: 2px 4px;">chatbot responses</span><br>
• <span style="background-color: #ffb6c1; padding: 2px 4px;">User inputs</span> are used as context but <strong>NOT</strong> trained on<br>
• Single forward pass through entire conversation sequence

</div>

---


<img src="images/fig_4_3_multi_round_sft.png" width="60%" alt="Multi-round SFT" style="display: block; margin: 0 auto;">

<p style="text-align: center; font-style: italic; margin-top: 15px; line-height: 1.6;">
<strong>Figure 4.3:</strong> Supervised fine-tuning for conversational models. The conversation history is concatenated into a single sequence, and loss is computed only for chatbot responses, not user inputs.
</p>


<h3 style="font-size: 1.5em; color: #003865; margin-top: 20px;">4.2.2 Fine-tuning Data Acquisition</h3>

### Challenge

Creating high-quality SFT data is **expensive and time-consuming**. We need efficient methods to acquire diverse, high-quality instruction-output pairs.

### Two Main Approaches:

1. **Manually Generated Data**: Human annotators create examples
2. **Automatically Generated Data**: Use LLMs to generate synthetic data

<h4 style="font-size: 1.3em; color: #003865; margin-top: 15px;">4.2.2.1 Manually Generated Data</h4>

#### Process:

1. **Write Prompt Templates**: Define task structure
2. **Collect Source Data**: Gather relevant examples
3. **Fill Templates**: Create instruction-output pairs

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

<strong>Example: Machine Translation Template</strong><br><br>

<span style="background-color: #ffeb3b; padding: 3px 6px; font-weight: bold;">Template:</span><br><br>

<strong>Instruction:</strong> Translate the text from <span style="background-color: #87ceeb; padding: 2px 4px;">[source_lang]</span> to <span style="background-color: #87ceeb; padding: 2px 4px;">[target_lang]</span>.<br>
<strong>User Input:</strong> <span style="background-color: #87ceeb; padding: 2px 4px;">[source_text]</span><br>
<strong>Output:</strong> <span style="background-color: #87ceeb; padding: 2px 4px;">[target_text]</span><br><br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

<span style="background-color: #90ee90; padding: 3px 6px; font-weight: bold;">Filled Instance:</span><br><br>

<strong>Instruction:</strong> Translate the text from <span style="background-color: #c8e6c9; padding: 2px 4px;">English</span> to <span style="background-color: #c8e6c9; padding: 2px 4px;">Chinese</span>.<br>
<strong>User Input:</strong> <span style="background-color: #c8e6c9; padding: 2px 4px;">How's the weather today?</span><br>
<strong>Output:</strong> <strong style="background-color: #4caf50; color: white; padding: 2px 4px;">今天天气怎么样?</strong>

</div>

#### Data Sources:

1. **Existing NLP Tasks**: Transform benchmark datasets into prompt format
2. **Web Data**: QA websites, forums (naturally occurring questions and answers)
3. **Crowdsourcing**: Users provide questions, experts or LLMs provide answers

#### Importance of Diversity:

- More diverse data → Better robustness and generalization
- Multiple tasks and prompt styles → Enhanced instruction-following ability

### 4.2.2.2 Automatically Generated Data (Self-Instruct)

#### Motivation

Manual data generation is limited by:
- Human experience and creativity
- Scalability constraints
- Annotation costs

**Solution**: Use LLMs to generate both instructions and outputs!

### Self-Instruct Algorithm

An iterative process to bootstrap diverse instruction data from a small seed set:




### Prompt Templates

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

<strong>Template 1: Instruction Generation</strong><br><br>

You are provided several different instructions for performing some tasks.<br> 
Please generate an instruction based on these.<br><br>

<span style="background-color: #c8e6c9; padding: 2px 4px;">Task 1:</span> {instruction1}<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Task 2:</span> {instruction2}<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Task 3:</span> {instruction3}<br>
<span style="background-color: #c8e6c9; padding: 2px 4px;">Task 4:</span> {instruction4}<br>
<span style="background-color: #ffeb3b; padding: 2px 4px; font-weight: bold;">New Task:</span> <u>______</u><br><br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

<strong>Template 2: Input-Output Generation</strong><br><br>

You are provided with a set of input-output samples, each composed of<br> 
an instruction, a user input, and an output. Please generate a new<br> 
sample based on these.<br><br>

<span style="background-color: #c8e6c9; padding: 2px 4px;">Sample 1:</span> {instruction1}<br>
<strong>Input:</strong> {user_input1}<br>
<strong>Output:</strong> {output1}<br><br>

<span style="background-color: #c8e6c9; padding: 2px 4px;">Sample 2:</span> {instruction2}<br>
<strong>Input:</strong> {user_input2}<br>
<strong>Output:</strong> {output2}<br><br>

<span style="background-color: #ffeb3b; padding: 2px 4px; font-weight: bold;">New Sample:</span> {new_instruction}<br>
<u>______</u>

</div>

---


<img src="images/fig_4_4_self_instruct.png" width="40%" alt="Self-Instruct Process" style="display: block; margin: 0 auto;">

<p style="text-align: center; font-style: italic; margin-top: 15px; line-height: 1.6;">
<strong>Figure 4.4:</strong> Illustration of the Self-Instruct process. An instruction pool is maintained with seed tasks. In each iteration, instructions are drawn from the pool, used to generate new instructions, which are then validated and added back to the pool.
</p>


### Improvements to Self-Instruct

1. **Input Inversion**: For tasks with biased outputs (e.g., classification)
   - Generate output label first
   - Then generate input that leads to that output
   - Ensures balanced class distribution
   
   **Example: Sentiment Classification**
   - Traditional approach: "Generate a movie review" → Often produces negative reviews (if model is biased)
   - Input Inversion: "Generate a movie review that would be classified as POSITIVE" → Guarantees positive example
   
   **Why this matters**: If your dataset has 90% negative and 10% positive reviews, traditional augmentation maintains this imbalance. Input Inversion lets you explicitly generate 50% positive examples to balance the dataset.

2. **Evolutionary Algorithms**: Generate more diverse instructions
   - Mutation: Modify existing instructions
   - Crossover: Combine multiple instructions
   - Selection: Keep most diverse and high-quality examples

3. **Synthetic Pre-training Data**: Recent trend to use synthetic data even in pre-training

### Applications

- **Bootstrap** from small seed set (e.g., 175 seed tasks)
- **Augment** limited expert-annotated data
- **Common practice** in NLP (parsing, MT have used self-generated data successfully)

In [None]:
# Example: Input Inversion for Data Generation

# Traditional approach - may maintain dataset bias
traditional_prompt = """
Generate a customer support ticket.
"""

# Input Inversion approach - explicitly control output distribution
input_inversion_examples = [
    # For balanced sentiment classification
    {
        "target_label": "POSITIVE",
        "prompt": "Generate a movie review that expresses positive sentiment about the film.",
        "example_output": "The cinematography was breathtaking, and the performances were outstanding. A must-watch!"
    },
    {
        "target_label": "NEGATIVE", 
        "prompt": "Generate a movie review that expresses negative sentiment about the film.",
        "example_output": "The plot was confusing and the acting felt forced. Disappointing overall."
    },
    
    # For balanced priority classification
    {
        "target_label": "URGENT",
        "prompt": "Generate a customer support ticket that would be classified as URGENT priority.",
        "example_output": "Production server is down! All customer transactions are failing. Need immediate help!"
    },
    {
        "target_label": "LOW",
        "prompt": "Generate a customer support ticket that would be classified as LOW priority.", 
        "example_output": "Could you update the color scheme on the settings page when you have time?"
    }
]

# Implementation example
def generate_balanced_dataset(num_samples_per_class):
    """
    Use Input Inversion to generate balanced training data
    """
    dataset = []
    
    # Define target classes
    classes = ["POSITIVE", "NEGATIVE"]
    
    for target_class in classes:
        for _ in range(num_samples_per_class):
            # Generate prompt that explicitly asks for the target class
            prompt = f"Generate a text that would be classified as {target_class}."
            
            # In practice, you'd call your LLM here
            # generated_text = llm.generate(prompt)
            
            # Add to dataset with guaranteed label
            dataset.append({
                "text": "generated_text_here",  # Would be actual generated text
                "label": target_class
            })
    
    return dataset

# Result: Perfectly balanced dataset regardless of model's natural bias
print("Dataset will have exactly 50% positive and 50% negative examples")

<h3 style="font-size: 1.5em; color: #003865; margin-top: 20px;">4.2.3 Fine-tuning with Less Data</h3>

### The Problem

Large SFT datasets are computationally expensive:
- **FLAN**: 15 million samples from 1,836 tasks
- Training on such datasets requires significant resources

### Key Insight: **Superficial Alignment Hypothesis**

> "Core abilities and knowledge are established during pre-training. Fine-tuning merely activates instruction-response mappings already learned."

**Implication**: Small amounts of specialized data may be sufficient to adapt model!

### Evidence

**Study**: LLaMA 65B fine-tuned on carefully crafted **1,000 samples**
- Competitive with models requiring much more data
- Key: High-quality sample selection using GPT-3.5-based assessment

### Strategies for Efficient Fine-tuning

1. **Parameter-efficient fine-tuning methods** (LoRA, adapters, etc.)
2. **Data selection**: Choose most relevant and impactful examples
3. **Quality over quantity**: Focus on diverse, high-quality samples

### Related Discovery

Instruction-following can be achieved by fine-tuning **only on responses** without instructions!
- Suggests implicit learning of instruction-following behavior
- Model learns patterns during pre-training, fine-tuning just "unlocks" them

<h3 style="font-size: 1.5em; color: #003865; margin-top: 20px;">4.2.5 Using Weak Models to Improve Strong Models</h3>

### The Challenge

**Problem**: Strong LLMs are already excellent. Where do we find superior models for supervision?

**Weak-to-Strong Generalization**: Can smaller models supervise larger ones?

**Future Relevance**: Reflects scenario of humans supervising AI smarter than ourselves!

### Method 1: Simple Knowledge Distillation

**Core Idea**: Use weak model to generate training data for strong model

$$\tilde{\theta} = \arg\max_{\theta} \sum_{\mathbf{x} \in X} \log \text{Pr}_{\theta}^{s}(\hat{\mathbf{y}} \mid \mathbf{x})$$

where $\hat{\mathbf{y}}$ is generated by weak model.

**How it works**:
1. Take a collection of inputs (instructions) $X$
2. Use weak model (e.g., GPT-2) to generate outputs: $\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \text{Pr}^{w}(\mathbf{y} \mid \mathbf{x})$
3. Train strong model (e.g., GPT-4) on these (input, weak_output) pairs
4. Surprisingly, the strong model often performs BETTER than the weak model!

### Performance Gap Recovered (PGR)

Metric to measure weak-to-strong effectiveness:

$$\text{PGR} = \max\left\{0, \frac{P_{\text{weak} \rightarrow \text{strong}} - P_{\text{weak}}}{P_{\text{ceiling}} - P_{\text{weak}}}\right\}$$

where:
- $P_{\text{weak}}$: Weak model baseline performance (e.g., 60% accuracy)
- $P_{\text{weak} \rightarrow \text{strong}}$: Strong model trained with weak supervision (e.g., 88% accuracy)
- $P_{\text{ceiling}}$: Strong model trained with ground truth (e.g., 95% accuracy)

**Example Calculation**:
- If weak model: 60%, strong with weak supervision: 88%, ceiling: 95%
- PGR = (88 - 60) / (95 - 60) = 28/35 = 0.8
- This means we recovered 80% of the possible performance gain!

**Results**: PGR ≈ 0.8 on 22 NLP classification tasks (substantial recovery!)

### Method 2: Auxiliary Loss Approach

**Core Idea**: Don't just use weak model's outputs, also use its confidence distribution

Combine language modeling with KL divergence:

$$\text{Loss}_{\text{kd}} = \text{KL}\left(\text{Pr}^w(\cdot \mid \mathbf{x}) \| \text{Pr}_{\theta}^s(\cdot \mid \mathbf{x})\right)$$

$$\hat{\theta} = \arg\max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \text{Pr}_{\theta}^s(\mathbf{y} \mid \mathbf{x}) - \lambda \cdot \text{Loss}_{\text{kd}}$$

**Key Components**:
- First term: Standard supervised learning on ground truth data $\mathcal{D}$
- Second term: Regularization to keep strong model's distribution close to weak model's
- $\lambda$: Controls influence of weak model (can decrease over time)

**Why this works**: The weak model provides useful regularization, preventing overfitting while allowing the strong model to exceed weak model's performance

### Other Methods (from Figure 4.5)

1. **Data Selection (c)**: Use weak models to identify which training examples are most valuable
2. **Model Aggregation (d)**: Combine predictions from multiple weak models to create strong supervision
3. **Model Cascading (e)**: Use weak models for easy cases, strong models only for hard cases (efficiency)

---


<img src="images/fig_4_5_weak_to_strong.png" width="50%" alt="Weak-to-Strong Methods" style="display: block; margin: 0 auto;">

<p style="text-align: center; font-style: italic; margin-top: 15px; line-height: 1.6;">
<strong>Figure 4.5:</strong> Methods for using weak models to improve strong models: (a) Generating synthetic data, (b) Incorporating auxiliary loss, (c) Selecting appropriate data, (d) Aggregating multiple small models, (e) Cascading small models with large models.
</p>