# Lecture 10: Prompting (Chapter 3) - Part III

**CSC 375/575 - Generative AI**  
**Prof. Rongyu Lin, Quinnipiac University**


### Topics Covered

- **3.3.1 Prompt Optimization** - Automated prompt design and search strategies
- **3.3.2 Soft Prompts** - Learnable hidden representations for efficient prompting
- **3.3.3 Prompt Length Reduction** - Simplifying prompts while maintaining effectiveness

### Learning Objectives

By the end of this lecture, students will be able to:

1. **Understand prompt optimization frameworks** including search space, performance estimation, and search strategies for automated prompt design
2. **Apply LLM-based optimization techniques** to iteratively improve prompts through initialization, evaluation, pruning, and expansion steps
3. **Distinguish between hard and soft prompts** and understand how soft prompts provide dense, learnable representations for efficient fine-tuning
4. **Implement parameter-efficient fine-tuning** using soft prompt methods such as prefix tuning and prompt tuning
5. **Apply prompt length reduction techniques** to simplify prompts while maintaining their effectiveness and interpretability

## Setup

First, let's import the necessary libraries for our demonstrations.

In [4]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
from IPython.display import Image, display

# For those with API access (optional):
# import openai
# from anthropic import Anthropic

print("Setup complete!")

Setup complete!


## Background: The Challenge of Prompt Design

In previous lectures, we explored various prompting techniques such as Chain of Thought, problem decomposition, self-refinement, ensembling, and RAG. While these methods are powerful, they share a common challenge:

### The Manual Prompt Design Problem

**Key Challenges**:
1. **Labor-Intensive**: Creating effective prompts requires significant human effort and expertise
2. **Task-Specific**: Each task often requires custom prompt design
3. **Computational Cost**: Long, complex prompts are expensive to process repeatedly
4. **No Guarantees**: Manual design doesn't guarantee optimal performance

### Solutions Covered in This Lecture

To address these challenges, this lecture covers three advanced techniques:

1. **Prompt Optimization (3.3.1)**: Automated machine learning approaches to discover optimal prompts
2. **Soft Prompts (3.3.2)**: Learnable hidden representations that replace or complement text prompts
3. **Prompt Length Reduction (3.3.3)**: Methods to simplify prompts while maintaining effectiveness

These techniques represent the cutting edge of prompt engineering, moving from manual design toward automated, efficient, and optimized prompting strategies.

## 3.3.1 Prompt Optimization

### Introduction

Given that prompt design is difficult and labor-intensive, it is desirable to use **machine learning models to discover the optimal prompt** for a specific task. This approach is called:

- **Automatic Prompt Design**, or
- **Prompt Optimization**

This can be viewed as an instance of **Automated Machine Learning (AutoML)**, which aims to reduce or eliminate the need for expert-driven manual design.

### Relationship to Neural Architecture Search (NAS)

Prompt optimization is conceptually similar to **Neural Architecture Search (NAS)**, where:
- **NAS Goal**: Find optimal neural network architectures by exploring a space of possible networks
- **Prompt Optimization Goal**: Find optimal prompts by exploring a space of possible prompt formulations

Both involve discrete structures and systematic search through a design space.

### General Prompt Optimization Framework

A general framework for prompt optimization involves three key components:

<div style="max-width: 900px; font-size: 0.95em; line-height: 1.6; padding: 20px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

#### 1. Prompt Search Space

Defines all possible prompts that the algorithms can explore.

**Example**: Edit seed prompts to generate diverse candidate prompts
- Start with: "Summarize this text."
- Generate variations: "Provide a concise summary.", "Briefly describe the main points.", etc.

#### 2. Performance Estimation

Evaluates the quality of each candidate prompt.

**Methods**:
- Feed prompt to LLM and measure performance on validation set
- Use metrics like accuracy, F1 score, or task-specific measures
- Compute log-likelihood of correct outputs

#### 3. Search Strategy

Systematically explores the search space to find better prompts.

**Process**:
- At each step: Explore promising prompts in search space
- Evaluate them using performance estimation
- Continue until stopping criterion is met
- Output: Best-performing prompt observed

</div>

### Key Insight

This framework is **very general** - different prompt optimization systems can vary in their design of each component. A popular approach uses **LLMs themselves** to implement these components!

<img src="images/prompt_optimization_framework.png" alt="Prompt Optimization Framework" width="45%" style="display: block; margin: 20px auto;">

*Prompt Optimization Framework: A systematic approach involving search space definition, performance estimation, and iterative search strategy*

### LLM-Based Prompt Optimization

A widely-used approach uses **LLMs as the basis** to develop optimization components [Zhou et al., 2023c].

**Benefits**:
- Uses off-the-shelf LLMs without substantial system development
- Can prompt or fine-tune LLMs to adapt to optimization tasks
- Leverages LLMs' generative capabilities for prompt creation

**Process Overview**:
1. Provide a few initial prompts
2. Iterate until stopping criterion:
   - Evaluate prompts on validation set
   - Maintain candidate pool with most promising prompts
   - Use LLMs to generate similar prompts from candidate pool

Let's examine each step in detail.

### Step 1: Initialization

Let $C$ represent the **pool of candidate prompts** we intend to explore.

#### Method 1: Manual Prompt Creation

Create initial prompts by hand for the given task.

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example:</strong><br>

C = {<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color: #87ceeb; padding: 2px 4px;">"Summarize this text in three sentences."</span>,<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color: #87ceeb; padding: 2px 4px;">"Provide a brief summary of the following."</span>,<br>
&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color: #87ceeb; padding: 2px 4px;">"What are the main points of this text?"</span><br>
}<br>

<strong>üí° Limitation:</strong> Requires human expertise about effective prompts for the task

</div>

#### Method 2: LLM-Generated from Task Description

Use LLMs to generate prompts given a task description.

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Prompt Template:</strong><br>

You are given a task to complete using LLMs. Please write a prompt to guide the LLMs.<br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">{task-description}</span><br>

<strong>Example Task Description:</strong><br>
"Translate English sentences into French while maintaining formal tone."

</div>

**Limitation**: Still requires human-provided task description

#### Method 3: LLM-Generated from Input-Output Examples

Use LLMs to infer prompts from example input-output pairs.

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Prompt Template:</strong><br>

You are provided with several input-output pairs for a task. Please write an instruction for performing this task.<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Input:</span> {input1} <span style="background-color: #87ceeb; padding: 2px 4px;">Output:</span> {output1}<br>
<span style="background-color: #90ee90; padding: 2px 4px;">Input:</span> {input2} <span style="background-color: #87ceeb; padding: 2px 4px;">Output:</span> {output2}<br>
...<br>

<strong>üí° Advantage:</strong> LLM can infer the instruction from patterns in the data!

</div>

### Step 2: Evaluation

Once we obtain the candidate pool $C$, we need to **evaluate the prompts** in $C$.

#### Evaluation Methods

**Method 1: Task Performance**
- Feed each prompt into an LLM
- Assess results on downstream task
- Use pre-defined metrics (accuracy, F1, BLEU, etc.)

**Method 2: Log-Likelihood**
- Use log-likelihood of correct output as quality measure
- Higher log-likelihood indicates better prompt

**Evaluation Score**: Each prompt $p \in C$ receives a score $S(p)$

#### Example: Evaluation Results

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Candidate Prompts with Evaluation Scores:</strong><br>

<span style="background-color: #87ceeb; padding: 2px 4px;">"Summarize this text in three sentences."</span> ‚Üí Score: <strong>0.85</strong><br>
<span style="background-color: #87ceeb; padding: 2px 4px;">"Provide a brief summary."</span> ‚Üí Score: <strong>0.78</strong><br>
<span style="background-color: #87ceeb; padding: 2px 4px;">"What are the main points?"</span> ‚Üí Score: <strong>0.72</strong><br>
<span style="background-color: #90ee90; padding: 2px 4px;">"Condense the following into key takeaways."</span> ‚Üí Score: <strong style="color: #4caf50;">0.88</strong> ‚úì Best!<br>
<span style="background-color: #87ceeb; padding: 2px 4px;">"Give a short overview."</span> ‚Üí Score: <strong>0.75</strong><br>

<strong>üí° Key Insight:</strong> Higher scores indicate better prompt performance on the validation set.

</div>

### Step 3: Pruning

If $C$ contains a large number of prompts, it's reasonable to **prune unpromising prompts**, reducing computational burden in subsequent steps.

#### Pruning Strategy

Given evaluation scores for each prompt:
- **Simple method**: Keep only a certain percentage of best-performing prompts
- **Discard** the rest

#### Example Pruning

<div style="max-width: 850px; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Before Pruning (5 prompts):</strong><br>

1. "Summarize this text in three sentences." - Score: 0.85 ‚úì<br>
2. "Provide a brief summary." - Score: 0.78 ‚úì<br>
3. "What are the main points?" - Score: 0.72 ‚ùå<br>
4. "Condense the following into key takeaways." - Score: 0.88 ‚úì<br>
5. "Give a short overview." - Score: 0.75 ‚ùå<br>

<strong>After Pruning (Keep top 60%):</strong><br>

1. "Condense the following into key takeaways." - Score: 0.88<br>
2. "Summarize this text in three sentences." - Score: 0.85<br>
3. "Provide a brief summary." - Score: 0.78<br>

<strong>üí° Result:</strong> Reduced from 5 to 3 prompts, keeping the best performers

</div>

### Step 4: Expansion

**Expansion** is a key operation to explore different states in the search space.

#### Mathematical Formulation

The expansion operation can be defined as:

$$C' = \text{Expand}(C, f)$$

where:
- $C'$ is the set of **new prompts** generated from $C$
- $f$ is the model (typically an LLM) used for expansion

#### LLM-Based Expansion

We can perform expansion by instructing an LLM to generate new and relevant prompts based on $C$.

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Expansion Prompt Template:</strong><br>

Below is a prompt for an LLM. Please provide some new prompts to perform the same task.<br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Input:</span> {prompt}<br>

<u>_______</u>

<strong>Example:</strong><br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Input Prompt:</span><br>
"Summarize this text in three sentences."<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Generated New Prompts:</span><br>
1. "Provide a three-sentence summary of the following text."<br>
2. "Condense the main ideas into exactly three sentences."<br>
3. "Create a brief summary limited to three sentences."<br>
4. "Capture the essence in three sentences or fewer."<br>

</div>

#### Iterative Process

After expansion:
1. Replace $C$ with $C'$
2. Repeat evaluation ‚Üí pruning ‚Üí expansion
3. Gradually explore a wider range of prompts

<img src="images/llm_based_optimization_workflow.png" alt="LLM-Based Optimization Workflow" width="65%" style="display: block; margin: 20px auto;">

*LLM-Based Optimization Workflow: Iterative process of initialization, evaluation, pruning, and expansion to discover optimal prompts*

### Advanced Expansion Techniques

The expansion step plays a **key role** in prompt optimization. Our goal is to find optimal results with minimal effort.

#### 1. Paraphrasing-Based Expansion

Treat expansion as a **paraphrasing task**:
- Apply off-the-shelf paraphrasing systems (LLM-based or other models)
- Transform input prompts into semantically equivalent forms
- Maintains meaning while exploring different phrasings

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example:</strong><br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Original:</span> "Translate this sentence to French."<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Paraphrases:</span><br>
&nbsp;&nbsp;‚Ä¢ "Convert the following to French."<br>
&nbsp;&nbsp;‚Ä¢ "Provide a French translation."<br>
&nbsp;&nbsp;‚Ä¢ "Render this in French language."<br>

</div>

#### 2. Edit Operations

Define specific edit operations for each token:
- **Insertions**: Add new tokens
- **Deletions**: Remove tokens  
- **Substitutions**: Replace tokens
- **Reorderings**: Change word order

Apply these operations to transform prompts into new variants.

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example:</strong><br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Original:</span> "Summarize the text."<br>
<br>
<span style="background-color: #87ceeb; padding: 2px 4px;">Insert:</span> "Summarize the <strong>following</strong> text."<br>
<span style="background-color: #87ceeb; padding: 2px 4px;">Substitute:</span> "<strong>Condense</strong> the text."<br>
<span style="background-color: #90ee90; padding: 2px 4px;">Insert+Modify:</span> "Briefly summarize the main points of the text."<br>

</div>

#### 3. Feedback-Based Refinement

Improve prompt quality by **learning from feedback** (related to self-refinement from Section 3.2.3).

**Process**:
1. Use LLM to generate feedback on input prompt
2. Revise prompt based on feedback
3. Repeat feedback-and-revision cycle
4. Continue until convergence or desired outcome

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Feedback-Based Refinement Example:</strong><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">Initial Prompt:</span><br>
"Translate to French."<br>

<span style="background-color: #ffa500; padding: 2px 4px;">Feedback:</span><br>
"Too vague. Should specify source language and provide context about formality."<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Revised Prompt:</span><br>
"Translate the following English text to French, maintaining a formal tone."<br>

<span style="background-color: #ffa500; padding: 2px 4px;">Feedback:</span><br>
"Better! Consider adding instructions about handling idiomatic expressions."<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Final Prompt:</span><br>
"Translate the following English text to French, maintaining a formal tone and adapting idiomatic expressions appropriately."

</div>

### Classic Optimization Techniques

Beyond LLM-based methods, we can apply **classic optimization techniques** to prompt optimization.

#### Evolutionary Computation

Frame the problem as an **evolutionary computation problem**:
- Treat prompts as **candidates** (individuals)
- Evolve generation by generation
- Apply genetic operations: selection, crossover, mutation

**Advantages**:
- Many powerful optimization algorithms available
- Can be directly applied to discrete search spaces
- Proven effective in related optimization problems

<img src="images/evolutionary_optimization.png" alt="Evolutionary Optimization for Prompts" width="60%" style="display: block; margin: 20px auto;">

*Evolutionary Optimization: Prompts evolve through selection, crossover, and mutation to find optimal formulations*

### Reinforcement Learning for Prompt Optimization

#### The API Dependency Problem

Using existing LLM APIs for optimization has limitations:
- **Strong dependency** on LLM's inference and in-context learning abilities
- Weak LLMs may introduce errors (e.g., generating incorrect prompts during expansion)
- May lack adaptation to specific tasks

#### Solution: Train Task-Specific Models

**Reinforcement Learning Approach** [Deng et al., 2022]:

1. **Architecture**: Develop prompt generator by integrating FFN-based adaptor into LLM
2. **Training**: Train as policy network (only adaptor parameters updated)
3. **Reward**: Test generated prompts using another LLM (similar to evaluation step)
4. **Deployment**: Use trained prompt generator to create new prompts

**Benefits**:
- Better suited to specific tasks
- Reduces errors from weak general-purpose LLMs
- Can be fine-tuned on task-specific data

### Beyond Simple Prompts: Structured Optimization

#### Prompt Structure

In our discussion, prompts are simply **sequences of tokens**. However, in reality, prompts have **complex structures**:

<div style="max-width: 850px; font-size: 0.95em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

**Prompt Components**:
- **User Input**: The actual question or task
- **Instruction**: How to process the input
- **Demonstrations**: Few-shot examples
- **Context**: Background information

</div>

#### Focused Optimization Areas

**1. Instruction Learning**

Much work focuses on **learning better instructions** for prompting:
- Generate instructions that effectively guide LLMs
- Based on given task requirements

**Challenge**: 
- Pre-trained LLMs not suited to predict instruction quality
- Testing instructions on downstream tasks is computationally expensive
- Makes optimization methods costly
- Exploring wide variety of instructions poses significant challenges

**2. Demonstration Learning**

Substantial research on **learning to select or generate demonstrations** in CoT:
- Generating high-quality demonstrations using LLMs is relatively easy
- Focus typically on **sampling appropriate demonstrations** from candidate pool
- Different challenge compared to instruction learning

**Example**: For a math problem-solving task, select the most relevant worked examples from a library of demonstrations.

### Summary: Prompt Optimization

**Key Takeaways**:

1. **General Framework**: Search space + Performance estimation + Search strategy
2. **LLM-Based Methods**: Use LLMs for initialization, evaluation, pruning, and expansion
3. **Advanced Techniques**: Paraphrasing, edit operations, feedback-based refinement
4. **Classic Methods**: Evolutionary computation, reinforcement learning
5. **Structured Optimization**: Focus on specific components (instructions, demonstrations)

**Practical Considerations**:
- Balance automation with computational cost
- Choose optimization method based on task requirements
- Consider whether to optimize entire prompt or specific components
- Evaluate trade-offs between search thoroughness and efficiency

## 3.3.2 Soft Prompts

### Introduction

While developing natural language prompts (manually or automatically) is straightforward and widely applied, it presents some **problems**:

#### Problem 1: Computational Burden

- Natural language prompts can be **complex and lengthy**
- Processing long prompts via LLMs is computationally expensive
- Repeatedly inputting the same long prompt is clearly inefficient

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example:</strong><br>

A detailed prompt for financial analysis might be hundreds of tokens long.<br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">Long Prompt:</span> "You are a financial analyst with expertise in market trends, risk assessment, and portfolio management. When analyzing financial data, please provide detailed insights on market conditions, identify potential risks, evaluate investment opportunities, and recommend strategic actions based on current economic indicators and historical trends."<br>

<strong>Problem:</strong> Using this 50+ word prompt for <span style="background-color: #ffeb3b; padding: 2px 4px;">thousands of queries</span> wastes significant computational resources.

</div>

#### Problem 2: Discrete Representation

- Prompts are typically **discrete token sequences** (called **hard prompts**)
- LLMs encode them as **low-dimensional real-valued vectors**
- Question: Are there more compact and efficient ways to represent prompts?

### Solution: Soft Prompts

**Soft prompts** are **hidden, distributed representations** of prompts - learnable vectors that serve as implicit prompting patterns embedded within LLMs.

### Hard Prompts vs. Soft Prompts

<div style="max-width: 900px; font-size: 0.95em; line-height: 1.6; padding: 20px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

#### Hard Prompts

**Definition**: Explicit, predefined text sequences that users input directly into LLMs

**Characteristics**:
- Expressed in **natural language**
- **Understandable for humans**
- Discrete token sequences
- Examples: "Translate this to French.", "Summarize in three sentences."

#### Soft Prompts

**Definition**: Implicit, adaptable prompting patterns embedded within LLMs

**Characteristics**:
- Encoded in a format **comprehensible to the model** (not humans)
- Continuous vector representations
- Learnable through optimization
- More compact than text prompts

</div>

### Illustrative Example

**Consider a simple prompt**: Translate the sentence into Chinese. Consider it done!


**Hard Prompt Analysis**:
- Instruction: "Translate the sentence into Chinese"
- Denoted by token sequence: $c_1 \ldots c_5$

**Soft Prompt Analysis**:
- Feed tokens into LLM
- Transformed into sequence of real-valued vectors: $\mathbf{h}_1 \ldots \mathbf{h}_5$
- Each $\mathbf{h}_i$ corresponds to a token
- We can think of $\mathbf{h}_1 \ldots \mathbf{h}_5$ as a **soft prompt**

<img src="images/fig3-3_hard_soft_prompts.png" alt="Hard vs Soft Prompts" width="45%" style="display: block; margin: 20px auto;">

*Figure 3.3: Illustration of hard and soft prompts. The hard prompt is the instruction we input to the LLM. The LLM encodes this instruction, and the intermediate representations can be viewed as soft prompts.*

### Key Insights About Soft Prompts

#### 1. No Direct Correspondence Required

While the example shows soft prompts generated by transforming hard prompts, **there's not necessarily a direct correspondence** between them.

**Important**: We don't even need to interpret soft prompts using meaningful text!

#### 2. Learnable Parameters

Soft prompts are simply **hidden states in LLMs** and can be learned as **standard parameters** through continuous optimization.

**Benefits**:
- Explore prompting methods **beyond text**
- Dense, low-dimensional representations
- Learnable through gradient descent
- Significantly lower computational cost than processing long hard prompts

#### 3. Practical Value

Especially valuable in LLM inference applications where:
- The same prompt is **repeatedly used**
- Computational efficiency is critical
- Task-specific adaptation is needed

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example Use Case: Customer Service Chatbot</strong><br>

<strong>Scenario:</strong> A chatbot handles <span style="background-color: #ffeb3b; padding: 2px 4px;">thousands of similar queries daily</span><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">Traditional Approach (Hard Prompt):</span><br>
Process "Please help the customer with their technical issue professionally and provide step-by-step solutions" <strong>repeatedly</strong> for each query.<br>

‚¨áÔ∏è<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Soft Prompt Approach:</span><br>
Use a learned soft prompt vector that encodes the same behavior.<br>

<strong>üí° Result:</strong> Significant computational savings while maintaining service quality!

</div>

### 3.3.2.1 Adapting LLMs with Less Prompting

#### The Core Idea

One obvious way to adapt an LLM for a particular task is **fine-tuning** using labeled data. This leads to various LLM alignment methods like **supervised fine-tuning (SFT)**.

**Key Insight**: If we take this idea further, we can expect LLMs to **absorb prompting knowledge** during fine-tuning. Consequently:
- Prompting information is partially captured in model parameters
- Fine-tuned LLMs can perform tasks with **less prompting**

#### Simple Prompt Formulation

Consider a simple prompt with only:
- **Instruction** (denoted by $\mathbf{c}$)
- **User input** (denoted by $\mathbf{z}$)

The prompt can be expressed as:

$$\mathbf{x} = (\mathbf{c}, \mathbf{z})$$

#### Fine-Tuning Objective

Given a set of prompt-response pairs $\mathcal{D} = \{(\mathbf{x}, \mathbf{y})\}$, minimize total loss:

$$\begin{aligned}
\hat{\theta} &= \arg\max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \text{Pr}_{\theta}(\mathbf{y} \mid \mathbf{x}) \\
&= \arg\max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \text{Pr}_{\theta}(\mathbf{y} \mid \mathbf{c}, \mathbf{z})
\end{aligned}$$

where $\text{Pr}_{\theta}(\cdot \mid \cdot)$ is the probability predicted by an LLM with parameters $\theta$.

### Instruction Simplification Through Fine-Tuning

The fine-tuning objective **doesn't restrict the instruction to any particular form**. This flexibility allows us to instruct LLMs in any way we want.

#### Example: Translation Task

Consider instructing LLMs to translate English to Chinese:

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Complex Instruction:</strong><br>
<span style="background-color: #87ceeb; padding: 2px 4px;">"Translate the following sentence from English to Chinese."</span><br>

‚¨áÔ∏è Simplify<br>

<strong>Simpler Instruction:</strong><br>
<span style="background-color: #90ee90; padding: 2px 4px;">"Translate this into Chinese."</span><br>

‚¨áÔ∏è Further simplify<br>

<strong>Minimal Instruction:</strong><br>
<span style="background-color: #ffeb3b; padding: 2px 4px;">"Translate!"</span><br>

<strong>üí° Key Point:</strong> With sufficient fine-tuning, we can adapt LLMs to follow <strong>ANY</strong> of these instructions!

</div>

#### Benefits of Simplified Instructions

**Computational Advantages**:
- Shorter prompts = Fewer tokens to process
- Faster inference
- Lower computational cost per query
- Makes subsequent prompting more efficient

**Example**: Using "Translate!" instead of "Translate the following sentence from English to Chinese" saves ~8 tokens per query. With millions of queries, this adds up significantly!

### Challenges of Over-Simplification

While simplified instructions have benefits, **over-simplification can be harmful**:

#### Problem 1: Information Loss

- Simplified instructions may **lose important information**
- Less context for the model to understand task requirements
- Can lead to ambiguous or unclear task specifications

#### Problem 2: Overfitting Risk

- LLMs more likely to **overfit** fine-tuning data
- May fail to **generalize** beyond specific instructions
- Especially problematic with limited labeled data

#### Problem 3: Instruction Variety

In scenarios with both complex and simplified instructions:
- Accommodating variety of instructions is **costly**
- Limited fine-tuning data available
- Challenge to balance coverage and quality

**Key Trade-off**: Efficiency vs. Generalization

### Alternative: Context Distillation

An alternative to direct fine-tuning is **knowledge distillation** [Snell et al., 2022].

#### Goal

Learn a **student model** that can use simplified instructions from a well-trained **teacher model** that follows complex instructions.

#### Process Overview

1. **Teacher Model**: Standard instruction-following LLM
   - Takes: Context (complex instruction) + User input
   - Produces: High-quality predictions

2. **Student Model**: Learns to match teacher's performance with simplified instructions
   - Takes: Simplified context + User input
   - Trained to: Match teacher's outputs

3. **Dataset Construction**: Create $\mathcal{D}'$ where each sample is:
   $$\mathbf{x}' = (\mathbf{c}, \mathbf{c}', \mathbf{z})$$
   - $\mathbf{c}$: Original complex instruction
   - $\mathbf{c}'$: Corresponding simplified instruction
   - $\mathbf{z}$: User input

<img src="images/fig3-4_context_distillation.png" alt="Context Distillation" width="50%" style="display: block; margin: 20px auto;">

*Figure 3.4: Context Distillation. The teacher model uses complex context, while the student model learns to achieve similar performance with simplified context.*

### Knowledge Distillation Objective

The goal is to train the student model to mimic the teacher model's outputs. We minimize a loss function defined on outputs of teacher and student models:

$$\hat{\theta} = \arg\min_{\theta} \sum_{\mathbf{x}' \in \mathcal{D}'} \text{Loss}(\text{Pr}^t(\cdot \mid \cdot), \text{Pr}_{\theta}^s(\cdot \mid \cdot), \mathbf{x}')$$

**Interpretation**: Find parameters $\theta$ that minimize the difference between teacher and student predictions across all training samples.

where:
- $\text{Pr}^t(\cdot \mid \cdot)$: Pre-trained teacher model (fixed, not updated)
- $\text{Pr}_{\theta}^s(\cdot \mid \cdot)$: Student model with learnable parameters $\theta$
- $\mathbf{x}' = (\mathbf{c}, \mathbf{c}', \mathbf{z})$: Training sample with complex context, simplified context, and input

#### Common Loss Functions

**1. Sequence-Level Loss (Computationally Infeasible)**

$$\text{Loss} = -\sum_{\mathbf{y}} \text{Pr}^t(\mathbf{y} \mid \mathbf{c}, \mathbf{z}) \log \text{Pr}_{\theta}^s(\mathbf{y} \mid \mathbf{c}', \mathbf{z})$$

**Interpretation**: Compute expected log-likelihood over all possible outputs $\mathbf{y}$, weighted by teacher's probability distribution.

**Problem**: Requires summing over exponentially large number of outputs (all possible sequences)! Computationally impossible for large vocabularies.

**2. Teacher-Generated Output Loss (Practical)**

First, use teacher model to generate its best output:
$$\hat{\mathbf{y}} = \arg\max_{\mathbf{y}} \log \text{Pr}^t(\mathbf{y} \mid \mathbf{c}, \mathbf{z})$$

**Interpretation**: Teacher generates a single high-quality output given complex context $\mathbf{c}$.

Then train student to reproduce this output given simplified context:
$$\text{Loss} = -\log \text{Pr}_{\theta}^s(\hat{\mathbf{y}} \mid \mathbf{c}', \mathbf{z})$$

**Interpretation**: Maximize student's probability of generating the teacher's output, but using simplified context $\mathbf{c}'$ instead of complex context $\mathbf{c}$.

**3. Distribution Matching Loss (KL Divergence)**

Minimize distance between probability distributions:
$$\text{Loss} = \text{KL}(\mathrm{P}^t \| \mathrm{P}_{\theta}^s)$$

where:
$$\begin{aligned}
\mathrm{P}^t &= \text{Pr}^t(\cdot \mid \mathbf{c}, \mathbf{z}) \quad \text{(teacher's output distribution)} \\
\mathrm{P}_{\theta}^s &= \text{Pr}_{\theta}^s(\cdot \mid \mathbf{c}', \mathbf{z}) \quad \text{(student's output distribution)}
\end{aligned}$$

**Interpretation**: Match the entire probability distribution over outputs, not just the most likely output. Captures uncertainty and alternative responses from teacher.

**Advantage**: Student learns the full distribution of teacher's responses, including uncertainty and multiple plausible outputs.

### Broader Applications of Distillation

#### Beyond Instructions

While we focused on **knowledge distillation for instructions**, the approaches are **general**:

**Key Principle**: By learning from teacher model outputs, prompting knowledge can be **distilled into student model parameters**.

**Result**: Distilled model encodes a form of **soft prompt**

#### Applications to Other Prompt Learning Problems

1. **Compressing Long Contexts**
   - Distill knowledge from long-context teacher
   - Student works with shorter contexts
   - Maintains performance while reducing computational cost

2. **Learning Soft Prompts as LLM Components**
   - Integrate soft prompts as specific model components
   - Learn these components through distillation
   - Enable efficient task-specific adaptation

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example: Customer Service Chatbot Distillation</strong><br>

<strong>Teacher Model</strong> uses complex instruction:<br>
<span style="background-color: #ffb6c1; padding: 2px 4px;">"Professionally assist customers with technical issues, maintaining empathy and providing step-by-step solutions while ensuring customer satisfaction and following company guidelines."</span><br>

‚¨áÔ∏è Knowledge Distillation<br>

<strong>Student Model</strong> learns compact soft prompt representation:<br>
<span style="background-color: #90ee90; padding: 2px 4px;">Soft prompt vector (e.g., 10 embeddings)</span><br>

<strong>üí° Result:</strong> Student model achieves similar performance with <strong>much lower computational cost</strong>!

</div>

### 3.3.2.2 Learning Soft Prompts for Parameter-Efficient Fine-Tuning

#### The Computational Challenge

**Problem**: Updating all parameters in fine-tuning is:
- Common for adapting LLMs to tasks
- Cheaper than pre-training, but still **computationally expensive**
- Resource-intensive for large models

**Motivation**: Develop **parameter-efficient fine-tuning** methods that minimize the number of parameters to update.

#### Solution: Prefix Fine-Tuning

**Prefix fine-tuning** [Li and Liang, 2021]: Append a series of **trainable vectors** (prefixes) at the beginning of the input of each Transformer layer.

**Key Idea**: Prefixes can be thought of as **soft prompts** that:
- Serve as additional context
- Guide model behavior for specific tasks
- Only require learning a small set of parameters

### Prefix Fine-Tuning: Mathematical Formulation

#### Standard Transformer Layer (Without Prefixes)

Let the input of layer at depth $l$ be a sequence of hidden states:
$$\mathbf{H}^l = \mathbf{h}_0^l \mathbf{h}_1^l \ldots \mathbf{h}_m^l$$

**Interpretation**: $m+1$ vectors representing tokens in the sequence at layer $l$.

The output is computed by applying the Transformer layer:
$$\mathbf{H}^{l+1} = \text{Layer}(\mathbf{H}^l)$$

**Interpretation**: Standard forward pass through attention and feed-forward networks.

#### With Prefix Fine-Tuning (Three Steps)

**Step 1: Prepend Trainable Prefixes**

Extend the sequence by adding $n+1$ trainable prefix vectors at the beginning:
$$\mathbf{p}_0^l \mathbf{p}_1^l \ldots \mathbf{p}_n^l$$

**Interpretation**: These are task-specific "soft prompts" that will be learned.

So $\mathbf{H}^l$ becomes an extended sequence:
$$\mathbf{H}^l = \underbrace{\mathbf{p}_0^l \mathbf{p}_1^l \ldots \mathbf{p}_n^l}_{\text{trainable prefixes}} \underbrace{\mathbf{h}_0^l \mathbf{h}_1^l \ldots \mathbf{h}_m^l}_{\text{actual input tokens}}$$

**Interpretation**: Now we have $(n+1) + (m+1)$ total vectors: prefixes + original tokens.

**Step 2: Process Through Layer and Extract**

Process the extended sequence through the Transformer layer:
$$\text{Full output} = \text{Layer}(\mathbf{H}^l) = \tilde{\mathbf{p}}_0^{l+1} \tilde{\mathbf{p}}_1^{l+1} \ldots \tilde{\mathbf{p}}_n^{l+1} \mathbf{h}_0^{l+1} \mathbf{h}_1^{l+1} \ldots \mathbf{h}_m^{l+1}$$

**Interpretation**: Layer outputs $(n+1) + (m+1)$ vectors. Prefixes attend to tokens and vice versa.

Extract only the last $m+1$ representations (corresponding to actual tokens):
$$\begin{aligned}
\overline{\mathbf{H}}^{l+1} &= \text{Layer}(\mathbf{H}^l)[-m-1:] \\
&= \mathbf{h}_0^{l+1} \mathbf{h}_1^{l+1} \ldots \mathbf{h}_m^{l+1}
\end{aligned}$$

**Interpretation**: Discard the prefix outputs; keep only token representations that have been influenced by prefixes through attention.

**Notation**: $[-m-1:]$ is Python-style slicing to extract the last $m+1$ elements.

**Step 3: Form Input for Next Layer**

Create input for next layer by prepending new prefixes:
$$\begin{aligned}
\mathbf{H}^{l+1} &= \mathbf{p}_0^{l+1} \mathbf{p}_1^{l+1} \ldots \mathbf{p}_n^{l+1} \, \overline{\mathbf{H}}^{l+1} \\
&= \mathbf{p}_0^{l+1} \mathbf{p}_1^{l+1} \ldots \mathbf{p}_n^{l+1} \, \mathbf{h}_0^{l+1} \mathbf{h}_1^{l+1} \ldots \mathbf{h}_m^{l+1}
\end{aligned}$$

**Interpretation**: Repeat the process at every layer - each layer has its own set of learnable prefixes.

#### Training

**Learnable Parameters**: Each prefix $\mathbf{p}_i^l \in \mathbb{R}^d$ is a $d$-dimensional vector that will be optimized.

**What's Updated**: During training, only the prefix vectors $\{\mathbf{p}_0^l, \mathbf{p}_1^l, \ldots, \mathbf{p}_n^l\}$ for all layers $l$ are updated via gradient descent.

**What's Frozen**: All original Transformer parameters (attention weights, feed-forward weights) remain **fixed**.

**Key Insight**: Prefixes act as continuous task-specific prompts that guide the model's behavior through attention mechanisms, without modifying the pre-trained model itself.

### Prefix Fine-Tuning: Translation Example

<div style="max-width: 900px; font-size: 0.95em; line-height: 1.6; padding: 20px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

**Task**: English to Chinese translation ("Look out!" ‚Üí "Â∞èÂøÉ!")

#### Training Process
1. **Forward Pass**: Input passes through all layers
2. **Error Computation**: Compare output with correct translation
3. **Backward Pass**: Only prefix vectors $\mathbf{p}_0^l$ and $\mathbf{p}_1^l$ receive gradients
4. **Parameter Update**: Only prefixes are updated

#### Result
- Prefix vectors **adapt to translation task**
- Act as **soft prompts** that activate translation behavior
- No explicit hard prompts like "Translate from English to Chinese" needed!

#### Inference
- Prepend optimized $\mathbf{p}_0^l$ and $\mathbf{p}_1^l$ to each layer
- LLM automatically translates input sentences
- No additional prompting required

</div>

<img src="images/fig3-5_prefix_tuning.png" alt="Prefix Fine-Tuning" width="40%" style="display: block; margin: 20px auto;">

*Figure 3.5: Prefix fine-tuning for translation. Only prefix vectors are updated while the rest of the model parameters remain fixed, enabling efficient task adaptation.*

### Prefix Fine-Tuning: Efficiency Analysis

#### Parameter Count

Prefix fine-tuning introduces:
$$L \times n \times d \text{ parameters}$$

where:
- $L$: Number of layers
- $n$: Number of prefixes per layer
- $d$: Dimensionality of each prefix

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example: Parameter Calculation</strong><br>

For a model with:<br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">24 layers</span> ($L = 24$)<br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">10 prefixes per layer</span> ($n = 10$)<br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">1024-dimensional vectors</span> ($d = 1024$)<br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

<strong>Calculation:</strong><br>
Total trainable parameters = $24 \times 10 \times 1024$ = <span style="background-color: #90ee90; padding: 2px 4px;"><strong>245,760</strong></span><br>

<strong>üí° Comparison:</strong> This is <strong>much smaller</strong> than <span style="background-color: #ffeb3b; padding: 2px 4px;">billions of parameters</span> in the full LLM!<br>

<strong>Result:</strong> Highly efficient fine-tuning process with <strong>minimal memory footprint</strong>.

</div>

### Prompt Tuning: A Simpler Alternative

#### Motivation

While prefix fine-tuning is efficient, it still **requires modifications to LLMs**.

**Alternative**: Separate soft prompts from LLMs to:
- Preserve original model architecture
- Enable more efficient deployment across tasks
- Avoid adjusting core model

#### Prompt Tuning Method [Lester et al., 2021]

**Key Difference**: Modifies **only the embedding layer** (not every layer like prefix tuning)

#### How It Works

**Standard Input**: Each token $z_i$ is represented by embedding $\mathbf{e}_i$

**Prompt Tuning**: Add pseudo embeddings at the beginning:
$$\mathbf{p}_0 \mathbf{p}_1 \ldots \mathbf{p}_n \mathbf{e}_0 \mathbf{e}_1 \ldots \mathbf{e}_m$$

where:
- $\mathbf{p}_0 \ldots \mathbf{p}_n$: **Trainable soft prompt embeddings**
- $\mathbf{e}_0 \ldots \mathbf{e}_m$: Fixed token embeddings

#### Important Properties

**Pseudo Embeddings**:
- Need NOT correspond to any natural language token
- Serve as **"soft prompt embeddings"**
- Condition the LLM for specific tasks

**Training**:
- Learn soft prompt embeddings on task-specific data
- They adaptively interact with token embeddings
- Guide LLM behavior

**Benefits**:
- Lightweight and efficient
- Doesn't change underlying LLM parameters
- Maintains generalization capabilities

<img src="images/fig3-6_prompt_tuning.png" alt="Prompt Tuning" width="40%" style="display: block; margin: 20px auto;">

*Figure 3.6: Prompt tuning for translation. Learnable soft prompt embeddings are added at the beginning of the embedding sequence, providing an efficient way to adapt LLMs without modifying core parameters.*

### Advanced Prompt Tuning Techniques

#### 1. Sequence Modeling for Soft Prompts

Since $\mathbf{p}_0, \mathbf{p}_1 \ldots \mathbf{p}_n$ is a sequence, we can use **sequence models** to better represent it:

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example: Transformer Encoding</strong><br>

<strong>Approach:</strong><br>
Use a Transformer to encode the soft prompt sequence<br>

<strong>Process:</strong><br>
1. Transformer encodes $\mathbf{p}_0 \mathbf{p}_1 \ldots \mathbf{p}_n$<br>
2. Resulting representation used as input to LLM<br>
3. Develops <span style="background-color: #90ee90; padding: 2px 4px;">additional model</span> for encoding soft prompts<br>

<strong>üí° Benefit:</strong> Better captures sequential dependencies in soft prompts!

</div>

#### 2. Combining Soft and Hard Prompts

Take advantage of **both types** of prompts by combining them [Liu et al., 2023b].

**Mixed Pattern Example**:
$$\mathbf{p}_0 \mathbf{p}_1 \ldots \mathbf{p}_n \quad c_0 c_1 \ldots c_{m'} \quad \mathbf{q}_0 \mathbf{q}_1 \ldots \mathbf{q}_{m'}$$

where:
- $\mathbf{p}_0 \ldots \mathbf{p}_n$: Soft prompt embeddings
- $c_0 \ldots c_{m'}$: Hard prompt tokens (e.g., "Translate to French:")
- $\mathbf{q}_0 \ldots \mathbf{q}_{m'}$: Embeddings of hard prompt tokens

**Possibilities**:
- Arrange or intersperse prompts in different patterns
- Soft prompts before hard prompts
- Interleaved soft and hard prompts
- Task-specific optimal patterns

**Benefits**:
- Flexibility in prompt design
- Combine efficiency (soft) with interpretability (hard)
- Better task adaptation

### Training Soft Prompts

We've focused on **methods for inserting soft prompts** in LLMs. Training details are skipped assuming familiarity with standard supervised learning:

**Standard Training Objective**: Maximize likelihood of correct model output given model input

$$\max_{\theta} \sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \log \text{Pr}_{\theta}(\mathbf{y} \mid \mathbf{x})$$

where $\theta$ represents the soft prompt parameters.

#### Connection to Other Methods

Learning soft prompts relates to many issues in **LLM fine-tuning**:

**1. Context Compression**

- View as context compression problem
- Apply knowledge distillation methods
- Compress prompts into pseudo tokens

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example: Context Compression [Mu et al., 2024]</strong><br>

<strong>Approach:</strong><br>
Compress long prompts into compact pseudo tokens<br>

<strong>Process:</strong><br>
1. <span style="background-color: #ffb6c1; padding: 2px 4px;">Long prompt</span> compressed into <span style="background-color: #90ee90; padding: 2px 4px;">few pseudo tokens</span><br>
2. Pseudo tokens appended to each input sequence<br>
3. Embeddings optimized to mimic standard-prompted model predictions<br>
4. Prompting knowledge <strong>distilled</strong> from teacher into pseudo tokens<br>

<strong>üí° Result:</strong> Efficient inference with <strong>compressed representations</strong> of complex prompts!

</div>

### Broader Perspective on Parameter-Efficient Fine-Tuning

#### General Principle

Broadly speaking, many **parameter-efficient fine-tuning methods** can be thought of as learning some sort of **soft prompt** [Lialin et al., 2023].

**Key Insight**: When we fine-tune a part of an LLM for a task, this can essentially be seen as **injecting task-related prompting information** into that specific part of the model.

#### Example: Adaptor Layers

Another widely-used approach:
- Add **adaptor layer** between existing model layers
- Fine-tune only the adaptor layer on specific tasks
- Keep original model parameters fixed

**Conceptually**: The adaptor layer learns task-specific soft prompts that modify information flow through the model.

#### Unified View

<div style="max-width: 900px; font-size: 0.95em; line-height: 1.6; padding: 20px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

**Common Theme Across Methods**:

- **Prefix Tuning**: Soft prompts at each layer input
- **Prompt Tuning**: Soft prompts at embedding layer
- **Adaptor Layers**: Task-specific transformations between layers
- **LoRA**: Low-rank adaptations as implicit soft prompts

All can be viewed as introducing **learnable, task-specific information** (soft prompts) into different parts of the model architecture.

</div>

### 3.3.2.3 Learning Soft Prompts with Compression

While prefix tuning and prompt tuning provide efficient ways to adapt LLMs, they don't directly address the problem of **long prompts**. In many applications, we need to provide extensive context, instructions, or examples that result in very long input sequences.

**Challenge**: Processing long prompts is computationally expensive, especially when the same context is used repeatedly for different inputs.

**Solution**: Use soft prompts to **compress** long textual prompts into compact vector representations.

#### The Context Compression Problem

Consider a scenario where we repeatedly use a long prompt:

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Long Prompt Example (200+ tokens):</strong><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">System Instructions:</span> "You are a medical diagnosis assistant with expertise in cardiology, pulmonology, and general internal medicine. When analyzing patient symptoms, always consider: (1) patient history and demographics, (2) symptom onset and progression, (3) relevant lab values and vital signs, (4) differential diagnosis with probability rankings, (5) recommended diagnostic tests, (6) treatment options with contraindications. Provide detailed reasoning for each diagnosis, cite medical literature when applicable, and always recommend consulting with a qualified physician for final decisions..."<br>

<span style="background-color: #87ceeb; padding: 2px 4px;">User Input:</span> "Patient presents with chest pain and shortness of breath..."<br>

<strong>üí° Problem:</strong> The 200+ token instruction is processed for <strong>every single query</strong>, wasting computation!

</div>

#### Prompt Compression Approach

**Goal**: Compress the long prompt into a small number of learnable pseudo-tokens (soft prompts).

**Method**: Use knowledge distillation to learn compressed representations.

**Process**:

1. **Teacher Model**: Uses full long prompt $\mathbf{c}_{long}$ + user input $\mathbf{z}$
   $$\mathbf{y}^t = \arg\max_{\mathbf{y}} \text{Pr}^t(\mathbf{y} \mid \mathbf{c}_{long}, \mathbf{z})$$

2. **Student Model**: Uses compressed soft prompts $\mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_k$ (where $k \ll |\mathbf{c}_{long}|$)
   $$\text{Pr}_{\theta}^s(\mathbf{y} \mid \mathbf{p}_0, \mathbf{p}_1, \ldots, \mathbf{p}_k, \mathbf{z})$$

3. **Training Objective**: Learn pseudo-tokens to match teacher's outputs
   $$\min_{\theta} \sum_{(\mathbf{z}, \mathbf{y}^t)} -\log \text{Pr}_{\theta}^s(\mathbf{y}^t \mid \mathbf{p}_0, \ldots, \mathbf{p}_k, \mathbf{z})$$

**Interpretation**: The soft prompts $\mathbf{p}_0, \ldots, \mathbf{p}_k$ learn to encode all the information from the long prompt in a compact form.

#### Compression Ratio and Efficiency

**Typical Compression**: A 200-token prompt can be compressed into 10-20 soft prompt embeddings.

**Compression Ratio**: 10:1 to 20:1 reduction in sequence length!

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Efficiency Comparison:</strong><br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">Original Approach:</span><br>
‚Ä¢ Input length: 200 (prompt) + 50 (user query) = 250 tokens<br>
‚Ä¢ Computational cost: Process 250 tokens through all layers<br>
‚Ä¢ Repeated for every query<br>
<br>
<span style="background-color: #90ee90; padding: 2px 4px;">Compressed Approach:</span><br>
‚Ä¢ Input length: 10 (soft prompts) + 50 (user query) = 60 tokens<br>
‚Ä¢ Computational cost: Process 60 tokens through all layers<br>
‚Ä¢ <strong>76% reduction</strong> in sequence length!<br>

<strong>üí° Speedup:</strong> Attention complexity is O(n¬≤), so shorter sequences provide <strong>quadratic</strong> speedup!

</div>

#### Practical Applications

**1. Repeated Task Instructions**

When the same instruction is used for many inputs (e.g., customer service chatbot, document classification):
- Compress system prompt once during training
- Deploy with compact soft prompts
- Significant inference cost reduction

**2. Long-Context RAG Systems**

When providing retrieved documents as context:
- Compress retrieved passages into soft prompts
- Reduces context window usage
- Allows more user input capacity

**3. Multi-Turn Dialogues**

Compress conversation history:
- Long chat history ‚Üí compact soft prompts
- Maintains conversation context
- Reduces token usage over time

#### Advantages and Trade-offs

**Advantages**:
- ‚úÖ **Computational Efficiency**: Dramatically reduced sequence length
- ‚úÖ **Cost Reduction**: Lower inference costs for API-based LLMs
- ‚úÖ **Preserves Information**: Soft prompts can encode complex instructions
- ‚úÖ **Task-Specific**: Optimized for specific use cases

**Trade-offs**:
- ‚ùå **Training Required**: Need labeled data to train compression
- ‚ùå **Not Interpretable**: Cannot read or edit compressed prompts
- ‚ùå **Fixed Task**: Compression is task-specific, not transferable
- ‚ùå **Quality Dependency**: Requires high-quality teacher model

#### Recent Work: AutoCompressors [Mu et al., 2024]

Recent research has developed sophisticated compression methods:

**AutoCompressors** learn to compress prompts by:
1. Training special compression tokens that attend to long context
2. Using recursive compression for very long sequences
3. Achieving high compression ratios with minimal quality loss

**Key Results**: 
- Compress 4000 tokens ‚Üí 50 summary vectors
- Maintains 90%+ of original model performance
- Enables processing of longer documents within context limits

## 3.3.3 Prompt Length Reduction

### The Interpretability-Efficiency Trade-off

#### Soft Prompts Limitations

While soft prompts provide dense, hidden representations, they have **significant drawbacks**:

**1. Lack of Interpretability**
- Not directly interpretable by humans
- Difficult to understand how inputs influence outputs
- Barrier for users trying to debug or improve prompts

**2. Inflexibility**
- Cannot easily adjust without extensive fine-tuning
- Modifications require retraining
- Limits utility in dynamic environments
- Frequent prompt changes become impractical

### Alternative: Text Simplification

One alternative for developing efficient prompts: **Simplify the text** used for prompting while maintaining interpretability.

### Example: Healthcare and Finance Prompt Simplification

<div style="max-width: 900px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong><span style="background-color: #ffb6c1; padding: 2px 4px;">Original Prompt (Verbose):</span></strong><br>

The task involves developing a language model capable of understanding and responding to user inquiries across various domains, with a particular emphasis on healthcare and finance. Considering the broad range of potential queries, from the specifics of medical diagnoses to the nuances of financial regulations, the model must ensure a comprehensive understanding and accurate responses.<br>

Question:<br>
What are the best practices for using artificial intelligence in diagnosing cardiovascular diseases?<br>

<u>_____________</u>

<strong><span style="background-color: #87ceeb; padding: 2px 4px;">Method 1: Delete Unimportant Parts</span></strong><br>

The task involves developing a language model capable of understanding and responding to user inquiries across various domains, with a particular emphasis on healthcare and finance. <strike style="text-decoration: line-through; color: #999;">Considering the broad range of potential queries, from the specifics of medical diagnoses to the nuances of financial regulations.</strike> The model must ensure a comprehensive understanding and accurate responses.<br>

<u>_____________</u>

<strong><span style="background-color: #90ee90; padding: 2px 4px;">Method 2: Paraphrase to Shorter Text</span></strong><br>

The task involves developing a language model focused on healthcare and finance, capable of understanding and accurately responding to a wide range of user inquiries.<br>

<strong>üí° Result:</strong> Reduced from ~60 words to ~25 words while preserving core meaning!

</div>

### Prompt Simplification Methods

This problem can be viewed as a classic NLP task: **Text Simplification**. Methods are general and not restricted to prompts.

#### Method 1: Heuristic-Based Token Removal

**Approach**:
- Define heuristics to identify redundant words
- Examine each token's contribution to overall meaning
- Remove tokens providing minimal value
- Preserve essential information

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Example: Token Removal Process</strong><br>

<strong>Step 1 - Analyze:</strong><br>
Examine each token's importance<br>

<strong>Step 2 - Score Based On:</strong><br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">Semantic contribution</span> - Does it add meaning?<br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">Redundancy with other tokens</span> - Is it repetitive?<br>
‚Ä¢ <span style="background-color: #87ceeb; padding: 2px 4px;">Syntactic necessity</span> - Is it required for grammar?<br>

<strong>Step 3 - Remove:</strong><br>
Delete <span style="background-color: #ffb6c1; padding: 2px 4px;">low-scoring tokens</span><br>

<strong>üí° References:</strong> [Li et al., 2023c; Jiang et al., 2023b]

</div>

#### Method 2: Sequence-to-Sequence Models

**Approach**:
- Frame as sequence-to-sequence task
- Train encoder-decoder model
- Transform input text into simplified form

**Requirements**:
- Labeled data for text simplification
- Pairs of (complex text, simplified text)

**Training Objective**:
$$\max_{\theta} \sum_{(\mathbf{x}_{complex}, \mathbf{x}_{simple})} \log \text{Pr}_{\theta}(\mathbf{x}_{simple} \mid \mathbf{x}_{complex})$$

### Method 3: LLM-Based Simplification

#### Direct Application

Many LLMs have been **fine-tuned and aligned** to perform text simplification tasks.

**Straightforward Approach**: Use these models directly to simplify prompts

#### Constrained Simplification

Prompt an LLM to simplify text under certain **constraints**:

<div style="max-width: 850px; font-family: monospace; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px;">

<strong>Simplification Prompt Template:</strong><br>

Please simplify the following text while preserving its essential meaning.<br>

<span style="background-color: #ffeb3b; padding: 2px 4px;">Constraints:</span><br>
- Maximum length: 30 words<br>
- Maintain key information about healthcare and finance<br>
- Keep formal tone<br>

<span style="background-color: #ffb6c1; padding: 2px 4px;">Original Text:</span><br>
{complex_prompt}<br>

<span style="background-color: #90ee90; padding: 2px 4px;">Simplified Text:</span><br>
{LLM generates simplified version}

</div>

#### Common Constraints

1. **Length Limits**: Maximum word/token count
2. **Information Preservation**: Must retain specific key points
3. **Style Requirements**: Maintain formality, tone, etc.
4. **Readability Goals**: Target reading level
5. **Format Specifications**: Maintain structure (e.g., bullet points)

### Practical Example: Multi-Method Comparison

<div style="max-width: 900px; font-size: 0.9em; line-height: 1.6; padding: 15px; border-left: 4px solid #003865; border-radius: 5px; background-color: #f9f9f9;">

**Original Prompt** (75 words):<br>
"You are an expert assistant specializing in providing detailed, accurate, and well-researched information across multiple domains including but not limited to science, technology, healthcare, finance, and general knowledge. When answering questions, please ensure that your responses are comprehensive, cite relevant sources when applicable, and are presented in a clear and accessible manner suitable for users with varying levels of expertise."<br>

<hr style="border: 1px solid #ddd; margin: 12px 0;">

**Heuristic Simplification** (35 words):<br>
"You are an expert assistant providing accurate information across science, technology, healthcare, finance, and general knowledge. Ensure responses are comprehensive and clear for users with varying expertise levels."<br>

**Seq2Seq Model Output** (28 words):<br>
"You are an expert assistant. Provide detailed, accurate information on science, technology, healthcare, finance, and general topics. Make responses clear and accessible."<br>

**LLM Simplification (with constraints)** (20 words):<br>
"Expert assistant providing clear, accurate information across science, technology, healthcare, finance, and general topics for all expertise levels."<br>

<strong>üí° Analysis:</strong><br>
- All methods preserve core meaning<br>
- Significant length reduction (75 ‚Üí 20 words = 73% reduction)<br>
- Computational savings scale with frequency of use<br>
- Interpretability maintained (unlike soft prompts)

</div>

### Benefits and Limitations of Prompt Length Reduction

#### Benefits

**1. Computational Efficiency**
- Fewer tokens to process
- Faster inference times
- Lower API costs
- Reduced memory requirements

**2. Maintained Interpretability**
- Still human-readable text
- Easy to understand and debug
- Can manually adjust if needed
- No "black box" representations

**3. Flexibility**
- Easy to modify without retraining
- Quick iterations during development
- Adaptable to changing requirements

#### Limitations

**1. Information Loss Risk**
- May lose nuanced details
- Could affect task performance
- Requires careful validation

**2. Not Always Applicable**
- Some prompts already concise
- Complex tasks may need detailed instructions
- Over-simplification can hurt performance

**3. Quality Depends on Method**
- Heuristics may be too aggressive
- Seq2Seq models need training data
- LLM-based methods depend on model quality

### Comparison: Soft Prompts vs. Prompt Length Reduction

<div style="max-width: 950px; font-size: 0.9em; line-height: 1.5; padding: 15px;">

<table style="width: 100%; border-collapse: collapse; margin: 15px 0;">
<thead style="background-color: #003865; color: white;">
<tr>
<th style="padding: 10px; border: 1px solid #ddd;">Aspect</th>
<th style="padding: 10px; border: 1px solid #ddd;">Soft Prompts</th>
<th style="padding: 10px; border: 1px solid #ddd;">Prompt Length Reduction</th>
</tr>
</thead>
<tbody>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Representation</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">Continuous vectors (hidden states)</td>
<td style="padding: 8px; border: 1px solid #ddd;">Discrete text (simplified)</td>
</tr>
<tr>
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Interpretability</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">‚ùå Not human-interpretable</td>
<td style="padding: 8px; border: 1px solid #ddd;">‚úÖ Human-readable text</td>
</tr>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Flexibility</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">‚ùå Requires retraining to modify</td>
<td style="padding: 8px; border: 1px solid #ddd;">‚úÖ Easy to manually adjust</td>
</tr>
<tr>
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Efficiency</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">‚úÖ Very compact representation</td>
<td style="padding: 8px; border: 1px solid #ddd;">‚úÖ Reduced token count</td>
</tr>
<tr style="background-color: #f9f9f9;">
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Setup Cost</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">High (requires fine-tuning)</td>
<td style="padding: 8px; border: 1px solid #ddd;">Low (apply simplification method)</td>
</tr>
<tr>
<td style="padding: 8px; border: 1px solid #ddd;"><strong>Best Use Case</strong></td>
<td style="padding: 8px; border: 1px solid #ddd;">Fixed tasks with high volume</td>
<td style="padding: 8px; border: 1px solid #ddd;">Dynamic tasks requiring iteration</td>
</tr>
</tbody>
</table>

</div>

## Summary

In this lecture, we covered advanced techniques for making prompting more efficient and automated:

3.3.1 Prompt Optimization

**Automated Prompt Discovery**:
- General framework: Search space + Performance estimation + Search strategy
- LLM-based optimization: Initialization ‚Üí Evaluation ‚Üí Pruning ‚Üí Expansion
- Advanced techniques: Paraphrasing, edit operations, feedback-based refinement
- Classic methods: Evolutionary computation, reinforcement learning
- Structured optimization: Focus on instructions, demonstrations, or full prompts

**Key Takeaway**: Machine learning can automate the labor-intensive process of prompt design, systematically exploring the space of possible prompts to find optimal formulations.

3.3.2 Soft Prompts

**Learnable Hidden Representations**:
- Hard prompts: Explicit text sequences (human-interpretable)
- Soft prompts: Continuous vector representations (model-optimized)
- Parameter-efficient fine-tuning methods:
  - **Prefix Tuning**: Trainable prefixes at each layer
  - **Prompt Tuning**: Trainable embeddings at input layer
- Context distillation: Learn from teacher models with complex prompts
- Can combine soft and hard prompts for best of both worlds

**Key Takeaway**: Soft prompts provide compact, learnable representations that enable efficient task adaptation without modifying core model parameters, though at the cost of interpretability.

3.3.3 Prompt Length Reduction

**Simplified Text Prompts**:
- Heuristic-based: Remove redundant tokens while preserving meaning
- Seq2Seq models: Learn to transform complex ‚Üí simple
- LLM-based: Use existing capabilities with constraints
- Maintains interpretability while improving efficiency
- Flexible and easy to modify compared to soft prompts

**Key Takeaway**: Text simplification offers a middle ground - more efficient than full prompts, more interpretable than soft prompts.

### Overall Insights

**The Evolution of Prompting**:
1. **Manual Design** ‚Üí Labor-intensive, requires expertise
2. **Automated Optimization** ‚Üí Systematic search for better prompts
3. **Learnable Representations** ‚Üí Soft prompts for efficiency
4. **Text Simplification** ‚Üí Balance efficiency and interpretability

**Practical Guidance**:
- Use **prompt optimization** when you need to discover the best prompt formulation
- Use **soft prompts** when efficiency is critical and the task is fixed
- Use **prompt length reduction** when you need interpretability with efficiency
- Consider **hybrid approaches** combining multiple techniques

These advanced techniques represent the cutting edge of prompt engineering, moving from manual crafting toward automated, efficient, and optimized prompting strategies that make LLMs more practical and cost-effective in real-world applications.

## Exercise: Implementing Soft Prompt Mechanism and Comparing Effects

**Background**: Soft Prompts are learnable vector representations that can more effectively guide the behavior of language models. Unlike hard prompts (natural language text), soft prompts guide model outputs through vector optimization.

**Task**: Complete the following Python code to implement a soft prompt mechanism and demonstrate its effects through comparison:

Define a SoftPrompt class to manage learnable prompt vectors

Implement a simple text classification model

Compare model outputs with and without soft prompts



In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class SoftPrompt(nn.Module):
    def __init__(self, prompt_length, embedding_dim):
        super().__init__()
        self.prompt_length = prompt_length
        self.embedding_dim = embedding_dim
        
        # TODO: Initialize soft prompts as learnable parameters
        self.soft_prompts = None  # TODO: Replace this
        
    def forward(self, input_embeddings):
        """Combine soft prompts with input embeddings"""
        batch_size = input_embeddings.size(0)
        
        # TODO: Expand soft prompts to batch dimension
        prompts = None  # TODO: Replace this
        
        # TODO: Add soft prompts before the input sequence
        combined_embeddings = None  # TODO: Replace this
        
        return combined_embeddings

class SimpleClassifier(nn.Module):
    """Simple text classifier"""
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, use_soft_prompt=False, prompt_length=0):
        super().__init__()
        self.use_soft_prompt = use_soft_prompt
        self.prompt_length = prompt_length
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        if use_soft_prompt:
            self.soft_prompt = SoftPrompt(prompt_length, embedding_dim)
        
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )
        
    def forward(self, input_ids):
        embeddings = self.embedding(input_ids)
        
        if self.use_soft_prompt:
            embeddings = self.soft_prompt(embeddings)
            embeddings = embeddings[:, self.prompt_length:, :]
        
        sentence_embedding = embeddings.mean(dim=1)
        logits = self.classifier(sentence_embedding)
        
        return logits

def compare_models():
    """Compare models with and without soft prompts"""
    vocab_size = 1000
    embedding_dim = 128
    hidden_dim = 64
    num_classes = 3
    prompt_length = 2
    
    model_without = SimpleClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, False)
    model_with = SimpleClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, True, prompt_length)
    
    input_ids = torch.randint(0, vocab_size, (2, 5))
    
    print("Input shape:", input_ids.shape)
    print("=" * 40)
    
    with torch.no_grad():
        # Without soft prompts
        logits_without = model_without(input_ids)
        probs_without = F.softmax(logits_without, dim=-1)
        print("Without soft prompts:")
        print(f"Probs: {probs_without}")
        print(f"Class: {torch.argmax(probs_without, dim=-1)}")
    
    print("-" * 20)
    
    with torch.no_grad():
        # With soft prompts
        logits_with = model_with(input_ids)
        probs_with = F.softmax(logits_with, dim=-1)
        print("With soft prompts:")
        print(f"Probs: {probs_with}")
        print(f"Class: {torch.argmax(probs_with, dim=-1)}")
    
    print("=" * 40)
    print("Comparison:")
    prob_diff = torch.abs(probs_with - probs_without)
    print(f"Max prob change: {torch.max(prob_diff):.4f}")
    print(f"Soft prompt shape: {model_with.soft_prompt.soft_prompts.shape}")

# Run comparison
compare_models()

def show_training_effect():
    """Show soft prompt changes during training"""
    print("\nTraining Effect:")
    print("=" * 30)
    
    model = SimpleClassifier(500, 64, 32, 2, True, 2)
    
    initial = model.soft_prompt.soft_prompts.clone().detach()
    
    # One training step
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    input_ids = torch.randint(0, 500, (4, 6))
    labels = torch.tensor([0, 1, 0, 1])
    
    logits = model(input_ids)
    loss = F.cross_entropy(logits, labels)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    trained = model.soft_prompt.soft_prompts.clone().detach()
    
    print(f"Initial norm: {torch.norm(initial):.4f}")
    print(f"Trained norm: {torch.norm(trained):.4f}")
    print(f"Change: {torch.norm(trained - initial):.4f}")

# Show training effect
show_training_effect()

In [None]:
# solution
import torch
import torch.nn as nn
import torch.nn.functional as F

class SoftPrompt(nn.Module):
    def __init__(self, prompt_length, embedding_dim):
        super().__init__()
        self.prompt_length = prompt_length
        self.embedding_dim = embedding_dim
        
        # Initialize soft prompts as learnable parameters
        self.soft_prompts = nn.Parameter(torch.randn(prompt_length, embedding_dim) * 0.1)
        
    def forward(self, input_embeddings):
        """Combine soft prompts with input embeddings"""
        batch_size = input_embeddings.size(0)
        
        # Expand soft prompts to batch dimension
        prompts = self.soft_prompts.unsqueeze(0).repeat(batch_size, 1, 1)
        
        # Prepend soft prompts to input sequence
        combined_embeddings = torch.cat([prompts, input_embeddings], dim=1)
        
        return combined_embeddings

class SimpleClassifier(nn.Module):
    """Simple text classifier"""
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_classes, use_soft_prompt=False, prompt_length=0):
        super().__init__()
        self.use_soft_prompt = use_soft_prompt
        self.prompt_length = prompt_length
        
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        if use_soft_prompt:
            self.soft_prompt = SoftPrompt(prompt_length, embedding_dim)
        
        self.classifier = nn.Sequential(
            nn.Linear(embedding_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, num_classes)
        )
        
    def forward(self, input_ids):
        embeddings = self.embedding(input_ids)
        
        if self.use_soft_prompt:
            embeddings = self.soft_prompt(embeddings)
            embeddings = embeddings[:, self.prompt_length:, :]
        
        sentence_embedding = embeddings.mean(dim=1)
        logits = self.classifier(sentence_embedding)
        
        return logits

def compare_models():
    """Compare models with and without soft prompts"""
    vocab_size = 1000
    embedding_dim = 128
    hidden_dim = 64
    num_classes = 3
    prompt_length = 2
    
    model_without = SimpleClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, False)
    model_with = SimpleClassifier(vocab_size, embedding_dim, hidden_dim, num_classes, True, prompt_length)
    
    input_ids = torch.randint(0, vocab_size, (2, 5))
    
    print("Input shape:", input_ids.shape)
    print("=" * 40)
    
    with torch.no_grad():
        # Without soft prompts
        logits_without = model_without(input_ids)
        probs_without = F.softmax(logits_without, dim=-1)
        print("Without soft prompts:")
        print(f"Probs: {probs_without}")
        print(f"Class: {torch.argmax(probs_without, dim=-1)}")
    
    print("-" * 20)
    
    with torch.no_grad():
        # With soft prompts
        logits_with = model_with(input_ids)
        probs_with = F.softmax(logits_with, dim=-1)
        print("With soft prompts:")
        print(f"Probs: {probs_with}")
        print(f"Class: {torch.argmax(probs_with, dim=-1)}")
    
    print("=" * 40)
    print("Comparison:")
    prob_diff = torch.abs(probs_with - probs_without)
    print(f"Max prob change: {torch.max(prob_diff):.4f}")
    print(f"Soft prompt shape: {model_with.soft_prompt.soft_prompts.shape}")

# Run comparison
compare_models()

def show_training_effect():
    """Show soft prompt changes during training"""
    print("\nTraining Effect:")
    print("=" * 30)
    
    model = SimpleClassifier(500, 64, 32, 2, True, 2)
    
    initial = model.soft_prompt.soft_prompts.clone().detach()
    
    # One training step
    optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
    input_ids = torch.randint(0, 500, (4, 6))
    labels = torch.tensor([0, 1, 0, 1])
    
    logits = model(input_ids)
    loss = F.cross_entropy(logits, labels)
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    trained = model.soft_prompt.soft_prompts.clone().detach()
    
    print(f"Initial norm: {torch.norm(initial):.4f}")
    print(f"Trained norm: {torch.norm(trained):.4f}")
    print(f"Change: {torch.norm(trained - initial):.4f}")

# Show training effect
show_training_effect()