Assignment 2: Building GPT-2 from Scratch

Enhanced with Advanced Concepts

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin
Due: See Course Website

Overview

In this assignment, you will implement the core components of GPT-2: attention mechanisms and the transformer architecture. You'll build practical, engineering-focused implementations that demonstrate both textbook concepts and modern optimizations that are currently used in production LLMs.

Important Note: This assignment includes several advanced concepts that extend beyond our textbook coverage. Don't worry - we provide detailed explanations and background for all advanced techniques below.

Learning Objectives

By completing this assignment, you will:

Implement attention mechanisms from basic to optimized
Build the GPT-2 architecture with modern improvements
Understand engineering trade-offs in LLM design
Experience the progression from theory to production
Learn cutting-edge optimizations used in modern LLMs like LLaMA and GPT-4

Assignment Structure

assignment2_starter/ ├── src/ # Your implementation files │ ├── Attention.py # Attention mechanisms (40 points) │ └── gpt2_model.py # GPT-2 model (40 points) ├── tests/ # Test files (don't modify) ├── data/ # Sample data ├── tools/ # Utility scripts └── requirements.txt # Dependencies

Advanced Concepts Guide

Before diving into implementation, please read this section carefully. It explains several modern LLM techniques that go beyond our textbook coverage but are essential for understanding production-quality models.

QuickNorm vs LayerNorm

What you learned in textbook TEXTBOOK

LayerNorm normalizes both mean and variance:

# Standard LayerNorm concept (from textbook)
# 1. Calculate mean and variance across features
# 2. Subtract mean and divide by sqrt(variance + eps)
# 3. Apply learnable scale and shift parameters

What you'll implement IN THIS WORK

QuickNorm (similar to RMSNorm) skips mean centering for efficiency:

# QuickNorm implementation steps:
# 1. Calculate variance only (skip mean calculation)
# 2. Normalize by dividing by sqrt(variance + eps)
# 3. Apply only scale parameter (no shift)
# Your task: implement in the forward() method

Why this matters:

Speed: 30% faster than LayerNorm (fewer operations)
Memory: Reduces activation memory usage
Performance: Surprisingly, often works as well as full LayerNorm
Used in: LLaMA, PaLM, and other modern models

SwiGLU vs GELU Activation

What you learned in textbook TEXTBOOK

GELU activation function:

# GELU concept (from textbook)
# Smooth approximation to ReLU with probabilistic gating
# Formula: 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

What you'll implement IN THIS WORK

SwiGLU (Swish-Gated Linear Unit) from LLaMA:

# SwiGLU implementation approach:
# 1. Project input through two separate linear layers (gate and value)
# 2. Apply SiLU (Swish) activation to the gate projection
# 3. Element-wise multiply the activated gate with value projection
# 4. Project the result back to original dimension
# Your task: implement this gated activation pattern

Research Foundation:

Primary Research: "GLU Variants Improve Transformer" (Shazeer, 2020)
LLaMA Implementation: "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)

Why SwiGLU is Superior:

Gated Information Flow: Unlike GELU which applies the same transformation to all inputs, SwiGLU uses a learned gate to selectively control which information passes through. This creates more sophisticated feature interactions.
Empirical Performance: In the original GLU paper, SwiGLU consistently outperformed GELU across multiple language modeling benchmarks, typically improving perplexity by 1-3 points.
Nonlinear Interactions: The elementwise multiplication between the gate and value creates richer nonlinear interactions compared to simple pointwise activations.
Gradient Flow: The gating mechanism provides better gradient flow during backpropagation, especially for deeper networks.
Production Validation: Used in LLaMA (65B+ parameters), PaLM-2, and Chinchilla with demonstrated improvements in language understanding tasks.

Mathematical Intuition:

SwiGLU can be viewed as a generalization of the classical MLP with a learned, input-dependent gating mechanism:

# Conceptual comparison:
# Traditional MLP: fixed activation applied to all inputs uniformly
# SwiGLU: learned gate selectively controls information flow
#
# Core implementation pattern:
gate = self.w_gate(x)    # Gate projection
up = self.w_up(x)        # Value projection
swish_gate = F.silu(gate)  # Apply SiLU to gate
gated = swish_gate * up    # Element-wise multiply
return self.w_down(gated)  # Final projection

Implementation Details:

Input Splitting: The input is projected to 2× the hidden dimension, then split into gate and value components
SiLU Activation: Apply SiLU (Swish) activation to the gate projection
Gated Output: Element-wise multiply the activated gate with the value projection
Parameter Cost: Requires ~1.67× parameters compared to standard MLP, but this cost is offset by better performance

Weight Tying Optimization

What you learned in textbook TEXTBOOK

Separate embedding and output layers:

# Separate weights concept (textbook approach)
# input_embedding: maps token_id → hidden_representation
# output_projection: maps hidden_representation → vocab_logits
# Two separate parameter matrices (doubled memory usage)

What you'll implement IN THIS WORK

Weight tying (sharing weights):

# Weight tying implementation approach:
# 1. Create token embedding layer as normal
# 2. Create output linear layer WITHOUT bias
# 3. Set output layer's weight to be the same as embedding weight
# 4. Now both input and output use the same parameter matrix
# Result: 50% reduction in vocabulary-related parameters

Research Foundation:

Primary Research: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)
Theoretical Analysis: "Weight Tying Improves Inclusivity of Word Representations" (Kumar & Tsvetkov, 2019)

The Conceptual Intuition:

Weight tying is based on a fundamental insight about vocabulary representation:

Shared Semantic Space: Input and output vocabularies represent the same concepts, so why use different representations?
Symmetric Learning: When the model learns that "cat" maps to representation X during input, it should naturally output that same representation X when generating "cat"
Regularization Effect: Forcing the model to use consistent representations acts as a form of regularization
Information Density: The same parameter matrix encodes both input understanding and output generation

Implementation Benefits:

Memory Reduction: Eliminates duplicate vocabulary parameters, reducing model size by 20-30%
Training Stability: Shared representations prevent input/output vocabulary drift during training
Generalization: Often improves performance on rare words by strengthening their representations
Efficiency: Faster inference due to reduced memory bandwidth requirements
Production Standard: Used in GPT-2, GPT-3, T5, BERT, and virtually all modern language models

Mathematical Perspective:

Weight tying creates a symmetric embedding space where input and output transformations share parameters:

# Conceptual understanding:
# Traditional approach: separate matrices for input and output
# Weight tying approach: same matrix used for both directions
#
# Implementation challenge:
# How do you make the embedding layer (vocab_size × embed_dim)
# work as the output layer (embed_dim × vocab_size)?
# Hint: Think about matrix transpose and weight sharing

Historical Context:

Weight tying bridges classical NLP and modern deep learning:

Classical Motivation: Early language models used the same vocabulary for input and output by necessity
Deep Learning Adoption: Press & Wolf (2017) demonstrated significant improvements in neural language models
Theoretical Understanding: Later work showed weight tying improves word representation quality and inclusivity
Practical Adoption: Now standard practice in all major language model architectures

Sinusoidal vs Learned Positional Encoding

What you learned in textbook TEXTBOOK

Learned positional embeddings:

# Learned positions concept (textbook)
# Embedding layer that maps position indices to learned vectors
# Parameters are learned during training for each position

What you'll implement IN THIS WORK

Sinusoidal positions (from original Transformer paper):

# Sinusoidal positions implementation approach:
# 1. Create position indices (0, 1, 2, ..., seq_len-1)
# 2. Create frequency terms using exponential decay
# 3. Apply sine to even dimensions, cosine to odd dimensions
# 4. No learnable parameters - purely mathematical function
# Your task: implement the create_sinusoidal_positions method

Research Foundation:

Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Analysis: "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)

The Mathematical Foundation:

Sinusoidal positional encoding uses trigonometric functions to create a unique, deterministic representation for each position:

# Mathematical formulation to implement:
# For each position pos and dimension i:
# Even dimensions (0, 2, 4, ...): apply sine function
# Odd dimensions (1, 3, 5, ...): apply cosine function
# Use different frequencies for different dimension pairs
#
# Core implementation pattern:
pos = torch.arange(max_len).unsqueeze(1)  # Position indices [max_len, 1]
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(pos * div_term)  # Even dimensions
pe[:, 1::2] = torch.cos(pos * div_term)  # Odd dimensions

Key Properties and Advantages:

Infinite Extrapolation: Can handle any sequence length, even much longer than training sequences
Parameter Efficiency: Zero learnable parameters dedicated to positional information
Relative Position Awareness: The model can learn to attend to relative positions through linear combinations
Unique Encodings: Each position gets a unique, deterministic representation
Smooth Transitions: Adjacent positions have similar but distinguishable encodings
Frequency Hierarchy: Different dimensions encode position at different frequencies

Intuitive Understanding:

Think of sinusoidal encoding as a multi-scale clock system:

High Frequencies (fast oscillation): Distinguish between adjacent positions
Low Frequencies (slow oscillation): Capture broader positional patterns
Combined Frequencies: Create unique fingerprints for each position
Binary Analogy: Similar to how binary representation uses powers of 2 to uniquely represent numbers

Comparison with Learned Embeddings:

Aspect	Learned Positional Embeddings	Sinusoidal Positional Encoding
Parameters	max_seq_len × d_model parameters	Zero parameters
Sequence Length	Fixed maximum during training	Unlimited extrapolation
Interpretability	Learned patterns, less interpretable	Mathematical pattern, fully interpretable
Relative Position	Must be learned implicitly	Can be computed through linear combinations
Generalization	May not generalize to unseen lengths	Perfect generalization to any length

Implementation Details:

The implementation creates a frequency spectrum that encodes position across multiple scales:

# Implementation strategy:
# 1. Create position tensor: shape (max_seq_len, 1)
# 2. Create frequency tensor: shape (d_model//2,)
# 3. Compute position × frequency matrix
# 4. Apply sine to columns 0, 2, 4, ... (even indices)
# 5. Apply cosine to columns 1, 3, 5, ... (odd indices)
# 6. Result: shape (max_seq_len, d_model)
#
# Challenge: How do you ensure the right frequencies for each dimension?
# Hint: Use exponential decay pattern from the original paper

Why Both Approaches Matter:

Research Value: Sinusoidal encoding from the foundational Transformer paper helps understand theoretical principles
Practical Benefits: Zero parameters and infinite extrapolation make it attractive for resource-constrained applications
Modern Variants: Inspired more sophisticated positional encodings like RoPE (Rotary Position Embedding) used in modern LLMs
Educational Importance: Understanding both approaches deepens comprehension of how models encode sequence information

What You'll Implement

Part 1: Attention Mechanisms (40 points)

File: src/Attention.py

Implement attention mechanisms from basic to production-ready: BasicAttention, ScaledAttention, MultiHeadAttention, and causal masking.

Part 2: GPT-2 with Modern Optimizations (40 points)

File: src/gpt2_model.py

Build GPT-2 with advanced features: QuickNorm, SwiGLU activation, TransformerLayer, complete forward pass, and text generation.

Note: All implementation details and hints are in the TODO comments within the source files. Refer to the Advanced Concepts Guide above for conceptual understanding.

Setup Instructions

Recommended Setup: We strongly recommend using Conda for this assignment. Conda automatically handles PyTorch installation for different platforms (macOS, Windows, Linux) and hardware (CPU, NVIDIA GPU, Apple Silicon), which eliminates the most common setup errors.

Quick Setup Overview

Step 1: Install Conda (if not already installed) - one-time setup
Step 2: Download and extract assignment2_starter.zip
Step 3: Create environment: conda env create -f environment.yml
Step 4: Activate and start coding: conda activate assignment2_env

Estimated time: 10-15 minutes (first time), 2-3 minutes (if Conda already installed)

Step 1: Install Conda (One-Time Setup)

Skip this step if you already have Anaconda or Miniconda installed.

macOS Installation

1. Download Miniconda installer
Visit: https://docs.conda.io/en/latest/miniconda.html
Download: "Miniconda3 macOS 64-bit pkg" (Intel Mac)
      or "Miniconda3 macOS Apple M1 64-bit pkg" (M1/M2/M3 Mac)

2. Install Miniconda
- Double-click the downloaded .pkg file
- Follow installation wizard
- Use default settings

3. Verify installation
Open a NEW Terminal window and run:
conda --version

Expected output: conda 24.x.x or similar

If you see "conda: command not found":
Close and reopen Terminal, then try again
source ~/miniconda3/bin/activate
conda init zsh  # if using zsh (default on macOS)
# OR
conda init bash  # if using bash
# Then close and reopen Terminal

Windows Installation

1. Download Miniconda installer
Visit: https://docs.conda.io/en/latest/miniconda.html
Download: "Miniconda3 Windows 64-bit"

2. Install Miniconda
- Double-click the downloaded .exe file
- Installation options:
  - Install for: Just Me (recommended)
  - Destination folder: Use default
  - Advanced options:
    [X] CHECK: "Add Miniconda3 to my PATH environment variable"
    [X] CHECK: "Register Miniconda3 as my default Python"

3. Verify installation
Open "Anaconda Prompt (Miniconda3)" from Start menu and run:
conda --version

Expected output: conda 24.x.x or similar

Note: Miniconda installs "Anaconda Prompt" - it's the same thing!

Linux/Ubuntu Installation

1. Download Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

2. Install Miniconda
bash Miniconda3-latest-Linux-x86_64.sh
- Press Enter to review license
- Type "yes" to accept
- Press Enter to confirm installation location
- Type "yes" when asked to initialize

3. Activate conda
source ~/.bashrc

4. Verify installation
conda --version

Expected output: conda 24.x.x or similar

Step 2: Create Assignment Environment

Once Conda is installed, follow these steps to set up your assignment environment. These steps are the same for all platforms.

All Platforms (macOS, Windows, Linux)

1. Download and extract assignment package
Download: assignment2_starter.zip from course website
Extract to your preferred location (e.g., Downloads or Documents)

2. Navigate to assignment directory
Open Terminal (macOS/Linux) or Anaconda Prompt (Windows)

# macOS/Linux:
cd ~/Downloads/assignment2_starter

# Windows:
cd C:\Users\YourName\Downloads\assignment2_starter

3. Create Conda environment
conda env create -f environment.yml

This will:
- Create an environment called "assignment2_env"
- Install Python 3.10
- Install PyTorch (automatically selects correct version for your system)
- Install all required packages (numpy, pytest, tqdm, etc.)

Expected time: 2-5 minutes depending on internet speed

4. Activate the environment
conda activate assignment2_env

5. Verify installation
python tools/run_tests.py

You should see test output showing TODO sections not implemented.
This is expected - you haven't written any code yet!

Step 3: Working on the Assignment

Every time you work on the assignment:

1. Open Terminal/Command Prompt

2. Navigate to assignment folder
cd ~/Downloads/assignment2_starter  # macOS/Linux
cd C:\Users\YourName\Downloads\assignment2_starter  # Windows

3. Activate environment
conda activate assignment2_env

4. Start coding!
# Edit files in src/ folder
# Run tests to check your work:
python tools/run_tests.py

5. When done for the day
conda deactivate

Troubleshooting

If "conda env create" fails

# Try creating environment manually:
conda create -n assignment2_env python=3.10 -y
conda activate assignment2_env

# Install PyTorch:
conda install pytorch -c pytorch -y

# Install other packages:
pip install numpy pytest tqdm pytest-timeout

If PyTorch import fails

# Verify PyTorch installation:
python -c "import torch; print('PyTorch version:', torch.__version__)"

# If it fails, reinstall PyTorch:
conda install pytorch -c pytorch -y

Alternative: pip/virtualenv (Advanced Users Only)

Advanced Method: Only use this if you're experienced with Python environments. We strongly recommend Conda for most students.

1. Install PyTorch first
Visit: https://pytorch.org/get-started/locally/
Follow platform-specific instructions

2. Create virtual environment
cd assignment2_starter
python3 -m venv assignment2_env

# Activate:
source assignment2_env/bin/activate  # macOS/Linux
assignment2_env\Scripts\activate     # Windows

3. Install dependencies
pip install -r requirements.txt

4. Verify
python tools/run_tests.py

Common Issues

Problem: "conda: command not found"

Solution: Close and reopen your terminal/command prompt. If still not working, add Conda to PATH or use Anaconda Prompt (Windows).

Problem: "ModuleNotFoundError: No module named 'torch'"

# Make sure environment is activated:
conda activate assignment2_env

# Verify PyTorch installation:
python -c "import torch; print(torch.__version__)"

# If fails, reinstall:
conda install pytorch -c pytorch

Need Help?

Discussion Forum: Post error messages with OS details
Office Hours: Tuesday/Thursday 2-4 PM
Email: Include OS, Python version, and full error message

Quick Reference

Every time you work on the assignment:
conda activate assignment2_env
cd path/to/assignment2_starter
# Edit files, then test:
python tools/run_tests.py

When done:
conda deactivate

Delete environment (if needed):
conda env remove -n assignment2_env
        

Testing Your Implementation

Run Individual Module Tests

Test your implementation files:

python src/Attention.py
python src/gpt2_model.py

Run All Tests

python tools/run_tests.py

Run Pytest

python -m pytest tests/ -v

Grading Rubric (100 Points Total)

Component	Points	Key Requirements
Attention.py	40	BasicAttention (10), ScaledAttention (15), MultiHeadAttention (10), causal_mask (5)
gpt2_model.py	40	QuickNorm (5), SwiGLU (5), TransformerLayer (10), Forward (15), Generate (5)
Code Quality	10	Clean code, proper structure, no hardcoded paths
Documentation	10	Docstrings, comments where needed
Total	100

Grading Criteria

Correctness (80%): Implementation produces expected outputs
Code Quality (10%): Clean, readable code following best practices
Documentation (10%): Clear docstrings and comments

Hints and Tips

Start Simple: Get BasicAttention working first, then build up
Test Often: Run tests after implementing each method
Check Shapes: Print tensor shapes when debugging
Read Comments: The TODOs have detailed hints
Use PyTorch Docs: Many operations have built-in PyTorch functions
Understand Before Implementing: Read the Advanced Concepts section carefully
Don't Fear the Unknown: These optimizations are used in ChatGPT, Claude, and other LLMs you use daily!

Academic Integrity

This is an individual assignment. You may:

Discuss concepts with classmates
Use course materials and PyTorch documentation
Ask for help during office hours
Research the concepts mentioned (QuickNorm, SwiGLU, etc.) in papers and blogs

You may NOT:

Share code with classmates
Use AI code generators (ChatGPT, Copilot, etc.)
Copy code from online sources

Submission Instructions

Complete all TODO sections in the two Python files
Ensure all tests pass
Create a ZIP file with your src/ directory:
zip -r assignment2_submission.zip src/
Submit on the course website before the deadline

Getting Help

Office Hours: Tuesday/Thursday 2-4 PM
Discussion Forum: Post conceptual questions about the advanced concepts (no code)
Email: For personal matters only

Good Luck!

Remember: You're not just implementing textbook concepts - you're learning the same optimizations used in GPT-4, Claude, and LLaMA! Understanding these concepts deeply will help you throughout your AI career.

Take time to understand what each component does, not just make the tests pass. The modern optimizations you're learning here are actively used in production systems serving millions of users.

Additional Resources for Advanced Concepts

Research Papers (Optional Reading)

SwiGLU: "GLU Variants Improve Transformer" (Shazeer, 2020)
Flash Attention: "FlashAttention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
RMSNorm: "Root Mean Square Layer Normalization" (Zhang & Sennrich, 2019)
Weight Tying: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)

Blog Posts (Optional Reading)

Hugging Face: "The Annotated Transformer"
Lilian Weng: "Attention? Attention!"
Jay Alammar: "The Illustrated Transformer"

"Attention is all you need" - but understanding modern optimizations helps even more!

← Back to Course