Assignment 2: Building GPT-2 from Scratch

Enhanced with Advanced Concepts

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin
Due: See Course Website


Overview

In this assignment, you will implement the core components of GPT-2: attention mechanisms and the transformer architecture. You'll build practical, engineering-focused implementations that demonstrate both textbook concepts and modern optimizations that are currently used in production LLMs.

Important Note: This assignment includes several advanced concepts that extend beyond our textbook coverage. Don't worry - we provide detailed explanations and background for all advanced techniques below.

Learning Objectives

By completing this assignment, you will:

  1. Implement attention mechanisms from basic to optimized
  2. Build the GPT-2 architecture with modern improvements
  3. Understand engineering trade-offs in LLM design
  4. Experience the progression from theory to production
  5. Learn cutting-edge optimizations used in modern LLMs like LLaMA and GPT-4

Assignment Structure

assignment2_starter/ ├── src/ # Your implementation files │ ├── Attention.py # Attention mechanisms (40 points) │ └── gpt2_model.py # GPT-2 model (40 points) ├── tests/ # Test files (don't modify) ├── data/ # Sample data ├── tools/ # Utility scripts └── requirements.txt # Dependencies

Advanced Concepts Guide

Before diving into implementation, please read this section carefully. It explains several modern LLM techniques that go beyond our textbook coverage but are essential for understanding production-quality models.

QuickNorm vs LayerNorm

What you learned in textbook TEXTBOOK

LayerNorm normalizes both mean and variance:

# Standard LayerNorm concept (from textbook) # 1. Calculate mean and variance across features # 2. Subtract mean and divide by sqrt(variance + eps) # 3. Apply learnable scale and shift parameters

What you'll implement IN THIS WORK

QuickNorm (similar to RMSNorm) skips mean centering for efficiency:

# QuickNorm implementation steps: # 1. Calculate variance only (skip mean calculation) # 2. Normalize by dividing by sqrt(variance + eps) # 3. Apply only scale parameter (no shift) # Your task: implement in the forward() method

Why this matters:

SwiGLU vs GELU Activation

What you learned in textbook TEXTBOOK

GELU activation function:

# GELU concept (from textbook) # Smooth approximation to ReLU with probabilistic gating # Formula: 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))

What you'll implement IN THIS WORK

SwiGLU (Swish-Gated Linear Unit) from LLaMA:

# SwiGLU implementation approach: # 1. Project input through two separate linear layers (gate and value) # 2. Apply SiLU (Swish) activation to the gate projection # 3. Element-wise multiply the activated gate with value projection # 4. Project the result back to original dimension # Your task: implement this gated activation pattern

Research Foundation:

Primary Research: "GLU Variants Improve Transformer" (Shazeer, 2020)
LLaMA Implementation: "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)

Why SwiGLU is Superior:

Mathematical Intuition:

SwiGLU can be viewed as a generalization of the classical MLP with a learned, input-dependent gating mechanism:

# Conceptual comparison: # Traditional MLP: fixed activation applied to all inputs uniformly # SwiGLU: learned gate selectively controls information flow # # Core implementation pattern: gate = self.w_gate(x) # Gate projection up = self.w_up(x) # Value projection swish_gate = F.silu(gate) # Apply SiLU to gate gated = swish_gate * up # Element-wise multiply return self.w_down(gated) # Final projection

Implementation Details:

  1. Input Splitting: The input is projected to 2× the hidden dimension, then split into gate and value components
  2. SiLU Activation: Apply SiLU (Swish) activation to the gate projection
  3. Gated Output: Element-wise multiply the activated gate with the value projection
  4. Parameter Cost: Requires ~1.67× parameters compared to standard MLP, but this cost is offset by better performance

Weight Tying Optimization

What you learned in textbook TEXTBOOK

Separate embedding and output layers:

# Separate weights concept (textbook approach) # input_embedding: maps token_id → hidden_representation # output_projection: maps hidden_representation → vocab_logits # Two separate parameter matrices (doubled memory usage)

What you'll implement IN THIS WORK

Weight tying (sharing weights):

# Weight tying implementation approach: # 1. Create token embedding layer as normal # 2. Create output linear layer WITHOUT bias # 3. Set output layer's weight to be the same as embedding weight # 4. Now both input and output use the same parameter matrix # Result: 50% reduction in vocabulary-related parameters

Research Foundation:

Primary Research: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)
Theoretical Analysis: "Weight Tying Improves Inclusivity of Word Representations" (Kumar & Tsvetkov, 2019)

The Conceptual Intuition:

Weight tying is based on a fundamental insight about vocabulary representation:

Implementation Benefits:

Mathematical Perspective:

Weight tying creates a symmetric embedding space where input and output transformations share parameters:

# Conceptual understanding: # Traditional approach: separate matrices for input and output # Weight tying approach: same matrix used for both directions # # Implementation challenge: # How do you make the embedding layer (vocab_size × embed_dim) # work as the output layer (embed_dim × vocab_size)? # Hint: Think about matrix transpose and weight sharing

Historical Context:

Weight tying bridges classical NLP and modern deep learning:

Sinusoidal vs Learned Positional Encoding

What you learned in textbook TEXTBOOK

Learned positional embeddings:

# Learned positions concept (textbook) # Embedding layer that maps position indices to learned vectors # Parameters are learned during training for each position

What you'll implement IN THIS WORK

Sinusoidal positions (from original Transformer paper):

# Sinusoidal positions implementation approach: # 1. Create position indices (0, 1, 2, ..., seq_len-1) # 2. Create frequency terms using exponential decay # 3. Apply sine to even dimensions, cosine to odd dimensions # 4. No learnable parameters - purely mathematical function # Your task: implement the create_sinusoidal_positions method

Research Foundation:

Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Analysis: "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)

The Mathematical Foundation:

Sinusoidal positional encoding uses trigonometric functions to create a unique, deterministic representation for each position:

# Mathematical formulation to implement: # For each position pos and dimension i: # Even dimensions (0, 2, 4, ...): apply sine function # Odd dimensions (1, 3, 5, ...): apply cosine function # Use different frequencies for different dimension pairs # # Core implementation pattern: pos = torch.arange(max_len).unsqueeze(1) # Position indices [max_len, 1] div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe = torch.zeros(max_len, d_model) pe[:, 0::2] = torch.sin(pos * div_term) # Even dimensions pe[:, 1::2] = torch.cos(pos * div_term) # Odd dimensions

Key Properties and Advantages:

Intuitive Understanding:

Think of sinusoidal encoding as a multi-scale clock system:

Comparison with Learned Embeddings:

Aspect Learned Positional Embeddings Sinusoidal Positional Encoding
Parameters max_seq_len × d_model parameters Zero parameters
Sequence Length Fixed maximum during training Unlimited extrapolation
Interpretability Learned patterns, less interpretable Mathematical pattern, fully interpretable
Relative Position Must be learned implicitly Can be computed through linear combinations
Generalization May not generalize to unseen lengths Perfect generalization to any length

Implementation Details:

The implementation creates a frequency spectrum that encodes position across multiple scales:

# Implementation strategy: # 1. Create position tensor: shape (max_seq_len, 1) # 2. Create frequency tensor: shape (d_model//2,) # 3. Compute position × frequency matrix # 4. Apply sine to columns 0, 2, 4, ... (even indices) # 5. Apply cosine to columns 1, 3, 5, ... (odd indices) # 6. Result: shape (max_seq_len, d_model) # # Challenge: How do you ensure the right frequencies for each dimension? # Hint: Use exponential decay pattern from the original paper

Why Both Approaches Matter:


What You'll Implement

Part 1: Attention Mechanisms (40 points)

File: src/Attention.py

Implement attention mechanisms from basic to production-ready: BasicAttention, ScaledAttention, MultiHeadAttention, and causal masking.

Part 2: GPT-2 with Modern Optimizations (40 points)

File: src/gpt2_model.py

Build GPT-2 with advanced features: QuickNorm, SwiGLU activation, TransformerLayer, complete forward pass, and text generation.

Note: All implementation details and hints are in the TODO comments within the source files. Refer to the Advanced Concepts Guide above for conceptual understanding.


Setup Instructions

Recommended Setup: We strongly recommend using Conda for this assignment. Conda automatically handles PyTorch installation for different platforms (macOS, Windows, Linux) and hardware (CPU, NVIDIA GPU, Apple Silicon), which eliminates the most common setup errors.

Quick Setup Overview

  1. Step 1: Install Conda (if not already installed) - one-time setup
  2. Step 2: Download and extract assignment2_starter.zip
  3. Step 3: Create environment: conda env create -f environment.yml
  4. Step 4: Activate and start coding: conda activate assignment2_env

Estimated time: 10-15 minutes (first time), 2-3 minutes (if Conda already installed)


Step 1: Install Conda (One-Time Setup)

Skip this step if you already have Anaconda or Miniconda installed.

macOS Installation

1. Download Miniconda installer Visit: https://docs.conda.io/en/latest/miniconda.html Download: "Miniconda3 macOS 64-bit pkg" (Intel Mac) or "Miniconda3 macOS Apple M1 64-bit pkg" (M1/M2/M3 Mac) 2. Install Miniconda - Double-click the downloaded .pkg file - Follow installation wizard - Use default settings 3. Verify installation Open a NEW Terminal window and run: conda --version Expected output: conda 24.x.x or similar If you see "conda: command not found": Close and reopen Terminal, then try again source ~/miniconda3/bin/activate conda init zsh # if using zsh (default on macOS) # OR conda init bash # if using bash # Then close and reopen Terminal

Windows Installation

1. Download Miniconda installer Visit: https://docs.conda.io/en/latest/miniconda.html Download: "Miniconda3 Windows 64-bit" 2. Install Miniconda - Double-click the downloaded .exe file - Installation options: - Install for: Just Me (recommended) - Destination folder: Use default - Advanced options: [X] CHECK: "Add Miniconda3 to my PATH environment variable" [X] CHECK: "Register Miniconda3 as my default Python" 3. Verify installation Open "Anaconda Prompt (Miniconda3)" from Start menu and run: conda --version Expected output: conda 24.x.x or similar Note: Miniconda installs "Anaconda Prompt" - it's the same thing!

Linux/Ubuntu Installation

1. Download Miniconda installer wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 2. Install Miniconda bash Miniconda3-latest-Linux-x86_64.sh - Press Enter to review license - Type "yes" to accept - Press Enter to confirm installation location - Type "yes" when asked to initialize 3. Activate conda source ~/.bashrc 4. Verify installation conda --version Expected output: conda 24.x.x or similar

Step 2: Create Assignment Environment

Once Conda is installed, follow these steps to set up your assignment environment. These steps are the same for all platforms.

All Platforms (macOS, Windows, Linux)

1. Download and extract assignment package Download: assignment2_starter.zip from course website Extract to your preferred location (e.g., Downloads or Documents) 2. Navigate to assignment directory Open Terminal (macOS/Linux) or Anaconda Prompt (Windows) # macOS/Linux: cd ~/Downloads/assignment2_starter # Windows: cd C:\Users\YourName\Downloads\assignment2_starter 3. Create Conda environment conda env create -f environment.yml This will: - Create an environment called "assignment2_env" - Install Python 3.10 - Install PyTorch (automatically selects correct version for your system) - Install all required packages (numpy, pytest, tqdm, etc.) Expected time: 2-5 minutes depending on internet speed 4. Activate the environment conda activate assignment2_env 5. Verify installation python tools/run_tests.py You should see test output showing TODO sections not implemented. This is expected - you haven't written any code yet!

Step 3: Working on the Assignment

Every time you work on the assignment:

1. Open Terminal/Command Prompt 2. Navigate to assignment folder cd ~/Downloads/assignment2_starter # macOS/Linux cd C:\Users\YourName\Downloads\assignment2_starter # Windows 3. Activate environment conda activate assignment2_env 4. Start coding! # Edit files in src/ folder # Run tests to check your work: python tools/run_tests.py 5. When done for the day conda deactivate

Troubleshooting

If "conda env create" fails

# Try creating environment manually: conda create -n assignment2_env python=3.10 -y conda activate assignment2_env # Install PyTorch: conda install pytorch -c pytorch -y # Install other packages: pip install numpy pytest tqdm pytest-timeout

If PyTorch import fails

# Verify PyTorch installation: python -c "import torch; print('PyTorch version:', torch.__version__)" # If it fails, reinstall PyTorch: conda install pytorch -c pytorch -y

Alternative: pip/virtualenv (Advanced Users Only)

Advanced Method: Only use this if you're experienced with Python environments. We strongly recommend Conda for most students.
1. Install PyTorch first Visit: https://pytorch.org/get-started/locally/ Follow platform-specific instructions 2. Create virtual environment cd assignment2_starter python3 -m venv assignment2_env # Activate: source assignment2_env/bin/activate # macOS/Linux assignment2_env\Scripts\activate # Windows 3. Install dependencies pip install -r requirements.txt 4. Verify python tools/run_tests.py

Common Issues

Problem: "conda: command not found"

Solution: Close and reopen your terminal/command prompt. If still not working, add Conda to PATH or use Anaconda Prompt (Windows).

Problem: "ModuleNotFoundError: No module named 'torch'"

# Make sure environment is activated: conda activate assignment2_env # Verify PyTorch installation: python -c "import torch; print(torch.__version__)" # If fails, reinstall: conda install pytorch -c pytorch

Need Help?


Quick Reference

Every time you work on the assignment: conda activate assignment2_env cd path/to/assignment2_starter # Edit files, then test: python tools/run_tests.py When done: conda deactivate Delete environment (if needed): conda env remove -n assignment2_env

Testing Your Implementation

Run Individual Module Tests

Test your implementation files:

python src/Attention.py python src/gpt2_model.py

Run All Tests

python tools/run_tests.py

Run Pytest

python -m pytest tests/ -v

Grading Rubric (100 Points Total)

Component Points Key Requirements
Attention.py 40 BasicAttention (10), ScaledAttention (15), MultiHeadAttention (10), causal_mask (5)
gpt2_model.py 40 QuickNorm (5), SwiGLU (5), TransformerLayer (10), Forward (15), Generate (5)
Code Quality 10 Clean code, proper structure, no hardcoded paths
Documentation 10 Docstrings, comments where needed
Total 100

Grading Criteria


Hints and Tips

  1. Start Simple: Get BasicAttention working first, then build up
  2. Test Often: Run tests after implementing each method
  3. Check Shapes: Print tensor shapes when debugging
  4. Read Comments: The TODOs have detailed hints
  5. Use PyTorch Docs: Many operations have built-in PyTorch functions
  6. Understand Before Implementing: Read the Advanced Concepts section carefully
  7. Don't Fear the Unknown: These optimizations are used in ChatGPT, Claude, and other LLMs you use daily!

Academic Integrity

This is an individual assignment. You may:

You may NOT:


Submission Instructions

  1. Complete all TODO sections in the two Python files
  2. Ensure all tests pass
  3. Create a ZIP file with your src/ directory:
    zip -r assignment2_submission.zip src/
  4. Submit on the course website before the deadline

Getting Help


Good Luck!

Remember: You're not just implementing textbook concepts - you're learning the same optimizations used in GPT-4, Claude, and LLaMA! Understanding these concepts deeply will help you throughout your AI career.

Take time to understand what each component does, not just make the tests pass. The modern optimizations you're learning here are actively used in production systems serving millions of users.


Additional Resources for Advanced Concepts

Research Papers (Optional Reading)

Blog Posts (Optional Reading)

"Attention is all you need" - but understanding modern optimizations helps even more!


← Back to Course