Assignment 2: Building GPT-2 from Scratch
Enhanced with Advanced Concepts
CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin
Due: See Course Website
Overview
In this assignment, you will implement the core components of GPT-2: attention mechanisms and the transformer architecture. You'll build practical, engineering-focused implementations that demonstrate both textbook concepts and modern optimizations that are currently used in production LLMs.
Important Note: This assignment includes several advanced concepts that extend beyond our textbook coverage. Don't worry - we provide detailed explanations and background for all advanced techniques below.
Learning Objectives
By completing this assignment, you will:
- Implement attention mechanisms from basic to optimized
- Build the GPT-2 architecture with modern improvements
- Understand engineering trade-offs in LLM design
- Experience the progression from theory to production
- Learn cutting-edge optimizations used in modern LLMs like LLaMA and GPT-4
Assignment Structure
assignment2_starter/
├── src/ # Your implementation files
│ ├── Attention.py # Attention mechanisms (40 points)
│ └── gpt2_model.py # GPT-2 model (40 points)
├── tests/ # Test files (don't modify)
├── data/ # Sample data
├── tools/ # Utility scripts
└── requirements.txt # Dependencies
Advanced Concepts Guide
Before diving into implementation, please read this section carefully. It explains several modern LLM techniques that go beyond our textbook coverage but are essential for understanding production-quality models.
QuickNorm vs LayerNorm
What you learned in textbook TEXTBOOK
LayerNorm normalizes both mean and variance:
# Standard LayerNorm concept (from textbook)
# 1. Calculate mean and variance across features
# 2. Subtract mean and divide by sqrt(variance + eps)
# 3. Apply learnable scale and shift parameters
What you'll implement IN THIS WORK
QuickNorm (similar to RMSNorm) skips mean centering for efficiency:
# QuickNorm implementation steps:
# 1. Calculate variance only (skip mean calculation)
# 2. Normalize by dividing by sqrt(variance + eps)
# 3. Apply only scale parameter (no shift)
# Your task: implement in the forward() method
Why this matters:
- Speed: 30% faster than LayerNorm (fewer operations)
- Memory: Reduces activation memory usage
- Performance: Surprisingly, often works as well as full LayerNorm
- Used in: LLaMA, PaLM, and other modern models
SwiGLU vs GELU Activation
What you learned in textbook TEXTBOOK
GELU activation function:
# GELU concept (from textbook)
# Smooth approximation to ReLU with probabilistic gating
# Formula: 0.5 * x * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
What you'll implement IN THIS WORK
SwiGLU (Swish-Gated Linear Unit) from LLaMA:
# SwiGLU implementation approach:
# 1. Project input through two separate linear layers (gate and value)
# 2. Apply SiLU (Swish) activation to the gate projection
# 3. Element-wise multiply the activated gate with value projection
# 4. Project the result back to original dimension
# Your task: implement this gated activation pattern
Research Foundation:
Primary Research: "GLU Variants Improve Transformer" (Shazeer, 2020)
LLaMA Implementation: "LLaMA: Open and Efficient Foundation Language Models" (Touvron et al., 2023)
Why SwiGLU is Superior:
- Gated Information Flow: Unlike GELU which applies the same transformation to all inputs, SwiGLU uses a learned gate to selectively control which information passes through. This creates more sophisticated feature interactions.
- Empirical Performance: In the original GLU paper, SwiGLU consistently outperformed GELU across multiple language modeling benchmarks, typically improving perplexity by 1-3 points.
- Nonlinear Interactions: The elementwise multiplication between the gate and value creates richer nonlinear interactions compared to simple pointwise activations.
- Gradient Flow: The gating mechanism provides better gradient flow during backpropagation, especially for deeper networks.
- Production Validation: Used in LLaMA (65B+ parameters), PaLM-2, and Chinchilla with demonstrated improvements in language understanding tasks.
Mathematical Intuition:
SwiGLU can be viewed as a generalization of the classical MLP with a learned, input-dependent gating mechanism:
# Conceptual comparison:
# Traditional MLP: fixed activation applied to all inputs uniformly
# SwiGLU: learned gate selectively controls information flow
#
# Core implementation pattern:
gate = self.w_gate(x) # Gate projection
up = self.w_up(x) # Value projection
swish_gate = F.silu(gate) # Apply SiLU to gate
gated = swish_gate * up # Element-wise multiply
return self.w_down(gated) # Final projection
Implementation Details:
- Input Splitting: The input is projected to 2× the hidden dimension, then split into gate and value components
- SiLU Activation: Apply SiLU (Swish) activation to the gate projection
- Gated Output: Element-wise multiply the activated gate with the value projection
- Parameter Cost: Requires ~1.67× parameters compared to standard MLP, but this cost is offset by better performance
Weight Tying Optimization
What you learned in textbook TEXTBOOK
Separate embedding and output layers:
# Separate weights concept (textbook approach)
# input_embedding: maps token_id → hidden_representation
# output_projection: maps hidden_representation → vocab_logits
# Two separate parameter matrices (doubled memory usage)
What you'll implement IN THIS WORK
Weight tying (sharing weights):
# Weight tying implementation approach:
# 1. Create token embedding layer as normal
# 2. Create output linear layer WITHOUT bias
# 3. Set output layer's weight to be the same as embedding weight
# 4. Now both input and output use the same parameter matrix
# Result: 50% reduction in vocabulary-related parameters
Research Foundation:
Primary Research: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)
Theoretical Analysis: "Weight Tying Improves Inclusivity of Word Representations" (Kumar & Tsvetkov, 2019)
The Conceptual Intuition:
Weight tying is based on a fundamental insight about vocabulary representation:
- Shared Semantic Space: Input and output vocabularies represent the same concepts, so why use different representations?
- Symmetric Learning: When the model learns that "cat" maps to representation X during input, it should naturally output that same representation X when generating "cat"
- Regularization Effect: Forcing the model to use consistent representations acts as a form of regularization
- Information Density: The same parameter matrix encodes both input understanding and output generation
Implementation Benefits:
- Memory Reduction: Eliminates duplicate vocabulary parameters, reducing model size by 20-30%
- Training Stability: Shared representations prevent input/output vocabulary drift during training
- Generalization: Often improves performance on rare words by strengthening their representations
- Efficiency: Faster inference due to reduced memory bandwidth requirements
- Production Standard: Used in GPT-2, GPT-3, T5, BERT, and virtually all modern language models
Mathematical Perspective:
Weight tying creates a symmetric embedding space where input and output transformations share parameters:
# Conceptual understanding:
# Traditional approach: separate matrices for input and output
# Weight tying approach: same matrix used for both directions
#
# Implementation challenge:
# How do you make the embedding layer (vocab_size × embed_dim)
# work as the output layer (embed_dim × vocab_size)?
# Hint: Think about matrix transpose and weight sharing
Historical Context:
Weight tying bridges classical NLP and modern deep learning:
- Classical Motivation: Early language models used the same vocabulary for input and output by necessity
- Deep Learning Adoption: Press & Wolf (2017) demonstrated significant improvements in neural language models
- Theoretical Understanding: Later work showed weight tying improves word representation quality and inclusivity
- Practical Adoption: Now standard practice in all major language model architectures
Sinusoidal vs Learned Positional Encoding
What you learned in textbook TEXTBOOK
Learned positional embeddings:
# Learned positions concept (textbook)
# Embedding layer that maps position indices to learned vectors
# Parameters are learned during training for each position
What you'll implement IN THIS WORK
Sinusoidal positions (from original Transformer paper):
# Sinusoidal positions implementation approach:
# 1. Create position indices (0, 1, 2, ..., seq_len-1)
# 2. Create frequency terms using exponential decay
# 3. Apply sine to even dimensions, cosine to odd dimensions
# 4. No learnable parameters - purely mathematical function
# Your task: implement the create_sinusoidal_positions method
Research Foundation:
Original Paper: "Attention Is All You Need" (Vaswani et al., 2017)
Analysis: "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
The Mathematical Foundation:
Sinusoidal positional encoding uses trigonometric functions to create a unique, deterministic representation for each position:
# Mathematical formulation to implement:
# For each position pos and dimension i:
# Even dimensions (0, 2, 4, ...): apply sine function
# Odd dimensions (1, 3, 5, ...): apply cosine function
# Use different frequencies for different dimension pairs
#
# Core implementation pattern:
pos = torch.arange(max_len).unsqueeze(1) # Position indices [max_len, 1]
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(pos * div_term) # Even dimensions
pe[:, 1::2] = torch.cos(pos * div_term) # Odd dimensions
Key Properties and Advantages:
- Infinite Extrapolation: Can handle any sequence length, even much longer than training sequences
- Parameter Efficiency: Zero learnable parameters dedicated to positional information
- Relative Position Awareness: The model can learn to attend to relative positions through linear combinations
- Unique Encodings: Each position gets a unique, deterministic representation
- Smooth Transitions: Adjacent positions have similar but distinguishable encodings
- Frequency Hierarchy: Different dimensions encode position at different frequencies
Intuitive Understanding:
Think of sinusoidal encoding as a multi-scale clock system:
- High Frequencies (fast oscillation): Distinguish between adjacent positions
- Low Frequencies (slow oscillation): Capture broader positional patterns
- Combined Frequencies: Create unique fingerprints for each position
- Binary Analogy: Similar to how binary representation uses powers of 2 to uniquely represent numbers
Comparison with Learned Embeddings:
| Aspect |
Learned Positional Embeddings |
Sinusoidal Positional Encoding |
| Parameters |
max_seq_len × d_model parameters |
Zero parameters |
| Sequence Length |
Fixed maximum during training |
Unlimited extrapolation |
| Interpretability |
Learned patterns, less interpretable |
Mathematical pattern, fully interpretable |
| Relative Position |
Must be learned implicitly |
Can be computed through linear combinations |
| Generalization |
May not generalize to unseen lengths |
Perfect generalization to any length |
Implementation Details:
The implementation creates a frequency spectrum that encodes position across multiple scales:
# Implementation strategy:
# 1. Create position tensor: shape (max_seq_len, 1)
# 2. Create frequency tensor: shape (d_model//2,)
# 3. Compute position × frequency matrix
# 4. Apply sine to columns 0, 2, 4, ... (even indices)
# 5. Apply cosine to columns 1, 3, 5, ... (odd indices)
# 6. Result: shape (max_seq_len, d_model)
#
# Challenge: How do you ensure the right frequencies for each dimension?
# Hint: Use exponential decay pattern from the original paper
Why Both Approaches Matter:
- Research Value: Sinusoidal encoding from the foundational Transformer paper helps understand theoretical principles
- Practical Benefits: Zero parameters and infinite extrapolation make it attractive for resource-constrained applications
- Modern Variants: Inspired more sophisticated positional encodings like RoPE (Rotary Position Embedding) used in modern LLMs
- Educational Importance: Understanding both approaches deepens comprehension of how models encode sequence information
What You'll Implement
Part 1: Attention Mechanisms (40 points)
File: src/Attention.py
Implement attention mechanisms from basic to production-ready: BasicAttention, ScaledAttention, MultiHeadAttention, and causal masking.
Part 2: GPT-2 with Modern Optimizations (40 points)
File: src/gpt2_model.py
Build GPT-2 with advanced features: QuickNorm, SwiGLU activation, TransformerLayer, complete forward pass, and text generation.
Note: All implementation details and hints are in the TODO comments within the source files. Refer to the Advanced Concepts Guide above for conceptual understanding.
Setup Instructions
Recommended Setup: We strongly recommend using Conda for this assignment. Conda automatically handles PyTorch installation for different platforms (macOS, Windows, Linux) and hardware (CPU, NVIDIA GPU, Apple Silicon), which eliminates the most common setup errors.
Quick Setup Overview
- Step 1: Install Conda (if not already installed) - one-time setup
- Step 2: Download and extract assignment2_starter.zip
- Step 3: Create environment:
conda env create -f environment.yml
- Step 4: Activate and start coding:
conda activate assignment2_env
Estimated time: 10-15 minutes (first time), 2-3 minutes (if Conda already installed)
Step 1: Install Conda (One-Time Setup)
Skip this step if you already have Anaconda or Miniconda installed.
macOS Installation
1. Download Miniconda installer
Visit: https://docs.conda.io/en/latest/miniconda.html
Download: "Miniconda3 macOS 64-bit pkg" (Intel Mac)
or "Miniconda3 macOS Apple M1 64-bit pkg" (M1/M2/M3 Mac)
2. Install Miniconda
- Double-click the downloaded .pkg file
- Follow installation wizard
- Use default settings
3. Verify installation
Open a NEW Terminal window and run:
conda --version
Expected output: conda 24.x.x or similar
If you see "conda: command not found":
Close and reopen Terminal, then try again
source ~/miniconda3/bin/activate
conda init zsh # if using zsh (default on macOS)
# OR
conda init bash # if using bash
# Then close and reopen Terminal
Windows Installation
1. Download Miniconda installer
Visit: https://docs.conda.io/en/latest/miniconda.html
Download: "Miniconda3 Windows 64-bit"
2. Install Miniconda
- Double-click the downloaded .exe file
- Installation options:
- Install for: Just Me (recommended)
- Destination folder: Use default
- Advanced options:
[X] CHECK: "Add Miniconda3 to my PATH environment variable"
[X] CHECK: "Register Miniconda3 as my default Python"
3. Verify installation
Open "Anaconda Prompt (Miniconda3)" from Start menu and run:
conda --version
Expected output: conda 24.x.x or similar
Note: Miniconda installs "Anaconda Prompt" - it's the same thing!
Linux/Ubuntu Installation
1. Download Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
2. Install Miniconda
bash Miniconda3-latest-Linux-x86_64.sh
- Press Enter to review license
- Type "yes" to accept
- Press Enter to confirm installation location
- Type "yes" when asked to initialize
3. Activate conda
source ~/.bashrc
4. Verify installation
conda --version
Expected output: conda 24.x.x or similar
Step 2: Create Assignment Environment
Once Conda is installed, follow these steps to set up your assignment environment. These steps are the same for all platforms.
All Platforms (macOS, Windows, Linux)
1. Download and extract assignment package
Download: assignment2_starter.zip from course website
Extract to your preferred location (e.g., Downloads or Documents)
2. Navigate to assignment directory
Open Terminal (macOS/Linux) or Anaconda Prompt (Windows)
# macOS/Linux:
cd ~/Downloads/assignment2_starter
# Windows:
cd C:\Users\YourName\Downloads\assignment2_starter
3. Create Conda environment
conda env create -f environment.yml
This will:
- Create an environment called "assignment2_env"
- Install Python 3.10
- Install PyTorch (automatically selects correct version for your system)
- Install all required packages (numpy, pytest, tqdm, etc.)
Expected time: 2-5 minutes depending on internet speed
4. Activate the environment
conda activate assignment2_env
5. Verify installation
python tools/run_tests.py
You should see test output showing TODO sections not implemented.
This is expected - you haven't written any code yet!
Step 3: Working on the Assignment
Every time you work on the assignment:
1. Open Terminal/Command Prompt
2. Navigate to assignment folder
cd ~/Downloads/assignment2_starter # macOS/Linux
cd C:\Users\YourName\Downloads\assignment2_starter # Windows
3. Activate environment
conda activate assignment2_env
4. Start coding!
# Edit files in src/ folder
# Run tests to check your work:
python tools/run_tests.py
5. When done for the day
conda deactivate
Troubleshooting
If "conda env create" fails
# Try creating environment manually:
conda create -n assignment2_env python=3.10 -y
conda activate assignment2_env
# Install PyTorch:
conda install pytorch -c pytorch -y
# Install other packages:
pip install numpy pytest tqdm pytest-timeout
If PyTorch import fails
# Verify PyTorch installation:
python -c "import torch; print('PyTorch version:', torch.__version__)"
# If it fails, reinstall PyTorch:
conda install pytorch -c pytorch -y
Alternative: pip/virtualenv (Advanced Users Only)
Advanced Method: Only use this if you're experienced with Python environments. We strongly recommend Conda for most students.
1. Install PyTorch first
Visit: https://pytorch.org/get-started/locally/
Follow platform-specific instructions
2. Create virtual environment
cd assignment2_starter
python3 -m venv assignment2_env
# Activate:
source assignment2_env/bin/activate # macOS/Linux
assignment2_env\Scripts\activate # Windows
3. Install dependencies
pip install -r requirements.txt
4. Verify
python tools/run_tests.py
Common Issues
Problem: "conda: command not found"
Solution: Close and reopen your terminal/command prompt. If still not working, add Conda to PATH or use Anaconda Prompt (Windows).
Problem: "ModuleNotFoundError: No module named 'torch'"
# Make sure environment is activated:
conda activate assignment2_env
# Verify PyTorch installation:
python -c "import torch; print(torch.__version__)"
# If fails, reinstall:
conda install pytorch -c pytorch
Need Help?
- Discussion Forum: Post error messages with OS details
- Office Hours: Tuesday/Thursday 2-4 PM
- Email: Include OS, Python version, and full error message
Quick Reference
Every time you work on the assignment:
conda activate assignment2_env
cd path/to/assignment2_starter
# Edit files, then test:
python tools/run_tests.py
When done:
conda deactivate
Delete environment (if needed):
conda env remove -n assignment2_env
Testing Your Implementation
Run Individual Module Tests
Test your implementation files:
python src/Attention.py
python src/gpt2_model.py
Run All Tests
python tools/run_tests.py
Run Pytest
python -m pytest tests/ -v
Grading Rubric (100 Points Total)
| Component |
Points |
Key Requirements |
| Attention.py |
40 |
BasicAttention (10), ScaledAttention (15), MultiHeadAttention (10), causal_mask (5) |
| gpt2_model.py |
40 |
QuickNorm (5), SwiGLU (5), TransformerLayer (10), Forward (15), Generate (5) |
| Code Quality |
10 |
Clean code, proper structure, no hardcoded paths |
| Documentation |
10 |
Docstrings, comments where needed |
| Total |
100 |
|
Grading Criteria
- Correctness (80%): Implementation produces expected outputs
- Code Quality (10%): Clean, readable code following best practices
- Documentation (10%): Clear docstrings and comments
Hints and Tips
- Start Simple: Get BasicAttention working first, then build up
- Test Often: Run tests after implementing each method
- Check Shapes: Print tensor shapes when debugging
- Read Comments: The TODOs have detailed hints
- Use PyTorch Docs: Many operations have built-in PyTorch functions
- Understand Before Implementing: Read the Advanced Concepts section carefully
- Don't Fear the Unknown: These optimizations are used in ChatGPT, Claude, and other LLMs you use daily!
Academic Integrity
This is an individual assignment. You may:
- Discuss concepts with classmates
- Use course materials and PyTorch documentation
- Ask for help during office hours
- Research the concepts mentioned (QuickNorm, SwiGLU, etc.) in papers and blogs
You may NOT:
- Share code with classmates
- Use AI code generators (ChatGPT, Copilot, etc.)
- Copy code from online sources
Submission Instructions
- Complete all TODO sections in the two Python files
- Ensure all tests pass
- Create a ZIP file with your
src/ directory:
zip -r assignment2_submission.zip src/
- Submit on the course website before the deadline
Getting Help
- Office Hours: Tuesday/Thursday 2-4 PM
- Discussion Forum: Post conceptual questions about the advanced concepts (no code)
- Email: For personal matters only
Good Luck!
Remember: You're not just implementing textbook concepts - you're learning the same optimizations used in GPT-4, Claude, and LLaMA! Understanding these concepts deeply will help you throughout your AI career.
Take time to understand what each component does, not just make the tests pass. The modern optimizations you're learning here are actively used in production systems serving millions of users.
Additional Resources for Advanced Concepts
Research Papers (Optional Reading)
- SwiGLU: "GLU Variants Improve Transformer" (Shazeer, 2020)
- Flash Attention: "FlashAttention: Fast and Memory-Efficient Exact Attention" (Dao et al., 2022)
- RMSNorm: "Root Mean Square Layer Normalization" (Zhang & Sennrich, 2019)
- Weight Tying: "Using the Output Embedding to Improve Language Models" (Press & Wolf, 2017)
Blog Posts (Optional Reading)
- Hugging Face: "The Annotated Transformer"
- Lilian Weng: "Attention? Attention!"
- Jay Alammar: "The Illustrated Transformer"
"Attention is all you need" - but understanding modern optimizations helps even more!
← Back to Course