Assignment 1: Tokenizer Implementation
Student Instructions & Getting Started Guide
CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin
Quick Start & Environment Setup
Step 1: Extract and Navigate
# Extract the assignment files
unzip assignment1_starter.zip
# Navigate to the assignment directory
cd assignment1/
# Check the project structure
ls -la
# You should see: src/, tests/, data/, tools/, requirements.txt, STUDENT_INSTRUCTIONS.md
Step 2: Critical Environment Setup (MANDATORY)
# STEP 2A: Verify Python Version (CRITICAL)
python3 --version
# REQUIRED: Python 3.7.x (autograder uses 3.7.17 in Docker)
# If wrong version, install correct Python before proceeding
# STEP 2B: Create ISOLATED virtual environment (MANDATORY)
python3 -m venv assignment1_env
source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate
# STEP 2C: Upgrade pip to exact version (CRITICAL)
pip install --upgrade pip==21.3.1
# STEP 2D: Install EXACT dependency versions (NO DEVIATIONS)
pip install -r requirements.txt
# STEP 2E: Verify environment matches autograder
python -c "import pytest; print(f'pytest version: {pytest.__version__}')"
python --version
pip freeze > my-environment.txt
CRITICAL WARNING:
- EXACT versions required - autograder will fail with version mismatches
- Virtual environment MANDATORY - global installs cause conflicts
- Environment verification REQUIRED - compare
pip freeze with requirements.txt
Step 3: Environment Verification (MANDATORY)
# Verify your environment EXACTLY matches autograder
cat my-environment.txt | grep -E "(pytest|pluggy|py==|attrs|more-itertools)"
# Expected output (MUST MATCH):
# attrs==21.4.0
# more-itertools==8.12.0
# pluggy==1.0.0
# py==1.11.0
# pytest==7.0.1
# If ANY version differs, reinstall with: pip install -r requirements.txt --force-reinstall
Step 4: Verify Project Setup
# Make sure virtual environment is activated
source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate
# Run the test runner to check everything works
python tools/run_tests.py
# Expected output:
# All tokenizers and performance analyzer can be imported
# Tests failed (this is normal - you haven't implemented anything yet!)
Project Structure
assignment1/
├── src/ # Your implementation files
│ ├── simple_tokenizer.py # Basic word-level tokenizer
│ ├── regex_tokenizer.py # Pattern-based tokenizer
│ ├── bpe_tokenizer.py # Byte Pair Encoding tokenizer
│ └── performance_analyzer.py # Performance analysis tool
├── tests/ # Test files (don't modify)
│ ├── test_simple_tokenizer.py
│ ├── test_regex_tokenizer.py
│ ├── test_bpe_tokenizer.py
│ └── test_performance.py
├── data/ # Sample training data
│ └── sample_texts.txt
├── tools/ # Utility scripts
│ └── run_tests.py # Test runner
├── requirements.txt # LOCKED dependency versions (CRITICAL)
└── STUDENT_INSTRUCTIONS.md # This file
What You Need to Implement
Complete the TODO sections in these four files in src/:
simple_tokenizer.py - 1990s word-level tokenization
regex_tokenizer.py - 2000s pattern-based tokenization
bpe_tokenizer.py - 2015 BPE subword tokenization
performance_analyzer.py - Performance analysis and comparison
The Three Tokenizers You'll Build
1. Simple Tokenizer
What: Basic word-level tokenization (like early NLP systems)
Your Task: Complete these methods in simple_tokenizer.py:
class SimpleTokenizer:
def build_vocabulary(self, texts) # Build word-to-ID mapping
def encode(self, text) # Text → token IDs
def decode(self, token_ids) # Token IDs → text
Key Features:
- Vocabulary building from training texts
- Handle unknown words with
<UNK> tokens
- Add special tokens (
<BOS>, <EOS>, <PAD>)
2. Regex Tokenizer
What: Pattern-based tokenization using regular expressions
Your Task: Complete these methods in regex_tokenizer.py:
class RegexTokenizer:
def _compile_patterns(self) # Set up regex patterns
def tokenize(self, text) # Split text using patterns
def encode(self, text) # Text → token IDs
Key Features:
- Handle punctuation, numbers, and words separately
- Preserve important patterns (emails, URLs, etc.)
- Case-sensitive and case-insensitive options
3. BPE Tokenizer
What: Byte Pair Encoding (used in GPT, BERT, LLaMA)
Your Task: Complete these methods in bpe_tokenizer.py:
class BPETokenizer:
def train(self, texts) # Learn merge rules from data
def encode(self, text) # Apply BPE merges
def decode(self, token_ids) # Perfect reconstruction
Key Features:
- BPE training learns subword patterns
- Handles rare words without
<UNK> tokens
- Better compression than word-level tokenizers
4. Performance Analyzer
What: Analyze and compare performance of different tokenizers
Your Task: Complete these methods in performance_analyzer.py:
class PerformanceAnalyzer:
def measure_timing(self, tokenizer, texts, tokenizer_name) # Measure timing performance
def measure_compression_effectiveness(self, tokenizer, texts, name) # Measure compression metrics
def compare_tokenizers(self, tokenizers, texts) # Compare multiple tokenizers
Key Features:
- Timing analysis (training, encoding, decoding)
- Compression effectiveness measurement
- Multi-tokenizer comparison
- Utility methods provided to simplify implementation
Text Preprocessing Guidelines
To ensure consistent implementation across all students, please follow these preprocessing standards:
Unified Preprocessing Rules
All tokenizers should apply the same preprocessing to avoid implementation confusion:
- Whitespace Handling: Multiple consecutive spaces should be merged into a single space
- Punctuation Handling: Separate punctuation marks from words (treat as individual tokens)
- Case Normalization: Convert text to lowercase to avoid duplicate tokens
- Text Cleanup: Remove leading and trailing whitespace from input text
Development Workflow
Recommended Development Process:
Phase 1: Start with Simple Tokenizer
# Make sure virtual environment is activated
source assignment1_env/bin/activate
# 1. Open the file and read the TODO sections
code src/simple_tokenizer.py # or your preferred editor
# 2. Run individual tests to see what fails
python -m pytest tests/test_simple_tokenizer.py -v
# 3. Implement one method at a time
# 4. Test frequently
python src/simple_tokenizer.py
# 5. Run tests again to see progress
python -m pytest tests/test_simple_tokenizer.py -v
Phase 2: Move to Regex Tokenizer
# Make sure virtual environment is activated
source assignment1_env/bin/activate
# Similar process for regex tokenizer
python -m pytest tests/test_regex_tokenizer.py -v
code src/regex_tokenizer.py
python src/regex_tokenizer.py
Phase 3: Tackle BPE Tokenizer
# Make sure virtual environment is activated
source assignment1_env/bin/activate
# BPE is most complex - take your time
python -m pytest tests/test_bpe_tokenizer.py -v
code src/bpe_tokenizer.py
python src/bpe_tokenizer.py
Phase 4: Complete Performance Analyzer
# Make sure virtual environment is activated
source assignment1_env/bin/activate
# Use provided utility methods to implement analysis
python -m pytest tests/test_performance.py -v
code src/performance_analyzer.py
python src/performance_analyzer.py
Testing Your Implementation
Run All Tests:
# Make sure virtual environment is activated first
source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate
# Use our test runner (recommended)
python tools/run_tests.py
# Or run pytest directly
python -m pytest tests/ -v
Test Individual Components:
# Make sure virtual environment is activated first
source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate
# Test specific tokenizer
python -m pytest tests/test_simple_tokenizer.py -v
python -m pytest tests/test_regex_tokenizer.py -v
python -m pytest tests/test_bpe_tokenizer.py -v
python -m pytest tests/test_performance.py -v
# Run the tokenizer directly
python src/simple_tokenizer.py
python src/performance_analyzer.py
Expected Results as You Progress:
- Initially: 12 tests failed (normal!)
- After Simple Tokenizer: 3 tests passed, 9 failed
- After Regex Tokenizer: 6 tests passed, 6 failed
- After BPE Tokenizer: 9 tests passed, 3 failed
- After Performance Analyzer: 12 tests passed!
Submission Requirements
Submit These Files:
Upload each file individually to the assignment portal:
simple_tokenizer.py - Your completed simple tokenizer
regex_tokenizer.py - Your completed regex tokenizer
bpe_tokenizer.py - Your completed BPE tokenizer
File Requirements:
- File names must be exact:
simple_tokenizer.py, regex_tokenizer.py, bpe_tokenizer.py, performance_analyzer.py
- Keep existing structure: Don't change class names or method signatures
- Complete all TODO sections: Fill in your implementation where indicated
- Test your code: Make sure it runs without errors
Implementation Tips
Getting Started:
- Start with Simple Tokenizer - It's the easiest and builds foundation concepts
- Use the test code - Run the provided test cases at the bottom of each file
- Read the TODO comments - They provide specific guidance for each method
- Test frequently - Don't wait until the end to test your code
Common Approaches:
- Simple Tokenizer: Use
split() and dictionaries for vocabulary
- Regex Tokenizer: Learn Python's
re module for pattern matching
- BPE Tokenizer: Implement the merge algorithm step by step
Debugging Tips:
- Print intermediate results to understand what's happening
- Handle edge cases like empty strings and special characters
- Check your logic with simple examples first
Grading Criteria
What We're Looking For:
- Correctness: Does your code work as expected?
- Completeness: Are all TODO sections implemented?
- Code Quality: Is your code readable and well-structured?
- Edge Cases: Does it handle unusual inputs properly?
Testing Approach:
- We'll test your tokenizers on various text samples
- Check encoding/decoding round-trip accuracy
- Verify special token handling
- Test performance on different text types
Resources & Help
Allowed Libraries:
- Standard Library:
re, collections, json, string
- FORBIDDEN:
tiktoken, transformers, sentencepiece (these solve the assignment for you)
Getting Help:
- Office Hours: Check course schedule
- Email: For specific questions
- Test Code: Use the examples provided in each file
What You'll Learn
By completing this assignment, you'll understand:
- How tokenization works in modern NLP systems
- Different tokenization strategies and their trade-offs
- The evolution from simple word splitting to sophisticated BPE
- Implementation skills that transfer to real AI systems
Remember: These are the same techniques used in ChatGPT, GPT-4, and LLaMA. You're learning to build the foundations of modern AI!
Ready to Begin?
- Download the starter files from the course website
- Start with
simple_tokenizer.py - it's the most straightforward
- Test your code frequently using the provided test cases
- Submit all three files when complete
Good luck! You're building the same technology that powers today's most advanced AI systems.
← Back to Course