Assignment 1: Tokenizer Implementation

Student Instructions & Getting Started Guide

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin

Quick Start & Environment Setup

Step 1: Extract and Navigate

# Extract the assignment files
unzip assignment1_starter.zip

# Navigate to the assignment directory  
cd assignment1/

# Check the project structure
ls -la
# You should see: src/, tests/, data/, tools/, requirements.txt, STUDENT_INSTRUCTIONS.md

Step 2: Critical Environment Setup (MANDATORY)

# STEP 2A: Verify Python Version (CRITICAL)
python3 --version
# REQUIRED: Python 3.7.x (autograder uses 3.7.17 in Docker)
# If wrong version, install correct Python before proceeding

# STEP 2B: Create ISOLATED virtual environment (MANDATORY)
python3 -m venv assignment1_env
source assignment1_env/bin/activate  # On Windows: assignment1_env\Scripts\activate

# STEP 2C: Upgrade pip to exact version (CRITICAL)
pip install --upgrade pip==21.3.1

# STEP 2D: Install EXACT dependency versions (NO DEVIATIONS)
pip install -r requirements.txt

# STEP 2E: Verify environment matches autograder
python -c "import pytest; print(f'pytest version: {pytest.__version__}')"
python --version
pip freeze > my-environment.txt

CRITICAL WARNING:

EXACT versions required - autograder will fail with version mismatches
Virtual environment MANDATORY - global installs cause conflicts
Environment verification REQUIRED - compare pip freeze with requirements.txt

Step 3: Environment Verification (MANDATORY)

# Verify your environment EXACTLY matches autograder
cat my-environment.txt | grep -E "(pytest|pluggy|py==|attrs|more-itertools)"

# Expected output (MUST MATCH):
# attrs==21.4.0
# more-itertools==8.12.0
# pluggy==1.0.0
# py==1.11.0
# pytest==7.0.1

# If ANY version differs, reinstall with: pip install -r requirements.txt --force-reinstall

Step 4: Verify Project Setup

# Make sure virtual environment is activated
source assignment1_env/bin/activate  # On Windows: assignment1_env\Scripts\activate

# Run the test runner to check everything works
python tools/run_tests.py

# Expected output:
# All tokenizers and performance analyzer can be imported
# Tests failed (this is normal - you haven't implemented anything yet!)

Project Structure

assignment1/ ├── src/ # Your implementation files │ ├── simple_tokenizer.py # Basic word-level tokenizer │ ├── regex_tokenizer.py # Pattern-based tokenizer │ ├── bpe_tokenizer.py # Byte Pair Encoding tokenizer │ └── performance_analyzer.py # Performance analysis tool ├── tests/ # Test files (don't modify) │ ├── test_simple_tokenizer.py │ ├── test_regex_tokenizer.py │ ├── test_bpe_tokenizer.py │ └── test_performance.py ├── data/ # Sample training data │ └── sample_texts.txt ├── tools/ # Utility scripts │ └── run_tests.py # Test runner ├── requirements.txt # LOCKED dependency versions (CRITICAL) └── STUDENT_INSTRUCTIONS.md # This file

What You Need to Implement

Complete the TODO sections in these four files in src/:

simple_tokenizer.py - 1990s word-level tokenization
regex_tokenizer.py - 2000s pattern-based tokenization
bpe_tokenizer.py - 2015 BPE subword tokenization
performance_analyzer.py - Performance analysis and comparison

The Three Tokenizers You'll Build

1. Simple Tokenizer

What: Basic word-level tokenization (like early NLP systems)

Your Task: Complete these methods in simple_tokenizer.py:

class SimpleTokenizer:
    def build_vocabulary(self, texts)      # Build word-to-ID mapping
    def encode(self, text)                 # Text → token IDs  
    def decode(self, token_ids)            # Token IDs → text

Key Features:

Vocabulary building from training texts
Handle unknown words with <UNK> tokens
Add special tokens (<BOS>, <EOS>, <PAD>)

2. Regex Tokenizer

What: Pattern-based tokenization using regular expressions

Your Task: Complete these methods in regex_tokenizer.py:

class RegexTokenizer:
    def _compile_patterns(self)            # Set up regex patterns
    def tokenize(self, text)               # Split text using patterns
    def encode(self, text)                 # Text → token IDs

Key Features:

Handle punctuation, numbers, and words separately
Preserve important patterns (emails, URLs, etc.)
Case-sensitive and case-insensitive options

3. BPE Tokenizer

What: Byte Pair Encoding (used in GPT, BERT, LLaMA)

Your Task: Complete these methods in bpe_tokenizer.py:

class BPETokenizer:
    def train(self, texts)                 # Learn merge rules from data
    def encode(self, text)                 # Apply BPE merges
    def decode(self, token_ids)            # Perfect reconstruction

Key Features:

BPE training learns subword patterns
Handles rare words without <UNK> tokens
Better compression than word-level tokenizers

4. Performance Analyzer

What: Analyze and compare performance of different tokenizers

Your Task: Complete these methods in performance_analyzer.py:

class PerformanceAnalyzer:
    def measure_timing(self, tokenizer, texts, tokenizer_name)           # Measure timing performance
    def measure_compression_effectiveness(self, tokenizer, texts, name)  # Measure compression metrics
    def compare_tokenizers(self, tokenizers, texts)                     # Compare multiple tokenizers

Key Features:

Timing analysis (training, encoding, decoding)
Compression effectiveness measurement
Multi-tokenizer comparison
Utility methods provided to simplify implementation

Text Preprocessing Guidelines

To ensure consistent implementation across all students, please follow these preprocessing standards:

Unified Preprocessing Rules

All tokenizers should apply the same preprocessing to avoid implementation confusion:

Whitespace Handling: Multiple consecutive spaces should be merged into a single space
Punctuation Handling: Separate punctuation marks from words (treat as individual tokens)
Case Normalization: Convert text to lowercase to avoid duplicate tokens
Text Cleanup: Remove leading and trailing whitespace from input text

Development Workflow

Recommended Development Process:

Phase 1: Start with Simple Tokenizer

# Make sure virtual environment is activated
source assignment1_env/bin/activate

# 1. Open the file and read the TODO sections
code src/simple_tokenizer.py  # or your preferred editor

# 2. Run individual tests to see what fails
python -m pytest tests/test_simple_tokenizer.py -v

# 3. Implement one method at a time
# 4. Test frequently
python src/simple_tokenizer.py

# 5. Run tests again to see progress  
python -m pytest tests/test_simple_tokenizer.py -v

Phase 2: Move to Regex Tokenizer

# Make sure virtual environment is activated
source assignment1_env/bin/activate

# Similar process for regex tokenizer
python -m pytest tests/test_regex_tokenizer.py -v
code src/regex_tokenizer.py
python src/regex_tokenizer.py

Phase 3: Tackle BPE Tokenizer

# Make sure virtual environment is activated
source assignment1_env/bin/activate

# BPE is most complex - take your time
python -m pytest tests/test_bpe_tokenizer.py -v
code src/bpe_tokenizer.py
python src/bpe_tokenizer.py

Phase 4: Complete Performance Analyzer

# Make sure virtual environment is activated
source assignment1_env/bin/activate

# Use provided utility methods to implement analysis
python -m pytest tests/test_performance.py -v
code src/performance_analyzer.py
python src/performance_analyzer.py

Testing Your Implementation

Run All Tests:

# Make sure virtual environment is activated first
source assignment1_env/bin/activate  # On Windows: assignment1_env\Scripts\activate

# Use our test runner (recommended)
python tools/run_tests.py

# Or run pytest directly
python -m pytest tests/ -v

Test Individual Components:

# Make sure virtual environment is activated first
source assignment1_env/bin/activate  # On Windows: assignment1_env\Scripts\activate

# Test specific tokenizer
python -m pytest tests/test_simple_tokenizer.py -v
python -m pytest tests/test_regex_tokenizer.py -v
python -m pytest tests/test_bpe_tokenizer.py -v
python -m pytest tests/test_performance.py -v

# Run the tokenizer directly
python src/simple_tokenizer.py
python src/performance_analyzer.py

Expected Results as You Progress:

Initially: 12 tests failed (normal!)
After Simple Tokenizer: 3 tests passed, 9 failed
After Regex Tokenizer: 6 tests passed, 6 failed
After BPE Tokenizer: 9 tests passed, 3 failed
After Performance Analyzer: 12 tests passed!

Submission Requirements

Submit These Files:

Upload each file individually to the assignment portal:

simple_tokenizer.py - Your completed simple tokenizer
regex_tokenizer.py - Your completed regex tokenizer
bpe_tokenizer.py - Your completed BPE tokenizer

File Requirements:

File names must be exact: simple_tokenizer.py, regex_tokenizer.py, bpe_tokenizer.py, performance_analyzer.py
Keep existing structure: Don't change class names or method signatures
Complete all TODO sections: Fill in your implementation where indicated
Test your code: Make sure it runs without errors

Implementation Tips

Getting Started:

Start with Simple Tokenizer - It's the easiest and builds foundation concepts
Use the test code - Run the provided test cases at the bottom of each file
Read the TODO comments - They provide specific guidance for each method
Test frequently - Don't wait until the end to test your code

Common Approaches:

Simple Tokenizer: Use split() and dictionaries for vocabulary
Regex Tokenizer: Learn Python's re module for pattern matching
BPE Tokenizer: Implement the merge algorithm step by step

Debugging Tips:

Print intermediate results to understand what's happening
Handle edge cases like empty strings and special characters
Check your logic with simple examples first

Grading Criteria

What We're Looking For:

Correctness: Does your code work as expected?
Completeness: Are all TODO sections implemented?
Code Quality: Is your code readable and well-structured?
Edge Cases: Does it handle unusual inputs properly?

Testing Approach:

We'll test your tokenizers on various text samples
Check encoding/decoding round-trip accuracy
Verify special token handling
Test performance on different text types

Resources & Help

Allowed Libraries:

Standard Library: re, collections, json, string
FORBIDDEN: tiktoken, transformers, sentencepiece (these solve the assignment for you)

Getting Help:

Office Hours: Check course schedule
Email: For specific questions
Test Code: Use the examples provided in each file

What You'll Learn

By completing this assignment, you'll understand:

How tokenization works in modern NLP systems
Different tokenization strategies and their trade-offs
The evolution from simple word splitting to sophisticated BPE
Implementation skills that transfer to real AI systems

Remember: These are the same techniques used in ChatGPT, GPT-4, and LLaMA. You're learning to build the foundations of modern AI!

Ready to Begin?

Download the starter files from the course website
Start with simple_tokenizer.py - it's the most straightforward
Test your code frequently using the provided test cases
Submit all three files when complete

Good luck! You're building the same technology that powers today's most advanced AI systems.

← Back to Course