Assignment 1: Tokenizer Implementation

Student Instructions & Getting Started Guide

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin


Quick Start & Environment Setup

Step 1: Extract and Navigate

# Extract the assignment files unzip assignment1_starter.zip # Navigate to the assignment directory cd assignment1/ # Check the project structure ls -la # You should see: src/, tests/, data/, tools/, requirements.txt, STUDENT_INSTRUCTIONS.md

Step 2: Critical Environment Setup (MANDATORY)

# STEP 2A: Verify Python Version (CRITICAL) python3 --version # REQUIRED: Python 3.7.x (autograder uses 3.7.17 in Docker) # If wrong version, install correct Python before proceeding # STEP 2B: Create ISOLATED virtual environment (MANDATORY) python3 -m venv assignment1_env source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate # STEP 2C: Upgrade pip to exact version (CRITICAL) pip install --upgrade pip==21.3.1 # STEP 2D: Install EXACT dependency versions (NO DEVIATIONS) pip install -r requirements.txt # STEP 2E: Verify environment matches autograder python -c "import pytest; print(f'pytest version: {pytest.__version__}')" python --version pip freeze > my-environment.txt
CRITICAL WARNING:

Step 3: Environment Verification (MANDATORY)

# Verify your environment EXACTLY matches autograder cat my-environment.txt | grep -E "(pytest|pluggy|py==|attrs|more-itertools)" # Expected output (MUST MATCH): # attrs==21.4.0 # more-itertools==8.12.0 # pluggy==1.0.0 # py==1.11.0 # pytest==7.0.1 # If ANY version differs, reinstall with: pip install -r requirements.txt --force-reinstall

Step 4: Verify Project Setup

# Make sure virtual environment is activated source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate # Run the test runner to check everything works python tools/run_tests.py # Expected output: # All tokenizers and performance analyzer can be imported # Tests failed (this is normal - you haven't implemented anything yet!)

Project Structure

assignment1/ ├── src/ # Your implementation files │ ├── simple_tokenizer.py # Basic word-level tokenizer │ ├── regex_tokenizer.py # Pattern-based tokenizer │ ├── bpe_tokenizer.py # Byte Pair Encoding tokenizer │ └── performance_analyzer.py # Performance analysis tool ├── tests/ # Test files (don't modify) │ ├── test_simple_tokenizer.py │ ├── test_regex_tokenizer.py │ ├── test_bpe_tokenizer.py │ └── test_performance.py ├── data/ # Sample training data │ └── sample_texts.txt ├── tools/ # Utility scripts │ └── run_tests.py # Test runner ├── requirements.txt # LOCKED dependency versions (CRITICAL) └── STUDENT_INSTRUCTIONS.md # This file

What You Need to Implement

Complete the TODO sections in these four files in src/:

  1. simple_tokenizer.py - 1990s word-level tokenization
  2. regex_tokenizer.py - 2000s pattern-based tokenization
  3. bpe_tokenizer.py - 2015 BPE subword tokenization
  4. performance_analyzer.py - Performance analysis and comparison

The Three Tokenizers You'll Build

1. Simple Tokenizer

What: Basic word-level tokenization (like early NLP systems)

Your Task: Complete these methods in simple_tokenizer.py:

class SimpleTokenizer: def build_vocabulary(self, texts) # Build word-to-ID mapping def encode(self, text) # Text → token IDs def decode(self, token_ids) # Token IDs → text

Key Features:

2. Regex Tokenizer

What: Pattern-based tokenization using regular expressions

Your Task: Complete these methods in regex_tokenizer.py:

class RegexTokenizer: def _compile_patterns(self) # Set up regex patterns def tokenize(self, text) # Split text using patterns def encode(self, text) # Text → token IDs

Key Features:

3. BPE Tokenizer

What: Byte Pair Encoding (used in GPT, BERT, LLaMA)

Your Task: Complete these methods in bpe_tokenizer.py:

class BPETokenizer: def train(self, texts) # Learn merge rules from data def encode(self, text) # Apply BPE merges def decode(self, token_ids) # Perfect reconstruction

Key Features:

4. Performance Analyzer

What: Analyze and compare performance of different tokenizers

Your Task: Complete these methods in performance_analyzer.py:

class PerformanceAnalyzer: def measure_timing(self, tokenizer, texts, tokenizer_name) # Measure timing performance def measure_compression_effectiveness(self, tokenizer, texts, name) # Measure compression metrics def compare_tokenizers(self, tokenizers, texts) # Compare multiple tokenizers

Key Features:


Text Preprocessing Guidelines

To ensure consistent implementation across all students, please follow these preprocessing standards:

Unified Preprocessing Rules

All tokenizers should apply the same preprocessing to avoid implementation confusion:


Development Workflow

Recommended Development Process:

Phase 1: Start with Simple Tokenizer

# Make sure virtual environment is activated source assignment1_env/bin/activate # 1. Open the file and read the TODO sections code src/simple_tokenizer.py # or your preferred editor # 2. Run individual tests to see what fails python -m pytest tests/test_simple_tokenizer.py -v # 3. Implement one method at a time # 4. Test frequently python src/simple_tokenizer.py # 5. Run tests again to see progress python -m pytest tests/test_simple_tokenizer.py -v

Phase 2: Move to Regex Tokenizer

# Make sure virtual environment is activated source assignment1_env/bin/activate # Similar process for regex tokenizer python -m pytest tests/test_regex_tokenizer.py -v code src/regex_tokenizer.py python src/regex_tokenizer.py

Phase 3: Tackle BPE Tokenizer

# Make sure virtual environment is activated source assignment1_env/bin/activate # BPE is most complex - take your time python -m pytest tests/test_bpe_tokenizer.py -v code src/bpe_tokenizer.py python src/bpe_tokenizer.py

Phase 4: Complete Performance Analyzer

# Make sure virtual environment is activated source assignment1_env/bin/activate # Use provided utility methods to implement analysis python -m pytest tests/test_performance.py -v code src/performance_analyzer.py python src/performance_analyzer.py

Testing Your Implementation

Run All Tests:

# Make sure virtual environment is activated first source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate # Use our test runner (recommended) python tools/run_tests.py # Or run pytest directly python -m pytest tests/ -v

Test Individual Components:

# Make sure virtual environment is activated first source assignment1_env/bin/activate # On Windows: assignment1_env\Scripts\activate # Test specific tokenizer python -m pytest tests/test_simple_tokenizer.py -v python -m pytest tests/test_regex_tokenizer.py -v python -m pytest tests/test_bpe_tokenizer.py -v python -m pytest tests/test_performance.py -v # Run the tokenizer directly python src/simple_tokenizer.py python src/performance_analyzer.py

Expected Results as You Progress:


Submission Requirements

Submit These Files:

Upload each file individually to the assignment portal:

  1. simple_tokenizer.py - Your completed simple tokenizer
  2. regex_tokenizer.py - Your completed regex tokenizer
  3. bpe_tokenizer.py - Your completed BPE tokenizer

File Requirements:


Implementation Tips

Getting Started:

  1. Start with Simple Tokenizer - It's the easiest and builds foundation concepts
  2. Use the test code - Run the provided test cases at the bottom of each file
  3. Read the TODO comments - They provide specific guidance for each method
  4. Test frequently - Don't wait until the end to test your code

Common Approaches:

Debugging Tips:


Grading Criteria

What We're Looking For:

Testing Approach:


Resources & Help

Allowed Libraries:

Getting Help:


What You'll Learn

By completing this assignment, you'll understand:

Remember: These are the same techniques used in ChatGPT, GPT-4, and LLaMA. You're learning to build the foundations of modern AI!


Ready to Begin?

  1. Download the starter files from the course website
  2. Start with simple_tokenizer.py - it's the most straightforward
  3. Test your code frequently using the provided test cases
  4. Submit all three files when complete

Good luck! You're building the same technology that powers today's most advanced AI systems.


← Back to Course