Assignment 3: Classification and Instruction Fine-Tuning
CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin
Overview
This assignment explores two fundamental fine-tuning paradigms for large language models: classification fine-tuning (Chapter 6) and instruction fine-tuning (Chapter 7). You will implement manual classification heads and instruction formatting following textbook concepts while using Hugging Face tools for efficiency.
Learning Objectives
By completing this assignment, you will:
- Understand manual classification head implementation (Chapter 6)
- Learn why last token extraction is critical for causal attention models
- Implement instruction fine-tuning with Alpaca prompt formatting (Chapter 7)
- Apply parameter freezing strategies for efficient training
Textbook Alignment
This assignment directly implements concepts from:
- Chapter 6: Classification Fine-tuning (pages 206-212)
- Manual classification head implementation
- Parameter freezing and selective unfreezing
- Last token extraction for causal attention
- Chapter 7: Instruction Fine-tuning (pages 228-235)
- Alpaca prompt formatting
- Keeping original LM head for generation
- Next-token prediction training
Understanding Hugging Face Transformers
Important: Hugging Face Library Introduction
We have NOT covered Hugging Face in lectures, but we're using it in this assignment because:
- Industry Standard: Hugging Face is the most widely used library for working with transformers in both research and industry
- Efficiency: It provides pre-trained models and training utilities, saving us from implementing everything from scratch
- Learning Focus: Allows us to focus on fine-tuning concepts (Chapter 6 & 7) rather than low-level implementation details
What is Hugging Face?
Think of Hugging Face as a library of pre-trained AI models that you can download and use. Instead of training a model from scratch (which takes weeks and costs thousands of dollars), you can:
- Download a pre-trained model (like DistilGPT-2)
- Fine-tune it on your specific task (what this assignment is about)
- Use it for your application
Analogy: It's like downloading a pre-trained athlete. Instead of training someone from birth, you get an athlete who already knows how to run, and you just teach them your specific sport.
Key Concept: What is a "Model"?
In this assignment, a model is a neural network that has been trained on massive amounts of text. Think of it as a student who has:
- Read billions of sentences from the internet
- Learned patterns in language (grammar, facts, reasoning)
- Can perform various language tasks
DistilGPT-2 is the specific model we're using. It's a smaller, faster version of GPT-2.
Understanding Different Model Types
Hugging Face offers the same base model (DistilGPT-2) in three different "configurations". Each configuration is designed for a different task:
1. AutoModel (Base Configuration)
What it is: The core transformer without any task-specific layer on top.
What it gives you: Just the hidden representations (embeddings) of your text.
When to use: When you want to ADD YOUR OWN custom layer for a specific task (Task 1).
from transformers import AutoModel
# Load the base model
model = AutoModel.from_pretrained('distilgpt2')
# This model gives you ONLY hidden states
# No classification, no text generation - just embeddings
# You need to add your own layers on top
Analogy: Like a car engine without a body. You get the core power, but you need to build the rest yourself.
2. AutoModelForCausalLM (Text Generation Configuration)
What it is: Base model + a "language modeling head" that predicts the next word.
What it gives you: Ability to generate text word-by-word.
When to use: For text generation tasks like chatbots, instruction following (Task 2).
from transformers import AutoModelForCausalLM
# Load model with text generation capability
model = AutoModelForCausalLM.from_pretrained('distilgpt2')
# This model can GENERATE text
# It predicts: "given these words, what comes next?"
# Perfect for chatbots and instruction following
Analogy: Like a complete car that's ready to drive. It can do its job (generate text) right away.
3. AutoModelForSequenceClassification (Pre-built Classification)
What it is: Base model + a built-in classification layer.
What it gives you: Ability to classify text into categories.
When to use: For classification when you DON'T want to manually implement (we DON'T use this because we're learning the manual way).
from transformers import AutoModelForSequenceClassification
# Load model with built-in classification head
model = AutoModelForSequenceClassification.from_pretrained('distilgpt2', num_labels=2)
# This model can classify text into 2 categories
# BUT we won't use this because we want to implement manually!
Analogy: Like buying a pre-built race car. It works, but you don't learn how to build cars.
Why Three Different Types?
| Model Type |
Use Case |
This Assignment |
| AutoModel |
Custom tasks where you build your own head |
Task 1: We add our own classification layer |
| AutoModelForCausalLM |
Text generation (chatbots, completion) |
Task 2: Instruction fine-tuning |
| AutoModelForSequenceClassification |
Quick classification without manual implementation |
NOT USED (we want to learn manually) |
Other Key Hugging Face Concepts
Tokenizer - Converting Text to Numbers
What it does: Neural networks can't read text directly. A tokenizer converts text into numbers that models can process.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
# Convert text to numbers
text = "Hello world"
tokens = tokenizer(text, return_tensors='pt')
# Result: {'input_ids': tensor([[15496, 995]]), ...}
# "Hello" -> 15496, "world" -> 995
Trainer - Simplified Training
What it does: Instead of writing complex training loops yourself, Trainer handles all the details.
from transformers import Trainer, TrainingArguments
# Configure training
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=8,
)
# Train with one line!
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()
Datasets Library - Easy Data Loading
What it does: Download popular datasets with one line of code.
from datasets import load_dataset
# Download IMDb movie reviews
dataset = load_dataset('imdb')
# That's it! Dataset is ready to use
Summary: What You Need to Know
- Hugging Face = Library of pre-trained models
- Download models instead of training from scratch
- Fine-tune them for your specific task
- Three model types for different purposes
- AutoModel: Base transformer, add your own layers
- AutoModelForCausalLM: Text generation ready
- AutoModelForSequenceClassification: Classification ready (we don't use)
- Supporting tools
- Tokenizer: Text to numbers
- Trainer: Easy training
- Datasets: Easy data loading
For this assignment:
- Task 1: Use AutoModel + add manual classification head
- Task 2: Use AutoModelForCausalLM for instruction following
Assignment Structure
assignment3_starter/
├── src/
│ ├── classification_model.py # Manual GPT2Classifier (Task 1)
│ ├── instruction_model.py # Instruction fine-tuning (Task 2)
│ ├── utils.py # Data loading and evaluation
│ └── trainer_config.py # Hugging Face Trainer setup
├── main.py # Run both tasks
├── requirements.txt
└── README.md
Running Modes: DEBUG vs FULL
Key Concept: Two-Stage Development
This assignment supports two running modes to make your development process much faster:
- DEBUG mode: Small datasets, quick testing (~5 minutes)
- FULL mode: Complete datasets, final results (longer)
Always start with DEBUG mode! Only run FULL mode when your code works correctly.
Mode Comparison Table
| Aspect |
DEBUG Mode |
FULL Mode |
| Command |
python main.py --mode debug |
python main.py --mode full |
| Task 1 Dataset |
50 train / 20 val / 20 test |
500 train / 100 val / 100 test |
| Task 2 Dataset |
20 instruction samples |
150 instruction samples |
| Task 1 Epochs |
1 epoch |
3 epochs |
| Task 2 Epochs |
1 epoch |
2 epochs |
| Purpose |
Code verification, debugging |
Complete training for submission |
| When to Use |
During development, after each change |
Once, when code is ready |
Detailed Mode Explanations
DEBUG Mode - Your Development Best Friend
Use this mode 90% of the time during development!
When to use DEBUG mode:
- Testing if your code runs without errors
- Debugging implementation issues
- Checking if TODO sections are completed correctly
- Verifying model architecture
- Testing data loading
- Quick iteration during development
What DEBUG mode does:
- Loads small datasets: Downloads/processes much faster
- Runs 1 epoch only: Training completes quickly
- Same code path: Tests your implementation without waiting
Example DEBUG workflow:
# 1. Write code in classification_model.py
vim src/classification_model.py
# 2. Test with DEBUG mode
python main.py --mode debug
# 3. See error? Fix it and test again
vim src/classification_model.py
python main.py --mode debug
# 4. Repeat until no errors!
FULL Mode - Your Submission Version
Only run this when your DEBUG mode works perfectly!
When to use FULL mode:
- Your code runs successfully in DEBUG mode
- All TODO sections are implemented
- You're ready to generate final results
- You want to test actual performance
What FULL mode does:
- Loads complete datasets: Full training data for quality results
- Runs multiple epochs: Proper convergence for good accuracy
- Generates submission results: Performance metrics for your report
Example FULL workflow:
# 1. Verify DEBUG mode works
python main.py --mode debug
# Output: All tasks complete without errors
# 2. Now run FULL mode
python main.py --mode full
# This will take longer but gives final results
# 3. Use these results for your analysis.pdf report
Recommended Development Workflow
# Week 1: Work on Task 1 (Classification)
# ------------------------------------------
# 1. Read Task 1 instructions carefully
# 2. Implement TODO sections in classification_model.py
# 3. Test frequently with DEBUG mode:
python main.py --mode debug # Run this 5-10 times while developing
# 4. When Task 1 works in DEBUG mode, move to Task 2
# Week 2: Work on Task 2 (Instruction)
# ------------------------------------------
# 1. Read Task 2 instructions carefully
# 2. Implement TODO sections in instruction_model.py
# 3. Test frequently with DEBUG mode:
python main.py --mode debug # Run this 5-10 times while developing
# Final Step: Run FULL mode once
# ------------------------------------------
# 1. Verify both tasks work in DEBUG mode
# 2. Run FULL mode for final results:
python main.py --mode full
# 3. Save outputs for your analysis.pdf
# 4. Submit!
Common Mistakes to Avoid
- DON'T run FULL mode repeatedly - it takes longer and you'll waste time
- DON'T skip DEBUG mode - you'll spend hours waiting for FULL mode to find errors
- DO use DEBUG mode for all development and debugging
- DO run FULL mode only once when everything works
Quick Start Guide
1. Environment Setup (REQUIRED: Use Conda)
IMPORTANT: Conda Environment Required
This assignment requires using Conda for environment management. Do NOT use venv or system Python.
# Extract the assignment files
unzip assignment3_starter.zip
cd assignment3_starter/
# REQUIRED: Create Conda environment (Python 3.9-3.11 recommended)
conda create -n assignment3_env python=3.10 -y
conda activate assignment3_env
# Install dependencies
pip install -r requirements.txt
# Verify installation
python -c "import torch; import transformers; print('Environment ready!')"
2. Run the Assignment
# Run both tasks sequentially
python main.py
Task 1: Classification Fine-Tuning (60 points)
Overview
Implement manual classification head following Chapter 6 concepts. This demonstrates the core difference from using AutoModelForSequenceClassification.
What You Need to Implement
In src/classification_model.py, complete the GPT2Classifier class following Chapter 6, pages 207-211.
Key Components to Implement:
- Load pretrained base transformer (use AutoModel, not AutoModelForSequenceClassification)
- Add manual classification head layer
- Implement parameter freezing strategy
- Extract last token from hidden states for classification
- Calculate loss and return proper format
Critical Concept: Why Last Token?
Chapter 6 Core Concept (Figure 6.12):
GPT models use causal attention masks, meaning each token only attends to previous tokens. The last token has access to information from the entire sequence, making it the ideal choice for classification.
Using first token (like BERT) would be incorrect because the first token in GPT cannot see future tokens due to causal masking.
Implementation Details
| Component |
Points |
Requirements |
| Base Model Loading |
10 |
Use AutoModel (not AutoModelForSequenceClassification) |
| Classification Head |
15 |
Manual nn.Linear(768, 2) with dropout |
| Parameter Freezing |
15 |
Freeze base, optional unfreeze last block |
| Forward Pass |
15 |
Last token extraction, loss calculation |
| Training Success |
15 |
Achieve 75-85% test accuracy on 500 IMDb samples |
Dataset: IMDb Movie Reviews
| Split |
Samples |
Purpose |
| Train |
500 |
Model training |
| Validation |
100 |
Early stopping |
| Test |
100 |
Final evaluation |
Expected Performance
DEBUG vs FULL Mode Results
DEBUG Mode (50 samples, 1 epoch):
- Test Accuracy: 35-50% (intentionally low)
- Purpose: Code verification only
- Training Time: ~2 minutes
FULL Mode (2500 samples, 3 epochs):
- Test Accuracy: 75-85% (target range)
- Purpose: Final submission quality
- Training Time: ~8-10 min (GPU/MPS) or ~15 min (CPU)
Important: DEBUG mode results are NOT representative. Low accuracy in DEBUG mode is normal and expected.
Model Configuration
- Trainable Parameters: ~7M (only classifier head if freeze_base=True)
- Total Parameters: 82M (DistilGPT-2)
Hardware Acceleration
- CUDA (NVIDIA GPU): Automatically detected and used
- MPS (Apple Silicon): Automatically detected on M-series Macs
- CPU: Fallback option (slower but works)
Task 2: Instruction Fine-Tuning (40 points)
Overview
Implement instruction fine-tuning using Alpaca prompt format (Chapter 7). This demonstrates the fundamental difference from classification - keeping the LM head for generation.
What You Need to Implement
In src/instruction_model.py, complete the instruction fine-tuning functions following Chapter 7, pages 229-235.
Key Components to Implement:
- Implement Alpaca prompt formatting (system prompt + Instruction + Input + Response sections)
- Load model with language modeling head (use AutoModelForCausalLM, not AutoModel)
- Prepare dataset with proper tokenization
- Implement generation function for responses
Critical Difference: Classification vs Instruction
| Aspect |
Classification (Task 1) |
Instruction (Task 2) |
| Model Head |
Custom classifier (2 outputs) |
Original LM head (50257 vocab) |
| Model Loading |
AutoModel |
AutoModelForCausalLM |
| Token Usage |
Last token only |
All tokens (next-token prediction) |
| Output Type |
Fixed classes (pos/neg) |
Open-ended generation |
| Training Objective |
Cross-entropy on last token |
Next-token prediction on all tokens |
Implementation Details
| Component |
Points |
Requirements |
| Alpaca Formatting |
8 |
Correct template with Instruction/Input/Response |
| Model Setup |
7 |
Use AutoModelForCausalLM, keep LM head |
| Training Success |
5 |
Model generates coherent responses after fine-tuning |
Dataset: Instruction Examples
- Source: Alpaca-style instruction-response pairs
- Samples: 150 instruction examples
- Format: JSON with instruction, input (optional), output fields
Expected Performance
DEBUG vs FULL Mode Results
DEBUG Mode (20 samples, 1 epoch):
- Generation Quality: May produce incoherent responses
- Purpose: Code verification only
- Training Time: ~1 minute
FULL Mode (150 samples, 2 epochs):
- Generation Quality: Coherent, on-topic responses
- Purpose: Final submission quality
- Training Time: ~5-8 min (GPU/MPS) or ~15-25 min (CPU)
Important: DEBUG mode generation quality will be poor due to limited training data. This is expected.
Submission Requirements
Files to Submit:
src/classification_model.py - Completed manual classifier
src/instruction_model.py - Completed instruction fine-tuning
Grading Rubric (100 points)
| Component |
Points |
Criteria |
| Task 1: Classification Fine-Tuning |
60 |
- Manual head implementation (20 pts)
- Parameter freezing (15 pts)
- Last token extraction (15 pts)
- Training success (10 pts)
|
| Task 2: Instruction Fine-Tuning |
40 |
- Alpaca formatting (15 pts)
- Model setup (15 pts)
- Training success (10 pts)
|
Academic Integrity
You may NOT:
- Share code with other students
- Use AI tools (ChatGPT, etc.) to generate code
← Back to Course