Assignment 3: Classification and Instruction Fine-Tuning

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin

Overview

This assignment explores two fundamental fine-tuning paradigms for large language models: classification fine-tuning (Chapter 6) and instruction fine-tuning (Chapter 7). You will implement manual classification heads and instruction formatting following textbook concepts while using Hugging Face tools for efficiency.

Learning Objectives

By completing this assignment, you will:

Understand manual classification head implementation (Chapter 6)
Learn why last token extraction is critical for causal attention models
Implement instruction fine-tuning with Alpaca prompt formatting (Chapter 7)
Apply parameter freezing strategies for efficient training

Textbook Alignment

This assignment directly implements concepts from:

Chapter 6: Classification Fine-tuning (pages 206-212)
- Manual classification head implementation
- Parameter freezing and selective unfreezing
- Last token extraction for causal attention
Chapter 7: Instruction Fine-tuning (pages 228-235)
- Alpaca prompt formatting
- Keeping original LM head for generation
- Next-token prediction training

Understanding Hugging Face Transformers

Important: Hugging Face Library Introduction

We have NOT covered Hugging Face in lectures, but we're using it in this assignment because:

Industry Standard: Hugging Face is the most widely used library for working with transformers in both research and industry
Efficiency: It provides pre-trained models and training utilities, saving us from implementing everything from scratch
Learning Focus: Allows us to focus on fine-tuning concepts (Chapter 6 & 7) rather than low-level implementation details

What is Hugging Face?

Think of Hugging Face as a library of pre-trained AI models that you can download and use. Instead of training a model from scratch (which takes weeks and costs thousands of dollars), you can:

Download a pre-trained model (like DistilGPT-2)
Fine-tune it on your specific task (what this assignment is about)
Use it for your application

Analogy: It's like downloading a pre-trained athlete. Instead of training someone from birth, you get an athlete who already knows how to run, and you just teach them your specific sport.

Key Concept: What is a "Model"?

In this assignment, a model is a neural network that has been trained on massive amounts of text. Think of it as a student who has:

Read billions of sentences from the internet
Learned patterns in language (grammar, facts, reasoning)
Can perform various language tasks

DistilGPT-2 is the specific model we're using. It's a smaller, faster version of GPT-2.

Understanding Different Model Types

Hugging Face offers the same base model (DistilGPT-2) in three different "configurations". Each configuration is designed for a different task:

1. AutoModel (Base Configuration)

What it is: The core transformer without any task-specific layer on top.

What it gives you: Just the hidden representations (embeddings) of your text.

When to use: When you want to ADD YOUR OWN custom layer for a specific task (Task 1).

from transformers import AutoModel

# Load the base model
model = AutoModel.from_pretrained('distilgpt2')

# This model gives you ONLY hidden states
# No classification, no text generation - just embeddings
# You need to add your own layers on top

Analogy: Like a car engine without a body. You get the core power, but you need to build the rest yourself.

2. AutoModelForCausalLM (Text Generation Configuration)

What it is: Base model + a "language modeling head" that predicts the next word.

What it gives you: Ability to generate text word-by-word.

When to use: For text generation tasks like chatbots, instruction following (Task 2).

from transformers import AutoModelForCausalLM

# Load model with text generation capability
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

# This model can GENERATE text
# It predicts: "given these words, what comes next?"
# Perfect for chatbots and instruction following

Analogy: Like a complete car that's ready to drive. It can do its job (generate text) right away.

3. AutoModelForSequenceClassification (Pre-built Classification)

What it is: Base model + a built-in classification layer.

What it gives you: Ability to classify text into categories.

When to use: For classification when you DON'T want to manually implement (we DON'T use this because we're learning the manual way).

from transformers import AutoModelForSequenceClassification

# Load model with built-in classification head
model = AutoModelForSequenceClassification.from_pretrained('distilgpt2', num_labels=2)

# This model can classify text into 2 categories
# BUT we won't use this because we want to implement manually!

Analogy: Like buying a pre-built race car. It works, but you don't learn how to build cars.

Why Three Different Types?

Model Type	Use Case	This Assignment
AutoModel	Custom tasks where you build your own head	Task 1: We add our own classification layer
AutoModelForCausalLM	Text generation (chatbots, completion)	Task 2: Instruction fine-tuning
AutoModelForSequenceClassification	Quick classification without manual implementation	NOT USED (we want to learn manually)

Other Key Hugging Face Concepts

Tokenizer - Converting Text to Numbers

What it does: Neural networks can't read text directly. A tokenizer converts text into numbers that models can process.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('distilgpt2')

# Convert text to numbers
text = "Hello world"
tokens = tokenizer(text, return_tensors='pt')

# Result: {'input_ids': tensor([[15496, 995]]), ...}
# "Hello" -> 15496, "world" -> 995

Trainer - Simplified Training

What it does: Instead of writing complex training loops yourself, Trainer handles all the details.

from transformers import Trainer, TrainingArguments

# Configure training
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
)

# Train with one line!
trainer = Trainer(model=model, args=training_args, train_dataset=dataset)
trainer.train()

Datasets Library - Easy Data Loading

What it does: Download popular datasets with one line of code.

from datasets import load_dataset

# Download IMDb movie reviews
dataset = load_dataset('imdb')

# That's it! Dataset is ready to use

Summary: What You Need to Know

Hugging Face = Library of pre-trained models
- Download models instead of training from scratch
- Fine-tune them for your specific task
Three model types for different purposes
- AutoModel: Base transformer, add your own layers
- AutoModelForCausalLM: Text generation ready
- AutoModelForSequenceClassification: Classification ready (we don't use)
Supporting tools
- Tokenizer: Text to numbers
- Trainer: Easy training
- Datasets: Easy data loading

For this assignment:

Task 1: Use AutoModel + add manual classification head
Task 2: Use AutoModelForCausalLM for instruction following

Assignment Structure

assignment3_starter/
├── src/
│   ├── classification_model.py  # Manual GPT2Classifier (Task 1)
│   ├── instruction_model.py     # Instruction fine-tuning (Task 2)
│   ├── utils.py                 # Data loading and evaluation
│   └── trainer_config.py        # Hugging Face Trainer setup
├── main.py                      # Run both tasks
├── requirements.txt
└── README.md

Running Modes: DEBUG vs FULL

Key Concept: Two-Stage Development

This assignment supports two running modes to make your development process much faster:

DEBUG mode: Small datasets, quick testing (~5 minutes)
FULL mode: Complete datasets, final results (longer)

Always start with DEBUG mode! Only run FULL mode when your code works correctly.

Mode Comparison Table

Aspect	DEBUG Mode	FULL Mode
Command	`python main.py --mode debug`	`python main.py --mode full`
Task 1 Dataset	50 train / 20 val / 20 test	500 train / 100 val / 100 test
Task 2 Dataset	20 instruction samples	150 instruction samples
Task 1 Epochs	1 epoch	3 epochs
Task 2 Epochs	1 epoch	2 epochs
Purpose	Code verification, debugging	Complete training for submission
When to Use	During development, after each change	Once, when code is ready

Detailed Mode Explanations

DEBUG Mode - Your Development Best Friend

Use this mode 90% of the time during development!

When to use DEBUG mode:

Testing if your code runs without errors
Debugging implementation issues
Checking if TODO sections are completed correctly
Verifying model architecture
Testing data loading
Quick iteration during development

What DEBUG mode does:

Loads small datasets: Downloads/processes much faster
Runs 1 epoch only: Training completes quickly
Same code path: Tests your implementation without waiting

Example DEBUG workflow:

# 1. Write code in classification_model.py
vim src/classification_model.py

# 2. Test with DEBUG mode
python main.py --mode debug

# 3. See error? Fix it and test again
vim src/classification_model.py
python main.py --mode debug

# 4. Repeat until no errors!

FULL Mode - Your Submission Version

Only run this when your DEBUG mode works perfectly!

When to use FULL mode:

Your code runs successfully in DEBUG mode
All TODO sections are implemented
You're ready to generate final results
You want to test actual performance

What FULL mode does:

Loads complete datasets: Full training data for quality results
Runs multiple epochs: Proper convergence for good accuracy
Generates submission results: Performance metrics for your report

Example FULL workflow:

# 1. Verify DEBUG mode works
python main.py --mode debug
# Output: All tasks complete without errors

# 2. Now run FULL mode
python main.py --mode full
# This will take longer but gives final results

# 3. Use these results for your analysis.pdf report

Recommended Development Workflow

# Week 1: Work on Task 1 (Classification)
# ------------------------------------------
# 1. Read Task 1 instructions carefully
# 2. Implement TODO sections in classification_model.py
# 3. Test frequently with DEBUG mode:
python main.py --mode debug  # Run this 5-10 times while developing

# 4. When Task 1 works in DEBUG mode, move to Task 2

# Week 2: Work on Task 2 (Instruction)
# ------------------------------------------
# 1. Read Task 2 instructions carefully
# 2. Implement TODO sections in instruction_model.py
# 3. Test frequently with DEBUG mode:
python main.py --mode debug  # Run this 5-10 times while developing

# Final Step: Run FULL mode once
# ------------------------------------------
# 1. Verify both tasks work in DEBUG mode
# 2. Run FULL mode for final results:
python main.py --mode full

# 3. Save outputs for your analysis.pdf
# 4. Submit!

Common Mistakes to Avoid

DON'T run FULL mode repeatedly - it takes longer and you'll waste time
DON'T skip DEBUG mode - you'll spend hours waiting for FULL mode to find errors
DO use DEBUG mode for all development and debugging
DO run FULL mode only once when everything works

Quick Start Guide

1. Environment Setup (REQUIRED: Use Conda)

IMPORTANT: Conda Environment Required

This assignment requires using Conda for environment management. Do NOT use venv or system Python.

# Extract the assignment files
unzip assignment3_starter.zip
cd assignment3_starter/

# REQUIRED: Create Conda environment (Python 3.9-3.11 recommended)
conda create -n assignment3_env python=3.10 -y
conda activate assignment3_env

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "import torch; import transformers; print('Environment ready!')"

2. Run the Assignment

# Run both tasks sequentially
python main.py

Task 1: Classification Fine-Tuning (60 points)

Overview

Implement manual classification head following Chapter 6 concepts. This demonstrates the core difference from using AutoModelForSequenceClassification.

What You Need to Implement

In src/classification_model.py, complete the GPT2Classifier class following Chapter 6, pages 207-211.

Key Components to Implement:

Load pretrained base transformer (use AutoModel, not AutoModelForSequenceClassification)
Add manual classification head layer
Implement parameter freezing strategy
Extract last token from hidden states for classification
Calculate loss and return proper format

Critical Concept: Why Last Token?

Chapter 6 Core Concept (Figure 6.12):

GPT models use causal attention masks, meaning each token only attends to previous tokens. The last token has access to information from the entire sequence, making it the ideal choice for classification.

Using first token (like BERT) would be incorrect because the first token in GPT cannot see future tokens due to causal masking.

Implementation Details

Component	Points	Requirements
Base Model Loading	10	Use AutoModel (not AutoModelForSequenceClassification)
Classification Head	15	Manual nn.Linear(768, 2) with dropout
Parameter Freezing	15	Freeze base, optional unfreeze last block
Forward Pass	15	Last token extraction, loss calculation
Training Success	15	Achieve 75-85% test accuracy on 500 IMDb samples

Dataset: IMDb Movie Reviews

Split	Samples	Purpose
Train	500	Model training
Validation	100	Early stopping
Test	100	Final evaluation

Expected Performance

DEBUG vs FULL Mode Results

DEBUG Mode (50 samples, 1 epoch):

Test Accuracy: 35-50% (intentionally low)
Purpose: Code verification only
Training Time: ~2 minutes

FULL Mode (2500 samples, 3 epochs):

Test Accuracy: 75-85% (target range)
Purpose: Final submission quality
Training Time: ~8-10 min (GPU/MPS) or ~15 min (CPU)

Important: DEBUG mode results are NOT representative. Low accuracy in DEBUG mode is normal and expected.

Model Configuration

Trainable Parameters: ~7M (only classifier head if freeze_base=True)
Total Parameters: 82M (DistilGPT-2)

Hardware Acceleration

CUDA (NVIDIA GPU): Automatically detected and used
MPS (Apple Silicon): Automatically detected on M-series Macs
CPU: Fallback option (slower but works)

Task 2: Instruction Fine-Tuning (40 points)

Overview

Implement instruction fine-tuning using Alpaca prompt format (Chapter 7). This demonstrates the fundamental difference from classification - keeping the LM head for generation.

What You Need to Implement

In src/instruction_model.py, complete the instruction fine-tuning functions following Chapter 7, pages 229-235.

Key Components to Implement:

Implement Alpaca prompt formatting (system prompt + Instruction + Input + Response sections)
Load model with language modeling head (use AutoModelForCausalLM, not AutoModel)
Prepare dataset with proper tokenization
Implement generation function for responses

Critical Difference: Classification vs Instruction

Aspect	Classification (Task 1)	Instruction (Task 2)
Model Head	Custom classifier (2 outputs)	Original LM head (50257 vocab)
Model Loading	AutoModel	AutoModelForCausalLM
Token Usage	Last token only	All tokens (next-token prediction)
Output Type	Fixed classes (pos/neg)	Open-ended generation
Training Objective	Cross-entropy on last token	Next-token prediction on all tokens

Implementation Details

Component	Points	Requirements
Alpaca Formatting	8	Correct template with Instruction/Input/Response
Model Setup	7	Use AutoModelForCausalLM, keep LM head
Training Success	5	Model generates coherent responses after fine-tuning

Dataset: Instruction Examples

Source: Alpaca-style instruction-response pairs
Samples: 150 instruction examples
Format: JSON with instruction, input (optional), output fields

Expected Performance

DEBUG vs FULL Mode Results

DEBUG Mode (20 samples, 1 epoch):

Generation Quality: May produce incoherent responses
Purpose: Code verification only
Training Time: ~1 minute

FULL Mode (150 samples, 2 epochs):

Generation Quality: Coherent, on-topic responses
Purpose: Final submission quality
Training Time: ~5-8 min (GPU/MPS) or ~15-25 min (CPU)

Important: DEBUG mode generation quality will be poor due to limited training data. This is expected.

Submission Requirements

Files to Submit:

src/classification_model.py - Completed manual classifier
src/instruction_model.py - Completed instruction fine-tuning

Grading Rubric (100 points)

Component	Points	Criteria
Task 1: Classification Fine-Tuning	60	- Manual head implementation (20 pts) - Parameter freezing (15 pts) - Last token extraction (15 pts) - Training success (10 pts)
Task 2: Instruction Fine-Tuning	40	- Alpaca formatting (15 pts) - Model setup (15 pts) - Training success (10 pts)

Academic Integrity

You may NOT:

Share code with other students
Use AI tools (ChatGPT, etc.) to generate code

← Back to Course