Assignment 3: Classification and Instruction Fine-Tuning

CSC 375/575 - Generative AI | Fall 2025
Professor: Rongyu Lin


Overview

This assignment explores two fundamental fine-tuning paradigms for large language models: classification fine-tuning (Chapter 6) and instruction fine-tuning (Chapter 7). You will implement manual classification heads and instruction formatting following textbook concepts while using Hugging Face tools for efficiency.

Learning Objectives

By completing this assignment, you will:

Textbook Alignment

This assignment directly implements concepts from:

Understanding Hugging Face Transformers

Important: Hugging Face Library Introduction

We have NOT covered Hugging Face in lectures, but we're using it in this assignment because:

What is Hugging Face?

Think of Hugging Face as a library of pre-trained AI models that you can download and use. Instead of training a model from scratch (which takes weeks and costs thousands of dollars), you can:

  1. Download a pre-trained model (like DistilGPT-2)
  2. Fine-tune it on your specific task (what this assignment is about)
  3. Use it for your application

Analogy: It's like downloading a pre-trained athlete. Instead of training someone from birth, you get an athlete who already knows how to run, and you just teach them your specific sport.

Key Concept: What is a "Model"?

In this assignment, a model is a neural network that has been trained on massive amounts of text. Think of it as a student who has:

DistilGPT-2 is the specific model we're using. It's a smaller, faster version of GPT-2.

Understanding Different Model Types

Hugging Face offers the same base model (DistilGPT-2) in three different "configurations". Each configuration is designed for a different task:

1. AutoModel (Base Configuration)

What it is: The core transformer without any task-specific layer on top.

What it gives you: Just the hidden representations (embeddings) of your text.

When to use: When you want to ADD YOUR OWN custom layer for a specific task (Task 1).

from transformers import AutoModel # Load the base model model = AutoModel.from_pretrained('distilgpt2') # This model gives you ONLY hidden states # No classification, no text generation - just embeddings # You need to add your own layers on top

Analogy: Like a car engine without a body. You get the core power, but you need to build the rest yourself.

2. AutoModelForCausalLM (Text Generation Configuration)

What it is: Base model + a "language modeling head" that predicts the next word.

What it gives you: Ability to generate text word-by-word.

When to use: For text generation tasks like chatbots, instruction following (Task 2).

from transformers import AutoModelForCausalLM # Load model with text generation capability model = AutoModelForCausalLM.from_pretrained('distilgpt2') # This model can GENERATE text # It predicts: "given these words, what comes next?" # Perfect for chatbots and instruction following

Analogy: Like a complete car that's ready to drive. It can do its job (generate text) right away.

3. AutoModelForSequenceClassification (Pre-built Classification)

What it is: Base model + a built-in classification layer.

What it gives you: Ability to classify text into categories.

When to use: For classification when you DON'T want to manually implement (we DON'T use this because we're learning the manual way).

from transformers import AutoModelForSequenceClassification # Load model with built-in classification head model = AutoModelForSequenceClassification.from_pretrained('distilgpt2', num_labels=2) # This model can classify text into 2 categories # BUT we won't use this because we want to implement manually!

Analogy: Like buying a pre-built race car. It works, but you don't learn how to build cars.

Why Three Different Types?

Model Type Use Case This Assignment
AutoModel Custom tasks where you build your own head Task 1: We add our own classification layer
AutoModelForCausalLM Text generation (chatbots, completion) Task 2: Instruction fine-tuning
AutoModelForSequenceClassification Quick classification without manual implementation NOT USED (we want to learn manually)

Other Key Hugging Face Concepts

Tokenizer - Converting Text to Numbers

What it does: Neural networks can't read text directly. A tokenizer converts text into numbers that models can process.

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('distilgpt2') # Convert text to numbers text = "Hello world" tokens = tokenizer(text, return_tensors='pt') # Result: {'input_ids': tensor([[15496, 995]]), ...} # "Hello" -> 15496, "world" -> 995

Trainer - Simplified Training

What it does: Instead of writing complex training loops yourself, Trainer handles all the details.

from transformers import Trainer, TrainingArguments # Configure training training_args = TrainingArguments( output_dir='./results', num_train_epochs=3, per_device_train_batch_size=8, ) # Train with one line! trainer = Trainer(model=model, args=training_args, train_dataset=dataset) trainer.train()

Datasets Library - Easy Data Loading

What it does: Download popular datasets with one line of code.

from datasets import load_dataset # Download IMDb movie reviews dataset = load_dataset('imdb') # That's it! Dataset is ready to use

Summary: What You Need to Know

  1. Hugging Face = Library of pre-trained models
    • Download models instead of training from scratch
    • Fine-tune them for your specific task
  2. Three model types for different purposes
    • AutoModel: Base transformer, add your own layers
    • AutoModelForCausalLM: Text generation ready
    • AutoModelForSequenceClassification: Classification ready (we don't use)
  3. Supporting tools
    • Tokenizer: Text to numbers
    • Trainer: Easy training
    • Datasets: Easy data loading

For this assignment:

Assignment Structure

assignment3_starter/ ├── src/ │ ├── classification_model.py # Manual GPT2Classifier (Task 1) │ ├── instruction_model.py # Instruction fine-tuning (Task 2) │ ├── utils.py # Data loading and evaluation │ └── trainer_config.py # Hugging Face Trainer setup ├── main.py # Run both tasks ├── requirements.txt └── README.md

Running Modes: DEBUG vs FULL

Key Concept: Two-Stage Development

This assignment supports two running modes to make your development process much faster:

Always start with DEBUG mode! Only run FULL mode when your code works correctly.

Mode Comparison Table

Aspect DEBUG Mode FULL Mode
Command python main.py --mode debug python main.py --mode full
Task 1 Dataset 50 train / 20 val / 20 test 500 train / 100 val / 100 test
Task 2 Dataset 20 instruction samples 150 instruction samples
Task 1 Epochs 1 epoch 3 epochs
Task 2 Epochs 1 epoch 2 epochs
Purpose Code verification, debugging Complete training for submission
When to Use During development, after each change Once, when code is ready

Detailed Mode Explanations

DEBUG Mode - Your Development Best Friend

Use this mode 90% of the time during development!

When to use DEBUG mode:
What DEBUG mode does:
Example DEBUG workflow:
# 1. Write code in classification_model.py vim src/classification_model.py # 2. Test with DEBUG mode python main.py --mode debug # 3. See error? Fix it and test again vim src/classification_model.py python main.py --mode debug # 4. Repeat until no errors!

FULL Mode - Your Submission Version

Only run this when your DEBUG mode works perfectly!

When to use FULL mode:
What FULL mode does:
Example FULL workflow:
# 1. Verify DEBUG mode works python main.py --mode debug # Output: All tasks complete without errors # 2. Now run FULL mode python main.py --mode full # This will take longer but gives final results # 3. Use these results for your analysis.pdf report

Recommended Development Workflow

# Week 1: Work on Task 1 (Classification) # ------------------------------------------ # 1. Read Task 1 instructions carefully # 2. Implement TODO sections in classification_model.py # 3. Test frequently with DEBUG mode: python main.py --mode debug # Run this 5-10 times while developing # 4. When Task 1 works in DEBUG mode, move to Task 2 # Week 2: Work on Task 2 (Instruction) # ------------------------------------------ # 1. Read Task 2 instructions carefully # 2. Implement TODO sections in instruction_model.py # 3. Test frequently with DEBUG mode: python main.py --mode debug # Run this 5-10 times while developing # Final Step: Run FULL mode once # ------------------------------------------ # 1. Verify both tasks work in DEBUG mode # 2. Run FULL mode for final results: python main.py --mode full # 3. Save outputs for your analysis.pdf # 4. Submit!

Common Mistakes to Avoid

Quick Start Guide

1. Environment Setup (REQUIRED: Use Conda)

IMPORTANT: Conda Environment Required

This assignment requires using Conda for environment management. Do NOT use venv or system Python.

# Extract the assignment files unzip assignment3_starter.zip cd assignment3_starter/ # REQUIRED: Create Conda environment (Python 3.9-3.11 recommended) conda create -n assignment3_env python=3.10 -y conda activate assignment3_env # Install dependencies pip install -r requirements.txt # Verify installation python -c "import torch; import transformers; print('Environment ready!')"

2. Run the Assignment

# Run both tasks sequentially python main.py

Task 1: Classification Fine-Tuning (60 points)

Overview

Implement manual classification head following Chapter 6 concepts. This demonstrates the core difference from using AutoModelForSequenceClassification.

What You Need to Implement

In src/classification_model.py, complete the GPT2Classifier class following Chapter 6, pages 207-211.

Key Components to Implement:

Critical Concept: Why Last Token?

Chapter 6 Core Concept (Figure 6.12):

GPT models use causal attention masks, meaning each token only attends to previous tokens. The last token has access to information from the entire sequence, making it the ideal choice for classification.

Using first token (like BERT) would be incorrect because the first token in GPT cannot see future tokens due to causal masking.

Implementation Details

Component Points Requirements
Base Model Loading 10 Use AutoModel (not AutoModelForSequenceClassification)
Classification Head 15 Manual nn.Linear(768, 2) with dropout
Parameter Freezing 15 Freeze base, optional unfreeze last block
Forward Pass 15 Last token extraction, loss calculation
Training Success 15 Achieve 75-85% test accuracy on 500 IMDb samples

Dataset: IMDb Movie Reviews

Split Samples Purpose
Train 500 Model training
Validation 100 Early stopping
Test 100 Final evaluation

Expected Performance

DEBUG vs FULL Mode Results

DEBUG Mode (50 samples, 1 epoch):

FULL Mode (2500 samples, 3 epochs):

Important: DEBUG mode results are NOT representative. Low accuracy in DEBUG mode is normal and expected.

Model Configuration

Hardware Acceleration

Task 2: Instruction Fine-Tuning (40 points)

Overview

Implement instruction fine-tuning using Alpaca prompt format (Chapter 7). This demonstrates the fundamental difference from classification - keeping the LM head for generation.

What You Need to Implement

In src/instruction_model.py, complete the instruction fine-tuning functions following Chapter 7, pages 229-235.

Key Components to Implement:

Critical Difference: Classification vs Instruction

Aspect Classification (Task 1) Instruction (Task 2)
Model Head Custom classifier (2 outputs) Original LM head (50257 vocab)
Model Loading AutoModel AutoModelForCausalLM
Token Usage Last token only All tokens (next-token prediction)
Output Type Fixed classes (pos/neg) Open-ended generation
Training Objective Cross-entropy on last token Next-token prediction on all tokens

Implementation Details

Component Points Requirements
Alpaca Formatting 8 Correct template with Instruction/Input/Response
Model Setup 7 Use AutoModelForCausalLM, keep LM head
Training Success 5 Model generates coherent responses after fine-tuning

Dataset: Instruction Examples

Expected Performance

DEBUG vs FULL Mode Results

DEBUG Mode (20 samples, 1 epoch):

FULL Mode (150 samples, 2 epochs):

Important: DEBUG mode generation quality will be poor due to limited training data. This is expected.

Submission Requirements

Files to Submit:

  1. src/classification_model.py - Completed manual classifier
  2. src/instruction_model.py - Completed instruction fine-tuning

Grading Rubric (100 points)

Component Points Criteria
Task 1: Classification Fine-Tuning 60 - Manual head implementation (20 pts)
- Parameter freezing (15 pts)
- Last token extraction (15 pts)
- Training success (10 pts)
Task 2: Instruction Fine-Tuning 40 - Alpaca formatting (15 pts)
- Model setup (15 pts)
- Training success (10 pts)

Academic Integrity

You may NOT:


← Back to Course