# Lecture 16: pandas Fundamentals Part 1 - Series
## The Foundation of Data Analysis

**Course:** INF 605 - Introduction to Programming - Python  
**Instructor:** Prof. Rongyu Lin  
**Institution:** Quinnipiac University

**Learning Objectives:**
- Explain the role of pandas in Python's data science ecosystem
- Create pandas Series from various data sources with custom indices
- Access Series elements using multiple methods
- Apply statistical methods for data analysis
- Work with string Series using the .str accessor
- Perform Series arithmetic with automatic index alignment
- Build real-world data analysis applications

## Setup and Imports

pandas is Python's fundamental library for data analysis and manipulation. Think of pandas as the "Excel of Python" - it provides powerful tools for organizing, analyzing, and transforming data. The name "pandas" comes from "Panel Data" and "Python Data Analysis".

The pandas library is built on top of NumPy (which we learned in Lectures 12-13), so it inherits NumPy's speed and efficiency while adding convenient data analysis features. Every data scientist, analyst, and machine learning engineer uses pandas daily because it makes working with labeled data intuitive and powerful.

The standard convention is to import pandas with the alias `pd`, which you'll see in all professional code, documentation, and tutorials worldwide. This consistency makes code immediately recognizable and easier to read.

In [None]:
# Standard pandas import - used universally in data science
import pandas as pd
import numpy as np  # We'll use NumPy for comparisons

# Display pandas version for reference
print(f"pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Part 1: Why pandas Series? The Labeled Data Problem

Think about how you organize information in real life. When you make a shopping list, you don't just write numbers - you write item names with quantities. When a teacher records grades, they write student names alongside scores. This pairing of labels with values is how humans naturally think about data.

NumPy arrays, which we mastered in Lectures 12-13, are incredibly fast for numerical operations but force you to use integer positions to access elements. If Alice's grade is at position 3, you must remember "position 3 equals Alice" every time you work with the data. This works for small datasets but becomes error-prone and confusing with hundreds or thousands of entries.

pandas Series solve this fundamental problem by combining NumPy's computational speed with meaningful labels. A Series is essentially a NumPy array with a custom index that can use any labels you want - names, dates, product codes, or anything meaningful to your data. This makes your code readable, self-documenting, and less prone to indexing errors.

In [None]:
# NumPy array: fast but uses only integer positions
grades_array = np.array([87, 92, 88, 95])
print("NumPy array (unlabeled):")
print(grades_array)
print(f"\nAlice's grade (position 0): {grades_array[0]}")
print("Problem: You must remember that position 0 is Alice!")

**What Happened:** The NumPy array stores the data efficiently, but there's no way to know which grade belongs to which student. You're forced to track the mapping between positions and student names separately, which is error-prone and makes code hard to read. Imagine managing grades for 100 students - you'd need to remember that position 0 is Alice, position 1 is Bob, and so on, or maintain a separate list of names. This mental overhead quickly becomes overwhelming and leads to mistakes. NumPy's position-based indexing works well for mathematical operations, but for real-world data where labels matter, you need something better.

In [None]:
# pandas Series: labeled data with meaningful indices
grades_series = pd.Series([87, 92, 88, 95],
                          index=['Alice', 'Bob', 'Charlie', 'Diana'])
print("pandas Series (labeled):")
print(grades_series)
print(f"\nAlice's grade: {grades_series['Alice']}")
print("Solution: Self-documenting, readable, intuitive!")

**What Happened:** The Series displays as two columns - the left column shows indices (student names), and the right column shows values (grades). Notice how much more readable this is. You can access Alice's grade by name, making the code self-explanatory and preventing indexing errors.

## Part 2: Creating pandas Series

pandas provides multiple ways to create Series, each suited for different data sources and situations. Just like you choose different containers for different purposes (a lunchbox for food, a toolbox for tools), you'll choose different Series creation methods based on where your data comes from.

Understanding these creation methods is essential because real-world data comes in many forms: sometimes you have lists from calculations, sometimes dictionaries from JSON files, sometimes you need to initialize with default values. pandas makes all of these scenarios straightforward.

In [None]:
# Create Series from a Python list (simplest method)
test_scores = [85, 92, 78, 95, 88]
scores_series = pd.Series(test_scores)

print("Series from list (default integer index):")
print(scores_series)
print(f"\nData type: {scores_series.dtype}")
print(f"Type: {type(scores_series)}")

**What Happened:** When you create a Series from a list without specifying an index, pandas automatically assigns integer indices starting from 0. The `dtype: int64` tells you the data type - pandas inferred that these are integers. This looks similar to a NumPy array but with the Series structure.

In [None]:
# Create Series with custom indices (the real power!)
grades = pd.Series([87, 92, 88, 95],
                   index=['Alice', 'Bob', 'Charlie', 'Diana'])

print("Series with custom string indices:")
print(grades)
print(f"\nAccess by label: grades['Alice'] = {grades['Alice']}")

**What Happened:** Now the indices are student names instead of integers. This makes the data self-describing - you can immediately see that Alice scored 87, Bob scored 92, and so on. Accessing data by meaningful labels makes your code much more readable than using numeric positions.

In [None]:
# Create Series from a dictionary (keys become indices)
city_temps = {
    'New York': 68,
    'Los Angeles': 75,
    'Chicago': 62,
    'Houston': 80,
    'Phoenix': 85
}
temps_series = pd.Series(city_temps)
print("Series from dictionary:")
print(temps_series)
print(f"\nHouston temperature: {temps_series['Houston']}F")

**What Happened:** When creating a Series from a dictionary, the dictionary keys automatically become the Series index, and the values become the Series data. This is one of the most intuitive methods because the natural key-value structure of dictionaries maps perfectly to pandas' index-data relationship. This approach is particularly powerful when your data has natural labels or identifiers. For example, if you have city names as keys and temperatures as values, converting this dictionary to a Series gives you a labeled data structure where you can look up temperatures by city name directly. The index-value pairing makes your data self-documenting and eliminates the need for separate position tracking.

In [None]:
# Create Series from a scalar (repeated value)
default_score = pd.Series(70, index=['Alice', 'Bob', 'Charlie', 'Diana'])
print("Series from scalar (initialization):")
print(default_score)

# Useful for creating baseline values
baseline = pd.Series(0, index=range(5))
print("\nZero baseline series:")
print(baseline)

**What Happened:** When you provide a scalar value (like 70 or 0), pandas broadcasts it to all indices. This is useful for initialization or creating template Series with default values.

### Exercise 1: Create Series from Different Sources

Create three different Series:
1. A Series of monthly sales for Jan, Feb, Mar with values [1200, 1350, 1180]
2. A Series from a dictionary of product prices: Widget ($25), Gadget ($40), Tool ($15)
3. A Series with value 100 for 5 students: Amy, Ben, Cal, Dan, Eve

In [None]:
# Your code here - try creating all three Series


In [None]:
# Solution
# 1. Monthly sales Series
sales = pd.Series([1200, 1350, 1180],
                  index=['Jan', 'Feb', 'Mar'])
print("Monthly Sales:")
print(sales)

# 2. Product prices from dictionary
prices_dict = {'Widget': 25, 'Gadget': 40, 'Tool': 15}
prices = pd.Series(prices_dict)
print("\nProduct Prices:")
print(prices)

# 3. Default score for students
students = ['Amy', 'Ben', 'Cal', 'Dan', 'Eve']
default_scores = pd.Series(100, index=students)
print("\nDefault Scores (Perfect 100):")
print(default_scores)

## Part 3: Accessing Series Elements

pandas Series support multiple ways to access elements, giving you flexibility for different situations. You can use integer positions (like lists and NumPy arrays) or custom labels (the pandas superpower). Understanding when to use each method is crucial for writing clear, correct code.

The key insight is that pandas provides explicit accessors - `.loc[]` for label-based access and `.iloc[]` for position-based access - to eliminate ambiguity. While you can sometimes use simple bracket notation `[]`, using the explicit accessors makes your intent crystal clear and prevents subtle bugs.

In [None]:
# Create a sample Series for demonstration
grades = pd.Series([87, 92, 88, 95, 85],
                   index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'])

print("Grade Series:")
print(grades)

# Access by label using square brackets
alice_grade = grades['Alice']
print(f"\nAlice's grade: {alice_grade}")

# Access by position (works like lists)
first_grade = grades[0]
print(f"First grade (position 0): {first_grade}")

**What Happened:** Both label-based (`grades['Alice']`) and position-based (`grades[0]`) access work with simple bracket notation. However, this can be ambiguous - what if your index labels are also integers? The explicit `.loc[]` and `.iloc[]` accessors solve this problem.

In [None]:
# Use .loc[] for explicit label-based access
print("Using .loc[] (label-based):")
print(f"grades.loc['Alice'] = {grades.loc['Alice']}")
print(f"grades.loc['Diana'] = {grades.loc['Diana']}")

# Use .iloc[] for explicit position-based access  
print("\nUsing .iloc[] (position-based):")
print(f"grades.iloc[0] = {grades.iloc[0]}")
print(f"grades.iloc[-1] = {grades.iloc[-1]}")

**What Happened:** `.loc[]` uses labels explicitly - `grades.loc['Alice']` clearly means "get the element labeled Alice". `.iloc[]` uses positions explicitly - `grades.iloc[0]` means "get the element at position 0". Negative indices work with `.iloc[]` just like in lists.

In [None]:
# Slicing with .iloc[] (position-based, excludes endpoint)
print("Position slicing with .iloc[]:")
first_three = grades.iloc[0:3]
print(first_three)

# Slicing with .loc[] (label-based, INCLUDES endpoint!)
print("\nLabel slicing with .loc[]:")
alice_to_charlie = grades.loc['Alice':'Charlie']
print(alice_to_charlie)

**Critical Difference:** Notice that `.iloc[0:3]` excludes position 3 (standard Python slicing behavior), but `.loc['Alice':'Charlie']` INCLUDES 'Charlie' (both endpoints included). This is a crucial distinction to remember when working with pandas.

### Exercise 2: Accessing Series Elements

Given a Series of temperatures for days of the week, access:
1. The temperature for Wednesday using .loc[]
2. The last temperature using .iloc[]
3. All temperatures from Tuesday through Thursday using .loc[]

In [None]:
# Given data
temps = pd.Series([72, 68, 75, 71, 69, 76, 73],
                  index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

# Your code here


In [None]:
# Solution
# 1. Wednesday temperature using .loc[]
wed_temp = temps.loc['Wed']
print(f"Wednesday temperature: {wed_temp}F")

# 2. Last temperature using .iloc[]
last_temp = temps.iloc[-1]
print(f"Last temperature (Sunday): {last_temp}F")

# 3. Tuesday through Thursday using .loc[]
midweek = temps.loc['Tue':'Thu']
print("\nTuesday through Thursday:")
print(midweek)

## Part 4: Statistical Methods

Remember all the statistical methods you learned for NumPy arrays in Lecture 13? Great news: pandas Series inherited all these methods! The calculations work identically, but pandas adds a crucial advantage - methods preserve and work with your meaningful index labels.

In this section, we'll see how these familiar methods (mean, std, min, max) work on labeled data, and we'll discover pandas-specific features like idxmin() and idxmax() that return index labels instead of positions, making analysis results immediately interpretable.

In [None]:
# Create test scores with named indices
test_scores = pd.Series([85, 92, 78, 90, 88, 95, 82, 87, 91, 86],
                        index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve',
                               'Frank', 'Grace', 'Henry', 'Iris', 'Jack'])

print("Test Scores:")
print(test_scores)

# All your familiar statistical methods from Lecture 13!
print("\nStatistical Methods (from Lecture 13):")
print(f"Mean (average): {test_scores.mean():.2f}")
print(f"Median: {test_scores.median():.2f}")
print(f"Standard deviation: {test_scores.std():.2f}")
print(f"Range: {test_scores.max() - test_scores.min()}")

**What Happened:** These are the same statistical methods you learned in Lecture 13 for NumPy arrays - mean(), median(), std(), min(), max(). They work exactly the same way on pandas Series! The key difference is that pandas preserves your meaningful indices (student names), making results more interpretable than position-based arrays.

In [None]:
# pandas SPECIAL FEATURE: Index-returning methods!
# NumPy's argmax()/argmin() return positions
# pandas' idxmax()/idxmin() return LABELS - much more useful!

print("Finding Extremes - The pandas Way:")
print(f"Highest score: {test_scores.max()}")
print(f"Student with highest score: {test_scores.idxmax()}")  # Returns 'Frank'!

print(f"\nLowest score: {test_scores.min()}")
print(f"Student with lowest score: {test_scores.idxmin()}")  # Returns 'Charlie'!

# Compare: NumPy way would give position 5 and 2
# pandas way gives 'Frank' and 'Charlie' directly!

**pandas Advantage in Action:** This is where pandas truly shines! While NumPy's argmax() and argmin() return integer positions (requiring you to look up names separately), pandas' idxmax() and idxmin() return the actual index labels. You get "Frank scored highest" instead of "position 5 scored highest, let me check who that is..." This makes your analysis immediately interpretable!

In [None]:
# The describe() method - ALL Lecture 13 stats in one call!
# Combines mean(), std(), min(), max(), and quartiles

print("Complete Statistical Summary:")
summary = test_scores.describe()
print(summary)
print("\nThis combines ALL the statistical methods from Lecture 13!")

**What Happened:** The describe() method is your "all-in-one" statistical summary, combining everything from Lecture 13:
- count: Number of data points
- mean: Average (from Lecture 13)
- std: Standard deviation (from Lecture 13)
- min/max: Extremes (from Lecture 13)
- 25%, 50%, 75%: Quartiles (like percentiles from Lecture 13)

Data scientists run describe() first thing on every new dataset - it reveals data quality issues, distribution shape, and basic characteristics in one glance. Think of it as a comprehensive health checkup combining all your Lecture 13 statistical knowledge!

### Exercise 3: Statistical Analysis

Given daily temperatures for a week, calculate:
1. The average temperature
2. The temperature range (max - min)
3. Which day had the highest temperature
4. A complete statistical summary

In [None]:
# Given data
daily_temps = pd.Series([72, 68, 75, 71, 69, 76, 73],
                        index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

# Your code here


In [None]:
# Solution
# 1. Average temperature
avg_temp = daily_temps.mean()
print(f"Average temperature: {avg_temp:.1f}F")

# 2. Temperature range
temp_range = daily_temps.max() - daily_temps.min()
print(f"Temperature range: {temp_range:.1f}F")

# 3. Hottest day
hottest_day = daily_temps.idxmax()
hottest_temp = daily_temps.max()
print(f"Hottest day: {hottest_day} at {hottest_temp}F")

# 4. Complete summary
print("\nComplete Summary:")
print(daily_temps.describe())

## Part 5: Working with String Series

When your Series contains strings, you need special methods to manipulate them. You can't just call `.upper()` on a Series - that would try to uppercase the entire Series object, not each string element. pandas provides the `.str` accessor that applies string methods to each element vectorized (all at once) for performance.

Think of the `.str` accessor like giving a command to an entire classroom. Instead of going to each student individually and saying "raise your hand," you just say "class, raise your hands!" The `.str` accessor is that class-wide command for string operations - it applies the operation to every string element efficiently.

In [None]:
# Create Series with string data
names = pd.Series(['alice johnson', 'bob smith', 'charlie brown', 'diana prince'])

print("Original names:")
print(names)

# Convert to uppercase using .str accessor
upper_names = names.str.upper()
print("\nUppercase names:")
print(upper_names)

# Title case (capitalize first letter of each word)
title_names = names.str.title()
print("\nTitle case names:")
print(title_names)

**What Happened:** The `.str` accessor gives you access to almost all Python string methods. The `.str.upper()` converts each name to uppercase, `.str.title()` capitalizes each word. These operations happen vectorized (all at once), making them fast even for millions of strings.

In [None]:
# Check which names contain specific text
has_brown = names.str.contains('brown')
print("Names containing 'brown':")
print(has_brown)

# Filter using boolean mask
brown_names = names[has_brown]
print("\nActual names with 'brown':")
print(brown_names)

**What Happened:** The `.str.contains()` method creates a boolean Series (True/False for each element). You can use this boolean mask to filter the original Series, selecting only the names that match your pattern. This combines string operations with boolean indexing.

In [None]:
# Extract parts of strings by splitting
first_names = names.str.split().str[0]  # Split on space, get first element
last_names = names.str.split().str[1]   # Split on space, get second element

print("First names only:")
print(first_names)
print("\nLast names only:")
print(last_names)

**What Happened:** The `.str.split()` method splits each string on whitespace (by default), returning a Series of lists. Then `.str[0]` extracts the first element from each list. This is a powerful pattern for parsing structured text data.

### Exercise 4: String Operations

Given a Series of product codes, perform these operations:
1. Convert all codes to lowercase
2. Find which products contain 'WIDGET'
3. Replace 'WIDGET' with 'ITEM'

In [None]:
# Given data
products = pd.Series(['WIDGET-A', 'GADGET-B', 'WIDGET-C', 'TOOL-D', 'WIDGET-E'])

# Your code here


In [None]:
# Solution
# 1. Convert to lowercase
lowercase_products = products.str.lower()
print("Lowercase product codes:")
print(lowercase_products)

# 2. Find products containing 'WIDGET'
widget_mask = products.str.contains('WIDGET')
widgets = products[widget_mask]
print("\nProducts containing 'WIDGET':")
print(widgets)

# 3. Replace 'WIDGET' with 'ITEM'
updated_products = products.str.replace('WIDGET', 'ITEM')
print("\nUpdated product codes:")
print(updated_products)

## Part 6: Series Arithmetic and Index Alignment

pandas Series support arithmetic operations just like NumPy arrays - addition, subtraction, multiplication, division. These operations are vectorized (applied element-wise to the entire Series), making them fast and efficient. But pandas adds a crucial feature that NumPy lacks: automatic index alignment.

When you add two Series, pandas doesn't just add element 0 to element 0 based on position (like NumPy). Instead, it aligns by index labels, matching 'Alice' with 'Alice', 'Bob' with 'Bob', regardless of their positions. This automatic alignment prevents a huge category of errors and makes data analysis more intuitive.

In [None]:
# Simple arithmetic operations
quiz1 = pd.Series([85, 92, 88, 90],
                  index=['Alice', 'Bob', 'Charlie', 'Diana'])
quiz2 = pd.Series([87, 89, 91, 93],
                  index=['Alice', 'Bob', 'Charlie', 'Diana'])

print("Quiz 1 scores:")
print(quiz1)
print("\nQuiz 2 scores:")
print(quiz2)

# Add the two Series (element-wise)
total_score = quiz1 + quiz2
print("\nTotal scores:")
print(total_score)

**What Happened:** The addition `quiz1 + quiz2` adds corresponding elements by matching index labels. Alice's scores (85 + 87 = 172), Bob's scores (92 + 89 = 181), and so on. The result maintains the index, so you can see which student has which total.

In [None]:
# Index alignment with different orders
scores1 = pd.Series([85, 92, 88],
                    index=['Alice', 'Bob', 'Charlie'])
scores2 = pd.Series([90, 87, 89],
                    index=['Charlie', 'Alice', 'Bob'])

print("Scores1 order: Alice, Bob, Charlie")
print(scores1)
print("\nScores2 order: Charlie, Alice, Bob")
print(scores2)

# pandas aligns by index automatically!
total = scores1 + scores2
print("\nTotal (aligned by name, not position):")
print(total)

**Critical Feature:** Notice that the Series are in different orders, but pandas aligns by index label. Alice gets 85 + 87 = 172 (matching labels), Bob gets 92 + 89 = 181, Charlie gets 88 + 90 = 178. NumPy would add by position and give wrong results!

In [None]:
# What happens with mismatched indices?
quiz_a = pd.Series([85, 92, 88],
                   index=['Alice', 'Bob', 'Charlie'])
quiz_b = pd.Series([87, 89, 91],
                   index=['Bob', 'Charlie', 'Diana'])

print("Quiz A students: Alice, Bob, Charlie")
print(quiz_a)
print("\nQuiz B students: Bob, Charlie, Diana")
print(quiz_b)

# Add with mismatched indices
result = quiz_a + quiz_b
print("\nResult (NaN for unmatched indices):")
print(result)

**What Happened:** When indices don't match completely, pandas introduces NaN (Not a Number) for the unmatched labels. Alice appears only in quiz_a, Diana only in quiz_b, so they get NaN. This is the correct behavior - pandas tells you there's incomplete data rather than making up values or causing errors.

### Exercise 5: Series Arithmetic

Calculate weighted final scores for students:
1. Midterm (weight 0.4) and Final (weight 0.6)
2. Check if any students are missing scores (NaN)

In [None]:
# Given data
midterm = pd.Series([85, 92, 78, 90], index=['Amy', 'Ben', 'Cal', 'Dan'])
final = pd.Series([88, 85, 95, 87], index=['Amy', 'Ben', 'Cal', 'Dan'])

# Your code here


In [None]:
# Solution
# 1. Calculate weighted final scores
weighted_final = midterm * 0.4 + final * 0.6
print("Weighted Final Scores (40% midterm, 60% final):")
print(weighted_final)

# 2. Check for missing scores
has_missing = weighted_final.isna().any()
print(f"\nAny missing scores? {has_missing}")

# Show which students have complete scores
complete_scores = weighted_final.notna()
print("\nStudents with complete scores:")
print(weighted_final[complete_scores])

## Summary and Key Takeaways

In this lecture, you learned the fundamentals of pandas Series:

1. **Why pandas Series**: Combine NumPy's speed with meaningful labels for intuitive data analysis
2. **Creating Series**: From lists, dictionaries, scalars with custom indices
3. **Accessing Elements**: Use `.loc[]` for labels, `.iloc[]` for positions
4. **Statistical Methods**: `mean()`, `std()`, `describe()` for instant data insights
5. **String Operations**: Use `.str` accessor for vectorized string manipulation
6. **Index Alignment**: Automatic matching by labels during arithmetic operations
7. **Missing Data**: NaN represents incomplete data, handled gracefully

**Next Lecture Preview:** In Lecture 17, we'll extend Series to two dimensions with pandas DataFrames - spreadsheet-like structures for multi-variable data analysis.