# Lecture 15: NumPy Data Manipulation
## Filtering, Statistics, and Sorting

**INF 605 - Introduction to Programming - Python**  
**Prof. Rongyu Lin**  
**Quinnipiac University**

---

## Learning Objectives

By the end of this lecture, you will be able to:

1. **Master Boolean Indexing** for powerful data filtering using comparison operators and logical combinations
2. **Apply Fancy Indexing** for advanced element selection and understand copies vs views
3. **Utilize Universal Functions** efficiently for fast element-wise operations
4. **Perform Statistical Analysis** with axis operations (CRITICAL SKILL for data analysis)
5. **Sort and Find Unique Values** using both in-place and copy-based approaches
6. **Combine Techniques** for real data analysis workflows preparing for Assignment 4

---

## Prerequisites Review

From Lecture 12, you mastered NumPy array creation, basic indexing and slicing, element-wise operations, and array reshaping. You understand how NumPy arrays provide 10-100x speed improvement over Python lists for numerical operations, and you've worked with array attributes like shape, dtype, and ndim.

This lecture advances your NumPy skills to the next level: filtering data with boolean masks, selecting non-contiguous elements with fancy indexing, calculating statistics across dimensions with the axis parameter, and sorting data. These are the fundamental operations for any data analysis task you'll encounter in real-world applications.

**Transformation Goal:** Evolve from basic array operations to sophisticated data filtering, analysis, and manipulation techniques.

## Setup: Import NumPy

We'll use NumPy throughout this lecture for all array operations and data manipulation tasks.

In [None]:
# Import NumPy library
import numpy as np

print(f"NumPy version: {np.__version__}")

---
# Part 1: Boolean Indexing - Powerful Data Filtering

Boolean indexing is one of NumPy's most powerful features for working with real data. Imagine you have a spreadsheet with thousands of rows and you want to see only the rows where sales exceeded 1000 dollars - instead of manually checking each row, you create a filter that automatically shows only matching rows. In NumPy, we create boolean masks using comparison operators (>, <, ==, !=, >=, <=) which produce True/False arrays that match the original array's shape. Each True value in the mask means "include this element" and each False means "skip this element". This technique is essential for data analysis - filtering datasets, finding outliers, and selecting subsets of data based on conditions.

## 1.1 Creating Boolean Masks with Comparison Operators

Let's start with a simple example to understand how boolean masks work. When you apply a comparison operator to an array, NumPy creates a new array of the same shape filled with True and False values - this is called a boolean mask. Think of it like a spotlight that illuminates only the elements you want to see, leaving everything else in darkness.

In [None]:
# Create sample data - names and corresponding data rows
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], [-12, -4], [3, 4]])

print("Names:", names)
print("Data:")
print(data)

**Result Explanation:**  
We have an array of names and a corresponding 2D array of data values. Each row in the data array corresponds to a person in the names array. This is a common pattern in data analysis where you have labels and corresponding data.

In [None]:
# Create boolean mask - which elements equal "Bob"?
mask = names == "Bob"  # Create boolean mask for filtering
print("Boolean mask:", mask)
print("Type of mask:", type(mask))

**Result Explanation:**  
The comparison operator creates a boolean array showing True at positions where "Bob" appears and False everywhere else. This mask has the same length as the names array. Notice that positions 0 and 3 are True because "Bob" appears at those indices.

In [None]:
# Use mask to filter data - get only Bob's rows
bob_data = data[mask]
print("Bob's data:")
print(bob_data)

**Result Explanation:**  
When we use the boolean mask to index the data array, NumPy returns only the rows where the mask is True. This gives us rows 0 and 3 from the original data, which are Bob's data rows: [4, 7] and [0, 0]. This is like using a filter in Excel to show only certain rows.

Now let's apply boolean indexing to numeric data, which is extremely common in data analysis. Think of filtering test scores to find all passing grades, or selecting temperatures above a threshold.

In [None]:
# Create array of test scores
scores = np.array([78, 92, 85, 67, 95, 73, 88])

# Find passing grades (>= 70)
passing_mask = scores >= 70  # Create boolean mask for filtering
passing_scores = scores[passing_mask]

print(f"All scores: {scores}")
print(f"Passing mask: {passing_mask}")
print(f"Passing scores: {passing_scores}")

**Result Explanation:**  
The >= operator creates a boolean mask showing which scores are 70 or above. When we filter with this mask, we get only the passing scores. Notice that the score 67 is excluded because it's below 70. This is much faster and cleaner than writing a loop to check each score individually!

### Practice Exercise: Temperature Filtering

Now it's your turn to practice boolean indexing! You have an array of daily temperatures. Your task is to filter out only the hot days (temperatures above 80 degrees). Try to solve this before looking at the solution.

In [None]:
# Given temperatures array
temperatures = np.array([68, 75, 82, 79, 71, 85, 88])

# Exercise: Filter temperatures above 80 degrees
# Step 1: Create a boolean mask for temps > 80
# Step 2: Use the mask to filter the array
# Your code here:


In [None]:
# Solution
hot_days_mask = temperatures > 80  # Create boolean mask for filtering
hot_temperatures = temperatures[hot_days_mask]
print(f"Hot day temperatures: {hot_temperatures}")  # [82 85 88]

**Solution Explanation:**  
We create a boolean mask using the > operator to find temperatures exceeding 80 degrees. Then we use this mask to filter the original array, getting only the values [82, 85, 88]. The key insight is that the mask identifies positions, and boolean indexing extracts values at those positions.

## 1.2 Combining Conditions with Logical Operators

In real data analysis, you often need to filter based on multiple criteria simultaneously - like finding students who scored between 80 and 90, or temperatures that are either very hot or very cold. NumPy provides logical operators to combine multiple boolean conditions: & (and), | (or), and ~ (not). Think of it like building complex database queries: "show me records where condition1 AND condition2" or "condition1 OR condition2". **IMPORTANT:** You must use & and | instead of Python's "and" and "or" keywords for NumPy arrays, and you need parentheses around each condition because of operator precedence. This is one of the most common mistakes students make - remember: parentheses are required!

In [None]:
# Combining with OR - filter for Bob OR Will
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], [-12, -4], [3, 4]])

# Create combined mask with | operator
mask = (names == "Bob") | (names == "Will")
print("Combined OR mask:", mask)

selected_data = data[mask]
print("Selected data:")
print(selected_data)

**Result Explanation:**  
The | operator creates a mask that is True where either condition is True. Positions 0, 2, 3, and 4 are True because they contain either "Bob" or "Will". Notice the parentheses around each condition - they're essential! Without them, Python's operator precedence would cause errors.

In [None]:
# Combining with AND - find scores between 80 and 90
scores = np.array([78, 92, 85, 67, 95, 73, 88])

# Both conditions must be true
good_range = (scores >= 80) & (scores <= 90)  # Create boolean mask for filtering
range_scores = scores[good_range]

print(f"All scores: {scores}")
print(f"Scores 80-90: {range_scores}")

**Result Explanation:**  
The & operator requires BOTH conditions to be True. A score must be >= 80 AND <= 90 to pass through the filter. Scores 85 and 88 meet both criteria, but 92 and 95 are too high (> 90), and the others are too low (< 80). This is perfect for finding values in a specific range.

**CRITICAL WARNING: Common Mistakes with Boolean Operators**

Students frequently make these errors when combining conditions. Let's see the wrong and right ways to write boolean combinations.

In [None]:
# WRONG - This will cause an error!
# mask = scores >= 80 and scores <= 90  # Error: use & not 'and'

# WRONG - Missing parentheses causes precedence error
# mask = scores >= 80 & scores <= 90    # Error: wrong precedence

# CORRECT - Use & with parentheses
mask = (scores >= 80) & (scores <= 90)  # Works perfectly!
print("Correct mask:", mask)

**Key Takeaway:**  
Always use & for AND, | for OR, and wrap each condition in parentheses. This is different from regular Python logic operators (and, or) which don't work with NumPy arrays. The parentheses ensure correct operator precedence.

### Practice Exercise: Extreme Temperature Detection

Practice combining conditions with the OR operator. You have daily temperatures and want to identify days that are either uncomfortably cold (below 70) OR uncomfortably hot (above 85).

In [None]:
# Given array of daily temperatures
temps = np.array([72, 85, 90, 68, 95, 78, 82, 88])

# Exercise: Find temperatures that are either below 70 OR above 85
# Use the | operator and remember parentheses!
# Your code here:


In [None]:
# Solution
extreme_temps = (temps < 70) | (temps > 85)  # Create boolean mask for filtering
result = temps[extreme_temps]
print(f"Extreme temperatures: {result}")  # [90 68 95 88]

**Solution Explanation:**  
The | operator combines two conditions with OR logic. A temperature makes it through the filter if it's EITHER below 70 OR above 85 (or both, though that's impossible here). The values 90, 68, 95, and 88 satisfy at least one condition. Notice that parentheses around each condition are required - this is non-negotiable for correct results!

## 1.3 Modifying Values with Boolean Indexing

Boolean indexing isn't just for filtering - you can also use it to modify specific elements in place based on conditions. This is incredibly useful for data cleaning: replacing negative values with zero, capping outliers at a maximum value, or correcting invalid data. Think of it like a sophisticated find-and-replace that works with any condition you can express. The pattern is simple: arr[condition] = new_value, and NumPy will replace all elements where the condition is True. This modifies the original array, so be careful - make a copy first if you need to preserve the original data!

In [None]:
# Replace all negative values with 0
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], [-12, -4], [3, 4]])
print("Before:", data)

# Set all negative values to 0
data[data < 0] = 0  # Create boolean mask for filtering
print("After:")
print(data)

**Result Explanation:**  
The condition `data < 0` creates a boolean mask identifying all negative values. The assignment `data[data < 0] = 0` then replaces every True position with 0. Notice that -5, -12, and -4 all became 0, while positive values and zeros remained unchanged. This is data cleaning in action!

In [None]:
# Cap test scores at maximum of 100
scores = np.array([78, 92, 105, 67, 98, 110, 88])
print(f"Original: {scores}")

# Cap any score above 100
scores[scores > 100] = 100  # Create boolean mask for filtering
print(f"Capped: {scores}")

**Result Explanation:**  
We identify scores exceeding 100 and replace them with exactly 100. Scores 105 and 110 are capped to 100, while valid scores remain unchanged. This is a common data validation technique - ensuring values stay within acceptable bounds.

### Practice Exercise: Error Value Replacement

This is a realistic data cleaning scenario! Temperature sensors sometimes report -999 as an error code when they malfunction. Your task is to replace these error values with the mean of the valid temperatures. This requires two steps: calculate the mean of valid data, then replace errors with that mean.

In [None]:
# Temperature data with some errors (negative values represent sensor errors)
temps = np.array([72, -999, 85, 90, -999, 78, 82])

# Exercise: Replace -999 error values with the mean of valid temperatures
# Hint: Step 1: Filter to get valid temps (temps != -999)
#       Step 2: Calculate mean of valid temps
#       Step 3: Replace -999 with the mean
# Your code here:


In [None]:
# Solution
# Step 1: Get valid temperatures
valid_temps = temps[temps != -999]  # Create boolean mask for filtering

# Step 2: Calculate mean of valid data
mean_temp = valid_temps.mean()
print(f"Mean of valid temps: {mean_temp:.1f}")

# Step 3: Replace error values
temps[temps == -999] = mean_temp
print(f"Cleaned data: {temps}")

**Solution Explanation:**  
We first filter to get only valid temperatures (not -999), calculate their mean (81.4), then replace all -999 values with this mean. This is a professional data cleaning technique - replacing bad data with reasonable estimates rather than just deleting it. The key is working in steps: filter, calculate, then replace.

---
# Part 2: Fancy Indexing - Advanced Element Selection

Fancy indexing allows you to select multiple elements from an array using an array of indices - it's like having a shopping list of specific positions you want to pick from a shelf. Unlike slicing (which selects contiguous elements), fancy indexing lets you grab elements from any positions in any order: "give me elements 5, 2, 8, 1" - they don't have to be next to each other! This is incredibly powerful for reordering data, selecting specific rows based on rankings, or picking non-contiguous samples. **IMPORTANT:** Fancy indexing creates a copy of the data, not a view, so modifying the result won't affect the original array - this is different from slicing which creates views.

## 2.1 Integer Array Indexing Basics

The fundamental idea of fancy indexing is simple: instead of using a single number to select one element, you use an array of numbers to select multiple elements. The result contains elements at those specific positions, in the order you specified. Think of it like giving NumPy a list of row numbers to pull out of a table.

In [None]:
# Create array where each row equals its row number
arr = np.zeros((8, 4))
for i in range(8):
    arr[i] = i

print("Original array:")
print(arr)

**Setup Explanation:**  
We created an 8x4 array where each row is filled with its row number. Row 0 contains all 0s, row 1 contains all 1s, etc. This makes it easy to see which rows we're selecting with fancy indexing.

In [None]:
# Select specific rows in specific order using fancy indexing
selected = arr[[4, 3, 0, 6]]
print("Selected rows [4, 3, 0, 6]:")
print(selected)

**Result Explanation:**  
Fancy indexing with [4, 3, 0, 6] extracts rows 4, 3, 0, and 6 in that exact order. Notice the result has row 4 first (filled with 4s), then row 3 (filled with 3s), then row 0 (filled with 0s), and finally row 6 (filled with 6s). The order in the index array determines the order in the result!

Now let's apply fancy indexing to a real-world scenario: selecting specific student scores from a larger dataset.

In [None]:
# Array of student scores
scores = np.array([78, 92, 85, 67, 95, 73, 88, 91])

# Select scores at positions 1, 4, 7 (2nd, 5th, 8th students)
indices = np.array([1, 4, 7])
selected_scores = scores[indices]

print(f"All scores: {scores}")
print(f"Selected positions {indices}: {selected_scores}")

**Result Explanation:**  
We selected scores at positions 1 (92), 4 (95), and 7 (91) using an array of indices. This is perfect for scenarios where you have a list of specific students (by position) you want to examine - maybe the top performers identified by some ranking algorithm.

### Practice Exercise: Quarterly Sales Selection

You have 12 months of sales data and need to extract the first month of each quarter (months 0, 3, 6, 9) for quarterly reports. Use fancy indexing to select these non-contiguous months.

In [None]:
# Monthly sales data for the year
sales = np.array([1200, 1500, 1800, 1650, 2000, 1750, 1900, 2100, 1850, 2200, 2050, 2300])

# Exercise: Select sales for months 0, 3, 6, 9 (quarterly data)
# Create an array of indices and use fancy indexing
# Your code here:


In [None]:
# Solution
quarters = np.array([0, 3, 6, 9])
quarterly_sales = sales[quarters]
print(f"Quarterly sales: {quarterly_sales}")  # [1200 1650 1900 2200]

**Solution Explanation:**  
Fancy indexing lets us pick non-contiguous months representing each quarter's first month. We get January (1200), April (1650), July (1900), and October (2200). This would be much more verbose with regular indexing: sales[0], sales[3], sales[6], sales[9]. Fancy indexing makes it clean and scalable!

## 2.2 Fancy Indexing in 2D Arrays

Fancy indexing becomes even more powerful with 2D arrays, where you can select specific rows, specific elements, or even create rectangular selections. When you provide two index arrays for a 2D array, NumPy pairs them up: first index array specifies rows, second specifies columns, and it selects elements at (row[0], col[0]), (row[1], col[1]), etc. This is perfect for selecting diagonal elements, specific matrix entries, or creating custom data selections. Understanding the difference between arr[[1,2,3]] (selecting entire rows) and arr[[1,2,3], [0,1,2]] (selecting specific row,col pairs) is crucial for working with real datasets.

In [None]:
# Create 2D array
arr = np.arange(32).reshape((8, 4))
print("Original array:")
print(arr)

**Setup Explanation:**  
We created an 8x4 array with values 0-31 arranged in rows. This gives us a clear reference to see exactly which elements fancy indexing selects.

In [None]:
# Select specific row,col pairs
# Gets elements at (1,0), (5,3), (7,1), (2,2)
result = arr[[1, 5, 7, 2], [0, 3, 1, 2]]
print("Selected elements:", result)

**Result Explanation:**  
With two index arrays, NumPy pairs them up: (row[0], col[0]) = (1, 0) gives 4, (row[1], col[1]) = (5, 3) gives 23, (row[2], col[2]) = (7, 1) gives 29, and (row[3], col[3]) = (2, 2) gives 10. The result is [4, 23, 29, 10] - specific elements picked from anywhere in the array!

In [None]:
# Rectangular selection - select rows then columns
arr = np.arange(32).reshape((8, 4))

# Select rows 1, 5, 7, 2, then columns 0, 3, 1, 2 from those rows
result = arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]
print("Rectangular selection:")
print(result)

**Result Explanation:**  
This two-step selection first takes rows [1, 5, 7, 2], then from those 4 rows, reorders columns to [0, 3, 1, 2]. The result is a 4x4 sub-matrix with both rows and columns reordered. This is useful when you want to reorganize both dimensions of your data.

### Practice Exercise: Selecting Specific Test Scores

You have a table of test scores where rows are students and columns are tests. Your task is to select specific (student, test) pairs - not complete rows or columns, but individual scores at specific positions.

In [None]:
# Student test scores: rows=students, columns=tests
scores = np.array([[85, 92, 78, 88],
                   [72, 68, 75, 70],
                   [95, 98, 92, 97],
                   [88, 85, 90, 87]])

# Exercise: Get test 0 for student 2, test 1 for student 0, test 3 for student 3
# Use fancy indexing with paired row and column indices
# Your code here:


In [None]:
# Solution
students = np.array([2, 0, 3])
tests = np.array([0, 1, 3])
selected = scores[students, tests]
print(f"Selected scores: {selected}")  # [95 92 87]

**Solution Explanation:**  
We pair up row and column indices to select specific cells: student 2 test 0 (95), student 0 test 1 (92), and student 3 test 3 (87). This is powerful for selecting scattered data points based on some criteria - maybe the first test each student completed, or specific scores flagged for review.

## 2.3 Understanding Copies vs Views - Memory Management

One of the most important differences between fancy indexing and slicing is how they handle memory: slicing creates a view (a window into the original data), while fancy indexing creates a copy (completely new data in memory). This matters because modifying a view changes the original array, but modifying a copy doesn't affect the original. Think of a view like looking at your room through a window - changes you make through the window affect the actual room. A copy is like taking a photograph - you can draw on the photo but the room stays the same. For data analysis, views are more memory efficient, but copies give you independence to modify without side effects.

In [None]:
# Slicing creates a view
arr = np.array([10, 20, 30, 40, 50])
view = arr[1:4]  # Slice creates a view

print("Original array:", arr)
print("View (slice [1:4]):", view)

# Modify the view
view[0] = 999
print("\nAfter modifying view[0] = 999:")
print("View:", view)
print("Original array:", arr)  # Original changed!

**Result Explanation:**  
Slicing created a view into the original array. When we changed view[0] to 999, we actually changed arr[1] to 999 because they share the same memory! This can be surprising if you're not expecting it. Views are memory-efficient but create side effects.

In [None]:
# Fancy indexing creates a copy
arr = np.array([10, 20, 30, 40, 50])
copy = arr[[1, 2, 3]]  # Fancy indexing creates a copy

print("Original array:", arr)
print("Copy (fancy [1,2,3]):", copy)

# Modify the copy
copy[0] = 999
print("\nAfter modifying copy[0] = 999:")
print("Copy:", copy)
print("Original array:", arr)  # Original unchanged!

**Result Explanation:**  
Fancy indexing created an independent copy. When we changed copy[0] to 999, the original array remained [10, 20, 30, 40, 50] - completely unaffected! This independence is safer for many operations but uses more memory because you have two separate arrays.

---
# Part 3: Universal Functions - Fast Element-wise Operations

Universal functions (ufuncs) are specialized NumPy functions that perform element-wise operations on arrays at blazing speed. Think of them like assembly-line workers who can process every item in a container simultaneously - instead of looping through each element one by one, ufuncs apply operations to all elements in parallel using optimized C code. These functions are called "universal" because they work universally across entire arrays, regardless of size or shape. Common examples include mathematical operations like square root, exponential, logarithm, and trigonometric functions. Using ufuncs instead of Python loops can make your code 10-100 times faster, which becomes crucial when working with large datasets containing millions of values.

## 3.1 Unary Universal Functions

Unary ufuncs take a single array as input and apply an operation to each element. The most common unary ufuncs are mathematical transformations like square root (np.sqrt), exponential (np.exp), natural logarithm (np.log), absolute value (np.abs), and sign extraction (np.sign). Think of them like filters you apply to a photo - each pixel gets the same transformation applied to it independently.

In [None]:
# Create array of numbers including negative values
arr = np.array([-3, -1, 0, 1, 4, 9, 16])

# Apply unary ufuncs
print("Original:", arr)
print("Absolute value:", np.abs(arr))
print("Sign (-1, 0, or 1):", np.sign(arr))

**Result Explanation:**  
The np.abs() function converts all negative numbers to positive (absolute value), while np.sign() extracts just the sign of each number: -1 for negative, 0 for zero, +1 for positive. These are simple but powerful transformations used constantly in data analysis for normalizing values or detecting patterns.

In [None]:
# Mathematical transformations (only on non-negative values)
positive_arr = np.array([1, 4, 9, 16, 25])

print("Original:", positive_arr)
print("Square root:", np.sqrt(positive_arr))
print("Exponential:", np.exp(np.array([0, 1, 2])))
print("Natural log:", np.log(positive_arr))

**Result Explanation:**  
np.sqrt() computes square roots element-wise (1→1, 4→2, 9→3, etc.). np.exp() calculates e^x for each element. np.log() computes natural logarithm. These are fundamental mathematical operations used everywhere from physics calculations to machine learning algorithms.

### Practice Exercise: Temperature Conversion

You have temperatures in Celsius and need to convert them to Fahrenheit using the formula: F = C * 9/5 + 32. Use array operations and ufuncs instead of loops.

In [None]:
# Temperatures in Celsius
celsius = np.array([0, 10, 20, 30, 40])

# Exercise: Convert to Fahrenheit using array operations
# Formula: F = C * 9/5 + 32
# Your code here:


In [None]:
# Solution
fahrenheit = celsius * 9/5 + 32
print(f"Celsius: {celsius}")
print(f"Fahrenheit: {fahrenheit}")  # [32, 50, 68, 86, 104]

**Solution Explanation:**  
We use element-wise multiplication and addition - no loops needed! NumPy applies the formula to every element automatically. The expression `celsius * 9/5 + 32` broadcasts the operations across the entire array, giving us instant conversion of all temperatures.

## 3.2 Binary Universal Functions

Binary ufuncs take two arrays as input and combine them element-wise. The most useful binary ufuncs are np.maximum() and np.minimum() which compare corresponding elements and keep the larger or smaller value. Unlike Python's max() and min() which find the single largest/smallest value in an array, these ufuncs compare elements position-by-position and create a new array of the same shape. This is perfect for implementing logic like "use the backup value whenever the main value is invalid" or "cap all values at a threshold".

In [None]:
# Compare two arrays element-wise
arr1 = np.array([5, 1, 8, 3, 7])
arr2 = np.array([3, 6, 2, 9, 4])

# Take maximum of each pair
result_max = np.maximum(arr1, arr2)
print(f"Array 1: {arr1}")
print(f"Array 2: {arr2}")
print(f"Maximum: {result_max}")  # [5, 6, 8, 9, 7]

**Result Explanation:**  
np.maximum() compares position by position: 5 vs 3 (keep 5), 1 vs 6 (keep 6), 8 vs 2 (keep 8), 3 vs 9 (keep 9), 7 vs 4 (keep 7). The result contains the larger value from each position. This is different from max(arr1, arr2) which would cause an error!

In [None]:
# Practical example: cap values at threshold
temperatures = np.array([68, 75, 105, 79, 110, 85])
max_valid = 100

# Cap all values at 100
capped = np.minimum(temperatures, max_valid)
print(f"Original: {temperatures}")
print(f"Capped at {max_valid}: {capped}")

**Result Explanation:**  
np.minimum() compares each temperature with 100, keeping whichever is smaller. Values like 105 and 110 get capped to 100, while values already below 100 (68, 75, 79, 85) stay unchanged. This is a clean way to enforce maximum limits on data.

### Practice Exercise: Replace Sensor Errors

You have two arrays: main_sensor (sometimes faulty with -999 errors) and backup_sensor (always reliable). Use np.where() combined with a binary ufunc to use backup values whenever main sensor shows -999.

In [None]:
# Sensor data
main = np.array([72, -999, 85, 90, -999, 78])
backup = np.array([70, 73, 84, 89, 76, 77])

# Exercise: Create corrected array that uses backup when main = -999
# Hint: Use np.where(condition, value_if_true, value_if_false)
# Your code here:


In [None]:
# Solution
corrected = np.where(main == -999, backup, main)  # Create boolean mask for filtering
print(f"Main sensor: {main}")
print(f"Backup sensor: {backup}")
print(f"Corrected: {corrected}")  # [72, 73, 85, 90, 76, 78]

**Solution Explanation:**  
np.where() checks each position: if main sensor equals -999, use backup sensor value; otherwise use main sensor value. This gives us [72 from main, 73 from backup, 85 from main, 90 from main, 76 from backup, 78 from main]. Perfect for data validation and error correction!

---
# Part 4: Mathematical and Statistical Methods - Analyzing Your Data

NumPy provides a comprehensive suite of mathematical and statistical methods that let you analyze arrays with just a single function call. Instead of writing loops to calculate averages or find patterns, you can call methods like .mean(), .sum(), .std() directly on arrays. These methods can operate on entire arrays or along specific dimensions using the axis parameter. Think of these as your data analysis toolkit - they answer questions like "what's the average?", "how spread out is the data?", "what's the trend?", and "are there any extreme values?". The axis parameter is one of the most CRITICAL concepts in NumPy - it determines whether you're calculating statistics down rows or across columns, and mastering it is essential for working with multi-dimensional data.

## 4.1 Basic Aggregation Methods

Aggregation methods reduce an array to a single value by combining all elements using an operation. The most common are: mean() for average, sum() for total, std() for standard deviation (spread), var() for variance, min() and max() for extremes. These methods work on the entire array by default, collapsing all values into one summary statistic. Think of aggregation like crushing a bunch of numbers through a funnel - many values go in, one summary value comes out.

In [None]:
# Student test scores
scores = np.array([78, 92, 85, 67, 95, 73, 88, 91, 84, 79])

# Calculate basic statistics
print(f"Scores: {scores}")
print(f"Mean (average): {scores.mean():.2f}")  # Calculate overall mean of all elements
print(f"Sum (total): {scores.sum()}")  # Calculate total sum of all elements
print(f"Std (spread): {scores.std():.2f}")  # Calculate standard deviation (spread of data)
print(f"Min: {scores.min()}, Max: {scores.max()}")  # Find minimum/maximum value

**Result Explanation:**  
The mean is 83.2 (average score), sum is 832 (total points), standard deviation is 8.48 (how spread out scores are), min is 67 (lowest score), and max is 95 (highest score). These five statistics give you a complete summary of the dataset - central tendency, total, spread, and range. This is the foundation of data analysis!

## 4.2 The Axis Parameter - CRITICAL CONCEPT

The axis parameter is one of the most important and initially confusing concepts in NumPy, but once you understand it, multi-dimensional data analysis becomes incredibly powerful. When you have a 2D array (rows and columns), you can calculate statistics either "down the columns" (across all rows) or "across the rows" (across all columns). The axis parameter controls this direction: axis=0 means "go DOWN through rows, collapsing them" (resulting in one value per column), while axis=1 means "go ACROSS columns, collapsing them" (resulting in one value per row). Think of axis as the dimension you want to "crush" or "eliminate" - axis=0 eliminates the row dimension, axis=1 eliminates the column dimension. This is counter-intuitive at first because axis=0 produces column-wise results, but with practice and visual examples, it becomes second nature.

In [None]:
# Create sample 2D array - student scores
# Rows = students, Columns = tests
scores = np.array([[85, 92, 78, 88],
                   [72, 68, 75, 70],
                   [95, 98, 92, 97],
                   [88, 85, 90, 87]])

print("Student scores (rows=students, cols=tests):")
print(scores)
print(f"Shape: {scores.shape} (4 students, 4 tests)")

**Setup Explanation:**  
We have a 4x4 array where each row is a student and each column is a test. Student 0 scored [85, 92, 78, 88], Student 1 scored [72, 68, 75, 70], etc. This is a common data structure - rows as observations, columns as variables.

In [None]:
# axis=0: Go DOWN rows (get average per test across all students)
test_averages = scores.mean(axis=0)
print("Average per TEST (axis=0, down rows):")
print(test_averages)
print(f"Shape: {test_averages.shape}")

**Result Explanation - axis=0:**  
axis=0 goes DOWN through rows, calculating the mean for each column separately. We get 4 values (one per test): Test 0 average is (85+72+95+88)/4 = 85, Test 1 average is (92+68+98+85)/4 = 85.75, etc. Think of it as: "For each TEST, what's the average score across all STUDENTS?" The result has shape (4,) because we collapsed the 4 rows into 1 value per column.

In [None]:
# axis=1: Go ACROSS columns (get average per student across all tests)
student_averages = scores.mean(axis=1)
print("Average per STUDENT (axis=1, across columns):")
print(student_averages)
print(f"Shape: {student_averages.shape}")

**Result Explanation - axis=1:**  
axis=1 goes ACROSS columns, calculating the mean for each row separately. We get 4 values (one per student): Student 0 average is (85+92+78+88)/4 = 85.75, Student 1 average is (72+68+75+70)/4 = 71.25, etc. Think of it as: "For each STUDENT, what's their average score across all TESTS?" The result has shape (4,) because we collapsed the 4 columns into 1 value per row.

**VISUAL GUIDE FOR AXIS PARAMETER:**

```
axis=0: Go DOWN ↓ through rows
Result: One value per column
[85, 92, 78, 88]    ↓
[72, 68, 75, 70]    ↓  Calculate
[95, 98, 92, 97]    ↓  mean for
[88, 85, 90, 87]    ↓  each column
-------------------
[85, 85.75, 83.75, 85.5]  ← Result shape (4,)

axis=1: Go ACROSS → through columns
Result: One value per row
[85, 92, 78, 88] → Calculate → 85.75
[72, 68, 75, 70] → mean for  → 71.25
[95, 98, 92, 97] → each row  → 95.5
[88, 85, 90, 87] →            → 87.5
                    Result shape (4,)
```

**MEMORY AID:**  
- **axis=0** = DOWN rows = result has length of COLUMNS
- **axis=1** = ACROSS columns = result has length of ROWS

### Practice Exercise: Axis Parameter - Temperature Data

You have daily temperatures for 3 cities over 5 days (rows=days, columns=cities). Calculate: (1) average temperature per city (across all days), and (2) average temperature per day (across all cities).

In [None]:
# Temperature data: rows=days, columns=cities
temps = np.array([[72, 68, 75],
                  [75, 70, 78],
                  [71, 67, 73],
                  [78, 72, 80],
                  [76, 71, 77]])

print("Temperatures (rows=days, cols=cities):")
print(temps)

# Exercise 1: Average per CITY (across all days) - which axis?
# Exercise 2: Average per DAY (across all cities) - which axis?
# Your code here:


In [None]:
# Solution
# Average per city: go DOWN days (axis=0)
city_avg = temps.mean(axis=0)
print(f"Average per city (axis=0): {city_avg}")

# Average per day: go ACROSS cities (axis=1)
day_avg = temps.mean(axis=1)
print(f"Average per day (axis=1): {day_avg}")

**Solution Explanation:**  
For average per city, we want one value per city (column), so we go DOWN days with axis=0. For average per day, we want one value per day (row), so we go ACROSS cities with axis=1. Remember: axis=0 eliminates rows (result has column shape), axis=1 eliminates columns (result has row shape).

## 4.3 Cumulative Operations

Cumulative operations compute running totals or products as they go through an array. np.cumsum() gives you the running sum (useful for tracking cumulative totals like bank balances or distance traveled), while np.cumprod() gives you the running product (useful for compound growth). Unlike aggregations that return a single value, cumulative operations return an array of the same size showing the accumulated result at each position. Think of cumsum like a bank account that shows your balance after each transaction - each value includes all previous values plus the current one.

In [None]:
# Daily sales - calculate running total
daily_sales = np.array([1200, 1500, 1800, 1650, 2000])

running_total = np.cumsum(daily_sales)
print(f"Daily sales: {daily_sales}")
print(f"Running total: {running_total}")

**Result Explanation:**  
cumsum creates running totals: Day 0: 1200, Day 1: 1200+1500=2700, Day 2: 2700+1800=4500, Day 3: 4500+1650=6150, Day 4: 6150+2000=8150. Each position shows the total of all values up to and including that position. Perfect for tracking cumulative metrics!

In [None]:
# Compound growth - running product
growth_rates = np.array([1.05, 1.03, 1.04, 1.06])  # 5%, 3%, 4%, 6% growth

cumulative_growth = np.cumprod(growth_rates)
print(f"Period growth rates: {growth_rates}")
print(f"Cumulative growth: {cumulative_growth}")
print(f"Total growth: {cumulative_growth[-1]:.4f} = {(cumulative_growth[-1]-1)*100:.2f}%")

**Result Explanation:**  
cumprod shows compound growth: Period 1: 1.05, Period 2: 1.05*1.03=1.0815, Period 3: 1.0815*1.04=1.1248, Period 4: 1.1248*1.06=1.1923. The final value 1.1923 means 19.23% total growth through compound effects.

## 4.4 Boolean Array Methods

Boolean arrays have special methods that help you ask questions about your data: .any() checks if at least one value is True, .all() checks if every value is True, and .sum() counts how many True values exist. These are incredibly useful for data validation and filtering analysis. Think of .any() as "is there at least one problem?", .all() as "is everything perfect?", and .sum() as "how many problems are there?".

In [None]:
# Student scores - check for failures and perfect scores
scores = np.array([78, 92, 85, 67, 95, 73, 88, 91, 84, 79])

# Check conditions
has_failures = (scores < 70).any()  # Create boolean mask for filtering
all_passing = (scores >= 70).all()  # Create boolean mask for filtering
num_excellent = (scores >= 90).sum()  # Create boolean mask for filtering

print(f"Scores: {scores}")
print(f"Has failures (<70)? {has_failures}")
print(f"All passing (>=70)? {all_passing}")
print(f"Number excellent (>=90): {num_excellent}")

**Result Explanation:**  
.any() returns True because there's at least one score < 70 (the 67). .all() returns False because not ALL scores are >= 70. .sum() counts True values (treating True as 1), giving us 4 scores >= 90. These methods turn boolean logic into actionable insights!

## 4.5 Percentiles for Distribution Analysis

Percentiles tell you the value below which a given percentage of data falls. The 50th percentile is the median (half the values are below it), the 25th percentile is the first quartile, and the 75th percentile is the third quartile. np.percentile() lets you find any percentile, which is perfect for understanding data distribution, detecting outliers, and setting thresholds. Think of percentiles as dividing your sorted data into segments - the 90th percentile is better than 90% of values.

In [None]:
# Test scores - analyze distribution
scores = np.array([78, 92, 85, 67, 95, 73, 88, 91, 84, 79, 82, 76])

# Calculate key percentiles
p25 = np.percentile(scores, 25)
p50 = np.percentile(scores, 50)  # median
p75 = np.percentile(scores, 75)
p90 = np.percentile(scores, 90)

print(f"Scores: {sorted(scores)}")
print(f"25th percentile: {p25}")
print(f"50th percentile (median): {p50}")
print(f"75th percentile: {p75}")
print(f"90th percentile: {p90}")

**Result Explanation:**  
The 25th percentile (76.75) means 25% of scores are below this value. The median (82.5) splits the data in half. The 75th percentile (89.25) means 75% of scores are below it (so it's in the top 25%). The 90th percentile (92.3) represents top 10% performance. This gives you a complete picture of score distribution!

---
# Part 5: Sorting and Unique Values - Organizing Your Data

Sorting is a fundamental operation in data analysis - it helps you find top performers, identify trends, and organize information logically. NumPy provides two sorting approaches: arr.sort() modifies the array in-place (changes the original), while np.sort(arr) returns a sorted copy (leaves original unchanged). Beyond basic sorting, argsort() is incredibly powerful - it returns the indices that would sort the array, allowing you to sort multiple related arrays in sync. Additionally, np.unique() finds distinct values, and np.in1d() tests membership, both essential for data cleaning and analysis.

## 5.1 Sorting Arrays - In-Place vs Copy

Understanding the difference between in-place sorting and copy sorting is crucial for data integrity. The .sort() method modifies your array directly (fast and memory-efficient but destructive), while np.sort() creates a new sorted array (safe but uses more memory). Choose based on whether you need to preserve the original order.

In [None]:
# Create array of test scores
scores = np.array([78, 92, 85, 67, 95, 73, 88])

# Method 1: np.sort() returns sorted copy (original unchanged)
sorted_copy = np.sort(scores)
print("Original (after np.sort()):", scores)  # Sort array (returns copy, original unchanged)
print("Sorted copy:", sorted_copy)

**Result Explanation:**  
np.sort() created a new sorted array [67, 73, 78, 85, 88, 92, 95] while leaving the original scores array untouched. This is the safe choice when you need both the original order and the sorted order.

In [None]:
# Method 2: .sort() modifies array in-place (original changed)
scores2 = np.array([78, 92, 85, 67, 95, 73, 88])
print("Before .sort():", scores2)  # Sort array in-place (modifies original)
scores2.sort()  # No return value, modifies scores2 directly
print("After .sort():", scores2)  # Sort array in-place (modifies original)

**Result Explanation:**  
The .sort() method modified scores2 directly, replacing the original order with sorted order. Notice it returns None - the operation happens in-place. Use this when you don't need the original order and want to save memory.

## 5.2 Indirect Sorting with argsort()

argsort() is one of NumPy's most powerful and initially confusing functions. Instead of returning sorted values, it returns the indices that would sort the array. Why is this useful? Because it lets you sort multiple related arrays in sync! If you have student names and scores in separate arrays, argsort() lets you sort both by scores while keeping them aligned. Think of argsort() as creating a "recipe" for sorting - it tells you which positions to pick in which order.

In [None]:
# Student names and scores (parallel arrays)
names = np.array(["Alice", "Bob", "Charlie", "Diana"])
scores = np.array([85, 92, 78, 95])

# Get indices that would sort scores
sorted_indices = np.argsort(scores)
print("Original scores:", scores)
print("Sorting indices:", sorted_indices)
print("Sorted scores:", scores[sorted_indices])
print("Names in score order:", names[sorted_indices])

**Result Explanation:**  
argsort() returns [2, 0, 1, 3] meaning: "to sort the array, first take element 2 (78), then element 0 (85), then element 1 (92), then element 3 (95)". We can use these same indices to sort names: Charlie (78), Alice (85), Bob (92), Diana (95). Both arrays stay perfectly aligned!

In [None]:
# Sort in descending order (highest to lowest)
desc_indices = np.argsort(scores)[::-1]
print("Top performers:")
for i, idx in enumerate(desc_indices, 1):
    print(f"{i}. {names[idx]}: {scores[idx]}")

**Result Explanation:**  
By reversing argsort() with [::-1], we get descending order. This creates a leaderboard: Diana (95), Bob (92), Alice (85), Charlie (78). The pattern argsort()[::-1] is extremely common for ranking from best to worst.

## 5.3 Finding Unique Values and Testing Membership

np.unique() removes duplicates and returns sorted unique values - essential for understanding what distinct values exist in your data. np.in1d() tests which elements from one array appear in another array - perfect for filtering or validation. These functions answer questions like "what categories exist in my dataset?" and "which students are on the honor roll list?".

In [None]:
# Survey responses with duplicates
responses = np.array([5, 3, 5, 2, 4, 3, 5, 1, 4, 5, 2])

# Find unique values
unique_responses = np.unique(responses)
print("All responses:", responses)
print("Unique responses:", unique_responses)

# Count occurrences
for val in unique_responses:
    count = (responses == val).sum()  # Create boolean mask for filtering
    print(f"  {val}: appears {count} times")

**Result Explanation:**  
np.unique() found the distinct values [1, 2, 3, 4, 5] in sorted order, removing all duplicates. We then counted how often each appears. This is fundamental for categorical data analysis - understanding what values exist and their frequencies.

In [None]:
# Check membership - which students are on honor roll?
all_students = np.array(["Alice", "Bob", "Charlie", "Diana", "Eve"])
honor_roll = np.array(["Diana", "Alice"])

# Test which students are on honor roll
is_honor = np.in1d(all_students, honor_roll)
print("All students:", all_students)
print("Honor roll:", honor_roll)
print("Is on honor roll?", is_honor)
print("Honor roll students:", all_students[is_honor])

**Result Explanation:**  
np.in1d() returns [True, False, False, True, False] indicating which students from all_students appear in honor_roll. We can use this boolean array to filter and get ["Alice", "Diana"]. This is perfect for set operations and membership testing!

---
# Part 6: Comprehensive Practice Example - Student Grade Analysis System

Now let's combine everything you've learned into a realistic, complete data analysis workflow. You'll work with a dataset of student quiz scores where some data is missing or invalid, and you need to clean it, analyze it, and extract meaningful insights. This example integrates boolean indexing for filtering and cleaning, fancy indexing for selecting specific data, universal functions for calculations, statistical methods with axis parameters for analysis, and sorting for rankings. This is exactly the kind of work you'll do in real data analysis projects!

## Setting Up the Student Data

We have quiz scores for 6 students across 4 quizzes. Some data points are missing (represented as -1), and we need to handle these properly. The goal is to calculate valid statistics, find top performers, and identify students who need help.

In [None]:
# Student quiz scores (rows=students, columns=quizzes)
# -1 indicates missing score (student was absent)
student_names = np.array(["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank"])
quiz_scores = np.array([[85, 92, -1, 88],   # Alice
                        [78, 72, 75, 70],   # Bob
                        [95, 98, 92, 97],   # Charlie
                        [67, -1, 72, 68],   # Diana
                        [88, 85, 90, 87],   # Eve
                        [72, 75, -1, 78]])  # Frank

print("Student Quiz Scores (rows=students, cols=quizzes):")
print(quiz_scores)
print(f"Shape: {quiz_scores.shape}")

**Data Overview:**  
We have 6 students and 4 quizzes, but some scores are -1 (missing). Before analyzing, we need to clean this data. Alice missed quiz 2, Diana missed quiz 1, and Frank missed quiz 2. Real datasets always have missing or invalid data!

## Step 1: Data Cleaning - Replace Invalid Scores

First, we'll identify missing scores (-1) and replace them with each student's average from their valid quizzes. This is a common data imputation technique - using available information to estimate missing values. We'll use boolean indexing to find missing values and NumPy's statistical methods to calculate replacements.

In [None]:
# Create a copy to avoid modifying original
cleaned_scores = quiz_scores.copy()

# Find students with missing scores
has_missing = (quiz_scores == -1).any(axis=1)  # Create boolean mask for filtering
print("Students with missing scores:", student_names[has_missing])

# For each student with missing scores, replace with their average
for i in range(len(cleaned_scores)):
    student_row = cleaned_scores[i]
    if (student_row == -1).any():
        # Calculate average of valid scores
        valid_scores = student_row[student_row != -1]  # Create boolean mask for filtering
        avg = valid_scores.mean()
        # Replace -1 with average
        cleaned_scores[i][student_row == -1] = avg
        print(f"{student_names[i]}: Replaced -1 with {avg:.2f}")

**Cleaning Result:**  
We identified students with -1 scores, calculated each student's average from their valid quizzes, and replaced -1 with that average. Alice's average of [85, 92, 88] is 88.33, so her missing score becomes 88.33. This preserves each student's performance level while filling gaps.

## Step 2: Statistical Analysis - Overall Performance

Now with clean data, let's calculate comprehensive statistics. We'll find average scores per student (axis=1), average scores per quiz (axis=0), identify top performers, and detect students who might need help.

In [None]:
# Calculate student averages (axis=1: across quizzes per student)
student_averages = cleaned_scores.mean(axis=1)

# Calculate quiz averages (axis=0: across students per quiz)
quiz_averages = cleaned_scores.mean(axis=0)

print("Student averages (each student's mean across quizzes):")
for name, avg in zip(student_names, student_averages):
    print(f"  {name}: {avg:.2f}")

print("
Quiz averages (each quiz's mean across students):")
for i, avg in enumerate(quiz_averages, 1):
    print(f"  Quiz {i}: {avg:.2f}")

**Statistical Insights:**  
We now have each student's overall performance and each quiz's difficulty level. axis=1 gave us one average per student (how each individual performed overall), while axis=0 gave us one average per quiz (which quizzes were harder or easier for the class).

## Step 3: Ranking and Identification

Use argsort to rank students from highest to lowest average, identify top performers (>= 90 average), and flag students who need help (< 75 average). This combines sorting, boolean indexing, and filtering techniques.

In [None]:
# Rank students (descending order)
ranking_indices = np.argsort(student_averages)[::-1]

print("Class Rankings (highest to lowest):")
for rank, idx in enumerate(ranking_indices, 1):
    print(f"{rank}. {student_names[idx]}: {student_averages[idx]:.2f}")

# Identify top performers (>= 90)
top_performers = student_averages >= 90  # Create boolean mask for filtering
print(f"
Top performers (avg >= 90): {student_names[top_performers]}")  # Create boolean mask for filtering

# Identify students needing help (< 75)
needs_help = student_averages < 75  # Create boolean mask for filtering
print(f"Students needing help (avg < 75): {student_names[needs_help]}")

**Ranking Insights:**  
argsort()[::-1] gave us indices in descending order, creating a leaderboard. Boolean indexing identified top performers and students struggling. This is exactly how real academic analytics systems work - automated identification of students needing intervention!

## Step 4: Quiz Difficulty Analysis

Determine which quiz was hardest (lowest average) and easiest (highest average), and use percentiles to understand score distributions. This helps teachers identify which topics need more review.

In [None]:
# Find easiest and hardest quizzes
easiest_quiz = np.argmax(quiz_averages) + 1
hardest_quiz = np.argmin(quiz_averages) + 1

print(f"Easiest quiz: Quiz {easiest_quiz} (avg: {quiz_averages[easiest_quiz-1]:.2f})")
print(f"Hardest quiz: Quiz {hardest_quiz} (avg: {quiz_averages[hardest_quiz-1]:.2f})")

# Score distribution analysis
all_scores = cleaned_scores.flatten()
p25 = np.percentile(all_scores, 25)
p50 = np.percentile(all_scores, 50)
p75 = np.percentile(all_scores, 75)

print(f"
Overall score distribution:")
print(f"  25th percentile: {p25:.2f}")
print(f"  50th percentile (median): {p50:.2f}")
print(f"  75th percentile: {p75:.2f}")

**Quiz Difficulty Insights:**  
argmax and argmin identified the easiest and hardest quizzes based on average scores. Percentiles show the overall score distribution - 25% of scores are below the first quartile, 50% below the median, and 75% below the third quartile. This gives a complete picture of class performance!

## Comprehensive Analysis Summary

You've just completed a full data analysis workflow! Let's review what techniques you used:

1. **Boolean Indexing:** Found missing scores (-1), filtered for top performers and students needing help
2. **Fancy Indexing:** Used argsort results to reorder students and names together
3. **Universal Functions:** Calculated averages, used argmax/argmin for finding extremes
4. **Statistical Methods with Axis:** Computed per-student averages (axis=1) and per-quiz averages (axis=0)
5. **Array Cleaning:** Replaced invalid data with calculated estimates
6. **Sorting:** Ranked students from best to worst performance
7. **Percentiles:** Analyzed overall score distribution

This is the foundation of data analysis with NumPy - every technique you learned today combines to solve real problems!

---
# Lecture Summary

## Key Concepts Mastered Today:

**Boolean Indexing:**
- Create boolean masks with comparison operators (>, <, ==, >=, <=, !=)
- Combine conditions with & (and), | (or), ~ (not) - remember parentheses!
- Modify values in-place based on conditions

**Fancy Indexing:**
- Select non-contiguous elements with integer arrays
- Understand copies vs views (fancy indexing creates copies)
- Select specific elements in 2D arrays with paired indices

**Universal Functions:**
- Apply element-wise operations efficiently (np.sqrt, np.exp, np.log)
- Use binary ufuncs (np.maximum, np.minimum)
- Leverage out parameter for memory efficiency

**Statistical Methods:**
- Calculate aggregations (mean, sum, std, min, max)
- **MASTER THE AXIS PARAMETER:** axis=0 goes DOWN rows, axis=1 goes ACROSS columns
- Use cumulative operations (cumsum, cumprod)
- Apply boolean array methods (any, all, sum)
- Analyze distributions with percentiles

**Sorting and Unique:**
- Sort in-place with .sort() or create copies with np.sort()
- Use argsort() to sort multiple related arrays together
- Find unique values and test membership

## Connection to Assignment 4:

The techniques you learned today directly prepare you for Assignment 4:
- **Problem 2:** Basic statistics (mean, sum, std)
- **Problem 3:** Boolean indexing for filtering wins/losses
- **Problem 5:** Combining boolean indexing, percentiles, and np.where()
- **Problem 6:** Broadcasting and normalization

**Excellent work today - you've mastered the core data manipulation techniques that make NumPy so powerful for data science!**