Calculus for Machine Learning — Complete Reference
Holberton · DLH · Machine Learning

Calculus for Machine Learning

Calculus for Machine Learning

A complete reference covering derivatives, chain rule, gradient descent, backpropagation, and optimization — every concept you need to understand how neural networks learn.

Blue panels — NumPy code
Green panels — Pure Python (no NumPy)
How to read this guide
🚀
Quick Scan
Read the blue Concept boxes and ML Application callouts only. Takes ~15 min per Part.
📖
Full Learning
Follow each section top-to-bottom: Concept → Math → Code → Application. Deep understanding in ~3 hours.
🎓
Exam Prep
Focus on Chain Rule, Gradient Descent, and Backpropagation. These are what Holberton tests most heavily.
Part I

Foundations

The building blocks of calculus — what functions are, how they change, and how we measure that change precisely. These concepts form the bedrock of every optimization algorithm in ML.

After Part I you will be able to:
  • Explain why calculus is essential for training neural networks
  • Plot and reason about common ML functions
  • Define the derivative as a limit and compute it numerically
  • Understand the geometric meaning of the derivative as slope
  • Implement numerical differentiation in pure Python
01

What is Calculus? Beginner

Calculus is the mathematics of change. It gives us two superpowers: (1) measuring how fast things change (differentiation) and (2) accumulating tiny changes into totals (integration). In ML, we use differentiation to figure out which direction to adjust model parameters to reduce error.
🧠 Mental model: Imagine driving a car. Your speedometer shows the derivative — the rate of change of position. Your odometer shows the integral — the total distance accumulated. Training a neural network is like adjusting the steering wheel (weights) based on how fast the error is changing (gradient).
📖 Concept 📐 Visual 🎯 ML Apply

Calculus has two fundamental branches:

Differential Calculus

Studies rates of change. Given a function f(x), the derivative f'(x) tells us how fast f is changing at each point x. This is how we compute gradients to train models.

Integral Calculus

Studies accumulation. Given a rate of change, integration recovers the total quantity. In ML, integrals appear in probability distributions, expected values, and KL divergence.

Why ML needs calculus
graph LR subgraph Training ["How Neural Networks Learn"] A["Model makes
prediction ŷ"] --> B{"Compare with
true label y"} B --> C["Compute Loss
L(ŷ, y)"] C --> D["Take Derivative
∂L/∂w (Calculus!)"] D --> E["Update Weights
w ← w − α·∂L/∂w"] E --> A end
🎯 ML Application: Every time you call loss.backward() in PyTorch, you are using differential calculus. The framework computes ∂Loss/∂w for every weight w in your network, then gradient descent uses those derivatives to update each weight in the direction that reduces the loss.
02

Functions & Graphs Beginner

A function maps each input to exactly one output: f(x) = y. Functions are the objects we differentiate. In ML, activation functions (sigmoid, ReLU), loss functions (MSE, cross-entropy), and the entire model itself are all functions.
🧠 Mental model: A function is a machine — you feed in a number, it spits out another number. The graph of a function is a photograph of every input-output pair plotted on a 2D plane.
📖 Concept 📐 Visual 💻 NumPy 🐍 Python 🎯 ML Apply

Key ML functions you'll encounter throughout this guide:

Linear: f(x) = mx + b

Straight line. The simplest model. Linear regression is just finding the best m and b.

Quadratic: f(x) = x²

Parabola. MSE loss is quadratic — this is why gradient descent works so well on it (convex!).

Sigmoid: σ(x) = 1/(1+e⁻ˣ)

S-curve. Squashes any input to (0, 1). Used in binary classification and gating.

ReLU: f(x) = max(0, x)

The most popular activation. Zero for negatives, identity for positives. Simple derivative.

With NumPy
import numpy as np

# Define common functions
x = np.linspace(-5, 5, 100)

linear  = 2 * x + 1
quad    = x ** 2
sigmoid = 1 / (1 + np.exp(-x))
relu    = np.maximum(0, x)

# Evaluate at a point
sigmoid_at_0 = 1 / (1 + np.exp(0))
print(sigmoid_at_0)  # 0.5
Pure Python (no NumPy)
import math

def linear(x, m=2, b=1):
    return m * x + b

def sigmoid(x):
    return 1 / (1 + math.exp(-x))

def relu(x):
    return max(0, x)

# Plot manually: generate (x, y) pairs
xs = [i * 0.1 for i in range(-50, 51)]
ys = [sigmoid(x) for x in xs]
print(sigmoid(0))  # 0.5
Interactive Function Explorer
🎯 ML Application: Your entire neural network is a composition of functions: output = f₃(f₂(f₁(x))). Each layer applies a linear transformation (matmul + bias) followed by a nonlinear activation function. Calculus lets us differentiate through this entire chain.
03

Limits & Continuity Beginner

A limit describes the value a function approaches as its input gets closer and closer to some point. Limits are the foundation of derivatives — a derivative is a limit. Continuity means a function has no jumps or holes.
🧠 Mental model: Walk toward a cliff edge. The limit is the edge — you can get arbitrarily close without stepping off. The function's value at that point might differ from the limit (a "hole"), or it might match perfectly (continuous).
📖 Concept 📐 Math 💻 NumPy 🎯 ML Apply
limx→a f(x) = L
"As x approaches a, f(x) approaches L"

Three key properties of limits that matter for ML:

1. Sum rule: lim[f(x) + g(x)] = lim f(x) + lim g(x)
2. Product rule: lim[f(x) · g(x)] = lim f(x) · lim g(x)
3. Chain rule: lim f(g(x)) = f(lim g(x)) (if f is continuous)
Numerical Limit with NumPy
# lim x→0 sin(x)/x = 1
import numpy as np

# Approach from right
h_values = [0.1, 0.01, 0.001, 0.0001]
for h in h_values:
    print(f"h={h}: {np.sin(h)/h:.8f}")

# h=0.1:    0.99833417
# h=0.01:   0.99998333
# h=0.001:  0.99999983
# h=0.0001: 1.00000000
# → Approaches 1!
Pure Python
import math

def numerical_limit(f, a, h_values=None):
    """Estimate lim x→a f(x)"""
    if h_values is None:
        h_values = [10**-i
                    for i in range(1,9)]
    results = []
    for h in h_values:
        results.append(f(a + h))
    return results[-1]

# lim x→0 sin(x)/x
limit = numerical_limit(
    lambda x: math.sin(x)/x, 0
)
print(limit)  # ≈ 1.0
⚠️ Why continuity matters: ReLU is not differentiable at x=0 (it has a kink). In practice, we define the derivative at 0 as either 0 or 1 — this works because we almost never land exactly at 0 during training. But it shows why understanding limits is important: the derivative at a kink is a limit that doesn't exist.
🎯 ML Application: Gradient-based optimization assumes the loss landscape is smooth (differentiable almost everywhere). When we use ReLU or other piecewise functions, we rely on the fact that the set of non-differentiable points has measure zero — limits and continuity formalize this.
04

The Derivative (Geometric View) Beginner

The derivative f'(x) is the slope of the tangent line to f(x) at point x. It tells you: if you nudge x by a tiny amount, how much does f(x) change? Positive derivative = function going up. Negative = going down. Zero = flat (potential minimum or maximum).
🧠 Mental model: Place a ruler tangent to a hill at your feet. The angle of the ruler is the derivative. Steep uphill = large positive derivative. Flat plateau = zero derivative. Steep downhill = large negative derivative.
📖 Concept 📐 Visual 💻 NumPy 🐍 Python 🎯 ML Apply
f'(x) = slope of tangent line at x
f'(x) > 0 → increasing · f'(x) < 0 → decreasing · f'(x)=0 → critical point
Mathematical Explanation & Example:

A derivative measures how a function changes as its input changes. Mathematically, it's defined as the limit of the average rate of change as the interval approaches zero:

f'(x) = lim(h→0) [f(x+h) - f(x)] / h

Example: Let f(x) = x². We want to find the derivative at x = 3.

  • Using the definition, the rate of change from x to x+h is ((x+h)² - x²) / h.
  • Expanding the numerator: x² + 2xh + h² - x² = 2xh + h².
  • Dividing by h gives: 2x + h.
  • As h → 0, this expression becomes 2x.
  • So, the derivative is f'(x) = 2x. At x = 3, the slope is 2(3) = 6.
The derivative tells gradient descent which way to go!
graph TD subgraph Slope ["Derivative as Slope"] POS["f'(x) > 0
Going uphill ↗"] --> ACT1["Move LEFT
(Decrease x)"] NEG["f'(x) < 0
Going downhill ↘"] --> ACT2["Move RIGHT
(Increase x)"] ZERO["f'(x) = 0
Flat — at minimum? ✓"] --> ACT3["Stop!
(Converged)"] end
With NumPy
import numpy as np

# f(x) = x² → f'(x) = 2x
def f(x): return x**2
def f_prime(x): return 2*x

x = np.array([-2, -1, 0, 1, 2])
print(f"f(x)  = {f(x)}")
# [4, 1, 0, 1, 4]
print(f"f'(x) = {f_prime(x)}")
# [-4, -2, 0, 2, 4]

# At x=-2: slope is -4 (steeply downhill)
# At x=0:  slope is  0 (minimum!)
# At x=2:  slope is  4 (steeply uphill)
Pure Python
# Numerical derivative approximation
def numerical_derivative(f, x, h=1e-7):
    """Central difference formula"""
    return (f(x + h) - f(x - h)) / (2 * h)

def f(x): return x ** 2

# Test at several points
for x in [-2, -1, 0, 1, 2]:
    slope = numerical_derivative(f, x)
    print(f"x={x}: f'={slope:.4f}")

# x=-2: f'=-4.0000
# x=-1: f'=-2.0000
# x= 0: f'= 0.0000  ← minimum!
# x= 1: f'= 2.0000
# x= 2: f'= 4.0000
🎯 ML Application: When training a model, the loss function L(w) depends on weights w. The derivative ∂L/∂w tells us the slope of the loss landscape at our current position. Gradient descent simply moves in the opposite direction of the slope: w ← w − α · ∂L/∂w. That's it — that's the entire algorithm.
05

Derivative from First Principles Beginner

The derivative is formally defined as a limit: f'(x) = limh→0 [f(x+h) − f(x)] / h. This is called the "first principles" or "limit definition" of the derivative. It measures the slope of a secant line as the two points merge into one.
🧠 Mental model: Draw a secant line between two points on a curve. Now slide the second point closer and closer to the first. The secant line rotates and becomes the tangent line in the limit. Its slope is the derivative.
📖 Concept 📐 Math 💻 NumPy 🐍 Python 🎯 ML Apply
f'(x) = limh→0 [f(x + h) − f(x)] / h
The limit of the difference quotient — slope of the secant as h shrinks to zero

Worked example: Derive f'(x) for f(x) = x² from first principles:

f'(x) = limh→0 [(x+h)² − x²] / h
     = limh→0 [x² + 2xh + h² − x²] / h
     = limh→0 [2xh + h²] / h
     = limh→0 (2x + h)
     = 2x
Three Numerical Approximations
import numpy as np

def f(x): return x**2

x = 3.0; h = 1e-7

# Forward difference
fwd = (f(x+h) - f(x)) / h
# Backward difference
bwd = (f(x) - f(x-h)) / h
# Central difference (best!)
ctr = (f(x+h) - f(x-h)) / (2*h)

print(f"Forward:  {fwd:.10f}")
print(f"Backward: {bwd:.10f}")
print(f"Central:  {ctr:.10f}")
print(f"Exact:    {2*x:.10f}")
# Central is most accurate!
Full Numerical Gradient
def numerical_gradient(f, params, h=1e-7):
    """Gradient for a list of params"""
    grad = []
    for i in range(len(params)):
        # Perturb param i
        params_plus = params[:]
        params_minus = params[:]
        params_plus[i] += h
        params_minus[i] -= h
        # Central difference
        g = (f(params_plus) - f(params_minus))
        g /= (2 * h)
        grad.append(g)
    return grad

# f(w0, w1) = w0² + 3*w1²
def loss(w):
    return w[0]**2 + 3*w[1]**2

g = numerical_gradient(loss, [2.0, 1.0])
print(g)  # [4.0, 6.0]
💡 Gradient checking: Before trusting your backprop implementation, compare its gradients against the numerical gradient from first principles. If they differ by more than ~1e-5, you have a bug. This is called gradient checking and is a standard debugging technique.
🎯 ML Application: The numerical_gradient function above is exactly what gradient checking does in practice. Frameworks like PyTorch provide torch.autograd.gradcheck() which computes both the analytical gradient (via backprop) and the numerical gradient, then asserts they match.
Part II

Differentiation Rules

Instead of computing every derivative from the limit definition, we use rules that let us differentiate any function instantly. The chain rule is the single most important rule for ML — it powers backpropagation.

After Part II you will be able to:
  • Apply power, sum, product, quotient, and chain rules fluently
  • Differentiate any polynomial, exponential, or trigonometric function
  • Understand why the chain rule is the backbone of backpropagation
  • Look up any common derivative from the reference table
06

Power Rule Beginner

The power rule is the most used differentiation rule: bring down the exponent as a coefficient, then subtract one from the exponent. It handles all polynomials.
🧠 Mental model: x³ → bring down the 3, reduce to x²: 3x². Simple as moving the exponent to the front and decrementing.
d/dx [xⁿ] = n · xn−1
Works for any real n — integers, fractions, negatives

Examples:

d/dx [x³] = 3x²
d/dx [x⁻¹] = −x⁻² = −1/x²
d/dx [√x] = d/dx [x1/2] = (1/2)x−1/2 = 1/(2√x)
d/dx [5] = 0  (constant → derivative is 0)
With NumPy
import numpy as np

# f(x) = 3x⁴ + 2x² − 7x + 5
# f'(x) = 12x³ + 4x − 7

def f(x):
    return 3*x**4 + 2*x**2 - 7*x + 5

def f_prime(x):
    return 12*x**3 + 4*x - 7

x = np.array([0, 1, 2])
print(f_prime(x))  # [-7, 9, 97]
Pure Python
def power_rule(coeff, exp):
    """Apply power rule: d/dx [c·x^n]"""
    if exp == 0:
        return (0, 0)  # constant
    new_coeff = coeff * exp
    new_exp = exp - 1
    return (new_coeff, new_exp)

# Differentiate 3x⁴
c, e = power_rule(3, 4)
print(f"{c}x^{e}")
# 12x^3

# Differentiate √x = x^0.5
c, e = power_rule(1, 0.5)
print(f"{c}x^{e}")
# 0.5x^-0.5
🎯 ML Application: L2 regularization adds λ||w||² to the loss. Its derivative is 2λw — a direct application of the power rule. This is why L2 regularization is also called weight decay: it adds a gradient that pushes weights toward zero.
07

Sum, Difference & Constant Rules Beginner

Differentiation is linear: the derivative of a sum is the sum of derivatives, and you can pull constants out front. This lets you differentiate any polynomial term by term.
d/dx [f(x) + g(x)] = f'(x) + g'(x)  Sum rule
d/dx [f(x) − g(x)] = f'(x) − g'(x)  Difference rule
d/dx [c · f(x)] = c · f'(x)  Constant multiple rule
d/dx [c] = 0  Constant rule

Example: Differentiate f(x) = 5x³ − 2x² + 7x − 3

f'(x) = 5·3x² − 2·2x + 7·1 − 0 = 15x² − 4x + 7
🎯 ML Application: A loss function is often a sum over all training examples: L = (1/N)Σ Lᵢ. The linearity of differentiation means ∂L/∂w = (1/N)Σ ∂Lᵢ/∂w — we can compute gradients for each example independently and average them. This is why mini-batch gradient descent works.
08

Product & Quotient Rules Intermediate

When two functions are multiplied or divided, their derivatives interact. The product rule says both functions contribute to the rate of change. The quotient rule is the product rule applied to f(x)/g(x).
Product rule: d/dx [f·g] = f'·g + f·g'
Quotient rule: d/dx [f/g] = (f'·g − f·g') / g²

Product rule example: d/dx [x² · eˣ]

f(x) = x²,  f'(x) = 2x
g(x) = eˣ,  g'(x) = eˣ
(fg)' = 2x·eˣ + x²·eˣ = eˣ(2x + x²)
With NumPy
import numpy as np

# f(x) = x² · eˣ
# f'(x) = eˣ(2x + x²)

def f(x):
    return x**2 * np.exp(x)

def f_prime(x):
    return np.exp(x) * (2*x + x**2)

x = np.array([0, 1, 2])
print(f_prime(x))
# [0.0, 8.154, 43.510]
Pure Python
import math

def product_rule(f, f_p, g, g_p, x):
    """(fg)' = f'g + fg'"""
    return f_p(x)*g(x) + f(x)*g_p(x)

# f=x², f'=2x, g=eˣ, g'=eˣ
result = product_rule(
    lambda x: x**2,
    lambda x: 2*x,
    lambda x: math.exp(x),
    lambda x: math.exp(x),
    1.0
)
print(result)  # 8.1548
🎯 ML Application: The product rule appears when differentiating attention scores in Transformers (Q·Kᵀ), gated mechanisms in LSTMs (forget gate × cell state), and any layer that multiplies two learned quantities.
09

Chain Rule Intermediate

The chain rule is THE most important rule in all of machine learning. It tells us how to differentiate composed functions — functions inside functions. Since a neural network is just a chain of composed functions, backpropagation is literally the chain rule applied recursively.
🧠 Mental model: Imagine a production line in a factory. If the raw material speeds up by 2×, the first machine produces 3× faster, and the second machine produces 4× faster, the total speedup is 2 × 3 × 4 = 24×. You multiply the rates through each stage — that's the chain rule.
📖 Concept 📐 Visual 💻 NumPy 🐍 Python 🎯 ML Apply
d/dx [f(g(x))] = f'(g(x)) · g'(x)
"Derivative of the outer function (evaluated at the inner) × derivative of the inner function"
The chain rule powers ALL of backpropagation!
graph LR subgraph Chain X[Input x] -->|g| G[Inner Function] G -->|f| F[Outer Function] F -->|dy/dx| O[Final Derivative] end

Worked example: d/dx [sin(x²)]

outer: f(u) = sin(u) → f'(u) = cos(u)
inner: g(x) = x² → g'(x) = 2x
chain rule: f'(g(x)) · g'(x) = cos(x²) · 2x = 2x·cos(x²)

Multi-layer chain: d/dx [f(g(h(x)))] = f'(g(h(x))) · g'(h(x)) · h'(x)

This is exactly how backpropagation works through a neural network with 3 layers. Each layer is a function, and the chain rule multiplies the local derivatives through all layers.
With NumPy
import numpy as np

# f(x) = sigmoid(3x + 2)
# outer: sigmoid(u), inner: 3x+2

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_prime(x):
    s = sigmoid(x)
    return s * (1 - s)

# Chain rule: σ'(3x+2) · 3
x = np.array([0, 1, 2])
inner = 3*x + 2
deriv = sigmoid_prime(inner) * 3
print(deriv)
# [0.315, 0.044, 0.005]
Pure Python — General Chain Rule
def chain_rule(outer_deriv, inner_deriv,
              inner_fn, x):
    """d/dx f(g(x)) = f'(g(x))·g'(x)"""
    u = inner_fn(x)        # g(x)
    df_du = outer_deriv(u)  # f'(g(x))
    du_dx = inner_deriv(x)  # g'(x)
    return df_du * du_dx

import math
# d/dx sin(x²) = cos(x²)·2x
result = chain_rule(
    lambda u: math.cos(u),  # f'
    lambda x: 2*x,          # g'
    lambda x: x**2,         # g
    1.0
)
print(result)  # ≈ 1.0806
🎯 ML Application: A 3-layer neural network computes y = σ(W₃ · σ(W₂ · σ(W₁ · x + b₁) + b₂) + b₃). To find ∂Loss/∂W₁, we apply the chain rule through all 3 layers. This recursive application IS backpropagation — invented by Rumelhart, Hinton & Williams in 1986. The chain rule is all you need.
10

Common Derivatives Reference Table Beginner

Your derivative cheat sheet. Memorize the first column; look up the rest as needed. The ML column tells you where each derivative appears in practice.
f(x) f'(x) ML Context
xⁿ nxⁿ⁻¹ Power rule — polynomials, L2 reg
Softmax, sigmoid, probability
ln(x) 1/x Cross-entropy loss, log-likelihood
sin(x) cos(x) Positional encodings (Transformers)
cos(x) −sin(x) Positional encodings
σ(x) = 1/(1+e⁻ˣ) σ(x)(1−σ(x)) Binary classification, gates
tanh(x) 1 − tanh²(x) RNN hidden states, normalization
ReLU(x) = max(0,x) 0 if x<0, 1 if x>0 Most popular activation function
|x| (absolute) sign(x) L1 regularization, MAE loss
aˣ · ln(a) Exponential learning rate decay
loga(x) 1/(x·ln(a)) Information theory
💡 Pro tip: You only need to truly memorize 5 derivatives for ML: xⁿ → nxⁿ⁻¹, eˣ → eˣ, ln(x) → 1/x, σ(x) → σ(1−σ), and ReLU → step function. Everything else can be derived from these using the chain rule.
Part III

ML Activation Functions & Their Derivatives

Activation functions introduce nonlinearity into neural networks. Without them, stacking linear layers would just give another linear function. Understanding their derivatives is essential for backpropagation.

After Part III you will be able to:
  • Implement sigmoid, tanh, ReLU, and softmax from scratch
  • Derive and implement their derivatives
  • Choose the right activation function for each layer
  • Understand vanishing gradient problems with sigmoid/tanh
  • Differentiate common loss functions (MSE, cross-entropy)
11

Sigmoid & Its Derivative Intermediate

The sigmoid function squashes any input into (0, 1). Its derivative has a beautiful property: σ'(x) = σ(x)(1 − σ(x)). The derivative is maximized at x=0 (where σ=0.5) and vanishes for large |x|. This "saturation" causes the vanishing gradient problem.
🧠 Mental model: Sigmoid is like a dimmer switch. At x=0 it's at 50% brightness, and the switch is most sensitive (high derivative). At extreme values, it's either fully off or fully on, and turning the knob barely changes anything (near-zero derivative).
📖 Concept 📐 Visual 💻 NumPy 🐍 Python 🎯 ML Apply
σ(x) = 1 / (1 + e⁻ˣ)
σ'(x) = σ(x) · (1 − σ(x))
Max derivative: σ'(0) = 0.25

Derivation using chain rule:

σ(x) = (1 + e⁻ˣ)⁻¹
σ'(x) = −1·(1 + e⁻ˣ)⁻² · (−e⁻ˣ)  chain rule
     = e⁻ˣ / (1 + e⁻ˣ)²
     = [1/(1 + e⁻ˣ)] · [e⁻ˣ/(1 + e⁻ˣ)]
     = σ(x) · [1 − σ(x)]  ✓
With NumPy
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

x = np.linspace(-6, 6, 5)
print("σ(x):  ", np.round(sigmoid(x), 4))
print("σ'(x): ", np.round(sigmoid_deriv(x), 4))
# σ(x):   [0.0025, 0.0474, 0.5, 0.9526, 0.9975]
# σ'(x):  [0.0025, 0.0452, 0.25, 0.0452, 0.0025]
# ↑ Derivative peaks at 0.25 (x=0) and vanishes at extremes!
Pure Python
import math

def sigmoid(x):
    # Numerically stable version
    if x >= 0:
        z = math.exp(-x)
        return 1 / (1 + z)
    else:
        z = math.exp(x)
        return z / (1 + z)

def sigmoid_deriv(x):
    s = sigmoid(x)
    return s * (1 - s)

# Verify max at x=0
print(sigmoid_deriv(0))   # 0.25
print(sigmoid_deriv(5))   # 0.0066
print(sigmoid_deriv(-5))  # 0.0066
Sigmoid vs Its Derivative
⚠️ Vanishing gradient problem: The maximum value of σ'(x) is only 0.25. In a 10-layer network, gradients get multiplied through each layer: 0.25¹⁰ = 0.00000095. The gradient essentially disappears — early layers barely learn. This is why deep networks switched from sigmoid to ReLU.
🎯 ML Application: Sigmoid is still used in (1) binary classification output layers to produce probabilities, (2) LSTM gates (forget/input/output gates), and (3) attention mechanisms. But it's almost never used as a hidden layer activation anymore due to vanishing gradients.
12

Tanh & Its Derivative Intermediate

Tanh is a "centered" sigmoid: it maps inputs to (−1, 1) instead of (0, 1). This zero-centering helps gradient descent converge faster. Its derivative: tanh'(x) = 1 − tanh²(x). Maximum derivative is 1.0 (at x=0) — better than sigmoid's 0.25.
tanh(x) = (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)
tanh'(x) = 1 − tanh²(x)
Note: tanh(x) = 2σ(2x) − 1 (scaled sigmoid!)
With NumPy
import numpy as np

def tanh(x):
    return np.tanh(x)

def tanh_deriv(x):
    t = np.tanh(x)
    return 1 - t**2

x = np.array([-2, -1, 0, 1, 2])
print("tanh:  ", np.round(tanh(x), 4))
print("tanh': ", np.round(tanh_deriv(x), 4))
# tanh:   [-0.9640, -0.7616, 0.0, 0.7616, 0.9640]
# tanh':  [0.0707, 0.4200, 1.0, 0.4200, 0.0707]
Pure Python
import math

def tanh(x):
    ep = math.exp(x)
    em = math.exp(-x)
    return (ep - em) / (ep + em)

def tanh_deriv(x):
    t = tanh(x)
    return 1 - t**2

# Verify: max deriv at x=0 is 1.0
print(tanh_deriv(0))  # 1.0
print(tanh_deriv(3))  # 0.0099
🎯 ML Application: Tanh is preferred over sigmoid in RNN/LSTM hidden states because its output is zero-centered. In Transformers, tanh appears in GELU (the default activation): GELU(x) ≈ 0.5x(1 + tanh(√(2/π)(x + 0.044715x³))).
13

ReLU Family Intermediate

ReLU (Rectified Linear Unit) solved the vanishing gradient problem. Its derivative is either 0 or 1 — no shrinking! This lets deep networks train effectively. Variants like Leaky ReLU, ELU, and GELU address ReLU's weaknesses.
🧠 Mental model: ReLU is a gate that blocks negative signals and passes positive ones unchanged. If a neuron's input is negative, it outputs 0 (dead neuron). If positive, it passes the signal straight through with no distortion.
ReLU: f(x) = max(0, x)  →  f'(x) = 0 if x<0, 1 if x>0
Leaky ReLU: f(x) = max(αx, x)  →  f'(x) = α if x<0, 1 if x>0  (α≈0.01)
ELU: f(x) = x if x>0, α(eˣ−1) if x≤0  →  f'(x) = 1 if x>0, f(x)+α if x≤0
GELU: f(x) = x · Φ(x)  (Φ = standard normal CDF, used in GPT/BERT)
With NumPy — All Variants
import numpy as np

def relu(x): return np.maximum(0, x)
def relu_deriv(x): return (x > 0).astype(float)

def leaky_relu(x, a=0.01):
    return np.where(x > 0, x, a*x)
def leaky_relu_deriv(x, a=0.01):
    return np.where(x > 0, 1, a)

def gelu(x):
    return 0.5*x*(1+np.tanh(
        np.sqrt(2/np.pi)*(x+0.044715*x**3)))

x = np.array([-2, -1, 0, 1, 2])
print("ReLU:  ", relu(x))
# [0, 0, 0, 1, 2]
Pure Python
def relu(x):
    return max(0, x)

def relu_deriv(x):
    return 1 if x > 0 else 0

def leaky_relu(x, alpha=0.01):
    return x if x > 0 else alpha * x

def leaky_relu_deriv(x, alpha=0.01):
    return 1 if x > 0 else alpha

# The "dying ReLU" problem:
# If a neuron always receives
# negative input, its gradient
# is always 0 → it never learns!
# Leaky ReLU fixes this by
# allowing a small gradient (α)
# for negative inputs.
🎯 ML Application: ReLU is the default for hidden layers in CNNs and MLPs. GELU is the default for Transformers (GPT, BERT). Leaky ReLU is preferred when you suspect dying neurons. ELU sometimes gives faster convergence by producing negative outputs.
14

Softmax & Cross-Entropy Intermediate

Softmax converts a vector of raw scores (logits) into probabilities that sum to 1. Combined with cross-entropy loss, the gradient simplifies beautifully to ŷ − y (predicted minus true). This elegance is why softmax + cross-entropy is the standard classification pipeline.
softmax(zᵢ) = eᶻⁱ / Σⱼ eᶻʲ
cross-entropy: L = −Σᵢ yᵢ · log(ŷᵢ)
Combined gradient: ∂L/∂zᵢ = ŷᵢ − yᵢ (beautifully simple!)
With NumPy
import numpy as np

def softmax(z):
    # Numerically stable
    e = np.exp(z - np.max(z))
    return e / e.sum()

def cross_entropy(y_true, y_pred):
    return -np.sum(y_true * np.log(y_pred + 1e-15))

# Example: 3-class classification
logits = np.array([2.0, 1.0, 0.1])
probs = softmax(logits)
print(probs)
# [0.659, 0.242, 0.099]
print(probs.sum())  # 1.0

# Gradient: simply ŷ − y!
y_true = np.array([1, 0, 0])  # one-hot
grad = probs - y_true
print(grad)
# [-0.341, 0.242, 0.099]
Pure Python
import math

def softmax(z):
    # Subtract max for stability
    m = max(z)
    exps = [math.exp(zi - m) for zi in z]
    total = sum(exps)
    return [e / total for e in exps]

def cross_entropy(y_true, y_pred):
    return -sum(
        yt * math.log(yp + 1e-15)
        for yt, yp in zip(y_true, y_pred)
    )

logits = [2.0, 1.0, 0.1]
probs = softmax(logits)
print([round(p, 3) for p in probs])
# [0.659, 0.242, 0.099]
🎯 ML Application: Softmax + cross-entropy is the output layer of every classification network. The fact that the gradient simplifies to ŷ − y means backprop through the last layer is trivially simple and numerically stable. PyTorch's nn.CrossEntropyLoss combines both into one fused operation for efficiency.
15

Loss Functions & Their Gradients Intermediate

The loss function measures how wrong the model is. Its gradient tells us how to adjust each parameter to reduce that wrongness. Different tasks need different loss functions.
Loss → Gradient → Update → Repeat!
graph LR subgraph Losses ["Loss Functions by Task"] REG["Regression
(continuous output)"] --> MSE["MSE Loss
L = (1/n)Σ(ŷ−y)²"] REG --> MAE["MAE Loss
L = (1/n)Σ|ŷ−y|"] CLS["Classification
(categorical)"] --> BCE["Binary CE
L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]"] CLS --> CCE["Categorical CE
L = −Σ yᵢ·log(ŷᵢ)"] end
Loss Formula Gradient ∂L/∂ŷ
MSE (ŷ − y)² 2(ŷ − y)
MAE |ŷ − y| sign(ŷ − y)
Binary CE −y·log(ŷ) − (1−y)·log(1−ŷ) (ŷ − y) / [ŷ(1 − ŷ)]
Categorical CE −Σ yᵢ·log(ŷᵢ) −y/ŷ (then softmax combines to ŷ−y)
Huber ½(ŷ−y)² if |ŷ−y|≤δ, else δ|ŷ−y|−½δ² (ŷ−y) if |ŷ−y|≤δ, else δ·sign(ŷ−y)
With NumPy
import numpy as np

def mse(y_true, y_pred):
    return np.mean((y_pred - y_true)**2)

def mse_grad(y_true, y_pred):
    n = len(y_true)
    return 2 * (y_pred - y_true) / n

def binary_ce(y_true, y_pred):
    eps = 1e-15
    y_pred = np.clip(y_pred, eps, 1-eps)
    return -np.mean(
        y_true*np.log(y_pred) +
        (1-y_true)*np.log(1-y_pred)
    )

y = np.array([1, 0, 1])
yh = np.array([0.9, 0.1, 0.8])
print(f"MSE:  {mse(y, yh):.4f}")
print(f"BCE:  {binary_ce(y, yh):.4f}")
Pure Python
import math

def mse(y_true, y_pred):
    n = len(y_true)
    return sum((yp-yt)**2
        for yt,yp in zip(y_true,y_pred)) / n

def mse_grad(y_true, y_pred):
    n = len(y_true)
    return [2*(yp-yt)/n
        for yt,yp in zip(y_true,y_pred)]

def binary_ce(y_true, y_pred):
    eps = 1e-15
    n = len(y_true)
    total = 0
    for yt, yp in zip(y_true, y_pred):
        yp = max(eps, min(1-eps, yp))
        total -= (yt*math.log(yp) +
                  (1-yt)*math.log(1-yp))
    return total / n
🎯 ML Application: Choosing the right loss function is one of the most important design decisions: MSE for regression, Cross-Entropy for classification, Huber for robust regression. The gradient of the loss function is what flows backward through the network during training.
Part IV

Integral Calculus

Integration is the reverse of differentiation — it accumulates tiny changes into totals. In ML, integrals appear in probability distributions, expected values, and information theory.

After Part IV you will be able to:
  • Understand the integral as area under a curve
  • Apply the Fundamental Theorem of Calculus
  • Implement numerical integration in pure Python
  • Recognize where integrals appear in ML (expectations, KL divergence)
16

The Integral (Area Under the Curve) Intermediate

If the derivative measures the rate of change, the integral measures the total accumulated change. Geometrically, the definite integral ∫ₐᵇ f(x)dx is the signed area between f(x) and the x-axis from a to b.
🧠 Mental model: If your speedometer (derivative) shows 60 mph for 2 hours, the integral tells you the total distance: 120 miles. Integration sums up infinitely many tiny slices of "speed × time".
Definite integral: ∫ₐᵇ f(x) dx = F(b) − F(a)  where F'(x) = f(x)
Indefinite integral: ∫ f(x) dx = F(x) + C
Mathematical Explanation & Example:

An integral computes the total accumulation of a quantity, visualized as the area under a curve. If a function f(x) describes the rate of change, its integral F(x) (the antiderivative) gives the total amount.

Example: Calculate the definite integral of f(x) = 2x from x = 0 to x = 3.

  • First, find the indefinite integral. What function's derivative is 2x? The power rule in reverse tells us F(x) = x².
  • Using the Fundamental Theorem of Calculus: ∫₀³ 2x dx = F(3) - F(0).
  • Evaluate at the bounds: (3)² - (0)² = 9 - 0 = 9.
  • The total area under the curve y = 2x between x = 0 and x = 3 is exactly 9.
Integration undoes differentiation!
graph LR subgraph IntDiff ["The Fundamental Connection"] F["F(x)
Antiderivative"] -->|"Differentiate
d/dx"| f["f(x)
Function"] f -->|"Integrate
∫ dx"| F end
With NumPy/SciPy
import numpy as np
from scipy import integrate

# ∫₀² x² dx = [x³/3]₀² = 8/3 ≈ 2.667
result, error = integrate.quad(
    lambda x: x**2, 0, 2
)
print(f"{result:.4f}")  # 2.6667

# Numerical integration with trapezoid
x = np.linspace(0, 2, 1000)
y = x**2
area = np.trapz(y, x)
print(f"{area:.4f}")    # 2.6667
Pure Python — Riemann Sums
def riemann_sum(f, a, b, n=1000):
    """Left Riemann sum approximation"""
    dx = (b - a) / n
    total = 0
    for i in range(n):
        x = a + i * dx
        total += f(x) * dx
    return total

def trapezoidal(f, a, b, n=1000):
    """More accurate than Riemann"""
    dx = (b - a) / n
    total = (f(a) + f(b)) / 2
    for i in range(1, n):
        total += f(a + i * dx)
    return total * dx

# ∫₀² x² dx = 8/3
result = trapezoidal(
    lambda x: x**2, 0, 2
)
print(f"{result:.6f}")
# 2.666667
🎯 ML Application: Integrals are everywhere in probability: the probability of an event is the integral of a PDF over a region. Expected values E[f(X)] = ∫f(x)p(x)dx are integrals. KL divergence between two distributions is an integral. Monte Carlo sampling is how we approximate integrals that have no closed form.
17

Fundamental Theorem of Calculus Intermediate

The FTC bridges differentiation and integration: if F'(x) = f(x), then ∫ₐᵇ f(x)dx = F(b) − F(a). Differentiation and integration are inverse operations.
🧠 Mental model: If you know the speed at every moment (derivative), integrating gives total distance (antiderivative). If you know total distance at every moment, differentiating gives speed.
d/dx [∫ₐˣ f(t) dt] = f(x)
Part 1: The derivative of an integral is the original function
∫ₐᵇ f(x) dx = F(b) − F(a)
Part 2: Evaluate a definite integral using any antiderivative F

Common antiderivatives:

∫ xⁿ dx = xn+1/(n+1) + C  (n ≠ −1)
∫ eˣ dx = eˣ + C
∫ 1/x dx = ln|x| + C
∫ cos(x) dx = sin(x) + C
∫ sin(x) dx = −cos(x) + C
With NumPy
import numpy as np

# Verify FTC: ∫₀³ 2x dx = [x²]₀³ = 9
# Antiderivative F(x) = x²
F = lambda x: x**2
exact = F(3) - F(0)
print(exact)  # 9

# Numerical check
x = np.linspace(0, 3, 10000)
numerical = np.trapz(2*x, x)
print(f"{numerical:.4f}")  # 9.0000
Pure Python
def integrate_poly(coeffs, a, b):
    """Exact integral of polynomial
    coeffs: [c0, c1, c2, ...]
    p(x) = c0 + c1*x + c2*x² + ...
    """
    # Antiderivative coeffs
    F_coeffs = [0]  # constant term
    for i, c in enumerate(coeffs):
        F_coeffs.append(c / (i + 1))

    def F(x):
        return sum(c * x**(i)
            for i, c in enumerate(F_coeffs))

    return F(b) - F(a)

# ∫₀³ 2x dx (coeffs = [0, 2])
print(integrate_poly([0, 2], 0, 3))
# 9.0
🎯 ML Application: The FTC guarantees that if a loss function has a known antiderivative, we can compute exact expectations. In variational autoencoders (VAEs), the ELBO involves integrals over latent distributions — we use the FTC when possible and Monte Carlo when not.
18

Integration Techniques Intermediate

Two essential techniques: substitution (the chain rule in reverse) and integration by parts (the product rule in reverse). Most ML integrals are solved numerically, but these techniques help build intuition.
Substitution: ∫ f(g(x))·g'(x) dx = ∫ f(u) du  where u = g(x)
By Parts: ∫ u·dv = u·v − ∫ v·du

Substitution example: ∫ 2x·eˣ² dx

Let u = x²,  du = 2x dx
∫ eᵘ du = eᵘ + C = eˣ² + C
🎯 ML Application: Substitution appears when computing the reparameterization trick in VAEs: we substitute z = μ + σε to move the gradient inside an expectation. Integration by parts is used in deriving the REINFORCE gradient estimator in reinforcement learning.
19

Numerical Integration Intermediate

Most ML integrals have no closed-form solution. We approximate them numerically using Riemann sums, the trapezoidal rule, or Simpson's rule — or statistically using Monte Carlo sampling.
With NumPy — Simpson's Rule
import numpy as np

def simpsons(f, a, b, n=1000):
    """Simpson's 1/3 rule (most accurate)"""
    assert n % 2 == 0
    x = np.linspace(a, b, n+1)
    y = f(x)
    h = (b - a) / n
    return h/3 * (y[0] + y[-1]
        + 4*np.sum(y[1:-1:2])
        + 2*np.sum(y[2:-2:2]))

# ∫₀π sin(x) dx = 2
result = simpsons(np.sin, 0, np.pi)
print(f"{result:.10f}")
# 2.0000000000
Pure Python — Monte Carlo
import random, math

def monte_carlo_integrate(f, a, b, n=100000):
    """Approximate ∫ₐᵇ f(x) dx
    by sampling random points"""
    total = 0
    for _ in range(n):
        x = random.uniform(a, b)
        total += f(x)
    return (b - a) * total / n

# ∫₀π sin(x) dx = 2
result = monte_carlo_integrate(
    math.sin, 0, math.pi
)
print(f"{result:.4f}")
# ≈ 2.0000 (stochastic!)
🎯 ML Application: Monte Carlo sampling is the backbone of modern ML: training GANs, estimating expectations in VAEs, computing policy gradients in RL, and MCMC for Bayesian inference. When you can't compute an integral exactly, you sample.
20

Integrals in Machine Learning Intermediate

Integrals show up in three critical ML contexts: (1) computing expected values, (2) measuring distances between probability distributions (KL divergence), and (3) normalizing probability distributions.
Expected value: E[f(X)] = ∫ f(x)·p(x) dx
KL divergence: DKL(P||Q) = ∫ p(x)·log(p(x)/q(x)) dx
Normalization: ∫ p(x) dx = 1
Where integrals hide in ML
graph TD subgraph IntML ["Integrals in Machine Learning"] EV["Expected Value
E[X] = ∫x·p(x)dx"] --> LOSS["Expected Loss
Risk minimization"] KL["KL Divergence
∫p·log(p/q)dx"] --> VAE["VAE Training
(ELBO objective)"] NORM["Normalization
∫p(x)dx = 1"] --> SOFT["Softmax
Σ eᶻⁱ = partition fn"] end
KL Divergence — NumPy
import numpy as np

def kl_divergence(p, q):
    """KL(P||Q) for discrete dists"""
    p = np.asarray(p, dtype=float)
    q = np.asarray(q, dtype=float)
    # Avoid log(0)
    mask = p > 0
    return np.sum(p[mask] * np.log(p[mask] / q[mask]))

p = [0.4, 0.3, 0.3]
q = [0.33, 0.33, 0.34]
print(f"KL(P||Q) = {kl_divergence(p, q):.4f}")
# KL(P||Q) = 0.0133
Expected Value — Pure Python
import math

def expected_value(values, probs):
    """E[X] = Σ xᵢ · p(xᵢ)"""
    return sum(x*p for x,p
               in zip(values, probs))

# Fair die: E[X] = 3.5
vals = [1,2,3,4,5,6]
probs = [1/6]*6
print(expected_value(vals, probs))
# 3.5

# Variance: E[X²] − E[X]²
ex2 = expected_value([x**2 for x in vals], probs)
ex = expected_value(vals, probs)
print(f"Var = {ex2 - ex**2:.4f}")
# Var = 2.9167
🎯 ML Application: The ELBO in VAEs = Eq[log p(x|z)] − DKL(q(z|x) || p(z)). Both terms are integrals — the first is estimated via Monte Carlo sampling, the second has a closed form for Gaussians. Understanding integrals is essential for generative models.
20.5

Double Integrals Advanced

A double integral integrates a function of two variables over a 2D region. While a single integral computes the area under a 1D curve, a double integral computes the volume under a 2D surface z = f(x, y).
Double Integral:R f(x, y) dA = ∫abcd f(x, y) dy dx
Mathematical Explanation & Example:

To compute a double integral over a rectangular region, we perform iterated integration: integrate with respect to one variable first (treating the other as a constant), then integrate the result with respect to the second variable.

Example: Calculate the volume under f(x, y) = xy for x from 0 to 2, and y from 0 to 1.

  • Set up the integral: ∫₀² ( ∫₀¹ xy dy ) dx
  • Inner integral (w.r.t y): Treat x as a constant.
    • The integral of y is y²/2. So, ∫₀¹ xy dy = x * [y²/2]₀¹ = x * (1/2 - 0) = x/2.
  • Outer integral (w.r.t x): Now integrate the result w.r.t x.
    • ∫₀² (x/2) dx = (1/2) * [x²/2]₀² = (1/2) * (2 - 0) = 1.
  • The total volume under the surface is exactly 1.
graph TD START["∬₀² ∫₀¹ xy dy dx"] START --> INNER["Inner Integral: ∫₀¹ xy dy"] INNER --> INNER_EVAL["Treat x as constant
[x·y²/2]₀¹ = x/2"] INNER_EVAL --> OUTER["Outer Integral: ∫₀² (x/2) dx"] OUTER --> OUTER_EVAL["[x²/4]₀² = 1"] OUTER_EVAL --> VOL["Final Answer
Volume = 1"] style START fill:#f8f9fa,stroke:#6c757d,color:#212529 style VOL fill:#f3e8ff,stroke:#9333ea,color:#6b21a8
🎯 ML Application: Double integrals (and multiple integrals in general) appear frequently in continuous probability. To find the probability that two continuous random variables fall within a certain region, you compute the double integral of their joint probability density function p(x, y). They are also used in marginalization (e.g., integrating out a variable p(x) = ∫ p(x,y) dy).
Part V

Multivariable Calculus & Optimization

Real ML models have millions of parameters. Partial derivatives, gradients, and optimization algorithms let us navigate this high-dimensional landscape to find parameter values that minimize the loss.

After Part V you will be able to:
  • Compute partial derivatives and gradient vectors
  • Implement gradient descent from scratch
  • Understand learning rate, momentum, and Adam
  • Explain what Jacobian and Hessian matrices encode
21

Partial Derivatives Intermediate

A partial derivative measures how a function changes when you vary ONE variable while holding all others constant. For f(x, y), ∂f/∂x treats y as a constant and differentiates with respect to x.
🧠 Mental model: Imagine standing on a hilly landscape (z = f(x,y)). ∂f/∂x is the slope if you walk purely East. ∂f/∂y is the slope if you walk purely North. Together they tell you the steepest uphill direction.
∂f/∂x = limh→0 [f(x+h, y) − f(x, y)] / h
Treat y as a constant, differentiate with respect to x only
Mathematical Explanation & Example:

A partial derivative is used when a function depends on multiple variables, but we only want to see how it changes with respect to one variable, holding all others constant.

Example: Let f(x, y) = 3x²y + 2xy³ - 5

  • To find ∂f/∂x (Partial w.r.t x): Treat y as a constant number (like 4 or 10).
    • The derivative of 3x²y w.r.t x is 6xy (the 3y is just a constant multiplier of ).
    • The derivative of 2xy³ w.r.t x is 2y³ (the 2y³ is a constant multiplier of x).
    • The derivative of -5 is 0.
    • So, ∂f/∂x = 6xy + 2y³.
  • To find ∂f/∂y (Partial w.r.t y): Treat x as a constant number.
    • The derivative of 3x²y w.r.t y is 3x² (the 3x² is a constant multiplier of y).
    • The derivative of 2xy³ w.r.t y is 2x * 3y² = 6xy².
    • So, ∂f/∂y = 3x² + 6xy².
graph TD F["f(x, y) = 3x²y + 2xy³ - 5"] F -->|∂/∂x| X["Hold y constant"] F -->|∂/∂y| Y["Hold x constant"] X --> X1["d(3x²y)/dx = 6xy d(2xy³)/dx = 2y³ d(-5)/dx = 0"] Y --> Y1["d(3x²y)/dy = 3x² d(2xy³)/dy = 6xy² d(-5)/dy = 0"] X1 --> RES_X["∂f/∂x = 6xy + 2y³"] Y1 --> RES_Y["∂f/∂y = 3x² + 6xy²"]
With NumPy
import numpy as np

def f(x, y):
    return 3*x**2*y + 2*x*y**3 - 5

def partial_x(f, x, y, h=1e-7):
    return (f(x+h, y) - f(x-h, y)) / (2*h)

def partial_y(f, x, y, h=1e-7):
    return (f(x, y+h) - f(x, y-h)) / (2*h)

print(partial_x(f, 1, 2))  # 28.0 (6·1·2 + 2·8)
print(partial_y(f, 1, 2))  # 27.0 (3·1 + 6·1·4)
Pure Python
def partial_derivative(f, args, idx, h=1e-7):
    """Partial deriv w.r.t. args[idx]"""
    args_p = list(args)
    args_m = list(args)
    args_p[idx] += h
    args_m[idx] -= h
    return (f(*args_p) - f(*args_m)) / (2*h)

def f(x, y):
    return 3*x**2*y + 2*x*y**3 - 5

# ∂f/∂x at (1,2)
print(partial_derivative(f, [1,2], 0))
# 28.0
# ∂f/∂y at (1,2)
print(partial_derivative(f, [1,2], 1))
# 27.0
🎯 ML Application: Each weight wᵢⱼ in a neural network gets its own partial derivative ∂L/∂wᵢⱼ. With millions of weights, backpropagation efficiently computes ALL partial derivatives in a single backward pass — much faster than computing each one individually.
22

The Gradient Vector Intermediate

The gradient ∇f bundles ALL partial derivatives into a single vector. It points in the direction of steepest ascent. Its negative, −∇f, points in the direction of steepest descent — which is exactly where gradient descent goes.
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
The gradient is a vector of all partial derivatives — it points "uphill"
∇f always points uphill — negate it to go downhill!
graph TD subgraph Grad ["Gradient Properties"] DIR["Direction
Points toward steepest ascent"] --> NEG["Negate it
−∇f = steepest DESCENT"] MAG["Magnitude
||∇f|| = steepness of slope"] --> FLAT["||∇f|| ≈ 0
means flat (near minimum)"] end
With NumPy
import numpy as np

def gradient(f, params, h=1e-7):
    """Compute ∇f numerically"""
    grad = np.zeros_like(params, dtype=float)
    for i in range(len(params)):
        p_plus = params.copy(); p_plus[i] += h
        p_minus = params.copy(); p_minus[i] -= h
        grad[i] = (f(p_plus) - f(p_minus)) / (2*h)
    return grad

# f(w) = w₀² + 3w₁² (bowl)
f = lambda w: w[0]**2 + 3*w[1]**2
w = np.array([4.0, 2.0])
print(gradient(f, w))  # [8.0, 12.0]
# Exact: [2w₀, 6w₁] = [8, 12] ✓
Pure Python
def gradient(f, params, h=1e-7):
    """Compute ∇f for list of params"""
    grad = []
    for i in range(len(params)):
        p_plus = params[:]
        p_minus = params[:]
        p_plus[i] += h
        p_minus[i] -= h
        g = (f(p_plus) - f(p_minus)) / (2*h)
        grad.append(g)
    return grad

def f(w):
    return w[0]**2 + 3*w[1]**2

print(gradient(f, [4.0, 2.0]))
# [8.0, 12.0]
🎯 ML Application: When you call loss.backward(), PyTorch computes ∇L with respect to every parameter in the model. The optimizer then uses this gradient to update each parameter. The gradient vector is THE central object in all of machine learning optimization.
23

Gradient Descent Intermediate

Gradient descent is THE optimization algorithm of deep learning. It iteratively adjusts parameters in the direction opposite to the gradient. Three words: compute gradient → step downhill → repeat.
🧠 Mental model: You're blindfolded on a mountain. To find the valley, feel the slope under your feet (gradient) and take a step downhill (negative gradient). Repeat until flat (minimum).
📖 Concept 📐 Visual 💻 NumPy 🐍 Python 🎯 ML Apply
w ← w − α · ∇L(w)
α = learning rate · ∇L = gradient of loss · w = parameters
Mathematical Explanation & Example:

Let's trace one step of Gradient Descent to minimize f(w₀, w₁) = w₀² + 3w₁².

  • 1. Compute the Gradient: The partial derivatives are ∂f/∂w₀ = 2w₀ and ∂f/∂w₁ = 6w₁. So, ∇f(w) = [2w₀, 6w₁].
  • 2. Pick a Starting Point & Learning Rate: Let's start at w = [4, 2] with learning rate α = 0.1.
  • 3. Evaluate Gradient at Start: ∇f([4, 2]) = [2(4), 6(2)] = [8, 12].
  • 4. Take a Step: w_new = w - α · ∇f(w)
    w_new = [4, 2] - 0.1 · [8, 12]
    w_new = [4, 2] - [0.8, 1.2] = [3.2, 0.8]
  • Result: We moved from [4, 2] to [3.2, 0.8], getting much closer to the true minimum at [0, 0]!
The ENTIRE training loop in one line!
graph TD subgraph GD ["Gradient Descent Loop"] INIT["Initialize weights w
randomly"] --> FWD["Forward Pass
ŷ = model(x)"] FWD --> LOSS["Compute Loss
L = loss(ŷ, y)"] LOSS --> GRAD["Compute Gradient
∇L = ∂L/∂w"] GRAD --> UPDATE["Update Weights
w ← w − α·∇L"] UPDATE -->|Repeat| FWD end
With NumPy — Full GD
import numpy as np

def gradient_descent(f, grad_f, w0,
                     lr=0.01, epochs=100):
    w = w0.copy()
    history = [f(w)]
    for _ in range(epochs):
        g = grad_f(w)
        w = w - lr * g
        history.append(f(w))
    return w, history

# Minimize f(w) = w₀² + 3w₁²
f = lambda w: w[0]**2 + 3*w[1]**2
grad = lambda w: np.array([2*w[0], 6*w[1]])

w_opt, hist = gradient_descent(
    f, grad, np.array([4.0, 2.0]),
    lr=0.1, epochs=50)
print(f"w = {w_opt}")
print(f"f(w) = {f(w_opt):.6f}")
# w ≈ [0.0, 0.0]  f(w) ≈ 0.0
Pure Python
def gradient_descent(f, grad_f, w0,
                     lr=0.01, epochs=100):
    w = w0[:]
    for _ in range(epochs):
        g = grad_f(w)
        w = [wi - lr*gi
             for wi, gi in zip(w, g)]
    return w

def f(w): return w[0]**2 + 3*w[1]**2
def grad_f(w): return [2*w[0], 6*w[1]]

w = gradient_descent(
    f, grad_f, [4.0, 2.0],
    lr=0.1, epochs=50)
print([round(x,6) for x in w])
# [0.0, 0.0]
Gradient Descent on a 2D Bowl
🎯 ML Application: Every optimizer.step() call in PyTorch executes one iteration of gradient descent (or a variant like Adam). The training loop is: (1) forward pass → (2) compute loss → (3) backward pass (get gradients) → (4) optimizer step (update weights) → repeat.
24

Learning Rate & Convergence Intermediate

The learning rate α controls how big each step is. Too large → overshoot and diverge. Too small → converge painfully slowly. Finding the right learning rate is one of the most important hyperparameter choices in ML.
α too large
Overshoots minimum, loss oscillates or diverges to infinity
α just right
Smooth convergence to minimum in reasonable time
α too small
Converges but takes forever; may get stuck in local minimum

Common schedules:

Step decay: α = α₀ × 0.1 every 30 epochs
Exponential: α = α₀ × e−kt
Cosine annealing: α = αmin + ½(αmax−αmin)(1 + cos(πt/T))
Warmup: linearly increase α from 0 to α₀ over first N steps
🎯 ML Application: Modern practice: start with lr=3e-4 for Adam, lr=0.1 for SGD. Use cosine annealing with warmup (the default in most Transformer training). The torch.optim.lr_scheduler module provides all common schedules.
25

SGD, Mini-Batch & Momentum Advanced

Full-batch GD is slow. Stochastic GD (SGD) uses one random sample per step — fast but noisy. Mini-batch GD (the standard) uses a batch of 32-512 samples — balancing speed and stability. Momentum adds a velocity term that accelerates through ravines.
SGD with Momentum:
v ← β·v + ∇L(w)  accumulate velocity (β≈0.9)
w ← w − α·v  step in velocity direction
Mathematical Explanation & Example:

Let's trace one step of Momentum with β = 0.9 and α = 0.1. Assume our current velocity is v = [1.0, -0.5], and the current gradient is g = [0.2, 2.0].

  • 1. Update Velocity: v_new = β·v + g
    v_new = 0.9 · [1.0, -0.5] + [0.2, 2.0]
    v_new = [0.9, -0.45] + [0.2, 2.0] = [1.1, 1.55]
    Notice how the velocity "remembers" the past direction but adds the new gradient's push.
  • 2. Update Weights: w_new = w - α·v_new
    w_new = w - 0.1 · [1.1, 1.55] = w - [0.11, 0.155]
graph LR subgraph Paths["Optimizer Behaviors"] SGD["Vanilla SGD\nZig-zags wildly"] MOM["Momentum\nDampens zig-zags\naccelerates downhill"] SGD --> MOM end style SGD fill:#fee2e2,stroke:#dc2626 style MOM fill:#dcfce7,stroke:#16a34a
With NumPy — SGD + Momentum
import numpy as np

def sgd_momentum(grad_f, w0, lr=0.01,
                 beta=0.9, epochs=100):
    w = w0.copy()
    v = np.zeros_like(w)  # velocity
    for _ in range(epochs):
        g = grad_f(w)
        v = beta * v + g       # accumulate
        w = w - lr * v         # step
    return w

grad = lambda w: np.array([2*w[0], 6*w[1]])
w = sgd_momentum(grad, np.array([4.0, 2.0]),
                 lr=0.01, beta=0.9)
print(w)  # ≈ [0, 0]
Pure Python
def sgd_momentum(grad_f, w0, lr=0.01,
                 beta=0.9, epochs=100):
    w = w0[:]
    v = [0.0] * len(w)
    for _ in range(epochs):
        g = grad_f(w)
        v = [beta*vi + gi
             for vi,gi in zip(v,g)]
        w = [wi - lr*vi
             for wi,vi in zip(w,v)]
    return w

def grad_f(w):
    return [2*w[0], 6*w[1]]

w = sgd_momentum(grad_f, [4.0, 2.0])
print([round(x,4) for x in w])
# [0.0, 0.0]
🎯 ML Application: torch.optim.SGD(params, lr=0.1, momentum=0.9) — SGD with momentum is still the state-of-the-art optimizer for many vision tasks (ResNet, EfficientNet). It often generalizes better than Adam despite converging more slowly.
26

Adam Optimizer Advanced

Adam = Adaptive Moment Estimation. It combines momentum (first moment) with RMSProp (second moment) to give each parameter its own adaptive learning rate. Adam is the default optimizer for most deep learning tasks.
m ← β₁·m + (1−β₁)·g  (first moment — mean of gradients)
v ← β₂·v + (1−β₂)·g²  (second moment — variance of gradients)
m̂ = m / (1 − β₁ᵗ)  (bias correction)
v̂ = v / (1 − β₂ᵗ)
w ← w − α · m̂ / (√v̂ + ε)  (update)
Adam adapts the learning rate per-parameter!
graph LR subgraph Opt ["Optimizer Evolution"] SGD["SGD
w−α·g"] -->|"Add velocity"| MOM["Momentum
v=βv+g, w−αv"] MOM -->|"Add per-param lr"| ADAM["Adam
Adaptive lr per param"] end
With NumPy — Full Adam
import numpy as np

def adam(grad_f, w0, lr=0.001,
        b1=0.9, b2=0.999, eps=1e-8,
        epochs=200):
    w = w0.copy()
    m = np.zeros_like(w)
    v = np.zeros_like(w)
    for t in range(1, epochs+1):
        g = grad_f(w)
        m = b1*m + (1-b1)*g
        v = b2*v + (1-b2)*g**2
        m_hat = m / (1 - b1**t)
        v_hat = v / (1 - b2**t)
        w = w - lr * m_hat / (np.sqrt(v_hat) + eps)
    return w

grad = lambda w: np.array([2*w[0], 6*w[1]])
w = adam(grad, np.array([4.0, 2.0]))
print(np.round(w, 4))  # [0, 0]
Pure Python
import math

def adam(grad_f, w0, lr=0.001,
        b1=0.9, b2=0.999, eps=1e-8,
        epochs=200):
    w = w0[:]
    m = [0.0]*len(w)
    v = [0.0]*len(w)
    for t in range(1, epochs+1):
        g = grad_f(w)
        for i in range(len(w)):
            m[i] = b1*m[i]+(1-b1)*g[i]
            v[i] = b2*v[i]+(1-b2)*g[i]**2
            mh = m[i]/(1-b1**t)
            vh = v[i]/(1-b2**t)
            w[i] -= lr*mh/(math.sqrt(vh)+eps)
    return w

def grad_f(w):
    return [2*w[0], 6*w[1]]

w = adam(grad_f, [4.0, 2.0])
print([round(x,4) for x in w])
# [0.0, 0.0]
🎯 ML Application: torch.optim.Adam(params, lr=3e-4) is the most common optimizer in deep learning. Defaults (β₁=0.9, β₂=0.999, ε=1e-8) work for most tasks. AdamW adds proper weight decay and is the default for Transformer training.
27

Jacobian Matrix Advanced

The Jacobian is the matrix of ALL first-order partial derivatives for a vector-valued function. If f: ℝⁿ → ℝᵐ, the Jacobian J is an m×n matrix where Jij = ∂fᵢ/∂xⱼ.
J = [∂f₁/∂x₁ ∂f₁/∂x₂ ··· ∂f₁/∂xₙ]
    [∂f₂/∂x₁ ∂f₂/∂x₂ ··· ∂f₂/∂xₙ]
    [  ⋮        ⋮              ⋮  ]
    [∂fₘ/∂x₁ ∂fₘ/∂x₂ ··· ∂fₘ/∂xₙ]
Shape: (outputs × inputs) = (m × n)
Mathematical Explanation & Example:

Let's find the Jacobian for the function f(x, y) = [x² + y, xy]. Here, we have two inputs (x, y) and two outputs (f₁, f₂).

  • Output 1: f₁(x, y) = x² + y.
    Its partial derivatives are ∂f₁/∂x = 2x and ∂f₁/∂y = 1. This forms the first row.
  • Output 2: f₂(x, y) = xy.
    Its partial derivatives are ∂f₂/∂x = y and ∂f₂/∂y = x. This forms the second row.
  • The Jacobian Matrix J is:
    [[ 2x, 1 ]]
    [[ y, x ]]
  • Evaluate at (3, 2): Plugging in x=3, y=2 gives:
    [[ 2(3), 1 ]]
    [[ 2, 3 ]]
    = [[ 6, 1 ], [ 2, 3 ]]
graph LR X["Input: x, y
ℝ²"] -->|Jacobian Matrix J| F["Output: f₁, f₂
ℝ²"] J1["Row 1: ∇f₁"] -.->|"∂f₁/∂x, ∂f₁/∂y"| F J2["Row 2: ∇f₂"] -.->|"∂f₂/∂x, ∂f₂/∂y"| F style X fill:#f8f9fa,stroke:#999999 style F fill:#f8f9fa,stroke:#999999
With NumPy
import numpy as np

def jacobian(f, x, h=1e-7):
    """Numerical Jacobian matrix"""
    x = np.asarray(x, dtype=float)
    f0 = np.asarray(f(x))
    n = len(x); m = len(f0)
    J = np.zeros((m, n))
    for j in range(n):
        xp = x.copy(); xp[j] += h
        xm = x.copy(); xm[j] -= h
        J[:, j] = (f(xp) - f(xm)) / (2*h)
    return J

# f(x,y) = [x²+y, xy]
f = lambda x: np.array([x[0]**2+x[1], x[0]*x[1]])
J = jacobian(f, [3.0, 2.0])
print(J)
# [[6. 1.]    ← [2x, 1]
#  [2. 3.]]  ← [y, x]
Pure Python
def jacobian(f, x, h=1e-7):
    n = len(x)
    f0 = f(x)
    m = len(f0)
    J = [[0.0]*n for _ in range(m)]
    for j in range(n):
        xp = x[:]; xm = x[:]
        xp[j] += h; xm[j] -= h
        fp = f(xp); fm = f(xm)
        for i in range(m):
            J[i][j] = (fp[i]-fm[i])/(2*h)
    return J

def f(x):
    return [x[0]**2+x[1], x[0]*x[1]]

J = jacobian(f, [3.0, 2.0])
for row in J: print(row)
# [6.0, 1.0]
# [2.0, 3.0]
🎯 ML Application: In Normalizing Flows, we need the log-determinant of the Jacobian to compute exact likelihoods. In physics-informed neural networks (PINNs), the Jacobian encodes how outputs change with respect to inputs — essential for enforcing differential equations.
28

Hessian Matrix Advanced

The Hessian is the matrix of second-order partial derivatives. It captures the curvature of a function. If the Hessian is positive definite at a critical point, that point is a minimum. The Hessian also tells us about the conditioning of the optimization landscape.
Hij = ∂²f / ∂xᵢ∂xⱼ
Hessian is symmetric (H = Hᵀ) for smooth functions
Newton's method: w ← w − H⁻¹ · ∇f (quadratic convergence)
Mathematical Explanation & Example:

Let's calculate the Hessian for f(x, y) = x² + 3y².

  • 1. First Derivatives (Gradient):
    ∂f/∂x = 2x
    ∂f/∂y = 6y
  • 2. Second Derivatives (Hessian entries):
    • Differentiate 2x wrt x: ∂²f/∂x² = 2
    • Differentiate 2x wrt y: ∂²f/∂x∂y = 0
    • Differentiate 6y wrt x: ∂²f/∂y∂x = 0
    • Differentiate 6y wrt y: ∂²f/∂y² = 6
  • The Hessian Matrix H is:
    [[ 2, 0 ]]
    [[ 0, 6 ]]
graph TD subgraph Landscape ["Curvature from Hessian"] POS["Hessian is Positive Definite
(Eigenvalues > 0)"] --> MIN["Local Minimum
Looks like a Bowl"] NEG["Hessian is Negative Definite
(Eigenvalues < 0)"] --> MAX["Local Maximum
Looks like a Hill"] MIX["Hessian is Indefinite
(Mixed Eigenvalues)"] --> SAD["Saddle Point
Looks like a Pringles chip"] end
With NumPy
import numpy as np

def hessian(f, x, h=1e-5):
    """Numerical Hessian matrix"""
    n = len(x)
    H = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            xpp=x.copy(); xpp[i]+=h; xpp[j]+=h
            xpm=x.copy(); xpm[i]+=h; xpm[j]-=h
            xmp=x.copy(); xmp[i]-=h; xmp[j]+=h
            xmm=x.copy(); xmm[i]-=h; xmm[j]-=h
            H[i,j] = (f(xpp)-f(xpm)-f(xmp)+f(xmm))
            H[i,j] /= (4*h*h)
    return H

# f(x,y) = x² + 3y² → H = [[2,0],[0,6]]
f = lambda x: x[0]**2 + 3*x[1]**2
print(hessian(f, np.array([1.0,1.0])))
# [[2. 0.]
#  [0. 6.]]
Hessian Eigenvalues
# Hessian eigenvalues tell us
# about the loss landscape shape:
#
# All positive → local minimum
# All negative → local maximum
# Mixed → saddle point
#
# Condition number (λ_max / λ_min)
# tells us how "elongated" the
# bowl is. High condition number
# → GD oscillates → slow.
# This is why Adam helps: it
# scales each dimension by 1/√v,
# effectively preconditioning.

import numpy as np
H = np.array([[2, 0], [0, 6]])
eigvals = np.linalg.eigvalsh(H)
print(f"Eigenvalues: {eigvals}")
# [2. 6.]
print(f"Condition: {max(eigvals)/min(eigvals)}")
# 3.0
🎯 ML Application: Second-order optimizers (L-BFGS, natural gradient) use Hessian information for faster convergence. In practice, the full Hessian is too expensive (O(n²) storage), so methods like Hessian-free optimization and Fisher information approximations are used.
Part VI

Backpropagation

Backpropagation is the chain rule applied systematically through a computational graph. It is the algorithm that makes training deep neural networks possible.

After Part VI you will be able to:
  • Draw a computational graph for any neural network
  • Trace the forward and backward passes step by step
  • Implement backprop for a 2-layer network from scratch
  • Explain why gradients vanish or explode in deep networks
29

Computational Graphs Advanced

A computational graph is a DAG (directed acyclic graph) where nodes are operations and edges carry data. The forward pass flows data left-to-right to compute the output. The backward pass flows gradients right-to-left using the chain rule.
Forward pass builds the graph; backward pass computes gradients!
graph LR subgraph CompGraph ["Simple Neuron: y = σ(wx + b)"] X["x (input)"] --> MUL["× (multiply)"] W["w (weight)"] --> MUL MUL --> ADD["+ (add)"] B["b (bias)"] --> ADD ADD --> SIG["σ (sigmoid)"] SIG --> Y["y (output)"] end

Each node in the graph stores two things during training:

Forward Pass →

Each node computes its output from its inputs and caches intermediate values (needed for backward pass).

← Backward Pass

Each node receives ∂L/∂output, computes ∂L/∂input using its local derivative, and passes it upstream.

🎯 ML Application: PyTorch and TensorFlow build a computational graph dynamically during the forward pass. When you call .backward(), they traverse this graph in reverse to compute all gradients. This is why tensors track requires_grad=True.
30

Forward Pass Advanced

The forward pass computes the output of the network by propagating input through each layer. At each layer: (1) linear transformation z = Wx + b, then (2) activation a = σ(z). Cache z and a — you'll need them for backprop.
# Forward pass for a 2-layer network
# Layer 1: z1 = W1·x + b1,  a1 = ReLU(z1)
# Layer 2: z2 = W2·a1 + b2, a2 = sigmoid(z2)

import numpy as np

def forward(x, W1, b1, W2, b2):
    # Layer 1
    z1 = W1 @ x + b1
    a1 = np.maximum(0, z1)         # ReLU

    # Layer 2
    z2 = W2 @ a1 + b2
    a2 = 1 / (1 + np.exp(-z2))     # Sigmoid

    cache = (x, z1, a1, z2, a2, W1, W2)
    return a2, cache
🎯 ML Application: During inference (prediction), you only need the forward pass. During training, you need both forward AND backward. The cache stores everything the backward pass needs, which is why training uses ~2-3× more memory than inference.
31

Backward Pass (Backpropagation) Advanced

Backpropagation = chain rule applied in reverse through the computational graph. Starting from ∂L/∂output, each layer computes its local gradients and passes ∂L/∂input to the previous layer.
Each node: receive upstream gradient, multiply by local gradient, pass downstream
graph RL subgraph Backprop ["Backward Pass"] dL["∂L/∂ŷ
(from loss)"] --> dSIG["σ' node
×σ(1−σ)"] dSIG --> dADD["+ node
pass through"] dADD --> dW["∂L/∂w
(save for update)"] dADD --> dB["∂L/∂b
(save for update)"] dADD --> dX["∂L/∂x
(to prev layer)"] end
# Backward pass — compute all gradients
def backward(y_true, cache):
    x, z1, a1, z2, a2, W1, W2 = cache
    m = y_true.shape[0]  # batch size

    # Output layer gradient (BCE loss + sigmoid)
    dz2 = a2 - y_true             # ∂L/∂z2 = ŷ - y
    dW2 = (1/m) * dz2 @ a1.T       # ∂L/∂W2
    db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)

    # Hidden layer gradient
    da1 = W2.T @ dz2              # chain rule
    dz1 = da1 * (z1 > 0)          # ReLU derivative
    dW1 = (1/m) * dz1 @ x.T
    db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)

    return {'dW1': dW1, 'db1': db1,
            'dW2': dW2, 'db2': db2}
🎯 ML Application: This is exactly what PyTorch does when you call loss.backward(). The framework traverses the computational graph in reverse, applying the chain rule at each node. All parameter gradients are stored in param.grad and then consumed by the optimizer.
32

Backprop Through a 2-Layer Network Advanced

Let's put it all together. A complete training loop: initialize → forward → loss → backward → update. This is the foundation of ALL neural network training.
import numpy as np

# === COMPLETE 2-LAYER NETWORK ===
np.random.seed(42)

# Data: XOR problem
X = np.array([[0,0],[0,1],[1,0],[1,1]]).T  # (2, 4)
Y = np.array([[0,1,1,0]])                    # (1, 4)

# Initialize weights
W1 = np.random.randn(4, 2) * 0.5  # (4, 2)
b1 = np.zeros((4, 1))
W2 = np.random.randn(1, 4) * 0.5  # (1, 4)
b2 = np.zeros((1, 1))

lr = 1.0
for epoch in range(10000):
    # Forward
    z1 = W1 @ X + b1
    a1 = np.maximum(0, z1)
    z2 = W2 @ a1 + b2
    a2 = 1 / (1 + np.exp(-z2))

    # Loss
    loss = -np.mean(Y*np.log(a2+1e-8) + (1-Y)*np.log(1-a2+1e-8))

    # Backward
    m = 4
    dz2 = a2 - Y
    dW2 = (1/m) * dz2 @ a1.T
    db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True)
    dz1 = (W2.T @ dz2) * (z1 > 0)
    dW1 = (1/m) * dz1 @ X.T
    db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True)

    # Update
    W1 -= lr * dW1; b1 -= lr * db1
    W2 -= lr * dW2; b2 -= lr * db2

    if epoch % 2000 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}")

# Test predictions
print(np.round(a2, 2))  # ≈ [0, 1, 1, 0] — XOR solved!
🎯 ML Application: This 40-line script IS a complete neural network. Every deep learning framework (PyTorch, TensorFlow, JAX) automates exactly these steps. Understanding this from scratch means you truly understand how neural networks learn.
33

Vanishing & Exploding Gradients Advanced

In deep networks, gradients are multiplied through many layers (chain rule). If each layer's local gradient is < 1, the product shrinks exponentially (vanishing). If> 1, it grows exponentially (exploding). Both prevent learning.
∂L/∂w₁ = ∂L/∂aₙ · ∂aₙ/∂aₙ₋₁ · ... · ∂a₂/∂a₁ · ∂a₁/∂w₁
n multiplications → (0.25)ⁿ for sigmoid → vanishes for large n

Solutions:

1. ReLU activation: derivative is 0 or 1 → no shrinking
2. Skip connections (ResNet): gradient flows through shortcut paths
3. Batch normalization: keeps activations in well-conditioned range
4. Gradient clipping: cap gradient magnitude to prevent explosion
5. Careful initialization: He init (ReLU), Xavier init (tanh/sigmoid)
6. LSTM/GRU gates: control gradient flow in recurrent networks
🎯 ML Application: ResNet's skip connections (y = F(x) + x) were revolutionary because the gradient for the shortcut path is exactly 1 — no vanishing. This enabled training networks with 100+ layers. torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) is standard for RNNs and Transformers.
Part VII

Advanced Topics

Taylor series approximations and automatic differentiation — the mathematical machinery that powers modern deep learning frameworks.

34

Taylor Series & Approximation Advanced

A Taylor series approximates any smooth function as an infinite polynomial. The key insight: knowing a function and all its derivatives at one point lets you reconstruct the function everywhere nearby. Adam uses first and second moment estimates — essentially a 2nd-order Taylor approximation of the loss.
f(x) = f(a) + f'(a)(x−a) + f''(a)(x−a)²/2! + f'''(a)(x−a)³/3! + ...
1st order ≈ linear · 2nd order ≈ quadratic (Newton's method)
With NumPy — Taylor of eˣ
import numpy as np

def taylor_exp(x, n_terms=10):
    """eˣ ≈ 1 + x + x²/2! + x³/3! +..."""
    result = np.zeros_like(x, dtype=float)
    for n in range(n_terms):
        result += x**n / np.math.factorial(n)
    return result

x = np.array([0, 1, 2])
print("Taylor:", taylor_exp(x))
print("Exact: ", np.exp(x))
# Taylor: [1.    2.718 7.389]
# Exact:  [1.    2.718 7.389]
Pure Python
def factorial(n):
    r = 1
    for i in range(2, n+1): r *= i
    return r

def taylor_exp(x, n_terms=15):
    return sum(x**n / factorial(n)
               for n in range(n_terms))

def taylor_sin(x, n_terms=10):
    return sum(
        (-1)**n * x**(2*n+1) / factorial(2*n+1)
        for n in range(n_terms))

print(taylor_exp(1))    # 2.71828...
print(taylor_sin(3.14159/2))
# ≈ 1.0
🎯 ML Application: Newton's method uses a 2nd-order Taylor expansion to find minima faster than GD. Adam's adaptive learning rates are related to diagonal approximations of the Hessian. The GELU activation uses a Taylor approximation of the Gaussian CDF.
35

Automatic Differentiation Advanced

Automatic differentiation (AD) computes exact derivatives by decomposing functions into elementary operations and applying the chain rule systematically. It's neither symbolic (like Wolfram Alpha) nor numerical (like finite differences) — it's exact AND efficient.
How PyTorch computes gradients — not magic, just clever chain rule!
graph TD subgraph AD ["Automatic Differentiation Modes"] FWD["Forward Mode
Compute ∂y/∂x alongside y
Good for few inputs, many outputs"] --> DUAL["Uses dual numbers
(value, derivative) pairs"] REV["Reverse Mode (Backprop)
Compute all ∂y/∂xᵢ in one pass
Good for many inputs, few outputs"] --> TAPE["Uses computation tape
= what PyTorch does"] end
Forward Mode

Propagates derivatives alongside values. Cost = O(n) passes for n inputs. Good when inputs << outputs (rare in ML).

Reverse Mode (Backprop)

One forward pass + one backward pass computes ALL gradients. Cost = O(1) backward passes regardless of # inputs. This is why backprop is used in ML.

# Tiny autodiff engine (inspired by Karpathy's micrograd)
class Value:
    def __init__(self, data, children=(), op=''):
        self.data = data
        self.grad = 0.0
        self._backward = lambda: None
        self._prev = set(children)

    def __add__(self, other):
        out = Value(self.data + other.data, (self, other), '+')
        def _backward():
            self.grad += out.grad   # ∂(a+b)/∂a = 1
            other.grad += out.grad  # ∂(a+b)/∂b = 1
        out._backward = _backward
        return out

    def __mul__(self, other):
        out = Value(self.data * other.data, (self, other), '*')
        def _backward():
            self.grad += other.data * out.grad  # ∂(ab)/∂a = b
            other.grad += self.data * out.grad  # ∂(ab)/∂b = a
        out._backward = _backward
        return out

    def backward(self):
        # Topological sort + reverse traverse
        topo = []; visited = set()
        def build(v):
            if v not in visited:
                visited.add(v)
                for c in v._prev: build(c)
                topo.append(v)
        build(self)
        self.grad = 1.0
        for v in reversed(topo): v._backward()

# Example: f = (a * b) + b
a = Value(2.0); b = Value(3.0)
c = a * b      # c = 6
d = c + b      # d = 9
d.backward()
print(f"∂d/∂a = {a.grad}")  # 3.0 (= b)
print(f"∂d/∂b = {b.grad}")  # 3.0 (= a + 1)
🎯 ML Application: This tiny Value class IS the core idea behind PyTorch's autograd. PyTorch's torch.Tensor does exactly this at scale — tracks operations, stores local derivatives, and traverses the graph backward. Karpathy's micrograd (~100 lines) implements a full autodiff engine.
Appendix

Complete Coverage Map

Every concept in this guide mapped to its ML application:

📐
Derivatives — The foundation of gradient-based optimizationGradient descent, backpropagation, all parameter updates
🔗
Chain Rule — Differentiating composed functionsBackpropagation IS the chain rule applied recursively
📊
Sigmoid/Tanh/ReLU — Activation functions and their derivativesEvery neural network layer uses an activation; knowing their derivatives is essential
📉
Loss Functions — MSE, cross-entropy, and their gradientsThe starting point of every backward pass
Integrals — Area under curve, expected valuesProbability, KL divergence, ELBO in VAEs
Gradients — Vector of all partial derivativesThe direction of steepest descent; core of all optimizers
⬇️
Gradient Descent — The optimization algorithmSGD, Adam, AdamW — every training loop
🔄
Backpropagation — Computing all gradients efficientlyloss.backward() in PyTorch; the algorithm that makes deep learning possible
🧮
Jacobian/Hessian — Higher-order derivative matricesNormalizing flows, Newton's method, loss landscape analysis
Automatic Differentiation — How frameworks compute gradientsPyTorch autograd, JAX, TensorFlow GradientTape
derivativeschain rulegradient descent backpropagationoptimizationintegrals sigmoidReLUAdam JacobianHessianloss functions
Comments

Comments

Loading comments...