Calculus for Machine Learning
A complete reference covering derivatives, chain rule, gradient descent, backpropagation, and optimization — every concept you need to understand how neural networks learn.
Foundations
The building blocks of calculus — what functions are, how they change, and how we measure that change precisely. These concepts form the bedrock of every optimization algorithm in ML.
- Explain why calculus is essential for training neural networks
- Plot and reason about common ML functions
- Define the derivative as a limit and compute it numerically
- Understand the geometric meaning of the derivative as slope
- Implement numerical differentiation in pure Python
What is Calculus? Beginner
Calculus has two fundamental branches:
Studies rates of change. Given a function f(x), the derivative f'(x) tells us how fast f is changing at each point x. This is how we compute gradients to train models.
Studies accumulation. Given a rate of change, integration recovers the total quantity. In ML, integrals appear in probability distributions, expected values, and KL divergence.
prediction ŷ"] --> B{"Compare with
true label y"} B --> C["Compute Loss
L(ŷ, y)"] C --> D["Take Derivative
∂L/∂w (Calculus!)"] D --> E["Update Weights
w ← w − α·∂L/∂w"] E --> A end
loss.backward() in PyTorch, you are using differential calculus. The framework computes ∂Loss/∂w
for every weight w in your network, then gradient descent uses those derivatives to update each weight in the
direction that reduces the loss.
Functions & Graphs Beginner
Key ML functions you'll encounter throughout this guide:
Straight line. The simplest model. Linear regression is just finding the best m and b.
Parabola. MSE loss is quadratic — this is why gradient descent works so well on it (convex!).
S-curve. Squashes any input to (0, 1). Used in binary classification and gating.
The most popular activation. Zero for negatives, identity for positives. Simple derivative.
import numpy as np # Define common functions x = np.linspace(-5, 5, 100) linear = 2 * x + 1 quad = x ** 2 sigmoid = 1 / (1 + np.exp(-x)) relu = np.maximum(0, x) # Evaluate at a point sigmoid_at_0 = 1 / (1 + np.exp(0)) print(sigmoid_at_0) # 0.5
import math def linear(x, m=2, b=1): return m * x + b def sigmoid(x): return 1 / (1 + math.exp(-x)) def relu(x): return max(0, x) # Plot manually: generate (x, y) pairs xs = [i * 0.1 for i in range(-50, 51)] ys = [sigmoid(x) for x in xs] print(sigmoid(0)) # 0.5
Limits & Continuity Beginner
Three key properties of limits that matter for ML:
2. Product rule: lim[f(x) · g(x)] = lim f(x) · lim g(x)
3. Chain rule: lim f(g(x)) = f(lim g(x)) (if f is continuous)
# lim x→0 sin(x)/x = 1 import numpy as np # Approach from right h_values = [0.1, 0.01, 0.001, 0.0001] for h in h_values: print(f"h={h}: {np.sin(h)/h:.8f}") # h=0.1: 0.99833417 # h=0.01: 0.99998333 # h=0.001: 0.99999983 # h=0.0001: 1.00000000 # → Approaches 1!
import math def numerical_limit(f, a, h_values=None): """Estimate lim x→a f(x)""" if h_values is None: h_values = [10**-i for i in range(1,9)] results = [] for h in h_values: results.append(f(a + h)) return results[-1] # lim x→0 sin(x)/x limit = numerical_limit( lambda x: math.sin(x)/x, 0 ) print(limit) # ≈ 1.0
The Derivative (Geometric View) Beginner
A derivative measures how a function changes as its input changes. Mathematically, it's defined as the limit of the average rate of change as the interval approaches zero:
f'(x) = lim(h→0) [f(x+h) - f(x)] / h
Example: Let f(x) = x². We want to find the
derivative at x = 3.
- Using the definition, the rate of change from
xtox+his((x+h)² - x²) / h. - Expanding the numerator:
x² + 2xh + h² - x² = 2xh + h². - Dividing by
hgives:2x + h. - As
h → 0, this expression becomes2x. - So, the derivative is
f'(x) = 2x. Atx = 3, the slope is2(3) = 6.
Going uphill ↗"] --> ACT1["Move LEFT
(Decrease x)"] NEG["f'(x) < 0
Going downhill ↘"] --> ACT2["Move RIGHT
(Increase x)"] ZERO["f'(x) = 0
Flat — at minimum? ✓"] --> ACT3["Stop!
(Converged)"] end
import numpy as np # f(x) = x² → f'(x) = 2x def f(x): return x**2 def f_prime(x): return 2*x x = np.array([-2, -1, 0, 1, 2]) print(f"f(x) = {f(x)}") # [4, 1, 0, 1, 4] print(f"f'(x) = {f_prime(x)}") # [-4, -2, 0, 2, 4] # At x=-2: slope is -4 (steeply downhill) # At x=0: slope is 0 (minimum!) # At x=2: slope is 4 (steeply uphill)
# Numerical derivative approximation def numerical_derivative(f, x, h=1e-7): """Central difference formula""" return (f(x + h) - f(x - h)) / (2 * h) def f(x): return x ** 2 # Test at several points for x in [-2, -1, 0, 1, 2]: slope = numerical_derivative(f, x) print(f"x={x}: f'={slope:.4f}") # x=-2: f'=-4.0000 # x=-1: f'=-2.0000 # x= 0: f'= 0.0000 ← minimum! # x= 1: f'= 2.0000 # x= 2: f'= 4.0000
Derivative from First Principles Beginner
Worked example: Derive f'(x) for f(x) = x² from first principles:
= limh→0 [x² + 2xh + h² − x²] / h
= limh→0 [2xh + h²] / h
= limh→0 (2x + h)
= 2x ✓
import numpy as np def f(x): return x**2 x = 3.0; h = 1e-7 # Forward difference fwd = (f(x+h) - f(x)) / h # Backward difference bwd = (f(x) - f(x-h)) / h # Central difference (best!) ctr = (f(x+h) - f(x-h)) / (2*h) print(f"Forward: {fwd:.10f}") print(f"Backward: {bwd:.10f}") print(f"Central: {ctr:.10f}") print(f"Exact: {2*x:.10f}") # Central is most accurate!
def numerical_gradient(f, params, h=1e-7): """Gradient for a list of params""" grad = [] for i in range(len(params)): # Perturb param i params_plus = params[:] params_minus = params[:] params_plus[i] += h params_minus[i] -= h # Central difference g = (f(params_plus) - f(params_minus)) g /= (2 * h) grad.append(g) return grad # f(w0, w1) = w0² + 3*w1² def loss(w): return w[0]**2 + 3*w[1]**2 g = numerical_gradient(loss, [2.0, 1.0]) print(g) # [4.0, 6.0]
numerical_gradient
function above is exactly what gradient checking does in practice. Frameworks like PyTorch
provide torch.autograd.gradcheck() which computes both the analytical gradient (via backprop)
and the numerical gradient, then asserts they match.Differentiation Rules
Instead of computing every derivative from the limit definition, we use rules that let us differentiate any function instantly. The chain rule is the single most important rule for ML — it powers backpropagation.
- Apply power, sum, product, quotient, and chain rules fluently
- Differentiate any polynomial, exponential, or trigonometric function
- Understand why the chain rule is the backbone of backpropagation
- Look up any common derivative from the reference table
Power Rule Beginner
Examples:
d/dx [x⁻¹] = −x⁻² = −1/x²
d/dx [√x] = d/dx [x1/2] = (1/2)x−1/2 = 1/(2√x)
d/dx [5] = 0 (constant → derivative is 0)
import numpy as np # f(x) = 3x⁴ + 2x² − 7x + 5 # f'(x) = 12x³ + 4x − 7 def f(x): return 3*x**4 + 2*x**2 - 7*x + 5 def f_prime(x): return 12*x**3 + 4*x - 7 x = np.array([0, 1, 2]) print(f_prime(x)) # [-7, 9, 97]
def power_rule(coeff, exp): """Apply power rule: d/dx [c·x^n]""" if exp == 0: return (0, 0) # constant new_coeff = coeff * exp new_exp = exp - 1 return (new_coeff, new_exp) # Differentiate 3x⁴ c, e = power_rule(3, 4) print(f"{c}x^{e}") # 12x^3 # Differentiate √x = x^0.5 c, e = power_rule(1, 0.5) print(f"{c}x^{e}") # 0.5x^-0.5
Sum, Difference & Constant Rules Beginner
d/dx [f(x) − g(x)] = f'(x) − g'(x) Difference rule
d/dx [c · f(x)] = c · f'(x) Constant multiple rule
d/dx [c] = 0 Constant rule
Example: Differentiate f(x) = 5x³ − 2x² + 7x − 3
Product & Quotient Rules Intermediate
Quotient rule: d/dx [f/g] = (f'·g − f·g') / g²
Product rule example: d/dx [x² · eˣ]
g(x) = eˣ, g'(x) = eˣ
(fg)' = 2x·eˣ + x²·eˣ = eˣ(2x + x²)
import numpy as np # f(x) = x² · eˣ # f'(x) = eˣ(2x + x²) def f(x): return x**2 * np.exp(x) def f_prime(x): return np.exp(x) * (2*x + x**2) x = np.array([0, 1, 2]) print(f_prime(x)) # [0.0, 8.154, 43.510]
import math def product_rule(f, f_p, g, g_p, x): """(fg)' = f'g + fg'""" return f_p(x)*g(x) + f(x)*g_p(x) # f=x², f'=2x, g=eˣ, g'=eˣ result = product_rule( lambda x: x**2, lambda x: 2*x, lambda x: math.exp(x), lambda x: math.exp(x), 1.0 ) print(result) # 8.1548
Chain Rule Intermediate
Worked example: d/dx [sin(x²)]
inner: g(x) = x² → g'(x) = 2x
chain rule: f'(g(x)) · g'(x) = cos(x²) · 2x = 2x·cos(x²)
Multi-layer chain: d/dx [f(g(h(x)))] = f'(g(h(x))) · g'(h(x)) · h'(x)
import numpy as np # f(x) = sigmoid(3x + 2) # outer: sigmoid(u), inner: 3x+2 def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_prime(x): s = sigmoid(x) return s * (1 - s) # Chain rule: σ'(3x+2) · 3 x = np.array([0, 1, 2]) inner = 3*x + 2 deriv = sigmoid_prime(inner) * 3 print(deriv) # [0.315, 0.044, 0.005]
def chain_rule(outer_deriv, inner_deriv, inner_fn, x): """d/dx f(g(x)) = f'(g(x))·g'(x)""" u = inner_fn(x) # g(x) df_du = outer_deriv(u) # f'(g(x)) du_dx = inner_deriv(x) # g'(x) return df_du * du_dx import math # d/dx sin(x²) = cos(x²)·2x result = chain_rule( lambda u: math.cos(u), # f' lambda x: 2*x, # g' lambda x: x**2, # g 1.0 ) print(result) # ≈ 1.0806
Common Derivatives Reference Table Beginner
| f(x) | f'(x) | ML Context |
|---|---|---|
| xⁿ | nxⁿ⁻¹ | Power rule — polynomials, L2 reg |
| eˣ | eˣ | Softmax, sigmoid, probability |
| ln(x) | 1/x | Cross-entropy loss, log-likelihood |
| sin(x) | cos(x) | Positional encodings (Transformers) |
| cos(x) | −sin(x) | Positional encodings |
| σ(x) = 1/(1+e⁻ˣ) | σ(x)(1−σ(x)) | Binary classification, gates |
| tanh(x) | 1 − tanh²(x) | RNN hidden states, normalization |
| ReLU(x) = max(0,x) | 0 if x<0, 1 if x>0 | Most popular activation function |
| |x| (absolute) | sign(x) | L1 regularization, MAE loss |
| aˣ | aˣ · ln(a) | Exponential learning rate decay |
| loga(x) | 1/(x·ln(a)) | Information theory |
xⁿ → nxⁿ⁻¹, eˣ → eˣ, ln(x) → 1/x, σ(x) → σ(1−σ),
and ReLU → step function. Everything else can be derived from these using the chain rule.ML Activation Functions & Their Derivatives
Activation functions introduce nonlinearity into neural networks. Without them, stacking linear layers would just give another linear function. Understanding their derivatives is essential for backpropagation.
- Implement sigmoid, tanh, ReLU, and softmax from scratch
- Derive and implement their derivatives
- Choose the right activation function for each layer
- Understand vanishing gradient problems with sigmoid/tanh
- Differentiate common loss functions (MSE, cross-entropy)
Sigmoid & Its Derivative Intermediate
σ'(x) = σ(x) · (1 − σ(x))
Max derivative: σ'(0) = 0.25
Derivation using chain rule:
σ'(x) = −1·(1 + e⁻ˣ)⁻² · (−e⁻ˣ) chain rule
= e⁻ˣ / (1 + e⁻ˣ)²
= [1/(1 + e⁻ˣ)] · [e⁻ˣ/(1 + e⁻ˣ)]
= σ(x) · [1 − σ(x)] ✓
import numpy as np def sigmoid(x): return 1 / (1 + np.exp(-x)) def sigmoid_deriv(x): s = sigmoid(x) return s * (1 - s) x = np.linspace(-6, 6, 5) print("σ(x): ", np.round(sigmoid(x), 4)) print("σ'(x): ", np.round(sigmoid_deriv(x), 4)) # σ(x): [0.0025, 0.0474, 0.5, 0.9526, 0.9975] # σ'(x): [0.0025, 0.0452, 0.25, 0.0452, 0.0025] # ↑ Derivative peaks at 0.25 (x=0) and vanishes at extremes!
import math def sigmoid(x): # Numerically stable version if x >= 0: z = math.exp(-x) return 1 / (1 + z) else: z = math.exp(x) return z / (1 + z) def sigmoid_deriv(x): s = sigmoid(x) return s * (1 - s) # Verify max at x=0 print(sigmoid_deriv(0)) # 0.25 print(sigmoid_deriv(5)) # 0.0066 print(sigmoid_deriv(-5)) # 0.0066
Tanh & Its Derivative Intermediate
tanh'(x) = 1 − tanh²(x)
Note: tanh(x) = 2σ(2x) − 1 (scaled sigmoid!)
import numpy as np def tanh(x): return np.tanh(x) def tanh_deriv(x): t = np.tanh(x) return 1 - t**2 x = np.array([-2, -1, 0, 1, 2]) print("tanh: ", np.round(tanh(x), 4)) print("tanh': ", np.round(tanh_deriv(x), 4)) # tanh: [-0.9640, -0.7616, 0.0, 0.7616, 0.9640] # tanh': [0.0707, 0.4200, 1.0, 0.4200, 0.0707]
import math def tanh(x): ep = math.exp(x) em = math.exp(-x) return (ep - em) / (ep + em) def tanh_deriv(x): t = tanh(x) return 1 - t**2 # Verify: max deriv at x=0 is 1.0 print(tanh_deriv(0)) # 1.0 print(tanh_deriv(3)) # 0.0099
ReLU Family Intermediate
Leaky ReLU: f(x) = max(αx, x) → f'(x) = α if x<0, 1 if x>0 (α≈0.01)
ELU: f(x) = x if x>0, α(eˣ−1) if x≤0 → f'(x) = 1 if x>0, f(x)+α if x≤0
GELU: f(x) = x · Φ(x) (Φ = standard normal CDF, used in GPT/BERT)
import numpy as np def relu(x): return np.maximum(0, x) def relu_deriv(x): return (x > 0).astype(float) def leaky_relu(x, a=0.01): return np.where(x > 0, x, a*x) def leaky_relu_deriv(x, a=0.01): return np.where(x > 0, 1, a) def gelu(x): return 0.5*x*(1+np.tanh( np.sqrt(2/np.pi)*(x+0.044715*x**3))) x = np.array([-2, -1, 0, 1, 2]) print("ReLU: ", relu(x)) # [0, 0, 0, 1, 2]
def relu(x): return max(0, x) def relu_deriv(x): return 1 if x > 0 else 0 def leaky_relu(x, alpha=0.01): return x if x > 0 else alpha * x def leaky_relu_deriv(x, alpha=0.01): return 1 if x > 0 else alpha # The "dying ReLU" problem: # If a neuron always receives # negative input, its gradient # is always 0 → it never learns! # Leaky ReLU fixes this by # allowing a small gradient (α) # for negative inputs.
Softmax & Cross-Entropy Intermediate
cross-entropy: L = −Σᵢ yᵢ · log(ŷᵢ)
Combined gradient: ∂L/∂zᵢ = ŷᵢ − yᵢ (beautifully simple!)
import numpy as np def softmax(z): # Numerically stable e = np.exp(z - np.max(z)) return e / e.sum() def cross_entropy(y_true, y_pred): return -np.sum(y_true * np.log(y_pred + 1e-15)) # Example: 3-class classification logits = np.array([2.0, 1.0, 0.1]) probs = softmax(logits) print(probs) # [0.659, 0.242, 0.099] print(probs.sum()) # 1.0 # Gradient: simply ŷ − y! y_true = np.array([1, 0, 0]) # one-hot grad = probs - y_true print(grad) # [-0.341, 0.242, 0.099]
import math def softmax(z): # Subtract max for stability m = max(z) exps = [math.exp(zi - m) for zi in z] total = sum(exps) return [e / total for e in exps] def cross_entropy(y_true, y_pred): return -sum( yt * math.log(yp + 1e-15) for yt, yp in zip(y_true, y_pred) ) logits = [2.0, 1.0, 0.1] probs = softmax(logits) print([round(p, 3) for p in probs]) # [0.659, 0.242, 0.099]
nn.CrossEntropyLoss combines both into one fused operation for efficiency.
Loss Functions & Their Gradients Intermediate
(continuous output)"] --> MSE["MSE Loss
L = (1/n)Σ(ŷ−y)²"] REG --> MAE["MAE Loss
L = (1/n)Σ|ŷ−y|"] CLS["Classification
(categorical)"] --> BCE["Binary CE
L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]"] CLS --> CCE["Categorical CE
L = −Σ yᵢ·log(ŷᵢ)"] end
| Loss | Formula | Gradient ∂L/∂ŷ |
|---|---|---|
| MSE | (ŷ − y)² | 2(ŷ − y) |
| MAE | |ŷ − y| | sign(ŷ − y) |
| Binary CE | −y·log(ŷ) − (1−y)·log(1−ŷ) | (ŷ − y) / [ŷ(1 − ŷ)] |
| Categorical CE | −Σ yᵢ·log(ŷᵢ) | −y/ŷ (then softmax combines to ŷ−y) |
| Huber | ½(ŷ−y)² if |ŷ−y|≤δ, else δ|ŷ−y|−½δ² | (ŷ−y) if |ŷ−y|≤δ, else δ·sign(ŷ−y) |
import numpy as np def mse(y_true, y_pred): return np.mean((y_pred - y_true)**2) def mse_grad(y_true, y_pred): n = len(y_true) return 2 * (y_pred - y_true) / n def binary_ce(y_true, y_pred): eps = 1e-15 y_pred = np.clip(y_pred, eps, 1-eps) return -np.mean( y_true*np.log(y_pred) + (1-y_true)*np.log(1-y_pred) ) y = np.array([1, 0, 1]) yh = np.array([0.9, 0.1, 0.8]) print(f"MSE: {mse(y, yh):.4f}") print(f"BCE: {binary_ce(y, yh):.4f}")
import math def mse(y_true, y_pred): n = len(y_true) return sum((yp-yt)**2 for yt,yp in zip(y_true,y_pred)) / n def mse_grad(y_true, y_pred): n = len(y_true) return [2*(yp-yt)/n for yt,yp in zip(y_true,y_pred)] def binary_ce(y_true, y_pred): eps = 1e-15 n = len(y_true) total = 0 for yt, yp in zip(y_true, y_pred): yp = max(eps, min(1-eps, yp)) total -= (yt*math.log(yp) + (1-yt)*math.log(1-yp)) return total / n
Integral Calculus
Integration is the reverse of differentiation — it accumulates tiny changes into totals. In ML, integrals appear in probability distributions, expected values, and information theory.
- Understand the integral as area under a curve
- Apply the Fundamental Theorem of Calculus
- Implement numerical integration in pure Python
- Recognize where integrals appear in ML (expectations, KL divergence)
The Integral (Area Under the Curve) Intermediate
Indefinite integral: ∫ f(x) dx = F(x) + C
An integral computes the total accumulation of a
quantity, visualized as the area under a curve. If a function f(x) describes the rate of
change, its integral F(x) (the antiderivative) gives the total amount.
Example: Calculate the definite integral of
f(x) = 2x from x = 0 to x = 3.
- First, find the indefinite integral. What function's derivative is
2x? The power rule in reverse tells usF(x) = x². - Using the Fundamental Theorem of Calculus:
∫₀³ 2x dx = F(3) - F(0). - Evaluate at the bounds:
(3)² - (0)² = 9 - 0 = 9. - The total area under the curve
y = 2xbetweenx = 0andx = 3is exactly 9.
Antiderivative"] -->|"Differentiate
d/dx"| f["f(x)
Function"] f -->|"Integrate
∫ dx"| F end
import numpy as np from scipy import integrate # ∫₀² x² dx = [x³/3]₀² = 8/3 ≈ 2.667 result, error = integrate.quad( lambda x: x**2, 0, 2 ) print(f"{result:.4f}") # 2.6667 # Numerical integration with trapezoid x = np.linspace(0, 2, 1000) y = x**2 area = np.trapz(y, x) print(f"{area:.4f}") # 2.6667
def riemann_sum(f, a, b, n=1000): """Left Riemann sum approximation""" dx = (b - a) / n total = 0 for i in range(n): x = a + i * dx total += f(x) * dx return total def trapezoidal(f, a, b, n=1000): """More accurate than Riemann""" dx = (b - a) / n total = (f(a) + f(b)) / 2 for i in range(1, n): total += f(a + i * dx) return total * dx # ∫₀² x² dx = 8/3 result = trapezoidal( lambda x: x**2, 0, 2 ) print(f"{result:.6f}") # 2.666667
Fundamental Theorem of Calculus Intermediate
Common antiderivatives:
∫ eˣ dx = eˣ + C
∫ 1/x dx = ln|x| + C
∫ cos(x) dx = sin(x) + C
∫ sin(x) dx = −cos(x) + C
import numpy as np # Verify FTC: ∫₀³ 2x dx = [x²]₀³ = 9 # Antiderivative F(x) = x² F = lambda x: x**2 exact = F(3) - F(0) print(exact) # 9 # Numerical check x = np.linspace(0, 3, 10000) numerical = np.trapz(2*x, x) print(f"{numerical:.4f}") # 9.0000
def integrate_poly(coeffs, a, b): """Exact integral of polynomial coeffs: [c0, c1, c2, ...] p(x) = c0 + c1*x + c2*x² + ... """ # Antiderivative coeffs F_coeffs = [0] # constant term for i, c in enumerate(coeffs): F_coeffs.append(c / (i + 1)) def F(x): return sum(c * x**(i) for i, c in enumerate(F_coeffs)) return F(b) - F(a) # ∫₀³ 2x dx (coeffs = [0, 2]) print(integrate_poly([0, 2], 0, 3)) # 9.0
Integration Techniques Intermediate
By Parts: ∫ u·dv = u·v − ∫ v·du
Substitution example: ∫ 2x·eˣ² dx
∫ eᵘ du = eᵘ + C = eˣ² + C ✓
Numerical Integration Intermediate
import numpy as np def simpsons(f, a, b, n=1000): """Simpson's 1/3 rule (most accurate)""" assert n % 2 == 0 x = np.linspace(a, b, n+1) y = f(x) h = (b - a) / n return h/3 * (y[0] + y[-1] + 4*np.sum(y[1:-1:2]) + 2*np.sum(y[2:-2:2])) # ∫₀π sin(x) dx = 2 result = simpsons(np.sin, 0, np.pi) print(f"{result:.10f}") # 2.0000000000
import random, math def monte_carlo_integrate(f, a, b, n=100000): """Approximate ∫ₐᵇ f(x) dx by sampling random points""" total = 0 for _ in range(n): x = random.uniform(a, b) total += f(x) return (b - a) * total / n # ∫₀π sin(x) dx = 2 result = monte_carlo_integrate( math.sin, 0, math.pi ) print(f"{result:.4f}") # ≈ 2.0000 (stochastic!)
Integrals in Machine Learning Intermediate
KL divergence: DKL(P||Q) = ∫ p(x)·log(p(x)/q(x)) dx
Normalization: ∫ p(x) dx = 1
E[X] = ∫x·p(x)dx"] --> LOSS["Expected Loss
Risk minimization"] KL["KL Divergence
∫p·log(p/q)dx"] --> VAE["VAE Training
(ELBO objective)"] NORM["Normalization
∫p(x)dx = 1"] --> SOFT["Softmax
Σ eᶻⁱ = partition fn"] end
import numpy as np def kl_divergence(p, q): """KL(P||Q) for discrete dists""" p = np.asarray(p, dtype=float) q = np.asarray(q, dtype=float) # Avoid log(0) mask = p > 0 return np.sum(p[mask] * np.log(p[mask] / q[mask])) p = [0.4, 0.3, 0.3] q = [0.33, 0.33, 0.34] print(f"KL(P||Q) = {kl_divergence(p, q):.4f}") # KL(P||Q) = 0.0133
import math def expected_value(values, probs): """E[X] = Σ xᵢ · p(xᵢ)""" return sum(x*p for x,p in zip(values, probs)) # Fair die: E[X] = 3.5 vals = [1,2,3,4,5,6] probs = [1/6]*6 print(expected_value(vals, probs)) # 3.5 # Variance: E[X²] − E[X]² ex2 = expected_value([x**2 for x in vals], probs) ex = expected_value(vals, probs) print(f"Var = {ex2 - ex**2:.4f}") # Var = 2.9167
Double Integrals Advanced
z = f(x, y).
To compute a double integral over a rectangular region, we perform iterated integration: integrate with respect to one variable first (treating the other as a constant), then integrate the result with respect to the second variable.
Example: Calculate the volume under
f(x, y) = xy for x from 0 to 2, and y from 0 to 1.
- Set up the integral:
∫₀² ( ∫₀¹ xy dy ) dx - Inner integral (w.r.t y): Treat
xas a constant.- The integral of
yisy²/2. So,∫₀¹ xy dy = x * [y²/2]₀¹ = x * (1/2 - 0) = x/2.
- The integral of
- Outer integral (w.r.t x): Now integrate the result w.r.t
x.∫₀² (x/2) dx = (1/2) * [x²/2]₀² = (1/2) * (2 - 0) = 1.
- The total volume under the surface is exactly 1.
[x·y²/2]₀¹ = x/2"] INNER_EVAL --> OUTER["Outer Integral: ∫₀² (x/2) dx"] OUTER --> OUTER_EVAL["[x²/4]₀² = 1"] OUTER_EVAL --> VOL["Final Answer
Volume = 1"] style START fill:#f8f9fa,stroke:#6c757d,color:#212529 style VOL fill:#f3e8ff,stroke:#9333ea,color:#6b21a8
p(x, y). They are also used in marginalization (e.g., integrating out a variable
p(x) = ∫ p(x,y) dy).
Multivariable Calculus & Optimization
Real ML models have millions of parameters. Partial derivatives, gradients, and optimization algorithms let us navigate this high-dimensional landscape to find parameter values that minimize the loss.
- Compute partial derivatives and gradient vectors
- Implement gradient descent from scratch
- Understand learning rate, momentum, and Adam
- Explain what Jacobian and Hessian matrices encode
Partial Derivatives Intermediate
Treat y as a constant, differentiate with respect to x only
A partial derivative is used when a function depends on multiple variables, but we only want to see how it changes with respect to one variable, holding all others constant.
Example: Let f(x, y) = 3x²y + 2xy³ - 5
- To find ∂f/∂x (Partial w.r.t x): Treat
yas a constant number (like 4 or 10).- The derivative of
3x²yw.r.t x is6xy(the3yis just a constant multiplier ofx²). - The derivative of
2xy³w.r.t x is2y³(the2y³is a constant multiplier ofx). - The derivative of
-5is 0. - So,
∂f/∂x = 6xy + 2y³.
- The derivative of
- To find ∂f/∂y (Partial w.r.t y): Treat
xas a constant number.- The derivative of
3x²yw.r.t y is3x²(the3x²is a constant multiplier ofy). - The derivative of
2xy³w.r.t y is2x * 3y² = 6xy². - So,
∂f/∂y = 3x² + 6xy².
- The derivative of
import numpy as np def f(x, y): return 3*x**2*y + 2*x*y**3 - 5 def partial_x(f, x, y, h=1e-7): return (f(x+h, y) - f(x-h, y)) / (2*h) def partial_y(f, x, y, h=1e-7): return (f(x, y+h) - f(x, y-h)) / (2*h) print(partial_x(f, 1, 2)) # 28.0 (6·1·2 + 2·8) print(partial_y(f, 1, 2)) # 27.0 (3·1 + 6·1·4)
def partial_derivative(f, args, idx, h=1e-7): """Partial deriv w.r.t. args[idx]""" args_p = list(args) args_m = list(args) args_p[idx] += h args_m[idx] -= h return (f(*args_p) - f(*args_m)) / (2*h) def f(x, y): return 3*x**2*y + 2*x*y**3 - 5 # ∂f/∂x at (1,2) print(partial_derivative(f, [1,2], 0)) # 28.0 # ∂f/∂y at (1,2) print(partial_derivative(f, [1,2], 1)) # 27.0
The Gradient Vector Intermediate
Points toward steepest ascent"] --> NEG["Negate it
−∇f = steepest DESCENT"] MAG["Magnitude
||∇f|| = steepness of slope"] --> FLAT["||∇f|| ≈ 0
means flat (near minimum)"] end
import numpy as np def gradient(f, params, h=1e-7): """Compute ∇f numerically""" grad = np.zeros_like(params, dtype=float) for i in range(len(params)): p_plus = params.copy(); p_plus[i] += h p_minus = params.copy(); p_minus[i] -= h grad[i] = (f(p_plus) - f(p_minus)) / (2*h) return grad # f(w) = w₀² + 3w₁² (bowl) f = lambda w: w[0]**2 + 3*w[1]**2 w = np.array([4.0, 2.0]) print(gradient(f, w)) # [8.0, 12.0] # Exact: [2w₀, 6w₁] = [8, 12] ✓
def gradient(f, params, h=1e-7): """Compute ∇f for list of params""" grad = [] for i in range(len(params)): p_plus = params[:] p_minus = params[:] p_plus[i] += h p_minus[i] -= h g = (f(p_plus) - f(p_minus)) / (2*h) grad.append(g) return grad def f(w): return w[0]**2 + 3*w[1]**2 print(gradient(f, [4.0, 2.0])) # [8.0, 12.0]
loss.backward(), PyTorch computes ∇L with respect to every parameter in the model. The
optimizer then uses this gradient to update each parameter. The gradient vector is THE central object in all
of machine learning optimization.
Gradient Descent Intermediate
Let's trace one step of Gradient Descent to minimize
f(w₀, w₁) = w₀² + 3w₁².
- 1. Compute the Gradient: The partial derivatives are
∂f/∂w₀ = 2w₀and∂f/∂w₁ = 6w₁. So,∇f(w) = [2w₀, 6w₁]. - 2. Pick a Starting Point & Learning Rate: Let's start at
w = [4, 2]with learning rateα = 0.1. - 3. Evaluate Gradient at Start:
∇f([4, 2]) = [2(4), 6(2)] = [8, 12]. - 4. Take a Step:
w_new = w - α · ∇f(w)
w_new = [4, 2] - 0.1 · [8, 12]
w_new = [4, 2] - [0.8, 1.2] = [3.2, 0.8] - Result: We moved from
[4, 2]to[3.2, 0.8], getting much closer to the true minimum at[0, 0]!
randomly"] --> FWD["Forward Pass
ŷ = model(x)"] FWD --> LOSS["Compute Loss
L = loss(ŷ, y)"] LOSS --> GRAD["Compute Gradient
∇L = ∂L/∂w"] GRAD --> UPDATE["Update Weights
w ← w − α·∇L"] UPDATE -->|Repeat| FWD end
import numpy as np def gradient_descent(f, grad_f, w0, lr=0.01, epochs=100): w = w0.copy() history = [f(w)] for _ in range(epochs): g = grad_f(w) w = w - lr * g history.append(f(w)) return w, history # Minimize f(w) = w₀² + 3w₁² f = lambda w: w[0]**2 + 3*w[1]**2 grad = lambda w: np.array([2*w[0], 6*w[1]]) w_opt, hist = gradient_descent( f, grad, np.array([4.0, 2.0]), lr=0.1, epochs=50) print(f"w = {w_opt}") print(f"f(w) = {f(w_opt):.6f}") # w ≈ [0.0, 0.0] f(w) ≈ 0.0
def gradient_descent(f, grad_f, w0, lr=0.01, epochs=100): w = w0[:] for _ in range(epochs): g = grad_f(w) w = [wi - lr*gi for wi, gi in zip(w, g)] return w def f(w): return w[0]**2 + 3*w[1]**2 def grad_f(w): return [2*w[0], 6*w[1]] w = gradient_descent( f, grad_f, [4.0, 2.0], lr=0.1, epochs=50) print([round(x,6) for x in w]) # [0.0, 0.0]
optimizer.step() call
in PyTorch executes one iteration of gradient descent (or a variant like Adam). The training loop is: (1)
forward pass → (2) compute loss → (3) backward pass (get gradients) → (4) optimizer step (update weights) →
repeat.Learning Rate & Convergence Intermediate
Common schedules:
Exponential: α = α₀ × e−kt
Cosine annealing: α = αmin + ½(αmax−αmin)(1 + cos(πt/T))
Warmup: linearly increase α from 0 to α₀ over first N steps
torch.optim.lr_scheduler module provides all common schedules.SGD, Mini-Batch & Momentum Advanced
v ← β·v + ∇L(w) accumulate velocity (β≈0.9)
w ← w − α·v step in velocity direction
Let's trace one step of Momentum with
β = 0.9 and α = 0.1. Assume our current velocity is
v = [1.0, -0.5], and the current gradient is g = [0.2, 2.0].
- 1. Update Velocity:
v_new = β·v + g
v_new = 0.9 · [1.0, -0.5] + [0.2, 2.0]
v_new = [0.9, -0.45] + [0.2, 2.0] = [1.1, 1.55]
Notice how the velocity "remembers" the past direction but adds the new gradient's push. - 2. Update Weights:
w_new = w - α·v_new
w_new = w - 0.1 · [1.1, 1.55] = w - [0.11, 0.155]
import numpy as np def sgd_momentum(grad_f, w0, lr=0.01, beta=0.9, epochs=100): w = w0.copy() v = np.zeros_like(w) # velocity for _ in range(epochs): g = grad_f(w) v = beta * v + g # accumulate w = w - lr * v # step return w grad = lambda w: np.array([2*w[0], 6*w[1]]) w = sgd_momentum(grad, np.array([4.0, 2.0]), lr=0.01, beta=0.9) print(w) # ≈ [0, 0]
def sgd_momentum(grad_f, w0, lr=0.01, beta=0.9, epochs=100): w = w0[:] v = [0.0] * len(w) for _ in range(epochs): g = grad_f(w) v = [beta*vi + gi for vi,gi in zip(v,g)] w = [wi - lr*vi for wi,vi in zip(w,v)] return w def grad_f(w): return [2*w[0], 6*w[1]] w = sgd_momentum(grad_f, [4.0, 2.0]) print([round(x,4) for x in w]) # [0.0, 0.0]
torch.optim.SGD(params, lr=0.1, momentum=0.9) — SGD with momentum is still the state-of-the-art
optimizer for many vision tasks (ResNet, EfficientNet). It often generalizes better than Adam despite
converging more slowly.
Adam Optimizer Advanced
v ← β₂·v + (1−β₂)·g² (second moment — variance of gradients)
m̂ = m / (1 − β₁ᵗ) (bias correction)
v̂ = v / (1 − β₂ᵗ)
w ← w − α · m̂ / (√v̂ + ε) (update)
w−α·g"] -->|"Add velocity"| MOM["Momentum
v=βv+g, w−αv"] MOM -->|"Add per-param lr"| ADAM["Adam
Adaptive lr per param"] end
import numpy as np def adam(grad_f, w0, lr=0.001, b1=0.9, b2=0.999, eps=1e-8, epochs=200): w = w0.copy() m = np.zeros_like(w) v = np.zeros_like(w) for t in range(1, epochs+1): g = grad_f(w) m = b1*m + (1-b1)*g v = b2*v + (1-b2)*g**2 m_hat = m / (1 - b1**t) v_hat = v / (1 - b2**t) w = w - lr * m_hat / (np.sqrt(v_hat) + eps) return w grad = lambda w: np.array([2*w[0], 6*w[1]]) w = adam(grad, np.array([4.0, 2.0])) print(np.round(w, 4)) # [0, 0]
import math def adam(grad_f, w0, lr=0.001, b1=0.9, b2=0.999, eps=1e-8, epochs=200): w = w0[:] m = [0.0]*len(w) v = [0.0]*len(w) for t in range(1, epochs+1): g = grad_f(w) for i in range(len(w)): m[i] = b1*m[i]+(1-b1)*g[i] v[i] = b2*v[i]+(1-b2)*g[i]**2 mh = m[i]/(1-b1**t) vh = v[i]/(1-b2**t) w[i] -= lr*mh/(math.sqrt(vh)+eps) return w def grad_f(w): return [2*w[0], 6*w[1]] w = adam(grad_f, [4.0, 2.0]) print([round(x,4) for x in w]) # [0.0, 0.0]
torch.optim.Adam(params, lr=3e-4) is the most common optimizer in deep learning. Defaults
(β₁=0.9, β₂=0.999, ε=1e-8) work for most tasks. AdamW adds proper weight decay and is the
default for Transformer training.
Jacobian Matrix Advanced
[∂f₂/∂x₁ ∂f₂/∂x₂ ··· ∂f₂/∂xₙ]
[ ⋮ ⋮ ⋮ ]
[∂fₘ/∂x₁ ∂fₘ/∂x₂ ··· ∂fₘ/∂xₙ]
Shape: (outputs × inputs) = (m × n)
Let's find the Jacobian for the function
f(x, y) = [x² + y, xy]. Here, we have two inputs (x, y) and two outputs
(f₁, f₂).
- Output 1:
f₁(x, y) = x² + y.
Its partial derivatives are∂f₁/∂x = 2xand∂f₁/∂y = 1. This forms the first row. - Output 2:
f₂(x, y) = xy.
Its partial derivatives are∂f₂/∂x = yand∂f₂/∂y = x. This forms the second row. - The Jacobian Matrix J is:
[[ 2x, 1 ]]
[[ y, x ]] - Evaluate at (3, 2): Plugging in
x=3, y=2gives:
[[ 2(3), 1 ]]=
[[ 2, 3 ]][[ 6, 1 ], [ 2, 3 ]]
ℝ²"] -->|Jacobian Matrix J| F["Output: f₁, f₂
ℝ²"] J1["Row 1: ∇f₁"] -.->|"∂f₁/∂x, ∂f₁/∂y"| F J2["Row 2: ∇f₂"] -.->|"∂f₂/∂x, ∂f₂/∂y"| F style X fill:#f8f9fa,stroke:#999999 style F fill:#f8f9fa,stroke:#999999
import numpy as np def jacobian(f, x, h=1e-7): """Numerical Jacobian matrix""" x = np.asarray(x, dtype=float) f0 = np.asarray(f(x)) n = len(x); m = len(f0) J = np.zeros((m, n)) for j in range(n): xp = x.copy(); xp[j] += h xm = x.copy(); xm[j] -= h J[:, j] = (f(xp) - f(xm)) / (2*h) return J # f(x,y) = [x²+y, xy] f = lambda x: np.array([x[0]**2+x[1], x[0]*x[1]]) J = jacobian(f, [3.0, 2.0]) print(J) # [[6. 1.] ← [2x, 1] # [2. 3.]] ← [y, x]
def jacobian(f, x, h=1e-7): n = len(x) f0 = f(x) m = len(f0) J = [[0.0]*n for _ in range(m)] for j in range(n): xp = x[:]; xm = x[:] xp[j] += h; xm[j] -= h fp = f(xp); fm = f(xm) for i in range(m): J[i][j] = (fp[i]-fm[i])/(2*h) return J def f(x): return [x[0]**2+x[1], x[0]*x[1]] J = jacobian(f, [3.0, 2.0]) for row in J: print(row) # [6.0, 1.0] # [2.0, 3.0]
Hessian Matrix Advanced
Hessian is symmetric (H = Hᵀ) for smooth functions
Newton's method: w ← w − H⁻¹ · ∇f (quadratic convergence)
Let's calculate the Hessian for
f(x, y) = x² + 3y².
- 1. First Derivatives (Gradient):
∂f/∂x = 2x
∂f/∂y = 6y - 2. Second Derivatives (Hessian entries):
- Differentiate
2xwrtx:∂²f/∂x² = 2 - Differentiate
2xwrty:∂²f/∂x∂y = 0 - Differentiate
6ywrtx:∂²f/∂y∂x = 0 - Differentiate
6ywrty:∂²f/∂y² = 6
- Differentiate
- The Hessian Matrix H is:
[[ 2, 0 ]]
[[ 0, 6 ]]
(Eigenvalues > 0)"] --> MIN["Local Minimum
Looks like a Bowl"] NEG["Hessian is Negative Definite
(Eigenvalues < 0)"] --> MAX["Local Maximum
Looks like a Hill"] MIX["Hessian is Indefinite
(Mixed Eigenvalues)"] --> SAD["Saddle Point
Looks like a Pringles chip"] end
import numpy as np def hessian(f, x, h=1e-5): """Numerical Hessian matrix""" n = len(x) H = np.zeros((n, n)) for i in range(n): for j in range(n): xpp=x.copy(); xpp[i]+=h; xpp[j]+=h xpm=x.copy(); xpm[i]+=h; xpm[j]-=h xmp=x.copy(); xmp[i]-=h; xmp[j]+=h xmm=x.copy(); xmm[i]-=h; xmm[j]-=h H[i,j] = (f(xpp)-f(xpm)-f(xmp)+f(xmm)) H[i,j] /= (4*h*h) return H # f(x,y) = x² + 3y² → H = [[2,0],[0,6]] f = lambda x: x[0]**2 + 3*x[1]**2 print(hessian(f, np.array([1.0,1.0]))) # [[2. 0.] # [0. 6.]]
# Hessian eigenvalues tell us # about the loss landscape shape: # # All positive → local minimum # All negative → local maximum # Mixed → saddle point # # Condition number (λ_max / λ_min) # tells us how "elongated" the # bowl is. High condition number # → GD oscillates → slow. # This is why Adam helps: it # scales each dimension by 1/√v, # effectively preconditioning. import numpy as np H = np.array([[2, 0], [0, 6]]) eigvals = np.linalg.eigvalsh(H) print(f"Eigenvalues: {eigvals}") # [2. 6.] print(f"Condition: {max(eigvals)/min(eigvals)}") # 3.0
Backpropagation
Backpropagation is the chain rule applied systematically through a computational graph. It is the algorithm that makes training deep neural networks possible.
- Draw a computational graph for any neural network
- Trace the forward and backward passes step by step
- Implement backprop for a 2-layer network from scratch
- Explain why gradients vanish or explode in deep networks
Computational Graphs Advanced
Each node in the graph stores two things during training:
Each node computes its output from its inputs and caches intermediate values (needed for backward pass).
Each node receives ∂L/∂output, computes ∂L/∂input using its local derivative, and passes it upstream.
.backward(), they traverse this graph in reverse to compute all gradients. This is why tensors
track requires_grad=True.
Forward Pass Advanced
# Forward pass for a 2-layer network # Layer 1: z1 = W1·x + b1, a1 = ReLU(z1) # Layer 2: z2 = W2·a1 + b2, a2 = sigmoid(z2) import numpy as np def forward(x, W1, b1, W2, b2): # Layer 1 z1 = W1 @ x + b1 a1 = np.maximum(0, z1) # ReLU # Layer 2 z2 = W2 @ a1 + b2 a2 = 1 / (1 + np.exp(-z2)) # Sigmoid cache = (x, z1, a1, z2, a2, W1, W2) return a2, cache
Backward Pass (Backpropagation) Advanced
(from loss)"] --> dSIG["σ' node
×σ(1−σ)"] dSIG --> dADD["+ node
pass through"] dADD --> dW["∂L/∂w
(save for update)"] dADD --> dB["∂L/∂b
(save for update)"] dADD --> dX["∂L/∂x
(to prev layer)"] end
# Backward pass — compute all gradients def backward(y_true, cache): x, z1, a1, z2, a2, W1, W2 = cache m = y_true.shape[0] # batch size # Output layer gradient (BCE loss + sigmoid) dz2 = a2 - y_true # ∂L/∂z2 = ŷ - y dW2 = (1/m) * dz2 @ a1.T # ∂L/∂W2 db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True) # Hidden layer gradient da1 = W2.T @ dz2 # chain rule dz1 = da1 * (z1 > 0) # ReLU derivative dW1 = (1/m) * dz1 @ x.T db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True) return {'dW1': dW1, 'db1': db1, 'dW2': dW2, 'db2': db2}
loss.backward(). The framework traverses the computational graph in reverse, applying
the chain rule at each node. All parameter gradients are stored in param.grad and then consumed
by the optimizer.Backprop Through a 2-Layer Network Advanced
import numpy as np # === COMPLETE 2-LAYER NETWORK === np.random.seed(42) # Data: XOR problem X = np.array([[0,0],[0,1],[1,0],[1,1]]).T # (2, 4) Y = np.array([[0,1,1,0]]) # (1, 4) # Initialize weights W1 = np.random.randn(4, 2) * 0.5 # (4, 2) b1 = np.zeros((4, 1)) W2 = np.random.randn(1, 4) * 0.5 # (1, 4) b2 = np.zeros((1, 1)) lr = 1.0 for epoch in range(10000): # Forward z1 = W1 @ X + b1 a1 = np.maximum(0, z1) z2 = W2 @ a1 + b2 a2 = 1 / (1 + np.exp(-z2)) # Loss loss = -np.mean(Y*np.log(a2+1e-8) + (1-Y)*np.log(1-a2+1e-8)) # Backward m = 4 dz2 = a2 - Y dW2 = (1/m) * dz2 @ a1.T db2 = (1/m) * np.sum(dz2, axis=1, keepdims=True) dz1 = (W2.T @ dz2) * (z1 > 0) dW1 = (1/m) * dz1 @ X.T db1 = (1/m) * np.sum(dz1, axis=1, keepdims=True) # Update W1 -= lr * dW1; b1 -= lr * db1 W2 -= lr * dW2; b2 -= lr * db2 if epoch % 2000 == 0: print(f"Epoch {epoch}: loss = {loss:.4f}") # Test predictions print(np.round(a2, 2)) # ≈ [0, 1, 1, 0] — XOR solved!
Vanishing & Exploding Gradients Advanced
n multiplications → (0.25)ⁿ for sigmoid → vanishes for large n
Solutions:
2. Skip connections (ResNet): gradient flows through shortcut paths
3. Batch normalization: keeps activations in well-conditioned range
4. Gradient clipping: cap gradient magnitude to prevent explosion
5. Careful initialization: He init (ReLU), Xavier init (tanh/sigmoid)
6. LSTM/GRU gates: control gradient flow in recurrent networks
torch.nn.utils.clip_grad_norm_(params, max_norm=1.0) is
standard for RNNs and Transformers.Advanced Topics
Taylor series approximations and automatic differentiation — the mathematical machinery that powers modern deep learning frameworks.
Taylor Series & Approximation Advanced
1st order ≈ linear · 2nd order ≈ quadratic (Newton's method)
import numpy as np def taylor_exp(x, n_terms=10): """eˣ ≈ 1 + x + x²/2! + x³/3! +...""" result = np.zeros_like(x, dtype=float) for n in range(n_terms): result += x**n / np.math.factorial(n) return result x = np.array([0, 1, 2]) print("Taylor:", taylor_exp(x)) print("Exact: ", np.exp(x)) # Taylor: [1. 2.718 7.389] # Exact: [1. 2.718 7.389]
def factorial(n): r = 1 for i in range(2, n+1): r *= i return r def taylor_exp(x, n_terms=15): return sum(x**n / factorial(n) for n in range(n_terms)) def taylor_sin(x, n_terms=10): return sum( (-1)**n * x**(2*n+1) / factorial(2*n+1) for n in range(n_terms)) print(taylor_exp(1)) # 2.71828... print(taylor_sin(3.14159/2)) # ≈ 1.0
Automatic Differentiation Advanced
Compute ∂y/∂x alongside y
Good for few inputs, many outputs"] --> DUAL["Uses dual numbers
(value, derivative) pairs"] REV["Reverse Mode (Backprop)
Compute all ∂y/∂xᵢ in one pass
Good for many inputs, few outputs"] --> TAPE["Uses computation tape
= what PyTorch does"] end
Propagates derivatives alongside values. Cost = O(n) passes for n inputs. Good when inputs << outputs (rare in ML).
One forward pass + one backward pass computes ALL gradients. Cost = O(1) backward passes regardless of # inputs. This is why backprop is used in ML.
# Tiny autodiff engine (inspired by Karpathy's micrograd) class Value: def __init__(self, data, children=(), op=''): self.data = data self.grad = 0.0 self._backward = lambda: None self._prev = set(children) def __add__(self, other): out = Value(self.data + other.data, (self, other), '+') def _backward(): self.grad += out.grad # ∂(a+b)/∂a = 1 other.grad += out.grad # ∂(a+b)/∂b = 1 out._backward = _backward return out def __mul__(self, other): out = Value(self.data * other.data, (self, other), '*') def _backward(): self.grad += other.data * out.grad # ∂(ab)/∂a = b other.grad += self.data * out.grad # ∂(ab)/∂b = a out._backward = _backward return out def backward(self): # Topological sort + reverse traverse topo = []; visited = set() def build(v): if v not in visited: visited.add(v) for c in v._prev: build(c) topo.append(v) build(self) self.grad = 1.0 for v in reversed(topo): v._backward() # Example: f = (a * b) + b a = Value(2.0); b = Value(3.0) c = a * b # c = 6 d = c + b # d = 9 d.backward() print(f"∂d/∂a = {a.grad}") # 3.0 (= b) print(f"∂d/∂b = {b.grad}") # 3.0 (= a + 1)
Value class IS
the core idea behind PyTorch's autograd. PyTorch's torch.Tensor does exactly this at scale —
tracks operations, stores local derivatives, and traverses the graph backward. Karpathy's
micrograd (~100 lines) implements a full autodiff engine.
Complete Coverage Map
Every concept in this guide mapped to its ML application:
Comments
Loading comments...