Linear Algebra for Machine Learning
Welcome to the ultimate interactive reference guide for Linear Algebra in Machine Learning. Dive into vectors, matrices, dimensionality reduction, and pure Python implementations. This guide bridges the gap between complex mathematical theory and practical, runnable code.
Foundations
What is a Vector?
A vector is an ordered list of numbers. Think of it as a point in space — or an arrow with both magnitude and direction. In machine learning it holds features: one number per feature.
import numpy as np v = np.array([1, 4, 9]) # 1-D array print(v.shape) # (3,) print(v.ndim) # 1 # Column vector (3×1) col = np.array([[1],[4],[9]]) print(col.shape) # (3, 1)
# Row vector — a plain list v = [1, 4, 9] len(v) # 3 (length = shape) # Column vector — list of 1-element lists col = [[1], [4], [9]] len(col) # 3 rows len(col[0]) # 1 col # Access elements v[0] # 1 col[1][0] # 4
What is a Matrix?
A matrix is a 2-D grid of numbers arranged in rows and columns. A dataset with m samples and n features is naturally an m × n matrix.
A = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]]) print(A.shape) # (3, 4) print(A.ndim) # 2 # Access element [row][col] A[1, 2] # 7
A = [[1,2,3,4], [5,6,7,8], [9,10,11,12]] # Shape via len() rows = len(A) # 3 cols = len(A[0]) # 4 # Access element [row][col] A[1][2] # 7
What is a Transpose?
The transpose of a matrix flips it over its main diagonal — rows become columns and columns become rows. An (m × n) matrix becomes (n × m).
A = np.array([[1,2,3],[4,5,6]]) # Method 1 — attribute AT = A.T # Method 2 — function AT = np.transpose(A) print(AT.shape) # (3, 2)
A = [[1,2,3],[4,5,6]] # Method 1 — explicit loops def transpose(A): rows, cols = len(A), len(A[0]) T = [] for j in range(cols): row = [] for i in range(rows): row.append(A[i][j]) T.append(row) return T # Method 2 — list comprehension def transpose(A): return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))] # Method 3 — zip trick (Pythonic) def transpose(A): return [list(row) for row in zip(*A)] transpose(A) # [[1,4],[2,5],[3,6]]
zip(*A) unpacks all rows as arguments to zip, which groups by column position — elegant and Pythonic.Transposing 3D Tensors
In 3D (e.g., shape (batch, rows, cols)), you rarely reverse all dimensions. Instead, you typically transpose the rows and columns of each matrix in the batch independently, leaving the batch dimension intact.
T = np.random.rand(10, 4, 5) # 10 batches of 4x5 # Swap axes 1 (rows) and 2 (cols), leave axis 0 (batch) T_trans = np.transpose(T, axes=(0, 2, 1)) print(T_trans.shape) # (10, 5, 4)
# Transpose each matrix in the batch def transpose_3d(T): return [transpose(batch) for batch in T] # uses the 2D transpose function defined above
Shape of a Matrix & What is an Axis?
The shape is a tuple describing how many elements exist along each axis (dimension). Axis 0 runs down the rows; axis 1 runs across the columns.
A.shape # (3, 4) np.sum(A, axis=0) # sum down rows → (4,) np.sum(A, axis=1) # sum across cols → (3,)
# Shape def shape(A): rows = len(A) cols = len(A[0]) return (rows, cols) # Sum along axis 0 (down rows) def sum_axis0(A): cols = len(A[0]) return [sum(A[i][j] for i in range(len(A))) for j in range(cols)] # Sum along axis 1 (across cols) def sum_axis1(A): return [sum(row) for row in A]
Axes in 3D Tensors
When you move to 3D, a new axis is added at the front. The standard convention in machine learning is (batch, rows, columns):
- Axis 0 (Batch): Depth or number of samples. Summing along axis 0 collapses all batches into a single matrix.
- Axis 1 (Rows): Height of each matrix.
- Axis 2 (Columns): Width of each matrix.
Element-wise Operations
Element-wise operations apply a function independently to each matching pair of elements. Both arrays must have the same shape (or be broadcastable).
A = np.array([[1,2],[3,4]]) B = np.array([[2,4],[6,8]]) A + B # [[3,6],[9,12]] A - B # [[-1,-2],[-3,-4]] A * B # [[2,8],[18,32]] A / B # [[0.5,0.5],[0.5,0.5]] A ** 2 # [[1,4],[9,16]] np.sqrt(A) # [[1.,1.41…],[1.73…,2.]]
A = [[1,2],[3,4]] B = [[2,4],[6,8]] # Addition def add(A, B): return [[A[i][j] + B[i][j] for j in range(len(A[0]))] for i in range(len(A))] # Element-wise multiply def mul(A, B): return [[A[i][j] * B[i][j] for j in range(len(A[0]))] for i in range(len(A))] # Power / sqrt import math [[x**2 for x in row] for row in A] [[math.sqrt(x) for x in row] for row in A]
A * B is element-wise multiplication. A @ B is matrix multiplication. They are completely different operations.Concatenation
Joining arrays along an existing axis.
a = np.array([[1,2],[3,4]]) # (2,2) b = np.array([[5,6]]) # (1,2) # Vertical stack (axis 0) np.concatenate([a, b], axis=0) # (3,2) np.vstack([a, b]) # Horizontal stack (axis 1) np.concatenate([a, b.T], axis=1) # (2,3) np.hstack([a, b.T])
a = [[1,2],[3,4]] b = [[5,6]] # Vertical stack (add rows) def vstack(A, B): return A + B # [[1,2],[3,4],[5,6]] # Horizontal stack (add cols) def hstack(A, B): return [rowA + rowB for rowA, rowB in zip(A, B)] # transpose b first, then hstack bT = [list(row) for row in zip(*b)] hstack(a, bT) # [[1,2,5],[3,4,6]]
Concatenating 3D Tensors
To join 3D tensors, the dimensions along all non-concatenated axes must perfectly match. For example, adding more samples to a batch means joining along axis=0.
T1 = np.ones((2, 3, 4)) # 2 batches T2 = np.ones((5, 3, 4)) # 5 batches # Join along batch dimension (axis 0) T_joined = np.concatenate([T1, T2], axis=0) print(T_joined.shape) # (7, 3, 4)
# Since axis 0 is the outermost list, # we just add the lists together! T_joined = T1 + T2 # Length goes from 2 to 7 batches
The Dot Product
The dot product of two equal-length vectors multiplies corresponding elements and sums the results into a single scalar — a measure of alignment between two vectors.
a = np.array([1, 3, 5]) b = np.array([2, 4, 6]) np.dot(a, b) # 44 a @ b # 44 (@ operator)
a = [1, 3, 5] b = [2, 4, 6] # Method 1 — index loop def dot(a, b): return sum(a[i] * b[i] for i in range(len(a))) # Method 2 — zip (Pythonic) def dot(a, b): return sum(x * y for x, y in zip(a, b)) dot(a, b) # 44
Matrix Multiplication
Matrix multiplication is a series of dot products. It is the core operation of deep learning, driving everything from linear layers to attention mechanisms.
A's columns must equal B's rows — that shared n is the dimension you "dot over". The result inherits A's row count and B's column count.
Each cell C[i][j] is the dot product of the entire row i of A with the entire column j of B. Multiply element-by-element, then sum.
Here n = 3, so k runs 0 → 1 → 2 (three iterations, three pairs, three products summed):
A = np.array([[1,2,3],[4,5,6]]) # (2,3) B = np.array([[7,8],[9,10],[11,12]]) # (3,2) C = np.matmul(A, B) # (2,2) C = A @ B # identical
A = [[1,2,3],[4,5,6]] B = [[7,8],[9,10],[11,12]] def matmul(A, B): m = len(A) p = len(B[0]) n = len(A[0]) # shared dimension C = [[0] * p for _ in range(m)] for i in range(m): # row in A for j in range(p): # col in B for k in range(n): # dot product sum C[i][j] += A[i][k] * B[k][j] return C
Total operations: m × p × n multiplications. For a 2×3 and 3×2 matrix: 2×2×3 = 12 multiplications to produce 4 result cells.
for b in range(batch): # NEW OUTER LOOP for i in range(m): # selects row in A[b] for j in range(p): # selects col in B[b] for k in range(n): # dot product C[b][i][j] += A[b][i][k] * B[b][k][j]
NumPy
What is NumPy?
NumPy (Numerical Python) is the core library for numerical computing in Python. Its star is the ndarray — an N-dimensional array stored as a contiguous block of typed memory, enabling vectorised operations orders of magnitude faster than Python loops.
result = [] for i in range(len(a)): result.append(a[i] * b[i])
result = a * b
# vectorised — runs in C
What is Parallelization & Why it Matters
Parallelization means splitting a computation across multiple processing units (CPU cores, GPU threads) so they run simultaneously. Instead of doing 1 million multiplications one after another, you do them all at once.
In ML, training on millions of examples would take days without parallelisation. NumPy's vectorised API, and frameworks like TensorFlow/PyTorch, express computation as array operations so the hardware (CPU/GPU/TPU) can parallelise automatically.
Broadcasting
Broadcasting lets NumPy perform operations on arrays with different shapes by virtually stretching the smaller array to match the larger one — without copying data in memory.
A = np.array([[1,2,3],[4,5,6],[7,8,9]]) # (3,3) b = np.array([10, 20, 30]) # (3,) A + b # b is broadcast row-wise # Rule: align shapes from the RIGHT. # Each dim must be equal OR one of them must be 1. # (3,3) vs (3,) → (3,3) vs (1,3) → compatible ✓
print(A.shape, b.shape) before operating on unfamiliar arrays. Silent shape mismatches produce wrong results, not errors.NumPy Slicing
NumPy slicing uses start:stop:step — stop is always excluded. For 2-D arrays, slice both dimensions at once: A[rows, cols].
v = np.array([10, 20, 30, 40, 50]) v[1:4] # [20, 30, 40] v[::2] # [10, 30, 50] (every other) v[-1] # 50 (last element) A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]]) A[1:, 1:3] # [[6,7],[10,11]] A[:, 0] # [1, 5, 9] (entire first column) A[0, :] # [1, 2, 3, 4] (entire first row) A[::-1] # reverse rows A[:, ::-1] # reverse columns A[A > 5] # boolean mask → [6,7,8,9,10,11,12]
.copy() to avoid surprises.5 Slicing Exercises (NumPy)
v = np.array([10, 20, 30, 40, 50, 60])
a) Get element at index 3.
b) Get elements from index 1 up to (not including) 4.
c) Get every other element (indices 0, 2, 4).
v[3] # 40 v[1:4] # [20, 30, 40] v[::2] # [10, 30, 50]
start:stop:step — stop is always excluded. Omitting start defaults to 0, omitting stop defaults to the end.A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])
a) Extract the entire 3rd row (index 2).
b) Extract the entire 2nd column (index 1).
c) Extract the 2×2 sub-matrix from rows 1–2, cols 2–3.
A[2, :] # [9, 10, 11, 12] A[:, 1] # [2, 6, 10, 14] A[1:3, 2:4] # [[7,8],[11,12]]
A[row, col] — both can be a single index, a range, or : (all). First dimension is always rows.B = np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]])
a) Get the last column using a negative index.
b) Get the last 2 rows.
c) Reverse the matrix left-to-right (flip column order).
B[:, -1] # [5, 10, 15] B[-2:, :] # [[6,7,8,9,10],[11,12,13,14,15]] B[:, ::-1] # [[5,4,3,2,1],[10,9,…],[15,14,…]]
C = np.array([3, -1, 7, -5, 2, -8, 4]) D = np.array([[10,20,30],[40,50,60],[70,80,90]])
a) From C, extract only positive values.
b) Replace all negative values in C with 0.
c) From D, select rows 0 and 2 only.
C[C > 0] # [3, 7, 2, 4] C[C < 0] = 0 # C → [3,0,7,0,2,0,4] D[[0, 2], :] # [[10,20,30],[70,80,90]]
# Shape: (2 batches, 3 rows, 4 cols) E = np.arange(24).reshape(2, 3, 4)
a) Extract the first batch (index 0).
b) Extract the last column of every batch.
c) Extract the 2×2 top-left corner from each batch simultaneously.
d) Get the single element at batch 1, row 2, col 3.
E[0] # first batch — shape (3, 4) E[:, :, -1] # last col of every batch — shape (2, 3) E[:, :2, :2] # top-left 2×2 of each batch — shape (2, 2, 2) E[1, 2, 3] # 23
Pure Python — No NumPy
Holberton will require you to implement these operations from scratch using nested lists and loops — before you're allowed to touch NumPy. Understand every line here.
Representing Vectors & Matrices
A vector is a plain list. A matrix is a list of lists — each inner list is one row.
# Vector v = [1, 2, 3, 4] # Matrix (2D) — list of rows A = [[1, 2, 3], [4, 5, 6]] # Tensor (3D) — list of matrices (batches) T = [ [[1, 2], [3, 4]], # batch 0 [[5, 6], [7, 8]] # batch 1 ] A[1][2] # 6 — row 1, col 2 (NOT A[1,2]!) T[1][0][1] # 6 — batch 1, row 0, col 1
A[row][col] — NOT A[row, col]. That comma syntax is NumPy only and will raise a TypeError on plain lists.Getting the Shape
No .shape attribute on lists. Build it yourself with len().
def shape_2d(matrix): return (len(matrix), len(matrix[0])) def shape_3d(tensor): batches = len(tensor) rows = len(tensor[0]) cols = len(tensor[0][0]) return (batches, rows, cols) A = [[1,2,3],[4,5,6]] T = [[[1,2]],[[3,4]],[[5,6]]] shape_2d(A) # (2, 3) shape_3d(T) # (3, 1, 2)
Transpose (Pure Python)
Swap rows and columns: T[j][i] = A[i][j].
# Explicit loops def transpose(A): rows, cols = len(A), len(A[0]) T = [] for j in range(cols): row = [] for i in range(rows): row.append(A[i][j]) T.append(row) return T # List comprehension def transpose(A): return [[A[i][j] for i in range(len(A))] for j in range(len(A[0]))] # Elegant zip trick def transpose(A): return [list(row) for row in zip(*A)]
zip(*A) works for transpose, dot products, and matrix multiplication.Element-wise Addition & Subtraction
# Vector addition def add_vectors(a, b): return [a[i] + b[i] for i in range(len(a))] # Matrix addition (2D) def add_matrices(A, B): return [[A[i][j] + B[i][j] for j in range(len(A[0]))] for i in range(len(A))] # Tensor addition (3D) — triple comprehension def add_3d(T1, T2): return [[[T1[b][i][j] + T2[b][i][j] for j in range(len(T1[0][0]))] for i in range(len(T1[0]))] for b in range(len(T1))]
Scalar Multiplication
def scalar_multiply_vec(v, s): return [x * s for x in v] def scalar_multiply(A, s): return [[elem * s for elem in row] for row in A] scalar_multiply([[1,2],[3,4]], 2) # [[2, 4], [6, 8]]
Dot Product (Pure Python)
def dot_product(a, b): return sum(a[i] * b[i] for i in range(len(a))) # Using zip (Pythonic) def dot_product(a, b): return sum(x * y for x, y in zip(a, b)) dot_product([1,3,5], [2,4,6]) # 44
Matrix Multiplication (Pure Python)
C[i][j] = dot product of row i of A with column j of B. The classic triple nested loop.
def mat_mul(A, B): m, k, n = len(A), len(B), len(B[0]) C = [[0] * n for _ in range(m)] for i in range(m): # each row of A for j in range(n): # each col of B for p in range(k): # shared dimension C[i][j] += A[i][p] * B[p][j] return C # Batched Matrix Multiply (3D) # shape: (batch, m, k) @ (batch, k, n) -> (batch, m, n) def bmm(T1, T2): return [mat_mul(T1[b], T2[b]) for b in range(len(T1))]
Concatenation (Pure Python)
# Vertical — stack rows (like np.vstack) def vstack(A, B): return A + B # lists concat directly # Horizontal — join cols row by row (like np.hstack) def hstack(A, B): return [A[i] + B[i] for i in range(len(A))] A = [[1,2],[3,4]] ; B = [[5,6],[7,8]] vstack(A, B) # [[1,2],[3,4],[5,6],[7,8]] hstack(A, B) # [[1,2,5,6],[3,4,7,8]]
2D & 3D Matrix Slicing — Pure Python
Row slicing works directly on the outer list. Column slicing requires a list comprehension — there is no direct column slice on a plain list.
A = [
[ 1, 2, 3, 4, 5], # row 0
[ 6, 7, 8, 9, 10], # row 1
[11, 12, 13, 14, 15], # row 2
[16, 17, 18, 19, 20] # row 3
] # shape: 4 rows × 5 cols
Single element
A[1][2] # 8 — row 1, col 2 A[0][0] # 1 — top-left corner A[-1][-1] # 20 — bottom-right corner
A[1, 2] → TypeError on a plain list. Always use A[row][col].Entire row / range of rows
A[1] # [6, 7, 8, 9, 10] — single row A[1:3] # rows 1 and 2 A[:2] # first 2 rows A[2:] # last 2 rows A[::2] # every other row A[-1] # last row A[-2:] # last 2 rows
Entire column / range of columns
# Single column — NumPy: A[:, 2] [row[2] for row in A] # [3, 8, 13, 18] # Column range — NumPy: A[:, 1:4] [row[1:4] for row in A] # [[2,3,4],[7,8,9],[12,13,14],[17,18,19]] # Last column — NumPy: A[:, -1] [row[-1] for row in A] # [5, 10, 15, 20] # Last 2 cols — NumPy: A[:, -2:] [row[-2:] for row in A] # [[4,5],[9,10],[14,15],[19,20]]
[row[j] for row in A] is the pure-Python equivalent of NumPy's A[:, j]. Memorise this pattern.Sub-matrix (rows AND columns)
# rows 1–2, cols 2–3 — NumPy: A[1:3, 2:4] [row[2:4] for row in A[1:3]] # [[8,9],[13,14]] # General helper def submatrix(A, r1, r2, c1, c2): return [row[c1:c2] for row in A[r1:r2]] # First 2 rows, last 2 cols [row[-2:] for row in A[:2]] # [[4,5],[9,10]]
Step slicing & reversal
# Every other row A[::2] # rows 0 and 2 # Every other column [row[::2] for row in A] # cols 0, 2, 4 # Every other row AND column [row[::2] for row in A[::2]] # [[1,3,5],[11,13,15]] # Reverse rows (upside-down) A[::-1] # Reverse columns (mirrored) [row[::-1] for row in A] # Fully reversed (upside-down AND mirrored) [row[::-1] for row in A[::-1]]
Diagonal elements
# Main diagonal [A[i][i] for i in range(min(len(A), len(A[0])))] # [1, 7, 13, 19] # Anti-diagonal n = len(A[0]) [A[i][n - 1 - i] for i in range(len(A))] # [5, 9, 13, 17]
# SINGLE ELEMENT A[i][j] # FULL ROW A[i] # ROW RANGE A[r1:r2] # FULL COLUMN (NumPy: A[:, j]) [row[j] for row in A] # COLUMN RANGE (NumPy: A[:, c1:c2]) [row[c1:c2] for row in A] # SUB-MATRIX (NumPy: A[r1:r2, c1:c2]) [row[c1:c2] for row in A[r1:r2]] # LAST ROW / LAST COLUMN A[-1] [row[-1] for row in A] # EVERY Nth ROW / COLUMN A[::n] [row[::n] for row in A] # REVERSE ROWS / COLUMNS A[::-1] [row[::-1] for row in A] # MAIN DIAGONAL [A[i][i] for i in range(min(len(A), len(A[0])))]
3D Tensor Slicing
# 3D Tensor T: (batches, rows, cols) T = [ [[1,2], [3,4]], [[5,6], [7,8]] ] # First batch — NumPy: T[0] T[0] # [[1, 2], [3, 4]] # First row of every batch — NumPy: T[:, 0, :] [batch[0] for batch in T] # [[1, 2], [5, 6]] # First column of every batch — NumPy: T[:, :, 0] [[row[0] for row in batch] for batch in T] # [[1, 3], [5, 7]]
Zip & Unpacking (Pure Python)
The zip() function is the cornerstone of pure Python matrix operations. It pairs elements from multiple iterables, which is essential for operations like transposing and dot products.
zip() takes two or more iterables and pairs them up element-by-element, producing a lazy iterator of tuples. It stops as soon as the shortest iterable is exhausted. Think of a physical zipper — two rows of teeth coming together one pair at a time.
B: [4, 5, 6]
zip() returns a generator that produces one tuple per iteration, on demand. No intermediate list is built. Zipping two 10,000-row matrices costs zero extra memory until you iterate.
a = [1, 2, 3] b = [4, 5, 6] for x, y in zip(a, b): print(x, y) # 1 4 | 2 5 | 3 6 # wrap in list() to materialise pairs = list(zip(a, b)) # [(1,4), (2,5), (3,6)] # Three iterables at once list(zip(a, b, [7,8,9])) # [(1,4,7), (2,5,8), (3,6,9)]
zip(*matrix) is the most important matrix trick. The * unpacks the outer list, passing each row as a separate argument to zip — which then pairs up the first elements of every row, then the second, etc. This is exactly a transpose.
M = [[1,2,3], [4,5,6]] # Without *: zip sees ONE argument (the whole matrix) list(zip(M)) # [([1,2,3],), ([4,5,6],)] — wrong! # With *: zip sees TWO arguments (row0 and row1) list(zip(*M)) # [(1,4), (2,5), (3,6)] — columns! # Equivalent to calling: list(zip([1,2,3], [4,5,6]))
A 2D matrix is a list of lists. zip(*M) treats each row as one iterable, pairing corresponding elements across all rows.
# zip(*M) yields tuples — convert each to list T = [list(col) for col in zip(*M)] # [[1,4,7], [2,5,8], [3,6,9]] # Or with map: T = list(map(list, zip(*M)))
A 3D matrix is a list of 2D matrices (slices). zip() at the outermost level pairs batch slices together.
Pairs 2D matrices from the same batch index. Foundation of batched operations.
for a_slice, b_slice in zip(A3, B3): # a_slice and b_slice are 2D pass
Pairs rows, then pairs elements within rows.
summed = [ [a + b for a, b in zip(row_a, row_b)] for row_a, row_b in zip(slice_0, slice_1) ]
zip(*tensor) swaps the batch axis with the row axis (equivalent to numpy transpose axes=(1,0,2)).
T = [[[1,2],[3,4]], [[5,6],[7,8]]] # shape (2,2,2) swapped = [list(pair) for pair in zip(*T)] # swapped shape is now (2,2,2) with axes swapped # pair 0: ([1,2], [5,6]) → first row of each batch
Using iter() with zip to group a flat list into chunks of n.
data = [1, 2, 3, 4, 5, 6, 7, 8, 9] chunked = list(zip(*[iter(data)] * 3)) # [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
Comparing consecutive elements (discrete derivatives) by zipping a list with a shifted version of itself.
t_series = [10, 12, 15, 20, 26] diffs = [n - p for p, n in zip(t_series, t_series[1:])] # [2, 3, 5, 6] — represents rate of change
Iterating through features and labels simultaneously during training loops.
for epoch in range(epochs): for x_batch, y_batch in zip(X_train, y_train): prediction = model.forward(x_batch) loss = compute_loss(prediction, y_batch) # backprop...
map can be used instead of list comprehension + zip for element-wise functions.
import operator A, B = [1, 2, 3], [4, 5, 6] # zip add_zip = [a + b for a, b in zip(A, B)] # map add_map = list(map(operator.add, A, B))
Why is lazy evaluation essential? Zipping two massive arrays takes almost zero memory until materialized into a list.
import sys a, b = range(1_000_000), range(1_000_000) z = zip(a, b) print(sys.getsizeof(z)) # 56 bytes! (Stores only iterator state) materialised = list(z) print(sys.getsizeof(materialised)) # 8,000,056 bytes (8MB)
The classic matmul uses a triple nested loop (i, j, k). With zip, you transpose B so its columns become rows, then dot-product each row of A with each (former) column of B — eliminating the explicit k loop entirely.
Columns of B can't be directly iterated in Python — there's no B[:, j] on a list-of-lists. zip(*B) transposes B, turning each column into an iterable row. Now zip(row_A, col_B) pairs elements for the dot product.
C[i][j] = Σ A[i][k] · B[k][j]
= sum(a*b for a,b in zip(row_i, col_j))
# ── zip-based matmul (no explicit k loop) ───────────────── def matmul_zip(A, B): assert len(A[0]) == len(B), "cols_A must equal rows_B" B_T = list(zip(*B)) # transpose B: columns → rows return [ [sum(a * b for a, b in zip(row, col)) # dot product for col in B_T] # each column of B for row in A # each row of A ] A = [[1,2],[3,4]] B = [[5,6],[7,8]] matmul_zip(A, B) # [[19, 22], [43, 50]] # ── Step trace for C[0][0]: ─────────────────────────────── # B_T = list(zip(*B)) → [(5,7), (6,8)] # row = A[0] = [1,2], col = B_T[0] = (5,7) # zip(row, col) → (1,5) (2,7) # products → 5 14 # sum → 19 ← C[0][0] ✓ # ── Non-square: (2×3) @ (3×2) → (2×2) ──────────────────── A = [[1,2,3],[4,5,6]] B = [[7,8],[9,10],[11,12]] matmul_zip(A, B) # [[58, 64], [139, 154]] # ── Batched 3D matmul using zip to pair slices ──────────── def batched_matmul(A3, B3): return [matmul_zip(a, b) for a, b in zip(A3, B3)]
A zip object can only be iterated once. The second pass silently yields nothing.
z = zip([1,2],[3,4]) list(z) # [(1,3),(2,4)] list(z) # [] ← empty! # Fix: materialise first z = list(zip([1,2],[3,4]))
zip(M) and zip(*M) are completely different.
list(zip([[1,2],[3,4]])) # [([1,2],), ([3,4],)] ← WRONG list(zip(*[[1,2],[3,4]])) # [(1,3),(2,4)] ← correct
Mismatched lengths drop elements with no warning.
list(zip([1,2,3],[10,20])) # [(1,10),(2,20)] ← 3 dropped! # Fix: Python 3.10+ zip([1,2,3],[10,20], strict=True) # ValueError raised
zip(*B) yields tuples. Tuples are immutable — assignment crashes.
B_T = list(zip(*[[1,2],[3,4]])) B_T[0][0] = 99 # TypeError! # Fix: convert to lists B_T = [list(c) for c in zip(*B)]
zip produces new output. The original is unchanged.
M = [[1,2],[3,4]] M = zip(*M) # M is now a zip object! # Fix: keep original, assign to new var M_T = [list(r) for r in zip(*M)]
zip(*[]) is safe but returns nothing.
def safe_transpose(M): if not M or not M[0]: return [] return [list(c) for c in zip(*M)]
zip(*zipped) reverses a previous zip. Same transpose trick — applied twice returns the original.
zipped = [(1,4),(2,5),(3,6)] a, b = zip(*zipped) # a = (1,2,3), b = (4,5,6) # Matrix round-trip: M = [[1,2],[3,4]] T = list(zip(*M)) # transpose O = list(zip(*T)) # back to original
Pads shorter iterables with a fill value instead of dropping elements. Essential for ragged data.
from itertools import zip_longest list(zip_longest( [1,2,3], [10,20], fillvalue=0)) # [(1,10),(2,20),(3,0)]
Raises ValueError on length mismatch. Use for shape validation.
def safe_add(A, B): return [ [a+b for a,b in zip(ra, rb, strict=True)] for ra,rb in zip(A, B, strict=True) ]
When you need the index and parallel values simultaneously.
A = [[1,2],[3,4]] B = [[5,6],[7,8]] for i, (ra, rb) in enumerate(zip(A,B)): print(f"row {i}: {ra} + {rb}") # row 0: [1,2] + [5,6] # row 1: [3,4] + [7,8]
zip(a,b) = pair elements |
zip(*M) = transpose |
sum(a*b for a,b in zip(r,c)) = dot product |
zip(*zipped) = unzip |
zip_longest = no truncation |
strict=True = enforce equal length
Deep Dive & Advanced
Norms (L1, L2, Frobenius)
A norm is a function that measures the "size" or "length" of a vector or matrix. In machine learning, norms are heavily used in regularization to penalize large weights and prevent overfitting.
Sum of absolute values. Used in Lasso regression. Drives weights to exactly zero (sparsity).
Square root of sum of squared values. Used in Ridge regression. Spreads weights out evenly.
The L2 norm equivalent for matrices. Square root of sum of all squared elements.
v = np.array([-3, 4]) A = np.array([[1,2],[3,4]]) # Vector Norms np.linalg.norm(v, ord=1) # L1: |-3| + |4| = 7.0 np.linalg.norm(v) # L2 (default): sqrt(9+16) = 5.0 # Matrix Norm np.linalg.norm(A) # Frobenius (default for 2D) # sqrt(1+4+9+16) = 5.477
import math v = [-3, 4] A = [[1,2],[3,4]] def l1_norm(v): return sum(abs(x) for x in v) def l2_norm(v): return math.sqrt(sum(x**2 for x in v)) def frobenius_norm(A): return math.sqrt(sum(x**2 for row in A for x in row))
Eigenvalues & Eigenvectors
When a matrix $A$ acts on a vector $v$, it generally rotates and stretches it. An eigenvector is a special vector that only gets stretched, not rotated, by $A$. The amount it stretches is the eigenvalue ($\lambda$).
A = np.array([[4, 1], [2, 3]]) # eigenvalues and eigenvectors eigenvalues, eigenvectors = np.linalg.eig(A) print(eigenvalues) # [5. 2.] print(eigenvectors) # [[ 0.707 -0.447] # [ 0.707 0.894]] # Note: eigenvectors are COLUMNS in the result.
1. Dimensionality Reduction (PCA)
The eigenvectors of a covariance matrix tell us the directions of maximum variance in our data. The eigenvalues tell us how much variance is in those directions.
2. Graph AI & PageRank
Google's original PageRank algorithm essentially computes the dominant eigenvector of the web's hyperlink matrix.
np.linalg.eig returns eigenvectors as columns of a matrix, so the first eigenvector is eigenvectors[:, 0], not eigenvectors[0].SVD & PCA
Singular Value Decomposition (SVD) generalizes eigendecomposition to any matrix (not just square ones). It factors a matrix $A$ into three matrices: $U \Sigma V^T$.
Orthonormal basis for the output space. Describes row relationships.
Diagonal matrix. Values tell us the importance of each vector, sorted descending.
Orthonormal basis for the input space. Describes column relationships.
A = np.array([[1, 2], [3, 4], [5, 6]]) U, S, VT = np.linalg.svd(A, full_matrices=False) # U shape: (3, 2) # S shape: (2,) -> 1D array of singular values # VT shape: (2, 2)
Principal Component Analysis (PCA) is an algorithm that uses SVD under the hood.
# 1. Center the data X_centered = X - np.mean(X, axis=0) # 2. Compute SVD U, S, VT = np.linalg.svd(X_centered, full_matrices=False) # 3. Project onto top k components # VT holds the principal components in its rows components = VT[:k] X_reduced = np.dot(X_centered, components.T)
Advanced Operations
Determinants
The determinant is a scalar value that encodes key properties of a square matrix. It tells you whether a matrix is invertible (det ≠ 0), and geometrically it measures the signed volume scaling factor of the linear transformation.
For larger matrices, we use cofactor expansion (Laplace expansion) — recursively breaking an n×n determinant into smaller (n-1)×(n-1) determinants.
A = np.array([[1, 2], [3, 4]]) np.linalg.det(A) # -2.0 (1*4 - 2*3) B = np.array([[6,1,1], [4,-2,5], [2,8,7]]) np.linalg.det(B) # -306.0
def det(A): n = len(A) if n == 1: return A[0][0] if n == 2: return A[0][0]*A[1][1] - A[0][1]*A[1][0] total = 0 for j in range(n): # Minor: delete row 0, col j minor = [[A[i][k] for k in range(n) if k != j] for i in range(1, n)] sign = (-1) ** j total += sign * A[0][j] * det(minor) return total
Minors & Cofactors
The minor Mij is the determinant of the sub-matrix formed by deleting row i and column j. The cofactor Cij = (-1)i+j × Mij adds a checkerboard sign pattern.
A = np.array([[1,2,3], [0,4,5], [1,0,6]]) def minor_matrix(A): n = A.shape[0] M = np.zeros((n, n)) for i in range(n): for j in range(n): sub = np.delete(np.delete(A, i, 0), j, 1) M[i,j] = np.linalg.det(sub) return M def cofactor_matrix(A): M = minor_matrix(A) n = A.shape[0] signs = np.array([[(-1)**(i+j) for j in range(n)] for i in range(n)]) return signs * M
def get_minor(A, i, j): """Sub-matrix with row i, col j removed""" return [[A[r][c] for c in range(len(A)) if c != j] for r in range(len(A)) if r != i] def minor_matrix(A): n = len(A) return [[det(get_minor(A, i, j)) for j in range(n)] for i in range(n)] def cofactor_matrix(A): n = len(A) M = minor_matrix(A) return [[(-1)**(i+j) * M[i][j] for j in range(n)] for i in range(n)]
Adjugate & Inverse
The adjugate (or classical adjoint) is the transpose of the cofactor matrix. The inverse A-1 satisfies A · A-1 = I (identity). It only exists when det(A) ≠ 0.
A = np.array([[1, 2], [3, 4]]) A_inv = np.linalg.inv(A) print(A_inv) # [[-2. , 1. ], # [ 1.5, -0.5]] # Verify: A @ A_inv = Identity np.allclose(A @ A_inv, np.eye(2)) # True # Pseudo-inverse (works for non-square) np.linalg.pinv(A)
def adjugate(A): C = cofactor_matrix(A) return transpose(C) def inverse(A): d = det(A) if d == 0: raise ValueError("Singular matrix") adj = adjugate(A) n = len(A) return [[adj[i][j] / d for j in range(n)] for i in range(n)] # Quick 2x2 shortcut def inv_2x2(A): a,b = A[0]; c,d = A[1] D = a*d - b*c return [[d/D, -b/D], [-c/D, a/D]]
np.linalg.solve or np.linalg.lstsq in practice.Matrix Power
Raising a square matrix to an integer power means multiplying it by itself k times: Ak = A × A × … × A. Negative powers use the inverse: A-k = (A-1)k.
A = np.array([[2, 1], [0, 3]]) np.linalg.matrix_power(A, 3) # [[8, 13], # [0, 27]] # A^0 = Identity np.linalg.matrix_power(A, 0) # [[1,0],[0,1]] # Negative power = inverse raised np.linalg.matrix_power(A, -1) # same as inv(A)
def mat_power(A, k): n = len(A) # Identity matrix result = [[1 if i==j else 0 for j in range(n)] for i in range(n)] if k < 0: A = inverse(A) k = -k for _ in range(k): result = mat_mul(result, A) return result mat_power([[2,1],[0,3]], 3) # [[8, 13], [0, 27]]
QR & Cholesky Decompositions
QR decomposition factors A = QR where Q is orthogonal (QTQ = I) and R is upper-triangular. Cholesky decomposition factors a symmetric positive-definite matrix as A = LLT where L is lower-triangular.
A = Q × R
Used in: least-squares, eigenvalue algorithms, solving Ax=b numerically stably
A = L × LT
Used in: sampling from multivariate Gaussians, solving positive-definite systems (~2× faster than LU)
A = np.array([[1,1],[1,0],[0,1]]) Q, R = np.linalg.qr(A) print(Q.shape) # (3, 2) print(R.shape) # (2, 2) # Verify: Q @ R ≈ A np.allclose(Q @ R, A) # True # Q columns are orthonormal np.allclose(Q.T @ Q, np.eye(2)) # True
# Must be symmetric positive-definite A = np.array([[4,2],[2,3]]) L = np.linalg.cholesky(A) print(L) # [[2. , 0. ], # [1. , 1.41]] # Verify: L @ L.T ≈ A np.allclose(L @ L.T, A) # True
Gram-Schmidt Orthogonalization
The Gram-Schmidt process takes a set of linearly independent vectors and produces an orthonormal set (mutually perpendicular unit vectors). It is the algorithm behind QR decomposition.
def gram_schmidt_np(V): """V: columns are input vectors""" n = V.shape[1] U = np.zeros_like(V, dtype=float) for i in range(n): u = V[:, i].astype(float) for j in range(i): u -= np.dot(U[:,j], V[:,i]) * U[:,j] U[:, i] = u / np.linalg.norm(u) return U V = np.array([[1,1],[1,0]]) Q = gram_schmidt_np(V) # Q columns are orthonormal
def proj(u, v): """Project v onto u""" dot_uv = sum(a*b for a,b in zip(u,v)) dot_uu = sum(a*a for a in u) return [dot_uv/dot_uu * a for a in u] def gram_schmidt(vectors): ortho = [] for v in vectors: u = v[:] for o in ortho: p = proj(o, v) u = [u[i]-p[i] for i in range(len(u))] norm = l2_norm(u) ortho.append([x/norm for x in u]) return ortho
Solving Linear Systems
A linear system Ax = b asks: given matrix A and vector b, find vector x. This is the mathematical core of regression, curve fitting, and engineering.
# Exact solution (square A) A = np.array([[3,1],[1,2]]) b = np.array([9, 8]) x = np.linalg.solve(A, b) print(x) # [2. 3.] # Least-squares (overdetermined) A = np.array([[1,1],[2,1],[3,1]]) b = np.array([1, 2, 2]) x, res, rank, sv = np.linalg.lstsq(A, b, rcond=None) print(x) # [0.5, 0.333...]
def solve_2x2(A, b): """Cramer's rule for 2x2""" D = det(A) Dx = [[b[0], A[0][1]], [b[1], A[1][1]]] Dy = [[A[0][0], b[0]], [A[1][0], b[1]]] return [det(Dx)/D, det(Dy)/D] def solve_inverse(A, b): """General: x = A_inv @ b""" A_inv = inverse(A) n = len(b) return [sum(A_inv[i][j]*b[j] for j in range(n)) for i in range(n)] solve_2x2([[3,1],[1,2]], [9,8]) # [2.0, 3.0]
np.linalg.solve uses LU decomposition internally and is both faster and more numerically stable than computing inv(A) @ b. Always prefer solve over inv.ML Essentials
Trace
The trace of a square matrix is the sum of its diagonal elements. It appears constantly in matrix calculus — the Frobenius inner product is tr(ATB), and many loss functions are expressed as traces.
A = np.array([[1,2],[3,4]]) np.trace(A) # 5 (1 + 4) # Frobenius inner product np.trace(A.T @ A) # same as np.sum(A**2)
def trace(A): return sum(A[i][i] for i in range(len(A))) trace([[1,2],[3,4]]) # 5
Rank & Linear Independence
The rank of a matrix is the number of linearly independent rows (or columns). Vectors are linearly independent if none can be written as a combination of the others. Rank tells you the “true dimensionality” of your data.
rank(A) = min(m,n). All rows/cols independent. System has a unique solution.
rank(A) < min(m,n). Redundant features. Infinite or no solutions.
A = np.array([[1,2,3], [4,5,6], [7,8,9]]) np.linalg.matrix_rank(A) # 2 (row 3 = 2*row2 - row1) B = np.eye(3) np.linalg.matrix_rank(B) # 3 (full rank)
If your feature matrix X is rank-deficient, XTX is singular and the normal equation has no unique solution. This is why regularization adds λI — it guarantees full rank.
Positive Definite Matrices
A symmetric matrix A is positive definite (PD) if xTAx > 0 for every non-zero vector x. Equivalently: all eigenvalues are positive. PD matrices guarantee a unique minimum in optimization.
All eigenvalues > 0. Invertible. Cholesky works.
All eigenvalues ≥ 0. Covariance matrices are always PSD.
Mixed signs. Saddle point — no guaranteed minimum.
A = np.array([[2, 1], [1, 3]]) eigenvalues = np.linalg.eigvalsh(A) # [1.38, 3.62] — both > 0 → PD # Quick test: try Cholesky — if it works, it's PD np.linalg.cholesky(A) # succeeds → PD ✓
Covariance Matrix
The covariance matrix Σ captures how features vary together. Element Σij = Cov(xi, xj). The diagonal holds variances; off-diagonal holds covariances. It is always symmetric positive semi-definite.
# X: (samples, features) X = np.array([[1,2],[3,4],[5,6],[7,8]]) cov = np.cov(X, rowvar=False) print(cov) # [[6.667, 6.667], # [6.667, 6.667]] # PCA starts here: eigenvectors of cov vals, vecs = np.linalg.eigh(cov)
def mean(col): return sum(col) / len(col) def cov_matrix(X): # X: list of rows n = len(X); d = len(X[0]) means = [mean([X[i][j] for i in range(n)]) for j in range(d)] C = [[0]*d for _ in range(d)] for a in range(d): for b in range(d): C[a][b] = sum( (X[i][a]-means[a])*(X[i][b]-means[b]) for i in range(n)) / (n-1) return C
Special Matrices
Certain matrix types appear so often in ML that recognizing them instantly saves time and prevents bugs.
AI = IA = A. The “1” of matrices. np.eye(n)
Non-zero only on diagonal. np.diag(v). Scales each axis independently.
A = AT. Covariance, kernel, and Hessian matrices are symmetric.
QTQ = I. Preserves lengths/angles. Used in weight init for RNNs. np.linalg.qr
Mostly zeros. NLP bag-of-words, graph adjacency. scipy.sparse saves memory.
# NumPy special matrix constructors np.eye(3) # 3×3 identity np.diag([1,2,3]) # 3×3 diagonal np.zeros((3,4)) # 3×4 zeros np.ones((2,2)) # 2×2 ones # Sparse (scipy) from scipy import sparse S = sparse.csr_matrix(np.eye(1000)) # stores only 1000 values, not 1M
Projection Matrices
A projection matrix P maps a vector onto a subspace. It satisfies P² = P (applying it twice changes nothing). The least-squares projection is P = X(XTX)-1XT, which projects b onto the column space of X.
# Project b onto the column space of A A = np.array([[1,0],[0,1],[0,0]]) P = A @ np.linalg.inv(A.T @ A) @ A.T b = np.array([1,2,3]) proj_b = P @ b # [1, 2, 0] # Verify P² = P np.allclose(P @ P, P) # True
Linear regression is a projection: ŷ = Xb projects the target vector onto the column space of the feature matrix. Attention in Transformers computes Q(KTK)-1KTV — a weighted projection.
Matrix Calculus (Jacobians & Hessians)
Matrix calculus extends derivatives to vectors and matrices. The Jacobian is the matrix of all first-order partial derivatives. The Hessian is the matrix of second-order derivatives — it tells you about curvature.
Jij = ∂fi/∂xj. Maps how each output changes w.r.t. each input. Shape: (outputs × inputs).
Hij = ∂²f/∂xi∂xj. Always symmetric. PD → convex → guaranteed minimum.
# Key gradient rules (used in backprop) # d/dx (Ax) = A # d/dx (x^T A x) = (A + A^T) x (= 2Ax if A symmetric) # d/dA tr(AB) = B^T # d/dx ||Ax-b||² = 2 A^T(Ax - b) ← linear regression gradient! # PyTorch auto-computes Jacobians: # loss.backward() builds the chain of Jacobians (backpropagation)
Einstein Summation (einsum)
np.einsum expresses any tensor operation in one line using Einstein notation. Repeated indices are summed over. It's the lingua franca of PyTorch/TensorFlow for complex tensor ops.
# Dot product np.einsum('i,i->', a, b) # same as a @ b # Matrix multiply np.einsum('ik,kj->ij', A, B) # same as A @ B # Batch matrix multiply np.einsum('bik,bkj->bij', T1, T2) # per-batch matmul # Trace np.einsum('ii->', A) # same as np.trace(A) # Transpose np.einsum('ij->ji', A) # same as A.T # Outer product np.einsum('i,j->ij', a, b) # same as np.outer(a,b) # Attention: Q K^T (batched) np.einsum('bhid,bhjd->bhij', Q, K) # multi-head attention scores
-> are kept; letters that appear only on the left are summed over. This single rule covers dot products, matmuls, traces, and more.Null Space & Column Space
The column space of A is the set of all possible outputs Ax — it's the “reach” of A. The null space is the set of all x where Ax = 0 — inputs that A annihilates. Together: dim(col space) + dim(null space) = n (number of columns).
A = np.array([[1,2,3],[4,5,6],[7,8,9]]) # Column space basis (via SVD) U, S, VT = np.linalg.svd(A) rank = np.sum(S > 1e-10) col_space = U[:, :rank] # shape (3, 2) # Null space from scipy.linalg import null_space ns = null_space(A) # shape (3, 1)
Column space = what your model can predict. If b isn't in the column space, you can only get the closest approximation (least squares).
Null space = parameter directions that don't affect output. A non-trivial null space means infinite solutions — your model is under-constrained.
Condition Number
The condition number κ(A) = σmax/σmin measures how sensitive a matrix computation is to numerical errors. A high condition number means tiny input changes cause huge output changes — the matrix is ill-conditioned.
# Well-conditioned A = np.array([[1,0],[0,1]]) np.linalg.cond(A) # 1.0 (perfect) # Ill-conditioned B = np.array([[1, 1], [1, 1.0001]]) np.linalg.cond(B) # ~40000 (dangerous!) # Rule of thumb: # κ ~ 1 → stable # κ ~ 10^k → lose ~k digits of precision # κ = ∞ → singular
NaN gradients? Check if your weight matrices are ill-conditioned. Features with vastly different scales create ill-conditioned XTX.
Fix: Feature scaling (standardization), regularization (λI raises the smallest singular values), or using np.linalg.lstsq instead of inv.
np.linalg.cond(X.T @ X) is large, do not compute inv(X.T @ X). Use np.linalg.lstsq(X, y) or add regularization instead.
Comments
Loading comments...