Holberton · DLH · Machine Learning

Linear Algebra for Machine Learning

Linear Algebra Feature

Welcome to the ultimate interactive reference guide for Linear Algebra in Machine Learning. Dive into vectors, matrices, dimensionality reduction, and pure Python implementations. This guide bridges the gap between complex mathematical theory and practical, runnable code.

Part I

Foundations

01

What is a Vector?

A vector is an ordered list of numbers. Think of it as a point in space — or an arrow with both magnitude and direction. In machine learning it holds features: one number per feature.

Row vector (1 × 3)
[
149
]
Column vector (3 × 1)
[
1
4
9
]
▶ With NumPy
import numpy as np

v = np.array([1, 4, 9])  # 1-D array
print(v.shape)            # (3,)
print(v.ndim)             # 1

# Column vector (3×1)
col = np.array([[1],[4],[9]])
print(col.shape)          # (3, 1)
▶ Pure Python (no NumPy)
# Row vector — a plain list
v = [1, 4, 9]
len(v)          # 3  (length = shape)

# Column vector — list of 1-element lists
col = [[1], [4], [9]]
len(col)        # 3 rows
len(col[0])    # 1 col

# Access elements
v[0]           # 1
col[1][0]      # 4
In ML: a single training example is often a vector — e.g. 784 pixel values for an MNIST digit.
02

What is a Matrix?

A matrix is a 2-D grid of numbers arranged in rows and columns. A dataset with m samples and n features is naturally an m × n matrix.

Matrix A — shape (3 × 4)
[
1234
5678
9101112
]
▶ With NumPy
A = np.array([[1,2,3,4],
              [5,6,7,8],
              [9,10,11,12]])
print(A.shape)  # (3, 4)
print(A.ndim)   # 2

# Access element [row][col]
A[1, 2]         # 7
▶ Pure Python (no NumPy)
A = [[1,2,3,4],
     [5,6,7,8],
     [9,10,11,12]]

# Shape via len()
rows = len(A)       # 3
cols = len(A[0])   # 4

# Access element [row][col]
A[1][2]            # 7
Rows = axis 0. Columns = axis 1. Shape is always (rows, cols) — memorise this order.
03

What is a Transpose?

The transpose of a matrix flips it over its main diagonal — rows become columns and columns become rows. An (m × n) matrix becomes (n × m).

A — (2 × 3)
[
123
456
]
Aᵀ — (3 × 2)
[
14
25
36
]
▶ With NumPy
A = np.array([[1,2,3],[4,5,6]])

# Method 1 — attribute
AT = A.T

# Method 2 — function
AT = np.transpose(A)

print(AT.shape)  # (3, 2)
▶ Pure Python (no NumPy)
A = [[1,2,3],[4,5,6]]

# Method 1 — explicit loops
def transpose(A):
    rows, cols = len(A), len(A[0])
    T = []
    for j in range(cols):
        row = []
        for i in range(rows):
            row.append(A[i][j])
        T.append(row)
    return T

# Method 2 — list comprehension
def transpose(A):
    return [[A[i][j] for i in range(len(A))]
            for j in range(len(A[0]))]

# Method 3 — zip trick (Pythonic)
def transpose(A):
    return [list(row) for row in zip(*A)]

transpose(A)  # [[1,4],[2,5],[3,6]]
Transpose is critical for aligning matrix dimensions before multiplication — a very common operation in neural networks. zip(*A) unpacks all rows as arguments to zip, which groups by column position — elegant and Pythonic.

Transposing 3D Tensors

In 3D (e.g., shape (batch, rows, cols)), you rarely reverse all dimensions. Instead, you typically transpose the rows and columns of each matrix in the batch independently, leaving the batch dimension intact.

▶ With NumPy
T = np.random.rand(10, 4, 5)  # 10 batches of 4x5

# Swap axes 1 (rows) and 2 (cols), leave axis 0 (batch)
T_trans = np.transpose(T, axes=(0, 2, 1))
print(T_trans.shape)  # (10, 5, 4)
▶ Pure Python (no NumPy)
# Transpose each matrix in the batch
def transpose_3d(T):
    return [transpose(batch) for batch in T]
    # uses the 2D transpose function defined above
04

Shape of a Matrix & What is an Axis?

The shape is a tuple describing how many elements exist along each axis (dimension). Axis 0 runs down the rows; axis 1 runs across the columns.

axis 0 ↓ (rows)
1234
5678
9101112
col 0 highlighted
axis 1 → (cols)
1234
5678
9101112
row 0 highlighted
▶ With NumPy
A.shape            # (3, 4)
np.sum(A, axis=0)  # sum down rows   → (4,)
np.sum(A, axis=1)  # sum across cols → (3,)
▶ Pure Python (no NumPy)
# Shape
def shape(A):
    rows = len(A)
    cols = len(A[0])
    return (rows, cols)

# Sum along axis 0 (down rows)
def sum_axis0(A):
    cols = len(A[0])
    return [sum(A[i][j] for i in
            range(len(A))) for j in range(cols)]

# Sum along axis 1 (across cols)
def sum_axis1(A):
    return [sum(row) for row in A]

Axes in 3D Tensors

When you move to 3D, a new axis is added at the front. The standard convention in machine learning is (batch, rows, columns):

  • Axis 0 (Batch): Depth or number of samples. Summing along axis 0 collapses all batches into a single matrix.
  • Axis 1 (Rows): Height of each matrix.
  • Axis 2 (Columns): Width of each matrix.
05

Element-wise Operations

Element-wise operations apply a function independently to each matching pair of elements. Both arrays must have the same shape (or be broadcastable).

▶ With NumPy
A = np.array([[1,2],[3,4]])
B = np.array([[2,4],[6,8]])

A + B      # [[3,6],[9,12]]
A - B      # [[-1,-2],[-3,-4]]
A * B      # [[2,8],[18,32]]
A / B      # [[0.5,0.5],[0.5,0.5]]
A ** 2     # [[1,4],[9,16]]
np.sqrt(A) # [[1.,1.41…],[1.73…,2.]]
▶ Pure Python (no NumPy)
A = [[1,2],[3,4]]
B = [[2,4],[6,8]]

# Addition
def add(A, B):
    return [[A[i][j] + B[i][j]
             for j in range(len(A[0]))]
            for i in range(len(A))]

# Element-wise multiply
def mul(A, B):
    return [[A[i][j] * B[i][j]
             for j in range(len(A[0]))]
            for i in range(len(A))]

# Power / sqrt
import math
[[x**2 for x in row] for row in A]
[[math.sqrt(x) for x in row] for row in A]
A * B is element-wise multiplication. A @ B is matrix multiplication. They are completely different operations.
06

Concatenation

Joining arrays along an existing axis.

▶ With NumPy
a = np.array([[1,2],[3,4]])   # (2,2)
b = np.array([[5,6]])          # (1,2)

# Vertical stack (axis 0)
np.concatenate([a, b], axis=0)   # (3,2)
np.vstack([a, b])

# Horizontal stack (axis 1)
np.concatenate([a, b.T], axis=1) # (2,3)
np.hstack([a, b.T])
▶ Pure Python (no NumPy)
a = [[1,2],[3,4]]
b = [[5,6]]

# Vertical stack (add rows)
def vstack(A, B):
    return A + B       # [[1,2],[3,4],[5,6]]

# Horizontal stack (add cols)
def hstack(A, B):
    return [rowA + rowB
            for rowA, rowB in zip(A, B)]

# transpose b first, then hstack
bT = [list(row) for row in zip(*b)]
hstack(a, bT)  # [[1,2,5],[3,4,6]]

Concatenating 3D Tensors

To join 3D tensors, the dimensions along all non-concatenated axes must perfectly match. For example, adding more samples to a batch means joining along axis=0.

▶ With NumPy
T1 = np.ones((2, 3, 4))  # 2 batches
T2 = np.ones((5, 3, 4))  # 5 batches

# Join along batch dimension (axis 0)
T_joined = np.concatenate([T1, T2], axis=0)
print(T_joined.shape)  # (7, 3, 4)
▶ Pure Python (no NumPy)
# Since axis 0 is the outermost list, 
# we just add the lists together!
T_joined = T1 + T2
# Length goes from 2 to 7 batches
07

The Dot Product

The dot product of two equal-length vectors multiplies corresponding elements and sums the results into a single scalar — a measure of alignment between two vectors.

a · b = (1×2) + (3×4) + (5×6)
1×2
+
3×4
+
5×6
= 44
▶ With NumPy
a = np.array([1, 3, 5])
b = np.array([2, 4, 6])

np.dot(a, b)  # 44
a @ b         # 44  (@ operator)
▶ Pure Python (no NumPy)
a = [1, 3, 5]
b = [2, 4, 6]

# Method 1 — index loop
def dot(a, b):
    return sum(a[i] * b[i]
               for i in range(len(a)))

# Method 2 — zip (Pythonic)
def dot(a, b):
    return sum(x * y for x, y in zip(a, b))

dot(a, b)  # 44
In neural networks, every layer computes: output = weights · inputs + bias. The dot product is the heartbeat of ML.
08

Matrix Multiplication

Matrix multiplication is a series of dot products. It is the core operation of deep learning, driving everything from linear layers to attention mechanisms.

The golden rule
A: m × n · B: n × p = C: m × p

A's columns must equal B's rows — that shared n is the dimension you "dot over". The result inherits A's row count and B's column count.

The formula
C[i][j] = Σk=0n-1 A[i][k] · B[k][j]

Each cell C[i][j] is the dot product of the entire row i of A with the entire column j of B. Multiply element-by-element, then sum.

Concrete trace — computing C[0][0] for a 2×3 · 3×2 example

Here n = 3, so k runs 0 → 1 → 2 (three iterations, three pairs, three products summed):

k = 0
A[0][0] × B[0][0]
1 × 7 = 7
k = 1
A[0][1] × B[1][0]
2 × 9 = 18
k = 2
A[0][2] × B[2][0]
3 × 11 = 33
C[0][0] = 7 + 18 + 33 = 58
▶ With NumPy
A = np.array([[1,2,3],[4,5,6]])    # (2,3)
B = np.array([[7,8],[9,10],[11,12]]) # (3,2)

C = np.matmul(A, B)  # (2,2)
C = A @ B            # identical
▶ Pure Python (no NumPy)
A = [[1,2,3],[4,5,6]]
B = [[7,8],[9,10],[11,12]]

def matmul(A, B):
    m = len(A)
    p = len(B[0])
    n = len(A[0]) # shared dimension
    C = [[0] * p for _ in range(m)]
    
    for i in range(m):         # row in A
        for j in range(p):     # col in B
            for k in range(n): # dot product sum
                C[i][j] += A[i][k] * B[k][j]
    return C
Why exactly three nested loops?
for i in range(m)
Selects a row in A
Runs m times. Determines which row of the result C we're filling.
for j in range(p)
Selects a col in B
Runs p times. Determines which column of C we're computing.
for k in range(n)
The dot product
Runs n times. Multiplies A[i][k] × B[k][j] and accumulates.

Total operations: m × p × n multiplications. For a 2×3 and 3×2 matrix: 2×2×3 = 12 multiplications to produce 4 result cells.

2D vs 3D — what changes and what stays the same
2D Matmul
A: m × n
B: n × p
C: m × p
k runs over n
3D Batched Matmul
A: b × m × n
B: b × n × p
C: b × m × p
k still runs over n — batch b is just a new outer loop.
The four nested loops — batch wraps everything
for b in range(batch):  # NEW OUTER LOOP
    for i in range(m):  # selects row in A[b]
        for j in range(p):  # selects col in B[b]
            for k in range(n):  # dot product
                C[b][i][j] += A[b][i][k] * B[b][k][j]
The inner dimensions must match: (m × n) @ (n × p). If they don't, NumPy raises a ValueError. Always check shapes first.
Part II

NumPy

09

What is NumPy?

NumPy (Numerical Python) is the core library for numerical computing in Python. Its star is the ndarray — an N-dimensional array stored as a contiguous block of typed memory, enabling vectorised operations orders of magnitude faster than Python loops.

Without NumPy (slow)
result = []
for i in range(len(a)):
    result.append(a[i] * b[i])
With NumPy (fast)
result = a * b
# vectorised — runs in C
NumPy is not just convenient — it is parallelised. Operations run on BLAS/LAPACK routines that exploit CPU vector instructions (SIMD), making them 10–100× faster than pure Python.
10

What is Parallelization & Why it Matters

Parallelization means splitting a computation across multiple processing units (CPU cores, GPU threads) so they run simultaneously. Instead of doing 1 million multiplications one after another, you do them all at once.

Sequential
123456
→ → → → → → (one by one)
slow
Parallel (vectorised)
123456
↓ ↓ ↓ ↓ ↓ ↓ (all at once)
fast

In ML, training on millions of examples would take days without parallelisation. NumPy's vectorised API, and frameworks like TensorFlow/PyTorch, express computation as array operations so the hardware (CPU/GPU/TPU) can parallelise automatically.

11

Broadcasting

Broadcasting lets NumPy perform operations on arrays with different shapes by virtually stretching the smaller array to match the larger one — without copying data in memory.

Add a (3,) row vector to a (3×3) matrix
[
123
456
789
]
+
[
102030
]
=
[
112233
142536
172839
]
[10,20,30] was broadcast across all 3 rows automatically
A = np.array([[1,2,3],[4,5,6],[7,8,9]])  # (3,3)
b = np.array([10, 20, 30])                   # (3,)
A + b   # b is broadcast row-wise

# Rule: align shapes from the RIGHT.
# Each dim must be equal OR one of them must be 1.
# (3,3) vs (3,) → (3,3) vs (1,3) → compatible ✓
Always print(A.shape, b.shape) before operating on unfamiliar arrays. Silent shape mismatches produce wrong results, not errors.
12

NumPy Slicing

NumPy slicing uses start:stop:step — stop is always excluded. For 2-D arrays, slice both dimensions at once: A[rows, cols].

v = np.array([10, 20, 30, 40, 50])
v[1:4]        # [20, 30, 40]
v[::2]        # [10, 30, 50]  (every other)
v[-1]         # 50            (last element)

A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12]])
A[1:, 1:3]   # [[6,7],[10,11]]
A[:, 0]     # [1, 5, 9]    (entire first column)
A[0, :]     # [1, 2, 3, 4] (entire first row)
A[::-1]     # reverse rows
A[:, ::-1]  # reverse columns
A[A > 5]    # boolean mask → [6,7,8,9,10,11,12]
NumPy slices are views, not copies. Modifying a slice modifies the original array. Use .copy() to avoid surprises.
13

5 Slicing Exercises (NumPy)

0 / 5 revealed
Level 1Extract a single element & a range
v = np.array([10, 20, 30, 40, 50, 60])

a) Get element at index 3.
b) Get elements from index 1 up to (not including) 4.
c) Get every other element (indices 0, 2, 4).

v[3]        # 40
v[1:4]      # [20, 30, 40]
v[::2]      # [10, 30, 50]
Key rule: start:stop:step — stop is always excluded. Omitting start defaults to 0, omitting stop defaults to the end.
Level 22-D matrix — rows, columns, sub-matrix
A = np.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]])

a) Extract the entire 3rd row (index 2).
b) Extract the entire 2nd column (index 1).
c) Extract the 2×2 sub-matrix from rows 1–2, cols 2–3.

A[2, :]      # [9, 10, 11, 12]
A[:, 1]      # [2, 6, 10, 14]
A[1:3, 2:4]  # [[7,8],[11,12]]
A[row, col] — both can be a single index, a range, or : (all). First dimension is always rows.
Level 3Negative indices & reverse
B = np.array([[1,2,3,4,5],[6,7,8,9,10],[11,12,13,14,15]])

a) Get the last column using a negative index.
b) Get the last 2 rows.
c) Reverse the matrix left-to-right (flip column order).

B[:, -1]      # [5, 10, 15]
B[-2:, :]     # [[6,7,8,9,10],[11,12,13,14,15]]
B[:, ::-1]    # [[5,4,3,2,1],[10,9,…],[15,14,…]]
Negative indices count from the end: -1 = last, -2 = second-to-last. A step of -1 walks backwards.
Level 4Boolean masking & fancy indexing
C = np.array([3, -1, 7, -5, 2, -8, 4])
D = np.array([[10,20,30],[40,50,60],[70,80,90]])

a) From C, extract only positive values.
b) Replace all negative values in C with 0.
c) From D, select rows 0 and 2 only.

C[C > 0]           # [3, 7, 2, 4]
C[C < 0] = 0        # C → [3,0,7,0,2,0,4]
D[[0, 2], :]         # [[10,20,30],[70,80,90]]
Boolean indexing filters by condition — result is always a 1-D copy. Fancy indexing (list of indices) picks arbitrary rows in any order.
Level 53-D array slicing (batch × rows × cols)
# Shape: (2 batches, 3 rows, 4 cols)
E = np.arange(24).reshape(2, 3, 4)

a) Extract the first batch (index 0).
b) Extract the last column of every batch.
c) Extract the 2×2 top-left corner from each batch simultaneously.
d) Get the single element at batch 1, row 2, col 3.

E[0]              # first batch — shape (3, 4)
E[:, :, -1]      # last col of every batch — shape (2, 3)
E[:, :2, :2]    # top-left 2×2 of each batch — shape (2, 2, 2)
E[1, 2, 3]      # 23
N-D slicing: one slice per axis, separated by commas. A range keeps a dimension; a single index removes it.
Part III

Pure Python — No NumPy

Holberton will require you to implement these operations from scratch using nested lists and loops — before you're allowed to touch NumPy. Understand every line here.

14

Representing Vectors & Matrices

A vector is a plain list. A matrix is a list of lists — each inner list is one row.

# Vector
v = [1, 2, 3, 4]

# Matrix (2D) — list of rows
A = [[1, 2, 3],
     [4, 5, 6]]

# Tensor (3D) — list of matrices (batches)
T = [
  [[1, 2], [3, 4]],   # batch 0
  [[5, 6], [7, 8]]    # batch 1
]

A[1][2]      # 6  — row 1, col 2  (NOT A[1,2]!)
T[1][0][1]   # 6  — batch 1, row 0, col 1
Pure Python uses A[row][col] — NOT A[row, col]. That comma syntax is NumPy only and will raise a TypeError on plain lists.
15

Getting the Shape

No .shape attribute on lists. Build it yourself with len().

def shape_2d(matrix):
    return (len(matrix), len(matrix[0]))

def shape_3d(tensor):
    batches = len(tensor)
    rows = len(tensor[0])
    cols = len(tensor[0][0])
    return (batches, rows, cols)

A = [[1,2,3],[4,5,6]]
T = [[[1,2]],[[3,4]],[[5,6]]]
shape_2d(A)    # (2, 3)
shape_3d(T)    # (3, 1, 2)
16

Transpose (Pure Python)

Swap rows and columns: T[j][i] = A[i][j].

# Explicit loops
def transpose(A):
    rows, cols = len(A), len(A[0])
    T = []
    for j in range(cols):
        row = []
        for i in range(rows):
            row.append(A[i][j])
        T.append(row)
    return T

# List comprehension
def transpose(A):
    return [[A[i][j] for i in range(len(A))]
            for j in range(len(A[0]))]

# Elegant zip trick
def transpose(A):
    return [list(row) for row in zip(*A)]
See §23 Zip & Unpacking for a deep dive into how zip(*A) works for transpose, dot products, and matrix multiplication.
17

Element-wise Addition & Subtraction

# Vector addition
def add_vectors(a, b):
    return [a[i] + b[i] for i in range(len(a))]

# Matrix addition (2D)
def add_matrices(A, B):
    return [[A[i][j] + B[i][j]
              for j in range(len(A[0]))]
             for i in range(len(A))]

# Tensor addition (3D) — triple comprehension
def add_3d(T1, T2):
    return [[[T1[b][i][j] + T2[b][i][j]
               for j in range(len(T1[0][0]))]
              for i in range(len(T1[0]))]
             for b in range(len(T1))]
18

Scalar Multiplication

def scalar_multiply_vec(v, s):
    return [x * s for x in v]

def scalar_multiply(A, s):
    return [[elem * s for elem in row]
            for row in A]

scalar_multiply([[1,2],[3,4]], 2)
# [[2, 4], [6, 8]]
19

Dot Product (Pure Python)

def dot_product(a, b):
    return sum(a[i] * b[i] for i in range(len(a)))

# Using zip (Pythonic)
def dot_product(a, b):
    return sum(x * y for x, y in zip(a, b))

dot_product([1,3,5], [2,4,6])  # 44
20

Matrix Multiplication (Pure Python)

C[i][j] = dot product of row i of A with column j of B. The classic triple nested loop.

def mat_mul(A, B):
    m, k, n = len(A), len(B), len(B[0])
    C = [[0] * n for _ in range(m)]
    for i in range(m):          # each row of A
        for j in range(n):      # each col of B
            for p in range(k):  # shared dimension
                C[i][j] += A[i][p] * B[p][j]
    return C

# Batched Matrix Multiply (3D)
# shape: (batch, m, k) @ (batch, k, n) -> (batch, m, n)
def bmm(T1, T2):
    return [mat_mul(T1[b], T2[b]) for b in range(len(T1))]
This triple loop is O(m·k·n) — exactly why NumPy's C-backed matmul is so much faster. For 3D tensors, we do a "batched matmul" (bmm): multiplying corresponding matrices per batch.
21

Concatenation (Pure Python)

# Vertical — stack rows (like np.vstack)
def vstack(A, B):
    return A + B          # lists concat directly

# Horizontal — join cols row by row (like np.hstack)
def hstack(A, B):
    return [A[i] + B[i] for i in range(len(A))]

A = [[1,2],[3,4]] ; B = [[5,6],[7,8]]
vstack(A, B)  # [[1,2],[3,4],[5,6],[7,8]]
hstack(A, B)  # [[1,2,5,6],[3,4,7,8]]
22

2D & 3D Matrix Slicing — Pure Python

Row slicing works directly on the outer list. Column slicing requires a list comprehension — there is no direct column slice on a plain list.

SetupReference matrix (used below)
A = [
    [ 1,  2,  3,  4,  5],   # row 0
    [ 6,  7,  8,  9, 10],   # row 1
    [11, 12, 13, 14, 15],   # row 2
    [16, 17, 18, 19, 20]    # row 3
]  # shape: 4 rows × 5 cols

Single element

A[1][2]    # 8   — row 1, col 2
A[0][0]    # 1   — top-left corner
A[-1][-1]  # 20  — bottom-right corner
A[1, 2]TypeError on a plain list. Always use A[row][col].

Entire row / range of rows

A[1]         # [6, 7, 8, 9, 10]     — single row
A[1:3]       # rows 1 and 2
A[:2]        # first 2 rows
A[2:]        # last 2 rows
A[::2]       # every other row
A[-1]        # last row
A[-2:]       # last 2 rows

Entire column / range of columns

# Single column — NumPy: A[:, 2]
[row[2] for row in A]        # [3, 8, 13, 18]

# Column range — NumPy: A[:, 1:4]
[row[1:4] for row in A]      # [[2,3,4],[7,8,9],[12,13,14],[17,18,19]]

# Last column — NumPy: A[:, -1]
[row[-1] for row in A]       # [5, 10, 15, 20]

# Last 2 cols — NumPy: A[:, -2:]
[row[-2:] for row in A]      # [[4,5],[9,10],[14,15],[19,20]]
[row[j] for row in A] is the pure-Python equivalent of NumPy's A[:, j]. Memorise this pattern.

Sub-matrix (rows AND columns)

# rows 1–2, cols 2–3 — NumPy: A[1:3, 2:4]
[row[2:4] for row in A[1:3]]   # [[8,9],[13,14]]

# General helper
def submatrix(A, r1, r2, c1, c2):
    return [row[c1:c2] for row in A[r1:r2]]

# First 2 rows, last 2 cols
[row[-2:] for row in A[:2]]    # [[4,5],[9,10]]

Step slicing & reversal

# Every other row
A[::2]                          # rows 0 and 2

# Every other column
[row[::2] for row in A]        # cols 0, 2, 4

# Every other row AND column
[row[::2] for row in A[::2]]  # [[1,3,5],[11,13,15]]

# Reverse rows (upside-down)
A[::-1]

# Reverse columns (mirrored)
[row[::-1] for row in A]

# Fully reversed (upside-down AND mirrored)
[row[::-1] for row in A[::-1]]

Diagonal elements

# Main diagonal
[A[i][i] for i in range(min(len(A), len(A[0])))]
# [1, 7, 13, 19]

# Anti-diagonal
n = len(A[0])
[A[i][n - 1 - i] for i in range(len(A))]
# [5, 9, 13, 17]
Pure Python Slicing — Quick Reference
# SINGLE ELEMENT
A[i][j]

# FULL ROW
A[i]

# ROW RANGE
A[r1:r2]

# FULL COLUMN            (NumPy: A[:, j])
[row[j] for row in A]

# COLUMN RANGE           (NumPy: A[:, c1:c2])
[row[c1:c2] for row in A]

# SUB-MATRIX             (NumPy: A[r1:r2, c1:c2])
[row[c1:c2] for row in A[r1:r2]]

# LAST ROW / LAST COLUMN
A[-1]
[row[-1] for row in A]

# EVERY Nth ROW / COLUMN
A[::n]
[row[::n] for row in A]

# REVERSE ROWS / COLUMNS
A[::-1]
[row[::-1] for row in A]

# MAIN DIAGONAL
[A[i][i] for i in range(min(len(A), len(A[0])))]

3D Tensor Slicing

# 3D Tensor T: (batches, rows, cols)
T = [
  [[1,2], [3,4]], 
  [[5,6], [7,8]]
]

# First batch — NumPy: T[0]
T[0]                 # [[1, 2], [3, 4]]

# First row of every batch — NumPy: T[:, 0, :]
[batch[0] for batch in T]  # [[1, 2], [5, 6]]

# First column of every batch — NumPy: T[:, :, 0]
[[row[0] for row in batch] for batch in T]
# [[1, 3], [5, 7]]
23

Zip & Unpacking (Pure Python)

The zip() function is the cornerstone of pure Python matrix operations. It pairs elements from multiple iterables, which is essential for operations like transposing and dot products.

What zip() is

zip() takes two or more iterables and pairs them up element-by-element, producing a lazy iterator of tuples. It stops as soon as the shortest iterable is exhausted. Think of a physical zipper — two rows of teeth coming together one pair at a time.

Basic pairing — visualised
A: [1, 2, 3]
B: [4, 5, 6]
zip(A, B) → (1,4), (2,5), (3,6)
Lazy evaluation — no copied data

zip() returns a generator that produces one tuple per iteration, on demand. No intermediate list is built. Zipping two 10,000-row matrices costs zero extra memory until you iterate.

a = [1, 2, 3]
b = [4, 5, 6]

for x, y in zip(a, b):
    print(x, y)  # 1 4 | 2 5 | 3 6

# wrap in list() to materialise
pairs = list(zip(a, b))  # [(1,4), (2,5), (3,6)]

# Three iterables at once
list(zip(a, b, [7,8,9]))  # [(1,4,7), (2,5,8), (3,6,9)]
The star (*) unpacking operator — essential for matrices

zip(*matrix) is the most important matrix trick. The * unpacks the outer list, passing each row as a separate argument to zip — which then pairs up the first elements of every row, then the second, etc. This is exactly a transpose.

M = [[1,2,3], [4,5,6]]

# Without *: zip sees ONE argument (the whole matrix)
list(zip(M))   # [([1,2,3],), ([4,5,6],)] — wrong!

# With *: zip sees TWO arguments (row0 and row1)
list(zip(*M))  # [(1,4), (2,5), (3,6)] — columns!

# Equivalent to calling:
list(zip([1,2,3], [4,5,6]))
zip in 2D — transposing rows to columns

A 2D matrix is a list of lists. zip(*M) treats each row as one iterable, pairing corresponding elements across all rows.

Original M (3×3)
123
456
789
Transposed T (3×3)
147
258
369
# zip(*M) yields tuples — convert each to list
T = [list(col) for col in zip(*M)]
# [[1,4,7], [2,5,8], [3,6,9]]

# Or with map:
T = list(map(list, zip(*M)))
zip in 3D — pairing slices and rows

A 3D matrix is a list of 2D matrices (slices). zip() at the outermost level pairs batch slices together.

1. zip over batch dimension

Pairs 2D matrices from the same batch index. Foundation of batched operations.

for a_slice, b_slice in zip(A3, B3):
    # a_slice and b_slice are 2D
    pass
2. zip over rows within a 3D slice

Pairs rows, then pairs elements within rows.

summed = [
  [a + b for a, b in zip(row_a, row_b)]
  for row_a, row_b in zip(slice_0, slice_1)
]
Transposing axes of a 3D matrix

zip(*tensor) swaps the batch axis with the row axis (equivalent to numpy transpose axes=(1,0,2)).

T = [[[1,2],[3,4]], [[5,6],[7,8]]]  # shape (2,2,2)
swapped = [list(pair) for pair in zip(*T)]
# swapped shape is now (2,2,2) with axes swapped
# pair 0: ([1,2], [5,6]) → first row of each batch
Advanced ML Idioms & Gotchas
1. The Chunking Idiom

Using iter() with zip to group a flat list into chunks of n.

data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
chunked = list(zip(*[iter(data)] * 3))
# [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
2. Sliding Windows / Derivatives

Comparing consecutive elements (discrete derivatives) by zipping a list with a shifted version of itself.

t_series = [10, 12, 15, 20, 26]
diffs = [n - p for p, n in zip(t_series, t_series[1:])]
# [2, 3, 5, 6]  — represents rate of change
3. ML Application: Data Loaders

Iterating through features and labels simultaneously during training loops.

for epoch in range(epochs):
    for x_batch, y_batch in zip(X_train, y_train):
        prediction = model.forward(x_batch)
        loss = compute_loss(prediction, y_batch)
        # backprop...
4. zip vs map

map can be used instead of list comprehension + zip for element-wise functions.

import operator
A, B = [1, 2, 3], [4, 5, 6]
# zip
add_zip = [a + b for a, b in zip(A, B)]
# map
add_map = list(map(operator.add, A, B))
5. Proving "Lazy Evaluation" with sys.getsizeof

Why is lazy evaluation essential? Zipping two massive arrays takes almost zero memory until materialized into a list.

import sys
a, b = range(1_000_000), range(1_000_000)

z = zip(a, b)
print(sys.getsizeof(z))  # 56 bytes! (Stores only iterator state)

materialised = list(z)
print(sys.getsizeof(materialised))  # 8,000,056 bytes (8MB)
zip in Matrix Multiplication — transpose B, then dot

The classic matmul uses a triple nested loop (i, j, k). With zip, you transpose B so its columns become rows, then dot-product each row of A with each (former) column of B — eliminating the explicit k loop entirely.

Why zip(*B)?

Columns of B can't be directly iterated in Python — there's no B[:, j] on a list-of-lists. zip(*B) transposes B, turning each column into an iterable row. Now zip(row_A, col_B) pairs elements for the dot product.

The formula
C[i][j] = Σ A[i][k] · B[k][j]
        = sum(a*b for a,b in zip(row_i, col_j))
# ── zip-based matmul (no explicit k loop) ─────────────────
def matmul_zip(A, B):
    assert len(A[0]) == len(B), "cols_A must equal rows_B"
    B_T = list(zip(*B))           # transpose B: columns → rows
    return [
        [sum(a * b for a, b in zip(row, col))  # dot product
         for col in B_T]                        # each column of B
        for row in A                             # each row of A
    ]

A = [[1,2],[3,4]]
B = [[5,6],[7,8]]
matmul_zip(A, B)  # [[19, 22], [43, 50]]

# ── Step trace for C[0][0]: ───────────────────────────────
# B_T = list(zip(*B)) → [(5,7), (6,8)]
# row = A[0] = [1,2],  col = B_T[0] = (5,7)
# zip(row, col) → (1,5) (2,7)
# products     →  5     14
# sum          →  19  ← C[0][0] ✓

# ── Non-square: (2×3) @ (3×2) → (2×2) ────────────────────
A = [[1,2,3],[4,5,6]]
B = [[7,8],[9,10],[11,12]]
matmul_zip(A, B)  # [[58, 64], [139, 154]]

# ── Batched 3D matmul using zip to pair slices ────────────
def batched_matmul(A3, B3):
    return [matmul_zip(a, b) for a, b in zip(A3, B3)]
Gotchas — 6 traps every beginner hits
⚠ 1. Iterator exhaustion

A zip object can only be iterated once. The second pass silently yields nothing.

z = zip([1,2],[3,4])
list(z)  # [(1,3),(2,4)]
list(z)  # []  ← empty!
# Fix: materialise first
z = list(zip([1,2],[3,4]))
⚠ 2. Forgetting * on a matrix

zip(M) and zip(*M) are completely different.

list(zip([[1,2],[3,4]]))
# [([1,2],), ([3,4],)] ← WRONG
list(zip(*[[1,2],[3,4]]))
# [(1,3),(2,4)] ← correct
⚠ 3. Silent truncation

Mismatched lengths drop elements with no warning.

list(zip([1,2,3],[10,20]))
# [(1,10),(2,20)] ← 3 dropped!
# Fix: Python 3.10+
zip([1,2,3],[10,20], strict=True)
# ValueError raised
⚠ 4. Tuples, not lists

zip(*B) yields tuples. Tuples are immutable — assignment crashes.

B_T = list(zip(*[[1,2],[3,4]]))
B_T[0][0] = 99  # TypeError!
# Fix: convert to lists
B_T = [list(c) for c in zip(*B)]
⚠ 5. zip never mutates

zip produces new output. The original is unchanged.

M = [[1,2],[3,4]]
M = zip(*M)  # M is now a zip object!
# Fix: keep original, assign to new var
M_T = [list(r) for r in zip(*M)]
⚠ 6. Empty matrix edge case

zip(*[]) is safe but returns nothing.

def safe_transpose(M):
    if not M or not M[0]:
        return []
    return [list(c) for c in zip(*M)]
Unzipping, zip_longest, strict mode & enumerate
Unzipping — zip is its own inverse

zip(*zipped) reverses a previous zip. Same transpose trick — applied twice returns the original.

zipped = [(1,4),(2,5),(3,6)]
a, b = zip(*zipped)
# a = (1,2,3), b = (4,5,6)

# Matrix round-trip:
M = [[1,2],[3,4]]
T = list(zip(*M))     # transpose
O = list(zip(*T))     # back to original
zip_longest — no truncation

Pads shorter iterables with a fill value instead of dropping elements. Essential for ragged data.

from itertools import zip_longest

list(zip_longest(
    [1,2,3], [10,20],
    fillvalue=0))
# [(1,10),(2,20),(3,0)]
strict=True (Python 3.10+)

Raises ValueError on length mismatch. Use for shape validation.

def safe_add(A, B):
    return [
      [a+b for a,b in
       zip(ra, rb, strict=True)]
      for ra,rb in
       zip(A, B, strict=True)
    ]
enumerate + zip

When you need the index and parallel values simultaneously.

A = [[1,2],[3,4]]
B = [[5,6],[7,8]]
for i, (ra, rb) in enumerate(zip(A,B)):
    print(f"row {i}: {ra} + {rb}")
# row 0: [1,2] + [5,6]
# row 1: [3,4] + [7,8]
Quick Reference: zip(a,b) = pair elements  |  zip(*M) = transpose  |  sum(a*b for a,b in zip(r,c)) = dot product  |  zip(*zipped) = unzip  |  zip_longest = no truncation  |  strict=True = enforce equal length
Part IV

Deep Dive & Advanced

24

Norms (L1, L2, Frobenius)

A norm is a function that measures the "size" or "length" of a vector or matrix. In machine learning, norms are heavily used in regularization to penalize large weights and prevent overfitting.

L1 Norm (Manhattan)

Sum of absolute values. Used in Lasso regression. Drives weights to exactly zero (sparsity).

||v||₁ = Σ |v_i|
L2 Norm (Euclidean)

Square root of sum of squared values. Used in Ridge regression. Spreads weights out evenly.

||v||₂ = √(Σ v_i²)
Frobenius Norm

The L2 norm equivalent for matrices. Square root of sum of all squared elements.

||A||_F = √(ΣΣ A_ij²)
▶ With NumPy
v = np.array([-3, 4])
A = np.array([[1,2],[3,4]])

# Vector Norms
np.linalg.norm(v, ord=1)  # L1: |-3| + |4| = 7.0
np.linalg.norm(v)         # L2 (default): sqrt(9+16) = 5.0

# Matrix Norm
np.linalg.norm(A)         # Frobenius (default for 2D)
                          # sqrt(1+4+9+16) = 5.477
▶ Pure Python (no NumPy)
import math
v = [-3, 4]
A = [[1,2],[3,4]]

def l1_norm(v):
    return sum(abs(x) for x in v)

def l2_norm(v):
    return math.sqrt(sum(x**2 for x in v))

def frobenius_norm(A):
    return math.sqrt(sum(x**2 
                         for row in A 
                         for x in row))
25

Eigenvalues & Eigenvectors

When a matrix $A$ acts on a vector $v$, it generally rotates and stretches it. An eigenvector is a special vector that only gets stretched, not rotated, by $A$. The amount it stretches is the eigenvalue ($\lambda$).

A · v = λ · v
Matrix transformation = Scalar scaling
▶ With NumPy
A = np.array([[4, 1],
              [2, 3]])

# eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)

print(eigenvalues) 
# [5. 2.]

print(eigenvectors) 
# [[ 0.707 -0.447]
#  [ 0.707  0.894]]
# Note: eigenvectors are COLUMNS in the result.
▶ Why it matters in ML

1. Dimensionality Reduction (PCA)
The eigenvectors of a covariance matrix tell us the directions of maximum variance in our data. The eigenvalues tell us how much variance is in those directions.

2. Graph AI & PageRank
Google's original PageRank algorithm essentially computes the dominant eigenvector of the web's hyperlink matrix.

np.linalg.eig returns eigenvectors as columns of a matrix, so the first eigenvector is eigenvectors[:, 0], not eigenvectors[0].
26

SVD & PCA

Singular Value Decomposition (SVD) generalizes eigendecomposition to any matrix (not just square ones). It factors a matrix $A$ into three matrices: $U \Sigma V^T$.

U (Left Singular Vectors)

Orthonormal basis for the output space. Describes row relationships.

Σ (Singular Values)

Diagonal matrix. Values tell us the importance of each vector, sorted descending.

V^T (Right Singular Vectors)

Orthonormal basis for the input space. Describes column relationships.

▶ With NumPy (SVD)
A = np.array([[1, 2],
              [3, 4],
              [5, 6]])

U, S, VT = np.linalg.svd(A, full_matrices=False)

# U shape: (3, 2)
# S shape: (2,)  -> 1D array of singular values
# VT shape: (2, 2)
▶ Connection to PCA

Principal Component Analysis (PCA) is an algorithm that uses SVD under the hood.

# 1. Center the data
X_centered = X - np.mean(X, axis=0)

# 2. Compute SVD
U, S, VT = np.linalg.svd(X_centered, full_matrices=False)

# 3. Project onto top k components
# VT holds the principal components in its rows
components = VT[:k] 
X_reduced = np.dot(X_centered, components.T)
Part V

Advanced Operations

27

Determinants

The determinant is a scalar value that encodes key properties of a square matrix. It tells you whether a matrix is invertible (det ≠ 0), and geometrically it measures the signed volume scaling factor of the linear transformation.

2×2: det(A) = ad − bc
For A = [[a,b],[c,d]]

For larger matrices, we use cofactor expansion (Laplace expansion) — recursively breaking an n×n determinant into smaller (n-1)×(n-1) determinants.

▶ With NumPy
A = np.array([[1, 2],
              [3, 4]])
np.linalg.det(A)  # -2.0  (1*4 - 2*3)

B = np.array([[6,1,1],
              [4,-2,5],
              [2,8,7]])
np.linalg.det(B)  # -306.0
▶ Pure Python (no NumPy)
def det(A):
    n = len(A)
    if n == 1:
        return A[0][0]
    if n == 2:
        return A[0][0]*A[1][1] - A[0][1]*A[1][0]
    total = 0
    for j in range(n):
        # Minor: delete row 0, col j
        minor = [[A[i][k] for k in range(n)
                  if k != j]
                 for i in range(1, n)]
        sign = (-1) ** j
        total += sign * A[0][j] * det(minor)
    return total
det(A) = 0 means the matrix is singular (not invertible). Rows/columns are linearly dependent — the transformation collapses space into a lower dimension.
28

Minors & Cofactors

The minor Mij is the determinant of the sub-matrix formed by deleting row i and column j. The cofactor Cij = (-1)i+j × Mij adds a checkerboard sign pattern.

Cij = (-1)i+j × det(A with row i, col j removed)
Sign pattern: + − + / − + − / + − +
▶ With NumPy
A = np.array([[1,2,3],
              [0,4,5],
              [1,0,6]])

def minor_matrix(A):
    n = A.shape[0]
    M = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            sub = np.delete(np.delete(A, i, 0), j, 1)
            M[i,j] = np.linalg.det(sub)
    return M

def cofactor_matrix(A):
    M = minor_matrix(A)
    n = A.shape[0]
    signs = np.array([[(-1)**(i+j)
             for j in range(n)]
            for i in range(n)])
    return signs * M
▶ Pure Python (no NumPy)
def get_minor(A, i, j):
    """Sub-matrix with row i, col j removed"""
    return [[A[r][c] for c in range(len(A))
             if c != j]
            for r in range(len(A))
            if r != i]

def minor_matrix(A):
    n = len(A)
    return [[det(get_minor(A, i, j))
             for j in range(n)]
            for i in range(n)]

def cofactor_matrix(A):
    n = len(A)
    M = minor_matrix(A)
    return [[(-1)**(i+j) * M[i][j]
             for j in range(n)]
            for i in range(n)]
Cofactors are the building blocks for both the determinant (cofactor expansion) and the adjugate matrix (used to compute the inverse).
29

Adjugate & Inverse

The adjugate (or classical adjoint) is the transpose of the cofactor matrix. The inverse A-1 satisfies A · A-1 = I (identity). It only exists when det(A) ≠ 0.

adj(A) = cofactor(A)T
A-1 = (1 / det(A)) × adj(A)
For 2×2: A-1 = (1/(ad-bc)) × [[d,-b],[-c,a]]
▶ With NumPy
A = np.array([[1, 2],
              [3, 4]])

A_inv = np.linalg.inv(A)
print(A_inv)
# [[-2. ,  1. ],
#  [ 1.5, -0.5]]

# Verify: A @ A_inv = Identity
np.allclose(A @ A_inv, np.eye(2))  # True

# Pseudo-inverse (works for non-square)
np.linalg.pinv(A)
▶ Pure Python (no NumPy)
def adjugate(A):
    C = cofactor_matrix(A)
    return transpose(C)

def inverse(A):
    d = det(A)
    if d == 0:
        raise ValueError("Singular matrix")
    adj = adjugate(A)
    n = len(A)
    return [[adj[i][j] / d
             for j in range(n)]
            for i in range(n)]

# Quick 2x2 shortcut
def inv_2x2(A):
    a,b = A[0]; c,d = A[1]
    D = a*d - b*c
    return [[d/D, -b/D],
            [-c/D, a/D]]
In ML, the normal equation for linear regression is β = (XTX)-1XTy. Computing the inverse directly is O(n³) and numerically unstable — use np.linalg.solve or np.linalg.lstsq in practice.
30

Matrix Power

Raising a square matrix to an integer power means multiplying it by itself k times: Ak = A × A × … × A. Negative powers use the inverse: A-k = (A-1)k.

▶ With NumPy
A = np.array([[2, 1],
              [0, 3]])

np.linalg.matrix_power(A, 3)
# [[8, 13],
#  [0, 27]]

# A^0 = Identity
np.linalg.matrix_power(A, 0)  # [[1,0],[0,1]]

# Negative power = inverse raised
np.linalg.matrix_power(A, -1)  # same as inv(A)
▶ Pure Python (no NumPy)
def mat_power(A, k):
    n = len(A)
    # Identity matrix
    result = [[1 if i==j else 0
               for j in range(n)]
              for i in range(n)]
    if k < 0:
        A = inverse(A)
        k = -k
    for _ in range(k):
        result = mat_mul(result, A)
    return result

mat_power([[2,1],[0,3]], 3)
# [[8, 13], [0, 27]]
Matrix powers appear in Markov chains (state transition after k steps), graph theory (counting k-length paths), and solving systems of linear differential equations.
31

QR & Cholesky Decompositions

QR decomposition factors A = QR where Q is orthogonal (QTQ = I) and R is upper-triangular. Cholesky decomposition factors a symmetric positive-definite matrix as A = LLT where L is lower-triangular.

QR Decomposition

A = Q × R

Used in: least-squares, eigenvalue algorithms, solving Ax=b numerically stably

Cholesky Decomposition

A = L × LT

Used in: sampling from multivariate Gaussians, solving positive-definite systems (~2× faster than LU)

▶ QR with NumPy
A = np.array([[1,1],[1,0],[0,1]])
Q, R = np.linalg.qr(A)
print(Q.shape)  # (3, 2)
print(R.shape)  # (2, 2)

# Verify: Q @ R ≈ A
np.allclose(Q @ R, A)  # True
# Q columns are orthonormal
np.allclose(Q.T @ Q, np.eye(2))  # True
▶ Cholesky with NumPy
# Must be symmetric positive-definite
A = np.array([[4,2],[2,3]])

L = np.linalg.cholesky(A)
print(L)
# [[2.  , 0.  ],
#  [1.  , 1.41]]

# Verify: L @ L.T ≈ A
np.allclose(L @ L.T, A)  # True
Cholesky is essential in Bayesian ML: to sample from N(μ, Σ), compute L = cholesky(Σ), then sample z ~ N(0,I) and return μ + Lz.
32

Gram-Schmidt Orthogonalization

The Gram-Schmidt process takes a set of linearly independent vectors and produces an orthonormal set (mutually perpendicular unit vectors). It is the algorithm behind QR decomposition.

uk = vk − Σ proj(uj, vk)
ek = uk / ||uk||
Subtract all projections onto previous vectors, then normalise
▶ With NumPy
def gram_schmidt_np(V):
    """V: columns are input vectors"""
    n = V.shape[1]
    U = np.zeros_like(V, dtype=float)
    for i in range(n):
        u = V[:, i].astype(float)
        for j in range(i):
            u -= np.dot(U[:,j], V[:,i]) * U[:,j]
        U[:, i] = u / np.linalg.norm(u)
    return U

V = np.array([[1,1],[1,0]])
Q = gram_schmidt_np(V)
# Q columns are orthonormal
▶ Pure Python (no NumPy)
def proj(u, v):
    """Project v onto u"""
    dot_uv = sum(a*b for a,b in zip(u,v))
    dot_uu = sum(a*a for a in u)
    return [dot_uv/dot_uu * a for a in u]

def gram_schmidt(vectors):
    ortho = []
    for v in vectors:
        u = v[:]
        for o in ortho:
            p = proj(o, v)
            u = [u[i]-p[i] for i in range(len(u))]
        norm = l2_norm(u)
        ortho.append([x/norm for x in u])
    return ortho
33

Solving Linear Systems

A linear system Ax = b asks: given matrix A and vector b, find vector x. This is the mathematical core of regression, curve fitting, and engineering.

Ax = b → x = A-1b
Direct inverse is O(n³) and unstable. Prefer factorisation-based solvers.
▶ With NumPy
# Exact solution (square A)
A = np.array([[3,1],[1,2]])
b = np.array([9, 8])
x = np.linalg.solve(A, b)
print(x)  # [2. 3.]

# Least-squares (overdetermined)
A = np.array([[1,1],[2,1],[3,1]])
b = np.array([1, 2, 2])
x, res, rank, sv = np.linalg.lstsq(A, b, rcond=None)
print(x)  # [0.5, 0.333...]
▶ Pure Python (no NumPy)
def solve_2x2(A, b):
    """Cramer's rule for 2x2"""
    D  = det(A)
    Dx = [[b[0], A[0][1]],
          [b[1], A[1][1]]]
    Dy = [[A[0][0], b[0]],
          [A[1][0], b[1]]]
    return [det(Dx)/D, det(Dy)/D]

def solve_inverse(A, b):
    """General: x = A_inv @ b"""
    A_inv = inverse(A)
    n = len(b)
    return [sum(A_inv[i][j]*b[j]
                for j in range(n))
            for i in range(n)]

solve_2x2([[3,1],[1,2]], [9,8])
# [2.0, 3.0]
np.linalg.solve uses LU decomposition internally and is both faster and more numerically stable than computing inv(A) @ b. Always prefer solve over inv.
Part VI

ML Essentials

34

Trace

The trace of a square matrix is the sum of its diagonal elements. It appears constantly in matrix calculus — the Frobenius inner product is tr(ATB), and many loss functions are expressed as traces.

tr(A) = Σ Aii = a11 + a22 + … + ann
Properties: tr(A+B) = tr(A) + tr(B) • tr(AB) = tr(BA) • tr(A) = Σ eigenvalues
▶ With NumPy
A = np.array([[1,2],[3,4]])
np.trace(A)  # 5  (1 + 4)

# Frobenius inner product
np.trace(A.T @ A)  # same as np.sum(A**2)
▶ Pure Python
def trace(A):
    return sum(A[i][i] for i in range(len(A)))

trace([[1,2],[3,4]])  # 5
35

Rank & Linear Independence

The rank of a matrix is the number of linearly independent rows (or columns). Vectors are linearly independent if none can be written as a combination of the others. Rank tells you the “true dimensionality” of your data.

Full rank

rank(A) = min(m,n). All rows/cols independent. System has a unique solution.

Rank-deficient

rank(A) < min(m,n). Redundant features. Infinite or no solutions.

▶ With NumPy
A = np.array([[1,2,3],
              [4,5,6],
              [7,8,9]])
np.linalg.matrix_rank(A)  # 2 (row 3 = 2*row2 - row1)

B = np.eye(3)
np.linalg.matrix_rank(B)  # 3 (full rank)
▶ Why it matters in ML

If your feature matrix X is rank-deficient, XTX is singular and the normal equation has no unique solution. This is why regularization adds λI — it guarantees full rank.

36

Positive Definite Matrices

A symmetric matrix A is positive definite (PD) if xTAx > 0 for every non-zero vector x. Equivalently: all eigenvalues are positive. PD matrices guarantee a unique minimum in optimization.

Positive Definite

All eigenvalues > 0. Invertible. Cholesky works.

Positive Semi-definite

All eigenvalues ≥ 0. Covariance matrices are always PSD.

Indefinite

Mixed signs. Saddle point — no guaranteed minimum.

A = np.array([[2, 1],
              [1, 3]])
eigenvalues = np.linalg.eigvalsh(A)  # [1.38, 3.62] — both > 0 → PD

# Quick test: try Cholesky — if it works, it's PD
np.linalg.cholesky(A)  # succeeds → PD ✓
In optimization, the Hessian being PD at a point means you're at a local minimum. In kernel methods (SVM), the kernel matrix must be PSD.
37

Covariance Matrix

The covariance matrix Σ captures how features vary together. Element Σij = Cov(xi, xj). The diagonal holds variances; off-diagonal holds covariances. It is always symmetric positive semi-definite.

▶ With NumPy
# X: (samples, features)
X = np.array([[1,2],[3,4],[5,6],[7,8]])
cov = np.cov(X, rowvar=False)
print(cov)
# [[6.667, 6.667],
#  [6.667, 6.667]]

# PCA starts here: eigenvectors of cov
vals, vecs = np.linalg.eigh(cov)
▶ Pure Python
def mean(col):
    return sum(col) / len(col)

def cov_matrix(X):
    # X: list of rows
    n = len(X); d = len(X[0])
    means = [mean([X[i][j] for i in range(n)])
             for j in range(d)]
    C = [[0]*d for _ in range(d)]
    for a in range(d):
        for b in range(d):
            C[a][b] = sum(
              (X[i][a]-means[a])*(X[i][b]-means[b])
              for i in range(n)) / (n-1)
    return C
38

Special Matrices

Certain matrix types appear so often in ML that recognizing them instantly saves time and prevents bugs.

Identity (I)

AI = IA = A. The “1” of matrices. np.eye(n)

Diagonal

Non-zero only on diagonal. np.diag(v). Scales each axis independently.

Symmetric

A = AT. Covariance, kernel, and Hessian matrices are symmetric.

Orthogonal

QTQ = I. Preserves lengths/angles. Used in weight init for RNNs. np.linalg.qr

Sparse

Mostly zeros. NLP bag-of-words, graph adjacency. scipy.sparse saves memory.

# NumPy special matrix constructors
np.eye(3)          # 3×3 identity
np.diag([1,2,3])   # 3×3 diagonal
np.zeros((3,4))    # 3×4 zeros
np.ones((2,2))     # 2×2 ones

# Sparse (scipy)
from scipy import sparse
S = sparse.csr_matrix(np.eye(1000))  # stores only 1000 values, not 1M
39

Projection Matrices

A projection matrix P maps a vector onto a subspace. It satisfies P² = P (applying it twice changes nothing). The least-squares projection is P = X(XTX)-1XT, which projects b onto the column space of X.

▶ With NumPy
# Project b onto the column space of A
A = np.array([[1,0],[0,1],[0,0]])
P = A @ np.linalg.inv(A.T @ A) @ A.T

b = np.array([1,2,3])
proj_b = P @ b  # [1, 2, 0]

# Verify P² = P
np.allclose(P @ P, P)  # True
▶ Why it matters in ML

Linear regression is a projection: ŷ = Xb projects the target vector onto the column space of the feature matrix. Attention in Transformers computes Q(KTK)-1KTV — a weighted projection.

40

Matrix Calculus (Jacobians & Hessians)

Matrix calculus extends derivatives to vectors and matrices. The Jacobian is the matrix of all first-order partial derivatives. The Hessian is the matrix of second-order derivatives — it tells you about curvature.

Jacobian J

Jij = ∂fi/∂xj. Maps how each output changes w.r.t. each input. Shape: (outputs × inputs).

Hessian H

Hij = ∂²f/∂xi∂xj. Always symmetric. PD → convex → guaranteed minimum.

# Key gradient rules (used in backprop)
# d/dx (Ax)    = A
# d/dx (x^T A x) = (A + A^T) x  (= 2Ax if A symmetric)
# d/dA tr(AB)  = B^T
# d/dx ||Ax-b||² = 2 A^T(Ax - b)   ← linear regression gradient!

# PyTorch auto-computes Jacobians:
# loss.backward() builds the chain of Jacobians (backpropagation)
Backpropagation is just the chain rule applied to Jacobians: ∂L/∂W1 = (∂L/∂zn) × (∂zn/∂zn-1) × … × (∂z2/∂W1). Each term is a Jacobian matrix.
41

Einstein Summation (einsum)

np.einsum expresses any tensor operation in one line using Einstein notation. Repeated indices are summed over. It's the lingua franca of PyTorch/TensorFlow for complex tensor ops.

# Dot product
np.einsum('i,i->', a, b)           # same as a @ b

# Matrix multiply
np.einsum('ik,kj->ij', A, B)      # same as A @ B

# Batch matrix multiply
np.einsum('bik,bkj->bij', T1, T2) # per-batch matmul

# Trace
np.einsum('ii->', A)              # same as np.trace(A)

# Transpose
np.einsum('ij->ji', A)            # same as A.T

# Outer product
np.einsum('i,j->ij', a, b)        # same as np.outer(a,b)

# Attention: Q K^T (batched)
np.einsum('bhid,bhjd->bhij', Q, K) # multi-head attention scores
Rule: letters that appear on both sides of -> are kept; letters that appear only on the left are summed over. This single rule covers dot products, matmuls, traces, and more.
42

Null Space & Column Space

The column space of A is the set of all possible outputs Ax — it's the “reach” of A. The null space is the set of all x where Ax = 0 — inputs that A annihilates. Together: dim(col space) + dim(null space) = n (number of columns).

▶ With NumPy / SciPy
A = np.array([[1,2,3],[4,5,6],[7,8,9]])

# Column space basis (via SVD)
U, S, VT = np.linalg.svd(A)
rank = np.sum(S > 1e-10)
col_space = U[:, :rank]  # shape (3, 2)

# Null space
from scipy.linalg import null_space
ns = null_space(A)  # shape (3, 1)
▶ Why it matters in ML

Column space = what your model can predict. If b isn't in the column space, you can only get the closest approximation (least squares).

Null space = parameter directions that don't affect output. A non-trivial null space means infinite solutions — your model is under-constrained.

43

Condition Number

The condition number κ(A) = σmaxmin measures how sensitive a matrix computation is to numerical errors. A high condition number means tiny input changes cause huge output changes — the matrix is ill-conditioned.

▶ With NumPy
# Well-conditioned
A = np.array([[1,0],[0,1]])
np.linalg.cond(A)  # 1.0 (perfect)

# Ill-conditioned
B = np.array([[1, 1],
              [1, 1.0001]])
np.linalg.cond(B)  # ~40000 (dangerous!)

# Rule of thumb:
# κ ~ 1     → stable
# κ ~ 10^k  → lose ~k digits of precision
# κ = ∞     → singular
▶ Why it matters in ML

NaN gradients? Check if your weight matrices are ill-conditioned. Features with vastly different scales create ill-conditioned XTX.

Fix: Feature scaling (standardization), regularization (λI raises the smallest singular values), or using np.linalg.lstsq instead of inv.

If np.linalg.cond(X.T @ X) is large, do not compute inv(X.T @ X). Use np.linalg.lstsq(X, y) or add regularization instead.
Appendix

Coverage Map

✓ Covered — Holberton objectives
Vectors — definition, 1-D arrays, row vs columnYou can describe and create them in NumPy and pure Python
Matrices — 2-D arrays, shape (m×n), ndimYou can build and describe any matrix
Transpose — A.T, np.transpose, shape flip, pure PythonYou know when and why to use it
Shape & Axes — axis 0/1, shape tuple, sum along axisYou understand how dimensions work
Slicing — 1-D, 2-D, 3-D, negative, boolean, fancy, pure PythonFully drilled with 5 NumPy exercises and pure-Python cheatsheet
Element-wise operations — +, *, **, sqrt, pure PythonYou know the difference from matmul
Concatenation — np.concatenate, vstack, hstack, pure PythonAlong any axis
Dot product — np.dot, @ operator, scalar result, pure PythonYou understand it as the engine of ML layers
Matrix multiplication — np.matmul, @, triple loop, shape rulesBoth NumPy and pure Python
NumPy — ndarray, vectorisation, speedYou know why it exists and how it's fast
Parallelisation — SIMD, BLAS, CPU/GPUYou can explain why it matters in ML
Broadcasting — right-align rule, dimension stretchingYou know the rule and the pitfall
✓ Covered — Advanced Operations
Determinants — np.linalg.det, recursive cofactor expansion, pure PythonTells you if a matrix is invertible; measures volume scaling
Minors & Cofactors — sub-matrix determinants, checkerboard signsBuilding blocks for adjugate and inverse computation
Adjugate & Inverse — np.linalg.inv, pinv, pure Python via cofactorsSolving Ax = b; linear regression closed-form uses pseudo-inverse
Matrix Power — np.linalg.matrix_power, pure Python loopMarkov chains, graph paths, differential equations
QR & Cholesky Decompositions — np.linalg.qr, np.linalg.choleskyNumerical stability, sampling from Gaussians, eigenvalue algorithms
Gram-Schmidt Orthogonalization — projection-based, pure PythonFoundation of QR decomposition; produces orthonormal bases
Solving Linear Systems — np.linalg.solve, lstsq, Cramer's ruleCore of regression, curve fitting, and engineering
✓ Covered — ML Essentials
Trace — np.trace, diagonal sum, Frobenius inner productUsed in loss functions, matrix derivatives, eigenvalue sums
Rank & Linear Independence — np.linalg.matrix_rankFeature redundancy, model capacity, regularization motivation
Positive Definite Matrices — eigenvalue test, Cholesky testOptimization guarantees, kernel validity, covariance properties
Covariance Matrix — np.cov, pure PythonPCA foundation, Gaussian distributions, feature correlations
Special Matrices — Identity, Diagonal, Symmetric, Orthogonal, SparseWeight init, regularization, large-scale NLP
Projection Matrices — P²=P, least-squares projectionLinear regression as projection, attention mechanisms
Matrix Calculus — Jacobians, Hessians, gradient rulesMathematical machinery behind backpropagation
Einstein Summation — np.einsum, index notationExpress any tensor op in one line; PyTorch/TF native
Null & Column Space — SVD-based, scipy.linalg.null_spaceModel reach vs. under-constrained parameters
Condition Number — np.linalg.cond, numerical stabilityDebugging NaN gradients, feature scaling motivation
★ Further topics
🟠
Least squares & normal equation — β = (XTX)-1XTyDirectly connects linear algebra to your first ML model
🟠
Kronecker & Hadamard productsAdvanced architectures, tensor factorization
Linear Algebra for Machine Learning · Holberton DLH · Built with Claude
Comments

Comments

Loading comments...