8 Least Squares

Least squares problems naturally arise when we seek the “best” solution to an inconsistent system of linear equations \(𝐴\mathbf{x}=\mathbf{b}\). While we typically classify linear systems as having no solution, exactly one solution, or infinitely many solutions, real-world data often leads to situations where \(\mathbf{b}\) lies outside the column space of \(𝐴\), making an exact solution impossible. Instead of giving up, we ask a more refined question: what vector \(\mathbf{x}\) minimizes the discrepancy between \(A\mathbf{x}\) and \(\mathbf{b}\)? In other words, we seek to minimize \[\min_{\mathbf{x}}\|A\mathbf{x}-\mathbf{b}\|,\] where \(\|\cdot\|\) is the Euclidean norm, induced by the dot product.

This problem is fundamental in statistics, applied science, and modern machine learning, where data rarely fits models exactly. Least squares provides a principled way to handle noisy observations, from experimental measurements to regression analysis and AI algorithms.

At its core, least squares is an optimization problem: we seek the vector \(\mathbf{x}\) that minimizes a given objective function. This idea extends far beyond linear systems—modern machine learning methods, including deep learning, rely on solving large-scale optimization problems where the goal is to minimize a loss function that quantifies error. Least squares serves as an important first example of how mathematical optimization underpins data-driven modeling.

8.1 Orthogonal Projection

The concept of orthogonal projection is fundamental to understanding how we can find the closest point in a subspace to any given vector. This builds from our understanding of orthogonal complements (Section 7.3).

Theorem 8.1 Suppose that \(W\) is a subspace of \(\mathbb{R}^n\) and that \(\mathbf{v}\in\mathbf{R}^n\). Then there exists a unique \(\hat{\mathbf{v}}\in W\) such that \[\|\mathbf{v}-\hat{\mathbf{v}}\|=\min_{\mathbf{w}\in W}\|\mathbf{v}-\mathbf{w}\|. \] Moreover, \(\hat{\mathbf{v}}\) is characterized by

\(\hat{\mathbf{v}}\in W\)
\(\mathbf{v}-\hat{\mathbf{v}}\in W^\perp\)

The vector \(\hat{\mathbf{v}}\) is called the orthogonal projection of \(\mathbf{v}\) onto \(W\). The term “orthogonal” comes from the second condition: the error vector \(\mathbf{v}-\hat{\mathbf{v}}\) is perpendicular to the subspace \(W\). Two fundamental geometric properties give us the solution: the Pythagorean Theorem (Theorem 1.2) demonstrates why this orthogonality guarantees that \(\hat{\mathbf{v}}\) is the closest point in \(W\) to \(\mathbf{v}\), while the Parallelogram Law (Equation 1.1) ensures this closest point is unique. Note that we solve this minimization problem using purely geometric arguments, without resorting to calculus.

This projection theorem forms the theoretical foundation for solving least squares problems.

8.2 Least Squares via Projection onto col(A)

The projection theorem provides what we need to solve the least squares problem \(\min \|A\mathbf{x} - \mathbf{b}\|\). When we seek to minimize \(\|A\mathbf{x} - \mathbf{b}\|\), we are effectively looking for the closest point in \(\text{col}(A)\) to the vector \(\mathbf{b}\). By the projection theorem, this closest point must be the orthogonal projection of \(\mathbf{b}\) onto \(\text{col}(A)\).

Theorem 8.2 Let \(A\) be an \(m \times n\) matrix and \(\mathbf{b}\in\mathbb{R}^m\). Then \(\mathbf{b}\) can be uniquely decomposed as \[\mathbf{b} = \mathbf{\hat{b}} + \mathbf{e}\] where \(\mathbf{\hat{b}} \in \text{col}(A)\) and \(\mathbf{e} = \mathbf{b} - \hat{\mathbf{b}} \in (\text{col}(A))^\perp = \text{null}(A^T)\). Moreover:

\(\mathbf{\hat{b}} = A\mathbf{\hat{x}}\) where \(\mathbf{\hat{x}}\) is any solution to the normal equations \(A^TA\mathbf{x} = A^T\mathbf{b}\)
\(\|\mathbf{b} - A\mathbf{x}\| \geq \|\mathbf{e}\|\) for all \(\mathbf{x} \in \mathbb{R}^n\), with equality if and only if \(\mathbf{x} = \mathbf{\hat{x}}\)

If \(A^TA\) is invertible, then \(\mathbf{\hat{x}}\) is unique and

\(\mathbf{\hat{x}}=(A^TA)^{-1}A^T\mathbf{b}\), and
\(\mathbf{\hat{b}}=A(A^TA)^{-1}A^T\mathbf{b}\)

This theorem translates the abstract projection theorem into concrete matrix computations. The normal equations arise naturally because \((\text{col}(A))^\perp = \text{null}(A^T)\) Theorem 7.1, connecting the algebraic and geometric perspectives of least squares solutions.

The matrix \(P=A(A^TA)^{-1}A^T\) in part 4 has two remarkable properties: \(P^2=P\) and \(P^T=P\). A matrix satisfying \(P^2=P\) is called a projection matrix: applying it twice doesn’t change anything, just as geometrically projecting a point that’s already in the subspace doesn’t move it. When a projection matrix is also symmetric (\(P^T=P\)), we call it an orthogonal projection matrix, as it projects vectors orthogonally onto its range. These properties capture the geometric meaning of our least squares solution: \(P\mathbf{b}\) is the orthogonal projection of \(\mathbf{b}\) onto \(\text{col}(A)\).

8.3 Invertibility of \(A^TA\) (Optional)

In this section we will prove that if the columns of \(A\) are linearly independent, then \(A^TA\) is invertible. The main tool is Theorem 5.3.

Suppose that \(A\) is an \(m\times n\) matrix satifying \(A\mathbf{x}=\mathbf{0}\) has a unique solution. We will show that this implies that \(A^TA\mathbf{x}=\mathbf{0}\) also has a unique solution. This implies that the RREF of \(A^TA\) has \(n\) pivots, and since \(A^TA\) is \(n\times n\), the RREF of \(A\) is the identity, which implies that \(A^TA\) is invertible, as we showed in the lab.

The main tool to prove this is the following important Lemma. Recall that if \(\mathbf{x},\mathbf{y}\in\mathbb{R}^n\), we can write their dot product as the transpose of the first vector times the second one: \[\mathbf{x}\cdot\mathbf{y}=\begin{bmatrix}x_1&x_2&\cdots &x_n\end{bmatrix} \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_n \end{bmatrix}\]

Lemma 8.1 Suppose that \(A\) is an \(m\times n\) matrix, that \(\mathbf{x}\in\mathbb{R}^m\) and \(\mathbf{y}\in\mathbb{R}^n\). Then \[(A^T\mathbf{x})\cdot\mathbf{y}=\mathbf{x}\cdot(A\mathbf{y})\]

Proof. Using the fact that the dot product can be written as matrix multiplication of a row vector with a column vector:

\[(A^T\mathbf{x})\cdot\mathbf{y} = ((A^T\mathbf{x})^T)\mathbf{y}.\]

Then

\[((A^T\mathbf{x})^T)\mathbf{y} = \mathbf{x}^TA\mathbf{y} = \mathbf{x}\cdot(A\mathbf{y})\]

where the first equality follows from the property \((A^T\mathbf{x})^T = \mathbf{x}^TA\) of matrix transpose.

Proposition: Let \(A\) be an \(m\times n\) matrix. If \(A\mathbf{x}=\mathbf{0}\) has a unique solution, then \(A^TA\mathbf{x}=\mathbf{0}\) also has a unique solution.

Proof. Suppose that \(A^TA\mathbf{x}=\mathbf{0}\). Then \(A^TA\mathbf{x}\cdot \mathbf{x}=0\). Applying Lemma 8.1 we get \[0=A^TA\mathbf{x}\cdot\mathbf{x}=A^T(A\mathbf{x})\cdot\mathbf{x}=A\mathbf{x}\cdot A\mathbf{x}=\|A\mathbf{x}\|^2.\] This implies that \(A\mathbf{x}=\mathbf{0}\) and the assumption on \(A\) tells us that \(\mathbf{x}=\mathbf{0}\) \(\square\)

8.4 Regression Problems

See slide presentation

8.5 Working with Data using Pandas

8.5.1 What is Pandas?

Python library for data manipulation and analysis
Built on top of NumPy, provides DataFrame objects
Efficiently handles tabular data with labeled rows and columns
Perfect for preprocessing data before linear regression

8.5.2 Basic Syntax

import pandas as pd
import numpy as np

# Create DataFrame from CSV
df = pd.read_csv('data.csv')

# View first few rows
df.head()

# Basic information about the dataset
df.info()

8.5.3 Essential Functions for Linear Regression

Data Selection

# Select single column (returns Series)
y = df['target_variable']

# Select multiple columns (returns DataFrame)
X = df[['feature1', 'feature2', 'feature3']]

Converting to NumPy

# Convert to numpy arrays for matrix operations
X_matrix = X.to_numpy()
y_vector = y.to_numpy()

Working Example: Least Squares

# Read data
df = pd.read_csv('housing_data.csv')

# Select features and target
X = df[['square_feet', 'bedrooms', 'age']]
y = df['price']

# Convert to numpy arrays
X_np = X.to_numpy()
y_np = y.to_numpy()

# Add column of ones for intercept term
X_np = np.column_stack([np.ones(len(X_np)), X_np])

# Calculate least squares coefficients: β = (X^T X)^{-1} X^T y
beta = np.linalg.inv(X_np.T @ X_np) @ X_np.T @ y_np

8.5.4 Helpful Data Cleaning Functions

df.dropna(): Remove rows with missing values
df.fillna(value): Fill missing values
df.describe(): Summary statistics
df['column'].value_counts(): Count unique values