Week 2: Linear Models & Least Squares

Stat 577 — Statistical Learning Theory

Soumojit Das

Spring 2026

Linear Models: The Foundation

Why Start with Linear Models?

Linear models are the workhorse of statistical learning.

Reasons to understand them deeply:

  • Interpretable: coefficients have clear meaning
  • Computationally efficient: closed-form solutions
  • Theoretical foundation: many methods generalize from here
  • Still competitive: often hard to beat for inference

This week: Understand linear regression geometrically, not just algebraically.

Three Goals of Regression

What do we actually want from a linear model?

1. Inference

Estimate \(\boldsymbol{\beta}\) and quantify uncertainty

“Which predictors matter?”

2. Prediction

Given new \(\mathbf{x}\), predict \(y\)

“What will happen?”

3. Interpretation

Understand relationships

“How does \(X_j\) affect \(Y\)?”

Different Goals, Different Priorities

  • Inference prioritizes unbiasedness and valid standard errors
  • Prediction prioritizes low test error (bias-variance tradeoff!)
  • These goals can conflict—a theme we’ll revisit in Week 3

The Linear Model

We assume:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon\]

Or in vector notation for a single observation:

\[Y = \beta_0 + \boldsymbol{\beta}^T \mathbf{x} + \varepsilon\]

where:

  • \(\boldsymbol{\beta} = (\beta_1, \ldots, \beta_p)^T\) is the coefficient vector
  • \(\mathbf{x} = (X_1, \ldots, X_p)^T\) is the predictor vector
  • \(\varepsilon\) is the error with \(E[\varepsilon] = 0\)

Simple Linear Regression

One Predictor

The simplest case:

\[Y = \beta_0 + \beta_1 X + \varepsilon\]

Goal: Find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the residual sum of squares (RSS):

\[\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]

Deriving the OLS Estimates

Take partial derivatives and set to zero:

\[\frac{\partial \text{RSS}}{\partial \beta_0} = -2\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0\]

\[\frac{\partial \text{RSS}}{\partial \beta_1} = -2\sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0\]

From the first equation: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

The Closed-Form Solution

Simple Linear Regression Estimates

\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]

\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]

Interpretation

  • \(\hat{\beta}_1\) is the estimated change in \(Y\) per unit change in \(X\)
  • The regression line passes through \((\bar{x}, \bar{y})\)

Simple Linear Regression Example

Multiple Regression: Matrix Formulation

The Design Matrix

With \(n\) observations and \(p\) predictors, write as:

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]

where:

  • \(\mathbf{y} \in \mathbb{R}^n\) is the response vector
  • \(\mathbf{X} \in \mathbb{R}^{n \times (p+1)}\) is the design matrix (including column of 1s)
  • \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) is the coefficient vector
  • \(\boldsymbol{\varepsilon} \in \mathbb{R}^n\) is the error vector

The Design Matrix Structure

\[\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}\]

The first column of 1s accounts for the intercept \(\beta_0\).

Row \(i\) of \(\mathbf{X}\) contains the predictors for observation \(i\).

The Normal Equations

Minimize RSS:

\[\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Take the gradient:

\[\nabla_{\boldsymbol{\beta}} \text{RSS} = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]

Setting to zero gives the normal equations:

\[\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}\]

The OLS Estimator

Ordinary Least Squares (OLS) Estimator

If \(\mathbf{X}^T\mathbf{X}\) is invertible:

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

When is \(\mathbf{X}^T\mathbf{X}\) invertible?

  • When columns of \(\mathbf{X}\) are linearly independent
  • Requires \(n \geq p + 1\) (more observations than parameters)
  • No perfect multicollinearity

Numerical Stability in Practice

In practice, we never compute \((\mathbf{X}^T\mathbf{X})^{-1}\) explicitly—this is numerically unstable!

Instead, lm() in R uses the QR decomposition: \(\mathbf{X} = \mathbf{Q}\mathbf{R}\) where \(\mathbf{Q}\) is orthonormal and \(\mathbf{R}\) is upper triangular. Then:

\[\hat{\boldsymbol{\beta}} = \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}\]

This avoids computing \(\mathbf{X}^T\mathbf{X}\) (which squares the condition number) and solves via back-substitution. The formula \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) is for theory; QR is for computation.

Historical Note: Origins of Least Squares

The least squares method has a remarkable history:

  • Gauss (1795, published 1809): Developed the method to predict the orbit of the asteroid Ceres from limited observations. His predictions proved accurate when Ceres was rediscovered.
  • Markov (1900): Proved the optimality result we now call the Gauss-Markov theorem.
  • David (1988): Popularized the “BLUE” acronym (Best Linear Unbiased Estimator) that we use today.

The fact that a method developed for celestial mechanics remains foundational 230 years later speaks to its mathematical elegance.

Residuals and Fitted Values

Fitted values: \[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Residuals: \[\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\]

Key Property

The residuals are orthogonal to the columns of \(\mathbf{X}\):

\[\mathbf{X}^T\mathbf{e} = \mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]

This follows directly from the normal equations!

Geometry of Least Squares

The Key Insight

Geometric View

Least squares finds the point in Col(\(\mathbf{X}\)) closest to \(\mathbf{y}\).

This is orthogonal projection.

  • \(\mathbf{y} \in \mathbb{R}^n\) is our observed response
  • Col(\(\mathbf{X}\)) is a \((p+1)\)-dimensional subspace of \(\mathbb{R}^n\)
  • \(\hat{\mathbf{y}}\) is the projection of \(\mathbf{y}\) onto Col(\(\mathbf{X}\))
  • \(\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\) is perpendicular to Col(\(\mathbf{X}\))

Visualizing the Projection (p = 2)

Interactive 3D View: Explore the Projection

Rotate this plot to see how \(\hat{\mathbf{y}}\) is the closest point in Col(\(\mathbf{X}\)) to \(\mathbf{y}\).

What to Look For

  • Rotate to see that \(\mathbf{e}\) is truly perpendicular to the green plane
  • The right angle marker at \(\hat{\mathbf{y}}\) confirms orthogonality
  • \(\hat{\mathbf{y}}\) is the unique point in Col(\(\mathbf{X}\)) closest to \(\mathbf{y}\)

The Hat Matrix

Definition: Hat Matrix

\[\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]

This matrix “puts the hat on \(\mathbf{y}\)”: \[\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\]

\(\mathbf{H}\) is the projection matrix onto Col(\(\mathbf{X}\)).

Properties of the Hat Matrix

Algebraic Properties:

  1. Symmetric: \(\mathbf{H}^T = \mathbf{H}\)

  2. Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)

  3. Eigenvalues: All 0 or 1

  4. Trace: \(\text{tr}(\mathbf{H}) = p + 1\)

Geometric Meaning:

  • \(\mathbf{H}\mathbf{y}\) = projection of \(\mathbf{y}\)
  • \((\mathbf{I} - \mathbf{H})\mathbf{y}\) = residual
  • Applying twice = once (idempotent)
  • Rank = dimension of Col(\(\mathbf{X}\))

Why “Hat” Matrix?

It’s called the hat matrix because it puts the “hat” (\(\hat{}\)) on \(\mathbf{y}\)!

Proof: \(\mathbf{H}\) is Idempotent

\[\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \cdot \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]

\[= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}[\mathbf{X}^T\mathbf{X}](\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]

\[= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T = \mathbf{H}\]

Geometric Intuition

Projecting twice onto the same subspace gives the same result as projecting once.

Residuals Live in the Orthogonal Complement

The residual maker matrix: \[\mathbf{M} = \mathbf{I} - \mathbf{H}\]

Properties:

  • \(\mathbf{M}\) is also symmetric and idempotent
  • \(\mathbf{M}\) projects onto the orthogonal complement of Col(\(\mathbf{X}\))
  • \(\mathbf{e} = \mathbf{M}\mathbf{y}\)

Key: \(\mathbf{H}\mathbf{M} = \mathbf{0}\) — fitted values and residuals are orthogonal!

Leverage: Diagonal of \(\mathbf{H}\)

The \(i\)th diagonal element of \(\mathbf{H}\):

\[h_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\]

This is the leverage of observation \(i\).

Interpretation of Leverage

  • \(h_{ii}\) measures how much \(y_i\) influences \(\hat{y}_i\)
  • \(\hat{y}_i = h_{ii} y_i + \sum_{j \neq i} h_{ij} y_j\)
  • High leverage = unusual \(\mathbf{x}_i\) values
  • When model includes intercept: \(1/n \leq h_{ii} \leq 1\) and \(\sum_i h_{ii} = p + 1\)
  • Without intercept: \(0 \leq h_{ii} \leq 1\)

Gauss-Markov Theorem

What Makes OLS Special?

Gauss-Markov Theorem

Among all linear unbiased estimators of \(\boldsymbol{\beta}\), the OLS estimator has the smallest variance.

OLS is BLUE: Best Linear Unbiased Estimator

Assumptions needed:

  1. \(E[\boldsymbol{\varepsilon}] = \mathbf{0}\) (errors have mean zero)
  2. \(\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \mathbf{I}\) (homoskedasticity, uncorrelated errors)
  3. \(\mathbf{X}\) is fixed (non-random), OR if \(\mathbf{X}\) is random, assumptions 1-2 hold conditionally on \(\mathbf{X}\)

What About Full Rank?

Common simplification: Assume \(\text{rank}(\mathbf{X}) = p + 1\) so \(\mathbf{X}^T\mathbf{X}\) is invertible.

But the theorem is more general! See next slide.

The General Gauss-Markov Theorem

When \(\mathbf{X}\) is rank-deficient, \(\boldsymbol{\beta}\) itself isn’t uniquely identifiable. But we can still do inference!

Estimable Functions

A linear function \(\mathbf{c}^T\boldsymbol{\beta}\) is estimable if \(\mathbf{c}\) lies in the row space of \(\mathbf{X}\).

Equivalently: \(\mathbf{c}^T\boldsymbol{\beta}\) is estimable iff it has the same value for all \(\boldsymbol{\beta}\) satisfying \(\mathbf{X}\boldsymbol{\beta} = E[\mathbf{y}]\).

General Gauss-Markov: For any estimable \(\mathbf{c}^T\boldsymbol{\beta}\), the OLS estimator

\[\mathbf{c}^T\hat{\boldsymbol{\beta}} = \mathbf{c}^T(\mathbf{X}^T\mathbf{X})^{-}\mathbf{X}^T\mathbf{y}\]

is BLUE, where \((\mathbf{X}^T\mathbf{X})^{-}\) is any generalized inverse.

Key insight: The choice of g-inverse doesn’t matter for estimable functions — they all give the same answer!

Why This Matters

Full rank (common case):

  • All functions of \(\boldsymbol{\beta}\) are estimable
  • \(\boldsymbol{\beta}\) uniquely identified
  • Use ordinary inverse \((\mathbf{X}^T\mathbf{X})^{-1}\)

Rank deficient:

  • Only estimable functions have BLUEs
  • \(\boldsymbol{\beta}\) not unique, but \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) is!
  • Use any g-inverse (Moore-Penrose, reflexive, etc.)

Estimable Functions in Practice

Intuition: What Makes Something Estimable?

A function of \(\boldsymbol{\beta}\) is estimable if you get the same answer no matter which valid \(\boldsymbol{\beta}\) parameterization you use.

Think of it this way: if there are multiple \(\boldsymbol{\beta}\) vectors that all produce the same fitted values \(\hat{\mathbf{y}}\), then any quantity that differs across these \(\boldsymbol{\beta}\)’s is not estimable—the data cannot distinguish between them. Only quantities that are the same for all valid \(\boldsymbol{\beta}\)’s can be estimated from the data.

Often, what we actually care about is estimable — even when \(\boldsymbol{\beta}\) isn’t identified!

Example: One-way ANOVA with cell means coding

\[y_{ij} = \mu_i + \varepsilon_{ij}, \quad i = 1, \ldots, k; \; j = 1, \ldots, n_i\]

Design matrix \(\mathbf{X}\) has rank \(k\), but we fit \(k\) means. What’s estimable?

Quantity Estimable? Why it matters
\(\mu_1\) ✓ Yes Individual group mean
\(\mu_1 - \mu_2\) ✓ Yes Treatment contrast
\(\frac{1}{k}\sum_i \mu_i\) ✓ Yes Overall mean
\(\hat{y}\) for any \(\mathbf{x}\) in data ✓ Yes Predictions always work

Example: Regression with perfect collinearity

If \(X_3 = X_1 + X_2\) exactly, then \(\beta_1, \beta_2, \beta_3\) aren’t individually identifiable.

But \(\beta_1 + \beta_3\) and \(\beta_2 + \beta_3\) are estimable — and often that’s what we need!

What G-Inverses Do NOT Solve

Critical Distinction

G-inverses handle rank deficiency from linear dependencies.

They do NOT solve the \(p \gg n\) problem!

Two very different situations:

Rank deficiency (g-inverses help)

  • \(n = 100\), \(p = 10\), but 2 columns linearly dependent
  • \(\text{rank}(\mathbf{X}) = 9 < p\)
  • Problem: Some \(\beta_j\)’s not identified
  • Solution: Focus on estimable functions
  • Inference is valid for what’s estimable

High-dimensional (\(p > n\), g-inverses don’t help)

  • \(n = 100\), \(p = 1000\)
  • \(\text{rank}(\mathbf{X}) \leq n < p\)
  • Problem: Massive overfitting, infinite solutions
  • G-inverse gives a solution, but terrible prediction
  • Need: Regularization (Week 3!)

Why the Difference?

With \(p > n\): even with g-inverse, \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) interpolates the training data perfectly.

Zero training error, catastrophic test error. This is the overfitting problem — no amount of linear algebra solves it. We need to constrain the solution space.

Proof Sketch: Setting Up

Consider any other linear estimator: \[\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\]

for some matrix \(\mathbf{C} \in \mathbb{R}^{(p+1) \times n}\).

For unbiasedness: \[E[\tilde{\boldsymbol{\beta}}] = \mathbf{C}E[\mathbf{y}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}\]

This requires \(\mathbf{C}\mathbf{X} = \mathbf{I}\).

Proof Sketch: Comparing Variances

Write \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\) for some matrix \(\mathbf{D}\).

The unbiasedness condition \(\mathbf{C}\mathbf{X} = \mathbf{I}\) implies: \[\mathbf{D}\mathbf{X} = \mathbf{0}\]

Variance of \(\tilde{\boldsymbol{\beta}}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2 \mathbf{C}\mathbf{C}^T\]

Proof Sketch: The Key Step

\[\mathbf{C}\mathbf{C}^T = [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T\]

\[= (\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T + \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T}_{= 0} + \underbrace{\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}}_{= 0}\]

Since \(\mathbf{D}\mathbf{X} = \mathbf{0}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]\]

Since \(\mathbf{D}\mathbf{D}^T\) is positive semi-definite, \(\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}})\).

What Gauss-Markov Does NOT Say

Limitations

  1. Only compares linear estimators — nonlinear estimators might be better

  2. Only considers unbiased estimators — biased estimators might have lower MSE

  3. Doesn’t guarantee small variance — just smallest among linear unbiased

  4. Assumes homoskedasticity — if violated, OLS is not BLUE

This motivates regularization! (Ridge, Lasso in Week 3)

Preview: Ridge Regression

Ridge regression adds a penalty to the OLS objective, yielding:

\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

where \(\lambda \geq 0\) is a tuning parameter.

Key insight: Compare to OLS: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

  • When \(\lambda = 0\): Ridge = OLS (unbiased, potentially high variance)
  • As \(\lambda \to \infty\): \(\hat{\boldsymbol{\beta}}_{\text{ridge}} \to \mathbf{0}\) (high bias, low variance)

The parameter \(\lambda\) gives us explicit control over the bias-variance tradeoff. We’ll develop this fully in Week 3.

Connecting to Bias-Variance

Recall from Week 1

\[\text{EPE}(\hat{f}) = \underbrace{\sigma^2}_{\text{irreducible}} + \underbrace{[\text{Bias}(\hat{f})]^2}_{\text{model wrong?}} + \underbrace{\text{Var}(\hat{f})}_{\text{estimation noise}}\]

For OLS under the correct linear model:

  • Bias = 0 (OLS is unbiased)
  • Gauss-Markov says: variance is minimized among linear unbiased estimators

But what if we’re willing to accept some bias?

  • A biased estimator with much lower variance might have lower total MSE
  • This is exactly the tradeoff Ridge and Lasso exploit
  • James & Stein (1961): For \(p \geq 3\), OLS is inadmissible—there exist estimators that dominate it in total squared error!

Bias-Variance at a Test Point

For a new observation \(\mathbf{x}_0\), the OLS prediction is \(\hat{y}_0 = \mathbf{x}_0^T \hat{\boldsymbol{\beta}}\).

Prediction Variance

\[\text{Var}(\hat{y}_0) = \sigma^2 \mathbf{x}_0^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_0\]

Under correct model specification:

  • Bias: \(E[\hat{y}_0] - E[y_0] = 0\) (OLS is unbiased)
  • Variance: Depends on \(\mathbf{x}_0\) and the design matrix geometry
  • MSE: \(\text{MSE}(\hat{y}_0) = \sigma^2 \mathbf{x}_0^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_0\)

Connection to Week 1

Recall the bias-variance decomposition: \(\text{EPE} = \sigma^2 + \text{Bias}^2 + \text{Variance}\).

For OLS under correct specification, Bias = 0, so prediction error is driven entirely by variance and irreducible noise. Regularization (Week 3) trades bias for reduced variance.

Distributional Properties

Under Normality Assumptions

Add a stronger assumption:

\[\boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})\]

Then we get exact finite-sample distributions:

Distribution of \(\hat{\boldsymbol{\beta}}\)

\[\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})\]

The covariance matrix \(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\) tells us about precision of each estimate.

Estimating \(\sigma^2\)

We don’t know \(\sigma^2\), so we estimate it:

\[\hat{\sigma}^2 = \frac{\text{RSS}}{n - p - 1} = \frac{\mathbf{e}^T\mathbf{e}}{n - p - 1} = \frac{\sum_{i=1}^n e_i^2}{n - p - 1}\]

Degrees of Freedom

Why \(n - p - 1\)?

\[E[\mathbf{e}^T\mathbf{e}] = E[\mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}] = \sigma^2 \text{tr}(\mathbf{I} - \mathbf{H}) = \sigma^2(n - p - 1)\]

We “use up” \(p + 1\) degrees of freedom estimating \(\boldsymbol{\beta}\).

Intuition: Why Residuals Have Fewer Degrees of Freedom

The residuals \(\mathbf{e}\) are not free to vary in all \(n\) dimensions—they are constrained by the normal equations \(\mathbf{X}^T\mathbf{e} = \mathbf{0}\).

This gives \(p + 1\) linear constraints (one for each column of \(\mathbf{X}\), including the intercept). So while we have \(n\) residuals, they only have \(n - (p+1)\) degrees of freedom to move around.

Equivalently: the residuals live in the \((n - p - 1)\)-dimensional orthogonal complement of Col(\(\mathbf{X}\)).

Independence of \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\)

Key Result

Under normality, \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent.

Furthermore:

\[\frac{(n - p - 1)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p-1}\]

These results enable exact inference (t-tests, F-tests).

Standard Errors

The estimated standard error of \(\hat{\beta}_j\):

\[\text{SE}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\]

This measures our uncertainty about \(\hat{\beta}_j\).

When Homoskedasticity Fails: Robust Standard Errors

What if \(\text{Var}(\varepsilon_i) = \sigma_i^2\) varies across observations?

Under Heteroskedasticity

  • OLS \(\hat{\boldsymbol{\beta}}\) is still unbiased and consistent
  • But the usual standard errors are wrong (often too small)
  • t-tests and confidence intervals become unreliable

Heteroskedasticity-Robust (Sandwich) Standard Errors

White (1980) proposed the “sandwich” estimator:

\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{\text{robust}} = (\mathbf{X}^T\mathbf{X})^{-1}\left(\sum_{i=1}^n e_i^2 \mathbf{x}_i\mathbf{x}_i^T\right)(\mathbf{X}^T\mathbf{X})^{-1}\]

where \(e_i = y_i - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}\) are the residuals.

In R: Use sandwich::vcovHC() with lmtest::coeftest():

library(sandwich); library(lmtest)
fit <- lm(y ~ x1 + x2, data = mydata)
coeftest(fit, vcov = vcovHC(fit, type = "HC3"))

The “HC3” variant is recommended for small samples (MacKinnon & White, 1985).

Beyond Independence: Generalized Least Squares

When errors are correlated (e.g., time series, spatial data, clustered observations):

\[\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \boldsymbol{\Omega} \neq \sigma^2 \mathbf{I}\]

The Generalized Least Squares (GLS) estimator accounts for this structure:

\[\hat{\boldsymbol{\beta}}_{\text{GLS}} = (\mathbf{X}^T\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Omega}^{-1}\mathbf{y}\]

Key properties: GLS is BLUE when \(\boldsymbol{\Omega}\) is known. In practice, \(\boldsymbol{\Omega}\) must be estimated (Feasible GLS). Common applications include time series (AR errors), panel data, and spatial models.

In R: nlme::gls() or specify correlation structures in mixed models.

Inference

Testing Individual Coefficients

To test \(H_0: \beta_j = 0\):

\[t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-p-1} \text{ under } H_0\]

Interpretation: Is predictor \(j\) useful after accounting for all other predictors?

Caution: Multicollinearity Trap

A non-significant \(t\)-test doesn’t mean the predictor is unimportant—it could be collinear with others.

Example: Suppose \(X_1\) (father’s height) and \(X_2\) (mother’s height) both predict \(Y\) (child’s height), with \(\text{Cor}(X_1, X_2) = 0.6\).

  • Individual \(t\)-tests: \(p = 0.12\) for \(X_1\), \(p = 0.15\) for \(X_2\) (both non-significant)
  • Joint \(F\)-test for \(H_0: \beta_1 = \beta_2 = 0\): \(p < 0.001\) (highly significant!)

Why? Each predictor’s \(t\)-test asks “does this help given the other?” With correlated predictors, neither adds much unique information beyond the other—but together they explain substantial variance.

Rule: Always examine joint \(F\)-tests alongside individual \(t\)-tests when predictors are correlated.

Confidence Intervals

A \((1-\alpha)\) confidence interval for \(\beta_j\):

\[\hat{\beta}_j \pm t_{n-p-1, \alpha/2} \cdot \text{SE}(\hat{\beta}_j)\]

F-Test for Nested Models

To compare a reduced model to a full model:

\[F = \frac{(\text{RSS}_{\text{reduced}} - \text{RSS}_{\text{full}}) / q}{\text{RSS}_{\text{full}} / (n - p - 1)} \sim F_{q, n-p-1}\]

where \(q\) is the number of additional parameters in the full model.

Use case: Testing whether a group of predictors improves the model.

Overall F-Test

Testing \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (all coefficients zero):

\[F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n - p - 1)} = \frac{R^2 / p}{(1 - R^2)/(n - p - 1)}\]

where \(R^2 = 1 - \text{RSS}/\text{TSS}\) is the coefficient of determination.

\(R^2\) and Adjusted \(R^2\)

\(R^2\): Proportion of variance explained \[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i(y_i - \hat{y}_i)^2}{\sum_i(y_i - \bar{y})^2}\]

Problem: \(R^2\) always increases when adding predictors.

Adjusted \(R^2\): Penalizes for number of predictors \[R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n - p - 1)}{\text{TSS}/(n - 1)}\]

Model Diagnostics

Why Diagnostics Matter

OLS assumes:

  1. Linear relationship
  2. Constant variance (homoskedasticity)
  3. Independence of errors
  4. Normal errors (for inference)
  5. No extreme outliers

Diagnostics help us check these assumptions.

Residual Plots

Q-Q Plot: Checking Normality

Leverage and Influence

Diagnostic Summary

Issue Detection Consequence Remedy
Nonlinearity Residual plots Biased estimates Transform predictors; add polynomial terms; use nonlinear model
Heteroskedasticity Residual vs fitted Invalid inference Weighted LS; robust SEs; transform response
Non-normality Q-Q plot Invalid inference Bootstrap; transform response; robust methods
High leverage Hat values \(h_{ii}\) Sensitive to outliers Investigate data entry; consider robust regression
Influential points Cook’s distance Model driven by few points Sensitivity analysis; robust regression; report with/without
Multicollinearity VIF Large standard errors Remove/combine predictors; ridge regression; collect more data

Code Demo: Full Diagnostic Check

Summary

What We Covered

  1. Linear model setup: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)

  2. OLS estimator: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)

  3. Geometry: Least squares = orthogonal projection onto Col(\(\mathbf{X}\))

  4. Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) projects \(\mathbf{y}\) to \(\hat{\mathbf{y}}\)

  5. Gauss-Markov: OLS is BLUE (among linear unbiased estimators)

  6. Inference: t-tests, F-tests, confidence intervals under normality

  7. Diagnostics: Residual plots, Q-Q plots, leverage, influence

Key Insights

Remember These

  • Geometry over algebra: Thinking about projections gives deeper understanding

  • Gauss-Markov limits: Only optimal among linear unbiased — biased estimators may be better!

  • Diagnostics matter: Conclusions depend on assumptions being (approximately) satisfied

Looking Ahead

We built the foundation this week. Now we break some rules.

What we established:

  • OLS has beautiful geometry
  • Gauss-Markov: BLUE among linear unbiased
  • Clear inference under normality

What we hinted at:

  • Unbiasedness isn’t everything
  • Stein’s paradox: OLS is inadmissible for \(p \geq 3\)
  • Bias-variance tradeoff applies here too

Next week: Ridge, Lasso, Elastic Net — deliberately biased estimators that often outperform OLS.