Stat 577 — Statistical Learning Theory
Spring 2026
Linear models are the workhorse of statistical learning.
Reasons to understand them deeply:
This week: Understand linear regression geometrically, not just algebraically.
What do we actually want from a linear model?
1. Inference
Estimate \(\boldsymbol{\beta}\) and quantify uncertainty
“Which predictors matter?”
2. Prediction
Given new \(\mathbf{x}\), predict \(y\)
“What will happen?”
3. Interpretation
Understand relationships
“How does \(X_j\) affect \(Y\)?”
Different Goals, Different Priorities
We assume:
\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \varepsilon\]
Or in vector notation for a single observation:
\[Y = \beta_0 + \boldsymbol{\beta}^T \mathbf{x} + \varepsilon\]
where:
The simplest case:
\[Y = \beta_0 + \beta_1 X + \varepsilon\]
Goal: Find \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the residual sum of squares (RSS):
\[\text{RSS}(\beta_0, \beta_1) = \sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i)^2\]
Take partial derivatives and set to zero:
\[\frac{\partial \text{RSS}}{\partial \beta_0} = -2\sum_{i=1}^n (y_i - \beta_0 - \beta_1 x_i) = 0\]
\[\frac{\partial \text{RSS}}{\partial \beta_1} = -2\sum_{i=1}^n x_i(y_i - \beta_0 - \beta_1 x_i) = 0\]
From the first equation: \[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Simple Linear Regression Estimates
\[\hat{\beta}_1 = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n (x_i - \bar{x})^2} = \frac{S_{xy}}{S_{xx}}\]
\[\hat{\beta}_0 = \bar{y} - \hat{\beta}_1 \bar{x}\]
Interpretation
With \(n\) observations and \(p\) predictors, write as:
\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
where:
\[\mathbf{X} = \begin{pmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}\]
The first column of 1s accounts for the intercept \(\beta_0\).
Row \(i\) of \(\mathbf{X}\) contains the predictors for observation \(i\).
Minimize RSS:
\[\text{RSS}(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]
Take the gradient:
\[\nabla_{\boldsymbol{\beta}} \text{RSS} = -2\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})\]
Setting to zero gives the normal equations:
\[\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}^T\mathbf{y}\]
Ordinary Least Squares (OLS) Estimator
If \(\mathbf{X}^T\mathbf{X}\) is invertible:
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
When is \(\mathbf{X}^T\mathbf{X}\) invertible?
Numerical Stability in Practice
In practice, we never compute \((\mathbf{X}^T\mathbf{X})^{-1}\) explicitly—this is numerically unstable!
Instead, lm() in R uses the QR decomposition: \(\mathbf{X} = \mathbf{Q}\mathbf{R}\) where \(\mathbf{Q}\) is orthonormal and \(\mathbf{R}\) is upper triangular. Then:
\[\hat{\boldsymbol{\beta}} = \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y}\]
This avoids computing \(\mathbf{X}^T\mathbf{X}\) (which squares the condition number) and solves via back-substitution. The formula \((\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\) is for theory; QR is for computation.
Historical Note: Origins of Least Squares
The least squares method has a remarkable history:
The fact that a method developed for celestial mechanics remains foundational 230 years later speaks to its mathematical elegance.
Fitted values: \[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
Residuals: \[\mathbf{e} = \mathbf{y} - \hat{\mathbf{y}}\]
Key Property
The residuals are orthogonal to the columns of \(\mathbf{X}\):
\[\mathbf{X}^T\mathbf{e} = \mathbf{X}^T(\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = \mathbf{0}\]
This follows directly from the normal equations!
Geometric View
Least squares finds the point in Col(\(\mathbf{X}\)) closest to \(\mathbf{y}\).
This is orthogonal projection.
Rotate this plot to see how \(\hat{\mathbf{y}}\) is the closest point in Col(\(\mathbf{X}\)) to \(\mathbf{y}\).
What to Look For
Definition: Hat Matrix
\[\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]
This matrix “puts the hat on \(\mathbf{y}\)”: \[\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\]
\(\mathbf{H}\) is the projection matrix onto Col(\(\mathbf{X}\)).
Algebraic Properties:
Symmetric: \(\mathbf{H}^T = \mathbf{H}\)
Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)
Eigenvalues: All 0 or 1
Trace: \(\text{tr}(\mathbf{H}) = p + 1\)
Geometric Meaning:
Why “Hat” Matrix?
It’s called the hat matrix because it puts the “hat” (\(\hat{}\)) on \(\mathbf{y}\)!
\[\mathbf{H}^2 = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \cdot \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]
\[= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}[\mathbf{X}^T\mathbf{X}](\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]
\[= \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T = \mathbf{H}\]
Geometric Intuition
Projecting twice onto the same subspace gives the same result as projecting once.
The residual maker matrix: \[\mathbf{M} = \mathbf{I} - \mathbf{H}\]
Properties:
Key: \(\mathbf{H}\mathbf{M} = \mathbf{0}\) — fitted values and residuals are orthogonal!
The \(i\)th diagonal element of \(\mathbf{H}\):
\[h_{ii} = \mathbf{x}_i^T(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{x}_i\]
This is the leverage of observation \(i\).
Interpretation of Leverage
Gauss-Markov Theorem
Among all linear unbiased estimators of \(\boldsymbol{\beta}\), the OLS estimator has the smallest variance.
OLS is BLUE: Best Linear Unbiased Estimator
Assumptions needed:
What About Full Rank?
Common simplification: Assume \(\text{rank}(\mathbf{X}) = p + 1\) so \(\mathbf{X}^T\mathbf{X}\) is invertible.
But the theorem is more general! See next slide.
When \(\mathbf{X}\) is rank-deficient, \(\boldsymbol{\beta}\) itself isn’t uniquely identifiable. But we can still do inference!
Estimable Functions
A linear function \(\mathbf{c}^T\boldsymbol{\beta}\) is estimable if \(\mathbf{c}\) lies in the row space of \(\mathbf{X}\).
Equivalently: \(\mathbf{c}^T\boldsymbol{\beta}\) is estimable iff it has the same value for all \(\boldsymbol{\beta}\) satisfying \(\mathbf{X}\boldsymbol{\beta} = E[\mathbf{y}]\).
General Gauss-Markov: For any estimable \(\mathbf{c}^T\boldsymbol{\beta}\), the OLS estimator
\[\mathbf{c}^T\hat{\boldsymbol{\beta}} = \mathbf{c}^T(\mathbf{X}^T\mathbf{X})^{-}\mathbf{X}^T\mathbf{y}\]
is BLUE, where \((\mathbf{X}^T\mathbf{X})^{-}\) is any generalized inverse.
Key insight: The choice of g-inverse doesn’t matter for estimable functions — they all give the same answer!
Full rank (common case):
Rank deficient:
Intuition: What Makes Something Estimable?
A function of \(\boldsymbol{\beta}\) is estimable if you get the same answer no matter which valid \(\boldsymbol{\beta}\) parameterization you use.
Think of it this way: if there are multiple \(\boldsymbol{\beta}\) vectors that all produce the same fitted values \(\hat{\mathbf{y}}\), then any quantity that differs across these \(\boldsymbol{\beta}\)’s is not estimable—the data cannot distinguish between them. Only quantities that are the same for all valid \(\boldsymbol{\beta}\)’s can be estimated from the data.
Often, what we actually care about is estimable — even when \(\boldsymbol{\beta}\) isn’t identified!
Example: One-way ANOVA with cell means coding
\[y_{ij} = \mu_i + \varepsilon_{ij}, \quad i = 1, \ldots, k; \; j = 1, \ldots, n_i\]
Design matrix \(\mathbf{X}\) has rank \(k\), but we fit \(k\) means. What’s estimable?
| Quantity | Estimable? | Why it matters |
|---|---|---|
| \(\mu_1\) | ✓ Yes | Individual group mean |
| \(\mu_1 - \mu_2\) | ✓ Yes | Treatment contrast |
| \(\frac{1}{k}\sum_i \mu_i\) | ✓ Yes | Overall mean |
| \(\hat{y}\) for any \(\mathbf{x}\) in data | ✓ Yes | Predictions always work |
Example: Regression with perfect collinearity
If \(X_3 = X_1 + X_2\) exactly, then \(\beta_1, \beta_2, \beta_3\) aren’t individually identifiable.
But \(\beta_1 + \beta_3\) and \(\beta_2 + \beta_3\) are estimable — and often that’s what we need!
Critical Distinction
G-inverses handle rank deficiency from linear dependencies.
They do NOT solve the \(p \gg n\) problem!
Two very different situations:
Rank deficiency (g-inverses help)
High-dimensional (\(p > n\), g-inverses don’t help)
Why the Difference?
With \(p > n\): even with g-inverse, \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\) interpolates the training data perfectly.
Zero training error, catastrophic test error. This is the overfitting problem — no amount of linear algebra solves it. We need to constrain the solution space.
Consider any other linear estimator: \[\tilde{\boldsymbol{\beta}} = \mathbf{C}\mathbf{y}\]
for some matrix \(\mathbf{C} \in \mathbb{R}^{(p+1) \times n}\).
For unbiasedness: \[E[\tilde{\boldsymbol{\beta}}] = \mathbf{C}E[\mathbf{y}] = \mathbf{C}\mathbf{X}\boldsymbol{\beta} = \boldsymbol{\beta}\]
This requires \(\mathbf{C}\mathbf{X} = \mathbf{I}\).
Write \(\mathbf{C} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}\) for some matrix \(\mathbf{D}\).
The unbiasedness condition \(\mathbf{C}\mathbf{X} = \mathbf{I}\) implies: \[\mathbf{D}\mathbf{X} = \mathbf{0}\]
Variance of \(\tilde{\boldsymbol{\beta}}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2 \mathbf{C}\mathbf{C}^T\]
\[\mathbf{C}\mathbf{C}^T = [(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}][(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T + \mathbf{D}]^T\]
\[= (\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T + \underbrace{(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{D}^T}_{= 0} + \underbrace{\mathbf{D}\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}}_{= 0}\]
Since \(\mathbf{D}\mathbf{X} = \mathbf{0}\): \[\text{Var}(\tilde{\boldsymbol{\beta}}) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1} + \mathbf{D}\mathbf{D}^T]\]
Since \(\mathbf{D}\mathbf{D}^T\) is positive semi-definite, \(\text{Var}(\tilde{\boldsymbol{\beta}}) \geq \text{Var}(\hat{\boldsymbol{\beta}})\).
Limitations
Only compares linear estimators — nonlinear estimators might be better
Only considers unbiased estimators — biased estimators might have lower MSE
Doesn’t guarantee small variance — just smallest among linear unbiased
Assumes homoskedasticity — if violated, OLS is not BLUE
This motivates regularization! (Ridge, Lasso in Week 3)
Preview: Ridge Regression
Ridge regression adds a penalty to the OLS objective, yielding:
\[\hat{\boldsymbol{\beta}}_{\text{ridge}} = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]
where \(\lambda \geq 0\) is a tuning parameter.
Key insight: Compare to OLS: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
The parameter \(\lambda\) gives us explicit control over the bias-variance tradeoff. We’ll develop this fully in Week 3.
Recall from Week 1
\[\text{EPE}(\hat{f}) = \underbrace{\sigma^2}_{\text{irreducible}} + \underbrace{[\text{Bias}(\hat{f})]^2}_{\text{model wrong?}} + \underbrace{\text{Var}(\hat{f})}_{\text{estimation noise}}\]
For OLS under the correct linear model:
But what if we’re willing to accept some bias?
For a new observation \(\mathbf{x}_0\), the OLS prediction is \(\hat{y}_0 = \mathbf{x}_0^T \hat{\boldsymbol{\beta}}\).
Prediction Variance
\[\text{Var}(\hat{y}_0) = \sigma^2 \mathbf{x}_0^T (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{x}_0\]
Under correct model specification:
Connection to Week 1
Recall the bias-variance decomposition: \(\text{EPE} = \sigma^2 + \text{Bias}^2 + \text{Variance}\).
For OLS under correct specification, Bias = 0, so prediction error is driven entirely by variance and irreducible noise. Regularization (Week 3) trades bias for reduced variance.
Add a stronger assumption:
\[\boldsymbol{\varepsilon} \sim N(\mathbf{0}, \sigma^2\mathbf{I})\]
Then we get exact finite-sample distributions:
Distribution of \(\hat{\boldsymbol{\beta}}\)
\[\hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2(\mathbf{X}^T\mathbf{X})^{-1})\]
The covariance matrix \(\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\) tells us about precision of each estimate.
We don’t know \(\sigma^2\), so we estimate it:
\[\hat{\sigma}^2 = \frac{\text{RSS}}{n - p - 1} = \frac{\mathbf{e}^T\mathbf{e}}{n - p - 1} = \frac{\sum_{i=1}^n e_i^2}{n - p - 1}\]
Degrees of Freedom
Why \(n - p - 1\)?
\[E[\mathbf{e}^T\mathbf{e}] = E[\mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}] = \sigma^2 \text{tr}(\mathbf{I} - \mathbf{H}) = \sigma^2(n - p - 1)\]
We “use up” \(p + 1\) degrees of freedom estimating \(\boldsymbol{\beta}\).
Intuition: Why Residuals Have Fewer Degrees of Freedom
The residuals \(\mathbf{e}\) are not free to vary in all \(n\) dimensions—they are constrained by the normal equations \(\mathbf{X}^T\mathbf{e} = \mathbf{0}\).
This gives \(p + 1\) linear constraints (one for each column of \(\mathbf{X}\), including the intercept). So while we have \(n\) residuals, they only have \(n - (p+1)\) degrees of freedom to move around.
Equivalently: the residuals live in the \((n - p - 1)\)-dimensional orthogonal complement of Col(\(\mathbf{X}\)).
Key Result
Under normality, \(\hat{\boldsymbol{\beta}}\) and \(\hat{\sigma}^2\) are independent.
Furthermore:
\[\frac{(n - p - 1)\hat{\sigma}^2}{\sigma^2} \sim \chi^2_{n-p-1}\]
These results enable exact inference (t-tests, F-tests).
The estimated standard error of \(\hat{\beta}_j\):
\[\text{SE}(\hat{\beta}_j) = \hat{\sigma}\sqrt{[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\]
This measures our uncertainty about \(\hat{\beta}_j\).
What if \(\text{Var}(\varepsilon_i) = \sigma_i^2\) varies across observations?
Under Heteroskedasticity
Heteroskedasticity-Robust (Sandwich) Standard Errors
White (1980) proposed the “sandwich” estimator:
\[\widehat{\text{Var}}(\hat{\boldsymbol{\beta}})_{\text{robust}} = (\mathbf{X}^T\mathbf{X})^{-1}\left(\sum_{i=1}^n e_i^2 \mathbf{x}_i\mathbf{x}_i^T\right)(\mathbf{X}^T\mathbf{X})^{-1}\]
where \(e_i = y_i - \mathbf{x}_i^T\hat{\boldsymbol{\beta}}\) are the residuals.
In R: Use sandwich::vcovHC() with lmtest::coeftest():
The “HC3” variant is recommended for small samples (MacKinnon & White, 1985).
Beyond Independence: Generalized Least Squares
When errors are correlated (e.g., time series, spatial data, clustered observations):
\[\text{Cov}(\boldsymbol{\varepsilon}) = \sigma^2 \boldsymbol{\Omega} \neq \sigma^2 \mathbf{I}\]
The Generalized Least Squares (GLS) estimator accounts for this structure:
\[\hat{\boldsymbol{\beta}}_{\text{GLS}} = (\mathbf{X}^T\boldsymbol{\Omega}^{-1}\mathbf{X})^{-1}\mathbf{X}^T\boldsymbol{\Omega}^{-1}\mathbf{y}\]
Key properties: GLS is BLUE when \(\boldsymbol{\Omega}\) is known. In practice, \(\boldsymbol{\Omega}\) must be estimated (Feasible GLS). Common applications include time series (AR errors), panel data, and spatial models.
In R: nlme::gls() or specify correlation structures in mixed models.
To test \(H_0: \beta_j = 0\):
\[t_j = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)} \sim t_{n-p-1} \text{ under } H_0\]
Interpretation: Is predictor \(j\) useful after accounting for all other predictors?
Caution: Multicollinearity Trap
A non-significant \(t\)-test doesn’t mean the predictor is unimportant—it could be collinear with others.
Example: Suppose \(X_1\) (father’s height) and \(X_2\) (mother’s height) both predict \(Y\) (child’s height), with \(\text{Cor}(X_1, X_2) = 0.6\).
Why? Each predictor’s \(t\)-test asks “does this help given the other?” With correlated predictors, neither adds much unique information beyond the other—but together they explain substantial variance.
Rule: Always examine joint \(F\)-tests alongside individual \(t\)-tests when predictors are correlated.
A \((1-\alpha)\) confidence interval for \(\beta_j\):
\[\hat{\beta}_j \pm t_{n-p-1, \alpha/2} \cdot \text{SE}(\hat{\beta}_j)\]
To compare a reduced model to a full model:
\[F = \frac{(\text{RSS}_{\text{reduced}} - \text{RSS}_{\text{full}}) / q}{\text{RSS}_{\text{full}} / (n - p - 1)} \sim F_{q, n-p-1}\]
where \(q\) is the number of additional parameters in the full model.
Use case: Testing whether a group of predictors improves the model.
Testing \(H_0: \beta_1 = \beta_2 = \cdots = \beta_p = 0\) (all coefficients zero):
\[F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n - p - 1)} = \frac{R^2 / p}{(1 - R^2)/(n - p - 1)}\]
where \(R^2 = 1 - \text{RSS}/\text{TSS}\) is the coefficient of determination.
\(R^2\): Proportion of variance explained \[R^2 = 1 - \frac{\text{RSS}}{\text{TSS}} = 1 - \frac{\sum_i(y_i - \hat{y}_i)^2}{\sum_i(y_i - \bar{y})^2}\]
Problem: \(R^2\) always increases when adding predictors.
Adjusted \(R^2\): Penalizes for number of predictors \[R^2_{\text{adj}} = 1 - \frac{\text{RSS}/(n - p - 1)}{\text{TSS}/(n - 1)}\]
OLS assumes:
Diagnostics help us check these assumptions.
| Issue | Detection | Consequence | Remedy |
|---|---|---|---|
| Nonlinearity | Residual plots | Biased estimates | Transform predictors; add polynomial terms; use nonlinear model |
| Heteroskedasticity | Residual vs fitted | Invalid inference | Weighted LS; robust SEs; transform response |
| Non-normality | Q-Q plot | Invalid inference | Bootstrap; transform response; robust methods |
| High leverage | Hat values \(h_{ii}\) | Sensitive to outliers | Investigate data entry; consider robust regression |
| Influential points | Cook’s distance | Model driven by few points | Sensitivity analysis; robust regression; report with/without |
| Multicollinearity | VIF | Large standard errors | Remove/combine predictors; ridge regression; collect more data |
Linear model setup: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)
OLS estimator: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
Geometry: Least squares = orthogonal projection onto Col(\(\mathbf{X}\))
Hat matrix: \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) projects \(\mathbf{y}\) to \(\hat{\mathbf{y}}\)
Gauss-Markov: OLS is BLUE (among linear unbiased estimators)
Inference: t-tests, F-tests, confidence intervals under normality
Diagnostics: Residual plots, Q-Q plots, leverage, influence
Remember These
Geometry over algebra: Thinking about projections gives deeper understanding
Gauss-Markov limits: Only optimal among linear unbiased — biased estimators may be better!
Diagnostics matter: Conclusions depend on assumptions being (approximately) satisfied
We built the foundation this week. Now we break some rules.
What we established:
What we hinted at:
Next week: Ridge, Lasso, Elastic Net — deliberately biased estimators that often outperform OLS.