Week 3: Regularization — Ridge, Lasso, Elastic Net

Stat 577 — Statistical Learning Theory

Soumojit Das

Spring 2026

Why Regularize?

When OLS Breaks Down

Last week we established that OLS is BLUE — Best Linear Unbiased Estimator.

But OLS fails spectacularly in several settings:

High Dimensions (\(p \geq n\))

  • \(\mathbf{X}^T\mathbf{X}\) is singular
  • Infinitely many solutions
  • Perfect training fit, terrible predictions

Multicollinearity

  • \(\mathbf{X}^T\mathbf{X}\) nearly singular
  • Huge variance in \(\hat{\boldsymbol{\beta}}\)
  • Unstable estimates

Overfitting

  • Model memorizes noise
  • Low training error, high test error
  • Poor generalization

The Core Problem

OLS is unbiased but can have enormous variance. Sometimes we want to trade a little bias for much lower variance.

Stein’s Phenomenon

Stein’s Paradox (1956)

For \(p \geq 3\), the MLE (sample mean) is inadmissible for estimating a multivariate normal mean under squared error loss.

There exist biased estimators that dominate it uniformly!

Setup: Observe \(\mathbf{z} \sim N(\boldsymbol{\theta}, \sigma^2 \mathbf{I}_p)\). Want to estimate \(\boldsymbol{\theta}\).

  • MLE: \(\hat{\boldsymbol{\theta}}_{MLE} = \mathbf{z}\)
  • MSE: \(E[\|\hat{\boldsymbol{\theta}}_{MLE} - \boldsymbol{\theta}\|^2] = p\sigma^2\)

The James-Stein Estimator

\[\hat{\boldsymbol{\theta}}_{JS} = \left(1 - \frac{(p-2)\sigma^2}{\|\mathbf{z}\|^2}\right)\mathbf{z}\]

This shrinks \(\mathbf{z}\) toward the origin. For \(p \geq 3\):

\[E[\|\hat{\boldsymbol{\theta}}_{JS} - \boldsymbol{\theta}\|^2] < E[\|\hat{\boldsymbol{\theta}}_{MLE} - \boldsymbol{\theta}\|^2]\]

for all \(\boldsymbol{\theta}\)!

Key insight: Shrinkage toward a target (here, zero) can uniformly reduce MSE. This is the foundation of regularization.

James-Stein Simulation

Observation

For \(p \geq 3\), James-Stein uniformly dominates MLE. The improvement grows with dimension!

From Stein to Hierarchical Bayes

The James-Stein estimator seemed like a “trick” — why does shrinking toward an arbitrary point help?

Efron & Morris (1973, 1975) provided the answer: JS is an empirical Bayes estimator!

The Bayesian Explanation

Consider the hierarchical model: \[\theta_j \stackrel{iid}{\sim} N(\mu, A) \quad \text{(prior)}\] \[z_j | \theta_j \sim N(\theta_j, \sigma^2) \quad \text{(likelihood)}\]

The posterior mean (optimal under squared error loss): \[E[\theta_j | z_j] = \left(1 - \frac{\sigma^2}{\sigma^2 + A}\right)z_j + \frac{\sigma^2}{\sigma^2 + A}\mu\]

James-Stein emerges when we:

  1. Set \(\mu = 0\) (shrink toward origin)
  2. Estimate \(A\) from the data (empirical Bayes)

Why This Matters

  • Shrinkage is not a “trick” — it’s borrowing strength across parameters
  • Each \(\theta_j\) estimate is improved by information from other \(\theta_k\)’s
  • This is the foundation of hierarchical Bayes and modern shrinkage priors
  • Ridge, Lasso, and Elastic Net are all doing this same borrowing!

Historical Arc of Shrinkage

James-Stein (1961) → Ridge (Hoerl & Kennard, 1970) → Lasso (Tibshirani, 1996) → Elastic Net (Zou & Hastie, 2005)

Each method built on its predecessor, progressing from theoretical insight to practical methodology.

Norms and Geometry

\(\ell_p\) Norms

The \(\ell_p\) norm of a vector \(\mathbf{x} \in \mathbb{R}^d\):

\[\|\mathbf{x}\|_p = \left(\sum_{i=1}^d |x_i|^p\right)^{1/p}\]

Common norms:

Norm Formula Name
\(\ell_1\) \(\sum_i |x_i|\) Manhattan / Taxicab
\(\ell_2\) \(\sqrt{\sum_i x_i^2}\) Euclidean
\(\ell_\infty\) \(\max_i |x_i|\) Maximum / Chebyshev

Properties:

  • \(\ell_2\) is rotation invariant
  • \(\ell_1\) promotes sparsity
  • \(\ell_\infty\) penalizes largest component
  • As \(p \to 0^+\), approaches \(\ell_0\) “norm”

Unit Balls in Different Norms

Key Observation

The \(\ell_1\) ball has corners on the axes — this is why Lasso produces sparse solutions!

The \(\ell_0\) “Norm”

The \(\ell_0\) “norm” counts nonzero entries:

\[\|\mathbf{x}\|_0 = \sum_{i=1}^d \mathbf{1}(x_i \neq 0) = \#\{i : x_i \neq 0\}\]

Not Actually a Norm!

A norm must satisfy three properties:

  1. Non-negativity: \(\|\mathbf{x}\| \geq 0\) with equality iff \(\mathbf{x} = \mathbf{0}\)
  2. Homogeneity: \(\|c\mathbf{x}\| = |c| \cdot \|\mathbf{x}\|\) for all scalars \(c\)
  3. Triangle inequality: \(\|\mathbf{x} + \mathbf{y}\| \leq \|\mathbf{x}\| + \|\mathbf{y}\|\)

\(\ell_0\) fails homogeneity: \(\|c\mathbf{x}\|_0 = \|\mathbf{x}\|_0\) for \(c \neq 0\), but we need \(|c| \cdot \|\mathbf{x}\|_0\).

Example: \(\|(2, 0, 3)\|_0 = 2\), but \(\|2 \cdot (2, 0, 3)\|_0 = \|(4, 0, 6)\|_0 = 2 \neq 2 \cdot 2 = 4\).

Why \(\ell_0\) matters:

  • Directly measures sparsity
  • Best subset selection minimizes RSS subject to \(\|\boldsymbol{\beta}\|_0 \leq k\)
  • Problem: Combinatorial optimization (NP-hard)

The Key Insight

\(\ell_1\) is the tightest convex relaxation of \(\ell_0\).

This makes Lasso computationally tractable while still promoting sparsity!

Ridge Regression

Ridge Objective

Ridge Regression

Penalized form: \[\hat{\boldsymbol{\beta}}_R = \arg\min_{\boldsymbol{\beta}} \left\{ \sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\}\]

Constrained form (equivalent): \[\hat{\boldsymbol{\beta}}_R = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 \quad \text{subject to} \quad \sum_{j=1}^p \beta_j^2 \leq t\]

  • \(\lambda \geq 0\) is the regularization parameter (tuning parameter)
  • \(\lambda = 0\): OLS
  • \(\lambda \to \infty\): \(\hat{\boldsymbol{\beta}}_R \to \mathbf{0}\)
  • The penalty does not include the intercept \(\beta_0\)

Deriving the Ridge Solution

Objective: Minimize \(L(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) + \lambda \boldsymbol{\beta}^T\boldsymbol{\beta}\)

Expanding: \[L(\boldsymbol{\beta}) = \mathbf{y}^T\mathbf{y} - 2\boldsymbol{\beta}^T\mathbf{X}^T\mathbf{y} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda\boldsymbol{\beta}^T\boldsymbol{\beta}\]

Taking the gradient: \[\nabla_{\boldsymbol{\beta}} L = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + 2\lambda\boldsymbol{\beta}\]

Setting to zero: \[\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} + \lambda\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\] \[(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\]

Ridge Estimator

\[\hat{\boldsymbol{\beta}}_R = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

Key property: \(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}\) is always invertible for \(\lambda > 0\), even when \(p > n\)!

Numerical Stability

Adding \(\lambda\mathbf{I}\) not only guarantees invertibility but also improves the condition number of \(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I}\), leading to more numerically stable solutions.

Connection to Sufficient Statistics

The Ridge estimator depends on the data only through \((\mathbf{X}^T\mathbf{X}, \mathbf{X}^T\mathbf{y})\):

\[\hat{\boldsymbol{\beta}}_R = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

For the linear model \(\mathbf{y} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})\):

  • \((\mathbf{X}^T\mathbf{X}, \mathbf{X}^T\mathbf{y})\) are jointly sufficient for \(\boldsymbol{\beta}\) (when \(\sigma^2\) known)
  • Ridge is a linear shrinkage of OLS: \(\hat{\boldsymbol{\beta}}_R = \mathbf{W}_\lambda \hat{\boldsymbol{\beta}}_{OLS}\)
  • where \(\mathbf{W}_\lambda = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{X}\)

This preserves the sufficiency principle — Ridge doesn’t discard information, it just reweights it!

Bayesian Interpretation

Ridge regression has an elegant Bayesian interpretation.

Likelihood: \(\mathbf{y} | \boldsymbol{\beta} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2\mathbf{I})\)

Prior: \(\boldsymbol{\beta} \sim N(\mathbf{0}, \tau^2\mathbf{I})\) (independent Gaussian prior)

Posterior mode (MAP estimate): \[\hat{\boldsymbol{\beta}}_{MAP} = \arg\max_{\boldsymbol{\beta}} \left\{ \log p(\mathbf{y}|\boldsymbol{\beta}) + \log p(\boldsymbol{\beta}) \right\}\]

\[= \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2\sigma^2}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \frac{1}{2\tau^2}\|\boldsymbol{\beta}\|^2 \right\}\]

Connection to Ridge

\[\lambda = \frac{\sigma^2}{\tau^2}\]

  • Large \(\tau^2\) (diffuse prior) \(\Rightarrow\) small \(\lambda\) \(\Rightarrow\) close to OLS
  • Small \(\tau^2\) (tight prior) \(\Rightarrow\) large \(\lambda\) \(\Rightarrow\) strong shrinkage

ML Connection: Weight Decay

The Ridge penalty \(\lambda\|\boldsymbol{\beta}\|_2^2\) is identical to “weight decay” in neural network training. When you see weight_decay=0.01 in PyTorch or TensorFlow optimizers, you’re applying Ridge regularization to the network weights.

Effective Degrees of Freedom

OLS uses \(p\) degrees of freedom (for \(p\) predictors, excluding intercept).

Ridge uses fewer! The effective degrees of freedom:

\[\text{df}(\lambda) = \text{tr}\left[\mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\right] = \sum_{j=1}^p \frac{d_j^2}{d_j^2 + \lambda}\]

where \(d_1, \ldots, d_p\) are the singular values of \(\mathbf{X}\).

Properties

  • When \(\lambda = 0\): \(\text{df}(\lambda) = p\) (OLS)
  • When \(\lambda \to \infty\): \(\text{df}(\lambda) \to 0\)
  • \(\text{df}(\lambda)\) decreases monotonically in \(\lambda\)

Effective Degrees of Freedom vs \(\lambda\)

Ridge via SVD

The SVD

The Singular Value Decomposition (SVD) of \(\mathbf{X} \in \mathbb{R}^{n \times p}\):

\[\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T\]

where:

  • \(\mathbf{U} \in \mathbb{R}^{n \times n}\): left singular vectors
    • Columns are orthonormal: \(\mathbf{U}^T\mathbf{U} = \mathbf{I}\)
    • Span the column space of \(\mathbf{X}\)
  • \(\mathbf{V} \in \mathbb{R}^{p \times p}\): right singular vectors
    • Columns are orthonormal: \(\mathbf{V}^T\mathbf{V} = \mathbf{I}\)
    • Span the row space of \(\mathbf{X}\)
  • \(\mathbf{D} \in \mathbb{R}^{n \times p}\): singular values
    • Diagonal matrix with \(d_1 \geq d_2 \geq \cdots \geq d_r > 0\)
    • \(r = \text{rank}(\mathbf{X})\)
    • \(d_j^2\) are eigenvalues of \(\mathbf{X}^T\mathbf{X}\)

OLS Through SVD

Using the SVD, the OLS fitted values:

\[\hat{\mathbf{y}}_{OLS} = \mathbf{X}\hat{\boldsymbol{\beta}}_{OLS} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]

Substituting \(\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T\):

\[\hat{\mathbf{y}}_{OLS} = \mathbf{U}\mathbf{U}^T\mathbf{y} = \sum_{j=1}^r \mathbf{u}_j \mathbf{u}_j^T \mathbf{y}\]

Interpretation

OLS projects \(\mathbf{y}\) onto the column space of \(\mathbf{X}\), giving equal weight to all principal directions.

Ridge via SVD

The ridge fitted values:

\[\hat{\mathbf{y}}_R = \mathbf{X}(\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

Using SVD of \(\mathbf{X} = \mathbf{U}\mathbf{D}\mathbf{V}^T\):

  • \(\mathbf{X}^T\mathbf{X} = \mathbf{V}\mathbf{D}^2\mathbf{V}^T\)
  • \((\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1} = \mathbf{V}(\mathbf{D}^2 + \lambda\mathbf{I})^{-1}\mathbf{V}^T\)

After simplification:

Ridge via SVD

\[\hat{\mathbf{y}}_R = \sum_{j=1}^p \mathbf{u}_j \frac{d_j^2}{d_j^2 + \lambda} \mathbf{u}_j^T \mathbf{y}\]

Each principal direction is shrunk by the factor \(\frac{d_j^2}{d_j^2 + \lambda}\).

Key insight:

  • Directions with large \(d_j\) (high variance): shrinkage factor \(\approx 1\) (little shrinkage)
  • Directions with small \(d_j\) (low variance): shrinkage factor \(\approx 0\) (heavy shrinkage)

Ridge adaptively shrinks directions based on their variance in the data!

SVD Shrinkage Factors

Common Misconception

Ridge does NOT shrink all coefficients equally. Shrinkage is adaptive: directions with less variance (smaller singular values) are shrunk more aggressively.

Connection to PCA

Ridge and Principal Components

Ridge regression shrinks along principal component directions:

  • High-variance directions (large \(d_j\)): preserved (little shrinkage)
  • Low-variance directions (small \(d_j\)): suppressed (heavy shrinkage)

This is similar to Principal Component Regression (PCR), but: - PCR: hard threshold (keep or discard components) - Ridge: soft shrinkage (continuous weighting)

Why shrink low-variance directions?

  • Low variance in \(\mathbf{X}\) \(\Rightarrow\) little information about \(\boldsymbol{\beta}\)
  • OLS estimates in these directions are unstable (high variance)
  • Better to shrink toward zero than estimate poorly

Lasso Regression

Lasso Objective

Lasso (Least Absolute Shrinkage and Selection Operator)

Penalized form: \[\hat{\boldsymbol{\beta}}_L = \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2n}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}\]

Constrained form: \[\hat{\boldsymbol{\beta}}_L = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 \quad \text{subject to} \quad \sum_{j=1}^p |\beta_j| \leq t\]

Key differences from Ridge:

  • \(\ell_1\) penalty instead of \(\ell_2\)
  • No closed-form solution (requires optimization)
  • Produces sparse solutions (some \(\hat{\beta}_j = 0\) exactly)

Bayesian Interpretation

Lasso corresponds to a Laplace (double-exponential) prior: \[p(\beta_j) \propto \exp\left(-\frac{|\beta_j|}{\tau}\right)\]

The Laplace distribution has heavier tails than Gaussian and places more mass at zero, promoting sparsity. Compare: Ridge \(\leftrightarrow\) Gaussian prior; Lasso \(\leftrightarrow\) Laplace prior.

Soft Thresholding

For orthogonal design (\(\mathbf{X}^T\mathbf{X} = n\mathbf{I}\)), Lasso has a closed form!

The Lasso objective separates by coordinate. Let \(z = \hat{\beta}_j^{OLS}\). We solve: \[\min_{\beta} \left\{ \frac{1}{2}(z - \beta)^2 + \lambda|\beta| \right\}\]

Derivation via subgradient:

The subgradient condition for optimality is: \[0 \in -(z - \beta) + \lambda \cdot \partial|\beta|\]

where \(\partial|\beta|\) is the subdifferential of \(|\beta|\): \[\partial|\beta| = \begin{cases} \{1\} & \text{if } \beta > 0 \\ [-1, 1] & \text{if } \beta = 0 \\ \{-1\} & \text{if } \beta < 0 \end{cases}\]

Case 1: If \(\hat{\beta} = 0\), then \(0 \in -z + \lambda[-1, 1]\), which requires \(|z| \leq \lambda\).

Case 2: If \(\hat{\beta} > 0\), then \(z - \hat{\beta} = \lambda\), so \(\hat{\beta} = z - \lambda\). This is positive when \(z > \lambda\).

Case 3: If \(\hat{\beta} < 0\), then \(z - \hat{\beta} = -\lambda\), so \(\hat{\beta} = z + \lambda\). This is negative when \(z < -\lambda\).

Soft Thresholding

\[\hat{\beta}_j^{Lasso} = \text{sign}(\hat{\beta}_j^{OLS})\left(|\hat{\beta}_j^{OLS}| - \lambda\right)_+\]

where \((x)_+ = \max(0, x)\).

Interpretation:

  • If \(|\hat{\beta}_j^{OLS}| \leq \lambda\): shrink to exactly zero (inside “dead zone”)
  • If \(|\hat{\beta}_j^{OLS}| > \lambda\): shrink by \(\lambda\) toward zero

Soft vs Hard vs Ridge Shrinkage

Why Lasso Gives Sparsity

Ridge constraint region:

  • The constraint \(\sum_j \beta_j^2 \leq t\) defines a ball
  • RSS contours are ellipses
  • Solution typically at a smooth point
  • No exact zeros (except by chance)

Lasso constraint region:

  • The constraint \(\sum_j |\beta_j| \leq t\) defines a diamond
  • Diamond has corners on coordinate axes
  • Solution often at a corner
  • Exact zeros for some coefficients

Geometry: Ridge vs Lasso

Subdifferential Perspective

Why does \(\ell_1\) allow exact zeros but \(\ell_2\) does not?

For Ridge (\(\ell_2\)): The gradient of \(|\beta_j|^2\) at \(\beta_j = 0\) is \(2\beta_j = 0\).

  • Zero gradient at zero means no “pull” toward zero
  • Must have RSS gradient = 0 at the same point (rare)

For Lasso (\(\ell_1\)): The subdifferential of \(|\beta_j|\) at \(\beta_j = 0\) is \([-1, 1]\).

  • Any value in \([-1, 1]\) is a valid “subgradient”
  • Optimality condition: \(-\frac{\partial RSS}{\partial \beta_j} \in \lambda \cdot [-1, 1]\)

Sparsity Condition

\(\hat{\beta}_j = 0\) whenever \(\left|\frac{\partial RSS}{\partial \beta_j}\right| \leq \lambda\)

The interval of valid subgradients at zero allows the Lasso to “park” coefficients at exactly zero.

Lasso Selection Caveat

When predictors are highly correlated, Lasso tends to arbitrarily select one and exclude others. It does not always select the “correct” variables. Elastic Net addresses this via the grouping effect.

Elastic Net

Elastic Net Objective

Elastic Net (Zou & Hastie, 2005)

\[\hat{\boldsymbol{\beta}}_{EN} = \arg\min_{\boldsymbol{\beta}} \left\{ \frac{1}{2n}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2 + \lambda\left[\frac{1-\alpha}{2}\sum_{j=1}^p \beta_j^2 + \alpha\sum_{j=1}^p |\beta_j|\right] \right\}\]

Parameters:

  • \(\lambda \geq 0\): overall regularization strength
  • \(\alpha \in [0, 1]\): mixing parameter
    • \(\alpha = 0\): Ridge
    • \(\alpha = 1\): Lasso
    • \(0 < \alpha < 1\): Combination

Intuition

Elastic Net combines the variable selection of Lasso with the grouped shrinkage of Ridge.

The Grouping Effect

Problem with Lasso: When predictors are highly correlated, Lasso tends to select only one and ignore the others.

Elastic Net’s solution: The Ridge component encourages correlated predictors to be selected together.

Grouping Effect Theorem (Zou & Hastie)

For the Elastic Net with \(\alpha \in (0, 1)\):

If \(\text{Cor}(X_j, X_k) = 1\), then \(\hat{\beta}_j = \hat{\beta}_k\).

More generally, if \(\rho = \text{Cor}(X_j, X_k)\) is large, then \(|\hat{\beta}_j - \hat{\beta}_k|\) is small.

When to use Elastic Net:

  • Groups of correlated predictors (genomics, imaging)
  • Want sparsity but expect related features to enter together
  • \(p > n\) with correlation structure

Elastic Net Constraint Shapes

Selecting \(\lambda\)

The Lambda Problem

How do we choose the regularization parameter \(\lambda\)?

Small \(\lambda\):

  • Less regularization
  • Lower bias
  • Higher variance
  • Risk of overfitting

Large \(\lambda\):

  • More regularization
  • Higher bias
  • Lower variance
  • Risk of underfitting

The Goal

Find \(\lambda\) that minimizes expected prediction error on new data.

We cannot use training error — it always decreases with less regularization!

K-Fold Cross-Validation

Procedure:

  1. Divide data into \(K\) roughly equal folds
  2. For each fold \(k = 1, \ldots, K\):
    • Fit model on all data except fold \(k\)
    • Compute prediction error on fold \(k\)
  3. Average the \(K\) error estimates

\[\text{CV}(\lambda) = \frac{1}{K}\sum_{k=1}^K \text{MSE}_k(\lambda)\]

Choosing \(\lambda\):

lambda.min:

\(\lambda\) that minimizes CV error

Optimizes prediction accuracy

lambda.1se:

Largest \(\lambda\) within 1 SE of minimum

More parsimonious model (fewer predictors)

Other Selection Methods

Information criteria can also select \(\lambda\):

  • AIC: \(-2\ell(\hat{\boldsymbol{\beta}}) + 2 \cdot \text{df}(\lambda)\)
  • BIC: \(-2\ell(\hat{\boldsymbol{\beta}}) + \log(n) \cdot \text{df}(\lambda)\)

BIC penalizes complexity more heavily. We’ll cover these in detail in Week 4.

CV Error vs \(\lambda\)

Package-Specific Lambda Scaling

Lambda is scaled differently across packages:

  • glmnet (R): Uses \(\alpha\lambda\) formulation; objective includes \(\frac{1}{2n}\) factor
  • sklearn (Python): Different scaling; direct comparison of lambda values is invalid

Always use cross-validation within each package rather than transferring lambda values between R and Python.

Code Demos: glmnet in Action

Ridge Coefficient Paths

Code
# Same data as before
set.seed(577)
n <- 200; p <- 50
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("X", 1:p)
beta_true <- c(rep(2, 5), rep(0, 45))
y <- X %*% beta_true + rnorm(n, 0, 2)

# Fit Ridge
ridge_fit <- glmnet(X, y, alpha = 0)

# Extract coefficient paths
coef_matrix <- as.matrix(ridge_fit$beta)
ridge_data <- as_tibble(t(coef_matrix)) %>%
  mutate(log_lambda = log(ridge_fit$lambda)) %>%
  pivot_longer(-log_lambda, names_to = "variable", values_to = "coefficient") %>%
  mutate(true_nonzero = variable %in% paste0("X", 1:5))

p <- ggplot(ridge_data, aes(x = log_lambda, y = coefficient,
                             group = variable, color = true_nonzero)) +
  geom_line(alpha = 0.7) +
  scale_color_manual(values = c("TRUE" = col_red, "FALSE" = "gray70"),
                     labels = c("TRUE" = "True signal", "FALSE" = "Noise")) +
  labs(title = "Ridge Regression: Coefficient Paths",
       subtitle = "Coefficients shrink smoothly toward zero",
       x = expression(log(lambda)),
       y = "Coefficient Value",
       color = "") +
  theme(legend.position = "bottom")

ggsave(file.path(fig_dir, "09_ridge_path.png"), p, width = 10, height = 5, dpi = 150)
p

Lasso Coefficient Paths

Code
# Fit Lasso
lasso_fit <- glmnet(X, y, alpha = 1)

# Extract coefficient paths
coef_matrix <- as.matrix(lasso_fit$beta)
lasso_data <- as_tibble(t(coef_matrix)) %>%
  mutate(log_lambda = log(lasso_fit$lambda)) %>%
  pivot_longer(-log_lambda, names_to = "variable", values_to = "coefficient") %>%
  mutate(true_nonzero = variable %in% paste0("X", 1:5))

p <- ggplot(lasso_data, aes(x = log_lambda, y = coefficient,
                             group = variable, color = true_nonzero)) +
  geom_line(alpha = 0.7) +
  scale_color_manual(values = c("TRUE" = col_red, "FALSE" = "gray70"),
                     labels = c("TRUE" = "True signal", "FALSE" = "Noise")) +
  labs(title = "Lasso Regression: Coefficient Paths",
       subtitle = "Coefficients hit exactly zero (sparsity)",
       x = expression(log(lambda)),
       y = "Coefficient Value",
       color = "") +
  theme(legend.position = "bottom")

ggsave(file.path(fig_dir, "10_lasso_path.png"), p, width = 10, height = 5, dpi = 150)
p

Python Equivalents

In scikit-learn, use:

  • sklearn.linear_model.Ridge
  • sklearn.linear_model.Lasso
  • sklearn.linear_model.ElasticNet

Note: Lambda scaling differs between packages—always tune via CV within each framework.

cv.glmnet Output

Code
# Cross-validation for Lasso
cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10)

# Save the default plot
png(file.path(fig_dir, "11_cv_glmnet.png"), width = 1000, height = 600, res = 100)
plot(cv_lasso)
title("Cross-Validation for Lasso (cv.glmnet)", line = 2.5)
invisible(dev.off())

# Display
plot(cv_lasso)
title("Cross-Validation for Lasso (cv.glmnet)", line = 2.5)

Reading the Plot

  • Top axis: number of nonzero coefficients at each \(\lambda\)
  • Red dots: CV error; error bars: \(\pm\) 1 SE
  • Left dashed line: lambda.min (minimum CV error)
  • Right dashed line: lambda.1se (most regularized within 1 SE)

Ridge vs Lasso: Comparison

Code
# Cross-validation for both
cv_ridge <- cv.glmnet(X, y, alpha = 0, nfolds = 10)
cv_lasso <- cv.glmnet(X, y, alpha = 1, nfolds = 10)

# Combine results
comparison_data <- bind_rows(
  tibble(
    log_lambda = log(cv_ridge$lambda),
    cvm = cv_ridge$cvm,
    cvsd = cv_ridge$cvsd,
    method = "Ridge"
  ),
  tibble(
    log_lambda = log(cv_lasso$lambda),
    cvm = cv_lasso$cvm,
    cvsd = cv_lasso$cvsd,
    method = "Lasso"
  )
)

# Best lambdas
best_lambdas <- tibble(
  method = c("Ridge", "Lasso"),
  log_lambda = c(log(cv_ridge$lambda.min), log(cv_lasso$lambda.min)),
  min_cvm = c(min(cv_ridge$cvm), min(cv_lasso$cvm))
)

p <- ggplot(comparison_data, aes(x = log_lambda, y = cvm, color = method)) +
  geom_ribbon(aes(ymin = cvm - cvsd, ymax = cvm + cvsd, fill = method),
              alpha = 0.2, color = NA) +
  geom_line(linewidth = 1) +
  geom_point(data = best_lambdas, aes(x = log_lambda, y = min_cvm),
             size = 4, shape = 18) +
  scale_color_manual(values = c("Ridge" = col_blue, "Lasso" = col_orange)) +
  scale_fill_manual(values = c("Ridge" = col_blue, "Lasso" = col_orange)) +
  labs(title = "Ridge vs Lasso: Cross-Validation Comparison",
       subtitle = "Diamonds mark lambda.min for each method",
       x = expression(log(lambda)),
       y = "CV Mean Squared Error",
       color = "", fill = "") +
  theme(legend.position = "bottom")

ggsave(file.path(fig_dir, "12_ridge_vs_lasso_comparison.png"), p, width = 12, height = 6, dpi = 150)
p

Summary

Comparison: Ridge vs Lasso vs Elastic Net

Property Ridge Lasso Elastic Net
Penalty \(\lambda\sum_j \beta_j^2\) \(\lambda\sum_j |\beta_j|\) \(\lambda[\alpha\sum_j|\beta_j| + \frac{1-\alpha}{2}\sum_j\beta_j^2]\)
Sparsity No Yes Yes
Closed form Yes No No
Correlated predictors Shrinks together Selects one Selects groups
\(p > n\) Handles well Selects \(\leq n\) Handles well
Bayesian prior Gaussian Laplace Mixture
Computation Direct solve Coordinate descent Coordinate descent

Key Takeaways

Remember These Concepts

  1. Bias-variance tradeoff: Regularization introduces bias to reduce variance

  2. Ridge: Shrinks all coefficients; best when all predictors contribute

  3. Lasso: Produces sparse solutions via \(\ell_1\) penalty; performs variable selection

  4. Elastic Net: Combines both; handles correlated predictors

  5. SVD insight: Ridge shrinks along principal components adaptively

  6. Cross-validation: Essential for choosing \(\lambda\); use lambda.1se for parsimony

Looking Ahead

Next week: Model Selection and Resampling Methods

What we covered:

  • Why OLS fails (high dimensions, multicollinearity)
  • Ridge: \(\ell_2\) penalty, SVD shrinkage
  • Lasso: \(\ell_1\) penalty, sparsity
  • Elastic Net: best of both
  • Cross-validation for \(\lambda\)

Coming up:

  • K-fold CV in depth
  • Bootstrap methods
  • Information criteria (AIC, BIC)
  • Model selection strategies
  • Stability selection

Preparation

Review: ESL Chapter 7 (Model Assessment), ISLR Chapter 5 (Resampling Methods)