If we assumet that \(Y = f(X) + \epsilon\) where \(E(\epsilon) = 0\) and \(Var(\epsilon) = \sigma_{\epsilon}^2\). The expected prediction error of a regression fit \(\hat{f}(X)\) at an input point \(X = x_0\) is
\[ \begin{align*} Err(x_0) &= E[(Y - \hat{f}(x_0))^2 | X = x_0] \\ &= E[(Y - f(x_0) + f(x_0) -\hat{f}(x_0))^2 | X = x_0] \\ &= \sigma_{\epsilon}^2 + [E \hat{f}(x_0) - f(x_0)]^2 + E[\hat{f}(x_0)- E\hat{f}(x_0)]^2 \\ &= \text{Irreducible Error} + \text{Bias}^2 + \text{Variance} \end{align*} \]
Typically, the more complex we make the model \(\hat{f}\) , the lower the bias but the higher the variance.
search through all possible subsets: for each \(k \in \{0,1,...,p \}\) find the subset among \(\binom{p}{k}\) models that gives smallest residual sum of squares. \(O(2^p)\)
an efficient algorithm- the leaps and bounds is feasible for p as large as 30 or 40.
computational and statistical problems: cannot deal with large \(p\).Large search space could lead to overfitting.
regsubsets
function from leaps
package | documentation
plot
function to display the selected variables in each model, ordered by the model selection criterion, e.g. \(C_p\), AIC, BIC, Adjusted \(R^2\).#library(ISLR)
library(leaps)
library(glmnet)
data("mtcars")
reg.full = regsubsets(mpg~., data = mtcars,nvmax = 10)
summary(reg.full)
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = mtcars, nvmax = 10)
## 10 Variables (and intercept)
## Forced in Forced out
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: exhaustive
## cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
## 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
## 5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
plot(reg.full, scale = "bic")
seek a good path: a greedy algorithm to produce a nested sequence of models. \(O(p^2)\)
computational and statistical advantages
optimal criteria: \(C_p\), AIC, BIC, Adjusted \(R^2\), cross-validation
regsubsets
function, method = c("forward","backward")
stepAIC
function from MASS
package | documentation
reg.fwd = regsubsets(mpg~., data = mtcars,nvmax = 10,method = "forward")
summary(reg.fwd)
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = mtcars, nvmax = 10, method = "forward")
## 10 Variables (and intercept)
## Forced in Forced out
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: forward
## cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " "*" " " " " " " " " " "
## 3 ( 1 ) "*" " " "*" " " "*" " " " " " " " " " "
## 4 ( 1 ) "*" " " "*" " " "*" " " " " "*" " " " "
## 5 ( 1 ) "*" " " "*" " " "*" "*" " " "*" " " " "
## 6 ( 1 ) "*" "*" "*" " " "*" "*" " " "*" " " " "
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" " " " "
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" " "
## 9 ( 1 ) "*" "*" "*" "*" "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
reg.bwd = regsubsets(mpg~., data = mtcars,nvmax = 10,method = "backward")
summary(reg.bwd)
## Subset selection object
## Call: regsubsets.formula(mpg ~ ., data = mtcars, nvmax = 10, method = "backward")
## 10 Variables (and intercept)
## Forced in Forced out
## cyl FALSE FALSE
## disp FALSE FALSE
## hp FALSE FALSE
## drat FALSE FALSE
## wt FALSE FALSE
## qsec FALSE FALSE
## vs FALSE FALSE
## am FALSE FALSE
## gear FALSE FALSE
## carb FALSE FALSE
## 1 subsets of each size up to 10
## Selection Algorithm: backward
## cyl disp hp drat wt qsec vs am gear carb
## 1 ( 1 ) " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " "*" "*" " " " " " " " "
## 3 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " "
## 4 ( 1 ) " " " " "*" " " "*" "*" " " "*" " " " "
## 5 ( 1 ) " " "*" "*" " " "*" "*" " " "*" " " " "
## 6 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" " " " "
## 7 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" " "
## 8 ( 1 ) " " "*" "*" "*" "*" "*" " " "*" "*" "*"
## 9 ( 1 ) " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
For the Gaussian model containing \(d\) predictors,
\[ \begin{align*} C_p &= \frac{1}{n}(RSS + 2d \hat{\sigma}^2) \\ AIC &= \frac{1}{n \hat{\sigma}^2}(RSS + 2d \hat{\sigma}^2) \end{align*} \]
\(C_p\) and AIC are proportional to each ohter in least squares models.
\(C_p\) adds a penalty to the training RSS to adjust for the fact that the training error tends to underestimate the test error.
The model with the lowest \(C_p\) value is the best.
\[ BIC = \frac{1}{n \hat{\sigma}^2}(RSS + log(n) d \hat{\sigma}^2) \]
\[ \text{Adjusted} \ R^2 = 1- \frac{RSS/ (n-d-1)}{TSS/(n-1)} \]
Problem with subset selection: discrete process(either retain or discard), high variability, cannot reduce the prediction error.
Alternative to subset selection: fit a model containing all \(p\) predictors using a technique that constrains or regularizes the coefficient estimates.
\[ \hat{\beta} = argmin_{\beta} \sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]
glmnet
and cv.glmnet
function from glmnet
package | documentation | vignette
alpha
argument: alpha= 0
: ridge regression, alpha = 1
: lassoplot
function: produce the coefficient paths for a glmnet model.
xvar
argument: norm
: plots against the L1-norm of the coef, lambda
: against the log-lambda sequencecv.glmnet
: cross-validation to determine the value of \(\lambda\).
lambda.min
: the value of \(\lambda\) that gives minimum mean cross-validation error.lambda.1se
: the largest value of \(\lambda\) such that error is within 1 standard error of the minimum.library(glmnet)
x = model.matrix(mpg~.,data = mtcars)[,-1]
y = mtcars$mpg
grid = 10^seq(4,-2,length.out = 100)
ridge.fit = glmnet(x,y,alpha = 0, lambda = grid)
plot(ridge.fit, label = T)
plot(ridge.fit, xvar = "lambda",label = T)
ridge.fit = cv.glmnet(x,y,alpha = 0, lambda = grid)
plot(ridge.fit)
lambda_opt = ridge.fit$lambda.min
ridge.coef = predict(ridge.fit, type = "coefficients", s = lambda_opt)[1:11,]
ridge.coef
## (Intercept) cyl disp hp drat
## 21.236907423 -0.358556850 -0.004945645 -0.011903921 1.045337278
## wt qsec vs am gear
## -1.337003381 0.173271645 0.718580031 1.730146134 0.554486713
## carb
## -0.587817375
\[ \hat{\beta} = argmin_{\beta} \sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j| \]
lasso.fit = glmnet(x,y,alpha = 1, lambda = grid)
plot(lasso.fit)
lasso.fit = cv.glmnet(x,y,alpha = 1, lambda = grid)
plot(lasso.fit)
lambda_opt_lasso = lasso.fit$lambda.min
lasso.coef = predict(lasso.fit, type = "coefficients", s = lambda_opt_lasso)[1:11,]
lasso.coef
## (Intercept) cyl disp hp drat wt
## 36.44085178 -0.89055459 0.00000000 -0.01295111 0.00000000 -2.78242409
## qsec vs am gear carb
## 0.00000000 0.00000000 0.02855393 0.00000000 0.00000000
Compromise between the ridge and the lasso penalties.
\[ \hat{\beta} = argmin_{\beta} \ \frac{1}{2} \sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda [ \frac{1}{2} (1-\alpha)\| \beta \|_2^2 + \alpha \| \beta \|_1 ]\] When \(\alpha=1\), it reduces to \(\mathcal{l}_1\) norm; and when \(\alpha = 0\), it reduces to squared \(\mathcal{l}_2\) norm.
Natural group structure: have all coefficients within a group become nonzero (or zero) simultaneously.
\[ \hat{\theta} = argmin_{\theta} \ \frac{1}{2} \sum_{i=1}^N (y_i - \theta_0 - \sum_{j=1}^J z_{ij} \theta_j)^2 + \lambda \sum_{j=1}^J \| \theta_j \|_2\]
where \(\|\theta_j \|_2\) is the Euclidean norm of the vector \(\theta_j\).
The ridge regression coef estimates are shrunken proportionally towards zero
The lasso coef estimates are soft-thresholding, shrinks each coef by a constant factor \(\lambda\) and samll coefs are shrunken to zero
best-subset: hard-thresholding
[SLS] Hastie, T., Tibshirani, R., & Wainwright, M. (2015). Statistical learning with sparsity: the lasso and generalizations. Chapman and Hall/CRC.
[glmnet vignette] Hastie, T., & Qian, J. (2014). Glmnet vignette. Retrieve from http://www.web.stanford.edu/~hastie/Papers/Glmnet_Vignette.pdf. Accessed September, 20, 2016.