Podemos identificar dos aplicaciones básicas de los modelos de regresión:
Sea el modelo de regresión
\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + \epsilon_i, \ i = 1,2,\cdots,n \]
Que se también se puede escribir como:
\[ y_i = \begin{bmatrix} 1 & x_{1i} & x_{2i} & \cdots & x_{ki} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \cdots \\ \beta_k \end{bmatrix} + \epsilon_i = x_i^T \beta + \epsilon_i, \ i = 1,2,\cdots,n \]
Utilizando parámetros estimados y residuos:
\[ y_i = \begin{bmatrix} 1 & x_{1i} & x_{2i} & \cdots & x_{ki} \end{bmatrix} \begin{bmatrix} \hat \beta_0 \\ \hat \beta_1 \\ \hat \beta_2 \\ \cdots \\ \hat \beta_k \end{bmatrix} + \epsilon_i = x_i^T \hat \beta + e_i, \ i = 1,2,\cdots,n \]
Dado un valor “nuevo” de los regresores \(x_p^T = [1 \ x_{1p} \ x_{2p} \ \cdots \ x_{kp}]\), se define la predicción del valor medio \(y\) en \(x_p\) como:
\[ \hat y_p = x_p^T \hat \beta \]
Es un estimador centrado del valor medio de \(y_p\), \(E[y_p]\):
\[ E[\hat y_p] = E[x_p^T \hat \beta] = x_p^T \beta \]
Y con varianza:
\[ Var[\hat y_p] = Var[x_p^T \hat \beta] = x_p^T Var[\hat \beta] x_p = \sigma^2 x_p^T (X^TX)^{-1} x_p \]
Por tanto
\[ \hat y_p \sim N(x_p^T \beta, \sigma^2 x_p^T (X^TX)^{-1} x_p) \Rightarrow \frac{\hat y_p - x_p^T \beta}{\hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p}} \sim t_{n-k-1} \]
Finalmente, el intervalo de confianza para \(x_p^T \beta\) es:
\[ x_p^T \hat \beta - t_{\alpha/2} \hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p} \leq (x_p^T \beta) \leq x_p^T \hat \beta + t_{\alpha/2} \hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p} \]
El error de predicción que cometemos utilizando la predicción del nivel medio es:
\[ e_{p} = y_p - x_p^T \hat \beta \]
La distribución de \(e_p\) es:
\[ \boxed{ e_p \sim N(0, \sigma^2(1 + x_p^T (X^TX)^{-1} x_p)) } \]
ya que:
\[ E[e_p] = E[y_p] - x_p^T E[\hat \beta] = x_p^T \beta - x_p^T \beta = 0 \]
\[ Var[e_p] = Var[y_p] + Var[x_p^T \hat \beta] = \sigma^2 + x_p^T Var[\hat \beta] x_p= \sigma^2 (1 + x_p^T (X^TX)^{-1} x_p) \]
Por tanto:
\[ \frac{e_p}{\hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p}} \sim t_{n-k-1} \]
Y por tanto:
\[ P \left( - t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \leq e_p \leq t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \right) = 1 - \alpha \]
Finalmente, el intervalo de predicción es:
\[ P \left( x_p^T \hat \beta - t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \leq y_p \leq x_p^T \hat \beta + t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \right) = 1 - \alpha \]
## 'data.frame': 434 obs. of 5 variables:
## $ kid_score: int 65 98 85 83 115 98 69 106 102 95 ...
## $ mom_hs : int 1 1 1 1 1 0 1 1 1 1 ...
## $ mom_iq : num 121.1 89.4 115.4 99.4 92.7 ...
## $ mom_work : int 4 4 4 3 4 1 4 3 1 1 ...
## $ mom_age : int 27 25 27 25 27 18 20 23 24 19 ...
Las variables son las siguientes:
m = lm(kid_score ~ mom_iq, data = d)
plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)
xp = matrix(c(1, 130), ncol = 1)
n = nrow(d)
beta_e = coef(m)
sR2 = sum(resid(m)^2)/(n-2)
X = model.matrix(m)
mat = t(xp) %*% solve(t(X) %*% X) %*% xp
(yp_medio = t(xp) %*% beta_e)
## [,1]
## [1,] 105.0965
## [1] 101.2394 108.9535
En R:
xp1 = data.frame(mom_iq = 130)
(yp_medio1 = predict(m, newdata = xp1, interval = "confidence", level = 0.95))
## fit lwr upr
## 1 105.0965 101.2394 108.9535
plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)
points(xp1$mom_iq, yp_medio1[1], col = "red", pch = 19) # prediccion puntual
points(xp1$mom_iq, yp_medio1[2], col = "red", pch = 19) # limite inferior int. conf.
points(xp1$mom_iq, yp_medio1[3], col = "red", pch = 19) # limite superior int. conf.
## [1] 68.98835 141.20459
## fit lwr upr
## 1 105.0965 68.98835 141.2046
plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)
points(xp1$mom_iq, yp_medio1[1], col = "red", pch = 19) # prediccion puntual
points(xp1$mom_iq, yp_medio1[2], col = "red", pch = 19) # limite inferior int. conf.
points(xp1$mom_iq, yp_medio1[3], col = "red", pch = 19) # limite superior int. conf.
points(xp1$mom_iq, yp1[2], col = "green", pch = 19) # limite inferior int. pred.
points(xp1$mom_iq, yp1[3], col = "green", pch = 19) # limite superior int. pred.
Vamos a predecir:
##
## Call:
## lm(formula = kid_score ~ mom_iq + mom_hs + mom_age + mom_work,
## data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.414 -12.095 2.015 11.653 49.100
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.27273 9.39320 2.158 0.0315 *
## mom_iq 0.55288 0.06138 9.008 <2e-16 ***
## mom_hsyes 5.43466 2.32518 2.337 0.0199 *
## mom_age 0.21629 0.33351 0.649 0.5170
## mom_work2 2.98266 2.81289 1.060 0.2896
## mom_work3 5.48824 3.25239 1.687 0.0922 .
## mom_work4 1.41929 2.51621 0.564 0.5730
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.14 on 427 degrees of freedom
## Multiple R-squared: 0.2213, Adjusted R-squared: 0.2103
## F-statistic: 20.22 on 6 and 427 DF, p-value: < 2.2e-16
Recordamos que el modelo sería: \(kid\_score = \hat \beta_0 + \hat \beta_1 mom\_iq + \hat \beta_2 mom\_hsyes + \hat \beta_3 mon\_age + \hat \beta_4 mom\_work2 + \hat \beta_5 mom\_work3 + \hat \beta_6 mom\_work4 + e\)
xp = matrix(c(1, 130, 0, 25, 0, 1, 0), ncol = 1)
beta_e = coef(m2)
sR2 = sum(resid(m2)^2)/(n-6-1)
X = model.matrix(m2)
mat = t(xp) %*% solve(t(X) %*% X) %*% xp
# prediccion del valor medio
(yp_medio = t(xp) %*% beta_e)
## [,1]
## [1,] 103.0423
## [1] 95.74381 110.34083
xp1 = data.frame(mom_iq = 130, mom_hs = "no", mom_age = 25, mom_work = "3")
(yp_medio1 = predict(m2, newdata = xp1, interval = "confidence", level = 0.95))
## fit lwr upr
## 1 103.0423 95.74381 110.3408
## [1] 66.65286 139.43178
## fit lwr upr
## 1 103.0423 66.65286 139.4318
Vamos a calcular el intervalo de confianza utilizando bootstrap:
B = 1000
yp_medio_b = rep(0,B)
e = resid(m2)
for (b in 1:B){
eb = sample(e, replace = T)
yb = fitted(m2) + eb
mb = lm(yb ~ mom_iq + mom_hs + mom_age + mom_work, data = d)
yp_medio_b[b] = predict(mb, newdata = xp1, interval = "none", level = 0.95)
}
El intervalo de confianza para el valor medio es:
## 2.5% 97.5%
## 95.75576 110.07076
Vamos a calcular el intervalo de predicción utilizando bootstrap. Vamos a utilizar la fórmula:
\[ e_p = y_p - \hat y_p \Rightarrow y_p = \hat y_p + e_p \]
yp_b = rep(0,B)
e = resid(m2)
# los residuos de lm() tienen varianza igual a la varianza residual
# pero los errores de predicción tienen varianza sR2*(1+mat)
# hacemos que los errores de prediccion tengan la varianza teorica (e/sR)*sqrt(sR2*(1+mat))
e1 = e * sqrt(1 + mat[1,1])
for (b in 1:B){
yp_b[b] = yp_medio_b[b] + sample(e1, 1)
}
Resultados:
## 2.5% 97.5%
## 65.44265 137.32346