1 Aplicaciones de la regresión

Podemos identificar dos aplicaciones básicas de los modelos de regresión:

Predecir.
Describir relaciones entre variables.

2 Predicción del valor medio

Sea el modelo de regresión

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + \epsilon_i, \ i = 1,2,\cdots,n \]

Que se también se puede escribir como:

\[ y_i = \begin{bmatrix} 1 & x_{1i} & x_{2i} & \cdots & x_{ki} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \cdots \\ \beta_k \end{bmatrix} + \epsilon_i = x_i^T \beta + \epsilon_i, \ i = 1,2,\cdots,n \]

Utilizando parámetros estimados y residuos:

\[ y_i = \begin{bmatrix} 1 & x_{1i} & x_{2i} & \cdots & x_{ki} \end{bmatrix} \begin{bmatrix} \hat \beta_0 \\ \hat \beta_1 \\ \hat \beta_2 \\ \cdots \\ \hat \beta_k \end{bmatrix} + \epsilon_i = x_i^T \hat \beta + e_i, \ i = 1,2,\cdots,n \]

Dado un valor “nuevo” de los regresores \(x_p^T = [1 \ x_{1p} \ x_{2p} \ \cdots \ x_{kp}]\), se define la predicción del valor medio \(y\) en \(x_p\) como:

\[ \hat y_p = x_p^T \hat \beta \]

Es un estimador centrado del valor medio de \(y_p\), \(E[y_p]\):

\[ E[\hat y_p] = E[x_p^T \hat \beta] = x_p^T \beta \]

Y con varianza:

\[ Var[\hat y_p] = Var[x_p^T \hat \beta] = x_p^T Var[\hat \beta] x_p = \sigma^2 x_p^T (X^TX)^{-1} x_p \]

Por tanto

\[ \hat y_p \sim N(x_p^T \beta, \sigma^2 x_p^T (X^TX)^{-1} x_p) \Rightarrow \frac{\hat y_p - x_p^T \beta}{\hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p}} \sim t_{n-k-1} \]

Finalmente, el intervalo de confianza para \(x_p^T \beta\) es:

\[ x_p^T \hat \beta - t_{\alpha/2} \hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p} \leq (x_p^T \beta) \leq x_p^T \hat \beta + t_{\alpha/2} \hat s_R\sqrt{x_p^T (X^TX)^{-1} x_p} \]

3 Intervalo de predicción

El error de predicción que cometemos utilizando la predicción del nivel medio es:

\[ e_{p} = y_p - x_p^T \hat \beta \]

La distribución de \(e_p\) es:

\[ \boxed{ e_p \sim N(0, \sigma^2(1 + x_p^T (X^TX)^{-1} x_p)) } \]

ya que:

\[ E[e_p] = E[y_p] - x_p^T E[\hat \beta] = x_p^T \beta - x_p^T \beta = 0 \]

\[ Var[e_p] = Var[y_p] + Var[x_p^T \hat \beta] = \sigma^2 + x_p^T Var[\hat \beta] x_p= \sigma^2 (1 + x_p^T (X^TX)^{-1} x_p) \]

Por tanto:

\[ \frac{e_p}{\hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p}} \sim t_{n-k-1} \]

Y por tanto:

\[ P \left( - t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \leq e_p \leq t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \right) = 1 - \alpha \]

Finalmente, el intervalo de predicción es:

\[ P \left( x_p^T \hat \beta - t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \leq y_p \leq x_p^T \hat \beta + t_{\alpha/2} \hat s_R\sqrt{1 + x_p^T (X^TX)^{-1} x_p} \right) = 1 - \alpha \]

4 Ejemplo

d = read.csv("datos/kidiq.csv")
str(d)

## 'data.frame':    434 obs. of  5 variables:
##  $ kid_score: int  65 98 85 83 115 98 69 106 102 95 ...
##  $ mom_hs   : int  1 1 1 1 1 0 1 1 1 1 ...
##  $ mom_iq   : num  121.1 89.4 115.4 99.4 92.7 ...
##  $ mom_work : int  4 4 4 3 4 1 4 3 1 1 ...
##  $ mom_age  : int  27 25 27 25 27 18 20 23 24 19 ...

Las variables son las siguientes:

kid_score: puntuacion de un test cognitivo en niños de 3-4 años
mom_hs:
- mom_hs = 1: las madres han terminado high school
- mom_hs = 0: las madres no terminaron high school
mom_iq: puntuación de la madre en un test IQ
mom_work:
- mom_work = 1: la madre no trabajó en los primeros tres años del niño
- mom_work = 2: la madre trabajó en el 2do o tercer año
- mom_work = 3: la madre trabajó a tiempo parcial el primer año
- mom_work = 4: la madre trabajó a tiempo completo el primer año
mom_age: edad de la madre

d$mom_hs = factor(d$mom_hs, labels = c("no","yes"))
d$mom_work = factor(d$mom_work)

4.1 Predicción en un modelo de regresión simple

m = lm(kid_score ~ mom_iq, data = d)
plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)

4.1.1 Prediccion del valor medio

xp = matrix(c(1, 130), ncol = 1)
n = nrow(d)
beta_e = coef(m)
sR2 = sum(resid(m)^2)/(n-2)
X = model.matrix(m)
mat = t(xp) %*% solve(t(X) %*% X) %*% xp
(yp_medio = t(xp) %*% beta_e)

##          [,1]
## [1,] 105.0965

yp_medio[1,1] + c(-1,1)*qt(0.975,n-2)*sqrt(sR2*(mat[1,1]))

## [1] 101.2394 108.9535

En R:

xp1 = data.frame(mom_iq = 130)
(yp_medio1 = predict(m, newdata = xp1, interval = "confidence", level = 0.95))

##        fit      lwr      upr
## 1 105.0965 101.2394 108.9535

plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)
points(xp1$mom_iq, yp_medio1[1], col = "red", pch = 19) # prediccion puntual
points(xp1$mom_iq, yp_medio1[2], col = "red", pch = 19) # limite inferior int. conf.
points(xp1$mom_iq, yp_medio1[3], col = "red", pch = 19) # limite superior int. conf.

4.1.2 Intervalo de prediccion

(yp = yp_medio[1,1] + c(-1,1)*qt(0.975,n-2)*sqrt(sR2*(1 + mat[1,1])))

## [1]  68.98835 141.20459

En R:

(yp1 = predict(m, newdata = xp1, interval = "prediction", level = 0.95))

##        fit      lwr      upr
## 1 105.0965 68.98835 141.2046

plot(d$mom_iq, d$kid_score, pch = 19)
abline(m, col = "red", lwd = 1)
points(xp1$mom_iq, yp_medio1[1], col = "red", pch = 19) # prediccion puntual
points(xp1$mom_iq, yp_medio1[2], col = "red", pch = 19) # limite inferior int. conf.
points(xp1$mom_iq, yp_medio1[3], col = "red", pch = 19) # limite superior int. conf.
points(xp1$mom_iq, yp1[2], col = "green", pch = 19) # limite inferior int. pred.
points(xp1$mom_iq, yp1[3], col = "green", pch = 19) # limite superior int. pred.

4.2 Predicción en un modelo de regresión múltiple

Vamos a predecir:

mom_iq = 130
mon_hs = no
mom_age = 25
mom_work = 3

m2 = lm(kid_score ~ mom_iq + mom_hs + mom_age + mom_work, data = d)
summary(m2)

## 
## Call:
## lm(formula = kid_score ~ mom_iq + mom_hs + mom_age + mom_work, 
##     data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.414 -12.095   2.015  11.653  49.100 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.27273    9.39320   2.158   0.0315 *  
## mom_iq       0.55288    0.06138   9.008   <2e-16 ***
## mom_hsyes    5.43466    2.32518   2.337   0.0199 *  
## mom_age      0.21629    0.33351   0.649   0.5170    
## mom_work2    2.98266    2.81289   1.060   0.2896    
## mom_work3    5.48824    3.25239   1.687   0.0922 .  
## mom_work4    1.41929    2.51621   0.564   0.5730    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.14 on 427 degrees of freedom
## Multiple R-squared:  0.2213, Adjusted R-squared:  0.2103 
## F-statistic: 20.22 on 6 and 427 DF,  p-value: < 2.2e-16

4.2.1 Prediccion del valor medio

Recordamos que el modelo sería: \(kid\_score = \hat \beta_0 + \hat \beta_1 mom\_iq + \hat \beta_2 mom\_hsyes + \hat \beta_3 mon\_age + \hat \beta_4 mom\_work2 + \hat \beta_5 mom\_work3 + \hat \beta_6 mom\_work4 + e\)

xp = matrix(c(1, 130, 0, 25, 0, 1, 0), ncol = 1)
beta_e = coef(m2)
sR2 = sum(resid(m2)^2)/(n-6-1)
X = model.matrix(m2)
mat = t(xp) %*% solve(t(X) %*% X) %*% xp
# prediccion del valor medio
(yp_medio = t(xp) %*% beta_e)

##          [,1]
## [1,] 103.0423

# intervalo de confianza
yp_medio[1,1] + c(-1,1)*qt(0.975,n-6-1)*sqrt(sR2*(mat[1,1]))

## [1]  95.74381 110.34083

En R:

xp1 = data.frame(mom_iq = 130, mom_hs = "no", mom_age = 25, mom_work = "3")
(yp_medio1 = predict(m2, newdata = xp1, interval = "confidence", level = 0.95))

##        fit      lwr      upr
## 1 103.0423 95.74381 110.3408

4.2.2 Intervalo de prediccion

(yp = yp_medio[1,1] + c(-1,1)*qt(0.975,n-6-1)*sqrt(sR2*(1 + mat[1,1])))

## [1]  66.65286 139.43178

En R:

(yp1 = predict(m2, newdata = xp1, interval = "prediction", level = 0.95))

##        fit      lwr      upr
## 1 103.0423 66.65286 139.4318

5 Predicciones utilizando bootstrap

5.1 Intervalo de confianza para el valor medio

Vamos a calcular el intervalo de confianza utilizando bootstrap:

B = 1000
yp_medio_b = rep(0,B)
e = resid(m2)
for (b in 1:B){
  eb = sample(e, replace = T)
  yb = fitted(m2) + eb
  mb = lm(yb ~ mom_iq + mom_hs + mom_age + mom_work, data = d)
  yp_medio_b[b] = predict(mb, newdata = xp1, interval = "none", level = 0.95)
}

El intervalo de confianza para el valor medio es:

quantile(yp_medio_b, probs = c(0.025, 0.975))

##      2.5%     97.5% 
##  95.75576 110.07076

5.2 Intervalo de predicción

Vamos a calcular el intervalo de predicción utilizando bootstrap. Vamos a utilizar la fórmula:

\[ e_p = y_p - \hat y_p \Rightarrow y_p = \hat y_p + e_p \]

yp_b = rep(0,B)
e = resid(m2)
# los residuos de lm() tienen varianza igual a la varianza residual
# pero los errores de predicción tienen varianza sR2*(1+mat)
# hacemos que los errores de prediccion tengan la varianza teorica (e/sR)*sqrt(sR2*(1+mat))
e1 = e * sqrt(1 + mat[1,1])
for (b in 1:B){
  yp_b[b] = yp_medio_b[b] + sample(e1, 1)
}

Resultados:

# intervalo de predicccion
quantile(yp_b, probs = c(0.025, 0.975))

##      2.5%     97.5% 
##  65.44265 137.32346

Aplicaciones del modelo de regresión lineal: cálculo de predicciones

10 octubre 2019