Introducción a la Regresión Lineal

Consideremos el siguiente modelo de regresión lineal:

\[ y_i = 2 + 0.5 x_i + u_i, \ u_i \sim N(0,1) \]

Vamos a generar datos con este modelo. Supongamos que:

(x = 1:10)

##  [1]  1  2  3  4  5  6  7  8  9 10

Generamos los términos aleatorios:

set.seed(12345)
n = length(x) # numero de datos
(u1 = rnorm(n, mean = 0, sd = 1))

##  [1]  0.5855288  0.7094660 -0.1093033 -0.4534972  0.6058875 -1.8179560
##  [7]  0.6300986 -0.2761841 -0.2841597 -0.9193220

La variable respuesta correspondiente es:

(y1 = 2 + 0.5*x + u1)

##  [1] 3.085529 3.709466 3.390697 3.546503 5.105887 3.182044 6.130099
##  [8] 5.723816 6.215840 6.080678

plot(x,y1, col = "red", ylim = c(0,10), pch = 19)
abline(2,0.5, col = "blue", lty = 2)

Estimamos la recta con los datos:

m1 = lm(y1 ~ x)

Y la dibujamos junto con la recta “teórica” y los datos:

plot(x,y1, col = "red", ylim = c(0,10), pch = 19)
abline(2,0.5, col = "blue", lty = 2)
abline(m1, col = "red")

Si repetimos el proceso:

u2 = rnorm(n, mean = 0, sd = 1)
y2 = 2 + 0.5*x + u2
m2 = lm(y2 ~ x)

plot(x,y1, col = "red", ylab = "y", ylim = c(0,10), pch = 19)
points(x,y2, col = "green", pch = 19)
abline(2,0.5, col = "blue", lty = 2)
abline(m1, col = "red")
abline(m2, col = "green")

Si esto lo repetimos muchas veces:

nmuestras = 1000
beta0 = rep(0, nmuestras)
beta1 = rep(0, nmuestras)
for (k in 1:nmuestras){
  u = rnorm(n, mean = 0, sd = 1)
  y = 2 + 0.5*x + u
  m = lm(y ~ x)
  beta0[k] = m$coefficients["(Intercept)"]
  beta1[k] = m$coefficients["x"]
}
par(mfrow = c(1,2))
hist(beta0, freq = F)
curve(dnorm(x, mean = mean(beta0), sd = sd(beta0)), add = T)
hist(beta1, freq = F)
curve(dnorm(x, mean = mean(beta1), sd = sd(beta1)), add = T)

c(mean(beta0),sd(beta0))

## [1] 1.9777567 0.6799878

c(mean(beta1),sd(beta1))

## [1] 0.5039175 0.1096082

La distribución teórica de los parámetros es:

\[\hat \beta_0 \sim N \left( \beta_0, SE(\hat \beta_0) \right)\]

\[\hat \beta_1 \sim N \left( \beta_1, SE(\hat \beta_1) \right)\]

summary(m1)

## 
## Call:
## lm(formula = y1 ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6229 -0.2721  0.1633  0.3765  0.9495 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  2.55061    0.52272    4.88  0.00123 **
## x            0.37572    0.08424    4.46  0.00211 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7652 on 8 degrees of freedom
## Multiple R-squared:  0.7132, Adjusted R-squared:  0.6773 
## F-statistic: 19.89 on 1 and 8 DF,  p-value: 0.002111

Introducción a la Regresión Lineal

Javier Cara

Curso 2018-19