1 Modelo

Tenemos un conjunto de datos organizados en la siguiente tabla:

\[\begin{equation} \begin{matrix} y & x_{1} & x_{2} & \cdots & x_{k} \\ \hline y_1 & x_{11} & x_{21} & \cdots & x_{k1} \\ y_2 & x_{12} & x_{22} & \cdots & x_{k2} \\ \cdots &\cdots & \cdots & \cdots & \cdots \\ y_n & x_{1n} & x_{2n} & \cdots & x_{kn} \\ \end{matrix} \end{equation}\]

El modelo que vamos a utilizar para analizar estos datos es el modelo lineal:

\[\begin{equation} y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_k x_{ki} + \epsilon_i, \ i = 1,2,\cdots,n \end{equation}\]

El término y se conoce como variable respuesta, y las x se conocen como regresores.
El término \(\epsilon\) representa el error del modelo.
El término lineal se emplea porque la ecuación del modelo es una función lineal de los parámetros \(\beta_0\), \(\beta_1\), \(\beta_2\), \(\beta_k\).
Modelos en apariencia complicados pueden ser considerados como un modelo lineal. Por ejemplo:
polinomios:

\[\begin{equation} y = \beta_0 + \beta_1 x + \beta_2 x^2 + \epsilon \Rightarrow y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \end{equation}\]

modelos con funciones en los regresores

\[\begin{equation} y = \beta_0 + \beta_1 x + \beta_2 log x + \epsilon \Rightarrow y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \end{equation}\]

modelos con interacción:

\[\begin{equation} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \end{equation}\]

este modelo no es lineal

\[\begin{equation} y = \beta_0 + \beta_1 x^{\beta_2} + \epsilon \end{equation}\]

La ecuación del modelo se suele escribir en notación matricial. Para ello escribimos la ecuación para todos los datos disponibles:

\[\begin{equation} i = 1 \Rightarrow y_1 = \beta_0 + \beta_1 x_{11} + \beta_2 x_{21} + \cdots + \beta_k x_{k1} + \epsilon_1 \end{equation}\]

\[\begin{equation} i = 2 \Rightarrow y_2 = \beta_0 + \beta_1 x_{12} + \beta_2 x_{22} + \cdots + \beta_k x_{k2} + \epsilon_2 \end{equation}\]

\[\begin{equation} \cdots \end{equation}\]

\[\begin{equation} i = n \Rightarrow y_n = \beta_0 + \beta_1 x_{1n} + \beta_2 x_{2n} + \cdots + \beta_k x_{kn} + \epsilon_n \end{equation}\]

Agrupando:

\[\begin{equation} \begin{bmatrix} y_1 \\ y_2 \\ \cdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{21} & \cdots & x_{k1} \\ 1 & x_{12} & x_{22} & \cdots & x_{k2} \\ \cdots &\cdots & \cdots & \cdots & \cdots \\ 1 & x_{1n} & x_{2n} & \cdots & x_{kn} \\ \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \cdots \\ \beta_k \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ \cdots \\ \epsilon_n \end{bmatrix} \end{equation}\]

Finalmente, en notación matricial:

\[\begin{equation} y = X \beta + \epsilon \end{equation}\]

Esta ecuación es válida para cualquier número de regresores y cualquier número de observaciones.

2 Estimación del modelo

2.1 Definiciones

Se define el vector de parámetros estimados como:

\[\begin{equation} \hat \beta = \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \cdots \\ \beta_k \end{bmatrix} \end{equation}\]

La respuesta estimada por el modelo se calcula:

\[\begin{equation} \hat y_i = \hat \beta_0 + \hat \beta_1 x_{1i} + \hat \beta_2 x_{2i} + \cdots + \hat \beta_k x_{ki}, \ i = 1,2,\cdots,n \end{equation}\]

En forma matricial:

\[\begin{equation} \hat y = X \hat \beta \end{equation}\]

Se definen los residuos como la diferencia entre la variable respuesta real y la estimada:

\[\begin{equation} e_i = y_i - \hat y_i = y_i - (\hat \beta_0 + \hat \beta_1 x_{1i} + \hat \beta_2 x_{2i} + \cdots + \hat \beta_k x_{ki}), \ i = 1,2,\cdots,n \end{equation}\]

En forma matricial:

\[\begin{equation} e = y - \hat y = y - X \hat \beta \end{equation}\]

2.2 Estimación del modelo usando mínimos cuadrados

El método de mínimos cuadrados consiste en calcular el valor de \(\hat \beta\) que minimiza la suma de los residuos al cuadrado (RSS, residula sum of squares):

\[\begin{equation} RSS = \sum e_i^2 = e^T e = (y - X \hat \beta)^T(y - X \hat \beta) = RSS(\hat \beta) \end{equation}\]

Derivando con respecto a \(\hat \beta\) e igualando a cero se obtiene el mínimo:

\[\begin{equation} \hat \beta = (X^TX)^{-1}X^Ty \end{equation}\]

La respuesta estimada se puede definir ahora como:

\[\begin{equation} \hat y = X \hat \beta = X (X^TX)^{-1}X^T Y = H y \end{equation}\]

La matriz H se denomina en inglés hat matrix. Es muy útil para derivar resultados teóricos, pero en la práctica no se suele calcular explícitamente. Por ejemplo, los residuos se pueden expresar en función de la matriz H:

\[\begin{equation} e =y - \hat y = (I-H)y \end{equation}\]

2.3 Bondad del modelo ajustado

Es conveniente medir como de bueno es el ajuste del modelo. La manera mas usual es utilizar el coeficiente de determinación o \(R^2\):

\[\begin{equation} R^2 = 1 - \frac{RSS}{TSS} \end{equation}\]

donde TSS es la suma total de cuadrados

\[\begin{equation} TSS = \sum(y_i - \bar y)^2 \end{equation}\]

2.4 Ejemplo

Datos del número de especies encontradas en varias islas del Archipiélago de las Galápagos:

d = faraway::gala
str(d)

## 'data.frame':    30 obs. of  7 variables:
##  $ Species  : num  58 31 3 25 2 18 24 10 8 2 ...
##  $ Endemics : num  23 21 3 9 1 11 0 7 4 2 ...
##  $ Area     : num  25.09 1.24 0.21 0.1 0.05 ...
##  $ Elevation: num  346 109 114 46 77 119 93 168 71 112 ...
##  $ Nearest  : num  0.6 0.6 2.8 1.9 1.9 8 6 34.1 0.4 2.6 ...
##  $ Scruz    : num  0.6 26.3 58.7 47.4 1.9 ...
##  $ Adjacent : num  1.84 572.33 0.78 0.18 903.82 ...

El significado de las variables es:

Species: número de especies encontradas en la isla
Area: Area de la isla (km2)
Elevation: máxima elevación en la isla (m)
Nearest: distancia a la isla más cercana (km)
Scruz: distancia a la isla de Santa Cruz (km)
Adjacent: área de la isla adyacente (km2)
Endemics: número de especies endémicas

Queremos hacer la regresión:

\[\begin{equation} Species = \beta_0 + \beta_1 Area + \beta_2 Elevation + \beta_3 Nearest + \beta_4 Scruz + \beta_5 Adjacent + \epsilon \end{equation}\]

Matrices del modelo

n = nrow(d)
y = matrix(d$Species, ncol = 1)
X = as.matrix(d[,c("Area","Elevation","Nearest","Scruz","Adjacent")])
X = cbind(rep(1,n),X)

Estimacion

XT_X = t(X) %*% X
( beta_e = solve(XT_X) %*% (t(X) %*% y) )

##                   [,1]
##            7.068220709
## Area      -0.023938338
## Elevation  0.319464761
## Nearest    0.009143961
## Scruz     -0.240524230
## Adjacent  -0.074804832

respuesta estimada

y_e = X %*% beta_e

residuos

e = y - y_e

juntando los resultados:

data.frame(y,y_e,e)

##                y         y_e           e
## Baltra        58 116.7259460  -58.725946
## Bartolome     31  -7.2731544   38.273154
## Caldwell       3  29.3306594  -26.330659
## Champion      25  10.3642660   14.635734
## Coamano        2 -36.3839155   38.383916
## Daphne.Major  18  43.0877052  -25.087705
## Daphne.Minor  24  33.9196678   -9.919668
## Darwin        10  -9.0189919   19.018992
## Eden           8  28.3142017  -20.314202
## Enderby        2  30.7859425  -28.785943
## Espanola      97  47.6564865   49.343513
## Fernandina    93  96.9895982   -3.989598
## Gardner1      58  -4.0332759   62.033276
## Gardner2       5  64.6337956  -59.633796
## Genovesa      40  -0.4971756   40.497176
## Isabela      347 386.4035578  -39.403558
## Marchena      51  88.6945404  -37.694540
## Onslow         2   4.0372328   -2.037233
## Pinta        104 215.6794862 -111.679486
## Pinzon       108 150.4753750  -42.475375
## Las.Plazas    12  35.0758066  -23.075807
## Rabida        70  75.5531221   -5.553122
## SanCristobal 280 206.9518779   73.048122
## SanSalvador  237 277.6763183  -40.676318
## SantaCruz    444 261.4164131  182.583587
## SantaFe       62  85.3764857  -23.376486
## SantaMaria   285 195.6166286   89.383371
## Seymour       44  49.8050946   -5.805095
## Tortuga       16  52.9357316  -36.935732
## Wolf          21  26.7005735   -5.700573

(RSS = sum(e^2))

## [1] 89231.37

(TSS = sum((y-mean(y))^2))

## [1] 381081.4

(R2 = 1 - RSS/TSS)

## [1] 0.7658469

Modelo matemático y su estimación