1 Polinomios a trozos

Se divide el rango de X por medio de k puntos (o nodos), \(c_1 < c_2 < \cdots < c_k\).
En cada intervalo se estima un polinomio de orden d.

\[ \begin{equation*} y_i = \left\{ \begin{array}{cl} \beta_{00} + \beta_{10} x_i + \beta_{20} x_i^2 + \cdots + \beta_{d0} x_i^d + u_i & \text{si } x < c_1,\\ \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \cdots + \beta_{d1} x_i^d + u_i & \text{si } c_1 \leq x \leq c_2, \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \cdots + \beta_{d2} x_i^d + u_i & \text{si } c_2 \leq x \leq c_3, \\ \ldots & \ldots, \\ \beta_{0k} + \beta_{1k} x_i + \beta_{2k} x_i^2 + \cdots + \beta_{dk} x_i^d + u_i & \text{si } x \geq c_k, \\ \end{array} \right. \end{equation*} \]

Por tanto, hay (k + 1)(d + 1) parámetros a estimar o grados de libertad.
En los puntos \(c_1, c_2, \cdots, c_k\) los polinomios han de ser continuos y suaves (con derivada de orden (d-1) continua). Por tanto, se añaden kd ecuaciones en cada nodo.
En total se tienen (k + 1)(d + 1) - kd = k + d grados de libertad. Por ejemplo en splines cubicas, se tienen k + 4.

2 Funciones base

Estimar el las ecuaciones anteriores con las restricciones correspondientes puede parecer complicado. Una alternativa es usar funciones base para los polinomios. Es fácil comprobar que el siguiente modelo cumple las condiciones de las splines:

\[ y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d + \beta_{d+1} h_1(x) + \cdots + \beta_{d+k} h_k(x) + u_i \]

donde:

\[ \begin{equation*} h_i(x) = \left\{ \begin{array}{cl} (x - c_i)^d & \text{si } x > c_i,\\ 0 & \text{si } x \leq c_i. \end{array} \right. \end{equation*} \]

En total se tienen d+k+1 prámetros (base de tamaño d+k mas \(\beta_0\)).
Por tanto, un modelo que utiliza splines de orden 3 (d=3) y cuatro nodos (k = 4) tiene 7 grados de libertad.
Al fin y al cabo, la spline definida de esta manera es un polinomio de orden d+k. Por tanto se puede expresar utilizando una base de tamaño d:

\[ p(x) = a_1 b_1(x) + a_2 b_2(x) + \cdots + a_{d+k} b_{d+k}(x) \]

Esta base no es única. Si conocemos una base para p(x) podemos estimar los parámetros utilizando regresión lineal.

3 Datos

Datos: Wage

Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.

library(ISLR)
datos = Wage
datos = datos[datos$wage<250,]
plot(datos$age,datos$wage, cex = 0.5, col = "darkgrey", ylab = "wage (x 1000 $)", xlab = "age")

4 Funciones base

library(splines)

La función bs() define automaticamente una matriz con las funciones de base necesarias a partir de los nodos. Se puede hacer de dos maneras:

Especificando los nodos:

dim(bs(datos$age, knots = c(25, 40, 60)))

## [1] 2921    6

Tres nodos y splines cúbicos originan una base de tamaño k+3!.

# por defecto utiliza splines cubicos
dim(bs(datos$age, knots = c(25, 40, 60), degree = 4))

## [1] 2921    7

Especificando los grados de libertad:

# k = df - 3, se utilizan quantiles
attr(bs(datos$age, df = 6), "knots")

## 25% 50% 75% 
##  33  42  50

attr(bs(datos$age, df = 6, degree = 4), "knots")

## 33.33333% 66.66667% 
##        37        48

5 Estimacion del modelo

m1 = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = datos)
summary(m1)

## 
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -93.271 -20.467  -2.006  17.424 116.146 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       59.978      7.151   8.388  < 2e-16 ***
## bs(age, knots = c(25, 40, 60))1    5.534      9.484   0.583 0.559625    
## bs(age, knots = c(25, 40, 60))2   44.434      7.280   6.104 1.17e-09 ***
## bs(age, knots = c(25, 40, 60))3   55.949      8.149   6.866 8.04e-12 ***
## bs(age, knots = c(25, 40, 60))4   52.762      8.119   6.498 9.53e-11 ***
## bs(age, knots = c(25, 40, 60))5   40.726     10.942   3.722 0.000201 ***
## bs(age, knots = c(25, 40, 60))6   25.289     14.470   1.748 0.080630 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.17 on 2914 degrees of freedom
## Multiple R-squared:  0.1096, Adjusted R-squared:  0.1077 
## F-statistic: 59.76 on 6 and 2914 DF,  p-value: < 2.2e-16

6 Prediccion

age_grid = seq(from = min(datos$age), to = max(datos$age), by = 1)
alfa = 0.05
yp = predict(m1, newdata = data.frame(age = age_grid), se = T)
yp1 = yp$fit + qnorm(alfa/2, mean = 0, sd = 1)*yp$se.fit
yp2 = yp$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*yp$se.fit

plot(datos$age, datos$wage, col = "gray")
lines(age_grid, yp$fit, col = "blue", lwd = 2)
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)

7 Splines cubicas naturales

Las splines tienen el problema de que en los extremos las predicciones tienen varianza elevada.
Esto se debe a que fuera del rango de la variable (valor minimo, valor máximo), la spline es de orden d.
Una opción es obligar a que para valores menores que el mínimo y mayores que el máximo, la spline sea lineal. Esto corrige la varianza alta de los bordes.

\[ \begin{equation*} y_i = \left\{ \begin{array}{cl} \beta_{00} + \beta_{10} x_i + u_i & \text{si } x < min(x_i),\\ \beta_{00} + \beta_{10} x_i + \beta_{20} x_i^2 + \cdots + \beta_{d0} x_i^d + u_i & \text{si } min(x_i) \leq x \leq c_1, \\ \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \cdots + \beta_{d1} x_i^d + u_i & \text{si } c_1 \leq x \leq c_2, \\ \ldots & \ldots, \\ \beta_{0k} + \beta_{1k} x_i + \beta_{2k} x_i^2 + \cdots + \beta_{dk} x_i^d + u_i & \text{si } c_k \leq x \leq max(x_i), \\ \beta_{0k} + \beta_{1k} x_i + u_i & \text{si } x \geq max(x_i), \\ \end{array} \right. \end{equation*} \] - Luego se añaden d-2 restricciones en cada uno de los contornos (derivadas 2,3 ,… ,(d-1) = 0), en total, * k + d - 2d + 4 = k - d + 4* funciones base.

En splines cúbicas, la base tiene tamaño k+1.
En R, se definen splines cúbicas naturales (las splines cúbicas son las más utilizadas) por medio de la función ns().

dim(bs(datos$age, knots = c(25, 40, 60)))

## [1] 2921    6

Se estima el modelo:

m2 = lm(wage ~ ns(age, knots = c(25, 40, 60)), data = Wage)
summary(m2)

## 
## Call:
## lm(formula = wage ~ ns(age, knots = c(25, 40, 60)), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.733 -24.761  -5.187  15.216 204.965 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       54.760      5.138  10.658  < 2e-16 ***
## ns(age, knots = c(25, 40, 60))1   67.402      5.013  13.444  < 2e-16 ***
## ns(age, knots = c(25, 40, 60))2   51.383      5.712   8.996  < 2e-16 ***
## ns(age, knots = c(25, 40, 60))3   88.566     12.016   7.371 2.18e-13 ***
## ns(age, knots = c(25, 40, 60))4   10.637      9.833   1.082    0.279    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39.92 on 2995 degrees of freedom
## Multiple R-squared:  0.08588,    Adjusted R-squared:  0.08466 
## F-statistic: 70.34 on 4 and 2995 DF,  p-value: < 2.2e-16

ypn = predict(m2, newdata = data.frame(age = age_grid), se = T)
ypn1 = ypn$fit + qnorm(alfa/2, mean = 0, sd = 1)*ypn$se.fit
ypn2 = ypn$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*ypn$se.fit

plot(datos$age, datos$wage, col = "gray")
#
lines(age_grid, yp$fit, col = "blue", lwd = 2)
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)
#
lines(age_grid, ypn$fit, col = "red", lwd = 2)
lines(age_grid, ypn1, col = "red", lty = 3)
lines(age_grid, ypn2, col = "red", lty = 3)

8 Seleccion del numero de nodos para splines cubicas

Se van a utilizar splines cubicas.

# numero maximo de escalones
num_max_gdl = 8

r2_adj = rep(0, num_max_gdl)
for (i in 3:num_max_gdl){
  m = lm(wage ~ bs(age, df = i), data = datos)
  m_summary = summary(m)
  r2_adj[i] = m_summary$adj.r.squared
}

plot(3:num_max_gdl, r2_adj[3:num_max_gdl], type = "b")

source("cross_val_pos.R")
source("MSE.R")

# datos de los folds
num_folds = 10
# grado de libertad maximo
num_max_gdl = 8
set.seed(1)
pos = cross_val_pos(nrow(datos),num_folds)

mse_cv = matrix(0, nrow = num_folds, ncol = num_max_gdl)
for (i in 1:num_folds){
  # datos de training y de validation de cada fold
  datos_train = datos[pos$train[[i]],]
  datos_test = datos[pos$test[[i]],]
  
  for (j in 3:num_max_gdl){
    m_cv = lm(wage ~ bs(age, df = j), data = datos_train)
    pred = predict(m_cv, newdata = datos_test)
    mse_cv[i,j] = MSE(datos_test$wage,pred)
  }
}

mse_cv_med = apply(mse_cv, 2, mean)
plot(3:num_max_gdl,mse_cv_med[3:num_max_gdl], type = "b")

Regresión no Lineal: splines de regresion

Javier Cara

curso 2018-19