\[ \begin{equation*} y_i = \left\{ \begin{array}{cl} \beta_{00} + \beta_{10} x_i + \beta_{20} x_i^2 + \cdots + \beta_{d0} x_i^d + u_i & \text{si } x < c_1,\\ \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \cdots + \beta_{d1} x_i^d + u_i & \text{si } c_1 \leq x \leq c_2, \\ \beta_{02} + \beta_{12} x_i + \beta_{22} x_i^2 + \cdots + \beta_{d2} x_i^d + u_i & \text{si } c_2 \leq x \leq c_3, \\ \ldots & \ldots, \\ \beta_{0k} + \beta_{1k} x_i + \beta_{2k} x_i^2 + \cdots + \beta_{dk} x_i^d + u_i & \text{si } x \geq c_k, \\ \end{array} \right. \end{equation*} \]
Por tanto, hay (k + 1)(d + 1) parƔmetros a estimar o grados de libertad.
En los puntos \(c_1, c_2, \cdots, c_k\) los polinomios han de ser continuos y suaves (con derivada de orden (d-1) continua). Por tanto, se aƱaden kd ecuaciones en cada nodo.
En total se tienen (k + 1)(d + 1) - kd = k + d grados de libertad. Por ejemplo en splines cubicas, se tienen k + 4.
Estimar el las ecuaciones anteriores con las restricciones correspondientes puede parecer complicado. Una alternativa es usar funciones base para los polinomios. Es fƔcil comprobar que el siguiente modelo cumple las condiciones de las splines:
\[ y_i = \beta_0 + \beta_1 x + \beta_2 x^2 + \cdots + \beta_d x^d + \beta_{d+1} h_1(x) + \cdots + \beta_{d+k} h_k(x) + u_i \]
donde:
\[ \begin{equation*} h_i(x) = \left\{ \begin{array}{cl} (x - c_i)^d & \text{si } x > c_i,\\ 0 & \text{si } x \leq c_i. \end{array} \right. \end{equation*} \]
En total se tienen d+k+1 prƔmetros (base de tamaƱo d+k mas \(\beta_0\)).
Por tanto, un modelo que utiliza splines de orden 3 (d=3) y cuatro nodos (k = 4) tiene 7 grados de libertad.
Al fin y al cabo, la spline definida de esta manera es un polinomio de orden d+k. Por tanto se puede expresar utilizando una base de tamaƱo d:
\[ p(x) = a_1 b_1(x) + a_2 b_2(x) + \cdots + a_{d+k} b_{d+k}(x) \]
Datos: Wage
Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.
library(ISLR)
datos = Wage
datos = datos[datos$wage<250,]
plot(datos$age,datos$wage, cex = 0.5, col = "darkgrey", ylab = "wage (x 1000 $)", xlab = "age")
library(splines)
La función bs() define automaticamente una matriz con las funciones de base necesarias a partir de los nodos. Se puede hacer de dos maneras:
dim(bs(datos$age, knots = c(25, 40, 60)))
## [1] 2921 6
Tres nodos y splines cúbicos originan una base de tamaño k+3!.
# por defecto utiliza splines cubicos
dim(bs(datos$age, knots = c(25, 40, 60), degree = 4))
## [1] 2921 7
# k = df - 3, se utilizan quantiles
attr(bs(datos$age, df = 6), "knots")
## 25% 50% 75%
## 33 42 50
attr(bs(datos$age, df = 6, degree = 4), "knots")
## 33.33333% 66.66667%
## 37 48
m1 = lm(wage ~ bs(age, knots = c(25, 40, 60)), data = datos)
summary(m1)
##
## Call:
## lm(formula = wage ~ bs(age, knots = c(25, 40, 60)), data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.271 -20.467 -2.006 17.424 116.146
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59.978 7.151 8.388 < 2e-16 ***
## bs(age, knots = c(25, 40, 60))1 5.534 9.484 0.583 0.559625
## bs(age, knots = c(25, 40, 60))2 44.434 7.280 6.104 1.17e-09 ***
## bs(age, knots = c(25, 40, 60))3 55.949 8.149 6.866 8.04e-12 ***
## bs(age, knots = c(25, 40, 60))4 52.762 8.119 6.498 9.53e-11 ***
## bs(age, knots = c(25, 40, 60))5 40.726 10.942 3.722 0.000201 ***
## bs(age, knots = c(25, 40, 60))6 25.289 14.470 1.748 0.080630 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.17 on 2914 degrees of freedom
## Multiple R-squared: 0.1096, Adjusted R-squared: 0.1077
## F-statistic: 59.76 on 6 and 2914 DF, p-value: < 2.2e-16
age_grid = seq(from = min(datos$age), to = max(datos$age), by = 1)
alfa = 0.05
yp = predict(m1, newdata = data.frame(age = age_grid), se = T)
yp1 = yp$fit + qnorm(alfa/2, mean = 0, sd = 1)*yp$se.fit
yp2 = yp$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*yp$se.fit
plot(datos$age, datos$wage, col = "gray")
lines(age_grid, yp$fit, col = "blue", lwd = 2)
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)
Las splines tienen el problema de que en los extremos las predicciones tienen varianza elevada.
Esto se debe a que fuera del rango de la variable (valor minimo, valor mƔximo), la spline es de orden d.
Una opción es obligar a que para valores menores que el mĆnimo y mayores que el mĆ”ximo, la spline sea lineal. Esto corrige la varianza alta de los bordes.
\[ \begin{equation*} y_i = \left\{ \begin{array}{cl} \beta_{00} + \beta_{10} x_i + u_i & \text{si } x < min(x_i),\\ \beta_{00} + \beta_{10} x_i + \beta_{20} x_i^2 + \cdots + \beta_{d0} x_i^d + u_i & \text{si } min(x_i) \leq x \leq c_1, \\ \beta_{01} + \beta_{11} x_i + \beta_{21} x_i^2 + \cdots + \beta_{d1} x_i^d + u_i & \text{si } c_1 \leq x \leq c_2, \\ \ldots & \ldots, \\ \beta_{0k} + \beta_{1k} x_i + \beta_{2k} x_i^2 + \cdots + \beta_{dk} x_i^d + u_i & \text{si } c_k \leq x \leq max(x_i), \\ \beta_{0k} + \beta_{1k} x_i + u_i & \text{si } x \geq max(x_i), \\ \end{array} \right. \end{equation*} \] - Luego se añaden d-2 restricciones en cada uno de los contornos (derivadas 2,3 ,⦠,(d-1) = 0), en total, * k + d - 2d + 4 = k - d + 4* funciones base.
En splines cúbicas, la base tiene tamaño k+1.
En R, se definen splines cúbicas naturales (las splines cúbicas son las mÔs utilizadas) por medio de la función ns().
dim(bs(datos$age, knots = c(25, 40, 60)))
## [1] 2921 6
m2 = lm(wage ~ ns(age, knots = c(25, 40, 60)), data = Wage)
summary(m2)
##
## Call:
## lm(formula = wage ~ ns(age, knots = c(25, 40, 60)), data = Wage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -98.733 -24.761 -5.187 15.216 204.965
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.760 5.138 10.658 < 2e-16 ***
## ns(age, knots = c(25, 40, 60))1 67.402 5.013 13.444 < 2e-16 ***
## ns(age, knots = c(25, 40, 60))2 51.383 5.712 8.996 < 2e-16 ***
## ns(age, knots = c(25, 40, 60))3 88.566 12.016 7.371 2.18e-13 ***
## ns(age, knots = c(25, 40, 60))4 10.637 9.833 1.082 0.279
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.92 on 2995 degrees of freedom
## Multiple R-squared: 0.08588, Adjusted R-squared: 0.08466
## F-statistic: 70.34 on 4 and 2995 DF, p-value: < 2.2e-16
ypn = predict(m2, newdata = data.frame(age = age_grid), se = T)
ypn1 = ypn$fit + qnorm(alfa/2, mean = 0, sd = 1)*ypn$se.fit
ypn2 = ypn$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*ypn$se.fit
plot(datos$age, datos$wage, col = "gray")
#
lines(age_grid, yp$fit, col = "blue", lwd = 2)
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)
#
lines(age_grid, ypn$fit, col = "red", lwd = 2)
lines(age_grid, ypn1, col = "red", lty = 3)
lines(age_grid, ypn2, col = "red", lty = 3)
Se van a utilizar splines cubicas.
# numero maximo de escalones
num_max_gdl = 8
r2_adj = rep(0, num_max_gdl)
for (i in 3:num_max_gdl){
m = lm(wage ~ bs(age, df = i), data = datos)
m_summary = summary(m)
r2_adj[i] = m_summary$adj.r.squared
}
plot(3:num_max_gdl, r2_adj[3:num_max_gdl], type = "b")
source("cross_val_pos.R")
source("MSE.R")
# datos de los folds
num_folds = 10
# grado de libertad maximo
num_max_gdl = 8
set.seed(1)
pos = cross_val_pos(nrow(datos),num_folds)
mse_cv = matrix(0, nrow = num_folds, ncol = num_max_gdl)
for (i in 1:num_folds){
# datos de training y de validation de cada fold
datos_train = datos[pos$train[[i]],]
datos_test = datos[pos$test[[i]],]
for (j in 3:num_max_gdl){
m_cv = lm(wage ~ bs(age, df = j), data = datos_train)
pred = predict(m_cv, newdata = datos_test)
mse_cv[i,j] = MSE(datos_test$wage,pred)
}
}
mse_cv_med = apply(mse_cv, 2, mean)
plot(3:num_max_gdl,mse_cv_med[3:num_max_gdl], type = "b")