1 Metodo

  • Se divide el rango de X por medio de k puntos, \(c_1 < c_2 < \cdots < c_k\).
  • Se definen k+1 variables auxiliares.

\[ \begin{align} C_0(X) & = I(X < c_1 ) \\ C_1(X) & = I(X \in (c_1,c_2] )\\ C_2(X) & = I(X \in (c_2,c_3] ) \\ \cdots & = \cdots \\ C_k(X) & = I(X > c_k ) \end{align} \]

donde:

\[ \begin{equation*} I(X \in (c_{i},c_{i+1}] ) = \left\{ \begin{array}{rl} 1 & \text{si } X \in (c_{i},c_{i+1}],\\ 0 & \text{si } X \notin (c_{i},c_{i+1}]. \end{array} \right. \end{equation*} \]

  • Se estima el modelo de regresión:

\[ y_i = \beta_0 + \beta_1 C_1(x) + \beta_2 C_2(x) + \cdots + \beta_k C_k(x) + u_i \]

2 Ejemplo

2.1 Datos

Datos: Wage

Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.

library(ISLR)
datos = Wage
datos = datos[datos$wage<250,]
plot(datos$age,datos$wage, cex = 0.5, col = "darkgrey")

2.2 Variables auxiliares

La función cut() define automaticamente cuatro intervalos:

head(cut(datos$age,4), 10)
##  [1] (17.9,33.5] (17.9,33.5] (33.5,49]   (33.5,49]   (49,64.5]  
##  [6] (49,64.5]   (33.5,49]   (17.9,33.5] (33.5,49]   (49,64.5]  
## Levels: (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]

El resultado de cut() es un factor.

class(cut(datos$age,4))
## [1] "factor"

Se han definido las siguientes variables:

\[ \begin{align} C_0(X) & = I(X \in (17.9,33.5] ) \\ C_1(X) & = I(X \in (33.5,49] )\\ C_2(X) & = I(X \in (49,64.5] ) \\ C_3(X) & = I(X \in (64.5,80.1] ) \end{align} \]

2.3 Estimacion del modelo

El modelo de regresión es:

\[ y_i = \beta_0 + \beta_1 C_1(x) + \beta_2 C_2(x) + \beta_3 C_3(x) + u_i \]

Se estima el modelo de regresión:

m_esc = lm(wage ~ cut(age,4), data = datos)
summary(m_esc)
## 
## Call:
## lm(formula = wage ~ cut(age, 4), data = datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -92.783 -21.169  -2.486  18.753 115.545 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              92.968      1.128  82.443   <2e-16 ***
## cut(age, 4)(33.5,49]     19.901      1.404  14.174   <2e-16 ***
## cut(age, 4)(49,64.5]     18.946      1.592  11.904   <2e-16 ***
## cut(age, 4)(64.5,80.1]    6.366      3.823   1.665    0.096 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.78 on 2917 degrees of freedom
## Multiple R-squared:  0.07234,    Adjusted R-squared:  0.07138 
## F-statistic: 75.82 on 3 and 2917 DF,  p-value: < 2.2e-16

Como son variables cero-uno, estamos estimando cuatro modelos:

\[ X \in (17.9,33.5], \quad y_i = \beta_0 + u_i\]

\[ X \in (33.5,49], \quad y_i = \beta_0 + \beta_1 + u_i\]

\[ X \in (49,64.5], \quad y_i = \beta_0 + \beta_2 + u_i\]

\[ X \in (64.5,80.1], \quad y_i = \beta_0 + \beta_3 + u_i\]

El pvalor de \(\beta_3 > 0.05\), luego \(\beta_3 = 0\).

2.4 Dibujo del modelo estimado con intervalos de confianza

age_grid = seq(from = min(datos$age), to = max(datos$age), by = 1)
yp = predict(m_esc, newdata = data.frame(age = age_grid), se = TRUE)
alfa = 0.05
yp1 = yp$fit + qnorm(alfa/2, mean = 0, sd = 1)*yp$se.fit
yp2 = yp$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*yp$se.fit
plot(wage ~ age, data = datos, xlim = range(age), cex = 0.5, col = "darkgrey")
title("Funcion escalon")
lines(age_grid, yp$fit, lwd = 2, col = "blue")
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)

2.5 Cálculo del número de escalones con el R2-ajustado

# numero maximo de escalones
num_max_esc = 20

# datos de los folds

r2_adj = rep(0, num_max_esc)
for (i in 2:num_max_esc){
  m = lm(wage ~ cut(age,i), data = datos)
  m_summary = summary(m)
  r2_adj[i] = m_summary$adj.r.squared
}

plot(r2_adj, type = "b")

2.6 Cálculo del número de escalones con Cross-Validation