\[ \begin{align} C_0(X) & = I(X < c_1 ) \\ C_1(X) & = I(X \in (c_1,c_2] )\\ C_2(X) & = I(X \in (c_2,c_3] ) \\ \cdots & = \cdots \\ C_k(X) & = I(X > c_k ) \end{align} \]
donde:
\[ \begin{equation*} I(X \in (c_{i},c_{i+1}] ) = \left\{ \begin{array}{rl} 1 & \text{si } X \in (c_{i},c_{i+1}],\\ 0 & \text{si } X \notin (c_{i},c_{i+1}]. \end{array} \right. \end{equation*} \]
\[ y_i = \beta_0 + \beta_1 C_1(x) + \beta_2 C_2(x) + \cdots + \beta_k C_k(x) + u_i \]
Datos: Wage
Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.
library(ISLR)
datos = Wage
datos = datos[datos$wage<250,]
plot(datos$age,datos$wage, cex = 0.5, col = "darkgrey")
La función cut() define automaticamente cuatro intervalos:
head(cut(datos$age,4), 10)
## [1] (17.9,33.5] (17.9,33.5] (33.5,49] (33.5,49] (49,64.5]
## [6] (49,64.5] (33.5,49] (17.9,33.5] (33.5,49] (49,64.5]
## Levels: (17.9,33.5] (33.5,49] (49,64.5] (64.5,80.1]
El resultado de cut() es un factor.
class(cut(datos$age,4))
## [1] "factor"
Se han definido las siguientes variables:
\[ \begin{align} C_0(X) & = I(X \in (17.9,33.5] ) \\ C_1(X) & = I(X \in (33.5,49] )\\ C_2(X) & = I(X \in (49,64.5] ) \\ C_3(X) & = I(X \in (64.5,80.1] ) \end{align} \]
El modelo de regresión es:
\[ y_i = \beta_0 + \beta_1 C_1(x) + \beta_2 C_2(x) + \beta_3 C_3(x) + u_i \]
Se estima el modelo de regresión:
m_esc = lm(wage ~ cut(age,4), data = datos)
summary(m_esc)
##
## Call:
## lm(formula = wage ~ cut(age, 4), data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -92.783 -21.169 -2.486 18.753 115.545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.968 1.128 82.443 <2e-16 ***
## cut(age, 4)(33.5,49] 19.901 1.404 14.174 <2e-16 ***
## cut(age, 4)(49,64.5] 18.946 1.592 11.904 <2e-16 ***
## cut(age, 4)(64.5,80.1] 6.366 3.823 1.665 0.096 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.78 on 2917 degrees of freedom
## Multiple R-squared: 0.07234, Adjusted R-squared: 0.07138
## F-statistic: 75.82 on 3 and 2917 DF, p-value: < 2.2e-16
Como son variables cero-uno, estamos estimando cuatro modelos:
\[ X \in (17.9,33.5], \quad y_i = \beta_0 + u_i\]
\[ X \in (33.5,49], \quad y_i = \beta_0 + \beta_1 + u_i\]
\[ X \in (49,64.5], \quad y_i = \beta_0 + \beta_2 + u_i\]
\[ X \in (64.5,80.1], \quad y_i = \beta_0 + \beta_3 + u_i\]
El pvalor de \(\beta_3 > 0.05\), luego \(\beta_3 = 0\).
age_grid = seq(from = min(datos$age), to = max(datos$age), by = 1)
yp = predict(m_esc, newdata = data.frame(age = age_grid), se = TRUE)
alfa = 0.05
yp1 = yp$fit + qnorm(alfa/2, mean = 0, sd = 1)*yp$se.fit
yp2 = yp$fit + qnorm(1-alfa/2, mean = 0, sd = 1)*yp$se.fit
plot(wage ~ age, data = datos, xlim = range(age), cex = 0.5, col = "darkgrey")
title("Funcion escalon")
lines(age_grid, yp$fit, lwd = 2, col = "blue")
lines(age_grid, yp1, col = "blue", lty = 3)
lines(age_grid, yp2, col = "blue", lty = 3)
# numero maximo de escalones
num_max_esc = 20
# datos de los folds
r2_adj = rep(0, num_max_esc)
for (i in 2:num_max_esc){
m = lm(wage ~ cut(age,i), data = datos)
m_summary = summary(m)
r2_adj[i] = m_summary$adj.r.squared
}
plot(r2_adj, type = "b")