Datos: Wage
Wage and other data for a group of 3000 male workers in the Mid-Atlantic region.
## 'data.frame': 3000 obs. of 12 variables:
## $ X : int 231655 86582 161300 155159 11443 376662 450601 377954 228963 81404 ...
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 1 level "2. Middle Atlantic": 1 1 1 1 1 1 1 1 1 1 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
A data frame with 3000 observations on the following 11 variables:
Parece que hay dos grupos diferenciados: los que ganan más de 250.000$ y los que ganan menos. Vamos a trabajar con los que ganan menos
\[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \cdots + \beta_k x_i^k + u_i \]
Hay varias maneras de implementarlos en R:
##
## Call:
## lm(formula = wage ~ age + I(age^2) + I(age^3) + I(age^4), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.565 -20.689 -2.015 17.584 116.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.620e+02 4.563e+01 -3.550 0.000391 ***
## age 1.948e+01 4.477e+00 4.350 1.41e-05 ***
## I(age^2) -5.150e-01 1.569e-01 -3.283 0.001039 **
## I(age^3) 6.113e-03 2.334e-03 2.619 0.008869 **
## I(age^4) -2.800e-05 1.250e-05 -2.240 0.025186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.16 on 2916 degrees of freedom
## Multiple R-squared: 0.1094, Adjusted R-squared: 0.1082
## F-statistic: 89.55 on 4 and 2916 DF, p-value: < 2.2e-16
z1 = d$age
z2 = d$age^2
z3 = d$age^3
z4 = d$age^4
m2 = lm(wage ~ z1 + z2 + z3 + z4, data = d)
summary(m2)
##
## Call:
## lm(formula = wage ~ z1 + z2 + z3 + z4, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.565 -20.689 -2.015 17.584 116.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.620e+02 4.563e+01 -3.550 0.000391 ***
## z1 1.948e+01 4.477e+00 4.350 1.41e-05 ***
## z2 -5.150e-01 1.569e-01 -3.283 0.001039 **
## z3 6.113e-03 2.334e-03 2.619 0.008869 **
## z4 -2.800e-05 1.250e-05 -2.240 0.025186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.16 on 2916 degrees of freedom
## Multiple R-squared: 0.1094, Adjusted R-squared: 0.1082
## F-statistic: 89.55 on 4 and 2916 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = wage ~ poly(age, degree = 4, raw = T), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.565 -20.689 -2.015 17.584 116.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.620e+02 4.563e+01 -3.550 0.000391 ***
## poly(age, degree = 4, raw = T)1 1.948e+01 4.477e+00 4.350 1.41e-05 ***
## poly(age, degree = 4, raw = T)2 -5.150e-01 1.569e-01 -3.283 0.001039 **
## poly(age, degree = 4, raw = T)3 6.113e-03 2.334e-03 2.619 0.008869 **
## poly(age, degree = 4, raw = T)4 -2.800e-05 1.250e-05 -2.240 0.025186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.16 on 2916 degrees of freedom
## Multiple R-squared: 0.1094, Adjusted R-squared: 0.1082
## F-statistic: 89.55 on 4 and 2916 DF, p-value: < 2.2e-16
La opción raw = T es necesaria, porque de lo contrario utiliza polinomios ortogonales. Vamos a dibujar la curva y los intervalos de confianza y predicción:
age_grid = seq(from = min(d$age), to = max(d$age), by = 1)
yp = predict(m1, newdata = data.frame(age = age_grid), se = TRUE)
yp = predict(m1, newdata = data.frame(age = age_grid), interval = "confidence", level = 0.95)
yp1 = predict(m1, newdata = data.frame(age = age_grid), interval = "prediction", level = 0.95)
plot(wage ~ age, data = d, xlim = range(age), cex = 0.5, col = "darkgrey")
title("Polinomio de grado 4")
lines(age_grid, yp[,1], lwd = 2, col = "blue")
# intervalos de confianza para el nivel medio
lines(age_grid, yp[,2], col = "blue", lty = 3)
lines(age_grid, yp[,3], col = "blue", lty = 3)
# intervalos de prediccion
lines(age_grid, yp1[,2], col = "red", lty = 3)
lines(age_grid, yp1[,3], col = "red", lty = 3)
Vamos a ir aumentando el grado del polinomio:
# numero maximo de escalones
max_grad = 10
r2_adj = rep(0, max_grad)
for (i in 1:max_grad){
mi = lm(wage ~ poly(age, degree = i, raw = T), data = d)
mi_summary = summary(mi)
r2_adj[i] = mi_summary$adj.r.squared
}
plot(r2_adj, type = "b")
Como vemos, no aumentamos el R2 para ordenes mayores de 4. Podemos afinar más utilizando el contraste de la F:
\[ F_0 = \frac{(VNE(m)-VNE(k))/(k-m)}{VNE(k)/(n-k-1)} \sim F_{k-m, n-k-1} \]
Vamos a comparar los modelos de grado 3, 4 y 5:
mk3 = lm(wage ~ poly(age, degree = 3, raw = T), data = d)
mk4 = lm(wage ~ poly(age, degree = 4, raw = T), data = d)
mk5 = lm(wage ~ poly(age, degree = 5, raw = T), data = d)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, degree = 3, raw = T)
## Model 2: wage ~ poly(age, degree = 4, raw = T)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2917 2657641
## 2 2916 2653077 1 4563.9 5.0162 0.02519 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, degree = 3, raw = T)
## Model 2: wage ~ poly(age, degree = 5, raw = T)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2917 2657641
## 2 2915 2653073 2 4567.8 2.5094 0.08149 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Como vemos, orden 3 y orden 5 son equivalentes, luego nos quedamos con el de orden 3 porque siempre preferimos modelos sencillos a modelos complejos.
Uno de los principales problemas que tiene utilizar el modelo anterior es que para polinomios de grado elevado, la matriz \(X^t X\) es casi singular, y podemos tener problemas en la estimación del modelo. Por ejemplo:
##
## Call:
## lm(formula = wage ~ poly(age, degree = 6, raw = T), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.347 -20.526 -1.956 17.549 115.680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.219e+02 3.275e+02 0.372 0.710
## poly(age, degree = 6, raw = T)1 -2.428e+01 4.944e+01 -0.491 0.623
## poly(age, degree = 6, raw = T)2 2.168e+00 2.986e+00 0.726 0.468
## poly(age, degree = 6, raw = T)3 -7.784e-02 9.255e-02 -0.841 0.400
## poly(age, degree = 6, raw = T)4 1.391e-03 1.556e-03 0.894 0.372
## poly(age, degree = 6, raw = T)5 -1.231e-05 1.349e-05 -0.913 0.362
## poly(age, degree = 6, raw = T)6 4.299e-08 4.723e-08 0.910 0.363
##
## Residual standard error: 30.17 on 2914 degrees of freedom
## Multiple R-squared: 0.1097, Adjusted R-squared: 0.1078
## F-statistic: 59.81 on 6 and 2914 DF, p-value: < 2.2e-16
Como vemos, en este modelo salen todos los parámetros no significativos, incluso \(\beta_1\) y \(\beta_2\).
Una opción es utilizar el modelo:
\[ y_i = \beta_0 P_0(x_i) + \beta_1 P_1(x_i) + \beta_2 P_2(x_i) + \cdots + \beta_k P_k(x_i) + u_i \]
donde \(P_k(x_i)\) es el polinomio de orden k que verifica:
\[ \sum _{i=1}^n P_r(x_i) P_s(x_i) = 0, \ r \neq s, \ r,s = 0,1,\ldots,k \\ P_0(x_i) = 1 \]
es decir, son polinomios ortogonales. El modelo sigue siendo \(y = X\beta + \epsilon\), con
\[ X = \begin{bmatrix} P_0(x_1) & P_1(x_1) & \cdots & P_k(x_1) \\ P_0(x_2) & P_1(x_2) & \cdots & P_k(x_2) \\ \vdots & \vdots & & \vdots \\ P_0(x_n) & P_1(x_n) & \cdots & P_k(x_n) \\ \end{bmatrix} \]
Por tanto:
\[ X^t X = \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 0 & \sum_{i=1}^n P_1^2(x_i) & \cdots & 0 \\ \vdots & \vdots & & \vdots \\ 0 & 0 & \cdots & \sum_{i=1}^n P_k^2(x_i) \\ \end{bmatrix} \]
Esta matriz es invertible (al ser diagonal) y:
\[ \hat \beta_j = \frac{\sum_{i=1}^n P_j(x_i) y_i}{\sum_{i=1}^n P_j^2(x_i)} \]
##
## Call:
## lm(formula = wage ~ poly(age, degree = 6), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.347 -20.526 -1.956 17.549 115.680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 107.2187 0.5582 192.074 < 2e-16 ***
## poly(age, degree = 6)1 358.9196 30.1695 11.897 < 2e-16 ***
## poly(age, degree = 6)2 -418.0999 30.1695 -13.858 < 2e-16 ***
## poly(age, degree = 6)3 133.0075 30.1695 4.409 1.08e-05 ***
## poly(age, degree = 6)4 -67.5570 30.1695 -2.239 0.0252 *
## poly(age, degree = 6)5 -1.9657 30.1695 -0.065 0.9481
## poly(age, degree = 6)6 27.4635 30.1695 0.910 0.3627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.17 on 2914 degrees of freedom
## Multiple R-squared: 0.1097, Adjusted R-squared: 0.1078
## F-statistic: 59.81 on 6 and 2914 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = wage ~ poly(age, degree = 4, raw = T), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93.565 -20.689 -2.015 17.584 116.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.620e+02 4.563e+01 -3.550 0.000391 ***
## poly(age, degree = 4, raw = T)1 1.948e+01 4.477e+00 4.350 1.41e-05 ***
## poly(age, degree = 4, raw = T)2 -5.150e-01 1.569e-01 -3.283 0.001039 **
## poly(age, degree = 4, raw = T)3 6.113e-03 2.334e-03 2.619 0.008869 **
## poly(age, degree = 4, raw = T)4 -2.800e-05 1.250e-05 -2.240 0.025186 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.16 on 2916 degrees of freedom
## Multiple R-squared: 0.1094, Adjusted R-squared: 0.1082
## F-statistic: 89.55 on 4 and 2916 DF, p-value: < 2.2e-16
pero coincide la varianza residual, el R2,… Se ha hecho un cambio de base, pero el modelo final es el mismo.
La predicción tiene que ser la misma:
## fit lwr upr
## 1 75.79416 72.15342 79.43489
## fit lwr upr
## 1 75.23709 71.40363 79.07055