## 'data.frame': 400 obs. of 12 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Income : num 14.9 106 104.6 148.9 55.9 ...
## $ Limit : int 3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
## $ Rating : int 283 483 514 681 357 569 259 512 266 491 ...
## $ Cards : int 2 3 4 3 2 4 2 2 5 3 ...
## $ Age : int 34 82 71 36 68 77 37 87 66 41 ...
## $ Education: int 11 15 11 11 16 10 12 9 13 19 ...
## $ Gender : Factor w/ 2 levels " Male","Female": 1 2 1 2 1 1 2 1 2 2 ...
## $ Student : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
## $ Married : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
## $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
## $ Balance : int 333 903 580 964 331 1151 203 872 279 1350 ...
Creamos una variable auxiliar, mujer, con los valores
El modelo estadístico que queremos estiamar es:
\[ Balance = \beta_0 + \beta_1 Mujer + u \]
##
## Call:
## lm(formula = Balance ~ Mujer, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## Mujer 19.73 46.05 0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
El el fondo tenemos dos modelos, uno para los hombres y otro para las mujeres:
Hombres (variable Mujer = 0): \(Balance = \beta_0\) El crédito medio de los hombres es 509.80
Mujeres (variable Mujer = 1): \(Balance = \beta_0 + \beta_1\) El crédito medio de las mujeres es 509.80 + 19.73 = 529.53
Una manera más elegante de estimar estos modelos en R es utilizar una variable factor para representar la variable cualitativa:
## [1] "factor"
## [1] " Male" "Female"
Como Gender es de tipo factor, se puede hacer:
##
## Call:
## lm(formula = Balance ~ Gender, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 509.80 33.13 15.389 <2e-16 ***
## GenderFemale 19.73 46.05 0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
Internamente, R asigna los valores:
## Female
## Male 0
## Female 1
También se podía haber elegido como variable auxiliar la variable Hombre, con los valores
El modelo sería ahora:
\[ Balance = \beta_0 + \beta_1 Hombre + u \]
##
## Call:
## lm(formula = Balance ~ Hombre, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 529.54 31.99 16.554 <2e-16 ***
## Hombre -19.73 46.05 -0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
Es decir:
Hombres (variable Hombre = 1): \(Balance = \beta_0 + \beta_1\) El crédito medio de los hombres es 529.54 - 19.73 = 509.81
Mujeres (variable Mujer = 1): \(Balance = \beta_0\) El crédito medio de las mujeres es 529.54
Si estamos utilizando factores, tenemos que cambiar el orden de los niveles del factor:
## [1] "Female" " Male"
##
## Call:
## lm(formula = datos$Balance ~ Gender1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 529.54 31.99 16.554 <2e-16 ***
## Gender1 Male -19.73 46.05 -0.429 0.669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.0004611, Adjusted R-squared: -0.00205
## F-statistic: 0.1836 on 1 and 398 DF, p-value: 0.6685
También se puede utilizar el modelo equivalente:
\[ Balance = \beta_1 Hombre + \beta_2 Mujer + u \]
en el que se utilizan las dos variables auxiliares pero se elimina el parámetro \(\beta_0\).
##
## Call:
## lm(formula = Balance ~ 0 + Hombre + Mujer, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Hombre 509.80 33.13 15.39 <2e-16 ***
## Mujer 529.54 31.99 16.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.5599
## F-statistic: 255.4 on 2 and 398 DF, p-value: < 2.2e-16
Con factores:
##
## Call:
## lm(formula = datos$Balance ~ 0 + Gender, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529.54 -455.35 -60.17 334.71 1489.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Gender Male 509.80 33.13 15.39 <2e-16 ***
## GenderFemale 529.54 31.99 16.55 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.2 on 398 degrees of freedom
## Multiple R-squared: 0.5621, Adjusted R-squared: 0.5599
## F-statistic: 255.4 on 2 and 398 DF, p-value: < 2.2e-16
En el caso de tener variables cualitativas con más de dos niveles:
## [1] "factor"
## [1] "African American" "Asian" "Caucasian"
Definimos las variables axiliares:
Afro = as.numeric(datos$Ethnicity == "African American")
Asia = as.numeric(datos$Ethnicity == "Asian")
Cauc = as.numeric(datos$Ethnicity == "Caucasian")
Como la variable tiene tres niveles, necesitamos incluir dos variables auxiliares en el modelo estadístico:
Modelo general:
\[ Balance = \beta_0 + \beta_1 Asia + \beta_2 Cauc + u \]
Modelo para “African American”: Asia = 0, Cauc = 0
\[ Balance = \beta_0 + u \]
Modelo para “Asian”: Asia = 1, Cauc = 0
\[ Balance = \beta_0 + \beta_1 + u \]
Modelo para “Caucasian”: Asia = 0, Cauc = 1
\[ Balance = \beta_0 + \beta_2 + u \] En R:
##
## Call:
## lm(formula = Balance ~ Asia + Cauc, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## Asia -18.69 65.02 -0.287 0.774
## Cauc -12.50 56.68 -0.221 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Utilizando factores se obtienen los mismos resultados:
##
## Call:
## lm(formula = Balance ~ Ethnicity, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 531.00 46.32 11.464 <2e-16 ***
## EthnicityAsian -18.69 65.02 -0.287 0.774
## EthnicityCaucasian -12.50 56.68 -0.221 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Comprovamos que internamente R crea variables auxiliares según los valores:
## Asian Caucasian
## African American 0 0
## Asian 1 0
## Caucasian 0 1
Podemos hacer otras comparaciones cambiando la variable de referencia:
## [1] "Asian" "African American" "Caucasian"
##
## Call:
## lm(formula = Balance ~ Ethnicity1, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 512.314 45.632 11.227 <2e-16 ***
## Ethnicity1African American 18.686 65.021 0.287 0.774
## Ethnicity1Caucasian 6.184 56.122 0.110 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Que en el fondo estamos haciendo:
##
## Call:
## lm(formula = Balance ~ Ethnicity1, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 512.314 45.632 11.227 <2e-16 ***
## Ethnicity1African American 18.686 65.021 0.287 0.774
## Ethnicity1Caucasian 6.184 56.122 0.110 0.912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
También podemos reordenar los niveles de la variable factor:
Ethnicity2 = factor(datos$Ethnicity,levels=c("Caucasian","Asian","African American"))
levels(Ethnicity2)
## [1] "Caucasian" "Asian" "African American"
##
## Call:
## lm(formula = Balance ~ Ethnicity2, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -531.00 -457.08 -63.25 339.25 1480.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 518.497 32.670 15.871 <2e-16 ***
## Ethnicity2Asian -6.184 56.122 -0.110 0.912
## Ethnicity2African American 12.503 56.681 0.221 0.826
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.9 on 397 degrees of freedom
## Multiple R-squared: 0.0002188, Adjusted R-squared: -0.004818
## F-statistic: 0.04344 on 2 and 397 DF, p-value: 0.9575
Es muy frecuente contar con regresores cualitativos y cuantitativos. Por ejemplo, vamos a estudiar la variable Balance en función de Income (cuantitativa) y Student (cualitativa). Definimos la variable auxiliar Estudiante
Estudiante = 0: (Student = No)
Estudiante = 1: (Student = Yes)
El modelo que vamos a analizar es:
\[ Balance = \beta_0 + \beta_1 Income + \beta_2 Estudiante + u \]
Por tanto:
Si Estudiante = 0: \(Balance = \beta_0 + \beta_1 Income + u\)
Si Estudiante = 1: \(Balance = (\beta_0 + \beta_2) + \beta_1 Income + u\)
Tenemos dos rectas, con la misma pendiente y distinta \(\beta_0\). En R:
##
## Call:
## lm(formula = Balance ~ Income + Student, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -762.37 -331.38 -45.04 323.60 818.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 211.1430 32.4572 6.505 2.34e-10 ***
## Income 5.9843 0.5566 10.751 < 2e-16 ***
## StudentYes 382.6705 65.3108 5.859 9.78e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.8 on 397 degrees of freedom
## Multiple R-squared: 0.2775, Adjusted R-squared: 0.2738
## F-statistic: 76.22 on 2 and 397 DF, p-value: < 2.2e-16
plot(datos$Income, datos$Balance, col = datos$Student)
abline(m12$coefficients["(Intercept)"], m12$coefficients["Income"])
abline(m12$coefficients["(Intercept)"] + m12$coefficients["StudentYes"],
m12$coefficients["Income"], col="red")
¿Podemos representar con un único modelo dos rectas con distinta pendiente, una para estudiantes y otra para no estudiantes? Sea el modelo:
\[ Balance = \beta_0 + \beta_1 Income + \beta_2 Estudiante + \beta_3 Estudiante * Income + u \]
Si Estudiante = 0: \(Balance = \beta_0 + \beta_1 Income + u\)
Si Estudiante = 1: \(Balance = (\beta_0 + \beta_2) + (\beta_1 + \beta_3) Income + u\)
Tenemos dos rectas, con diferente pendiente y ordenada en el origen.
##
## Call:
## lm(formula = Balance ~ Income * Student, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.39 -325.70 -41.13 321.65 814.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.6232 33.6984 5.953 5.79e-09 ***
## Income 6.2182 0.5921 10.502 < 2e-16 ***
## StudentYes 476.6758 104.3512 4.568 6.59e-06 ***
## Income:StudentYes -1.9992 1.7313 -1.155 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared: 0.2799, Adjusted R-squared: 0.2744
## F-statistic: 51.3 on 3 and 396 DF, p-value: < 2.2e-16
La interacción en R también se define utilizando los dos puntos (:). Por tanto, el modelo anterior es equivalente a poner:
##
## Call:
## lm(formula = Balance ~ Income + Student + Income:Student, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -773.39 -325.70 -41.13 321.65 814.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 200.6232 33.6984 5.953 5.79e-09 ***
## Income 6.2182 0.5921 10.502 < 2e-16 ***
## StudentYes 476.6758 104.3512 4.568 6.59e-06 ***
## Income:StudentYes -1.9992 1.7313 -1.155 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 391.6 on 396 degrees of freedom
## Multiple R-squared: 0.2799, Adjusted R-squared: 0.2744
## F-statistic: 51.3 on 3 and 396 DF, p-value: < 2.2e-16
plot(datos$Income, datos$Balance, col = datos$Student)
abline(m14$coefficients["(Intercept)"], m14$coefficients["Income"])
abline(m14$coefficients["(Intercept)"] + m14$coefficients["StudentYes"],
m14$coefficients["Income"] + m14$coefficients["Income:StudentYes"], col="red")
Sea el modelo que analiza la variable Balance en función de Income, Gender y Ethnicity. Recordamos que tenemos las variables auxiliares:
Mujer = 0, si Gender = Male
Cauc = 1, si Ethnicity = Caucasian
El primer modelo que podemos construir es:
\[ Balance = \beta_0 + \beta_1 Income + \beta_2 Mujer + \beta_3 Asia + \beta_4 Cauc + u \]
Gender | Ethnicity | Modelo |
---|---|---|
Hombre | Afro | \(Balance = \beta_0 + \beta_1 Income + u\) |
Hombre | Asia | \(Balance = (\beta_0 + \beta_3) + \beta_1 Income + u\) |
Hombre | Cauc | \(Balance = (\beta_0 + \beta_4) + \beta_1 Income + u\) |
Mujer | Afro | \(Balance = (\beta_0 + \beta_2) + \beta_1 Income + u\) |
Mujer | Asia | \(Balance = (\beta_0 + \beta_2 + \beta_3) + \beta_1 Income + u\) |
Mujer | Cauc | \(Balance = (\beta_0 + \beta_2 + \beta_4) + \beta_1 Income + u\) |
Como vemos, la pendiente \(\beta_1\) que relaciona Balance e Income no depende del género ni de la raza. Sin embargo, la ordenada en el origen si depende de ambos factores de manera aditiva.
##
## Call:
## lm(formula = Balance ~ Income + Gender + Ethnicity, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -794.14 -351.67 -52.02 328.02 1110.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 230.0291 53.8574 4.271 2.44e-05 ***
## Income 6.0542 0.5818 10.406 < 2e-16 ***
## GenderFemale 24.3396 40.9630 0.594 0.553
## EthnicityAsian 1.6372 57.7867 0.028 0.977
## EthnicityCaucasian 6.4469 50.3634 0.128 0.898
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 409.2 on 395 degrees of freedom
## Multiple R-squared: 0.2157, Adjusted R-squared: 0.2078
## F-statistic: 27.16 on 4 and 395 DF, p-value: < 2.2e-16
Podemos considerar interacciones entre el regresor cuantitativo y los cualitativos:
\[ Balance = \beta_0 + \beta_1 Income + \beta_2 Mujer + \beta_3 Asia + \beta_4 Cauc + \beta_5 Mujer * Income + \beta_6 Asia * Income + \beta_7 Cauc * Income + u \]
Gender | Ethnicity | Modelo |
---|---|---|
Hombre | Afro | \(Balance = \beta_0 + \beta_1 Income + u\) |
Hombre | Asia | \(Balance = (\beta_0 + \beta_3) + (\beta_1 + \beta_6) Income + u\) |
Hombre | Cauc | \(Balance = (\beta_0 + \beta_4) + (\beta_1 + \beta_7) Income + u\) |
Mujer | Afro | \(Balance = (\beta_0 + \beta_2) + (\beta_1 + \beta_5) Income + u\) |
Mujer | Asia | \(Balance = (\beta_0 + \beta_2 + \beta_3) + (\beta_1 + \beta_5 + \beta_6) Income + u\) |
Mujer | Cauc | \(Balance = (\beta_0 + \beta_2 + \beta_4) + (\beta_1 + \beta_5 + \beta_7) Income + u\) |
Es decir, tanto la pendiente como las ordenada en el origen dependen de los niveles de los factores. En R:
m16 = lm(Balance ~ Income + Gender + Ethnicity + Income:Gender + Income:Ethnicity, data = datos)
summary(m16)
##
## Call:
## lm(formula = Balance ~ Income + Gender + Ethnicity + Income:Gender +
## Income:Ethnicity, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -753.83 -357.12 -60.18 328.83 1100.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 150.4704 74.3670 2.023 0.0437 *
## Income 7.7622 1.2337 6.292 8.4e-10 ***
## GenderFemale 47.5525 67.4646 0.705 0.4813
## EthnicityAsian 55.6196 92.2098 0.603 0.5467
## EthnicityCaucasian 124.8133 81.5912 1.530 0.1269
## Income:GenderFemale -0.5773 1.2039 -0.480 0.6318
## Income:EthnicityAsian -1.0723 1.5796 -0.679 0.4976
## Income:EthnicityCaucasian -2.5669 1.3858 -1.852 0.0647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 408.9 on 392 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2089
## F-statistic: 16.05 on 7 and 392 DF, p-value: < 2.2e-16
De igual forma se podrían añadir las interacciones Gender:Ethnicity y Gender:Ethnicity:Income, lo que añadiría nuevos términos al modelo.