Untitled

> summary(lm(y~., data=mydf))

Call:
lm(formula = y ~ ., data = mydf)

Residuals:
    Min      1Q  Median      3Q     Max
-73.111  -9.528  -0.897   8.907  78.653

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)  107.20300    2.83286  37.843  < 2e-16
age           -0.87090    0.12356  -7.048 1.97e-12  # SIGNIFICANT
genderM       -6.34184    0.33625 -18.861  < 2e-16  # SIGNIFICANT
htcm          -0.05992    0.02657  -2.255  0.02415  # SIGNIFICANT
wtkg           0.01247    0.04037   0.309  0.75745
waistcm        0.08095    0.03434   2.358  0.01842  # SIGNIFICANT
cityP          1.18070    0.38454   3.070  0.00214  # SIGNIFICANT
seasonsummer   0.28349    0.66278   0.428  0.66886
seasonwinter  -1.25711    0.67247  -1.869  0.06161

Residual standard error: 14.32 on 7767 degrees of freedom
  (396 observations deleted due to missingness)
Multiple R-squared:  0.08514,   Adjusted R-squared:  0.08419
F-statistic: 90.35 on 8 and 7767 DF,  p-value: < 2.2e-16

> summary(aov(y~., data=mydf))
              Df  Sum Sq Mean Sq F value  Pr(>F)
age            1   68902   68902 335.992 < 2e-16    # SIGNIFICANT
gender         1   72243   72243 352.280 < 2e-16    # SIGNIFICANT
htcm           1     149     149   0.726 0.39409
wtkg           1    1592    1592   7.762 0.00535    # SIGNIFICANT
waistcm        1     767     767   3.738 0.05323
city           1     829     829   4.043 0.04440    # SIGNIFICANT
season         2    3742    1871   9.124 0.00011    # SIGNIFICANT
Residuals   7767 1592791     205
396 observations deleted due to missingness

> bestglm(mydf)
Morgan-Tatar search since factors present with more than 2 levels.
BIC
Best Model:
              Df  Sum Sq Mean Sq F value Pr(>F)
age            1   68902   68902   334.8 <2e-16 # SIGNIFICANT
gender         1   72243   72243   351.0 <2e-16 # SIGNIFICANT
Residuals   7773 1599869     206
396 observations deleted due to missingness

> library(randomForest)
> fit <- randomForest(y~., data=mydf, importance=TRUE)
> print(fit)

Call:
 randomForest(formula = y ~ ., data = mydf)
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 207.2199
                    % Var explained: 7.45

# FOLLOWING IS FROM fit$importance:

           IncNodePurity
htcm       219809.13
waistcm    196753.10
wtkg       181179.19
age        119446.90
gender      83154.71
season      42938.42
city        27040.10

             %IncMSE
htcm       72.663197
wtkg       68.040321
age        48.075415
waistcm    33.267517
gender     26.680004
season      5.932131
city        3.905936

var        importance

gender     55.4005861
waistcm    34.4082250
age        32.3720673
htcm       28.6817975
wtkg       26.7268140
season      8.0689392
city        7.9994742