Dissertation Statistics Help: best subsets regression

Friday, September 14, 2012

Best Subsets Regression

Best subsets regression is an exploratory model building regression analysis. It compares all possible models that can be created based upon an identified set of predictors. The results presented for best subsets, by default in Minitab, show the two best models for one predictor, two predictors, three predictors, and so on for the number of possible predictors that were entered into the best subsets regression. The output in Minitab presents R², adjusted R², Mallow’s C_p, and S. To determine the best model, these model fit statistics will be used in conjunction with one another. R²and adjusted R²measure the coefficient of multiple determination and are used to determine the amount of predictability of the criterion variable based upon the set of predictor variables. Mallow’s C_p is a measure of bias or prediction error. S is the square root of the mean square error (MSE).

The decision is not always clear so the researcher must use all the tools available to make the most informed choice. When selecting the best subset, we are looking for the highest adjusted R². Every increase in the number of predictors will cause an increase in the R² value, therefore, when selecting among different numbers of predictors it is more reasonable to use the adjusted R²,as the adjusted R²increases only if the added predictors improve the model more than chance alone. In regards to Mallow’s C_p, where p indicates the number of parameters in the model, we are looking for a value equal to or less than p. The number of parameters in each model is equal to the number of predictors plus one, where the one is the intercept parameter. So if our output reads two variables, we know that the number of parameters in the model is equal to three. There are a few things to note when analyzing Mallow’s C_p:

· The model with the maximum number of predictors always shows C_p= p so Mallow’s C_pis not a good selection tool for the full model.

· If all models but the full model display a large C_pthen the models are lacking important predictors that must be identified before going forward.

· When several models show a C_pnear p, then the model with the smallest C_pshould be selected to be certain the bias is small.

· Further, when several models show a C_pnear p, then the model with the fewest number of predictors should be selected.

In addition to these guidelines, we are also looking for the model with the smallest S. Taking these factors into account should allow the research to select the most appropriate, best fitting regression model.

Additional reading/reference

https://onlinecourses.science.psu.edu/stat501/node/89

Request

Friday, September 14, 2012

Best Subsets Regression