Dissertation Statistics Help: regression

Friday, December 7, 2012

The differences in most common statistical analyses

Correlation vs. Regression vs. Mean Differences

Inferential (parametric and non-parametric) statistics are conducted when the goal of the research is to draw conclusions about the statistical significance of the relationships and/or differences among variables of interest.

The “relationships” can be tested in different statistically ways, depending on the goal of the research. The three most common meanings of “relationship” between/among variables are:

1. Strength, or association, between variables = e.g., Pearson & Spearman rho correlations

2. Statistical differences on a continuous variable by group(s) = e.g., t-test and ANOVA

3. Statistical contribution/prediction on a variable from another(s) = regression.

Correlations are the appropriate analyses when the goal of the research is to test the strength, or association, between two variables. There are two main types of correlations: Pearson product-moment correlations, a.k.a. Pearson (r), and Spearman rho (r_s) correlations. A Pearson correlation is a parametric test that is appropriate when the two variables are continuous. Like with all parametric tests, there are assumptions that need to be met; for a Pearson correlation: linearity and homoscedasticity. A Spearman correlation is a non-parametric test that is appropriate when at least one of the variables is ordinal.

o E.g., a Pearson correlation is appropriate for the two continuous variables: age and height.

o E.g., a Spearman correlation is appropriate for the variables: age (continuous) and income level (under 25,000, 25,000 – 50,000, 50,001 – 100,000, above 100,000).

To test for mean differences by group, there a variety of analyses that can be appropriate. Three parametric examples will be given: Dependent sample t test, Independent sample t test, and an analysis of variance (ANOVA). The assumption of the dependent sample t test is normality. The assumptions of the independent sample t test are normality and equality of variance (a.k.a. homogeneity of variance). The assumptions of an ANOVA are normality and equality of variance (a.k.a. homogeneity of variance).

o E.g., a dependent t – test is appropriate for testing mean differences on a continuous variable by time on the same group of people: testing weight differences by time (year 1 - before diet vs. year 2 – after diet) for the same participants.

o E.g., an independent t-test is appropriate for testing mean differences on a continuous variable by two independent groups: testing GPA scores by gender (males vs. females)

o E.g., an ANOVA is appropriate for testing mean differences on a continuous variable by a group with more than two independent groups: testing IQ scores by college major (Business vs. Engineering vs. Nursing vs. Communications)

To test if a variable(s) offers a significant contribution, or predicts, another variable, a regression is appropriate. Three parametric examples will be given: simple linear regression, multiple linear regression, and binary logistic regression. The assumptions of a simple linear regression are linearity and homoscedasticity. The assumptions of a multiple linear regressions are linearity, homoscedasticity, and the absence of multicollinearity. The assumption of binary logistic regression is absence of multicollinearity.

o E.g., a simple linear regression is appropriate for testing if a continuous variable predicts another continuous variable: testing if IQ scores predict SAT scores

o E.g., a multiple linear regression is appropriate for testing if more than one continuous variable predicts another continuous variable: testing if IQ scores and GPA scores predict SAT scores

o E.g., a binary logistic regression is appropriate for testing if more than one variable (continuous or dichotomous) predicts a dichotomous variable: testing if IQ scores, gender, and GPA scores predict entrance to college (yes = 1 vs. no = 0).

In regards to the assumptions mentioned above:

o Linearity assumes a straight line relationship between the variables

o Homoscedasticity assumes that scores are normally distributed about the regression line

o Absence of multicollinearity assumes that predictor variables are not too related

o Normality assumes that the dependent variables are normally distributed (symmetrical bell shaped) for each group

o Homogeneity of variance assumes that groups have equal error variances

Friday, September 14, 2012

Best Subsets Regression

Best subsets regression is an exploratory model building regression analysis. It compares all possible models that can be created based upon an identified set of predictors. The results presented for best subsets, by default in Minitab, show the two best models for one predictor, two predictors, three predictors, and so on for the number of possible predictors that were entered into the best subsets regression. The output in Minitab presents R², adjusted R², Mallow’s C_p, and S. To determine the best model, these model fit statistics will be used in conjunction with one another. R²and adjusted R²measure the coefficient of multiple determination and are used to determine the amount of predictability of the criterion variable based upon the set of predictor variables. Mallow’s C_p is a measure of bias or prediction error. S is the square root of the mean square error (MSE).

The decision is not always clear so the researcher must use all the tools available to make the most informed choice. When selecting the best subset, we are looking for the highest adjusted R². Every increase in the number of predictors will cause an increase in the R² value, therefore, when selecting among different numbers of predictors it is more reasonable to use the adjusted R²,as the adjusted R²increases only if the added predictors improve the model more than chance alone. In regards to Mallow’s C_p, where p indicates the number of parameters in the model, we are looking for a value equal to or less than p. The number of parameters in each model is equal to the number of predictors plus one, where the one is the intercept parameter. So if our output reads two variables, we know that the number of parameters in the model is equal to three. There are a few things to note when analyzing Mallow’s C_p:

· The model with the maximum number of predictors always shows C_p= p so Mallow’s C_pis not a good selection tool for the full model.

· If all models but the full model display a large C_pthen the models are lacking important predictors that must be identified before going forward.

· When several models show a C_pnear p, then the model with the smallest C_pshould be selected to be certain the bias is small.

· Further, when several models show a C_pnear p, then the model with the fewest number of predictors should be selected.

In addition to these guidelines, we are also looking for the model with the smallest S. Taking these factors into account should allow the research to select the most appropriate, best fitting regression model.

Additional reading/reference

https://onlinecourses.science.psu.edu/stat501/node/89

Request

Friday, December 7, 2012

The differences in most common statistical analyses

Friday, September 14, 2012

Best Subsets Regression