Request

To request a blog written on a specific topic, please email James@StatisticsSolutions.com with your suggestion. Thank you!

Friday, December 7, 2012

The differences in most common statistical analyses




Correlation vs. Regression vs. Mean Differences
  •  Inferential (parametric and non-parametric) statistics are conducted when the goal of the research is to draw conclusions about the statistical significance of the relationships and/or differences among variables of interest.

  •   The “relationships” can be tested in different statistically ways, depending on the goal of the research.  The three most common meanings of “relationship” between/among variables are:

1.      Strength, or association, between variables = e.g., Pearson & Spearman rho correlations
2.      Statistical differences on a continuous variable by group(s) = e.g., t-test and ANOVA
3.      Statistical contribution/prediction on a variable from another(s) = regression.

  •  Correlations are the appropriate analyses when the goal of the research is to test the strength, or association, between two variables.  There are two main types of correlations: Pearson product-moment correlations, a.k.a. Pearson (r), and Spearman rho (rs) correlations.  A Pearson correlation is a parametric test that is appropriate when the two variables are continuous.  Like with all parametric tests, there are assumptions that need to be met; for a Pearson correlation: linearity and homoscedasticity.  A Spearman correlation is a non-parametric test that is appropriate when at least one of the variables is ordinal.

o   E.g., a Pearson correlation is appropriate for the two continuous variables: age and height.
o   E.g., a Spearman correlation is appropriate for the variables: age (continuous) and income level (under 25,000, 25,000 – 50,000, 50,001 – 100,000, above 100,000).

  • To test for mean differences by group, there a variety of analyses that can be appropriate.  Three parametric examples will be given: Dependent sample t test, Independent sample t test, and an analysis of variance (ANOVA).  The assumption of the dependent sample t test is normality.  The assumptions of the independent sample t test are normality and equality of variance (a.k.a. homogeneity of variance).  The assumptions of an ANOVA are normality and equality of variance (a.k.a. homogeneity of variance).

o   E.g., a dependent t – test is appropriate for testing mean differences on a continuous variable by time on the same group of people: testing weight differences by time (year 1 - before diet vs. year 2 – after diet) for the same participants. 
o   E.g., an independent t-test is appropriate for testing mean differences on a continuous variable by two independent groups: testing GPA scores by gender (males vs. females)
o   E.g., an ANOVA is appropriate for testing mean differences on a continuous variable by a group with more than two independent groups: testing IQ scores by college major (Business vs. Engineering vs. Nursing vs. Communications)

  •  To test if a variable(s) offers a significant contribution, or predicts, another variable, a regression is appropriate.  Three parametric examples will be given: simple linear regression, multiple linear regression, and binary logistic regression.  The assumptions of a simple linear regression are linearity and homoscedasticity.  The assumptions of a multiple linear regressions are linearity, homoscedasticity, and the absence of multicollinearity.  The assumption of binary logistic regression is absence of multicollinearity.

o   E.g., a simple linear regression is appropriate for testing if a continuous variable predicts another continuous variable: testing if IQ scores predict SAT scores
o   E.g., a multiple linear regression is appropriate for testing if more than one continuous variable predicts another continuous variable: testing if IQ scores and GPA scores predict SAT scores
o   E.g., a binary logistic regression is appropriate for testing if more than one variable (continuous or dichotomous) predicts a dichotomous variable: testing if IQ scores, gender, and GPA scores predict entrance to college (yes = 1 vs. no = 0). 


  •  In regards to the assumptions mentioned above:


o   Linearity assumes a straight line relationship between the variables
o   Homoscedasticity assumes that scores are normally distributed about the regression line
o   Absence of multicollinearity assumes that predictor variables are not too related
o   Normality assumes that the dependent variables are normally distributed (symmetrical bell shaped) for each group
o   Homogeneity of variance assumes that groups have equal error variances

Monday, November 19, 2012

Manipulation Checks (betwen two groups)



  • A procedure that can be used to test whether the levels (or groups) of the IV differ on the DVs.  E.g., a study consists of two different types of primates, where one primate is “more intelligent” and the other primate is “less intelligent.”  The IV is primate intelligence (high intelligence vs. low intelligence) and the DVs are five different questionnaires that each measures, or rates, the participants’ attitudes on the primates.  Each questionnaire can measure different attributes that deal with primate intelligence (ex., problem solving, memorization, etc…) A manipulation check would assess if the researcher has effectively “manipulated” primate intelligence.  In this example, an independent sample t test would be the appropriate statistical analysis for the manipulation check: five t tests on the five composite scores (from the five different questionnaires) by primate intelligence (high intelligence vs. low intelligence).  If the results (per composite score) are statistically significant, than primate intelligence can be said to be effectively manipulated.  The IV can be used for further analyses.

  • (Can be, but does not have to be?) Included at the end of each questionnaire.

  • Checks (each possible comparison) for consistency on the different dependent variables (the questionnaires) by the independent variable (the two groups).

  • Included in a study to test the effectiveness of the IV on the different surveys (DVs) in the manner in which the study was intended to be constructed.

Monday, November 12, 2012

EFA vs. Cronbach's Alpha



When a researcher chooses to create their own survey instrument, it is appropriate to run an exploratory factor analysis to assess for potential subscales within the instrument.  However, it is seemingly unnecessary to run an exploratory factor analysis (EFA) on an already established instrument.  In the case of an already established instrument, typically, a Cronbach’s alpha is the acceptable way to assess reliability.

Thursday, November 8, 2012

Recoding



Survey items can be worded with a positive or negative direction:

·         Positively worded: e.g., I know that I am welcomed at my child’s school, I feel that I am good at my job, Having a wheelchair helps, etc…
·         Negatively worded: e.g., I feel isolated at my child’s school, I am not good at my job, having a wheelchair is a hindrance, etc…
·         Likert scaled responses can vary: e.g., 1 = never, 2= sometimes, 3 = always; OR
1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly agree
·         When creating a composite score from specific survey items, we want to make sure we are looking at the responses in the same manner.  If we have survey items that are not all worded in the same direction, we need to re-code the responses. E.g.: I want to make a composite score called “helpfulness” from the following survey items :  

o   5-point Likert scaled, where 5 = always   4 = almost always  3 = sometimes, 2 = almost never  1 = never

1.      I like to tutor at school   
2.      I am usually asked by my friends to help with homework
3.      I typically do homework in a group setting
4.      I do not go over my homework with others

In this example, survey items 1 – 3 are all positively worded, but survey item 4 is not.  When creating the composite score, we wish to make sure that we are examining the coded responses the same way.  In this case, we’d have to re-code the responses to survey item 4 to make sure that all responses for the score “helpfulness” are correctly interpreted; the recoded responses for survey item 4 are: 1 = always, 2 = almost always, 3 = sometimes, 4 = almost never, 5 = never.

Now, all responses that are scored have the same direction and thus, can be interpreted correctly: positive responses for “helpfulness” have higher values and negative responses for “helpfulness” have lower values.

·         Also, you may wish to change the number of responses.  For example, you may wish to dichotomize or trichotomize the responses.  In the example above, you can trichotomize the responses by recoding responses “always” and “almost always” to 3 = high, “sometimes” to 2 = sometimes, and “almost never” and “never” to 1 = low.  However, please be advised to make sure that you have sound reason to alter the number of responses.

Wednesday, November 7, 2012

Cox Event History

Cox event history is a branch of statistics that deals mainly with the death of biological organisms and the failure of mechanical systems. It is also sometimes referred to as a statistical method for analyzing survival data. Cox event history is also known as various other names, such as survival analysis, duration analysis, or transition analysis. Generally speaking, this technique involves the modeling of data structured in a time-to-event format. The goal of this analysis is to understand the probability of the occurrence of an event. Cox event history was primarily developed for use in medical and biological sciences. However, this technique is now frequently used in engineering as well as in statistical and data analysis.

One of the key purposes of the Cox event history technique is to explain the causes behind the differences or similarities between the events encountered by subjects. For instance, Cox regression may be used to evaluate why certain individuals are at a higher risk of contracting some diseases. Thus, it can be effectively applied to studying acute or chronic diseases, hence the interest in Cox regression by the medical science field. The Cox event history model mainly focuses on the hazard function, which produces the probability of an event occurring randomly at random times or at a specific period or instance in time.

The basic Cox event history model can be summarized by the following function:

h(t) = h0(t)e(b1X1 + b2X2 + K + bnXn)

Where; h(t) = rate of hazard

h0(t) = baseline hazard function

bX’s = coefficients and covariates.

Cox event history can be categorized mainly under three models: nonparametric, semi-parametric and parametric.

Non-parametric: The non-parametric model does not make any assumptions about the hazard function or the variables affecting it. Consequently, only a limited number of variable types can be handled with the help of a non-parametric model. This type of model involves the analysis of empirical data showing changes over a period of time and cannot handle continuous variables.

Semi-parametric: Similiar to the non-parametric model, the semi-parametric model also does not make any assumptions about the shape of the hazard function or the variables affecting it. What makes this model different is that it assumes the rate of the hazard is proportional over a period of time. The estimates for the hazard function shape can be derived empirically as well. Multivariate analyses are supported by semi-parametric models and are often considered a more reliable fitting method for use in a Cox event history analysis.

Parametric: In this model, the shape of the hazard function and the variables affecting it are determined in advance. Multivariate analyses of discrete and continuous explanatory variables are supported by the parametric model. However, if the hazard function shape is incorrectly estimated, then there is a chance that the results could be biased. Parametric models are frequently used to analyze the nature of time dependency. It is also particularly useful for predictive modeling, because the shape of the baseline hazard function can be determined correctly by the parametric model.

Cox event history analysis involves the use of certain assumptions. As with every other statistical method or technique, if an assumption is violated, it will often lead to the results being statistically unreliable. The major assumption is that in using Cox event history, with the passage of time, independent variables do not interact with each other. In other words, the independent variables should have a constant hazard of rate over time.

In addition, hazard rates are rarely smooth in reality. Frequently, these rates need to be smoothed over in order for them to be useful for Cox event history analysis.

Applications of Cox Event History
Cox event history can be applied in many fields, although initially it was used primarily in medical and other biological sciences. Today, it is an excellent tool for other applications, frequently used as a statistical method where the dependent variables are categorical, especially in socio-economic analyses. For instance, in the field of economics, Cox event history is used extensively to relate macro or micro economic indicators in terms of a time series; for instance, one could determine the relationship between unemployment or employment over time. In addition, in commercial applications, Cox event history can be applied to estimate the lifespan of a certain machine and break down points based on historical data.

Click here for information on dissertation consulting services