When a researcher
chooses to create their own survey instrument, it is appropriate to run an
exploratory factor analysis to assess for potential subscales within the
instrument. However, it is seemingly
unnecessary to run an exploratory factor analysis (EFA) on an already
established instrument. In the case of an already established instrument,
typically, a Cronbach’s alpha is the acceptable way to assess reliability.
Request
To request a blog written on a specific topic, please email James@StatisticsSolutions.com with your suggestion. Thank you!
Monday, November 12, 2012
Thursday, November 8, 2012
Recoding
Survey items can be worded with a positive or negative direction:
·
Positively worded: e.g., I know that I am
welcomed at my child’s school, I feel that I am good at my job, Having a
wheelchair helps, etc…
·
Negatively worded: e.g., I feel isolated at my
child’s school, I am not good at my job, having a wheelchair is a hindrance,
etc…
·
Likert scaled responses can vary: e.g., 1 =
never, 2= sometimes, 3 = always; OR
1 = strongly disagree, 2 = disagree, 3 = neutral, 4 = agree, 5 = strongly
agree
·
When creating a composite score from specific
survey items, we want to make sure we are looking at the responses in the same
manner. If we have survey items that are
not all worded in the same direction, we need to re-code the responses. E.g.: I
want to make a composite score called “helpfulness” from the following survey
items :
o
5-point Likert scaled, where 5 = always 4 = almost always 3 = sometimes, 2 = almost never 1 = never
1.
I like to tutor at school
2.
I am usually asked by my friends to help with
homework
3.
I typically do homework in a group setting
4.
I do not go over my homework with others
In this example, survey items 1
– 3 are all positively worded, but survey item 4 is not. When creating the composite score, we wish to
make sure that we are examining the coded responses the same way. In this case, we’d have to re-code the
responses to survey item 4 to make sure that all responses for the score
“helpfulness” are correctly interpreted; the recoded responses for survey item
4 are: 1 = always, 2 = almost always, 3 = sometimes, 4 = almost never, 5 =
never.
Now, all responses that are
scored have the same direction and thus, can be interpreted correctly: positive
responses for “helpfulness” have higher values and negative responses for
“helpfulness” have lower values.
·
Also, you may wish to change the number of
responses. For example, you may wish to
dichotomize or trichotomize the responses.
In the example above, you can trichotomize the responses by recoding
responses “always” and “almost always” to 3 = high, “sometimes” to 2 =
sometimes, and “almost never” and “never” to 1 = low. However, please be advised to make sure that
you have sound reason to alter the number of responses.
Wednesday, November 7, 2012
Cox Event History
Cox event history is a branch of statistics that deals mainly with
the death of biological organisms and the failure of mechanical systems.
It is also sometimes referred to as a statistical method for analyzing
survival data. Cox event history is also known as various other names,
such as survival analysis, duration analysis, or transition analysis.
Generally speaking, this technique involves the modeling of data
structured in a time-to-event format. The goal of this analysis is to
understand the probability of the occurrence of an event. Cox event
history was primarily developed for use in medical and biological
sciences. However, this technique is now frequently used in engineering
as well as in statistical and data analysis.
One of the key purposes of the Cox event history technique is to explain the causes behind the differences or similarities between the events encountered by subjects. For instance, Cox regression may be used to evaluate why certain individuals are at a higher risk of contracting some diseases. Thus, it can be effectively applied to studying acute or chronic diseases, hence the interest in Cox regression by the medical science field. The Cox event history model mainly focuses on the hazard function, which produces the probability of an event occurring randomly at random times or at a specific period or instance in time.
The basic Cox event history model can be summarized by the following function:
h(t) = h0(t)e(b1X1 + b2X2 + K + bnXn)
Where; h(t) = rate of hazard
h0(t) = baseline hazard function
bX’s = coefficients and covariates.
Cox event history can be categorized mainly under three models: nonparametric, semi-parametric and parametric.
Non-parametric: The non-parametric model does not make any assumptions about the hazard function or the variables affecting it. Consequently, only a limited number of variable types can be handled with the help of a non-parametric model. This type of model involves the analysis of empirical data showing changes over a period of time and cannot handle continuous variables.
Semi-parametric: Similiar to the non-parametric model, the semi-parametric model also does not make any assumptions about the shape of the hazard function or the variables affecting it. What makes this model different is that it assumes the rate of the hazard is proportional over a period of time. The estimates for the hazard function shape can be derived empirically as well. Multivariate analyses are supported by semi-parametric models and are often considered a more reliable fitting method for use in a Cox event history analysis.
Parametric: In this model, the shape of the hazard function and the variables affecting it are determined in advance. Multivariate analyses of discrete and continuous explanatory variables are supported by the parametric model. However, if the hazard function shape is incorrectly estimated, then there is a chance that the results could be biased. Parametric models are frequently used to analyze the nature of time dependency. It is also particularly useful for predictive modeling, because the shape of the baseline hazard function can be determined correctly by the parametric model.
Cox event history analysis involves the use of certain assumptions. As with every other statistical method or technique, if an assumption is violated, it will often lead to the results being statistically unreliable. The major assumption is that in using Cox event history, with the passage of time, independent variables do not interact with each other. In other words, the independent variables should have a constant hazard of rate over time.
In addition, hazard rates are rarely smooth in reality. Frequently, these rates need to be smoothed over in order for them to be useful for Cox event history analysis.
Applications of Cox Event History
Cox event history can be applied in many fields, although initially it was used primarily in medical and other biological sciences. Today, it is an excellent tool for other applications, frequently used as a statistical method where the dependent variables are categorical, especially in socio-economic analyses. For instance, in the field of economics, Cox event history is used extensively to relate macro or micro economic indicators in terms of a time series; for instance, one could determine the relationship between unemployment or employment over time. In addition, in commercial applications, Cox event history can be applied to estimate the lifespan of a certain machine and break down points based on historical data.
Click here for information on dissertation consulting services
One of the key purposes of the Cox event history technique is to explain the causes behind the differences or similarities between the events encountered by subjects. For instance, Cox regression may be used to evaluate why certain individuals are at a higher risk of contracting some diseases. Thus, it can be effectively applied to studying acute or chronic diseases, hence the interest in Cox regression by the medical science field. The Cox event history model mainly focuses on the hazard function, which produces the probability of an event occurring randomly at random times or at a specific period or instance in time.
The basic Cox event history model can be summarized by the following function:
h(t) = h0(t)e(b1X1 + b2X2 + K + bnXn)
Where; h(t) = rate of hazard
h0(t) = baseline hazard function
bX’s = coefficients and covariates.
Cox event history can be categorized mainly under three models: nonparametric, semi-parametric and parametric.
Non-parametric: The non-parametric model does not make any assumptions about the hazard function or the variables affecting it. Consequently, only a limited number of variable types can be handled with the help of a non-parametric model. This type of model involves the analysis of empirical data showing changes over a period of time and cannot handle continuous variables.
Semi-parametric: Similiar to the non-parametric model, the semi-parametric model also does not make any assumptions about the shape of the hazard function or the variables affecting it. What makes this model different is that it assumes the rate of the hazard is proportional over a period of time. The estimates for the hazard function shape can be derived empirically as well. Multivariate analyses are supported by semi-parametric models and are often considered a more reliable fitting method for use in a Cox event history analysis.
Parametric: In this model, the shape of the hazard function and the variables affecting it are determined in advance. Multivariate analyses of discrete and continuous explanatory variables are supported by the parametric model. However, if the hazard function shape is incorrectly estimated, then there is a chance that the results could be biased. Parametric models are frequently used to analyze the nature of time dependency. It is also particularly useful for predictive modeling, because the shape of the baseline hazard function can be determined correctly by the parametric model.
Cox event history analysis involves the use of certain assumptions. As with every other statistical method or technique, if an assumption is violated, it will often lead to the results being statistically unreliable. The major assumption is that in using Cox event history, with the passage of time, independent variables do not interact with each other. In other words, the independent variables should have a constant hazard of rate over time.
In addition, hazard rates are rarely smooth in reality. Frequently, these rates need to be smoothed over in order for them to be useful for Cox event history analysis.
Applications of Cox Event History
Cox event history can be applied in many fields, although initially it was used primarily in medical and other biological sciences. Today, it is an excellent tool for other applications, frequently used as a statistical method where the dependent variables are categorical, especially in socio-economic analyses. For instance, in the field of economics, Cox event history is used extensively to relate macro or micro economic indicators in terms of a time series; for instance, one could determine the relationship between unemployment or employment over time. In addition, in commercial applications, Cox event history can be applied to estimate the lifespan of a certain machine and break down points based on historical data.
Click here for information on dissertation consulting services
Friday, September 14, 2012
Best Subsets Regression
Best
subsets regression is an exploratory model building regression analysis. It compares all possible models that can be
created based upon an identified set of predictors. The results presented for best subsets, by
default in Minitab, show the two best models for one predictor, two predictors,
three predictors, and so on for the number of possible predictors that were
entered into the best subsets regression.
The output in Minitab presents R2,
adjusted R2, Mallow’s Cp, and S. To determine the best
model, these model fit statistics will be used in conjunction with one another. R2and
adjusted R2measure the
coefficient of multiple determination and are used to determine the amount of
predictability of the criterion variable based upon the set of predictor
variables. Mallow’s Cp is a measure of bias or prediction error. S
is the square root of the mean square error (MSE).
The
decision is not always clear so the researcher must use all the tools available
to make the most informed choice. When
selecting the best subset, we are looking for the highest adjusted R2.
Every increase in the number of
predictors will cause an increase in the R2
value, therefore, when selecting among different numbers of predictors it
is more reasonable to use the adjusted R2, as the
adjusted R2 increases only
if the added predictors improve the model more than chance alone. In
regards to Mallow’s Cp,
where p indicates the number of
parameters in the model, we are looking for a value equal to or less than p.
The number of parameters in each model is equal to the number of
predictors plus one, where the one is the intercept parameter. So if our output reads two variables, we know
that the number of parameters in the model is equal to three. There are a few things to note when
analyzing Mallow’s Cp:
·
The model with the maximum number of
predictors always shows Cp = p
so Mallow’s Cp is not
a good selection tool for the full model.
·
If all models but the full model display
a large Cp then the models
are lacking important predictors that must be identified before going forward.
·
When several models show a Cp near p, then the model with the smallest Cp should be selected to be certain the bias is small.
·
Further, when several models show a Cp near p, then the model with the fewest number of predictors should be
selected.
In addition to these
guidelines, we are also looking for the model with the smallest S. Taking these factors into account should allow
the research to select the most appropriate, best fitting regression
model.
Additional
reading/reference
https://onlinecourses.science.psu.edu/stat501/node/89
Monday, September 10, 2012
Binary Logistic Regression
- Logistic regression is an extension of simple linear regression.
- Where the dependent variable is dichotomous or binary in nature, we cannot use simple linear regression. Logistic regression is the statistical technique used to predict the relationship between predictors (our independent variables) and a predicted variable (the dependent variable) where the dependent variable is binary (e.g., sex [male vs. female], response [yes vs. no], score [high vs. low], etc…).
- There must be two or more independent variables, or predictors, for a logistic regression. The IVs, or predictors, can be continuous (interval/ratio) or categorical (ordinal/nominal).
- All predictor variables are tested in one block to assess their predictive ability while controlling for the effects of other predictors in the model.
·
Assumptions for a Logistic regression:
1.
adequate sample size (too few participants for
too many predictors is bad!);
2.
absence of multicollinearity (multicollinearity
= high intercorrelations among the predictors);
3.
no outliers
- The statistic -2LogL (minus 2 times the log of the likelihood) is a badness-of-fit indicator, that is, large numbers mean poor fit of the model to the data.
- When taken from large samples, the difference between two values of -2LogL is distributed as chi-square:
Where likelihoodR
is for a restricted, or smaller, model and likelihoodF
is for a full, or larger, model.
- LikelihoodF has all the parameters of interest.
- LikelihoodR is nested in the larger model. (nested = all terms occur in the larger model; necessary condition for model comparison tests).
- A nested model cannot have as a single IV, some other categorical or continuous variable not contained in the full model. If it does, then it is no longer nested, and we cannot compare the two values of -2LogL to get a chi-square value.
- The chi-square is used to statistically test whether including a variable reduces badness-of-fit measure.
- If chi-square is significant, the variable is considered to be a significant predictor in the equation.
Subscribe to:
Posts (Atom)