In various posts, particularly those on regression analysis, variance analysis, and time series, we’ve come across terms that seem deliberately designed to scare the reader.
The aim of these articles is to explain these key concepts simply, beyond the apparent complexity (something I really wanted when I was a student, instead of facing texts written in a purposely convoluted and unnecessarily difficult way).
So, it’s time to spend a few words on three very important concepts that often recur in statistical analysis and need to be well understood. The reality is much, much clearer than it seems, so… don’t be afraid!
If you have followed me across various posts, you may remember that we mentioned this term when approaching regression analysis.
We talk about multicollinearity when there is a strong correlation between two or more explanatory variables in our correlation model.
Multicollinearity is a rather tricky problem because it can undermine the validity of regression analysis, even if there’s a high coefficient of determination R2, which might appear significant.
When multicollinearity exists, it’s difficult to isolate the effect that dependent variables have on the independent variable, and the coefficients estimated with the least squares method may turn out to be statistically insignificant.
We have several options:
This term seems designed to scare. If you want to reinforce someone’s belief (or bias) in the inherently, terrifying complexity of statistics, this is the magic word to use! 🙂
Surprise: the concept isn’t actually that complicated.
Heteroscedasticity simply means unequal dispersion.
It refers to situations where the variance of the error term isn’t constant for all values of the independent variable.
In regression analysis, heteroscedasticity is problematic because ordinary least squares regression assumes that all residuals come from a population with a constant variance (homoscedasticity).
Homoscedasticity is thus the opposite of heteroscedasticity…
Returning for a moment to the topic of regression, the assumption of homoscedasticity suggests that the prediction errors in Y are roughly the same at all levels of X, in both magnitude and scale.
We discussed autocorrelation in the long post on time series analysis, where we also looked at a practical example.
To define the most common case, we can say that
positive first-order autocorrelation exists when the error term of one period is positively correlated with the error term of the immediately preceding period.
In time series, this is quite a common scenario and can result in bias errors, leading to incorrect statistical test results and confidence intervals.
Autocorrelation, also referred to in some texts as serial correlation, can also be of a higher order (it is of the second order if the error term of one period is correlated with the error term of two preceding periods, and so forth) and can also be negative.
In my post on time series analysis, we used R’s valuable acf() function and discussed the Ljung-Box test.
A “classic” method involves checking for autocorrelation using the Durbin-Watson statistic, calculating the d value and comparing it to the appropriate table values at the desired significance level, typically 5% or 1%.
In the presence of autocorrelation, the estimates obtained using ordinary least squares remain consistent and are not affected by systemic error, but the standard errors of the estimated regression parameters are unfortunately subject to systemic errors, potentially leading to inaccurate statistical tests and confidence intervals.
A method to correct for positive first-order autocorrelation (the most common type) is the Durbin two-stage method, which we won’t cover here but will likely be the subject of a future post.
Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test…
Decision Trees are a type of machine learning algorithm that uses a tree structure to…
Imagine wanting to find the fastest route to reach a destination by car. You could…
Monte Carlo simulation is a method used to quantify the risk associated with a decision-making…
We have seen that the binomial distribution is based on the hypothesis of an infinite…
The negative binomial distribution describes the number of trials needed to achieve a certain number…