statistics

Multicollinearity, Heteroscedasticity, Autocorrelation: Three Difficult-Sounding Concepts (Explained Simply)

In various posts, particularly those on regression analysis, variance analysis, and time series, we’ve come across terms that seem deliberately designed to scare the reader.
The aim of these articles is to explain these key concepts simply, beyond the apparent complexity (something I really wanted when I was a student, instead of facing texts written in a purposely convoluted and unnecessarily difficult way).
So, it’s time to spend a few words on three very important concepts that often recur in statistical analysis and need to be well understood. The reality is much, much clearer than it seems, so… don’t be afraid!

Multicollinearity

If you have followed me across various posts, you may remember that we mentioned this term when approaching regression analysis.

We talk about multicollinearity when there is a strong correlation between two or more explanatory variables in our correlation model.

Multicollinearity is a rather tricky problem because it can undermine the validity of regression analysis, even if there’s a high coefficient of determination R2, which might appear significant.
When multicollinearity exists, it’s difficult to isolate the effect that dependent variables have on the independent variable, and the coefficients estimated with the least squares method may turn out to be statistically insignificant.

How can we reduce the problem?

We have several options:

  • Using a larger amount of data. In other words, increasing the sample size.
  • Transforming the functional relationship.
  • Using prior information.
  • Excluding one of the variables that show a strong collinear relationship.

Heteroscedasticity

This term seems designed to scare. If you want to reinforce someone’s belief (or bias) in the inherently, terrifying complexity of statistics, this is the magic word to use! 🙂

Surprise: the concept isn’t actually that complicated.

Heteroscedasticity simply means unequal dispersion.
It refers to situations where the variance of the error term isn’t constant for all values of the independent variable.

In regression analysis, heteroscedasticity is problematic because ordinary least squares regression assumes that all residuals come from a population with a constant variance (homoscedasticity).
Homoscedasticity is thus the opposite of heteroscedasticity…

Returning for a moment to the topic of regression, the assumption of homoscedasticity suggests that the prediction errors in Y are roughly the same at all levels of X, in both magnitude and scale.

Autocorrelation

We discussed autocorrelation in the long post on time series analysis, where we also looked at a practical example.

To define the most common case, we can say that

positive first-order autocorrelation exists when the error term of one period is positively correlated with the error term of the immediately preceding period.

In time series, this is quite a common scenario and can result in bias errors, leading to incorrect statistical test results and confidence intervals.

Autocorrelation, also referred to in some texts as serial correlation, can also be of a higher order (it is of the second order if the error term of one period is correlated with the error term of two preceding periods, and so forth) and can also be negative.

How can I check for autocorrelation?

In my post on time series analysis, we used R’s valuable acf() function and discussed the Ljung-Box test.
A “classic” method involves checking for autocorrelation using the Durbin-Watson statistic, calculating the d value and comparing it to the appropriate table values at the desired significance level, typically 5% or 1%.

In the presence of autocorrelation, the estimates obtained using ordinary least squares remain consistent and are not affected by systemic error, but the standard errors of the estimated regression parameters are unfortunately subject to systemic errors, potentially leading to inaccurate statistical tests and confidence intervals.

A method to correct for positive first-order autocorrelation (the most common type) is the Durbin two-stage method, which we won’t cover here but will likely be the subject of a future post.

paolo

Recent Posts

Guide to Statistical Tests for A/B Analysis

Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test…

8 months ago

How to Use Decision Trees to Classify Data

Decision Trees are a type of machine learning algorithm that uses a tree structure to…

11 months ago

The Gradient Descent Algorithm Explained Simply

Imagine wanting to find the fastest route to reach a destination by car. You could…

1 year ago

The Monte Carlo Method Explained Simply with Real-World Applications

Monte Carlo simulation is a method used to quantify the risk associated with a decision-making…

2 years ago

The Hypergeometric Distribution

We have seen that the binomial distribution is based on the hypothesis of an infinite…

2 years ago

The Negative Binomial Distribution (or Pascal Distribution)

The negative binomial distribution describes the number of trials needed to achieve a certain number…

2 years ago