Paolo Gironi - appunti di analisi dei dati,seo,statistica, retroinformatica

Correlation and Regression Analysis – Linear Regression

In previous posts, we have examined concepts such as the mean and standard deviation, which are capable of describing a single variable. These statistics are of great importance; however, in daily practice, it is often necessary to investigate the relationships between two or more variables. This is where new key concepts emerge: correlation and regression analysis.

Correlation and regression analysis are tools widely used during the analysis of our datasets.
They involve estimating the relationship between a dependent variable and one or more independent variables.

Time Series Analysis and Forecasting in R

What is meant by a time series?

A time series consists of values observed over a set of sequentially ordered periods. This, for those who do SEO, is already an element of utmost interest.

Website traffic data, considered over a time sequence, is in fact an example of a time series.

Time series analysis is a set of methods that allow us to derive significant patterns or statistics from data with temporal information.

In very general terms, we can say that a time series is a sequence of random variables indexed in time.

The purpose of analyzing a time series can be descriptive (consider decomposing the series to remove seasonality elements or to highlight underlying trends) or inferential, with the latter including forecasting values for future time periods that have not yet occurred.

The Chi-Square Test: Goodness of Fit and Test of Independence

In previous posts, we have seen different types of tests that we can use to analyze our data and test hypotheses.

The chi-square test was proposed by Karl Pearson in 1900, and it is widely used to estimate how effectively the distribution of a categorical variable represents an expected distribution (in this case, we talk about the “Goodness of Fit Test”) or to estimate when two categorical variables are independent of each other (and then we talk about the “Test of Independence”).

Such is the importance and widespread use of this test that it was listed by the magazine Scientific American among the 20 most important scientific discoveries of the 20th century.

The Normal Distribution

The concept of the normal distribution is one of the key elements in the field of statistical research. Very often, the data we collect shows typical characteristics, so typical that the resulting distribution is simply called… “normal”. In this post, we will look at the characteristics of this distribution, as well as touch on some other concepts of notable importance such as:

the empirical rule
the standardized variable – The concept of Z score
Chebyshev’s inequality

Descriptive Statistics: Measures of Position and Central Tendency

Measures of position, also known as position indices, or measures of central tendency, are values that summarize the position of a statistical distribution, providing a single figure that encapsulates the most important aspects of the data. In this brief discussion, we will explore some of the most common and practical indices, such as the various types of means, the median, quartiles, and percentiles.

Topics Covered

Measures of Central Tendency
Arithmetic Mean
The Mean of Grouped Data
The Weighted Mean
The Geometric Mean
The Harmonic Mean
The Trimmed Mean
The Median
- The Median of Grouped Data
The Mode
- Mode of Grouped Data
Relationship Between Mean, Median, and Mode
Quartiles, Deciles, and Percentiles
- Quartiles, Deciles, and Percentiles for Grouped Data
An Overview: The Very Useful 5 Numbers
Let’s Help Ourselves with a Clever Graph: The Boxplot