Paolo Gironi - appunti di analisi dei dati,seo,statistica, retroinformatica

Guide to Statistical Tests for A/B Analysis

Statistical tests are fundamental tools for data analysis and informed decision-making. Choosing the appropriate test depends on the characteristics of the data, the hypotheses to be tested, and the underlying assumptions.

How to Use Decision Trees to Classify Data

Decision Trees are a type of machine learning algorithm that uses a tree structure to divide data based on logical rules and predict the class of new data. They are easy to interpret and adaptable to different types of data, but can also suffer from problems such as overfitting, complexity, and imbalance.
Let’s understand a bit more about them and examine a simple example of use in R.

The Gradient Descent Algorithm Explained Simply

Imagine wanting to find the fastest route to reach a destination by car. You could use a road map to estimate the distance and travel time of different roads. However, this method doesn’t account for traffic, which can vary significantly throughout the day.

Gradient Descent can be used to find the fastest route in real-time. In this case:

The cost function represents the travel time of the journey.
The parameter to optimize is the route to follow.
The gradient indicates the direction in which travel time increases most rapidly.

The Gradient Descent algorithm can then be used to update the route iteratively, getting closer to the fastest route with each iteration.

Let’s now try to organize the definitions a bit.

Gradient Descent is an algorithm that tries to find the minimum of an objective function, i.e., the lowest possible value that the function can assume. To do this, the algorithm starts from a random point and moves in the opposite direction of the gradient, which is the direction in which the function grows most rapidly. The gradient is calculated as the derivative of the function, i.e., the slope of the curve at a point. The higher the gradient, the steeper the function.

The Monte Carlo Method Explained Simply with Real-World Applications

Monte Carlo simulation is a method used to quantify the risk associated with a decision-making process. This technique, based on random number generation, is particularly useful when dealing with many unknown variables and when historical data or past experiences are not available for making reliable predictions.

The core idea behind Monte Carlo simulation is to create a series of simulated scenarios, each characterized by a different set of variables. Each scenario is determined by randomly generating values for each variable. This process is repeated many times, thus creating a large number of different scenarios.

The Hypergeometric Distribution

We have seen that the binomial distribution is based on the hypothesis of an infinite population N, a condition that can be practically realized by sampling from a finite population with replacement.

If this does not occur, meaning if we are sampling from a population without replacement, we must use the hypergeometric distribution. (In reality, if N is large, the hypergeometric probability density function tends towards the binomial).

The hypergeometric distribution is used to calculate the probability of obtaining a certain number of successes in a series of binary trials (yes or no), which are dependent and have a variable probability of success.

The hypergeometric distribution allows us to answer questions like:

If I take a sample of size N, in which M elements meet certain requirements, what is the probability of drawing x elements that meet those requirements?