Python Statistics and ML

Updated at June 22, 2023

Probability and Statistics

Introducing KaTeX: The fastest math typesetting library for the web
Granularity: the level of detail at which data is stored
Rule of thumb: principe based on practical experience rather than theory

Sampling

Population: the set of all individuals relevant to a particulas statistical question
Sample: a smaller group selected from a population
Parameter: a population metric
Statistic: a sample metric
Sampling error: difference between the metrics of a population and the metrics of a sample
- sampling error = parameter - statistic
Representativeness: every individual in the population has an equal chance to be selected, leading to smaller sampling error

Sampling methods:

Simple random sampling (SRS): a sampling method using random numbers to select a few sample units # pd.sample()
Stratified sampling: organize (stratify) data into different groups (stratums), and then sample randomly each group
- Maximize the variability between strata (different groups)
- Minimize the variability within each stratum
- The stratification criterion should be strongly correlated with the property you’re trying to measure
Сluster sampling: picking only a few of the individual data souces (clusters)
Descriptive statistics: describing a sample or a population by measuring and visualizing stuff
Inferential statistics: using a sample (infering) to draw conclusions about a population

Variables

Variable is a property with varying value. Can be divided into two categories:

Quantitative variable: describes how much there is of something
- We can tell the size or direction of the difference
- e.g. height, age (date), points, experience
Qualitative variable (Categorical): describes what or how
- We cannot tell the size and direction of the difference
- e.g. name, position, place, college

Scales of measurement is the system of rules that define how each variable is measured:

The Nominal scale: measuring qualitative variables only
An Ordinal scale: measuring quantitative variables only
- We can tell the direction of the difference
- We cannot tell the size of the difference (intervals between ranks could differ)
- We should be aware calculating averages for ordinal variables (different results with shifted encoding systems)
An Interval or Ratio scales: measuring quantitative variables only
- Preserves the order between values and has well-defined intervals using real numbers
- On a Ratio scale, the zero point means “no quantity”, while on an Interval scale it indicates the presence of a quantity
- Using a Ratio scale we can measure the difference in terms of ratios (division)
- Discrete variable: there is no possible intermediate value between any two adjacent values
- Continuous variable: contains an infinity of values between any two values

Frequency Distributions

Frequency Distribution Table shows how frequencies are distributed
Grouped Frequency Distribution Talbes: each group (interval) is called a class interval
- s.value_counts(bins=intervals)
- pd.interval_range(start=0, end=100, freq=10)
- there should be a good balance between information and comprehensibility

Types of Frequencies:

Absolute frequencies: absolute counts # s.value_counts()
Relative frequencies: proportions and percentages # s.value_counts(normalize=True)

Percentiles and Quartiles:

Percentile rank of a score is the percentage of scores in its distribution that are less than it
Percentile and percentile rank are related terms, but percentile is measured in percentages
- from scipy.stats import percentileofscore
- percentileofscore(a=series, score=value, kind='weak')
Quartiles: the three percentiles, 25th (lower quartile), the 50th (middle quartile), and the 75th (upper quartile), that divide the distribution in four equal parts # s.describe(percentiles=[])

Types of Distributions:

Skewed Distributions
- Left skewed (negatively skewed): the tail points in the direction of negative numbers
- Right skewed (positively skewed): the tail points in the direction of positive numbers
Symmetrical Distributions
- Normal distribution (Gaussian distribution): the values pile up in the middle and gradually decrease toward both ends
- Uniform distribution: the values are distributed uniformly

Visualizing Distributions:

Nominal and Ordinal variables is common to visualize using bar plot, pie chart (better sense for the relative frequencies)
The most commonly used graph for visualizing distributions is the histogram
Smoothed histogram that display densities (probabilities) instead of frequencies is called Kernel Density Estimate (KDE) plot
When we need to compare multiple (> 4) distributions, it is better to use strip plot or box plot
- Quartiles
  - Lower quartile index: $$Qi_1 = (n+1) * 0.25$$
  - Upper quartile index: $$Qi_3 = (n+1) * 0.75$$
  - Interquartile range: $\text{IQR} = \text{upper quartile} - \text{lower quartile}$
- Outliers are values in the distribution that are much larger or much lower than the rest of the values
  - Lower bound: $\min = Q_1 - 1.5* \text{IQR}$
  - Upper bound: $\max = Q_3 + 1.5 * \text{IQR}$

Averages and Variability

Arithmetic Mean μ (Parameter): total sum divided by total number of values (distances belove and above are the same) $\dfrac{1}{N}(\sum_{i=1}^N x_i)$
Sample Mean x̄ (Statistics): there are three possible scenarios: overestimation, underestimation, equal estimation (when x̄>μ and x̄<μ, sampling error occurs)
Sampling Error: $$μ - x̄$$
Sample Representativity: the more representative a sample is, the closer x̄ will be to μ
Sample Size: the larger the sample, the more chances we have to get a representative sample and less sampling error
Unbiased Estimator: statistic that are on average equal to the parameter it estimates
- This is true for any distribution of real numbers with equal sample size
Weighted Mean: takes into account the different weights $\dfrac{\sum_{i=1}^{N} x_i w_i}{\sum_{i=1}^{N} w_i}$
- np.average(houses_per_year['Mean Price'], weights=houses_per_year['Houses Sold'])
Open-Ended Distribution: distribution with open boundary, for example “10 or more / 10+”
Median: the middle value in a sorted distribution ( $$Q_2$$ ), resistant to outliers (robust statistics) # s.median()
Mode: the most frequent value in the destribution # s.mode()
- The best option for discrete values, because it gives you the whole number
- The distribution could be unimodal, bimodal or even multimodal (in case of more than one mode)
Range of Distribution: measure the variability of a distribution (average distance, dispersion) # s.std()
- $\text{mean absolute deviation} = \dfrac{\sum_{i=1}^{N} |x_i - \mu|}{N}$
- $\text{mean squared deviation (variance)} = \dfrac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
- $\text{standard deviation} = \sqrt{\text{variance}}$
- Bessel’s correction suggests to divide by n-1, instead of n, to prevent sample underestimation # np.std(list, ddof=1)

	Can be used	Can’t be used	Ideal for
Mean	Interval / Ratio Continuous Ordinal	Nominal Non-numeric Ordinal For different weights use weighted mean	Summarizing numerical distributions with each value in the distribution
Median	Interval / Ratio Numeric Ordinal	Nominal Non-numeric Ordinal	Summarizing numerical distributions with outliers Open-ended distributions
Mode	Interval / Ratio Nominal or Ordinal	Uniform distributions Continuous Ordinal	Nominal or Non-numeric Ordinal Discrete values

	Value	Reporting to non-technical audiences
Mean	1.04	The average house has 1.04 kitchens
Median	1	The average house has one kitchen
Mode	1	The typical house has one kitchen

Machine Learning

PyTorch for Deep Learning - Full Course / Tutorial

Correlation: attributes relations [-1, 1]
- Pearson correlation coefficient: df.corr()
Weighted sum model (WSM): is the best known and simplest multi-criteria decision analysis (MCDA)
- $A_i^{WSM-score} = \sum_{j=1}^{n} w_j a_{ij} \text {, for i = 1, 2, 3, \dots, m}$
Min-Max Feature scaling (Normalization): compare different scales in a meaningful way [0, 1]
- $x' = \frac {x-\min(x)} {\max(x) - \min(x)}$

Computer Vision

OpenCV Course - Full Tutorial with Python