Python Statistics and ML
Updated at
Probability and Statistics
- Introducing KaTeX: The fastest math typesetting library for the web
- Granularity: the level of detail at which data is stored
- Rule of thumb: principe based on practical experience rather than theory
Sampling
- Population: the set of all individuals relevant to a particulas statistical question
- Sample: a smaller group selected from a population
- Parameter: a population metric
- Statistic: a sample metric
- Sampling error: difference between the metrics of a population and the metrics of a sample
- sampling error = parameter - statistic
- Representativeness: every individual in the population has an equal chance to be selected, leading to smaller sampling error
- Simple random sampling (SRS): a sampling method using random numbers to select a few sample units
# pd.sample() - Stratified sampling: organize (stratify) data into different groups (stratums), and then sample randomly each group
- Maximize the variability between strata (different groups)
- Minimize the variability within each stratum
- The stratification criterion should be strongly correlated with the property you’re trying to measure
- Сluster sampling: picking only a few of the individual data souces (clusters)
- Descriptive statistics: describing a sample or a population by measuring and visualizing stuff
- Inferential statistics: using a sample (infering) to draw conclusions about a population
Variables
Variable is a property with varying value. Can be divided into two categories:
- Quantitative variable: describes how much there is of something
- We can tell the size or direction of the difference
- e.g. height, age (date), points, experience
- Qualitative variable (Categorical): describes what or how
- We cannot tell the size and direction of the difference
- e.g. name, position, place, college
Scales of measurement is the system of rules that define how each variable is measured:
- The Nominal scale: measuring qualitative variables only
- An Ordinal scale: measuring quantitative variables only
- We can tell the direction of the difference
- We cannot tell the size of the difference (intervals between ranks could differ)
- We should be aware calculating averages for ordinal variables (different results with shifted encoding systems)
- An Interval or Ratio scales: measuring quantitative variables only
- Preserves the order between values and has well-defined intervals using real numbers
- On a Ratio scale, the zero point means “no quantity”, while on an Interval scale it indicates the presence of a quantity
- Using a Ratio scale we can measure the difference in terms of ratios (division)
- Discrete variable: there is no possible intermediate value between any two adjacent values
- Continuous variable: contains an infinity of values between any two values
Frequency Distributions
- Frequency Distribution Table shows how frequencies are distributed
- Grouped Frequency Distribution Talbes: each group (interval) is called a class interval
s.value_counts(bins=intervals)pd.interval_range(start=0, end=100, freq=10)- there should be a good balance between information and comprehensibility
Types of Frequencies:
- Absolute frequencies: absolute counts
# s.value_counts() - Relative frequencies: proportions and percentages
# s.value_counts(normalize=True)
Percentiles and Quartiles:
- Percentile rank of a score is the percentage of scores in its distribution that are less than it
- Percentile and percentile rank are related terms, but percentile is measured in percentages
from scipy.stats import percentileofscorepercentileofscore(a=series, score=value, kind='weak')
- Quartiles: the three percentiles, 25th (lower quartile), the 50th (middle quartile), and the 75th (upper quartile), that divide the distribution in four equal parts
# s.describe(percentiles=[])
Types of Distributions:
- Skewed Distributions
- Left skewed (negatively skewed): the tail points in the direction of negative numbers
- Right skewed (positively skewed): the tail points in the direction of positive numbers
- Symmetrical Distributions
- Normal distribution (Gaussian distribution): the values pile up in the middle and gradually decrease toward both ends
- Uniform distribution: the values are distributed uniformly
Visualizing Distributions:
- Nominal and Ordinal variables is common to visualize using bar plot, pie chart (better sense for the relative frequencies)
- The most commonly used graph for visualizing distributions is the histogram
- Smoothed histogram that display densities (probabilities) instead of frequencies is called Kernel Density Estimate (KDE) plot
- When we need to compare multiple (> 4) distributions, it is better to use strip plot or box plot
- Quartiles
- Lower quartile index: $Qi_1 = (n+1) * 0.25$
- Upper quartile index: $Qi_3 = (n+1) * 0.75$
- Interquartile range: $\text{IQR} = \text{upper quartile} - \text{lower quartile}$
- Outliers are values in the distribution that are much larger or much lower than the rest of the values
- Lower bound: $\min = Q_1 - 1.5* \text{IQR}$
- Upper bound: $\max = Q_3 + 1.5 * \text{IQR}$
- Quartiles
Averages and Variability
- Arithmetic Mean μ (Parameter): total sum divided by total number of values (distances belove and above are the same) $\dfrac{1}{N}(\sum_{i=1}^N x_i)$
- Sample Mean x̄ (Statistics): there are three possible scenarios: overestimation, underestimation, equal estimation (when x̄>μ and x̄<μ, sampling error occurs)
- Sampling Error: $μ - x̄$
- Sample Representativity: the more representative a sample is, the closer x̄ will be to μ
- Sample Size: the larger the sample, the more chances we have to get a representative sample and less sampling error
- Unbiased Estimator: statistic that are on average equal to the parameter it estimates
- This is true for any distribution of real numbers with equal sample size
- Weighted Mean: takes into account the different weights $\dfrac{\sum_{i=1}^{N} x_i w_i}{\sum_{i=1}^{N} w_i}$
np.average(houses_per_year['Mean Price'], weights=houses_per_year['Houses Sold'])
- Open-Ended Distribution: distribution with open boundary, for example “10 or more / 10+”
- Median: the middle value in a sorted distribution ($Q_2$
), resistant to outliers (robust statistics)
# s.median() - Mode: the most frequent value in the destribution
# s.mode()- The best option for discrete values, because it gives you the whole number
- The distribution could be unimodal, bimodal or even multimodal (in case of more than one mode)
- Range of Distribution: measure the variability of a distribution (average distance, dispersion)
# s.std()- $\text{mean absolute deviation} = \dfrac{\sum_{i=1}^{N} |x_i - \mu|}{N}$
- $\text{mean squared deviation (variance)} = \dfrac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
- $\text{standard deviation} = \sqrt{\text{variance}}$
- Bessel’s correction suggests to divide by n-1, instead of n, to prevent sample underestimation
# np.std(list, ddof=1)
| Can be used | Can’t be used | Ideal for | |
|---|---|---|---|
| Mean | Interval / Ratio Continuous Ordinal | Nominal Non-numeric Ordinal For different weights use weighted mean | Summarizing numerical distributions with each value in the distribution |
| Median | Interval / Ratio Numeric Ordinal | Nominal Non-numeric Ordinal | Summarizing numerical distributions with outliers Open-ended distributions |
| Mode | Interval / Ratio Nominal or Ordinal | Uniform distributions Continuous Ordinal | Nominal or Non-numeric Ordinal Discrete values |
| Value | Reporting to non-technical audiences | |
|---|---|---|
| Mean | 1.04 | The average house has 1.04 kitchens |
| Median | 1 | The average house has one kitchen |
| Mode | 1 | The typical house has one kitchen |
Machine Learning
PyTorch for Deep Learning - Full Course / Tutorial
- Correlation: attributes relations [-1, 1]
- Pearson correlation coefficient:
df.corr()
- Pearson correlation coefficient:
- Weighted sum model (WSM): is the best known and simplest multi-criteria decision analysis (MCDA)
- $A_i^{WSM-score} = \sum_{j=1}^{n} w_j a_{ij} \text {, for i = 1, 2, 3, \dots, m}$
- Min-Max Feature scaling (Normalization): compare different scales in a meaningful way [0, 1]
- $x' = \frac {x-\min(x)} {\max(x) - \min(x)}$