Python Statistics and ML

Updated at

Probability and Statistics

  • Introducing KaTeX: The fastest math typesetting library for the web
  • Granularity: the level of detail at which data is stored
  • Rule of thumb: principe based on practical experience rather than theory

Sampling

  • Population: the set of all individuals relevant to a particulas statistical question
  • Sample: a smaller group selected from a population
  • Parameter: a population metric
  • Statistic: a sample metric
  • Sampling error: difference between the metrics of a population and the metrics of a sample
    • sampling error = parameter - statistic
  • Representativeness: every individual in the population has an equal chance to be selected, leading to smaller sampling error

Sampling methods:

  • Simple random sampling (SRS): a sampling method using random numbers to select a few sample units # pd.sample()
  • Stratified sampling: organize (stratify) data into different groups (stratums), and then sample randomly each group
    • Maximize the variability between strata (different groups)
    • Minimize the variability within each stratum
    • The stratification criterion should be strongly correlated with the property you’re trying to measure
  • Сluster sampling: picking only a few of the individual data souces (clusters)
  • Descriptive statistics: describing a sample or a population by measuring and visualizing stuff
  • Inferential statistics: using a sample (infering) to draw conclusions about a population

Variables

Variable is a property with varying value. Can be divided into two categories:

  • Quantitative variable: describes how much there is of something
    • We can tell the size or direction of the difference
    • e.g. height, age (date), points, experience
  • Qualitative variable (Categorical): describes what or how
    • We cannot tell the size and direction of the difference
    • e.g. name, position, place, college

Scales of measurement is the system of rules that define how each variable is measured:

  • The Nominal scale: measuring qualitative variables only
  • An Ordinal scale: measuring quantitative variables only
    • We can tell the direction of the difference
    • We cannot tell the size of the difference (intervals between ranks could differ)
    • We should be aware calculating averages for ordinal variables (different results with shifted encoding systems)
  • An Interval or Ratio scales: measuring quantitative variables only
    • Preserves the order between values and has well-defined intervals using real numbers
    • On a Ratio scale, the zero point means “no quantity”, while on an Interval scale it indicates the presence of a quantity
    • Using a Ratio scale we can measure the difference in terms of ratios (division)
    • Discrete variable: there is no possible intermediate value between any two adjacent values
    • Continuous variable: contains an infinity of values between any two values

Frequency Distributions

  • Frequency Distribution Table shows how frequencies are distributed
  • Grouped Frequency Distribution Talbes: each group (interval) is called a class interval
    • s.value_counts(bins=intervals)
    • pd.interval_range(start=0, end=100, freq=10)
    • there should be a good balance between information and comprehensibility

Types of Frequencies:

  • Absolute frequencies: absolute counts # s.value_counts()
  • Relative frequencies: proportions and percentages # s.value_counts(normalize=True)

Percentiles and Quartiles:

  • Percentile rank of a score is the percentage of scores in its distribution that are less than it
  • Percentile and percentile rank are related terms, but percentile is measured in percentages
    • from scipy.stats import percentileofscore
    • percentileofscore(a=series, score=value, kind='weak')
  • Quartiles: the three percentiles, 25th (lower quartile), the 50th (middle quartile), and the 75th (upper quartile), that divide the distribution in four equal parts # s.describe(percentiles=[])

Types of Distributions:

  • Skewed Distributions
    • Left skewed (negatively skewed): the tail points in the direction of negative numbers
    • Right skewed (positively skewed): the tail points in the direction of positive numbers
  • Symmetrical Distributions
    • Normal distribution (Gaussian distribution): the values pile up in the middle and gradually decrease toward both ends
    • Uniform distribution: the values are distributed uniformly

Visualizing Distributions:

  • Nominal and Ordinal variables is common to visualize using bar plot, pie chart (better sense for the relative frequencies)
  • The most commonly used graph for visualizing distributions is the histogram
  • Smoothed histogram that display densities (probabilities) instead of frequencies is called Kernel Density Estimate (KDE) plot
  • When we need to compare multiple (> 4) distributions, it is better to use strip plot or box plot
    • Quartiles
      • Lower quartile index: $Qi_1 = (n+1) * 0.25$
      • Upper quartile index: $Qi_3 = (n+1) * 0.75$
      • Interquartile range: $\text{IQR} = \text{upper quartile} - \text{lower quartile}$
    • Outliers are values in the distribution that are much larger or much lower than the rest of the values
      • Lower bound: $\min = Q_1 - 1.5* \text{IQR}$
      • Upper bound: $\max = Q_3 + 1.5 * \text{IQR}$

Averages and Variability

  • Arithmetic Mean μ (Parameter): total sum divided by total number of values (distances belove and above are the same) $\dfrac{1}{N}(\sum_{i=1}^N x_i)$
  • Sample Mean x̄ (Statistics): there are three possible scenarios: overestimation, underestimation, equal estimation (when x̄>μ and x̄<μ, sampling error occurs)
  • Sampling Error: $μ - x̄$
  • Sample Representativity: the more representative a sample is, the closer x̄ will be to μ
  • Sample Size: the larger the sample, the more chances we have to get a representative sample and less sampling error
  • Unbiased Estimator: statistic that are on average equal to the parameter it estimates
    • This is true for any distribution of real numbers with equal sample size
  • Weighted Mean: takes into account the different weights $\dfrac{\sum_{i=1}^{N} x_i w_i}{\sum_{i=1}^{N} w_i}$
    • np.average(houses_per_year['Mean Price'], weights=houses_per_year['Houses Sold'])
  • Open-Ended Distribution: distribution with open boundary, for example “10 or more / 10+”
  • Median: the middle value in a sorted distribution ($Q_2$ ), resistant to outliers (robust statistics) # s.median()
  • Mode: the most frequent value in the destribution # s.mode()
    • The best option for discrete values, because it gives you the whole number
    • The distribution could be unimodal, bimodal or even multimodal (in case of more than one mode)
  • Range of Distribution: measure the variability of a distribution (average distance, dispersion) # s.std()
    • $\text{mean absolute deviation} = \dfrac{\sum_{i=1}^{N} |x_i - \mu|}{N}$
    • $\text{mean squared deviation (variance)} = \dfrac{\sum_{i=1}^{N} (x_i - \mu)^2}{N}$
    • $\text{standard deviation} = \sqrt{\text{variance}}$
    • Bessel’s correction suggests to divide by n-1, instead of n, to prevent sample underestimation # np.std(list, ddof=1)
Can be usedCan’t be usedIdeal for
MeanInterval / Ratio
Continuous Ordinal
Nominal
Non-numeric Ordinal
For different weights use weighted mean
Summarizing numerical distributions
with each value in the distribution
MedianInterval / Ratio
Numeric Ordinal
Nominal
Non-numeric Ordinal
Summarizing numerical distributions with outliers
Open-ended distributions
ModeInterval / Ratio
Nominal or Ordinal
Uniform distributions
Continuous Ordinal
Nominal or Non-numeric Ordinal
Discrete values
ValueReporting to non-technical audiences
Mean1.04The average house has 1.04 kitchens
Median1The average house has one kitchen
Mode1The typical house has one kitchen

Machine Learning

PyTorch for Deep Learning - Full Course / Tutorial

Computer Vision

OpenCV Course - Full Tutorial with Python