Topic A3 — Summary Statistics

Table of contents
  1. Topic A3 — Summary Statistics
    1. Mean vs Median vs Mode
    2. Arithmetic vs weighted mean
    3. Percentiles and quantiles
    4. Inter quantile range (IQR)
    5. Outliers & boxplot
    6. Variance and standard deviation
    7. Z-scores, standardization
    8. Covariance vs correlation
    9. Chebyshev’s theorem
    10. The Empirical Rule

Mean vs Median vs Mode

  • Mean
    • sum over all variables/nb obs.
    • /!\ affected by outliers
  • Median
    • observations ranked, divided in two (middle value, or avg of center values)
    • OR 50% of obs above and below that value
    • safe from outliers
  • Mode
    • most occuring value in data
    • qualitative vars
    • one mode: on value most occ. –> bimodal: two values in first position

Arithmetic vs weighted mean

  • “Normal mean”
  • VS some values matter more than others (credits and grades)

Percentiles and quantiles

  • Cumulative frequencies
  • 10th percentile = 4 –> 10% of observations have a value of 4 or less
  • Quartiles are chunks of 25%: 1st Quart 25%, Second 50% (=???), Third 75%

Inter quantile range (IQR)

  • The range of values within the central 50% of observations

  • $\mathit{IQR} = Q_3 - Q_1$

  • OR the range of values between the 1st and 2nd quartile

Outliers & boxplot

Procedure:

  • Calculate $\mathit{IQR} = Q_3 - Q_1$
  • Multiply $\mathit{IQR} \times 1.5$
  • Observation outlier if $x_i > Q_3 + 1.5 \times \mathit{IQR}$ (upper bound)
  • OR $x_i < Q_1 - 1.5 \times \mathit{IQR}$ (lower bound)
Q1-1.5*IQR     Q1     Median   Q3      Q3+1.5*IQR
                -----------------
* |-------------|        |      |----------|    * *
                -----------------

Variance and standard deviation

  • Range: Max-Min
  • Mean Absolute deviation (MAD): avg absolute difference from the mean $\frac{1}{n-1} \sum^2_{i=1} (x_i - \bar{x})$
  • Variance: average square distance from the mean $s^2 = \frac{1}{n-1} \sum^2_{i=1} (x_i - \bar{x})^2$
    • Square punishes more the observations far form the mean
  • Standard deviation: $s = \sqrt{\mathit{s^2}} = \sqrt{\frac{1}{n-1} \sum^2_{i=1} (x_i - \bar{x})^2}$
  • Coeff of variation: “standardizes” std dev: makes it comparable accross datasets $\mathit{CV}= \frac{s}{\bar{x}}$

Z-scores, standardization

  • How far an obs is far from the mean. Standardizing.
    • if it’s on the mean, =0
    • smaller than mean –> <0
    • bigger than mean –> >0 $\textit{z-score} = \frac{\mathit{Observation}-\mathit{Mean}}{\mathit{Std Deviation}}$

Covariance vs correlation

  • Covariance formula: $s_{xy} = \frac{1}{n-1} \sum^n_{i=1} (x_i-\bar{x}) (y_i-\bar{y})$
    • Looks familiar no? It’s basically the variance, but for two different variables
    • How does a variable move relative to another?
  • Correlation: $ r_{xy} = \frac{s_{xy}}{s_x s_y} $ that is $ \frac{\mathit{Cov}_{xy}}{\mathit{Var}_x \times \mathit{Var}_y} $
    • Yields a number between -1 and 1. What does it mean if the correlation = 1? Or -1?

Chebyshev’s theorem

  • $ 1 - \frac{1}{\mathit{nbSD}^2} =$ percentage of observations within the range $ \mathit{mean} \pm (\mathit{nbSD} \times \mathit{SD}) $ for number of standard deviations bigger than 1
  • Just need the mean and std deviation to get an idea of the spread of data
  • Features
    • Regardless of the distribution of the dataset
    • Represents a lower bound, i.e. the minimum percentage within $k$ std dev (can be much larger): “No more than $x$ percent can be more than $k$ number of $SD$ away from the mean”

The Empirical Rule

  • Formula
    • $ \mathit{mean} \pm (1 \times SD) $ has approximately 68% of values
    • $ \mathit{mean} \pm (2 \times SD) $ has approximately 95% of values
    • $ \mathit{mean} \pm (3 \times SD) $ has approximately 100% of values
  • Features
    • More precise
    • Only bell-shaped and symetric distributions