Topic A3 — Summary Statistics
Table of contents
Mean vs Median vs Mode
 Mean
 sum over all variables/nb obs.
 /!\ affected by outliers
 Median
 observations ranked, divided in two (middle value, or avg of center values)
 OR 50% of obs above and below that value
 safe from outliers
 Mode
 most occuring value in data
 qualitative vars
 one mode: on value most occ. –> bimodal: two values in first position
Arithmetic vs weighted mean
 “Normal mean”
 VS some values matter more than others (credits and grades)
Percentiles and quantiles
 Cumulative frequencies
 10th percentile = 4 –> 10% of observations have a value of 4 or less
 Quartiles are chunks of 25%: 1st Quart 25%, Second 50% (=???), Third 75%
Inter quantile range (IQR)

The range of values within the central 50% of observations

$\mathit{IQR} = Q_3  Q_1$

OR the range of values between the 1st and 2nd quartile
Outliers & boxplot
Procedure:
 Calculate $\mathit{IQR} = Q_3  Q_1$
 Multiply $\mathit{IQR} \times 1.5$
 Observation outlier if $x_i > Q_3 + 1.5 \times \mathit{IQR}$ (upper bound)
 OR $x_i < Q_1  1.5 \times \mathit{IQR}$ (lower bound)
Q11.5*IQR Q1 Median Q3 Q3+1.5*IQR

*    * *

Variance and standard deviation
 Range: MaxMin
 Mean Absolute deviation (MAD): avg absolute difference from the mean $\frac{1}{n1} \sum^2_{i=1} (x_i  \bar{x})$
 Variance: average square distance from the mean $s^2 = \frac{1}{n1} \sum^2_{i=1} (x_i  \bar{x})^2$
 Square punishes more the observations far form the mean
 Standard deviation: $s = \sqrt{\mathit{s^2}} = \sqrt{\frac{1}{n1} \sum^2_{i=1} (x_i  \bar{x})^2}$
 Coeff of variation: “standardizes” std dev: makes it comparable accross datasets $\mathit{CV}= \frac{s}{\bar{x}}$
Zscores, standardization
 How far an obs is far from the mean. Standardizing.
 if it’s on the mean, =0
 smaller than mean –> <0
 bigger than mean –> >0 $\textit{zscore} = \frac{\mathit{Observation}\mathit{Mean}}{\mathit{Std Deviation}}$
Covariance vs correlation
 Covariance formula: $s_{xy} = \frac{1}{n1} \sum^n_{i=1} (x_i\bar{x}) (y_i\bar{y})$
 Looks familiar no? It’s basically the variance, but for two different variables
 How does a variable move relative to another?
 Correlation: $ r_{xy} = \frac{s_{xy}}{s_x s_y} $ that is $ \frac{\mathit{Cov}_{xy}}{\mathit{Var}_x \times \mathit{Var}_y} $
 Yields a number between 1 and 1. What does it mean if the correlation = 1? Or 1?
Chebyshev’s theorem
 $ 1  \frac{1}{\mathit{nbSD}^2} =$ percentage of observations within the range $ \mathit{mean} \pm (\mathit{nbSD} \times \mathit{SD}) $ for number of standard deviations bigger than 1
 Just need the mean and std deviation to get an idea of the spread of data
 Features
 Regardless of the distribution of the dataset
 Represents a lower bound, i.e. the minimum percentage within $k$ std dev (can be much larger): “No more than $x$ percent can be more than $k$ number of $SD$ away from the mean”
The Empirical Rule
 Formula
 $ \mathit{mean} \pm (1 \times SD) $ has approximately 68% of values
 $ \mathit{mean} \pm (2 \times SD) $ has approximately 95% of values
 $ \mathit{mean} \pm (3 \times SD) $ has approximately 100% of values
 Features
 More precise
 Only bellshaped and symetric distributions