scientificComputing/statistics/lecture/statistics.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Descriptive statistics}

Descriptive statistics characterizes data sets by means of a few measures.

In addition to histograms that estimate the full distribution of the data,
the following measures are used for characterizing univariate data:
\begin{description}
\item[Location, central tendency] (``Lagema{\ss}e''):
  arithmetic mean, median, mode.
\item[Spread, dispersion] (``Streuungsma{\ss}e''): variance,
  standard deviation, inter-quartile range,\linebreak coefficient of variation
  (``Variationskoeffizient'').
\item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung'').
\end{description}
For bivariate and multivariate data sets we can also analyse their
\begin{description}
\item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient,
  Spearman's rank correlation coefficient.
\end{description}

The following is not a complete introduction to descriptive
statistics, but summarizes a few concepts that are most important in
daily data-analysis problems.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mean, variance, and standard deviation}
The \enterm{arithmetic mean} is a measure of location. For $n$ data values
$x_i$ the arithmetic mean is computed by
\[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \]
This computation (summing up all elements of a vector and dividing by
the length of the vector) is provided by the function \mcode{mean()}.
The mean has the same unit as the data values.

The dispersion of the data values around the mean is quantified by
their \enterm{variance}
\[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \]
The variance is computed by the function \mcode{var()}.
The unit of the variance is the unit of the data values squared.
Therefore, variances cannot be compared to the mean or the data values
themselves. In particular, variances cannot be used for plotting error
bars along with the mean.

The standard deviation
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
as computed by the function \mcode{std()}, however, has the same unit
as the data values and can (and should) be used to display the
dispersion of the data together with their mean.

The mean of a data set can be displayed by a bar-plot
\matlabfun{bar()}. Additional errorbars \matlabfun{errobar()} can be
used to illustrate the standard deviation of the data
(\figref{displayunivariatedatafig} (2)).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{displayunivariatedata}
  \titlecaption{\label{displayunivariatedatafig} Displaying statistics
    of univariate data.}{(1) In particular for small data sets it is
    most informative to plot the data themselves. The value of each
    data point is plotted on the y-axis.  To make the data points
    overlap less, they are jittered along the x-axis by means of
    uniformly distributed random numbers \matlabfun{rand()}. (2) With
    a bar plot \matlabfun{bar()} one usually shows the mean of the
    data. The additional errorbar illustrates the deviation of the
    data from the mean by $\pm$ one standard deviation. (3) A
    box-whisker plot \matlabfun{boxplot()} shows more details of the
    distribution of the data values. The box extends from the 1. to
    the 3. quartile, a horizontal ine within the box marks the median
    value, and the whiskers extend to the minum and the maximum data
    values. (4) The probability density $p(x)$ estimated from a
    normalized histogram shows the entire distribution of the
    data. Estimating the probability distribution is only meaningful
    for sufficiently large data sets.}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mode, median, quartile, etc.}

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{median}
  \titlecaption{\label{medianfig} Median, mean and mode of a
    probability distribution.}{Left: Median, mean and mode are
    identical for the symmetric and unimodal normal distribution.
    Right: for asymmetric distributions these three measures differ. A
    heavy tail of a distribution pulls out the mean most strongly. In
    contrast, the median is more robust against heavy tails, but not
    necessarily identical with the mode.}
\end{figure}

The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution.

The \enterm{median} separates a list of data values into two halves
such that one half of the data is not greater and the other half is
not smaller than the median (\figref{medianfig}).

\begin{exercise}{mymedian.m}{}
  Write a function \code{mymedian()} that computes the median of a vector.
\end{exercise}

\matlab{} provides the function \code{median()} for computing the median.

\begin{exercise}{checkmymedian.m}{}
  Write a script that tests whether your median function really
  returns a median above which are the same number of data than
  below. In particular the script should test data vectors of
  different length. You should not use the \mcode{median()} function
  for testing your function.

  Writing tests for your own functions is a very important strategy for
  writing reliable code!
\end{exercise}

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{quartile}
  \titlecaption{\label{quartilefig} Median and quartiles of a normal distribution.}{}
\end{figure}

The distribution of data can be further characterized by the position
of its \enterm[quartile]{quartiles}. Neighboring quartiles are
separated by 25\,\% of the data (\figref{quartilefig}).
\enterm[percentile]{Percentiles} allow to characterize the
distribution of the data in more detail. The 3$^{\rm rd}$ quartile
corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the
data are smaller than the 3$^{\rm rd}$ quartile.

% \begin{definition}[quartile]
%   Die Quartile Q1, Q2 und Q3 unterteilen die Daten in vier gleich
%   gro{\ss}e Gruppen, die jeweils ein Viertel der Daten enthalten.
%   Das mittlere Quartil entspricht dem Median.
% \end{definition}

% \begin{exercise}{quartiles.m}{}
%   Write a function that computes the first, second, and third quartile of a vector.
% \end{exercise}

% \begin{figure}[t]
%   \includegraphics[width=1\textwidth]{boxwhisker}
%   \titlecaption{\label{boxwhiskerfig} Box-Whisker Plot.}{Box-whisker
%     plots are well suited for comparing unimodal distributions.  Each
%     box-whisker characterizes 40 random numbers that have been drawn
%     from a normal distribution.}
% \end{figure}

\enterm{Box-whisker plots} are commonly used to visualize and compare
the distribution of unimodal data. A box is drawn around the median
that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$ quartile. The
whiskers mark the minimum and maximum value of the data set
(\figref{displayunivariatedatafig} (3)).

\begin{exercise}{boxwhisker.m}{}
  Generate eine $40 \times 10$ matrix of random numbers and
  illustrate their distribution in a box-whicker plot
  (\code{boxplot()} function). How to interpret the plot?
\end{exercise}

\section{Distributions}
The distribution of values in a data set is estimated by histograms
(\figref{displayunivariatedatafig} (4)).

\subsection{Histograms}

\enterm[Histogram]{Histograms} count the frequency $n_i$ of
$N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
(\figref{diehistogramsfig} left).  The bins tile the data range
usually into intervals of the same size. The width of the bins is
called the bin width.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{diehistograms}
  \titlecaption{\label{diehistogramsfig} Histograms resulting from 100
    or 500 times rolling a die.}{Left: the absolute frequency
    histogram counts the frequency of each number the die
    shows. Right: When normalized by the sum of the frequency
    histogram the two data sets become comparable with each other and
    with the expected theoretical distribution of $P=1/6$.}
\end{figure}

Histograms are often used to estimate the \enterm{probability
  distribution} of the data values.

\subsection{Probabilities}
In the frequentist interpretation of probability, the probability of
an event (e.g. getting a six when rolling a die) is the relative
occurrence of this event in the limit of a large number of trials.

For a finite number of trials $N$ where the event $i$ occurred $n_i$
times, the probability $P_i$ of this event is estimated by
\[ P_i = \frac{n_i}{N} = \frac{n_i}{\sum_{i=1}^M n_i} \; . \]
From this definition it follows that a probability is a unitless
quantity that takes on values between zero and one. Most importantly,
the sum of the probabilities of all possible events is one:
\[ \sum_{i=1}^M P_i = \sum_{i=1}^M \frac{n_i}{N} = \frac{1}{N} \sum_{i=1}^M n_i = \frac{N}{N} = 1\; , \]
i.e. the probability of getting any event is one.


\subsection{Probability distributions of categorial data}

For categorial data values (e.g. the faces of a die (as integer
numbers or as colors)) a bin can be defined for each category $i$.
The histogram is normalized by the total number of measurements to
make it independent of the size of the data set
(\figref{diehistogramsfig}). After this normalization the height of
each histogram bar is an estimate of the probability $P_i$ of the
category $i$, i.e. of getting a data value in the $i$-th bin.

\begin{exercise}{rollthedie.m}{}
  Write a function that simulates rolling a die $n$ times.
\end{exercise}

\begin{exercise}{diehistograms.m}{}
  Plotte Histogramme von 20, 100, und 1000-mal W\"urfeln.  Benutze
  \code[hist()]{hist(x)}, erzwinge sechs Bins mit
  \code[hist()]{hist(x,6)}, oder setze selbst sinnvolle Bins. Normiere
  anschliessend das Histogram.
\end{exercise}


\subsection{Probability densities functions}

In cases where we deal with data sets of measurements of a real
quantity (e.g. the length of snakes, the weight of elephants, the time
between succeeding spikes) there is no natural bin width for computing
a histogram. In addition, the probability of measuring a data value that
equals exactly a specific real number like, e.g., 0.123456789 is zero, because
there are uncountable many real numbers.

We can only ask for the probability to get a measurement value in some
range.  For example, we can ask for the probability $P(1.2<x<1.3)$ to
get a measurement between 0 and 1 (\figref{pdfprobabilitiesfig}). More
generally, we want to know the probability $P(x_0<x<x_1)$ to obtain a
measurement between $x_0$ and $x_1$. If we define the width of the
range defined by $x_0$ and $x_1$ is $\Delta x = x_1 - x_0$ then the
probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$.

In the limit to very small ranges $\Delta x$ the probability of
getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to
zero with $\Delta x$:
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \]
In here the quantity $p(x_00)$ is a so called \enterm{probability
  density}. This is not a unitless probability with values between 0
and 1, but a number that takes on any positive real number and has as
a unit the inverse of the unit of the data values --- hence the name
``density''.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{pdfprobabilities}
  \titlecaption{\label{pdfprobabilitiesfig} Probability of a
    probability density.}{The probability of a data value $x$ between,
    e.g., zero and one is the integral (red area) over the probability
    density (blue).}
\end{figure}

The probability to get a value $x$ between $x_1$ and $x_2$ is
given by the integral over the probability density:
\[ P(x_1 < x < x2) = \int\limits_{x_1}^{x_2} p(x) \, dx \; . \]
Because the probability to get any value $x$ is one, the integral over
the probability density

Da die Wahrscheinlichkeit irgendeines Wertes $x$ Eins ergeben muss gilt die Normierung
\begin{equation}
  \label{pdfnorm}
  P(-\infty < x < \infty) = \int\limits_{-\infty}^{+\infty} p(x) \, dx = 1 \; .
\end{equation}

\pagebreak[2]
Die gesamte Funktion $p(x)$, die jedem Wert $x$ einen
Wahrscheinlichkeitsdichte zuordnet wir auch
\determ{Wahrscheinlichkeitsdichtefunktion} (\enterm{probability
  density function}, \enterm[pdf|see{probability density
  function}]{pdf}, oder kurz \enterm[density|see{probability density
  function}]{density}) genannt. Die bekannteste
Wahrscheinlichkeitsdichtefunktion ist die der \determ{Normalverteilung}
\[ p_g(x) =
\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
--- die \determ{Gau{\ss}sche-Glockenkurve} mit Mittelwert $\mu$ und
Standardabweichung $\sigma$.

\begin{exercise}{gaussianpdf.m}{gaussianpdf.out}
  \begin{enumerate}
  \item Plot the probability density of the normal distribution $p_g(x)$.
  \item Compute the probability of getting a data value between zero and one
    for the normal distribution with zero mean and standard deviation of one.
  \item Draw 1000 normally distributed random numbers and use these
    numbers to calculate the probability of getting a number between
    zero and one.
  \item Compute from the normal distribution $\int_{-\infty}^{+\infty} p(x) \, dx$.
  \end{enumerate}
\end{exercise}

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{pdfhistogram}
  \titlecaption{\label{pdfhistogramfig} Histograms with different bin
    widths of normally distributed data.}{Left: The height of the
    histogram bars strongly depends on the width of the bins. Right:
    If the histogram is normalized such that its integral is one we
    get an estimate of the probability density of the data values.
    The normalized histograms are comparable with each other and can
    also be compared to theoretical probability densities, like the
    normal distributions (blue).}
\end{figure}

\pagebreak[4]
\begin{exercise}{gaussianbins.m}{}
  Draw 100 random data from a Gaussian distribution and plot
  histograms with different bin sizes of the data. What do you
  observe?
\end{exercise}

Damit Histogramme von reellen Messwerten trotz unterschiedlicher
Anzahl von Messungen und unterschiedlicher Klassenbreiten
untereinander vergleichbar werden und mit bekannten
Wahrscheinlichkeitsdichtefunktionen verglichen werden k\"onnen,
m\"ussen sie auf das Integral Eins normiert werden
\eqnref{pdfnorm}. Das Integral (nicht die Summe) \"uber das Histogramm
soll Eins ergeben --- denn die Wahrscheinlichkeit, dass irgendeiner
der Messwerte auftritt mu{\ss} Eins sein. Das Integral ist die
Fl\"ache des Histogramms, die sich aus der Fl\"ache der einzelnen
Histogrammbalken zusammen setzt. Die Balken des Histogramms haben die
H\"ohe $n_i$ und die Breite $\Delta x$. Die Gesamtfl\"ache $A$ des
Histogramms ist also
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i \]
und das normierte Histogramm hat die H\"ohe
\[ p(x_i) = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} \]
Es muss also nicht nur durch die Summe, sondern auch durch die Breite
$\Delta x$ der Klassen geteilt werden (\figref{pdfhistogramfig}).

\begin{exercise}{gaussianbinsnorm.m}{}
  Normiere das Histogramm der vorherigen \"Ubung zu einer Wahrscheinlichkeitsdichte.
\end{exercise}


\section{Correlations}

Until now we described properties of univariate data sets.  In
bivariate or multivariate data sets where we have pairs or tuples of
data values (e.g. the size and the weight of elephants) we want to analyze
dependencies between the variables.

The \enterm{correlation coefficient}
\[ r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle
  (x-\langle x \rangle)(y-\langle y \rangle) \rangle}{\sqrt{\langle
    (x-\langle x \rangle)^2} \rangle \sqrt{\langle (y-\langle y
    \rangle)^2} \rangle} \]
quantifies linear relationships between two variables
\matlabfun{corr()}.  The correlation coefficient is the
\determ{covariance} normalized by the standard deviations of the
single variables.  Perfectly correlated variables result in a
correlation coefficient of $+1$, anit-correlated or negatively
correlated data in a correlation coefficient of $-1$ and un-correlated
data in a correlation coefficient close to zero
(\figrefb{correlationfig}).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{correlation}
  \titlecaption{\label{correlationfig} Korrelationen zwischen Datenpaaren.}{}
\end{figure}

\begin{exercise}{correlations.m}{}
  Generate pairs of random numbers with four different correlations
  (perfectly correlated, somehow correlated, uncorrelated, negatively
  correlated).  Plot them into a scatter plot and compute their
  correlation coefficient.
\end{exercise}

Note that non-linear dependencies between two variables are
insufficiently or not at all detected by the correlation coefficient
(\figref{nonlincorrelationfig}).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{nonlincorrelation}
  \titlecaption{\label{nonlincorrelationfig} Correlations for
    non-linear dependencies.}{The correlation coefficient detects
    linear dependencies only. Both the quadratic dependency (left) and
    the noise correlation (right), where the dispersal of the
    $y$-values depends on the $x$-value, result in correlation
    coefficients close to zero. $\xi$ denote normally distributed
    random numbers.}
\end{figure}