scientificComputing/statistics/lecture/statistics.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Descriptive statistics}

Descriptive statistics characterizes data sets by means of a few measures.
In addition to histograms that estimate the full distribution of the data,
the following measures are used for characterizing univariate data:
\begin{description}
\item[Location, central tendency] (``Lagema{\ss}e''):
  arithmetic mean, median, mode.
\item[Spread, dispersion] (``Streuungsma{\ss}e''): variance,
  standard deviation, inter-quartile range,\linebreak coefficient of variation
  (``Variationskoeffizient'').
\item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung'').
\end{description}
For bivariate and multivariate data sets we can also analyse their
\begin{description}
\item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient,
  Spearman's rank correlation coefficient.
\end{description}

The following is in no way a complete introduction to descriptive
statistics, but summarizes a few concepts that are most important in
daily data-analysis problems.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mean, variance, and standard deviation}
The \enterm{arithmetic mean} is a measure of location. For $n$ data values
$x_i$ the arithmetic mean is computed by
\[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \]
This computation (summing up all elements of a vector and dividing by
the length of the vector) is provided by the function \mcode{mean()}.
The mean has the same unit as the data values.

The dispersion of the data values around the mean is quantified by
their \enterm{variance}
\[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \]
The variance is computed by the function \mcode{var()}.
The unit of the variance is the unit of the data values squared.
Therefore, variances cannot be compared to the mean or the data values
themselves. In particular, variances cannot be used for plotting error
bars along with the mean.

The standard deviation
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
as computed by the function \mcode{std()}, however, has the same unit
as the data values and can (and should) be used to display the
dispersion of the data together with their mean.

The mean of a data set can be displayed by a bar-plot
\matlabfun{bar()}. Additional errorbars \matlabfun{errobar()} can be
used to illustrate the standard deviation of the data
(\figref{displayunivariatedatafig} (2)).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{displayunivariatedata}
  \titlecaption{\label{displayunivariatedatafig} Displaying statistics
    of univariate data.}{(1) In particular for small data sets it is
    most informative to plot the data themselves. The value of each
    data point is plotted on the y-axis.  To make the data points
    overlap less, they are jittered along the x-axis by means of
    uniformly distributed random numbers \matlabfun{rand()}. (2) With
    a bar plot \matlabfun{bar()} one usually shows the mean of the
    data. The additional errorbar illustrates the deviation of the
    data from the mean by $\pm$ one standard deviation. In case of
    non-normal data mean and standard deviation only poorly
    characterize the distribution of the data values. (3) A
    box-whisker plot \matlabfun{boxplot()} shows more details of the
    distribution of the data values. The box extends from the 1. to
    the 3. quartile, a horizontal line within the box marks the median
    value, and the whiskers extend to the minum and the maximum data
    values. (4) The probability density $p(x)$ estimated from a
    normalized histogram shows the entire distribution of the
    data. Estimating the probability distribution is only meaningful
    for sufficiently large data sets.}
\end{figure}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mode, median, quartile, etc.}

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{median}
  \titlecaption{\label{medianfig} Median, mean and mode of a
    probability distribution.}{Left: Median, mean and mode coincide
    for the symmetric and unimodal normal distribution.  Right: for
    asymmetric distributions these three measures differ. A heavy tail
    of a distribution pulls out the mean most strongly. In contrast,
    the median is more robust against heavy tails, but not necessarily
    identical with the mode.}
\end{figure}

The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution.

The \enterm{median} separates a list of data values into two halves
such that one half of the data is not greater and the other half is
not smaller than the median (\figref{medianfig}).

\begin{exercise}{mymedian.m}{}
  Write a function \code{mymedian()} that computes the median of a vector.
\end{exercise}

\matlab{} provides the function \code{median()} for computing the median.

\begin{exercise}{checkmymedian.m}{}
  Write a script that tests whether your median function really
  returns a median above which are the same number of data than
  below. In particular the script should test data vectors of
  different length. You should not use the \mcode{median()} function
  for testing your function.

  Writing tests for your own functions is a very important strategy for
  writing reliable code!
\end{exercise}

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{quartile}
  \titlecaption{\label{quartilefig} Median and quartiles of a normal
    distribution.}{ The interquartile range between the first and the
    third quartile contains 50\,\% of the data and contains the
    median.}
\end{figure}

The distribution of data can be further characterized by the position
of its \enterm[quartile]{quartiles}. Neighboring quartiles are
separated by 25\,\% of the data (\figref{quartilefig}).
\enterm[percentile]{Percentiles} allow to characterize the
distribution of the data in more detail. The 3$^{\rm rd}$ quartile
corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the
data are smaller than the 3$^{\rm rd}$ quartile.

% \begin{definition}[quartile]
%   Die Quartile Q1, Q2 und Q3 unterteilen die Daten in vier gleich
%   gro{\ss}e Gruppen, die jeweils ein Viertel der Daten enthalten.
%   Das mittlere Quartil entspricht dem Median.
% \end{definition}

% \begin{exercise}{quartiles.m}{}
%   Write a function that computes the first, second, and third quartile of a vector.
% \end{exercise}

% \begin{figure}[t]
%   \includegraphics[width=1\textwidth]{boxwhisker}
%   \titlecaption{\label{boxwhiskerfig} Box-Whisker Plot.}{Box-whisker
%     plots are well suited for comparing unimodal distributions.  Each
%     box-whisker characterizes 40 random numbers that have been drawn
%     from a normal distribution.}
% \end{figure}

\enterm{Box-whisker plots} are commonly used to visualize and compare
the distribution of unimodal data. A box is drawn around the median
that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$ quartile. The
whiskers mark the minimum and maximum value of the data set
(\figref{displayunivariatedatafig} (3)).

\begin{exercise}{univariatedata.m}{}
  Generate 40 normally distributed random numbers with a mean of 2 and
  illustrate their distribution in a box-whisker plot
  (\code{boxplot()} function), with a bar and errorbar illustrating
  the mean and standard deviation (\code{bar()}, \code{errorbar()}),
  and the data themselves jittered randomly (as in
  \figref{displayunivariatedatafig}).  How to interpret the different
  plots?
\end{exercise}

% \begin{exercise}{boxwhisker.m}{}
%   Generate a $40 \times 10$ matrix of random numbers and
%   illustrate their distribution in a box-whisker plot
%   (\code{boxplot()} function). How to interpret the plot?
% \end{exercise}

\section{Distributions}
The distribution of values in a data set is estimated by histograms
(\figref{displayunivariatedatafig} (4)).

\subsection{Histograms}

\enterm[Histogram]{Histograms} count the frequency $n_i$ of
$N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
(\figref{diehistogramsfig} left).  The bins tile the data range
usually into intervals of the same size. The width of the bins is
called the bin width. The frequencies $n_i$ plotted against the
categories $i$ is the \enterm{histogram}, or the \enterm{frequency
  histogram}.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{diehistograms}
  \titlecaption{\label{diehistogramsfig} Histograms resulting from 100
    or 500 times rolling a die.}{Left: the absolute frequency
    histogram counts the frequency of each number the die
    shows. Right: When normalized by the sum of the frequency
    histogram the two data sets become comparable with each other and
    with the expected theoretical distribution of $P=1/6$.}
\end{figure}

Histograms are often used to estimate the \enterm{probability
  distribution} of the data values.

\subsection{Probabilities}
In the frequentist interpretation of probability, the probability of
an event (e.g. getting a six when rolling a die) is the relative
occurrence of this event in the limit of a large number of trials.

For a finite number of trials $N$ where the event $i$ occurred $n_i$
times, the probability $P_i$ of this event is estimated by
\[ P_i = \frac{n_i}{N} = \frac{n_i}{\sum_{i=1}^M n_i} \; . \]
From this definition it follows that a probability is a unitless
quantity that takes on values between zero and one. Most importantly,
the sum of the probabilities of all possible events is one:
\[ \sum_{i=1}^M P_i = \sum_{i=1}^M \frac{n_i}{N} = \frac{1}{N} \sum_{i=1}^M n_i = \frac{N}{N} = 1\; , \]
i.e. the probability of getting any event is one.


\subsection{Probability distributions of categorial data}

For categorial data values (e.g. the faces of a die (as integer
numbers or as colors)) a bin can be defined for each category $i$.
The histogram is normalized by the total number of measurements to
make it independent of the size of the data set
(\figref{diehistogramsfig}). After this normalization the height of
each histogram bar is an estimate of the probability $P_i$ of the
category $i$, i.e. of getting a data value in the $i$-th bin.

\begin{exercise}{rollthedie.m}{}
  Write a function that simulates rolling a die $n$ times.
\end{exercise}

\begin{exercise}{diehistograms.m}{}
  Plot histograms for 20, 100, and 1000 times rolling a die.  Use
  \code[hist()]{hist(x)}, enforce six bins with
  \code[hist()]{hist(x,6)}, or set useful bins yourself. Normalize the
  histograms appropriately.
\end{exercise}


\subsection{Probability densities functions}

In cases where we deal with data sets of measurements of a real
quantity (e.g. lengths of snakes, weights of elephants, times
between succeeding spikes) there is no natural bin width for computing
a histogram. In addition, the probability of measuring a data value that
equals exactly a specific real number like, e.g., 0.123456789 is zero, because
there are uncountable many real numbers.

We can only ask for the probability to get a measurement value in some
range.  For example, we can ask for the probability $P(1.2<x<1.3)$ to
get a measurement between 0 and 1 (\figref{pdfprobabilitiesfig}). More
generally, we want to know the probability $P(x_0<x<x_1)$ to obtain a
measurement between $x_0$ and $x_1$. If we define the width of the
range between $x_0$ and $x_1$ as $\Delta x = x_1 - x_0$ then the
probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$.

In the limit to very small ranges $\Delta x$ the probability of
getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to
zero with $\Delta x$:
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \]
In here the quantity $p(x_00)$ is a so called \enterm{probability
  density} that is larger than zero and that describes the
distribution of the data values. The probability density is not a
unitless probability with values between 0 and 1, but a number that
takes on any positive real number and has as a unit the inverse of the
unit of the data values --- hence the name ``density''.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{pdfprobabilities}
  \titlecaption{\label{pdfprobabilitiesfig} Probability of a
    probability density.}{The probability of a data value $x$ between,
    e.g., zero and one is the integral (red area) of the probability
    density (blue).}
\end{figure}

The probability to get a value $x$ between $x_1$ and $x_2$ is
given by the integral of the probability density:
\[ P(x_1 < x < x2) = \int\limits_{x_1}^{x_2} p(x) \, dx \; . \]
Because the probability to get any value $x$ is one, the integral of
the probability density over the whole real axis must be one:
\begin{equation}
  \label{pdfnorm}
  P(-\infty < x < \infty) = \int\limits_{-\infty}^{+\infty} p(x) \, dx = 1 \; .
\end{equation}

The function $p(x)$, that assigns to every $x$ a probability density,
is called \enterm{probability density function},
\enterm[pdf|see{probability density function}]{pdf}, or just
\enterm[density|see{probability density function}]{density}
(\determ{Wahrscheinlichkeitsdichtefunktion}). The well known
\enterm{normal distribution} (\determ{Normalverteilung}) is an example of a
probability density function
\[ p_g(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
--- the \enterm{Guassian distribution}
(\determ{Gau{\ss}sche-Glockenkurve}) with mean $\mu$ and standard
deviation $\sigma$.
The factor in front of the exponential function ensures the normalization to
$\int p_g(x) \, dx = 1$, \eqnref{pdfnorm}.

\begin{exercise}{gaussianpdf.m}{gaussianpdf.out}
  \begin{enumerate}
  \item Plot the probability density of the normal distribution $p_g(x)$.
  \item Compute the probability of getting a data value between zero and one
    for the normal distribution with zero mean and standard deviation of one.
  \item Draw 1000 normally distributed random numbers and use these
    numbers to calculate the probability of getting a number between
    zero and one.
  \item Compute from the normal distribution $\int_{-\infty}^{+\infty} p(x) \, dx$.
  \end{enumerate}
\end{exercise}

Histograms of real valued data depend on both the number of data
values \emph{and} the chosen bin width. As in the example with the die
(\figref{diehistogramsfig} left), the height of the histogram gets
larger the larger the size of the data set. Also, as the bin width is
increased the hight of the histogram increases, because more data
values fall within each bin (\figref{pdfhistogramfig} left).

\begin{exercise}{gaussianbins.m}{}
  Draw 100 random data from a Gaussian distribution and plot
  histograms of the data with different bin widths. What do you
  observe?
\end{exercise}

To turn such histograms to estimates of probability densities they
need to be normalized such that according to \eqnref{pdfnorm} their
integral equals one. While histograms of categorial data are
normalized such that their sum equals one, here we need to integrate
over the histogram. The integral is the area (not the height) of the
histogram bars. Each bar has the height $n_i$ and the width $\Delta
x$.  The total area $A$ of the histogram is thus
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \]
and the normalized histogram has the heights
\[ p(x_i) = \frac{n_i}{A} = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} =
   \frac{n_i}{N \Delta x} \; .\]
A histogram needs to be divided by both the sum of the frequencies
$n_i$ and the bin width $\Delta x$ to results in an estimate of the
corresponding probability density. Only then can the distribution be
compared with other distributions and in particular with theoretical
probability density functions like the one of the normal distribution
(\figref{pdfhistogramfig} right).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{pdfhistogram}
  \titlecaption{\label{pdfhistogramfig} Histograms with different bin
    widths of normally distributed data.}{Left: The height of the
    histogram bars strongly depends on the width of the bins. Right:
    If the histogram is normalized such that its integral is one we
    get an estimate of the probability density of the data values.
    The normalized histograms are comparable with each other and can
    also be compared to theoretical probability densities, like the
    normal distributions (blue).}
\end{figure}

\begin{exercise}{gaussianbinsnorm.m}{}
  Normalize the histogram of the previous exercise to a probability density.
\end{exercise}


\subsection{Kernel densities}

A problem of using histograms for estimating probability densities is
that they have hard bin edges. Depending on where the bin edges are
placed a data value falls in one or the other bin. As a result the
shape histogram depends on the exact position of its bins
(\figref{kerneldensityfig} left).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{kerneldensity}
  \titlecaption{\label{kerneldensityfig} Kernel densities.}{Left: The
    histogram-based estimation of the probability density is dependent
    on the position of the bins. In the bottom plot the bins have
    been shifted by half a bin width (here $\Delta x=0.4$) and as a
    result details of the probability density look different. Look,
    for example, at the height of the largest bin. Right: In contrast,
    a kernel density is uniquely defined for a given kernel width
    (here Gaussian kernels with standard deviation of $\sigma=0.2$).}
\end{figure}

To avoid this problem one can use so called \enterm{kernel densities}
for estimating probability densities from data. Here every data point
is replaced by a kernel (a function with integral one, like for
example the Gaussian) that is moved exactly to the position
indicated by the data value. Then all the kernels of all the data
values are summed up, the sum is divided by the number of data values,
and we get an estimate of the probability density
(\figref{kerneldensityfig} right).

As for the histogram, where we need to choose a bin width, we need to
choose the width of the kernels appropriately.

\begin{exercise}{gaussiankerneldensity.m}{}
  Construct and plot a kernel density of the data from the previous
  two exercises.
\end{exercise}

\subsection{Cumulative distributions}
The \enterm{cumulative distribution function},
\enterm[cdf|see{cumulative distribution function}]{cdf}, or
\enterm[cumulative density function|see{cumulative distribution
  function}]{cumulative density function}
(\determ{kumulative Verteilung}) is the integral over the probability density
up to any value $x$:
\[ F(x) = \int_{-\infty}^x p(x') \, dx' \]
As such the cumulative distribution is a probability. It is the
probability of getting a value smaller than $x$.

For estimating the cumulative distribution from a set of data values
we do not need to rely on histograms or kernel densities. Instead, it
can be computed from the data directly without the need of a bin width
or width of a kernel. For a data set of $N$ data values $x_i$ the
probability of a data value smaller than $x$ is the number of data
points with values smaller than $x$ divided by $N$. If we sort the
data values than at each data value $x_i$ the number of data elements
smaller than $x_i$ is increased by one and the corresponding
probability of getting a value smaller than $x_i$ is increased by $1/N$.
That is, the cumulative distribution is
\[ F(x_i) = \frac{i}{N} \]
See \figref{cumulativefig} for an example.

The cumulative distribution tells you the fraction of data that are
below a certain value and can therefore be used to evaluate significance
from Null-hypothesis constructed from data, as it is done with bootstrap methods
(see chapter \ref{bootstrapchapter}). The other way around the values of quartiles
and percentiles can be determined from the inverse cumulative function.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{cumulative}
  \titlecaption{\label{cumulativefig} Estimation of the cumulative
    distribution.}{The cumulative distribution $F(x)$ estimated from
    100 data values drawn from a normal distribution (red) in
    comparison to the true cumulative distribution function computed
    by numerically integrating the normal distribution function
    (blue). From the cumulative distribution function one can read off
    the probabilities of getting values smaller than a given value
    (here: $P(x \le -1) \approx 0.15$). From the inverse cumulative
    distribution the position of percentiles can be computed (here:
    the median (50\,\% percentile) is as expected close to zero and
    the third quartile (75\,\% percentile) at $x=0.68$.}
\end{figure}

\begin{exercise}{cumulative.m}{cumulative.out}
  Generate 200 normally distributed data values and construct an
  estimate of the cumulative distribution function from this data.

  Compare this estimate with an integral over the normal distribution.

  Use the estimate to compute the probability of having data values
  smaller than $-1$.

  Use the estimate to compute the value of the 5\,\% percentile.
\end{exercise}

\newpage
\section{Correlations}

Until now we described properties of univariate data sets.  In
bivariate or multivariate data sets where we have pairs or tuples of
data values (e.g. size and weight of elephants) we want to analyze
dependencies between the variables.

The \enterm{correlation coefficient}
\[ r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle
  (x-\langle x \rangle)(y-\langle y \rangle) \rangle}{\sqrt{\langle
    (x-\langle x \rangle)^2} \rangle \sqrt{\langle (y-\langle y
    \rangle)^2} \rangle} \]
quantifies linear relationships between two variables
\matlabfun{corr()}.  The correlation coefficient is the
\determ{covariance} normalized by the standard deviations of the
single variables.  Perfectly correlated variables result in a
correlation coefficient of $+1$, anit-correlated or negatively
correlated data in a correlation coefficient of $-1$ and un-correlated
data in a correlation coefficient close to zero
(\figrefb{correlationfig}).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{correlation}
  \titlecaption{\label{correlationfig} Correlations between pairs of
    data.}{Shown are scatter plots of four data sets. Each point is a
    single data pair. The correlation coefficient $r$ is given in the top
    left of each plot.}
\end{figure}

\begin{exercise}{correlations.m}{}
  Generate pairs of random numbers with four different correlations
  (perfectly correlated, somehow correlated, uncorrelated, negatively
  correlated).  Plot them into a scatter plot and compute their
  correlation coefficient.
\end{exercise}

Note that non-linear dependencies between two variables are
insufficiently or not at all detected by the correlation coefficient
(\figref{nonlincorrelationfig}).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{nonlincorrelation}
  \titlecaption{\label{nonlincorrelationfig} Correlations for
    non-linear dependencies.}{The correlation coefficient detects
    linear dependencies only. Both the quadratic dependency (left) and
    the noise correlation (right), where the dispersal of the
    $y$-values depends on the $x$-value, result in correlation
    coefficients close to zero. $\xi$ denote normally distributed
    random numbers.}
\end{figure}