scientificComputing/bootstrap/lecture/bootstrap.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Bootstrap methods}
\label{bootstrapchapter}
\exercisechapter{Bootstrap methods}

Bootstrapping methods are applied to create distributions of
statistical measures via resampling of a sample. Bootstrapping offers several
advantages:
\begin{itemize}
\item Fewer assumptions (e.g. a measured sample does not need to be
  normally distributed).
\item Increased precision as compared to classical methods. %such as?
\item General applicability: the bootstrapping methods are very
  similar for different statistics and there is no need to specialize
  the method to specific statistic measures.
\end{itemize}

\begin{figure}[tp]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-26-05_771}\\[2ex]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
  \titlecaption{\label{statisticalpopulationfig} Why can't we measure
    properties of the full population but only draw samples?}{}
\end{figure}

Reminder: in statistics we are interested in properties of a
\enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
average length of all pickles (\figref{statisticalpopulationfig}). But
we cannot measure the lengths of all pickles in the
population. Rather, we draw samples (\enterm{simple random sample}
\enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
then estimate a statistical measure of interest (e.g. the average
length of the pickles) within this sample and hope that it is a good
approximation of the unknown and immeasurable true average length of
the population (\entermde{Populationsparameter}{population
  parameter}). We apply statistical methods to find out how precise
this approximation is.

If we could draw a large number of simple random samples we
could calculate the statistical measure of interest for each sample
and estimate its probability distribution using a histogram. This
distribution is called the \enterm{sampling distribution}
(\determ{Stichprobenverteilung},
\subfigref{bootstrapsamplingdistributionfig}{a}).

\begin{figure}[tp]
  \includegraphics[height=0.2\textheight]{srs1}\\[2ex]
  \includegraphics[height=0.2\textheight]{srs2}\\[2ex]
  \includegraphics[height=0.2\textheight]{srs3}
  \titlecaption{\label{bootstrapsamplingdistributionfig}Bootstrapping
    the sampling distribution.}{(a) Simple random samples (SRS) are
    drawn from a statistical population with an unknown population
    parameter (e.g. the average $\mu$). The statistical measure (the
    estimation of $\bar x$) is calculated for each sample. The
    measured values originate from the sampling distribution. Often
    only a single random sample is drawn! (b) By applying assumption
    and theories one can guess the sampling distribution without
    actually measuring it. (c) Alternatively, one can generate many
    bootstrap-samples from the same SRS (resampling) and use these to
    estimate the sampling distribution empirically. From Hesterberg et
    al. 2003, Bootstrap Methods and Permutation Tests}
\end{figure}

Commonly, there will be only a single SRS. In such cases we make use
of certain assumptions (e.g. we assume a normal distribution) that
allow us to infer the precision of our estimation based on the
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
error of the mean which is the standard deviation of the sampling
distribution of average values around the true mean of the population
(\subfigref{bootstrapsamplingdistributionfig}{b}).

Alternatively, we can use \enterm{bootstrapping}
(\determ[Bootstrap!Verfahren]{Bootstrapverfahren}) to generate new
samples from one set of measurements
(\entermde{Resampling}{resampling}). From these bootstrapped samples
we compute the desired statistical measure and estimate their
distribution (\entermde{Bootstrap!Verteilung}{bootstrap distribution},
\subfigref{bootstrapsamplingdistributionfig}{c}).  Interestingly, this
distribution is very similar to the sampling distribution regarding
its width. The only difference is that the bootstrapped values are
distributed around the measure of the original sample and not the one
of the statistical population. We can use the bootstrap distribution
to draw conclusion regarding the precision of our estimation (e.g.
standard errors and confidence intervals).

Bootstrapping methods create bootstrapped samples from a SRS by
resampling. The bootstrapped samples are used to estimate the sampling
distribution of a statistical measure. The bootstrapped samples have
the same size as the original sample and are created by randomly
drawing with replacement. That is, each value of the original sample
can occur once, multiple time, or not at all in a bootstrapped
sample. This can be implemented by generating random indices into the
data set using the \code{randi()} function.


\section{Bootstrap of the standard error}

Bootstrapping can be nicely illustrated at the example of the
\enterm{standard error} of the mean (\determ{Standardfehler}). The
arithmetic mean is calculated for a simple random sample. The standard
error of the mean is the standard deviation of the expected
distribution of mean values around the mean of the statistical
population.

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{bootstrapsem}
  \titlecaption{\label{bootstrapsemfig}Bootstrapping the standard
    error of the mean.}{The --- usually unknown --- sampling
    distribution of the mean is distributed around the true mean of
    the statistical population ($\mu=0$, red). The bootstrap
    distribution of the means computed from many bootstrapped samples
    has the same shape as the sampling distribution but is centered
    around the mean of the SRS used for resampling. The standard
    deviation of the bootstrap distribution (blue) is an estimator for
    the standard error of the mean.}
\end{figure}

Via bootstrapping we create a distribution of the mean values
(\figref{bootstrapsemfig}) and the standard deviation of this
distribution is the standard error of the mean.

\pagebreak[4]
\begin{exercise}{bootstrapsem.m}{bootstrapsem.out}
  Create the distribution of mean values from bootstrapped samples
  resampled from a single SRS. Use this distribution to estimate the
  standard error of the mean.
  \begin{enumerate}
  \item Draw 1000 normally distributed random number and calculate the
    mean, the standard deviation, and the standard error
    ($\sigma/\sqrt{n}$).
  \item Resample the data 1000 times (randomly draw and replace) and calculate
    the mean of each bootstrapped sample.
  \item Plot a histogram of the respective distribution and calculate its mean and
    standard deviation. Compare with the
    original values based on the statistical population.
  \end{enumerate}
\end{exercise}


\section{Permutation tests}
Statistical tests ask for the probability of a measured value to
originate from a null hypothesis. Is this probability smaller than the
desired \entermde{Signifikanz}{significance level}, the
\entermde{Nullhypothese}{null hypothesis} may be rejected.

Traditionally, such probabilities are taken from theoretical
distributions which are based on assumptions about the data.  Thus the
applied statistical test has to be appropriate for the type of
data. An alternative approach is to calculate the probability density
of the null hypothesis directly from the data itself. To do this, we
need to resample the data according to the null hypothesis from the
SRS. By such permutation operations we destroy the feature of interest
while we conserve all other statistical properties of the data.

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{permutecorrelation}
  \titlecaption{\label{permutecorrelationfig}Permutation test for
    correlations.}{Let the correlation coefficient of a dataset with
    200 samples be $\rho=0.21$. The distribution of the null
    hypothesis (yellow), optained from the correlation coefficients of
    permuted and therefore uncorrelated datasets is centered around
    zero. The measured correlation coefficient is larger than the
    95\,\% percentile of the null hypothesis. The null hypothesis may
    thus be rejected and the measured correlation is considered
    statistically significant.}
\end{figure}

A good example for the application of a
\entermde{Permutationstest}{permutaion test} is the statistical
assessment of \entermde[correlation]{Korrelation}{correlations}. Given
are measured pairs of data points $(x_i, y_i)$. By calculating the
\entermde[correlation!correlation
coefficient]{Korrelationskoeffizient}{correlation
  coefficient} we can quantify how strongly $y$ depends on $x$. The
correlation coefficient alone, however, does not tell whether the
correlation is significantly different from a random correlation. The
\entermde{Nullhypothese}{null hypothesis} for such a situation is that
$y$ does not depend on $x$. In order to perform a permutation test, we
need to destroy the correlation by permuting the $(x_i, y_i)$ pairs,
i.e. we rearrange the $x_i$ and $y_i$ values in a random
fashion. Generating many sets of random pairs and computing the
resulting correlation coefficients yields a distribution of
correlation coefficients that result randomly from uncorrelated
data. By comparing the actually measured correlation coefficient with
this distribution we can directly assess the significance of the
correlation (figure\,\ref{permutecorrelationfig}).

\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
Estimate the statistical significance of a correlation coefficient.
\begin{enumerate}
\item Create pairs of $(x_i, y_i)$ values. Randomly choose $x$-values
  and calculate the respective $y$-values according to $y_i =0.2 \cdot x_i + u_i$
  where $u_i$ is a random number drawn from a normal distribution.
\item Calculate the correlation coefficient.
\item Generate the distribution of the null hypothesis by generating
  uncorrelated pairs. For this permute $x$- and $y$-values
  \matlabfun{randperm()} 1000 times and calculate for each permutation
  the correlation coefficient.
\item Read out the 95\,\% percentile from the resulting distribution
  of the null hypothesis and compare it with the correlation
  coefficient computed from the original data.
\end{enumerate}
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions