scientificComputing/bootstrap/lecture/bootstrap.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Resampling methods}
\label{bootstrapchapter}
\exercisechapter{Resampling methods}


\entermde{Resampling-Methoden}{Resampling methods} are applied to
generate distributions of statistical measures via resampling of
existing samples. Resampling offers several advantages:
\begin{itemize}
\item Fewer assumptions (e.g. a measured sample does not need to be
  normally distributed).
\item Increased precision as compared to classical methods. %such as?
\item General applicability: the resampling methods are very
  similar for different statistics and there is no need to specialize
  the method to specific statistic measures.
\end{itemize}
Resampling methods can be used for both estimating the precision of
estimated statistics (e.g. standard error of the mean, confidence
intervals) and testing for significane.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bootstrapping}

\begin{figure}[tp]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-26-05_771}\\[2ex]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
  \includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
  \titlecaption{\label{statisticalpopulationfig} Why can't we measure
    properties of the full population but only draw samples?}{}
\end{figure}

Reminder: in statistics we are interested in properties of a
\enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
average length of all pickles (\figref{statisticalpopulationfig}). But
we cannot measure the lengths of all pickles in the
population. Rather, we draw samples (\enterm{simple random sample}
\enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
then estimate a statistical measure of interest (e.g. the average
length of the pickles) within this sample and hope that it is a good
approximation of the unknown and immeasurable true average length of
the population (\entermde{Populationsparameter}{population
  parameter}). We apply statistical methods to find out how precise
this approximation is.

If we could draw a large number of simple random samples we
could calculate the statistical measure of interest for each sample
and estimate its probability distribution using a histogram. This
distribution is called the \enterm{sampling distribution}
(\determ{Stichprobenverteilung},
\subfigref{bootstrapsamplingdistributionfig}{a}).

\begin{figure}[tp]
  \includegraphics[height=0.2\textheight]{srs1}\\[2ex]
  \includegraphics[height=0.2\textheight]{srs2}\\[2ex]
  \includegraphics[height=0.2\textheight]{srs3}
  \titlecaption{\label{bootstrapsamplingdistributionfig}Bootstrapping
    the sampling distribution.}{(a) Simple random samples (SRS) are
    drawn from a statistical population with an unknown population
    parameter (e.g. the average $\mu$). The statistical measure (the
    estimation of $\bar x$) is calculated for each sample. The
    measured values originate from the sampling distribution. Often
    only a single random sample is drawn! (b) By applying assumption
    and theories one can guess the sampling distribution without
    actually measuring it. (c) Alternatively, one can generate many
    bootstrap-samples from the same SRS (resampling) and use these to
    estimate the sampling distribution empirically. From Hesterberg et
    al. 2003, Bootstrap Methods and Permutation Tests}
\end{figure}

Commonly, there will be only a single SRS. In such cases we make use
of certain assumptions (e.g. we assume a normal distribution) that
allow us to infer the precision of our estimation based on the
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
error of the mean which is the standard deviation of the sampling
distribution of average values around the true mean of the population
(\subfigref{bootstrapsamplingdistributionfig}{b}).

Alternatively, we can use \enterm{bootstrapping}
(\determ[Bootstrap!Verfahren]{Bootstrapverfahren}) to generate new
samples from one set of measurements by means of resampling. From
these bootstrapped samples we compute the desired statistical measure
and estimate their distribution
(\entermde{Bootstrap!Verteilung}{bootstrap distribution},
\subfigref{bootstrapsamplingdistributionfig}{c}).  Interestingly, this
distribution is very similar to the sampling distribution regarding
its width. The only difference is that the bootstrapped values are
distributed around the measure of the original sample and not the one
of the statistical population. We can use the bootstrap distribution
to draw conclusion regarding the precision of our estimation (e.g.
standard errors and confidence intervals).

Bootstrapping methods generate bootstrapped samples from a SRS by
resampling. The bootstrapped samples are used to estimate the sampling
distribution of a statistical measure. The bootstrapped samples have
the same size as the original sample and are generated by randomly
drawing with replacement. That is, each value of the original sample
can occur once, multiple times, or not at all in a bootstrapped
sample. This can be implemented by generating random indices into the
data set using the \code{randi()} function.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Bootstrap the standard error}

Bootstrapping can be nicely illustrated on the example of the
\enterm{standard error} of the mean (\determ{Standardfehler}). The
arithmetic mean is calculated for a simple random sample. The standard
error of the mean is the standard deviation of the expected
distribution of mean values around the mean of the statistical
population.

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{bootstrapsem}
  \titlecaption{\label{bootstrapsemfig}Bootstrapping the standard
    error of the mean.}{The --- usually unknown --- sampling
    distribution of the mean is distributed around the true mean of
    the statistical population ($\mu=0$, red). The bootstrap
    distribution of the means computed from many bootstrapped samples
    has the same shape as the sampling distribution but is centered
    around the mean of the SRS used for resampling. The standard
    deviation of the bootstrap distribution (blue) is an estimator for
    the standard error of the mean.}
\end{figure}

Via bootstrapping we generate a distribution of mean values
(\figref{bootstrapsemfig}) and the standard deviation of this
distribution is the standard error of the sample mean.

\begin{exercise}{bootstrapsem.m}{bootstrapsem.out}
  Create the distribution of mean values from bootstrapped samples
  resampled from a single SRS. Use this distribution to estimate the
  standard error of the mean.
  \begin{enumerate}
  \item Draw 1000 normally distributed random number and calculate the
    mean, the standard deviation, and the standard error
    ($\sigma/\sqrt{n}$).
  \item Resample the data 1000 times (randomly draw and replace) and calculate
    the mean of each bootstrapped sample.
  \item Plot a histogram of the respective distribution and calculate its mean and
    standard deviation. Compare with the
    original values based on the statistical population.
  \end{enumerate}
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Permutation tests}

Statistical tests ask for the probability of a measured value to
originate from a null hypothesis. Is this probability smaller than the
desired \entermde{Signifikanz}{significance level}, the
\entermde{Nullhypothese}{null hypothesis} can be rejected.

Traditionally, such probabilities are taken from theoretical
distributions which have been derived based on some assumptions about
the data. For example, the data should be normally distributed.  Given
some data one has to find an appropriate test that matches the
properties of the data. An alternative approach is to calculate the
probability density of the null hypothesis directly from the data
themselves. To do so, we need to resample the data according to the
null hypothesis from the SRS. By such permutation operations we
destroy the feature of interest while conserving all other statistical
properties of the data.


\subsection{Significance of a difference in the mean}

Often we would like to know whether two data sets differ in their
mean. Whether the ears of foxes are larger in southern Europe compared
to the ones from Scandinavia, whether a drug decreases blood pressure
in humans, whether a sensory stimulus increases the firing rate of a
neuron, etc. The \entermde{Nullhypothese}{null hypothesis} is that
they do not differ in their means, i.e. that both data sets come from
the same distribution. But even if the two data sets come from the
same distribution, their sample means may nevertheless differ by
chance. We need to figure out how these differences of the means are
distributed. Only if the measured difference between the means is
significantly larger than the ones obtained by chance we can reject
the null hypothesis and consider the two data sets to differ
significantly in their means.

We can easily estimate the distribution of the null hypothesis by
putting the data of both data sets in one big bag. By merging the two
data sets we assume that all the data values come from the same
distribution. We then randomly separate the data values into two new
data sets. These random data sets contain data from both original data
sets and thus come from the same distribution. From these random data
sets we compute the difference of their sample means. This procedure
is repeated many, say one thousand, times and each time we get a value
for a difference of means. The distribution of these values is the
distribution of the null hypothesis. It is the distribution of
differences of mean values that we get by chance although the two data
sets come from the same distribution. For a one-sided test that checks
whether the measured difference of means is significantly larger than
zero at a significance level of 5\,\% we compute the value of the
95\,\% percentile of the null distribution. If the measured value is
larger, we can reject the null hypothesis and consider the two data
sets to differ significantly in their means.

By using the original data to estimate the null hypothesis, we make no
assumption about the properties of the data. We do not need to worry
about the data being normally distributed. We do not need to memorize
which test to use in which situation. And we better understand what we
are testing, because we design the test ourselves. Nowadays, computer
are powerful enough to iterate even ten thousand times over the data
to compute the distribution of the null hypothesis --- with only a few
lines of code. This is why \entermde{Permutationstest}{permutaion
  test} are getting quite popular.

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{permuteaverage}
  \titlecaption{\label{permuteaverage}Permutation test for differences
    in means.}{We want to test whether two datasets
    $\left\{x_i\right\}$ (red) and $\left\{y_i\right\}$ (blue) come
    from different distributions by assessing the significance of the
    difference in their sample means. The data sets were generated
    with a difference in their population means of $d=0.7$. For
    generating the distribution of the null hypothesis, i.e. the
    distribution of differences in the means if the two data sets come
    from the same distribution, we randomly select the same number of
    samples from both data sets (top right). This is repeated many
    times and results in the desired distribution of differences of
    means (bottom). The measured difference is clearly beyond the
    95\,\% percentile of this distribution and thus indicates a
    significant difference between the distributions of the two
    original data sets.}
\end{figure}

\begin{exercise}{meandiffsignificance.m}{meandiffsignificance.out}
Estimate the statistical significance of a difference in the mean of two data sets.
\vspace{-1ex}
\begin{enumerate}
\item Generate two independent data sets, $\left\{x_i\right\}$ and
  $\left\{y_i\right\}$, of $n=200$ samples each, by drawing random
  numbers from a normal distribution. Add 0.2 to all the $y_i$ samples
  to ensure the population means to differ by 0.2.
\item Calculate the difference between the sample means of the two data sets.
\item Estimate the distribution of the null hypothesis of no
  difference of the means by generating new data sets with the same
  number of samples randomly selected from both data sets.  For this
  lump the two data sets together into a single vector. Then permute
  the order of the elements in this vector using the function
  \varcode{randperm()}, split it into two data sets and calculate
  the difference of their means. Repeat this 1000 times.
\item Read out the 95\,\% percentile from the resulting distribution
  of the differences in the mean, the null hypothesis using the
  \varcode{quantile()} function, and compare it with the difference of
  means measured from the original data sets.
\end{enumerate}
\end{exercise}


\subsection{Significance of correlations}

Another nice example for the application of a
\entermde{Permutationstest}{permutaion test} is testing for
significant \entermde[correlation]{Korrelation}{correlations}
(figure\,\ref{permutecorrelationfig}). Given are measured pairs of
data points $(x_i, y_i)$. By calculating the
\entermde[correlation!correlation
coefficient]{Korrelationskoeffizient}{correlation coefficient} we can
quantify how strongly $y$ depends on $x$. The correlation coefficient
alone, however, does not tell whether the correlation is significantly
different from a non-zero correlation that we might get although there
is no true correlation in the data. The \entermde{Nullhypothese}{null
  hypothesis} for such a situation is that $y$ does not depend on
$x$. In order to perform a permutation test, we need to destroy the
correlation between the data pairs by permuting the $(x_i, y_i)$
pairs, i.e. we rearrange the $x_i$ and $y_i$ values in a random
fashion. Generating many sets of random pairs and computing the
corresponding correlation coefficients yields a distribution of
correlation coefficients that result randomly from truly uncorrelated
data. By comparing the actually measured correlation coefficient with
this distribution we can directly assess the significance of the
correlation.

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{permutecorrelation}
  \titlecaption{\label{permutecorrelationfig}Permutation test for
    correlations.}{Let the correlation coefficient of a dataset with
    200 samples be $\rho=0.21$ (top left). By shuffling the data pairs
    we destroy any true correlation (top right). The resulting
    distribution of the null hypothesis (bottm, yellow), optained from
    the correlation coefficients of permuted and therefore
    uncorrelated datasets is centered around zero. The measured
    correlation coefficient is larger than the 95\,\% percentile of
    the null hypothesis. The null hypothesis may thus be rejected and
    the measured correlation is considered statistically significant.}
\end{figure}

\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
Estimate the statistical significance of a correlation coefficient.
\begin{enumerate}
\item Generate pairs of $(x_i, y_i)$ values. Randomly choose $x$-values
  and calculate the respective $y$-values according to $y_i =0.2 \cdot x_i + u_i$
  where $u_i$ is a random number drawn from a normal distribution.
\item Calculate the correlation coefficient.
\item Estimate the distribution of the null hypothesis by generating
  uncorrelated pairs. For this permute $x$- and $y$-values
  \matlabfun{randperm()} 1000 times and calculate for each permutation
  the correlation coefficient.
\item Read out the 95\,\% percentile from the resulting distribution
  of the null hypothesis using the \varcode{quantile()} function and
  compare it with the correlation coefficient computed from the
  original data.
\end{enumerate}
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions