This repository has been archived on 2021-05-17. You can view files and clone it, but cannot push or open issues or pull requests.
scientificComputing/bootstrap/lecture/bootstrap.tex
2020-12-21 22:07:36 +01:00

315 lines
16 KiB
TeX

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Resampling methods}
\label{bootstrapchapter}
\exercisechapter{Resampling methods}
\entermde{Resampling-Methoden}{Resampling methods} are applied to
generate distributions of statistical measures via resampling of
existing samples. Resampling offers several advantages:
\begin{itemize}
\item Fewer assumptions (e.g. a measured sample does not need to be
normally distributed).
\item Increased precision as compared to classical methods. %such as?
\item General applicability: the resampling methods are very
similar for different statistics and there is no need to specialize
the method to specific statistic measures.
\end{itemize}
Resampling methods can be used for both estimating the precision of
estimated statistics (e.g. standard error of the mean, confidence
intervals) and testing for significane.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Bootstrapping}
\begin{figure}[tp]
\includegraphics[width=0.8\textwidth]{2012-10-29_16-26-05_771}\\[2ex]
\includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
\includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
\titlecaption{\label{statisticalpopulationfig} Why can't we measure
properties of the full population but only draw samples?}{}
\end{figure}
Reminder: in statistics we are interested in properties of a
\enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
average length of all pickles (\figref{statisticalpopulationfig}). But
we cannot measure the lengths of all pickles in the
population. Rather, we draw samples (\enterm{simple random sample}
\enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
then estimate a statistical measure of interest (e.g. the average
length of the pickles) within this sample and hope that it is a good
approximation of the unknown and immeasurable true average length of
the population (\entermde{Populationsparameter}{population
parameter}). We apply statistical methods to find out how precise
this approximation is.
If we could draw a large number of simple random samples we
could calculate the statistical measure of interest for each sample
and estimate its probability distribution using a histogram. This
distribution is called the \enterm{sampling distribution}
(\determ{Stichprobenverteilung},
\subfigref{bootstrapsamplingdistributionfig}{a}).
\begin{figure}[tp]
\includegraphics[height=0.2\textheight]{srs1}\\[2ex]
\includegraphics[height=0.2\textheight]{srs2}\\[2ex]
\includegraphics[height=0.2\textheight]{srs3}
\titlecaption{\label{bootstrapsamplingdistributionfig}Bootstrapping
the sampling distribution.}{(a) Simple random samples (SRS) are
drawn from a statistical population with an unknown population
parameter (e.g. the average $\mu$). The statistical measure (the
estimation of $\bar x$) is calculated for each sample. The
measured values originate from the sampling distribution. Often
only a single random sample is drawn! (b) By applying assumption
and theories one can guess the sampling distribution without
actually measuring it. (c) Alternatively, one can generate many
bootstrap-samples from the same SRS (resampling) and use these to
estimate the sampling distribution empirically. From Hesterberg et
al. 2003, Bootstrap Methods and Permutation Tests}
\end{figure}
Commonly, there will be only a single SRS. In such cases we make use
of certain assumptions (e.g. we assume a normal distribution) that
allow us to infer the precision of our estimation based on the
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
error of the mean which is the standard deviation of the sampling
distribution of average values around the true mean of the population
(\subfigref{bootstrapsamplingdistributionfig}{b}).
Alternatively, we can use \enterm{bootstrapping}
(\determ[Bootstrap!Verfahren]{Bootstrapverfahren}) to generate new
samples from one set of measurements by means of resampling. From
these bootstrapped samples we compute the desired statistical measure
and estimate their distribution
(\entermde{Bootstrap!Verteilung}{bootstrap distribution},
\subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this
distribution is very similar to the sampling distribution regarding
its width. The only difference is that the bootstrapped values are
distributed around the measure of the original sample and not the one
of the statistical population. We can use the bootstrap distribution
to draw conclusion regarding the precision of our estimation (e.g.
standard errors and confidence intervals).
Bootstrapping methods generate bootstrapped samples from a SRS by
resampling. The bootstrapped samples are used to estimate the sampling
distribution of a statistical measure. The bootstrapped samples have
the same size as the original sample and are generated by randomly
drawing with replacement. That is, each value of the original sample
can occur once, multiple times, or not at all in a bootstrapped
sample. This can be implemented by generating random indices into the
data set using the \code{randi()} function.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Bootstrap the standard error}
Bootstrapping can be nicely illustrated on the example of the
\enterm{standard error} of the mean (\determ{Standardfehler}). The
arithmetic mean is calculated for a simple random sample. The standard
error of the mean is the standard deviation of the expected
distribution of mean values around the mean of the statistical
population.
\begin{figure}[tp]
\includegraphics[width=1\textwidth]{bootstrapsem}
\titlecaption{\label{bootstrapsemfig}Bootstrapping the standard
error of the mean.}{The --- usually unknown --- sampling
distribution of the mean is distributed around the true mean of
the statistical population ($\mu=0$, red). The bootstrap
distribution of the means computed from many bootstrapped samples
has the same shape as the sampling distribution but is centered
around the mean of the SRS used for resampling. The standard
deviation of the bootstrap distribution (blue) is an estimator for
the standard error of the mean.}
\end{figure}
Via bootstrapping we generate a distribution of mean values
(\figref{bootstrapsemfig}) and the standard deviation of this
distribution is the standard error of the sample mean.
\begin{exercise}{bootstrapsem.m}{bootstrapsem.out}
Create the distribution of mean values from bootstrapped samples
resampled from a single SRS. Use this distribution to estimate the
standard error of the mean.
\begin{enumerate}
\item Draw 1000 normally distributed random number and calculate the
mean, the standard deviation, and the standard error
($\sigma/\sqrt{n}$).
\item Resample the data 1000 times (randomly draw and replace) and calculate
the mean of each bootstrapped sample.
\item Plot a histogram of the respective distribution and calculate its mean and
standard deviation. Compare with the
original values based on the statistical population.
\end{enumerate}
\end{exercise}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Permutation tests}
Statistical tests ask for the probability of a measured value to
originate from a null hypothesis. Is this probability smaller than the
desired \entermde{Signifikanz}{significance level}, the
\entermde{Nullhypothese}{null hypothesis} can be rejected.
Traditionally, such probabilities are taken from theoretical
distributions which have been derived based on some assumptions about
the data. For example, the data should be normally distributed. Given
some data one has to find an appropriate test that matches the
properties of the data. An alternative approach is to calculate the
probability density of the null hypothesis directly from the data
themselves. To do so, we need to resample the data according to the
null hypothesis from the SRS. By such permutation operations we
destroy the feature of interest while conserving all other statistical
properties of the data.
\subsection{Significance of a difference in the mean}
Often we would like to know whether two data sets differ in their
mean. Whether the ears of foxes are larger in southern Europe compared
to the ones from Scandinavia, whether a drug decreases blood pressure
in humans, whether a sensory stimulus increases the firing rate of a
neuron, etc. The \entermde{Nullhypothese}{null hypothesis} is that
they do not differ in their means, i.e. that both data sets come from
the same distribution. But even if the two data sets come from the
same distribution, their sample means may nevertheless differ by
chance. We need to figure out how these differences of the means are
distributed. Only if the measured difference between the means is
significantly larger than the ones obtained by chance we can reject
the null hypothesis and consider the two data sets to differ
significantly in their means.
We can easily estimate the distribution of the null hypothesis by
putting the data of both data sets in one big bag. By merging the two
data sets we assume that all the data values come from the same
distribution. We then randomly separate the data values into two new
data sets. These random data sets contain data from both original data
sets and thus come from the same distribution. From these random data
sets we compute the difference of their sample means. This procedure
is repeated many, say one thousand, times and each time we get a value
for a difference of means. The distribution of these values is the
distribution of the null hypothesis. It is the distribution of
differences of mean values that we get by chance although the two data
sets come from the same distribution. For a one-sided test that checks
whether the measured difference of means is significantly larger than
zero at a significance level of 5\,\% we compute the value of the
95\,\% percentile of the null distribution. If the measured value is
larger, we can reject the null hypothesis and consider the two data
sets to differ significantly in their means.
By using the original data to estimate the null hypothesis, we make no
assumption about the properties of the data. We do not need to worry
about the data being normally distributed. We do not need to memorize
which test to use in which situation. And we better understand what we
are testing, because we design the test ourselves. Nowadays, computer
are powerful enough to iterate even ten thousand times over the data
to compute the distribution of the null hypothesis --- with only a few
lines of code. This is why \entermde{Permutationstest}{permutaion
test} are getting quite popular.
\begin{figure}[tp]
\includegraphics[width=1\textwidth]{permuteaverage}
\titlecaption{\label{permuteaverage}Permutation test for differences
in means.}{We want to test whether two datasets
$\left\{x_i\right\}$ (red) and $\left\{y_i\right\}$ (blue) come
from different distributions by assessing the significance of the
difference in their sample means. The data sets were generated
with a difference in their population means of $d=0.7$. For
generating the distribution of the null hypothesis, i.e. the
distribution of differences in the means if the two data sets come
from the same distribution, we randomly select the same number of
samples from both data sets (top right). This is repeated many
times and results in the desired distribution of differences of
means (bottom). The measured difference is clearly beyond the
95\,\% percentile of this distribution and thus indicates a
significant difference between the distributions of the two
original data sets.}
\end{figure}
\begin{exercise}{meandiffsignificance.m}{meandiffsignificance.out}
Estimate the statistical significance of a difference in the mean of two data sets.
\vspace{-1ex}
\begin{enumerate}
\item Generate two independent data sets, $\left\{x_i\right\}$ and
$\left\{y_i\right\}$, of $n=200$ samples each, by drawing random
numbers from a normal distribution. Add 0.2 to all the $y_i$ samples
to ensure the population means to differ by 0.2.
\item Calculate the difference between the sample means of the two data sets.
\item Estimate the distribution of the null hypothesis of no
difference of the means by generating new data sets with the same
number of samples randomly selected from both data sets. For this
lump the two data sets together into a single vector. Then permute
the order of the elements in this vector using the function
\varcode{randperm()}, split it into two data sets and calculate
the difference of their means. Repeat this 1000 times.
\item Read out the 95\,\% percentile from the resulting distribution
of the differences in the mean, the null hypothesis using the
\varcode{quantile()} function, and compare it with the difference of
means measured from the original data sets.
\end{enumerate}
\end{exercise}
\subsection{Significance of correlations}
Another nice example for the application of a
\entermde{Permutationstest}{permutaion test} is testing for
significant \entermde[correlation]{Korrelation}{correlations}
(figure\,\ref{permutecorrelationfig}). Given are measured pairs of
data points $(x_i, y_i)$. By calculating the
\entermde[correlation!correlation
coefficient]{Korrelationskoeffizient}{correlation coefficient} we can
quantify how strongly $y$ depends on $x$. The correlation coefficient
alone, however, does not tell whether the correlation is significantly
different from a non-zero correlation that we might get although there
is no true correlation in the data. The \entermde{Nullhypothese}{null
hypothesis} for such a situation is that $y$ does not depend on
$x$. In order to perform a permutation test, we need to destroy the
correlation between the data pairs by permuting the $(x_i, y_i)$
pairs, i.e. we rearrange the $x_i$ and $y_i$ values in a random
fashion. Generating many sets of random pairs and computing the
corresponding correlation coefficients yields a distribution of
correlation coefficients that result randomly from truly uncorrelated
data. By comparing the actually measured correlation coefficient with
this distribution we can directly assess the significance of the
correlation.
\begin{figure}[tp]
\includegraphics[width=1\textwidth]{permutecorrelation}
\titlecaption{\label{permutecorrelationfig}Permutation test for
correlations.}{Let the correlation coefficient of a dataset with
200 samples be $\rho=0.21$ (top left). By shuffling the data pairs
we destroy any true correlation (top right). The resulting
distribution of the null hypothesis (bottm, yellow), optained from
the correlation coefficients of permuted and therefore
uncorrelated datasets is centered around zero. The measured
correlation coefficient is larger than the 95\,\% percentile of
the null hypothesis. The null hypothesis may thus be rejected and
the measured correlation is considered statistically significant.}
\end{figure}
\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
Estimate the statistical significance of a correlation coefficient.
\begin{enumerate}
\item Generate pairs of $(x_i, y_i)$ values. Randomly choose $x$-values
and calculate the respective $y$-values according to $y_i =0.2 \cdot x_i + u_i$
where $u_i$ is a random number drawn from a normal distribution.
\item Calculate the correlation coefficient.
\item Estimate the distribution of the null hypothesis by generating
uncorrelated pairs. For this permute $x$- and $y$-values
\matlabfun{randperm()} 1000 times and calculate for each permutation
the correlation coefficient.
\item Read out the 95\,\% percentile from the resulting distribution
of the null hypothesis using the \varcode{quantile()} function and
compare it with the correlation coefficient computed from the
original data.
\end{enumerate}
\end{exercise}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions