315 lines
16 KiB
TeX
315 lines
16 KiB
TeX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\chapter{Resampling methods}
|
|
\label{bootstrapchapter}
|
|
\exercisechapter{Resampling methods}
|
|
|
|
|
|
\entermde{Resampling-Methoden}{Resampling methods} are applied to
|
|
generate distributions of statistical measures via resampling of
|
|
existing samples. Resampling offers several advantages:
|
|
\begin{itemize}
|
|
\item Fewer assumptions (e.g. a measured sample does not need to be
|
|
normally distributed).
|
|
\item Increased precision as compared to classical methods. %such as?
|
|
\item General applicability: the resampling methods are very
|
|
similar for different statistics and there is no need to specialize
|
|
the method to specific statistic measures.
|
|
\end{itemize}
|
|
Resampling methods can be used for both estimating the precision of
|
|
estimated statistics (e.g. standard error of the mean, confidence
|
|
intervals) and testing for significane.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Bootstrapping}
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=0.8\textwidth]{2012-10-29_16-26-05_771}\\[2ex]
|
|
\includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
|
|
\includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
|
|
\titlecaption{\label{statisticalpopulationfig} Why can't we measure
|
|
properties of the full population but only draw samples?}{}
|
|
\end{figure}
|
|
|
|
Reminder: in statistics we are interested in properties of a
|
|
\enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
|
|
average length of all pickles (\figref{statisticalpopulationfig}). But
|
|
we cannot measure the lengths of all pickles in the
|
|
population. Rather, we draw samples (\enterm{simple random sample}
|
|
\enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
|
|
then estimate a statistical measure of interest (e.g. the average
|
|
length of the pickles) within this sample and hope that it is a good
|
|
approximation of the unknown and immeasurable true average length of
|
|
the population (\entermde{Populationsparameter}{population
|
|
parameter}). We apply statistical methods to find out how precise
|
|
this approximation is.
|
|
|
|
If we could draw a large number of simple random samples we
|
|
could calculate the statistical measure of interest for each sample
|
|
and estimate its probability distribution using a histogram. This
|
|
distribution is called the \enterm{sampling distribution}
|
|
(\determ{Stichprobenverteilung},
|
|
\subfigref{bootstrapsamplingdistributionfig}{a}).
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[height=0.2\textheight]{srs1}\\[2ex]
|
|
\includegraphics[height=0.2\textheight]{srs2}\\[2ex]
|
|
\includegraphics[height=0.2\textheight]{srs3}
|
|
\titlecaption{\label{bootstrapsamplingdistributionfig}Bootstrapping
|
|
the sampling distribution.}{(a) Simple random samples (SRS) are
|
|
drawn from a statistical population with an unknown population
|
|
parameter (e.g. the average $\mu$). The statistical measure (the
|
|
estimation of $\bar x$) is calculated for each sample. The
|
|
measured values originate from the sampling distribution. Often
|
|
only a single random sample is drawn! (b) By applying assumption
|
|
and theories one can guess the sampling distribution without
|
|
actually measuring it. (c) Alternatively, one can generate many
|
|
bootstrap-samples from the same SRS (resampling) and use these to
|
|
estimate the sampling distribution empirically. From Hesterberg et
|
|
al. 2003, Bootstrap Methods and Permutation Tests}
|
|
\end{figure}
|
|
|
|
Commonly, there will be only a single SRS. In such cases we make use
|
|
of certain assumptions (e.g. we assume a normal distribution) that
|
|
allow us to infer the precision of our estimation based on the
|
|
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
|
|
error of the mean which is the standard deviation of the sampling
|
|
distribution of average values around the true mean of the population
|
|
(\subfigref{bootstrapsamplingdistributionfig}{b}).
|
|
|
|
Alternatively, we can use \enterm{bootstrapping}
|
|
(\determ[Bootstrap!Verfahren]{Bootstrapverfahren}) to generate new
|
|
samples from one set of measurements by means of resampling. From
|
|
these bootstrapped samples we compute the desired statistical measure
|
|
and estimate their distribution
|
|
(\entermde{Bootstrap!Verteilung}{bootstrap distribution},
|
|
\subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this
|
|
distribution is very similar to the sampling distribution regarding
|
|
its width. The only difference is that the bootstrapped values are
|
|
distributed around the measure of the original sample and not the one
|
|
of the statistical population. We can use the bootstrap distribution
|
|
to draw conclusion regarding the precision of our estimation (e.g.
|
|
standard errors and confidence intervals).
|
|
|
|
Bootstrapping methods generate bootstrapped samples from a SRS by
|
|
resampling. The bootstrapped samples are used to estimate the sampling
|
|
distribution of a statistical measure. The bootstrapped samples have
|
|
the same size as the original sample and are generated by randomly
|
|
drawing with replacement. That is, each value of the original sample
|
|
can occur once, multiple times, or not at all in a bootstrapped
|
|
sample. This can be implemented by generating random indices into the
|
|
data set using the \code{randi()} function.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Bootstrap the standard error}
|
|
|
|
Bootstrapping can be nicely illustrated on the example of the
|
|
\enterm{standard error} of the mean (\determ{Standardfehler}). The
|
|
arithmetic mean is calculated for a simple random sample. The standard
|
|
error of the mean is the standard deviation of the expected
|
|
distribution of mean values around the mean of the statistical
|
|
population.
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{bootstrapsem}
|
|
\titlecaption{\label{bootstrapsemfig}Bootstrapping the standard
|
|
error of the mean.}{The --- usually unknown --- sampling
|
|
distribution of the mean is distributed around the true mean of
|
|
the statistical population ($\mu=0$, red). The bootstrap
|
|
distribution of the means computed from many bootstrapped samples
|
|
has the same shape as the sampling distribution but is centered
|
|
around the mean of the SRS used for resampling. The standard
|
|
deviation of the bootstrap distribution (blue) is an estimator for
|
|
the standard error of the mean.}
|
|
\end{figure}
|
|
|
|
Via bootstrapping we generate a distribution of mean values
|
|
(\figref{bootstrapsemfig}) and the standard deviation of this
|
|
distribution is the standard error of the sample mean.
|
|
|
|
\begin{exercise}{bootstrapsem.m}{bootstrapsem.out}
|
|
Create the distribution of mean values from bootstrapped samples
|
|
resampled from a single SRS. Use this distribution to estimate the
|
|
standard error of the mean.
|
|
\begin{enumerate}
|
|
\item Draw 1000 normally distributed random number and calculate the
|
|
mean, the standard deviation, and the standard error
|
|
($\sigma/\sqrt{n}$).
|
|
\item Resample the data 1000 times (randomly draw and replace) and calculate
|
|
the mean of each bootstrapped sample.
|
|
\item Plot a histogram of the respective distribution and calculate its mean and
|
|
standard deviation. Compare with the
|
|
original values based on the statistical population.
|
|
\end{enumerate}
|
|
\end{exercise}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Permutation tests}
|
|
|
|
Statistical tests ask for the probability of a measured value to
|
|
originate from a null hypothesis. Is this probability smaller than the
|
|
desired \entermde{Signifikanz}{significance level}, the
|
|
\entermde{Nullhypothese}{null hypothesis} can be rejected.
|
|
|
|
Traditionally, such probabilities are taken from theoretical
|
|
distributions which have been derived based on some assumptions about
|
|
the data. For example, the data should be normally distributed. Given
|
|
some data one has to find an appropriate test that matches the
|
|
properties of the data. An alternative approach is to calculate the
|
|
probability density of the null hypothesis directly from the data
|
|
themselves. To do so, we need to resample the data according to the
|
|
null hypothesis from the SRS. By such permutation operations we
|
|
destroy the feature of interest while conserving all other statistical
|
|
properties of the data.
|
|
|
|
|
|
\subsection{Significance of a difference in the mean}
|
|
|
|
Often we would like to know whether two data sets differ in their
|
|
mean. Whether the ears of foxes are larger in southern Europe compared
|
|
to the ones from Scandinavia, whether a drug decreases blood pressure
|
|
in humans, whether a sensory stimulus increases the firing rate of a
|
|
neuron, etc. The \entermde{Nullhypothese}{null hypothesis} is that
|
|
they do not differ in their means, i.e. that both data sets come from
|
|
the same distribution. But even if the two data sets come from the
|
|
same distribution, their sample means may nevertheless differ by
|
|
chance. We need to figure out how these differences of the means are
|
|
distributed. Only if the measured difference between the means is
|
|
significantly larger than the ones obtained by chance we can reject
|
|
the null hypothesis and consider the two data sets to differ
|
|
significantly in their means.
|
|
|
|
We can easily estimate the distribution of the null hypothesis by
|
|
putting the data of both data sets in one big bag. By merging the two
|
|
data sets we assume that all the data values come from the same
|
|
distribution. We then randomly separate the data values into two new
|
|
data sets. These random data sets contain data from both original data
|
|
sets and thus come from the same distribution. From these random data
|
|
sets we compute the difference of their sample means. This procedure
|
|
is repeated many, say one thousand, times and each time we get a value
|
|
for a difference of means. The distribution of these values is the
|
|
distribution of the null hypothesis. It is the distribution of
|
|
differences of mean values that we get by chance although the two data
|
|
sets come from the same distribution. For a one-sided test that checks
|
|
whether the measured difference of means is significantly larger than
|
|
zero at a significance level of 5\,\% we compute the value of the
|
|
95\,\% percentile of the null distribution. If the measured value is
|
|
larger, we can reject the null hypothesis and consider the two data
|
|
sets to differ significantly in their means.
|
|
|
|
By using the original data to estimate the null hypothesis, we make no
|
|
assumption about the properties of the data. We do not need to worry
|
|
about the data being normally distributed. We do not need to memorize
|
|
which test to use in which situation. And we better understand what we
|
|
are testing, because we design the test ourselves. Nowadays, computer
|
|
are powerful enough to iterate even ten thousand times over the data
|
|
to compute the distribution of the null hypothesis --- with only a few
|
|
lines of code. This is why \entermde{Permutationstest}{permutaion
|
|
test} are getting quite popular.
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{permuteaverage}
|
|
\titlecaption{\label{permuteaverage}Permutation test for differences
|
|
in means.}{We want to test whether two datasets
|
|
$\left\{x_i\right\}$ (red) and $\left\{y_i\right\}$ (blue) come
|
|
from different distributions by assessing the significance of the
|
|
difference in their sample means. The data sets were generated
|
|
with a difference in their population means of $d=0.7$. For
|
|
generating the distribution of the null hypothesis, i.e. the
|
|
distribution of differences in the means if the two data sets come
|
|
from the same distribution, we randomly select the same number of
|
|
samples from both data sets (top right). This is repeated many
|
|
times and results in the desired distribution of differences of
|
|
means (bottom). The measured difference is clearly beyond the
|
|
95\,\% percentile of this distribution and thus indicates a
|
|
significant difference between the distributions of the two
|
|
original data sets.}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{meandiffsignificance.m}{meandiffsignificance.out}
|
|
Estimate the statistical significance of a difference in the mean of two data sets.
|
|
\vspace{-1ex}
|
|
\begin{enumerate}
|
|
\item Generate two independent data sets, $\left\{x_i\right\}$ and
|
|
$\left\{y_i\right\}$, of $n=200$ samples each, by drawing random
|
|
numbers from a normal distribution. Add 0.2 to all the $y_i$ samples
|
|
to ensure the population means to differ by 0.2.
|
|
\item Calculate the difference between the sample means of the two data sets.
|
|
\item Estimate the distribution of the null hypothesis of no
|
|
difference of the means by generating new data sets with the same
|
|
number of samples randomly selected from both data sets. For this
|
|
lump the two data sets together into a single vector. Then permute
|
|
the order of the elements in this vector using the function
|
|
\varcode{randperm()}, split it into two data sets and calculate
|
|
the difference of their means. Repeat this 1000 times.
|
|
\item Read out the 95\,\% percentile from the resulting distribution
|
|
of the differences in the mean, the null hypothesis using the
|
|
\varcode{quantile()} function, and compare it with the difference of
|
|
means measured from the original data sets.
|
|
\end{enumerate}
|
|
\end{exercise}
|
|
|
|
|
|
\subsection{Significance of correlations}
|
|
|
|
Another nice example for the application of a
|
|
\entermde{Permutationstest}{permutaion test} is testing for
|
|
significant \entermde[correlation]{Korrelation}{correlations}
|
|
(figure\,\ref{permutecorrelationfig}). Given are measured pairs of
|
|
data points $(x_i, y_i)$. By calculating the
|
|
\entermde[correlation!correlation
|
|
coefficient]{Korrelationskoeffizient}{correlation coefficient} we can
|
|
quantify how strongly $y$ depends on $x$. The correlation coefficient
|
|
alone, however, does not tell whether the correlation is significantly
|
|
different from a non-zero correlation that we might get although there
|
|
is no true correlation in the data. The \entermde{Nullhypothese}{null
|
|
hypothesis} for such a situation is that $y$ does not depend on
|
|
$x$. In order to perform a permutation test, we need to destroy the
|
|
correlation between the data pairs by permuting the $(x_i, y_i)$
|
|
pairs, i.e. we rearrange the $x_i$ and $y_i$ values in a random
|
|
fashion. Generating many sets of random pairs and computing the
|
|
corresponding correlation coefficients yields a distribution of
|
|
correlation coefficients that result randomly from truly uncorrelated
|
|
data. By comparing the actually measured correlation coefficient with
|
|
this distribution we can directly assess the significance of the
|
|
correlation.
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{permutecorrelation}
|
|
\titlecaption{\label{permutecorrelationfig}Permutation test for
|
|
correlations.}{Let the correlation coefficient of a dataset with
|
|
200 samples be $\rho=0.21$ (top left). By shuffling the data pairs
|
|
we destroy any true correlation (top right). The resulting
|
|
distribution of the null hypothesis (bottm, yellow), optained from
|
|
the correlation coefficients of permuted and therefore
|
|
uncorrelated datasets is centered around zero. The measured
|
|
correlation coefficient is larger than the 95\,\% percentile of
|
|
the null hypothesis. The null hypothesis may thus be rejected and
|
|
the measured correlation is considered statistically significant.}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
|
|
Estimate the statistical significance of a correlation coefficient.
|
|
\begin{enumerate}
|
|
\item Generate pairs of $(x_i, y_i)$ values. Randomly choose $x$-values
|
|
and calculate the respective $y$-values according to $y_i =0.2 \cdot x_i + u_i$
|
|
where $u_i$ is a random number drawn from a normal distribution.
|
|
\item Calculate the correlation coefficient.
|
|
\item Estimate the distribution of the null hypothesis by generating
|
|
uncorrelated pairs. For this permute $x$- and $y$-values
|
|
\matlabfun{randperm()} 1000 times and calculate for each permutation
|
|
the correlation coefficient.
|
|
\item Read out the 95\,\% percentile from the resulting distribution
|
|
of the null hypothesis using the \varcode{quantile()} function and
|
|
compare it with the correlation coefficient computed from the
|
|
original data.
|
|
\end{enumerate}
|
|
\end{exercise}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|