[bootstrap] fixed english text
This commit is contained in:
parent
025d4eb640
commit
155f6d7e54
@ -3,8 +3,6 @@
|
|||||||
\chapter{\tr{Bootstrap methods}{Bootstrap Methoden}}
|
\chapter{\tr{Bootstrap methods}{Bootstrap Methoden}}
|
||||||
\label{bootstrapchapter}
|
\label{bootstrapchapter}
|
||||||
|
|
||||||
\selectlanguage{english}
|
|
||||||
|
|
||||||
Bootstrapping methods are applied to create distributions of
|
Bootstrapping methods are applied to create distributions of
|
||||||
statistical measures via resampling of a sample. Bootstrapping offers several
|
statistical measures via resampling of a sample. Bootstrapping offers several
|
||||||
advantages:
|
advantages:
|
||||||
@ -12,9 +10,9 @@ advantages:
|
|||||||
\item Fewer assumptions (e.g. a measured sample does not need to be
|
\item Fewer assumptions (e.g. a measured sample does not need to be
|
||||||
normally distributed).
|
normally distributed).
|
||||||
\item Increased precision as compared to classical methods. %such as?
|
\item Increased precision as compared to classical methods. %such as?
|
||||||
\item General applicability: The bootstrapping methods are very
|
\item General applicability: the bootstrapping methods are very
|
||||||
similar for different statistics and there is no need to specialize
|
similar for different statistics and there is no need to specialize
|
||||||
the method depending on the investigated statistic measure.
|
the method to specific statistic measures.
|
||||||
\end{itemize}
|
\end{itemize}
|
||||||
|
|
||||||
\begin{figure}[tp]
|
\begin{figure}[tp]
|
||||||
@ -22,27 +20,26 @@ advantages:
|
|||||||
\includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
|
\includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
|
||||||
\includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
|
\includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
|
||||||
\titlecaption{\label{statisticalpopulationfig} Why can't we measure
|
\titlecaption{\label{statisticalpopulationfig} Why can't we measure
|
||||||
the statistical population but only draw samples?}{}
|
properties of the full population but only draw samples?}{}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
Reminder: in statistics we are interested in properties of the
|
Reminder: in statistics we are interested in properties of a
|
||||||
``statistical population'' (in German: \determ{Grundgesamtheit}), e.g. the
|
\enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
|
||||||
average length of all pickles (\figref{statisticalpopulationfig}). But
|
average length of all pickles (\figref{statisticalpopulationfig}). But
|
||||||
we cannot measure the lengths of all pickles in the statistical
|
we cannot measure the lengths of all pickles in the
|
||||||
population. Rather, we draw samples (simple random sample
|
population. Rather, we draw samples (\enterm{simple random sample}
|
||||||
\enterm[SRS|see{simple random sample}]{SRS}, in German:
|
\enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
|
||||||
\determ{Stichprobe}). We then estimate a statistical measure of interest
|
then estimate a statistical measure of interest (e.g. the average
|
||||||
(e.g. the average length of the pickles) within this sample and
|
length of the pickles) within this sample and hope that it is a good
|
||||||
hope that it is a good approximation of the unknown and immeasurable
|
approximation of the unknown and immeasurable true average length of
|
||||||
real average length of the statistical population (in German aka
|
the population (\determ{Populationsparameter}). We apply statistical
|
||||||
\determ{Populationsparameter}). We apply statistical methods to find
|
methods to find out how precise this approximation is.
|
||||||
out how precise this approximation is.
|
|
||||||
|
If we could draw a large number of \enterm{simple random samples} we
|
||||||
If we could draw a large number of \textit{simple random samples} we could
|
could calculate the statistical measure of interest for each sample
|
||||||
calculate the statistical measure of interest for each sample and
|
and estimate its probability distribution using a histogram. This
|
||||||
estimate the probability distribution using a histogram. This
|
distribution is called the \enterm{sampling distribution}
|
||||||
distribution is called the \enterm{sampling distribution} (German:
|
(\determ{Stichprobenverteilung},
|
||||||
\determ{Stichprobenverteilung},
|
|
||||||
\subfigref{bootstrapsamplingdistributionfig}{a}).
|
\subfigref{bootstrapsamplingdistributionfig}{a}).
|
||||||
|
|
||||||
\begin{figure}[tp]
|
\begin{figure}[tp]
|
||||||
@ -67,16 +64,14 @@ Commonly, there will be only a single SRS. In such cases we make use
|
|||||||
of certain assumptions (e.g. we assume a normal distribution) that
|
of certain assumptions (e.g. we assume a normal distribution) that
|
||||||
allow us to infer the precision of our estimation based on the
|
allow us to infer the precision of our estimation based on the
|
||||||
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
|
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
|
||||||
error of the mean which is the standard deviation of the distribution
|
error of the mean which is the standard deviation of the sampling
|
||||||
of average values around the mean of the statistical population
|
distribution of average values around the true mean of the population
|
||||||
estimated in many SRS
|
|
||||||
(\subfigref{bootstrapsamplingdistributionfig}{b}).
|
(\subfigref{bootstrapsamplingdistributionfig}{b}).
|
||||||
%explicitely state that this is based on the assumption of a normal distribution?
|
|
||||||
|
|
||||||
Alternatively, we can use ``bootstrapping'' to generate new samples
|
Alternatively, we can use ``bootstrapping'' to generate new samples
|
||||||
from the one set of measurements (resampling). From these bootstrapped
|
from one set of measurements (resampling). From these bootstrapped
|
||||||
samples we calculate the desired statistical measure and estimate
|
samples we compute the desired statistical measure and estimate their
|
||||||
their distribution (\enterm{bootstrap distribution},
|
distribution (\enterm{bootstrap distribution},
|
||||||
\subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this
|
\subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this
|
||||||
distribution is very similar to the sampling distribution regarding
|
distribution is very similar to the sampling distribution regarding
|
||||||
its width. The only difference is that the bootstrapped values are
|
its width. The only difference is that the bootstrapped values are
|
||||||
@ -89,7 +84,7 @@ Bootstrapping methods create bootstrapped samples from a SRS by
|
|||||||
resampling. The bootstrapped samples are used to estimate the sampling
|
resampling. The bootstrapped samples are used to estimate the sampling
|
||||||
distribution of a statistical measure. The bootstrapped samples have
|
distribution of a statistical measure. The bootstrapped samples have
|
||||||
the same size as the original sample and are created by randomly drawing with
|
the same size as the original sample and are created by randomly drawing with
|
||||||
replacement, that is, each value of the original sample can occur
|
replacement. That is, each value of the original sample can occur
|
||||||
once, multiple time, or not at all in a bootstrapped sample.
|
once, multiple time, or not at all in a bootstrapped sample.
|
||||||
|
|
||||||
|
|
||||||
@ -107,10 +102,10 @@ of the statistical population.
|
|||||||
error of the mean.}{The --- usually unknown --- sampling
|
error of the mean.}{The --- usually unknown --- sampling
|
||||||
distribution of the mean is distributed around the true mean of
|
distribution of the mean is distributed around the true mean of
|
||||||
the statistical population ($\mu=0$, red). The bootstrap
|
the statistical population ($\mu=0$, red). The bootstrap
|
||||||
distribution of the means calculated for many bootstrapped samples
|
distribution of the means computed from many bootstrapped samples
|
||||||
has the same shape as the sampling distribution but is centered
|
has the same shape as the sampling distribution but is centered
|
||||||
around the mean of the SRS used for resampling. The standard
|
around the mean of the SRS used for resampling. The standard
|
||||||
deviation of the bootstrap distribution (blue) is thus an estimator for
|
deviation of the bootstrap distribution (blue) is an estimator for
|
||||||
the standard error of the mean.}
|
the standard error of the mean.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
@ -137,8 +132,8 @@ distribution is the standard error of the mean.
|
|||||||
|
|
||||||
|
|
||||||
\section{Permutation tests}
|
\section{Permutation tests}
|
||||||
Statistical tests ask for the probability that a measured value
|
Statistical tests ask for the probability of a measured value
|
||||||
originates from the null hypothesis. Is this probability smaller than
|
to originate from a null hypothesis. Is this probability smaller than
|
||||||
the desired significance level, the null hypothesis may be rejected.
|
the desired significance level, the null hypothesis may be rejected.
|
||||||
|
|
||||||
Traditionally, such probabilities are taken from theoretical
|
Traditionally, such probabilities are taken from theoretical
|
||||||
@ -148,36 +143,37 @@ data. An alternative approach is to calculate the probability density
|
|||||||
of the null hypothesis directly from the data itself. To do this, we
|
of the null hypothesis directly from the data itself. To do this, we
|
||||||
need to resample the data according to the null hypothesis from the
|
need to resample the data according to the null hypothesis from the
|
||||||
SRS. By such permutation operations we destroy the feature of interest
|
SRS. By such permutation operations we destroy the feature of interest
|
||||||
while we conserve all other features of the data.
|
while we conserve all other statistical properties of the data.
|
||||||
|
|
||||||
\begin{figure}[tp]
|
\begin{figure}[tp]
|
||||||
\includegraphics[width=1\textwidth]{permutecorrelation}
|
\includegraphics[width=1\textwidth]{permutecorrelation}
|
||||||
\titlecaption{\label{permutecorrelationfig}Permutation test for
|
\titlecaption{\label{permutecorrelationfig}Permutation test for
|
||||||
correlations.}{Let the correlation coefficient of a dataset with
|
correlations.}{Let the correlation coefficient of a dataset with
|
||||||
200 samples be $\rho=0.21$. The distribution of the null
|
200 samples be $\rho=0.21$. The distribution of the null
|
||||||
hypothesis, yielded from the correlation coefficients of
|
hypothesis (yellow), optained from the correlation coefficients of
|
||||||
permuted and uncorrelated datasets is centered around zero
|
permuted and therefore uncorrelated datasets is centered around
|
||||||
(yellow). The measured correlation coefficient is larger than the
|
zero. The measured correlation coefficient is larger than the
|
||||||
95\,\% percentile of the null hypothesis. The null hypothesis may
|
95\,\% percentile of the null hypothesis. The null hypothesis may
|
||||||
thus be rejected and the measured correlation is statistically
|
thus be rejected and the measured correlation is considered
|
||||||
significant.}
|
statistically significant.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
A good example for the application of a permutaion test is the
|
A good example for the application of a permutaion test is the
|
||||||
statistical assessment of correlations. Given are measured pairs of
|
statistical assessment of correlations. Given are measured pairs of
|
||||||
data points $(x_i, y_i)$. By calculating the correlation coefficient
|
data points $(x_i, y_i)$. By calculating the correlation coefficient
|
||||||
we can quantify how strongly $y$ depends on $x$. The correlation
|
we can quantify how strongly $y$ depends on $x$. The correlation
|
||||||
coefficient alone, however, does not tell whether it is statistically
|
coefficient alone, however, does not tell whether the correlation is
|
||||||
significantly different from a random correlation. The null hypothesis
|
significantly different from a random correlation. The null hypothesis
|
||||||
for such a situation would be that $y$ does not depend on $x$. In
|
for such a situation would be that $y$ does not depend on $x$. In
|
||||||
order to perform a permutation test, we now destroy the correlation by
|
order to perform a permutation test, we need to destroy the
|
||||||
permuting the $(x_i, y_i)$ pairs, i.e. we rearrange the $x_i$ and
|
correlation by permuting the $(x_i, y_i)$ pairs, i.e. we rearrange the
|
||||||
$y_i$ values in a random fashion. By creating many sets of random
|
$x_i$ and $y_i$ values in a random fashion. Generating many sets of
|
||||||
pairs and calculating the resulting correlation coefficients, we yield
|
random pairs and computing the resulting correlation coefficients,
|
||||||
a distribution of correlation coefficients that are a result of
|
yields a distribution of correlation coefficients that result
|
||||||
randomness. From this distribution we can directly measure the
|
randomnly from uncorrelated data. By comparing the actually measured
|
||||||
statistical significance (figure\,\ref{permutecorrelationfig}).
|
correlation coefficient with this distribution we can directly assess
|
||||||
|
the significance of the correlation
|
||||||
|
(figure\,\ref{permutecorrelationfig}).
|
||||||
|
|
||||||
\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
|
\begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
|
||||||
Estimate the statistical significance of a correlation coefficient.
|
Estimate the statistical significance of a correlation coefficient.
|
||||||
@ -190,10 +186,8 @@ Estimate the statistical significance of a correlation coefficient.
|
|||||||
generating uncorrelated pairs. For this permute $x$- and $y$-values
|
generating uncorrelated pairs. For this permute $x$- and $y$-values
|
||||||
\matlabfun{randperm()} 1000 times and calculate for each
|
\matlabfun{randperm()} 1000 times and calculate for each
|
||||||
permutation the correlation coefficient.
|
permutation the correlation coefficient.
|
||||||
\item Read out the 95\,\% percentile from the resulting null
|
\item Read out the 95\,\% percentile from the resulting distribution
|
||||||
hypothesis distribution and compare it with the correlation
|
of the null hypothesis and compare it with the correlation
|
||||||
coefficient calculated for the original data.
|
coefficient computed from the original data.
|
||||||
\end{enumerate}
|
\end{enumerate}
|
||||||
\end{exercise}
|
\end{exercise}
|
||||||
|
|
||||||
\selectlanguage{english}
|
|
||||||
|
@ -15,7 +15,7 @@
|
|||||||
\else
|
\else
|
||||||
\newcommand{\stitle}{}
|
\newcommand{\stitle}{}
|
||||||
\fi
|
\fi
|
||||||
\header{{\bfseries\large Exercise 7\stitle}}{{\bfseries\large Statistics}}{{\bfseries\large December 2nd, 2019}}
|
\header{{\bfseries\large Exercise 8\stitle}}{{\bfseries\large Statistics}}{{\bfseries\large December 2nd, 2019}}
|
||||||
\firstpagefooter{Prof. Dr. Jan Benda}{Phone: 29 74573}{Email:
|
\firstpagefooter{Prof. Dr. Jan Benda}{Phone: 29 74573}{Email:
|
||||||
jan.benda@uni-tuebingen.de}
|
jan.benda@uni-tuebingen.de}
|
||||||
\runningfooter{}{\thepage}{}
|
\runningfooter{}{\thepage}{}
|
||||||
|
Reference in New Issue
Block a user