[bootstrap] fixed english text

This commit is contained in:
Jan Benda 2019-11-26 14:41:51 +01:00
parent 025d4eb640
commit 155f6d7e54
2 changed files with 49 additions and 55 deletions

View File

@ -3,8 +3,6 @@
\chapter{\tr{Bootstrap methods}{Bootstrap Methoden}} \chapter{\tr{Bootstrap methods}{Bootstrap Methoden}}
\label{bootstrapchapter} \label{bootstrapchapter}
\selectlanguage{english}
Bootstrapping methods are applied to create distributions of Bootstrapping methods are applied to create distributions of
statistical measures via resampling of a sample. Bootstrapping offers several statistical measures via resampling of a sample. Bootstrapping offers several
advantages: advantages:
@ -12,9 +10,9 @@ advantages:
\item Fewer assumptions (e.g. a measured sample does not need to be \item Fewer assumptions (e.g. a measured sample does not need to be
normally distributed). normally distributed).
\item Increased precision as compared to classical methods. %such as? \item Increased precision as compared to classical methods. %such as?
\item General applicability: The bootstrapping methods are very \item General applicability: the bootstrapping methods are very
similar for different statistics and there is no need to specialize similar for different statistics and there is no need to specialize
the method depending on the investigated statistic measure. the method to specific statistic measures.
\end{itemize} \end{itemize}
\begin{figure}[tp] \begin{figure}[tp]
@ -22,27 +20,26 @@ advantages:
\includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex] \includegraphics[width=0.8\textwidth]{2012-10-29_16-41-39_523}\\[2ex]
\includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312} \includegraphics[width=0.8\textwidth]{2012-10-29_16-29-35_312}
\titlecaption{\label{statisticalpopulationfig} Why can't we measure \titlecaption{\label{statisticalpopulationfig} Why can't we measure
the statistical population but only draw samples?}{} properties of the full population but only draw samples?}{}
\end{figure} \end{figure}
Reminder: in statistics we are interested in properties of the Reminder: in statistics we are interested in properties of a
``statistical population'' (in German: \determ{Grundgesamtheit}), e.g. the \enterm{statistical population} (\determ{Grundgesamtheit}), e.g. the
average length of all pickles (\figref{statisticalpopulationfig}). But average length of all pickles (\figref{statisticalpopulationfig}). But
we cannot measure the lengths of all pickles in the statistical we cannot measure the lengths of all pickles in the
population. Rather, we draw samples (simple random sample population. Rather, we draw samples (\enterm{simple random sample}
\enterm[SRS|see{simple random sample}]{SRS}, in German: \enterm[SRS|see{simple random sample}]{SRS}, \determ{Stichprobe}). We
\determ{Stichprobe}). We then estimate a statistical measure of interest then estimate a statistical measure of interest (e.g. the average
(e.g. the average length of the pickles) within this sample and length of the pickles) within this sample and hope that it is a good
hope that it is a good approximation of the unknown and immeasurable approximation of the unknown and immeasurable true average length of
real average length of the statistical population (in German aka the population (\determ{Populationsparameter}). We apply statistical
\determ{Populationsparameter}). We apply statistical methods to find methods to find out how precise this approximation is.
out how precise this approximation is.
If we could draw a large number of \enterm{simple random samples} we
If we could draw a large number of \textit{simple random samples} we could could calculate the statistical measure of interest for each sample
calculate the statistical measure of interest for each sample and and estimate its probability distribution using a histogram. This
estimate the probability distribution using a histogram. This distribution is called the \enterm{sampling distribution}
distribution is called the \enterm{sampling distribution} (German: (\determ{Stichprobenverteilung},
\determ{Stichprobenverteilung},
\subfigref{bootstrapsamplingdistributionfig}{a}). \subfigref{bootstrapsamplingdistributionfig}{a}).
\begin{figure}[tp] \begin{figure}[tp]
@ -67,16 +64,14 @@ Commonly, there will be only a single SRS. In such cases we make use
of certain assumptions (e.g. we assume a normal distribution) that of certain assumptions (e.g. we assume a normal distribution) that
allow us to infer the precision of our estimation based on the allow us to infer the precision of our estimation based on the
SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard SRS. For example the formula $\sigma/\sqrt{n}$ gives the standard
error of the mean which is the standard deviation of the distribution error of the mean which is the standard deviation of the sampling
of average values around the mean of the statistical population distribution of average values around the true mean of the population
estimated in many SRS
(\subfigref{bootstrapsamplingdistributionfig}{b}). (\subfigref{bootstrapsamplingdistributionfig}{b}).
%explicitely state that this is based on the assumption of a normal distribution?
Alternatively, we can use ``bootstrapping'' to generate new samples Alternatively, we can use ``bootstrapping'' to generate new samples
from the one set of measurements (resampling). From these bootstrapped from one set of measurements (resampling). From these bootstrapped
samples we calculate the desired statistical measure and estimate samples we compute the desired statistical measure and estimate their
their distribution (\enterm{bootstrap distribution}, distribution (\enterm{bootstrap distribution},
\subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this \subfigref{bootstrapsamplingdistributionfig}{c}). Interestingly, this
distribution is very similar to the sampling distribution regarding distribution is very similar to the sampling distribution regarding
its width. The only difference is that the bootstrapped values are its width. The only difference is that the bootstrapped values are
@ -89,7 +84,7 @@ Bootstrapping methods create bootstrapped samples from a SRS by
resampling. The bootstrapped samples are used to estimate the sampling resampling. The bootstrapped samples are used to estimate the sampling
distribution of a statistical measure. The bootstrapped samples have distribution of a statistical measure. The bootstrapped samples have
the same size as the original sample and are created by randomly drawing with the same size as the original sample and are created by randomly drawing with
replacement, that is, each value of the original sample can occur replacement. That is, each value of the original sample can occur
once, multiple time, or not at all in a bootstrapped sample. once, multiple time, or not at all in a bootstrapped sample.
@ -107,10 +102,10 @@ of the statistical population.
error of the mean.}{The --- usually unknown --- sampling error of the mean.}{The --- usually unknown --- sampling
distribution of the mean is distributed around the true mean of distribution of the mean is distributed around the true mean of
the statistical population ($\mu=0$, red). The bootstrap the statistical population ($\mu=0$, red). The bootstrap
distribution of the means calculated for many bootstrapped samples distribution of the means computed from many bootstrapped samples
has the same shape as the sampling distribution but is centered has the same shape as the sampling distribution but is centered
around the mean of the SRS used for resampling. The standard around the mean of the SRS used for resampling. The standard
deviation of the bootstrap distribution (blue) is thus an estimator for deviation of the bootstrap distribution (blue) is an estimator for
the standard error of the mean.} the standard error of the mean.}
\end{figure} \end{figure}
@ -137,8 +132,8 @@ distribution is the standard error of the mean.
\section{Permutation tests} \section{Permutation tests}
Statistical tests ask for the probability that a measured value Statistical tests ask for the probability of a measured value
originates from the null hypothesis. Is this probability smaller than to originate from a null hypothesis. Is this probability smaller than
the desired significance level, the null hypothesis may be rejected. the desired significance level, the null hypothesis may be rejected.
Traditionally, such probabilities are taken from theoretical Traditionally, such probabilities are taken from theoretical
@ -148,36 +143,37 @@ data. An alternative approach is to calculate the probability density
of the null hypothesis directly from the data itself. To do this, we of the null hypothesis directly from the data itself. To do this, we
need to resample the data according to the null hypothesis from the need to resample the data according to the null hypothesis from the
SRS. By such permutation operations we destroy the feature of interest SRS. By such permutation operations we destroy the feature of interest
while we conserve all other features of the data. while we conserve all other statistical properties of the data.
\begin{figure}[tp] \begin{figure}[tp]
\includegraphics[width=1\textwidth]{permutecorrelation} \includegraphics[width=1\textwidth]{permutecorrelation}
\titlecaption{\label{permutecorrelationfig}Permutation test for \titlecaption{\label{permutecorrelationfig}Permutation test for
correlations.}{Let the correlation coefficient of a dataset with correlations.}{Let the correlation coefficient of a dataset with
200 samples be $\rho=0.21$. The distribution of the null 200 samples be $\rho=0.21$. The distribution of the null
hypothesis, yielded from the correlation coefficients of hypothesis (yellow), optained from the correlation coefficients of
permuted and uncorrelated datasets is centered around zero permuted and therefore uncorrelated datasets is centered around
(yellow). The measured correlation coefficient is larger than the zero. The measured correlation coefficient is larger than the
95\,\% percentile of the null hypothesis. The null hypothesis may 95\,\% percentile of the null hypothesis. The null hypothesis may
thus be rejected and the measured correlation is statistically thus be rejected and the measured correlation is considered
significant.} statistically significant.}
\end{figure} \end{figure}
A good example for the application of a permutaion test is the A good example for the application of a permutaion test is the
statistical assessment of correlations. Given are measured pairs of statistical assessment of correlations. Given are measured pairs of
data points $(x_i, y_i)$. By calculating the correlation coefficient data points $(x_i, y_i)$. By calculating the correlation coefficient
we can quantify how strongly $y$ depends on $x$. The correlation we can quantify how strongly $y$ depends on $x$. The correlation
coefficient alone, however, does not tell whether it is statistically coefficient alone, however, does not tell whether the correlation is
significantly different from a random correlation. The null hypothesis significantly different from a random correlation. The null hypothesis
for such a situation would be that $y$ does not depend on $x$. In for such a situation would be that $y$ does not depend on $x$. In
order to perform a permutation test, we now destroy the correlation by order to perform a permutation test, we need to destroy the
permuting the $(x_i, y_i)$ pairs, i.e. we rearrange the $x_i$ and correlation by permuting the $(x_i, y_i)$ pairs, i.e. we rearrange the
$y_i$ values in a random fashion. By creating many sets of random $x_i$ and $y_i$ values in a random fashion. Generating many sets of
pairs and calculating the resulting correlation coefficients, we yield random pairs and computing the resulting correlation coefficients,
a distribution of correlation coefficients that are a result of yields a distribution of correlation coefficients that result
randomness. From this distribution we can directly measure the randomnly from uncorrelated data. By comparing the actually measured
statistical significance (figure\,\ref{permutecorrelationfig}). correlation coefficient with this distribution we can directly assess
the significance of the correlation
(figure\,\ref{permutecorrelationfig}).
\begin{exercise}{correlationsignificance.m}{correlationsignificance.out} \begin{exercise}{correlationsignificance.m}{correlationsignificance.out}
Estimate the statistical significance of a correlation coefficient. Estimate the statistical significance of a correlation coefficient.
@ -190,10 +186,8 @@ Estimate the statistical significance of a correlation coefficient.
generating uncorrelated pairs. For this permute $x$- and $y$-values generating uncorrelated pairs. For this permute $x$- and $y$-values
\matlabfun{randperm()} 1000 times and calculate for each \matlabfun{randperm()} 1000 times and calculate for each
permutation the correlation coefficient. permutation the correlation coefficient.
\item Read out the 95\,\% percentile from the resulting null \item Read out the 95\,\% percentile from the resulting distribution
hypothesis distribution and compare it with the correlation of the null hypothesis and compare it with the correlation
coefficient calculated for the original data. coefficient computed from the original data.
\end{enumerate} \end{enumerate}
\end{exercise} \end{exercise}
\selectlanguage{english}

View File

@ -15,7 +15,7 @@
\else \else
\newcommand{\stitle}{} \newcommand{\stitle}{}
\fi \fi
\header{{\bfseries\large Exercise 7\stitle}}{{\bfseries\large Statistics}}{{\bfseries\large December 2nd, 2019}} \header{{\bfseries\large Exercise 8\stitle}}{{\bfseries\large Statistics}}{{\bfseries\large December 2nd, 2019}}
\firstpagefooter{Prof. Dr. Jan Benda}{Phone: 29 74573}{Email: \firstpagefooter{Prof. Dr. Jan Benda}{Phone: 29 74573}{Email:
jan.benda@uni-tuebingen.de} jan.benda@uni-tuebingen.de}
\runningfooter{}{\thepage}{} \runningfooter{}{\thepage}{}