[statistics] fixed the chapter

This commit is contained in:
Jan Benda 2019-11-25 23:08:48 +01:00
parent 4b9d1134fe
commit b32000214a
3 changed files with 16 additions and 21 deletions

View File

@ -9,9 +9,7 @@ PYPDFFILES=$(PYFILES:.py=.pdf)
pythonplots : $(PYPDFFILES) pythonplots : $(PYPDFFILES)
$(PYPDFFILES) : %.pdf: %.py $(PYPDFFILES) : %.pdf: %.py
echo $$(which python)
python3 $< python3 $<
#python $<
cleanpythonplots : cleanpythonplots :
rm -f $(PYPDFFILES) rm -f $(PYPDFFILES)

View File

@ -16,11 +16,6 @@
\input{statistics} \input{statistics}
\section{TODO}
\begin{itemize}
\item Replace exercise 1.3 (boxwhisker) by one recreating figure 1.
\end{itemize}
\end{document} \end{document}

View File

@ -307,7 +307,7 @@ $\int p_g(x) \, dx = 1$, \eqnref{pdfnorm}.
\newpage \newpage
Histograms of real valued data depend on both the number of data Histograms of real valued data depend on both the number of data
values and the chosen bin width. As in the example with the die values \emph{and} the chosen bin width. As in the example with the die
(\figref{diehistogramsfig} left), the height of the histogram gets (\figref{diehistogramsfig} left), the height of the histogram gets
larger the larger the size of the data set. Also, as the bin width is larger the larger the size of the data set. Also, as the bin width is
increased the hight of the histogram increases, because more data increased the hight of the histogram increases, because more data
@ -315,7 +315,7 @@ values fall within each bin (\figref{pdfhistogramfig} left).
\begin{exercise}{gaussianbins.m}{} \begin{exercise}{gaussianbins.m}{}
Draw 100 random data from a Gaussian distribution and plot Draw 100 random data from a Gaussian distribution and plot
histograms with different bin sizes of the data. What do you histograms of the data with different bin widths. What do you
observe? observe?
\end{exercise} \end{exercise}
@ -328,8 +328,8 @@ histogram bars. Each bar has the height $n_i$ and the width $\Delta
x$. The total area $A$ of the histogram is thus x$. The total area $A$ of the histogram is thus
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \] \[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \]
and the normalized histogram has the heights and the normalized histogram has the heights
\[ p(x_i) = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} = \frac{n_i}{N \[ p(x_i) = \frac{n_i}{A} = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} =
\Delta x} \; .\] \frac{n_i}{N \Delta x} \; .\]
A histogram needs to be divided by both the sum of the frequencies A histogram needs to be divided by both the sum of the frequencies
$n_i$ and the bin width $\Delta x$ to results in an estimate of the $n_i$ and the bin width $\Delta x$ to results in an estimate of the
corresponding probability density. Only then can the distribution be corresponding probability density. Only then can the distribution be
@ -359,33 +359,35 @@ probability density functions like the one of the normal distribution
\subsection{Kernel densities} \subsection{Kernel densities}
A problem of using histograms for estimating probability densities is A problem of using histograms for estimating probability densities is
that they have hard bin edges. Depending on where the bin edges are placed that they have hard bin edges. Depending on where the bin edges are
a data value falls in one or the other bin. placed a data value falls in one or the other bin. As a result the
shape histogram depends on the exact position of its bins
(\figref{kerneldensityfig} left).
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{kerneldensity} \includegraphics[width=1\textwidth]{kerneldensity}
\titlecaption{\label{kerneldensityfig} Kernel densities.}{Left: The \titlecaption{\label{kerneldensityfig} Kernel densities.}{Left: The
histogram-based estimation of the probability density is dependent histogram-based estimation of the probability density is dependent
on the position of the bins. In the bottom plot the bins have on the position of the bins. In the bottom plot the bins have
bin shifted by half a bin width (here $\Delta x=0.4$) and as a been shifted by half a bin width (here $\Delta x=0.4$) and as a
result details of the probability density look different. Look, result details of the probability density look different. Look,
for example at the height of the largest bin. Right: In contrast, for example, at the height of the largest bin. Right: In contrast,
a kernel density is uniquely defined for a given kernel width a kernel density is uniquely defined for a given kernel width
(here Gaussian kernels with standard deviation of $\sigma=2$).} (here Gaussian kernels with standard deviation of $\sigma=0.2$).}
\end{figure} \end{figure}
To avoid this problem one can use so called \enterm {kernel densities} To avoid this problem one can use so called \enterm{kernel densities}
for estimating probability densities from data. Here every data point for estimating probability densities from data. Here every data point
is replaced by a kernel (a function with integral one, like for is replaced by a kernel (a function with integral one, like for
example the Gaussian) that is moved exactly to the position example the Gaussian) that is moved exactly to the position
indicated by the data value. Then all the kernels of all the data indicated by the data value. Then all the kernels of all the data
values are summed up, the sum is divided by the number of data values, values are summed up, the sum is divided by the number of data values,
and we get an estimate of the probability density. and we get an estimate of the probability density
(\figref{kerneldensityfig} right).
As for the histogram, where we need to choose a bin width, we need to As for the histogram, where we need to choose a bin width, we need to
choose the width of the kernels appropriately. choose the width of the kernels appropriately.
\newpage
\begin{exercise}{gaussiankerneldensity.m}{} \begin{exercise}{gaussiankerneldensity.m}{}
Construct and plot a kernel density of the data from the previous Construct and plot a kernel density of the data from the previous
two exercises. two exercises.
@ -430,7 +432,7 @@ and percentiles can be determined from the inverse cumulative function.
by numerically integrating the normal distribution function by numerically integrating the normal distribution function
(blue). From the cumulative distribution function one can read off (blue). From the cumulative distribution function one can read off
the probabilities of getting values smaller than a given value the probabilities of getting values smaller than a given value
(here: $P(x \ge -1) \approx 0.15$). From the inverse cumulative (here: $P(x \le -1) \approx 0.15$). From the inverse cumulative
distribution the position of percentiles can be computed (here: distribution the position of percentiles can be computed (here:
the median (50\,\% percentile) is as expected close to zero and the median (50\,\% percentile) is as expected close to zero and
the third quartile (75\,\% percentile) at $x=0.68$.} the third quartile (75\,\% percentile) at $x=0.68$.}
@ -453,7 +455,7 @@ and percentiles can be determined from the inverse cumulative function.
Until now we described properties of univariate data sets. In Until now we described properties of univariate data sets. In
bivariate or multivariate data sets where we have pairs or tuples of bivariate or multivariate data sets where we have pairs or tuples of
data values (e.g. the size and the weight of elephants) we want to analyze data values (e.g. size and weight of elephants) we want to analyze
dependencies between the variables. dependencies between the variables.
The \enterm{correlation coefficient} The \enterm{correlation coefficient}