fixed many index entries

This commit is contained in:
2019-12-09 20:01:27 +01:00
parent f24c14e6f5
commit bf52536b7b
12 changed files with 332 additions and 306 deletions

View File

@@ -7,17 +7,16 @@ Descriptive statistics characterizes data sets by means of a few measures.
In addition to histograms that estimate the full distribution of the data,
the following measures are used for characterizing univariate data:
\begin{description}
\item[Location, central tendency] (``Lagema{\ss}e''):
arithmetic mean, median, mode.
\item[Spread, dispersion] (``Streuungsma{\ss}e''): variance,
standard deviation, inter-quartile range,\linebreak coefficient of variation
(``Variationskoeffizient'').
\item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung'').
\item[Location, central tendency] (\determ{Lagema{\ss}e}):
\entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean}, \entermde{Median}{median}, \enterm{mode}.
\item[Spread, dispersion] (\determ{Streuungsma{\ss}e}): \entermde{Varianz}{variance},
\entermde{Standardabweichung}{standard deviation}, inter-quartile range,\linebreak \enterm{coefficient of variation} (\determ{Variationskoeffizient}).
\item[Shape]: \enterm{skewness} (\determ{Schiefe}), \enterm{kurtosis} (\determ{W\"olbung}).
\end{description}
For bivariate and multivariate data sets we can also analyse their
\begin{description}
\item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient,
Spearman's rank correlation coefficient.
\item[Dependence, association] (\determ{Zusammenhangsma{\ss}e}): \entermde[correlation!coefficient!Pearson's]{Korrelation!Pearson}{Pearson's correlation coefficient},
\entermde[correlation!coefficient!Spearman's rank]{{Rangkorrelationskoeffizient!Spearman'scher}}{Spearman's rank correlation coefficient}.
\end{description}
The following is in no way a complete introduction to descriptive
@@ -26,15 +25,16 @@ daily data-analysis problems.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Mean, variance, and standard deviation}
The \enterm{arithmetic mean} is a measure of location. For $n$ data values
$x_i$ the arithmetic mean is computed by
The \entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean}
is a measure of location. For $n$ data values $x_i$ the arithmetic
mean is computed by
\[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \]
This computation (summing up all elements of a vector and dividing by
the length of the vector) is provided by the function \mcode{mean()}.
The mean has the same unit as the data values.
The dispersion of the data values around the mean is quantified by
their \enterm{variance}
their \entermde{Varianz}{variance}
\[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \]
The variance is computed by the function \mcode{var()}.
The unit of the variance is the unit of the data values squared.
@@ -42,14 +42,15 @@ Therefore, variances cannot be compared to the mean or the data values
themselves. In particular, variances cannot be used for plotting error
bars along with the mean.
The standard deviation
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
as computed by the function \mcode{std()}, however, has the same unit
as the data values and can (and should) be used to display the
dispersion of the data together with their mean.
In contrast to the variance, the
\entermde{Standardabweichung}{standard deviation}
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
as computed by the function \mcode{std()} has the same unit as the
data values and can (and should) be used to display the dispersion of
the data together with their mean.
The mean of a data set can be displayed by a bar-plot
\matlabfun{bar()}. Additional errorbars \matlabfun{errobar()} can be
\matlabfun{bar()}. Additional errorbars \matlabfun{errorbar()} can be
used to illustrate the standard deviation of the data
(\figref{displayunivariatedatafig} (2)).
@@ -90,18 +91,18 @@ used to illustrate the standard deviation of the data
identical with the mode.}
\end{figure}
The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution.
The \enterm{mode} (\determ{Modus}) is the most frequent value,
i.e. the position of the maximum of the probability distribution.
The \enterm{median} separates a list of data values into two halves
such that one half of the data is not greater and the other half is
not smaller than the median (\figref{medianfig}).
The \entermde{Median}{median} separates a list of data values into two
halves such that one half of the data is not greater and the other
half is not smaller than the median (\figref{medianfig}). The
function \mcode{median()} computes the median.
\begin{exercise}{mymedian.m}{}
Write a function \varcode{mymedian()} that computes the median of a vector.
\end{exercise}
\matlab{} provides the function \code{median()} for computing the median.
\begin{exercise}{checkmymedian.m}{}
Write a script that tests whether your median function really
returns a median above which are the same number of data than
@@ -122,9 +123,9 @@ not smaller than the median (\figref{medianfig}).
\end{figure}
The distribution of data can be further characterized by the position
of its \enterm[quartile]{quartiles}. Neighboring quartiles are
of its \entermde[quartile]{Quartil}{quartiles}. Neighboring quartiles are
separated by 25\,\% of the data (\figref{quartilefig}).
\enterm[percentile]{Percentiles} allow to characterize the
\entermde[percentile]{Perzentil}{Percentiles} allow to characterize the
distribution of the data in more detail. The 3$^{\rm rd}$ quartile
corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the
data are smaller than the 3$^{\rm rd}$ quartile.
@@ -147,11 +148,12 @@ data are smaller than the 3$^{\rm rd}$ quartile.
% from a normal distribution.}
% \end{figure}
\enterm[box-whisker plots]{Box-whisker plots} are commonly used to
visualize and compare the distribution of unimodal data. A box is
drawn around the median that extends from the 1$^{\rm st}$ to the
3$^{\rm rd}$ quartile. The whiskers mark the minimum and maximum value
of the data set (\figref{displayunivariatedatafig} (3)).
\entermde[box-whisker plots]{Box-Whisker-Plot}{Box-whisker plots}, or
\entermde{Box-Plot}{box plot} are commonly used to visualize and
compare the distribution of unimodal data. A box is drawn around the
median that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$
quartile. The whiskers mark the minimum and maximum value of the data
set (\figref{displayunivariatedatafig} (3)).
\begin{exercise}{univariatedata.m}{}
Generate 40 normally distributed random numbers with a mean of 2 and
@@ -170,13 +172,14 @@ of the data set (\figref{displayunivariatedatafig} (3)).
% \end{exercise}
\section{Distributions}
The distribution of values in a data set is estimated by histograms
(\figref{displayunivariatedatafig} (4)).
The \enterm{distribution} (\determ{Verteilung}) of values in a data
set is estimated by histograms (\figref{displayunivariatedatafig}
(4)).
\subsection{Histograms}
\enterm[histogram]{Histograms} count the frequency $n_i$ of
$N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
\entermde[histogram]{Histogramm}{Histograms} count the frequency $n_i$
of $N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
(\figref{diehistogramsfig} left). The bins tile the data range
usually into intervals of the same size. The width of the bins is
called the bin width. The frequencies $n_i$ plotted against the
@@ -194,13 +197,14 @@ categories $i$ is the \enterm{histogram}, or the \enterm{frequency
\end{figure}
Histograms are often used to estimate the
\enterm[probability!distribution]{probability distribution} of the
data values.
\enterm[probability!distribution]{probability distribution}
(\determ[Wahrscheinlichkeits!-verteilung]{Wahrscheinlichkeitsverteilung}) of the data values.
\subsection{Probabilities}
In the frequentist interpretation of probability, the probability of
an event (e.g. getting a six when rolling a die) is the relative
occurrence of this event in the limit of a large number of trials.
In the frequentist interpretation of probability, the
\enterm{probability} (\determ{Wahrscheinlichkeit}) of an event
(e.g. getting a six when rolling a die) is the relative occurrence of
this event in the limit of a large number of trials.
For a finite number of trials $N$ where the event $i$ occurred $n_i$
times, the probability $P_i$ of this event is estimated by
@@ -212,15 +216,16 @@ the sum of the probabilities of all possible events is one:
i.e. the probability of getting any event is one.
\subsection{Probability distributions of categorial data}
\subsection{Probability distributions of categorical data}
For categorial data values (e.g. the faces of a die (as integer
numbers or as colors)) a bin can be defined for each category $i$.
The histogram is normalized by the total number of measurements to
make it independent of the size of the data set
(\figref{diehistogramsfig}). After this normalization the height of
each histogram bar is an estimate of the probability $P_i$ of the
category $i$, i.e. of getting a data value in the $i$-th bin.
For \entermde[data!categorical]{Daten!kategorische}{categorical} data
values (e.g. the faces of a die (as integer numbers or as colors)) a
bin can be defined for each category $i$. The histogram is normalized
by the total number of measurements to make it independent of the size
of the data set (\figref{diehistogramsfig}). After this normalization
the height of each histogram bar is an estimate of the probability
$P_i$ of the category $i$, i.e. of getting a data value in the $i$-th
bin.
\begin{exercise}{rollthedie.m}{}
Write a function that simulates rolling a die $n$ times.
@@ -236,12 +241,14 @@ category $i$, i.e. of getting a data value in the $i$-th bin.
\subsection{Probability densities functions}
In cases where we deal with data sets of measurements of a real
quantity (e.g. lengths of snakes, weights of elephants, times
between succeeding spikes) there is no natural bin width for computing
a histogram. In addition, the probability of measuring a data value that
equals exactly a specific real number like, e.g., 0.123456789 is zero, because
there are uncountable many real numbers.
In cases where we deal with
\entermde[data!continuous]{Daten!kontinuierliche}{continuous data},
(measurements of real-valued quantities, e.g. lengths of snakes,
weights of elephants, times between succeeding spikes) there is no
natural bin width for computing a histogram. In addition, the
probability of measuring a data value that equals exactly a specific
real number like, e.g., 0.123456789 is zero, because there are
uncountable many real numbers.
We can only ask for the probability to get a measurement value in some
range. For example, we can ask for the probability $P(1.2<x<1.3)$ to
@@ -254,14 +261,14 @@ probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$.
In the limit to very small ranges $\Delta x$ the probability of
getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to
zero with $\Delta x$:
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \]
In here the quantity $p(x_00)$ is a so called
\enterm[probability!density]{probability density} that is larger than
zero and that describes the distribution of the data values. The
probability density is not a unitless probability with values between
0 and 1, but a number that takes on any positive real number and has
as a unit the inverse of the unit of the data values --- hence the
name ``density''.
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \] In here
the quantity $p(x_00)$ is a so called
\enterm[probability!density]{probability density}
(\determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}) that is larger than zero and that
describes the distribution of the data values. The probability density
is not a unitless probability with values between 0 and 1, but a
number that takes on any positive real number and has as a unit the
inverse of the unit of the data values --- hence the name ``density''.
\begin{figure}[t]
\includegraphics[width=1\textwidth]{pdfprobabilities}
@@ -282,17 +289,18 @@ the probability density over the whole real axis must be one:
\end{equation}
The function $p(x)$, that assigns to every $x$ a probability density,
is called \enterm[probability!density function]{probability density function},
\enterm[pdf|see{probability density function}]{pdf}, or just
\enterm[density|see{probability density function}]{density}
(\determ{Wahrscheinlichkeitsdichtefunktion}). The well known
\enterm{normal distribution} (\determ{Normalverteilung}) is an example of a
probability density function
is called \enterm[probability!density function]{probability density
function}, \enterm[pdf|see{probability density function}]{pdf}, or
just \enterm[density|see{probability density function}]{density}
(\determ[Wahrscheinlichkeits!-dichtefunktion]{Wahrscheinlichkeitsdichtefunktion},
\determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}). The
well known \entermde{Normalverteilung}{normal distribution} is an
example of a probability density function
\[ p_g(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
--- the \enterm{Gaussian distribution}
(\determ{Gau{\ss}sche-Glockenkurve}) with mean $\mu$ and standard
deviation $\sigma$.
The factor in front of the exponential function ensures the normalization to
The factor in front of the exponential function ensures normalization to
$\int p_g(x) \, dx = 1$, \eqnref{pdfnorm}.
\begin{exercise}{gaussianpdf.m}{gaussianpdf.out}
@@ -322,13 +330,15 @@ values fall within each bin (\figref{pdfhistogramfig} left).
To turn such histograms to estimates of probability densities they
need to be normalized such that according to \eqnref{pdfnorm} their
integral equals one. While histograms of categorial data are
integral equals one. While histograms of categorical data are
normalized such that their sum equals one, here we need to integrate
over the histogram. The integral is the area (not the height) of the
histogram bars. Each bar has the height $n_i$ and the width $\Delta
x$. The total area $A$ of the histogram is thus
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \]
and the normalized histogram has the heights
and the
\entermde[histogram!normalized]{Histogramm!normiertes}{normalized
histogram} has the heights
\[ p(x_i) = \frac{n_i}{A} = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} =
\frac{n_i}{N \Delta x} \; .\]
A histogram needs to be divided by both the sum of the frequencies
@@ -375,14 +385,14 @@ shape histogram depends on the exact position of its bins
(here Gaussian kernels with standard deviation of $\sigma=0.2$).}
\end{figure}
To avoid this problem one can use so called \enterm{kernel densities}
for estimating probability densities from data. Here every data point
is replaced by a kernel (a function with integral one, like for
example the Gaussian) that is moved exactly to the position
indicated by the data value. Then all the kernels of all the data
values are summed up, the sum is divided by the number of data values,
and we get an estimate of the probability density
(\figref{kerneldensityfig} right).
To avoid this problem so called \entermde[kernel
density]{Kerndichte}{kernel densities} can be used for estimating
probability densities from data. Here every data point is replaced by
a kernel (a function with integral one, like for example the Gaussian)
that is moved exactly to the position indicated by the data
value. Then all the kernels of all the data values are summed up, the
sum is divided by the number of data values, and we get an estimate of
the probability density (\figref{kerneldensityfig} right).
As for the histogram, where we need to choose a bin width, we need to
choose the width of the kernels appropriately.
@@ -457,7 +467,9 @@ bivariate or multivariate data sets where we have pairs or tuples of
data values (e.g. size and weight of elephants) we want to analyze
dependencies between the variables.
The \enterm[correlation!correlation coefficient]{correlation coefficient}
The
\entermde[correlation!coefficient]{Korrelation!-skoeffizient}{correlation
coefficient}
\begin{equation}
\label{correlationcoefficient}
r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle
@@ -467,8 +479,8 @@ The \enterm[correlation!correlation coefficient]{correlation coefficient}
\end{equation}
quantifies linear relationships between two variables
\matlabfun{corr()}. The correlation coefficient is the
\enterm{covariance} normalized by the standard deviations of the
single variables. Perfectly correlated variables result in a
\entermde{Kovarianz}{covariance} normalized by the standard deviations
of the single variables. Perfectly correlated variables result in a
correlation coefficient of $+1$, anit-correlated or negatively
correlated data in a correlation coefficient of $-1$ and un-correlated
data in a correlation coefficient close to zero