fixed many index entries
This commit is contained in:
@@ -7,17 +7,16 @@ Descriptive statistics characterizes data sets by means of a few measures.
|
||||
In addition to histograms that estimate the full distribution of the data,
|
||||
the following measures are used for characterizing univariate data:
|
||||
\begin{description}
|
||||
\item[Location, central tendency] (``Lagema{\ss}e''):
|
||||
arithmetic mean, median, mode.
|
||||
\item[Spread, dispersion] (``Streuungsma{\ss}e''): variance,
|
||||
standard deviation, inter-quartile range,\linebreak coefficient of variation
|
||||
(``Variationskoeffizient'').
|
||||
\item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung'').
|
||||
\item[Location, central tendency] (\determ{Lagema{\ss}e}):
|
||||
\entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean}, \entermde{Median}{median}, \enterm{mode}.
|
||||
\item[Spread, dispersion] (\determ{Streuungsma{\ss}e}): \entermde{Varianz}{variance},
|
||||
\entermde{Standardabweichung}{standard deviation}, inter-quartile range,\linebreak \enterm{coefficient of variation} (\determ{Variationskoeffizient}).
|
||||
\item[Shape]: \enterm{skewness} (\determ{Schiefe}), \enterm{kurtosis} (\determ{W\"olbung}).
|
||||
\end{description}
|
||||
For bivariate and multivariate data sets we can also analyse their
|
||||
\begin{description}
|
||||
\item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient,
|
||||
Spearman's rank correlation coefficient.
|
||||
\item[Dependence, association] (\determ{Zusammenhangsma{\ss}e}): \entermde[correlation!coefficient!Pearson's]{Korrelation!Pearson}{Pearson's correlation coefficient},
|
||||
\entermde[correlation!coefficient!Spearman's rank]{{Rangkorrelationskoeffizient!Spearman'scher}}{Spearman's rank correlation coefficient}.
|
||||
\end{description}
|
||||
|
||||
The following is in no way a complete introduction to descriptive
|
||||
@@ -26,15 +25,16 @@ daily data-analysis problems.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Mean, variance, and standard deviation}
|
||||
The \enterm{arithmetic mean} is a measure of location. For $n$ data values
|
||||
$x_i$ the arithmetic mean is computed by
|
||||
The \entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean}
|
||||
is a measure of location. For $n$ data values $x_i$ the arithmetic
|
||||
mean is computed by
|
||||
\[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \]
|
||||
This computation (summing up all elements of a vector and dividing by
|
||||
the length of the vector) is provided by the function \mcode{mean()}.
|
||||
The mean has the same unit as the data values.
|
||||
|
||||
The dispersion of the data values around the mean is quantified by
|
||||
their \enterm{variance}
|
||||
their \entermde{Varianz}{variance}
|
||||
\[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \]
|
||||
The variance is computed by the function \mcode{var()}.
|
||||
The unit of the variance is the unit of the data values squared.
|
||||
@@ -42,14 +42,15 @@ Therefore, variances cannot be compared to the mean or the data values
|
||||
themselves. In particular, variances cannot be used for plotting error
|
||||
bars along with the mean.
|
||||
|
||||
The standard deviation
|
||||
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
|
||||
as computed by the function \mcode{std()}, however, has the same unit
|
||||
as the data values and can (and should) be used to display the
|
||||
dispersion of the data together with their mean.
|
||||
In contrast to the variance, the
|
||||
\entermde{Standardabweichung}{standard deviation}
|
||||
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
|
||||
as computed by the function \mcode{std()} has the same unit as the
|
||||
data values and can (and should) be used to display the dispersion of
|
||||
the data together with their mean.
|
||||
|
||||
The mean of a data set can be displayed by a bar-plot
|
||||
\matlabfun{bar()}. Additional errorbars \matlabfun{errobar()} can be
|
||||
\matlabfun{bar()}. Additional errorbars \matlabfun{errorbar()} can be
|
||||
used to illustrate the standard deviation of the data
|
||||
(\figref{displayunivariatedatafig} (2)).
|
||||
|
||||
@@ -90,18 +91,18 @@ used to illustrate the standard deviation of the data
|
||||
identical with the mode.}
|
||||
\end{figure}
|
||||
|
||||
The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution.
|
||||
The \enterm{mode} (\determ{Modus}) is the most frequent value,
|
||||
i.e. the position of the maximum of the probability distribution.
|
||||
|
||||
The \enterm{median} separates a list of data values into two halves
|
||||
such that one half of the data is not greater and the other half is
|
||||
not smaller than the median (\figref{medianfig}).
|
||||
The \entermde{Median}{median} separates a list of data values into two
|
||||
halves such that one half of the data is not greater and the other
|
||||
half is not smaller than the median (\figref{medianfig}). The
|
||||
function \mcode{median()} computes the median.
|
||||
|
||||
\begin{exercise}{mymedian.m}{}
|
||||
Write a function \varcode{mymedian()} that computes the median of a vector.
|
||||
\end{exercise}
|
||||
|
||||
\matlab{} provides the function \code{median()} for computing the median.
|
||||
|
||||
\begin{exercise}{checkmymedian.m}{}
|
||||
Write a script that tests whether your median function really
|
||||
returns a median above which are the same number of data than
|
||||
@@ -122,9 +123,9 @@ not smaller than the median (\figref{medianfig}).
|
||||
\end{figure}
|
||||
|
||||
The distribution of data can be further characterized by the position
|
||||
of its \enterm[quartile]{quartiles}. Neighboring quartiles are
|
||||
of its \entermde[quartile]{Quartil}{quartiles}. Neighboring quartiles are
|
||||
separated by 25\,\% of the data (\figref{quartilefig}).
|
||||
\enterm[percentile]{Percentiles} allow to characterize the
|
||||
\entermde[percentile]{Perzentil}{Percentiles} allow to characterize the
|
||||
distribution of the data in more detail. The 3$^{\rm rd}$ quartile
|
||||
corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the
|
||||
data are smaller than the 3$^{\rm rd}$ quartile.
|
||||
@@ -147,11 +148,12 @@ data are smaller than the 3$^{\rm rd}$ quartile.
|
||||
% from a normal distribution.}
|
||||
% \end{figure}
|
||||
|
||||
\enterm[box-whisker plots]{Box-whisker plots} are commonly used to
|
||||
visualize and compare the distribution of unimodal data. A box is
|
||||
drawn around the median that extends from the 1$^{\rm st}$ to the
|
||||
3$^{\rm rd}$ quartile. The whiskers mark the minimum and maximum value
|
||||
of the data set (\figref{displayunivariatedatafig} (3)).
|
||||
\entermde[box-whisker plots]{Box-Whisker-Plot}{Box-whisker plots}, or
|
||||
\entermde{Box-Plot}{box plot} are commonly used to visualize and
|
||||
compare the distribution of unimodal data. A box is drawn around the
|
||||
median that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$
|
||||
quartile. The whiskers mark the minimum and maximum value of the data
|
||||
set (\figref{displayunivariatedatafig} (3)).
|
||||
|
||||
\begin{exercise}{univariatedata.m}{}
|
||||
Generate 40 normally distributed random numbers with a mean of 2 and
|
||||
@@ -170,13 +172,14 @@ of the data set (\figref{displayunivariatedatafig} (3)).
|
||||
% \end{exercise}
|
||||
|
||||
\section{Distributions}
|
||||
The distribution of values in a data set is estimated by histograms
|
||||
(\figref{displayunivariatedatafig} (4)).
|
||||
The \enterm{distribution} (\determ{Verteilung}) of values in a data
|
||||
set is estimated by histograms (\figref{displayunivariatedatafig}
|
||||
(4)).
|
||||
|
||||
\subsection{Histograms}
|
||||
|
||||
\enterm[histogram]{Histograms} count the frequency $n_i$ of
|
||||
$N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
|
||||
\entermde[histogram]{Histogramm}{Histograms} count the frequency $n_i$
|
||||
of $N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
|
||||
(\figref{diehistogramsfig} left). The bins tile the data range
|
||||
usually into intervals of the same size. The width of the bins is
|
||||
called the bin width. The frequencies $n_i$ plotted against the
|
||||
@@ -194,13 +197,14 @@ categories $i$ is the \enterm{histogram}, or the \enterm{frequency
|
||||
\end{figure}
|
||||
|
||||
Histograms are often used to estimate the
|
||||
\enterm[probability!distribution]{probability distribution} of the
|
||||
data values.
|
||||
\enterm[probability!distribution]{probability distribution}
|
||||
(\determ[Wahrscheinlichkeits!-verteilung]{Wahrscheinlichkeitsverteilung}) of the data values.
|
||||
|
||||
\subsection{Probabilities}
|
||||
In the frequentist interpretation of probability, the probability of
|
||||
an event (e.g. getting a six when rolling a die) is the relative
|
||||
occurrence of this event in the limit of a large number of trials.
|
||||
In the frequentist interpretation of probability, the
|
||||
\enterm{probability} (\determ{Wahrscheinlichkeit}) of an event
|
||||
(e.g. getting a six when rolling a die) is the relative occurrence of
|
||||
this event in the limit of a large number of trials.
|
||||
|
||||
For a finite number of trials $N$ where the event $i$ occurred $n_i$
|
||||
times, the probability $P_i$ of this event is estimated by
|
||||
@@ -212,15 +216,16 @@ the sum of the probabilities of all possible events is one:
|
||||
i.e. the probability of getting any event is one.
|
||||
|
||||
|
||||
\subsection{Probability distributions of categorial data}
|
||||
\subsection{Probability distributions of categorical data}
|
||||
|
||||
For categorial data values (e.g. the faces of a die (as integer
|
||||
numbers or as colors)) a bin can be defined for each category $i$.
|
||||
The histogram is normalized by the total number of measurements to
|
||||
make it independent of the size of the data set
|
||||
(\figref{diehistogramsfig}). After this normalization the height of
|
||||
each histogram bar is an estimate of the probability $P_i$ of the
|
||||
category $i$, i.e. of getting a data value in the $i$-th bin.
|
||||
For \entermde[data!categorical]{Daten!kategorische}{categorical} data
|
||||
values (e.g. the faces of a die (as integer numbers or as colors)) a
|
||||
bin can be defined for each category $i$. The histogram is normalized
|
||||
by the total number of measurements to make it independent of the size
|
||||
of the data set (\figref{diehistogramsfig}). After this normalization
|
||||
the height of each histogram bar is an estimate of the probability
|
||||
$P_i$ of the category $i$, i.e. of getting a data value in the $i$-th
|
||||
bin.
|
||||
|
||||
\begin{exercise}{rollthedie.m}{}
|
||||
Write a function that simulates rolling a die $n$ times.
|
||||
@@ -236,12 +241,14 @@ category $i$, i.e. of getting a data value in the $i$-th bin.
|
||||
|
||||
\subsection{Probability densities functions}
|
||||
|
||||
In cases where we deal with data sets of measurements of a real
|
||||
quantity (e.g. lengths of snakes, weights of elephants, times
|
||||
between succeeding spikes) there is no natural bin width for computing
|
||||
a histogram. In addition, the probability of measuring a data value that
|
||||
equals exactly a specific real number like, e.g., 0.123456789 is zero, because
|
||||
there are uncountable many real numbers.
|
||||
In cases where we deal with
|
||||
\entermde[data!continuous]{Daten!kontinuierliche}{continuous data},
|
||||
(measurements of real-valued quantities, e.g. lengths of snakes,
|
||||
weights of elephants, times between succeeding spikes) there is no
|
||||
natural bin width for computing a histogram. In addition, the
|
||||
probability of measuring a data value that equals exactly a specific
|
||||
real number like, e.g., 0.123456789 is zero, because there are
|
||||
uncountable many real numbers.
|
||||
|
||||
We can only ask for the probability to get a measurement value in some
|
||||
range. For example, we can ask for the probability $P(1.2<x<1.3)$ to
|
||||
@@ -254,14 +261,14 @@ probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$.
|
||||
In the limit to very small ranges $\Delta x$ the probability of
|
||||
getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to
|
||||
zero with $\Delta x$:
|
||||
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \]
|
||||
In here the quantity $p(x_00)$ is a so called
|
||||
\enterm[probability!density]{probability density} that is larger than
|
||||
zero and that describes the distribution of the data values. The
|
||||
probability density is not a unitless probability with values between
|
||||
0 and 1, but a number that takes on any positive real number and has
|
||||
as a unit the inverse of the unit of the data values --- hence the
|
||||
name ``density''.
|
||||
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \] In here
|
||||
the quantity $p(x_00)$ is a so called
|
||||
\enterm[probability!density]{probability density}
|
||||
(\determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}) that is larger than zero and that
|
||||
describes the distribution of the data values. The probability density
|
||||
is not a unitless probability with values between 0 and 1, but a
|
||||
number that takes on any positive real number and has as a unit the
|
||||
inverse of the unit of the data values --- hence the name ``density''.
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{pdfprobabilities}
|
||||
@@ -282,17 +289,18 @@ the probability density over the whole real axis must be one:
|
||||
\end{equation}
|
||||
|
||||
The function $p(x)$, that assigns to every $x$ a probability density,
|
||||
is called \enterm[probability!density function]{probability density function},
|
||||
\enterm[pdf|see{probability density function}]{pdf}, or just
|
||||
\enterm[density|see{probability density function}]{density}
|
||||
(\determ{Wahrscheinlichkeitsdichtefunktion}). The well known
|
||||
\enterm{normal distribution} (\determ{Normalverteilung}) is an example of a
|
||||
probability density function
|
||||
is called \enterm[probability!density function]{probability density
|
||||
function}, \enterm[pdf|see{probability density function}]{pdf}, or
|
||||
just \enterm[density|see{probability density function}]{density}
|
||||
(\determ[Wahrscheinlichkeits!-dichtefunktion]{Wahrscheinlichkeitsdichtefunktion},
|
||||
\determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}). The
|
||||
well known \entermde{Normalverteilung}{normal distribution} is an
|
||||
example of a probability density function
|
||||
\[ p_g(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
|
||||
--- the \enterm{Gaussian distribution}
|
||||
(\determ{Gau{\ss}sche-Glockenkurve}) with mean $\mu$ and standard
|
||||
deviation $\sigma$.
|
||||
The factor in front of the exponential function ensures the normalization to
|
||||
The factor in front of the exponential function ensures normalization to
|
||||
$\int p_g(x) \, dx = 1$, \eqnref{pdfnorm}.
|
||||
|
||||
\begin{exercise}{gaussianpdf.m}{gaussianpdf.out}
|
||||
@@ -322,13 +330,15 @@ values fall within each bin (\figref{pdfhistogramfig} left).
|
||||
|
||||
To turn such histograms to estimates of probability densities they
|
||||
need to be normalized such that according to \eqnref{pdfnorm} their
|
||||
integral equals one. While histograms of categorial data are
|
||||
integral equals one. While histograms of categorical data are
|
||||
normalized such that their sum equals one, here we need to integrate
|
||||
over the histogram. The integral is the area (not the height) of the
|
||||
histogram bars. Each bar has the height $n_i$ and the width $\Delta
|
||||
x$. The total area $A$ of the histogram is thus
|
||||
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \]
|
||||
and the normalized histogram has the heights
|
||||
and the
|
||||
\entermde[histogram!normalized]{Histogramm!normiertes}{normalized
|
||||
histogram} has the heights
|
||||
\[ p(x_i) = \frac{n_i}{A} = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} =
|
||||
\frac{n_i}{N \Delta x} \; .\]
|
||||
A histogram needs to be divided by both the sum of the frequencies
|
||||
@@ -375,14 +385,14 @@ shape histogram depends on the exact position of its bins
|
||||
(here Gaussian kernels with standard deviation of $\sigma=0.2$).}
|
||||
\end{figure}
|
||||
|
||||
To avoid this problem one can use so called \enterm{kernel densities}
|
||||
for estimating probability densities from data. Here every data point
|
||||
is replaced by a kernel (a function with integral one, like for
|
||||
example the Gaussian) that is moved exactly to the position
|
||||
indicated by the data value. Then all the kernels of all the data
|
||||
values are summed up, the sum is divided by the number of data values,
|
||||
and we get an estimate of the probability density
|
||||
(\figref{kerneldensityfig} right).
|
||||
To avoid this problem so called \entermde[kernel
|
||||
density]{Kerndichte}{kernel densities} can be used for estimating
|
||||
probability densities from data. Here every data point is replaced by
|
||||
a kernel (a function with integral one, like for example the Gaussian)
|
||||
that is moved exactly to the position indicated by the data
|
||||
value. Then all the kernels of all the data values are summed up, the
|
||||
sum is divided by the number of data values, and we get an estimate of
|
||||
the probability density (\figref{kerneldensityfig} right).
|
||||
|
||||
As for the histogram, where we need to choose a bin width, we need to
|
||||
choose the width of the kernels appropriately.
|
||||
@@ -457,7 +467,9 @@ bivariate or multivariate data sets where we have pairs or tuples of
|
||||
data values (e.g. size and weight of elephants) we want to analyze
|
||||
dependencies between the variables.
|
||||
|
||||
The \enterm[correlation!correlation coefficient]{correlation coefficient}
|
||||
The
|
||||
\entermde[correlation!coefficient]{Korrelation!-skoeffizient}{correlation
|
||||
coefficient}
|
||||
\begin{equation}
|
||||
\label{correlationcoefficient}
|
||||
r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle
|
||||
@@ -467,8 +479,8 @@ The \enterm[correlation!correlation coefficient]{correlation coefficient}
|
||||
\end{equation}
|
||||
quantifies linear relationships between two variables
|
||||
\matlabfun{corr()}. The correlation coefficient is the
|
||||
\enterm{covariance} normalized by the standard deviations of the
|
||||
single variables. Perfectly correlated variables result in a
|
||||
\entermde{Kovarianz}{covariance} normalized by the standard deviations
|
||||
of the single variables. Perfectly correlated variables result in a
|
||||
correlation coefficient of $+1$, anit-correlated or negatively
|
||||
correlated data in a correlation coefficient of $-1$ and un-correlated
|
||||
data in a correlation coefficient close to zero
|
||||
|
||||
Reference in New Issue
Block a user