380 lines
17 KiB
TeX
380 lines
17 KiB
TeX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\chapter{Descriptive statistics}
|
|
|
|
Descriptive statistics characterizes data sets by means of a few measures.
|
|
|
|
In addition to histograms that estimate the full distribution of the data,
|
|
the following measures are used for characterizing univariate data:
|
|
\begin{description}
|
|
\item[Location, central tendency] (``Lagema{\ss}e''):
|
|
arithmetic mean, median, mode.
|
|
\item[Spread, dispersion] (``Streuungsma{\ss}e''): variance,
|
|
standard deviation, inter-quartile range,\linebreak coefficient of variation
|
|
(``Variationskoeffizient'').
|
|
\item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung'').
|
|
\end{description}
|
|
For bivariate and multivariate data sets we can also analyse their
|
|
\begin{description}
|
|
\item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient,
|
|
Spearman's rank correlation coefficient.
|
|
\end{description}
|
|
|
|
The following is not a complete introduction to descriptive
|
|
statistics, but summarizes a few concepts that are most important in
|
|
daily data-analysis problems.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Mean, variance, and standard deviation}
|
|
The \enterm{arithmetic mean} is a measure of location. For $n$ data values
|
|
$x_i$ the arithmetic mean is computed by
|
|
\[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \]
|
|
This computation (summing up all elements of a vector and dividing by
|
|
the length of the vector) is provided by the function \mcode{mean()}.
|
|
The mean has the same unit as the data values.
|
|
|
|
The dispersion of the data values around the mean is quantified by
|
|
their \enterm{variance}
|
|
\[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \]
|
|
The variance is computed by the function \mcode{var()}.
|
|
The unit of the variance is the unit of the data values squared.
|
|
Therefore, variances cannot be compared to the mean or the data values
|
|
themselves. In particular, variances cannot be used for plotting error
|
|
bars along with the mean.
|
|
|
|
The standard deviation
|
|
\[ \sigma_x = \sqrt{\sigma^2_x} \; , \]
|
|
as computed by the function \mcode{std()}, however, has the same unit
|
|
as the data values and can (and should) be used to display the
|
|
dispersion of the data together with their mean.
|
|
|
|
The mean of a data set can be displayed by a bar-plot
|
|
\matlabfun{bar()}. Additional errorbars \matlabfun{errobar()} can be
|
|
used to illustrate the standard deviation of the data
|
|
(\figref{displayunivariatedatafig} (2)).
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{displayunivariatedata}
|
|
\titlecaption{\label{displayunivariatedatafig} Displaying statistics
|
|
of univariate data.}{(1) In particular for small data sets it is
|
|
most informative to plot the data themselves. The value of each
|
|
data point is plotted on the y-axis. To make the data points
|
|
overlap less, they are jittered along the x-axis by means of
|
|
uniformly distributed random numbers \matlabfun{rand()}. (2) With
|
|
a bar plot \matlabfun{bar()} one usually shows the mean of the
|
|
data. The additional errorbar illustrates the deviation of the
|
|
data from the mean by $\pm$ one standard deviation. (3) A
|
|
box-whisker plot \matlabfun{boxplot()} shows more details of the
|
|
distribution of the data values. The box extends from the 1. to
|
|
the 3. quartile, a horizontal ine within the box marks the median
|
|
value, and the whiskers extend to the minum and the maximum data
|
|
values. (4) The probability density $p(x)$ estimated from a
|
|
normalized histogram shows the entire distribution of the
|
|
data. Estimating the probability distribution is only meaningful
|
|
for sufficiently large data sets.}
|
|
\end{figure}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Mode, median, quartile, etc.}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{median}
|
|
\titlecaption{\label{medianfig} Median, mean and mode of a
|
|
probability distribution.}{Left: Median, mean and mode are
|
|
identical for the symmetric and unimodal normal distribution.
|
|
Right: for asymmetric distributions these three measures differ. A
|
|
heavy tail of a distribution pulls out the mean most strongly. In
|
|
contrast, the median is more robust against heavy tails, but not
|
|
necessarily identical with the mode.}
|
|
\end{figure}
|
|
|
|
The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution.
|
|
|
|
The \enterm{median} separates a list of data values into two halves
|
|
such that one half of the data is not greater and the other half is
|
|
not smaller than the median (\figref{medianfig}).
|
|
|
|
\begin{exercise}{mymedian.m}{}
|
|
Write a function \code{mymedian()} that computes the median of a vector.
|
|
\end{exercise}
|
|
|
|
\matlab{} provides the function \code{median()} for computing the median.
|
|
|
|
\begin{exercise}{checkmymedian.m}{}
|
|
Write a script that tests whether your median function really
|
|
returns a median above which are the same number of data than
|
|
below. In particular the script should test data vectors of
|
|
different length. You should not use the \mcode{median()} function
|
|
for testing your function.
|
|
|
|
Writing tests for your own functions is a very important strategy for
|
|
writing reliable code!
|
|
\end{exercise}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{quartile}
|
|
\titlecaption{\label{quartilefig} Median and quartiles of a normal distribution.}{}
|
|
\end{figure}
|
|
|
|
The distribution of data can be further characterized by the position
|
|
of its \enterm[quartile]{quartiles}. Neighboring quartiles are
|
|
separated by 25\,\% of the data (\figref{quartilefig}).
|
|
\enterm[percentile]{Percentiles} allow to characterize the
|
|
distribution of the data in more detail. The 3$^{\rm rd}$ quartile
|
|
corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the
|
|
data are smaller than the 3$^{\rm rd}$ quartile.
|
|
|
|
% \begin{definition}[quartile]
|
|
% Die Quartile Q1, Q2 und Q3 unterteilen die Daten in vier gleich
|
|
% gro{\ss}e Gruppen, die jeweils ein Viertel der Daten enthalten.
|
|
% Das mittlere Quartil entspricht dem Median.
|
|
% \end{definition}
|
|
|
|
% \begin{exercise}{quartiles.m}{}
|
|
% Write a function that computes the first, second, and third quartile of a vector.
|
|
% \end{exercise}
|
|
|
|
% \begin{figure}[t]
|
|
% \includegraphics[width=1\textwidth]{boxwhisker}
|
|
% \titlecaption{\label{boxwhiskerfig} Box-Whisker Plot.}{Box-whisker
|
|
% plots are well suited for comparing unimodal distributions. Each
|
|
% box-whisker characterizes 40 random numbers that have been drawn
|
|
% from a normal distribution.}
|
|
% \end{figure}
|
|
|
|
\enterm{Box-whisker plots} are commonly used to visualize and compare
|
|
the distribution of unimodal data. A box is drawn around the median
|
|
that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$ quartile. The
|
|
whiskers mark the minimum and maximum value of the data set
|
|
(\figref{displayunivariatedatafig} (3)).
|
|
|
|
\begin{exercise}{boxwhisker.m}{}
|
|
Generate eine $40 \times 10$ matrix of random numbers and
|
|
illustrate their distribution in a box-whicker plot
|
|
(\code{boxplot()} function). How to interpret the plot?
|
|
\end{exercise}
|
|
|
|
\section{Distributions}
|
|
The distribution of values in a data set is estimated by histograms
|
|
(\figref{displayunivariatedatafig} (4)).
|
|
|
|
\subsection{Histograms}
|
|
|
|
\enterm[Histogram]{Histograms} count the frequency $n_i$ of
|
|
$N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$
|
|
(\figref{diehistogramsfig} left). The bins tile the data range
|
|
usually into intervals of the same size. The width of the bins is
|
|
called the bin width.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{diehistograms}
|
|
\titlecaption{\label{diehistogramsfig} Histograms resulting from 100
|
|
or 500 times rolling a die.}{Left: the absolute frequency
|
|
histogram counts the frequency of each number the die
|
|
shows. Right: When normalized by the sum of the frequency
|
|
histogram the two data sets become comparable with each other and
|
|
with the expected theoretical distribution of $P=1/6$.}
|
|
\end{figure}
|
|
|
|
Histograms are often used to estimate the \enterm{probability
|
|
distribution} of the data values.
|
|
|
|
\subsection{Probabilities}
|
|
In the frequentist interpretation of probability, the probability of
|
|
an event (e.g. getting a six when rolling a die) is the relative
|
|
occurrence of this event in the limit of a large number of trials.
|
|
|
|
For a finite number of trials $N$ where the event $i$ occurred $n_i$
|
|
times, the probability $P_i$ of this event is estimated by
|
|
\[ P_i = \frac{n_i}{N} = \frac{n_i}{\sum_{i=1}^M n_i} \; . \]
|
|
From this definition it follows that a probability is a unitless
|
|
quantity that takes on values between zero and one. Most importantly,
|
|
the sum of the probabilities of all possible events is one:
|
|
\[ \sum_{i=1}^M P_i = \sum_{i=1}^M \frac{n_i}{N} = \frac{1}{N} \sum_{i=1}^M n_i = \frac{N}{N} = 1\; , \]
|
|
i.e. the probability of getting any event is one.
|
|
|
|
|
|
\subsection{Probability distributions of categorial data}
|
|
|
|
For categorial data values (e.g. the faces of a die (as integer
|
|
numbers or as colors)) a bin can be defined for each category $i$.
|
|
The histogram is normalized by the total number of measurements to
|
|
make it independent of the size of the data set
|
|
(\figref{diehistogramsfig}). After this normalization the height of
|
|
each histogram bar is an estimate of the probability $P_i$ of the
|
|
category $i$, i.e. of getting a data value in the $i$-th bin.
|
|
|
|
\begin{exercise}{rollthedie.m}{}
|
|
Write a function that simulates rolling a die $n$ times.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{diehistograms.m}{}
|
|
Plotte Histogramme von 20, 100, und 1000-mal W\"urfeln. Benutze
|
|
\code[hist()]{hist(x)}, erzwinge sechs Bins mit
|
|
\code[hist()]{hist(x,6)}, oder setze selbst sinnvolle Bins. Normiere
|
|
anschliessend das Histogram.
|
|
\end{exercise}
|
|
|
|
|
|
\subsection{Probability densities functions}
|
|
|
|
In cases where we deal with data sets of measurements of a real
|
|
quantity (e.g. the length of snakes, the weight of elephants, the time
|
|
between succeeding spikes) there is no natural bin width for computing
|
|
a histogram. In addition, the probability of measuring a data value that
|
|
equals exactly a specific real number like, e.g., 0.123456789 is zero, because
|
|
there are uncountable many real numbers.
|
|
|
|
We can only ask for the probability to get a measurement value in some
|
|
range. For example, we can ask for the probability $P(1.2<x<1.3)$ to
|
|
get a measurement between 0 and 1 (\figref{pdfprobabilitiesfig}). More
|
|
generally, we want to know the probability $P(x_0<x<x_1)$ to obtain a
|
|
measurement between $x_0$ and $x_1$. If we define the width of the
|
|
range defined by $x_0$ and $x_1$ is $\Delta x = x_1 - x_0$ then the
|
|
probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$.
|
|
|
|
In the limit to very small ranges $\Delta x$ the probability of
|
|
getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to
|
|
zero with $\Delta x$:
|
|
\[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \]
|
|
In here the quantity $p(x_00)$ is a so called \enterm{probability
|
|
density}. This is not a unitless probability with values between 0
|
|
and 1, but a number that takes on any positive real number and has as
|
|
a unit the inverse of the unit of the data values --- hence the name
|
|
``density''.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{pdfprobabilities}
|
|
\titlecaption{\label{pdfprobabilitiesfig} Probability of a
|
|
probability density.}{The probability of a data value $x$ between,
|
|
e.g., zero and one is the integral (red area) over the probability
|
|
density (blue).}
|
|
\end{figure}
|
|
|
|
The probability to get a value $x$ between $x_1$ and $x_2$ is
|
|
given by the integral over the probability density:
|
|
\[ P(x_1 < x < x2) = \int\limits_{x_1}^{x_2} p(x) \, dx \; . \]
|
|
Because the probability to get any value $x$ is one, the integral over
|
|
the probability density
|
|
|
|
Da die Wahrscheinlichkeit irgendeines Wertes $x$ Eins ergeben muss gilt die Normierung
|
|
\begin{equation}
|
|
\label{pdfnorm}
|
|
P(-\infty < x < \infty) = \int\limits_{-\infty}^{+\infty} p(x) \, dx = 1 \; .
|
|
\end{equation}
|
|
|
|
\pagebreak[2]
|
|
Die gesamte Funktion $p(x)$, die jedem Wert $x$ einen
|
|
Wahrscheinlichkeitsdichte zuordnet wir auch
|
|
\determ{Wahrscheinlichkeitsdichtefunktion} (\enterm{probability
|
|
density function}, \enterm[pdf|see{probability density
|
|
function}]{pdf}, oder kurz \enterm[density|see{probability density
|
|
function}]{density}) genannt. Die bekannteste
|
|
Wahrscheinlichkeitsdichtefunktion ist die der \determ{Normalverteilung}
|
|
\[ p_g(x) =
|
|
\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \]
|
|
--- die \determ{Gau{\ss}sche-Glockenkurve} mit Mittelwert $\mu$ und
|
|
Standardabweichung $\sigma$.
|
|
|
|
\begin{exercise}{gaussianpdf.m}{gaussianpdf.out}
|
|
\begin{enumerate}
|
|
\item Plot the probability density of the normal distribution $p_g(x)$.
|
|
\item Compute the probability of getting a data value between zero and one
|
|
for the normal distribution with zero mean and standard deviation of one.
|
|
\item Draw 1000 normally distributed random numbers and use these
|
|
numbers to calculate the probability of getting a number between
|
|
zero and one.
|
|
\item Compute from the normal distribution $\int_{-\infty}^{+\infty} p(x) \, dx$.
|
|
\end{enumerate}
|
|
\end{exercise}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{pdfhistogram}
|
|
\titlecaption{\label{pdfhistogramfig} Histograms with different bin
|
|
widths of normally distributed data.}{Left: The height of the
|
|
histogram bars strongly depends on the width of the bins. Right:
|
|
If the histogram is normalized such that its integral is one we
|
|
get an estimate of the probability density of the data values.
|
|
The normalized histograms are comparable with each other and can
|
|
also be compared to theoretical probability densities, like the
|
|
normal distributions (blue).}
|
|
\end{figure}
|
|
|
|
\pagebreak[4]
|
|
\begin{exercise}{gaussianbins.m}{}
|
|
Draw 100 random data from a Gaussian distribution and plot
|
|
histograms with different bin sizes of the data. What do you
|
|
observe?
|
|
\end{exercise}
|
|
|
|
Damit Histogramme von reellen Messwerten trotz unterschiedlicher
|
|
Anzahl von Messungen und unterschiedlicher Klassenbreiten
|
|
untereinander vergleichbar werden und mit bekannten
|
|
Wahrscheinlichkeitsdichtefunktionen verglichen werden k\"onnen,
|
|
m\"ussen sie auf das Integral Eins normiert werden
|
|
\eqnref{pdfnorm}. Das Integral (nicht die Summe) \"uber das Histogramm
|
|
soll Eins ergeben --- denn die Wahrscheinlichkeit, dass irgendeiner
|
|
der Messwerte auftritt mu{\ss} Eins sein. Das Integral ist die
|
|
Fl\"ache des Histogramms, die sich aus der Fl\"ache der einzelnen
|
|
Histogrammbalken zusammen setzt. Die Balken des Histogramms haben die
|
|
H\"ohe $n_i$ und die Breite $\Delta x$. Die Gesamtfl\"ache $A$ des
|
|
Histogramms ist also
|
|
\[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i \]
|
|
und das normierte Histogramm hat die H\"ohe
|
|
\[ p(x_i) = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} \]
|
|
Es muss also nicht nur durch die Summe, sondern auch durch die Breite
|
|
$\Delta x$ der Klassen geteilt werden (\figref{pdfhistogramfig}).
|
|
|
|
\begin{exercise}{gaussianbinsnorm.m}{}
|
|
Normiere das Histogramm der vorherigen \"Ubung zu einer Wahrscheinlichkeitsdichte.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Correlations}
|
|
|
|
Until now we described properties of univariate data sets. In
|
|
bivariate or multivariate data sets where we have pairs or tuples of
|
|
data values (e.g. the size and the weight of elephants) we want to analyze
|
|
dependencies between the variables.
|
|
|
|
The \enterm{correlation coefficient}
|
|
\[ r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle
|
|
(x-\langle x \rangle)(y-\langle y \rangle) \rangle}{\sqrt{\langle
|
|
(x-\langle x \rangle)^2} \rangle \sqrt{\langle (y-\langle y
|
|
\rangle)^2} \rangle} \]
|
|
quantifies linear relationships between two variables
|
|
\matlabfun{corr()}. The correlation coefficient is the
|
|
\determ{covariance} normalized by the standard deviations of the
|
|
single variables. Perfectly correlated variables result in a
|
|
correlation coefficient of $+1$, anit-correlated or negatively
|
|
correlated data in a correlation coefficient of $-1$ and un-correlated
|
|
data in a correlation coefficient close to zero
|
|
(\figrefb{correlationfig}).
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{correlation}
|
|
\titlecaption{\label{correlationfig} Korrelationen zwischen Datenpaaren.}{}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{correlations.m}{}
|
|
Generate pairs of random numbers with four different correlations
|
|
(perfectly correlated, somehow correlated, uncorrelated, negatively
|
|
correlated). Plot them into a scatter plot and compute their
|
|
correlation coefficient.
|
|
\end{exercise}
|
|
|
|
Note that non-linear dependencies between two variables are
|
|
insufficiently or not at all detected by the correlation coefficient
|
|
(\figref{nonlincorrelationfig}).
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{nonlincorrelation}
|
|
\titlecaption{\label{nonlincorrelationfig} Correlations for
|
|
non-linear dependencies.}{The correlation coefficient detects
|
|
linear dependencies only. Both the quadratic dependency (left) and
|
|
the noise correlation (right), where the dispersal of the
|
|
$y$-values depends on the $x$-value, result in correlation
|
|
coefficients close to zero. $\xi$ denote normally distributed
|
|
random numbers.}
|
|
\end{figure}
|