%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Descriptive statistics} Descriptive statistics characterizes data sets by means of a few measures. In addition to histograms that visualize the distribution of the data, the following measures are used for characterizing the data: \begin{description} \item[Location, central tendency] (``Lagema{\ss}e''): arithmetic mean, median, mode. \item[Spread, dispersion] (``Streuungsma{\ss}e''): variance, standard deviation, inter-quartile range,\linebreak coefficient of variation (``Variationskoeffizient''). \item[Shape]: skewness (``Schiefe''), kurtosis (``W\"olbung''). \item[Dependence, association] (``Zusammenhangsma{\ss}e''): Pearson's correlation coefficient, Spearman's rank correlation coefficient. \end{description} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Mode, median, quartile, etc.} \begin{figure}[t] \includegraphics[width=1\textwidth]{median} \titlecaption{\label{medianfig} Median, mean and mode of a probability distribution.}{Left: Median, mean and mode are identical for the symmetric and unimodal normal distribution. Right: for asymmetric distributions these threa measures differ. A heavy tail of a distribution pulls out the mean most strongly. In contrast, the median is more robust against heavy tails, but not necessarily identical with the mode.} \end{figure} The \enterm{mode} is the most frequent value, i.e. the position of the maximum of the probability distribution. The \enterm{median} separates a list of data values into two halves such that one half of the data is not greater and the other half is not smaller than the median (\figref{medianfig}). \newpage \begin{exercise}{mymedian.m}{} Write a function \code{mymedian()} that computes the median of a vector. \end{exercise} \matlab{} provides the function \code{median()} for computing the median. \newpage \begin{exercise}{checkmymedian.m}{} Write a script that tests whether your median function really returns a median above which are the same number of data than below. In particular the script should test data vectors of different length. \end{exercise} \begin{figure}[t] \includegraphics[width=1\textwidth]{quartile} \titlecaption{\label{quartilefig} Median and quartiles of a normal distribution.}{} \end{figure} The distribution of data can be further characterized by the position of its \enterm[quartile]{quartiles}. Neighboring quartiles are separated by 25\,\% of the data (\figref{quartilefig}). \enterm[percentile]{Percentiles} allow to characterize the distribution of the data in more detail. The 3$^{\rm rd}$ quartile corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the data are smaller than the 3$^{\rm rd}$ quartile. % \begin{definition}[quartile] % Die Quartile Q1, Q2 und Q3 unterteilen die Daten in vier gleich % gro{\ss}e Gruppen, die jeweils ein Viertel der Daten enthalten. % Das mittlere Quartil entspricht dem Median. % \end{definition} % \begin{exercise}{quartiles.m}{} % Write a function that computes the first, second, and third quartile of a vector. % \end{exercise} \begin{figure}[t] \includegraphics[width=1\textwidth]{boxwhisker} \titlecaption{\label{boxwhiskerfig} Box-Whisker Plot.}{Box-whisker plots are well suited for comparing unimodal distributions. Each box-whisker characterizes 40 random numbers that have been drawn from a normal distribution.} \end{figure} \enterm{Box-whisker plots} are commonly used to visualize and compare the distribution of unimodal data. Aa box is drawn around the median that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$ quartile. The whiskers mark the minimum and maximum value of the data set (\figref{boxwhiskerfig}). \begin{exercise}{boxwhisker.m}{} Generate eine $40 \times 10$ matrix of random numbers and illustrate their distribution in a box-whicker plot (\code{boxplot()} function). How to interpret the plot? \end{exercise} \section{Histograms} \enterm[Histogram]{Histograms} count the frequency $n_i$ of $N=\sum_{i=1}^M n_i$ measurements in $M$ bins $i$. The bins tile the data range usually into intervals of the same size. Histograms are often used to estimate the \enterm{probability distribution} of the data values. \begin{figure}[t] \includegraphics[width=1\textwidth]{diehistograms} \titlecaption{\label{diehistogramsfig} Histograms resulting from 100 or 500 times rolling a die.}{Left: the absolute frequency histogram counts the frequency of each number the die shows. Right: When normalized by the sum of the frequency histogram the two data sets become comparable with each other and with the expected theoretical distribution of $P=1/6$.} \end{figure} For integer data values (e.g. die number of the faces of a die or the number of action potential occurring within a fixed time window) a bin can be defined for each data value. The histogram is usually normalized by the total number of measurements to make it independent of size of the data set (\figref{diehistogramsfig}). Then the height of each histogram bar equals the probability $P(x_i)$ of the data value $x_i$ in the $i$-th bin: \[ P(x_i) = P_i = \frac{n_i}{N} = \frac{n_i}{\sum_{i=1}^M n_i} \; . \] \begin{exercise}{rollthedie.m}{} Write a function that simulates rolling a die $n$ times. \end{exercise} \begin{exercise}{diehistograms.m}{} Plotte Histogramme von 20, 100, und 1000-mal W\"urfeln. Benutze \code[hist()]{hist(x)}, erzwinge sechs Bins mit \code[hist()]{hist(x,6)}, oder setze selbst sinnvolle Bins. Normiere anschliessend das Histogram. \end{exercise} \section{Probability density functions} Meistens haben wir es jedoch mit reellen Messgr\"o{\ss}en zu tun (z.B. Gewicht von Tigern, L\"ange von Interspikeintervallen). Es macht keinen Sinn dem Auftreten jeder einzelnen reelen Zahl eine Wahrscheinlichkeit zuzuordnen, denn die Wahrscheinlichkeit genau den Wert einer bestimmten reelen Zahl, z.B. 1.23456789, zu messen ist gleich Null, da es unabz\"ahlbar viele reelle Zahlen gibt. Sinnvoller ist es dagegen, nach der Wahrscheinlichkeit zu fragen, eine Zahl aus einem bestimmten Bereich zu erhalten, z.B. die Wahrscheinlichkeit $P(1.2