%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Descriptive statistics} \label{descriptivestatisticschapter} \exercisechapter{Descriptive statistics} Descriptive statistics characterizes data sets by means of a few measures. In addition to histograms that estimate the full distribution of the data, the following measures are used for characterizing univariate data: \begin{description} \item[Location, central tendency] (\determ{Lagema{\ss}e}): \entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean}, \entermde{Median}{median}, \enterm{mode}. \item[Spread, dispersion] (\determ{Streuungsma{\ss}e}): \entermde{Varianz}{variance}, \entermde{Standardabweichung}{standard deviation}, inter-quartile range,\linebreak \enterm{coefficient of variation} (\determ{Variationskoeffizient}). \item[Shape]: \enterm{skewness} (\determ{Schiefe}), \enterm{kurtosis} (\determ{W\"olbung}). \end{description} For bivariate and multivariate data sets we can also analyse their \begin{description} \item[Dependence, association] (\determ{Zusammenhangsma{\ss}e}): \entermde[correlation!coefficient!Pearson's]{Korrelation!Pearson}{Pearson's correlation coefficient}, \entermde[correlation!coefficient!Spearman's rank]{{Rangkorrelationskoeffizient!Spearman'scher}}{Spearman's rank correlation coefficient}. \end{description} The following is in no way a complete introduction to descriptive statistics, but summarizes a few concepts that are most important in daily data-analysis problems. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Mean, variance, and standard deviation} The \entermde[mean!arithmetic]{Mittel!arithmetisches}{arithmetic mean} is a measure of location. For $n$ data values $x_i$ the arithmetic mean is computed by \[ \bar x = \langle x \rangle = \frac{1}{N}\sum_{i=1}^n x_i \; . \] This computation (summing up all elements of a vector and dividing by the length of the vector) is provided by the function \mcode{mean()}. The mean has the same unit as the data values. The dispersion of the data values around the mean is quantified by their \entermde{Varianz}{variance} \[ \sigma^2_x = \langle (x-\langle x \rangle)^2 \rangle = \frac{1}{N}\sum_{i=1}^n (x_i - \bar x)^2 \; . \] The variance is computed by the function \mcode{var()}. The unit of the variance is the unit of the data values squared. Therefore, variances cannot be compared to the mean or the data values themselves. In particular, variances cannot be used for plotting error bars along with the mean. In contrast to the variance, the \entermde{Standardabweichung}{standard deviation} \[ \sigma_x = \sqrt{\sigma^2_x} \; , \] as computed by the function \mcode{std()} has the same unit as the data values and can (and should) be used to display the dispersion of the data together with their mean. The mean of a data set can be displayed by a bar-plot \matlabfun{bar()}. Additional errorbars \matlabfun{errorbar()} can be used to illustrate the standard deviation of the data (\figref{displayunivariatedatafig} (2)). \begin{figure}[t] \includegraphics[width=1\textwidth]{displayunivariatedata} \titlecaption{\label{displayunivariatedatafig} Displaying statistics of univariate data.}{(1) In particular for small data sets it is most informative to plot the data themselves. The value of each data point is plotted on the y-axis. To make the data points overlap less, they are jittered along the x-axis by means of uniformly distributed random numbers \matlabfun{rand()}. (2) With a bar plot \matlabfun{bar()} one usually shows the mean of the data. The additional errorbar illustrates the deviation of the data from the mean by $\pm$ one standard deviation. In case of non-normal data mean and standard deviation only poorly characterize the distribution of the data values. (3) A box-whisker plot \matlabfun{boxplot()} shows more details of the distribution of the data values. The box extends from the 1. to the 3. quartile, a horizontal line within the box marks the median value, and the whiskers extend to the minum and the maximum data values. (4) The probability density $p(x)$ estimated from a normalized histogram shows the entire distribution of the data. Estimating the probability distribution is only meaningful for sufficiently large data sets.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Mode, median, quartile, etc.} \begin{figure}[t] \includegraphics[width=1\textwidth]{median} \titlecaption{\label{medianfig} Median, mean and mode of a probability distribution.}{Left: Median, mean and mode coincide for the symmetric and unimodal normal distribution. Right: for asymmetric distributions these three measures differ. A heavy tail of a distribution pulls out the mean most strongly. In contrast, the median is more robust against heavy tails, but not necessarily identical with the mode.} \end{figure} The \enterm{mode} (\determ{Modus}) is the most frequent value, i.e. the position of the maximum of the probability distribution. The \entermde{Median}{median} separates a list of data values into two halves such that one half of the data is not greater and the other half is not smaller than the median (\figref{medianfig}). The function \mcode{median()} computes the median. \begin{exercise}{mymedian.m}{} Write a function \varcode{mymedian()} that computes the median of a vector. \end{exercise} \begin{exercise}{checkmymedian.m}{} Write a script that tests whether your median function really returns a median above which are the same number of data than below. In particular the script should test data vectors of different length. You should not use the \mcode{median()} function for testing your function. Writing tests for your own functions is a very important strategy for writing reliable code! \end{exercise} The distribution of data can be further characterized by the position of its \entermde[quartile]{Quartil}{quartiles}. Neighboring quartiles are separated by 25\,\% of the data.% (\figref{quartilefig}). \entermde[percentile]{Perzentil}{Percentiles} allow to characterize the distribution of the data in more detail. The 3$^{\rm rd}$ quartile corresponds to the 75$^{\rm th}$ percentile, because 75\,\% of the data are smaller than the 3$^{\rm rd}$ quartile. % \begin{definition}[quartile] % Die Quartile Q1, Q2 und Q3 unterteilen die Daten in vier gleich % gro{\ss}e Gruppen, die jeweils ein Viertel der Daten enthalten. % Das mittlere Quartil entspricht dem Median. % \end{definition} % \begin{exercise}{quartiles.m}{} % Write a function that computes the first, second, and third quartile of a vector. % \end{exercise} % \begin{figure}[t] % \includegraphics[width=1\textwidth]{boxwhisker} % \titlecaption{\label{boxwhiskerfig} Box-Whisker Plot.}{Box-whisker % plots are well suited for comparing unimodal distributions. Each % box-whisker characterizes 40 random numbers that have been drawn % from a normal distribution.} % \end{figure} \entermde[box-whisker plots]{Box-Whisker-Plot}{Box-whisker plots}, or \entermde{Box-Plot}{box plot} are commonly used to visualize and compare the distribution of unimodal data. A box is drawn around the median that extends from the 1$^{\rm st}$ to the 3$^{\rm rd}$ quartile. The whiskers mark the minimum and maximum value of the data set (\figref{displayunivariatedatafig} (3)). % \begin{figure}[t] % \includegraphics[width=1\textwidth]{quartile} % \titlecaption{\label{quartilefig} Median and quartiles of a normal % distribution.}{ The interquartile range between the first and the % third quartile contains 50\,\% of the data and contains the % median.} % \end{figure} % \begin{exercise}{boxwhisker.m}{} % Generate a $40 \times 10$ matrix of random numbers and % illustrate their distribution in a box-whisker plot % (\code{boxplot()} function). How to interpret the plot? % \end{exercise} \section{Distributions} The \enterm{distribution} (\determ{Verteilung}) of values in a data set is estimated by histograms (\figref{displayunivariatedatafig} (4)). \subsection{Histograms} \entermde[histogram]{Histogramm}{Histograms} count the frequency $n_i$ of $N=\sum_{i=1}^M n_i$ measurements in each of $M$ bins $i$ (\figref{diehistogramsfig} left). The bins tile the data range usually into intervals of the same size. The width of the bins is called the bin width. The frequencies $n_i$ plotted against the categories $i$ is the \enterm{histogram}, or the \enterm{frequency histogram}. \begin{figure}[t] \includegraphics[width=1\textwidth]{diehistograms} \titlecaption{\label{diehistogramsfig} Histograms resulting from 100 or 500 times rolling a die.}{Left: the absolute frequency histogram counts the frequency of each number the die shows. Right: When normalized by the sum of the frequency histogram the two data sets become comparable with each other and with the expected theoretical distribution of $P=1/6$.} \end{figure} Histograms are often used to estimate the \enterm[probability!distribution]{probability distribution} (\determ[Wahrscheinlichkeits!-verteilung]{Wahrscheinlichkeitsverteilung}) of the data values. \begin{exercise}{univariatedata.m}{} Generate 40 normally distributed random numbers with a mean of 2 and illustrate their distribution in a box-whisker plot (\code{boxplot()} function), with a bar and errorbar illustrating the mean and standard deviation (\code{bar()}, \code{errorbar()}), and the data themselves jittered randomly (as in \figref{displayunivariatedatafig}). How to interpret the different plots? \end{exercise} \subsection{Probabilities} In the frequentist interpretation of probability, the \enterm{probability} (\determ{Wahrscheinlichkeit}) of an event (e.g. getting a six when rolling a die) is the relative occurrence of this event in the limit of a large number of trials. For a finite number of trials $N$ where the event $i$ occurred $n_i$ times, the probability $P_i$ of this event is estimated by \[ P_i = \frac{n_i}{N} = \frac{n_i}{\sum_{i=1}^M n_i} \; . \] From this definition it follows that a probability is a unitless quantity that takes on values between zero and one. Most importantly, the sum of the probabilities of all possible events is one: \[ \sum_{i=1}^M P_i = \sum_{i=1}^M \frac{n_i}{N} = \frac{1}{N} \sum_{i=1}^M n_i = \frac{N}{N} = 1\; , \] i.e. the probability of getting any event is one. \subsection{Probability distributions of categorical data} For \entermde[data!categorical]{Daten!kategorische}{categorical} data values (e.g. the faces of a die (as integer numbers or as colors)) a bin can be defined for each category $i$. The histogram is normalized by the total number of measurements to make it independent of the size of the data set (\figref{diehistogramsfig}). After this normalization the height of each histogram bar is an estimate of the probability $P_i$ of the category $i$, i.e. of getting a data value in the $i$-th bin. \begin{exercise}{rollthedie.m}{} Write a function that simulates rolling a die $n$ times. \end{exercise} \begin{exercise}{diehistograms.m}{} Plot histograms for 20, 100, and 1000 times rolling a die. Use \code[hist()]{hist(x)}, enforce six bins with \code[hist()]{hist(x,6)}, or set useful bins yourself. Normalize the histograms appropriately. \end{exercise} \subsection{Probability densities functions} In cases where we deal with \entermde[data!continuous]{Daten!kontinuierliche}{continuous data}, (measurements of real-valued quantities, e.g. lengths of snakes, weights of elephants, times between succeeding spikes) there is no natural bin width for computing a histogram. In addition, the probability of measuring a data value that equals exactly a specific real number like, e.g., 0.123456789 is zero, because there are uncountable many real numbers. We can only ask for the probability to get a measurement value in some range. For example, we can ask for the probability $P(0<x<1)$ to get a measurement between 0 and 1 (\figref{pdfprobabilitiesfig}). More generally, we want to know the probability $P(x_0<x<x_1)$ to obtain a measurement between $x_0$ and $x_1$. If we define the width of the range between $x_0$ and $x_1$ as $\Delta x = x_1 - x_0$ then the probability can also be expressed as $P(x_0<x<x_0 + \Delta x)$. In the limit to very small ranges $\Delta x$ the probability of getting a measurement between $x_0$ and $x_0+\Delta x$ scales down to zero with $\Delta x$: \[ P(x_0<x<x_0+\Delta x) \approx p(x_0) \cdot \Delta x \; . \] In here the quantity $p(x_00)$ is a so called \enterm[probability!density]{probability density} (\determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}) that is larger than zero and that describes the distribution of the data values. The probability density is not a unitless probability with values between 0 and 1, but a number that takes on any positive real number and has as a unit the inverse of the unit of the data values --- hence the name ``density''. \begin{figure}[t] \includegraphics[width=1\textwidth]{pdfprobabilities} \titlecaption{\label{pdfprobabilitiesfig} Probability of a probability density.}{The probability of a data value $x$ between, e.g., zero and one is the integral (red area) of the probability density (blue).} \end{figure} The probability to get a value $x$ between $x_1$ and $x_2$ is given by an integral over the probability density: \[ P(x_1 < x < x2) = \int\limits_{x_1}^{x_2} p(x) \, dx \; . \] Because the probability to get any value $x$ is one, the integral of the probability density over the whole real axis must be one: \begin{equation} \label{pdfnorm} P(-\infty < x < \infty) = \int\limits_{-\infty}^{+\infty} p(x) \, dx = 1 \; . \end{equation} The function $p(x)$, that assigns to every $x$ a probability density, is called \enterm[probability!density function]{probability density function}, \enterm[pdf|see{probability density function}]{pdf}, or just \enterm[density|see{probability density function}]{density} (\determ[Wahrscheinlichkeits!-dichtefunktion]{Wahrscheinlichkeitsdichtefunktion}, \determ[Wahrscheinlichkeits!-dichte]{Wahrscheinlichkeitsdichte}). The well known \entermde{Normalverteilung}{normal distribution} is an example of a probability density function \[ p_g(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \] --- the \enterm{Gaussian distribution} (\determ{Gau{\ss}sche-Glockenkurve}) with mean $\mu$ and standard deviation $\sigma$. The factor in front of the exponential function ensures normalization to $\int p_g(x) \, dx = 1$, \eqnref{pdfnorm}. \begin{exercise}{gaussianpdf.m}{gaussianpdf.out} \begin{enumerate} \item Plot the probability density of the normal distribution $p_g(x)$. \item Compute the probability of getting a data value between zero and one for the normal distribution with zero mean and standard deviation of one. \item Draw 1000 normally distributed random numbers and use these numbers to calculate the probability of getting a number between zero and one. \item Compute from the normal distribution $\int_{-\infty}^{+\infty} p(x) \, dx$. \end{enumerate} \end{exercise} Histograms of real valued data depend on both the number of data values \emph{and} the chosen bin width. As in the example with the die (\figref{diehistogramsfig} left), the height of the histogram gets larger the larger the size of the data set. Also, as the bin width is increased the hight of the histogram increases, because more data values fall within each bin (\figref{pdfhistogramfig} left). \begin{exercise}{gaussianbins.m}{} Draw 100 random data from a Gaussian distribution and plot histograms of the data with different bin widths. What do you observe? \end{exercise} To turn such histograms into estimates of probability densities they need to be normalized such that according to \eqnref{pdfnorm} their integral equals one. While histograms of categorical data are normalized such that their sum equals one, here we need to integrate over the histogram. The integral is the area (not the height) of the histogram bars. Each bar has the height $n_i$ and the width $\Delta x$. The total area $A$ of the histogram is thus \[ A = \sum_{i=1}^N ( n_i \cdot \Delta x ) = \Delta x \sum_{i=1}^N n_i = N \, \Delta x \] and the \entermde[histogram!normalized]{Histogramm!normiertes}{normalized histogram} has the heights \[ p(x_i) = \frac{n_i}{A} = \frac{n_i}{\Delta x \sum_{i=1}^N n_i} = \frac{n_i}{N \Delta x} \; .\] A histogram needs to be divided by both the sum of the frequencies $n_i$ and the bin width $\Delta x$ to result in an estimate of the corresponding probability density. Only then can the distribution be compared with other distributions and in particular with theoretical probability density functions like the one of the normal distribution (\figref{pdfhistogramfig} right). \begin{figure}[t] \includegraphics[width=1\textwidth]{pdfhistogram} \titlecaption{\label{pdfhistogramfig} Histograms with different bin widths of normally distributed data.}{Left: The height of the histogram bars strongly depends on the width of the bins. Right: If the histogram is normalized such that its integral is one we get an estimate of the probability density of the data values. The normalized histograms are comparable with each other and can also be compared to theoretical probability densities, like the normal distributions (blue).} \end{figure} \begin{exercise}{gaussianbinsnorm.m}{} Normalize the histogram of the previous exercise to a probability density. \end{exercise} \subsection{Kernel densities} A problem of using histograms for estimating probability densities is that they have hard bin edges. Depending on where the bin edges are placed a data value falls in one or the other bin. As a result the shape of the resulting histogram depends on the exact position of its bins (\figref{kerneldensityfig} left). \begin{figure}[t] \includegraphics[width=1\textwidth]{kerneldensity} \titlecaption{\label{kerneldensityfig} Kernel densities.}{The histogram-based estimation of the probability density depends on the position of the bins (left). In the bottom plot the bins have been shifted by half a bin width (here $\Delta x=0.4$) and as a result details of the probability density look different. Look, for example, at the height of the largest bin. In contrast, a kernel density is uniquely defined for a given kernel width (right, Gaussian kernels with standard deviation of $\sigma=0.2$).} \end{figure} To avoid this problem so called \entermde[kernel density]{Kerndichte}{kernel densities} can be used for estimating probability densities from data. Here every data point is replaced by a kernel (a function with integral one, like for example the Gaussian) that is moved exactly to the position indicated by the data value. Then all the kernels of all the data values are summed up, the sum is divided by the number of data values, and we get an estimate of the probability density (\figref{kerneldensityfig} right). As for the histogram, where we need to choose a bin width, we need to choose the width of the kernels appropriately. \begin{exercise}{gaussiankerneldensity.m}{} Construct and plot a kernel density of the data from the previous two exercises. \end{exercise} \subsection{Cumulative distributions} The \enterm{cumulative distribution function}, \enterm[cdf|see{cumulative distribution function}]{cdf}, or \enterm[cumulative density function|see{cumulative distribution function}]{cumulative density function} (\determ{kumulative Verteilung}) is the integral over the probability density up to any value $x$: \[ F(x) = \int_{-\infty}^x p(x') \, dx' \] As such the cumulative distribution is a probability. It is the probability of getting a value smaller than $x$. For estimating the cumulative distribution from a set of data values we do not need to rely on histograms or kernel densities. Instead, it can be computed from the data directly without the need of a bin width or width of a kernel. For a data set of $N$ data values $x_i$ the probability of a data value smaller than $x$ is the number of data points with values smaller than $x$ divided by $N$. If we sort the data values than at each data value $x_i$ the number of data elements smaller than $x_i$ is increased by one and the corresponding probability of getting a value smaller than $x_i$ is increased by $1/N$. That is, the cumulative distribution is \[ F(x_i) = \frac{i}{N} \] See \figref{cumulativefig} for an example. The cumulative distribution tells you the fraction of data that are below a certain value and can therefore be used to evaluate significance from Null-hypothesis constructed from data, as it is done with bootstrap methods (see chapter \ref{bootstrapchapter}). The other way around the values of quartiles and percentiles can be determined from the inverse cumulative function. \begin{figure}[t] \includegraphics[width=1\textwidth]{cumulative} \titlecaption{\label{cumulativefig} Estimation of the cumulative distribution.}{The cumulative distribution $F(x)$ estimated from 100 data values drawn from a normal distribution (red) in comparison to the true cumulative distribution function computed by numerically integrating the normal distribution function (blue). From the cumulative distribution function one can read off the probabilities of getting values smaller than a given value (here: $P(x \le -1) \approx 0.15$). From the inverse cumulative distribution the position of percentiles can be computed (here: the median (50\,\% percentile) is as expected close to zero and the third quartile (75\,\% percentile) at $x=0.68$.} \end{figure} \begin{exercise}{cumulative.m}{cumulative.out} Generate 200 normally distributed data values and construct an estimate of the cumulative distribution function from this data. Compare this estimate with an integral over the normal distribution. Use the estimate to compute the probability of having data values smaller than $-1$. Use the estimate to compute the value of the 5\,\% percentile. \end{exercise} \section{Correlations} Until now we described properties of univariate data sets. In bivariate or multivariate data sets where we have pairs or tuples of data values (e.g. size and weight of elephants) we want to analyze dependencies between the variables. The \entermde[correlation!coefficient]{Korrelationskoeffizient}{correlation coefficient} \begin{equation} \label{correlationcoefficient} r_{x,y} = \frac{Cov(x,y)}{\sigma_x \sigma_y} = \frac{\langle (x-\langle x \rangle)(y-\langle y \rangle) \rangle}{\sqrt{\langle (x-\langle x \rangle)^2} \rangle \sqrt{\langle (y-\langle y \rangle)^2} \rangle} \end{equation} quantifies linear relationships between two variables \matlabfun{corr()}. The correlation coefficient is the \entermde{Kovarianz}{covariance} normalized by the standard deviations of the single variables. Perfectly correlated variables result in a correlation coefficient of $+1$, anit-correlated or negatively correlated data in a correlation coefficient of $-1$ and un-correlated data in a correlation coefficient close to zero (\figrefb{correlationfig}). \begin{figure}[tp] \includegraphics[width=1\textwidth]{correlation} \titlecaption{\label{correlationfig} Correlations between pairs of data.}{Shown are scatter plots of four data sets. Each point is a single data pair. The correlation coefficient $r$ is given in the top left of each plot.} \end{figure} \begin{exercise}{correlations.m}{} Generate pairs of random numbers with four different correlations (perfectly correlated, somehow correlated, uncorrelated, negatively correlated). Plot them into a scatter plot and compute their correlation coefficient. \end{exercise} Note that non-linear dependencies between two variables are insufficiently or not at all detected by the correlation coefficient (\figref{nonlincorrelationfig}). \begin{figure}[tp] \includegraphics[width=1\textwidth]{nonlincorrelation} \titlecaption{\label{nonlincorrelationfig} Correlations for non-linear dependencies.}{The correlation coefficient detects linear dependencies only. Both the quadratic dependency (left) and the noise correlation (right), where the dispersal of the $y$-values depends on the $x$-value, result in correlation coefficients close to zero. $\xi$ denote normally distributed random numbers.} \end{figure} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \printsolutions