%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{\tr{Maximum likelihood estimation}{Maximum-Likelihood-Sch\"atzer}} \label{maximumlikelihoodchapter} A common problem in statistics is to estimate from a probability distribution one or more parameters $\theta$ that best describe the data $x_1, x_2, \ldots x_n$. \enterm{Maximum likelihood estimators} (\enterm[mle|see{Maximum likelihood estimators}]{mle}, \determ{Maximum-Likelihood-Sch\"atzer}) choose the parameters such that they maximize the likelihood of the data $x_1, x_2, \ldots x_n$ to originate from the distribution. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Maximum Likelihood} Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given $\theta$.'') the probability (density) distribution of $x$ given the parameters $\theta$. This could be the normal distribution \begin{equation} \label{normpdfmean} p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{equation} defined by the mean $\mu$ and the standard deviation $\sigma$ as parameters $\theta$. If the $n$ independent observations of $x_1, x_2, \ldots x_n$ originate from the same probability density distribution (they are \enterm{i.i.d.} independent and identically distributed) then the conditional probability $p(x_1,x_2, \ldots x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given a specific $\theta$ is given by \begin{equation} p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta) \ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; . \end{equation} Vice versa, the \enterm{likelihood} of the parameters $\theta$ given the observed data $x_1, x_2, \ldots x_n$ is \begin{equation} {\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; . \end{equation} Note: the likelihood ${\cal L}$ is not a probability in the classic sense since it does not integrate to unity ($\int {\cal L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$). When applying maximum likelihood estimations we are interested in the parameter values \begin{equation} \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n) \end{equation} that maximize the likelihood. $\text{argmax}_xf(x)$ is the value of the argument $x$ of the function $f(x)$ for which the function $f(x)$ assumes its global maximum. Thus, we search for the value of $\theta$ at which the likelihood ${\cal L}(\theta)$ reaches its maximum. The position of a function's maximum does not change when the values of the function are transformed by a strictly monotonously rising function such as the logarithm. For numerical and reasons that we will discuss below, we commonly search for the maximum of the logarithm of the likelihood (\enterm{log-likelihood}): \begin{eqnarray} \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\ & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood} \end{eqnarray} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example: the arithmetic mean} Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a normal distribution \eqnref{normpdfmean} and we consider the mean $\mu$ as the only parameter $\theta$. Which value of $\theta$ maximizes the likelihood of the data? \begin{figure}[t] \includegraphics[width=1\textwidth]{mlemean} \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of the mean.}{Top: The measured data (blue dots) together with three different possible normal distributions with different means (arrows) the data could have originated from. Bottom left: the likelihood as a function of $\theta$ i.e. the mean. It is maximal at a value of $\theta = 2$. Bottom right: the log-likelihood. Taking the logarithm does not change the position of the maximum.} \end{figure} The log-likelihood \eqnref{loglikelihood} \begin{eqnarray*} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; . \end{eqnarray*} Since the logarithm is the inverse function of the exponential ($\log(e^x)=x$), taking the logarithm removes the exponential from the normal distribution. To calculate the maximum of the log-likelihood, we need to take the derivative with respect to $\theta$ and set it to zero: \begin{eqnarray*} \frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \\ \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \\ \Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \\ \Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \end{eqnarray*} Thus, the maximum likelihood estimator is the arithmetic mean. That is, the arithmetic mean maximizes the likelihood that the data originate from a normal distribution centered at the arithmetic mean (\figref{mlemeanfig}). Equivalently, the standard deviation computed from the data, maximizes the likelihood that the data were generated from a normal distribution with this standard deviation. \begin{exercise}{mlemean.m}{mlemean.out} Draw $n=50$ random numbers from a normal distribution with a mean of $\ne 0$ and a standard deviation of $\ne 1$. Plot the likelihood (the product of the probabilities) and the log-likelihood (given by the sum of the logarithms of the probabilities) for the mean as parameter. Compare the position of the maxima with the mean calculated from the data. \end{exercise} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Fitting probability distributions} Consider normally distributed data with unknown mean and standard deviation. From the considerations above we just have seen that a Gaussian distribution with mean at the arithmetic mean and standard deviation equal to the standard deviation computed from the data is the best Gaussian distribution that fits the data best in a maximum likelihood sense, i.e. the likelihood that the data were generated from this distribution is the largest. Fitting a Gaussian distribution to data is very simple: just compute the two parameter of the Gaussian distribution as the arithmetic mean and a standard deviation directly from the data. For non-Gaussian distributions (e.g. a Gamma-distribution), however, such simple analytical expressions for the parameters of the distribution do not exist, e.g. the shape parameter of a \enterm{Gamma-distribution}. How do we fit such a distribution to some data? That is, how should we compute the values of the parameters of the distribution, given the data? A first guess could be to fit the probability density function by minimization of the squared difference to a histogram of the measured data. For several reasons this is, however, not the method of choice: (i) Probability densities can only be positive which leads, for small values in particular, to asymmetric distributions. (ii) The values of a histogram estimating the density are not independent because the integral over a density is unity. The two basic assumptions of normally distributed and independent samples, which are a prerequisite make the minimization of the squared difference \eqnref{chisqmin} to a maximum likelihood estimation, are violated. (iii) The histogram strongly depends on the chosen bin size \figref{mlepdffig}). \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepdf} \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a probability density.}{Left: the 100 data points drawn from a 2nd order Gamma-distribution. The maximum likelihood estimation of the probability density function is shown in orange, the true pdf is shown in red. Right: normalized histogram of the data together with the real (red) and the fitted probability density functions. The fit was done by minimizing the squared difference to the histogram.} \end{figure} Instead we should stay with maximum-likelihood estimation. Exactly in the same way as we estimated the mean value of a Gaussian distribution above, we can numerically fit the parameter of any type of distribution directly from the data by means of maximizing the likelihood. We simply search for the parameter $\theta$ of the desired probability density function that maximizes the log-likelihood. In general this is a non-linear optimization problem that is solved with numerical methods such as the gradient descent \matlabfun{mle()}. \begin{exercise}{mlegammafit.m}{mlegammafit.out} Generate a sample of gamma-distributed random numbers and apply the maximum likelihood method to estimate the parameters of the gamma function from the data. \end{exercise} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Curve fitting} When fitting a function of the form $f(x;\theta)$ to data pairs $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the function best describes the data. With maximum likelihood we search for the parameter value $\theta$ for which the likelihood that the data were drawn from the corresponding function is maximal. If we assume that the $y_i$ values are normally distributed around the function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the log-likelihood is \begin{eqnarray*} \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n)) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\ \end{eqnarray*} The only difference to the previous example is that the averages in the equations above are now given as the function values $f(x_i;\theta)$. The parameter $\theta$ should be the one that maximizes the log-likelihood. The first part of the sum is independent of $\theta$ and can thus be ignored when computing the the maximum: \begin{eqnarray*} & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \end{eqnarray*} We can further simplify by inverting the sign and then search for the minimum. Also the factor $1/2$ can be ignored since it does not affect the position of the minimum: \begin{equation} \label{chisqmin} \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2 \end{equation} The sum of the squared differences normalized by the standard deviation is also called $\chi^2$. The parameter $\theta$ which minimizes the squared differences is thus the one that maximizes the likelihood that the data actually originate from the given function. Minimizing $\chi^2$ therefore is a maximum likelihood estimation. From the mathematical considerations above we can see that the minimization of the squared difference is a maximum-likelihood estimation only if the data are normally distributed around the function. In case of other distributions, the log-likelihood \eqnref{loglikelihood} needs to be adapted accordingly and be maximized respectively. \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepropline} \titlecaption{\label{mleproplinefig} Maximum likelihood estimation of the slope of line through the origin.}{The data (blue and left histogram) originate from a straight line $y=mx$ trough the origin (red). The maximum-likelihood estimation of the slope $m$ of the regression line (orange), \eqnref{mleslope}, is close to the true one. The residuals, the data minus the estimated line (right), reveal the normal distribution of the data around the line (right histogram).} \end{figure} \subsection{Example: simple proportionality} The function of a line with slope $\theta$ through the origin is \[ f(x) = \theta x \; . \] The $\chi^2$-sum is thus \[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \] To estimate the minimum we again take the first derivative with respect to $\theta$ and equate it to zero: \begin{eqnarray} \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ & = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\ & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\ \Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \end{eqnarray} \begin{eqnarray} \Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope} \end{eqnarray} This is an analytical expression for the estimation of the slope $\theta$ of the regression line (\figref{mleproplinefig}). A gradient descent, as we have done in the previous chapter, is not necessary for fitting the slope of a straight line, because the slope can be directly computed via \eqnref{mleslope}. More generally, this is the case also for fitting the coefficients of linearly combined basis functions as for example the slope $m$ and the y-intercept $b$ of a straight line \[ y = m \cdot x +b \] or the coefficients $a_k$ of a polynomial \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \] \matlabfun{polyfit()}. Parameters that are non-linearly combined can not be calculated analytically. Consider for example the rate $\lambda$ of the exponential decay \[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \] Such cases require numerical solutions for the optimization of the cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}. Let us have a closer look on \eqnref{mleslope}. If the standard deviation of the data $\sigma_i$ is the same for each data point, i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops out of \eqnref{mleslope} and we get \begin{equation} \label{whitemleslope} \theta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \end{equation} To see what this expression is, we need to standardize the data. We make the data mean free and normalize them to their standard deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$. The resulting numbers are also called \enterm[z-values]{$z$-values} or $z$-scores and they have the property $\bar x = 0$ and $\sigma_x = 1$. $z$-scores are often used in Biology to make quantities that differ in their units comparable. For standardized data the variance \[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \] is given by the mean squared data and equals one. The covariance between $x$ and $y$ also simplifies to \[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \] the averaged product between pairs of $x$ and $y$ values. Recall that the correlation coefficient $r_{x,y}$, \eqnref{correlationcoefficient}, is the covariance normalized by the product of the standard deviations of $x$ and $y$, respectively. Therefore, in case the standard deviations equal one, the correlation coefficient equals the covariance. Consequently, for standardized data the slope of the regression line \eqnref{whitemleslope} simplifies to \begin{equation} \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \end{equation} For standardized data the slope of the regression line equals the correlation coefficient! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Neural coding} In sensory systems certain aspects of the environment are encoded in the neuronal activity of populations of neurons. One example of such a population code is the tuning of neurons in the primary visual cortex (V1) to the orientation of an edge or bar in the visual stimulus. Different neurons respond best to different edge orientations. Traditionally, such a tuning is measured by analyzing the neuronal response strength (e.g. the firing rate) as a function of the orientation of a black bar and is illustrated and summarized with the so called \enterm{tuning-curve} (\determ{Abstimmkurve}, figure~\ref{mlecodingfig}, top). \begin{figure}[tp] \includegraphics[width=1\textwidth]{mlecoding} \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of a stimulus parameter from neuronal activity.}{Top: Tuning curve of an individual neuron as a function of the stimulus orientation (a dark bar in front of a white background). The stimulus that evokes the strongest activity in that neuron is the bar with the vertical orientation (arrow, $\phi_i=90$\,\degree). The red area indicates the variability of the neuronal activity $p(r|\phi)$ around the tuning curve. Center: In a population of neurons, each neuron may have a different tuning curve (colors). A specific stimulus (the vertical bar) activates the individual neurons of the population in a specific way (dots). Bottom: The log-likelihood of the activity pattern will be maximized close to the real stimulus orientation.} \end{figure} The brain, however, is confronted with the inverse problem: given a certain activity pattern in the neuronal population, what is the stimulus (here the orientation of an edge)? In the sense of maximum likelihood, a possible answer to this question would be: the stimulus for which the particular activity pattern is most likely given the tuning of the neurons. Let's stay with the example of the orientation tuning in V1. The tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge orientation $\phi_i$ can be well described using a van-Mises function (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}): \[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \] If we approximate the neuronal activity by a normal distribution around the tuning curve with a standard deviation $\sigma=\Omega/4$, which is proportional to $\Omega$, then the probability $p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a certain orientation $\phi$ of an edge is given by \[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \] The log-likelihood of the edge orientation $\phi$ given the activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus \[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \] The angle $\phi$ that maximizes this likelihood is then an estimate of the orientation of the edge.