%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{\tr{Maximum likelihood estimation}{Maximum-Likelihood-Sch\"atzer}} \label{maximumlikelihoodchapter} \selectlanguage{english} There are situations in which we want to estimate one or more parameters $\theta$ of a probability distribution that best describe the data $x_1, x_2, \ldots x_n$. \enterm{Maximum likelihood estimators} (\determ{Maximum-Likelihood-Sch\"atzer}, \determ[mle|see{Maximum-Likelihood-Sch\"atzer}]{mle}) choose the parameters such that it maximizes the likelihood of $x_1, x_2, \ldots x_n$ originating from the distribution. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Maximum Likelihood} Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given $\theta$.'') the probability (density) distribution of $x$ given the parameters $\theta$. This could be the normal distribution \begin{equation} \label{normpdfmean} p(x|\theta) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{equation} defined by the mean ($\mu$) and the standard deviation $\sigma$ as parameters $\theta$. If the $n$ independent observations of $x_1, x_2, \ldots x_n$ originate from the same probability density distribution (\enterm{i.i.d.} independent and identically distributed) then is the conditional probability $p(x_1,x_2, \ldots x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given the a specific $\theta$, \begin{equation} p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta) \ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; . \end{equation} Vice versa, is the \enterm{likelihood} of the parameters $\theta$ given the observed data $x_1, x_2, \ldots x_n$, \begin{equation} {\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; . \end{equation} Note: the likelihood ${\cal L}$ is not a probability in the classic sense since it does not integrate to unity ($\int {\cal L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$). When applying maximum likelihood estimations we are interested in the parameters $\theta$ that maximize the likelihood (``mle''): \begin{equation} \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n) \end{equation} $\text{argmax}_xf(x)$ denotes the values of the argument $x$ of the function $f(x)$ at which the function $f(x)$ reaches its global maximum. Thus, we search the value of $\theta$ at which the likelihood ${\cal L}(\theta)$ reaches its maximum. The position of a function's maximum does not change when the values of the function are transformed by a strictly monotonously rising function such as the logarithm. For numerical and reasons that we will discuss below, we commonly search for the maximum of the logarithm of the likelihood (\enterm{log-likelihood}): \begin{eqnarray} \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\ & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood} \end{eqnarray} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Example: the arithmetic mean} Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a normal distribution \eqnref{normpdfmean} and we consider the mean $\mu=\theta$ as the only parameter. Which value of $\theta$ maximizes its likelihood? \begin{figure}[t] \includegraphics[width=1\textwidth]{mlemean} \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of the mean.}{Top: The measured data (blue dots) together with three different possible normal distributions with different means (arrows) that could be the origin of the data. Bootom left: the likelihood as a function of $\theta$ i.e. the mean. It is maximal at a value of $\theta = 2$. Bottom right: the log-likelihood. Taking the logarithm does not change the position of the maximum.} \end{figure} The log-likelihood \eqnref{loglikelihood} \begin{eqnarray*} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; . \end{eqnarray*} % FIXME do we need parentheses around the normal distribution in line one? Since the logarithm is the inverse function of the exponential ($\log(e^x)=x$), taking the logarithm removes the exponential from the normal distribution. To calculate the maximum of the log-likelihood, we need to take the derivative with respect to $\theta$ and set it to zero: \begin{eqnarray*} \frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \\ \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \\ \Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \\ \Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \end{eqnarray*} From the above equations it becomes clear that the maximum likelihood estimation is equivalent to the mean of the data. That is, the assuming the mean of the data as $\theta$ maximizes the likelihood that the data originate from a normal distribution with that mean (\figref{mlemeanfig}). \begin{exercise}{mlemean.m}{mlemean.out} Draw $n=50$ random numbers from a normal distribution with a mean of $\ne 0$ and a standard deviation of $\ne 1$. Plot the likelihood (the product of the probabilities) and the log-likelihood (given by the sum of the logarithms of the probabilities) for the mean as parameter. Compare the position of the maxima with the mean calculated from the data. \pagebreak[4] \end{exercise} \pagebreak[4] %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Curve fitting as using maximum-likelihood estimation} During curve fitting a function of the form $f(x;\theta)$ with the parameter $\theta$ is adapted to the data pairs $(x_i|y_i)$ by adapting $\theta$. When we assume that the $y_i$ values are normally distributed around the function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the log-likelihood is \begin{eqnarray*} \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n)) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\ \end{eqnarray*} The only difference to the previous example is that the averages in the equations above are now given as the function values $f(x_i;\theta)$ The parameter $\theta$ should be the one that maximizes the log-likelihood. The first part of the sum is independent of $\theta$ and can thus be ignored during the search of the maximum: \begin{eqnarray*} & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \end{eqnarray*} We can further simplify by inverting the sign and then search for the minimum. Also the $1/2$ factor can be ignored since it does not affect the position of the minimum: \begin{equation} \label{chisqmin} \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2 \end{equation} The sum of the squared differences when normalized to the standard deviation is also called $\chi^2$. The parameter $\theta$ which minimizes the squared differences is thus the one that maximizes the probability that the data actually originate from the given function. Minimizing $\chi^2$ therefore is a maximum likelihood estimation. From the mathematical considerations above we can see that the minimization of the squared difference is a maximum-likelihood estimation only if the data are normally distributed around the function. In case of other distributions, the log-likelihood needs to be adapted accordingly \eqnref{loglikelihood} and be maximized respectively. \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepropline} \titlecaption{\label{mleproplinefig} Maximum likelihood estimation of the slope of line through the origin.}{} \end{figure} \subsection{Example: simple proportionality} The function of a line going through the origin \[ f(x) = \theta x \] with the slope $\theta$. The $\chi^2$-sum is thus \[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \] To estimate the minimum we again take the first derivative with respect to $\theta$ and equate it to zero: \begin{eqnarray} \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ & = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\ & = & -2 \sum_{i=1}^n \left( \frac{x_iy_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\ \Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2} \nonumber \\ \Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope} \end{eqnarray} With this we obtained an analytical expression for the estimation of the slope $\theta$ of the regression line (\figref{mleproplinefig}). A gradient descent, as we have done in the previous chapter, is thus not necessary for the fitting of the slope of a linear equation. This holds even more generally for fitting the coefficients of linearly combined basis functions as for example the fitting of the slope $m$ and the y-intercept $b$ of the linear equation \[ y = m \cdot x +b \] or, more generally, the coefficients $a_k$ of a polynom \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \] \matlabfun{polyfit()}. Parameters that are non-linearly combined can not be calculated analytically. Consider for example the rate $\lambda$ of the exponential decay \[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \] Such cases require numerical solutions for the optimization of the cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Fitting probability distributions} Finally let's consider the case in which we want to fit the parameters of a probability density function (e.g. the shape parameter of a \enterm{Gamma-distribution}) to a dataset. A first guess could be to fit the probability density by minimization of the squared difference to a histogram of the measured data. For several reasons this is, however, not the method of choice: (i) probability densities can only be positive which leads, for small values in particular, to asymmetric distributions. (ii) the values of a histogram are not independent because the integral of a density is unity. The two basic assumptions of normally distributed and independent samples, which are a prerequisite make the minimization of the squared difference \eqnref{chisqmin} to a maximum likelihood estimation, are violated. (iii) The histogram strongly depends on the chosen bin size \figref{mlepdffig}). \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepdf} \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a probability density.}{Left: the 100 data points drawn from a 2nd order Gamma-distribution. The maximum likelihood estimation of the probability density function is shown in orange, the true pdf is shown in red. Right: normalized histogram of the data together with the real (red) and the fitted probability density functions. The fit was done by minimizing the squared difference to the histogram.} \end{figure} Using the example of the estimating the mean value of a normal distribution we have discussed the direct approach to fit a probability density to data via maximum likelihood. We simply search for the parameter $\theta$ of the desired probability density function that maximizes the log-likelihood. This is a non-linear optimization problem that is generally solved with numerical methods such as the gradient descent \matlabfun{mle()}. \begin{exercise}{mlegammafit.m}{mlegammafit.out} Create a sample of gamma-distributed random number and apply the maximum likelihood method to estimate the parameters of the gamma function from the data. \pagebreak \end{exercise} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Neural coding} In sensory systems certain aspects of the surrounding are encoded in the neuronal activity of populations of neurons. One example of such population coding is the tuning of neurons in the primary visual cortex (V1) to the orientation of a visual stimulus. Different neurons respond best to different stimulus orientations. Traditionally, such a tuning is measured by analyzing the neuronal response strength (e.g. the firing rate) as a function of the orientation of the visual stimulus and is depicted and summarized with the so called \enterm{tuning-curve}(German \determ{Abstimmkurve}, figure~\ref{mlecoding}, top). \begin{figure}[tp] \includegraphics[width=1\textwidth]{mlecoding} \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of a stimulus parameter from neuronal activity.}{Top: Tuning curve of an individual neuron as a function of the stimulus orientation (a dark bar in front of a white background). The stimulus that evokes the strongest activity in that neuron is the bar with the vertical orientation (arrow, $\phi_i=90$\,\degree). The red area indicates the variability of the neuronal activity $p(r)$ around the tunig curve. Center: In a population of neurons, each neuron may have a different tuning curve (colors). A specific stimulus (the vertical bar) activates the individual neurons of the population in a specific way (dots). Bottom: The log-likelihood of the activity pattern will be maximized close to the real stimulus orientation.} \end{figure} The brain, however, is confronted with the inverse problem: given a certain activity pattern in the neuronal population, what was the stimulus? In the sense of maximum likelihood, a possible answer to this question would be: It was the stimulus for which the particular activity pattern is most likely. Let's stay with the example of the orientation tuning in V1. The tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus orientation $\phi_i$ can be well described using a van-Mises function (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}): \[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \] The neuronal activity is approximated with a normal distribution around the tuning curve with a standard deviation $\sigma=\Omega/4$ which is proprotional to $\Omega$ such that the probability $p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a certain orientation $\phi$ is given by \[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \] The log-likelihood of the stimulus orientation $\phi$ given the activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus \[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \] \selectlanguage{english}