%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Maximum likelihood estimation} \label{maximumlikelihoodchapter} \exercisechapter{Maximum likelihood estimation} The core task of statistics is to infer from measured data some parameters describing the data. These parameters can be simply a mean, a standard deviation, or any other parameter needed to describe the distribution the data are originating from, a correlation coefficient, or some parameters of a function describing a particular dependence between the data. The brain faces exactly the same problem. Given the activity pattern of some neurons (the data) it needs to infer some aspects (parameters) of the environment and the internal state of the body in order to generate some useful behavior. One possible approach to estimate parameters from data are \enterm[maximum likelihood estimator]{maximum likelihood estimators} (\enterm[mle|see{maximum likelihood estimator}]{mle}, \determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters such that they maximize the likelihood of the specific data values to originate from a specific distribution. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Maximum likelihood} Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given $\theta$.'') the probability (density) distribution of data value $x$ given parameter values $\theta$. This could be the normal distribution \begin{equation} \label{normpdfmean} p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{equation} defined by the mean $\mu$ and the standard deviation $\sigma$ as parameters $\theta$. If the $n$ observations $x_1, x_2, \ldots, x_n$ are independent of each other and originate from the same probability density distribution (they are \enterm[i.i.d.|see{independent and identically distributed}]{i.i.d.}, \enterm{independent and identically distributed}), then the conditional probability $p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data values $x_1, x_2, \ldots, x_n$ given some specific parameter values $\theta$ of the probability density is given by the product of the probability densities of each data value: \begin{equation} \label{probdata} p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta) \ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; . \end{equation} Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$ given the observed data $x_1, x_2, \ldots, x_n$ is \begin{equation} \label{likelihood} {\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; . \end{equation} Note, that the likelihood ${\cal L}$ is not a probability in the classic sense since it does not integrate to unity ($\int {\cal L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given observations $x_1, x_2, \ldots, x_n$, the likelihood \eqref{likelihood} is a function of the parameters $\theta$. This function has a global maximum for some specific parameter values. At this maximum the probability \eqref{probdata} to observe the measured data values is the largest. Maximum likelihood estimators just find the parameter values \begin{equation} \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n) \end{equation} that maximize the likelihood \eqref{likelihood}. $\text{argmax}_xf(x)$ is the value of the argument $x$ for which the function $f(x)$ assumes its global maximum. Thus, we search for the parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$ reaches its maximum. For these parameter values the measured data most likely originated from the corresponding distribution. The position of a function's maximum does not change when the values of the function are transformed by a strictly monotonously rising function such as the logarithm. For numerical reasons and reasons that we discuss below, we instead search for the maximum of the logarithm of the likelihood (\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood}) \begin{eqnarray} \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\ & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood} \end{eqnarray} which is the sum of the logarithms of the probabilites of each observation. Let's illustrate the concept of maximum likelihood estimation on the arithmetic mean. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \subsection{Arithmetic mean} Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a normal distribution \eqref{normpdfmean} and we do not know the population mean $\mu$ of the normal distribution (\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter $\theta$. Which value of $\mu$ maximizes the likelihood of the data? \begin{figure}[t] \includegraphics[width=1\textwidth]{mlemean} \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of the mean.}{Top: The measured data (blue dots) together with three normal distributions differing in their means (arrows) from which the data could have originated from. Bottom left: the likelihood as a function of the parameter $\mu$. For the data it is maximal at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking the logarithm does not change the position of the maximum.} \end{figure} With the normal distribution \eqref{normpdfmean} and applying logarithmic identities, the log-likelihood \eqref{loglikelihood} reads \begin{eqnarray} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; . \end{eqnarray} Since the logarithm is the inverse function of the exponential ($\log(e^x)=x$), taking the logarithm removes the exponential from the normal distribution. This is the second reason why it is useful to maximize the log-likelihood. To calculate the maximum of the log-likelihood, we need to take the derivative with respect to $\mu$ and set it to zero: \begin{eqnarray} \frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\; 0 \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\ \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\ \Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \label{arithmeticmean} \end{eqnarray} Thus, the maximum likelihood estimator of the population mean of normally distributed data is the arithmetic mean. That is, the arithmetic mean maximizes the likelihood that the data originate from a normal distribution centered at the arithmetic mean (\figref{mlemeanfig}). Equivalently, the standard deviation computed from the data, maximizes the likelihood that the data were generated from a normal distribution with this standard deviation. \begin{exercise}{mlemean.m}{mlemean.out} Draw $n=50$ random numbers from a normal distribution with a mean of $\ne 0$ and a standard deviation of $\ne 1$. Plot the likelihood (the product of the probabilities) and the log-likelihood (given by the sum of the logarithms of the probabilities) for the mean as parameter. Compare the position of the maxima with the mean calculated from the data. \end{exercise} Comparing the values of the likelihood with the ones of the log-likelihood shown in \figref{mlemeanfig}, shows the numerical reason for taking the logarithm of the likelihood. The likelihood values can get very small, because we multiply many, potentially small probability densities with each other. The likelihood quickly gets smaller than the samlles number a floating point number of a computer can represent. Try it by increasing the number of data values in the exercise. Taking the logarithm avoids this problem. The log-likelihood assumes well behaving numbers that can be handled well by the computer. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Fitting probability distributions} Consider normally distributed data with unknown mean and standard deviation. From the considerations above we just have seen that a Gaussian distribution with mean at the arithmetic mean and standard deviation equal to the standard deviation computed from the data is the Gaussian that fits the data best in a maximum likelihood sense, i.e. the likelihood that the data were generated from this distribution is the largest. Fitting a Gaussian distribution to data is very simple: just compute the two parameter of the Gaussian distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard deviation, respectively, directly from the data. For non-Gaussian distributions, for example a \entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution} \begin{equation} \label{gammapdf} p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \end{equation} with a shape parameter $\alpha$ and a rate parameter $\beta$ (\figrefb{mlepdffig}), in general no such simple analytical expressions for estimating the parameters directly from the data do not exist. How do we fit such a distribution to the data? That is, how should we compute the values of the parameters of the distribution, given the data? A first guess could be to fit the probability density function by a \enterm{least squares} fit to a normalized histogram of the measured data in the same way as we fit a function to some data. For several reasons this is, however, not the method of choice: (i) Probability densities can only be positive which leads, in particular for small values, to asymmetric distributions of the estimated histogram around the true density. (ii) The values of a histogram estimating the density are not independent because the integral over a density is unity. The two basic assumptions of normally distributed and independent samples, which are a prerequisite for making the minimization of the squared difference to a maximum likelihood estimation (see next section), are violated. (iii) The estimation of the probability density by means of a histogram strongly depends on the chosen bin size. \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepdf} \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a probability density.}{Left: the 100 data points drawn from a 2nd order Gamma-distribution. The maximum likelihood estimation of the probability density function is shown in orange, the true pdf is shown in red. Right: normalized histogram of the data together with the true probability density (red) and the probability density function obtained by a least squares fit to the histogram.} \end{figure} Instead we should stay with maximum-likelihood estimation. Exactly in the same way as we estimated the mean value of a Gaussian distribution above, we can numerically fit the parameter of any type of distribution directly from the data by means of maximizing the likelihood. We simply search for the parameter values of the desired probability density function that maximize the log-likelihood. In general this is a non-linear optimization problem that is solved with numerical methods such as the gradient descent \matlabfun{mle()}. \begin{exercise}{mlegammafit.m}{mlegammafit.out} Generate a sample of gamma-distributed random numbers and apply the maximum likelihood method to estimate the parameters of the gamma function from the data. \end{exercise} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Curve fitting} When fitting a function of the form $f(x;\theta)$ to data pairs $(x_i|y_i)$ one tries to adapt the parameters $\theta$ such that the function best describes the data. In chapter~\ref{gradientdescentchapter} we simply assumed that ``best'' means minimizing the squared distance between the data and the function. With maximum likelihood we search for those parameter values $\theta$ that maximize the likelihood of the data to be drawn from the corresponding function. If we assume that the $y_i$ values are normally distributed around the function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the log-likelihood is \begin{eqnarray} \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n)) & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \end{eqnarray} The only difference to the previous example of the arithmetic mean is that the means $\mu$ in the equations above are given by the function values $f(x_i;\theta)$. The parameters $\theta$ should maximize the log-likelihood. The first term in the sum is independent of $\theta$ and can be ignored when computing the the maximum. From the second term we pull out the constant factor $-\frac{1}{2}$: \begin{eqnarray} & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \end{eqnarray} We can further simplify by inverting the sign and then search for the minimum. Also the factor $\frac{1}{2}$ can be ignored since it does not affect the position of the minimum: \begin{equation} \label{chisqmin} \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2 \end{equation} The sum of the squared differences between the $y$-data values and the function values, normalized by the standard deviation of the data around the function, is called $\chi^2$ (chi squared). The parameter $\theta$ which minimizes the squared differences is thus the one that maximizes the likelihood of the data to actually originate from the given function. Whether minimizing $\chi^2$ or the \enterm{mean squared error} \eqref{meansquarederror} introduced in chapter~\ref{gradientdescentchapter} does not matter. The latter is the mean and $\chi^2$ is the sum of the squared differences. They simply differ by the constant factor $n$, the number of data pairs, which does not affect the position of the minimum. $\chi^2$ is more general in that it allows for different standard deviations for each data pair. If they are all the same ($\sigma_i = \sigma$), the common standard deviation can be pulled out of the sum and also does not influence the position of the minimum. Both \enterm{least squares} and minimizing $\chi^2$ are maximum likelihood estimators of the parameters $\theta$ of a function. From the mathematical considerations above we can see that the minimization of the squared difference is a maximum-likelihood estimation only if the data are normally distributed around the function. In case of other distributions, the log-likelihood \eqref{loglikelihood} needs to be adapted accordingly. \begin{figure}[t] \includegraphics[width=1\textwidth]{mlepropline} \titlecaption{\label{mleproplinefig} Maximum likelihood estimation of the slope of a line through the origin.}{The data (blue and left histogram) originate from a straight line $y=mx$ trough the origin (red). The maximum-likelihood estimation of the slope $m$ of the regression line (orange), \eqnref{mleslope}, is close to the true one. The residuals, the data minus the estimated line (right), reveal the normal distribution of the data around the line (right histogram).} \end{figure} Let's go on and calculate the minimum \eqref{chisqmin} of $\chi^2$ analytically for a simple function. \subsection{Straight line trough the origin} The function of a straight line with slope $m$ through the origin is \[ f(x) = m x \; . \] With this specific function, $\chi^2$ reads \[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-m x_i}{\sigma_i} \right)^2 \; . \] To calculate the minimum we take the first derivative with respect to $m$ and equate it to zero: \begin{eqnarray} \frac{\text{d}}{\text{d}m}\chi^2 & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}m} \left( \frac{y_i-m x_i}{\sigma_i} \right)^2 \nonumber \\ & = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-m x_i}{\sigma_i} \right) \nonumber \\ & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - m \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\ \Leftrightarrow \quad m \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\ \Leftrightarrow \quad m & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope} \end{eqnarray} This is an analytical expression for the estimation of the slope $m$ of the regression line (\figref{mleproplinefig}). We do not need to apply numerical methods like the gradient descent to find the slope that minimizes the squared differences. Instead, we can compute the slope directly from the data by means of \eqnref{mleslope}, very much like we calculate the mean of some data by means of the arithmetic mean \eqref{arithmeticmean}. \subsection{Linear and non-linear fits} A gradient descent, as we have done in the previous chapter, is not necessary for fitting the slope of a straight line, because the slope can be directly computed via \eqnref{mleslope}. More generally, this is the case also for fitting the coefficients of linearly combined basis functions as for example the slope $m$ and the y-intercept $b$ of a straight line \[ y = m \cdot x +b \] or the coefficients $a_k$ of a polynomial \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \] \matlabfun{polyfit()}. Parameters that are non-linearly combined can not be calculated analytically. Consider, for example, the factor $c$ and the rate $\lambda$ of the exponential decay \[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \] Such cases require numerical solutions for the optimization of the cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}. \subsection{Relation between slope and correlation coefficient} Let us have a closer look on \eqnref{mleslope} for the slope of a line through the origin. If the standard deviation of the data $\sigma_i$ is the same for each data point, i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops out and \eqnref{mleslope} simplifies to \begin{equation} \label{whitemleslope} m = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \end{equation} To see what the nominator and the denominator of this expression describe, we need to subtract from the data their mean value. We make the data mean free, i.e. $x \mapsto x - \bar x$ and $y \mapsto y - \bar y$ . For mean-free data the variance \begin{equation} \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 \end{equation} is given by the mean squared data. In the same way, the covariance between $x$ and $y$ simplifies to \begin{equation} \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \; , \end{equation} the averaged product between pairs of $x$ and $y$ values. Expanding the fraction in \eqnref{whitemleslope} by $\frac{1}{n}$ we get \begin{equation} \label{meanfreeslope} m = \frac{\frac{1}{n}\sum_{i=1}^n x_i y_i}{\frac{1}{n}\sum_{i=1}^n x_i^2} = \frac{\text{cov}(x, y)}{\sigma_x^2} \end{equation} Recall that the correlation coefficient $r_{x,y}$ is the covariance normalized by the product of the standard deviations of $x$ and $y$: \begin{equation} \label{meanfreecorrcoef} r_{x,y} = \frac{\text{cov}(x, y)}{\sigma_x\sigma_y} \end{equation} If furthermore the standard deviations of $x$ and $y$ are the same, i.e. $\sigma_x = \sigma_y$, the slope of a line trough the origin is identical to the correlation coefficient. This relation between slope and correlation coefficient in particular holds for standardized data that have been made mean free and have been normalized by their standard deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$ and $y \mapsto (y - \bar x)/\sigma_y$. The resulting numbers are called \entermde[z-value]{z-Wert}{$z$-values} or \enterm[z-score]{$z$-scores}. Their mean equals zero and their standard deviation one. $z$-scores are often used to make quantities that differ in their units comparable. For standardized data the denominators for both the slope \eqref{meanfreeslope} and the correlation coefficient \eqref{meanfreecorrcoef} equal one. For standardized data, both the slope of the regression line and the corelation coefficient reduce to the covariance between the $x$ and $y$ data: \begin{equation} m = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \end{equation} For standardized data the slope of the regression line is the same as the correlation coefficient! %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Neural coding} Maximum likelihood estimators are not only a central concept for data analysis. Neural systems face the very same problem. They also need to estimate parameters of the internal and external environment based on the activity of neurons. In sensory systems certain aspects of the environment are encoded in the neuronal activity of populations of neurons. One example of such a population code is the tuning of neurons in the primary visual cortex (V1) to the orientation of a bar in the visual stimulus. Different neurons respond best to different bar orientations. Traditionally, such a tuning is measured by analyzing the neuronal response strength (e.g. the firing rate) as a function of the orientation of a black bar and is illustrated and summarized with the so called \enterm{tuning-curve} (\determ{Abstimmkurve}, figure~\ref{mlecodingfig}, top). \begin{figure}[tp] \includegraphics[width=1\textwidth]{mlecoding} \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of a stimulus parameter from neuronal activity.}{Top: Tuning curve $r_i(\phi;c,\phi_i)$ of a specific neuron $i$ as a function of the orientation $\phi$ of a stimulus, a dark bar in front of a white background. The preferred stimulus $\phi_i$ of that neuron, the one that evokes the strongest firing rate response, is a bar with vertical orientation (arrow, $\phi_i=90$\,\degree). The width of the red area indicates the variability of the neuronal activity $\sigma_i$ around the tuning curve. Center: In a population of neurons, each neuron may have a different tuning curve (colors). A specific stimulus activates the individual neurons of the population in a specific way (dots). Bottom: The log-likelihood of the activity pattern has a maximum close to the real stimulus orientation.} \end{figure} The brain, however, is confronted with the inverse problem: given a certain activity pattern in the neuronal population, what is the stimulus? In our example, what is the orientation of the bar? In the sense of maximum likelihood, a possible answer to this question would be: the stimulus for which the particular activity pattern is most likely given the tuning of the neurons and the noise (standard deviation) of the responses. Let's stay with the example of the orientation tuning in V1. The tuning of the firing rate $r_i(\phi)$ of neuron $i$ to the preferred bar orientation $\phi_i$ can be well described using a van-Mises function (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}): \begin{equation} \label{bartuningcurve} r_i(\phi; c, \phi_i) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c > 0 \end{equation} Note the factor two in the cosine, because the response of the neuron is the same for a bar turned by 180\,\degree. If we approximate the neuronal activity by a normal distribution around the tuning curve with a standard deviation $\sigma_i$, then the probability $p_i(r|\phi)$ of the $i$-th neuron having the observed activity $r$, given a certain orientation $\phi$ of a bar is \begin{equation} p_i(r|\phi) = \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{1}{2}\left(\frac{r-r_i(\phi; c, \phi_i)}{\sigma_i}\right)^2} \; . \end{equation} The log-likelihood of the bar orientation $\phi$ given the activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus \begin{equation} {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \end{equation} The angle $\phi_{mle}$ that maximizes this likelihood is an estimate of the true orientation of the bar (\figref{mlecodingfig}). The noisiness of the neuron's responses as quantified by $\sigma_i$ usually is a function of the neuron's mean firing rate $r_i$, \eqnref{bartuningcurve}: $\sigma_i = \sigma_i(r_i)$. This dependence has a major impact of the maximum likelihood estimation. Usually, the stronger the response of a neuron, the higher its firing rate, the lower the noise. In this case, strong responses will have a stronger influence on the position of the maximum of the log-likelihood. Whether neural systems really implement maximum likelihood estimators is another question. There are many ways how a stimulus property can be read out from the activity of a population of neurons. The simplest one being a ``winner takes all'' rule. The preferred stimulus parameter of the neuron with the strongest response is the estimate. Another possibility is to compute a population vector. The estimated stimulus parameter is the sum of the preferred stimulus parameters of all neurons in the population weighted by the activity of the neurons. In case of angular stimulus parameters, like the orientation of the bar in our example, a vector pointing in the direction of the angle is used instead of the angle to incorporate the cyclic nature of the parameter. Using maximum likelihood estimators for analyzing neural population activity gives us an upper bound of how well a stimulus parameter is encoded in the activity of the neurons. The brain would not be able to do better. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \printsolutions