429 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
			
		
		
	
	
			429 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			TeX
		
	
	
	
	
	
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \chapter{Maximum likelihood estimation}
 | |
| \label{maximumlikelihoodchapter}
 | |
| \exercisechapter{Maximum likelihood estimation}
 | |
| 
 | |
| The core task of statistics is to infer from measured data some
 | |
| parameters describing the data. These parameters can be simply a mean,
 | |
| a standard deviation, or any other parameter needed to describe the
 | |
| distribution the data a re originating from, a correlation
 | |
| coefficient, or some parameters of a function describing a particular
 | |
| dependence between the data. The brain faces exactly the same
 | |
| problem. Given the activity pattern of some neurons (the data) it
 | |
| needs to infer some aspects (parameters) of the environment and the
 | |
| internal state of the body in order to generate some useful
 | |
| behavior. One possible approach to estimate parameters from data are
 | |
| \enterm[maximum likelihood estimator]{maximum likelihood estimators}
 | |
| (\enterm[mle|see{maximum likelihood estimator}]{mle},
 | |
| \determ{Maximum-Likelihood-Sch\"atzer}).  They choose the parameters
 | |
| such that they maximize the likelihood of the specific data values to
 | |
| originate from a specific distribution.
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \section{Maximum likelihood}
 | |
| 
 | |
| Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
 | |
| $\theta$.'') the probability (density) distribution of data value $x$
 | |
| given parameter values $\theta$. This could be the normal distribution
 | |
| \begin{equation}
 | |
|   \label{normpdfmean}
 | |
|   p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
 | |
| \end{equation}
 | |
| defined by the mean $\mu$ and the standard deviation $\sigma$ as
 | |
| parameters $\theta$.  If the $n$ observations $x_1, x_2, \ldots, x_n$
 | |
| are independent of each other and originate from the same probability
 | |
| density distribution (they are \enterm[i.i.d.|see{independent and
 | |
|   identically distributed}]{i.i.d.}, \enterm{independent and
 | |
|   identically distributed}), then the conditional probability
 | |
| $p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
 | |
| values $x_1, x_2, \ldots, x_n$ given some specific parameter values
 | |
| $\theta$ of the probability density is given by the product of the
 | |
| probability densities of each data value:
 | |
| \begin{equation}
 | |
|   \label{probdata}
 | |
|   p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
 | |
|   \ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
 | |
| \end{equation}
 | |
| Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
 | |
| given the observed data $x_1, x_2, \ldots, x_n$ is
 | |
| \begin{equation}
 | |
|   \label{likelihood}
 | |
|   {\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
 | |
| \end{equation}
 | |
| Note, that the likelihood ${\cal L}$ is not a probability in the
 | |
| classic sense since it does not integrate to unity ($\int {\cal
 | |
|   L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
 | |
| observations $x_1, x_2, \ldots, x_n$, the likelihood
 | |
| \eqref{likelihood} is a function of the parameters $\theta$. This
 | |
| function has a global maximum for some specific parameter values. At
 | |
| this maximum the probability \eqref{probdata} to observe the measured
 | |
| data values is the largest.
 | |
| 
 | |
| Maximum likelihood estimators just find the parameter values
 | |
| \begin{equation}
 | |
|   \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
 | |
| \end{equation}
 | |
| that maximize the likelihood \eqref{likelihood}.
 | |
| $\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
 | |
| function $f(x)$ assumes its global maximum. Thus, we search for the
 | |
| parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
 | |
| reaches its maximum. For these parameter values the measured data most
 | |
| likely originated from the corresponding distribution.
 | |
| 
 | |
| The position of a function's maximum does not change when the values
 | |
| of the function are transformed by a strictly monotonously rising
 | |
| function such as the logarithm. For numerical reasons and reasons that
 | |
| we discuss below, we instead search for the maximum of the logarithm
 | |
| of the likelihood
 | |
| (\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
 | |
| \begin{eqnarray}
 | |
|   \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
 | |
|               & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
 | |
|               & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
 | |
|               & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
 | |
| \end{eqnarray}
 | |
| which is the sum of the logarithms of the probabilites of each
 | |
| observation. Let's illustrate the concept of maximum likelihood
 | |
| estimation on the arithmetic mean.
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \subsection{Arithmetic mean}
 | |
| Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
 | |
| normal distribution \eqref{normpdfmean} and we do not know the
 | |
| population mean $\mu$ of the normal distribution
 | |
| (\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
 | |
| $\theta$.  Which value of $\mu$ maximizes the likelihood of the data?
 | |
| 
 | |
| \begin{figure}[t]
 | |
|   \includegraphics[width=1\textwidth]{mlemean}
 | |
|   \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
 | |
|     the mean.}{Top: The measured data (blue dots) together with three
 | |
|     normal distributions differing in their means (arrows) from which
 | |
|     the data could have originated from.  Bottom left: the likelihood
 | |
|     as a function of the parameter $\mu$. For the data it is maximal
 | |
|     at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
 | |
|     the logarithm does not change the position of the maximum.}
 | |
| \end{figure}
 | |
| 
 | |
| With the normal distribution \eqref{normpdfmean} and applying
 | |
| logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
 | |
| \begin{eqnarray}
 | |
|   \log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
 | |
|   & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
 | |
|   & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
 | |
| \end{eqnarray}
 | |
| Since the logarithm is the inverse function of the exponential
 | |
| ($\log(e^x)=x$), taking the logarithm removes the exponential from the
 | |
| normal distribution. This is the second reason why it is useful to
 | |
| maximize the log-likelihood. To calculate the maximum of the
 | |
| log-likelihood, we need to take the derivative with respect to $\mu$
 | |
| and set it to zero:
 | |
| \begin{eqnarray}
 | |
|   \frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\;  0  \nonumber \\
 | |
|   \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0  \nonumber \\
 | |
|   \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0  \nonumber \\
 | |
|   \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i  \nonumber \\
 | |
|   \Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
 | |
| \end{eqnarray}
 | |
| Thus, the maximum likelihood estimator of the population mean of
 | |
| normally distributed data is the arithmetic mean. That is, the
 | |
| arithmetic mean maximizes the likelihood that the data originate from
 | |
| a normal distribution centered at the arithmetic mean
 | |
| (\figref{mlemeanfig}). Equivalently, the standard deviation computed
 | |
| from the data, maximizes the likelihood that the data were generated
 | |
| from a normal distribution with this standard deviation.
 | |
| 
 | |
| \begin{exercise}{mlemean.m}{mlemean.out}
 | |
|   Draw $n=50$ random numbers from a normal distribution with a mean of
 | |
|   $\ne 0$ and a standard deviation of $\ne 1$.
 | |
| 
 | |
|   Plot the likelihood (the product of the probabilities) and the
 | |
|   log-likelihood (given by the sum of the logarithms of the
 | |
|   probabilities) for the mean as parameter. Compare the position of
 | |
|   the maxima with the mean calculated from the data.
 | |
| \end{exercise}
 | |
| 
 | |
| Comparing the values of the likelihood with the ones of the
 | |
| log-likelihood shown in \figref{mlemeanfig}, shows the numerical
 | |
| reason for taking the logarithm of the likelihood. The likelihood
 | |
| values can get very small, because we multiply many, potentially small
 | |
| probability densities with each other. The likelihood quickly gets
 | |
| smaller than the samlles number a floating point number of a computer
 | |
| can represent. Try it by increasing the number of data values in the
 | |
| exercise. Taking the logarithm avoids this problem. The log-likelihood
 | |
| assumes well behaving numbers that can be handled well by the
 | |
| computer.
 | |
| 
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \section{Fitting probability distributions}
 | |
| Consider normally distributed data with unknown mean and standard
 | |
| deviation.  From the considerations above we just have seen that a
 | |
| Gaussian distribution with mean at the arithmetic mean and standard
 | |
| deviation equal to the standard deviation computed from the data is
 | |
| the Gaussian that fits the data best in a maximum likelihood sense,
 | |
| i.e. the likelihood that the data were generated from this
 | |
| distribution is the largest. Fitting a Gaussian distribution to data
 | |
| is very simple: just compute the two parameter of the Gaussian
 | |
| distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
 | |
| deviation, respectively, directly from the data.
 | |
| 
 | |
| For non-Gaussian distributions, for example a
 | |
| \entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
 | |
| \begin{equation}
 | |
|   \label{gammapdf}
 | |
|   p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; ,
 | |
| \end{equation}
 | |
| however, such simple analytical expressions for the parameters of the
 | |
| distribution do not exist. This is the case, for example, for the
 | |
| shape parameter $\alpha$ of the Gamma-distribution. How do we fit such
 | |
| a distribution to some data?  That is, how should we compute the
 | |
| values of the parameters of the distribution, given the data?
 | |
| 
 | |
| A first guess could be to fit the probability density function by
 | |
| minimization of the squared difference to a histogram of the measured
 | |
| data in the same way as we fit a a function to some data. For several
 | |
| reasons this is, however, not the method of choice: (i) Probability
 | |
| densities can only be positive which leads, for small values in
 | |
| particular, to asymmetric distributions of the estimated histogram
 | |
| around the true density. (ii) The values of a histogram estimating the
 | |
| density are not independent because the integral over a density is
 | |
| unity. The two basic assumptions of normally distributed and
 | |
| independent samples, which are a prerequisite for making the
 | |
| minimization of the squared difference to a maximum likelihood
 | |
| estimation (see next section), are violated. (iii) The estimation of
 | |
| the probability density by means of a histogram strongly depends on
 | |
| the chosen bin size \figref{mlepdffig}).
 | |
| 
 | |
| \begin{figure}[t]
 | |
|   \includegraphics[width=1\textwidth]{mlepdf}
 | |
|   \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
 | |
|     probability density.}{Left: the 100 data points drawn from a 2nd
 | |
|     order Gamma-distribution. The maximum likelihood estimation of the
 | |
|     probability density function is shown in orange, the true pdf is
 | |
|     shown in red. Right: normalized histogram of the data together
 | |
|     with the real (red) and the fitted probability density
 | |
|     functions. The fit was done by minimizing the squared difference
 | |
|     to the histogram.}
 | |
| \end{figure}
 | |
| 
 | |
| Instead we should stay with maximum-likelihood estimation.  Exactly in
 | |
| the same way as we estimated the mean value of a Gaussian distribution
 | |
| above, we can numerically fit the parameter of any type of
 | |
| distribution directly from the data by means of maximizing the
 | |
| likelihood.  We simply search for the parameter values of the desired
 | |
| probability density function that maximize the log-likelihood. In
 | |
| general this is a non-linear optimization problem that is solved with
 | |
| numerical methods such as the gradient descent \matlabfun{mle()}.
 | |
| 
 | |
| \begin{exercise}{mlegammafit.m}{mlegammafit.out}
 | |
|   Generate a sample of gamma-distributed random numbers and apply the
 | |
|   maximum likelihood method to estimate the parameters of the gamma
 | |
|   function from the data.
 | |
| \end{exercise}
 | |
| 
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \section{Curve fitting}
 | |
| 
 | |
| When fitting a function of the form $f(x;\theta)$ to data pairs
 | |
| $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
 | |
| function best describes the data. In
 | |
| chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
 | |
| means minimizing the squared distance between the data and the
 | |
| function.  With maximum likelihood we search for the parameter value
 | |
| $\theta$ for which the likelihood that the data were drawn from the
 | |
| corresponding function is maximal.
 | |
| 
 | |
| If we assume that the $y_i$ values are normally distributed around the
 | |
| function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
 | |
| the log-likelihood is
 | |
| \begin{eqnarray}
 | |
|   \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
 | |
|   & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}}  \nonumber \\
 | |
|   & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\
 | |
| \end{eqnarray}
 | |
| The only difference to the previous example is that the averages in
 | |
| the equations above are now given as the function values
 | |
| $f(x_i;\theta)$. The parameter $\theta$ should be the one that
 | |
| maximizes the log-likelihood. The first part of the sum is independent
 | |
| of $\theta$ and can thus be ignored when computing the the maximum:
 | |
| \begin{eqnarray}
 | |
|   & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
 | |
| \end{eqnarray}
 | |
| We can further simplify by inverting the sign and then search for the
 | |
| minimum. Also the factor $1/2$ can be ignored since it does not affect
 | |
| the position of the minimum:
 | |
| \begin{equation}
 | |
|   \label{chisqmin}
 | |
|   \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
 | |
| \end{equation}
 | |
| The sum of the squared differences normalized by the standard
 | |
| deviation is also called $\chi^2$ (chi squared). The parameter
 | |
| $\theta$ which minimizes the squared differences is thus the one that
 | |
| maximizes the likelihood of the data to actually originate from the
 | |
| given function. Therefore, minimizing $\chi^2$ is a maximum likelihood
 | |
| estimation.
 | |
| 
 | |
| From the mathematical considerations above we can see that the
 | |
| minimization of the squared difference is a maximum-likelihood
 | |
| estimation only if the data are normally distributed around the
 | |
| function. In case of other distributions, the log-likelihood
 | |
| \eqnref{loglikelihood} needs to be adapted accordingly.
 | |
| 
 | |
| \begin{figure}[t]
 | |
|   \includegraphics[width=1\textwidth]{mlepropline}
 | |
|   \titlecaption{\label{mleproplinefig} Maximum likelihood estimation
 | |
|     of the slope of a line through the origin.}{The data (blue and
 | |
|     left histogram) originate from a straight line $y=mx$ trough the origin
 | |
|     (red). The maximum-likelihood estimation of the slope $m$ of the
 | |
|     regression line (orange), \eqnref{mleslope}, is close to the true
 | |
|     one. The residuals, the data minus the estimated line (right), reveal
 | |
|     the normal distribution of the data around the line (right histogram).}
 | |
| \end{figure}
 | |
| 
 | |
| 
 | |
| \subsection{Example: simple proportionality}
 | |
| The function of a line with slope $\theta$ through the origin is
 | |
| \[ f(x) = \theta x \; . \]
 | |
| The $\chi^2$-sum is thus
 | |
| \[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \]
 | |
| To estimate the minimum we again take the first derivative with
 | |
| respect to $\theta$ and equate it to zero:
 | |
| \begin{eqnarray}
 | |
|   \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
 | |
|   & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
 | |
|   & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
 | |
|   & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
 | |
| \Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
 | |
| \end{eqnarray}
 | |
| \begin{eqnarray}
 | |
| \Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
 | |
| \end{eqnarray}
 | |
| This is an analytical expression for the estimation of the slope
 | |
| $\theta$ of the regression line (\figref{mleproplinefig}).
 | |
| 
 | |
| \subsection{Linear and non-linear fits}
 | |
| A gradient descent, as we have done in the previous chapter, is not
 | |
| necessary for fitting the slope of a straight line, because the slope
 | |
| can be directly computed via \eqnref{mleslope}. More generally, this
 | |
| is the case also for fitting the coefficients of linearly combined
 | |
| basis functions as for example the slope $m$ and the y-intercept $b$
 | |
| of a straight line
 | |
| \[ y = m \cdot x +b \]
 | |
| or the coefficients $a_k$ of a polynomial
 | |
| \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
 | |
| \matlabfun{polyfit()}.
 | |
| 
 | |
| Parameters that are non-linearly combined can not be calculated
 | |
| analytically. Consider for example the rate $\lambda$ of the
 | |
| exponential decay
 | |
| \[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
 | |
| Such cases require numerical solutions for the optimization of the
 | |
| cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.
 | |
| 
 | |
| \subsection{Relation between slope and correlation coefficient}
 | |
| Let us have a closer look on \eqnref{mleslope}. If the standard
 | |
| deviation of the data $\sigma_i$ is the same for each data point,
 | |
| i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops
 | |
| out of \eqnref{mleslope} and we get
 | |
| \begin{equation}
 | |
|   \label{whitemleslope}
 | |
|   \theta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}
 | |
| \end{equation}
 | |
| To see what this expression is, we need to standardize the data. We
 | |
| make the data mean free and normalize them to their standard
 | |
| deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$. The resulting
 | |
| numbers are also called \entermde[z-values]{z-Wert}{$z$-values} or
 | |
| $z$-scores and they have the property $\bar x = 0$ and $\sigma_x =
 | |
| 1$. $z$-scores are often used in Biology to make quantities that
 | |
| differ in their units comparable. For standardized data the variance
 | |
| \begin{equation}
 | |
|   \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1
 | |
| \end{equation}
 | |
| is given by the mean squared data and equals one.
 | |
| The covariance between $x$ and $y$ also simplifies to
 | |
| \begin{equation}
 | |
|   \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
 | |
|   \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i
 | |
| \end{equation}
 | |
| the averaged product between pairs of $x$ and $y$ values.  Recall that
 | |
| the correlation coefficient $r_{x,y}$,
 | |
| \eqnref{correlationcoefficient}, is the covariance normalized by the
 | |
| product of the standard deviations of $x$ and $y$,
 | |
| respectively. Therefore, in case the standard deviations equal one,
 | |
| the correlation coefficient is identical to the covariance.
 | |
| Consequently, for standardized data the slope of the regression line
 | |
| \eqnref{whitemleslope} simplifies to
 | |
| \begin{equation}
 | |
|   \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
 | |
| \end{equation}
 | |
| For standardized data the slope of the regression line is the same as the
 | |
| correlation coefficient!
 | |
| 
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \section{Neural coding}
 | |
| In sensory systems certain aspects of the environment are encoded in
 | |
| the neuronal activity of populations of neurons. One example of such a
 | |
| population code is the tuning of neurons in the primary visual cortex
 | |
| (V1) to the orientation of an edge or bar in the visual
 | |
| stimulus. Different neurons respond best to different edge
 | |
| orientations. Traditionally, such a tuning is measured by analyzing
 | |
| the neuronal response strength (e.g. the firing rate) as a function of
 | |
| the orientation of a black bar and is illustrated and summarized
 | |
| with the so called \enterm{tuning-curve} (\determ{Abstimmkurve},
 | |
| figure~\ref{mlecodingfig}, top).
 | |
| 
 | |
| \begin{figure}[tp]
 | |
|   \includegraphics[width=1\textwidth]{mlecoding}
 | |
|   \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
 | |
|     a stimulus parameter from neuronal activity.}{Top: Tuning curve of
 | |
|     an individual neuron as a function of the stimulus orientation (a
 | |
|     dark bar in front of a white background). The stimulus that evokes
 | |
|     the strongest activity in that neuron is the bar with the vertical
 | |
|     orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
 | |
|     the variability of the neuronal activity $p(r|\phi)$ around the
 | |
|     tuning curve. Center: In a population of neurons, each neuron may
 | |
|     have a different tuning curve (colors). A specific stimulus (the
 | |
|     vertical bar) activates the individual neurons of the population
 | |
|     in a specific way (dots). Bottom: The log-likelihood of the
 | |
|     activity pattern will be maximized close to the real stimulus
 | |
|     orientation.}
 | |
| \end{figure}
 | |
| 
 | |
| The brain, however, is confronted with the inverse problem: given a
 | |
| certain activity pattern in the neuronal population, what is the
 | |
| stimulus (here the orientation of an edge)? In the sense of maximum
 | |
| likelihood, a possible answer to this question would be: the stimulus
 | |
| for which the particular activity pattern is most likely given the
 | |
| tuning of the neurons.
 | |
| 
 | |
| Let's stay with the example of the orientation tuning in V1. The
 | |
| tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge
 | |
| orientation $\phi_i$ can be well described using a van-Mises function
 | |
| (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
 | |
| \begin{equation}
 | |
|   \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ
 | |
| \end{equation} 
 | |
| If we approximate the neuronal activity by a normal distribution
 | |
| around the tuning curve with a standard deviation $\sigma=\Omega/4$,
 | |
| which is proportional to $\Omega$, then the probability $p_i(r|\phi)$
 | |
| of the $i$-th neuron showing the activity $r$ given a certain
 | |
| orientation $\phi$ of an edge is given by
 | |
| \begin{equation}
 | |
|   p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; .
 | |
| \end{equation}
 | |
| The log-likelihood of the edge orientation $\phi$ given the
 | |
| activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
 | |
| \begin{equation}
 | |
|   {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
 | |
| \end{equation}
 | |
| The angle $\phi$ that maximizes this likelihood is then an estimate of
 | |
| the orientation of the edge.
 | |
| 
 | |
| 
 | |
| %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 | |
| \printsolutions
 |