507 lines
26 KiB
TeX
507 lines
26 KiB
TeX
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\chapter{Maximum likelihood estimation}
|
|
\label{maximumlikelihoodchapter}
|
|
\exercisechapter{Maximum likelihood estimation}
|
|
|
|
The core task of statistics is to infer from measured data some
|
|
parameters describing the data. These parameters can be simply a mean,
|
|
a standard deviation, or any other parameter needed to describe the
|
|
distribution the data are originating from, a correlation coefficient,
|
|
or some parameters of a function describing a particular dependence
|
|
between the data. The brain faces exactly the same problem. Given the
|
|
activity pattern of some neurons (the data) it needs to infer some
|
|
aspects (parameters) of the environment and the internal state of the
|
|
body in order to generate some useful behavior. One possible approach
|
|
to estimate parameters from data are \enterm[maximum likelihood
|
|
estimator]{maximum likelihood estimators} (\enterm[mle|see{maximum
|
|
likelihood estimator}]{mle},
|
|
\determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters
|
|
such that they maximize the likelihood of the specific data values to
|
|
originate from a specific distribution.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Maximum likelihood}
|
|
|
|
Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
|
|
$\theta$.'') the probability (density) distribution of data value $x$
|
|
given parameter values $\theta$. This could be the normal distribution
|
|
\begin{equation}
|
|
\label{normpdfmean}
|
|
p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
|
|
\end{equation}
|
|
defined by the mean $\mu$ and the standard deviation $\sigma$ as
|
|
parameters $\theta$. If the $n$ observations $x_1, x_2, \ldots, x_n$
|
|
are independent of each other and originate from the same probability
|
|
density distribution (they are \enterm[i.i.d.|see{independent and
|
|
identically distributed}]{i.i.d.}, \enterm{independent and
|
|
identically distributed}), then the conditional probability
|
|
$p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
|
|
values $x_1, x_2, \ldots, x_n$ given some specific parameter values
|
|
$\theta$ of the probability density is given by the product of the
|
|
probability densities of each data value:
|
|
\begin{equation}
|
|
\label{probdata}
|
|
p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
|
|
\ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
|
|
\end{equation}
|
|
Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
|
|
given the observed data $x_1, x_2, \ldots, x_n$ is
|
|
\begin{equation}
|
|
\label{likelihood}
|
|
{\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
|
|
\end{equation}
|
|
Note, that the likelihood ${\cal L}$ is not a probability in the
|
|
classic sense since it does not integrate to unity ($\int {\cal
|
|
L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
|
|
observations $x_1, x_2, \ldots, x_n$, the likelihood
|
|
\eqref{likelihood} is a function of the parameters $\theta$. This
|
|
function has a global maximum for some specific parameter values. At
|
|
this maximum the probability \eqref{probdata} to observe the measured
|
|
data values is the largest.
|
|
|
|
Maximum likelihood estimators just find the parameter values
|
|
\begin{equation}
|
|
\theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
|
|
\end{equation}
|
|
that maximize the likelihood \eqref{likelihood}.
|
|
$\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
|
|
function $f(x)$ assumes its global maximum. Thus, we search for the
|
|
parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
|
|
reaches its maximum. For these parameter values the measured data most
|
|
likely originated from the corresponding distribution.
|
|
|
|
The position of a function's maximum does not change when the values
|
|
of the function are transformed by a strictly monotonously rising
|
|
function such as the logarithm. For numerical reasons and reasons that
|
|
we discuss below, we instead search for the maximum of the logarithm
|
|
of the likelihood
|
|
(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
|
|
\begin{eqnarray}
|
|
\theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
|
|
\end{eqnarray}
|
|
which is the sum of the logarithms of the probabilites of each
|
|
observation. Let's illustrate the concept of maximum likelihood
|
|
estimation on the arithmetic mean.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Arithmetic mean}
|
|
Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
|
|
normal distribution \eqref{normpdfmean} and we do not know the
|
|
population mean $\mu$ of the normal distribution
|
|
(\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
|
|
$\theta$. Which value of $\mu$ maximizes the likelihood of the data?
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlemean}
|
|
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
|
|
the mean.}{Top: The measured data (blue dots) together with three
|
|
normal distributions differing in their means (arrows) from which
|
|
the data could have originated from. Bottom left: the likelihood
|
|
as a function of the parameter $\mu$. For the data it is maximal
|
|
at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
|
|
the logarithm does not change the position of the maximum.}
|
|
\end{figure}
|
|
|
|
With the normal distribution \eqref{normpdfmean} and applying
|
|
logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
|
|
\begin{eqnarray}
|
|
\log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
|
|
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
|
|
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
|
|
\end{eqnarray}
|
|
Since the logarithm is the inverse function of the exponential
|
|
($\log(e^x)=x$), taking the logarithm removes the exponential from the
|
|
normal distribution. This is the second reason why it is useful to
|
|
maximize the log-likelihood. To calculate the maximum of the
|
|
log-likelihood, we need to take the derivative with respect to $\mu$
|
|
and set it to zero:
|
|
\begin{eqnarray}
|
|
\frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\; 0 \nonumber \\
|
|
\Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\
|
|
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\
|
|
\Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\
|
|
\Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \label{arithmeticmean}
|
|
\end{eqnarray}
|
|
Thus, the maximum likelihood estimator of the population mean of
|
|
normally distributed data is the arithmetic mean. That is, the
|
|
arithmetic mean maximizes the likelihood that the data originate from
|
|
a normal distribution centered at the arithmetic mean
|
|
(\figref{mlemeanfig}). Equivalently, the standard deviation computed
|
|
from the data, maximizes the likelihood that the data were generated
|
|
from a normal distribution with this standard deviation.
|
|
|
|
\begin{exercise}{mlemean.m}{mlemean.out}
|
|
Draw $n=50$ random numbers from a normal distribution with a mean of
|
|
$\ne 0$ and a standard deviation of $\ne 1$.
|
|
|
|
Plot the likelihood (the product of the probabilities) and the
|
|
log-likelihood (given by the sum of the logarithms of the
|
|
probabilities) for the mean as parameter. Compare the position of
|
|
the maxima with the mean calculated from the data.
|
|
\end{exercise}
|
|
|
|
Comparing the values of the likelihood with the ones of the
|
|
log-likelihood shown in \figref{mlemeanfig}, shows the numerical
|
|
reason for taking the logarithm of the likelihood. The likelihood
|
|
values can get very small, because we multiply many, potentially small
|
|
probability densities with each other. The likelihood quickly gets
|
|
smaller than the samlles number a floating point number of a computer
|
|
can represent. Try it by increasing the number of data values in the
|
|
exercise. Taking the logarithm avoids this problem. The log-likelihood
|
|
assumes well behaving numbers that can be handled well by the
|
|
computer.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Fitting probability distributions}
|
|
Consider normally distributed data with unknown mean and standard
|
|
deviation. From the considerations above we just have seen that a
|
|
Gaussian distribution with mean at the arithmetic mean and standard
|
|
deviation equal to the standard deviation computed from the data is
|
|
the Gaussian that fits the data best in a maximum likelihood sense,
|
|
i.e. the likelihood that the data were generated from this
|
|
distribution is the largest. Fitting a Gaussian distribution to data
|
|
is very simple: just compute the two parameter of the Gaussian
|
|
distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
|
|
deviation, respectively, directly from the data.
|
|
|
|
For non-Gaussian distributions, for example a
|
|
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
|
|
\begin{equation}
|
|
\label{gammapdf}
|
|
p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x}
|
|
\end{equation}
|
|
with a shape parameter $\alpha$ and a rate parameter $\beta$
|
|
(\figrefb{mlepdffig}), in general no such simple analytical
|
|
expressions for estimating the parameters directly from the data do
|
|
not exist. How do we fit such a distribution to the data? That is,
|
|
how should we compute the values of the parameters of the
|
|
distribution, given the data?
|
|
|
|
A first guess could be to fit the probability density function by a
|
|
\enterm{least squares} fit to a normalized histogram of the measured data in
|
|
the same way as we fit a function to some data. For several reasons
|
|
this is, however, not the method of choice: (i) Probability densities
|
|
can only be positive which leads, in particular for small values, to
|
|
asymmetric distributions of the estimated histogram around the true
|
|
density. (ii) The values of a histogram estimating the density are not
|
|
independent because the integral over a density is unity. The two
|
|
basic assumptions of normally distributed and independent samples,
|
|
which are a prerequisite for making the minimization of the squared
|
|
difference to a maximum likelihood estimation (see next section), are
|
|
violated. (iii) The estimation of the probability density by means of
|
|
a histogram strongly depends on the chosen bin size.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlepdf}
|
|
\titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
|
|
probability density.}{Left: the 100 data points drawn from a 2nd
|
|
order Gamma-distribution. The maximum likelihood estimation of the
|
|
probability density function is shown in orange, the true pdf is
|
|
shown in red. Right: normalized histogram of the data together
|
|
with the true probability density (red) and the probability
|
|
density function obtained by a least squares fit to the
|
|
histogram.}
|
|
\end{figure}
|
|
|
|
Instead we should stay with maximum-likelihood estimation. Exactly in
|
|
the same way as we estimated the mean value of a Gaussian distribution
|
|
above, we can numerically fit the parameter of any type of
|
|
distribution directly from the data by means of maximizing the
|
|
likelihood. We simply search for the parameter values of the desired
|
|
probability density function that maximize the log-likelihood. In
|
|
general this is a non-linear optimization problem that is solved with
|
|
numerical methods such as the gradient descent \matlabfun{mle()}.
|
|
|
|
\begin{exercise}{mlegammafit.m}{mlegammafit.out}
|
|
Generate a sample of gamma-distributed random numbers and apply the
|
|
maximum likelihood method to estimate the parameters of the gamma
|
|
function from the data.
|
|
\end{exercise}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Curve fitting}
|
|
|
|
When fitting a function of the form $f(x;\theta)$ to data pairs
|
|
$(x_i|y_i)$ one tries to adapt the parameters $\theta$ such that the
|
|
function best describes the data. In
|
|
chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
|
|
means minimizing the squared distance between the data and the
|
|
function. With maximum likelihood we search for those parameter
|
|
values $\theta$ that maximize the likelihood of the data to be drawn
|
|
from the corresponding function.
|
|
|
|
If we assume that the $y_i$ values are normally distributed around the
|
|
function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
|
|
the log-likelihood is
|
|
\begin{eqnarray}
|
|
\log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
|
|
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\
|
|
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}
|
|
\end{eqnarray}
|
|
The only difference to the previous example of the arithmetic mean is
|
|
that the means $\mu$ in the equations above are given by the function
|
|
values $f(x_i;\theta)$. The parameters $\theta$ should maximize the
|
|
log-likelihood. The first term in the sum is independent of $\theta$
|
|
and can be ignored when computing the the maximum. From the second
|
|
term we pull out the constant factor $-\frac{1}{2}$:
|
|
\begin{eqnarray}
|
|
& = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
|
|
\end{eqnarray}
|
|
We can further simplify by inverting the sign and then search for the
|
|
minimum. Also the factor $\frac{1}{2}$ can be ignored since it does
|
|
not affect the position of the minimum:
|
|
\begin{equation}
|
|
\label{chisqmin}
|
|
\theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
|
|
\end{equation}
|
|
The sum of the squared differences between the $y$-data values and the
|
|
function values, normalized by the standard deviation of the data
|
|
around the function, is called $\chi^2$ (chi squared). The parameter
|
|
$\theta$ which minimizes the squared differences is thus the one that
|
|
maximizes the likelihood of the data to actually originate from the
|
|
given function.
|
|
|
|
Whether minimizing $\chi^2$ or the \enterm{mean squared error}
|
|
\eqref{meansquarederror} introduced in
|
|
chapter~\ref{gradientdescentchapter} does not matter. The latter is
|
|
the mean and $\chi^2$ is the sum of the squared differences. They
|
|
simply differ by the constant factor $n$, the number of data pairs,
|
|
which does not affect the position of the minimum. $\chi^2$ is more
|
|
general in that it allows for different standard deviations for each
|
|
data pair. If they are all the same ($\sigma_i = \sigma$), the common
|
|
standard deviation can be pulled out of the sum and also does not
|
|
influence the position of the minimum. Both \enterm{least squares} and
|
|
minimizing $\chi^2$ are maximum likelihood estimators of the
|
|
parameters $\theta$ of a function.
|
|
|
|
From the mathematical considerations above we can see that the
|
|
minimization of the squared difference is a maximum-likelihood
|
|
estimation only if the data are normally distributed around the
|
|
function. In case of other distributions, the log-likelihood
|
|
\eqref{loglikelihood} needs to be adapted accordingly.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlepropline}
|
|
\titlecaption{\label{mleproplinefig} Maximum likelihood estimation
|
|
of the slope of a line through the origin.}{The data (blue and
|
|
left histogram) originate from a straight line $y=mx$ trough the origin
|
|
(red). The maximum-likelihood estimation of the slope $m$ of the
|
|
regression line (orange), \eqnref{mleslope}, is close to the true
|
|
one. The residuals, the data minus the estimated line (right), reveal
|
|
the normal distribution of the data around the line (right histogram).}
|
|
\end{figure}
|
|
|
|
Let's go on and calculate the minimum \eqref{chisqmin} of $\chi^2$
|
|
analytically for a simple function.
|
|
|
|
|
|
\subsection{Straight line trough the origin}
|
|
The function of a straight line with slope $m$ through the origin
|
|
is
|
|
\[ f(x) = m x \; . \]
|
|
With this specific function, $\chi^2$ reads
|
|
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-m x_i}{\sigma_i} \right)^2
|
|
\; . \] To calculate the minimum we take the first derivative with
|
|
respect to $m$ and equate it to zero:
|
|
\begin{eqnarray}
|
|
\frac{\text{d}}{\text{d}m}\chi^2 & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}m} \left( \frac{y_i-m x_i}{\sigma_i} \right)^2 \nonumber \\
|
|
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-m x_i}{\sigma_i} \right) \nonumber \\
|
|
& = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - m \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
|
|
\Leftrightarrow \quad m \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
|
|
\Leftrightarrow \quad m & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
|
|
\end{eqnarray}
|
|
This is an analytical expression for the estimation of the slope $m$
|
|
of the regression line (\figref{mleproplinefig}). We do not need to
|
|
apply numerical methods like the gradient descent to find the slope
|
|
that minimizes the squared differences. Instead, we can compute the
|
|
slope directly from the data by means of \eqnref{mleslope}, very much
|
|
like we calculate the mean of some data by means of the arithmetic
|
|
mean \eqref{arithmeticmean}.
|
|
|
|
\subsection{Linear and non-linear fits}
|
|
A gradient descent, as we have done in the previous chapter, is not
|
|
necessary for fitting the slope of a straight line, because the slope
|
|
can be directly computed via \eqnref{mleslope}. More generally, this
|
|
is the case also for fitting the coefficients of linearly combined
|
|
basis functions as for example the slope $m$ and the y-intercept $b$
|
|
of a straight line
|
|
\[ y = m \cdot x +b \]
|
|
or the coefficients $a_k$ of a polynomial
|
|
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
|
|
\matlabfun{polyfit()}.
|
|
|
|
Parameters that are non-linearly combined can not be calculated
|
|
analytically. Consider, for example, the factor $c$ and the rate
|
|
$\lambda$ of the exponential decay
|
|
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
|
|
Such cases require numerical solutions for the optimization of the
|
|
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.
|
|
|
|
\subsection{Relation between slope and correlation coefficient}
|
|
Let us have a closer look on \eqnref{mleslope} for the slope of a line
|
|
through the origin. If the standard deviation of the data $\sigma_i$
|
|
is the same for each data point, i.e. $\sigma_i = \sigma_j \; \forall
|
|
\; i, j$, the standard deviation drops out and \eqnref{mleslope}
|
|
simplifies to
|
|
\begin{equation}
|
|
\label{whitemleslope}
|
|
m = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}
|
|
\end{equation}
|
|
To see what the nominator and the denominator of this expression
|
|
describe, we need to subtract from the data their mean value. We make
|
|
the data mean free, i.e. $x \mapsto x - \bar x$ and $y \mapsto y -
|
|
\bar y$ . For mean-free data the variance
|
|
\begin{equation}
|
|
\sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2
|
|
\end{equation}
|
|
is given by the mean squared data. In the same way, the covariance
|
|
between $x$ and $y$ simplifies to
|
|
\begin{equation}
|
|
\text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
|
|
\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \; ,
|
|
\end{equation}
|
|
the averaged product between pairs of $x$ and $y$ values. Expanding
|
|
the fraction in \eqnref{whitemleslope} by $\frac{1}{n}$ we get
|
|
\begin{equation}
|
|
\label{meanfreeslope}
|
|
m = \frac{\frac{1}{n}\sum_{i=1}^n x_i y_i}{\frac{1}{n}\sum_{i=1}^n x_i^2}
|
|
= \frac{\text{cov}(x, y)}{\sigma_x^2}
|
|
\end{equation}
|
|
|
|
Recall that the correlation coefficient $r_{x,y}$ is the covariance
|
|
normalized by the product of the standard deviations of $x$ and $y$:
|
|
\begin{equation}
|
|
\label{meanfreecorrcoef}
|
|
r_{x,y} = \frac{\text{cov}(x, y)}{\sigma_x\sigma_y}
|
|
\end{equation}
|
|
If furthermore the standard deviations of $x$ and $y$ are the same,
|
|
i.e. $\sigma_x = \sigma_y$, the slope of a line trough the origin is
|
|
identical to the correlation coefficient.
|
|
|
|
This relation between slope and correlation coefficient in particular
|
|
holds for standardized data that have been made mean free and have
|
|
been normalized by their standard deviation, i.e. $x \mapsto (x - \bar
|
|
x)/\sigma_x$ and $y \mapsto (y - \bar x)/\sigma_y$. The resulting
|
|
numbers are called \entermde[z-value]{z-Wert}{$z$-values} or
|
|
\enterm[z-score]{$z$-scores}. Their mean equals zero and their
|
|
standard deviation one. $z$-scores are often used to make quantities
|
|
that differ in their units comparable. For standardized data the
|
|
denominators for both the slope \eqref{meanfreeslope} and the
|
|
correlation coefficient \eqref{meanfreecorrcoef} equal one. For
|
|
standardized data, both the slope of the regression line and the
|
|
corelation coefficient reduce to the covariance between the $x$ and
|
|
$y$ data:
|
|
\begin{equation}
|
|
m = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
|
|
\end{equation}
|
|
For standardized data the slope of the regression line is the same as the
|
|
correlation coefficient!
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Neural coding}
|
|
Maximum likelihood estimators are not only a central concept for data
|
|
analysis. Neural systems face the very same problem. They also need to
|
|
estimate parameters of the internal and external environment based on
|
|
the activity of neurons.
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{mlecoding}
|
|
\titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
|
|
a stimulus parameter from neuronal activity.}{Top: Tuning curve
|
|
$r_i(\phi;c,\phi_i)$ of a specific neuron $i$ as a function of the
|
|
orientation $\phi$ of a stimulus, a dark bar in front of a white
|
|
background. The preferred stimulus $\phi_i$ of that neuron, the
|
|
one that evokes the strongest firing rate response, is a bar with
|
|
vertical orientation (arrow, $\phi_i=90$\,\degree). The width of
|
|
the red area indicates the variability of the neuronal activity
|
|
$\sigma_i$ around the tuning curve. Center: In a population of
|
|
neurons, each neuron may have a different tuning curve (colors). A
|
|
specific stimulus activates the individual neurons of the
|
|
population in a specific way (dots). Bottom: The log-likelihood of
|
|
the activity pattern has a maximum close to the real stimulus
|
|
orientation.}
|
|
\end{figure}
|
|
|
|
In sensory systems certain aspects of the environment are encoded in
|
|
the neuronal activity of populations of neurons. One example of such a
|
|
population code is the tuning of neurons in the primary visual cortex
|
|
(V1) to the orientation of a bar in the visual stimulus. Different
|
|
neurons respond best to different bar orientations. Traditionally,
|
|
such a tuning is measured by analyzing the neuronal response strength
|
|
(e.g. the firing rate) as a function of the orientation of a black bar
|
|
and is illustrated and summarized with the so called
|
|
\enterm{tuning-curve} (\determ{Abstimmkurve},
|
|
figure~\ref{mlecodingfig}, top).
|
|
|
|
The brain, however, is confronted with the inverse problem: given a
|
|
certain activity pattern in the neuronal population, what is the
|
|
stimulus? In our example, what is the orientation of the bar? In the
|
|
sense of maximum likelihood, a possible answer to this question would
|
|
be: the stimulus for which the particular activity pattern is most
|
|
likely given the tuning of the neurons and the noise (standard
|
|
deviation) of the responses.
|
|
|
|
Let's stay with the example of the orientation tuning in V1. The
|
|
tuning of the firing rate $r_i(\phi)$ of neuron $i$ to the preferred
|
|
bar orientation $\phi_i$ can be well described using a van-Mises
|
|
function (the Gaussian function on a cyclic x-axis)
|
|
(\figref{mlecodingfig}):
|
|
\begin{equation}
|
|
\label{bartuningcurve}
|
|
r_i(\phi; c, \phi_i) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c > 0
|
|
\end{equation}
|
|
Note the factor two in the cosine, because the response of the neuron
|
|
is the same for a bar turned by 180\,\degree.
|
|
|
|
If we approximate the neuronal activity by a normal distribution
|
|
around the tuning curve with a standard deviation $\sigma_i$, then the
|
|
probability $p_i(r|\phi)$ of the $i$-th neuron having the observed
|
|
activity $r$, given a certain orientation $\phi$ of a bar is
|
|
\begin{equation}
|
|
p_i(r|\phi) = \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{1}{2}\left(\frac{r-r_i(\phi; c, \phi_i)}{\sigma_i}\right)^2} \; .
|
|
\end{equation}
|
|
The log-likelihood of the bar orientation $\phi$ given the
|
|
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
|
|
\begin{equation}
|
|
{\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
|
|
\end{equation}
|
|
The angle $\phi_{mle}$ that maximizes this likelihood is an estimate
|
|
of the true orientation of the bar (\figref{mlecodingfig}).
|
|
|
|
The noisiness of the neuron's responses as quantified by $\sigma_i$
|
|
usually is a function of the neuron's mean firing rate $r_i$:
|
|
$\sigma_i = \sigma_i(r_i)$. This dependence has a major impact on the
|
|
maximum likelihood estimation. Usually, the stronger the response of a
|
|
neuron, the higher its firing rate, the lower the noise. In this case,
|
|
strong responses have a stronger influence on the position of the
|
|
maximum of the log-likelihood.
|
|
|
|
Whether neural systems really implement maximum likelihood estimators
|
|
is another question. There are many ways how a stimulus property can
|
|
be read out from the activity of a population of neurons. The simplest
|
|
one being a ``winner takes all'' rule. The preferred stimulus
|
|
parameter of the neuron with the strongest response is the
|
|
estimate. Another possibility is to compute a population vector. The
|
|
estimated stimulus parameter is the sum of the preferred stimulus
|
|
parameters of all neurons in the population weighted by the activity
|
|
of the neurons. In case of angular stimulus parameters, like the
|
|
orientation of the bar in our example, a vector pointing in the
|
|
direction of the angle is used instead of the angle to incorporate the
|
|
cyclic nature of the parameter.
|
|
|
|
Using maximum likelihood estimators for analyzing neural population
|
|
activity gives us an upper bound of how well a stimulus parameter is
|
|
encoded in the activity of the neurons. The brain would not be able to
|
|
do better.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|