313 lines
16 KiB
TeX
313 lines
16 KiB
TeX
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\chapter{\tr{Maximum likelihood estimation}{Maximum-Likelihood-Sch\"atzer}}
|
|
\label{maximumlikelihoodchapter}
|
|
|
|
\selectlanguage{english}
|
|
|
|
There are situations in which we want to estimate one or more
|
|
parameters $\theta$ of a probability distribution that best describe
|
|
the data $x_1, x_2, \ldots x_n$. \enterm{Maximum likelihood
|
|
estimators} (\determ{Maximum-Likelihood-Sch\"atzer},
|
|
\determ[mle|see{Maximum-Likelihood-Sch\"atzer}]{mle}) choose the
|
|
parameters such that it maximizes the likelihood of $x_1, x_2, \ldots
|
|
x_n$ originating from the distribution.
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Maximum Likelihood}
|
|
|
|
Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
|
|
$\theta$.'') the probability (density) distribution of $x$ given the
|
|
parameters $\theta$. This could be the normal distribution
|
|
\begin{equation}
|
|
\label{normpdfmean}
|
|
p(x|\theta) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
|
|
\end{equation}
|
|
defined by the mean ($\mu$) and the standard deviation $\sigma$ as
|
|
parameters $\theta$. If the $n$ independent observations of $x_1,
|
|
x_2, \ldots x_n$ originate from the same probability density
|
|
distribution (\enterm{i.i.d.} independent and identically distributed)
|
|
then is the conditional probability $p(x_1,x_2, \ldots x_n|\theta)$ of
|
|
observing $x_1, x_2, \ldots x_n$ given the a specific $\theta$,
|
|
\begin{equation}
|
|
p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
|
|
\ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
|
|
\end{equation}
|
|
Vice versa, is the \enterm{likelihood} of the parameters $\theta$
|
|
given the observed data $x_1, x_2, \ldots x_n$,
|
|
\begin{equation}
|
|
{\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; .
|
|
\end{equation}
|
|
Note: the likelihood ${\cal L}$ is not a probability in the
|
|
classic sense since it does not integrate to unity ($\int {\cal
|
|
L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$).
|
|
|
|
When applying maximum likelihood estimations we are interested in the
|
|
parameters $\theta$ that maximize the likelihood (``mle''):
|
|
\begin{equation}
|
|
\theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n)
|
|
\end{equation}
|
|
$\text{argmax}_xf(x)$ denotes the values of the argument $x$ of the
|
|
function $f(x)$ at which the function $f(x)$ reaches its global
|
|
maximum. Thus, we search the value of $\theta$ at which the
|
|
likelihood ${\cal L}(\theta)$ reaches its maximum.
|
|
|
|
The position of a function's maximum does not change when the values
|
|
of the function are transformed by a strictly monotonously rising
|
|
function such as the logarithm. For numerical and reasons that we will
|
|
discuss below, we commonly search for the maximum of the logarithm of
|
|
the likelihood (\enterm{log-likelihood}):
|
|
|
|
\begin{eqnarray}
|
|
\theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
|
|
& = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
|
|
\end{eqnarray}
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\subsection{Example: the arithmetic mean}
|
|
Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a
|
|
normal distribution \eqnref{normpdfmean} and we consider the mean
|
|
$\mu=\theta$ as the only parameter. Which value of $\theta$ maximizes
|
|
its likelihood?
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlemean}
|
|
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
|
|
the mean.}{Top: The measured data (blue dots) together with three
|
|
different possible normal distributions with different means
|
|
(arrows) that could be the origin of the data. Bootom left: the
|
|
likelihood as a function of $\theta$ i.e. the mean. It is maximal
|
|
at a value of $\theta = 2$. Bottom right: the
|
|
log-likelihood. Taking the logarithm does not change the position
|
|
of the maximum.}
|
|
\end{figure}
|
|
The log-likelihood \eqnref{loglikelihood}
|
|
\begin{eqnarray*}
|
|
\log {\cal L}(\theta|x_1,x_2, \ldots x_n)
|
|
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \\
|
|
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; .
|
|
\end{eqnarray*}
|
|
% FIXME do we need parentheses around the normal distribution in line one?
|
|
Since the logarithm is the inverse function of the exponential
|
|
($\log(e^x)=x$), taking the logarithm removes the exponential from the
|
|
normal distribution. To calculate the maximum of the log-likelihood,
|
|
we need to take the derivative with respect to $\theta$ and set it to
|
|
zero:
|
|
\begin{eqnarray*}
|
|
\frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \\
|
|
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \\
|
|
\Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \\
|
|
\Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
|
|
\end{eqnarray*}
|
|
From the above equations it becomes clear that the maximum likelihood
|
|
estimation is equivalent to the mean of the data. That is, the
|
|
assuming the mean of the data as $\theta$ maximizes the likelihood
|
|
that the data originate from a normal distribution with that mean
|
|
(\figref{mlemeanfig}).
|
|
|
|
|
|
\begin{exercise}{mlemean.m}{mlemean.out}
|
|
Draw $n=50$ random numbers from a normal distribution with a mean of
|
|
$\ne 0$ and a standard deviation of $\ne 1$.
|
|
|
|
Plot the likelihood (the product of the probabilities) and the
|
|
log-likelihood (given by the sum of the logarithms of the
|
|
probabilities) for the mean as parameter. Compare the position of
|
|
the maxima with the mean calculated from the data.
|
|
\pagebreak[4]
|
|
\end{exercise}
|
|
|
|
|
|
\pagebreak[4]
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Curve fitting as using maximum-likelihood estimation}
|
|
|
|
During curve fitting a function of the form $f(x;\theta)$ with the
|
|
parameter $\theta$ is adapted to the data pairs $(x_i|y_i)$ by
|
|
adapting $\theta$. When we assume that the $y_i$ values are normally
|
|
distributed around the function values $f(x_i;\theta)$ with a standard
|
|
deviation $\sigma_i$, the log-likelihood is
|
|
|
|
\begin{eqnarray*}
|
|
\log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
|
|
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \\
|
|
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\
|
|
\end{eqnarray*}
|
|
The only difference to the previous example is that the averages in
|
|
the equations above are now given as the function values
|
|
$f(x_i;\theta)$
|
|
|
|
The parameter $\theta$ should be the one that maximizes the
|
|
log-likelihood. The first part of the sum is independent of $\theta$
|
|
and can thus be ignored during the search of the maximum:
|
|
\begin{eqnarray*}
|
|
& = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
|
|
\end{eqnarray*}
|
|
We can further simplify by inverting the sign and then search for the
|
|
minimum. Also the $1/2$ factor can be ignored since it does not affect
|
|
the position of the minimum:
|
|
\begin{equation}
|
|
\label{chisqmin}
|
|
\theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
|
|
\end{equation}
|
|
The sum of the squared differences when normalized to the standard
|
|
deviation is also called $\chi^2$. The parameter $\theta$ which
|
|
minimizes the squared differences is thus the one that maximizes the
|
|
probability that the data actually originate from the given
|
|
function. Minimizing $\chi^2$ therefore is a maximum likelihood
|
|
estimation.
|
|
|
|
From the mathematical considerations above we can see that the
|
|
minimization of the squared difference is a maximum-likelihood
|
|
estimation only if the data are normally distributed around the
|
|
function. In case of other distributions, the log-likelihood needs to
|
|
be adapted accordingly \eqnref{loglikelihood} and be maximized
|
|
respectively.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlepropline}
|
|
\titlecaption{\label{mleproplinefig} Maximum likelihood estimation
|
|
of the slope of line through the origin.}{}
|
|
\end{figure}
|
|
|
|
|
|
\subsection{Example: simple proportionality}
|
|
The function of a line going through the origin
|
|
\[ f(x) = \theta x \]
|
|
with the slope $\theta$. The $\chi^2$-sum is thus
|
|
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \]
|
|
To estimate the minimum we again take the first derivative with
|
|
respect to $\theta$ and equate it to zero:
|
|
\begin{eqnarray}
|
|
\frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
|
|
& = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
|
|
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
|
|
& = & -2 \sum_{i=1}^n \left( \frac{x_iy_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
|
|
\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2} \nonumber \\
|
|
\Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
|
|
\end{eqnarray}
|
|
With this we obtained an analytical expression for the estimation of
|
|
the slope $\theta$ of the regression line (\figref{mleproplinefig}).
|
|
|
|
A gradient descent, as we have done in the previous chapter, is thus
|
|
not necessary for the fitting of the slope of a linear equation. This
|
|
holds even more generally for fitting the coefficients of linearly
|
|
combined basis functions as for example the fitting of the slope $m$
|
|
and the y-intercept $b$ of the linear equation
|
|
\[ y = m \cdot x +b \]
|
|
or, more generally, the coefficients $a_k$ of a polynom
|
|
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
|
|
\matlabfun{polyfit()}.
|
|
|
|
Parameters that are non-linearly combined can not be calculated
|
|
analytically. Consider for example the rate $\lambda$ of the
|
|
exponential decay
|
|
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
|
|
Such cases require numerical solutions for the optimization of the
|
|
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Fits von Wahrscheinlichkeitsverteilungen}
|
|
Finally let's consider the case in which we want to fit the parameters
|
|
of a probability density function (e.g. the shape parameter of a
|
|
\enterm{Gamma-distribution}) to a dataset.
|
|
|
|
A first guess could be to fit the probability density by minimization
|
|
of the squared difference to a histogram of the measured data. For
|
|
several reasons this is, however, not the method of choice: (i)
|
|
probability densities can only be positive which leads, for small
|
|
values in particular, to asymmetric distributions. (ii) the values of
|
|
a histogram are not independent because the integral of a density is
|
|
unity. The two basic assumptions of normally distributed and
|
|
independent samples, which are a prerequisite make the minimization of
|
|
the squared difference \eqnref{chisqmin} to a maximum likelihood
|
|
estimation, are violated. (iii) The histogram strongly depends on the
|
|
chosen bin size \figref{mlepdffig}).
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{mlepdf}
|
|
\titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
|
|
probability density.}{Left: the 100 data points drawn from a 2nd
|
|
order Gamma-distribution. The maximum likelihood estimation of the
|
|
probability density function is shown in orange, the true pdf is
|
|
shown in red. Right: normalized histogram of the data together
|
|
with the real (red) and the fitted probability density
|
|
functions. The fit was done by minimizing the squared difference
|
|
to the histogram.}
|
|
\end{figure}
|
|
|
|
|
|
Using the example of the estimating the mean value of a normal
|
|
distribution we have discussed the direct approach to fit a
|
|
probability density to data via maximum likelihood. We simply search
|
|
for the parameter $\theta$ of the desired probability density function
|
|
that maximizes the log-likelihood. This is a non-linear optimization
|
|
problem that is generally solved with numerical methods such as the
|
|
gradient descent \matlabfun{mle()}.
|
|
|
|
\begin{exercise}{mlegammafit.m}{mlegammafit.out}
|
|
Create a sample of gamma-distributed random number and apply the
|
|
maximum likelihood method to estimate the parameters of the gamma
|
|
function from the data.
|
|
\pagebreak
|
|
\end{exercise}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\section{Neural coding}
|
|
In sensory systems certain aspects of the surrounding are encoded in
|
|
the neuronal activity of populations of neurons. One example of such
|
|
population coding is the tuning of neurons in the primary visual
|
|
cortex (V1) to the orientation of a visual stimulus. Different neurons
|
|
respond best to different stimulus orientations. Traditionally, such a
|
|
tuning is measured by analyzing the neuronal response strength
|
|
(e.g. the firing rate) as a function of the orientation of the visual
|
|
stimulus and is depicted and summarized with the so called
|
|
\enterm{tuning-curve}(German \determ{Abstimmkurve},
|
|
figure~\ref{mlecoding}, top).
|
|
|
|
\begin{figure}[tp]
|
|
\includegraphics[width=1\textwidth]{mlecoding}
|
|
\titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
|
|
a stimulus parameter from neuronal activity.}{Top: Tuning curve of
|
|
an individual neuron as a function of the stimulus orientation (a
|
|
dark bar in front of a white background). The stimulus that evokes
|
|
the strongest activity in that neuron is the bar with the vertical
|
|
orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
|
|
the variability of the neuronal activity $p(r)$ around the tunig
|
|
curve. Center: In a population of neurons, each neuron may have a
|
|
different tuning curve (colors). A specific stimulus (the vertical
|
|
bar) activates the individual neurons of the population in a
|
|
specific way (dots). Bottom: The log-likelihood of the activity
|
|
pattern will be maximized close to the real stimulus orientation.}
|
|
\end{figure}
|
|
|
|
The brain, however, is confronted with the inverse problem: given a
|
|
certain activity pattern in the neuronal population, what was the
|
|
stimulus? In the sense of maximum likelihood, a possible answer to
|
|
this question would be: It was the stimulus for which the particular
|
|
activity pattern is most likely.
|
|
|
|
Let's stay with the example of the orientation tuning in V1. The
|
|
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
|
|
orientation $\phi_i$ can be well described using a van-Mises function
|
|
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
|
|
\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
|
|
\in \reZ \]
|
|
The neuronal activity is approximated with a normal
|
|
distribution around the tuning curve with a standard deviation
|
|
$\sigma=\Omega/4$ which is proprotional to $\Omega$ such that the
|
|
probability $p_i(r|\phi)$ of the $i$-th neuron showing the activity
|
|
$r$ given a certain orientation $\phi$ is given by
|
|
|
|
\[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
|
|
The log-likelihood of the stimulus orientation $\phi$ given the
|
|
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
|
|
\[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]
|
|
|
|
\selectlanguage{english}
|