scientificComputing/likelihood/lecture/likelihood.tex


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{\tr{Maximum likelihood estimation}{Maximum-Likelihood-Sch\"atzer}}
\label{maximumlikelihoodchapter}

\selectlanguage{english}

There are situations in which we want to estimate one or more
parameters $\theta$ of a probability distribution that best describe
the data $x_1, x_2, \ldots x_n$.  \enterm{Maximum likelihood
  estimators} (\determ{Maximum-Likelihood-Sch\"atzer},
\determ[mle|see{Maximum-Likelihood-Sch\"atzer}]{mle}) choose the
parameters such that it maximizes the likelihood of $x_1, x_2, \ldots
x_n$ originating from the distribution.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Maximum Likelihood}

Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
$\theta$.'') the probability (density) distribution of $x$ given the
parameters $\theta$. This could be the normal distribution
\begin{equation}
  \label{normpdfmean}
  p(x|\theta) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}
defined by the mean ($\mu$) and the standard deviation $\sigma$ as
parameters $\theta$.  If the $n$ independent observations of $x_1,
x_2, \ldots x_n$ originate from the same probability density
distribution (\enterm{i.i.d.} independent and identically distributed)
then is the conditional probability $p(x_1,x_2, \ldots x_n|\theta)$ of
observing $x_1, x_2, \ldots x_n$ given the a specific $\theta$,
\begin{equation}
  p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
  \ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
\end{equation}
Vice versa, is the \enterm{likelihood} of the parameters $\theta$
given the observed data $x_1, x_2, \ldots x_n$,
\begin{equation}
  {\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; .
\end{equation}
Note: the likelihood ${\cal L}$ is not a probability in the
classic sense since it does not integrate to unity ($\int {\cal
  L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$).

When applying maximum likelihood estimations we are interested in the
parameters $\theta$ that maximize the likelihood (``mle''):
\begin{equation}
  \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n)
\end{equation}
$\text{argmax}_xf(x)$ denotes the values of the argument $x$ of the
function $f(x)$ at which the function $f(x)$ reaches its global
maximum. Thus, we search the value of $\theta$ at which the
likelihood ${\cal L}(\theta)$ reaches its maximum.

The position of a function's maximum does not change when the values
of the function are transformed by a strictly monotonously rising
function such as the logarithm. For numerical and reasons that we will
discuss below, we commonly search for the maximum of the logarithm of
the likelihood (\enterm{log-likelihood}):

\begin{eqnarray}
  \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
              & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
\end{eqnarray}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Example: the arithmetic mean}
Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a
normal distribution \eqnref{normpdfmean} and we consider the mean
$\mu=\theta$ as the only parameter.  Which value of $\theta$ maximizes
its likelihood?

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlemean}
  \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
    the mean.}{Top: The measured data (blue dots) together with three
    different possible normal distributions with different means
    (arrows) that could be the origin of the data.  Bootom left: the
    likelihood as a function of $\theta$ i.e. the mean. It is maximal
    at a value of $\theta = 2$. Bottom right: the
    log-likelihood. Taking the logarithm does not change the position
    of the maximum.}
\end{figure}
The log-likelihood \eqnref{loglikelihood}
\begin{eqnarray*}
  \log {\cal L}(\theta|x_1,x_2, \ldots x_n)
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; .
\end{eqnarray*}
% FIXME do we need parentheses around the normal distribution in line one?
Since the logarithm is the inverse function of the exponential
($\log(e^x)=x$), taking the logarithm removes the exponential from the
normal distribution.  To calculate the maximum of the log-likelihood,
we need to take the derivative with respect to $\theta$ and set it to
zero:
\begin{eqnarray*}
  \frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \\
  \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \\
  \Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \\
  \Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
\end{eqnarray*}
From the above equations it becomes clear that the maximum likelihood
estimation is equivalent to the mean of the data. That is, the
assuming the mean of the data as $\theta$ maximizes the likelihood
that the data originate from a normal distribution with that mean
(\figref{mlemeanfig}).


\begin{exercise}{mlemean.m}{mlemean.out}
  Draw $n=50$ random numbers from a normal distribution with a mean of
  $\ne 0$ and a standard deviation of $\ne 1$.

  Plot the likelihood (the product of the probabilities) and the
  log-likelihood (given by the sum of the logarithms of the
  probabilities) for the mean as parameter. Compare the position of
  the maxima with the mean calculated from the data.
  \pagebreak[4]
\end{exercise}


\pagebreak[4]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Curve fitting as using maximum-likelihood estimation}

During curve fitting a function of the form $f(x;\theta)$ with the
parameter $\theta$ is adapted to the data pairs $(x_i|y_i)$ by
adapting $\theta$. When we assume that the $y_i$ values are normally
distributed around the function values $f(x_i;\theta)$ with a standard
deviation $\sigma_i$, the log-likelihood is

\begin{eqnarray*}
  \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\
\end{eqnarray*}
The only difference to the previous example is that the averages in
the equations above are now given as the function values
$f(x_i;\theta)$

The parameter $\theta$ should be the one that maximizes the
log-likelihood. The first part of the sum is independent of $\theta$
and can thus be ignored during the search of the maximum:
\begin{eqnarray*}
  & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
\end{eqnarray*}
We can further simplify by inverting the sign and then search for the
minimum. Also the $1/2$ factor can be ignored since it does not affect
the position of the minimum:
\begin{equation}
  \label{chisqmin}
  \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
\end{equation}
The sum of the squared differences when normalized to the standard
deviation is also called $\chi^2$. The parameter $\theta$ which
minimizes the squared differences is thus the one that maximizes the
probability that the data actually originate from the given
function. Minimizing $\chi^2$ therefore is a maximum likelihood
estimation.

From the mathematical considerations above we can see that the
minimization of the squared difference is a maximum-likelihood
estimation only if the data are normally distributed around the
function. In case of other distributions, the log-likelihood needs to
be adapted accordingly \eqnref{loglikelihood} and be maximized
respectively.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepropline}
  \titlecaption{\label{mleproplinefig} Maximum likelihood estimation
    of the slope of line through the origin.}{}
\end{figure}


\subsection{Example: simple proportionality}
The function of a line going through the origin
\[ f(x) = \theta x  \]
with the slope $\theta$. The $\chi^2$-sum is thus
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \]
To estimate the minimum we again take the first derivative with
respect to $\theta$ and equate it to zero:
\begin{eqnarray}
  \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
  & = & -2 \sum_{i=1}^n \left( \frac{x_iy_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2} \nonumber \\
\Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_iy_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
\end{eqnarray}
With this we obtained an analytical expression for the estimation of
the slope $\theta$ of the regression line (\figref{mleproplinefig}).

A gradient descent, as we have done in the previous chapter, is thus
not necessary for the fitting of the slope of a linear equation. This
holds even more generally for fitting the coefficients of linearly
combined basis functions as for example the fitting of the slope $m$
and the y-intercept $b$ of the linear equation
\[ y = m \cdot x +b \]
or, more generally, the coefficients $a_k$ of a polynom
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
\matlabfun{polyfit()}.

Parameters that are non-linearly combined can not be calculated
analytically. Consider for example the rate $\lambda$ of the
exponential decay
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
Such cases require numerical solutions for the optimization of the
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fits von Wahrscheinlichkeitsverteilungen}
Finally let's consider the case in which we want to fit the parameters
of a probability density function (e.g. the shape parameter of a
\enterm{Gamma-distribution}) to a dataset.

A first guess could be to fit the probability density by minimization
of the squared difference to a histogram of the measured data. For
several reasons this is, however, not the method of choice: (i)
probability densities can only be positive which leads, for small
values in particular, to asymmetric distributions. (ii) the values of
a histogram are not independent because the integral of a density is
unity. The two basic assumptions of normally distributed and
independent samples, which are a prerequisite make the minimization of
the squared difference \eqnref{chisqmin} to a maximum likelihood
estimation, are violated. (iii) The histogram strongly depends on the
chosen bin size \figref{mlepdffig}).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
    probability density.}{Left: the 100 data points drawn from a 2nd
    order Gamma-distribution. The maximum likelihood estimation of the
    probability density function is shown in orange, the true pdf is
    shown in red. Right: normalized histogram of the data together
    with the real (red) and the fitted probability density
    functions. The fit was done by minimizing the squared difference
    to the histogram.}
\end{figure}


Using the example of the estimating the mean value of a normal
distribution we have discussed the direct approach to fit a
probability density to data via maximum likelihood. We simply search
for the parameter $\theta$ of the desired probability density function
that maximizes the log-likelihood. This is a non-linear optimization
problem that is generally solved with numerical methods such as the
gradient descent \matlabfun{mle()}.

\begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Create a sample of gamma-distributed random number and apply the
  maximum likelihood method to estimate the parameters of the gamma
  function from the data.
  \pagebreak
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Neural coding}
In sensory systems certain aspects of the surrounding are encoded in
the neuronal activity of populations of neurons. One example of such
population coding is the tuning of neurons in the primary visual
cortex (V1) to the orientation of a visual stimulus. Different neurons
respond best to different stimulus orientations. Traditionally, such a
tuning is measured by analyzing the neuronal response strength
(e.g. the firing rate) as a function of the orientation of the visual
stimulus and is depicted and summarized with the so called
\enterm{tuning-curve}(German \determ{Abstimmkurve},
figure~\ref{mlecoding}, top).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{mlecoding}
  \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
    a stimulus parameter from neuronal activity.}{Top: Tuning curve of
    an individual neuron as a function of the stimulus orientation (a
    dark bar in front of a white background). The stimulus that evokes
    the strongest activity in that neuron is the bar with the vertical
    orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
    the variability of the neuronal activity $p(r)$ around the tunig
    curve. Center: In a population of neurons, each neuron may have a
    different tuning curve (colors). A specific stimulus (the vertical
    bar) activates the individual neurons of the population in a
    specific way (dots). Bottom: The log-likelihood of the activity
    pattern will be maximized close to the real stimulus orientation.}
\end{figure}

The brain, however, is confronted with the inverse problem: given a
certain activity pattern in the neuronal population, what was the
stimulus? In the sense of maximum likelihood, a possible answer to
this question would be: It was the stimulus for which the particular
activity pattern is most likely.

Let's stay with the example of the orientation tuning in V1. The
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
orientation $\phi_i$ can be well described using a van-Mises function
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
\in \reZ \]
The neuronal activity is approximated with a normal
distribution around the tuning curve with a standard deviation
$\sigma=\Omega/4$ which is proprotional to $\Omega$ such that the
probability $p_i(r|\phi)$ of the $i$-th neuron showing the activity
$r$ given a certain orientation $\phi$ is given by

\[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
The log-likelihood of the stimulus orientation $\phi$ given the
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
\[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]

\selectlanguage{english}