scientificComputing/likelihood/lecture/likelihood.tex


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{\tr{Maximum likelihood estimation}{Maximum-Likelihood-Sch\"atzer}}
\label{maximumlikelihoodchapter}

A common problem in statistics is to estimate from a probability
distribution one or more parameters $\theta$ that best describe the
data $x_1, x_2, \ldots x_n$.  \enterm{Maximum likelihood estimators}
(\enterm[mle|see{Maximum likelihood estimators}]{mle},
\determ{Maximum-Likelihood-Sch\"atzer}) choose the parameters such
that they maximize the likelihood of the data $x_1, x_2, \ldots x_n$
to originate from the distribution.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Maximum Likelihood}

Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
$\theta$.'') the probability (density) distribution of $x$ given the
parameters $\theta$. This could be the normal distribution
\begin{equation}
  \label{normpdfmean}
  p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}
defined by the mean $\mu$ and the standard deviation $\sigma$ as
parameters $\theta$.  If the $n$ independent observations of $x_1,
x_2, \ldots x_n$ originate from the same probability density
distribution (they are \enterm{i.i.d.} independent and identically
distributed) then the conditional probability $p(x_1,x_2, \ldots
x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given a specific
$\theta$ is given by
\begin{equation}
  p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
  \ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
\end{equation}
Vice versa, the \enterm{likelihood} of the parameters $\theta$
given the observed data $x_1, x_2, \ldots x_n$ is
\begin{equation}
  {\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; .
\end{equation}
Note: the likelihood ${\cal L}$ is not a probability in the
classic sense since it does not integrate to unity ($\int {\cal
  L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$).

When applying maximum likelihood estimations we are interested in the
parameter values
\begin{equation}
  \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n)
\end{equation}
that maximize the likelihood.  $\text{argmax}_xf(x)$ is the value of
the argument $x$ of the function $f(x)$ for which the function $f(x)$
assumes its global maximum. Thus, we search for the value of $\theta$
at which the likelihood ${\cal L}(\theta)$ reaches its maximum.

The position of a function's maximum does not change when the values
of the function are transformed by a strictly monotonously rising
function such as the logarithm. For numerical and reasons that we will
discuss below, we commonly search for the maximum of the logarithm of
the likelihood (\enterm{log-likelihood}):

\begin{eqnarray}
  \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
              & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
\end{eqnarray}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Example: the arithmetic mean}
Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a
normal distribution \eqnref{normpdfmean} and we consider the mean
$\mu$ as the only parameter $\theta$.  Which value of $\theta$
maximizes the likelihood of the data?

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlemean}
  \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
    the mean.}{Top: The measured data (blue dots) together with three
    different possible normal distributions with different means
    (arrows) the data could have originated from.  Bootom left: the
    likelihood as a function of $\theta$ i.e. the mean. It is maximal
    at a value of $\theta = 2$. Bottom right: the
    log-likelihood. Taking the logarithm does not change the position
    of the maximum.}
\end{figure}

The log-likelihood \eqnref{loglikelihood}
\begin{eqnarray*}
  \log {\cal L}(\theta|x_1,x_2, \ldots x_n)
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; .
\end{eqnarray*}
Since the logarithm is the inverse function of the exponential
($\log(e^x)=x$), taking the logarithm removes the exponential from the
normal distribution.  To calculate the maximum of the log-likelihood,
we need to take the derivative with respect to $\theta$ and set it to
zero:
\begin{eqnarray*}
  \frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \\
  \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \\
  \Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \\
  \Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
\end{eqnarray*}
Thus, the maximum likelihood estimator is the arithmetic mean. That
is, the arithmetic mean maximizes the likelihood that the data
originate from a normal distribution (\figref{mlemeanfig}).

\begin{exercise}{mlemean.m}{mlemean.out}
  Draw $n=50$ random numbers from a normal distribution with a mean of
  $\ne 0$ and a standard deviation of $\ne 1$.

  Plot the likelihood (the product of the probabilities) and the
  log-likelihood (given by the sum of the logarithms of the
  probabilities) for the mean as parameter. Compare the position of
  the maxima with the mean calculated from the data.
  \pagebreak[4]
\end{exercise}


\pagebreak[4]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Curve fitting}

When fitting a function of the form $f(x;\theta)$ to data pairs
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
function best describes the data. With maximum likelihood we search
for the paramter value $\theta$ for which the likelihood that the data
were drawn from the corresponding function is maximal.  If we assume
that the $y_i$ values are normally distributed around the function
values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
log-likelihood is
\begin{eqnarray*}
  \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\
\end{eqnarray*}
The only difference to the previous example is that the averages in
the equations above are now given as the function values
$f(x_i;\theta)$. The parameter $\theta$ should be the one that
maximizes the log-likelihood. The first part of the sum is independent
of $\theta$ and can thus be ignored when computing the the maximum:
\begin{eqnarray*}
  & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
\end{eqnarray*}
We can further simplify by inverting the sign and then search for the
minimum. Also the factor $1/2$ can be ignored since it does not affect
the position of the minimum:
\begin{equation}
  \label{chisqmin}
  \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
\end{equation}
The sum of the squared differences normalized by the standard
deviation is also called $\chi^2$. The parameter $\theta$ which
minimizes the squared differences is thus the one that maximizes the
likelihood that the data actually originate from the given
function. Minimizing $\chi^2$ therefore is a maximum likelihood
estimation.

From the mathematical considerations above we can see that the
minimization of the squared difference is a maximum-likelihood
estimation only if the data are normally distributed around the
function. In case of other distributions, the log-likelihood
\eqnref{loglikelihood} needs to be adapted accordingly and be
maximized respectively.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepropline}
  \titlecaption{\label{mleproplinefig} Maximum likelihood estimation
    of the slope of line through the origin.}{}
\end{figure}


\subsection{Example: simple proportionality}
The function of a line with slope $\theta$ through the origin is
\[ f(x) = \theta x \; . \]
The $\chi^2$-sum is thus
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \]
To estimate the minimum we again take the first derivative with
respect to $\theta$ and equate it to zero:
\begin{eqnarray}
  \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
  & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
\Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
\end{eqnarray}
This is an analytical expression for the estimation of the slope
$\theta$ of the regression line (\figref{mleproplinefig}).

A gradient descent, as we have done in the previous chapter, is not
necessary for fitting the slope of a straight line, because the slope
can be directly computed via \eqnref{nleslope}. More generally, this
is the case also for fitting the coefficients of linearly combined
basis functions as for example the slope $m$ and the y-intercept $b$
of a straight line
\[ y = m \cdot x +b \]
or the coefficients $a_k$ of a polynom
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
\matlabfun{polyfit()}.

Parameters that are non-linearly combined can not be calculated
analytically. Consider for example the rate $\lambda$ of the
exponential decay
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
Such cases require numerical solutions for the optimization of the
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.

Let us have a closer look on \eqnref{mleslope}. If the standard
deviation of the data $\sigma_i$ is the same for each data point,
i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops
out of \eqnref{mleslope} and we get
\begin{equation}
  \label{whitemleslope}
  \theta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}
\end{equation}
To see what this expression is, we need to standardize the data. We
make the data mean free and normalize them to their standard
deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$. The resulting
numbers are also called \enterm{$z$-values} or $z$-scores and they
have the property $\bar x = 0$ and $\sigma_x = 1$.  If this is the
case, the variance
\[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \]
is the mean squared data and equals one.
The covariance between $x$ and $y$ also simplifies to
\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
the averaged product between pairs of $x$ and $y$ values.
Recall that the correlation coefficient $r_{x,y}$ is the covariance
normalized by the product of the standard deviations of $x$ and $y$,
respectively. Therefore, in case the standard deviations equal one, the
correlation coefficient equals the covariance.
Consequently, for standardized data the slope of the regression line
\eqnref{whitemleslope} simplifies to
\begin{equation}
  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \]
\end{equation}
For standardized data the slope of the regression line equals the
correlation coefficient!


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fitting probability distributions}
Finally let's consider the case in which we want to fit the parameters
of a probability density function (e.g. the shape parameter of a
\enterm{Gamma-distribution}) to a dataset.

A first guess could be to fit the probability density by minimization
of the squared difference to a histogram of the measured data. For
several reasons this is, however, not the method of choice: (i)
Probability densities can only be positive which leads, for small
values in particular, to asymmetric distributions. (ii) The values of
a histogram estimating the density are not independent because the
integral over a density is unity. The two basic assumptions of
normally distributed and independent samples, which are a prerequisite
make the minimization of the squared difference \eqnref{chisqmin} to a
maximum likelihood estimation, are violated. (iii) The histogram
strongly depends on the chosen bin size \figref{mlepdffig}).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
    probability density.}{Left: the 100 data points drawn from a 2nd
    order Gamma-distribution. The maximum likelihood estimation of the
    probability density function is shown in orange, the true pdf is
    shown in red. Right: normalized histogram of the data together
    with the real (red) and the fitted probability density
    functions. The fit was done by minimizing the squared difference
    to the histogram.}
\end{figure}


Using the example of estimating the mean value of a normal
distribution we have discussed the direct approach to fit a
probability density to data via maximum likelihood. We simply search
for the parameter $\theta$ of the desired probability density function
that maximizes the log-likelihood. In general this is a non-linear
optimization problem that is generally solved with numerical methods
such as the gradient descent \matlabfun{mle()}.

\begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Generate a sample of gamma-distributed random numbers and apply the
  maximum likelihood method to estimate the parameters of the gamma
  function from the data.
  \pagebreak
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Neural coding}
In sensory systems certain aspects of the environment are encoded in
the neuronal activity of populations of neurons. One example of such
a population code is the tuning of neurons in the primary visual
cortex (V1) to the orientation of a visual stimulus. Different neurons
respond best to different stimulus orientations. Traditionally, such a
tuning is measured by analyzing the neuronal response strength
(e.g. the firing rate) as a function of the orientation of the visual
stimulus and is depicted and summarized with the so called
\enterm{tuning-curve} (\determ{Abstimmkurve},
figure~\ref{mlecodingfig}, top).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{mlecoding}
  \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
    a stimulus parameter from neuronal activity.}{Top: Tuning curve of
    an individual neuron as a function of the stimulus orientation (a
    dark bar in front of a white background). The stimulus that evokes
    the strongest activity in that neuron is the bar with the vertical
    orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
    the variability of the neuronal activity $p(r|\phi)$ around the tunig
    curve. Center: In a population of neurons, each neuron may have a
    different tuning curve (colors). A specific stimulus (the vertical
    bar) activates the individual neurons of the population in a
    specific way (dots). Bottom: The log-likelihood of the activity
    pattern will be maximized close to the real stimulus orientation.}
\end{figure}

The brain, however, is confronted with the inverse problem: given a
certain activity pattern in the neuronal population, what is the
stimulus? In the sense of maximum likelihood, a possible answer to
this question would be: the stimulus for which the particular
activity pattern is most likely given the tuning of the neurons.

Let's stay with the example of the orientation tuning in V1. The
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
orientation $\phi_i$ can be well described using a van-Mises function
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
\in \reZ \]
If we approximate the neuronal activity by a normal distribution
around the tuning curve with a standard deviation $\sigma=\Omega/4$,
which is proprotional to $\Omega$, then the probability
$p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a
certain orientation $\phi$ is given by
\[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
The log-likelihood of the stimulus orientation $\phi$ given the
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
\[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]