scientificComputing/likelihood/lecture/likelihood.tex

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Maximum likelihood estimation}
\label{maximumlikelihoodchapter}
\exercisechapter{Maximum likelihood estimation}

The core task of statistics is to infer from measured data some
parameters describing the data. These parameters can be simply a mean,
a standard deviation, or any other parameter needed to describe the
distribution the data a re originating from, a correlation
coefficient, or some parameters of a function describing a particular
dependence between the data. The brain faces exactly the same
problem. Given the activity pattern of some neurons (the data) it
needs to infer some aspects (parameters) of the environment and the
internal state of the body in order to generate some useful
behavior. One possible approach to estimate parameters from data are
\enterm[maximum likelihood estimator]{maximum likelihood estimators}
(\enterm[mle|see{maximum likelihood estimator}]{mle},
\determ{Maximum-Likelihood-Sch\"atzer}).  They choose the parameters
such that they maximize the likelihood of the specific data values to
originate from a specific distribution.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Maximum likelihood}

Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
$\theta$.'') the probability (density) distribution of data value $x$
given parameter values $\theta$. This could be the normal distribution
\begin{equation}
  \label{normpdfmean}
  p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation}
defined by the mean $\mu$ and the standard deviation $\sigma$ as
parameters $\theta$.  If the $n$ observations $x_1, x_2, \ldots, x_n$
are independent of each other and originate from the same probability
density distribution (they are \enterm[i.i.d.|see{independent and
  identically distributed}]{i.i.d.}, \enterm{independent and
  identically distributed}), then the conditional probability
$p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
values $x_1, x_2, \ldots, x_n$ given some specific parameter values
$\theta$ of the probability density is given by the product of the
probability densities of each data value:
\begin{equation}
  \label{probdata}
  p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
  \ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
\end{equation}
Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
given the observed data $x_1, x_2, \ldots, x_n$ is
\begin{equation}
  \label{likelihood}
  {\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
\end{equation}
Note, that the likelihood ${\cal L}$ is not a probability in the
classic sense since it does not integrate to unity ($\int {\cal
  L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
observations $x_1, x_2, \ldots, x_n$, the likelihood
\eqref{likelihood} is a function of the parameters $\theta$. This
function has a global maximum for some specific parameter values. At
this maximum the probability \eqref{probdata} to observe the measured
data values is the largest.

Maximum likelihood estimators just find the parameter values
\begin{equation}
  \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
\end{equation}
that maximize the likelihood \eqref{likelihood}.
$\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
function $f(x)$ assumes its global maximum. Thus, we search for the
parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
reaches its maximum. For these parameter values the measured data most
likely originated from the corresponding distribution.

The position of a function's maximum does not change when the values
of the function are transformed by a strictly monotonously rising
function such as the logarithm. For numerical reasons and reasons that
we discuss below, we instead search for the maximum of the logarithm
of the likelihood
(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
\begin{eqnarray}
  \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
              & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
\end{eqnarray}
which is the sum of the logarithms of the probabilites of each
observation. Let's illustrate the concept of maximum likelihood
estimation on the arithmetic mean.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Arithmetic mean}
Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
normal distribution \eqref{normpdfmean} and we do not know the
population mean $\mu$ of the normal distribution
(\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
$\theta$.  Which value of $\mu$ maximizes the likelihood of the data?

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlemean}
  \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
    the mean.}{Top: The measured data (blue dots) together with three
    normal distributions differing in their means (arrows) from which
    the data could have originated from.  Bottom left: the likelihood
    as a function of the parameter $\mu$. For the data it is maximal
    at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
    the logarithm does not change the position of the maximum.}
\end{figure}

With the normal distribution \eqref{normpdfmean} and applying
logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
\begin{eqnarray}
  \log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
\end{eqnarray}
Since the logarithm is the inverse function of the exponential
($\log(e^x)=x$), taking the logarithm removes the exponential from the
normal distribution. This is the second reason why it is useful to
maximize the log-likelihood. To calculate the maximum of the
log-likelihood, we need to take the derivative with respect to $\mu$
and set it to zero:
\begin{eqnarray}
  \frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\;  0  \nonumber \\
  \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0  \nonumber \\
  \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0  \nonumber \\
  \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i  \nonumber \\
  \Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
\end{eqnarray}
Thus, the maximum likelihood estimator of the population mean of
normally distributed data is the arithmetic mean. That is, the
arithmetic mean maximizes the likelihood that the data originate from
a normal distribution centered at the arithmetic mean
(\figref{mlemeanfig}). Equivalently, the standard deviation computed
from the data, maximizes the likelihood that the data were generated
from a normal distribution with this standard deviation.

\begin{exercise}{mlemean.m}{mlemean.out}
  Draw $n=50$ random numbers from a normal distribution with a mean of
  $\ne 0$ and a standard deviation of $\ne 1$.

  Plot the likelihood (the product of the probabilities) and the
  log-likelihood (given by the sum of the logarithms of the
  probabilities) for the mean as parameter. Compare the position of
  the maxima with the mean calculated from the data.
\end{exercise}

Comparing the values of the likelihood with the ones of the
log-likelihood shown in \figref{mlemeanfig}, shows the numerical
reason for taking the logarithm of the likelihood. The likelihood
values can get very small, because we multiply many, potentially small
probability densities with each other. The likelihood quickly gets
smaller than the samlles number a floating point number of a computer
can represent. Try it by increasing the number of data values in the
exercise. Taking the logarithm avoids this problem. The log-likelihood
assumes well behaving numbers that can be handled well by the
computer.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fitting probability distributions}
Consider normally distributed data with unknown mean and standard
deviation.  From the considerations above we just have seen that a
Gaussian distribution with mean at the arithmetic mean and standard
deviation equal to the standard deviation computed from the data is
the Gaussian that fits the data best in a maximum likelihood sense,
i.e. the likelihood that the data were generated from this
distribution is the largest. Fitting a Gaussian distribution to data
is very simple: just compute the two parameter of the Gaussian
distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
deviation, respectively, directly from the data.

For non-Gaussian distributions, for example a
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
\begin{equation}
  \label{gammapdf}
  p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; ,
\end{equation}
however, such simple analytical expressions for the parameters of the
distribution do not exist. This is the case, for example, for the
shape parameter $\alpha$ of the Gamma-distribution. How do we fit such
a distribution to some data?  That is, how should we compute the
values of the parameters of the distribution, given the data?

A first guess could be to fit the probability density function by
minimization of the squared difference to a histogram of the measured
data in the same way as we fit a a function to some data. For several
reasons this is, however, not the method of choice: (i) Probability
densities can only be positive which leads, for small values in
particular, to asymmetric distributions of the estimated histogram
around the true density. (ii) The values of a histogram estimating the
density are not independent because the integral over a density is
unity. The two basic assumptions of normally distributed and
independent samples, which are a prerequisite for making the
minimization of the squared difference to a maximum likelihood
estimation (see next section), are violated. (iii) The estimation of
the probability density by means of a histogram strongly depends on
the chosen bin size \figref{mlepdffig}).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
    probability density.}{Left: the 100 data points drawn from a 2nd
    order Gamma-distribution. The maximum likelihood estimation of the
    probability density function is shown in orange, the true pdf is
    shown in red. Right: normalized histogram of the data together
    with the real (red) and the fitted probability density
    functions. The fit was done by minimizing the squared difference
    to the histogram.}
\end{figure}

Instead we should stay with maximum-likelihood estimation.  Exactly in
the same way as we estimated the mean value of a Gaussian distribution
above, we can numerically fit the parameter of any type of
distribution directly from the data by means of maximizing the
likelihood.  We simply search for the parameter values of the desired
probability density function that maximize the log-likelihood. In
general this is a non-linear optimization problem that is solved with
numerical methods such as the gradient descent \matlabfun{mle()}.

\begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Generate a sample of gamma-distributed random numbers and apply the
  maximum likelihood method to estimate the parameters of the gamma
  function from the data.
\end{exercise}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Curve fitting}

When fitting a function of the form $f(x;\theta)$ to data pairs
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
function best describes the data. In
chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
means minimizing the squared distance between the data and the
function.  With maximum likelihood we search for the parameter value
$\theta$ for which the likelihood that the data were drawn from the
corresponding function is maximal.

If we assume that the $y_i$ values are normally distributed around the
function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
the log-likelihood is
\begin{eqnarray}
  \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}}  \nonumber \\
  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\
\end{eqnarray}
The only difference to the previous example is that the averages in
the equations above are now given as the function values
$f(x_i;\theta)$. The parameter $\theta$ should be the one that
maximizes the log-likelihood. The first part of the sum is independent
of $\theta$ and can thus be ignored when computing the the maximum:
\begin{eqnarray}
  & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
\end{eqnarray}
We can further simplify by inverting the sign and then search for the
minimum. Also the factor $1/2$ can be ignored since it does not affect
the position of the minimum:
\begin{equation}
  \label{chisqmin}
  \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
\end{equation}
The sum of the squared differences normalized by the standard
deviation is also called $\chi^2$ (chi squared). The parameter
$\theta$ which minimizes the squared differences is thus the one that
maximizes the likelihood of the data to actually originate from the
given function. Therefore, minimizing $\chi^2$ is a maximum likelihood
estimation.

From the mathematical considerations above we can see that the
minimization of the squared difference is a maximum-likelihood
estimation only if the data are normally distributed around the
function. In case of other distributions, the log-likelihood
\eqnref{loglikelihood} needs to be adapted accordingly.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepropline}
  \titlecaption{\label{mleproplinefig} Maximum likelihood estimation
    of the slope of a line through the origin.}{The data (blue and
    left histogram) originate from a straight line $y=mx$ trough the origin
    (red). The maximum-likelihood estimation of the slope $m$ of the
    regression line (orange), \eqnref{mleslope}, is close to the true
    one. The residuals, the data minus the estimated line (right), reveal
    the normal distribution of the data around the line (right histogram).}
\end{figure}


\subsection{Example: simple proportionality}
The function of a line with slope $\theta$ through the origin is
\[ f(x) = \theta x \; . \]
The $\chi^2$-sum is thus
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \]
To estimate the minimum we again take the first derivative with
respect to $\theta$ and equate it to zero:
\begin{eqnarray}
  \frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
  & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
\end{eqnarray}
\begin{eqnarray}
\Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
\end{eqnarray}
This is an analytical expression for the estimation of the slope
$\theta$ of the regression line (\figref{mleproplinefig}).

\subsection{Linear and non-linear fits}
A gradient descent, as we have done in the previous chapter, is not
necessary for fitting the slope of a straight line, because the slope
can be directly computed via \eqnref{mleslope}. More generally, this
is the case also for fitting the coefficients of linearly combined
basis functions as for example the slope $m$ and the y-intercept $b$
of a straight line
\[ y = m \cdot x +b \]
or the coefficients $a_k$ of a polynomial
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
\matlabfun{polyfit()}.

Parameters that are non-linearly combined can not be calculated
analytically. Consider for example the rate $\lambda$ of the
exponential decay
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
Such cases require numerical solutions for the optimization of the
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.

\subsection{Relation between slope and correlation coefficient}
Let us have a closer look on \eqnref{mleslope}. If the standard
deviation of the data $\sigma_i$ is the same for each data point,
i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops
out of \eqnref{mleslope} and we get
\begin{equation}
  \label{whitemleslope}
  \theta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}
\end{equation}
To see what this expression is, we need to standardize the data. We
make the data mean free and normalize them to their standard
deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$. The resulting
numbers are also called \entermde[z-values]{z-Wert}{$z$-values} or
$z$-scores and they have the property $\bar x = 0$ and $\sigma_x =
1$. $z$-scores are often used in Biology to make quantities that
differ in their units comparable. For standardized data the variance
\begin{equation}
  \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1
\end{equation}
is given by the mean squared data and equals one.
The covariance between $x$ and $y$ also simplifies to
\begin{equation}
  \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
  \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i
\end{equation}
the averaged product between pairs of $x$ and $y$ values.  Recall that
the correlation coefficient $r_{x,y}$,
\eqnref{correlationcoefficient}, is the covariance normalized by the
product of the standard deviations of $x$ and $y$,
respectively. Therefore, in case the standard deviations equal one,
the correlation coefficient is identical to the covariance.
Consequently, for standardized data the slope of the regression line
\eqnref{whitemleslope} simplifies to
\begin{equation}
  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
\end{equation}
For standardized data the slope of the regression line is the same as the
correlation coefficient!


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Neural coding}
In sensory systems certain aspects of the environment are encoded in
the neuronal activity of populations of neurons. One example of such a
population code is the tuning of neurons in the primary visual cortex
(V1) to the orientation of an edge or bar in the visual
stimulus. Different neurons respond best to different edge
orientations. Traditionally, such a tuning is measured by analyzing
the neuronal response strength (e.g. the firing rate) as a function of
the orientation of a black bar and is illustrated and summarized
with the so called \enterm{tuning-curve} (\determ{Abstimmkurve},
figure~\ref{mlecodingfig}, top).

\begin{figure}[tp]
  \includegraphics[width=1\textwidth]{mlecoding}
  \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
    a stimulus parameter from neuronal activity.}{Top: Tuning curve of
    an individual neuron as a function of the stimulus orientation (a
    dark bar in front of a white background). The stimulus that evokes
    the strongest activity in that neuron is the bar with the vertical
    orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
    the variability of the neuronal activity $p(r|\phi)$ around the
    tuning curve. Center: In a population of neurons, each neuron may
    have a different tuning curve (colors). A specific stimulus (the
    vertical bar) activates the individual neurons of the population
    in a specific way (dots). Bottom: The log-likelihood of the
    activity pattern will be maximized close to the real stimulus
    orientation.}
\end{figure}

The brain, however, is confronted with the inverse problem: given a
certain activity pattern in the neuronal population, what is the
stimulus (here the orientation of an edge)? In the sense of maximum
likelihood, a possible answer to this question would be: the stimulus
for which the particular activity pattern is most likely given the
tuning of the neurons.

Let's stay with the example of the orientation tuning in V1. The
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge
orientation $\phi_i$ can be well described using a van-Mises function
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
\begin{equation}
  \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ
\end{equation}
If we approximate the neuronal activity by a normal distribution
around the tuning curve with a standard deviation $\sigma=\Omega/4$,
which is proportional to $\Omega$, then the probability $p_i(r|\phi)$
of the $i$-th neuron showing the activity $r$ given a certain
orientation $\phi$ of an edge is given by
\begin{equation}
  p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; .
\end{equation}
The log-likelihood of the edge orientation $\phi$ given the
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
\begin{equation}
  {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
\end{equation}
The angle $\phi$ that maximizes this likelihood is then an estimate of
the orientation of the edge.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions