From 934a1e976cb9dfaca04f3675a0d9bd273d319c79 Mon Sep 17 00:00:00 2001
From: Jan Benda <jan.benda@uni-tuebingen.de>
Date: Sat, 30 Nov 2019 20:42:29 +0100
Subject: [PATCH] [likelihood] further textual improvements

---
 likelihood/lecture/likelihood.tex | 196 +++++++++++++++++-------------
 1 file changed, 109 insertions(+), 87 deletions(-)

diff --git a/likelihood/lecture/likelihood.tex b/likelihood/lecture/likelihood.tex
index 91680f3..12d1e3a 100644
--- a/likelihood/lecture/likelihood.tex
+++ b/likelihood/lecture/likelihood.tex
@@ -77,7 +77,7 @@ maximizes the likelihood of the data?
   \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
     the mean.}{Top: The measured data (blue dots) together with three
     different possible normal distributions with different means
-    (arrows) the data could have originated from.  Bootom left: the
+    (arrows) the data could have originated from.  Bottom left: the
     likelihood as a function of $\theta$ i.e. the mean. It is maximal
     at a value of $\theta = 2$. Bottom right: the
     log-likelihood. Taking the logarithm does not change the position
@@ -103,7 +103,10 @@ zero:
 \end{eqnarray*}
 Thus, the maximum likelihood estimator is the arithmetic mean. That
 is, the arithmetic mean maximizes the likelihood that the data
-originate from a normal distribution (\figref{mlemeanfig}).
+originate from a normal distribution centered at the arithmetic mean
+(\figref{mlemeanfig}). Equivalently, the standard deviation computed
+from the data, maximizes the likelihood that the data were generated
+from a normal distribution with this standard deviation.
 
 \begin{exercise}{mlemean.m}{mlemean.out}
   Draw $n=50$ random numbers from a normal distribution with a mean of
@@ -113,18 +116,77 @@ originate from a normal distribution (\figref{mlemeanfig}).
   log-likelihood (given by the sum of the logarithms of the
   probabilities) for the mean as parameter. Compare the position of
   the maxima with the mean calculated from the data.
-  \pagebreak[4]
 \end{exercise}
 
 
-\pagebreak[4]
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Fitting probability distributions}
+Consider normally distributed data with unknown mean and standard
+deviation.  From the considerations above we just have seen that a
+Gaussian distribution with mean at the arithmetic mean and standard
+deviation equal to the standard deviation computed from the data is
+the best Gaussian distribution that fits the data best in a maximum
+likelihood sense, i.e. the likelihood that the data were generated
+from this distribution is the largest. Fitting a Gaussian distribution
+to data is very simple: just compute the two parameter of the Gaussian
+distribution as the arithmetic mean and a standard deviation directly
+from the data.
+
+For non-Gaussian distributions (e.g. a Gamma-distribution), however,
+such simple analytical expressions for the parameters of the
+distribution do not exist, e.g. the shape parameter of a
+\enterm{Gamma-distribution}. How do we fit such a distribution to
+some data? That is, how should we compute the values of the parameters
+of the distribution, given the data?
+
+A first guess could be to fit the probability density function by
+minimization of the squared difference to a histogram of the measured
+data. For several reasons this is, however, not the method of choice:
+(i) Probability densities can only be positive which leads, for small
+values in particular, to asymmetric distributions. (ii) The values of
+a histogram estimating the density are not independent because the
+integral over a density is unity. The two basic assumptions of
+normally distributed and independent samples, which are a prerequisite
+make the minimization of the squared difference \eqnref{chisqmin} to a
+maximum likelihood estimation, are violated. (iii) The histogram
+strongly depends on the chosen bin size \figref{mlepdffig}).
+
+\begin{figure}[t]
+  \includegraphics[width=1\textwidth]{mlepdf}
+  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
+    probability density.}{Left: the 100 data points drawn from a 2nd
+    order Gamma-distribution. The maximum likelihood estimation of the
+    probability density function is shown in orange, the true pdf is
+    shown in red. Right: normalized histogram of the data together
+    with the real (red) and the fitted probability density
+    functions. The fit was done by minimizing the squared difference
+    to the histogram.}
+\end{figure}
+
+Instead we should stay with maximum-likelihood estimation.  Exactly in
+the same way as we estimated the mean value of a Gaussian distribution
+above, we can numerically fit the parameter of any type of
+distribution directly from the data by means of maximizing the
+likelihood.  We simply search for the parameter $\theta$ of the
+desired probability density function that maximizes the
+log-likelihood. In general this is a non-linear optimization problem
+that is solved with numerical methods such as the gradient descent
+\matlabfun{mle()}.
+
+\begin{exercise}{mlegammafit.m}{mlegammafit.out}
+  Generate a sample of gamma-distributed random numbers and apply the
+  maximum likelihood method to estimate the parameters of the gamma
+  function from the data.
+\end{exercise}
+
+
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Curve fitting}
 
 When fitting a function of the form $f(x;\theta)$ to data pairs
 $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
 function best describes the data. With maximum likelihood we search
-for the paramter value $\theta$ for which the likelihood that the data
+for the parameter value $\theta$ for which the likelihood that the data
 were drawn from the corresponding function is maximal.  If we assume
 that the $y_i$ values are normally distributed around the function
 values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
@@ -182,7 +244,9 @@ respect to $\theta$ and equate it to zero:
   & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
   & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
   & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
-\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
+\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
+\end{eqnarray}
+\begin{eqnarray}
 \Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
 \end{eqnarray}
 This is an analytical expression for the estimation of the slope
@@ -190,12 +254,12 @@ $\theta$ of the regression line (\figref{mleproplinefig}).
 
 A gradient descent, as we have done in the previous chapter, is not
 necessary for fitting the slope of a straight line, because the slope
-can be directly computed via \eqnref{nleslope}. More generally, this
+can be directly computed via \eqnref{mleslope}. More generally, this
 is the case also for fitting the coefficients of linearly combined
 basis functions as for example the slope $m$ and the y-intercept $b$
 of a straight line
 \[ y = m \cdot x +b \]
-or the coefficients $a_k$ of a polynom
+or the coefficients $a_k$ of a polynomial
 \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
 \matlabfun{polyfit()}.
 
@@ -223,79 +287,34 @@ case, the variance
 \[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \]
 is the mean squared data and equals one.
 The covariance between $x$ and $y$ also simplifies to
-\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
-the averaged product between pairs of $x$ and $y$ values.
-Recall that the correlation coefficient $r_{x,y}$ is the covariance
-normalized by the product of the standard deviations of $x$ and $y$,
-respectively. Therefore, in case the standard deviations equal one, the
-correlation coefficient equals the covariance.
-Consequently, for standardized data the slope of the regression line
+\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
+\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
+the averaged product between pairs of $x$ and $y$ values.  Recall that
+the correlation coefficient $r_{x,y}$,
+\eqnref{correlationcoefficient}, is the covariance normalized by the
+product of the standard deviations of $x$ and $y$,
+respectively. Therefore, in case the standard deviations equal one,
+the correlation coefficient equals the covariance.  Consequently, for
+standardized data the slope of the regression line
 \eqnref{whitemleslope} simplifies to
 \begin{equation}
-  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \]
+  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
 \end{equation}
 For standardized data the slope of the regression line equals the
 correlation coefficient!
 
 
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Fitting probability distributions}
-Finally let's consider the case in which we want to fit the parameters
-of a probability density function (e.g. the shape parameter of a
-\enterm{Gamma-distribution}) to a dataset.
-
-A first guess could be to fit the probability density by minimization
-of the squared difference to a histogram of the measured data. For
-several reasons this is, however, not the method of choice: (i)
-Probability densities can only be positive which leads, for small
-values in particular, to asymmetric distributions. (ii) The values of
-a histogram estimating the density are not independent because the
-integral over a density is unity. The two basic assumptions of
-normally distributed and independent samples, which are a prerequisite
-make the minimization of the squared difference \eqnref{chisqmin} to a
-maximum likelihood estimation, are violated. (iii) The histogram
-strongly depends on the chosen bin size \figref{mlepdffig}).
-
-\begin{figure}[t]
-  \includegraphics[width=1\textwidth]{mlepdf}
-  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
-    probability density.}{Left: the 100 data points drawn from a 2nd
-    order Gamma-distribution. The maximum likelihood estimation of the
-    probability density function is shown in orange, the true pdf is
-    shown in red. Right: normalized histogram of the data together
-    with the real (red) and the fitted probability density
-    functions. The fit was done by minimizing the squared difference
-    to the histogram.}
-\end{figure}
-
-
-Using the example of estimating the mean value of a normal
-distribution we have discussed the direct approach to fit a
-probability density to data via maximum likelihood. We simply search
-for the parameter $\theta$ of the desired probability density function
-that maximizes the log-likelihood. In general this is a non-linear
-optimization problem that is generally solved with numerical methods
-such as the gradient descent \matlabfun{mle()}.
-
-\begin{exercise}{mlegammafit.m}{mlegammafit.out}
-  Generate a sample of gamma-distributed random numbers and apply the
-  maximum likelihood method to estimate the parameters of the gamma
-  function from the data.
-  \pagebreak
-\end{exercise}
-
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Neural coding}
 In sensory systems certain aspects of the environment are encoded in
-the neuronal activity of populations of neurons. One example of such
-a population code is the tuning of neurons in the primary visual
-cortex (V1) to the orientation of a visual stimulus. Different neurons
-respond best to different stimulus orientations. Traditionally, such a
-tuning is measured by analyzing the neuronal response strength
-(e.g. the firing rate) as a function of the orientation of the visual
-stimulus and is depicted and summarized with the so called
-\enterm{tuning-curve} (\determ{Abstimmkurve},
+the neuronal activity of populations of neurons. One example of such a
+population code is the tuning of neurons in the primary visual cortex
+(V1) to the orientation of an edge or bar in the visual
+stimulus. Different neurons respond best to different edge
+orientations. Traditionally, such a tuning is measured by analyzing
+the neuronal response strength (e.g. the firing rate) as a function of
+the orientation of a black bar and is illustrated and summarized
+with the so called \enterm{tuning-curve} (\determ{Abstimmkurve},
 figure~\ref{mlecodingfig}, top).
 
 \begin{figure}[tp]
@@ -306,32 +325,35 @@ figure~\ref{mlecodingfig}, top).
     dark bar in front of a white background). The stimulus that evokes
     the strongest activity in that neuron is the bar with the vertical
     orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
-    the variability of the neuronal activity $p(r|\phi)$ around the tunig
-    curve. Center: In a population of neurons, each neuron may have a
-    different tuning curve (colors). A specific stimulus (the vertical
-    bar) activates the individual neurons of the population in a
-    specific way (dots). Bottom: The log-likelihood of the activity
-    pattern will be maximized close to the real stimulus orientation.}
+    the variability of the neuronal activity $p(r|\phi)$ around the
+    tuning curve. Center: In a population of neurons, each neuron may
+    have a different tuning curve (colors). A specific stimulus (the
+    vertical bar) activates the individual neurons of the population
+    in a specific way (dots). Bottom: The log-likelihood of the
+    activity pattern will be maximized close to the real stimulus
+    orientation.}
 \end{figure}
 
 The brain, however, is confronted with the inverse problem: given a
 certain activity pattern in the neuronal population, what is the
-stimulus? In the sense of maximum likelihood, a possible answer to
-this question would be: the stimulus for which the particular
-activity pattern is most likely given the tuning of the neurons.
+stimulus (here the orientation of an edge)? In the sense of maximum
+likelihood, a possible answer to this question would be: the stimulus
+for which the particular activity pattern is most likely given the
+tuning of the neurons.
 
 Let's stay with the example of the orientation tuning in V1. The
-tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
+tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge
 orientation $\phi_i$ can be well described using a van-Mises function
 (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
-\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
-\in \reZ \]
+\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \] 
 If we approximate the neuronal activity by a normal distribution
 around the tuning curve with a standard deviation $\sigma=\Omega/4$,
-which is proprotional to $\Omega$, then the probability
-$p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a
-certain orientation $\phi$ is given by
+which is proportional to $\Omega$, then the probability $p_i(r|\phi)$
+of the $i$-th neuron showing the activity $r$ given a certain
+orientation $\phi$ of an edge is given by
 \[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
-The log-likelihood of the stimulus orientation $\phi$ given the
+The log-likelihood of the edge orientation $\phi$ given the
 activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
 \[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]
+The angle $\phi$ that maximizes this likelihood is then an estimate of
+the orientation of the edge.