[likelihood] further textual improvements

2019-11-30 20:42:29 +01:00 · 2019-11-30 20:42:29 +01:00 · 934a1e976c
commit 934a1e976c
parent 1a9cddd451
1 changed files with 109 additions and 87 deletions
--- a/likelihood/lecture/likelihood.tex
+++ b/likelihood/lecture/likelihood.tex
@ -77,7 +77,7 @@ maximizes the likelihood of the data?
  \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
    the mean.}{Top: The measured data (blue dots) together with three
    different possible normal distributions with different means
-    (arrows) the data could have originated from.  Bootom left: the
+    (arrows) the data could have originated from.  Bottom left: the
    likelihood as a function of $\theta$ i.e. the mean. It is maximal
    at a value of $\theta = 2$. Bottom right: the
    log-likelihood. Taking the logarithm does not change the position
@ -103,7 +103,10 @@ zero:
 \end{eqnarray*}
 Thus, the maximum likelihood estimator is the arithmetic mean. That
 is, the arithmetic mean maximizes the likelihood that the data
-originate from a normal distribution (\figref{mlemeanfig}).
+originate from a normal distribution centered at the arithmetic mean
 (\figref{mlemeanfig}). Equivalently, the standard deviation computed
 from the data, maximizes the likelihood that the data were generated
 from a normal distribution with this standard deviation.
 \begin{exercise}{mlemean.m}{mlemean.out}
  Draw $n=50$ random numbers from a normal distribution with a mean of
@ -113,18 +116,77 @@ originate from a normal distribution (\figref{mlemeanfig}).
  log-likelihood (given by the sum of the logarithms of the
  probabilities) for the mean as parameter. Compare the position of
  the maxima with the mean calculated from the data.
  \pagebreak[4]
 \end{exercise}
-\pagebreak[4]
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Fitting probability distributions}
 Consider normally distributed data with unknown mean and standard
 deviation.  From the considerations above we just have seen that a
 Gaussian distribution with mean at the arithmetic mean and standard
 deviation equal to the standard deviation computed from the data is
 the best Gaussian distribution that fits the data best in a maximum
 likelihood sense, i.e. the likelihood that the data were generated
 from this distribution is the largest. Fitting a Gaussian distribution
 to data is very simple: just compute the two parameter of the Gaussian
 distribution as the arithmetic mean and a standard deviation directly
 from the data.
 For non-Gaussian distributions (e.g. a Gamma-distribution), however,
 such simple analytical expressions for the parameters of the
 distribution do not exist, e.g. the shape parameter of a
 \enterm{Gamma-distribution}. How do we fit such a distribution to
 some data? That is, how should we compute the values of the parameters
 of the distribution, given the data?
 A first guess could be to fit the probability density function by
 minimization of the squared difference to a histogram of the measured
 data. For several reasons this is, however, not the method of choice:
 (i) Probability densities can only be positive which leads, for small
 values in particular, to asymmetric distributions. (ii) The values of
 a histogram estimating the density are not independent because the
 integral over a density is unity. The two basic assumptions of
 normally distributed and independent samples, which are a prerequisite
 make the minimization of the squared difference \eqnref{chisqmin} to a
 maximum likelihood estimation, are violated. (iii) The histogram
 strongly depends on the chosen bin size \figref{mlepdffig}).
 \begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
    probability density.}{Left: the 100 data points drawn from a 2nd
    order Gamma-distribution. The maximum likelihood estimation of the
    probability density function is shown in orange, the true pdf is
    shown in red. Right: normalized histogram of the data together
    with the real (red) and the fitted probability density
    functions. The fit was done by minimizing the squared difference
    to the histogram.}
 \end{figure}
 Instead we should stay with maximum-likelihood estimation.  Exactly in
 the same way as we estimated the mean value of a Gaussian distribution
 above, we can numerically fit the parameter of any type of
 distribution directly from the data by means of maximizing the
 likelihood.  We simply search for the parameter $\theta$ of the
 desired probability density function that maximizes the
 log-likelihood. In general this is a non-linear optimization problem
 that is solved with numerical methods such as the gradient descent
 \matlabfun{mle()}.
 \begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Generate a sample of gamma-distributed random numbers and apply the
  maximum likelihood method to estimate the parameters of the gamma
  function from the data.
 \end{exercise}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Curve fitting}
 When fitting a function of the form $f(x;\theta)$ to data pairs
 $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
 function best describes the data. With maximum likelihood we search
-for the paramter value $\theta$ for which the likelihood that the data
+for the parameter value $\theta$ for which the likelihood that the data
 were drawn from the corresponding function is maximal.  If we assume
 that the $y_i$ values are normally distributed around the function
 values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
@ -182,7 +244,9 @@ respect to $\theta$ and equate it to zero:
  & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
  & = & -2 \sum_{i=1}^n  \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
  & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
-\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
+\Leftrightarrow \quad  \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
 \end{eqnarray}
 \begin{eqnarray}
 \Leftrightarrow \quad  \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
 \end{eqnarray}
 This is an analytical expression for the estimation of the slope
@ -190,12 +254,12 @@ $\theta$ of the regression line (\figref{mleproplinefig}).
 A gradient descent, as we have done in the previous chapter, is not
 necessary for fitting the slope of a straight line, because the slope
-can be directly computed via \eqnref{nleslope}. More generally, this
+can be directly computed via \eqnref{mleslope}. More generally, this
 is the case also for fitting the coefficients of linearly combined
 basis functions as for example the slope $m$ and the y-intercept $b$
 of a straight line
 \[ y = m \cdot x +b \]
-or the coefficients $a_k$ of a polynom
+or the coefficients $a_k$ of a polynomial
 \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
 \matlabfun{polyfit()}.
@ -223,79 +287,34 @@ case, the variance
 \[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \]
 is the mean squared data and equals one.
 The covariance between $x$ and $y$ also simplifies to
-\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
+\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
-the averaged product between pairs of $x$ and $y$ values.
+\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
-Recall that the correlation coefficient $r_{x,y}$ is the covariance
+the averaged product between pairs of $x$ and $y$ values.  Recall that
-normalized by the product of the standard deviations of $x$ and $y$,
+the correlation coefficient $r_{x,y}$,
-respectively. Therefore, in case the standard deviations equal one, the
+\eqnref{correlationcoefficient}, is the covariance normalized by the
-correlation coefficient equals the covariance.
+product of the standard deviations of $x$ and $y$,
-Consequently, for standardized data the slope of the regression line
+respectively. Therefore, in case the standard deviations equal one,
 the correlation coefficient equals the covariance.  Consequently, for
 standardized data the slope of the regression line
 \eqnref{whitemleslope} simplifies to
 \begin{equation}
-  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \]
+  \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
 \end{equation}
 For standardized data the slope of the regression line equals the
 correlation coefficient!
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Fitting probability distributions}
 Finally let's consider the case in which we want to fit the parameters
 of a probability density function (e.g. the shape parameter of a
 \enterm{Gamma-distribution}) to a dataset.
 A first guess could be to fit the probability density by minimization
 of the squared difference to a histogram of the measured data. For
 several reasons this is, however, not the method of choice: (i)
 Probability densities can only be positive which leads, for small
 values in particular, to asymmetric distributions. (ii) The values of
 a histogram estimating the density are not independent because the
 integral over a density is unity. The two basic assumptions of
 normally distributed and independent samples, which are a prerequisite
 make the minimization of the squared difference \eqnref{chisqmin} to a
 maximum likelihood estimation, are violated. (iii) The histogram
 strongly depends on the chosen bin size \figref{mlepdffig}).
 \begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
  \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
    probability density.}{Left: the 100 data points drawn from a 2nd
    order Gamma-distribution. The maximum likelihood estimation of the
    probability density function is shown in orange, the true pdf is
    shown in red. Right: normalized histogram of the data together
    with the real (red) and the fitted probability density
    functions. The fit was done by minimizing the squared difference
    to the histogram.}
 \end{figure}
 Using the example of estimating the mean value of a normal
 distribution we have discussed the direct approach to fit a
 probability density to data via maximum likelihood. We simply search
 for the parameter $\theta$ of the desired probability density function
 that maximizes the log-likelihood. In general this is a non-linear
 optimization problem that is generally solved with numerical methods
 such as the gradient descent \matlabfun{mle()}.
 \begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Generate a sample of gamma-distributed random numbers and apply the
  maximum likelihood method to estimate the parameters of the gamma
  function from the data.
  \pagebreak
 \end{exercise}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Neural coding}
 In sensory systems certain aspects of the environment are encoded in
-the neuronal activity of populations of neurons. One example of such
+the neuronal activity of populations of neurons. One example of such a
-a population code is the tuning of neurons in the primary visual
+population code is the tuning of neurons in the primary visual cortex
-cortex (V1) to the orientation of a visual stimulus. Different neurons
+(V1) to the orientation of an edge or bar in the visual
-respond best to different stimulus orientations. Traditionally, such a
+stimulus. Different neurons respond best to different edge
-tuning is measured by analyzing the neuronal response strength
+orientations. Traditionally, such a tuning is measured by analyzing
-(e.g. the firing rate) as a function of the orientation of the visual
+the neuronal response strength (e.g. the firing rate) as a function of
-stimulus and is depicted and summarized with the so called
+the orientation of a black bar and is illustrated and summarized
-\enterm{tuning-curve} (\determ{Abstimmkurve},
+with the so called \enterm{tuning-curve} (\determ{Abstimmkurve},
 figure~\ref{mlecodingfig}, top).
 \begin{figure}[tp]
@ -306,32 +325,35 @@ figure~\ref{mlecodingfig}, top).
    dark bar in front of a white background). The stimulus that evokes
    the strongest activity in that neuron is the bar with the vertical
    orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
-    the variability of the neuronal activity $p(r|\phi)$ around the tunig
+    the variability of the neuronal activity $p(r|\phi)$ around the
-    curve. Center: In a population of neurons, each neuron may have a
+    tuning curve. Center: In a population of neurons, each neuron may
-    different tuning curve (colors). A specific stimulus (the vertical
+    have a different tuning curve (colors). A specific stimulus (the
-    bar) activates the individual neurons of the population in a
+    vertical bar) activates the individual neurons of the population
-    specific way (dots). Bottom: The log-likelihood of the activity
+    in a specific way (dots). Bottom: The log-likelihood of the
-    pattern will be maximized close to the real stimulus orientation.}
+    activity pattern will be maximized close to the real stimulus
    orientation.}
 \end{figure}
 The brain, however, is confronted with the inverse problem: given a
 certain activity pattern in the neuronal population, what is the
-stimulus? In the sense of maximum likelihood, a possible answer to
+stimulus (here the orientation of an edge)? In the sense of maximum
-this question would be: the stimulus for which the particular
+likelihood, a possible answer to this question would be: the stimulus
-activity pattern is most likely given the tuning of the neurons.
+for which the particular activity pattern is most likely given the
 tuning of the neurons.
 Let's stay with the example of the orientation tuning in V1. The
-tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
+tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge
 orientation $\phi_i$ can be well described using a van-Mises function
 (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
-\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
+\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \] 
 \in \reZ \]
 If we approximate the neuronal activity by a normal distribution
 around the tuning curve with a standard deviation $\sigma=\Omega/4$,
-which is proprotional to $\Omega$, then the probability
+which is proportional to $\Omega$, then the probability $p_i(r|\phi)$
-$p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a
+of the $i$-th neuron showing the activity $r$ given a certain
-certain orientation $\phi$ is given by
+orientation $\phi$ of an edge is given by
 \[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
-The log-likelihood of the stimulus orientation $\phi$ given the
+The log-likelihood of the edge orientation $\phi$ given the
 activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
 \[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]
 The angle $\phi$ that maximizes this likelihood is then an estimate of
 the orientation of the edge.