From 934a1e976cb9dfaca04f3675a0d9bd273d319c79 Mon Sep 17 00:00:00 2001 From: Jan Benda Date: Sat, 30 Nov 2019 20:42:29 +0100 Subject: [PATCH] [likelihood] further textual improvements --- likelihood/lecture/likelihood.tex | 196 +++++++++++++++++------------- 1 file changed, 109 insertions(+), 87 deletions(-) diff --git a/likelihood/lecture/likelihood.tex b/likelihood/lecture/likelihood.tex index 91680f3..12d1e3a 100644 --- a/likelihood/lecture/likelihood.tex +++ b/likelihood/lecture/likelihood.tex @@ -77,7 +77,7 @@ maximizes the likelihood of the data? \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of the mean.}{Top: The measured data (blue dots) together with three different possible normal distributions with different means - (arrows) the data could have originated from. Bootom left: the + (arrows) the data could have originated from. Bottom left: the likelihood as a function of $\theta$ i.e. the mean. It is maximal at a value of $\theta = 2$. Bottom right: the log-likelihood. Taking the logarithm does not change the position @@ -103,7 +103,10 @@ zero: \end{eqnarray*} Thus, the maximum likelihood estimator is the arithmetic mean. That is, the arithmetic mean maximizes the likelihood that the data -originate from a normal distribution (\figref{mlemeanfig}). +originate from a normal distribution centered at the arithmetic mean +(\figref{mlemeanfig}). Equivalently, the standard deviation computed +from the data, maximizes the likelihood that the data were generated +from a normal distribution with this standard deviation. \begin{exercise}{mlemean.m}{mlemean.out} Draw $n=50$ random numbers from a normal distribution with a mean of @@ -113,18 +116,77 @@ originate from a normal distribution (\figref{mlemeanfig}). log-likelihood (given by the sum of the logarithms of the probabilities) for the mean as parameter. Compare the position of the maxima with the mean calculated from the data. - \pagebreak[4] \end{exercise} -\pagebreak[4] +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\section{Fitting probability distributions} +Consider normally distributed data with unknown mean and standard +deviation. From the considerations above we just have seen that a +Gaussian distribution with mean at the arithmetic mean and standard +deviation equal to the standard deviation computed from the data is +the best Gaussian distribution that fits the data best in a maximum +likelihood sense, i.e. the likelihood that the data were generated +from this distribution is the largest. Fitting a Gaussian distribution +to data is very simple: just compute the two parameter of the Gaussian +distribution as the arithmetic mean and a standard deviation directly +from the data. + +For non-Gaussian distributions (e.g. a Gamma-distribution), however, +such simple analytical expressions for the parameters of the +distribution do not exist, e.g. the shape parameter of a +\enterm{Gamma-distribution}. How do we fit such a distribution to +some data? That is, how should we compute the values of the parameters +of the distribution, given the data? + +A first guess could be to fit the probability density function by +minimization of the squared difference to a histogram of the measured +data. For several reasons this is, however, not the method of choice: +(i) Probability densities can only be positive which leads, for small +values in particular, to asymmetric distributions. (ii) The values of +a histogram estimating the density are not independent because the +integral over a density is unity. The two basic assumptions of +normally distributed and independent samples, which are a prerequisite +make the minimization of the squared difference \eqnref{chisqmin} to a +maximum likelihood estimation, are violated. (iii) The histogram +strongly depends on the chosen bin size \figref{mlepdffig}). + +\begin{figure}[t] + \includegraphics[width=1\textwidth]{mlepdf} + \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a + probability density.}{Left: the 100 data points drawn from a 2nd + order Gamma-distribution. The maximum likelihood estimation of the + probability density function is shown in orange, the true pdf is + shown in red. Right: normalized histogram of the data together + with the real (red) and the fitted probability density + functions. The fit was done by minimizing the squared difference + to the histogram.} +\end{figure} + +Instead we should stay with maximum-likelihood estimation. Exactly in +the same way as we estimated the mean value of a Gaussian distribution +above, we can numerically fit the parameter of any type of +distribution directly from the data by means of maximizing the +likelihood. We simply search for the parameter $\theta$ of the +desired probability density function that maximizes the +log-likelihood. In general this is a non-linear optimization problem +that is solved with numerical methods such as the gradient descent +\matlabfun{mle()}. + +\begin{exercise}{mlegammafit.m}{mlegammafit.out} + Generate a sample of gamma-distributed random numbers and apply the + maximum likelihood method to estimate the parameters of the gamma + function from the data. +\end{exercise} + + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Curve fitting} When fitting a function of the form $f(x;\theta)$ to data pairs $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the function best describes the data. With maximum likelihood we search -for the paramter value $\theta$ for which the likelihood that the data +for the parameter value $\theta$ for which the likelihood that the data were drawn from the corresponding function is maximal. If we assume that the $y_i$ values are normally distributed around the function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the @@ -182,7 +244,9 @@ respect to $\theta$ and equate it to zero: & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ & = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\ & = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\ -\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\ +\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber +\end{eqnarray} +\begin{eqnarray} \Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope} \end{eqnarray} This is an analytical expression for the estimation of the slope @@ -190,12 +254,12 @@ $\theta$ of the regression line (\figref{mleproplinefig}). A gradient descent, as we have done in the previous chapter, is not necessary for fitting the slope of a straight line, because the slope -can be directly computed via \eqnref{nleslope}. More generally, this +can be directly computed via \eqnref{mleslope}. More generally, this is the case also for fitting the coefficients of linearly combined basis functions as for example the slope $m$ and the y-intercept $b$ of a straight line \[ y = m \cdot x +b \] -or the coefficients $a_k$ of a polynom +or the coefficients $a_k$ of a polynomial \[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \] \matlabfun{polyfit()}. @@ -223,79 +287,34 @@ case, the variance \[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \] is the mean squared data and equals one. The covariance between $x$ and $y$ also simplifies to -\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \] -the averaged product between pairs of $x$ and $y$ values. -Recall that the correlation coefficient $r_{x,y}$ is the covariance -normalized by the product of the standard deviations of $x$ and $y$, -respectively. Therefore, in case the standard deviations equal one, the -correlation coefficient equals the covariance. -Consequently, for standardized data the slope of the regression line +\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - +\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \] +the averaged product between pairs of $x$ and $y$ values. Recall that +the correlation coefficient $r_{x,y}$, +\eqnref{correlationcoefficient}, is the covariance normalized by the +product of the standard deviations of $x$ and $y$, +respectively. Therefore, in case the standard deviations equal one, +the correlation coefficient equals the covariance. Consequently, for +standardized data the slope of the regression line \eqnref{whitemleslope} simplifies to \begin{equation} - \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \] + \theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \end{equation} For standardized data the slope of the regression line equals the correlation coefficient! -%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\section{Fitting probability distributions} -Finally let's consider the case in which we want to fit the parameters -of a probability density function (e.g. the shape parameter of a -\enterm{Gamma-distribution}) to a dataset. - -A first guess could be to fit the probability density by minimization -of the squared difference to a histogram of the measured data. For -several reasons this is, however, not the method of choice: (i) -Probability densities can only be positive which leads, for small -values in particular, to asymmetric distributions. (ii) The values of -a histogram estimating the density are not independent because the -integral over a density is unity. The two basic assumptions of -normally distributed and independent samples, which are a prerequisite -make the minimization of the squared difference \eqnref{chisqmin} to a -maximum likelihood estimation, are violated. (iii) The histogram -strongly depends on the chosen bin size \figref{mlepdffig}). - -\begin{figure}[t] - \includegraphics[width=1\textwidth]{mlepdf} - \titlecaption{\label{mlepdffig} Maximum likelihood estimation of a - probability density.}{Left: the 100 data points drawn from a 2nd - order Gamma-distribution. The maximum likelihood estimation of the - probability density function is shown in orange, the true pdf is - shown in red. Right: normalized histogram of the data together - with the real (red) and the fitted probability density - functions. The fit was done by minimizing the squared difference - to the histogram.} -\end{figure} - - -Using the example of estimating the mean value of a normal -distribution we have discussed the direct approach to fit a -probability density to data via maximum likelihood. We simply search -for the parameter $\theta$ of the desired probability density function -that maximizes the log-likelihood. In general this is a non-linear -optimization problem that is generally solved with numerical methods -such as the gradient descent \matlabfun{mle()}. - -\begin{exercise}{mlegammafit.m}{mlegammafit.out} - Generate a sample of gamma-distributed random numbers and apply the - maximum likelihood method to estimate the parameters of the gamma - function from the data. - \pagebreak -\end{exercise} - - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Neural coding} In sensory systems certain aspects of the environment are encoded in -the neuronal activity of populations of neurons. One example of such -a population code is the tuning of neurons in the primary visual -cortex (V1) to the orientation of a visual stimulus. Different neurons -respond best to different stimulus orientations. Traditionally, such a -tuning is measured by analyzing the neuronal response strength -(e.g. the firing rate) as a function of the orientation of the visual -stimulus and is depicted and summarized with the so called -\enterm{tuning-curve} (\determ{Abstimmkurve}, +the neuronal activity of populations of neurons. One example of such a +population code is the tuning of neurons in the primary visual cortex +(V1) to the orientation of an edge or bar in the visual +stimulus. Different neurons respond best to different edge +orientations. Traditionally, such a tuning is measured by analyzing +the neuronal response strength (e.g. the firing rate) as a function of +the orientation of a black bar and is illustrated and summarized +with the so called \enterm{tuning-curve} (\determ{Abstimmkurve}, figure~\ref{mlecodingfig}, top). \begin{figure}[tp] @@ -306,32 +325,35 @@ figure~\ref{mlecodingfig}, top). dark bar in front of a white background). The stimulus that evokes the strongest activity in that neuron is the bar with the vertical orientation (arrow, $\phi_i=90$\,\degree). The red area indicates - the variability of the neuronal activity $p(r|\phi)$ around the tunig - curve. Center: In a population of neurons, each neuron may have a - different tuning curve (colors). A specific stimulus (the vertical - bar) activates the individual neurons of the population in a - specific way (dots). Bottom: The log-likelihood of the activity - pattern will be maximized close to the real stimulus orientation.} + the variability of the neuronal activity $p(r|\phi)$ around the + tuning curve. Center: In a population of neurons, each neuron may + have a different tuning curve (colors). A specific stimulus (the + vertical bar) activates the individual neurons of the population + in a specific way (dots). Bottom: The log-likelihood of the + activity pattern will be maximized close to the real stimulus + orientation.} \end{figure} The brain, however, is confronted with the inverse problem: given a certain activity pattern in the neuronal population, what is the -stimulus? In the sense of maximum likelihood, a possible answer to -this question would be: the stimulus for which the particular -activity pattern is most likely given the tuning of the neurons. +stimulus (here the orientation of an edge)? In the sense of maximum +likelihood, a possible answer to this question would be: the stimulus +for which the particular activity pattern is most likely given the +tuning of the neurons. Let's stay with the example of the orientation tuning in V1. The -tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus +tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge orientation $\phi_i$ can be well described using a van-Mises function (the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}): -\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c -\in \reZ \] +\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \] If we approximate the neuronal activity by a normal distribution around the tuning curve with a standard deviation $\sigma=\Omega/4$, -which is proprotional to $\Omega$, then the probability -$p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a -certain orientation $\phi$ is given by +which is proportional to $\Omega$, then the probability $p_i(r|\phi)$ +of the $i$-th neuron showing the activity $r$ given a certain +orientation $\phi$ of an edge is given by \[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \] -The log-likelihood of the stimulus orientation $\phi$ given the +The log-likelihood of the edge orientation $\phi$ given the activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus \[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \] +The angle $\phi$ that maximizes this likelihood is then an estimate of +the orientation of the edge.