[likelihood] improved text

This commit is contained in:
Jan Benda 2021-01-09 19:48:59 +01:00
parent 1d0913c600
commit 454c7178c7
3 changed files with 232 additions and 152 deletions

View File

@ -20,12 +20,11 @@
\section{TODO} \section{TODO}
\begin{itemize} \begin{itemize}
\item Fitting psychometric functions: \item Fitting psychometric functions, logistic regression:
Variable $x_i$, responses $r_i$ either 0 or 1. Variable $x_i$, responses $r_i$ either 0 or 1.
$p(x_i, \theta)$ is Weibull or Boltzmann function. $p(x_i, \theta)$ is Weibull or Boltzmann or logistic function.
Likelihood is $L = \prod p(x_i, \theta)^{r_i} (1-p(x_i, \theta))^{1-r_i}$. Likelihood is $L = \prod p(x_i, \theta)^{r_i} (1-p(x_i, \theta))^{1-r_i}$.
Use fminsearch for fitting. Use fminsearch for fitting?
\item GLM model fitting?
\end{itemize} \end{itemize}
\end{document} \end{document}

View File

@ -7,15 +7,15 @@
The core task of statistics is to infer from measured data some The core task of statistics is to infer from measured data some
parameters describing the data. These parameters can be simply a mean, parameters describing the data. These parameters can be simply a mean,
a standard deviation, or any other parameter needed to describe the a standard deviation, or any other parameter needed to describe the
distribution the data a re originating from, a correlation distribution the data are originating from, a correlation coefficient,
coefficient, or some parameters of a function describing a particular or some parameters of a function describing a particular dependence
dependence between the data. The brain faces exactly the same between the data. The brain faces exactly the same problem. Given the
problem. Given the activity pattern of some neurons (the data) it activity pattern of some neurons (the data) it needs to infer some
needs to infer some aspects (parameters) of the environment and the aspects (parameters) of the environment and the internal state of the
internal state of the body in order to generate some useful body in order to generate some useful behavior. One possible approach
behavior. One possible approach to estimate parameters from data are to estimate parameters from data are \enterm[maximum likelihood
\enterm[maximum likelihood estimator]{maximum likelihood estimators} estimator]{maximum likelihood estimators} (\enterm[mle|see{maximum
(\enterm[mle|see{maximum likelihood estimator}]{mle}, likelihood estimator}]{mle},
\determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters \determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters
such that they maximize the likelihood of the specific data values to such that they maximize the likelihood of the specific data values to
originate from a specific distribution. originate from a specific distribution.
@ -124,7 +124,7 @@ and set it to zero:
\Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\
\Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\ \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\
\Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \label{arithmeticmean}
\end{eqnarray} \end{eqnarray}
Thus, the maximum likelihood estimator of the population mean of Thus, the maximum likelihood estimator of the population mean of
normally distributed data is the arithmetic mean. That is, the normally distributed data is the arithmetic mean. That is, the
@ -173,28 +173,28 @@ For non-Gaussian distributions, for example a
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution} \entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
\begin{equation} \begin{equation}
\label{gammapdf} \label{gammapdf}
p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; , p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x}
\end{equation} \end{equation}
however, such simple analytical expressions for the parameters of the with a shape parameter $\alpha$ and a rate parameter $\beta$
distribution do not exist. This is the case, for example, for the (\figrefb{mlepdffig}), in general no such simple analytical
shape parameter $\alpha$ of the Gamma-distribution. How do we fit such expressions for estimating the parameters directly from the data do
a distribution to some data? That is, how should we compute the not exist. How do we fit such a distribution to the data? That is,
values of the parameters of the distribution, given the data? how should we compute the values of the parameters of the
distribution, given the data?
A first guess could be to fit the probability density function by
minimization of the squared difference to a histogram of the measured A first guess could be to fit the probability density function by a
data in the same way as we fit a a function to some data. For several \enterm{least squares} fit to a normalized histogram of the measured data in
reasons this is, however, not the method of choice: (i) Probability the same way as we fit a function to some data. For several reasons
densities can only be positive which leads, for small values in this is, however, not the method of choice: (i) Probability densities
particular, to asymmetric distributions of the estimated histogram can only be positive which leads, in particular for small values, to
around the true density. (ii) The values of a histogram estimating the asymmetric distributions of the estimated histogram around the true
density are not independent because the integral over a density is density. (ii) The values of a histogram estimating the density are not
unity. The two basic assumptions of normally distributed and independent because the integral over a density is unity. The two
independent samples, which are a prerequisite for making the basic assumptions of normally distributed and independent samples,
minimization of the squared difference to a maximum likelihood which are a prerequisite for making the minimization of the squared
estimation (see next section), are violated. (iii) The estimation of difference to a maximum likelihood estimation (see next section), are
the probability density by means of a histogram strongly depends on violated. (iii) The estimation of the probability density by means of
the chosen bin size \figref{mlepdffig}). a histogram strongly depends on the chosen bin size.
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{mlepdf} \includegraphics[width=1\textwidth]{mlepdf}
@ -203,9 +203,9 @@ the chosen bin size \figref{mlepdffig}).
order Gamma-distribution. The maximum likelihood estimation of the order Gamma-distribution. The maximum likelihood estimation of the
probability density function is shown in orange, the true pdf is probability density function is shown in orange, the true pdf is
shown in red. Right: normalized histogram of the data together shown in red. Right: normalized histogram of the data together
with the real (red) and the fitted probability density with the true probability density (red) and the probability
functions. The fit was done by minimizing the squared difference density function obtained by a least squares fit to the
to the histogram.} histogram.}
\end{figure} \end{figure}
Instead we should stay with maximum-likelihood estimation. Exactly in Instead we should stay with maximum-likelihood estimation. Exactly in
@ -228,13 +228,13 @@ numerical methods such as the gradient descent \matlabfun{mle()}.
\section{Curve fitting} \section{Curve fitting}
When fitting a function of the form $f(x;\theta)$ to data pairs When fitting a function of the form $f(x;\theta)$ to data pairs
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the $(x_i|y_i)$ one tries to adapt the parameters $\theta$ such that the
function best describes the data. In function best describes the data. In
chapter~\ref{gradientdescentchapter} we simply assumed that ``best'' chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
means minimizing the squared distance between the data and the means minimizing the squared distance between the data and the
function. With maximum likelihood we search for the parameter value function. With maximum likelihood we search for those parameter
$\theta$ for which the likelihood that the data were drawn from the values $\theta$ that maximize the likelihood of the data to be drawn
corresponding function is maximal. from the corresponding function.
If we assume that the $y_i$ values are normally distributed around the If we assume that the $y_i$ values are normally distributed around the
function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
@ -242,35 +242,49 @@ the log-likelihood is
\begin{eqnarray} \begin{eqnarray}
\log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n)) \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\ & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2} \\ & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma_i^2} -\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}
\end{eqnarray} \end{eqnarray}
The only difference to the previous example is that the averages in The only difference to the previous example of the arithmetic mean is
the equations above are now given as the function values that the means $\mu$ in the equations above are given by the function
$f(x_i;\theta)$. The parameter $\theta$ should be the one that values $f(x_i;\theta)$. The parameters $\theta$ should maximize the
maximizes the log-likelihood. The first part of the sum is independent log-likelihood. The first term in the sum is independent of $\theta$
of $\theta$ and can thus be ignored when computing the the maximum: and can be ignored when computing the the maximum. From the second
term we pull out the constant factor $-\frac{1}{2}$:
\begin{eqnarray} \begin{eqnarray}
& = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 & = & - \frac{1}{2} \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2
\end{eqnarray} \end{eqnarray}
We can further simplify by inverting the sign and then search for the We can further simplify by inverting the sign and then search for the
minimum. Also the factor $1/2$ can be ignored since it does not affect minimum. Also the factor $\frac{1}{2}$ can be ignored since it does
the position of the minimum: not affect the position of the minimum:
\begin{equation} \begin{equation}
\label{chisqmin} \label{chisqmin}
\theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2 \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
\end{equation} \end{equation}
The sum of the squared differences normalized by the standard The sum of the squared differences between the $y$-data values and the
deviation is also called $\chi^2$ (chi squared). The parameter function values, normalized by the standard deviation of the data
around the function, is called $\chi^2$ (chi squared). The parameter
$\theta$ which minimizes the squared differences is thus the one that $\theta$ which minimizes the squared differences is thus the one that
maximizes the likelihood of the data to actually originate from the maximizes the likelihood of the data to actually originate from the
given function. Therefore, minimizing $\chi^2$ is a maximum likelihood given function.
estimation.
Whether minimizing $\chi^2$ or the \enterm{mean squared error}
\eqref{meansquarederror} introduced in
chapter~\ref{gradientdescentchapter} does not matter. The latter is
the mean and $\chi^2$ is the sum of the squared differences. They
simply differ by the constant factor $n$, the number of data pairs,
which does not affect the position of the minimum. $\chi^2$ is more
general in that it allows for different standard deviations for each
data pair. If they are all the same ($\sigma_i = \sigma$), the common
standard deviation can be pulled out of the sum and also does not
influence the position of the minimum. Both \enterm{least squares} and
minimizing $\chi^2$ are maximum likelihood estimators of the
parameters $\theta$ of a function.
From the mathematical considerations above we can see that the From the mathematical considerations above we can see that the
minimization of the squared difference is a maximum-likelihood minimization of the squared difference is a maximum-likelihood
estimation only if the data are normally distributed around the estimation only if the data are normally distributed around the
function. In case of other distributions, the log-likelihood function. In case of other distributions, the log-likelihood
\eqnref{loglikelihood} needs to be adapted accordingly. \eqref{loglikelihood} needs to be adapted accordingly.
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{mlepropline} \includegraphics[width=1\textwidth]{mlepropline}
@ -283,26 +297,32 @@ function. In case of other distributions, the log-likelihood
the normal distribution of the data around the line (right histogram).} the normal distribution of the data around the line (right histogram).}
\end{figure} \end{figure}
Let's go on and calculate the minimum \eqref{chisqmin} of $\chi^2$
analytically for a simple function.
\subsection{Example: simple proportionality}
The function of a line with slope $\theta$ through the origin is \subsection{Straight line trough the origin}
\[ f(x) = \theta x \; . \] The function of a straight line with slope $m$ through the origin
The $\chi^2$-sum is thus is
\[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \; . \] \[ f(x) = m x \; . \]
To estimate the minimum we again take the first derivative with With this specific function, $\chi^2$ reads
respect to $\theta$ and equate it to zero: \[ \chi^2 = \sum_{i=1}^n \left( \frac{y_i-m x_i}{\sigma_i} \right)^2
\begin{eqnarray} \; . \] To calculate the minimum we take the first derivative with
\frac{\text{d}}{\text{d}\theta}\chi^2 & = & \frac{\text{d}}{\text{d}\theta} \sum_{i=1}^n \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\ respect to $m$ and equate it to zero:
& = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
& = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
\end{eqnarray}
\begin{eqnarray} \begin{eqnarray}
\Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope} \frac{\text{d}}{\text{d}m}\chi^2 & = & \sum_{i=1}^n \frac{\text{d}}{\text{d}m} \left( \frac{y_i-m x_i}{\sigma_i} \right)^2 \nonumber \\
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-m x_i}{\sigma_i} \right) \nonumber \\
& = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - m \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad m \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
\Leftrightarrow \quad m & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
\end{eqnarray} \end{eqnarray}
This is an analytical expression for the estimation of the slope This is an analytical expression for the estimation of the slope $m$
$\theta$ of the regression line (\figref{mleproplinefig}). of the regression line (\figref{mleproplinefig}). We do not need to
apply numerical methods like the gradient descent to find the slope
that minimizes the squared differences. Instead, we can compute the
slope directly from the data by means of \eqnref{mleslope}, very much
like we calculate the mean of some data by means of the arithmetic
mean \eqref{arithmeticmean}.
\subsection{Linear and non-linear fits} \subsection{Linear and non-linear fits}
A gradient descent, as we have done in the previous chapter, is not A gradient descent, as we have done in the previous chapter, is not
@ -317,47 +337,68 @@ or the coefficients $a_k$ of a polynomial
\matlabfun{polyfit()}. \matlabfun{polyfit()}.
Parameters that are non-linearly combined can not be calculated Parameters that are non-linearly combined can not be calculated
analytically. Consider for example the rate $\lambda$ of the analytically. Consider, for example, the factor $c$ and the rate
exponential decay $\lambda$ of the exponential decay
\[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \] \[ y = c \cdot e^{\lambda x} \quad , \quad c, \lambda \in \reZ \; . \]
Such cases require numerical solutions for the optimization of the Such cases require numerical solutions for the optimization of the
cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}. cost function, e.g. the gradient descent \matlabfun{lsqcurvefit()}.
\subsection{Relation between slope and correlation coefficient} \subsection{Relation between slope and correlation coefficient}
Let us have a closer look on \eqnref{mleslope}. If the standard Let us have a closer look on \eqnref{mleslope} for the slope of a line
deviation of the data $\sigma_i$ is the same for each data point, through the origin. If the standard deviation of the data $\sigma_i$
i.e. $\sigma_i = \sigma_j \; \forall \; i, j$, the standard deviation drops is the same for each data point, i.e. $\sigma_i = \sigma_j \; \forall
out of \eqnref{mleslope} and we get \; i, j$, the standard deviation drops out and \eqnref{mleslope}
simplifies to
\begin{equation} \begin{equation}
\label{whitemleslope} \label{whitemleslope}
\theta = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} m = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2}
\end{equation} \end{equation}
To see what this expression is, we need to standardize the data. We To see what the nominator and the denominator of this expression
make the data mean free and normalize them to their standard describe, we need to subtract from the data their mean value. We make
deviation, i.e. $x \mapsto (x - \bar x)/\sigma_x$. The resulting the data mean free, i.e. $x \mapsto x - \bar x$ and $y \mapsto y -
numbers are also called \entermde[z-values]{z-Wert}{$z$-values} or \bar y$ . For mean-free data the variance
$z$-scores and they have the property $\bar x = 0$ and $\sigma_x =
1$. $z$-scores are often used in Biology to make quantities that
differ in their units comparable. For standardized data the variance
\begin{equation} \begin{equation}
\sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2
\end{equation} \end{equation}
is given by the mean squared data and equals one. is given by the mean squared data. In the same way, the covariance
The covariance between $x$ and $y$ also simplifies to between $x$ and $y$ simplifies to
\begin{equation} \begin{equation}
\text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \; ,
\end{equation}
the averaged product between pairs of $x$ and $y$ values. Expanding
the fraction in \eqnref{whitemleslope} by $\frac{1}{n}$ we get
\begin{equation}
\label{meanfreeslope}
m = \frac{\frac{1}{n}\sum_{i=1}^n x_i y_i}{\frac{1}{n}\sum_{i=1}^n x_i^2}
= \frac{\text{cov}(x, y)}{\sigma_x^2}
\end{equation} \end{equation}
the averaged product between pairs of $x$ and $y$ values. Recall that
the correlation coefficient $r_{x,y}$, Recall that the correlation coefficient $r_{x,y}$ is the covariance
\eqnref{correlationcoefficient}, is the covariance normalized by the normalized by the product of the standard deviations of $x$ and $y$:
product of the standard deviations of $x$ and $y$, \begin{equation}
respectively. Therefore, in case the standard deviations equal one, \label{meanfreecorrcoef}
the correlation coefficient is identical to the covariance. r_{x,y} = \frac{\text{cov}(x, y)}{\sigma_x\sigma_y}
Consequently, for standardized data the slope of the regression line \end{equation}
\eqnref{whitemleslope} simplifies to If furthermore the standard deviations of $x$ and $y$ are the same,
i.e. $\sigma_x = \sigma_y$, the slope of a line trough the origin is
identical to the correlation coefficient.
This relation between slope and correlation coefficient in particular
holds for standardized data that have been made mean free and have
been normalized by their standard deviation, i.e. $x \mapsto (x - \bar
x)/\sigma_x$ and $y \mapsto (y - \bar x)/\sigma_y$. The resulting
numbers are called \entermde[z-value]{z-Wert}{$z$-values} or
\enterm[z-score]{$z$-scores}. Their mean equals zero and their
standard deviation one. $z$-scores are often used to make quantities
that differ in their units comparable. For standardized data the
denominators for both the slope \eqref{meanfreeslope} and the
correlation coefficient \eqref{meanfreecorrcoef} equal one. For
standardized data, both the slope of the regression line and the
corelation coefficient reduce to the covariance between the $x$ and
$y$ data:
\begin{equation} \begin{equation}
\theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} m = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
\end{equation} \end{equation}
For standardized data the slope of the regression line is the same as the For standardized data the slope of the regression line is the same as the
correlation coefficient! correlation coefficient!
@ -365,63 +406,100 @@ correlation coefficient!
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Neural coding} \section{Neural coding}
Maximum likelihood estimators are not only a central concept for data
analysis. Neural systems face the very same problem. They also need to
estimate parameters of the internal and external environment based on
the activity of neurons.
In sensory systems certain aspects of the environment are encoded in In sensory systems certain aspects of the environment are encoded in
the neuronal activity of populations of neurons. One example of such a the neuronal activity of populations of neurons. One example of such a
population code is the tuning of neurons in the primary visual cortex population code is the tuning of neurons in the primary visual cortex
(V1) to the orientation of an edge or bar in the visual (V1) to the orientation of a bar in the visual stimulus. Different
stimulus. Different neurons respond best to different edge neurons respond best to different bar orientations. Traditionally,
orientations. Traditionally, such a tuning is measured by analyzing such a tuning is measured by analyzing the neuronal response strength
the neuronal response strength (e.g. the firing rate) as a function of (e.g. the firing rate) as a function of the orientation of a black bar
the orientation of a black bar and is illustrated and summarized and is illustrated and summarized with the so called
with the so called \enterm{tuning-curve} (\determ{Abstimmkurve}, \enterm{tuning-curve} (\determ{Abstimmkurve},
figure~\ref{mlecodingfig}, top). figure~\ref{mlecodingfig}, top).
\begin{figure}[tp] \begin{figure}[tp]
\includegraphics[width=1\textwidth]{mlecoding} \includegraphics[width=1\textwidth]{mlecoding}
\titlecaption{\label{mlecodingfig} Maximum likelihood estimation of \titlecaption{\label{mlecodingfig} Maximum likelihood estimation of
a stimulus parameter from neuronal activity.}{Top: Tuning curve of a stimulus parameter from neuronal activity.}{Top: Tuning curve
an individual neuron as a function of the stimulus orientation (a $r_i(\phi;c,\phi_i)$ of a specific neuron $i$ as a function of the
dark bar in front of a white background). The stimulus that evokes orientation $\phi$ of a stimulus, a dark bar in front of a white
the strongest activity in that neuron is the bar with the vertical background. The preferred stimulus $\phi_i$ of that neuron, the
orientation (arrow, $\phi_i=90$\,\degree). The red area indicates one that evokes the strongest firing rate response, is a bar with
the variability of the neuronal activity $p(r|\phi)$ around the vertical orientation (arrow, $\phi_i=90$\,\degree). The width of
tuning curve. Center: In a population of neurons, each neuron may the red area indicates the variability of the neuronal activity
have a different tuning curve (colors). A specific stimulus (the $\sigma_i$ around the tuning curve. Center: In a population of
vertical bar) activates the individual neurons of the population neurons, each neuron may have a different tuning curve (colors). A
in a specific way (dots). Bottom: The log-likelihood of the specific stimulus activates the individual neurons of the
activity pattern will be maximized close to the real stimulus population in a specific way (dots). Bottom: The log-likelihood of
the activity pattern has a maximum close to the real stimulus
orientation.} orientation.}
\end{figure} \end{figure}
The brain, however, is confronted with the inverse problem: given a The brain, however, is confronted with the inverse problem: given a
certain activity pattern in the neuronal population, what is the certain activity pattern in the neuronal population, what is the
stimulus (here the orientation of an edge)? In the sense of maximum stimulus? In our example, what is the orientation of the bar? In the
likelihood, a possible answer to this question would be: the stimulus sense of maximum likelihood, a possible answer to this question would
for which the particular activity pattern is most likely given the be: the stimulus for which the particular activity pattern is most
tuning of the neurons. likely given the tuning of the neurons and the noise (standard
deviation) of the responses.
Let's stay with the example of the orientation tuning in V1. The Let's stay with the example of the orientation tuning in V1. The
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge tuning of the firing rate $r_i(\phi)$ of neuron $i$ to the preferred
orientation $\phi_i$ can be well described using a van-Mises function bar orientation $\phi_i$ can be well described using a van-Mises
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}): function (the Gaussian function on a cyclic x-axis)
(\figref{mlecodingfig}):
\begin{equation} \begin{equation}
\Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \label{bartuningcurve}
r_i(\phi; c, \phi_i) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c > 0
\end{equation} \end{equation}
Note the factor two in the cosine, because the response of the neuron
is the same for a bar turned by 180\,\degree.
If we approximate the neuronal activity by a normal distribution If we approximate the neuronal activity by a normal distribution
around the tuning curve with a standard deviation $\sigma=\Omega/4$, around the tuning curve with a standard deviation $\sigma_i$, then the
which is proportional to $\Omega$, then the probability $p_i(r|\phi)$ probability $p_i(r|\phi)$ of the $i$-th neuron having the observed
of the $i$-th neuron showing the activity $r$ given a certain activity $r$, given a certain orientation $\phi$ of a bar is
orientation $\phi$ of an edge is given by
\begin{equation} \begin{equation}
p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . p_i(r|\phi) = \frac{1}{\sqrt{2\pi\sigma_i^2}} e^{-\frac{1}{2}\left(\frac{r-r_i(\phi; c, \phi_i)}{\sigma_i}\right)^2} \; .
\end{equation} \end{equation}
The log-likelihood of the edge orientation $\phi$ given the The log-likelihood of the bar orientation $\phi$ given the
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
\begin{equation} \begin{equation}
{\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
\end{equation} \end{equation}
The angle $\phi$ that maximizes this likelihood is then an estimate of The angle $\phi_{mle}$ that maximizes this likelihood is an estimate
the orientation of the edge. of the true orientation of the bar (\figref{mlecodingfig}).
The noisiness of the neuron's responses as quantified by $\sigma_i$
usually is a function of the neuron's mean firing rate $r_i$,
\eqnref{bartuningcurve}: $\sigma_i = \sigma_i(r_i)$. This dependence
has a major impact of the maximum likelihood estimation. Usually, the
stronger the response of a neuron, the higher its firing rate, the
lower the noise. In this case, strong responses will have a stronger
influence on the position of the maximum of the log-likelihood.
Whether neural systems really implement maximum likelihood estimators
is another question. There are many ways how a stimulus property can
be read out from the activity of a population of neurons. The simplest
one being a ``winner takes all'' rule. The preferred stimulus
parameter of the neuron with the strongest response is the
estimate. Another possibility is to compute a population vector. The
estimated stimulus parameter is the sum of the preferred stimulus
parameters of all neurons in the population weighted by the activity
of the neurons. In case of angular stimulus parameters, like the
orientation of the bar in our example, a vector pointing in the
direction of the angle is used instead of the angle to incorporate the
cyclic nature of the parameter.
Using maximum likelihood estimators for analyzing neural population
activity gives us an upper bound of how well a stimulus parameter is
encoded in the activity of the neurons. The brain would not be able to
do better.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View File

@ -5,11 +5,9 @@ import matplotlib.gridspec as gridspec
from plotstyle import * from plotstyle import *
rng = np.random.RandomState(4637281) rng = np.random.RandomState(4637281)
lmarg=0.1
rmarg=0.1
fig = plt.figure(figsize=cm_size(figure_width, 2.8*figure_height)) fig = plt.figure(figsize=cm_size(figure_width, 2.8*figure_height))
spec = gridspec.GridSpec(nrows=4, ncols=1, height_ratios=[4, 4, 1, 3], hspace=0.2, spec = gridspec.GridSpec(nrows=4, ncols=1, height_ratios=[4, 5, 1, 3], hspace=0.2,
**adjust_fs(fig, left=4.0)) **adjust_fs(fig, left=4.0))
ax = fig.add_subplot(spec[0, 0]) ax = fig.add_subplot(spec[0, 0])
ax.set_xlim(0.0, np.pi) ax.set_xlim(0.0, np.pi)
@ -17,7 +15,7 @@ ax.set_xticks(np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi))
ax.set_xticklabels([]) ax.set_xticklabels([])
ax.set_ylim(0.0, 3.5) ax.set_ylim(0.0, 3.5)
ax.yaxis.set_major_locator(plt.NullLocator()) ax.yaxis.set_major_locator(plt.NullLocator())
ax.text(-0.2, 0.5*3.5, 'Activity', rotation='vertical', va='center') ax.text(-0.2, 0.5*3.5, 'Firing rate', rotation='vertical', va='center')
ax.annotate('Tuning curve', ax.annotate('Tuning curve',
xy=(0.42*np.pi, 2.5), xycoords='data', xy=(0.42*np.pi, 2.5), xycoords='data',
xytext=(0.3*np.pi, 3.2), textcoords='data', ha='right', xytext=(0.3*np.pi, 3.2), textcoords='data', ha='right',
@ -32,27 +30,34 @@ ax.text(0.52*np.pi, 0.7, 'preferred\norientation')
xx = np.arange(0.0, 2.0*np.pi, 0.01) xx = np.arange(0.0, 2.0*np.pi, 0.01)
pp = 0.5*np.pi pp = 0.5*np.pi
yy = np.exp(np.cos(2.0*(xx+pp))) yy = np.exp(np.cos(2.0*(xx+pp)))
ax.fill_between(xx, yy+0.25*yy, yy-0.25*yy, **fsBa) ss = 1.0/(0.75+2.0*yy)
ax.fill_between(xx, yy+ss, yy-ss, **fsBa)
ax.plot(xx, yy, **lsB) ax.plot(xx, yy, **lsB)
ax = fig.add_subplot(spec[1, 0]) ax = fig.add_subplot(spec[1, 0])
ax.set_xlim(0.0, np.pi) ax.set_xlim(0.0, np.pi)
ax.set_xticks(np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi)) ax.set_xticks(np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi))
ax.set_xticklabels([]) ax.set_xticklabels([])
ax.set_ylim(0.0, 3.0) ax.set_ylim(-0.1, 3.5)
ax.yaxis.set_major_locator(plt.NullLocator()) ax.yaxis.set_major_locator(plt.NullLocator())
ax.text(-0.2, 0.5*3.5, 'Activity', rotation='vertical', va='center') ax.text(-0.2, 0.5*3.5, 'Firing rate', rotation='vertical', va='center')
xx = np.arange(0.0, 1.0*np.pi, 0.01) xx = np.arange(0.0, 1.0*np.pi, 0.01)
prefphases = np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi) prefphases = np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi)
responses = [] responses = []
xresponse = 0.475*np.pi sigmas = []
xresponse = 0.41*np.pi
ax.annotate('Orientation of bar',
xy=(xresponse, -0.1), xycoords='data',
xytext=(xresponse, 3.1), textcoords='data', ha='left', zorder=-10,
arrowprops=dict(arrowstyle="->", relpos=(0.0,0.0)) )
for pp, ls, ps in zip(prefphases, [lsE, lsC, lsD, lsB, lsD, lsC, lsE], for pp, ls, ps in zip(prefphases, [lsE, lsC, lsD, lsB, lsD, lsC, lsE],
[psE, psC, psD, psB, psD, psC, psE]) : [psE, psC, psD, psB, psD, psC, psE]) :
yy = np.exp(np.cos(2.0*(xx+pp))) yy = np.exp(np.cos(2.0*(xx+pp)))
#ax.plot(xx, yy, color=cm.autumn(2.0*np.abs(pp/np.pi-0.5), 1))
ax.plot(xx, yy, **ls) ax.plot(xx, yy, **ls)
y = np.exp(np.cos(2.0*(xresponse+pp))) y = np.exp(np.cos(2.0*(xresponse+pp)))
responses.append(y + rng.randn()*0.25*y) s = 1.0/(0.75+2.0*y)
responses.append(y + rng.randn()*s)
sigmas.append(s)
ax.plot(xresponse, y, **ps) ax.plot(xresponse, y, **ps)
responses = np.array(responses) responses = np.array(responses)
@ -68,25 +73,23 @@ ax = fig.add_subplot(spec[3, 0])
ax.set_xlim(0.0, np.pi) ax.set_xlim(0.0, np.pi)
ax.set_xticks(np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi)) ax.set_xticks(np.arange(0.125*np.pi, 1.*np.pi, 0.125*np.pi))
ax.set_xticklabels([]) ax.set_xticklabels([])
ax.set_ylim(-1600, 0) ax.set_ylim(-210, 0)
ax.yaxis.set_major_locator(plt.NullLocator()) ax.yaxis.set_major_locator(plt.NullLocator())
ax.set_xlabel('Orientation') ax.set_xlabel('Orientation')
ax.text(-0.2, -800, 'Log-Likelihood', rotation='vertical', va='center') ax.text(-0.2, -100, 'Log-Likelihood', rotation='vertical', va='center')
phases = np.linspace(0.0, 1.1*np.pi, 100) phases = np.linspace(0.0, 1.1*np.pi, 100)
probs = np.zeros((len(responses), len(phases))) probs = np.zeros((len(responses), len(phases)))
for k, (pp, r) in enumerate(zip(prefphases, responses)) : for k, (pp, r, sigma) in enumerate(zip(prefphases, responses, sigmas)) :
y = np.exp(np.cos(2.0*(phases+pp))) y = np.exp(np.cos(2.0*(phases+pp)))
sigma = 0.1*y probs[k,:] = np.exp(-0.5*((r-y)/sigma)**2.0)/np.sqrt(2.0*np.pi*sigma**2)
probs[k,:] = np.exp(-0.5*((r-y)/sigma)**2.0)/np.sqrt(2.0*np.pi)/sigma
loglikelihood = np.sum(np.log(probs), 0) loglikelihood = np.sum(np.log(probs), 0)
maxl = np.max(loglikelihood)
maxp = phases[np.argmax(loglikelihood)] maxp = phases[np.argmax(loglikelihood)]
ax.annotate('', ax.annotate('',
xy=(maxp, -1600), xycoords='data', xy=(maxp, -210), xycoords='data',
xytext=(maxp, -30), textcoords='data', xytext=(maxp, -10), textcoords='data',
arrowprops=dict(arrowstyle="->", relpos=(0.5,0.5), arrowprops=dict(arrowstyle="->", relpos=(0.5,0.5),
connectionstyle="angle3,angleA=80,angleB=90") ) connectionstyle="angle3,angleA=80,angleB=90") )
ax.text(maxp+0.05, -1100, 'most likely\norientation\ngiven the responses') ax.text(maxp+0.05, -150, 'most likely\norientation\ngiven the responses')
ax.plot(phases, loglikelihood, **lsA) ax.plot(phases, loglikelihood, clip_on=False, **lsA)
plt.savefig('mlecoding.pdf') plt.savefig('mlecoding.pdf')