[likelihood] further textual improvements
This commit is contained in:
parent
1a9cddd451
commit
934a1e976c
@ -77,7 +77,7 @@ maximizes the likelihood of the data?
|
|||||||
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
|
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
|
||||||
the mean.}{Top: The measured data (blue dots) together with three
|
the mean.}{Top: The measured data (blue dots) together with three
|
||||||
different possible normal distributions with different means
|
different possible normal distributions with different means
|
||||||
(arrows) the data could have originated from. Bootom left: the
|
(arrows) the data could have originated from. Bottom left: the
|
||||||
likelihood as a function of $\theta$ i.e. the mean. It is maximal
|
likelihood as a function of $\theta$ i.e. the mean. It is maximal
|
||||||
at a value of $\theta = 2$. Bottom right: the
|
at a value of $\theta = 2$. Bottom right: the
|
||||||
log-likelihood. Taking the logarithm does not change the position
|
log-likelihood. Taking the logarithm does not change the position
|
||||||
@ -103,7 +103,10 @@ zero:
|
|||||||
\end{eqnarray*}
|
\end{eqnarray*}
|
||||||
Thus, the maximum likelihood estimator is the arithmetic mean. That
|
Thus, the maximum likelihood estimator is the arithmetic mean. That
|
||||||
is, the arithmetic mean maximizes the likelihood that the data
|
is, the arithmetic mean maximizes the likelihood that the data
|
||||||
originate from a normal distribution (\figref{mlemeanfig}).
|
originate from a normal distribution centered at the arithmetic mean
|
||||||
|
(\figref{mlemeanfig}). Equivalently, the standard deviation computed
|
||||||
|
from the data, maximizes the likelihood that the data were generated
|
||||||
|
from a normal distribution with this standard deviation.
|
||||||
|
|
||||||
\begin{exercise}{mlemean.m}{mlemean.out}
|
\begin{exercise}{mlemean.m}{mlemean.out}
|
||||||
Draw $n=50$ random numbers from a normal distribution with a mean of
|
Draw $n=50$ random numbers from a normal distribution with a mean of
|
||||||
@ -113,18 +116,77 @@ originate from a normal distribution (\figref{mlemeanfig}).
|
|||||||
log-likelihood (given by the sum of the logarithms of the
|
log-likelihood (given by the sum of the logarithms of the
|
||||||
probabilities) for the mean as parameter. Compare the position of
|
probabilities) for the mean as parameter. Compare the position of
|
||||||
the maxima with the mean calculated from the data.
|
the maxima with the mean calculated from the data.
|
||||||
\pagebreak[4]
|
|
||||||
\end{exercise}
|
\end{exercise}
|
||||||
|
|
||||||
|
|
||||||
\pagebreak[4]
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
|
\section{Fitting probability distributions}
|
||||||
|
Consider normally distributed data with unknown mean and standard
|
||||||
|
deviation. From the considerations above we just have seen that a
|
||||||
|
Gaussian distribution with mean at the arithmetic mean and standard
|
||||||
|
deviation equal to the standard deviation computed from the data is
|
||||||
|
the best Gaussian distribution that fits the data best in a maximum
|
||||||
|
likelihood sense, i.e. the likelihood that the data were generated
|
||||||
|
from this distribution is the largest. Fitting a Gaussian distribution
|
||||||
|
to data is very simple: just compute the two parameter of the Gaussian
|
||||||
|
distribution as the arithmetic mean and a standard deviation directly
|
||||||
|
from the data.
|
||||||
|
|
||||||
|
For non-Gaussian distributions (e.g. a Gamma-distribution), however,
|
||||||
|
such simple analytical expressions for the parameters of the
|
||||||
|
distribution do not exist, e.g. the shape parameter of a
|
||||||
|
\enterm{Gamma-distribution}. How do we fit such a distribution to
|
||||||
|
some data? That is, how should we compute the values of the parameters
|
||||||
|
of the distribution, given the data?
|
||||||
|
|
||||||
|
A first guess could be to fit the probability density function by
|
||||||
|
minimization of the squared difference to a histogram of the measured
|
||||||
|
data. For several reasons this is, however, not the method of choice:
|
||||||
|
(i) Probability densities can only be positive which leads, for small
|
||||||
|
values in particular, to asymmetric distributions. (ii) The values of
|
||||||
|
a histogram estimating the density are not independent because the
|
||||||
|
integral over a density is unity. The two basic assumptions of
|
||||||
|
normally distributed and independent samples, which are a prerequisite
|
||||||
|
make the minimization of the squared difference \eqnref{chisqmin} to a
|
||||||
|
maximum likelihood estimation, are violated. (iii) The histogram
|
||||||
|
strongly depends on the chosen bin size \figref{mlepdffig}).
|
||||||
|
|
||||||
|
\begin{figure}[t]
|
||||||
|
\includegraphics[width=1\textwidth]{mlepdf}
|
||||||
|
\titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
|
||||||
|
probability density.}{Left: the 100 data points drawn from a 2nd
|
||||||
|
order Gamma-distribution. The maximum likelihood estimation of the
|
||||||
|
probability density function is shown in orange, the true pdf is
|
||||||
|
shown in red. Right: normalized histogram of the data together
|
||||||
|
with the real (red) and the fitted probability density
|
||||||
|
functions. The fit was done by minimizing the squared difference
|
||||||
|
to the histogram.}
|
||||||
|
\end{figure}
|
||||||
|
|
||||||
|
Instead we should stay with maximum-likelihood estimation. Exactly in
|
||||||
|
the same way as we estimated the mean value of a Gaussian distribution
|
||||||
|
above, we can numerically fit the parameter of any type of
|
||||||
|
distribution directly from the data by means of maximizing the
|
||||||
|
likelihood. We simply search for the parameter $\theta$ of the
|
||||||
|
desired probability density function that maximizes the
|
||||||
|
log-likelihood. In general this is a non-linear optimization problem
|
||||||
|
that is solved with numerical methods such as the gradient descent
|
||||||
|
\matlabfun{mle()}.
|
||||||
|
|
||||||
|
\begin{exercise}{mlegammafit.m}{mlegammafit.out}
|
||||||
|
Generate a sample of gamma-distributed random numbers and apply the
|
||||||
|
maximum likelihood method to estimate the parameters of the gamma
|
||||||
|
function from the data.
|
||||||
|
\end{exercise}
|
||||||
|
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\section{Curve fitting}
|
\section{Curve fitting}
|
||||||
|
|
||||||
When fitting a function of the form $f(x;\theta)$ to data pairs
|
When fitting a function of the form $f(x;\theta)$ to data pairs
|
||||||
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
|
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
|
||||||
function best describes the data. With maximum likelihood we search
|
function best describes the data. With maximum likelihood we search
|
||||||
for the paramter value $\theta$ for which the likelihood that the data
|
for the parameter value $\theta$ for which the likelihood that the data
|
||||||
were drawn from the corresponding function is maximal. If we assume
|
were drawn from the corresponding function is maximal. If we assume
|
||||||
that the $y_i$ values are normally distributed around the function
|
that the $y_i$ values are normally distributed around the function
|
||||||
values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
|
values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
|
||||||
@ -182,7 +244,9 @@ respect to $\theta$ and equate it to zero:
|
|||||||
& = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
|
& = & \sum_{i=1}^n \frac{\text{d}}{\text{d}\theta} \left( \frac{y_i-\theta x_i}{\sigma_i} \right)^2 \nonumber \\
|
||||||
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
|
& = & -2 \sum_{i=1}^n \frac{x_i}{\sigma_i} \left( \frac{y_i-\theta x_i}{\sigma_i} \right) \nonumber \\
|
||||||
& = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
|
& = & -2 \sum_{i=1}^n \left( \frac{x_i y_i}{\sigma_i^2} - \theta \frac{x_i^2}{\sigma_i^2} \right) \;\; = \;\; 0 \nonumber \\
|
||||||
\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber \\
|
\Leftrightarrow \quad \theta \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2} & = & \sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2} \nonumber
|
||||||
|
\end{eqnarray}
|
||||||
|
\begin{eqnarray}
|
||||||
\Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
|
\Leftrightarrow \quad \theta & = & \frac{\sum_{i=1}^n \frac{x_i y_i}{\sigma_i^2}}{ \sum_{i=1}^n \frac{x_i^2}{\sigma_i^2}} \label{mleslope}
|
||||||
\end{eqnarray}
|
\end{eqnarray}
|
||||||
This is an analytical expression for the estimation of the slope
|
This is an analytical expression for the estimation of the slope
|
||||||
@ -190,12 +254,12 @@ $\theta$ of the regression line (\figref{mleproplinefig}).
|
|||||||
|
|
||||||
A gradient descent, as we have done in the previous chapter, is not
|
A gradient descent, as we have done in the previous chapter, is not
|
||||||
necessary for fitting the slope of a straight line, because the slope
|
necessary for fitting the slope of a straight line, because the slope
|
||||||
can be directly computed via \eqnref{nleslope}. More generally, this
|
can be directly computed via \eqnref{mleslope}. More generally, this
|
||||||
is the case also for fitting the coefficients of linearly combined
|
is the case also for fitting the coefficients of linearly combined
|
||||||
basis functions as for example the slope $m$ and the y-intercept $b$
|
basis functions as for example the slope $m$ and the y-intercept $b$
|
||||||
of a straight line
|
of a straight line
|
||||||
\[ y = m \cdot x +b \]
|
\[ y = m \cdot x +b \]
|
||||||
or the coefficients $a_k$ of a polynom
|
or the coefficients $a_k$ of a polynomial
|
||||||
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
|
\[ y = \sum_{k=0}^N a_k x^k = a_o + a_1x + a_2x^2 + a_3x^4 + \ldots \]
|
||||||
\matlabfun{polyfit()}.
|
\matlabfun{polyfit()}.
|
||||||
|
|
||||||
@ -223,79 +287,34 @@ case, the variance
|
|||||||
\[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \]
|
\[ \sigma_x^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)^2 = \frac{1}{n} \sum_{i=1}^n x_i^2 = 1 \]
|
||||||
is the mean squared data and equals one.
|
is the mean squared data and equals one.
|
||||||
The covariance between $x$ and $y$ also simplifies to
|
The covariance between $x$ and $y$ also simplifies to
|
||||||
\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i - \bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
|
\[ \text{cov}(x, y) = \frac{1}{n} \sum_{i=1}^n (x_i - \bar x)(y_i -
|
||||||
the averaged product between pairs of $x$ and $y$ values.
|
\bar y) =\frac{1}{n} \sum_{i=1}^n x_i y_i \]
|
||||||
Recall that the correlation coefficient $r_{x,y}$ is the covariance
|
the averaged product between pairs of $x$ and $y$ values. Recall that
|
||||||
normalized by the product of the standard deviations of $x$ and $y$,
|
the correlation coefficient $r_{x,y}$,
|
||||||
respectively. Therefore, in case the standard deviations equal one, the
|
\eqnref{correlationcoefficient}, is the covariance normalized by the
|
||||||
correlation coefficient equals the covariance.
|
product of the standard deviations of $x$ and $y$,
|
||||||
Consequently, for standardized data the slope of the regression line
|
respectively. Therefore, in case the standard deviations equal one,
|
||||||
|
the correlation coefficient equals the covariance. Consequently, for
|
||||||
|
standardized data the slope of the regression line
|
||||||
\eqnref{whitemleslope} simplifies to
|
\eqnref{whitemleslope} simplifies to
|
||||||
\begin{equation}
|
\begin{equation}
|
||||||
\theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y} \]
|
\theta = \frac{1}{n} \sum_{i=1}^n x_i y_i = \text{cov}(x,y) = r_{x,y}
|
||||||
\end{equation}
|
\end{equation}
|
||||||
For standardized data the slope of the regression line equals the
|
For standardized data the slope of the regression line equals the
|
||||||
correlation coefficient!
|
correlation coefficient!
|
||||||
|
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
||||||
\section{Fitting probability distributions}
|
|
||||||
Finally let's consider the case in which we want to fit the parameters
|
|
||||||
of a probability density function (e.g. the shape parameter of a
|
|
||||||
\enterm{Gamma-distribution}) to a dataset.
|
|
||||||
|
|
||||||
A first guess could be to fit the probability density by minimization
|
|
||||||
of the squared difference to a histogram of the measured data. For
|
|
||||||
several reasons this is, however, not the method of choice: (i)
|
|
||||||
Probability densities can only be positive which leads, for small
|
|
||||||
values in particular, to asymmetric distributions. (ii) The values of
|
|
||||||
a histogram estimating the density are not independent because the
|
|
||||||
integral over a density is unity. The two basic assumptions of
|
|
||||||
normally distributed and independent samples, which are a prerequisite
|
|
||||||
make the minimization of the squared difference \eqnref{chisqmin} to a
|
|
||||||
maximum likelihood estimation, are violated. (iii) The histogram
|
|
||||||
strongly depends on the chosen bin size \figref{mlepdffig}).
|
|
||||||
|
|
||||||
\begin{figure}[t]
|
|
||||||
\includegraphics[width=1\textwidth]{mlepdf}
|
|
||||||
\titlecaption{\label{mlepdffig} Maximum likelihood estimation of a
|
|
||||||
probability density.}{Left: the 100 data points drawn from a 2nd
|
|
||||||
order Gamma-distribution. The maximum likelihood estimation of the
|
|
||||||
probability density function is shown in orange, the true pdf is
|
|
||||||
shown in red. Right: normalized histogram of the data together
|
|
||||||
with the real (red) and the fitted probability density
|
|
||||||
functions. The fit was done by minimizing the squared difference
|
|
||||||
to the histogram.}
|
|
||||||
\end{figure}
|
|
||||||
|
|
||||||
|
|
||||||
Using the example of estimating the mean value of a normal
|
|
||||||
distribution we have discussed the direct approach to fit a
|
|
||||||
probability density to data via maximum likelihood. We simply search
|
|
||||||
for the parameter $\theta$ of the desired probability density function
|
|
||||||
that maximizes the log-likelihood. In general this is a non-linear
|
|
||||||
optimization problem that is generally solved with numerical methods
|
|
||||||
such as the gradient descent \matlabfun{mle()}.
|
|
||||||
|
|
||||||
\begin{exercise}{mlegammafit.m}{mlegammafit.out}
|
|
||||||
Generate a sample of gamma-distributed random numbers and apply the
|
|
||||||
maximum likelihood method to estimate the parameters of the gamma
|
|
||||||
function from the data.
|
|
||||||
\pagebreak
|
|
||||||
\end{exercise}
|
|
||||||
|
|
||||||
|
|
||||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||||
\section{Neural coding}
|
\section{Neural coding}
|
||||||
In sensory systems certain aspects of the environment are encoded in
|
In sensory systems certain aspects of the environment are encoded in
|
||||||
the neuronal activity of populations of neurons. One example of such
|
the neuronal activity of populations of neurons. One example of such a
|
||||||
a population code is the tuning of neurons in the primary visual
|
population code is the tuning of neurons in the primary visual cortex
|
||||||
cortex (V1) to the orientation of a visual stimulus. Different neurons
|
(V1) to the orientation of an edge or bar in the visual
|
||||||
respond best to different stimulus orientations. Traditionally, such a
|
stimulus. Different neurons respond best to different edge
|
||||||
tuning is measured by analyzing the neuronal response strength
|
orientations. Traditionally, such a tuning is measured by analyzing
|
||||||
(e.g. the firing rate) as a function of the orientation of the visual
|
the neuronal response strength (e.g. the firing rate) as a function of
|
||||||
stimulus and is depicted and summarized with the so called
|
the orientation of a black bar and is illustrated and summarized
|
||||||
\enterm{tuning-curve} (\determ{Abstimmkurve},
|
with the so called \enterm{tuning-curve} (\determ{Abstimmkurve},
|
||||||
figure~\ref{mlecodingfig}, top).
|
figure~\ref{mlecodingfig}, top).
|
||||||
|
|
||||||
\begin{figure}[tp]
|
\begin{figure}[tp]
|
||||||
@ -306,32 +325,35 @@ figure~\ref{mlecodingfig}, top).
|
|||||||
dark bar in front of a white background). The stimulus that evokes
|
dark bar in front of a white background). The stimulus that evokes
|
||||||
the strongest activity in that neuron is the bar with the vertical
|
the strongest activity in that neuron is the bar with the vertical
|
||||||
orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
|
orientation (arrow, $\phi_i=90$\,\degree). The red area indicates
|
||||||
the variability of the neuronal activity $p(r|\phi)$ around the tunig
|
the variability of the neuronal activity $p(r|\phi)$ around the
|
||||||
curve. Center: In a population of neurons, each neuron may have a
|
tuning curve. Center: In a population of neurons, each neuron may
|
||||||
different tuning curve (colors). A specific stimulus (the vertical
|
have a different tuning curve (colors). A specific stimulus (the
|
||||||
bar) activates the individual neurons of the population in a
|
vertical bar) activates the individual neurons of the population
|
||||||
specific way (dots). Bottom: The log-likelihood of the activity
|
in a specific way (dots). Bottom: The log-likelihood of the
|
||||||
pattern will be maximized close to the real stimulus orientation.}
|
activity pattern will be maximized close to the real stimulus
|
||||||
|
orientation.}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
The brain, however, is confronted with the inverse problem: given a
|
The brain, however, is confronted with the inverse problem: given a
|
||||||
certain activity pattern in the neuronal population, what is the
|
certain activity pattern in the neuronal population, what is the
|
||||||
stimulus? In the sense of maximum likelihood, a possible answer to
|
stimulus (here the orientation of an edge)? In the sense of maximum
|
||||||
this question would be: the stimulus for which the particular
|
likelihood, a possible answer to this question would be: the stimulus
|
||||||
activity pattern is most likely given the tuning of the neurons.
|
for which the particular activity pattern is most likely given the
|
||||||
|
tuning of the neurons.
|
||||||
|
|
||||||
Let's stay with the example of the orientation tuning in V1. The
|
Let's stay with the example of the orientation tuning in V1. The
|
||||||
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred stimulus
|
tuning $\Omega_i(\phi)$ of the neurons $i$ to the preferred edge
|
||||||
orientation $\phi_i$ can be well described using a van-Mises function
|
orientation $\phi_i$ can be well described using a van-Mises function
|
||||||
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
|
(the Gaussian function on a cyclic x-axis) (\figref{mlecodingfig}):
|
||||||
\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c
|
\[ \Omega_i(\phi) = c \cdot e^{\cos(2(\phi-\phi_i))} \quad , \quad c \in \reZ \]
|
||||||
\in \reZ \]
|
|
||||||
If we approximate the neuronal activity by a normal distribution
|
If we approximate the neuronal activity by a normal distribution
|
||||||
around the tuning curve with a standard deviation $\sigma=\Omega/4$,
|
around the tuning curve with a standard deviation $\sigma=\Omega/4$,
|
||||||
which is proprotional to $\Omega$, then the probability
|
which is proportional to $\Omega$, then the probability $p_i(r|\phi)$
|
||||||
$p_i(r|\phi)$ of the $i$-th neuron showing the activity $r$ given a
|
of the $i$-th neuron showing the activity $r$ given a certain
|
||||||
certain orientation $\phi$ is given by
|
orientation $\phi$ of an edge is given by
|
||||||
\[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
|
\[ p_i(r|\phi) = \frac{1}{\sqrt{2\pi}\Omega_i(\phi)/4} e^{-\frac{1}{2}\left(\frac{r-\Omega_i(\phi)}{\Omega_i(\phi)/4}\right)^2} \; . \]
|
||||||
The log-likelihood of the stimulus orientation $\phi$ given the
|
The log-likelihood of the edge orientation $\phi$ given the
|
||||||
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
|
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
|
||||||
\[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]
|
\[ {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) \]
|
||||||
|
The angle $\phi$ that maximizes this likelihood is then an estimate of
|
||||||
|
the orientation of the edge.
|
||||||
|
Reference in New Issue
Block a user