[likelihood] improved text 1

2021-01-09 13:40:35 +01:00 · 2021-01-09 13:40:35 +01:00 · 1d0913c600
commit 1d0913c600
parent 9bfc39b82c
3 changed files with 150 additions and 108 deletions
--- a/likelihood/lecture/likelihood.tex
+++ b/likelihood/lecture/likelihood.tex
@ -4,111 +4,132 @@
 \label{maximumlikelihoodchapter}
 \exercisechapter{Maximum likelihood estimation}
-A common problem in statistics is to estimate from a probability
+The core task of statistics is to infer from measured data some
-distribution one or more parameters $\theta$ that best describe the
+parameters describing the data. These parameters can be simply a mean,
-data $x_1, x_2, \ldots x_n$.  \enterm[maximum likelihood
+a standard deviation, or any other parameter needed to describe the
-estimator]{Maximum likelihood estimators} (\enterm[mle|see{maximum
+distribution the data a re originating from, a correlation
-  likelihood estimator}]{mle},
+coefficient, or some parameters of a function describing a particular
-\determ{Maximum-Likelihood-Sch\"atzer}) choose the parameters such
+dependence between the data. The brain faces exactly the same
-that they maximize the likelihood of the data $x_1, x_2, \ldots x_n$
+problem. Given the activity pattern of some neurons (the data) it
-to originate from the distribution.
+needs to infer some aspects (parameters) of the environment and the
 internal state of the body in order to generate some useful
 behavior. One possible approach to estimate parameters from data are
 \enterm[maximum likelihood estimator]{maximum likelihood estimators}
 (\enterm[mle|see{maximum likelihood estimator}]{mle},
 \determ{Maximum-Likelihood-Sch\"atzer}).  They choose the parameters
 such that they maximize the likelihood of the specific data values to
 originate from a specific distribution.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{Maximum Likelihood}
+\section{Maximum likelihood}
 Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
-$\theta$.'') the probability (density) distribution of $x$ given the
+$\theta$.'') the probability (density) distribution of data value $x$
-parameters $\theta$. This could be the normal distribution
+given parameter values $\theta$. This could be the normal distribution
 \begin{equation}
  \label{normpdfmean}
  p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
 \end{equation}
 defined by the mean $\mu$ and the standard deviation $\sigma$ as
-parameters $\theta$.  If the $n$ independent observations of $x_1,
+parameters $\theta$.  If the $n$ observations $x_1, x_2, \ldots, x_n$
-x_2, \ldots x_n$ originate from the same probability density
+are independent of each other and originate from the same probability
-distribution (they are \enterm[i.i.d.|see{independent and identically
+density distribution (they are \enterm[i.i.d.|see{independent and
-  distributed}]{i.i.d.}, \enterm{independent and identically
+  identically distributed}]{i.i.d.}, \enterm{independent and
-  distributed}) then the conditional probability $p(x_1,x_2, \ldots
+  identically distributed}), then the conditional probability
-x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given some specific
+$p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
-parameter values $\theta$ is given by
+values $x_1, x_2, \ldots, x_n$ given some specific parameter values
 $\theta$ of the probability density is given by the product of the
 probability densities of each data value:
 \begin{equation}
-  p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
+  \label{probdata}
-  \ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
+  p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
  \ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
 \end{equation}
 Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
-given the observed data $x_1, x_2, \ldots x_n$ is
+given the observed data $x_1, x_2, \ldots, x_n$ is
 \begin{equation}
-  {\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; .
+  \label{likelihood}
  {\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
 \end{equation}
-Note: the likelihood ${\cal L}$ is not a probability in the
+Note, that the likelihood ${\cal L}$ is not a probability in the
 classic sense since it does not integrate to unity ($\int {\cal
-  L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$).
+  L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
-
+observations $x_1, x_2, \ldots, x_n$, the likelihood
-When applying maximum likelihood estimations we are interested in the
+\eqref{likelihood} is a function of the parameters $\theta$. This
-parameter values
+function has a global maximum for some specific parameter values. At
 this maximum the probability \eqref{probdata} to observe the measured
 data values is the largest.
 Maximum likelihood estimators just find the parameter values
 \begin{equation}
-  \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n)
+  \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
 \end{equation}
-that maximize the likelihood.  $\text{argmax}_xf(x)$ is the value of
+that maximize the likelihood \eqref{likelihood}.
-the argument $x$ for which the function $f(x)$ assumes its global
+$\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
-maximum. Thus, we search for the parameter values $\theta$ at which
+function $f(x)$ assumes its global maximum. Thus, we search for the
-the likelihood ${\cal L}(\theta)$ reaches its maximum. For these
+parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
-paraemter values the measured data most likely originated from the
+reaches its maximum. For these parameter values the measured data most
-corresponding distribution.
+likely originated from the corresponding distribution.
 The position of a function's maximum does not change when the values
 of the function are transformed by a strictly monotonously rising
 function such as the logarithm. For numerical reasons and reasons that
-we will discuss below, we search for the maximum of the logarithm of
+we discuss below, we instead search for the maximum of the logarithm
-the likelihood
+of the likelihood
-(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood}):
+(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
 \begin{eqnarray}
-  \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
+  \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
-              & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
+              & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
              & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
              & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
 \end{eqnarray}
 which is the sum of the logarithms of the probabilites of each
 observation. Let's illustrate the concept of maximum likelihood
 estimation on the arithmetic mean.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\subsection{Example: the arithmetic mean}
+\subsection{Arithmetic mean}
-Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a
+Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
-normal distribution \eqnref{normpdfmean} and we consider the mean
+normal distribution \eqref{normpdfmean} and we do not know the
-$\mu$ as the only parameter $\theta$.  Which value of $\theta$
+population mean $\mu$ of the normal distribution
-maximizes the likelihood of the data?
+(\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
 $\theta$.  Which value of $\mu$ maximizes the likelihood of the data?
 \begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlemean}
  \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
    the mean.}{Top: The measured data (blue dots) together with three
-    different possible normal distributions with different means
+    normal distributions differing in their means (arrows) from which
-    (arrows) the data could have originated from.  Bottom left: the
+    the data could have originated from.  Bottom left: the likelihood
-    likelihood as a function of $\theta$ i.e. the mean. It is maximal
+    as a function of the parameter $\mu$. For the data it is maximal
-    at a value of $\theta = 2$. Bottom right: the
+    at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
-    log-likelihood. Taking the logarithm does not change the position
+    the logarithm does not change the position of the maximum.}
    of the maximum.}
 \end{figure}
-The log-likelihood \eqnref{loglikelihood}
+With the normal distribution \eqref{normpdfmean} and applying
 logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
 \begin{eqnarray}
-  \log {\cal L}(\theta|x_1,x_2, \ldots x_n)
+  \log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
-  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \nonumber \\
+  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
-  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; .
+  & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
 \end{eqnarray}
 Since the logarithm is the inverse function of the exponential
 ($\log(e^x)=x$), taking the logarithm removes the exponential from the
-normal distribution.  To calculate the maximum of the log-likelihood,
+normal distribution. This is the second reason why it is useful to
-we need to take the derivative with respect to $\theta$ and set it to
+maximize the log-likelihood. To calculate the maximum of the
-zero:
+log-likelihood, we need to take the derivative with respect to $\mu$
 and set it to zero:
 \begin{eqnarray}
-  \frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0  \nonumber \\
+  \frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\;  0  \nonumber \\
-  \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0  \nonumber \\
+  \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0  \nonumber \\
-  \Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i  \nonumber \\
+  \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0  \nonumber \\
-  \Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
+  \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i  \nonumber \\
  \Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
 \end{eqnarray}
-Thus, the maximum likelihood estimator is the arithmetic mean. That
+Thus, the maximum likelihood estimator of the population mean of
-is, the arithmetic mean maximizes the likelihood that the data
+normally distributed data is the arithmetic mean. That is, the
-originate from a normal distribution centered at the arithmetic mean
+arithmetic mean maximizes the likelihood that the data originate from
 a normal distribution centered at the arithmetic mean
 (\figref{mlemeanfig}). Equivalently, the standard deviation computed
 from the data, maximizes the likelihood that the data were generated
 from a normal distribution with this standard deviation.
@ -123,6 +144,17 @@ from a normal distribution with this standard deviation.
  the maxima with the mean calculated from the data.
 \end{exercise}
 Comparing the values of the likelihood with the ones of the
 log-likelihood shown in \figref{mlemeanfig}, shows the numerical
 reason for taking the logarithm of the likelihood. The likelihood
 values can get very small, because we multiply many, potentially small
 probability densities with each other. The likelihood quickly gets
 smaller than the samlles number a floating point number of a computer
 can represent. Try it by increasing the number of data values in the
 exercise. Taking the logarithm avoids this problem. The log-likelihood
 assumes well behaving numbers that can be handled well by the
 computer.
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Fitting probability distributions}
@ -130,32 +162,39 @@ Consider normally distributed data with unknown mean and standard
 deviation.  From the considerations above we just have seen that a
 Gaussian distribution with mean at the arithmetic mean and standard
 deviation equal to the standard deviation computed from the data is
-the best Gaussian distribution that fits the data best in a maximum
+the Gaussian that fits the data best in a maximum likelihood sense,
-likelihood sense, i.e. the likelihood that the data were generated
+i.e. the likelihood that the data were generated from this
-from this distribution is the largest. Fitting a Gaussian distribution
+distribution is the largest. Fitting a Gaussian distribution to data
-to data is very simple: just compute the two parameter of the Gaussian
+is very simple: just compute the two parameter of the Gaussian
-distribution as the arithmetic mean and a standard deviation directly
+distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
-from the data.
+deviation, respectively, directly from the data.
-
+
-For non-Gaussian distributions (e.g. a Gamma-distribution), however,
+For non-Gaussian distributions, for example a
-such simple analytical expressions for the parameters of the
+\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
-distribution do not exist, e.g. the shape parameter of a
+\begin{equation}
-\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}. How
+  \label{gammapdf}
-do we fit such a distribution to some data? That is, how should we
+  p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; ,
-compute the values of the parameters of the distribution, given the
+\end{equation}
-data?
+however, such simple analytical expressions for the parameters of the
 distribution do not exist. This is the case, for example, for the
 shape parameter $\alpha$ of the Gamma-distribution. How do we fit such
 a distribution to some data?  That is, how should we compute the
 values of the parameters of the distribution, given the data?
 A first guess could be to fit the probability density function by
 minimization of the squared difference to a histogram of the measured
-data. For several reasons this is, however, not the method of choice:
+data in the same way as we fit a a function to some data. For several
-(i) Probability densities can only be positive which leads, for small
+reasons this is, however, not the method of choice: (i) Probability
-values in particular, to asymmetric distributions. (ii) The values of
+densities can only be positive which leads, for small values in
-a histogram estimating the density are not independent because the
+particular, to asymmetric distributions of the estimated histogram
-integral over a density is unity. The two basic assumptions of
+around the true density. (ii) The values of a histogram estimating the
-normally distributed and independent samples, which are a prerequisite
+density are not independent because the integral over a density is
-make the minimization of the squared difference \eqnref{chisqmin} to a
+unity. The two basic assumptions of normally distributed and
-maximum likelihood estimation, are violated. (iii) The histogram
+independent samples, which are a prerequisite for making the
-strongly depends on the chosen bin size \figref{mlepdffig}).
+minimization of the squared difference to a maximum likelihood
 estimation (see next section), are violated. (iii) The estimation of
 the probability density by means of a histogram strongly depends on
 the chosen bin size \figref{mlepdffig}).
 \begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepdf}
@ -173,11 +212,10 @@ Instead we should stay with maximum-likelihood estimation.  Exactly in
 the same way as we estimated the mean value of a Gaussian distribution
 above, we can numerically fit the parameter of any type of
 distribution directly from the data by means of maximizing the
-likelihood.  We simply search for the parameter $\theta$ of the
+likelihood.  We simply search for the parameter values of the desired
-desired probability density function that maximizes the
+probability density function that maximize the log-likelihood. In
-log-likelihood. In general this is a non-linear optimization problem
+general this is a non-linear optimization problem that is solved with
-that is solved with numerical methods such as the gradient descent
+numerical methods such as the gradient descent \matlabfun{mle()}.
 \matlabfun{mle()}.
 \begin{exercise}{mlegammafit.m}{mlegammafit.out}
  Generate a sample of gamma-distributed random numbers and apply the
@ -191,12 +229,16 @@ that is solved with numerical methods such as the gradient descent
 When fitting a function of the form $f(x;\theta)$ to data pairs
 $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
-function best describes the data. With maximum likelihood we search
+function best describes the data. In
-for the parameter value $\theta$ for which the likelihood that the data
+chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
-were drawn from the corresponding function is maximal.  If we assume
+means minimizing the squared distance between the data and the
-that the $y_i$ values are normally distributed around the function
+function.  With maximum likelihood we search for the parameter value
-values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
+$\theta$ for which the likelihood that the data were drawn from the
-log-likelihood is
+corresponding function is maximal.
 If we assume that the $y_i$ values are normally distributed around the
 function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
 the log-likelihood is
 \begin{eqnarray}
  \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
  & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}}  \nonumber \\
@ -218,18 +260,17 @@ the position of the minimum:
  \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
 \end{equation}
 The sum of the squared differences normalized by the standard
-deviation is also called $\chi^2$. The parameter $\theta$ which
+deviation is also called $\chi^2$ (chi squared). The parameter
-minimizes the squared differences is thus the one that maximizes the
+$\theta$ which minimizes the squared differences is thus the one that
-likelihood that the data actually originate from the given
+maximizes the likelihood of the data to actually originate from the
-function. Minimizing $\chi^2$ therefore is a maximum likelihood
+given function. Therefore, minimizing $\chi^2$ is a maximum likelihood
 estimation.
 From the mathematical considerations above we can see that the
 minimization of the squared difference is a maximum-likelihood
 estimation only if the data are normally distributed around the
 function. In case of other distributions, the log-likelihood
-\eqnref{loglikelihood} needs to be adapted accordingly and be
+\eqnref{loglikelihood} needs to be adapted accordingly.
 maximized respectively.
 \begin{figure}[t]
  \includegraphics[width=1\textwidth]{mlepropline}
@ -377,7 +418,7 @@ orientation $\phi$ of an edge is given by
 The log-likelihood of the edge orientation $\phi$ given the
 activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
 \begin{equation}
-  {\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
+  {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
 \end{equation}
 The angle $\phi$ that maximizes this likelihood is then an estimate of
 the orientation of the edge.
--- a/likelihood/lecture/mlemean.py
+++ b/likelihood/lecture/mlemean.py
@ -52,7 +52,7 @@ for i, theta in enumerate(thetas) :
 p=np.prod(ps,axis=0)
 # plot it:
 ax = fig.add_subplot(spec[1, 0])
-ax.set_xlabel(r'Parameter $\theta$')
+ax.set_xlabel(r'Parameter $\mu$')
 ax.set_ylabel('Likelihood')
 ax.set_xticks(np.arange(1.6, 2.5, 0.4))
 ax.annotate('Maximum',
@ -68,7 +68,7 @@ ax.annotate('',
 ax.plot(thetas, p, **lsAm)
 ax = fig.add_subplot(spec[1, 1])
-ax.set_xlabel(r'Parameter $\theta$')
+ax.set_xlabel(r'Parameter $\mu$')
 ax.set_ylabel('Log-Likelihood')
 ax.set_ylim(-50,-20)
 ax.set_xticks(np.arange(1.6, 2.5, 0.4))
--- a/regression/lecture/regression.tex
+++ b/regression/lecture/regression.tex
@ -1,4 +1,5 @@
 \chapter{Optimization and gradient descent}
 \label{gradientdescentchapter}
 \exercisechapter{Optimization and gradient descent}
 Optimization problems arise in many different contexts. For example,