[likelihood] improved text 1
This commit is contained in:
parent
9bfc39b82c
commit
1d0913c600
@ -4,111 +4,132 @@
|
||||
\label{maximumlikelihoodchapter}
|
||||
\exercisechapter{Maximum likelihood estimation}
|
||||
|
||||
A common problem in statistics is to estimate from a probability
|
||||
distribution one or more parameters $\theta$ that best describe the
|
||||
data $x_1, x_2, \ldots x_n$. \enterm[maximum likelihood
|
||||
estimator]{Maximum likelihood estimators} (\enterm[mle|see{maximum
|
||||
likelihood estimator}]{mle},
|
||||
\determ{Maximum-Likelihood-Sch\"atzer}) choose the parameters such
|
||||
that they maximize the likelihood of the data $x_1, x_2, \ldots x_n$
|
||||
to originate from the distribution.
|
||||
The core task of statistics is to infer from measured data some
|
||||
parameters describing the data. These parameters can be simply a mean,
|
||||
a standard deviation, or any other parameter needed to describe the
|
||||
distribution the data a re originating from, a correlation
|
||||
coefficient, or some parameters of a function describing a particular
|
||||
dependence between the data. The brain faces exactly the same
|
||||
problem. Given the activity pattern of some neurons (the data) it
|
||||
needs to infer some aspects (parameters) of the environment and the
|
||||
internal state of the body in order to generate some useful
|
||||
behavior. One possible approach to estimate parameters from data are
|
||||
\enterm[maximum likelihood estimator]{maximum likelihood estimators}
|
||||
(\enterm[mle|see{maximum likelihood estimator}]{mle},
|
||||
\determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters
|
||||
such that they maximize the likelihood of the specific data values to
|
||||
originate from a specific distribution.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Maximum Likelihood}
|
||||
\section{Maximum likelihood}
|
||||
|
||||
Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
|
||||
$\theta$.'') the probability (density) distribution of $x$ given the
|
||||
parameters $\theta$. This could be the normal distribution
|
||||
$\theta$.'') the probability (density) distribution of data value $x$
|
||||
given parameter values $\theta$. This could be the normal distribution
|
||||
\begin{equation}
|
||||
\label{normpdfmean}
|
||||
p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
|
||||
\end{equation}
|
||||
defined by the mean $\mu$ and the standard deviation $\sigma$ as
|
||||
parameters $\theta$. If the $n$ independent observations of $x_1,
|
||||
x_2, \ldots x_n$ originate from the same probability density
|
||||
distribution (they are \enterm[i.i.d.|see{independent and identically
|
||||
distributed}]{i.i.d.}, \enterm{independent and identically
|
||||
distributed}) then the conditional probability $p(x_1,x_2, \ldots
|
||||
x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given some specific
|
||||
parameter values $\theta$ is given by
|
||||
parameters $\theta$. If the $n$ observations $x_1, x_2, \ldots, x_n$
|
||||
are independent of each other and originate from the same probability
|
||||
density distribution (they are \enterm[i.i.d.|see{independent and
|
||||
identically distributed}]{i.i.d.}, \enterm{independent and
|
||||
identically distributed}), then the conditional probability
|
||||
$p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
|
||||
values $x_1, x_2, \ldots, x_n$ given some specific parameter values
|
||||
$\theta$ of the probability density is given by the product of the
|
||||
probability densities of each data value:
|
||||
\begin{equation}
|
||||
p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
|
||||
\ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
|
||||
\label{probdata}
|
||||
p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
|
||||
\ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
|
||||
\end{equation}
|
||||
Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
|
||||
given the observed data $x_1, x_2, \ldots x_n$ is
|
||||
given the observed data $x_1, x_2, \ldots, x_n$ is
|
||||
\begin{equation}
|
||||
{\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; .
|
||||
\label{likelihood}
|
||||
{\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
|
||||
\end{equation}
|
||||
Note: the likelihood ${\cal L}$ is not a probability in the
|
||||
Note, that the likelihood ${\cal L}$ is not a probability in the
|
||||
classic sense since it does not integrate to unity ($\int {\cal
|
||||
L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$).
|
||||
|
||||
When applying maximum likelihood estimations we are interested in the
|
||||
parameter values
|
||||
L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
|
||||
observations $x_1, x_2, \ldots, x_n$, the likelihood
|
||||
\eqref{likelihood} is a function of the parameters $\theta$. This
|
||||
function has a global maximum for some specific parameter values. At
|
||||
this maximum the probability \eqref{probdata} to observe the measured
|
||||
data values is the largest.
|
||||
|
||||
Maximum likelihood estimators just find the parameter values
|
||||
\begin{equation}
|
||||
\theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n)
|
||||
\theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
|
||||
\end{equation}
|
||||
that maximize the likelihood. $\text{argmax}_xf(x)$ is the value of
|
||||
the argument $x$ for which the function $f(x)$ assumes its global
|
||||
maximum. Thus, we search for the parameter values $\theta$ at which
|
||||
the likelihood ${\cal L}(\theta)$ reaches its maximum. For these
|
||||
paraemter values the measured data most likely originated from the
|
||||
corresponding distribution.
|
||||
that maximize the likelihood \eqref{likelihood}.
|
||||
$\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
|
||||
function $f(x)$ assumes its global maximum. Thus, we search for the
|
||||
parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
|
||||
reaches its maximum. For these parameter values the measured data most
|
||||
likely originated from the corresponding distribution.
|
||||
|
||||
The position of a function's maximum does not change when the values
|
||||
of the function are transformed by a strictly monotonously rising
|
||||
function such as the logarithm. For numerical reasons and reasons that
|
||||
we will discuss below, we search for the maximum of the logarithm of
|
||||
the likelihood
|
||||
(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood}):
|
||||
|
||||
we discuss below, we instead search for the maximum of the logarithm
|
||||
of the likelihood
|
||||
(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
|
||||
\begin{eqnarray}
|
||||
\theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
|
||||
& = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\
|
||||
\theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
|
||||
& = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
|
||||
& = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
|
||||
& = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
|
||||
\end{eqnarray}
|
||||
which is the sum of the logarithms of the probabilites of each
|
||||
observation. Let's illustrate the concept of maximum likelihood
|
||||
estimation on the arithmetic mean.
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\subsection{Example: the arithmetic mean}
|
||||
Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a
|
||||
normal distribution \eqnref{normpdfmean} and we consider the mean
|
||||
$\mu$ as the only parameter $\theta$. Which value of $\theta$
|
||||
maximizes the likelihood of the data?
|
||||
\subsection{Arithmetic mean}
|
||||
Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
|
||||
normal distribution \eqref{normpdfmean} and we do not know the
|
||||
population mean $\mu$ of the normal distribution
|
||||
(\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
|
||||
$\theta$. Which value of $\mu$ maximizes the likelihood of the data?
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{mlemean}
|
||||
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
|
||||
the mean.}{Top: The measured data (blue dots) together with three
|
||||
different possible normal distributions with different means
|
||||
(arrows) the data could have originated from. Bottom left: the
|
||||
likelihood as a function of $\theta$ i.e. the mean. It is maximal
|
||||
at a value of $\theta = 2$. Bottom right: the
|
||||
log-likelihood. Taking the logarithm does not change the position
|
||||
of the maximum.}
|
||||
normal distributions differing in their means (arrows) from which
|
||||
the data could have originated from. Bottom left: the likelihood
|
||||
as a function of the parameter $\mu$. For the data it is maximal
|
||||
at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
|
||||
the logarithm does not change the position of the maximum.}
|
||||
\end{figure}
|
||||
|
||||
The log-likelihood \eqnref{loglikelihood}
|
||||
With the normal distribution \eqref{normpdfmean} and applying
|
||||
logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
|
||||
\begin{eqnarray}
|
||||
\log {\cal L}(\theta|x_1,x_2, \ldots x_n)
|
||||
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \nonumber \\
|
||||
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; .
|
||||
\log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
|
||||
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
|
||||
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
|
||||
\end{eqnarray}
|
||||
Since the logarithm is the inverse function of the exponential
|
||||
($\log(e^x)=x$), taking the logarithm removes the exponential from the
|
||||
normal distribution. To calculate the maximum of the log-likelihood,
|
||||
we need to take the derivative with respect to $\theta$ and set it to
|
||||
zero:
|
||||
normal distribution. This is the second reason why it is useful to
|
||||
maximize the log-likelihood. To calculate the maximum of the
|
||||
log-likelihood, we need to take the derivative with respect to $\mu$
|
||||
and set it to zero:
|
||||
\begin{eqnarray}
|
||||
\frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \nonumber \\
|
||||
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \nonumber \\
|
||||
\Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \nonumber \\
|
||||
\Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
|
||||
\frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\; 0 \nonumber \\
|
||||
\Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\
|
||||
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\
|
||||
\Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\
|
||||
\Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
|
||||
\end{eqnarray}
|
||||
Thus, the maximum likelihood estimator is the arithmetic mean. That
|
||||
is, the arithmetic mean maximizes the likelihood that the data
|
||||
originate from a normal distribution centered at the arithmetic mean
|
||||
Thus, the maximum likelihood estimator of the population mean of
|
||||
normally distributed data is the arithmetic mean. That is, the
|
||||
arithmetic mean maximizes the likelihood that the data originate from
|
||||
a normal distribution centered at the arithmetic mean
|
||||
(\figref{mlemeanfig}). Equivalently, the standard deviation computed
|
||||
from the data, maximizes the likelihood that the data were generated
|
||||
from a normal distribution with this standard deviation.
|
||||
@ -123,6 +144,17 @@ from a normal distribution with this standard deviation.
|
||||
the maxima with the mean calculated from the data.
|
||||
\end{exercise}
|
||||
|
||||
Comparing the values of the likelihood with the ones of the
|
||||
log-likelihood shown in \figref{mlemeanfig}, shows the numerical
|
||||
reason for taking the logarithm of the likelihood. The likelihood
|
||||
values can get very small, because we multiply many, potentially small
|
||||
probability densities with each other. The likelihood quickly gets
|
||||
smaller than the samlles number a floating point number of a computer
|
||||
can represent. Try it by increasing the number of data values in the
|
||||
exercise. Taking the logarithm avoids this problem. The log-likelihood
|
||||
assumes well behaving numbers that can be handled well by the
|
||||
computer.
|
||||
|
||||
|
||||
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
||||
\section{Fitting probability distributions}
|
||||
@ -130,32 +162,39 @@ Consider normally distributed data with unknown mean and standard
|
||||
deviation. From the considerations above we just have seen that a
|
||||
Gaussian distribution with mean at the arithmetic mean and standard
|
||||
deviation equal to the standard deviation computed from the data is
|
||||
the best Gaussian distribution that fits the data best in a maximum
|
||||
likelihood sense, i.e. the likelihood that the data were generated
|
||||
from this distribution is the largest. Fitting a Gaussian distribution
|
||||
to data is very simple: just compute the two parameter of the Gaussian
|
||||
distribution as the arithmetic mean and a standard deviation directly
|
||||
from the data.
|
||||
|
||||
For non-Gaussian distributions (e.g. a Gamma-distribution), however,
|
||||
such simple analytical expressions for the parameters of the
|
||||
distribution do not exist, e.g. the shape parameter of a
|
||||
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}. How
|
||||
do we fit such a distribution to some data? That is, how should we
|
||||
compute the values of the parameters of the distribution, given the
|
||||
data?
|
||||
the Gaussian that fits the data best in a maximum likelihood sense,
|
||||
i.e. the likelihood that the data were generated from this
|
||||
distribution is the largest. Fitting a Gaussian distribution to data
|
||||
is very simple: just compute the two parameter of the Gaussian
|
||||
distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
|
||||
deviation, respectively, directly from the data.
|
||||
|
||||
For non-Gaussian distributions, for example a
|
||||
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
|
||||
\begin{equation}
|
||||
\label{gammapdf}
|
||||
p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; ,
|
||||
\end{equation}
|
||||
however, such simple analytical expressions for the parameters of the
|
||||
distribution do not exist. This is the case, for example, for the
|
||||
shape parameter $\alpha$ of the Gamma-distribution. How do we fit such
|
||||
a distribution to some data? That is, how should we compute the
|
||||
values of the parameters of the distribution, given the data?
|
||||
|
||||
A first guess could be to fit the probability density function by
|
||||
minimization of the squared difference to a histogram of the measured
|
||||
data. For several reasons this is, however, not the method of choice:
|
||||
(i) Probability densities can only be positive which leads, for small
|
||||
values in particular, to asymmetric distributions. (ii) The values of
|
||||
a histogram estimating the density are not independent because the
|
||||
integral over a density is unity. The two basic assumptions of
|
||||
normally distributed and independent samples, which are a prerequisite
|
||||
make the minimization of the squared difference \eqnref{chisqmin} to a
|
||||
maximum likelihood estimation, are violated. (iii) The histogram
|
||||
strongly depends on the chosen bin size \figref{mlepdffig}).
|
||||
data in the same way as we fit a a function to some data. For several
|
||||
reasons this is, however, not the method of choice: (i) Probability
|
||||
densities can only be positive which leads, for small values in
|
||||
particular, to asymmetric distributions of the estimated histogram
|
||||
around the true density. (ii) The values of a histogram estimating the
|
||||
density are not independent because the integral over a density is
|
||||
unity. The two basic assumptions of normally distributed and
|
||||
independent samples, which are a prerequisite for making the
|
||||
minimization of the squared difference to a maximum likelihood
|
||||
estimation (see next section), are violated. (iii) The estimation of
|
||||
the probability density by means of a histogram strongly depends on
|
||||
the chosen bin size \figref{mlepdffig}).
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{mlepdf}
|
||||
@ -173,11 +212,10 @@ Instead we should stay with maximum-likelihood estimation. Exactly in
|
||||
the same way as we estimated the mean value of a Gaussian distribution
|
||||
above, we can numerically fit the parameter of any type of
|
||||
distribution directly from the data by means of maximizing the
|
||||
likelihood. We simply search for the parameter $\theta$ of the
|
||||
desired probability density function that maximizes the
|
||||
log-likelihood. In general this is a non-linear optimization problem
|
||||
that is solved with numerical methods such as the gradient descent
|
||||
\matlabfun{mle()}.
|
||||
likelihood. We simply search for the parameter values of the desired
|
||||
probability density function that maximize the log-likelihood. In
|
||||
general this is a non-linear optimization problem that is solved with
|
||||
numerical methods such as the gradient descent \matlabfun{mle()}.
|
||||
|
||||
\begin{exercise}{mlegammafit.m}{mlegammafit.out}
|
||||
Generate a sample of gamma-distributed random numbers and apply the
|
||||
@ -191,12 +229,16 @@ that is solved with numerical methods such as the gradient descent
|
||||
|
||||
When fitting a function of the form $f(x;\theta)$ to data pairs
|
||||
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
|
||||
function best describes the data. With maximum likelihood we search
|
||||
for the parameter value $\theta$ for which the likelihood that the data
|
||||
were drawn from the corresponding function is maximal. If we assume
|
||||
that the $y_i$ values are normally distributed around the function
|
||||
values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the
|
||||
log-likelihood is
|
||||
function best describes the data. In
|
||||
chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
|
||||
means minimizing the squared distance between the data and the
|
||||
function. With maximum likelihood we search for the parameter value
|
||||
$\theta$ for which the likelihood that the data were drawn from the
|
||||
corresponding function is maximal.
|
||||
|
||||
If we assume that the $y_i$ values are normally distributed around the
|
||||
function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
|
||||
the log-likelihood is
|
||||
\begin{eqnarray}
|
||||
\log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
|
||||
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\
|
||||
@ -218,18 +260,17 @@ the position of the minimum:
|
||||
\theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
|
||||
\end{equation}
|
||||
The sum of the squared differences normalized by the standard
|
||||
deviation is also called $\chi^2$. The parameter $\theta$ which
|
||||
minimizes the squared differences is thus the one that maximizes the
|
||||
likelihood that the data actually originate from the given
|
||||
function. Minimizing $\chi^2$ therefore is a maximum likelihood
|
||||
deviation is also called $\chi^2$ (chi squared). The parameter
|
||||
$\theta$ which minimizes the squared differences is thus the one that
|
||||
maximizes the likelihood of the data to actually originate from the
|
||||
given function. Therefore, minimizing $\chi^2$ is a maximum likelihood
|
||||
estimation.
|
||||
|
||||
From the mathematical considerations above we can see that the
|
||||
minimization of the squared difference is a maximum-likelihood
|
||||
estimation only if the data are normally distributed around the
|
||||
function. In case of other distributions, the log-likelihood
|
||||
\eqnref{loglikelihood} needs to be adapted accordingly and be
|
||||
maximized respectively.
|
||||
\eqnref{loglikelihood} needs to be adapted accordingly.
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{mlepropline}
|
||||
@ -377,7 +418,7 @@ orientation $\phi$ of an edge is given by
|
||||
The log-likelihood of the edge orientation $\phi$ given the
|
||||
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
|
||||
\begin{equation}
|
||||
{\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
|
||||
{\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
|
||||
\end{equation}
|
||||
The angle $\phi$ that maximizes this likelihood is then an estimate of
|
||||
the orientation of the edge.
|
||||
|
@ -52,7 +52,7 @@ for i, theta in enumerate(thetas) :
|
||||
p=np.prod(ps,axis=0)
|
||||
# plot it:
|
||||
ax = fig.add_subplot(spec[1, 0])
|
||||
ax.set_xlabel(r'Parameter $\theta$')
|
||||
ax.set_xlabel(r'Parameter $\mu$')
|
||||
ax.set_ylabel('Likelihood')
|
||||
ax.set_xticks(np.arange(1.6, 2.5, 0.4))
|
||||
ax.annotate('Maximum',
|
||||
@ -68,7 +68,7 @@ ax.annotate('',
|
||||
ax.plot(thetas, p, **lsAm)
|
||||
|
||||
ax = fig.add_subplot(spec[1, 1])
|
||||
ax.set_xlabel(r'Parameter $\theta$')
|
||||
ax.set_xlabel(r'Parameter $\mu$')
|
||||
ax.set_ylabel('Log-Likelihood')
|
||||
ax.set_ylim(-50,-20)
|
||||
ax.set_xticks(np.arange(1.6, 2.5, 0.4))
|
||||
|
@ -1,4 +1,5 @@
|
||||
\chapter{Optimization and gradient descent}
|
||||
\label{gradientdescentchapter}
|
||||
\exercisechapter{Optimization and gradient descent}
|
||||
|
||||
Optimization problems arise in many different contexts. For example,
|
||||
|
Reference in New Issue
Block a user