[likelihood] improved text 1

This commit is contained in:
Jan Benda 2021-01-09 13:40:35 +01:00
parent 9bfc39b82c
commit 1d0913c600
3 changed files with 150 additions and 108 deletions

View File

@ -4,111 +4,132 @@
\label{maximumlikelihoodchapter} \label{maximumlikelihoodchapter}
\exercisechapter{Maximum likelihood estimation} \exercisechapter{Maximum likelihood estimation}
A common problem in statistics is to estimate from a probability The core task of statistics is to infer from measured data some
distribution one or more parameters $\theta$ that best describe the parameters describing the data. These parameters can be simply a mean,
data $x_1, x_2, \ldots x_n$. \enterm[maximum likelihood a standard deviation, or any other parameter needed to describe the
estimator]{Maximum likelihood estimators} (\enterm[mle|see{maximum distribution the data a re originating from, a correlation
likelihood estimator}]{mle}, coefficient, or some parameters of a function describing a particular
\determ{Maximum-Likelihood-Sch\"atzer}) choose the parameters such dependence between the data. The brain faces exactly the same
that they maximize the likelihood of the data $x_1, x_2, \ldots x_n$ problem. Given the activity pattern of some neurons (the data) it
to originate from the distribution. needs to infer some aspects (parameters) of the environment and the
internal state of the body in order to generate some useful
behavior. One possible approach to estimate parameters from data are
\enterm[maximum likelihood estimator]{maximum likelihood estimators}
(\enterm[mle|see{maximum likelihood estimator}]{mle},
\determ{Maximum-Likelihood-Sch\"atzer}). They choose the parameters
such that they maximize the likelihood of the specific data values to
originate from a specific distribution.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Maximum Likelihood} \section{Maximum likelihood}
Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given Let $p(x|\theta)$ (to be read as ``probability(density) of $x$ given
$\theta$.'') the probability (density) distribution of $x$ given the $\theta$.'') the probability (density) distribution of data value $x$
parameters $\theta$. This could be the normal distribution given parameter values $\theta$. This could be the normal distribution
\begin{equation} \begin{equation}
\label{normpdfmean} \label{normpdfmean}
p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}} p(x|\mu, \sigma) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}
\end{equation} \end{equation}
defined by the mean $\mu$ and the standard deviation $\sigma$ as defined by the mean $\mu$ and the standard deviation $\sigma$ as
parameters $\theta$. If the $n$ independent observations of $x_1, parameters $\theta$. If the $n$ observations $x_1, x_2, \ldots, x_n$
x_2, \ldots x_n$ originate from the same probability density are independent of each other and originate from the same probability
distribution (they are \enterm[i.i.d.|see{independent and identically density distribution (they are \enterm[i.i.d.|see{independent and
distributed}]{i.i.d.}, \enterm{independent and identically identically distributed}]{i.i.d.}, \enterm{independent and
distributed}) then the conditional probability $p(x_1,x_2, \ldots identically distributed}), then the conditional probability
x_n|\theta)$ of observing $x_1, x_2, \ldots x_n$ given some specific $p(x_1,x_2, \ldots, x_n|\theta)$ of observing the particular data
parameter values $\theta$ is given by values $x_1, x_2, \ldots, x_n$ given some specific parameter values
$\theta$ of the probability density is given by the product of the
probability densities of each data value:
\begin{equation} \begin{equation}
p(x_1,x_2, \ldots x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta) \label{probdata}
\ldots p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; . p(x_1,x_2, \ldots, x_n|\theta) = p(x_1|\theta) \cdot p(x_2|\theta)
\ldots, p(x_n|\theta) = \prod_{i=1}^n p(x_i|\theta) \; .
\end{equation} \end{equation}
Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$ Vice versa, the \entermde{Likelihood}{likelihood} of the parameters $\theta$
given the observed data $x_1, x_2, \ldots x_n$ is given the observed data $x_1, x_2, \ldots, x_n$ is
\begin{equation} \begin{equation}
{\cal L}(\theta|x_1,x_2, \ldots x_n) = p(x_1,x_2, \ldots x_n|\theta) \; . \label{likelihood}
{\cal L}(\theta|x_1,x_2, \ldots, x_n) = p(x_1,x_2, \ldots, x_n|\theta) \; .
\end{equation} \end{equation}
Note: the likelihood ${\cal L}$ is not a probability in the Note, that the likelihood ${\cal L}$ is not a probability in the
classic sense since it does not integrate to unity ($\int {\cal classic sense since it does not integrate to unity ($\int {\cal
L}(\theta|x_1,x_2, \ldots x_n) \, d\theta \ne 1$). L}(\theta|x_1,x_2, \ldots, x_n) \, d\theta \ne 1$). For given
observations $x_1, x_2, \ldots, x_n$, the likelihood
When applying maximum likelihood estimations we are interested in the \eqref{likelihood} is a function of the parameters $\theta$. This
parameter values function has a global maximum for some specific parameter values. At
this maximum the probability \eqref{probdata} to observe the measured
data values is the largest.
Maximum likelihood estimators just find the parameter values
\begin{equation} \begin{equation}
\theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots x_n) \theta_{mle} = \text{argmax}_{\theta} {\cal L}(\theta|x_1,x_2, \ldots, x_n)
\end{equation} \end{equation}
that maximize the likelihood. $\text{argmax}_xf(x)$ is the value of that maximize the likelihood \eqref{likelihood}.
the argument $x$ for which the function $f(x)$ assumes its global $\text{argmax}_xf(x)$ is the value of the argument $x$ for which the
maximum. Thus, we search for the parameter values $\theta$ at which function $f(x)$ assumes its global maximum. Thus, we search for the
the likelihood ${\cal L}(\theta)$ reaches its maximum. For these parameter values $\theta$ at which the likelihood ${\cal L}(\theta)$
paraemter values the measured data most likely originated from the reaches its maximum. For these parameter values the measured data most
corresponding distribution. likely originated from the corresponding distribution.
The position of a function's maximum does not change when the values The position of a function's maximum does not change when the values
of the function are transformed by a strictly monotonously rising of the function are transformed by a strictly monotonously rising
function such as the logarithm. For numerical reasons and reasons that function such as the logarithm. For numerical reasons and reasons that
we will discuss below, we search for the maximum of the logarithm of we discuss below, we instead search for the maximum of the logarithm
the likelihood of the likelihood
(\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood}): (\entermde[likelihood!log-]{Likelihood!Log-}{log-likelihood})
\begin{eqnarray} \begin{eqnarray}
\theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ \theta_{mle} & = & \text{argmax}_{\theta}\; {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
& = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots x_n) \nonumber \\ & = & \text{argmax}_{\theta}\; \log {\cal L}(\theta|x_1,x_2, \ldots, x_n) \nonumber \\
& = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\ & = & \text{argmax}_{\theta}\; \log \prod_{i=1}^n p(x_i|\theta) \nonumber \\
& = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood} & = & \text{argmax}_{\theta}\; \sum_{i=1}^n \log p(x_i|\theta) \label{loglikelihood}
\end{eqnarray} \end{eqnarray}
which is the sum of the logarithms of the probabilites of each
observation. Let's illustrate the concept of maximum likelihood
estimation on the arithmetic mean.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Example: the arithmetic mean} \subsection{Arithmetic mean}
Suppose that the measurements $x_1, x_2, \ldots x_n$ originate from a Suppose that the measurements $x_1, x_2, \ldots, x_n$ originate from a
normal distribution \eqnref{normpdfmean} and we consider the mean normal distribution \eqref{normpdfmean} and we do not know the
$\mu$ as the only parameter $\theta$. Which value of $\theta$ population mean $\mu$ of the normal distribution
maximizes the likelihood of the data? (\figrefb{mlemeanfig}). In this setting $\mu$ is the only parameter
$\theta$. Which value of $\mu$ maximizes the likelihood of the data?
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{mlemean} \includegraphics[width=1\textwidth]{mlemean}
\titlecaption{\label{mlemeanfig} Maximum likelihood estimation of \titlecaption{\label{mlemeanfig} Maximum likelihood estimation of
the mean.}{Top: The measured data (blue dots) together with three the mean.}{Top: The measured data (blue dots) together with three
different possible normal distributions with different means normal distributions differing in their means (arrows) from which
(arrows) the data could have originated from. Bottom left: the the data could have originated from. Bottom left: the likelihood
likelihood as a function of $\theta$ i.e. the mean. It is maximal as a function of the parameter $\mu$. For the data it is maximal
at a value of $\theta = 2$. Bottom right: the at a value of $\mu = 2$. Bottom right: the log-likelihood. Taking
log-likelihood. Taking the logarithm does not change the position the logarithm does not change the position of the maximum.}
of the maximum.}
\end{figure} \end{figure}
The log-likelihood \eqnref{loglikelihood} With the normal distribution \eqref{normpdfmean} and applying
logarithmic identities, the log-likelihood \eqref{loglikelihood} reads
\begin{eqnarray} \begin{eqnarray}
\log {\cal L}(\theta|x_1,x_2, \ldots x_n) \log {\cal L}(\mu|x_1,x_2, \ldots, x_n)
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\theta)^2}{2\sigma^2}} \nonumber \\ & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma^2}}e^{-\frac{(x_i-\mu)^2}{2\sigma^2}} \nonumber \\
& = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\theta)^2}{2\sigma^2} \; . & = & \sum_{i=1}^n - \log \sqrt{2\pi \sigma^2} -\frac{(x_i-\mu)^2}{2\sigma^2} \; .
\end{eqnarray} \end{eqnarray}
Since the logarithm is the inverse function of the exponential Since the logarithm is the inverse function of the exponential
($\log(e^x)=x$), taking the logarithm removes the exponential from the ($\log(e^x)=x$), taking the logarithm removes the exponential from the
normal distribution. To calculate the maximum of the log-likelihood, normal distribution. This is the second reason why it is useful to
we need to take the derivative with respect to $\theta$ and set it to maximize the log-likelihood. To calculate the maximum of the
zero: log-likelihood, we need to take the derivative with respect to $\mu$
and set it to zero:
\begin{eqnarray} \begin{eqnarray}
\frac{\text{d}}{\text{d}\theta} \log {\cal L}(\theta|x_1,x_2, \ldots x_n) & = & \sum_{i=1}^n - \frac{2(x_i-\theta)}{2\sigma^2} \;\; = \;\; 0 \nonumber \\ \frac{\text{d}}{\text{d}\mu} \log {\cal L}(\mu|x_1,x_2, \ldots, x_n) & = & \sum_{i=1}^n - \frac{\text{d}}{\text{d}\mu} \log \sqrt{2\pi \sigma^2} - \frac{\text{d}}{\text{d}\mu} \frac{(x_i-\mu)^2}{2\sigma^2} \;\; = \;\; 0 \nonumber \\
\Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \theta & = & 0 \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n - \frac{2(x_i-\mu)}{2\sigma^2} & = & 0 \nonumber \\
\Leftrightarrow \quad n \theta & = & \sum_{i=1}^n x_i \nonumber \\ \Leftrightarrow \quad \sum_{i=1}^n x_i - \sum_{i=1}^n \mu & = & 0 \nonumber \\
\Leftrightarrow \quad \theta & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x \Leftrightarrow \quad n \mu & = & \sum_{i=1}^n x_i \nonumber \\
\Leftrightarrow \quad \mu & = & \frac{1}{n} \sum_{i=1}^n x_i \;\; = \;\; \bar x
\end{eqnarray} \end{eqnarray}
Thus, the maximum likelihood estimator is the arithmetic mean. That Thus, the maximum likelihood estimator of the population mean of
is, the arithmetic mean maximizes the likelihood that the data normally distributed data is the arithmetic mean. That is, the
originate from a normal distribution centered at the arithmetic mean arithmetic mean maximizes the likelihood that the data originate from
a normal distribution centered at the arithmetic mean
(\figref{mlemeanfig}). Equivalently, the standard deviation computed (\figref{mlemeanfig}). Equivalently, the standard deviation computed
from the data, maximizes the likelihood that the data were generated from the data, maximizes the likelihood that the data were generated
from a normal distribution with this standard deviation. from a normal distribution with this standard deviation.
@ -123,6 +144,17 @@ from a normal distribution with this standard deviation.
the maxima with the mean calculated from the data. the maxima with the mean calculated from the data.
\end{exercise} \end{exercise}
Comparing the values of the likelihood with the ones of the
log-likelihood shown in \figref{mlemeanfig}, shows the numerical
reason for taking the logarithm of the likelihood. The likelihood
values can get very small, because we multiply many, potentially small
probability densities with each other. The likelihood quickly gets
smaller than the samlles number a floating point number of a computer
can represent. Try it by increasing the number of data values in the
exercise. Taking the logarithm avoids this problem. The log-likelihood
assumes well behaving numbers that can be handled well by the
computer.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Fitting probability distributions} \section{Fitting probability distributions}
@ -130,32 +162,39 @@ Consider normally distributed data with unknown mean and standard
deviation. From the considerations above we just have seen that a deviation. From the considerations above we just have seen that a
Gaussian distribution with mean at the arithmetic mean and standard Gaussian distribution with mean at the arithmetic mean and standard
deviation equal to the standard deviation computed from the data is deviation equal to the standard deviation computed from the data is
the best Gaussian distribution that fits the data best in a maximum the Gaussian that fits the data best in a maximum likelihood sense,
likelihood sense, i.e. the likelihood that the data were generated i.e. the likelihood that the data were generated from this
from this distribution is the largest. Fitting a Gaussian distribution distribution is the largest. Fitting a Gaussian distribution to data
to data is very simple: just compute the two parameter of the Gaussian is very simple: just compute the two parameter of the Gaussian
distribution as the arithmetic mean and a standard deviation directly distribution $\mu$ and $\sigma$ as the arithmetic mean and a standard
from the data. deviation, respectively, directly from the data.
For non-Gaussian distributions (e.g. a Gamma-distribution), however, For non-Gaussian distributions, for example a
such simple analytical expressions for the parameters of the \entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}
distribution do not exist, e.g. the shape parameter of a \begin{equation}
\entermde[distribution!Gamma-]{Verteilung!Gamma-}{Gamma-distribution}. How \label{gammapdf}
do we fit such a distribution to some data? That is, how should we p(x|\alpha,\beta) \sim x^{\alpha-1}e^{-\beta x} \; ,
compute the values of the parameters of the distribution, given the \end{equation}
data? however, such simple analytical expressions for the parameters of the
distribution do not exist. This is the case, for example, for the
shape parameter $\alpha$ of the Gamma-distribution. How do we fit such
a distribution to some data? That is, how should we compute the
values of the parameters of the distribution, given the data?
A first guess could be to fit the probability density function by A first guess could be to fit the probability density function by
minimization of the squared difference to a histogram of the measured minimization of the squared difference to a histogram of the measured
data. For several reasons this is, however, not the method of choice: data in the same way as we fit a a function to some data. For several
(i) Probability densities can only be positive which leads, for small reasons this is, however, not the method of choice: (i) Probability
values in particular, to asymmetric distributions. (ii) The values of densities can only be positive which leads, for small values in
a histogram estimating the density are not independent because the particular, to asymmetric distributions of the estimated histogram
integral over a density is unity. The two basic assumptions of around the true density. (ii) The values of a histogram estimating the
normally distributed and independent samples, which are a prerequisite density are not independent because the integral over a density is
make the minimization of the squared difference \eqnref{chisqmin} to a unity. The two basic assumptions of normally distributed and
maximum likelihood estimation, are violated. (iii) The histogram independent samples, which are a prerequisite for making the
strongly depends on the chosen bin size \figref{mlepdffig}). minimization of the squared difference to a maximum likelihood
estimation (see next section), are violated. (iii) The estimation of
the probability density by means of a histogram strongly depends on
the chosen bin size \figref{mlepdffig}).
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{mlepdf} \includegraphics[width=1\textwidth]{mlepdf}
@ -173,11 +212,10 @@ Instead we should stay with maximum-likelihood estimation. Exactly in
the same way as we estimated the mean value of a Gaussian distribution the same way as we estimated the mean value of a Gaussian distribution
above, we can numerically fit the parameter of any type of above, we can numerically fit the parameter of any type of
distribution directly from the data by means of maximizing the distribution directly from the data by means of maximizing the
likelihood. We simply search for the parameter $\theta$ of the likelihood. We simply search for the parameter values of the desired
desired probability density function that maximizes the probability density function that maximize the log-likelihood. In
log-likelihood. In general this is a non-linear optimization problem general this is a non-linear optimization problem that is solved with
that is solved with numerical methods such as the gradient descent numerical methods such as the gradient descent \matlabfun{mle()}.
\matlabfun{mle()}.
\begin{exercise}{mlegammafit.m}{mlegammafit.out} \begin{exercise}{mlegammafit.m}{mlegammafit.out}
Generate a sample of gamma-distributed random numbers and apply the Generate a sample of gamma-distributed random numbers and apply the
@ -191,12 +229,16 @@ that is solved with numerical methods such as the gradient descent
When fitting a function of the form $f(x;\theta)$ to data pairs When fitting a function of the form $f(x;\theta)$ to data pairs
$(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the $(x_i|y_i)$ one tries to adapt the parameter $\theta$ such that the
function best describes the data. With maximum likelihood we search function best describes the data. In
for the parameter value $\theta$ for which the likelihood that the data chapter~\ref{gradientdescentchapter} we simply assumed that ``best''
were drawn from the corresponding function is maximal. If we assume means minimizing the squared distance between the data and the
that the $y_i$ values are normally distributed around the function function. With maximum likelihood we search for the parameter value
values $f(x_i;\theta)$ with a standard deviation $\sigma_i$, the $\theta$ for which the likelihood that the data were drawn from the
log-likelihood is corresponding function is maximal.
If we assume that the $y_i$ values are normally distributed around the
function values $f(x_i;\theta)$ with a standard deviation $\sigma_i$,
the log-likelihood is
\begin{eqnarray} \begin{eqnarray}
\log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n)) \log {\cal L}(\theta|(x_1,y_1,\sigma_1), \ldots, (x_n,y_n,\sigma_n))
& = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\ & = & \sum_{i=1}^n \log \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{-\frac{(y_i-f(x_i;\theta))^2}{2\sigma_i^2}} \nonumber \\
@ -218,18 +260,17 @@ the position of the minimum:
\theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2 \theta_{mle} = \text{argmin}_{\theta} \; \sum_{i=1}^n \left( \frac{y_i-f(x_i;\theta)}{\sigma_i} \right)^2 \;\; = \;\; \text{argmin}_{\theta} \; \chi^2
\end{equation} \end{equation}
The sum of the squared differences normalized by the standard The sum of the squared differences normalized by the standard
deviation is also called $\chi^2$. The parameter $\theta$ which deviation is also called $\chi^2$ (chi squared). The parameter
minimizes the squared differences is thus the one that maximizes the $\theta$ which minimizes the squared differences is thus the one that
likelihood that the data actually originate from the given maximizes the likelihood of the data to actually originate from the
function. Minimizing $\chi^2$ therefore is a maximum likelihood given function. Therefore, minimizing $\chi^2$ is a maximum likelihood
estimation. estimation.
From the mathematical considerations above we can see that the From the mathematical considerations above we can see that the
minimization of the squared difference is a maximum-likelihood minimization of the squared difference is a maximum-likelihood
estimation only if the data are normally distributed around the estimation only if the data are normally distributed around the
function. In case of other distributions, the log-likelihood function. In case of other distributions, the log-likelihood
\eqnref{loglikelihood} needs to be adapted accordingly and be \eqnref{loglikelihood} needs to be adapted accordingly.
maximized respectively.
\begin{figure}[t] \begin{figure}[t]
\includegraphics[width=1\textwidth]{mlepropline} \includegraphics[width=1\textwidth]{mlepropline}
@ -377,7 +418,7 @@ orientation $\phi$ of an edge is given by
The log-likelihood of the edge orientation $\phi$ given the The log-likelihood of the edge orientation $\phi$ given the
activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus activity pattern in the population $r_1$, $r_2$, ... $r_n$ is thus
\begin{equation} \begin{equation}
{\cal L}(\phi|r_1, r_2, \ldots r_n) = \sum_{i=1}^n \log p_i(r_i|\phi) {\cal L}(\phi|r_1, r_2, \ldots, r_n) = \sum_{i=1}^n \log p_i(r_i|\phi)
\end{equation} \end{equation}
The angle $\phi$ that maximizes this likelihood is then an estimate of The angle $\phi$ that maximizes this likelihood is then an estimate of
the orientation of the edge. the orientation of the edge.

View File

@ -52,7 +52,7 @@ for i, theta in enumerate(thetas) :
p=np.prod(ps,axis=0) p=np.prod(ps,axis=0)
# plot it: # plot it:
ax = fig.add_subplot(spec[1, 0]) ax = fig.add_subplot(spec[1, 0])
ax.set_xlabel(r'Parameter $\theta$') ax.set_xlabel(r'Parameter $\mu$')
ax.set_ylabel('Likelihood') ax.set_ylabel('Likelihood')
ax.set_xticks(np.arange(1.6, 2.5, 0.4)) ax.set_xticks(np.arange(1.6, 2.5, 0.4))
ax.annotate('Maximum', ax.annotate('Maximum',
@ -68,7 +68,7 @@ ax.annotate('',
ax.plot(thetas, p, **lsAm) ax.plot(thetas, p, **lsAm)
ax = fig.add_subplot(spec[1, 1]) ax = fig.add_subplot(spec[1, 1])
ax.set_xlabel(r'Parameter $\theta$') ax.set_xlabel(r'Parameter $\mu$')
ax.set_ylabel('Log-Likelihood') ax.set_ylabel('Log-Likelihood')
ax.set_ylim(-50,-20) ax.set_ylim(-50,-20)
ax.set_xticks(np.arange(1.6, 2.5, 0.4)) ax.set_xticks(np.arange(1.6, 2.5, 0.4))

View File

@ -1,4 +1,5 @@
\chapter{Optimization and gradient descent} \chapter{Optimization and gradient descent}
\label{gradientdescentchapter}
\exercisechapter{Optimization and gradient descent} \exercisechapter{Optimization and gradient descent}
Optimization problems arise in many different contexts. For example, Optimization problems arise in many different contexts. For example,