418 lines
20 KiB
TeX
418 lines
20 KiB
TeX
\chapter{Optimization and gradient descent}
|
|
\exercisechapter{Optimization and gradient descent}
|
|
|
|
Optimization problems arise in many different contexts. For example,
|
|
to understand the behavior of a given neuronal system, the system is
|
|
probed with a range of input signals and then the resulting responses
|
|
are measured. This input-output relation can be described by a
|
|
model. Such a model can be a simple function that maps the input
|
|
signals to corresponding responses, it can be a filter, or a system of
|
|
differential equations. In any case, the model has some parameters
|
|
that specify how input and output signals are related. Which
|
|
combination of parameter values are best suited to describe the
|
|
input-output relation? The process of finding the best parameter
|
|
values is an optimization problem. For a simple parameterized function
|
|
that maps input to output values, this is the special case of a
|
|
\enterm{curve fitting} problem, where the average distance between the
|
|
curve and the response values is minimized. One basic numerical method
|
|
used for such optimization problems is the so called gradient descent,
|
|
which is introduced in this chapter.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicfunc}
|
|
\titlecaption{Example data suggesting a cubic relation.}{The length
|
|
$x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a
|
|
cubic relation between size and weight leaves us with a single
|
|
free parameters, a scaling factor. The cubic relation is shown for
|
|
a few values of this scaling factor (orange and red,
|
|
right).}\label{cubicdatafig}
|
|
\end{figure}
|
|
|
|
For demonstrating the curve-fitting problem let's take the simple
|
|
example of weights and sizes measured for a number of male tigers
|
|
(\figref{cubicdatafig}). Weight $y$ is proportional to volume
|
|
$V$ via the density $\rho$. The volume $V$ of any object is
|
|
proportional to its length $x$ cubed. The factor $\alpha$ relating
|
|
volume and size cubed depends on the shape of the object and we do not
|
|
know this factor for tigers. For the data set we thus expect a cubic
|
|
relation between weight and length
|
|
\begin{equation}
|
|
\label{cubicfunc}
|
|
y = f(x; c) = c\cdot x^3
|
|
\end{equation}
|
|
where $c=\rho\alpha$, the product of a tiger's density and form
|
|
factor, is the only free parameter in the relation. We would like to
|
|
find out which value of $c$ best describes the measured data. In the
|
|
following we use this example to illustrate the gradient descent as a
|
|
basic method for finding such an optimal parameter.
|
|
|
|
|
|
\section{The error function --- mean squared error}
|
|
|
|
Before we go ahead finding the optimal parameter value we need to
|
|
specify what exactly we consider as an optimal fit. In our example we
|
|
search the parameter that describes the relation of $x$ and $y$
|
|
best. What is meant by this? The length $x_i$ of each tiger is
|
|
associated with a weight $y_i$ and for each $x_i$ we have a
|
|
\emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by
|
|
the model \eqnref{cubicfunc} for a specific value of the parameter
|
|
$c$. Prediction and actual data value ideally match (in a perfect
|
|
noise-free world), but in general the estimate and measurement are
|
|
separated by some distance or error $y_i - y^{est}(x_i)$. In our
|
|
example the estimate of the weight for the length $x_i$ is given by
|
|
equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting
|
|
model with parameter $c$ is the one that somehow minimizes the
|
|
distances between observations $y_i$ and corresponding estimations
|
|
$y^{est}(x_i)$ (\figref{cubicerrorsfig}).
|
|
|
|
As a first guess we could simply minimize the sum of the distances,
|
|
$\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work
|
|
because positive and negative errors would cancel out, no matter how
|
|
large they are, and sum up to values close to zero. Better is to sum
|
|
over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum
|
|
can only be small if all deviations are indeed small no matter if they
|
|
are above or below the prediction. The sum of the squared distances,
|
|
$\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better
|
|
choice. Instead of the sum we could also minimize the average distance
|
|
\begin{equation}
|
|
\label{meansquarederror}
|
|
f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2
|
|
\end{equation}
|
|
This is known as the \enterm[mean squared error]{mean squared error}
|
|
(\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer
|
|
Fehler}). Similar to the absolute distance, the square of the errors
|
|
is always positive and thus positive and negative error values do not
|
|
cancel each other out. In addition, the square punishes large
|
|
deviations over small deviations. In
|
|
chapter~\ref{maximumlikelihoodchapter} we show that minimizing the
|
|
mean squared error is equivalent to maximizing the likelihood that the
|
|
observations originate from the model, if the data are normally
|
|
distributed around the model prediction.
|
|
|
|
\begin{exercise}{meansquarederrorline.m}{}\label{mseexercise}
|
|
Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store
|
|
these sizes in a vector \varcode{x}. Compute the corresponding
|
|
predicted weights \varcode{yest} for each tiger according to
|
|
\eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the
|
|
predictions generate simulated measurements of the tiger's weights
|
|
\varcode{y}, by adding normally distributed random numbers to the
|
|
predictions scaled to a standard deviation of 50\,\kilo\gram.
|
|
|
|
Compute the \emph{mean squared error} between \varcode{y} and
|
|
\varcode{yest} in a single line of code.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Cost function}
|
|
|
|
The mean squared error is a so called \enterm{objective function} or
|
|
\enterm{cost function} (\determ{Kostenfunktion}). A cost function
|
|
assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set
|
|
$\{(x_i, y_i)\}$ a single scalar value that we want to minimize. Here
|
|
we aim to adapt the model parameter to minimize the mean squared error
|
|
\eqref{meansquarederror}. In general, the \enterm{cost function} can
|
|
be any function that describes the quality of a fit by mapping the
|
|
data and the predictions to a single scalar value.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicerrors}
|
|
\titlecaption{Estimating the \emph{mean squared error}.} {The
|
|
deviation error (orange) between the prediction (red line) and the
|
|
observations (blue dots) is calculated for each data point
|
|
(left). Then the deviations are squared and the average is
|
|
calculated (right).}
|
|
\label{cubicerrorsfig}
|
|
\end{figure}
|
|
|
|
Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror}
|
|
with our cubic model \eqref{cubicfunc}, the cost function reads
|
|
\begin{eqnarray}
|
|
f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\
|
|
& = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube}
|
|
\end{eqnarray}
|
|
The optimization process tries to find a value for the factor $c$ such
|
|
that the cost function is minimized. With the mean squared error as
|
|
the cost function this optimization process is also called method of
|
|
\enterm{least squares} (\determ[quadratischer
|
|
Fehler!kleinster]{Methode der kleinsten Quadrate}).
|
|
|
|
\begin{exercise}{meanSquaredErrorCubic.m}{}
|
|
Implement the objective function \eqref{msecube} as a function
|
|
\varcode{meanSquaredErrorCubic()}. The function takes three
|
|
arguments. The first is a vector of $x$-values and the second
|
|
contains the measurements $y$ for each value of $x$. The third
|
|
argument is the value of the factor $c$. The function returns the
|
|
mean squared error.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Graph of the cost function}
|
|
For each value of the parameter $c$ of the model we can use
|
|
\eqnref{msecube} to calculate the corresponding value of the cost
|
|
function. The cost function $f_{cost}(c|\{(x_i, y_i)\}|)$ is a
|
|
function $f_{cost}(c)$ that maps the parameter value $c$ to a scalar
|
|
error value. For a given data set we thus can simply plot the cost
|
|
function as a function of the parameter $c$ (\figref{cubiccostfig}).
|
|
|
|
\begin{exercise}{plotcubiccosts.m}{}
|
|
Calculate the mean squared error between the data and the cubic
|
|
function for a range of values of the factor $c$ using the
|
|
\varcode{meanSquaredErrorCubic()} function from the previous
|
|
exercise. Plot the graph of the cost function.
|
|
\end{exercise}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubiccost}
|
|
\titlecaption{Minimum of the cost function.}{For a given data set
|
|
the cost function, the mean squared error \eqnref{msecube}, as a
|
|
function of the unknown parameter $c$ has a minimum close to the
|
|
true value of $c$ that was used to generate the data
|
|
(left). Simply taking the absolute minimum of the cost function
|
|
computed for a pre-set range of values for the parameter $c$, has
|
|
the disadvantage to be limited in precision (right) and the
|
|
possibility to entirely miss the global minimum if it is outside
|
|
the computed range.}\label{cubiccostfig}
|
|
\end{figure}
|
|
|
|
By looking at the plot of the cost function we can visually identify
|
|
the position of the minimum and thus estimate the optimal value for
|
|
the parameter $c$. How can we use the cost function to guide an
|
|
automatic optimization process?
|
|
|
|
The obvious approach would be to calculate the mean squared error for
|
|
a range of parameter values and then find the position of the minimum
|
|
using the \code{min()} function. This approach, however has several
|
|
disadvantages: (i) The accuracy of the estimation of the best
|
|
parameter is limited by the resolution used to sample the parameter
|
|
space. The coarser the parameters are sampled the less precise is the
|
|
obtained position of the minimum (\figref{cubiccostfig}, right). (ii)
|
|
The range of parameter values might not include the absolute minimum.
|
|
(iii) In particular for functions with more than a single free
|
|
parameter it is computationally expensive to calculate the cost
|
|
function for each parameter combination at a sufficient
|
|
resolution. The number of combinations increases exponentially with
|
|
the number of free parameters. This is known as the \enterm{curse of
|
|
dimensionality}.
|
|
|
|
So we need a different approach. We want a procedure that finds the
|
|
minimum of the cost function with a minimal number of computations and
|
|
to arbitrary precision.
|
|
|
|
\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
|
|
\includegraphics[width=0.33\textwidth]{derivative}
|
|
\hfill
|
|
\begin{minipage}[b]{0.63\textwidth}
|
|
The difference quotient
|
|
\begin{equation}
|
|
\label{difffrac}
|
|
m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
|
|
\end{equation}
|
|
of a function $y = f(x)$ is the slope of the secant (red) defined
|
|
by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ at
|
|
distance $\Delta x$.
|
|
|
|
The slope of the function $y=f(x)$ at the position $x$ (orange) is
|
|
given by the derivative $f'(x)$ of the function at that position.
|
|
It is defined by the difference quotient in the limit of
|
|
infinitesimally (red and yellow) small distances $\Delta x$:
|
|
\begin{equation}
|
|
\label{derivative}
|
|
f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
|
|
\end{minipage}\vspace{2ex}
|
|
|
|
It is not possible to calculate the exact value of the derivative
|
|
\eqref{derivative} numerically. The derivative can only be estimated
|
|
by computing the difference quotient \eqref{difffrac} using
|
|
sufficiently small $\Delta x$.
|
|
\end{ibox}
|
|
|
|
\section{Gradient}
|
|
Imagine to place a ball at some point on the cost function
|
|
\figref{cubiccostfig}. Naturally, it would roll down the slope and
|
|
eventually stop at the minimum of the error surface (if it had no
|
|
inertia). We will use this analogy to develop an algorithm to find our
|
|
way to the minimum of the cost function. The ball always follows the
|
|
steepest slope. Thus we need to figure out the direction of the slope
|
|
at the position of the ball.
|
|
|
|
In our one-dimensional example of a single free parameter the slope is
|
|
simply the derivative of the cost function with respect to the
|
|
parameter $c$. This derivative is called the
|
|
\entermde{Gradient}{gradient} of the cost function:
|
|
\begin{equation}
|
|
\label{costderivative}
|
|
\nabla f_{cost}(c) = \frac{{\rm d} f_{cost}(c)}{{\rm d} c}
|
|
\end{equation}
|
|
There is no need to calculate this derivative analytically, because it
|
|
can be approximated numerically by the difference quotient
|
|
(Box~\ref{differentialquotientbox}) for small steps $\Delta c$:
|
|
\begin{equation}
|
|
\frac{{\rm d} f_{cost}(c)}{{\rm d} c} =
|
|
\lim\limits_{\Delta c \to 0} \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
|
|
\approx \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
|
|
\end{equation}
|
|
The derivative is positive for positive slopes. Since want to go down
|
|
the hill, we choose the opposite direction.
|
|
|
|
\begin{exercise}{meanSquaredGradientCubic.m}{}
|
|
Implement a function \varcode{meanSquaredGradientCubic()}, that
|
|
takes the $x$- and $y$-data and the parameter $c$ as input
|
|
arguments. The function should return the derivative of the mean
|
|
squared error $f_{cost}(c)$ with respect to $c$ at the position
|
|
$c$.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{plotcubicgradient.m}{}
|
|
Using the \varcode{meanSquaredGradientCubic()} function from the
|
|
previous exercise, plot the derivative of the cost function as a
|
|
function of $c$.
|
|
\end{exercise}
|
|
|
|
\section{Gradient descent}
|
|
Finally, we are able to implement the optimization itself. By now it
|
|
should be obvious why it is called the gradient descent method. All
|
|
ingredients are already there. We need: (i) the cost function
|
|
(\varcode{meanSquaredErrorCubic()}), and (ii) the gradient
|
|
(\varcode{meanSquaredGradientCubic()}). The algorithm of the gradient
|
|
descent works as follows:
|
|
\begin{enumerate}
|
|
\item Start with some arbitrary value $p_0$ for the parameter $c$.
|
|
\item \label{computegradient} Compute the gradient
|
|
\eqnref{costderivative} at the current position $p_i$.
|
|
\item If the length of the gradient, the absolute value of the
|
|
derivative \eqnref{costderivative}, is smaller than some threshold
|
|
value, we assume to have reached the minimum and stop the search.
|
|
We return the current parameter value $p_i$ as our best estimate of
|
|
the parameter $c$ that minimizes the cost function.
|
|
|
|
We are actually looking for the point at which the derivative of the
|
|
cost function equals zero, but this is impossible because of
|
|
numerical imprecision. We thus apply a threshold below which we are
|
|
sufficiently close to zero.
|
|
\item \label{gradientstep} If the length of the gradient exceeds the
|
|
threshold we take a small step into the opposite direction:
|
|
\begin{equation}
|
|
p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(p_i)
|
|
\end{equation}
|
|
where $\epsilon = 0.01$ is a factor linking the gradient to
|
|
appropriate steps in the parameter space.
|
|
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
|
|
\end{enumerate}
|
|
|
|
\Figref{gradientdescentfig} illustrates the gradient descent --- the
|
|
path the imaginary ball has chosen to reach the minimum. Starting at
|
|
an arbitrary position we change the position as long as the gradient
|
|
at that position is larger than a certain threshold. If the slope is
|
|
very steep, the change in the position (the distance between the red
|
|
dots in \figref{gradientdescentfig}) is large.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicmse}
|
|
\titlecaption{Gradient descent.}{The algorithm starts at an
|
|
arbitrary position. At each point the gradient is estimated and
|
|
the position is updated as long as the length of the gradient is
|
|
sufficiently large.The dots show the positions after each
|
|
iteration of the algorithm.} \label{gradientdescentfig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{gradientDescent.m}{}
|
|
Implement the gradient descent for the problem of fitting a straight
|
|
line to some measured data. Reuse the data generated in
|
|
exercise~\ref{errorsurfaceexercise}.
|
|
\begin{enumerate}
|
|
\item Store for each iteration the error value.
|
|
\item Plot the error values as a function of the iterations, the
|
|
number of optimization steps.
|
|
\item Plot the measured data together with the best fitting straight line.
|
|
\end{enumerate}\vspace{-4.5ex}
|
|
\end{exercise}
|
|
|
|
|
|
\begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient}
|
|
Some functions that depend on more than a single variable:
|
|
\[ z = f(x,y) \]
|
|
for example depends on $x$ and $y$. Using the partial derivative
|
|
\[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
|
|
and
|
|
\[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
|
|
one can estimate the slope in the direction of the variables
|
|
individually by using the respective difference quotient
|
|
(Box~\ref{differentialquotientbox}). \vspace{1ex}
|
|
|
|
\begin{minipage}[t]{0.5\textwidth}
|
|
\mbox{}\\[-2ex]
|
|
\includegraphics[width=1\textwidth]{gradient}
|
|
\end{minipage}
|
|
\hfill
|
|
\begin{minipage}[t]{0.46\textwidth}
|
|
For example, the partial derivatives of
|
|
\[ f(x,y) = x^2+y^2 \] are
|
|
\[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]
|
|
|
|
The gradient is a vector that is constructed from the partial derivatives:
|
|
\[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
|
|
This vector points into the direction of the strongest ascend of
|
|
$f(x,y)$.
|
|
\end{minipage}
|
|
|
|
\vspace{0.5ex} The figure shows the contour lines of a bi-variate
|
|
Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
|
|
arrows) and the corresponding two partial derivatives (thin arrows)
|
|
for three different locations.
|
|
\end{ibox}
|
|
|
|
The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the
|
|
objective function is the vector
|
|
\begin{equation}
|
|
\label{gradient}
|
|
\nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
|
|
\frac{\partial f(m,b)}{\partial b} \right)
|
|
\end{equation}
|
|
that points to the strongest ascend of the objective function. The
|
|
gradient is given by partial derivatives
|
|
(Box~\ref{partialderivativebox}) of the mean squared error with
|
|
respect to the parameters $m$ and $b$ of the straight line.
|
|
\section{Summary}
|
|
|
|
The gradient descent is an important numerical method for solving
|
|
optimization problems. It is used to find the global minimum of an
|
|
objective function.
|
|
|
|
Curve fitting is a common application for the gradient descent method.
|
|
For the case of fitting straight lines to data pairs, the error
|
|
surface (using the mean squared error) has exactly one clearly defined
|
|
global minimum. In fact, the position of the minimum can be analytically
|
|
calculated as shown in the next chapter.
|
|
|
|
Problems that involve nonlinear computations on parameters, e.g. the
|
|
rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda
|
|
x}$, do not have an analytical solution for the least squares. To
|
|
find the least squares for such functions numerical methods such as
|
|
the gradient descent have to be applied.
|
|
|
|
The suggested gradient descent algorithm can be improved in multiple
|
|
ways to converge faster. For example one could adapt the step size to
|
|
the length of the gradient. These numerical tricks have already been
|
|
implemented in pre-defined functions. Generic optimization functions
|
|
such as \matlabfun{fminsearch()} have been implemented for arbitrary
|
|
objective functions, while the more specialized function
|
|
\matlabfun{lsqcurvefit()} i specifically designed for optimizations in
|
|
the least square error sense.
|
|
|
|
%\newpage
|
|
\begin{important}[Beware of secondary minima!]
|
|
Finding the absolute minimum is not always as easy as in the case of
|
|
fitting a straight line. Often, the error surface has secondary or
|
|
local minima in which the gradient descent stops even though there
|
|
is a more optimal solution, a global minimum that is lower than the
|
|
local minimum. Starting from good initial positions is a good
|
|
approach to avoid getting stuck in local minima. Also keep in mind
|
|
that error surfaces tend to be simpler (less local minima) the fewer
|
|
parameters are fitted from the data. Each additional parameter
|
|
increases complexity and is computationally more expensive.
|
|
\end{important}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|