429 lines
20 KiB
TeX
429 lines
20 KiB
TeX
\chapter{Optimization and gradient descent}
|
|
\exercisechapter{Optimization and gradient descent}
|
|
|
|
Optimization problems arise in many different contexts. For example,
|
|
to understand the behavior of a given neuronal system, the system is
|
|
probed with a range of input signals and then the resulting responses
|
|
are measured. This input-output relation can be described by a
|
|
model. Such a model can be a simple function that maps the input
|
|
signals to corresponding responses, it can be a filter, or a system of
|
|
differential equations. In any case, the model has some parameters
|
|
that specify how input and output signals are related. Which
|
|
combination of parameter values are best suited to describe the
|
|
input-output relation? The process of finding the best parameter
|
|
values is an optimization problem. For a simple parameterized function
|
|
that maps input to output values, this is the special case of a
|
|
\enterm{curve fitting} problem, where the average distance between the
|
|
curve and the response values is minimized. One basic numerical method
|
|
used for such optimization problems is the so called gradient descent,
|
|
which is introduced in this chapter.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicfunc}
|
|
\titlecaption{Example data suggesting a cubic relation.}{The length
|
|
$x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a
|
|
cubic relation between size and weight leaves us with a single
|
|
free parameters, a scaling factor. The cubic relation is shown for
|
|
a few values of this scaling factor (orange and red,
|
|
right).}\label{cubicdatafig}
|
|
\end{figure}
|
|
|
|
For demonstrating the curve-fitting problem let's take the simple
|
|
example of weights and sizes measured for a number of male tigers
|
|
(\figref{cubicdatafig}). Weight $y$ is proportional to volume
|
|
$V$ via the density $\rho$. The volume $V$ of any object is
|
|
proportional to its length $x$ cubed. The factor $\alpha$ relating
|
|
volume and size cubed depends on the shape of the object and we do not
|
|
know this factor for tigers. For the data set we thus expect a cubic
|
|
relation between weight and length
|
|
\begin{equation}
|
|
\label{cubicfunc}
|
|
y = f(x; c) = c\cdot x^3
|
|
\end{equation}
|
|
where $c=\rho\alpha$, the product of a tiger's density and form
|
|
factor, is the only free parameter in the relation. We would like to
|
|
find out which value of $c$ best describes the measured data. In the
|
|
following we use this example to illustrate the gradient descent as a
|
|
basic method for finding such an optimal parameter.
|
|
|
|
|
|
\section{The error function --- mean squared error}
|
|
|
|
Before we go ahead finding the optimal parameter value we need to
|
|
specify what exactly we consider as an optimal fit. In our example we
|
|
search the parameter that describes the relation of $x$ and $y$
|
|
best. What is meant by this? The length $x_i$ of each tiger is
|
|
associated with a weight $y_i$ and for each $x_i$ we have a
|
|
\emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by
|
|
the model \eqnref{cubicfunc} for a specific value of the parameter
|
|
$c$. Prediction and actual data value ideally match (in a perfect
|
|
noise-free world), but in general the estimate and measurement are
|
|
separated by some distance or error $y_i - y^{est}(x_i)$. In our
|
|
example the estimate of the weight for the length $x_i$ is given by
|
|
equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting
|
|
model with parameter $c$ is the one that somehow minimizes the
|
|
distances between observations $y_i$ and corresponding estimations
|
|
$y^{est}(x_i)$ (\figref{cubicerrorsfig}).
|
|
|
|
As a first guess we could simply minimize the sum of the distances,
|
|
$\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work
|
|
because positive and negative errors would cancel out, no matter how
|
|
large they are, and sum up to values close to zero. Better is to sum
|
|
over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum
|
|
can only be small if all deviations are indeed small no matter if they
|
|
are above or below the prediction. The sum of the squared distances,
|
|
$\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better
|
|
choice. Instead of the sum we could also minimize the average distance
|
|
\begin{equation}
|
|
\label{meansquarederror}
|
|
f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2
|
|
\end{equation}
|
|
This is known as the \enterm[mean squared error]{mean squared error}
|
|
(\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer
|
|
Fehler}). Similar to the absolute distance, the square of the errors
|
|
is always positive and thus positive and negative error values do not
|
|
cancel each other out. In addition, the square punishes large
|
|
deviations over small deviations. In
|
|
chapter~\ref{maximumlikelihoodchapter} we show that minimizing the
|
|
mean squared error is equivalent to maximizing the likelihood that the
|
|
observations originate from the model, if the data are normally
|
|
distributed around the model prediction.
|
|
|
|
\begin{exercise}{meansquarederrorline.m}{}\label{mseexercise}
|
|
Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store
|
|
these sizes in a vector \varcode{x}. Compute the corresponding
|
|
predicted weights \varcode{yest} for each tiger according to
|
|
\eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the
|
|
predictions generate simulated measurements of the tiger's weights
|
|
\varcode{y}, by adding normally distributed random numbers to the
|
|
predictions scaled to a standard deviation of 50\,\kilo\gram.
|
|
|
|
Compute the \emph{mean squared error} between \varcode{y} and
|
|
\varcode{yest} in a single line of code.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Objective function}
|
|
|
|
The mean squared error is a so called \enterm{objective function} or
|
|
\enterm{cost function} (\determ{Kostenfunktion}). A cost function
|
|
assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set
|
|
$\{(x_i, y_i)\}$ a single scalar value that we want to minimize. Here
|
|
we aim to adapt the model parameter to minimize the mean squared error
|
|
\eqref{meansquarederror}. In general, the \enterm{cost function} can
|
|
be any function that describes the quality of a fit by mapping the
|
|
data and the predictions to a single scalar value.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicerrors}
|
|
\titlecaption{Estimating the \emph{mean squared error}.} {The
|
|
deviation error (orange) between the prediction (red line) and the
|
|
observations (blue dots) is calculated for each data point
|
|
(left). Then the deviations are squared and the average is
|
|
calculated (right).}
|
|
\label{cubicerrorsfig}
|
|
\end{figure}
|
|
|
|
Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror}
|
|
with our cubic model \eqref{cubicfunc}, the cost function reads
|
|
\begin{eqnarray}
|
|
f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\
|
|
& = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube}
|
|
\end{eqnarray}
|
|
The optimization process tries to find a value for the factor $c$ such
|
|
that the cost function is minimized. With the mean squared error as
|
|
the cost function this optimization process is also called method of
|
|
\enterm{least squares} (\determ[quadratischer
|
|
Fehler!kleinster]{Methode der kleinsten Quadrate}).
|
|
|
|
\begin{exercise}{meanSquaredErrorCubic.m}{}
|
|
Implement the objective function \eqref{msecube} as a function
|
|
\varcode{meanSquaredErrorCubic()}. The function takes three
|
|
arguments. The first is a vector of $x$-values and the second
|
|
contains the measurements $y$ for each value of $x$. The third
|
|
argument is the value of the factor $c$. The function returns the
|
|
mean squared error.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Error surface}
|
|
For each combination of the two parameters $m$ and $b$ of the model we
|
|
can use \eqnref{mseline} to calculate the corresponding value of the
|
|
cost function. The cost function $f_{cost}(m,b|\{(x_i, y_i)\}|)$ is a
|
|
function $f_{cost}(m,b)$, that maps the parameter values $m$ and $b$
|
|
to a scalar error value. The error values describe a landscape over the
|
|
$m$-$b$ plane, the error surface, that can be illustrated graphically
|
|
using a 3-d surface-plot. $m$ and $b$ are plotted on the $x$- and $y$-
|
|
axis while the third dimension indicates the error value
|
|
(\figref{errorsurfacefig}).
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=0.75\textwidth]{error_surface}
|
|
\titlecaption{Error surface.}{The two model parameters $m$ and $b$
|
|
define the base area of the surface plot. For each parameter
|
|
combination of slope and intercept the error is calculated. The
|
|
resulting surface has a minimum which indicates the parameter
|
|
combination that best fits the data.}\label{errorsurfacefig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}
|
|
Generate 20 data pairs $(x_i|y_i)$ that are linearly related with
|
|
slope $m=0.75$ and intercept $b=-40$, using \varcode{rand()} for
|
|
drawing $x$ values between 0 and 120 and \varcode{randn()} for
|
|
jittering the $y$ values with a standard deviation of 15. Then
|
|
calculate the mean squared error between the data and straight lines
|
|
for a range of slopes and intercepts using the
|
|
\varcode{meanSquaredError()} function from the previous exercise.
|
|
Illustrates the error surface using the \code{surface()} function.
|
|
Consult the documentation to find out how to use \code{surface()}.
|
|
\end{exercise}
|
|
|
|
By looking at the error surface we can directly see the position of
|
|
the minimum and thus estimate the optimal parameter combination. How
|
|
can we use the error surface to guide an automatic optimization
|
|
process?
|
|
|
|
The obvious approach would be to calculate the error surface for any
|
|
combination of slope and intercept values and then find the position
|
|
of the minimum using the \code{min} function. This approach, however
|
|
has several disadvantages: (i) it is computationally very expensive to
|
|
calculate the error for each parameter combination. The number of
|
|
combinations increases exponentially with the number of free
|
|
parameters (also known as the ``curse of dimensionality''). (ii) the
|
|
accuracy with which the best parameters can be estimated is limited by
|
|
the resolution used to sample the parameter space. The coarser the
|
|
parameters are sampled the less precise is the obtained position of
|
|
the minimum.
|
|
|
|
So we need a different approach. We want a procedure that finds the
|
|
minimum of the cost function with a minimal number of computations and
|
|
to arbitrary precision.
|
|
|
|
\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
|
|
\includegraphics[width=0.33\textwidth]{derivative}
|
|
\hfill
|
|
\begin{minipage}[b]{0.63\textwidth}
|
|
The difference quotient
|
|
\begin{equation}
|
|
\label{difffrac}
|
|
m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
|
|
\end{equation}
|
|
of a function $y = f(x)$ is the slope of the secant (red) defined
|
|
by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the
|
|
distance $\Delta x$.
|
|
|
|
The slope of the function $y=f(x)$ at the position $x$ (yellow) is
|
|
given by the derivative $f'(x)$ of the function at that position.
|
|
It is defined by the difference quotient in the limit of
|
|
infinitesimally (orange) small distances $\Delta x$:
|
|
\begin{equation}
|
|
\label{derivative}
|
|
f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
|
|
\end{minipage}\vspace{2ex}
|
|
|
|
It is not possible to calculate the derivative, \eqnref{derivative},
|
|
numerically. The derivative can only be estimated using the
|
|
difference quotient, \eqnref{difffrac}, by using sufficiently small
|
|
$\Delta x$.
|
|
\end{ibox}
|
|
|
|
\begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient}
|
|
Some functions that depend on more than a single variable:
|
|
\[ z = f(x,y) \]
|
|
for example depends on $x$ and $y$. Using the partial derivative
|
|
\[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
|
|
and
|
|
\[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
|
|
one can estimate the slope in the direction of the variables
|
|
individually by using the respective difference quotient
|
|
(Box~\ref{differentialquotientbox}). \vspace{1ex}
|
|
|
|
\begin{minipage}[t]{0.5\textwidth}
|
|
\mbox{}\\[-2ex]
|
|
\includegraphics[width=1\textwidth]{gradient}
|
|
\end{minipage}
|
|
\hfill
|
|
\begin{minipage}[t]{0.46\textwidth}
|
|
For example, the partial derivatives of
|
|
\[ f(x,y) = x^2+y^2 \] are
|
|
\[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]
|
|
|
|
The gradient is a vector that is constructed from the partial derivatives:
|
|
\[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
|
|
This vector points into the direction of the strongest ascend of
|
|
$f(x,y)$.
|
|
\end{minipage}
|
|
|
|
\vspace{0.5ex} The figure shows the contour lines of a bi-variate
|
|
Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
|
|
arrows) and the corresponding two partial derivatives (thin arrows)
|
|
for three different locations.
|
|
\end{ibox}
|
|
|
|
|
|
\section{Gradient}
|
|
Imagine to place a small ball at some point on the error surface
|
|
\figref{errorsurfacefig}. Naturally, it would roll down the steepest
|
|
slope and eventually stop at the minimum of the error surface (if it had no
|
|
inertia). We will use this picture to develop an algorithm to find our
|
|
way to the minimum of the objective function. The ball will always
|
|
follow the steepest slope. Thus we need to figure out the direction of
|
|
the steepest slope at the position of the ball.
|
|
|
|
The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the
|
|
objective function is the vector
|
|
\begin{equation}
|
|
\label{gradient}
|
|
\nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
|
|
\frac{\partial f(m,b)}{\partial b} \right)
|
|
\end{equation}
|
|
that points to the strongest ascend of the objective function. The
|
|
gradient is given by partial derivatives
|
|
(Box~\ref{partialderivativebox}) of the mean squared error with
|
|
respect to the parameters $m$ and $b$ of the straight line. There is
|
|
no need to calculate it analytically because it can be estimated from
|
|
the partial derivatives using the difference quotient
|
|
(Box~\ref{differentialquotientbox}) for small steps $\Delta m$ and
|
|
$\Delta b$. For example, the partial derivative with respect to $m$
|
|
can be computed as
|
|
\begin{equation}
|
|
\frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
|
|
0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
|
|
\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \; .
|
|
\end{equation}
|
|
The length of the gradient indicates the steepness of the slope
|
|
(\figref{gradientquiverfig}). Since want to go down the hill, we
|
|
choose the opposite direction.
|
|
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=0.75\textwidth]{error_gradient}
|
|
\titlecaption{Gradient of the error surface.} {Each arrow points
|
|
into the direction of the greatest ascend at different positions
|
|
of the error surface shown in \figref{errorsurfacefig}. The
|
|
contour lines in the background illustrate the error surface. Warm
|
|
colors indicate high errors, colder colors low error values. Each
|
|
contour line connects points of equal
|
|
error.}\label{gradientquiverfig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{meanSquaredGradient.m}{}\label{gradientexercise}%
|
|
Implement a function \varcode{meanSquaredGradient()}, that takes the
|
|
$x$- and $y$-data and the set of parameters $(m, b)$ of a straight
|
|
line as a two-element vector as input arguments. The function should
|
|
return the gradient at the position $(m, b)$ as a vector with two
|
|
elements.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{errorGradient.m}{}
|
|
Extend the script of exercises~\ref{errorsurfaceexercise} to plot
|
|
both the error surface and gradients using the
|
|
\varcode{meanSquaredGradient()} function from
|
|
exercise~\ref{gradientexercise}. Vectors in space can be easily
|
|
plotted using the function \code{quiver()}. Use \code{contour()}
|
|
instead of \code{surface()} to plot the error surface.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Gradient descent}
|
|
Finally, we are able to implement the optimization itself. By now it
|
|
should be obvious why it is called the gradient descent method. All
|
|
ingredients are already there. We need: (i) the cost function
|
|
(\varcode{meanSquaredError()}), and (ii) the gradient
|
|
(\varcode{meanSquaredGradient()}). The algorithm of the gradient
|
|
descent works as follows:
|
|
\begin{enumerate}
|
|
\item Start with some given combination of the parameters $m$ and $b$
|
|
($p_0 = (m_0, b_0)$).
|
|
\item \label{computegradient} Calculate the gradient at the current
|
|
position $p_i$.
|
|
\item If the length of the gradient falls below a certain value, we
|
|
assume to have reached the minimum and stop the search. We are
|
|
actually looking for the point at which the length of the gradient
|
|
is zero, but finding zero is impossible because of numerical
|
|
imprecision. We thus apply a threshold below which we are
|
|
sufficiently close to zero (e.g. \varcode{norm(gradient) < 0.1}).
|
|
\item \label{gradientstep} If the length of the gradient exceeds the
|
|
threshold we take a small step into the opposite direction:
|
|
\begin{equation}
|
|
p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)
|
|
\end{equation}
|
|
where $\epsilon = 0.01$ is a factor linking the gradient to
|
|
appropriate steps in the parameter space.
|
|
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
|
|
\end{enumerate}
|
|
|
|
\Figref{gradientdescentfig} illustrates the gradient descent --- the
|
|
path the imaginary ball has chosen to reach the minimum. Starting at
|
|
an arbitrary position on the error surface we change the position as
|
|
long as the gradient at that position is larger than a certain
|
|
threshold. If the slope is very steep, the change in the position (the
|
|
distance between the red dots in \figref{gradientdescentfig}) is
|
|
large.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicmse}
|
|
\titlecaption{Gradient descent.}{The algorithm starts at an
|
|
arbitrary position. At each point the gradient is estimated and
|
|
the position is updated as long as the length of the gradient is
|
|
sufficiently large.The dots show the positions after each
|
|
iteration of the algorithm.} \label{gradientdescentfig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{gradientDescent.m}{}
|
|
Implement the gradient descent for the problem of fitting a straight
|
|
line to some measured data. Reuse the data generated in
|
|
exercise~\ref{errorsurfaceexercise}.
|
|
\begin{enumerate}
|
|
\item Store for each iteration the error value.
|
|
\item Plot the error values as a function of the iterations, the
|
|
number of optimization steps.
|
|
\item Plot the measured data together with the best fitting straight line.
|
|
\end{enumerate}\vspace{-4.5ex}
|
|
\end{exercise}
|
|
|
|
|
|
\section{Summary}
|
|
|
|
The gradient descent is an important numerical method for solving
|
|
optimization problems. It is used to find the global minimum of an
|
|
objective function.
|
|
|
|
Curve fitting is a common application for the gradient descent method.
|
|
For the case of fitting straight lines to data pairs, the error
|
|
surface (using the mean squared error) has exactly one clearly defined
|
|
global minimum. In fact, the position of the minimum can be analytically
|
|
calculated as shown in the next chapter.
|
|
|
|
Problems that involve nonlinear computations on parameters, e.g. the
|
|
rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda
|
|
x}$, do not have an analytical solution for the least squares. To
|
|
find the least squares for such functions numerical methods such as
|
|
the gradient descent have to be applied.
|
|
|
|
The suggested gradient descent algorithm can be improved in multiple
|
|
ways to converge faster. For example one could adapt the step size to
|
|
the length of the gradient. These numerical tricks have already been
|
|
implemented in pre-defined functions. Generic optimization functions
|
|
such as \matlabfun{fminsearch()} have been implemented for arbitrary
|
|
objective functions, while the more specialized function
|
|
\matlabfun{lsqcurvefit()} i specifically designed for optimizations in
|
|
the least square error sense.
|
|
|
|
%\newpage
|
|
\begin{important}[Beware of secondary minima!]
|
|
Finding the absolute minimum is not always as easy as in the case of
|
|
fitting a straight line. Often, the error surface has secondary or
|
|
local minima in which the gradient descent stops even though there
|
|
is a more optimal solution, a global minimum that is lower than the
|
|
local minimum. Starting from good initial positions is a good
|
|
approach to avoid getting stuck in local minima. Also keep in mind
|
|
that error surfaces tend to be simpler (less local minima) the fewer
|
|
parameters are fitted from the data. Each additional parameter
|
|
increases complexity and is computationally more expensive.
|
|
\end{important}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|