\chapter{Optimization and gradient descent}
\exercisechapter{Optimization and gradient descent}

Optimization problems arise in many different contexts. For example,
to understand the behavior of a given system, the system is probed
with a range of input signals and then the resulting responses are
measured. This input-output relation can be described by a model. Such
a model can be a simple function that maps the input signals to
corresponding responses, it can be a filter, or a system of
differential equations. In any case, the model has some parameters that
specify how input and output signals are related.  Which combination
of parameter values are best suited to describe the input-output
relation? The process of finding the best parameter values is an
optimization problem. For a simple parameterized function that maps
input to output values, this is the special case of a \enterm{curve
  fitting} problem, where the average distance between the curve and
the response values is minimized. One basic numerical method used for
such optimization problems is the so called gradient descent, which is
introduced in this chapter.

%%% Weiteres einfaches verbales Beispiel? Eventuell aus der Populationsoekologie?

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{lin_regress}\hfill
  \titlecaption{Example data suggesting a linear relation.}{A set of
    input signals $x$, e.g. stimulus intensities, were used to probe a
    system. The system's output $y$ to the inputs are noted
    (left). Assuming a linear relation between $x$ and $y$ leaves us
    with 2 parameters, the slope (center) and the intercept with the
    y-axis (right panel).}\label{linregressiondatafig}
\end{figure}

The data plotted in \figref{linregressiondatafig} suggest a linear
relation between input and output of the system. We thus assume that a
straight line
\begin{equation}
  \label{straightline}
  y = f(x; m, b) = m\cdot x + b
\end{equation}
is an appropriate model to describe the system.  The line has two free
parameter, the slope $m$ and the $y$-intercept $b$. We need to find
values for the slope and the intercept that best describe the measured
data.  In this chapter we use this example to illustrate the gradient
descent and how this methods can be used to find a combination of
slope and intercept that best describes the system.


\section{The error function --- mean squared error}

Before the optimization can be done we need to specify what exactly is
considered an optimal fit. In our example we search the parameter
combination that describe the relation of $x$ and $y$ best. What is
meant by this? Each input $x_i$ leads to an measured output $y_i$ and
for each $x_i$ there is a \emph{prediction} or \emph{estimation}
$y^{est}(x_i)$ of the output value by the model. At each $x_i$
estimation and measurement have a distance or error $y_i -
y^{est}(x_i)$. In our example the estimation is given by the equation
$y^{est}(x_i) = f(x_i;m,b)$. The best fitting model with parameters
$m$ and $b$ is the one that minimizes the distances between
observation $y_i$ and estimation $y^{est}(x_i)$
(\figref{leastsquareerrorfig}).

As a first guess we could simply minimize the sum $\sum_{i=1}^N y_i -
y^{est}(x_i)$. This approach, however, will not work since a minimal sum
can also be achieved if half of the measurements is above and the
other half below the predicted line. Positive and negative errors
would cancel out and then sum up to values close to zero. A better
approach is to sum over the absolute values of the distances:
$\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum can only be small if all
deviations are indeed small no matter if they are above or below the
predicted line. Instead of the sum we could also take the average
\begin{equation}
  \label{meanabserror}
  f_{dist}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N |y_i - y^{est}(x_i)|
\end{equation}
Instead of the averaged absolute errors, the \enterm[mean squared
error]{mean squared error} (\determ[quadratischer
Fehler!mittlerer]{mittlerer quadratischer Fehler})
\begin{equation}
  \label{meansquarederror}
  f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2
\end{equation}
is commonly used (\figref{leastsquareerrorfig}). Similar to the
absolute distance, the square of the errors, $(y_i - y^{est}(x_i))^2$, is
always positive and thus positive and negative error values do not
cancel each other out. In addition, the square punishes large
deviations over small deviations. In
chapter~\ref{maximumlikelihoodchapter} we show that minimizing the
mean square error is equivalent to maximizing the likelihood that the
observations originate from the model, if the data are normally
distributed around the model prediction.

\begin{exercise}{meanSquaredErrorLine.m}{}\label{mseexercise}%
  Given a vector of observations \varcode{y} and a vector with the
  corresponding predictions \varcode{y\_est}, compute the \emph{mean
    square error} between \varcode{y} and \varcode{y\_est} in a single
  line of code.
\end{exercise}


\section{Objective function}

The mean squared error is a so called \enterm{objective function} or
\enterm{cost function} (\determ{Kostenfunktion}). A cost function
assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set
$\{(x_i, y_i)\}$ a single scalar value that we want to minimize.  Here
we aim to adapt the model parameters to minimize the mean squared
error \eqref{meansquarederror}. In general, the \enterm{cost function}
can be any function that describes the quality of the fit by mapping
the data and the predictions to a single scalar value.

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{linear_least_squares}
  \titlecaption{Estimating the \emph{mean square error}.}  {The
    deviation error, orange) between the prediction (red
    line) and the observations (blue dots) is calculated for each data
    point (left). Then the deviations are squared and the aveage is
    calculated (right).}
  \label{leastsquareerrorfig}
\end{figure}

Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror}
with our model, the straight line \eqref{straightline}, the cost
function reads
\begin{eqnarray}
  f_{cost}(m,b|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;m,b))^2 \label{msefunc} \\
  & = & \frac{1}{N} \sum_{i=1}^N (y_i - m x_i - b)^2 \label{mseline}
\end{eqnarray}
The optimization process tries to find the slope $m$ and the intercept
$b$ such that the cost function is minimized. With the mean squared
error as the cost function this optimization process is also called
method of the \enterm{least square error} (\determ[quadratischer
Fehler!kleinster]{Methode der kleinsten Quadrate}).

\begin{exercise}{meanSquaredError.m}{}
  Implement the objective function \eqref{mseline} as a function
  \varcode{meanSquaredError()}.  The function takes three
  arguments. The first is a vector of $x$-values and the second
  contains the measurements $y$ for each value of $x$. The third
  argument is a 2-element vector that contains the values of
  parameters \varcode{m} and \varcode{b}. The function returns the
  mean square error.
\end{exercise}


\section{Error surface}
For each combination of the two parameters $m$ and $b$ of the model we
can use \eqnref{mseline} to calculate the corresponding value of the
cost function. The cost function $f_{cost}(m,b|\{(x_i, y_i)\}|)$ is a
function $f_{cost}(m,b)$, that maps the parameter values $m$ and $b$
to a scalar error value.  The error values describe a landscape over the
$m$-$b$ plane, the error surface, that can be illustrated graphically
using a 3-d surface-plot. $m$ and $b$ are plotted on the $x$- and $y$-
axis while the third dimension indicates the error value
(\figref{errorsurfacefig}).

\begin{figure}[t]
  \includegraphics[width=0.75\textwidth]{error_surface}
  \titlecaption{Error surface.}{The two model parameters $m$ and $b$
    define the base area of the surface plot. For each parameter
    combination of slope and intercept the error is calculated. The
    resulting surface has a minimum which indicates the parameter
    combination that best fits the data.}\label{errorsurfacefig}
\end{figure}

\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}
  Generate 20 data pairs $(x_i|y_i)$ that are linearly related with
  slope $m=0.75$ and intercept $b=-40$, using \varcode{rand()} for
  drawing $x$ values between 0 and 120 and \varcode{randn()} for
  jittering the $y$ values with a standard deviation of 15.  Then
  calculate the mean squared error between the data and straight lines
  for a range of slopes and intercepts using the
  \varcode{meanSquaredError()} function from the previous exercise.
  Illustrates the error surface using the \code{surface()} function.
  Consult the documentation to find out how to use \code{surface()}.
\end{exercise}

By looking at the error surface we can directly see the position of
the minimum and thus estimate the optimal parameter combination. How
can we use the error surface to guide an automatic optimization
process?

The obvious approach would be to calculate the error surface for any
combination of slope and intercept values and then find the position
of the minimum using the \code{min} function. This approach, however
has several disadvantages: (i) it is computationally very expensive to
calculate the error for each parameter combination. The number of
combinations increases exponentially with the number of free
parameters (also known as the ``curse of dimensionality''). (ii) the
accuracy with which the best parameters can be estimated is limited by
the resolution used to sample the parameter space. The coarser the
parameters are sampled the less precise is the obtained position of
the minimum.

So we need a different approach. We want a procedure that finds the
minimum of the cost function with a minimal number of computations and
to arbitrary precision.

\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
  \includegraphics[width=0.33\textwidth]{derivative}
  \hfill
  \begin{minipage}[b]{0.63\textwidth}
    The difference quotient
    \begin{equation}
      \label{difffrac}
      m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
    \end{equation}
    of a function $y = f(x)$ is the slope of the secant (red) defined
    by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the
    distance $\Delta x$.

    The slope of the function $y=f(x)$ at the position $x$ (yellow) is
    given by the derivative $f'(x)$ of the function at that position.
    It is defined by the difference quotient in the limit of
    infinitesimally (orange) small distances $\Delta x$:
    \begin{equation}
      \label{derivative}
      f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
  \end{minipage}\vspace{2ex} 

  It is not possible to calculate the derivative, \eqnref{derivative},
  numerically. The derivative can only be estimated using the
  difference quotient, \eqnref{difffrac}, by using sufficiently small
  $\Delta x$.
\end{ibox}

\begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient}
  Some functions that depend on more than a single variable:
  \[ z = f(x,y) \]
  for example depends on $x$ and $y$. Using the partial derivative
  \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
  and
  \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
  one can estimate the slope in the direction of the variables
  individually by using the respective difference quotient
  (Box~\ref{differentialquotientbox}).  \vspace{1ex}

  \begin{minipage}[t]{0.5\textwidth}
    \mbox{}\\[-2ex]
    \includegraphics[width=1\textwidth]{gradient}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{0.46\textwidth}
    For example, the partial derivatives of
    \[ f(x,y) = x^2+y^2 \] are 
    \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]

    The gradient is a vector that is constructed from the partial derivatives:
    \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
    This vector points into the direction of the strongest ascend of
    $f(x,y)$.
  \end{minipage}

  \vspace{0.5ex} The figure shows the contour lines of a bi-variate
  Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
  arrows) and the corresponding two partial derivatives (thin arrows)
  for three different locations.
\end{ibox}


\section{Gradient}
Imagine to place a small ball at some point on the error surface
\figref{errorsurfacefig}. Naturally, it would roll down the steepest
slope and eventually stop at the minimum of the error surface (if it had no
inertia). We will use this picture to develop an algorithm to find our
way to the minimum of the objective function. The ball will always
follow the steepest slope. Thus we need to figure out the direction of
the steepest slope at the position of the ball.

The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the
objective function is the vector
\begin{equation}
  \label{gradient}
  \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
    \frac{\partial f(m,b)}{\partial b} \right)
\end{equation}
that points to the strongest ascend of the objective function.  The
gradient is given by partial derivatives
(Box~\ref{partialderivativebox}) of the mean squared error with
respect to the parameters $m$ and $b$ of the straight line. There is
no need to calculate it analytically because it can be estimated from
the partial derivatives using the difference quotient
(Box~\ref{differentialquotientbox}) for small steps $\Delta m$ and
$\Delta b$. For example, the partial derivative with respect to $m$
can be computed as
\begin{equation}
  \frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
  0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \; . 
\end{equation}
The length of the gradient indicates the steepness of the slope
(\figref{gradientquiverfig}). Since want to go down the hill, we
choose the opposite direction.


\begin{figure}[t]
  \includegraphics[width=0.75\textwidth]{error_gradient}
  \titlecaption{Gradient of the error surface.}  {Each arrow points
    into the direction of the greatest ascend at different positions
    of the error surface shown in \figref{errorsurfacefig}. The
    contour lines in the background illustrate the error surface. Warm
    colors indicate high errors, colder colors low error values. Each
    contour line connects points of equal
    error.}\label{gradientquiverfig}
\end{figure}

\begin{exercise}{meanSquaredGradient.m}{}\label{gradientexercise}%
  Implement a function \varcode{meanSquaredGradient()}, that takes the
  $x$- and $y$-data and the set of parameters $(m, b)$ of a straight
  line as a two-element vector as input arguments. The function should
  return the gradient at the position $(m, b)$ as a vector with two
  elements.
\end{exercise}

\begin{exercise}{errorGradient.m}{}
  Extend the script of exercises~\ref{errorsurfaceexercise} to plot
  both the error surface and gradients using the
  \varcode{meanSquaredGradient()} function from
  exercise~\ref{gradientexercise}. Vectors in space can be easily
  plotted using the function \code{quiver()}. Use \code{contour()}
  instead of \code{surface()} to plot the error surface.
\end{exercise}


\section{Gradient descent}
Finally, we are able to implement the optimization itself. By now it
should be obvious why it is called the gradient descent method. All
ingredients are already there. We need: (i) the cost function
(\varcode{meanSquaredError()}), and (ii) the gradient
(\varcode{meanSquaredGradient()}). The algorithm of the gradient
descent works as follows:
\begin{enumerate}
\item Start with some given combination of the parameters $m$ and $b$
  ($p_0 = (m_0, b_0)$).
\item \label{computegradient} Calculate the gradient at the current
  position $p_i$.
\item If the length of the gradient falls below a certain value, we
  assume to have reached the minimum and stop the search. We are
  actually looking for the point at which the length of the gradient
  is zero, but finding zero is impossible because of numerical
  imprecision. We thus apply a threshold below which we are
  sufficiently close to zero (e.g. \varcode{norm(gradient) < 0.1}).
\item \label{gradientstep} If the length of the gradient exceeds the
  threshold we take a small step into the opposite direction:
  \begin{equation}
    p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)
  \end{equation}
  where $\epsilon = 0.01$ is a factor linking the gradient to
  appropriate steps in the parameter space.
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
\end{enumerate}

\Figref{gradientdescentfig} illustrates the gradient descent --- the
path the imaginary ball has chosen to reach the minimum. Starting at
an arbitrary position on the error surface we change the position as
long as the gradient at that position is larger than a certain
threshold. If the slope is very steep, the change in the position (the
distance between the red dots in \figref{gradientdescentfig}) is
large.

\begin{figure}[t]
  \includegraphics[width=0.45\textwidth]{gradient_descent}
  \titlecaption{Gradient descent.}{The algorithm starts at an
    arbitrary position. At each point the gradient is estimated and
    the position is updated as long as the length of the gradient is
    sufficiently large.The dots show the positions after each
    iteration of the algorithm.} \label{gradientdescentfig}
\end{figure}

\begin{exercise}{gradientDescent.m}{}
  Implement the gradient descent for the problem of fitting a straight
  line to some measured data. Reuse the data generated in
  exercise~\ref{errorsurfaceexercise}.
  \begin{enumerate}
  \item Store for each iteration the error value.
  \item Plot the error values as a function of the iterations, the
    number of optimization steps.
  \item Plot the measured data together with the best fitting straight line.
  \end{enumerate}\vspace{-4.5ex}
\end{exercise}


\section{Summary}

The gradient descent is an important numerical method for solving
optimization problems. It is used to find the global minimum of an
objective function.

Curve fitting is a common application for the gradient descent method.
For the case of fitting straight lines to data pairs, the error
surface (using the mean squared error) has exactly one clearly defined
global minimum. In fact, the position of the minimum can be analytically
calculated as shown in the next chapter.

Problems that involve nonlinear computations on parameters, e.g. the
rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda
  x}$, do not have an analytical solution for the least squares. To
find the least squares for such functions numerical methods such as
the gradient descent have to be applied.

The suggested gradient descent algorithm can be improved in multiple
ways to converge faster.  For example one could adapt the step size to
the length of the gradient. These numerical tricks have already been
implemented in pre-defined functions. Generic optimization functions
such as \matlabfun{fminsearch()} have been implemented for arbitrary
objective functions, while the more specialized function
\matlabfun{lsqcurvefit()} i specifically designed for optimizations in
the least square error sense.

%\newpage
\begin{important}[Beware of secondary minima!]
  Finding the absolute minimum is not always as easy as in the case of
  fitting a straight line. Often, the error surface has secondary or
  local minima in which the gradient descent stops even though there
  is a more optimal solution, a global minimum that is lower than the
  local minimum. Starting from good initial positions is a good
  approach to avoid getting stuck in local minima. Also keep in mind
  that error surfaces tend to be simpler (less local minima) the fewer
  parameters are fitted from the data. Each additional parameter
  increases complexity and is computationally more expensive.
\end{important}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions