scientificComputing/regression/lecture/regression.tex

\chapter{Optimization and gradient descent}
\exercisechapter{Optimization and gradient descent}

Optimization problems arise in many different contexts. For example,
to understand the behavior of a given neuronal system, the system is
probed with a range of input signals and then the resulting responses
are measured. This input-output relation can be described by a
model. Such a model can be a simple function that maps the input
signals to corresponding responses, it can be a filter, or a system of
differential equations. In any case, the model has some parameters
that specify how input and output signals are related.  Which
combination of parameter values are best suited to describe the
input-output relation? The process of finding the best parameter
values is an optimization problem. For a simple parameterized function
that maps input to output values, this is the special case of a
\enterm{curve fitting} problem, where the average distance between the
curve and the response values is minimized. One basic numerical method
used for such optimization problems is the so called gradient descent,
which is introduced in this chapter.

\begin{figure}[t]
  \includegraphics{cubicfunc}
  \titlecaption{Example data suggesting a cubic relation.}{The length
    $x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a
    cubic relation between size and weight leaves us with a single
    free parameters, a scaling factor. The cubic relation is shown for
    a few values of this scaling factor (orange and red,
    right).}\label{cubicdatafig}
\end{figure}

For demonstrating the curve-fitting problem let's take the simple
example of weights and sizes measured for a number of male tigers
(\figref{cubicdatafig}). Weight $y$ is proportional to volume
$V$ via the density $\rho$. The volume $V$ of any object is
proportional to its length $x$ cubed. The factor $\alpha$ relating
volume and size cubed depends on the shape of the object and we do not
know this factor for tigers. For the data set we thus expect a cubic
relation between weight and length
\begin{equation}
  \label{cubicfunc}
  y = f(x; c) = c\cdot x^3
\end{equation}
where $c=\rho\alpha$, the product of a tiger's density and form
factor, is the only free parameter in the relation. We would like to
find out which value of $c$ best describes the measured data.  In the
following we use this example to illustrate the gradient descent as a
basic method for finding such an optimal parameter.


\section{The error function --- mean squared error}

Before we go ahead finding the optimal parameter value we need to
specify what exactly we consider as an optimal fit. In our example we
search the parameter that describes the relation of $x$ and $y$
best. What is meant by this? The length $x_i$ of each tiger is
associated with a weight $y_i$ and for each $x_i$ we have a
\emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by
the model \eqnref{cubicfunc} for a specific value of the parameter
$c$. Prediction and actual data value ideally match (in a perfect
noise-free world), but in general the estimate and measurement are
separated by some distance or error $y_i - y^{est}(x_i)$. In our
example the estimate of the weight for the length $x_i$ is given by
equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting
model with parameter $c$ is the one that somehow minimizes the
distances between observations $y_i$ and corresponding estimations
$y^{est}(x_i)$ (\figref{cubicerrorsfig}).

As a first guess we could simply minimize the sum of the distances,
$\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work
because positive and negative errors would cancel out, no matter how
large they are, and sum up to values close to zero. Better is to sum
over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum
can only be small if all deviations are indeed small no matter if they
are above or below the prediction.  The sum of the squared distances,
$\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better
choice. Instead of the sum we could also minimize the average distance
\begin{equation}
  \label{meansquarederror}
  f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2
\end{equation}
This is known as the \enterm[mean squared error]{mean squared error}
(\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer
  Fehler}). Similar to the absolute distance, the square of the errors
is always positive and thus positive and negative error values do not
cancel each other out. In addition, the square punishes large
deviations over small deviations. In
chapter~\ref{maximumlikelihoodchapter} we show that minimizing the
mean squared error is equivalent to maximizing the likelihood that the
observations originate from the model, if the data are normally
distributed around the model prediction.

\begin{exercise}{meansquarederrorline.m}{}\label{mseexercise}
  Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store
  these sizes in a vector \varcode{x}. Compute the corresponding
  predicted weights \varcode{yest} for each tiger according to
  \eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the
  predictions generate simulated measurements of the tiger's weights
  \varcode{y}, by adding normally distributed random numbers to the
  predictions scaled to a standard deviation of 50\,\kilo\gram.

  Compute the \emph{mean squared error} between \varcode{y} and
  \varcode{yest} in a single line of code.
\end{exercise}


\section{Cost function}

The mean squared error is a so called \enterm{objective function} or
\enterm{cost function} (\determ{Kostenfunktion}). A cost function
assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set
$\{(x_i, y_i)\}$ a single scalar value that we want to minimize.  Here
we aim to adapt the model parameter to minimize the mean squared error
\eqref{meansquarederror}. In general, the \enterm{cost function} can
be any function that describes the quality of a fit by mapping the
data and the predictions to a single scalar value.

\begin{figure}[t]
  \includegraphics{cubicerrors}
  \titlecaption{Estimating the \emph{mean squared error}.}  {The
    deviation error (orange) between the prediction (red line) and the
    observations (blue dots) is calculated for each data point
    (left). Then the deviations are squared and the average is
    calculated (right).}
  \label{cubicerrorsfig}
\end{figure}

Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror}
with our cubic model \eqref{cubicfunc}, the cost function reads
\begin{eqnarray}
  f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\
  & = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube}
\end{eqnarray}
The optimization process tries to find a value for the factor $c$ such
that the cost function is minimized. With the mean squared error as
the cost function this optimization process is also called method of
 \enterm{least squares} (\determ[quadratischer
Fehler!kleinster]{Methode der kleinsten Quadrate}).

\begin{exercise}{meanSquaredErrorCubic.m}{}
  Implement the objective function \eqref{msecube} as a function
  \varcode{meanSquaredErrorCubic()}.  The function takes three
  arguments. The first is a vector of $x$-values and the second
  contains the measurements $y$ for each value of $x$. The third
  argument is the value of the factor $c$. The function returns the
  mean squared error.
\end{exercise}


\section{Graph of the cost function}
For each value of the parameter $c$ of the model we can use
\eqnref{msecube} to calculate the corresponding value of the cost
function. The cost function $f_{cost}(c|\{(x_i, y_i)\}|)$ is a
function $f_{cost}(c)$ that maps the parameter value $c$ to a scalar
error value. For a given data set we thus can simply plot the cost
function as a function of the parameter $c$ (\figref{cubiccostfig}).

\begin{exercise}{plotcubiccosts.m}{}
  Calculate the mean squared error between the data and the cubic
  function for a range of values of the factor $c$ using the
  \varcode{meanSquaredErrorCubic()} function from the previous
  exercise. Plot the graph of the cost function.
\end{exercise}

\begin{figure}[t]
  \includegraphics{cubiccost}
  \titlecaption{Minimum of the cost function.}{For a given data set
    the cost function, the mean squared error \eqnref{msecube}, as a
    function of the unknown parameter $c$ has a minimum close to the
    true value of $c$ that was used to generate the data
    (left). Simply taking the absolute minimum of the cost function
    computed for a pre-set range of values for the parameter $c$, has
    the disadvantage to be limited in precision (right) and the
    possibility to entirely miss the global minimum if it is outside
    the computed range.}\label{cubiccostfig}
\end{figure}

By looking at the plot of the cost function we can visually identify
the position of the minimum and thus estimate the optimal value for
the parameter $c$. How can we use the cost function to guide an
automatic optimization process?

The obvious approach would be to calculate the mean squared error for
a range of parameter values and then find the position of the minimum
using the \code{min()} function. This approach, however has several
disadvantages: (i) The accuracy of the estimation of the best
parameter is limited by the resolution used to sample the parameter
space. The coarser the parameters are sampled the less precise is the
obtained position of the minimum (\figref{cubiccostfig}, right).  (ii)
The range of parameter values might not include the absolute minimum.
(iii) In particular for functions with more than a single free
parameter it is computationally expensive to calculate the cost
function for each parameter combination at a sufficient
resolution. The number of combinations increases exponentially with
the number of free parameters. This is known as the \enterm{curse of
  dimensionality}.

So we need a different approach. We want a procedure that finds the
minimum of the cost function with a minimal number of computations and
to arbitrary precision.

\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
  \includegraphics[width=0.33\textwidth]{derivative}
  \hfill
  \begin{minipage}[b]{0.63\textwidth}
    The difference quotient
    \begin{equation}
      \label{difffrac}
      m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
    \end{equation}
    of a function $y = f(x)$ is the slope of the secant (red) defined
    by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ at
    distance $\Delta x$.

    The slope of the function $y=f(x)$ at the position $x$ (orange) is
    given by the derivative $f'(x)$ of the function at that position.
    It is defined by the difference quotient in the limit of
    infinitesimally (red and yellow) small distances $\Delta x$:
    \begin{equation}
      \label{derivative}
      f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
  \end{minipage}\vspace{2ex}

  It is not possible to calculate the exact value of the derivative
  \eqref{derivative} numerically. The derivative can only be estimated
  by computing the difference quotient \eqref{difffrac} using
  sufficiently small $\Delta x$.
\end{ibox}

\section{Gradient}
Imagine to place a ball at some point on the cost function
\figref{cubiccostfig}. Naturally, it would roll down the slope and
eventually stop at the minimum of the error surface (if it had no
inertia). We will use this analogy to develop an algorithm to find our
way to the minimum of the cost function. The ball always follows the
steepest slope. Thus we need to figure out the direction of the slope
at the position of the ball.

In our one-dimensional example of a single free parameter the slope is
simply the derivative of the cost function with respect to the
parameter $c$. This derivative is called the
\entermde{Gradient}{gradient} of the cost function:
\begin{equation}
  \label{costderivative}
  \nabla f_{cost}(c) = \frac{{\rm d} f_{cost}(c)}{{\rm d} c}
\end{equation}
There is no need to calculate this derivative analytically, because it
can be approximated numerically by the difference quotient
(Box~\ref{differentialquotientbox}) for small steps $\Delta c$:
\begin{equation}
  \frac{{\rm d} f_{cost}(c)}{{\rm d} c} =
    \lim\limits_{\Delta c \to 0} \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
    \approx \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
\end{equation}
The derivative is positive for positive slopes. Since want to go down
the hill, we choose the opposite direction.

\begin{exercise}{meanSquaredGradientCubic.m}{}
  Implement a function \varcode{meanSquaredGradientCubic()}, that
  takes the $x$- and $y$-data and the parameter $c$ as input
  arguments. The function should return the derivative of the mean
  squared error $f_{cost}(c)$ with respect to $c$ at the position
  $c$.
\end{exercise}

\begin{exercise}{plotcubicgradient.m}{}
  Using the \varcode{meanSquaredGradientCubic()} function from the
  previous exercise, plot the derivative of the cost function as a
  function of $c$.
\end{exercise}

\section{Gradient descent}
Finally, we are able to implement the optimization itself. By now it
should be obvious why it is called the gradient descent method. All
ingredients are already there. We need: (i) the cost function
(\varcode{meanSquaredErrorCubic()}), and (ii) the gradient
(\varcode{meanSquaredGradientCubic()}). The algorithm of the gradient
descent works as follows:
\begin{enumerate}
\item Start with some arbitrary value $p_0$ for the parameter $c$.
\item \label{computegradient} Compute the gradient
  \eqnref{costderivative} at the current position $p_i$.
\item If the length of the gradient, the absolute value of the
  derivative \eqnref{costderivative}, is smaller than some threshold
  value, we assume to have reached the minimum and stop the search.
  We return the current parameter value $p_i$ as our best estimate of
  the parameter $c$ that minimizes the cost function.

  We are actually looking for the point at which the derivative of the
  cost function equals zero, but this is impossible because of
  numerical imprecision. We thus apply a threshold below which we are
  sufficiently close to zero.
\item \label{gradientstep} If the length of the gradient exceeds the
  threshold we take a small step into the opposite direction:
  \begin{equation}
    p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(p_i)
  \end{equation}
  where $\epsilon = 0.01$ is a factor linking the gradient to
  appropriate steps in the parameter space.
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
\end{enumerate}

\Figref{gradientdescentfig} illustrates the gradient descent --- the
path the imaginary ball has chosen to reach the minimum. Starting at
an arbitrary position we change the position as long as the gradient
at that position is larger than a certain threshold. If the slope is
very steep, the change in the position (the distance between the red
dots in \figref{gradientdescentfig}) is large.

\begin{figure}[t]
  \includegraphics{cubicmse}
  \titlecaption{Gradient descent.}{The algorithm starts at an
    arbitrary position. At each point the gradient is estimated and
    the position is updated as long as the length of the gradient is
    sufficiently large.The dots show the positions after each
    iteration of the algorithm.} \label{gradientdescentfig}
\end{figure}

\begin{exercise}{gradientDescent.m}{}
  Implement the gradient descent for the problem of fitting a straight
  line to some measured data. Reuse the data generated in
  exercise~\ref{errorsurfaceexercise}.
  \begin{enumerate}
  \item Store for each iteration the error value.
  \item Plot the error values as a function of the iterations, the
    number of optimization steps.
  \item Plot the measured data together with the best fitting straight line.
  \end{enumerate}\vspace{-4.5ex}
\end{exercise}


\begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient}
  Some functions that depend on more than a single variable:
  \[ z = f(x,y) \]
  for example depends on $x$ and $y$. Using the partial derivative
  \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
  and
  \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
  one can estimate the slope in the direction of the variables
  individually by using the respective difference quotient
  (Box~\ref{differentialquotientbox}).  \vspace{1ex}

  \begin{minipage}[t]{0.5\textwidth}
    \mbox{}\\[-2ex]
    \includegraphics[width=1\textwidth]{gradient}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{0.46\textwidth}
    For example, the partial derivatives of
    \[ f(x,y) = x^2+y^2 \] are
    \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]

    The gradient is a vector that is constructed from the partial derivatives:
    \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
    This vector points into the direction of the strongest ascend of
    $f(x,y)$.
  \end{minipage}

  \vspace{0.5ex} The figure shows the contour lines of a bi-variate
  Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
  arrows) and the corresponding two partial derivatives (thin arrows)
  for three different locations.
\end{ibox}

The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the
objective function is the vector
\begin{equation}
  \label{gradient}
  \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
    \frac{\partial f(m,b)}{\partial b} \right)
\end{equation}
that points to the strongest ascend of the objective function.  The
gradient is given by partial derivatives
(Box~\ref{partialderivativebox}) of the mean squared error with
respect to the parameters $m$ and $b$ of the straight line.
\section{Summary}

The gradient descent is an important numerical method for solving
optimization problems. It is used to find the global minimum of an
objective function.

Curve fitting is a common application for the gradient descent method.
For the case of fitting straight lines to data pairs, the error
surface (using the mean squared error) has exactly one clearly defined
global minimum. In fact, the position of the minimum can be analytically
calculated as shown in the next chapter.

Problems that involve nonlinear computations on parameters, e.g. the
rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda
  x}$, do not have an analytical solution for the least squares. To
find the least squares for such functions numerical methods such as
the gradient descent have to be applied.

The suggested gradient descent algorithm can be improved in multiple
ways to converge faster.  For example one could adapt the step size to
the length of the gradient. These numerical tricks have already been
implemented in pre-defined functions. Generic optimization functions
such as \matlabfun{fminsearch()} have been implemented for arbitrary
objective functions, while the more specialized function
\matlabfun{lsqcurvefit()} i specifically designed for optimizations in
the least square error sense.

%\newpage
\begin{important}[Beware of secondary minima!]
  Finding the absolute minimum is not always as easy as in the case of
  fitting a straight line. Often, the error surface has secondary or
  local minima in which the gradient descent stops even though there
  is a more optimal solution, a global minimum that is lower than the
  local minimum. Starting from good initial positions is a good
  approach to avoid getting stuck in local minima. Also keep in mind
  that error surfaces tend to be simpler (less local minima) the fewer
  parameters are fitted from the data. Each additional parameter
  increases complexity and is computationally more expensive.
\end{important}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions