scientificComputing/regression/lecture/regression.tex

\chapter{\tr{Optimization and gradient descent}{Optimierung und Gradientenabstieg}}

To understand the behaviour of a given system sciences often probe the
system with input signals and then try to explain the responses
through a model. Typically the model has a few parameter that specify
how input and output signals are related. The question arises which
combination of paramters are best suited to describe the relation of
in- and output. The process of finding the best paramter set is called
optimization or also \enterm{curve fitting}. One rather generic
approach to the problem is the so called gradient descent method which
will be introduced in this chapter.


\begin{figure}[t]
  \includegraphics[width=1\textwidth]{lin_regress}\hfill
  \titlecaption{Example data suggesting a linear relation.}{A set of
    input signals $x$, e.g. stimulus intensities, were used to probe a
    system. The system's output $y$ to the inputs are noted
    (left). Assuming a linear relation between $x$ and $y$ leaves us
    with 2 parameters, the slope (center) and the intercept with the
    y-axis (right panel).}\label{linregressiondatafig}
\end{figure}

The data plotted in \figref{linregressiondatafig} suggests a linear
relation between input and output of the invesitagted system. We thus
assume that the linear equation
\[y = f(x; m, b) = m\cdot x + b \] is an appropriate model to describe the system.
The linear equation has two free paramteter $m$ and $b$ which denote
the slope and the y-intercept, respectively. In this chapter we will
use this example to illustrate the methods behind several curve
fitting approaches. We will apply this method to find the combination
of slope and intercept that best describes the system.


\section{The error function --- mean square error}

Before the optimization can be done we need to specify what is
considered an optimal fit. In our example we search the parameter
combination that describe the relation of $x$ and $y$ best. What is
meant by this? Each input $x_i$ leads to an output $y_i$ and for each
$x_i$ there is a \emph{prediction} or \emph{estimation}
$y^{est}_i$. For each of $x_i$ estimation and measurement will have a
certain distance $y_i - y_i^{est}$. In our example the estimation is
given by the linear equation $y_i^{est} = f(x;m,b)$. The best fit of
the model with the parameters $m$ and $b$ leads to the minimal
distances between observation $y_i$ and estimation $y_i^{est}$
(\figref{leastsquareerrorfig}).

We could require that the sum $\sum_{i=1}^N y_i - y^{est}_i$ is
minimized. This approach, however, will not work since a minimal sum
can also be achieved if half of the measurements is above and the
other half below the predicted line. Positive and negative errors
would cancel out and then sum up to values close to zero. A better
approach is to consider the absolute value of the distance
$\sum_{i=1}^N |y_i - y^{est}_i|$. The total error can only be small if
all deviations are indeed small no matter if they are above or below
the prediced line. Instead of the sum we could also ask for the
\emph{average}

\begin{equation}
  \label{meanabserror}
  f_{dist}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N |y_i - y^{est}_i|
\end{equation}
should be small. Commonly, the \enterm{mean squared distance} oder
\enterm{mean squared error}
\begin{equation}
  \label{meansquarederror}
  f_{mse}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}_i)^2
\end{equation}

is used (\figref{leastsquareerrorfig}). Similar to the absolute
distance, the square of the error($(y_i - y_i^{est})^2$) is always
positive error values do not cancel out. The square further punishes
large deviations.

\begin{exercise}{meanSquareError.m}{}\label{mseexercise}%
  Implement a function \code{meanSquareError()}, that calculates the
  \emph{mean square distance} bewteen a vector of observations ($y$)
  and respective predictions ($y^{est}$).  \pagebreak[4]
\end{exercise}


\section{\tr{Objective function}{Zielfunktion}}

$f_{cost}(\{(x_i, y_i)\}|\{y^{est}_i\})$ is a so called
\enterm{objective function} or \enterm{cost function}. We aim to adapt
the model parameters to minimize the error (mean square error) and
thus the \emph{objective function}. In Chapter~\ref{maximumlikelihoodchapter}
we will show that the minimization of the mean square error is
equivalent to maximizing the likelihood that the observations
originate from the model (assuming a normal distribution of the data
around the model prediction).

\begin{figure}[t]
  \includegraphics[width=1\textwidth]{linear_least_squares}
  \titlecaption{Estimating the \emph{mean square error}.}  {The
    deviation (\enterm{error}, orange) between the prediction (red
    line) and the observations (blue dots) is calculated for each data
    point (left). Then the deviations are squared and the aveage is
    calculated (right).}
  \label{leastsquareerrorfig}
\end{figure}

The error or also \enterm{cost function} is not necessarily the mean
square distance but can be any function that maps the predictions to a
scalar value describing the quality of the fit. In the optimization
process we aim for the paramter combination that minimized the costs
(error).

%%% Einfaches verbales Beispiel? Eventuell aus der Populationsoekologie?
Replacing $y^{est}$ with the linear equation (the model) in
(\eqnref{meansquarederror}) we yield:

\begin{eqnarray}
  f_{cost}(\{(x_i, y_i)\}|m,b) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;m,b))^2 \label{msefunc} \\
  & = & \frac{1}{N} \sum_{i=1}^N (y_i - m x_i - b)^2 \label{mseline}
\end{eqnarray}

That is, the mean square error is given the pairs $(x_i, y_i)$ and the
parameters $m$ and $b$ of the linear equation. The optimization
process will not try to optimize $m$ and $b$ to lead to the smallest
error, the method of the \enterm{least square error}.

\begin{exercise}{lsqError.m}{}
  Implement the objective function \code{lsqError()} that applies the
  linear equation as a model.
  \begin{itemize}
  \item The function takes three arguments. The first is a 2-element
    vector that contains the values of parameters \varcode{m} and
    \varcode{b}. The second is a vector of x-values the third contains
    the measurements for each value of $x$, the respecive $y$-values.
  \item The function returns the mean square error \eqnref{mseline}.
  \item The function should call the function \code{meanSquareError()}
    defined in the previouos exercise to calculate the error.
  \end{itemize}
\end{exercise}


\section{Error surface}
The two parameters of the model define a surface. For each combination
of $m$ and $b$ we can use \eqnref{mseline} to calculate the associated
error. We thus consider the objective function $f_{cost}(\{(x_i,
y_i)\}|m,b)$ as a function $f_{cost}(m,b)$, that maps the variables
$m$ and $b$ to an error value.

Thus, for each spot of the surface we get an error that we can
illustrate graphically using a 3-d surface-plot, i.e. the error
surface. $m$ and $b$ are plotted on the $x-$ and $y-$ axis while the
third dimension is used to indicate the error value
(\figref{errorsurfacefig}).

\begin{figure}[t]
  \includegraphics[width=0.75\columnwidth]{error_surface.pdf}
  \titlecaption{Error surface.}{The two model parameters $m$ and $b$
    define the base area of the surface plot. For each parameter
    combination of slope and intercept the error is calculated. The
    resulting surface has a minimum which indicates the parameter
    combination that best fits the data.}\label{errorsurfacefig}
\end{figure}

\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}%
  Load the dataset \textit{lin\_regression.mat} into the workspace (20
  data pairs contained in the vectors \varcode{x} and
  \varcode{y}). Implement a script \file{errorSurface.m}, that
  calculates the mean square error between data and a linear model and
  illustrates the error surface using the \code{surf()} function
  (consult the help to find out how to use \code{surf}.).
\end{exercise}

By looking at the error surface we can directly see the position of
the minimum and thus estimate the optimal parameter combination. How
can we use the error surface to guide an automatic optimization
process.

The obvious approach would be to calculate the error surface and then
find the position of the minimum. The approach, however has several
disadvantages: (I) it is computationally very expensive to calculate
the error for each parameter combination. The number of combinations
increases exponentially with the number of free parameters (also known
as the ``curse of dimensionality''). (II) the accuracy with which the
best parameters can be estimated is limited by the resolution with
which the parameter space was sampled. If the grid is too large, one
might miss the minimum.

We thus want a procedure that finds the minimum with a minimal number
of computations.

\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
  \includegraphics[width=0.33\textwidth]{derivative}
  \hfill
  \begin{minipage}[b]{0.63\textwidth}
    The difference quotient
    \begin{equation}
      \label{difffrac}
      m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
    \end{equation}
    of a function $y = f(x)$ is the slope of the secant (red) defined
    by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the
    distance $\Delta x$.

    The slope of the function $y=f(x)$ at the position $x$ (yellow) is
    given by the derivative $f'(x)$ of the function at that position.
    It is defined by the difference quotient in the limit of
    infinitesimally (orange) small distances $\Delta x$:
    \begin{equation}
      \label{derivative}
      f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
  \end{minipage}\vspace{2ex}

  It is not possible to calculate this numerically
  (\eqnref{derivative}). The derivative can only be estimated using
  the difference quotient \eqnref{difffrac} by using sufficiently
  small $\Delta x$.
\end{ibox}

\begin{ibox}[t]{\label{partialderivativebox}Partial derivative and gradient}
  Some functions that depend on more than a single variable:
  \[ z = f(x,y) \]
  for example depends on $x$ and $y$. Using the partial derivative
  \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
  and
  \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
  one can estimate the slope in the direction of the variables
  individually by using the respective difference quotient
  (Box~\ref{differentialquotientbox}).  \vspace{1ex}

  \begin{minipage}[t]{0.44\textwidth}
    \mbox{}\\[-2ex]
    \includegraphics[width=1\textwidth]{gradient}
  \end{minipage}
  \hfill
  \begin{minipage}[t]{0.52\textwidth}
    For example, the partial derivatives of
    \[ f(x,y) = x^2+y^2 \] are
    \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]

    The gradient is a vector that constructed from the partial derivatives:
    \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
    This vector points into the direction of the strongest ascend of
    $f(x,y)$.
  \end{minipage}

  \vspace{1ex} The figure shows the contour lines of a bi-variate
  Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
  arrow) and the two partial derivatives (thin arrows) for three
  different locations.
\end{ibox}


\section{Gradient}
Imagine to place a small ball at some point on the error surface
\figref{errorsurfacefig}. Naturally, it would follow the steepest
slope and would stop at the minimum of the error surface (if it had no
inertia). We will use this picture to develop an algorithm to find our
way to the minimum of the objective function. The ball will always
follow the steepest slope. Thus we need to figure out the direction of
the steepest slope at the position of the ball.

The \enterm{gradient} (Box~\ref{partialderivativebox}) of the
objective function is the vector

\[ \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
\frac{\partial f(m,b)}{\partial b} \right) \]

that points to the strongest ascend of the objective function. Since
we want to reach the minimum we simply choose the opposite direction.

The gradient is given by partial derivatives
(Box~\ref{partialderivativebox}) with respect to the parameters $m$
and $b$ of the linear equation. There is no need to calculate it
analytically but it can be estimated from the partial derivatives
using the difference quotient (Box~\ref{differentialquotientbox}) for
small steps $\Delta m$ und $\Delta b$. For example the partial
derivative with respect to $m$:

\[\frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
  0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \;
. \]

The length of the gradient indicates the steepness of the slope
(\figref{gradientquiverfig}). Since want to go down the hill, we
choose the opposite direction.


\begin{figure}[t]
  \includegraphics[width=0.75\columnwidth]{error_gradient}
  \titlecaption{Gradient of the error surface.}  {Each arrow points
    into the direction of the greatest ascend at different positions
    of the error surface shown in \figref{errorsurfacefig}. The
    contour lines in the background illustrate the error surface. Warm
    colors indicate high errors, colder colors low error values. Each
    contour line connects points of equal
    error.}\label{gradientquiverfig}
\end{figure}

\begin{exercise}{lsqGradient.m}{}\label{gradientexercise}%
  Implement a function \code{lsqGradient()}, that takes the set of
  parameters $(m, b)$ of the linear equation as a two-element vector
  and the $x$- and $y$-data as input arguments. The function should
  return the gradient at that position.
\end{exercise}

\begin{exercise}{errorGradient.m}{}
  Use the functions from the previous
  exercises~\ref{errorsurfaceexercise} and~\ref{gradientexercise} to
  estimate and plot the error surface including the gradients. Choose
  a subset of parameter combinations for which you plot the
  gradient. Vectors in space can be easily plotted using the function
  \code{quiver()}.
\end{exercise}


\section{Gradient descent}
Finally, we are able to implement the optimization itself. By now it
should be obvious why it is called the gradient descent method. All
ingredients are already there. We need: 1. The error function
(\code{meanSquareError}), 2. the objective function
(\code{lsqError()}), and 3. the gradient (\code{lsqGradient()}). The
algorithm of the gradient descent is:

\begin{enumerate}
\item Start with any given combination of the parameters $m$ and $b$ ($p_0 = (m_0,
  b_0)$).
\item \label{computegradient} Calculate the gradient at the current
  position $p_i$.
\item If the length of the gradient falls below a certain value, we
  assume to have reached the minimum and stop the search. We are
  actually looking for the point at which the length of the gradient
  is zero but finding zero is impossible for numerical reasons. We
  thus apply a threshold below which we are sufficiently close to zero
  (e.g. \varcode{norm(gradient) < 0.1}).
\item \label{gradientstep} If the length of the gradient exceeds the
  threshold we take a small step into the opposite direction
  ($\epsilon = 0.01$):
  \[p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)\]
\item Repeat steps \ref{computegradient} --
  \ref{gradientstep}.
\end{enumerate}

\Figref{gradientdescentfig} illustrates the gradient descent (the path
the imaginary ball has chosen to reach the minimum). Starting at an
arbitrary position on the error surface we change the position as long
as the gradient at that position is larger than a certain
threshold. If the slope is very steep, the change in the position (the
distance between the red dots in \figref{gradientdescentfig}) is
large.

\begin{figure}[t]
  \includegraphics[width=0.6\columnwidth]{gradient_descent}
  \titlecaption{Gradient descent.}{The algorithm starts at an
    arbitrary position. At each point the gradient is estimated and
    the position is updated as long as the length of the gradient is
    sufficiently large.The dots show the positions after each
    iteration of the algorithm.} \label{gradientdescentfig}
\end{figure}

\begin{exercise}{gradientDescent.m}{}
  Implement the gradient descent for the problem of the linear
  equation for the measured data in file \file{lin\_regression.mat}.
  \begin{enumerate}
  \item Store for each iteration the error value.
  \item Create a plot that shows the error value as a function of the
    number of optimization steps.
  \item Create a plot that shows the measured data and the best fit.
  \end{enumerate}
\end{exercise}


\section{Summary}

The gradient descent is an important method for solving optimization
problems. It is used to find the global minimum of an objective
function.

In the case of the linear equation the error surface (using the mean
square error) shows a clearly defined minimum. The position of the
minimum can be analytically calculated. The next chapter will
introduce how this can be done without using the gradient descent
\matlabfun{polyfit()}.

Problems that involve nonlinear computations on parameters, e.g. the
rate $\lambda$ in the exponential function $f(x;\lambda) =
e^{\lambda x}$, do not have an analytical solution. To find minima
in such functions numerical methods such as the gradient descent have
to be applied.

The suggested gradient descent algorithm can be improved in multiple
ways to converge faster.  For example one could adapt the step size to
the length of the gradient. These numerical tricks have already been
implemented in pre-defined functions. Generic optimization functions
such as \matlabfun{fminsearch()} have been implemented for arbitrary
objective functions while more specialized functions are specifically
designed for optimizations in the least square error sense
\matlabfun{lsqcurvefit()}.

%\newpage
\begin{important}[Beware of secondary minima!]
  Finding the absolute minimum is not always as easy as in the case of
  the linear equation. Often, the error surface has secondary or local
  minima in which the gradient descent stops even though there is a
  more optimal solution. Starting from good start positions is a good
  approach to avoid getting stuck in local minima. Further it is
  easier to optimize as few parameters as possible. Each additional
  parameter increases complexity and is computationally expensive.
\end{important}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\printsolutions