411 lines
18 KiB
TeX
411 lines
18 KiB
TeX
\chapter{\tr{Optimization and gradient descent}{Optimierung und Gradientenabstieg}}
|
|
|
|
To understand the behaviour of a given system sciences often probe the
|
|
system with input signals and then try to explain the responses
|
|
through a model. Typically the model has a few parameter that specify
|
|
how input and output signals are related. The question arises which
|
|
combination of paramters are best suited to describe the relation of
|
|
in- and output. The process of finding the best paramter set is called
|
|
optimization or also \enterm{curve fitting}. One rather generic
|
|
approach to the problem is the so called gradient descent method which
|
|
will be introduced in this chapter.
|
|
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{lin_regress}\hfill
|
|
\titlecaption{Example data suggesting a linear relation.}{A set of
|
|
input signals $x$, e.g. stimulus intensities, were used to probe a
|
|
system. The system's output $y$ to the inputs are noted
|
|
(left). Assuming a linear relation between $x$ and $y$ leaves us
|
|
with 2 parameters, the slope (center) and the intercept with the
|
|
y-axis (right panel).}\label{linregressiondatafig}
|
|
\end{figure}
|
|
|
|
The data plotted in \figref{linregressiondatafig} suggests a linear
|
|
relation between input and output of the invesitagted system. We thus
|
|
assume that the linear equation
|
|
\[y = f(x; m, b) = m\cdot x + b \] is an appropriate model to describe the system.
|
|
The linear equation has two free paramteter $m$ and $b$ which denote
|
|
the slope and the y-intercept, respectively. In this chapter we will
|
|
use this example to illustrate the methods behind several curve
|
|
fitting approaches. We will apply this method to find the combination
|
|
of slope and intercept that best describes the system.
|
|
|
|
|
|
\section{The error function --- mean square error}
|
|
|
|
Before the optimization can be done we need to specify what is
|
|
considered an optimal fit. In our example we search the parameter
|
|
combination that describe the relation of $x$ and $y$ best. What is
|
|
meant by this? Each input $x_i$ leads to an output $y_i$ and for each
|
|
$x_i$ there is a \emph{prediction} or \emph{estimation}
|
|
$y^{est}_i$. For each of $x_i$ estimation and measurement will have a
|
|
certain distance $y_i - y_i^{est}$. In our example the estimation is
|
|
given by the linear equation $y_i^{est} = f(x;m,b)$. The best fit of
|
|
the model with the parameters $m$ and $b$ leads to the minimal
|
|
distances between observation $y_i$ and estimation $y_i^{est}$
|
|
(\figref{leastsquareerrorfig}).
|
|
|
|
We could require that the sum $\sum_{i=1}^N y_i - y^{est}_i$ is
|
|
minimized. This approach, however, will not work since a minimal sum
|
|
can also be achieved if half of the measurements is above and the
|
|
other half below the predicted line. Positive and negative errors
|
|
would cancel out and then sum up to values close to zero. A better
|
|
approach is to consider the absolute value of the distance
|
|
$\sum_{i=1}^N |y_i - y^{est}_i|$. The total error can only be small if
|
|
all deviations are indeed small no matter if they are above or below
|
|
the prediced line. Instead of the sum we could also ask for the
|
|
\emph{average}
|
|
|
|
\begin{equation}
|
|
\label{meanabserror}
|
|
f_{dist}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N |y_i - y^{est}_i|
|
|
\end{equation}
|
|
should be small. Commonly, the \enterm{mean squared distance} oder
|
|
\enterm{mean squared error}
|
|
\begin{equation}
|
|
\label{meansquarederror}
|
|
f_{mse}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}_i)^2
|
|
\end{equation}
|
|
|
|
is used (\figref{leastsquareerrorfig}). Similar to the absolute
|
|
distance, the square of the error($(y_i - y_i^{est})^2$) is always
|
|
positive error values do not cancel out. The square further punishes
|
|
large deviations.
|
|
|
|
\begin{exercise}{meanSquareError.m}{}\label{mseexercise}%
|
|
Implement a function \code{meanSquareError()}, that calculates the
|
|
\emph{mean square distance} bewteen a vector of observations ($y$)
|
|
and respective predictions ($y^{est}$). \pagebreak[4]
|
|
\end{exercise}
|
|
|
|
|
|
\section{\tr{Objective function}{Zielfunktion}}
|
|
|
|
$f_{cost}(\{(x_i, y_i)\}|\{y^{est}_i\})$ is a so called
|
|
\enterm{objective function} or \enterm{cost function}. We aim to adapt
|
|
the model parameters to minimize the error (mean square error) and
|
|
thus the \emph{objective function}. In Chapter~\ref{maximumlikelihoodchapter}
|
|
we will show that the minimization of the mean square error is
|
|
equivalent to maximizing the likelihood that the observations
|
|
originate from the model (assuming a normal distribution of the data
|
|
around the model prediction).
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=1\textwidth]{linear_least_squares}
|
|
\titlecaption{Estimating the \emph{mean square error}.} {The
|
|
deviation (\enterm{error}, orange) between the prediction (red
|
|
line) and the observations (blue dots) is calculated for each data
|
|
point (left). Then the deviations are squared and the aveage is
|
|
calculated (right).}
|
|
\label{leastsquareerrorfig}
|
|
\end{figure}
|
|
|
|
The error or also \enterm{cost function} is not necessarily the mean
|
|
square distance but can be any function that maps the predictions to a
|
|
scalar value describing the quality of the fit. In the optimization
|
|
process we aim for the paramter combination that minimized the costs
|
|
(error).
|
|
|
|
%%% Einfaches verbales Beispiel? Eventuell aus der Populationsoekologie?
|
|
Replacing $y^{est}$ with the linear equation (the model) in
|
|
(\eqnref{meansquarederror}) we yield:
|
|
|
|
\begin{eqnarray}
|
|
f_{cost}(\{(x_i, y_i)\}|m,b) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;m,b))^2 \label{msefunc} \\
|
|
& = & \frac{1}{N} \sum_{i=1}^N (y_i - m x_i - b)^2 \label{mseline}
|
|
\end{eqnarray}
|
|
|
|
That is, the mean square error is given the pairs $(x_i, y_i)$ and the
|
|
parameters $m$ and $b$ of the linear equation. The optimization
|
|
process will not try to optimize $m$ and $b$ to lead to the smallest
|
|
error, the method of the \enterm{least square error}.
|
|
|
|
\begin{exercise}{lsqError.m}{}
|
|
Implement the objective function \code{lsqError()} that applies the
|
|
linear equation as a model.
|
|
\begin{itemize}
|
|
\item The function takes three arguments. The first is a 2-element
|
|
vector that contains the values of parameters \varcode{m} and
|
|
\varcode{b}. The second is a vector of x-values the third contains
|
|
the measurements for each value of $x$, the respecive $y$-values.
|
|
\item The function returns the mean square error \eqnref{mseline}.
|
|
\item The function should call the function \code{meanSquareError()}
|
|
defined in the previouos exercise to calculate the error.
|
|
\end{itemize}
|
|
\end{exercise}
|
|
|
|
|
|
\section{Error surface}
|
|
The two parameters of the model define a surface. For each combination
|
|
of $m$ and $b$ we can use \eqnref{mseline} to calculate the associated
|
|
error. We thus consider the objective function $f_{cost}(\{(x_i,
|
|
y_i)\}|m,b)$ as a function $f_{cost}(m,b)$, that maps the variables
|
|
$m$ and $b$ to an error value.
|
|
|
|
Thus, for each spot of the surface we get an error that we can
|
|
illustrate graphically using a 3-d surface-plot, i.e. the error
|
|
surface. $m$ and $b$ are plotted on the $x-$ and $y-$ axis while the
|
|
third dimension is used to indicate the error value
|
|
(\figref{errorsurfacefig}).
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=0.75\columnwidth]{error_surface.pdf}
|
|
\titlecaption{Error surface.}{The two model parameters $m$ and $b$
|
|
define the base area of the surface plot. For each parameter
|
|
combination of slope and intercept the error is calculated. The
|
|
resulting surface has a minimum which indicates the parameter
|
|
combination that best fits the data.}\label{errorsurfacefig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}%
|
|
Load the dataset \textit{lin\_regression.mat} into the workspace (20
|
|
data pairs contained in the vectors \varcode{x} and
|
|
\varcode{y}). Implement a script \file{errorSurface.m}, that
|
|
calculates the mean square error between data and a linear model and
|
|
illustrates the error surface using the \code{surf()} function
|
|
(consult the help to find out how to use \code{surf}.).
|
|
\end{exercise}
|
|
|
|
By looking at the error surface we can directly see the position of
|
|
the minimum and thus estimate the optimal parameter combination. How
|
|
can we use the error surface to guide an automatic optimization
|
|
process.
|
|
|
|
The obvious approach would be to calculate the error surface and then
|
|
find the position of the minimum. The approach, however has several
|
|
disadvantages: (I) it is computationally very expensive to calculate
|
|
the error for each parameter combination. The number of combinations
|
|
increases exponentially with the number of free parameters (also known
|
|
as the ``curse of dimensionality''). (II) the accuracy with which the
|
|
best parameters can be estimated is limited by the resolution with
|
|
which the parameter space was sampled. If the grid is too large, one
|
|
might miss the minimum.
|
|
|
|
We thus want a procedure that finds the minimum with a minimal number
|
|
of computations.
|
|
|
|
\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
|
|
\includegraphics[width=0.33\textwidth]{derivative}
|
|
\hfill
|
|
\begin{minipage}[b]{0.63\textwidth}
|
|
The difference quotient
|
|
\begin{equation}
|
|
\label{difffrac}
|
|
m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
|
|
\end{equation}
|
|
of a function $y = f(x)$ is the slope of the secant (red) defined
|
|
by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the
|
|
distance $\Delta x$.
|
|
|
|
The slope of the function $y=f(x)$ at the position $x$ (yellow) is
|
|
given by the derivative $f'(x)$ of the function at that position.
|
|
It is defined by the difference quotient in the limit of
|
|
infinitesimally (orange) small distances $\Delta x$:
|
|
\begin{equation}
|
|
\label{derivative}
|
|
f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
|
|
\end{minipage}\vspace{2ex}
|
|
|
|
It is not possible to calculate this numerically
|
|
(\eqnref{derivative}). The derivative can only be estimated using
|
|
the difference quotient \eqnref{difffrac} by using sufficiently
|
|
small $\Delta x$.
|
|
\end{ibox}
|
|
|
|
\begin{ibox}[t]{\label{partialderivativebox}Partial derivative and gradient}
|
|
Some functions that depend on more than a single variable:
|
|
\[ z = f(x,y) \]
|
|
for example depends on $x$ and $y$. Using the partial derivative
|
|
\[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
|
|
and
|
|
\[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
|
|
one can estimate the slope in the direction of the variables
|
|
individually by using the respective difference quotient
|
|
(Box~\ref{differentialquotientbox}). \vspace{1ex}
|
|
|
|
\begin{minipage}[t]{0.44\textwidth}
|
|
\mbox{}\\[-2ex]
|
|
\includegraphics[width=1\textwidth]{gradient}
|
|
\end{minipage}
|
|
\hfill
|
|
\begin{minipage}[t]{0.52\textwidth}
|
|
For example, the partial derivatives of
|
|
\[ f(x,y) = x^2+y^2 \] are
|
|
\[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]
|
|
|
|
The gradient is a vector that constructed from the partial derivatives:
|
|
\[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
|
|
This vector points into the direction of the strongest ascend of
|
|
$f(x,y)$.
|
|
\end{minipage}
|
|
|
|
\vspace{1ex} The figure shows the contour lines of a bi-variate
|
|
Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
|
|
arrow) and the two partial derivatives (thin arrows) for three
|
|
different locations.
|
|
\end{ibox}
|
|
|
|
|
|
\section{Gradient}
|
|
Imagine to place a small ball at some point on the error surface
|
|
\figref{errorsurfacefig}. Naturally, it would follow the steepest
|
|
slope and would stop at the minimum of the error surface (if it had no
|
|
inertia). We will use this picture to develop an algorithm to find our
|
|
way to the minimum of the objective function. The ball will always
|
|
follow the steepest slope. Thus we need to figure out the direction of
|
|
the steepest slope at the position of the ball.
|
|
|
|
The \enterm{gradient} (Box~\ref{partialderivativebox}) of the
|
|
objective function is the vector
|
|
|
|
\[ \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
|
|
\frac{\partial f(m,b)}{\partial b} \right) \]
|
|
|
|
that points to the strongest ascend of the objective function. Since
|
|
we want to reach the minimum we simply choose the opposite direction.
|
|
|
|
The gradient is given by partial derivatives
|
|
(Box~\ref{partialderivativebox}) with respect to the parameters $m$
|
|
and $b$ of the linear equation. There is no need to calculate it
|
|
analytically but it can be estimated from the partial derivatives
|
|
using the difference quotient (Box~\ref{differentialquotientbox}) for
|
|
small steps $\Delta m$ und $\Delta b$. For example the partial
|
|
derivative with respect to $m$:
|
|
|
|
\[\frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
|
|
0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
|
|
\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \;
|
|
. \]
|
|
|
|
The length of the gradient indicates the steepness of the slope
|
|
(\figref{gradientquiverfig}). Since want to go down the hill, we
|
|
choose the opposite direction.
|
|
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=0.75\columnwidth]{error_gradient}
|
|
\titlecaption{Gradient of the error surface.} {Each arrow points
|
|
into the direction of the greatest ascend at different positions
|
|
of the error surface shown in \figref{errorsurfacefig}. The
|
|
contour lines in the background illustrate the error surface. Warm
|
|
colors indicate high errors, colder colors low error values. Each
|
|
contour line connects points of equal
|
|
error.}\label{gradientquiverfig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{lsqGradient.m}{}\label{gradientexercise}%
|
|
Implement a function \code{lsqGradient()}, that takes the set of
|
|
parameters $(m, b)$ of the linear equation as a two-element vector
|
|
and the $x$- and $y$-data as input arguments. The function should
|
|
return the gradient at that position.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{errorGradient.m}{}
|
|
Use the functions from the previous
|
|
exercises~\ref{errorsurfaceexercise} and~\ref{gradientexercise} to
|
|
estimate and plot the error surface including the gradients. Choose
|
|
a subset of parameter combinations for which you plot the
|
|
gradient. Vectors in space can be easily plotted using the function
|
|
\code{quiver()}.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Gradient descent}
|
|
Finally, we are able to implement the optimization itself. By now it
|
|
should be obvious why it is called the gradient descent method. All
|
|
ingredients are already there. We need: 1. The error function
|
|
(\code{meanSquareError}), 2. the objective function
|
|
(\code{lsqError()}), and 3. the gradient (\code{lsqGradient()}). The
|
|
algorithm of the gradient descent is:
|
|
|
|
\begin{enumerate}
|
|
\item Start with any given combination of the parameters $m$ and $b$ ($p_0 = (m_0,
|
|
b_0)$).
|
|
\item \label{computegradient} Calculate the gradient at the current
|
|
position $p_i$.
|
|
\item If the length of the gradient falls below a certain value, we
|
|
assume to have reached the minimum and stop the search. We are
|
|
actually looking for the point at which the length of the gradient
|
|
is zero but finding zero is impossible for numerical reasons. We
|
|
thus apply a threshold below which we are sufficiently close to zero
|
|
(e.g. \varcode{norm(gradient) < 0.1}).
|
|
\item \label{gradientstep} If the length of the gradient exceeds the
|
|
threshold we take a small step into the opposite direction
|
|
($\epsilon = 0.01$):
|
|
\[p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)\]
|
|
\item Repeat steps \ref{computegradient} --
|
|
\ref{gradientstep}.
|
|
\end{enumerate}
|
|
|
|
\Figref{gradientdescentfig} illustrates the gradient descent (the path
|
|
the imaginary ball has chosen to reach the minimum). Starting at an
|
|
arbitrary position on the error surface we change the position as long
|
|
as the gradient at that position is larger than a certain
|
|
threshold. If the slope is very steep, the change in the position (the
|
|
distance between the red dots in \figref{gradientdescentfig}) is
|
|
large.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics[width=0.6\columnwidth]{gradient_descent}
|
|
\titlecaption{Gradient descent.}{The algorithm starts at an
|
|
arbitrary position. At each point the gradient is estimated and
|
|
the position is updated as long as the length of the gradient is
|
|
sufficiently large.The dots show the positions after each
|
|
iteration of the algorithm.} \label{gradientdescentfig}
|
|
\end{figure}
|
|
|
|
\begin{exercise}{gradientDescent.m}{}
|
|
Implement the gradient descent for the problem of the linear
|
|
equation for the measured data in file \file{lin\_regression.mat}.
|
|
\begin{enumerate}
|
|
\item Store for each iteration the error value.
|
|
\item Create a plot that shows the error value as a function of the
|
|
number of optimization steps.
|
|
\item Create a plot that shows the measured data and the best fit.
|
|
\end{enumerate}
|
|
\end{exercise}
|
|
|
|
|
|
\section{Summary}
|
|
|
|
The gradient descent is an important method for solving optimization
|
|
problems. It is used to find the global minimum of an objective
|
|
function.
|
|
|
|
In the case of the linear equation the error surface (using the mean
|
|
square error) shows a clearly defined minimum. The position of the
|
|
minimum can be analytically calculated. The next chapter will
|
|
introduce how this can be done without using the gradient descent
|
|
\matlabfun{polyfit()}.
|
|
|
|
Problems that involve nonlinear computations on parameters, e.g. the
|
|
rate $\lambda$ in the exponential function $f(x;\lambda) =
|
|
e^{\lambda x}$, do not have an analytical solution. To find minima
|
|
in such functions numerical methods such as the gradient descent have
|
|
to be applied.
|
|
|
|
The suggested gradient descent algorithm can be improved in multiple
|
|
ways to converge faster. For example one could adapt the step size to
|
|
the length of the gradient. These numerical tricks have already been
|
|
implemented in pre-defined functions. Generic optimization functions
|
|
such as \matlabfun{fminsearch()} have been implemented for arbitrary
|
|
objective functions while more specialized functions are specifically
|
|
designed for optimizations in the least square error sense
|
|
\matlabfun{lsqcurvefit()}.
|
|
|
|
%\newpage
|
|
\begin{important}[Beware of secondary minima!]
|
|
Finding the absolute minimum is not always as easy as in the case of
|
|
the linear equation. Often, the error surface has secondary or local
|
|
minima in which the gradient descent stops even though there is a
|
|
more optimal solution. Starting from good start positions is a good
|
|
approach to avoid getting stuck in local minima. Further it is
|
|
easier to optimize as few parameters as possible. Each additional
|
|
parameter increases complexity and is computationally expensive.
|
|
\end{important}
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|