690 lines
34 KiB
TeX
690 lines
34 KiB
TeX
\chapter{Optimization and gradient descent}
|
|
\exercisechapter{Optimization and gradient descent}
|
|
|
|
Optimization problems arise in many different contexts. For example,
|
|
to understand the behavior of a given neuronal system, the system is
|
|
probed with a range of input signals and then the resulting responses
|
|
are measured. This input-output relation can be described by a
|
|
model. Such a model can be a simple function that maps the input
|
|
signals to corresponding responses, it can be a filter, or a system of
|
|
differential equations. In any case, the model has some parameters
|
|
that specify how input and output signals are related. Which
|
|
combination of parameter values are best suited to describe the
|
|
input-output relation? The process of finding the best parameter
|
|
values is an optimization problem. For a simple parameterized function
|
|
that maps input to output values, this is the special case of a
|
|
\enterm{curve fitting} problem, where the average distance between the
|
|
curve and the response values is minimized. One basic numerical method
|
|
used for such optimization problems is the so called gradient descent,
|
|
which is introduced in this chapter.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicfunc}
|
|
\titlecaption{Example data suggesting a cubic relation.}{The length
|
|
$x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a
|
|
cubic relation between size and weight leaves us with a single
|
|
free parameters, a scaling factor. The cubic relation is shown for
|
|
a few values of this scaling factor (orange and red,
|
|
right).}\label{cubicdatafig}
|
|
\end{figure}
|
|
|
|
For demonstrating the curve-fitting problem let's take the simple
|
|
example of weights and sizes measured for a number of male tigers
|
|
(\figref{cubicdatafig}). Weight $y$ is proportional to volume
|
|
$V$ via the density $\rho$. The volume $V$ of any object is
|
|
proportional to its length $x$ cubed. The factor $\alpha$ relating
|
|
volume and size cubed depends on the shape of the object and we do not
|
|
know this factor for tigers. For the data set we thus expect a cubic
|
|
relation between weight and length
|
|
\begin{equation}
|
|
\label{cubicfunc}
|
|
y = f(x; c) = c\cdot x^3
|
|
\end{equation}
|
|
where $c=\rho\alpha$, the product of a tiger's density and form
|
|
factor, is the only free parameter in the relation. We would like to
|
|
find out which value of $c$ best describes the measured data. In the
|
|
following we use this example to illustrate the gradient descent as a
|
|
basic method for finding such an optimal parameter.
|
|
|
|
|
|
\section{The error function --- mean squared error}
|
|
|
|
Before we go ahead finding the optimal parameter value we need to
|
|
specify what exactly we consider as an optimal fit. In our example we
|
|
search the parameter that describes the relation of $x$ and $y$
|
|
best. What is meant by this? The length $x_i$ of each tiger is
|
|
associated with a weight $y_i$ and for each $x_i$ we have a
|
|
\emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by
|
|
the model \eqnref{cubicfunc} for a specific value of the parameter
|
|
$c$. Prediction and actual data value ideally match (in a perfect
|
|
noise-free world), but in general the estimate and measurement are
|
|
separated by some distance or error $y_i - y^{est}(x_i)$. In our
|
|
example the estimate of the weight for the length $x_i$ is given by
|
|
equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting
|
|
model with parameter $c$ is the one that somehow minimizes the
|
|
distances between observations $y_i$ and corresponding estimations
|
|
$y^{est}(x_i)$ (\figref{cubicerrorsfig}).
|
|
|
|
As a first guess we could simply minimize the sum of the distances,
|
|
$\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work
|
|
because positive and negative errors would cancel out, no matter how
|
|
large they are, and sum up to values close to zero. Better is to sum
|
|
over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum
|
|
can only be small if all deviations are indeed small no matter if they
|
|
are above or below the prediction. The sum of the squared distances,
|
|
$\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better
|
|
choice. Instead of the sum we could also minimize the average distance
|
|
\begin{equation}
|
|
\label{meansquarederror}
|
|
f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2
|
|
\end{equation}
|
|
This is known as the \enterm[mean squared error]{mean squared error}
|
|
(\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer
|
|
Fehler}). Similar to the absolute distance, the square of the errors
|
|
is always positive and thus positive and negative error values do not
|
|
cancel each other out. In addition, the square punishes large
|
|
deviations over small deviations. In
|
|
chapter~\ref{maximumlikelihoodchapter} we show that minimizing the
|
|
mean squared error is equivalent to maximizing the likelihood that the
|
|
observations originate from the model, if the data are normally
|
|
distributed around the model prediction.
|
|
|
|
\begin{exercise}{meansquarederrorline.m}{}\label{mseexercise}
|
|
Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store
|
|
these sizes in a vector \varcode{x}. Compute the corresponding
|
|
predicted weights \varcode{yest} for each tiger according to
|
|
\eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the
|
|
predictions generate simulated measurements of the tiger's weights
|
|
\varcode{y}, by adding normally distributed random numbers to the
|
|
predictions scaled to a standard deviation of 50\,\kilo\gram.
|
|
|
|
Compute the \emph{mean squared error} between \varcode{y} and
|
|
\varcode{yest} in a single line of code.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Cost function}
|
|
|
|
The mean squared error is a so called \enterm{objective function} or
|
|
\enterm{cost function} (\determ{Kostenfunktion}). A cost function
|
|
assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set
|
|
$\{(x_i, y_i)\}$ a single scalar value that we want to minimize. Here
|
|
we aim to adapt the model parameter to minimize the mean squared error
|
|
\eqref{meansquarederror}. In general, the \enterm{cost function} can
|
|
be any function that describes the quality of a fit by mapping the
|
|
data and the predictions to a single scalar value.
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicerrors}
|
|
\titlecaption{Estimating the \emph{mean squared error}.} {The
|
|
deviation error (orange) between the prediction (red line) and the
|
|
observations (blue dots) is calculated for each data point
|
|
(left). Then the deviations are squared and the average is
|
|
calculated (right).}
|
|
\label{cubicerrorsfig}
|
|
\end{figure}
|
|
|
|
Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror}
|
|
with our cubic model \eqref{cubicfunc}, the cost function reads
|
|
\begin{eqnarray}
|
|
f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\
|
|
& = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube}
|
|
\end{eqnarray}
|
|
The optimization process tries to find a value for the factor $c$ such
|
|
that the cost function is minimized. With the mean squared error as
|
|
the cost function this optimization process is also called method of
|
|
\enterm{least squares} (\determ[quadratischer
|
|
Fehler!kleinster]{Methode der kleinsten Quadrate}).
|
|
|
|
\begin{exercise}{meanSquaredErrorCubic.m}{}
|
|
Implement the objective function \eqref{msecube} as a function
|
|
\varcode{meanSquaredErrorCubic()}. The function takes three
|
|
arguments. The first is a vector of $x$-values and the second
|
|
contains the measurements $y$ for each value of $x$. The third
|
|
argument is the value of the factor $c$. The function returns the
|
|
mean squared error.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Graph of the cost function}
|
|
For each value of the parameter $c$ of the model we can use
|
|
\eqnref{msecube} to calculate the corresponding value of the cost
|
|
function. The cost function $f_{cost}(c|\{(x_i, y_i)\}|)$ is a
|
|
function $f_{cost}(c)$ that maps the parameter value $c$ to a scalar
|
|
error value. For a given data set we thus can simply plot the cost
|
|
function as a function of the parameter $c$ (\figref{cubiccostfig}).
|
|
|
|
\begin{exercise}{plotcubiccosts.m}{}
|
|
Calculate the mean squared error between the data and the cubic
|
|
function for a range of values of the factor $c$ using the
|
|
\varcode{meanSquaredErrorCubic()} function from the previous
|
|
exercise. Plot the graph of the cost function.
|
|
\end{exercise}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubiccost}
|
|
\titlecaption{Minimum of the cost function.}{For a given data set
|
|
the cost function, the mean squared error \eqnref{msecube}, as a
|
|
function of the unknown parameter $c$ has a minimum close to the
|
|
true value of $c$ that was used to generate the data
|
|
(left). Simply taking the absolute minimum of the cost function
|
|
computed for a pre-set range of values for the parameter $c$, has
|
|
the disadvantage to be limited in precision (right) and the
|
|
possibility to entirely miss the global minimum if it is outside
|
|
the computed range.}\label{cubiccostfig}
|
|
\end{figure}
|
|
|
|
By looking at the plot of the cost function we can visually identify
|
|
the position of the minimum and thus estimate the optimal value for
|
|
the parameter $c$. How can we use the cost function to guide an
|
|
automatic optimization process?
|
|
|
|
The obvious approach would be to calculate the mean squared error for
|
|
a range of parameter values and then find the position of the minimum
|
|
using the \code{min()} function. This approach, however has several
|
|
disadvantages: (i) The accuracy of the estimation of the best
|
|
parameter is limited by the resolution used to sample the parameter
|
|
space. The coarser the parameters are sampled the less precise is the
|
|
obtained position of the minimum (\figref{cubiccostfig}, right). (ii)
|
|
The range of parameter values might not include the absolute minimum.
|
|
(iii) In particular for functions with more than a single free
|
|
parameter it is computationally expensive to calculate the cost
|
|
function for each parameter combination at a sufficient
|
|
resolution. The number of combinations increases exponentially with
|
|
the number of free parameters. This is known as the \enterm{curse of
|
|
dimensionality}.
|
|
|
|
So we need a different approach. We want a procedure that finds the
|
|
minimum of the cost function with a minimal number of computations and
|
|
to arbitrary precision.
|
|
|
|
\section{Gradient}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicgradient}
|
|
\titlecaption{Derivative of the cost function.}{The gradient, the
|
|
derivative \eqref{costderivative} of the cost function, is
|
|
negative to the left of the minimum (vertical line) of the cost
|
|
function, zero (horizontal line) at, and positive to the right of
|
|
the minimum (left). For each value of the parameter $c$ the
|
|
negative gradient (arrows) points towards the minimum of the cost
|
|
function (right).} \label{gradientcubicfig}
|
|
\end{figure}
|
|
|
|
\begin{ibox}[b]{\label{differentialquotientbox}Difference quotient and derivative}
|
|
\includegraphics[width=0.33\textwidth]{derivative}
|
|
\hfill
|
|
\begin{minipage}[b]{0.63\textwidth}
|
|
The difference quotient
|
|
\begin{equation}
|
|
\label{difffrac}
|
|
m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
|
|
\end{equation}
|
|
of a function $y = f(x)$ is the slope of the secant (red) defined
|
|
by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ at
|
|
distance $\Delta x$.
|
|
|
|
The slope of the function $y=f(x)$ at the position $x$ (orange) is
|
|
given by the derivative $f'(x)$ of the function at that position.
|
|
It is defined by the difference quotient in the limit of
|
|
infinitesimally (red and yellow) small distances $\Delta x$:
|
|
\begin{equation}
|
|
\label{derivative}
|
|
f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
|
|
\end{minipage}\vspace{2ex}
|
|
|
|
It is not possible to calculate the exact value of the derivative
|
|
\eqref{derivative} numerically. The derivative can only be estimated
|
|
by computing the difference quotient \eqref{difffrac} using
|
|
sufficiently small $\Delta x$.
|
|
\end{ibox}
|
|
|
|
Imagine to place a ball at some point on the cost function
|
|
\figref{cubiccostfig}. Naturally, it would roll down the slope and
|
|
eventually stop at the minimum of the error surface (if it had no
|
|
inertia). We will use this analogy to develop an algorithm to find our
|
|
way to the minimum of the cost function. The ball always follows the
|
|
steepest slope. Thus we need to figure out the direction of the slope
|
|
at the position of the ball.
|
|
|
|
In our one-dimensional example of a single free parameter the slope is
|
|
simply the derivative of the cost function with respect to the
|
|
parameter $c$ (\figref{gradientcubicfig}, left). This derivative is called
|
|
the \entermde{Gradient}{gradient} of the cost function:
|
|
\begin{equation}
|
|
\label{costderivative}
|
|
\nabla f_{cost}(c) = \frac{{\rm d} f_{cost}(c)}{{\rm d} c}
|
|
\end{equation}
|
|
There is no need to calculate this derivative analytically, because it
|
|
can be approximated numerically by the difference quotient
|
|
(Box~\ref{differentialquotientbox}) for small steps $\Delta c$:
|
|
\begin{equation}
|
|
\frac{{\rm d} f_{cost}(c)}{{\rm d} c} =
|
|
\lim\limits_{\Delta c \to 0} \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
|
|
\approx \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c}
|
|
\end{equation}
|
|
Choose, for example, $\Delta c = 10^{-7}$. The derivative is positive
|
|
for positive slopes. Since want to go down the hill, we choose the
|
|
opposite direction (\figref{gradientcubicfig}, right).
|
|
|
|
\begin{exercise}{meanSquaredGradientCubic.m}{}\label{gradientcubic}
|
|
Implement a function \varcode{meanSquaredGradientCubic()}, that
|
|
takes the $x$- and $y$-data and the parameter $c$ as input
|
|
arguments. The function should return the derivative of the mean
|
|
squared error $f_{cost}(c)$ with respect to $c$ at the position
|
|
$c$.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{plotcubicgradient.m}{}
|
|
Using the \varcode{meanSquaredGradientCubic()} function from the
|
|
previous exercise, plot the derivative of the cost function as a
|
|
function of $c$.
|
|
\end{exercise}
|
|
|
|
\section{Gradient descent}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{cubicmse}
|
|
\titlecaption{Gradient descent.}{The algorithm starts at an
|
|
arbitrary position. At each point the gradient is estimated and
|
|
the position is updated as long as the length of the gradient is
|
|
sufficiently large. The dots show the positions after each
|
|
iteration of the algorithm.} \label{gradientdescentcubicfig}
|
|
\end{figure}
|
|
|
|
Finally, we are able to implement the optimization itself. By now it
|
|
should be obvious why it is called the gradient descent method. From a
|
|
starting position on we iteratively walk down the slope of the cost
|
|
function against its gradient. All ingredients necessary for this
|
|
algorithm are already there. We need: (i) the cost function
|
|
(\varcode{meanSquaredErrorCubic()}), and (ii) the gradient
|
|
(\varcode{meanSquaredGradientCubic()}). The algorithm of the gradient
|
|
descent works as follows:
|
|
\begin{enumerate}
|
|
\item Start with some arbitrary value $p_0$ for the parameter $c$.
|
|
\item \label{computegradient} Compute the gradient
|
|
\eqnref{costderivative} at the current position $p_i$.
|
|
\item If the length of the gradient, the absolute value of the
|
|
derivative \eqnref{costderivative}, is smaller than some threshold
|
|
value, we assume to have reached the minimum and stop the search.
|
|
We return the current parameter value $p_i$ as our best estimate of
|
|
the parameter $c$ that minimizes the cost function.
|
|
|
|
We are actually looking for the point at which the derivative of the
|
|
cost function equals zero, but this is impossible because of
|
|
numerical imprecision. We thus apply a threshold below which we are
|
|
sufficiently close to zero.
|
|
\item \label{gradientstep} If the length of the gradient exceeds the
|
|
threshold we take a small step into the opposite direction:
|
|
\begin{equation}
|
|
\label{gradientdescent}
|
|
p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(p_i)
|
|
\end{equation}
|
|
where $\epsilon$ is a factor linking the gradient to
|
|
appropriate steps in the parameter space.
|
|
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
|
|
\end{enumerate}
|
|
|
|
\Figref{gradientdescentcubicfig} illustrates the gradient descent ---
|
|
the path the imaginary ball has chosen to reach the minimum. We walk
|
|
along the parameter axis against the gradient as long as the gradient
|
|
differs sufficiently from zero. At steep slopes we take large steps
|
|
(the distance between the red dots in \figref{gradientdescentcubicfig}
|
|
is large).
|
|
|
|
\begin{exercise}{gradientDescentCubic.m}{}\label{gradientdescentexercise}
|
|
Implement the gradient descent algorithm for the problem of fitting
|
|
a cubic function \eqref{cubicfunc} to some measured data pairs $x$
|
|
and $y$ as a function \varcode{gradientDescentCubic()} that returns
|
|
the estimated best fitting parameter value $c$ as well as two
|
|
vectors with all the parameter values and the corresponding values
|
|
of the cost function that the algorithm iterated trough. As
|
|
additional arguments that function takes the initial value for the
|
|
parameter $c$, the factor $\epsilon$ connecting the gradient with
|
|
iteration steps in \eqnref{gradientdescent}, and the threshold value
|
|
for the absolute value of the gradient terminating the algorithm.
|
|
\end{exercise}
|
|
|
|
\begin{exercise}{plotgradientdescentcubic.m}{}\label{plotgradientdescentexercise}
|
|
Use the function \varcode{gradientDescentCubic()} to fit the
|
|
simulated data from exercise~\ref{mseexercise}. Plot the returned
|
|
values of the parameter $c$ as as well as the corresponding mean
|
|
squared errors as a function of iteration step (two plots). Compare
|
|
the result of the gradient descent method with the true value of $c$
|
|
used to simulate the data. Inspect the plots and adapt $\epsilon$
|
|
and the threshold to make the algorithm behave as intended. Finally
|
|
plot the data together with the best fitting cubic relation
|
|
\eqref{cubicfunc}.
|
|
\end{exercise}
|
|
|
|
The $\epsilon$ parameter in \eqnref{gradientdescent} is critical. If
|
|
too large, the algorithm does not converge to the minimum of the cost
|
|
function (try it!). At medium values it oscillates around the minimum
|
|
but might nevertheless converge. Only for sufficiently small values
|
|
(here $\epsilon = 0.00001$) does the algorithm follow the slope
|
|
downwards towards the minimum.
|
|
|
|
The terminating condition on the absolute value of the gradient
|
|
influences how often the cost function is evaluated. The smaller the
|
|
threshold value the more often the cost is computed and the more
|
|
precisely the fit parameter is estimated. If it is too small, however,
|
|
the increase in precision is negligible, in particular in comparison
|
|
to the increased computational effort. Have a look at the derivatives
|
|
that we plotted in exercise~\ref{gradientcubic} and decide on a
|
|
sensible value for the threshold. Run the gradient descent algorithm
|
|
and check how the resulting $c$ parameter values converge and how many
|
|
iterations were needed. Then reduce the threshold (by factors of ten)
|
|
and check how this changes the results.
|
|
|
|
Many modern algorithms for finding the minimum of a function are based
|
|
on the basic idea of the gradient descent. Luckily these algorithms
|
|
choose $\epsilon$ in a smart adaptive way and they also come up with
|
|
sensible default values for the termination condition. On the other
|
|
hand, these algorithm often take optional arguments that let you
|
|
control how they behave. Now you know what this is all about.
|
|
|
|
\section{N-dimensional minimization problems}
|
|
|
|
So far we were concerned about finding the right value for a single
|
|
parameter that minimizes a cost function. Often we deal with
|
|
functions that have more than a single parameter, in general $n$
|
|
parameters. We then need to find the minimum in an $n$ dimensional
|
|
parameter space.
|
|
|
|
For our tiger problem, we could have also fitted the exponent $a$ of
|
|
the power-law relation between size and weight, instead of assuming a
|
|
cubic relation with $a=3$:
|
|
\begin{equation}
|
|
\label{powerfunc}
|
|
y = f(x; c, a) = f(x; \vec p) = c\cdot x^a
|
|
\end{equation}
|
|
We then could check whether the resulting estimate of the exponent
|
|
$a$ indeed is close to the expected power of three. The
|
|
power-law \eqref{powerfunc} has two free parameters $c$ and $a$.
|
|
Instead of a single parameter we are now dealing with a vector $\vec
|
|
p$ containing $n$ parameter values. Here, $\vec p = (c, a)$. All
|
|
the concepts we introduced on the example of the one dimensional
|
|
problem of tiger weights generalize to $n$-dimensional problems. We
|
|
only need to adapt a few things. The cost function for the mean
|
|
squared error reads
|
|
\begin{equation}
|
|
\label{ndimcostfunc}
|
|
f_{cost}(\vec p|\{(x_i, y_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;\vec p))^2
|
|
\end{equation}
|
|
|
|
\begin{figure}[t]
|
|
\includegraphics{powergradientdescent}
|
|
\titlecaption{Gradient descent on an error surface.}{Contour plot of
|
|
the cost function \eqref{ndimcostfunc} for the fit of a power law
|
|
\eqref{powerfunc} to some data. Here the cost function is a long
|
|
and narrow valley on the plane spanned by the two parameters $c$
|
|
and $a$ of the power law. The red line marks the path of the
|
|
gradient descent algorithm. The gradient is always perpendicular
|
|
to the contour lines. The algorithm quickly descends into the
|
|
valley and then slowly creeps on the shallow bottom of the valley
|
|
towards the global minimum where it terminates (yellow circle).
|
|
} \label{powergradientdescentfig}
|
|
\end{figure}
|
|
|
|
For two-dimensional problems the graph of the cost function is an
|
|
\enterm{error surface} (\determ{{Fehlerfl\"ache}}). The two parameters
|
|
span a two-dimensional plane. The cost function assigns to each
|
|
parameter combination on this plane a single value. This results in a
|
|
landscape over the parameter plane with mountains and valleys and we
|
|
are searching for the position of the bottom of the deepest valley
|
|
(\figref{powergradientdescentfig}).
|
|
|
|
\begin{ibox}[t]{\label{partialderivativebox}Partial derivatives and gradient}
|
|
Some functions depend on more than a single variable. For example, the function
|
|
\[ z = f(x,y) \]
|
|
depends on both $x$ and $y$. Using the partial derivatives
|
|
\[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
|
|
and
|
|
\[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
|
|
one can calculate the slopes in the directions of each of the
|
|
variables by means of the respective difference quotient
|
|
(see box~\ref{differentialquotientbox}). \vspace{1ex}
|
|
|
|
\begin{minipage}[t]{0.5\textwidth}
|
|
\mbox{}\\[-2ex]
|
|
\includegraphics[width=1\textwidth]{gradient}
|
|
\end{minipage}
|
|
\hfill
|
|
\begin{minipage}[t]{0.46\textwidth}
|
|
For example, the partial derivatives of
|
|
\[ f(x,y) = x^2+y^2 \vspace{-1ex} \] are
|
|
\[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]
|
|
|
|
The gradient is a vector with the partial derivatives as coordinates:
|
|
\[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
|
|
This vector points into the direction of the strongest ascend of
|
|
$f(x,y)$.
|
|
\end{minipage}
|
|
|
|
\vspace{0.5ex} The figure shows the contour lines of a bivariate
|
|
Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
|
|
arrows) together with the corresponding partial derivatives (thin
|
|
arrows) for three different locations.
|
|
\end{ibox}
|
|
|
|
When we place a ball somewhere on the slopes of a hill, it rolls
|
|
downwards and eventually stops at the bottom. The ball always rolls in
|
|
the direction of the steepest slope. For a two-dimensional parameter
|
|
space of our example, the \entermde{Gradient}{gradient}
|
|
(box~\ref{partialderivativebox}) of the cost function is a vector
|
|
\begin{equation}
|
|
\label{gradientpowerlaw}
|
|
\nabla f_{cost}(c, a) = \left( \frac{\partial f_{cost}(c, a)}{\partial c},
|
|
\frac{\partial f_{cost}(c, a)}{\partial a} \right)
|
|
\end{equation}
|
|
that points into the direction of the strongest ascend of the cost
|
|
function. The gradient is given by the partial derivatives
|
|
(box~\ref{partialderivativebox}) of the mean squared error with
|
|
respect to the parameters $c$ and $a$ of the power law
|
|
relation. In general, for $n$-dimensional problems, the gradient is an
|
|
$n-$ dimensional vector containing for each of the $n$ parameters
|
|
$p_j$ the respective partial derivatives as coordinates:
|
|
\begin{equation}
|
|
\label{gradient}
|
|
\nabla f_{cost}(\vec p) = \left( \frac{\partial f_{cost}(\vec p)}{\partial p_j} \right)
|
|
\end{equation}
|
|
|
|
The iterative equation \eqref{gradientdescent} of the gradient descent
|
|
stays exactly the same, with the only difference that the current
|
|
parameter value $p_i$ becomes a vector $\vec p_i$ of parameter values:
|
|
\begin{equation}
|
|
\label{ndimgradientdescent}
|
|
\vec p_{i+1} = \vec p_i - \epsilon \cdot \nabla f_{cost}(\vec p_i)
|
|
\end{equation}
|
|
The algorithm proceeds along the negative gradient
|
|
(\figref{powergradientdescentfig}).
|
|
|
|
For the termination condition we need the length of the gradient. In
|
|
one dimension it was just the absolute value of the derivative. For
|
|
$n$ dimensions this is according to the \enterm{Pythagorean theorem}
|
|
(\determ[Pythagoras!Satz des]{Satz des Pythagoras}) the square root of
|
|
the sum of the squared partial derivatives:
|
|
\begin{equation}
|
|
\label{ndimabsgradient}
|
|
|\nabla f_{cost}(\vec p_i)| = \sqrt{\sum_{i=1}^n \left(\frac{\partial f_{cost}(\vec p)}{\partial p_i}\right)^2}
|
|
\end{equation}
|
|
The \code{norm()} function implements this given a vector with the
|
|
partial derivatives.
|
|
|
|
\subsection{Passing a function as an argument to another function}
|
|
|
|
So far, all our code for the gradient descent algorithm was tailored
|
|
to a specific function, the cubic relation \eqref{cubicfunc}. It would
|
|
be much better if we could pass an arbitrary function to our gradient
|
|
algorithm. Then we would not need to rewrite it every time anew.
|
|
|
|
This is possible. We can indeed pass a function as an argument to
|
|
another function. For this use the \code{@}-operator. As an example
|
|
let's define a function that produces a standard plot for a function:
|
|
|
|
\pageinputlisting[caption={Example function taking a function as argument.}]{funcPlotter.m}
|
|
|
|
This function can then be used as follows for plotting a sine wave. We
|
|
pass the built in \varcode{sin()} function as \varcode{@sin} as an
|
|
argument to our function:
|
|
|
|
\pageinputlisting[caption={Passing a function handle as an argument to a function.}]{funcplotterexamples.m}
|
|
|
|
|
|
\subsection{Gradient descent algorithm for arbitrary functions}
|
|
|
|
Now we are ready to adapt the gradient descent algorithm from
|
|
exercise~\ref{gradientdescentexercise} to arbitrary functions with $n$
|
|
parameters that we want to fit to some data.
|
|
|
|
\begin{exercise}{gradientDescent.m}{}
|
|
Adapt the function \varcode{gradientDescentCubic()} from
|
|
exercise~\ref{gradientdescentexercise} to implement the gradient
|
|
descent algorithm for any function \varcode{func(x, p)} that takes
|
|
as first argument the $x$-data values and as second argument a
|
|
vector with parameter values. The new function takes a vector $\vec
|
|
p_0$ for the initial parameter values and also returns the best
|
|
parameter combination as a vector. Use a \varcode{for} loop over the
|
|
two dimensions for computing the gradient.
|
|
\end{exercise}
|
|
|
|
For testing our new function we need to implement the power law
|
|
\eqref{powerfunc}:
|
|
|
|
\begin{exercise}{powerLaw.m}{}
|
|
Write a function that implements \eqref{powerfunc}. The function
|
|
gets as arguments a vector $x$ containing the $x$-data values and
|
|
another vector containing as elements the parameters for the power
|
|
law, i.e. the factor $c$ and the power $a$. It returns a vector
|
|
with the computed $y$ values for each $x$ value.
|
|
\end{exercise}
|
|
|
|
Now let's use the new gradient descent function to fit a power law to
|
|
our tiger data-set (\figref{powergradientdescentfig}):
|
|
|
|
\begin{exercise}{plotgradientdescentpower.m}{}
|
|
Use the function \varcode{gradientDescent()} to fit the
|
|
\varcode{powerLaw()} function to the simulated data from
|
|
exercise~\ref{mseexercise}. Plot the returned values of the two
|
|
parameters against each other. Compare the result of the gradient
|
|
descent method with the true values of $c$ and $a$ used to simulate
|
|
the data. Observe the norm of the gradient and inspect the plots to
|
|
adapt $\epsilon$ and the threshold if necessary. Finally plot the
|
|
data together with the best fitting power-law \eqref{powerfunc}.
|
|
\end{exercise}
|
|
|
|
|
|
\section{Fitting non-linear functions to data}
|
|
|
|
The gradient descent is a basic numerical method for solving
|
|
optimization problems. It is used to find the global minimum of an
|
|
objective function.
|
|
|
|
Curve fitting is a specific optimization problem and a common
|
|
application for the gradient descent method. For the case of fitting
|
|
straight lines to data pairs, the error surface (using the mean
|
|
squared error) has exactly one clearly defined global minimum. In
|
|
fact, the position of the minimum can be analytically calculated as
|
|
shown in the next chapter. For linear fitting problems numerical
|
|
methods like the gradient descent are not needed.
|
|
|
|
Fitting problems that involve nonlinear functions of the parameters,
|
|
e.g. the power law \eqref{powerfunc} or the exponential function
|
|
$f(t;\tau) = e^{-t/\tau}$, do in general not have an analytical
|
|
solution for the least squares. To find the least squares for such
|
|
functions numerical methods such as the gradient descent have to be
|
|
applied.
|
|
|
|
The suggested gradient descent algorithm requires manually tuned
|
|
values for $\epsilon$ and the threshold for terminating the iteration.
|
|
The algorithm can be improved in multiple ways to converge more
|
|
robustly and faster. Most importantly, $\epsilon$ is made dependent
|
|
on the changes of the gradient from one iteration to the next. These
|
|
and other numerical tricks have already been implemented in
|
|
pre-defined functions. Generic optimization functions such as
|
|
\mcode{fminsearch()} have been implemented for arbitrary objective
|
|
functions, while the more specialized function \mcode{lsqcurvefit()}
|
|
is specifically designed for optimizations in the least square error
|
|
sense.
|
|
|
|
\begin{exercise}{plotlsqcurvefitpower.m}{}
|
|
Use the \matlab-function \varcode{lsqcurvefit()} instead of
|
|
\varcode{gradientDescent()} to fit the \varcode{powerLaw()} function
|
|
to the simulated data from exercise~\ref{mseexercise}. Plot the data
|
|
and the resulting best fitting power law function.
|
|
\end{exercise}
|
|
|
|
|
|
\begin{important}[Beware of secondary minima!]
|
|
Finding the absolute minimum is not always as easy as in the case of
|
|
fitting a straight line. Often, the cost function has secondary or
|
|
local minima in which the gradient descent stops even though there
|
|
is a more optimal solution, a global minimum that is lower than the
|
|
local minimum. Starting from good initial positions is a good
|
|
approach to avoid getting stuck in local minima. Also keep in mind
|
|
that cost functions tend to be simpler (less local minima) the fewer
|
|
parameters are fitted from the data. Each additional parameter
|
|
increases complexity and is computationally more expensive.
|
|
\end{important}
|
|
|
|
|
|
\section{Evolution as an optimization problem}
|
|
|
|
Evolution is a biological implementation of an optimization
|
|
algorithm. The objective function is an organism's fitness. This needs
|
|
to be maximized (this is the same as minimizing the negative fitness).
|
|
The parameters of this optimization problem are all the many genes on
|
|
the DNA. This is a very high-dimensional optimization problem. By
|
|
cross-over and mutations a population of a species moves along the
|
|
high-dimensional parameter space. Selection processes make sure that
|
|
only organisms with higher fitness pass on their genes to the next
|
|
generations. In this way the algorithm is not directed towards higher
|
|
fitness, as the gradient descent method would be. Rather, some
|
|
neighborhood of the parameter space is randomly probed. That way it is
|
|
even possible to escape a local maximum and find a potentially better
|
|
maximum. For this reason, \enterm[genetic algorithm]{genetic
|
|
algorithms} try to mimic evolution in the context of
|
|
high-dimensional optimization problems, in particular with discrete
|
|
parameter values. In biological evolution, the objective function,
|
|
however, is not a fixed function. It may change in time by changing
|
|
abiotic and biotic environmental conditions, making this a very
|
|
complex but also interesting optimization problem.
|
|
|
|
How should a neuron or neural network be designed? As a particular
|
|
aspect of the general evolution of a species, this is a fundamental
|
|
question in the neurosciences. Maintaining a neural system is
|
|
costly. By their simple presence neurons incur costs. They need to be
|
|
built and maintained, they occupy space and consume
|
|
resources. Equipping a neuron with more ion channels also costs. And
|
|
neural activity makes it more costly to maintain concentration
|
|
gradients of ions. This all boils down to the consumption of more ATP,
|
|
the currency of metabolism. On the other hand each neuron provides
|
|
some useful function. In the end, neurons make the organism to behave
|
|
in some sensible way to increase the overall fitness of the
|
|
organism. On the level of neurons that means that they should
|
|
faithfully represent and process behaviorally relevant sensory
|
|
stimuli, make a sensible decision, store some important memory, or
|
|
initiate and control movements in a directed way. Unfortunately there
|
|
is a tradeoff. Better neural function usually involves higher
|
|
costs. More ion channels reduce intrinsic noise which usually favors
|
|
the precision of neural responses. Higher neuronal activity improves
|
|
the quality of the encoding of sensory stimuli. More neurons are
|
|
required for more complex computations. And so on.
|
|
|
|
Understanding why a neuronal system is designed in some specific way
|
|
requires to also understand these tradeoffs. The number of neurons,
|
|
the number and types of ion channels, the length of axons, the number
|
|
of synapses, the way neurons are connected, etc. are all parameters
|
|
that could be optimized. For the objective function the function of
|
|
the neurons needs to be quantified, for example by measures from
|
|
information or detection theory, and their dependence on the
|
|
parameters. From these benefits the costs need to be subtracted. And
|
|
then one is interested in finding the maximum of the resulting
|
|
objective function. Maximization (or minimization) problems are not
|
|
only a tool for data analysis, rather they are at the core of many ---
|
|
not only biological or neuroscientific --- problems.
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
\printsolutions
|