\chapter{Optimization and gradient descent} \exercisechapter{Optimization and gradient descent} Optimization problems arise in many different contexts. For example, to understand the behavior of a given neuronal system, the system is probed with a range of input signals and then the resulting responses are measured. This input-output relation can be described by a model. Such a model can be a simple function that maps the input signals to corresponding responses, it can be a filter, or a system of differential equations. In any case, the model has some parameters that specify how input and output signals are related. Which combination of parameter values are best suited to describe the input-output relation? The process of finding the best parameter values is an optimization problem. For a simple parameterized function that maps input to output values, this is the special case of a \enterm{curve fitting} problem, where the average distance between the curve and the response values is minimized. One basic numerical method used for such optimization problems is the so called gradient descent, which is introduced in this chapter. \begin{figure}[t] \includegraphics{cubicfunc} \titlecaption{Example data suggesting a cubic relation.}{The length $x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a cubic relation between size and weight leaves us with a single free parameters, a scaling factor. The cubic relation is shown for a few values of this scaling factor (orange and red, right).}\label{cubicdatafig} \end{figure} For demonstrating the curve-fitting problem let's take the simple example of weights and sizes measured for a number of male tigers (\figref{cubicdatafig}). Weight $y$ is proportional to volume $V$ via the density $\rho$. The volume $V$ of any object is proportional to its length $x$ cubed. The factor $\alpha$ relating volume and size cubed depends on the shape of the object and we do not know this factor for tigers. For the data set we thus expect a cubic relation between weight and length \begin{equation} \label{cubicfunc} y = f(x; c) = c\cdot x^3 \end{equation} where $c=\rho\alpha$, the product of a tiger's density and form factor, is the only free parameter in the relation. We would like to find out which value of $c$ best describes the measured data. In the following we use this example to illustrate the gradient descent as a basic method for finding such an optimal parameter. \section{The error function --- mean squared error} Before we go ahead finding the optimal parameter value we need to specify what exactly we consider as an optimal fit. In our example we search the parameter that describes the relation of $x$ and $y$ best. What is meant by this? The length $x_i$ of each tiger is associated with a weight $y_i$ and for each $x_i$ we have a \emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by the model \eqnref{cubicfunc} for a specific value of the parameter $c$. Prediction and actual data value ideally match (in a perfect noise-free world), but in general the estimate and measurement are separated by some distance or error $y_i - y^{est}(x_i)$. In our example the estimate of the weight for the length $x_i$ is given by equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting model with parameter $c$ is the one that somehow minimizes the distances between observations $y_i$ and corresponding estimations $y^{est}(x_i)$ (\figref{cubicerrorsfig}). As a first guess we could simply minimize the sum of the distances, $\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work because positive and negative errors would cancel out, no matter how large they are, and sum up to values close to zero. Better is to sum over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum can only be small if all deviations are indeed small no matter if they are above or below the prediction. The sum of the squared distances, $\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better choice. Instead of the sum we could also minimize the average distance \begin{equation} \label{meansquarederror} f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2 \end{equation} This is known as the \enterm[mean squared error]{mean squared error} (\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer Fehler}). Similar to the absolute distance, the square of the errors is always positive and thus positive and negative error values do not cancel each other out. In addition, the square punishes large deviations over small deviations. In chapter~\ref{maximumlikelihoodchapter} we show that minimizing the mean squared error is equivalent to maximizing the likelihood that the observations originate from the model, if the data are normally distributed around the model prediction. \begin{exercise}{meansquarederrorline.m}{}\label{mseexercise} Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store these sizes in a vector \varcode{x}. Compute the corresponding predicted weights \varcode{yest} for each tiger according to \eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the predictions generate simulated measurements of the tiger's weights \varcode{y}, by adding normally distributed random numbers to the predictions scaled to a standard deviation of 50\,\kilo\gram. Compute the \emph{mean squared error} between \varcode{y} and \varcode{yest} in a single line of code. \end{exercise} \section{Objective function} The mean squared error is a so called \enterm{objective function} or \enterm{cost function} (\determ{Kostenfunktion}). A cost function assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set $\{(x_i, y_i)\}$ a single scalar value that we want to minimize. Here we aim to adapt the model parameter to minimize the mean squared error \eqref{meansquarederror}. In general, the \enterm{cost function} can be any function that describes the quality of a fit by mapping the data and the predictions to a single scalar value. \begin{figure}[t] \includegraphics{cubicerrors} \titlecaption{Estimating the \emph{mean squared error}.} {The deviation error (orange) between the prediction (red line) and the observations (blue dots) is calculated for each data point (left). Then the deviations are squared and the average is calculated (right).} \label{cubicerrorsfig} \end{figure} Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror} with our cubic model \eqref{cubicfunc}, the cost function reads \begin{eqnarray} f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\ & = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube} \end{eqnarray} The optimization process tries to find a value for the factor $c$ such that the cost function is minimized. With the mean squared error as the cost function this optimization process is also called method of \enterm{least squares} (\determ[quadratischer Fehler!kleinster]{Methode der kleinsten Quadrate}). \begin{exercise}{meanSquaredErrorCubic.m}{} Implement the objective function \eqref{msecube} as a function \varcode{meanSquaredErrorCubic()}. The function takes three arguments. The first is a vector of $x$-values and the second contains the measurements $y$ for each value of $x$. The third argument is the value of the factor $c$. The function returns the mean squared error. \end{exercise} \section{Error surface} For each combination of the two parameters $m$ and $b$ of the model we can use \eqnref{mseline} to calculate the corresponding value of the cost function. The cost function $f_{cost}(m,b|\{(x_i, y_i)\}|)$ is a function $f_{cost}(m,b)$, that maps the parameter values $m$ and $b$ to a scalar error value. The error values describe a landscape over the $m$-$b$ plane, the error surface, that can be illustrated graphically using a 3-d surface-plot. $m$ and $b$ are plotted on the $x$- and $y$- axis while the third dimension indicates the error value (\figref{errorsurfacefig}). \begin{figure}[t] \includegraphics[width=0.75\textwidth]{error_surface} \titlecaption{Error surface.}{The two model parameters $m$ and $b$ define the base area of the surface plot. For each parameter combination of slope and intercept the error is calculated. The resulting surface has a minimum which indicates the parameter combination that best fits the data.}\label{errorsurfacefig} \end{figure} \begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise} Generate 20 data pairs $(x_i|y_i)$ that are linearly related with slope $m=0.75$ and intercept $b=-40$, using \varcode{rand()} for drawing $x$ values between 0 and 120 and \varcode{randn()} for jittering the $y$ values with a standard deviation of 15. Then calculate the mean squared error between the data and straight lines for a range of slopes and intercepts using the \varcode{meanSquaredError()} function from the previous exercise. Illustrates the error surface using the \code{surface()} function. Consult the documentation to find out how to use \code{surface()}. \end{exercise} By looking at the error surface we can directly see the position of the minimum and thus estimate the optimal parameter combination. How can we use the error surface to guide an automatic optimization process? The obvious approach would be to calculate the error surface for any combination of slope and intercept values and then find the position of the minimum using the \code{min} function. This approach, however has several disadvantages: (i) it is computationally very expensive to calculate the error for each parameter combination. The number of combinations increases exponentially with the number of free parameters (also known as the ``curse of dimensionality''). (ii) the accuracy with which the best parameters can be estimated is limited by the resolution used to sample the parameter space. The coarser the parameters are sampled the less precise is the obtained position of the minimum. So we need a different approach. We want a procedure that finds the minimum of the cost function with a minimal number of computations and to arbitrary precision. \begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative} \includegraphics[width=0.33\textwidth]{derivative} \hfill \begin{minipage}[b]{0.63\textwidth} The difference quotient \begin{equation} \label{difffrac} m = \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation} of a function $y = f(x)$ is the slope of the secant (red) defined by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the distance $\Delta x$. The slope of the function $y=f(x)$ at the position $x$ (yellow) is given by the derivative $f'(x)$ of the function at that position. It is defined by the difference quotient in the limit of infinitesimally (orange) small distances $\Delta x$: \begin{equation} \label{derivative} f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation} \end{minipage}\vspace{2ex} It is not possible to calculate the derivative, \eqnref{derivative}, numerically. The derivative can only be estimated using the difference quotient, \eqnref{difffrac}, by using sufficiently small $\Delta x$. \end{ibox} \begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient} Some functions that depend on more than a single variable: \[ z = f(x,y) \] for example depends on $x$ and $y$. Using the partial derivative \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \] and \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \] one can estimate the slope in the direction of the variables individually by using the respective difference quotient (Box~\ref{differentialquotientbox}). \vspace{1ex} \begin{minipage}[t]{0.5\textwidth} \mbox{}\\[-2ex] \includegraphics[width=1\textwidth]{gradient} \end{minipage} \hfill \begin{minipage}[t]{0.46\textwidth} For example, the partial derivatives of \[ f(x,y) = x^2+y^2 \] are \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\] The gradient is a vector that is constructed from the partial derivatives: \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \] This vector points into the direction of the strongest ascend of $f(x,y)$. \end{minipage} \vspace{0.5ex} The figure shows the contour lines of a bi-variate Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick arrows) and the corresponding two partial derivatives (thin arrows) for three different locations. \end{ibox} \section{Gradient} Imagine to place a small ball at some point on the error surface \figref{errorsurfacefig}. Naturally, it would roll down the steepest slope and eventually stop at the minimum of the error surface (if it had no inertia). We will use this picture to develop an algorithm to find our way to the minimum of the objective function. The ball will always follow the steepest slope. Thus we need to figure out the direction of the steepest slope at the position of the ball. The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the objective function is the vector \begin{equation} \label{gradient} \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m}, \frac{\partial f(m,b)}{\partial b} \right) \end{equation} that points to the strongest ascend of the objective function. The gradient is given by partial derivatives (Box~\ref{partialderivativebox}) of the mean squared error with respect to the parameters $m$ and $b$ of the straight line. There is no need to calculate it analytically because it can be estimated from the partial derivatives using the difference quotient (Box~\ref{differentialquotientbox}) for small steps $\Delta m$ and $\Delta b$. For example, the partial derivative with respect to $m$ can be computed as \begin{equation} \frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to 0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \; . \end{equation} The length of the gradient indicates the steepness of the slope (\figref{gradientquiverfig}). Since want to go down the hill, we choose the opposite direction. \begin{figure}[t] \includegraphics[width=0.75\textwidth]{error_gradient} \titlecaption{Gradient of the error surface.} {Each arrow points into the direction of the greatest ascend at different positions of the error surface shown in \figref{errorsurfacefig}. The contour lines in the background illustrate the error surface. Warm colors indicate high errors, colder colors low error values. Each contour line connects points of equal error.}\label{gradientquiverfig} \end{figure} \begin{exercise}{meanSquaredGradient.m}{}\label{gradientexercise}% Implement a function \varcode{meanSquaredGradient()}, that takes the $x$- and $y$-data and the set of parameters $(m, b)$ of a straight line as a two-element vector as input arguments. The function should return the gradient at the position $(m, b)$ as a vector with two elements. \end{exercise} \begin{exercise}{errorGradient.m}{} Extend the script of exercises~\ref{errorsurfaceexercise} to plot both the error surface and gradients using the \varcode{meanSquaredGradient()} function from exercise~\ref{gradientexercise}. Vectors in space can be easily plotted using the function \code{quiver()}. Use \code{contour()} instead of \code{surface()} to plot the error surface. \end{exercise} \section{Gradient descent} Finally, we are able to implement the optimization itself. By now it should be obvious why it is called the gradient descent method. All ingredients are already there. We need: (i) the cost function (\varcode{meanSquaredError()}), and (ii) the gradient (\varcode{meanSquaredGradient()}). The algorithm of the gradient descent works as follows: \begin{enumerate} \item Start with some given combination of the parameters $m$ and $b$ ($p_0 = (m_0, b_0)$). \item \label{computegradient} Calculate the gradient at the current position $p_i$. \item If the length of the gradient falls below a certain value, we assume to have reached the minimum and stop the search. We are actually looking for the point at which the length of the gradient is zero, but finding zero is impossible because of numerical imprecision. We thus apply a threshold below which we are sufficiently close to zero (e.g. \varcode{norm(gradient) < 0.1}). \item \label{gradientstep} If the length of the gradient exceeds the threshold we take a small step into the opposite direction: \begin{equation} p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i) \end{equation} where $\epsilon = 0.01$ is a factor linking the gradient to appropriate steps in the parameter space. \item Repeat steps \ref{computegradient} -- \ref{gradientstep}. \end{enumerate} \Figref{gradientdescentfig} illustrates the gradient descent --- the path the imaginary ball has chosen to reach the minimum. Starting at an arbitrary position on the error surface we change the position as long as the gradient at that position is larger than a certain threshold. If the slope is very steep, the change in the position (the distance between the red dots in \figref{gradientdescentfig}) is large. \begin{figure}[t] \includegraphics{cubicmse} \titlecaption{Gradient descent.}{The algorithm starts at an arbitrary position. At each point the gradient is estimated and the position is updated as long as the length of the gradient is sufficiently large.The dots show the positions after each iteration of the algorithm.} \label{gradientdescentfig} \end{figure} \begin{exercise}{gradientDescent.m}{} Implement the gradient descent for the problem of fitting a straight line to some measured data. Reuse the data generated in exercise~\ref{errorsurfaceexercise}. \begin{enumerate} \item Store for each iteration the error value. \item Plot the error values as a function of the iterations, the number of optimization steps. \item Plot the measured data together with the best fitting straight line. \end{enumerate}\vspace{-4.5ex} \end{exercise} \section{Summary} The gradient descent is an important numerical method for solving optimization problems. It is used to find the global minimum of an objective function. Curve fitting is a common application for the gradient descent method. For the case of fitting straight lines to data pairs, the error surface (using the mean squared error) has exactly one clearly defined global minimum. In fact, the position of the minimum can be analytically calculated as shown in the next chapter. Problems that involve nonlinear computations on parameters, e.g. the rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda x}$, do not have an analytical solution for the least squares. To find the least squares for such functions numerical methods such as the gradient descent have to be applied. The suggested gradient descent algorithm can be improved in multiple ways to converge faster. For example one could adapt the step size to the length of the gradient. These numerical tricks have already been implemented in pre-defined functions. Generic optimization functions such as \matlabfun{fminsearch()} have been implemented for arbitrary objective functions, while the more specialized function \matlabfun{lsqcurvefit()} i specifically designed for optimizations in the least square error sense. %\newpage \begin{important}[Beware of secondary minima!] Finding the absolute minimum is not always as easy as in the case of fitting a straight line. Often, the error surface has secondary or local minima in which the gradient descent stops even though there is a more optimal solution, a global minimum that is lower than the local minimum. Starting from good initial positions is a good approach to avoid getting stuck in local minima. Also keep in mind that error surfaces tend to be simpler (less local minima) the fewer parameters are fitted from the data. Each additional parameter increases complexity and is computationally more expensive. \end{important} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \printsolutions