\chapter{Optimization and gradient descent} \label{gradientdescentchapter} \exercisechapter{Optimization and gradient descent} Optimization problems arise in many different contexts. For example, to understand the behavior of a given neuronal system, the system is probed with a range of input signals and then the resulting responses are measured. This input-output relation can be described by a model. Such a model can be a simple function that maps the input signals to corresponding responses, it can be a filter, or a system of differential equations. In any case, the model has some parameters that specify how input and output signals are related. Which combination of parameter values are best suited to describe the input-output relation? The process of finding the best parameter values is an optimization problem. For a simple parameterized function that maps input to output values, this is the special case of a \enterm{curve fitting} problem, where the average distance between the curve and the response values is minimized. One basic numerical method used for such optimization problems is the so called gradient descent, which is introduced in this chapter. \begin{figure}[t] \includegraphics{cubicfunc} \titlecaption{Example data suggesting a cubic relation.}{The length $x$ and weight $y$ of $n=34$ male tigers (blue, left). Assuming a cubic relation between size and weight leaves us with a single free parameters, a scaling factor. The cubic relation is shown for a few values of this scaling factor (orange and red, right).}\label{cubicdatafig} \end{figure} For demonstrating the curve-fitting problem let's take the simple example of weights and sizes measured for a number of male tigers (\figref{cubicdatafig}). Weight $y$ is proportional to volume $V$ via the density $\rho$. The volume $V$ of any object is proportional to its length $x$ cubed. The factor $\alpha$ relating volume and size cubed depends on the shape of the object and we do not know this factor for tigers. For the data set we thus expect a cubic relation between weight and length \begin{equation} \label{cubicfunc} y = f(x; c) = c\cdot x^3 \end{equation} where $c=\rho\alpha$, the product of a tiger's density and form factor, is the only free parameter in the relation. We would like to find out which value of $c$ best describes the measured data. In the following we use this example to illustrate the gradient descent as a basic method for finding such an optimal parameter. \section{The error function --- mean squared error} Before we go ahead finding the optimal parameter value we need to specify what exactly we consider as an optimal fit. In our example we search the parameter that describes the relation of $x$ and $y$ best. What is meant by this? The length $x_i$ of each tiger is associated with a weight $y_i$ and for each $x_i$ we have a \emph{prediction} or \emph{estimation} $y^{est}(x_i)$ of the weight by the model \eqnref{cubicfunc} for a specific value of the parameter $c$. Prediction and actual data value ideally match (in a perfect noise-free world), but in general the estimate and measurement are separated by some distance or error $y_i - y^{est}(x_i)$. In our example the estimate of the weight for the length $x_i$ is given by equation \eqref{cubicfunc} $y^{est}(x_i) = f(x_i;c)$. The best fitting model with parameter $c$ is the one that somehow minimizes the distances between observations $y_i$ and corresponding estimations $y^{est}(x_i)$ (\figref{cubicerrorsfig}). As a first guess we could simply minimize the sum of the distances, $\sum_{i=1}^N y_i - y^{est}(x_i)$. This, however, does not work because positive and negative errors would cancel out, no matter how large they are, and sum up to values close to zero. Better is to sum over absolute distances: $\sum_{i=1}^N |y_i - y^{est}(x_i)|$. This sum can only be small if all deviations are indeed small no matter if they are above or below the prediction. The sum of the squared distances, $\sum_{i=1}^N (y_i - y^{est}(x_i))^2$, turns out to be an even better choice. Instead of the sum we could also minimize the average distance \begin{equation} \label{meansquarederror} f_{mse}(\{(x_i, y_i)\}|\{y^{est}(x_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}(x_i))^2 \end{equation} This is known as the \enterm[mean squared error]{mean squared error} (\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer Fehler}). Similar to the absolute distance, the square of the errors is always positive and thus positive and negative error values do not cancel each other out. In addition, the square punishes large deviations over small deviations. In chapter~\ref{maximumlikelihoodchapter} we show that minimizing the mean squared error is equivalent to maximizing the likelihood that the observations originate from the model, if the data are normally distributed around the model prediction. \begin{exercise}{meansquarederrorline.m}{}\label{mseexercise} Simulate $n=40$ tigers ranging from 2.2 to 3.9\,m in size and store these sizes in a vector \varcode{x}. Compute the corresponding predicted weights \varcode{yest} for each tiger according to \eqnref{cubicfunc} with $c=6$\,\kilo\gram\per\meter\cubed. From the predictions generate simulated measurements of the tiger's weights \varcode{y}, by adding normally distributed random numbers to the predictions scaled to a standard deviation of 50\,\kilo\gram. Compute the \emph{mean squared error} between \varcode{y} and \varcode{yest} in a single line of code. \end{exercise} \section{Cost function} The mean squared error is a so called \enterm{objective function} or \enterm{cost function} (\determ{Kostenfunktion}). A cost function assigns to a model prediction $\{y^{est}(x_i)\}$ for a given data set $\{(x_i, y_i)\}$ a single scalar value that we want to minimize. Here we aim to adapt the model parameter to minimize the mean squared error \eqref{meansquarederror}. In general, the \enterm{cost function} can be any function that describes the quality of a fit by mapping the data and the predictions to a single scalar value. \begin{figure}[t] \includegraphics{cubicerrors} \titlecaption{Estimating the \emph{mean squared error}.} {The deviation error (orange) between the prediction (red line) and the observations (blue dots) is calculated for each data point (left). Then the deviations are squared and the average is calculated (right).} \label{cubicerrorsfig} \end{figure} Replacing $y^{est}$ in the mean squared error \eqref{meansquarederror} with our cubic model \eqref{cubicfunc}, the cost function reads \begin{eqnarray} f_{cost}(c|\{(x_i, y_i)\}) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;c))^2 \label{msefunc} \\ & = & \frac{1}{N} \sum_{i=1}^N (y_i - c x_i^3)^2 \label{msecube} \end{eqnarray} The optimization process tries to find a value for the factor $c$ such that the cost function is minimized. With the mean squared error as the cost function this optimization process is also called method of \enterm{least squares} (\determ[quadratischer Fehler!kleinster]{Methode der kleinsten Quadrate}). \begin{exercise}{meanSquaredErrorCubic.m}{} Implement the objective function \eqref{msecube} as a function \varcode{meanSquaredErrorCubic()}. The function takes three arguments. The first is a vector of $x$-values and the second contains the measurements $y$ for each value of $x$. The third argument is the value of the factor $c$. The function returns the mean squared error. \end{exercise} \section{Graph of the cost function} For each value of the parameter $c$ of the model we can use \eqnref{msecube} to calculate the corresponding value of the cost function. The cost function $f_{cost}(c|\{(x_i, y_i)\}|)$ is a function $f_{cost}(c)$ that maps the parameter value $c$ to a scalar error value. For a given data set we thus can simply plot the cost function as a function of the parameter $c$ (\figref{cubiccostfig}). \begin{exercise}{plotcubiccosts.m}{} Calculate the mean squared error between the data and the cubic function for a range of values of the factor $c$ using the \varcode{meanSquaredErrorCubic()} function from the previous exercise. Plot the graph of the cost function. \end{exercise} \begin{figure}[t] \includegraphics{cubiccost} \titlecaption{Minimum of the cost function.}{For a given data set the cost function, the mean squared error \eqnref{msecube}, as a function of the unknown parameter $c$ has a minimum close to the true value of $c$ that was used to generate the data (left). Simply taking the absolute minimum of the cost function computed for a pre-set range of values for the parameter $c$, has the disadvantage to be limited in precision (right) and the possibility to entirely miss the global minimum if it is outside the computed range.}\label{cubiccostfig} \end{figure} By looking at the plot of the cost function we can visually identify the position of the minimum and thus estimate the optimal value for the parameter $c$. How can we use the cost function to guide an automatic optimization process? The obvious approach would be to calculate the mean squared error for a range of parameter values and then find the position of the minimum using the \code{min()} function. This approach, however has several disadvantages: (i) The accuracy of the estimation of the best parameter is limited by the resolution used to sample the parameter space. The coarser the parameters are sampled the less precise is the obtained position of the minimum (\figref{cubiccostfig}, right). (ii) The range of parameter values might not include the absolute minimum. (iii) In particular for functions with more than a single free parameter it is computationally expensive to calculate the cost function for each parameter combination at a sufficient resolution. The number of combinations increases exponentially with the number of free parameters. This is known as the \enterm{curse of dimensionality}. So we need a different approach. We want a procedure that finds the minimum of the cost function with a minimal number of computations and to arbitrary precision. \section{Gradient} \begin{figure}[t] \includegraphics{cubicgradient} \titlecaption{Derivative of the cost function.}{The gradient, the derivative \eqref{costderivative} of the cost function, is negative to the left of the minimum (vertical line) of the cost function, zero (horizontal line) at, and positive to the right of the minimum (left). For each value of the parameter $c$ the negative gradient (arrows) points towards the minimum of the cost function (right).} \label{gradientcubicfig} \end{figure} \begin{ibox}[b]{\label{differentialquotientbox}Difference quotient and derivative} \includegraphics[width=0.33\textwidth]{derivative} \hfill \begin{minipage}[b]{0.63\textwidth} The difference quotient \begin{equation} \label{difffrac} m = \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation} of a function $y = f(x)$ is the slope of the secant (red) defined by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ at distance $\Delta x$. The slope of the function $y=f(x)$ at the position $x$ (orange) is given by the derivative $f'(x)$ of the function at that position. It is defined by the difference quotient in the limit of infinitesimally (red and yellow) small distances $\Delta x$: \begin{equation} \label{derivative} f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation} \end{minipage}\vspace{2ex} It is not possible to calculate the exact value of the derivative \eqref{derivative} numerically. The derivative can only be estimated by computing the difference quotient \eqref{difffrac} using sufficiently small $\Delta x$. \end{ibox} Imagine to place a ball at some point on the cost function \figref{cubiccostfig}. Naturally, it would roll down the slope and eventually stop at the minimum of the error surface (if it had no inertia). We will use this analogy to develop an algorithm to find our way to the minimum of the cost function. The ball always follows the steepest slope. Thus we need to figure out the direction of the slope at the position of the ball. In our one-dimensional example of a single free parameter the slope is simply the derivative of the cost function with respect to the parameter $c$ (\figref{gradientcubicfig}, left). This derivative is called the \entermde{Gradient}{gradient} of the cost function: \begin{equation} \label{costderivative} \nabla f_{cost}(c) = \frac{{\rm d} f_{cost}(c)}{{\rm d} c} \end{equation} There is no need to calculate this derivative analytically, because it can be approximated numerically by the difference quotient (Box~\ref{differentialquotientbox}) for small steps $\Delta c$: \begin{equation} \label{costderivativediff} \frac{{\rm d} f_{cost}(c)}{{\rm d} c} = \lim\limits_{\Delta c \to 0} \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c} \approx \frac{f_{cost}(c + \Delta c) - f_{cost}(c)}{\Delta c} \end{equation} Choose, for example, $\Delta c = 10^{-7}$. The derivative is positive for positive slopes. Since want to go down the hill, we choose the opposite direction (\figref{gradientcubicfig}, right). \begin{exercise}{meanSquaredGradientCubic.m}{}\label{gradientcubic} Implement a function \varcode{meanSquaredGradientCubic()}, that takes the $x$- and $y$-data and the parameter $c$ as input arguments. The function should return the derivative of the mean squared error $f_{cost}(c)$ with respect to $c$ at the position $c$. \end{exercise} \begin{exercise}{plotcubicgradient.m}{} Using the \varcode{meanSquaredGradientCubic()} function from the previous exercise, plot the derivative of the cost function as a function of $c$. \end{exercise} \section{Gradient descent} \begin{figure}[t] \includegraphics{cubicmse} \titlecaption{Gradient descent.}{The algorithm starts at an arbitrary position. At each point the gradient is estimated and the position is updated as long as the length of the gradient is sufficiently large. The dots show the positions after each iteration of the algorithm.} \label{gradientdescentcubicfig} \end{figure} Finally, we are able to implement the optimization itself. By now it should be obvious why it is called the gradient descent method. From a starting position on we iteratively walk down the slope of the cost function against its gradient. All ingredients necessary for this algorithm are already there. We need: (i) the cost function (\varcode{meanSquaredErrorCubic()}), and (ii) the gradient (\varcode{meanSquaredGradientCubic()}). The algorithm of the gradient descent works as follows: \begin{enumerate} \item Start with some arbitrary value $p_0$ for the parameter $c$. \item \label{computegradient} Compute the gradient \eqnref{costderivative} at the current position $p_i$. \item If the length of the gradient, the absolute value of the derivative \eqnref{costderivative}, is smaller than some threshold value, we assume to have reached the minimum and stop the search. We return the current parameter value $p_i$ as our best estimate of the parameter $c$ that minimizes the cost function. We are actually looking for the point at which the derivative of the cost function equals zero, but this is impossible because of numerical imprecision. We thus apply a threshold below which we are sufficiently close to zero. \item \label{gradientstep} If the length of the gradient exceeds the threshold we take a small step into the opposite direction: \begin{equation} \label{gradientdescent} p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(p_i) \end{equation} where $\epsilon$ is a factor linking the gradient to appropriate steps in the parameter space. \item Repeat steps \ref{computegradient} -- \ref{gradientstep}. \end{enumerate} \Figref{gradientdescentcubicfig} illustrates the gradient descent --- the path the imaginary ball has chosen to reach the minimum. We walk along the parameter axis against the gradient as long as the gradient differs sufficiently from zero. At steep slopes we take large steps (the distance between the red dots in \figref{gradientdescentcubicfig} is large). \begin{exercise}{gradientDescentCubic.m}{}\label{gradientdescentexercise} Implement the gradient descent algorithm for the problem of fitting a cubic function \eqref{cubicfunc} to some measured data pairs $x$ and $y$ as a function \varcode{gradientDescentCubic()} that returns the estimated best fitting parameter value $c$ as well as two vectors with all the parameter values and the corresponding values of the cost function that the algorithm iterated trough. As additional arguments that function takes the initial value for the parameter $c$, the factor $\epsilon$ connecting the gradient with iteration steps in \eqnref{gradientdescent}, and the threshold value for the absolute value of the gradient terminating the algorithm. \end{exercise} \begin{exercise}{plotgradientdescentcubic.m}{}\label{plotgradientdescentexercise} Use the function \varcode{gradientDescentCubic()} to fit the simulated data from exercise~\ref{mseexercise}. Plot the returned values of the parameter $c$ as as well as the corresponding mean squared errors as a function of iteration step (two plots). Compare the result of the gradient descent method with the true value of $c$ used to simulate the data. Inspect the plots and adapt $\epsilon$ and the threshold to make the algorithm behave as intended. Finally plot the data together with the best fitting cubic relation \eqref{cubicfunc}. \end{exercise} The $\epsilon$ parameter in \eqnref{gradientdescent} is critical. If too large, the algorithm does not converge to the minimum of the cost function (try it!). At medium values it oscillates around the minimum but might nevertheless converge. Only for sufficiently small values (here $\epsilon = 0.00001$) does the algorithm follow the slope downwards towards the minimum. Change $\epsilon$ by factors of ten to adapt it to a specific problem. The terminating condition on the absolute value of the gradient influences how often the cost function is evaluated. The smaller the threshold value the more often the cost is computed and the more precisely the fit parameter is estimated. If it is too small, however, the increase in precision is negligible, in particular in comparison to the increased computational effort. Have a look at the derivatives that we plotted in exercise~\ref{gradientcubic} and decide on a sensible value for the threshold. Run the gradient descent algorithm and check how the resulting $c$ parameter values converge and how many iterations were needed. Then reduce the threshold (by factors of ten) and check how this changes the results. Many modern algorithms for finding the minimum of a function are based on the basic idea of the gradient descent. Luckily these algorithms choose $\epsilon$ in a smart adaptive way and they also come up with sensible default values for the termination condition. On the other hand, these algorithm often take optional arguments that let you control how they behave. Now you know what this is all about. \section{N-dimensional minimization problems} So far we were concerned about finding the right value for a single parameter that minimizes a cost function. Often we deal with functions that have more than a single parameter, in general $n$ parameters. We then need to find the minimum in an $n$ dimensional parameter space. For our tiger problem, we could have also fitted the exponent $a$ of the power-law relation between size and weight, instead of assuming a cubic relation with $a=3$: \begin{equation} \label{powerfunc} y = f(x; c, a) = f(x; \vec p) = c\cdot x^a \end{equation} We then could check whether the resulting estimate of the exponent $a$ indeed is close to the expected power of three. The power-law \eqref{powerfunc} has two free parameters $c$ and $a$. Instead of a single parameter we are now dealing with a vector $\vec p$ containing $n$ parameter values. Here, $\vec p = (c, a)$. All the concepts we introduced on the example of the one dimensional problem of tiger weights generalize to $n$-dimensional problems. We only need to adapt a few things. The cost function for the mean squared error reads \begin{equation} \label{ndimcostfunc} f_{cost}(\vec p|\{(x_i, y_i)\}) = \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;\vec p))^2 \end{equation} \begin{figure}[t] \includegraphics{powergradientdescent} \titlecaption{Gradient descent on an error surface.}{Contour plot of the cost function \eqref{ndimcostfunc} for the fit of a power law \eqref{powerfunc} to some data. Here the cost function is a long and narrow valley on the plane spanned by the two parameters $c$ and $a$ of the power law. The red line marks the path of the gradient descent algorithm. The gradient is always perpendicular to the contour lines. The algorithm quickly descends into the valley and then slowly creeps on the shallow bottom of the valley towards the global minimum where it terminates (yellow circle). } \label{powergradientdescentfig} \end{figure} For two-dimensional problems the graph of the cost function is an \enterm{error surface} (\determ{{Fehlerfl\"ache}}). The two parameters span a two-dimensional plane. The cost function assigns to each parameter combination on this plane a single value. This results in a landscape over the parameter plane with mountains and valleys and we are searching for the position of the bottom of the deepest valley (\figref{powergradientdescentfig}). \begin{ibox}[t]{\label{partialderivativebox}Partial derivatives and gradient} Some functions depend on more than a single variable. For example, the function \[ z = f(x,y) \] depends on both $x$ and $y$. Using the partial derivatives \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \] and \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \] one can calculate the slopes in the directions of each of the variables by means of the respective difference quotients (see box~\ref{differentialquotientbox}). \vspace{1ex} \begin{minipage}[t]{0.5\textwidth} \mbox{}\\[-2ex] \includegraphics[width=1\textwidth]{gradient} \end{minipage} \hfill \begin{minipage}[t]{0.46\textwidth} For example, the partial derivatives of \[ f(x,y) = x^2+y^2 \vspace{-1ex} \] are \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\] The gradient is a vector with the partial derivatives as coordinates: \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \] This vector points into the direction of the strongest ascend of $f(x,y)$. \end{minipage} \vspace{0.5ex} The figure shows the contour lines of a bivariate Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick arrows) together with the corresponding partial derivatives (thin arrows) for three different locations. \end{ibox} When we place a ball somewhere on the slopes of a hill, it rolls downwards and eventually stops at the bottom. The ball always rolls in the direction of the steepest slope. For a two-dimensional parameter space of our example, the \entermde{Gradient}{gradient} (box~\ref{partialderivativebox}) of the cost function is a vector \begin{equation} \label{gradientpowerlaw} \nabla f_{cost}(c, a) = \left( \frac{\partial f_{cost}(c, a)}{\partial c}, \frac{\partial f_{cost}(c, a)}{\partial a} \right) \end{equation} that points into the direction of the strongest ascend of the cost function. The gradient is given by the partial derivatives (box~\ref{partialderivativebox}) of the mean squared error with respect to the parameters $c$ and $a$ of the power law relation. In general, for $n$-dimensional problems, the gradient is an $n-$ dimensional vector containing for each of the $n$ parameters $p_j$ the respective partial derivatives as coordinates: \begin{equation} \label{gradient} \nabla f_{cost}(\vec p) = \left( \frac{\partial f_{cost}(\vec p)}{\partial p_j} \right) \end{equation} Despite the fancy words this simply means that we need to calculate the derivatives in the very same way as we have done it for the case of a single parameter, \eqnref{costderivativediff}, for each parameter separately. The iterative equation \eqref{gradientdescent} of the gradient descent stays exactly the same, with the only difference that the current parameter value $p_i$ becomes a vector $\vec p_i$ of parameter values: \begin{equation} \label{ndimgradientdescent} \vec p_{i+1} = \vec p_i - \epsilon \cdot \nabla f_{cost}(\vec p_i) \end{equation} For each parameter we subtract the corresponding derivative multiplied with $\epsilon$. The algorithm proceeds along the negative gradient (\figref{powergradientdescentfig}). For the termination condition we need the length of the gradient. In one dimension it was just the absolute value of the derivative. For $n$ dimensions this is according to the \enterm{Pythagorean theorem} (\determ[Pythagoras!Satz des]{Satz des Pythagoras}) the square root of the sum of the squared partial derivatives: \begin{equation} \label{ndimabsgradient} |\nabla f_{cost}(\vec p_i)| = \sqrt{\sum_{i=1}^n \left(\frac{\partial f_{cost}(\vec p)}{\partial p_i}\right)^2} \end{equation} The \code{norm()} function implements this given a vector with the partial derivatives. \subsection{Passing a function as an argument to another function} So far, all our code for the gradient descent algorithm was tailored to a specific function, the cubic relation \eqref{cubicfunc}. It would be much better if we could pass an arbitrary function to our gradient algorithm. Then we would not need to rewrite it every time anew. This is possible. We can indeed pass a function as an argument to another function. For this use the \code{@}-operator. As an example let's define a function that produces a standard plot for a function: \pageinputlisting[caption={Example function taking a function as argument.}]{funcPlotter.m} This function can then be used as follows for plotting a sine wave. We pass the built in \varcode{sin()} function as \varcode{@sin} as an argument to our function: \pageinputlisting[caption={Passing a function handle as an argument to a function.}]{funcplotterexamples.m} \subsection{Gradient descent algorithm for arbitrary functions} Now we are ready to adapt the gradient descent algorithm from exercise~\ref{gradientdescentexercise} to arbitrary functions with $n$ parameters that we want to fit to some data. \begin{exercise}{gradientDescent.m}{} Adapt the function \varcode{gradientDescentCubic()} from exercise~\ref{gradientdescentexercise} to implement the gradient descent algorithm for any function \varcode{func(x, p)} that takes as first argument the $x$-data values and as second argument a vector with parameter values. The new function takes a vector $\vec p_0$ for the initial parameter values and also returns the best parameter combination as a vector. Use a \varcode{for} loop over the two dimensions for computing the gradient. \end{exercise} For testing our new function we need to implement the power law \eqref{powerfunc}: \begin{exercise}{powerLaw.m}{} Write a function that implements \eqref{powerfunc}. The function gets as arguments a vector $x$ containing the $x$-data values and another vector containing as elements the parameters for the power law, i.e. the factor $c$ and the power $a$. It returns a vector with the computed $y$ values for each $x$ value. \end{exercise} Now let's use the new gradient descent function to fit a power law to our tiger data-set: \begin{exercise}{plotgradientdescentpower.m}{} Use the function \varcode{gradientDescent()} to fit the \varcode{powerLaw()} function to the simulated data from exercise~\ref{mseexercise}. Plot the returned values of the two parameters against each other. Compare the result of the gradient descent method with the true values of $c$ and $a$ used to simulate the data. Observe the norm of the gradient and inspect the plots to adapt $\epsilon$ and the threshold if necessary. Finally plot the data together with the best fitting power-law \eqref{powerfunc}. \end{exercise} Note that in our specific example on tiger sizes and weights the simulated data look on a first glance like being linearly related (\figref{cubicdatafig}). The true cubic relation between weights and sizes is not that obvious, because of the limited range of tiger sizes. Nevertheless, the cost function has a minimum at the bottom of a valley that is very narrow in the direction of the expontent (\figref{powergradientdescentfig}). The exponent of about three is thus clearly defined by the data. \section{Fitting non-linear functions to data} The gradient descent is a basic numerical method for solving optimization problems. It is used to find the global minimum of an objective function. Curve fitting is a specific optimization problem and a common application for the gradient descent method. For the case of fitting straight lines to data pairs, the error surface (using the mean squared error) has exactly one clearly defined global minimum. In fact, the position of the minimum can be analytically calculated as shown in the next chapter. For linear fitting problems numerical methods like the gradient descent are not needed. Fitting problems that involve nonlinear functions of the parameters, e.g. the power law \eqref{powerfunc} or the exponential function $f(t;\tau) = e^{-t/\tau}$, do in general not have an analytical solution for the least squares. To find the least squares for such functions numerical methods such as the gradient descent have to be applied. The suggested gradient descent algorithm requires manually tuned values for $\epsilon$ and the threshold for terminating the iteration. The algorithm can be improved in multiple ways to converge more robustly and faster. Most importantly, $\epsilon$ is made dependent on the changes of the gradient from one iteration to the next. These and other numerical tricks have already been implemented in pre-defined functions. Generic optimization functions such as \mcode{fminsearch()} have been implemented for arbitrary objective functions, while the more specialized function \mcode{lsqcurvefit()} is specifically designed for optimizations in the least square error sense. \begin{exercise}{plotlsqcurvefitpower.m}{} Use the \matlab-function \varcode{lsqcurvefit()} instead of \varcode{gradientDescent()} to fit the \varcode{powerLaw()} function to the simulated data from exercise~\ref{mseexercise}. Plot the data and the resulting best fitting power law function. \end{exercise} \begin{important}[Beware of secondary minima!] Finding the absolute minimum is not always as easy as in the case of fitting a straight line. Often, the cost function has secondary or local minima in which the gradient descent stops even though there is a more optimal solution, a global minimum that is lower than the local minimum. Starting from good initial positions is a good approach to avoid getting stuck in local minima. Also keep in mind that cost functions tend to be simpler (less local minima) the fewer parameters are fitted from the data. Each additional parameter increases complexity and is computationally more expensive. \end{important} \section{Evolution as an optimization problem} Evolution is a biological implementation of an optimization algorithm. The objective function is an organism's fitness. This needs to be maximized (this is the same as minimizing the negative fitness). The parameters of this optimization problem are all the many genes on the DNA. This is a very high-dimensional optimization problem. By cross-over and mutations a population of a species moves along the high-dimensional parameter space. Selection processes make sure that only organisms with higher fitness pass on their genes to the next generations. In this way the algorithm is not directed towards higher fitness, as the gradient descent method would be. Rather, some neighborhood of the parameter space is randomly probed. That way it is even possible to escape a local maximum and find a potentially better maximum. For this reason, \enterm[genetic algorithm]{genetic algorithms} try to mimic evolution in the context of high-dimensional optimization problems, in particular with discrete parameter values. In biological evolution, the objective function, however, is not a fixed function. It may change in time by changing abiotic and biotic environmental conditions, making this a very complex but also interesting optimization problem. \subsection{Optimal design of neural systems} How should a neuron or neural network be designed? As a particular aspect of the general evolution of a species, this is a fundamental question in the neurosciences. Maintaining a neural system is costly. By their simple presence neurons incur costs. They need to be built and maintained, they occupy space and consume resources. Equipping a neuron with more ion channels also costs. And neural activity makes it more costly to maintain concentration gradients of ions. This all boils down to the consumption of more ATP, the currency of metabolism. On the other hand each neuron provides some useful function. In the end, neurons make the organism to behave in some sensible way to increase the overall fitness of the organism. On the level of neurons that means that they should faithfully represent and process behaviorally relevant sensory stimuli, make a sensible decision, store some important memory, or initiate and control movements in a directed way. Unfortunately there is a tradeoff. Better neural function usually involves higher costs. More ion channels reduce intrinsic noise which usually favors the precision of neural responses. Higher neuronal activity improves the quality of the encoding of sensory stimuli. More neurons are required for more complex computations. And so on. Understanding why a neuronal system is designed in some specific way requires to also understand these tradeoffs. The number of neurons, the number and types of ion channels, the length of axons, the number of synapses, the way neurons are connected, etc. are all parameters that could be optimized. For the objective function the function of the neurons needs to be quantified, for example by measures from information or detection theory, and their dependence on the parameters. From these benefits the costs need to be subtracted. And then one is interested in finding the maximum of the resulting objective function. Maximization (or minimization) problems are not only a tool for data analysis, rather they are at the core of many --- not only biological or neuroscientific --- problems. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \printsolutions