[regression] improved the chapter
This commit is contained in:
@@ -18,6 +18,8 @@ the response values is minimized. One basic numerical method used for
|
||||
such optimization problems is the so called gradient descent, which is
|
||||
introduced in this chapter.
|
||||
|
||||
%%% Weiteres einfaches verbales Beispiel? Eventuell aus der Populationsoekologie?
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{lin_regress}\hfill
|
||||
\titlecaption{Example data suggesting a linear relation.}{A set of
|
||||
@@ -31,7 +33,10 @@ introduced in this chapter.
|
||||
The data plotted in \figref{linregressiondatafig} suggest a linear
|
||||
relation between input and output of the system. We thus assume that a
|
||||
straight line
|
||||
\[y = f(x; m, b) = m\cdot x + b \]
|
||||
\begin{equation}
|
||||
\label{straightline}
|
||||
y = f(x; m, b) = m\cdot x + b
|
||||
\end{equation}
|
||||
is an appropriate model to describe the system. The line has two free
|
||||
parameter, the slope $m$ and the $y$-intercept $b$. We need to find
|
||||
values for the slope and the intercept that best describe the measured
|
||||
@@ -45,59 +50,68 @@ slope and intercept that best describes the system.
|
||||
Before the optimization can be done we need to specify what exactly is
|
||||
considered an optimal fit. In our example we search the parameter
|
||||
combination that describe the relation of $x$ and $y$ best. What is
|
||||
meant by this? Each input $x_i$ leads to an output $y_i$ and for each
|
||||
$x_i$ there is a \emph{prediction} or \emph{estimation}
|
||||
$y^{est}_i$. For each of $x_i$ estimation and measurement will have a
|
||||
certain distance $y_i - y_i^{est}$. In our example the estimation is
|
||||
given by the linear equation $y_i^{est} = f(x;m,b)$. The best fit of
|
||||
the model with the parameters $m$ and $b$ leads to the minimal
|
||||
distances between observation $y_i$ and estimation $y_i^{est}$
|
||||
(\figref{leastsquareerrorfig}).
|
||||
meant by this? Each input $x_i$ leads to an measured output $y_i$ and
|
||||
for each $x_i$ there is a \emph{prediction} or \emph{estimation}
|
||||
$y^{est}_i$ of the output value by the model. At each $x_i$ estimation
|
||||
and measurement have a distance or error $y_i - y_i^{est}$. In our
|
||||
example the estimation is given by the equation $y_i^{est} =
|
||||
f(x;m,b)$. The best fitting model with parameters $m$ and $b$ is the
|
||||
one that minimizes the distances between observation $y_i$ and
|
||||
estimation $y_i^{est}$ (\figref{leastsquareerrorfig}).
|
||||
|
||||
We could require that the sum $\sum_{i=1}^N y_i - y^{est}_i$ is
|
||||
minimized. This approach, however, will not work since a minimal sum
|
||||
As a first guess we could simply minimize the sum $\sum_{i=1}^N y_i -
|
||||
y^{est}_i$. This approach, however, will not work since a minimal sum
|
||||
can also be achieved if half of the measurements is above and the
|
||||
other half below the predicted line. Positive and negative errors
|
||||
would cancel out and then sum up to values close to zero. A better
|
||||
approach is to consider the absolute value of the distance
|
||||
$\sum_{i=1}^N |y_i - y^{est}_i|$. The total error can only be small if
|
||||
all deviations are indeed small no matter if they are above or below
|
||||
the predicted line. Instead of the sum we could also ask for the
|
||||
\emph{average}
|
||||
approach is to sum over the absolute values of the distances:
|
||||
$\sum_{i=1}^N |y_i - y^{est}_i|$. This sum can only be small if all
|
||||
deviations are indeed small no matter if they are above or below the
|
||||
predicted line. Instead of the sum we could also take the average
|
||||
\begin{equation}
|
||||
\label{meanabserror}
|
||||
f_{dist}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N |y_i - y^{est}_i|
|
||||
\end{equation}
|
||||
should be small. Commonly, the \enterm{mean squared distance} or
|
||||
\enterm[square error!mean]{mean square error} (\determ[quadratischer
|
||||
Fehler!mittlerer]{mittlerer quadratischer Fehler})
|
||||
For reasons that are explained in
|
||||
chapter~\ref{maximumlikelihoodchapter}, instead of the averaged
|
||||
absolute errors, the \enterm[mean squared error]{mean squared error}
|
||||
(\determ[quadratischer Fehler!mittlerer]{mittlerer quadratischer
|
||||
Fehler})
|
||||
\begin{equation}
|
||||
\label{meansquarederror}
|
||||
f_{mse}(\{(x_i, y_i)\}|\{y^{est}_i\}) = \frac{1}{N} \sum_{i=1}^N (y_i - y^{est}_i)^2
|
||||
\end{equation}
|
||||
is used (\figref{leastsquareerrorfig}). Similar to the absolute
|
||||
distance, the square of the error, $(y_i - y_i^{est})^2$, is always
|
||||
positive and thus error values do not cancel out. The square further
|
||||
punishes large deviations over small deviations.
|
||||
is commonly used (\figref{leastsquareerrorfig}). Similar to the
|
||||
absolute distance, the square of the errors, $(y_i - y_i^{est})^2$, is
|
||||
always positive and thus positive and negative error values do not
|
||||
cancel each other out. In addition, the square punishes large
|
||||
deviations over small deviations.
|
||||
|
||||
\begin{exercise}{meanSquareError.m}{}\label{mseexercise}%
|
||||
Implement a function \varcode{meanSquareError()}, that calculates the
|
||||
\emph{mean square distance} between a vector of observations ($y$)
|
||||
and respective predictions ($y^{est}$).
|
||||
\begin{exercise}{meanSquaredErrorLine.m}{}\label{mseexercise}%
|
||||
Given a vector of observations \varcode{y} and a vector with the
|
||||
corresponding predictions \varcode{y\_est}, compute the \emph{mean
|
||||
square error} between \varcode{y} and \varcode{y\_est} in a single
|
||||
line of code.
|
||||
\end{exercise}
|
||||
|
||||
|
||||
\section{\tr{Objective function}{Zielfunktion}}
|
||||
\section{Objective function}
|
||||
|
||||
$f_{cost}(\{(x_i, y_i)\}|\{y^{est}_i\})$ is a so called
|
||||
\enterm{objective function} or \enterm{cost function}
|
||||
(\determ{Kostenfunktion}). We aim to adapt the model parameters to
|
||||
minimize the error (mean square error) and thus the \emph{objective
|
||||
function}. In Chapter~\ref{maximumlikelihoodchapter} we will show
|
||||
that the minimization of the mean square error is equivalent to
|
||||
The mean squared error is a so called \enterm{objective function} or
|
||||
\enterm{cost function} (\determ{Kostenfunktion}), $f_{cost}(\{(x_i,
|
||||
y_i)\}|\{y^{est}_i\})$. A cost function assigns to the given data set
|
||||
$\{(x_i, y_i)\}$ and corresponding model predictions $\{y^{est}_i\}$ a
|
||||
single scalar value that we want to minimize. Here we aim to adapt the
|
||||
model parameters to minimize the mean squared error
|
||||
\eqref{meansquarederror}. In chapter~\ref{maximumlikelihoodchapter} we
|
||||
show that the minimization of the mean square error is equivalent to
|
||||
maximizing the likelihood that the observations originate from the
|
||||
model (assuming a normal distribution of the data around the model
|
||||
prediction).
|
||||
prediction). The \enterm{cost function} does not have to be the mean
|
||||
square error but can be any function that maps the data and the
|
||||
predictions to a scalar value describing the quality of the fit. In
|
||||
the optimization process we aim for the paramter combination that
|
||||
minimizes the costs.
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=1\textwidth]{linear_least_squares}
|
||||
@@ -109,58 +123,44 @@ prediction).
|
||||
\label{leastsquareerrorfig}
|
||||
\end{figure}
|
||||
|
||||
The error or also \enterm{cost function} is not necessarily the mean
|
||||
square distance but can be any function that maps the predictions to a
|
||||
scalar value describing the quality of the fit. In the optimization
|
||||
process we aim for the paramter combination that minimized the costs
|
||||
(error).
|
||||
|
||||
%%% Einfaches verbales Beispiel? Eventuell aus der Populationsoekologie?
|
||||
Replacing $y^{est}$ with the linear equation (the model) in
|
||||
(\eqnref{meansquarederror}) we yield:
|
||||
|
||||
Replacing $y^{est}$ with our model, the straight line
|
||||
\eqref{straightline}, yields
|
||||
\begin{eqnarray}
|
||||
f_{cost}(\{(x_i, y_i)\}|m,b) & = & \frac{1}{N} \sum_{i=1}^N (y_i - f(x_i;m,b))^2 \label{msefunc} \\
|
||||
& = & \frac{1}{N} \sum_{i=1}^N (y_i - m x_i - b)^2 \label{mseline}
|
||||
\end{eqnarray}
|
||||
That is, the mean square error is given by the pairs $(x_i, y_i)$ of
|
||||
measurements and the parameters $m$ and $b$ of the straight line. The
|
||||
optimization process tries to find $m$ and $b$ such that the cost
|
||||
function is minimized. With the mean squared error as the cost
|
||||
function this optimization process is also called method of the
|
||||
\enterm{least square error} (\determ[quadratischer
|
||||
Fehler!kleinster]{Methode der kleinsten Quadrate}).
|
||||
|
||||
That is, the mean square error is given the pairs $(x_i, y_i)$ and the
|
||||
parameters $m$ and $b$ of the linear equation. The optimization
|
||||
process tries to optimize $m$ and $b$ such that the error is
|
||||
minimized, the method of the \enterm[square error!least]{least square
|
||||
error} (\determ[quadratischer Fehler!kleinster]{Methode der
|
||||
kleinsten Quadrate}).
|
||||
|
||||
\begin{exercise}{lsqError.m}{}
|
||||
Implement the objective function \varcode{lsqError()} that applies the
|
||||
linear equation as a model.
|
||||
\begin{itemize}
|
||||
\item The function takes three arguments. The first is a 2-element
|
||||
vector that contains the values of parameters \varcode{m} and
|
||||
\varcode{b}. The second is a vector of x-values the third contains
|
||||
the measurements for each value of $x$, the respecive $y$-values.
|
||||
\item The function returns the mean square error \eqnref{mseline}.
|
||||
\item The function should call the function \varcode{meanSquareError()}
|
||||
defined in the previouos exercise to calculate the error.
|
||||
\end{itemize}
|
||||
\begin{exercise}{meanSquaredError.m}{}
|
||||
Implement the objective function \varcode{meanSquaredError()} that
|
||||
uses a straight line, \eqnref{straightline}, as a model. The
|
||||
function takes three arguments. The first is a 2-element vector that
|
||||
contains the values of parameters \varcode{m} and \varcode{b}. The
|
||||
second is a vector of x-values, and the third contains the
|
||||
measurements for each value of $x$, the respective $y$-values. The
|
||||
function returns the mean square error \eqnref{mseline}.
|
||||
\end{exercise}
|
||||
|
||||
|
||||
\section{Error surface}
|
||||
The two parameters of the model define a surface. For each combination
|
||||
of $m$ and $b$ we can use \eqnref{mseline} to calculate the associated
|
||||
error. We thus consider the objective function $f_{cost}(\{(x_i,
|
||||
y_i)\}|m,b)$ as a function $f_{cost}(m,b)$, that maps the variables
|
||||
$m$ and $b$ to an error value.
|
||||
|
||||
Thus, for each spot of the surface we get an error that we can
|
||||
illustrate graphically using a 3-d surface-plot, i.e. the error
|
||||
surface. $m$ and $b$ are plotted on the $x-$ and $y-$ axis while the
|
||||
third dimension is used to indicate the error value
|
||||
(\figref{errorsurfacefig}).
|
||||
For each combination of the two parameters $m$ and $b$ of the model we
|
||||
can use \eqnref{mseline} to calculate the corresponding value of the
|
||||
cost function. We thus consider the cost function $f_{cost}(\{(x_i,
|
||||
y_i)\}|m,b)$ as a function $f_{cost}(m,b)$, that maps the parameter
|
||||
values $m$ and $b$ to an error value. The error values describe a
|
||||
landscape over the $m$-$b$ plane, the error surface, that can be
|
||||
illustrated graphically using a 3-d surface-plot. $m$ and $b$ are
|
||||
plotted on the $x-$ and $y-$ axis while the third dimension indicates
|
||||
the error value (\figref{errorsurfacefig}).
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=0.75\columnwidth]{error_surface}
|
||||
\includegraphics[width=0.75\textwidth]{error_surface}
|
||||
\titlecaption{Error surface.}{The two model parameters $m$ and $b$
|
||||
define the base area of the surface plot. For each parameter
|
||||
combination of slope and intercept the error is calculated. The
|
||||
@@ -168,32 +168,36 @@ third dimension is used to indicate the error value
|
||||
combination that best fits the data.}\label{errorsurfacefig}
|
||||
\end{figure}
|
||||
|
||||
\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}%
|
||||
Load the dataset \textit{lin\_regression.mat} into the workspace (20
|
||||
data pairs contained in the vectors \varcode{x} and
|
||||
\varcode{y}). Implement a script \file{errorSurface.m}, that
|
||||
calculates the mean square error between data and a linear model and
|
||||
illustrates the error surface using the \code{surf()} function
|
||||
(consult the help to find out how to use \code{surf()}.).
|
||||
\begin{exercise}{errorSurface.m}{}\label{errorsurfaceexercise}
|
||||
Generate 20 data pairs $(x_i|y_i)$ that are linearly related with
|
||||
slope $m=0.75$ and intercept $b=-40$, using \varcode{rand()} for
|
||||
drawing $x$ values between 0 and 120 and \varcode{randn()} for
|
||||
jittering the $y$ values with a standard deviation of 15. Then
|
||||
calculate the mean squared error between the data and straight lines
|
||||
for a range of slopes and intercepts using the
|
||||
\varcode{meanSquaredError()} function from the previous exercise.
|
||||
Illustrates the error surface using the \code{surface()} function
|
||||
(consult the help to find out how to use \code{surface()}).
|
||||
\end{exercise}
|
||||
|
||||
By looking at the error surface we can directly see the position of
|
||||
the minimum and thus estimate the optimal parameter combination. How
|
||||
can we use the error surface to guide an automatic optimization
|
||||
process.
|
||||
process?
|
||||
|
||||
The obvious approach would be to calculate the error surface and then
|
||||
find the position of the minimum. The approach, however has several
|
||||
disadvantages: (I) it is computationally very expensive to calculate
|
||||
the error for each parameter combination. The number of combinations
|
||||
increases exponentially with the number of free parameters (also known
|
||||
as the ``curse of dimensionality''). (II) the accuracy with which the
|
||||
best parameters can be estimated is limited by the resolution with
|
||||
which the parameter space was sampled. If the grid is too large, one
|
||||
might miss the minimum.
|
||||
find the position of the minimum using the \code{min} function. This
|
||||
approach, however has several disadvantages: (i) it is computationally
|
||||
very expensive to calculate the error for each parameter
|
||||
combination. The number of combinations increases exponentially with
|
||||
the number of free parameters (also known as the ``curse of
|
||||
dimensionality''). (ii) the accuracy with which the best parameters
|
||||
can be estimated is limited by the resolution used to sample the
|
||||
parameter space. The coarser the parameters are sampled the less
|
||||
precise is the obtained position of the minimum.
|
||||
|
||||
We thus want a procedure that finds the minimum with a minimal number
|
||||
of computations.
|
||||
We want a procedure that finds the minimum of the cost function with a minimal number
|
||||
of computations and to arbitrary precision.
|
||||
|
||||
\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
|
||||
\includegraphics[width=0.33\textwidth]{derivative}
|
||||
@@ -217,13 +221,13 @@ of computations.
|
||||
f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
|
||||
\end{minipage}\vspace{2ex}
|
||||
|
||||
It is not possible to calculate this numerically
|
||||
(\eqnref{derivative}). The derivative can only be estimated using
|
||||
the difference quotient \eqnref{difffrac} by using sufficiently
|
||||
small $\Delta x$.
|
||||
It is not possible to calculate the derivative, \eqnref{derivative},
|
||||
numerically. The derivative can only be estimated using the
|
||||
difference quotient, \eqnref{difffrac}, by using sufficiently small
|
||||
$\Delta x$.
|
||||
\end{ibox}
|
||||
|
||||
\begin{ibox}[t]{\label{partialderivativebox}Partial derivative and gradient}
|
||||
\begin{ibox}[tp]{\label{partialderivativebox}Partial derivative and gradient}
|
||||
Some functions that depend on more than a single variable:
|
||||
\[ z = f(x,y) \]
|
||||
for example depends on $x$ and $y$. Using the partial derivative
|
||||
@@ -234,33 +238,33 @@ of computations.
|
||||
individually by using the respective difference quotient
|
||||
(Box~\ref{differentialquotientbox}). \vspace{1ex}
|
||||
|
||||
\begin{minipage}[t]{0.44\textwidth}
|
||||
\begin{minipage}[t]{0.5\textwidth}
|
||||
\mbox{}\\[-2ex]
|
||||
\includegraphics[width=1\textwidth]{gradient}
|
||||
\end{minipage}
|
||||
\hfill
|
||||
\begin{minipage}[t]{0.52\textwidth}
|
||||
\begin{minipage}[t]{0.46\textwidth}
|
||||
For example, the partial derivatives of
|
||||
\[ f(x,y) = x^2+y^2 \] are
|
||||
\[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]
|
||||
|
||||
The gradient is a vector that constructed from the partial derivatives:
|
||||
The gradient is a vector that is constructed from the partial derivatives:
|
||||
\[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
|
||||
This vector points into the direction of the strongest ascend of
|
||||
$f(x,y)$.
|
||||
\end{minipage}
|
||||
|
||||
\vspace{1ex} The figure shows the contour lines of a bi-variate
|
||||
\vspace{0.5ex} The figure shows the contour lines of a bi-variate
|
||||
Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
|
||||
arrow) and the two partial derivatives (thin arrows) for three
|
||||
different locations.
|
||||
arrows) and the corresponding two partial derivatives (thin arrows)
|
||||
for three different locations.
|
||||
\end{ibox}
|
||||
|
||||
|
||||
\section{Gradient}
|
||||
Imagine to place a small ball at some point on the error surface
|
||||
\figref{errorsurfacefig}. Naturally, it would follow the steepest
|
||||
slope and would stop at the minimum of the error surface (if it had no
|
||||
\figref{errorsurfacefig}. Naturally, it would roll down the steepest
|
||||
slope and eventually stop at the minimum of the error surface (if it had no
|
||||
inertia). We will use this picture to develop an algorithm to find our
|
||||
way to the minimum of the objective function. The ball will always
|
||||
follow the steepest slope. Thus we need to figure out the direction of
|
||||
@@ -268,33 +272,31 @@ the steepest slope at the position of the ball.
|
||||
|
||||
The \entermde{Gradient}{gradient} (Box~\ref{partialderivativebox}) of the
|
||||
objective function is the vector
|
||||
|
||||
\[ \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
|
||||
\frac{\partial f(m,b)}{\partial b} \right) \]
|
||||
|
||||
that points to the strongest ascend of the objective function. Since
|
||||
we want to reach the minimum we simply choose the opposite direction.
|
||||
|
||||
The gradient is given by partial derivatives
|
||||
(Box~\ref{partialderivativebox}) with respect to the parameters $m$
|
||||
and $b$ of the linear equation. There is no need to calculate it
|
||||
analytically but it can be estimated from the partial derivatives
|
||||
using the difference quotient (Box~\ref{differentialquotientbox}) for
|
||||
small steps $\Delta m$ und $\Delta b$. For example the partial
|
||||
derivative with respect to $m$:
|
||||
|
||||
\begin{equation}
|
||||
\label{gradient}
|
||||
\nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
|
||||
\frac{\partial f(m,b)}{\partial b} \right)
|
||||
\end{equation}
|
||||
that points to the strongest ascend of the objective function. The
|
||||
gradient is given by partial derivatives
|
||||
(Box~\ref{partialderivativebox}) of the mean squared error with
|
||||
respect to the parameters $m$ and $b$ of the straight line. There is
|
||||
no need to calculate it analytically because it can be estimated from
|
||||
the partial derivatives using the difference quotient
|
||||
(Box~\ref{differentialquotientbox}) for small steps $\Delta m$ and
|
||||
$\Delta b$. For example, the partial derivative with respect to $m$
|
||||
can be computed as
|
||||
\[\frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
|
||||
0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
|
||||
\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \;
|
||||
. \]
|
||||
|
||||
The length of the gradient indicates the steepness of the slope
|
||||
(\figref{gradientquiverfig}). Since want to go down the hill, we
|
||||
choose the opposite direction.
|
||||
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=0.75\columnwidth]{error_gradient}
|
||||
\includegraphics[width=0.75\textwidth]{error_gradient}
|
||||
\titlecaption{Gradient of the error surface.} {Each arrow points
|
||||
into the direction of the greatest ascend at different positions
|
||||
of the error surface shown in \figref{errorsurfacefig}. The
|
||||
@@ -304,60 +306,60 @@ choose the opposite direction.
|
||||
error.}\label{gradientquiverfig}
|
||||
\end{figure}
|
||||
|
||||
\begin{exercise}{lsqGradient.m}{}\label{gradientexercise}%
|
||||
Implement a function \varcode{lsqGradient()}, that takes the set of
|
||||
parameters $(m, b)$ of the linear equation as a two-element vector
|
||||
and the $x$- and $y$-data as input arguments. The function should
|
||||
return the gradient at that position.
|
||||
\begin{exercise}{meanSquaredGradient.m}{}\label{gradientexercise}%
|
||||
Implement a function \varcode{meanSquaredGradient()}, that takes the
|
||||
set of parameters $(m, b)$ of a straight line as a two-element
|
||||
vector and the $x$- and $y$-data as input arguments. The function
|
||||
should return the gradient at that position as a vector with two
|
||||
elements.
|
||||
\end{exercise}
|
||||
|
||||
\begin{exercise}{errorGradient.m}{}
|
||||
Use the functions from the previous
|
||||
exercises~\ref{errorsurfaceexercise} and~\ref{gradientexercise} to
|
||||
estimate and plot the error surface including the gradients. Choose
|
||||
a subset of parameter combinations for which you plot the
|
||||
gradient. Vectors in space can be easily plotted using the function
|
||||
\code{quiver()}.
|
||||
Extend the script of exercises~\ref{errorsurfaceexercise} to plot
|
||||
both the error surface and gradients using the
|
||||
\varcode{meanSquaredGradient()} function from
|
||||
exercise~\ref{gradientexercise}. Vectors in space can be easily
|
||||
plotted using the function \code{quiver()}. Use \code{contour()}
|
||||
instead of \code{surface()} to plot the error surface.
|
||||
\end{exercise}
|
||||
|
||||
|
||||
\section{Gradient descent}
|
||||
Finally, we are able to implement the optimization itself. By now it
|
||||
should be obvious why it is called the gradient descent method. All
|
||||
ingredients are already there. We need: 1. The error function
|
||||
(\varcode{meanSquareError}), 2. the objective function
|
||||
(\varcode{lsqError()}), and 3. the gradient (\varcode{lsqGradient()}). The
|
||||
algorithm of the gradient descent is:
|
||||
|
||||
ingredients are already there. We need: (i) the cost function
|
||||
(\varcode{meanSquaredError()}), and (ii) the gradient
|
||||
(\varcode{meanSquaredGradient()}). The algorithm of the gradient
|
||||
descent works as follows:
|
||||
\begin{enumerate}
|
||||
\item Start with any given combination of the parameters $m$ and $b$ ($p_0 = (m_0,
|
||||
b_0)$).
|
||||
\item Start with some given combination of the parameters $m$ and $b$
|
||||
($p_0 = (m_0, b_0)$).
|
||||
\item \label{computegradient} Calculate the gradient at the current
|
||||
position $p_i$.
|
||||
\item If the length of the gradient falls below a certain value, we
|
||||
assume to have reached the minimum and stop the search. We are
|
||||
actually looking for the point at which the length of the gradient
|
||||
is zero but finding zero is impossible for numerical reasons. We
|
||||
thus apply a threshold below which we are sufficiently close to zero
|
||||
(e.g. \varcode{norm(gradient) < 0.1}).
|
||||
is zero, but finding zero is impossible because of numerical
|
||||
imprecision. We thus apply a threshold below which we are
|
||||
sufficiently close to zero (e.g. \varcode{norm(gradient) < 0.1}).
|
||||
\item \label{gradientstep} If the length of the gradient exceeds the
|
||||
threshold we take a small step into the opposite direction
|
||||
($\epsilon = 0.01$):
|
||||
threshold we take a small step into the opposite direction:
|
||||
\[p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)\]
|
||||
\item Repeat steps \ref{computegradient} --
|
||||
\ref{gradientstep}.
|
||||
where $\epsilon = 0.01$ is a factor linking the gradient to
|
||||
appropriate steps in the parameter space.
|
||||
\item Repeat steps \ref{computegradient} -- \ref{gradientstep}.
|
||||
\end{enumerate}
|
||||
|
||||
\Figref{gradientdescentfig} illustrates the gradient descent (the path
|
||||
the imaginary ball has chosen to reach the minimum). Starting at an
|
||||
arbitrary position on the error surface we change the position as long
|
||||
as the gradient at that position is larger than a certain
|
||||
\Figref{gradientdescentfig} illustrates the gradient descent --- the
|
||||
path the imaginary ball has chosen to reach the minimum. Starting at
|
||||
an arbitrary position on the error surface we change the position as
|
||||
long as the gradient at that position is larger than a certain
|
||||
threshold. If the slope is very steep, the change in the position (the
|
||||
distance between the red dots in \figref{gradientdescentfig}) is
|
||||
large.
|
||||
|
||||
\begin{figure}[t]
|
||||
\includegraphics[width=0.6\columnwidth]{gradient_descent}
|
||||
\includegraphics[width=0.55\textwidth]{gradient_descent}
|
||||
\titlecaption{Gradient descent.}{The algorithm starts at an
|
||||
arbitrary position. At each point the gradient is estimated and
|
||||
the position is updated as long as the length of the gradient is
|
||||
@@ -366,53 +368,56 @@ large.
|
||||
\end{figure}
|
||||
|
||||
\begin{exercise}{gradientDescent.m}{}
|
||||
Implement the gradient descent for the problem of the linear
|
||||
equation for the measured data in file \file{lin\_regression.mat}.
|
||||
Implement the gradient descent for the problem of fitting a straight
|
||||
line to some measured data. Reuse the data generated in
|
||||
exercise~\ref{errorsurfaceexercise}.
|
||||
\begin{enumerate}
|
||||
\item Store for each iteration the error value.
|
||||
\item Create a plot that shows the error value as a function of the
|
||||
\item Plot the error values as a function of the iterations, the
|
||||
number of optimization steps.
|
||||
\item Create a plot that shows the measured data and the best fit.
|
||||
\item Plot the measured data together with the best fitting straight line.
|
||||
\end{enumerate}
|
||||
\end{exercise}
|
||||
|
||||
|
||||
\section{Summary}
|
||||
|
||||
The gradient descent is an important method for solving optimization
|
||||
problems. It is used to find the global minimum of an objective
|
||||
function.
|
||||
The gradient descent is an important numerical method for solving
|
||||
optimization problems. It is used to find the global minimum of an
|
||||
objective function.
|
||||
|
||||
In the case of the linear equation the error surface (using the mean
|
||||
square error) shows a clearly defined minimum. The position of the
|
||||
minimum can be analytically calculated. The next chapter will
|
||||
introduce how this can be done without using the gradient descent
|
||||
\matlabfun{polyfit()}.
|
||||
Curve fitting is a common application for the gradient descent method.
|
||||
For the case of fitting straight lines to data pairs, the error
|
||||
surface (using the mean squared error) has exactly one clearly defined
|
||||
global minimum. In fact, the position of the minimum can be analytically
|
||||
calculated as shown in the next chapter.
|
||||
|
||||
Problems that involve nonlinear computations on parameters, e.g. the
|
||||
rate $\lambda$ in the exponential function $f(x;\lambda) =
|
||||
e^{\lambda x}$, do not have an analytical solution. To find minima
|
||||
in such functions numerical methods such as the gradient descent have
|
||||
to be applied.
|
||||
rate $\lambda$ in an exponential function $f(x;\lambda) = e^{\lambda
|
||||
x}$, do not have an analytical solution for the least squares. To
|
||||
find the least squares for such functions numerical methods such as
|
||||
the gradient descent have to be applied.
|
||||
|
||||
The suggested gradient descent algorithm can be improved in multiple
|
||||
ways to converge faster. For example one could adapt the step size to
|
||||
the length of the gradient. These numerical tricks have already been
|
||||
implemented in pre-defined functions. Generic optimization functions
|
||||
such as \matlabfun{fminsearch()} have been implemented for arbitrary
|
||||
objective functions while more specialized functions are specifically
|
||||
designed for optimizations in the least square error sense
|
||||
\matlabfun{lsqcurvefit()}.
|
||||
objective functions, while the more specialized function
|
||||
\matlabfun{lsqcurvefit()} i specifically designed for optimizations in
|
||||
the least square error sense.
|
||||
|
||||
%\newpage
|
||||
\begin{important}[Beware of secondary minima!]
|
||||
Finding the absolute minimum is not always as easy as in the case of
|
||||
the linear equation. Often, the error surface has secondary or local
|
||||
minima in which the gradient descent stops even though there is a
|
||||
more optimal solution. Starting from good start positions is a good
|
||||
approach to avoid getting stuck in local minima. Further it is
|
||||
easier to optimize as few parameters as possible. Each additional
|
||||
parameter increases complexity and is computationally expensive.
|
||||
fitting a straight line. Often, the error surface has secondary or
|
||||
local minima in which the gradient descent stops even though there
|
||||
is a more optimal solution, a global minimum that is lower than the
|
||||
local minimum. Starting from good initial positions is a good
|
||||
approach to avoid getting stuck in local minima. Also keep in mind
|
||||
that error surfaces tend to be simpler (less local minima) the fewer
|
||||
parameters are fitted from the data. Each additional parameter
|
||||
increases complexity and is computationally more expensive.
|
||||
\end{important}
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user