[regression] translations 2

2018-10-11 16:55:24 +02:00 · 2018-10-11 16:55:24 +02:00 · e500280c07
commit e500280c07
parent 51a8183f33
1 changed files with 164 additions and 155 deletions
--- a/regression/lecture/regression.tex
+++ b/regression/lecture/regression.tex
@ -117,7 +117,7 @@ Replacing $y^{est}$ with the linear equation (the model) in
  & = & \frac{1}{N} \sum_{i=1}^N (y_i - m x_i - b)^2 \label{mseline}
 \end{eqnarray}

-That is, the meas square error given the pairs $(x_i, y_i)$ and the
+That is, the mean square error is given the pairs $(x_i, y_i)$ and the
 parameters $m$ and $b$ of the linear equation. The optimization
 process will not try to optimize $m$ and $b$ to lead to the smallest
 error, the method of the \enterm{least square error}.
@ -163,66 +163,67 @@ third dimension is used to indicate the error value
  Load the dataset \textit{lin\_regression.mat} into the workspace (20
  data pairs contained in the vectors \varcode{x} and
  \varcode{y}). Implement a script \file{errorSurface.m}, that
-  calculates the mean square error between data and a linear model und
+  calculates the mean square error between data and a linear model and
  illustrates the error surface using the \code{surf()} function
  (consult the help to find out how to use \code{surf}.).
 \end{exercise}

-An der Fehlerfl\"ache kann direkt erkannt werden, bei welcher
-Parameterkombination der Fehler minimal, beziehungsweise die
-Parameterisierung optimal an die Daten angepasst ist. Wie kann die
-Fehlerfunktion und die durch sie definierte Fehlerfl\"ache nun benutzt
-werden, um den Optimierungsprozess zu leiten?
+By looking at the error surface we can directly see the position of
+the minimum and thus estimate the optimal parameter combination. How
+can we use the error surface to guide an automatic optimization
+process.

-Die naheliegenste Variante ist, von der Fehlerfl\"ache einfach den Ort
-des globalen Minimums zu bestimmen. Das ist im Allgemeinen jedoch zu
-rechenintensiv, da f\"ur jede m\"ogliche Kombination der Parameter der
-Fehler berechnet werden muss. Die Anzahl der n\"otigen Berechnungen
-steigt exponentiell mit der Anzahl der Parameter (``Fluch der
-Dimension''). Auch eine bessere Genauigkeit, mit der das Minimum
-bestimmt werden soll, erh\"oht die Anzahl der n\"otigen
-Berechnungen. Wir suchen also ein Verfahren, dass das Minimum der
-Kostenfunktion mit m\"oglichst wenigen Berechnungen findet.
+The obvious approach would be to calculate the error surface and then
+find the position of the minimum. The approach, however has several
+disadvantages: (I) it is computationally very expensive to calculate
+the error for each parameter combination. The number of combinations
+increases exponentially with the number of free parameters (also known
+as the ``curse of dimensionality''). (II) the accuracy with which the
+best parameters can be estimated is limited by the resolution with
+which the parameter space was sampled. If the grid is too large, one
+might miss the minimum.

-\begin{ibox}[t]{\label{differentialquotientbox}Differenzenquotient und Ableitung}
+We thus want a procedure that finds the minimum with a minimal number
+of computations.
+
+\begin{ibox}[t]{\label{differentialquotientbox}Difference quotient and derivative}
  \includegraphics[width=0.33\textwidth]{derivative}
  \hfill
  \begin{minipage}[b]{0.63\textwidth}
-    Der Differenzenquotient 
+    The difference quotient
    \begin{equation}
      \label{difffrac}
      m = \frac{f(x + \Delta x) - f(x)}{\Delta x}
    \end{equation}
-    einer Funktion $y = f(x)$ ist die Steigung der Sekante (rot) durch
-    die beiden Punkte $(x,f(x))$ und $(x+\Delta x,f(x+\Delta x))$ mit
-    dem Abstand $\Delta x$.
+    of a function $y = f(x)$ is the slope of the secant (red) defined
+    by the points $(x,f(x))$ and $(x+\Delta x,f(x+\Delta x))$ with the
+    distance $\Delta x$.

-    Die Steigung einer Funktion $y=f(x)$ an einer Stelle $x$ (gelb) wird durch
-    die Ableitung $f'(x)$ der Funktion an dieser Stelle berechnet.  Die
-    Ableitung ist \"uber den Grenzwert (orange) des Differenzenquotienten f\"ur
-    unendlich kleine Abst\"ande $\Delta x$ definiert:
+    The slope of the function $y=f(x)$ at the position $x$ (yellow) is
+    given by the derivative $f'(x)$ of the function at that position.
+    It is defined by the difference quotient in the limit of
+    infinitesimally (orange) small distances $\Delta x$:
    \begin{equation}
      \label{derivative}
      f'(x) = \frac{{\rm d} f(x)}{{\rm d}x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x) - f(x)}{\Delta x} \end{equation}
  \end{minipage}\vspace{2ex} 

-  Numerisch kann der Grenzwert \eqnref{derivative} nicht
-  gebildet werden. Die Ableitung kann nur durch den
-  Differenzenquotienten \eqnref{difffrac} mit gen\"ugend kleinem
-  $\Delta x$ angen\"ahert werden.
+  It is not possible to calculate this numerically
+  (\eqnref{derivative}). The derivative can only be estimated using
+  the difference quotient \eqnref{difffrac} by using sufficiently
+  small $\Delta x$.
 \end{ibox}

-\begin{ibox}[t]{\label{partialderivativebox}Partielle Ableitungen und Gradient}
-  Bei Funktionen
+\begin{ibox}[t]{\label{partialderivativebox}Partial derivative and gradient}
+  Some functions that depend on more than a single variable:
  \[ z = f(x,y) \]
-  die von mehreren Variablen, z.B. $x$ und $y$ abh\"angen,
-  kann die Steigung in Richtung jeder dieser Variablen
-  mit den partiellen Ableitungen 
+  for example depends on $x$ and $y$. Using the partial derivative
  \[ \frac{\partial f(x,y)}{\partial x} = \lim\limits_{\Delta x \to 0} \frac{f(x + \Delta x,y) - f(x,y)}{\Delta x} \]
-  und
+  and
  \[ \frac{\partial f(x,y)}{\partial y} = \lim\limits_{\Delta y \to 0} \frac{f(x, y + \Delta y) - f(x,y)}{\Delta y} \]
-  definiert \"uber den jeweiligen Differenzenquotienten
-  (Box~\ref{differentialquotientbox}) berechnet werden.  \vspace{1ex}
+  one can estimate the slope in the direction of the variables
+  individually by using the respective difference quotient
+  (Box~\ref{differentialquotientbox}).  \vspace{1ex}

  \begin{minipage}[t]{0.44\textwidth}
    \mbox{}\\[-2ex]
@ -230,172 +231,180 @@ Kostenfunktion mit m\"oglichst wenigen Berechnungen findet.
  \end{minipage}
  \hfill
  \begin{minipage}[t]{0.52\textwidth}
-    Z.B. lauten die partiellen Ableitungen von 
-    \[ f(x,y) = x^2+y^2 \]
+    For example, the partial derivatives of
+    \[ f(x,y) = x^2+y^2 \] are 
    \[ \frac{\partial f(x,y)}{\partial x} = 2x \; , \quad \frac{\partial f(x,y)}{\partial y} = 2y \; .\]

-    Der Gradient ist der aus den partiellen Ableitungen gebildete Vektor
+    The gradient is a vector that constructed from the partial derivatives:
    \[ \nabla f(x,y) = \left( \begin{array}{c} \frac{\partial f(x,y)}{\partial x} \\[1ex] \frac{\partial f(x,y)}{\partial y} \end{array} \right) \]
-    und zeigt in Richtung des st\"arksten Anstiegs der Funktion $f(x,y)$.
+    This vector points into the direction of the strongest ascend of
+    $f(x,y)$.
  \end{minipage}

-  \vspace{1ex} Die Abbildung zeigt die Konturlinien einer bivariaten
-  Gau{\ss}glocke $f(x,y) = \exp(-(x^2+y^2)/2)$ und den Gradienten mit
-  seinen partiellen Ableitungen an drei verschiedenen Stellen.
+  \vspace{1ex} The figure shows the contour lines of a bi-variate
+  Gaussian $f(x,y) = \exp(-(x^2+y^2)/2)$ and the gradient (thick
+  arrow) and the two partial derivatives (thin arrows) for three
+  different locations.
 \end{ibox}


 \section{Gradient}
+Imagine to place a small ball at some point on the error surface
+\figref{errorsurfacefig}. Naturally, it would follow the steepest
+slope and would stop at the minimum of the error surface (if it had no
+inertia). We will use this picture to develop an algorithm to find our
+way to the minimum of the objective function. The ball will always
+follow the steepest slope. Thus we need to figure out the direction of
+the steepest slope at the position of the ball.

-Wenn eine Kugel an einem beliebigen Startpunkt auf der Fehlerfl\"ache
-\figref{errorsurfacefig} losgelassen werden w\"urde, dann w\"urde sie
-entlang des steilsten Gef\"alles auf schnellsten Wege zum Minimum der
-Fehlerfl\"ache rollen und dort zum Stehen kommen (wenn sie keine
-Tr\"agheit besitzen w\"urde). Den Weg der Kugel wollen wir nun als
-Grundlage unseres Algorithmus zur Bestimmung des Minimums der
-Kostenfunktion verwenden. Da die Kugel immer entlang des steilsten
-Gef\"alles rollt, ben\"otigen wir Information \"uber die Richtung des
-Gef\"alles an der jeweils aktuellen Position.
+The \enterm{gradient} (Box~\ref{partialderivativebox}) of the
+objective function is the vector

-Der \determ{Gradient} (Box~\ref{partialderivativebox}) der Kostenfunktion
 \[ \nabla f_{cost}(m,b) = \left( \frac{\partial f(m,b)}{\partial m},
-  \frac{\partial f(m,b)}{\partial b} \right) \] bzgl. der beiden
-Parameter $m$ und $b$ der Geradengleichung ist ein Vektor, der in
-Richtung des steilsten Anstiegs der Kostenfunktion $f_{cost}(m,b)$ zeigt.
-Die L\"ange des Gradienten gibt die St\"arke des Anstiegs an
-(\figref{gradientquiverfig})).  Da wir aber abw\"arts zum Minimum
-laufen wollen, m\"ussen wir die dem Gradienten entgegengesetzte
-Richtung einschlagen.
+\frac{\partial f(m,b)}{\partial b} \right) \]
+
+that points to the strongest ascend of the objective function. Since
+we want to reach the minimum we simply choose the opposite direction.
+
+The gradient is given by partial derivatives
+(Box~\ref{partialderivativebox}) with respect to the parameters $m$
+and $b$ of the linear equation. There is no need to calculate it
+analytically but it can be estimated from the partial derivatives
+using the difference quotient (Box~\ref{differentialquotient}) for
+small steps $\Delta m$ und $\Delta b$. For example the partial
+derivative with respect to $m$:

-Die partiellen Ableitungen m\"ussen nicht analytisch berechnet werden
-sondern k\"onnen numerisch entsprechend dem Differenzenquotienten
-(Box~\ref{differentialquotientbox}) mit kleinen Schrittweiten $\Delta
-m$ und $\Delta b$ angen\"ahert werden. z.B. approximieren wir die
-partielle Ableitung nach $m$ durch
 \[\frac{\partial f_{cost}(m,b)}{\partial m} = \lim\limits_{\Delta m \to
-  0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \approx \frac{f_{cost}(m + \Delta m, b) -
-  f_{cost}(m,b)}{\Delta m} \; . \]
+  0} \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m}
+\approx \frac{f_{cost}(m + \Delta m, b) - f_{cost}(m,b)}{\Delta m} \;
+. \]
+
+The length of the gradient indicates the steepness of the slope
+(\figref{gradientquiverfig}). Since want to go down the hill, we
+choose the opposite direction.
+

 \begin{figure}[t]
  \includegraphics[width=0.75\columnwidth]{error_gradient}
-  \titlecaption{Gradient der Fehlerfl\"ache.} 
-  {Jeder Pfeil zeigt die Richtung und die
-    Steigung f\"ur verschiedene Parameterkombination aus Steigung und
-    $y$-Achsenabschnitt an. Die Konturlinien im Hintergrund
-    illustrieren die Fehlerfl\"ache. Warme Farben stehen f\"ur
-    gro{\ss}e Fehlerwerte, kalte Farben f\"ur kleine. Jede
-    Konturlinie steht f\"ur eine Linie gleichen
-    Fehlers.}\label{gradientquiverfig}
+  \titlecaption{Gradient of the error surface.}  {Each arrow points
+    into the direction of the greatest ascend at different positions
+    of the error surface shown in \figref{errorsurfacefig}. The
+    contour lines in the background illustrate the error surface. Warm
+    colors indicate high errors, colder colors low error values. Each
+    contour line connects points of equal
+    error.}\label{gradientquiverfig}
 \end{figure}

 \begin{exercise}{lsqGradient.m}{}\label{gradientexercise}%
-  Implementiere eine Funktion \code{lsqGradient()}, die den
-  Parametersatz $(m, b)$ der Geradengleichung als 2-elementigen Vektor
-  sowie die $x$- und $y$-Werte der Messdaten als Argumente
-  entgegennimmt und den Gradienten an dieser Stelle zur\"uckgibt.
+  Implement a function \code{lsqGradient()}, that takes the set of
+  parameters $(m, b)$ of the linear equation as a two-element vector
+  and the $x$- and $y$-data as input arguments. The function should
+  return the gradient at that position.
 \end{exercise}

 \begin{exercise}{errorGradient.m}{}
-  Benutze die Funktion aus der vorherigen \"Ubung (\ref{gradientexercise}),
-  um f\"ur jede Parameterkombination aus der Fehlerfl\"ache
-  (\"Ubung \ref{errorsurfaceexercise}) auch den Gradienten zu
-  berechnen und darzustellen. Vektoren im Raum k\"onnen mithilfe der
-  Funktion \code{quiver()} geplottet werden.
+  Use the functions from the previous
+  exercises~\ref{errorsurfaceexercise} and~\ref{gradientexercise} to
+  estimate and plot the error surface including the gradients. Choose
+  a subset of parameter combinations for which you plot the
+  gradient. Vectors in space can be easily plotted using the function
+  \code{quiver()}.
 \end{exercise}


-\section{Gradientenabstieg}
+\section{Gradient descent}
+Finally, we are able to implement the optimization itself. By now it
+should be obvious why it is called the gradient descent method. All
+ingredients are already there. We need: 1. The error function
+(\code{meanSquareError}), 2. the objective function
+(\code{lsqError()}), and 3. the gradient (\code{lsqGradient()}). The
+algorithm of the gradient descent is:

-Zu guter Letzt muss nur noch der \determ{Gradientenabstieg} implementiert
-werden. Die daf\"ur ben\"otigten Zutaten haben wir aus den
-vorangegangenen \"Ubungen bereits vorbereitet. Wir brauchen: 1. Die Fehlerfunktion
-(\code{meanSquareError()}), 2. die Zielfunktion (\code{lsqError()})
-und 3. den Gradienten (\code{lsqGradient()}).  Der Algorithmus
-f\"ur den Abstieg lautet:
 \begin{enumerate}
-\item Starte mit einer beliebigen Parameterkombination $p_0 = (m_0,
-  b_0)$.
-\item \label{computegradient} Berechne den Gradienten an der akutellen Position $p_i$.
-\item Wenn die L\"ange des Gradienten einen bestimmten Wert
-  unterschreitet, haben wir das Minum gefunden und k\"onnen die Suche
-  abbrechen.  Wir suchen ja das Minimum, bei dem der Gradient gleich
-  Null ist. Da aus numerischen Gr\"unden der Gradient nie exakt Null
-  werden wird, k\"onnen wir nur fordern, dass er hinreichend klein
-  wird (z.B. \varcode{norm(gradient) < 0.1}).
-\item \label{gradientstep} Gehe einen kleinen Schritt ($\epsilon =
-  0.01$) in die entgegensetzte Richtung des Gradienten:
+\item Start with any given combination of the parameters $m$ and $b$ ($p_0 = (m_0,
+  b_0)$).
+\item \label{computegradient} Calculate the gradient at the current
+  position $p_i$.
+\item If the length of the gradient falls below a certain value, we
+  assume to have reached the minimum and stop the search. We are
+  actually looking for the point at which the length of the gradient
+  is zero but finding zero is impossible for numerical reasons. We
+  thus apply a threshold below which we are sufficiently close to zero
+  (e.g. \varcode{norm(gradient) < 0.1}).
+\item \label{gradientstep} If the length of the gradient exceeds the
+  threshold we take a small step into the opposite direction
+  ($\epsilon = 0.01$):
  \[p_{i+1} = p_i - \epsilon \cdot \nabla f_{cost}(m_i, b_i)\]
-\item Wiederhole die Schritte \ref{computegradient} -- \ref{gradientstep}.
+\item Repeat steps \ref{computegradient} --
+  \ref{gradientstep}.
 \end{enumerate}

-Abbildung \ref{gradientdescentfig} zeigt den Verlauf des
-Gradientenabstiegs. Von einer Startposition aus wird die Position
-solange ver\"andert, wie der Gradient eine bestimmte Gr\"o{\ss}e
-\"uberschreitet. An den Stellen, an denen der Gradient sehr stark ist,
-ist auch die Ver\"anderung der Position gro{\ss} und der Abstand der
-Punkte in Abbildung \ref{gradientdescentfig} gro{\ss}.
+\Figref{gradientdescentfig} illustrates the gradient descent (the path
+the imaginary ball has chosen to reach the minimum). Starting at an
+arbitrary position on the error surface we change the position as long
+as the gradient at that position is larger than a certain
+threshold. If the slope is very steep, the change in the position (the
+distance between the red dots in \figref{gradientdescentfig}) is
+large.

 \begin{figure}[t]
  \includegraphics[width=0.6\columnwidth]{gradient_descent}
-  \titlecaption{Gradientenabstieg.}{Es wird von einer beliebigen
-    Position aus gestartet und der Gradient berechnet und die Position
-    ver\"andert. Jeder Punkt zeigt die Position nach jedem
-    Optimierungsschritt an.} \label{gradientdescentfig}
+  \titlecaption{Gradient descent.}{The algorithm starts at an
+    arbitrary position. At each point the gradient is estimated and
+    the position is updated as long as the length of the gradient is
+    sufficiently large.The dots show the positions after each
+    iteration of the algorithm.} \label{gradientdescentfig}
 \end{figure}

 \setboolean{showexercisesolutions}{false}
 \begin{exercise}{gradientDescent.m}{}
-  Implementiere den Gradientenabstieg f\"ur das Problem der
-  Parameteranpassung der linearen Geradengleichung an die Messdaten in
-  der Datei \file{lin\_regression.mat}.
+  Implement the gradient descent for the problem of the linear
+  equation for the measured data in file \file{lin\_regression.mat}.
  \begin{enumerate}
-  \item Merke Dir f\"ur jeden Schritt den Fehler zwischen
-    Modellvorhersage und Daten.
-  \item Erstelle eine Plot, der die Entwicklung des Fehlers als
-    Funktion der Optimierungsschritte zeigt.
-  \item Erstelle einen Plot, der den besten Fit in die Daten plottet.
+  \item Store for each iteration the error value.
+  \item Create a plot that shows the error value as a function of the
+    number of optimization steps.
+  \item Create a plot that shows the measured data and the best fit.
  \end{enumerate}
 \end{exercise}


-\section{Fazit}
-Mit dem Gradientenabstieg haben wir eine wichtige Methode zur
-Bestimmung eines globalen Minimums einer Kostenfunktion
-kennengelernt. 
+\section{Summary}

-F\"ur den Fall des Kurvenfits mit einer Geradengleichung zeigt der
-mittlere quadratische Abstand als Kostenfunktion in der Tat ein
-einziges klar definiertes Minimum.  Wie wir im n\"achsten Kapitel
-sehen werden, kann die Position des Minimums bei Geradengleichungen
-sogar analytisch bestimmt werden, der Gradientenabstieg w\"are also
-gar nicht n\"otig \matlabfun{polyfit()}.
+The gradient descent is an important method for solving optimization
+problems. It is used to find the global minimum of an objective
+function.

-F\"ur Parameter, die nichtlinear in einer Funktion
-enthalten sind, wie z.B. die Rate $\lambda$ als Parameter in der
-Exponentialfunktion $f(x;\lambda) = \exp(\lambda x)$, gibt es keine
-analytische L\"osung, und das Minimum der Kostenfunktion muss
-numerisch, z.B. mit dem Gradientenabstiegsverfahren bestimmt werden.
+In the case of the linear equation the error surface (using the mean
+square error) shows a clearly defined minimum. The position of the
+minimum can be analytically calculated. The next chapter will
+introduce how this can be done without using the gradient descent
+\matlabfun{polyfit()}.

-Um noch schneller das Minimum zu finden, kann das Verfahren des
-Gradientenabstiegs auf vielf\"altige Weise verbessert
-werden. z.B. kann die Schrittweite an die St\"arke des Gradienten
-angepasst werden. Diese numerischen Tricks sind in bereits vorhandenen
-Funktionen implementiert.  Allgemeine Funktionen sind f\"ur beliebige
-Kostenfunktionen gemacht \matlabfun{fminsearch()}, w\"ahrend spezielle
-Funktionen z.B. f\"ur die Minimierung des quadratischen Abstands bei
-einem Kurvenfit angeboten werden \matlabfun{lsqcurvefit()}.
+Problems that involve nonlinear computations on parameters, e.g. the
+rate $\lambda$ in the exponential function $f(x;\lambda) =
+\exp(\lambda x)$, do not have an analytical solution. To find minima
+in such functions numerical methods such as the gradient descent have
+to be applied.
+
+The suggested gradient descent algorithm can be improved in multiple
+ways to converge faster.  For example one could adapt the step size to
+the length of the gradient. These numerical tricks have already been
+implemented in pre-defined functions. Generic optimization functions
+such as \matlabfun{fminsearch()} have been implemented for arbitrary
+objective functions while more specialized functions are specifically
+designed for optimizations in the least square error sense
+\matlabfun{lsqcurvefit()}.

 \newpage
-\begin{important}[Achtung Nebenminima!]
-  Das Finden des globalen Minimums ist leider nur selten so leicht wie
-  bei einem Geradenfit. Oft hat die Kostenfunktion viele Nebenminima,
-  in denen der Gradientenabstieg enden kann, obwohl das gesuchte
-  globale Minimum noch weit entfernt ist. Darum ist es meist sehr
-  wichtig, wirklich gute Startwerte f\"ur die zu bestimmenden
-  Parameter der Kostenfunktion zu haben. Auch sollten nur so wenig wie
-  m\"oglich Parameter gefittet werden, da jeder zus\"atzliche
-  Parameter den Optimierungsprozess schwieriger und
-  rechenaufw\"andiger macht.
+\begin{important}[Beware of secondary minima!]
+  Finding the absolute minimum is not always as easy as in the case of
+  the linear equation. Often, the error surface has secondary or local
+  minima in which the gradient descent stops even though there is a
+  more optimal solution. Starting from good start positions is a good
+  approach to avoid getting stuck in local minima. Further it is
+  easier to optimize as few parameters as possible. Each additional
+  parameter increases complexity and is computationally expensive.
 \end{important}

 \selectlanguage{english}