scientificComputing/codestyle/lecture/codestyle.tex

\chapter{Code style}

\shortquote{Any code of your own that you haven't looked at for six or
  more months might as well have been written by someone
  else.}{Eagleson's law}

Cultivating a good code style is not just a matter of good taste but
rather is a key ingredient for readability and maintainability of code
and, in the end, facilitates reproducibility of scientific
results. Programs should be written and structured in a way that
supports outsiders as well the author himself --- a few weeks or
months after it was written --- to understand the programs'
rationale. Clean code pays off for the original author as well as
others that are supposed to use the code.

Clean code addresses several issues:
\begin{enumerate}
\item The programs' structure.
\item Naming of scripts and functions.
\item Naming of variables and constants.
\item Application of indentation and empty lines to define blocks.
\item Use of comments and inline documentation.
\item Delegation of repeated code to functions and dedicated
  subroutines.
\end{enumerate}

\section{Organization of programs on the file system}

While introducing scripts and functions we suggested a typical program
layout (box\,\ref{whenscriptsbox}). The idea is to create a single
entry point by having one script that controls the rest of the program
by calling functions that work on the data and managing the
results. Applying this structure makes it easy to understand the flow
of the program but two questions remain: (i) How to organize the files
on the file system and (ii) how to name them that the controlling
script is easily identified among the other \entermde[m-file]{m-File}{m-files}.

Upon installation \matlab{} creates a folder called \file{MATLAB} in
the user space (Windows: My files, Linux: Documents, MacOS:
Documents). Since this folder is already appended to the Matlab search
path (Box~\ref{matlabpathbox}), it is easiest to stick to it for the
moment. Of course, any other location can specified as well. Generally
it is of great advantage to store related scripts and functions within
the same folder on the hard drive. An easy approach is to create a
project-specific folder structure that contains sub-folders for each
task (analysis) and to store all related \entermde[m-file]{m-File}{m-files}
(screenshot \ref{fileorganizationfig}). In these task-related folders
one may consider to create a further sub-folder to store results
(created figures, result data). On the project level a single script
(\file{analysis.m}) controls the whole process. In parallel to the
project folder we suggest to create an additional folder for functions
that are or may be relevant across different projects.

Within such a structure it is quite likely that programs in different
projects share the same name (e.g. a \varcode{load\_data.m}
function). Usually this will not lead to conflicts due to the way
matlab searches for matching functions which always starts in the
current folder (more information on the \matlab-path in
Box~\ref{matlabpathbox}).

\begin{figure}[tp]
  \includegraphics[width=0.75\textwidth]{program_organization}
  \titlecaption{\label{fileorganizationfig} Possible folder structure
    for maintaining program code on the file system.}{For each project
    one maintains an individual folder in which analyses or tasks may
    be structured in sub-folders. Within each analysis a \file{main.m}
    script is the entry point for the analyses. On the project level
    there could be a single script that triggers and controls all
    analyses and tasks in the sub-folders. Functions that are of
    general interest across projects are best kept in a dedicated
    folder outside the project sub-structure.}
\end{figure}


\begin{ibox}[tp]{\label{matlabpathbox}\matlab{} search path}
  The \entermde{Suchpfad}{search path} defines where \matlab{} looks
  for scripts and functions. When calling a function from the command
  line \matlab{} needs to figure out which function is addressed and
  starts looking for it in the current path. If this fails it will
  crawl all locations listed in the search path (see figure). The
  \entermde{Suchpfad}{search path} is basically a list of
  folders. \matlab{} will go through this list from front to end and
  the search will stop on the first match. This implies that the order
  in the search path may affect which version of functions that share
  the same name is used. Note: \matlab{} does not perform a recursive
  search. That is, a function that resides in a sub-folder that is not
  explicitly listed in the \entermde{Suchpfad}{search path} will not be found.

  \vspace{2ex}
  \includegraphics[width=0.9\textwidth]{search_path}
  \vspace{1.5ex}

  The search path can be managed from the command line by using the
  functions \code{addpath()} or \code{userpath()}. Alternatively, the
  \matlab{} UI offers a graphical tool for adding/removing paths, or
  changing the order of entries.

  The current working directory can be changed via the UI or also the
  command line using the command \code{cd} (for change directory). The
  current path is shown in the current directory text field of the UI
  or can be requested using the command \code{pwd} (for present work
  directory). The function \code{which()} shows the full path of the
  actually used function. For example, finding out which \code{mean()}
  function is used gives a result similar to:
  \begin{lstlisting}[label=useofwhich, caption={Use of 'which'}]
    >> which('mean')
    /Applications/MATLAB2018b.app/toolbox/matlab/datafun/mean.m
  \end{lstlisting}
\end{ibox}

\section{Naming things}
The dictum of good code style is: ``Program code must be readable.''
Expressive names are extraordinarily important in this respect. Even
if it is tricky to find expressive names that are not overly long,
naming should be taken seriously.

\matlab{} has a few rules about names: Names must not start with a
number, they must not contain blanks or other special characters like
e.g. German Umlauts. Otherwise one is free to use whatever suits. The
names of pre-defined functions shipped with \matlab{} follows several
patterns:
\begin{itemize}
\item Names are always lowercase.
\item Names are often abbreviations (e.g. \code{xcorr()}
  stands for cross-correlation \code{repmat()} for ``repeat matrix'').
\item Functions that convert between formats are named according to
  the pattern ``format2format'' (e.g. \code{num2str()} for ``number to string'' conversion).
\end{itemize}

There are other common patterns such as the \emph{camelCase} in which
the first character of compound words is capitalized. Other
conventions use the underscore to separate the individual words
(\emph{snake\_case}). A function that counts the number of action
potentials could be named \file{spikeCount.m} or
\file{spike\_count.m}.

The same naming rules apply for scripts and functions as well as
variables and constants.

\subsection{Naming scripts and functions}
\matlab{} will search the search path (Box \ref{matlabpathbox})
exclusively by name. This search is case-sensitive which implies that
the files \file{test\_function.m} and \file{Test\_function.m} are two
different things. It is self-evident that choosing such names is
nonsensical because the tiny difference in the name contains no cue
about the difference between the two versions and the function names
themselves tell close to nothing about the purpose. Finding good names
is not trivial. Sometimes it is harder than the programming
itself. Choosing \emph{expressive names} that provide information about a
function's purpose, however, pays off!

\begin{important}[Naming scripts and functions]
  Names of functions and scripts should be expressive in the sense
  that the name provides information about the function's purpose.
  (\file{estimate\_firingrate.m} tells much more than
  \file{exercise1.m}). Choosing a good name replaces large parts of
  the documentation.
\end{important}

\subsection{Naming variables and constants}

While the names of scripts and functions describe the purpose, names
of variables describe the stored content. A variable storing the
average number of actions potentials could be called\\
\varcode{average\_spike\_count}. If this variable is meant to store
multiple spike counts the plural form would be appropriate\\
(\varcode{average\_spike\_counts}).

The control variables used in the head of a \code{for} loop are often
simply named \varcode{i}, \varcode{j} or \varcode{k}. This kind-of
clashes with the previously made statements but since it is a very
common pattern the meaning of such variables in the context of the
loop is quite obvious. This should, however, be the only exception to
the general rule of expressive naming.

\begin{important}[Naming of variables]
  The names of variables should be expressive. That is, the name
  itself should tell about the content of the variable. The name
  \varcode{spike\_count} tells much more about the stored information
  than \varcode{x}. Choosing a good variable name replaces additional
  comments.
\end{important}


\section{Code style}
Readability of program code depends strongly on whether or not a
consistent code style is applied. A program that is only randomly
indented or that contains lots of empty lines is very hard to read and
to comprehend. Even though the \matlab{} language (as many others)
does not enforce indentation, indentation is very powerful for
defining coherent blocks. The \matlab{} editor supports this by an
auto-indentation mechanism. A selected section of the code and be
automatically indented by pressing \keycode{Ctrl-I}.

Interspersing empty lines is very helpful to separate regions in the
code that belong together. Too many empty lines, however lead to
hard-to-read code because it might require more space than a granted
by the screen and thus takes overview.

The following two listings show basically the same implementation of a
random walk\footnote{A random walk is a simple simulation of Brownian
  motion. In each simulation step an agent takes a step into a
  randomly chosen direction.} once in a rather chaotic version
(listing \ref{chaoticcode}) then in cleaner way (listing
\ref{cleancode})

\begin{lstlisting}[label=chaoticcode, caption={Chaotic implementation of the random-walk.}]
num_runs = 10; max_steps = 1000;

positions = zeros(max_steps, num_runs);

for run = 1:num_runs


for step = 2:max_steps

x = randn(1);
if x<0
positions(step, run)= positions(step-1, run)+1;


elseif x>0
 positions(step,run)=positions(step-1,run)-1;
 end
end
end
\end{lstlisting}

\pagebreak[4]

\begin{lstlisting}[label=cleancode, caption={Clean implementation of the random-walk.}]
num_runs = 10;
max_steps = 1000;
positions = zeros(max_steps, num_runs);

for run = 1:num_runs
    for step = 2:max_steps
        x = randn(1);
        if x < 0
            positions(step, run) = positions(step-1, run) + 1;
        elseif x > 0
            positions(step, run) = positions(step-1, run) - 1;
        end
    end
end
\end{lstlisting}

\section{Using comments}

It is common to provide extra information about the meaning of program
code by adding comments. In \matlab{} comments are indicated by the
percent character \code{\%}. Anything that follows the percent
character in a line is ignored and considered a comment. When used
sparsely comments can be immensely helpful. Comments
are short sentences that describe the meaning of the (following) lines
in the program code. During the initial implementation of a function
they can be used to guide the development but have the tendency to
blow up the code and decrease readability. By choosing expressive
variable and function names, most lines should be self-explanatory.

For example stating the obvious does not really help and should be
avoided:\\ \varcode{ x = x + 2; \% add two to x}\\

\begin{important}[Using comments]
  \begin{itemize}
  \item Comments describe the rationale of the respective code block.
  \item Comments are good and helpful --- they must be true, however!
  \item A wrong comment is worse than a non-existent one!
  \item Comments must be maintained just as the code. Otherwise they
    may become wrong and worse than meaningless!
  \end{itemize}
  \widequote{Good code is its own best documentation. As you're about to add
    a comment, ask yourself, ``How can I improve the code so that this
    comment isn't needed?'' Improve the code and then document it to
    make it even clearer.}{Steve McConnell}
\end{important}

\pagebreak[4]
\section{Documenting functions}
All pre-defined \matlab{} functions begin with a comment block that
describes the purpose of the function, the required and optional
arguments, and the values returned by the function. Using the
\code{help} command one can display these comments and learn how to
use the function properly. Self-written functions can and should be
documented in a similar way. Listing ~\ref{localfunctions} shows a
well documented function.

\begin{important}[Documenting functions]
  Functions must be properly documented, otherwise a user (the author
  him- or herself) must read and understand the function code which is
  a waste of time!
  \begin{itemize}
  \item Describe with a few sentences the purpose of the function.
  \item Note the function head to illustrate the order of the argments.
  \item For each argument state the purpose, the expected data type
    (number, vector, matrix, etc.) and, if applicable, the unit in
    which a provided number must be given (e.g. seconds if a time is
    expected).
  \item The same for all return values.
  \end{itemize}
\end{important}


\section{Delegating tasks in functions}
Comments and empty lines are used to organize code into logical blocks
and to briefly explain what they do. Whenever one feels tempted to do
this, one could also consider to delegate the respective task to a
function. In most cases this is preferable.

Not delegating the tasks leads to very long \entermde[m-file]{m-File}{m-files}
which can be confusing. Sometimes such a code is called ``spaghetti
code''. It is high time to think about delegation of tasks to
functions.

\begin{important}[Delegating to functions]
  When should one consider delegating tasks to specific functions?
  \begin{itemize}
  \item Whenever one needs more than two indentation levels to
    organize to code.
  \item Whenever the same lines of code are repeated more than once.
  \item Whenever one is tempted to use copy-and-paste.
  \end{itemize}
\end{important}

\subsection{Local and nested functions}
Generally, functions live in their own \entermde[m-file]{m-File}{m-files} that
have the same name as the function itself. Delegating tasks to
functions thus leads to a large set of \entermde[m-file]{m-File}{m-files}
which increases complexity and may lead to confusion. If the delegated
functionality is used in multiple instances, it is still advisable to
do so. On the other hand, when the delegated functionality is only
used within the context of another function \matlab{} allows to define
\entermde[function!local]{Funktion!lokale}{local functions} and
\entermde[function!nested]{Funktion!verschachtelte}{nested functions}
within the same file. Listing \ref{localfunctions} shows an example of
a local function definition.

\pagebreak[3]
\lstinputlisting[label=localfunctions, caption={Example for local
  functions.}]{calculateSines.m}

\emph{Local function} live in the same \entermde{m-File}{m-file} as
the main function and are only available in this context. Each local
function has its own \enterm{scope}, that is, the local function can
not access (read or write) variables of the calling
function. Interaction with the local function requires to pass all
required arguments and to take care of the return values of the
function.

\emph{Nested functions} are different in this respect. They are
defined within the body of the parent function (between the keywords
\code{function} and \code{end}) and have full access to all variables
defined in the parent function. Working (in particular changing) the
parent's variables is handy on the one side, but is also risky. One
should take care when defining nested functions.


\section{Specifics when using scripts}
A similar problem as with nested function arises when using scripts
(instead of functions). All variables that are defined within a script
become available in the global \enterm{workspace}
(\determ{Arbeitsbereich}). There is the risk of name conflicts, that
is, a called sub-script redefines or uses the same variable name and
may \emph{silently} change its content. The user will not be notified
about this change and the calling script may expect a completely
different content. Bugs that are based on such mistakes are hard to
find since the program itself looks perfectly fine.

To avoid such issues one should design scripts in a way that they
perform their tasks independent from other scripts and functions.

A common use case for a script could be to control the analyses made
on many datasets and to collect the results. A good script is still
not too long and is thus easy to comprehend.  Another advantage of
small task-related scripts is that they can be directly executed by
either calling them from the command line or pressing \keycode{F5} in
the editor. Should it fail there will be a proper error message that
provides important information to track and fix the bug.

\begin{important}[Structuring scripts]
  \begin{itemize}
  \item Similar to functions script should solve one task and should
    not be too long.

  \item Scripts should work independently of existing variables in the
    global workspace.

  \item Often it is advisable to start a script with deleting
    variables (\code{clear}) from the workspace and most of the times
    it is also good to close all open figures (\code{close all}). Be
    careful if a the respective script has been called by another one.

  \item Clean up the workspace at the end of a script. Delete
    (\code{clear}) all variables that are no longer needed.

  \item Consider to write functions instead of scripts.
  \end{itemize}
\end{important}


\section{Summary}

Program code must be readable. Names of variables, functions and
scripts should be expressive and describe their purpose (scripts and
functions) or their content (variables). Cultivating a personalized
code style is perfectly fine as long as it is consistent. Many
programming languages or communities have their own traditions. It is
advisable to adhere to these.

Repeated tasks should (to be read as must) be delegated to
functions. In cases in which a function is only locally applied and
not of more global interest across projects consider to define it as
\entermde[function!local]{Funktion!lokale}{local function} or
\entermde[function!nested]{Funktion!verschachtelte}{nested
  function}. Taking care to increase readability and comprehensibility
pays off, even to the author!  \footnote{Reading tip: Robert
  C. Martin: \textit{Clean Code: A Handbook of Agile Software
    Craftmanship}, Prentice Hall}

\shortquote{Programs must be written for people to read, and only
  incidentally for machines to execute.}{Abelson / Sussman}

\shortquote{Any fool can write code that a computer can
  understand. Good programmers write code that humans can
  understand.}{Martin Fowler}

\shortquote{First, solve the problem. Then, write the code.}{John
  Johnson}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%\printsolutions