Wrote results rect-lp and log-hp :)

Finished some more figure captions.
2026-05-04 19:50:04 +02:00
parent 69f172ff2c
commit 16014c02a0
15 changed files with 1376 additions and 1232 deletions
--- a/main.tex
+++ b/main.tex
@@ -33,7 +33,7 @@
 %\bibstyle
 %\citation

-\title{Emergent intensity invariance in a physiologically inspired model of the grasshopper auditory system}
+\title{Emergent intensity invariance vs. signal-to-noise ratio at three consecutive processing stages along the grasshopper song recognition pathway}
 \author{Jona Hartling, Jan Benda}
 \date{}

@@ -403,7 +403,7 @@ pathway, logarithmic compression is achieved by conversion to decibel scale
    \db(t)\,=\,20\,\cdot\,\dec \frac{\env(t)}{\dbref}, \qquad \dbref\,=\,1
    \label{eq:log}
 \end{equation}
-relative to the maximum intensity $\dbref$ of the signal envelope $\env(t)$.
+relative to the common reference intensity $\dbref$.
 Both the receptor neurons~(\bcite{romer1976informationsverarbeitung};
 \bcite{gollisch2004input}; \bcite{fisch2012channel}) and, on a larger scale,
 the subsequent local interneurons~(\bcite{hildebrandt2009origin};
@@ -555,7 +555,7 @@ can be read out by a simple linear classifier.
 \end{figure}
 \FloatBarrier

-\section{Two mechanisms driving the emergence of intensity-invariant song representation}
+\section{Mechanisms driving the emergence of\\intensity-invariant song representation}

 % Still missing the SNR analysis. Should be able to write around it for now.
 The robustness of song recognition is tied to the degree of intensity
@@ -573,6 +573,54 @@ specific operations involved, as outlined in the following sections.

 \subsection{Full-wave rectification \& lowpass filtering}

+The first nonlinear transformation along the model pathway is the full-wave
+rectification of the tympanal signal $\filt(t)$ during the extraction of the
+signal envelope (Eq.\,\ref{eq:env}). Rectification transforms the distribution
+of $\filt(t)$ from an approximately zero-centered distribution with both
+positive and negative values into a strictly non-negative distribution. Signal
+envelope $\env(t)$ is then obtained by lowpass filtering the rectified
+$\filt(t)$. The effects of this transformation pair on SNR and potential
+intensity invariance were analyzed by rescaling and processing the input signal
+$\raw(t)$ and comparing standard deviations between the resulting $\filt(t)$
+and $\env(t)$, once for the noiseless case~(Fig.\,\ref{fig:rect-lp}a) and once
+for the noisy case~(Fig.\,\ref{fig:rect-lp}b). In addition, the cutoff
+frequency $\fc$ of the lowpass filter was varied to investigate the influence
+of different filter bandwidths. In the noiseless case, the standard deviations
+of $\filt(t)$ and $\env(t)$ are each reduced compared to the input $\raw(t)$ by
+a multiplicative factor. These factors are constant across all $\sca$, which
+results in a downward shift of the respective curve on a double-logarithmic
+scale, away from the diagonal~(Fig.\,\ref{fig:rect-lp}c). For $\filt(t)$, the
+reduction is a consequence of the bandpass filtering~(Eq.\,\ref{eq:bandpass})
+of $\raw(t)$. For $\env(t)$, the standard deviation is further reduced compared
+to $\filt(t)$. Rectification contributes much less to this reduction than
+lowpass filtering. The degree of reduction by lowpass filtering depends on the
+cutoff frequency $\fc$, with lower $\fc$ (narrow bandwidth) resulting in a
+stronger reduction. In the noisy case, the standard deviations of $\filt(t)$
+and $\env(t)$ can be related to the respective pure-noise reference standard
+deviation~(Fig.\,\ref{fig:rect-lp}d). This causes each curve to start with a
+constant regime of SNR values near 1 for smaller $\sca$, which reflects the
+dominance of the noise component $\noc(t)$ over the song component $\soc(t)$ in
+the input $\raw(t)$. For larger $\sca$, all curves transition into a regime of
+linearly increasing SNR on a double-logarithmic scale. For $\filt(t)$, the
+linear part of the curve deviates only slightly from the diagonal. For
+$\env(t)$, however, the transition occurs at lower $\sca$ compared to
+$\filt(t)$, and the linear part of the curve is shifted leftward away from the
+diagonal, which means that higher SNR values are achieved for the same $\sca$.
+This effect is more pronounced for lower $\fc$ of the lowpass filter and is
+presumably caused by the attenuation of high-frequency components in the
+signal, which are more prominent in the noise component $\noc(t)$ than in the
+song component $\soc(t)$. The effect also appears relatively consistent across
+different species, although small variations based on different song structures
+and distributions exist~(Fig.\,\ref{fig:rect-lp}e). In summary, the standard
+deviation of $\env(t)$ has never been observed to transition into a saturation
+regime for larger $\sca$ but rather continues to increase proportionally to
+$\sca$ for all tested $\fc$, in both the noiseless and the noisy case and
+across different species. Consequently, the combination of rectification and
+lowpass filtering does not contribute to intensity invariance. However, this
+transformation pair does improve the SNR of $\env(t)$ relative to $\filt(t)$
+and thus provides subsequent processing stages with a more robust input
+representation and higher input SNR.
+
 \begin{figure}[!ht]
    \centering
    \includegraphics[width=\textwidth]{figures/fig_invariance_rect_lp.pdf}
@@ -605,73 +653,113 @@ specific operations involved, as outlined in the following sections.

 \subsection{Logarithmic compression \& spike-frequency adaptation}

-The first notable emergence of intensity invariance along the model pathway
-occurs during the transformation of the signal envelope $\env(t)$ into the
-logarithmically scaled envelope $\db(t)$ and then into the intensity-adapted
-envelope $\adapt(t)$. In order to disentangle the interplay of logarithmic
-compression and adaptation, $\env(t)$ can be rewritten as a synthetic mixture
+The second nonlinear transformation along the model pathway is the logarithmic
+compression of the signal envelope $\env(t)$ into $\db(t)$, Eq.\,\ref{eq:log},
+which is then followed by the highpass filtering of $\db(t)$,
+Eq.\,\ref{eq:highpass}, to obtain the intensity-adapted envelope $\adapt(t)$.
+The interplay of this transformation pair was analyzed by rescaling and
+processing the input signal $\filt(t)$ and comparing standard deviations
+between the resulting $\env(t)$, $\db(t)$, and $\adapt(t)$. It is necessary to
+use $\filt(t)$ as input for this analysis instead of $\env(t)$, because
+$\env(t)$ results from a nonlinear transformation and hence cannot be
+synthesized as an additive mixture of song component $\soc(t)$ and noise
+component $\noc(t)$. % <-- Sentence may be methods section material.
+However, it is much easier to conceive a mathematical description of the
+effects of logarithmic compression and adaptation if $\env(t)$ itself is
+assumed to be composed of $\soc(t)$ and $\noc(t)$. In the noiseless
+case~(Fig.\,\ref{fig:log-hp}a), $\env(t)$ takes the form of
+\begin{equation}
+    \env(t)\,=\,\sca\,\cdot\,\soc(t), \qquad \env(t)\,>\,0\enspace\forall\enspace t\,\in\,\mathbb{R}
+    \label{eq:toy_env_pure}
+\end{equation}
+The standard deviation of $\env(t)$ increases linearly with $\sca$ on a
+double-logarithmic scale and is slightly reduced~(Fig.\,\ref{fig:log-hp}c)
+compared to the input $\filt(t)$, which is consistent with the results of the
+previous analysis~(Fig.\,\ref{fig:rect-lp}c). By conversion of $\env(t)$ to
+decibel scale, $\sca$ turns from a multiplicative scale in linear space into an
+additive term, or offset, in logarithmic space:
+\begin{equation}
+    \db(t)\,=\,20\,\cdot\,\dec \left[\,\sca\,\cdot\,s(t)\,\right]\,=\,20\,\cdot\,\left[\dec \sca\,+\,\dec s(t)\right], \qquad \sca\,>\,0
+    \label{eq:toy_log_pure}
+\end{equation}
+The highpass filtering of $\db(t)$ can be approximated as a subtraction of the
+local signal offset within a suitable time interval $0 \ll \thp <
+\frac{1}{\fc}$:
+\begin{equation}
+    \begin{split}
+    \adapt(t)\,\approx\,\db(t)\,-\,20\,\cdot\,\dec \sca\,=\,20\,\cdot\,\dec s(t)
+    \end{split}
+    \label{eq:toy_highpass_pure}
+\end{equation}
+This eliminates $\sca$ from $\adapt(t)$ and thus renders it perfectly
+intensity-invariant, with a constant standard deviation of around 10\,dB across
+all $\sca>0$~(Fig.\,\ref{fig:log-hp}c). In contrast, in the noisy
+case~(Fig.\,\ref{fig:log-hp}b), $\env(t)$ takes the form of
 \begin{equation}
    \env(t)\,=\,\sca\,\cdot\,\soc(t)\,+\,\noc(t), \qquad \env(t)\,>\,0\enspace\forall\enspace t\,\in\,\mathbb{R}
-    \label{eq:toy_env}
+    \label{eq:toy_env_noise}
 \end{equation}
-of a song component $\soc(t)$ with variable multiplicative scale $\sca\geq0$
-and a fixed-scale noise component $\noc(t)$. Both $\soc(t)$ and $\noc(t)$ are
-assumed to have unit variance. By conversion of $\env(t)$ to decibel
-scale~(Eq.\,\ref{eq:log}), $\sca$ turns from a multiplicative scale in linear
-space into an additive term, or offset, in logarithmic space
+Similar to the previous analysis~(Fig.\,\ref{fig:rect-lp}d), the ratio of the
+standard deviation of $\env(t)$ to its pure-noise reference standard deviation
+on a double-logarithmic scale follows a constant regime for small $\sca$ and a
+linearly increasing regime for larger $\sca$~(Fig.\,\ref{fig:log-hp}d). Decibel
+conversion of $\env(t)$
 % \begin{equation}
 %     \begin{split}
-%         \db(t)\,&=\,\dec \frac{\alpha\,\cdot\,s(t)\,+\,\eta(t)}{\dbref}\\
-%         &=\,\dec \frac{\alpha}{\dbref}\,+\,\dec \left[s(t)\,+\,\frac{\eta(t)}{\alpha}\right], \qquad \sca\,>\,0
+%         \db(t)\,&=\,20\,\cdot\,\dec \left[\,\sca\,\cdot\,s(t)\,+\,\eta(t)\,\right]\\
+%         &=\,20\,\cdot\,\left(\dec \sca\,+\,\dec \left[s(t)\,+\,\frac{\eta(t)}{\sca}\right]\right), \qquad \sca\,>\,0
 %     \end{split}
-%     \label{eq:toy_log}
+%     \label{eq:toy_log_noise}
 % \end{equation}
 \begin{equation}
-    \begin{split}
-        \db(t)\,&=\,20\,\cdot\,\dec \left[\,\sca\,\cdot\,s(t)\,+\,\eta(t)\,\right]\\
-        &=\,20\,\cdot\,\left(\dec \sca\,+\,\dec \left[s(t)\,+\,\frac{\eta(t)}{\sca}\right]\right), \qquad \sca\,>\,0
-    \end{split}
-    \label{eq:toy_log}
+    \db(t)\,=\,20\,\cdot\,\left(\dec \sca\,+\,\dec \left[s(t)\,+\,\frac{\eta(t)}{\sca}\right]\right), \qquad \sca\,>\,0
+    \label{eq:toy_log_noise}
+\end{equation}
+allows for the separation of $\sca$ from $\soc(t)$ but introduces a scaling of
+$\noc(t)$ by the inverse of $\sca$, which remains present even after the offset
+subtraction:
+\begin{equation}
+    \begin{split}
+    \adapt(t)\,\approx\,20\,\cdot\,\dec\left[s(t)\,+\,\frac{\eta(t)}{\sca}\right]
+    \end{split}
+    \label{eq:toy_highpass_noise}
 \end{equation}
-which allows for its separation from $\soc(t)$ but introduces a scaling of
-$\noc(t)$ by the inverse of $\sca$. The subsequent
-highpass filtering~(Eq.\,\ref{eq:highpass}) of $\db(t)$ can then be
-approximated as a subtraction of the local offset within a suitable time
-interval $0 \ll \thp < \frac{1}{\fc}$:
 % \begin{equation}
 %     \begin{split}
-%     \adapt(t)\,\approx\,\db(t)\,-\,\dec \frac{\sca}{\dbref}\,=\,\dec\left[s(t)\,+\,\frac{\eta(t)}{\sca}\right], \qquad \sca\,>\,0
+%     \adapt(t)\,\approx\,\db(t)\,-\,20\,\cdot\,\dec \sca\,=\,20\,\cdot\,\dec\left[s(t)\,+\,\frac{\eta(t)}{\sca}\right]
 %     \end{split}
-%     \label{eq:toy_highpass}
+%     \label{eq:toy_highpass_noise}
 % \end{equation}
-\begin{equation}
-    \begin{split}
-    \adapt(t)\,\approx\,\db(t)\,-\,20\,\cdot\,\dec \sca\,=\,20\,\cdot\,\dec\left[s(t)\,+\,\frac{\eta(t)}{\sca}\right], \qquad \sca\,>\,0
-    \end{split}
-    \label{eq:toy_highpass}
-\end{equation}
-This means that $\sca$ cannot be entirely eliminated from $\adapt(t)$, only
-redistributed between $\soc(t)$ and $\noc(t)$. In consequence, if $\sca$ is
-sufficiently large ($\sca\gg1$), $\noc(t)$ is attenuated to the point of being
-negligible, so that $\adapt(t)$ represents $\soc(t)$ in a scale-free manner. If
-$\soc(t)$ and $\noc(t)$ are at similar scales ($\sca\approx1$), $\adapt(t)$
-largely resembles $\db(t)$. However, if $\sca$ is sufficiently small
-($\sca\ll1$), $\noc(t)$ masks $\soc(t)$ even after the intensity adaptation.
-Therefore, the effective intensity invariance of $\adapt(t)$ relative to
-$\env(t)$ is limited by the initial scaling of $\soc(t)$ relative to $\noc(t)$;
-that is, the signal-to-noise ratio (SNR) of $\env(t)$ with ($\sca>0$) and
-without ($\sca=0$) song component $\soc(t)$
-\begin{equation}
-    \text{SNR}(\sca)\,=\,\frac{\xvar}{\nvar}\,=\,\frac{\alpha^{2}\,\cdot\,\svar\,+\,\nvar}{\nvar}\,=\,\alpha^{2}\,+\,1, \qquad \svar\,=\,\nvar\,=\,1
-    \label{eq:toy_snr}
-\end{equation}
-which depends quadratically on $\sca$ if $\soc(t)\perp\noc(t)$. Overall, the
-combination of logarithmic compression and adaptation allows for the
-equalization of different sufficiently large song scales, which is essential
-for intensity-invariant song representation. However, this mechanism is unable
-to recover songs that have already sunken below the noise floor, which
-emphasizes the importance of a sufficiently high SNR at the intial reception of
-the signal for reliable song recognition.
+This means that, in the noisy case, $\sca$ cannot be entirely eliminated from
+$\adapt(t)$, only redistributed between $\soc(t)$ and $\noc(t)$. If $\sca$ is
+sufficiently large ($\sca\gg1$, saturation regime), $\noc(t)$ is attenuated to
+the point of being negligible, so that $\adapt(t)$ is a scale-free
+representation of $\soc(t)$. If $\sca$ and $\noc(t)$ are at similar scales
+($\sca\approx1$, transient regime), $\adapt(t)$ largely resembles $\db(t)$.
+Finally, if $\sca$ is sufficiently small ($0<\sca\ll1$, noise regime),
+$\noc(t)$ masks $\soc(t)$ even after the intensity adaptation. Accordingly, the
+effective intensity invariance of $\adapt(t)$ through logarithmic compression
+and adaptation is limited by the SNR of $\env(t)$: Songs that have already
+sunken into the noise floor at the level of $\env(t)$ cannot be recovered by
+subsequent processing steps, which emphasizes the importance of the SNR
+improvement by rectification and lowpass filtering during the previous
+processing step~(Fig.\,\ref{fig:rect-lp}d). The general pattern of noise
+regime, transient regime, and saturation regime remains consistent across
+different species~(Fig.\,\ref{fig:log-hp}e). However, the specific value of
+$\sca$ at which the saturation regime is reached (see appendix
+Fig.\,\ref{fig:app_log-hp_saturation}) and the maximum SNR value of $\adapt(t)$
+within the saturation regime vary considerably between and within species. For
+example, \textit{C. biguttulus} and \textit{C. mollis} display a noticably
+lower maximum SNR of $\adapt(t)$ compared to other species. These differences
+are not to be underestimated, since the SNR of $\adapt(t)$ within the
+saturation regime determines the maximum input SNR for subsequent processing
+steps. In other words, the fact that $\adapt(t)$ eventually reaches a
+saturation regime is, of course, desirable in the context of intensity
+invariance, but it also means to pass up on the higher SNR values that are
+achieved by $\env(t)$ for the same $\sca$ (up to several orders of magnitude,
+Fig.\,\ref{fig:log-hp}d). This trade-off between intensity invariance and SNR
+--- and the consequences it has further downstream along the pathway --- are
+adressed in the following sections.

 \begin{figure}[!ht]
    \centering
@@ -744,30 +832,6 @@ the signal for reliable song recognition.
 \end{figure}
 \FloatBarrier

-
-% \caption{\textbf{Rectification and lowpass filtering improves SNR
-%                  but does not contribute to intensity invariance.}
-%                  Input $\raw(t)$ consists of song component $\soc(t)$ scaled by
-%                  $\sca$ with optional noise component $\noc(t)$ and is
-%                  successively transformed into tympanal signal $\filt(t)$ and
-%                  envelope $\env(t)$. Different line styles indicate different
-%                  cutoff frequencies $\fc$ of the lowpass filter extracting
-%                  $\env(t)$.
-%                  \textbf{Top}:~Example representations of $\filt(t)$ and
-%                  $\env(t)$ for different $\sca$.
-%                  \textbf{a}:~Noiseless case.
-%                  \textbf{b}:~Noisy case.
-%                  \textbf{Bottom}:~Intensity metrics over a range of $\sca$.
-%                  \textbf{c}:~Noiseless case: Standard deviations of $\filt(t)$
-%                  and $\env(t)$.
-%                  \textbf{d}:~Noisy case: Ratios of standard deviations of
-%                  $\filt(t)$ and $\env(t)$ to the respective reference standard
-%                  deviation for input $\raw(t)=\noc(t)$.
-%                  \textbf{e}:~Ratios of standard deviations of $\env(t)$ as in
-%                  \textbf{b} for different species (averaged over songs and
-%                  recordings, see appendix Fig.\,\ref{fig:app_rect-lp}).
-%                 }
-
 \begin{figure}[!ht]
    \centering
    \includegraphics[width=\textwidth]{figures/fig_invariance_thresh_lp_species.pdf}
@@ -796,17 +860,67 @@ the signal for reliable song recognition.
                     curve span of the norm across all three $\mu_{f_i}$ per
                     species.
                     \textbf{d}:~Noiseless case.
-                     \textbf{e}:~Noisy case. Shaded areas
+                     \textbf{e}:~Noisy case. Shaded areas indicate the average
+                     minimum $\mu_{f_i}$ across all species-specific trajectories.
                     }
    \label{fig:thresh-lp_species}
 \end{figure}
 \FloatBarrier
-
+% \caption{\textbf{Rectification and lowpass filtering improves SNR
+%                  but does not contribute to intensity invariance.}
+%                  Input $\raw(t)$ consists of song component $\soc(t)$ scaled by
+%                  $\sca$ with optional noise component $\noc(t)$ and is
+%                  successively transformed into tympanal signal $\filt(t)$ and
+%                  envelope $\env(t)$. Different line styles indicate different
+%                  cutoff frequencies $\fc$ of the lowpass filter extracting
+%                  $\env(t)$.
+%                  \textbf{Top}:~Example representations of $\filt(t)$ and
+%                  $\env(t)$ for different $\sca$.
+%                  \textbf{a}:~Noiseless case.
+%                  \textbf{b}:~Noisy case.
+%                  \textbf{Bottom}:~Intensity metrics over a range of $\sca$.
+%                  \textbf{c}:~Noiseless case: Standard deviations of $\filt(t)$
+%                  and $\env(t)$.
+%                  \textbf{d}:~Noisy case: Ratios of standard deviations of
+%                  $\filt(t)$ and $\env(t)$ to the respective reference standard
+%                  deviation for input $\raw(t)=\noc(t)$.
+%                  \textbf{e}:~Ratios of standard deviations of $\env(t)$ as in
+%                  \textbf{b} for different species (averaged over songs and
+%                  recordings, see appendix Fig.\,\ref{fig:app_rect-lp}).
+%                 }
 \begin{figure}[!ht]
    \centering
    \includegraphics[width=\textwidth]{figures/fig_invariance_full_Omocestus_rufipes.pdf}
-    \caption{\textbf{Step-wise emergence of intensity invariant song
-                     representation along the model pathway.}
+    \caption{\textbf{Step-wise emergence of intensity-invariant song
+                     representation along the full model pathway.}
+                     Input $\raw(t)$ consists of song component $\soc(t)$
+                     scaled by $\sca$ with added noise component $\noc(t)$ and
+                     is processed up to the feature set $f_i(t)$. Different
+                     color shades indicate different types of Gabor kernels
+                     with specific lobe number $\kn$ and either $+$ or $-$
+                     sign, sorted (dark to light) first by increasing $\kn$ and
+                     then by sign~($1\,\leq\,\kn\,\leq\,4$; first $+$, then $-$
+                     for each $\kn$; five kernel widths $\kw$ of 1, 2, 4, 8,
+                     and $16\,$ms per type; 8 types, 40 kernels in total).
+                     \textbf{a}:~Example representations of $\filt(t)$,
+                     $\env(t)$, $\db(t)$, $\adapt(t)$, $c_i(t)$, and $f_i(t)$
+                     for different $\sca$.
+                     \textbf{b}:~Intensity metrics over $\sca$. For $c_i(t)$
+                     and $f_i(t)$, the median over kernels is shown. Dots
+                     indicate $95\,\%$ curve span for $\db(t)$, $\adapt(t)$,
+                     $c_i(t)$, and $f_i(t)$.
+                     \textbf{c}:~Average value $\mu_{f_i}$ of each feature
+                     $f_i(t)$ over $\sca$.
+                     \textbf{d}:~Ratios of intensity metrics to the respective
+                     reference value for input $\raw(t)=\noc(t)$. For $c_i(t)$
+                     and $f_i(t)$, the median over kernel-specific ratios is
+                     shown.
+                     \textbf{e}:~Ratios of standard deviation $\sigma_{c_i}$ of
+                     each $c_i(t)$.
+                     \textbf{f}:~Ratios of $\mu_{f_i}$.
+                     \textbf{g}:~Distributions of kernel-specific $\sca$ that
+                     correspond to $95\,\%$ curve span for $c_i(t)$ and
+                     $f_i(t)$. Dots indicate the values from \textbf{b}.
                     }
    \label{fig:pipeline_full}
 \end{figure}
@@ -816,7 +930,34 @@ the signal for reliable song recognition.
    \centering
    \includegraphics[width=\textwidth]{figures/fig_invariance_short_Omocestus_rufipes.pdf}
    \caption{\textbf{Step-wise emergence of intensity invariant song
-                     representation along the model pathway.}
+                     representation along the model pathway without logarithmic
+                     compression.}
+                     Input $\raw(t)$ consists of song component $\soc(t)$
+                     scaled by $\sca$ with added noise component $\noc(t)$ and
+                     is processed up to the feature set $f_i(t)$, skipping
+                     $\db(t)$. Different color shades indicate different types
+                     of Gabor kernels with specific lobe number $\kn$ and
+                     either $+$ or $-$ sign, sorted (dark to light) first by
+                     increasing $\kn$ and then by
+                     sign~($1\,\leq\,\kn\,\leq\,4$; first $+$, then $-$ for
+                     each $\kn$; five kernel widths $\kw$ of 1, 2, 4, 8, and
+                     $16\,$ms per type; 8 types, 40 kernels in total).
+                     \textbf{a}:~Example representations of $\filt(t)$,
+                     $\env(t)$, $\adapt(t)$, $c_i(t)$, and $f_i(t)$ for
+                     different $\sca$.
+                     \textbf{b}:~Intensity metrics over $\sca$. For $c_i(t)$
+                     and $f_i(t)$, the median over kernels is shown. Dots
+                     indicate $95\,\%$ curve span for $f_i(t)$.
+                     \textbf{c}:~Average value $\mu_{f_i}$ of each feature
+                     $f_i(t)$ over $\sca$.
+                     \textbf{d}:~Ratios of intensity metrics to the respective
+                     reference value for input $\raw(t)=\noc(t)$. For $c_i(t)$
+                     and $f_i(t)$, the median over kernel-specific ratios is
+                     shown.
+                     \textbf{e}:~Ratios of $\mu_{f_i}$.
+                     \textbf{f}:~Distribution of kernel-specific $\sca$ that
+                     correspond to $95\,\%$ curve span for $f_i(t)$. Dots
+                     indicate the value from \textbf{b}.
                     }
    \label{fig:pipeline_short}
 \end{figure}
@@ -825,7 +966,28 @@ the signal for reliable song recognition.
 \begin{figure}[!ht]
    \centering
    \includegraphics[width=\textwidth]{figures/fig_features_cross_species.pdf}
-    \caption{\textbf{Inter- and intraspecific feature variability.}
+    \caption{\textbf{Interspecific and intraspecific feature variability.}
+                     Average value $\mu_{f_i}$ of each feature $f_i(t)$ against
+                     its counterpart from a 2nd feature set based on a
+                     different input $\raw(t)$. Each dot within a subplot
+                     represents a single feature $f_i(t)$. Different color
+                     shades indicate different types of Gabor kernels with
+                     specific lobe number $\kn$ and either $+$ or $-$ sign,
+                     sorted (dark to light) first by increasing $\kn$ and then
+                     by sign~($1\,\leq\,\kn\,\leq\,4$; first $+$, then $-$ for
+                     each $\kn$; five kernel widths $\kw$ of 1, 2, 4, 8, and
+                     $16\,$ms per type; 8 types, 40 kernels in total). Data is
+                     based on the analysis underlying
+                     Fig\,\ref{fig:pipeline_full}.
+                     \textbf{Lower triangular}:~Interspecific comparisons
+                     between single songs of different species.
+                     \textbf{Upper triangular}:~Intraspecific comparisons
+                     between different songs of a single species (\textit{O.
+                     rufipes}).
+                     \textbf{Lower left}:~Distribution of correlation
+                     coefficients $\rho$ for each interspecific and
+                     intraspecific comparison. Dots indicate single $\rho$
+                     values.
                     }
    \label{fig:feat_cross_species}
 \end{figure}