
\[y = \color{orange}{f}(x) + \epsilon\]
where:
All supervised learning problems require to retrieve the most reliable function \(\color{orange}{f}\), using observed data \(\{(x_1, y_1), \dots, (x_n, y_n) \}\), to perform predictions when observing new input data \(x_{n+1}\).
\[y = \color{orange}{f}(x) + \epsilon\]
No restrictions on \(\color{orange}{f}\) but a prior distribution on a functional space: \(\color{orange}{f} \sim \mathcal{GP}(m(\cdot),C(\cdot,\cdot))\)
We can think of a Gaussian process as the extension to infinity of multivariate Gaussians:
\(x \sim \mathcal{N}(m , \sigma^2)\) in \(\mathbb{R}\), \(\begin{pmatrix}
x_1 \\
x_2 \\
\end{pmatrix} \sim \mathcal{N} \left(
\
\begin{pmatrix}
m_1 \\
m_2 \\
\end{pmatrix},
\begin{bmatrix}
C_{1,1} & C_{1,2} \\
C_{2,1} & C_{2,2}
\end{bmatrix} \right)\) in \(\mathbb{R}^2\)
\[y = \color{orange}{f}(x) + \epsilon\]
No restrictions on \(\color{orange}{f}\) but a prior distribution on a functional space: \(\color{orange}{f} \sim \mathcal{GP}(m(\cdot),C(\cdot,\cdot))\)
We can think of a Gaussian process as the extension to infinity of multivariate Gaussians:
\[p(f_1, f_2, \ldots, f_N, \ldots \mid \mathbf{x}) = \mathcal{N}\left( \begin{bmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_N) \\ \vdots \end{bmatrix}, \begin{bmatrix} C(x_1,x_1) & C(x_1,x_2) & \cdots & C(x_1,x_N) & \cdots \\ C(x_2,x_1) & C(x_2,x_2) & \cdots & C(x_2,x_N) & \cdots \\ \vdots & \vdots & \ddots & \vdots & \\ C(x_N,x_1) & C(x_N,x_2) & \cdots & C(x_N,x_N) & \cdots \\ \vdots & \vdots & & \vdots & \ddots \end{bmatrix} \right)\]
Kernels (or covariance functions) \(C(\cdot, \cdot)\) define the covariance structure of functions sampled from a GP. Formally, we fill the covariance matrix \(K\) such that:
\[cov(f(x), f(x^{\prime})) = C(x, x^{\prime}) = \left[C \right]_{x, x^{\prime}}.\]
While \(m(\cdot)\) is often assumed to be \(0\), the covariance structure is critical and defined through tailored kernels. For instance, the Squared Exponential (or RBF) kernel is expressed as: \[C_{SE}(x, x^{\prime}) = s^2 \exp \Bigg(-\dfrac{(x - x^{\prime})^2}{2 \ell^2}\Bigg)\]
To model phenomenon exhibiting repeting patterns, one can leverage the Periodic kernel: \[C_{perio}(x, x^{\prime}) = s^2 \exp \Bigg(- \dfrac{ 2 \sin^2 \Big(\pi \frac{\mid x - x^{\prime}\mid}{p} \Big)}{\ell^2}\Bigg)\]
We can even consider linear regression as a particular GP problem, by using the Linear kernel: \[C_{lin}(x, x^{\prime}) = s_a^2 + s_b^2 (x - c)(x^{\prime} - c )\]
We can learn optimal values of hyper-parameters from data through maximum likelihood.
Given observed data \(\mathbf{y} = (y_1, \dots, y_n)^T\) at inputs \(\mathbf{x} = (x_1, \dots, x_n)^T\), we assume:
\[y_i = \color{orange}{f}(x_i) + \epsilon_i, \hspace{1cm} \forall i = 1, \dots, n\]
with
The vector of observations \(\mathbf{y}\) follows a multivariate Gaussian distribution:
\[\mathbf{y} \sim \mathcal{N}(m(\mathbf{x}), C_{\theta}(\mathbf{x}, \mathbf{x}) + \sigma^2 I)\]
where \(C_{\theta}(\mathbf{x}, \mathbf{x})\) is the covariance matrix evaluated at inputs \(\mathbf{x}\), with hyper-parameters \(\theta\).
The log-marginal likelihood of observed data \(\mathbf{y}\) given inputs \(\mathbf{x}\) and hyper-parameters \(\theta\) is: \[\log p(\mathbf{y} \mid \mathbf{x}, \theta) = -\frac{1}{2} \mathbf{y}^T (C_{\theta} + \sigma^2 I )^{-1} \mathbf{y} - \frac{1}{2} \log |C_{\theta} + \sigma^2 I| - \frac{n}{2} \log 2\pi.\]
We can optimise hyper-parameters \(\theta\) by maximizing the log-marginal likelihood using gradient-based methods.
The Gaussian property induces that unobserved points have no influence on inference:
\[ \int \underbrace{p(f_{\color{grey}{obs}}, f_{\color{purple}{mis}})}_{\mathcal{GP}(m, C)} \ \mathrm{d}f_{\color{purple}{mis}} = \underbrace{p(f_{\color{grey}{obs}})}_{\mathcal{N}(m_{\color{grey}{obs}}, C_{\color{grey}{obs}})} \]
This crucial trick allows us to learn function properties from finite sets of observations. More generally, Gaussian processes are closed under conditioning and marginalisation.
\[ \begin{bmatrix} f_{\color{grey}{o}} \\ f_{\color{purple}{m}} \\ \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m_{\color{grey}{o}} \\ m_{\color{purple}{m}} \\ \end{bmatrix}, \begin{pmatrix} C_{\color{grey}{o, o}} & C_{\color{grey}{o}, \color{purple}{m}} \\ C_{\color{purple}{m}, \color{grey}{o}} & C_{\color{purple}{m, m}} \end{pmatrix} \right)\]
While marginalisation serves for training, conditioning leads the key GP prediction formula:
\[f_{\color{purple}{m}} \mid f_{\color{grey}{o}} \sim \mathcal{N} \Big( m_{\color{purple}{m}} + C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} (f_{\color{grey}{o}} - m_{\color{grey}{o}}), \ \ C_{\color{purple}{m, m}} - C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} C_{\color{grey}{o}, \color{purple}{m}} \Big)\]




Try and play with the amazing interactive GP visualisation tool available at:
In the previous examples, we saw a variety of situations that naturally lead to the same mathematical formulation of the learning problem:
\[y_s = \color{orange}{f_s}(x_s) + \epsilon_s, \hspace{3cm} \forall s = 1, \dots, S\]
Learn \(\color{orange}{f_s}\), the underlying relationship between \(x_s\) and \(y_s\), for each source of data \(\color{orange}{s}\).
Multi-, in contrast with single-, regression implies that some information can be shared across data sources to improve learning/predictions.
The nature of measurements leads to a more philosophical distinction:
A nice overview of Multi-Output GPs concepts:
Instead of modelling multiple tasks (or individuals, or outputs) independently with a single GPs, we could leverage correlations existing across them.
A natural approach comes from defining multi-output kernels, explicitly modelling correlations between each task-input couple. A classical assumption is separable covariance structures:
\[k\left((i, x),\left(j, x^{\prime}\right)\right)=\operatorname{cov}\left(f_i(x), f_j\left(x^{\prime}\right)\right)=K^{\mathrm{O}}_{i, j} \times k_{\mathrm{I}}\left(x, x^{\prime}\right)\]
By stacking outputs into one vector \(\textbf{y} = \{\textbf{y}_1, \dots, \textbf{y}_M\}^{\intercal}\), the matrices build from those kernels can be represented through a Kronecker product:
\[\textbf{K} = \textbf{K}^{\mathrm{O}} \otimes \textbf{K}_{\mathrm{I}}\]
There exist many ways to define multi-output kernels, we will see a brief overview of a few important ones. In practice, it can be difficult to learn explicit inter-task correlations (i.e. the elements of \(\boldsymbol{K^{\mathrm{O}}}\)) from data, with a computational complexity for exact training of \(\mathcal{O}(\color{red}{O}^3N^3)\).
\[y_t = \mu_0 + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_t} + \sigma_t^2 I), \ \perp \!\!\! \perp_t\]
\(\rightarrow\) Unified GP framework with a shared mean process \(\mu_0\), and task-specific process \(f_t\),
\(\rightarrow\) Naturaly handles irregular grids of input data.
Goal: Learn the hyper-parameters, (and \(\mu_0\)’s hyper-posterior).
Difficulty: The likelihood depends on \(\mu_0\), and tasks are not independent.
E-step
\[ \begin{align} p(\mu_0(\color{grey}{\mathbf{x}}) \mid \textbf{y}, \hat{\Theta}) &\propto \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); m_0(\color{grey}{\textbf{x}}), \textbf{K}_{0}^{\color{grey}{\textbf{x}}}) \times \prod\limits_{t =1}^T \mathcal{N}(\mathbf{y}_t; \mu_0( \color{purple}{\textbf{x}_i}), \boldsymbol{\Psi}_{\hat{\theta}_i, \hat{\sigma}_i^2}^{\color{purple}{\textbf{x}_i}}) \\ &= \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); \hat{m}_0(\color{grey}{\textbf{x}}), \hat{\textbf{K}}^{\color{grey}{\textbf{x}}}), \end{align} \]
M-step
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \ \sum\limits_{t = 1}^{T}\left\{ \log \mathcal{N} \left( \mathbf{y}_t; \hat{m}_0(\color{purple}{\mathbf{x}_t}), \boldsymbol{\Psi}_{\theta_t, \sigma^2}^{\color{purple}{\mathbf{x}_t}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{x}_t}} {\boldsymbol{\Psi}_{\theta_t, \sigma^2}^{\color{purple}{\mathbf{x}_t}}}^{-1} \right) \right\}. \end{align*} \]
Sharing the covariance structures or not offers a compromise between flexibility and parsimony:
Overall, the computational complexity is: \[ \mathcal{O}(\color{blue}{T} \times N_t^3 + N^3) \]
with \(N = \bigcup\limits_{t = 1}^\color{blue}{T} N_t\)
For a new task, we observe some data \(y_*(\textbf{x}_*)\). Let us recall:
\[y_* \mid \mu_0 \sim \mathcal{GP}(\mu_0, \boldsymbol{\Psi}_{\theta_*, \sigma_*^2}), \ \perp \!\!\! \perp_t\]
Goals:
Difficulties:
Defining a multi-task prior distribution by:
\[\begin{align} p(y_* (\textbf{x}_*^{p}) \mid \textbf{y}) &= \int p\left(y_* (\textbf{x}_*^{p}) \mid \textbf{y}, \mu_0(\textbf{x}_*^{p})\right) p(\mu_0 (\textbf{x}_*^{p}) \mid \textbf{y}) \ d \mu_0(\mathbf{x}^{p}_{*}) \\ &= \int \underbrace{ p \left(y_* (\textbf{x}_*^{p}) \mid \mu_0 (\textbf{x}_*^{p}) \right)}_{\mathcal{N}(y_*; \mu_0, \Psi_*)} \ \underbrace{p(\mu_0 (\textbf{x}_*^{p}) \mid \textbf{y})}_{\mathcal{N}(\mu_0; \hat{m}_0, \hat{K})} \ d \mu_0(\mathbf{x}^{p}_{*}) \\ &= \mathcal{N}( \hat{m}_0 (\mathbf{x}^{p}_{*}), \underbrace{\Psi_* + \hat{K}}_{\Gamma}) \end{align}\]
\[p \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{x}_{*}}) \\ y_*(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix} \mid \textbf{y} \right) = \mathcal{N} \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{x}_{*}}) \\ y_*(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix}; \ \begin{bmatrix} \hat{m}_0(\color{grey}{\mathbf{x}_{*}}) \\ \hat{m}_0(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{\color{grey}{**}} & \Gamma_{\color{grey}{*}\color{purple}{p}} \\ \Gamma_{\color{purple}{p}\color{grey}{*}} & \Gamma_{\color{purple}{pp}} \end{pmatrix} \right)\]
\[p(y_*(\color{purple}{\mathbf{x}^{p}}) \mid y_*(\color{grey}{\mathbf{x}_{*}}), \textbf{y}) = \mathcal{N} \Big( y_*(\color{purple}{\mathbf{x}^{p}}); \ \hat{\mu}_{*}(\color{purple}{\mathbf{x}^{p}}) , \hat{\Gamma}_{\color{purple}{pp}} \Big)\]
with:
Multi-Task GPs regression provides more reliable predictions when individual processes are sparsely observed, while greatly reducing the associated uncertainty
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t = \mu_0 + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:
It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_t), \ \perp \!\!\! \perp_t\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t \mid \{\color{green}{Z_{tk}} = 1 \} = \mu_{\color{green}{k}} + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_t), \ \perp \!\!\! \perp_t\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t \mid \{\color{green}{Z_{ik}} = 1 \} = \mu_{\color{green}{k}} + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \{ \boldsymbol{\mu} , \color{green}{\boldsymbol{\pi}} \} \sim \sum\limits_{k=1}^K{ \color{green}{\pi_k} \ \mathcal{GP}\Big(\mu_{\color{green}{k}}, \Psi_t^\color{green}{k} \Big)}, \ \perp \!\!\! \perp_t\]
E step: \[ \begin{align} \hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) &= \color{green}{\prod\limits_{k = 1}^K} \mathcal{N}(\mu_\color{green}{k};\hat{m}_\color{green}{k}, \hat{\textbf{C}}_\color{green}{k}) , \hspace{2cm} \hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) = \prod\limits_{t = 1}^T \mathcal{M}(Z_t;1, \color{green}{\boldsymbol{\tau}_t}) \end{align} \] M step:
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{k = 1}^{K}\sum\limits_{t = 1}^{T}\tau_{tk}\ \mathcal{N} \left( \mathbf{y}_t; \ \hat{m}_k, \boldsymbol{\Psi}_{\color{blue}{\theta_t}, \color{blue}{\sigma_t^2}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{\Psi}_{\color{blue}{\theta_t}, \color{blue}{\sigma_t^2}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{t = 1}^{T}\tau_{tk}\log \color{green}{\pi_{k}} \end{align*} \]
\[ p(y_*(\mathbf{x}^{p}) \mid \color{green}{Z_{*k}} = 1, y_*(\mathbf{x}_{*}), \textbf{y}) = \mathcal{N} \Big( y_*(\mathbf{x}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\mathbf{x}^{p}) , \hat{\Gamma}_{pp}^\color{green}{k} \Big), \forall \color{green}{k}, \]
\(\hat{\mu}_{*}^\color{green}{k}(\mathbf{x}^{p}) = \hat{m}_\color{green}{k}(\mathbf{x}^{p}) + \Gamma^\color{green}{k}_{p*} {\Gamma^\color{green}{k}_{**}}^{-1} (y_*(\mathbf{x}_{*}) - \hat{m}_\color{green}{k} (\mathbf{x}_{*}))\)
\(\hat{\Gamma}_{pp}^\color{green}{k} = \Gamma_{pp}^\color{green}{k} - \Gamma_{p*}^\color{green}{k} {\Gamma^{\color{green}{k}}_{**}}^{-1} \Gamma^{\color{green}{k}}_{*p}\)
\[p(y_*(\textbf{x}^p) \mid y_*(\textbf{x}_*), \textbf{y}) = \color{green}{\sum\limits_{k = 1}^{K} \tau_{*k}} \ \mathcal{N} \big( y_*(\mathbf{x}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\textbf{x}^p) , \hat{\Gamma}_{pp}^\color{green}{k}(\textbf{x}^p) \big).\]
A unique mean process can struggle to capture relevant signals in presence of group structures.
By identifying the underlying clustering structure, we can discard unnecessary information and provide enhanced predictions as well as lower uncertainty.
Each cluster-specific prediction is weighted by its membership probability \(\color{green}{\tau_{*k}}\).
Uncertainty quantification provides valuable information to practitioners to make decisions based on probabilistic statements, like the risk of overweight at 10 years.

Implemented as an R package MagmaClustR: https://arthurleroy.github.io/MagmaClustR
We have reviewed two different paradigms to tackle an identical mathematical problem:
Both work on separate parts of a GP, with no assumption on the other, and are theoretically compatible. They are even expected to synergise well, by tackling each others limitations.
\[\begin{bmatrix} y_t^{1} \\ \vdots \\ y_t^{O} \\ \end{bmatrix} = \begin{bmatrix} \mu_0 \\ \vdots \\ \mu_0 \\ \end{bmatrix} + \begin{bmatrix} f_t^{1} \\ \vdots \\ f_t^{O} \\ \end{bmatrix} + \begin{bmatrix} \epsilon_t^{1} \\ \vdots \\ \epsilon_t^{O} \\ \end{bmatrix}, \hspace{3cm} \forall t = 1, \dots, T\]
All mathematical objects from Multi-Task GPs are now extended with stacked Outputs.
\[\Big[k_{\theta_t}(\mathbf{x}, \mathbf{x}^{\prime})\Big]_{o,o'} = \dfrac{S_{t,o} \ S_{t,o{\prime}}}{(2\pi)^{D/2} \ |\Sigma|^{1/2}} \ \exp\Big(-\dfrac{1}{2}(\mathbf{x} - \mathbf{x}^{\prime})^T \Sigma^{-1}(\mathbf{x}-\mathbf{x}^{\prime})\Big)\]
These matrices are typically diagonal, filled with Output-specific and latent lengthscales, one per Input dimension. In the end, we need to optimise \(\color{red}{O}\times \color{blue}{T} \times (3D+1)\) hyper-parameters.
\[ \begin{align} p(\mu_0(\color{grey}{\mathbf{x}}) \mid \textbf{y}, \hat{\Theta}) = \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); \hat{m}_0(\color{grey}{\textbf{x}}), \hat{\textbf{K}}^{\color{grey}{\textbf{x}}}), \end{align} \]
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{t = 1}^{T}\left\{ \log \mathcal{N} \left( \mathbf{y}_t; \hat{m}_0(\color{purple}{\mathbf{x}_t^O}), \boldsymbol{\Psi}_{\theta_t, \sigma_t^2}^{\color{purple}{\mathbf{x}_t^O}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{x}_t^O}} {\boldsymbol{\Psi}_{\theta_t, \sigma_t^2}^{\color{purple}{\mathbf{x}_t^O}}}^{-1} \right) \right\}. \end{align*} \]
As before we get a multi-task posterior predictive distribution in closed form:
\[p\left(\begin{bmatrix} y_1^*(\color{purple}{\mathbf{x}^{p}_1}) \\ \vdots \\ y_1^*(\color{purple}{\mathbf{x}^{p}_O}) \\ \end{bmatrix} \mid \begin{bmatrix} y_1^*(\color{grey}{\mathbf{x}_1^*}) \\ \vdots \\ y_1^*(\color{grey}{\mathbf{x}_O^*}) \\ \end{bmatrix}, \textbf{y} \right) = \mathcal{N} \left(\begin{bmatrix} y_1^*(\color{purple}{\mathbf{x}^{p}_1}) \\ \vdots \\ y_1^*(\color{purple}{\mathbf{x}^{p}_O}) \\ \end{bmatrix}; \ \hat{\mu}_{*}(\color{purple}{\mathbf{x}^{p}}) , \hat{\Gamma}_{\color{purple}{pp}} \right)\]
with:
For instance, this context could correspond to measuring both weight and milk production of several cows over time, of unknown distinct races (that we want to recover), and for which we want to predict missing values of one Output, or forecast future values of both.
Let’s remove [-0.7, -0.3] of 2nd Output, and [0.7, 1] of both Outputs for the testing Task, and try to predict the those testing points (and recover the underlying clusters).
Rule of thumb:
| Output nature | Framework | Complexity |
|---|---|---|
| Independant variables of interest | Single-Task | \(\color{red}{O}\times\mathcal{O}(N^3)\) |
| Correlated variables of interest | Multi-Output | \(\mathcal{O}(\color{red}{O}^3 N^3)\) |
| Occurences of the same phenomenon | Multi-Task | \(\mathcal{O}(\color{blue}{T} \times N^3)\) |
| Occurences of correlated variables | Multi-Task-Multi-Output | \(\mathcal{O}(\color{blue}{T}\times\color{red}{O}^3 N^3)\) |
We are currently working on some nice extensions with a great team of PhD students!
Some problems we are tackling (at least trying to):
Several sources of correlation might exist in a dataset (e.g. multiple genes and individuals)
\[y_{\color{blue}{i}\color{red}{j}} = \mu_{0} + f_\color{blue}{i} + g_\color{red}{j} + \epsilon_{\color{blue}{i}\color{red}{j}}\]
with:
Key idea for training: define \(\color{blue}{T}+\color{red}{P} + 1\) different hyper-posterior distributions for \(\mu_0\) by conditioning over the adequate sub-sample of data.
\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{red}{j} = 1,\dots, \color{red}{P}}^{\color{blue}{i} = 1,\dots, \color{blue}{T}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{0}, \hat{K}_0 \Big).\]
\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{blue}{i} = 1,\dots, \color{blue}{T}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{red}{j}}, \hat{K}_\color{red}{j} \Big), \ \forall \color{red}{j} \in 1, \dots, \color{red}{P}\]
\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{red}{j} = 1,\dots, \color{red}{P}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{blue}{i}}, \hat{K}_\color{blue}{i} \Big), \forall \color{blue}{i} \in 1, \dots, \color{blue}{T} \]
Each sub-sample of data leads to a specific hyper-posterior distribution of the mean process \(\mu_0\).
Although sharing the same mean process, different tasks still lead to different predictions.
Multi-mean GP provides adaptive predictions according to the relevant context.