Multi-Task Gaussian Processes

Arthur Leroy - Department of Computer Science, The University of Manchester

Gaussian Process Summer School - 11/09/2023

Let’s start simply with an illustrative example

Irregular time series (in number of observations and location),
Many different swimmers per category,
A few observations per swimmer.

Gaussian process: a prior distribution over functions

\[y = \color{orange}{f}(x) + \epsilon\]

No restrictions on \(\color{orange}{f}\) but a prior distribution on a functional space: \(\color{orange}{f} \sim \mathcal{GP}(m(\cdot),C(\cdot,\cdot))\)

While \(m\) is often assumed to be \(0\), the covariance structure is critical and defined through tailored kernels. For instance, the Squared Exponential (or RBF) kernel is expressed as: \[C_{SE}(x, x^{\prime}) = s^2 \exp \Bigg(\dfrac{(x - x^{\prime})^2}{2 \ell^2}\Bigg)\]

Gaussian process: all you need is a posterior

The Gaussian property induces that unobserved points have no influence on inference:

\[ \int \underbrace{p(f_{\color{grey}{obs}}, f_{\color{purple}{mis}})}_{\mathcal{GP}(m, C)} \ \mathrm{d}f_{\color{purple}{mis}} = \underbrace{p(f_{\color{grey}{obs}})}_{\mathcal{N}(m_{\color{grey}{obs}}, C_{\color{grey}{obs}})} \]

This crucial trick allows us to learn function properties from finite sets of observations. More generally, Gaussian processes are closed under conditioning and marginalisation.

\[\begin{bmatrix} f_{\color{grey}{o}} \\ f_{\color{purple}{m}} \\ \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m_{\color{grey}{o}} \\ m_{\color{purple}{m}} \\ \end{bmatrix}, \begin{pmatrix} C_{\color{grey}{o, o}} & C_{\color{grey}{o}, \color{purple}{m}} \\ C_{\color{purple}{m}, \color{grey}{o}} & C_{\color{purple}{m, m}} \end{pmatrix} \right)\]

While marginalisation serves for training, conditioning leads the key GP prediction formula:

\[f_{\color{purple}{m}} \mid f_{\color{grey}{o}} \sim \mathcal{N} \Big( m_{\color{purple}{m}} + C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} (f_{\color{grey}{o}} - m_{\color{grey}{o}}), \ \ C_{\color{grey}{o}, \color{purple}{m}} - C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} C_{\color{purple}{m, m}} \Big)\]

Gaussian process: what about a visual summary?

Powerful non parametric method offering probabilistic predictions,
Computational complexity in \(\mathcal{O}(N^3)\), with N the number of observations.

Forecasting with a unique GP

Multi-Output GPs: exploiting explicit correlations

Instead of modelling multiple tasks (or individuals, or outputs) independently with a single GPs, we could leverage correlations existing across them.

A natural approach comes from defining multi-output kernels, explicitly modelling correlations between each task-input couple. A classical assumption is to consider separable kernels, such as:

\[k\left((i, t),\left(j, t^{\prime}\right)\right)=\operatorname{cov}\left(f_i(t), f_j\left(t^{\prime}\right)\right)=k_{\mathrm{o}}(i, j) k_{\mathrm{t}}\left(t, t^{\prime}\right)\]

By stacking outputs into one vector \(\textbf{y} = \{\textbf{y}_1, \dots, \textbf{y}_M\}^{\intercal}\), the matrices build from those kernels can be represented through a Kronecker product:

\[\textbf{K} = \textbf{K}_{\mathrm{o}} \otimes \textbf{K}_{\mathrm{t}}\]

There exist many ways to define multi-output kernels, though it is in practice costly and unstable to learn explicitly the inter-task correlations (i.e. the elements of \(\boldsymbol{K_o}\)) from data. Moreover, the time complexity for exact training is \(\mathcal{O}(\color{blue}{M}^3N^3)\).

Multi-Output GPs: covariance visualisation

Credits: Neil Lawrence

Multi-Output GPs: illustrative example

Multi-Output GPs: some modern extensions and reviews

GPSS 2017 video recording - Multi-Output Gaussian Processes - Mauricio Alvarez
Alvarez et al. - Kernels for Vector-Valued Functions: A Review - Foundations and Trends in Machine Learning, 2011
Nguyen and Bonilla - Collaborative multi-output Gaussian processes - UAI, 2014
Parra and Tobar - Spectral mixture kernels for multi-output Gaussian processes - NeurIPS, 2017
Bruinsma et al. - Scalable Exact Inference in Multi-Output Gaussian Processes PMLR, 2020
Van der Wilk et al. - A framework for interdomain and multioutput Gaussian processes - arXiv preprint, 2020

New paradigm: Multi-tAsk GP with common MeAn (Magma)

\[y_i = \mu_0 + f_i + \epsilon_i\]

with:

\(\mu_0 \sim \mathcal{GP}(m_0, K_{\theta_0}),\)

\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)

\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_i} + \sigma_i^2 I), \ \perp \!\!\! \perp_i\]

\(\rightarrow\) Unified GP framework with a common mean process \(\mu_0\), and individual-specific process \(f_i\),

\(\rightarrow\) Naturaly handles irregular grids of input data.

Goal: Learn the hyper-parameters, (and \(\mu_0\)’s hyper-posterior).
Difficulty: The likelihood depends on \(\mu_0\), and individuals are not independent.

Notation and dimensionality

Each individual has its specific vector of inputs \(\color{purple}{\textbf{t}_i}\) associated with outputs \(\textbf{y}_i\).
The mean process \(\mu_0\) requires to define pooled vectors and additional notation follows:

\(\textbf{y} = (\textbf{y}_1,\dots, \textbf{y}_i, \dots, \textbf{y}_M)^T,\)
\(\color{grey}{\textbf{t}} = (\textbf{t}_1,\dots,\textbf{t}_i, \dots, \textbf{t}_M)^T,\)

\(\textbf{K}_{\theta_0}^{\color{grey}{\textbf{t}}}\): covariance matrix from the process \(\mu_0\) evaluated on \(\color{grey}{\textbf{t}},\)
\(\boldsymbol{\Sigma}_{\theta_i}^{\color{purple}{\textbf{t}_i}}\): covariance matrix from the process \(f_i\) evaluated on \(\color{purple}{\color{purple}{\textbf{t}_i}},\)

\(\Theta = \{ \theta_0, (\theta_i)_i, \sigma_i^2 \}\): the set of hyper-parameters,
\(\boldsymbol{\Psi}_{\theta_i, \sigma_i^2}^{\color{purple}{\textbf{t}_i}} = \boldsymbol{\Sigma}_{\theta_i}^{\color{purple}{\textbf{t}_i}} + \sigma_i^2 I_{N_i}.\)

While GP are infinite-dimensional objects, a tractable inference on a finite set of observations fully determines the overall properties.

EM algorithm

E step: \[ \begin{align} p(\mu_0(\color{grey}{\mathbf{t}}) \mid \textbf{y}, \hat{\Theta}) &\propto \mathcal{N}(\mu_0(\color{grey}{\mathbf{t}}); m_0(\color{grey}{\textbf{t}}), \textbf{K}_{\hat{\theta}_0}^{\color{grey}{\textbf{t}}}) \times \prod\limits_{i =1}^M \mathcal{N}(\mathbf{y}_i; \mu_0( \color{purple}{\textbf{t}_i}), \boldsymbol{\Psi}_{\hat{\theta}_i, \hat{\sigma}_i^2}^{\color{purple}{\textbf{t}_i}}) \\ &= \mathcal{N}(\mu_0(\color{grey}{\mathbf{t}}); \hat{m}_0(\color{grey}{\textbf{t}}), \hat{\textbf{K}}^{\color{grey}{\textbf{t}}}), \end{align} \] M step:

\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \ \log \mathcal{N} \left( \hat{m}_0(\color{grey}{\textbf{t}}); m_0(\color{grey}{\textbf{t}}), \mathbf{K}_{\theta_0}^{\color{grey}{\textbf{t}}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{grey}{\textbf{t}}} {\mathbf{K}_{\theta_0}^{\color{grey}{\textbf{t}}}}^{-1} \right) \\ & \ \ \ + \sum\limits_{i = 1}^{M}\left\{ \log \mathcal{N} \left( \mathbf{y}_i; \hat{m}_0(\color{purple}{\mathbf{t}_i}), \boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{purple}{\mathbf{t}_i}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{t}_i}} {\boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{purple}{\mathbf{t}_i}}}^{-1} \right) \right\}. \end{align*} \]

Prediction

For a new individual, we observe some data \(y_*(\textbf{t}_*)\). Let us recall:

\[y_* \mid \mu_0 \sim \mathcal{GP}(\mu_0, \boldsymbol{\Psi}_{\theta_*, \sigma_*^2}), \ \perp \!\!\! \perp_i\]

Goals:

derive a analytical predictive distribution at arbitrary inputs \(\mathbf{t}^{p}\),

sharing the information from training individuals, stored in the mean process \(\mu_0\).

Difficulties:

the model is conditionned over \(\mu_0\), a latent, unobserved quantity,

defining the adequate target distribution is not straightforward,

working on a new grid of inputs \(\mathbf{t}^{p}_{*}= (\mathbf{t}_{*}, \mathbf{t}^{p})^{\intercal},\) potentially distinct from \(\mathbf{t}.\)

Prediction: the key idea

Defining a multi-task prior distribution by:

conditioning on training data,
integrating over \(\mu_0\)’s hyper-posterior distribution.

\[\begin{align} p(y_* (\textbf{t}_*^{p}) \mid \textbf{y}) &= \int p\left(y_* (\textbf{t}_*^{p}) \mid \textbf{y}, \mu_0(\textbf{t}_*^{p})\right) p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y}) \ d \mu_0(\mathbf{t}^{p}_{*}) \\ &= \int \underbrace{ p \left(y_* (\textbf{t}_*^{p}) \mid \mu_0 (\textbf{t}_*^{p}) \right)}_{\mathcal{N}(y_*; \mu_0, \Psi_*)} \ \underbrace{p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y})}_{\mathcal{N}(\mu_0; \hat{m}_0, \hat{K})} \ d \mu_0(\mathbf{t}^{p}_{*}) \\ &= \mathcal{N}( \hat{m}_0 (\mathbf{t}^{p}_{*}), \underbrace{\Psi_* + \hat{K}}_{\Gamma}) \end{align}\]

Prediction: additional steps

Multi-task prior:

\[p \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{t}_{*}}) \\ y_*(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix} \mid \textbf{y} \right) = \mathcal{N} \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{t}_{*}}) \\ y_*(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix}; \ \begin{bmatrix} \hat{m}_0(\color{grey}{\mathbf{t}_{*}}) \\ \hat{m}_0(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{\color{grey}{**}} & \Gamma_{\color{grey}{*}\color{purple}{p}} \\ \Gamma_{\color{purple}{p}\color{grey}{*}} & \Gamma_{\color{purple}{pp}} \end{pmatrix} \right)\]

Multi-task posterior:

\[p(y_*(\color{purple}{\mathbf{t}^{p}}) \mid y_*(\color{grey}{\mathbf{t}_{*}}), \textbf{y}) = \mathcal{N} \Big( y_*(\color{purple}{\mathbf{t}^{p}}); \ \hat{\mu}_{*}(\color{purple}{\mathbf{t}^{p}}) , \hat{\Gamma}_{\color{purple}{pp}} \Big)\]

with:

\(\hat{\mu}_{*}(\color{purple}{\mathbf{t}^{p}}) = \hat{m}_0(\color{purple}{\mathbf{t}^{p}}) + \Gamma_{\color{purple}{p}\color{grey}{*}}\Gamma_{\color{grey}{**}}^{-1} (y_*(\color{grey}{\mathbf{t}_{*}}) - \hat{m}_0 (\color{grey}{\mathbf{t}_{*}}))\)
\(\hat{\Gamma}_{\color{purple}{pp}} = \Gamma_{\color{purple}{pp}} - \Gamma_{\color{purple}{p}\color{grey}{*}}\Gamma_{\color{grey}{**}}^{-1} \Gamma_{\color{grey}{*}\color{purple}{p}}\)

Leroy et al. - MAGMA: Inference and Prediction using Multi-Task Gaussian Processes with Common Mean - Machine Learning - 2022

A GIF is worth a thousand words

Magma + Clustering = MagmaClust

Leroy et al. - Cluster-Specific Predictions with Multi-Task Gaussian Processes - Journal of Machine Learning Research - 2023

A unique underlying mean process might be too restrictive.

\(\rightarrow\) Mixture of multi-task GPs:

\[y_i = \mu_0 + f_i + \epsilon_i\]

with:

\(\color{green}{Z_{i}} \sim \mathcal{M}(1, \color{green}{\boldsymbol{\pi}}), \ \perp \!\!\! \perp_i,\)

\(\mu_0 \sim \mathcal{GP}(m_0, K_{\theta_0}), \ \perp \!\!\! \perp_k,\)
\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)
\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_i), \ \perp \!\!\! \perp_i\]

Magma + Clustering = MagmaClust

Leroy et al. - Cluster-Specific Predictions with Multi-Task Gaussian Processes - Journal of Machine Learning Research - 2023

A unique underlying mean process might be too restrictive.

\(\rightarrow\) Mixture of multi-task GPs:

\[y_i \mid \{\color{green}{Z_{ik}} = 1 \} = \mu_{\color{green}{k}} + f_i + \epsilon_i\]

with:

\(\color{green}{Z_{i}} \sim \mathcal{M}(1, \color{green}{\boldsymbol{\pi}}), \ \perp \!\!\! \perp_i,\)
\(\mu_{\color{green}{k}} \sim \mathcal{GP}(m_{\color{green}{k}}, \color{green}{C_{\gamma_{k}}})\ \perp \!\!\! \perp_{\color{green}{k}},\)
\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)
\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_i), \ \perp \!\!\! \perp_i\]

Magma + Clustering = MagmaClust

A unique underlying mean process might be too restrictive.

\(\rightarrow\) Mixture of multi-task GPs:

\[y_i \mid \{\color{green}{Z_{ik}} = 1 \} = \mu_{\color{green}{k}} + f_i + \epsilon_i\]

with:

\(\color{green}{Z_{i}} \sim \mathcal{M}(1, \color{green}{\boldsymbol{\pi}}), \ \perp \!\!\! \perp_i,\)
\(\mu_{\color{green}{k}} \sim \mathcal{GP}(m_{\color{green}{k}}, \color{green}{C_{\gamma_{k}}})\ \perp \!\!\! \perp_{\color{green}{k}},\)
\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)
\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \{ \boldsymbol{\mu} , \color{green}{\boldsymbol{\pi}} \} \sim \sum\limits_{k=1}^K{ \color{green}{\pi_k} \ \mathcal{GP}\Big(\mu_{\color{green}{k}}, \Psi_i^\color{green}{k} \Big)}, \ \perp \!\!\! \perp_i\]

Learning

The integrated likelihood is not tractable anymore due to posterior dependencies between \( \boldsymbol{\mu} = \{\mu_\color{green}{k}\}_\color{green}{k}\) and \(\mathbf{Z}= \{Z_i\}_i\).

Variational inference still allows us to maintain closed-form approximations. For any distribution \(q\):

\[\log p(\textbf{y} \mid \Theta) = \mathcal{L}(q; \Theta) + KL \big( q \mid \mid p(\boldsymbol{\mu}, \boldsymbol{Z} \mid \textbf{y}, \Theta)\big)\]

The posterior independence is forced by an approximation assumption:

\[q(\boldsymbol{\mu}, \boldsymbol{Z}) = q_{\boldsymbol{\mu}}(\boldsymbol{\mu})q_{\boldsymbol{Z}}(\boldsymbol{Z}).\]

Maximising the lower bound \(\mathcal{L}(q; \Theta)\) induces natural factorisations over clusters and individuals for the variational distributions.

Variational EM

E step: \[ \begin{align} \hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) &= \color{green}{\prod\limits_{k = 1}^K} \mathcal{N}(\mu_\color{green}{k};\hat{m}_\color{green}{k}, \hat{\textbf{C}}_\color{green}{k}) , \hspace{2cm} \hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) = \prod\limits_{i = 1}^M \mathcal{M}(Z_i;1, \color{green}{\boldsymbol{\tau}_i}) \end{align} \] M step:

\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{k = 1}^{K}\ \mathcal{N} \left( \hat{m}_k; \ m_k, \boldsymbol{C}_{\color{green}{\gamma_k}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{C}_{\color{green}{\gamma_k}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{i = 1}^{M}\tau_{ik}\ \mathcal{N} \left( \mathbf{y}_i; \ \hat{m}_k, \boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{i = 1}^{M}\tau_{ik}\log \color{green}{\pi_{k}} \end{align*} \]

Covariance structure assumption: 4 sub-models

Sharing the covariance structures offers a compromise between flexibility and parsimony:

\(\mathcal{H}_{oo}:\) common mean process - common individual process - \(2\) HPs,
\(\mathcal{H}_{\color{green}{k}o}:\) specific mean process - common individual process - \(\color{green}{K} + 1\) HPs,
\(\mathcal{H}_{o\color{blue}{i}}:\) common mean process - specific individual process - \(1 + \color{blue}{M}\) HPs,
\(\mathcal{H}_{\color{green}{k}\color{blue}{i}}:\) specific mean process - specific individual process - \(\color{green}{K} + \color{blue}{M}\) HP.

Prediction

Multi-task posterior for each cluster:

\[ p(y_*(\mathbf{t}^{p}) \mid \color{green}{Z_{*k}} = 1, y_*(\mathbf{t}_{*}), \textbf{y}) = \mathcal{N} \Big( y_*(\mathbf{t}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\mathbf{t}^{p}) , \hat{\Gamma}_{pp}^\color{green}{k} \Big), \forall \color{green}{k}, \]

\(\hat{\mu}_{*}^\color{green}{k}(\mathbf{t}^{p}) = \hat{m}_\color{green}{k}(\mathbf{t}^{p}) + \Gamma^\color{green}{k}_{p*} {\Gamma^\color{green}{k}_{**}}^{-1} (y_*(\mathbf{t}_{*}) - \hat{m}_\color{green}{k} (\mathbf{t}_{*}))\)
\(\hat{\Gamma}_{pp}^\color{green}{k} = \Gamma_{pp}^\color{green}{k} - \Gamma_{p*}^\color{green}{k} {\Gamma^{\color{green}{k}}_{**}}^{-1} \Gamma^{\color{green}{k}}_{*p}\)
Predictive multi-task GPs mixture:

\[p(y_*(\textbf{t}^p) \mid y_*(\textbf{t}_*), \textbf{y}) = \color{green}{\sum\limits_{k = 1}^{K} \tau_{*k}} \ \mathcal{N} \big( y_*(\mathbf{t}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\textbf{t}^p) , \hat{\Gamma}_{pp}^\color{green}{k}(\textbf{t}^p) \big).\]

An image is still worth many words

By identifying the underlying clustering structure, MagmaClust discards unnecessary information and provides enhanced predictions as well as a lower uncertainty.

Cluster-specific predictions

Each cluster-specific prediction is weighted by its membership probability \(\color{green}{\tau_{*k}}\).

Clustering and prediction performances

Does it tackle our initial problem?

Feel free to draw your own figures…

Implemented as an R package MagmaClustR: https://arthurleroy.github.io/MagmaClustR

… it also works with muti-dimensional inputs

Multi-Means Gaussian processes

Different sources of correlation might exist in the data (e.g. multiple genes and individuals)

\[y_{\color{blue}{i}\color{red}{j}} = \mu_{0} + f_\color{blue}{i} + g_\color{red}{j} + \epsilon_{\color{blue}{i}\color{red}{j}}\]

with:

\(\mu_{0} \sim \mathcal{GP}(m_{0}, {C_{\gamma_{0}}}), \ f_{\color{blue}{i}} \sim \mathcal{GP}(0, \Sigma_{\theta_{\color{blue}{i}}}), \ \epsilon_{\color{blue}{i}\color{red}{j}} \sim \mathcal{GP}(0, \sigma_{\color{blue}{i}\color{red}{j}}^2), \ \perp \!\!\! \perp_i\)
\(g_{\color{red}{j}} \sim \mathcal{GP}(0, \Sigma_{\theta_{\color{red}{j}}})\)

Key idea for training: define \(\color{blue}{M}+\color{red}{P} + 1\) different hyper-posterior distributions for \(\mu_0\) by conditioning over the adequate sub-sample of data.

\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{red}{j} = 1,\dots, \color{red}{P}}^{\color{blue}{i} = 1,\dots, \color{blue}{M}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{0}, \hat{K}_0 \Big).\]

\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{blue}{i} = 1,\dots, \color{blue}{M}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{red}{j}}, \hat{K}_\color{red}{j} \Big), \ \forall \color{red}{j} \in 1, \dots, \color{red}{P}\]

\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{red}{j} = 1,\dots, \color{red}{P}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{blue}{i}}, \hat{K}_\color{blue}{i} \Big), \forall \color{blue}{i} \in 1, \dots, \color{blue}{M} \]

Multi-Mean GPs: multiple hyper-posterior mean processes

Each sub-sample of data leads to a specific hyper-posterior distribution of the mean process \(\mu_0\).

Multi-Mean GPs: an adaptive prediction

Although sharing the same mean process, different tasks still lead to different predictions.

Multi-mean GP provides adaptive predictions according to the relevant context.

Forecasting thousands of genenomic time series…

… with the adequate quantification of uncertainty

Answer to the \(\mathbb{P}(\)first question\() \approx 1\)

\(\rightarrow\) All methods scale linearly with the number of tasks and clusters.

\(\rightarrow\) Parallel computing can be used to speed up training.

Overall, the computational complexity is:

Magma: \[ \mathcal{O}(\color{blue}{M} \times N_i^3 + N^3) \]
MagmaClust: \[ \mathcal{O}(\color{blue}{M} \times N_i^3 + \color{green}{K} \times N^3) \]
Multi-Mean Gaussian Processes: \[ \mathcal{O}(\color{blue}{M} \times \color{red}{P} \times N_{ij}^3 + (\color{blue}{M} + \color{red}{P}) \times N^3) \]

Thank you for your attention!

Multi-Task Gaussian Processes

Arthur Leroy - Department of Computer Science, The University of Manchester

Gaussian Process Summer School - 11/09/2023

Let’s start simply with an illustrative example

Gaussian process: a prior distribution over functions

Gaussian process: all you need is a posterior

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Forecasting with a unique GP

Forecasting with a unique GP

Multi-Output GPs: exploiting explicit correlations

Multi-Output GPs: covariance visualisation

Multi-Output GPs: illustrative example

Multi-Output GPs: some modern extensions and reviews

New paradigm: Multi-tAsk GP with common MeAn (Magma)

Notation and dimensionality

EM algorithm

Prediction

Prediction: the key idea

Prediction: additional steps

A GIF is worth a thousand words

A GIF is worth a thousand words

Magma + Clustering = MagmaClust

Magma + Clustering = MagmaClust

Magma + Clustering = MagmaClust

Learning

Variational EM

Covariance structure assumption: 4 sub-models

Prediction

An image is still worth many words

Cluster-specific predictions

Clustering and prediction performances

Does it tackle our initial problem?

Feel free to draw your own figures…

… it also works with muti-dimensional inputs

Multi-Means Gaussian processes

Multi-Mean GPs: multiple hyper-posterior mean processes

Multi-Mean GPs: an adaptive prediction

Forecasting thousands of genenomic time series…

… with the adequate quantification of uncertainty

Answer to the \(\mathbb{P}(\)first question\() \approx 1\)

Related current projects: BMI evolution patterns

Model selection performances of VBIC