Multi-Mean Gaussian Processes: A probabilistic framework for multi-correlated functional data

Arthur Leroy - Department of Computer Science, The University of Manchester

Functional Data Analysis Workshop - Lille - 14/03/2024

Let’s start simply with an illustrative example

Irregular time series (in number of observations and location),
Many different individuals,
A few observations per individual.

Gaussian process: a prior distribution over functions

\[y = \color{orange}{f}(x) + \epsilon\]

No restrictions on \(\color{orange}{f}\) but a prior distribution on a functional space: \(\color{orange}{f} \sim \mathcal{GP}(m(\cdot),C(\cdot,\cdot))\)

While \(m\) is often assumed to be \(0\), the covariance structure is critical and defined through tailored kernels. For instance, the Squared Exponential (or RBF) kernel is expressed as: \[C_{SE}(x, x^{\prime}) = s^2 \exp \Bigg(-\dfrac{(x - x^{\prime})^2}{2 \ell^2}\Bigg)\]

Gaussian process: what about a visual summary?

Powerful non parametric method offering probabilistic predictions,
Computational complexity in \(\mathcal{O}(N^3)\), with N the number of observations.

Forecasting with a unique GP

Multi-tAsk GP with common MeAn (Magma)

\[y_i = \mu_0 + f_i + \epsilon_i\]

with:

\(\mu_0 \sim \mathcal{GP}(m_0, K_{\theta_0}),\)
\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)
\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_i} + \sigma_i^2 I), \ \perp \!\!\! \perp_i\]

\(\rightarrow\) Unified GP framework with a common mean process \(\mu_0\), and individual-specific process \(f_i\),

\(\rightarrow\) Naturaly handles irregular grids of input data.

Goal: Learn the hyper-parameters, (and \(\mu_0\)’s hyper-posterior).
Difficulty: The likelihood depends on \(\mu_0\), and individuals are not independent.

EM algorithm

E step: \[ \begin{align} p(\mu_0(\color{grey}{\mathbf{t}}) \mid \textbf{y}, \hat{\Theta}) &\propto \mathcal{N}(\mu_0(\color{grey}{\mathbf{t}}); m_0(\color{grey}{\textbf{t}}), \textbf{K}_{\hat{\theta}_0}^{\color{grey}{\textbf{t}}}) \times \prod\limits_{i =1}^M \mathcal{N}(\mathbf{y}_i; \mu_0( \color{purple}{\textbf{t}_i}), \boldsymbol{\Psi}_{\hat{\theta}_i, \hat{\sigma}_i^2}^{\color{purple}{\textbf{t}_i}}) \\ &= \mathcal{N}(\mu_0(\color{grey}{\mathbf{t}}); \hat{m}_0(\color{grey}{\textbf{t}}), \hat{\textbf{K}}^{\color{grey}{\textbf{t}}}), \end{align} \] M step:

\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \ \log \mathcal{N} \left( \hat{m}_0(\color{grey}{\textbf{t}}); m_0(\color{grey}{\textbf{t}}), \mathbf{K}_{\theta_0}^{\color{grey}{\textbf{t}}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{grey}{\textbf{t}}} {\mathbf{K}_{\theta_0}^{\color{grey}{\textbf{t}}}}^{-1} \right) \\ & \ \ \ + \sum\limits_{i = 1}^{M}\left\{ \log \mathcal{N} \left( \mathbf{y}_i; \hat{m}_0(\color{purple}{\mathbf{t}_i}), \boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{purple}{\mathbf{t}_i}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{t}_i}} {\boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{purple}{\mathbf{t}_i}}}^{-1} \right) \right\}. \end{align*} \]

Prediction: the key idea

Defining a multi-task prior distribution by:

conditioning on training data,
integrating over \(\mu_0\)’s hyper-posterior distribution.

\[\begin{align} p(y_* (\textbf{t}_*^{p}) \mid \textbf{y}) &= \int p\left(y_* (\textbf{t}_*^{p}) \mid \textbf{y}, \mu_0(\textbf{t}_*^{p})\right) p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y}) \ d \mu_0(\mathbf{t}^{p}_{*}) \\ &= \int \underbrace{ p \left(y_* (\textbf{t}_*^{p}) \mid \mu_0 (\textbf{t}_*^{p}) \right)}_{\mathcal{N}(y_*; \mu_0, \Psi_*)} \ \underbrace{p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y})}_{\mathcal{N}(\mu_0; \hat{m}_0, \hat{K})} \ d \mu_0(\mathbf{t}^{p}_{*}) \\ &= \mathcal{N}( \hat{m}_0 (\mathbf{t}^{p}_{*}), \underbrace{\Psi_* + \hat{K}}_{\Gamma}) \end{align}\]

Prediction: additional steps

Multi-task prior:

\[p \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{t}_{*}}) \\ y_*(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix} \mid \textbf{y} \right) = \mathcal{N} \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{t}_{*}}) \\ y_*(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix}; \ \begin{bmatrix} \hat{m}_0(\color{grey}{\mathbf{t}_{*}}) \\ \hat{m}_0(\color{purple}{\mathbf{t}^{p}}) \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{\color{grey}{**}} & \Gamma_{\color{grey}{*}\color{purple}{p}} \\ \Gamma_{\color{purple}{p}\color{grey}{*}} & \Gamma_{\color{purple}{pp}} \end{pmatrix} \right)\]

Multi-task posterior:

\[p(y_*(\color{purple}{\mathbf{t}^{p}}) \mid y_*(\color{grey}{\mathbf{t}_{*}}), \textbf{y}) = \mathcal{N} \Big( y_*(\color{purple}{\mathbf{t}^{p}}); \ \hat{\mu}_{*}(\color{purple}{\mathbf{t}^{p}}) , \hat{\Gamma}_{\color{purple}{pp}} \Big)\]

with:

\(\hat{\mu}_{*}(\color{purple}{\mathbf{t}^{p}}) = \hat{m}_0(\color{purple}{\mathbf{t}^{p}}) + \Gamma_{\color{purple}{p}\color{grey}{*}}\Gamma_{\color{grey}{**}}^{-1} (y_*(\color{grey}{\mathbf{t}_{*}}) - \hat{m}_0 (\color{grey}{\mathbf{t}_{*}}))\)
\(\hat{\Gamma}_{\color{purple}{pp}} = \Gamma_{\color{purple}{pp}} - \Gamma_{\color{purple}{p}\color{grey}{*}}\Gamma_{\color{grey}{**}}^{-1} \Gamma_{\color{grey}{*}\color{purple}{p}}\)

Leroy et al. - MAGMA: Inference and Prediction using Multi-Task Gaussian Processes with Common Mean - Machine Learning - 2022

A GIF is worth a thousand words

Magma + Clustering = MagmaClust

Leroy et al. - Cluster-Specific Predictions with Multi-Task Gaussian Processes - Journal of Machine Learning Research - 2023

\(\rightarrow\) Mixture of multi-task GPs:

\[y_i \mid \{\color{green}{Z_{ik}} = 1 \} = \mu_{\color{green}{k}} + f_i + \epsilon_i\]

with:

\(\color{green}{Z_{i}} \sim \mathcal{M}(1, \color{green}{\boldsymbol{\pi}}), \ \perp \!\!\! \perp_i,\)
\(\mu_{\color{green}{k}} \sim \mathcal{GP}(m_{\color{green}{k}}, \color{green}{C_{\gamma_{k}}})\ \perp \!\!\! \perp_{\color{green}{k}},\)
\(f_i \sim \mathcal{GP}(0, \Sigma_{\theta_i}), \ \perp \!\!\! \perp_i,\)
\(\epsilon_i \sim \mathcal{GP}(0, \sigma_i^2), \ \perp \!\!\! \perp_i.\)

It follows that:

\[y_i \mid \{ \boldsymbol{\mu} , \color{green}{\boldsymbol{\pi}} \} \sim \sum\limits_{k=1}^K{ \color{green}{\pi_k} \ \mathcal{GP}\Big(\mu_{\color{green}{k}}, \Psi_i^\color{green}{k} \Big)}, \ \perp \!\!\! \perp_i\]

Variational EM

E step: \[ \begin{align} \hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) &= \color{green}{\prod\limits_{k = 1}^K} \mathcal{N}(\mu_\color{green}{k};\hat{m}_\color{green}{k}, \hat{\textbf{C}}_\color{green}{k}) , \hspace{2cm} \hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) = \prod\limits_{i = 1}^M \mathcal{M}(Z_i;1, \color{green}{\boldsymbol{\tau}_i}) \end{align} \] M step:

\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{k = 1}^{K}\ \mathcal{N} \left( \hat{m}_k; \ m_k, \boldsymbol{C}_{\color{green}{\gamma_k}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{C}_{\color{green}{\gamma_k}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{i = 1}^{M}\tau_{ik}\ \mathcal{N} \left( \mathbf{y}_i; \ \hat{m}_k, \boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{i = 1}^{M}\tau_{ik}\log \color{green}{\pi_{k}} \end{align*} \]

Prediction

Multi-task posterior for each cluster:

\[ p(y_*(\mathbf{t}^{p}) \mid \color{green}{Z_{*k}} = 1, y_*(\mathbf{t}_{*}), \textbf{y}) = \mathcal{N} \Big( y_*(\mathbf{t}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\mathbf{t}^{p}) , \hat{\Gamma}_{pp}^\color{green}{k} \Big), \forall \color{green}{k}, \]

\(\hat{\mu}_{*}^\color{green}{k}(\mathbf{t}^{p}) = \hat{m}_\color{green}{k}(\mathbf{t}^{p}) + \Gamma^\color{green}{k}_{p*} {\Gamma^\color{green}{k}_{**}}^{-1} (y_*(\mathbf{t}_{*}) - \hat{m}_\color{green}{k} (\mathbf{t}_{*}))\)
\(\hat{\Gamma}_{pp}^\color{green}{k} = \Gamma_{pp}^\color{green}{k} - \Gamma_{p*}^\color{green}{k} {\Gamma^{\color{green}{k}}_{**}}^{-1} \Gamma^{\color{green}{k}}_{*p}\)
Predictive multi-task GPs mixture:

\[p(y_*(\textbf{t}^p) \mid y_*(\textbf{t}_*), \textbf{y}) = \color{green}{\sum\limits_{k = 1}^{K} \tau_{*k}} \ \mathcal{N} \big( y_*(\mathbf{t}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\textbf{t}^p) , \hat{\Gamma}_{pp}^\color{green}{k}(\textbf{t}^p) \big).\]

An image is still worth many words

\(\rightarrow\) By identifying the underlying clustering structure, MagmaClust discards unnecessary information and provides enhanced predictions as well as a lower uncertainty.

\(\rightarrow\) Each cluster-specific prediction is weighted by its membership probability \(\color{green}{\tau_{*k}}\).

Feel free to test it on your data …

Implemented as an R package MagmaClustR: https://arthurleroy.github.io/MagmaClustR

… it also works with muti-dimensional inputs

Multi-mean Gaussian processes

Different sources of correlation might co-exist (e.g. multiple individuals with multiple genes)

\[y_{\color{blue}{i}\color{red}{j}} = \mu_{0} + f_\color{blue}{i} + g_\color{red}{j} + \epsilon_{\color{blue}{i}\color{red}{j}}\]

with:

\(\mu_{0} \sim \mathcal{GP}(m_{0}, {C_{\gamma_{0}}}), \ f_{\color{blue}{i}} \sim \mathcal{GP}(0, \Sigma_{\theta_{\color{blue}{i}}}), \ \epsilon_{\color{blue}{i}\color{red}{j}} \sim \mathcal{GP}(0, \sigma_{\color{blue}{i}\color{red}{j}}^2), \ \perp \!\!\! \perp_i\)
\(g_{\color{red}{j}} \sim \mathcal{GP}(0, \Sigma_{\theta_{\color{red}{j}}})\)

Key idea for training: define \(\color{blue}{M}+\color{red}{P}\) adaptive hyper-posterior distributions for \(\mu_0\) by conditioning over the adequate sub-sample of data.

\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{blue}{i} = 1,\dots, \color{blue}{M}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{red}{j}}, \hat{K}_\color{red}{j} \Big), \ \forall \color{red}{j} \in 1, \dots, \color{red}{P}\]

\[p(\mu_0 \mid \{y_{\color{blue}{i}\color{red}{j}} \}_{\color{red}{j} = 1,\dots, \color{red}{P}}) = \mathcal{N}\Big(\mu_{0}; \ \hat{m}_{\color{blue}{i}}, \hat{K}_\color{blue}{i} \Big), \forall \color{blue}{i} \in 1, \dots, \color{blue}{M} \]

Multi-Mean GPs: multiple hyper-posterior mean processes

Each sub-sample of data leads to a specific hyper-posterior distribution of the mean process \(\mu_0\).

Multi-Mean GPs: an adaptive prediction

Although sharing the same mean process, different tasks still lead to different predictions.

Multi-mean GP provides adaptive predictions according to the relevant context.

Forecasting thousands of genenomic time series…

… with the adequate quantification of uncertainty

Leroy et al. - Longitudinal prediction of DNA methylation to forecast epigenetic outcomes - Preprint - 2023

Answer to the \(\mathbb{P}(\)first question\() \approx 1\)

\(\rightarrow\) All methods scale linearly with the number of tasks and clusters.

\(\rightarrow\) Parallel computing can be used to speed up training.

Overall, the computational complexity is:

Magma: \[ \mathcal{O}(\color{blue}{M} \times N_i^3 + N^3) \]
MagmaClust: \[ \mathcal{O}(\color{blue}{M} \times N_i^3 + \color{green}{K} \times N^3) \]
Multi-Mean Gaussian Processes: \[ \mathcal{O}(\color{blue}{M} \times \color{red}{P} \times N_{ij}^3 + (\color{blue}{M} + \color{red}{P}) \times N^3) \]

Thank you for your attention!

Multi-Mean Gaussian Processes: A probabilistic framework for multi-correlated functional data

Arthur Leroy - Department of Computer Science, The University of Manchester

Functional Data Analysis Workshop - Lille - 14/03/2024

Let’s start simply with an illustrative example

Gaussian process: a prior distribution over functions

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Forecasting with a unique GP

Forecasting with a unique GP

Multi-tAsk GP with common MeAn (Magma)

EM algorithm

Prediction: the key idea

Prediction: additional steps

A GIF is worth a thousand words

A GIF is worth a thousand words

Magma + Clustering = MagmaClust

Variational EM

Prediction

An image is still worth many words

Feel free to test it on your data …

… it also works with muti-dimensional inputs

Multi-mean Gaussian processes

Multi-Mean GPs: multiple hyper-posterior mean processes

Multi-Mean GPs: an adaptive prediction

Forecasting thousands of genenomic time series…

… with the adequate quantification of uncertainty

Answer to the \(\mathbb{P}(\)first question\() \approx 1\)

Clustering and prediction performances

Cluster-specific predictions

Model selection performances of VBIC

Multi-Mean Gaussian Processes: A probabilistic framework for multi-correlated functional data

Arthur Leroy - Department of Computer Science, The University of Manchester

Functional Data Analysis Workshop - Lille - 14/03/2024

Let’s start simply with an illustrative example

Gaussian process: a prior distribution over functions

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Gaussian process: what about a visual summary?

Forecasting with a unique GP

Forecasting with a unique GP

Multi-tAsk GP with common MeAn (Magma)

EM algorithm

Prediction: the key idea

Prediction: additional steps

A GIF is worth a thousand words

A GIF is worth a thousand words

Magma + Clustering = MagmaClust

Variational EM

Prediction

An image is still worth many words

Feel free to test it on your data …

… it also works with muti-dimensional inputs

Multi-mean Gaussian processes

Multi-Mean GPs: multiple hyper-posterior mean processes

Multi-Mean GPs: an adaptive prediction

Forecasting thousands of genenomic time series…

… with the adequate quantification of uncertainty

Related current projects: BMI evolution patterns

Answer to the \(\mathbb{P}(\)first question\() \approx 1\)

Clustering and prediction performances

Cluster-specific predictions

Model selection performances of VBIC