
If we observed the function at \(N\) instants, we can find coefficients \(\boldsymbol{\alpha}\) through least squares:
\[LS(\boldsymbol{\alpha}) = \sum_{i=1}^{N}\left[f(t_i)-\sum_{b = 1}^{B} \alpha_{b} \phi_{b}\left(t_{i}\right)\right]^{2}\]
Depending on the context, we may expect different properties for our function.
Do we interpolate or smooth? Are the variations periodic (Fourier basis), multi-scale (wavelets) or polynomial (B-splines)? Answers are probably is this book.
Like for multi-variate statistics, FCPA is central method when studying functions. From the Karhunen-Loève theorem we can express any centred stochastic process as an infinite linear combination of orthonormal eigenfunctions:
\[X(t)-\mathbb{E}[X(t)]=\sum_{q=1}^{\infty} \xi_{q} \varphi_{q}(t)\]
It’s still used to represent the trajectories explaining the most variance among our functions. Eigenfunctions are uncorrelated, and they provide the most parsimonious decomposition in terms of basis function.
One nice and surprising property of FDA methods is that they generally don’t suffer from being infinitely high dimensional. Intuitively, what is the true dimension of this object?
Clustering performance curves of 100m freestyle swimmers into 5 groups using derivatives.
\[y = \color{orange}{f}(x) + \epsilon\]
where:
All supervised learning problems require to retrieve the most reliable function \(\color{orange}{f}\), using observed data \(\{(x_1, y_1), \dots, (x_n, y_n) \}\), to perform predictions when observing new input data \(x_{n+1}\).
In the simplest (though really common in practice) case of linear regression, we assume that:
\[\color{orange}{f}(x) = a x + b\]
Finding the best function \(\color{orange}{f}\) reduces to compute optimal parameters \(a\) and \(b\) from our dataset.
This well-known probability formula has massive implications on learning strategies:
\[\mathbb{P}(\color{red}{T} \mid \color{blue}{D}) = \dfrac{\mathbb{P}(\color{blue}{D} \mid \color{red}{T}) \times \mathbb{P}(\color{red}{T})}{\mathbb{P}(\color{blue}{D})}\]
with:
Bayes’ theorem indicates how to update our beliefs about \(\color{red}{T}\) when accounting for new data \(\color{blue}{D}\) :
We generally use probability distributions to express our initial uncertainty about a quantity of interest (balance of a coin, average size of a human, …) and its posterior probable values.
Assume there exists some trouble or disease such that:
From Bayes’ theorem, the probability to have contracted the trouble when the detection test was positive is:
\[\mathbb{P}(\color{red}{T} \mid \color{blue}{D}) = \dfrac{\mathbb{P}(\color{blue}{D} \mid \color{red}{T}) \times \mathbb{P}(\color{red}{T})}{\mathbb{P}(\color{blue}{D})} = \dfrac{0.99 \times 0.001}{0.99 \times 0.001 + (1-0.99) \times 0.999} \simeq 0.09\]
Hence, we only have 9% chance to actually be sick despite a positive result to the detection test.
\[y = \color{orange}{f}(x) + \epsilon\]
No restrictions on \(\color{orange}{f}\) but a prior distribution on a functional space: \(\color{orange}{f} \sim \mathcal{GP}(m(\cdot),C(\cdot,\cdot))\)
We can think of a Gaussian process as the extension to infinity of multivariate Gaussians:
\(x \sim \mathcal{N}(m , \sigma^2)\) in \(\mathbb{R}\), \(\begin{pmatrix}
x_1 \\
x_2 \\
\end{pmatrix} \sim \mathcal{N} \left(
\
\begin{pmatrix}
m_1 \\
m_2 \\
\end{pmatrix},
\begin{bmatrix}
C_{1,1} & C_{1,2} \\
C_{2,1} & C_{2,2}
\end{bmatrix} \right)\) in \(\mathbb{R}^2\)
While \(m(\cdot)\) is often assumed to be \(0\), the covariance structure is critical and defined through tailored kernels. For instance, the Squared Exponential (or RBF) kernel is expressed as: \[C_{SE}(x, x^{\prime}) = s^2 \exp \Bigg(-\dfrac{(x - x^{\prime})^2}{2 \ell^2}\Bigg)\]
To model phenomenon exhibiting repeting patterns, one can leverage the Periodic kernel: \[C_{perio}(x, x^{\prime}) = s^2 \exp \Bigg(- \dfrac{ 2 \sin^2 \Big(\pi \frac{\mid x - x^{\prime}\mid}{p} \Big)}{\ell^2}\Bigg)\]
We can even consider linear regression as a particular GP problem, by using the Linear kernel: \[C_{lin}(x, x^{\prime}) = s_a^2 + s_b^2 (x - c)(x^{\prime} - c )\]
We can learn optimal values of hyper-parameters from data through maximum likelihood.
The Gaussian property induces that unobserved points have no influence on inference:
\[ \int \underbrace{p(f_{\color{grey}{obs}}, f_{\color{purple}{mis}})}_{\mathcal{GP}(m, C)} \ \mathrm{d}f_{\color{purple}{mis}} = \underbrace{p(f_{\color{grey}{obs}})}_{\mathcal{N}(m_{\color{grey}{obs}}, C_{\color{grey}{obs}})} \]
This crucial trick allows us to learn function properties from finite sets of observations. More generally, Gaussian processes are closed under conditioning and marginalisation.
\[ \begin{bmatrix} f_{\color{grey}{o}} \\ f_{\color{purple}{m}} \\ \end{bmatrix} \sim \mathcal{N} \left( \begin{bmatrix} m_{\color{grey}{o}} \\ m_{\color{purple}{m}} \\ \end{bmatrix}, \begin{pmatrix} C_{\color{grey}{o, o}} & C_{\color{grey}{o}, \color{purple}{m}} \\ C_{\color{purple}{m}, \color{grey}{o}} & C_{\color{purple}{m, m}} \end{pmatrix} \right) \]
While marginalisation serves for training, conditioning leads the key GP prediction formula:
\[f_{\color{purple}{m}} \mid f_{\color{grey}{o}} \sim \mathcal{N} \Big( m_{\color{purple}{m}} + C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} (f_{\color{grey}{o}} - m_{\color{grey}{o}}), \ \ C_{\color{purple}{m, m}} - C_{\color{purple}{m}, \color{grey}{o}} C_{\color{grey}{o, o}}^{-1} C_{\color{grey}{o}, \color{purple}{m}} \Big)\]
Several classical challenges:
Several classical challenges:
In the previous examples, we saw a variety of situations that naturally lead to the same mathematical formulation of the learning problem:
\[y_s = \color{orange}{f_s}(x_s) + \epsilon_s, \hspace{3cm} \forall s = 1, \dots, S\]
Learn \(\color{orange}{f_s}\), the underlying relationship between \(x_s\) and \(y_s\), for each source of data \(\color{orange}{s}\).
Multi-, in contrast with single-, regression implies that some information can be shared across data sources to improve learning/predictions.
The nature of measurements leads to a more philosophical distinction:
\[y_t = \mu_0 + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_t} + \sigma_t^2 I), \ \perp \!\!\! \perp_t\]
\(\rightarrow\) Unified GP framework with a shared mean process \(\mu_0\), and task-specific process \(f_t\),
\(\rightarrow\) Naturaly handles irregular grids of input data.
Goal: Learn the hyper-parameters, (and \(\mu_0\)’s hyper-posterior).
Difficulty: The likelihood depends on \(\mu_0\), and tasks are not independent.
E-step
\[ \begin{align} p(\mu_0(\color{grey}{\mathbf{x}}) \mid \textbf{y}, \hat{\Theta}) &\propto \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); m_0(\color{grey}{\textbf{x}}), \textbf{K}_{0}^{\color{grey}{\textbf{x}}}) \times \prod\limits_{t =1}^T \mathcal{N}(\mathbf{y}_t; \mu_0( \color{purple}{\textbf{x}_i}), \boldsymbol{\Psi}_{\hat{\theta}_i, \hat{\sigma}_i^2}^{\color{purple}{\textbf{x}_i}}) \\ &= \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); \hat{m}_0(\color{grey}{\textbf{x}}), \hat{\textbf{K}}^{\color{grey}{\textbf{x}}}), \end{align} \]
M-step
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \ \sum\limits_{t = 1}^{T}\left\{ \log \mathcal{N} \left( \mathbf{y}_t; \hat{m}_0(\color{purple}{\mathbf{x}_t}), \boldsymbol{\Psi}_{\theta_t, \sigma^2}^{\color{purple}{\mathbf{x}_t}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{x}_t}} {\boldsymbol{\Psi}_{\theta_t, \sigma^2}^{\color{purple}{\mathbf{x}_t}}}^{-1} \right) \right\}. \end{align*} \]
Sharing the covariance structures or not offers a compromise between flexibility and parsimony:
Major interests:
Overall, the computational complexity is: \[ \mathcal{O}(\color{blue}{T} \times N_t^3 + N^3) \]
with \(N = \bigcup\limits_{t = 1}^\color{blue}{T} N_t\)
\[p \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{x}_{*}}) \\ y_*(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix} \mid \textbf{y} \right) = \mathcal{N} \left( \begin{bmatrix} y_*(\color{grey}{\mathbf{x}_{*}}) \\ y_*(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix}; \ \begin{bmatrix} \hat{m}_0(\color{grey}{\mathbf{x}_{*}}) \\ \hat{m}_0(\color{purple}{\mathbf{x}^{p}}) \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{\color{grey}{**}} & \Gamma_{\color{grey}{*}\color{purple}{p}} \\ \Gamma_{\color{purple}{p}\color{grey}{*}} & \Gamma_{\color{purple}{pp}} \end{pmatrix} \right)\]
\[p(y_*(\color{purple}{\mathbf{x}^{p}}) \mid y_*(\color{grey}{\mathbf{x}_{*}}), \textbf{y}) = \mathcal{N} \Big( y_*(\color{purple}{\mathbf{x}^{p}}); \ \hat{\mu}_{*}(\color{purple}{\mathbf{x}^{p}}) , \hat{\Gamma}_{\color{purple}{pp}} \Big)\]
with:
Multi-Task GPs regression provides more reliable predictions when individual processes are sparsely observed, while greatly reducing the associated uncertainty
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t = \mu_0 + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:
It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_t), \ \perp \!\!\! \perp_t\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t \mid \{\color{green}{Z_{tk}} = 1 \} = \mu_{\color{green}{k}} + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Psi_t), \ \perp \!\!\! \perp_t\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_t \mid \{\color{green}{Z_{ik}} = 1 \} = \mu_{\color{green}{k}} + f_t + \epsilon_t, \hspace{3cm} \forall t = 1, \dots, T\]
with:

It follows that:
\[y_t \mid \{ \boldsymbol{\mu} , \color{green}{\boldsymbol{\pi}} \} \sim \sum\limits_{k=1}^K{ \color{green}{\pi_k} \ \mathcal{GP}\Big(\mu_{\color{green}{k}}, \Psi_t^\color{green}{k} \Big)}, \ \perp \!\!\! \perp_t\]
E step: \[ \begin{align} \hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) &= \color{green}{\prod\limits_{k = 1}^K} \mathcal{N}(\mu_\color{green}{k};\hat{m}_\color{green}{k}, \hat{\textbf{C}}_\color{green}{k}) , \hspace{2cm} \hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) = \prod\limits_{t = 1}^T \mathcal{M}(Z_t;1, \color{green}{\boldsymbol{\tau}_t}) \end{align} \] M step:
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{k = 1}^{K}\sum\limits_{t = 1}^{T}\tau_{tk}\ \mathcal{N} \left( \mathbf{y}_t; \ \hat{m}_k, \boldsymbol{\Psi}_{\color{blue}{\theta_t}, \color{blue}{\sigma_t^2}} \right) - \dfrac{1}{2} \textrm{tr}\left( \mathbf{\hat{C}}_k\boldsymbol{\Psi}_{\color{blue}{\theta_t}, \color{blue}{\sigma_t^2}}^{-1}\right) \\ & \hspace{1cm} + \sum\limits_{k = 1}^{K}\sum\limits_{t = 1}^{T}\tau_{tk}\log \color{green}{\pi_{k}} \end{align*} \]
\[ p(y_*(\mathbf{x}^{p}) \mid \color{green}{Z_{*k}} = 1, y_*(\mathbf{x}_{*}), \textbf{y}) = \mathcal{N} \Big( y_*(\mathbf{x}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\mathbf{x}^{p}) , \hat{\Gamma}_{pp}^\color{green}{k} \Big), \forall \color{green}{k}, \]
\(\hat{\mu}_{*}^\color{green}{k}(\mathbf{x}^{p}) = \hat{m}_\color{green}{k}(\mathbf{x}^{p}) + \Gamma^\color{green}{k}_{p*} {\Gamma^\color{green}{k}_{**}}^{-1} (y_*(\mathbf{x}_{*}) - \hat{m}_\color{green}{k} (\mathbf{x}_{*}))\)
\(\hat{\Gamma}_{pp}^\color{green}{k} = \Gamma_{pp}^\color{green}{k} - \Gamma_{p*}^\color{green}{k} {\Gamma^{\color{green}{k}}_{**}}^{-1} \Gamma^{\color{green}{k}}_{*p}\)
\[p(y_*(\textbf{x}^p) \mid y_*(\textbf{x}_*), \textbf{y}) = \color{green}{\sum\limits_{k = 1}^{K} \tau_{*k}} \ \mathcal{N} \big( y_*(\mathbf{x}^{p}); \ \hat{\mu}_{*}^\color{green}{k}(\textbf{x}^p) , \hat{\Gamma}_{pp}^\color{green}{k}(\textbf{x}^p) \big).\]
A unique mean process can struggle to capture relevant signals in presence of group structures.
By identifying the underlying clustering structure, we can discard unnecessary information and provide enhanced predictions as well as lower uncertainty.
Each cluster-specific prediction is weighted by its membership probability \(\color{green}{\tau_{*k}}\).
Uncertainty quantification provides valuable information to practitioners to make decisions based on probabilistic statements, like the risk of overweight at 10 years.
Both work on separate parts of a GP, with no assumption on the other, and are theoretically compatible. They are even expected to synergise well, by tackling each others limitations.
\[\begin{bmatrix} y_t^{1} \\ \vdots \\ y_t^{O} \\ \end{bmatrix} = \begin{bmatrix} \mu_0 \\ \vdots \\ \mu_0 \\ \end{bmatrix} + \begin{bmatrix} f_t^{1} \\ \vdots \\ f_t^{O} \\ \end{bmatrix} + \begin{bmatrix} \epsilon_t^{1} \\ \vdots \\ \epsilon_t^{O} \\ \end{bmatrix}, \hspace{3cm} \forall t = 1, \dots, T\]
All mathematical objects from Multi-Task GPs are now extended with stacked Outputs.
\[\Big[k_{\theta_t}(\mathbf{x}, \mathbf{x}^{\prime})\Big]_{o,o'} = \dfrac{S_{t,o} \ S_{t,o{\prime}}}{(2\pi)^{D/2} \ |\Sigma|^{1/2}} \ \exp\Big(-\dfrac{1}{2}(\mathbf{x} - \mathbf{x}^{\prime})^T \Sigma^{-1}(\mathbf{x}-\mathbf{x}^{\prime})\Big)\]
with:
These matrices are typically diagonal, filled with Output-specific and latent lengthscales, one per Input dimension. In the end, we need to optimise \(\color{red}{O}\times \color{blue}{T} \times (3D+1)\) hyper-parameters.
\[ \begin{align} p(\mu_0(\color{grey}{\mathbf{x}}) \mid \textbf{y}, \hat{\Theta}) = \mathcal{N}(\mu_0(\color{grey}{\mathbf{x}}); \hat{m}_0(\color{grey}{\textbf{x}}), \hat{\textbf{K}}^{\color{grey}{\textbf{x}}}), \end{align} \]
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \sum\limits_{t = 1}^{T}\left\{ \log \mathcal{N} \left( \mathbf{y}_t; \hat{m}_0(\color{purple}{\mathbf{x}_t^O}), \boldsymbol{\Psi}_{\theta_t, \sigma_t^2}^{\color{purple}{\mathbf{x}_t^O}} \right) - \dfrac{1}{2} Tr \left( \hat{\mathbf{K}}^{\color{purple}{\mathbf{x}_t^O}} {\boldsymbol{\Psi}_{\theta_t, \sigma_t^2}^{\color{purple}{\mathbf{x}_t^O}}}^{-1} \right) \right\}. \end{align*} \]
Let’s remove [-0.8, -0.2] of 2nd Output, and [0.7, 1] of both Outputs for the testing Task.
Rule of thumb:
| Output nature | Framework | Complexity |
|---|---|---|
| Independant variables of interest | Single-Task | \(\color{red}{O}\times\mathcal{O}(N^3)\) |
| Correlated variables of interest | Multi-Output | \(\mathcal{O}(\color{red}{O}^3 N^3)\) |
| Occurences of the same phenomenon | Multi-Task | \(\mathcal{O}(\color{blue}{T} \times N^3)\) |
| Occurences of correlated variables | Multi-Task-Multi-Output | \(\mathcal{O}(\color{blue}{T}\times\color{red}{O}^3 N^3)\) |