Arthur Leroy
encadré par
A problem:
An opportunity:
Performances in competition for FFN members:
Performances in competition for FFN members:
Performances in competition for FFN members:
Are there different patterns of progression among swimmers?
Chapter 2
Leroy et al. - Functional Data Analysis in Sport Science: Example of Swimmers’ Progression Curves Clustering - Applied Sciences - 2018
Can we provide probabilistic predictions of future performances?
Chapter 3
Leroy et al. - Magma: Inference and Prediction with Multi-Task Gaussian Processes - Under submission in Statistics and Computing - https://github.com/ArthurLeroy/MAGMA
May possible group structures improve the quality of predictions?
Chapter 4
Leroy et al. - Cluster-Specific Predictions with Multi-Task Gaussian Processes - Under submission in JMLR - https://github.com/ArthurLeroy/MAGMAclust
Are there different patterns of progression among swimmers?
A common representation as functional data is proposed by using B-splines decomposition.
Clustering curves with FunHDDC algorithm (Bouveyron & Jacques - 2011, Schmutz et al. - 2020) highlights different patterns of progression. These groups, relating both on level and trend, are consistent groups with sport experts knowledge.
This approach suffers from severe limitations such as:
Can we provide probabilistic predictions of future performances?
No restrictions on \(f\) but a prior distribution on a functional space: \(f \sim \mathcal{GP}(0,C(\cdot,\cdot))\)
GPs provide an ideal framework for modelling although insufficient for direct predictions.
\[y_i = \mu_0 + f_i + \epsilon_i\]
with:
It follows that:
\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_i} + \sigma_i^2 I), \ \perp \!\!\! \perp_i\]
\(\rightarrow\) Unified GP framework with a common mean process \(\mu_0\), and individual-specific process \(f_i\),
\(\rightarrow\) Naturaly handles irregular grids of input data.
Goal: Learn the hyper-parameters, (and \(\mu_0\)'s hyper-posterior).
Difficulty: The likelihood depends on \(\mu_0\), and individuals are not independent.
Each individual has its specific vector of inputs \(\textbf{t}_i\) associated with outputs \(\textbf{y}_i\).
The mean process \(\mu_0\) requires to define pooled vectors and additional notation follows:
While GP are infinite-dimensional objects, a tractable inference on a finite set of observations fully determines the overall properties.
However, handling distributions with differing dimensions constitutes a major technical challenge.
Assuming to know \(\hat{\Theta}\), the hyper-posterior distribution of \(\mu_0\) is given by: \[ \begin{align} p(\mu_0(\color{red}{{\mathbf{t}}}) \mid \textbf{y}, \hat{\Theta}) &\propto \mathcal{N}(\mu_0(\color{red}{{\mathbf{t}}}); m_0(\color{red}{\textbf{t}}), \textbf{K}_{\hat{\theta}_0}^{\color{red}{\textbf{t}}}) \times \prod\limits_{i =1}^M \mathcal{N}({\mathbf{y}_i}; \mu_0( \color{blue}{\textbf{t}_i}), \boldsymbol{\Psi}_{\hat{\theta}_i, \hat{\sigma}_i^2}^{\color{blue}{\textbf{t}_i}}) \\ &= \mathcal{N}(\mu_0(\color{red}{{\mathbf{t}}}); \hat{m}_0(\color{red}{\textbf{t}}), \hat{\textbf{K}}^{\color{red}{\textbf{t}}}), \end{align} \]
with:
Assuming to know \(p(\mu_0(\color{red}{\textbf{t}}) \mid \textbf{y}, \hat{\Theta})\), the estimated set of hyper-parameters is given by:
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \mathbb{E}_{\mu_0} [ \log \ p(\textbf{y}, \mu_0(\color{red}{\textbf{t}}) \mid \Theta )] \\ &= \log \mathcal{N} {\left( {\hat{m}_0}(\color{red}{\textbf{t}}); m_0(\color{red}{\textbf{t}}), \mathbf{K}_{\theta_0}^{\color{red}{\textbf{t}}} \right)} - \dfrac{1}{2} Tr {\left( {\hat{\mathbf{K}}}^{\color{red}{\textbf{t}}} {\mathbf{K}_{\theta_0}^{\color{red}{\textbf{t}}}}^{-1} \right)} \\ & \ \ \ + {\sum\limits_{i = 1}^{M}}{\left\{ \log \mathcal{N} {\left( {\mathbf{y}_i}; {\hat{m}_0}(\color{blue}{{\mathbf{t}_i}}), \boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{blue}{{\mathbf{t}_i}}} \right)} - \dfrac{1}{2} Tr {\left( {\hat{\mathbf{K}}}^{\color{blue}{{\mathbf{t}_i}}} {\boldsymbol{\Psi}_{\theta_i, \sigma^2}^{\color{blue}{{\mathbf{t}_i}}}}^{-1} \right)} \right\}}. \end{align*} \]
Assuming to know \(p(\mu_0({\mathbf{t}}) \mid \textbf{y}, \hat{\Theta})\), the estimated set of hyper-parameters is given by:
\[ \begin{align*} \hat{\Theta} &= \underset{\Theta}{\arg\max} \ \mathbb{E}_{\mu_0} [ \log \ p(\textbf{y}, \mu_0({\mathbf{t}}) \mid \Theta )] \\ &= \log \mathcal{N} {\left( {\hat{m}_0}({\mathbf{t}}); m_0({\mathbf{t}}), \mathbf{K}_{\color{red}{\theta_0}}^{{\mathbf{t}}} \right)} - \dfrac{1}{2} Tr {\left( {\hat{\mathbf{K}}}^{{\mathbf{t}}} {\mathbf{K}_{\color{red}{\theta_0}}^{{\mathbf{t}}}}^{-1} \right)} \\ & \ \ \ + {\sum\limits_{i = 1}^{M}}{\left\{ \log \mathcal{N} {\left( {\mathbf{y}_i}; {\hat{m}_0}({\mathbf{t}_i}), \boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}}^{{\mathbf{t}_i}} \right)} - \dfrac{1}{2} Tr {\left( {\hat{\mathbf{K}}}^{{\mathbf{t}_i}} {\boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}}^{{\mathbf{t}_i}}}^{-1} \right)} \right\}}. \end{align*} \]
For a new individual, we observe some data \(y_*(\textbf{t}_*)\). Let us recall:
\[y_* \mid \mu_0 \sim \mathcal{GP}(\mu_0, \boldsymbol{\Psi}_{\theta_*, \sigma_*^2}), \ \perp \!\!\! \perp_i\]
Goals:
Difficulties:
Defining a multi-task prior distribution by:
\[\begin{align} p(y_* (\textbf{t}_*^{p}) \mid \textbf{y}) &= \int p\left(y_* (\textbf{t}_*^{p}) \mid \textbf{y}, \mu_0(\textbf{t}_*^{p})\right) p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y}) \ d \mu_0({\mathbf{t}^{p}_{*}}) \\ &= \int \underbrace{ p \left(y_* (\textbf{t}_*^{p}) \mid \mu_0 (\textbf{t}_*^{p}) \right)}_{\mathcal{N}(y_*; \mu_0, \Psi_*)} \ \underbrace{p(\mu_0 (\textbf{t}_*^{p}) \mid \textbf{y})}_{\mathcal{N}(\mu_0; \hat{m}_0, \hat{K})} \ d \mu_0({\mathbf{t}^{p}_{*}}) \\ &= \mathcal{N}( \hat{m}_0 ({\mathbf{t}^{p}_{*}}), \Gamma) \end{align}\]
with:
\[\Gamma = \boldsymbol{\Psi}_{\theta_*, \sigma_*^2}^{{\mathbf{t}^{p}_{*}}} + \hat{K}^{{\mathbf{t}^{p}_{*}}}\]
\[p {\left( \begin{bmatrix} y_*(\color{red}{{\mathbf{t}_{*}}}) \\ y_*(\color{blue}{{\mathbf{t}^{p}}}) \\ \end{bmatrix} \mid \textbf{y} \right)} = \mathcal{N} {\left( \begin{bmatrix} y_*(\color{red}{{\mathbf{t}_{*}}}) \\ y_*(\color{blue}{{\mathbf{t}^{p}}}) \\ \end{bmatrix}; \ \begin{bmatrix} \hat{m}_0(\color{red}{{\mathbf{t}_{*}}}) \\ \hat{m}_0(\color{blue}{{\mathbf{t}^{p}}}) \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{\color{red}{**}} & \Gamma_{\color{red}{*}\color{blue}{p}} \\ \Gamma_{\color{blue}{p}\color{red}{*}} & \Gamma_{\color{blue}{pp}} \end{pmatrix} \right)}\]
\[p(y_*(\color{blue}{{\mathbf{t}^{p}}}) \mid y_*(\color{red}{{\mathbf{t}_{*}}}), \textbf{y}) = \mathcal{N} \Big( y_*(\color{blue}{{\mathbf{t}^{p}}}); \ \hat{\mu}_{*}(\color{blue}{{\mathbf{t}^{p}}}) , \hat{\Gamma}_{\color{blue}{pp}} \Big)\]
with:
This multi-task posterior distribution provides the desired probabilistic predictions.
May eventual group structures improve the quality of predictions?
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_i = \mu_0 + f_i + \epsilon_i\]
with:
It follows that:
\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_i} + \sigma_i^2 I), \ \perp \!\!\! \perp_i\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_i \mid \{\color{red}{Z_{ik}} = 1 \} = \mu_{\color{red}{k}} + f_i + \epsilon_i\]
with:
It follows that:
\[y_i \mid \mu_0 \sim \mathcal{GP}(\mu_0, \Sigma_{\theta_i} + \sigma_i^2 I), \ \perp \!\!\! \perp_i\]
A unique underlying mean process might be too restrictive.
\(\rightarrow\) Mixture of multi-task GPs:
\[y_i \mid \{\color{red}{Z_{ik}} = 1 \} = \mu_{\color{red}{k}} + f_i + \epsilon_i\]
with:
It follows that:
\[y_i \mid \{ { \boldsymbol{\mu} }, \color{red}{\boldsymbol{\pi}} \} \sim \sum\limits_{k=1}^K{ \color{red}{\pi_k} \ \mathcal{GP}\Big(\mu_{\color{red}{k}}, \Psi_i \Big)}, \ \perp \!\!\! \perp_i\]
The integrated likelihood is not tractable anymore due to posterior dependencies between \({ \boldsymbol{\mu} }= \{\mu_k\}_k\) and \({\mathbf{Z}}= \{Z_i\}_i\).
Variational inference still allows us to maintain closed-form approximations. For any distribution \(q\): \[\log p(\textbf{y} \mid \Theta) = \mathcal{L}(q; \Theta) + KL \big( q \mid \mid p(\boldsymbol{\mu}, \boldsymbol{Z} \mid \textbf{y}, \Theta)\big)\]
The posterior independance is forced by an approximation assumption: \(q(\boldsymbol{\mu}, \boldsymbol{Z}) = q_{\boldsymbol{\mu}}(\boldsymbol{\mu})q_{\boldsymbol{Z}}(\boldsymbol{Z}).\)
Maximising the lower bound \(\mathcal{L}(q; \Theta)\) induces natural factorisations over clusters and individuals for the variational distributions.
The optimal variational distributions are analytical and factorise such as:
\[ \begin{align} \hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) &= \prod\limits_{k = 1}^K \mathcal{N}(\mu_k;\hat{m}_k, \hat{\textbf{C}}_k) \\ \hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) &= \prod\limits_{i = 1}^M \mathcal{M}(Z_i;1, \color{red}{\boldsymbol{\tau}_i}) \end{align} \]
with:
Optimising \(\mathcal{L}(\hat{q}; \Theta)\) w.r.t. \(\Theta\) induces independant maximisation problems:
\[ \begin{align*} \hat{\Theta} &= \text{arg}\max\limits_{\Theta} \mathbb{E}_{\boldsymbol{\mu},\boldsymbol{Z}} \left[ \log p(\textbf{y},\boldsymbol{\mu}, \boldsymbol{Z} \mid \Theta)\right] \\ &= {\sum\limits_{k = 1}^{K}}\ \mathcal{N} {\left( {\hat{m}_k}; \ m_k, \boldsymbol{C}_{\color{red}{\gamma_k}} \right)} - \dfrac{1}{2} {\textrm{tr}\left( {\mathbf{\hat{C}}_k}\boldsymbol{C}_{\color{red}{\gamma_k}}^{-1}\right)} \\ & \hspace{1cm} + {\sum\limits_{k = 1}^{K}}{\sum\limits_{i = 1}^{M}}{\tau_{ik}}\ \mathcal{N} {\left( {\mathbf{y}_i}; \ {\hat{m}_k}, \boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}} \right)} - \dfrac{1}{2} {\textrm{tr}\left( {\mathbf{\hat{C}}_k}\boldsymbol{\Psi}_{\color{blue}{\theta_i}, \color{blue}{\sigma_i^2}}^{-1}\right)} \\ & \hspace{1cm} + {\sum\limits_{k = 1}^{K}}{\sum\limits_{i = 1}^{M}}{\tau_{ik}}\log \color{green}{{\pi_{k}}} \end{align*} \]
Sharing the covariance structures offers a compromise between flexibility and parsimony:
\[ p(y_*(\textbf{t}^p) \mid y_*(\textbf{t}_*), \color{red}{Z_{*k}} = 1, \textbf{y}) = \mathcal{N} \big( \hat{\mu}_{*}^k (\textbf{t}^p) , \hat{\Gamma}_{pp}^{k} \big), \ \color{red}{\forall k}, \]
\[p(y_*(\textbf{t}^p) \mid y_*(\textbf{t}_*), \textbf{y}) = \color{red}{\sum\limits_{k = 1}^{K} \tau_{*k}} \ \mathcal{N} \big( \hat{\mu}_{*}^k(\textbf{t}^p) , \hat{\Gamma}_{pp}^k(\textbf{t}^p) \big).\]
Leroy & al. - 2020 - preprint
After convergence of the VEM algorithm, a variational-BIC expression can be derived as:
Two important prediction steps of Magma have been omitted for clarity:
The computational complexity for learning is given by: