Traditional talent identification: \(\rightarrow\) Best young athlete + coach intuition
G. Boccia et al. (2017) :
\(\simeq\) 60% of 16 years old elite athletes do not maintain their level of performance
Philip E. Kearney & Philip R. Hayes (2018) :
\(\simeq\) only 10% of senior top 20 were also top 20 before 13 years
Performances from FF of Swimming members since 2002:
Performances from FF of Swimming members since 2002:
Performances from FF of Swimming members since 2002:
Functional data \(\simeq\) coefficients \(\alpha_k\) of B-splines functions:
\[y_i(t) = \sum\limits_{k=1}^{K}{\alpha_k B_k(t)}\]
Clustering: Algo FunHDDC (Gaussian mixture + EM) Bouveyron & Jacques - 2011
Using the multidimensional version : curve + derivative \(\rightarrow\) Information about performance level and trend of improvement
Leroy et al. - 2018
Leroy et al. - 2018
Bishop - 2006 | Rasmussen & Williams - 2006
GPR : a kernel method to estimate \(f\) when:
\[y = f(x) +\epsilon\]
\(\rightarrow\) No restrictions on \(f\) but a prior probability:
\[f \sim \mathcal{GP}(0,C(\cdot,\cdot))\]
An example of exponential kernel for the covariance function: \[cov(f(x),f(x'))= C(x,x') = \alpha exp(- \dfrac{1}{2\theta^2} |x - x'|^2)\] Kernel definition \(\Rightarrow\) prefered properties on \(f\)
\(\textbf{y}_{N+1} = (y_1,...,y_{N+1})\) has the following prior density: \[\textbf{y}_{N+1} \sim \mathcal{N}(0, C_{N+1}), \ C_{N+1} = \begin{pmatrix} C_N & k_{N+1} \\ k_{N+1}^T & c_{N+1} \end{pmatrix}\]
When the joint density is gaussian, so does the conditionnal dentisty:
\[y_{N+1}|\textbf{y}_{N}, \textbf{x}_{N+1} \sim \mathcal{N}(k^T \color{red}{C_N^{-1}}\textbf{y}_{N}, c_{N+1}- k_{N+1}^T \color{red}{C_N^{-1}} k_{N+1})\]
Key points:
Estimating a GP on each individual (\(O(\color{green}{N_i^3})\)):
Estimating a GP on each individuals (\(O(\color{green}{N_i^3})\)):
Estimating a GP on each individuals (\(O(\color{green}{N_i^3})\)):
Estimating a GP on each individuals (\(O(\color{green}{N_i^3})\)):
Estimating a GP on each individuals (\(O(\color{green}{N_i^3})\)):
Estimating a GP on each individuals (\(O(\color{green}{N_i^3})\)):
\(\rightarrow\) Using the shared information between individuals (GPR-ME)
Shi & Wang - 2008 | Shi & Choi - 2011
\[y_i(t) = \mu_0(t) + f_i(t) + \epsilon_i\] with:
GPFDA R package
Limits:
\[y_i(t) = \mu_0(t) + f_i(t) + \epsilon_i\] with:
It follows that:
\[y_i(\cdot) \vert \mu_0 \sim \mathcal{GP}(\mu_0(\cdot), \Sigma_{\theta_i}(\cdot,\cdot) + \sigma_i^2), \ y_i \vert \mu_0 \perp \!\!\! \perp\]
\(\rightarrow\) Shared information through \(\mu_0\) and its uncertainty \(\rightarrow\) Unified non parametric probabilistic framework \(\rightarrow\) Effective even for irregular time series
\(\textbf{y} = (y_1^1,\dots,y_i^k,\dots,y_M^{N_M})^T\) \(\textbf{t} = (t_1^1,\dots,t_i^k,\dots,t_M^{N_M})^T\) \(\Theta = \{ \theta_0, (\theta_i)_i, \sigma_i^2 \}\)
\(K\): covariance matrix from the process \(\mu_0\) evaluated on \(\textbf{t}\)
\(K = \left[ K_{\theta_0}(t_i^k, t_j^l) \right]_{(i,j), (k,l)}\)
\(\Sigma_i\): covariance matrix from the process \(f_i\) evaluated on \(\textbf{t}_i\)
\(\Sigma_i = \left[ \Sigma_{\theta_i}(t_i^k, t_i^l) \right]_{(k,l)} \ \ \forall i = 1, \dots, M\)
\(\Psi_i = \Sigma_i + \sigma_i^2 I_{N_i}\)
Reminder of its simple definition:
\[ \mathbb{P}(T \vert D) = \dfrac{\mathbb{P}(D \vert T) \mathbb{P}(T)}{\mathbb{P}(D)} \] Powerful implication when it comes to learning from data:
Bayes' law tells you how and what you should learn on theory T according to data D:
\(\rightarrow \mathbb{P}(T \vert D)\), what you should think a posteriori Computational burden, among solutions: empirical Bayes
Step E: Computing the posterior (knowing \(\Theta\))
\[ \begin{align} p(\mu_0(\textbf{t}) \vert \textbf{t}, \textbf{y}, \Theta) &\propto p(\textbf{y} \vert \textbf{t}, \mu_0(\textbf{t}), \Theta) \ p(\mu_0(\textbf{t}) \vert \textbf{t}, \Theta) \\ &\propto \prod\limits_{i =1}^M \mathcal{N}( \mu_0(\textbf{t}_i), \Psi_i) \ \mathcal{N}(m_0(\textbf{t}), K) \\ &= [Insert \ here \ some \ PhD \ student \ ideas] \\ &= \mathcal{N}( \hat{m}_0(\textbf{t}), \hat{K}) \end{align} \]
Step M: Estimating \(\Theta\) (knowing \(p(\mu_0)\))
\[\hat{\Theta} = \underset{\Theta}{\arg\max} \ \mathbb{E}_{\mu_0} [ \log \ p(\textbf{y}, \mu_0(\textbf{t}) \vert \textbf{t}, \Theta ) \ \vert \Theta]\]
Initialize hyperparameters
while(sufficient condition of convergence){
Iterate alternatively steps E and M}
\(\mathbb{E} \left[ \mu_0(\textbf{t}) \vert Data \right] \pm CI_{0.95}\)
Iteration counter : 0
\(\mathbb{E} \left[ \mu_0(\textbf{t}) \vert Data \right] \pm CI_{0.95}\)
Iteration counter : 1
\(\mathbb{E} \left[ \mu_0(\textbf{t}) \vert Data \right] \pm CI_{0.95}\)
Iteration counter : 2
\(\mathbb{E} \left[ \mu_0(\textbf{t}) \vert Data \right] \pm CI_{0.95}\)
Iteration counter : 4
\(\mathbb{E} \left[ \mu_0(\textbf{t}) \vert Data \right] \pm CI_{0.95}\)
Iteration counter : 6 \(\rightarrow\) break and return
\[ \forall i, \ \ y_i(t) = \mu_0(t) + f_i(t) + \epsilon_i \]
Suppose that, after the learning step, you observe some data \(y_*(\textbf{t}_*)\) from a new individual, and want to make predictions at timestamps \(\textbf{t}^p\).
Multi-task learning consists in improving performance by sharing information across individuals.
Without external information:
\(p(\begin{bmatrix} y_*^{\textbf{t}_*} \\ y_*^{\textbf{t}_p} \\ \end{bmatrix}) = \mathcal{N}( \begin{bmatrix} \mu_0^{\textbf{t}_*} \\ \mu_0^{\textbf{t}^p} \\ \end{bmatrix}, \begin{pmatrix} \Psi_*^{\textbf{t}_*,\textbf{t}_*} & \Psi_*^{\textbf{t}_*,\textbf{t}^p} \\ \Psi_*^{\textbf{t}^p,\textbf{t}_*} & \Psi_*^{\textbf{t}^p,\textbf{t}^p} \end{pmatrix})\)
Multi-task regression : conditioning on other observations Incertitude on the mean process : integrate over \(\mu_0\)
\[\begin{align} p(y_* \vert \textbf{y}) &= \int p(y_*, \mu_0 \vert \textbf{y}) \ d \mu_0\\ &\underbrace{=}_{Bayes} \int p(y_* \vert \textbf{y}, \mu_0) p(\mu_0 \vert \textbf{y}) \ d \mu_0 \\ &\underbrace{=}_{(y_i \vert \mu_0)_i \perp \!\!\! \perp} \int p(y_* \vert \mu_0) p(\mu_0 \vert \textbf{y}) \ d \mu_0 \\ &= \int \mathcal{N}(y_*; \mu_0, \Psi_*) \mathcal{N}(\mu_0; \hat{m}_0, \hat{K}) \ d \mu_0 \\ &= \mathcal{N}( \hat{m}_0, \Gamma_* = \Psi_* + \hat{K}) \end{align}\]
\(\hat{\theta}_*, \hat{\sigma}_*^2 = \underset{\theta_*, \sigma_*^2}{\arg\max} \ \mathbb{E}_{\mu_0} [ \log \ p(\textbf{y}_*(\textbf{t}_*), \mu_0(\textbf{t}_*) \vert \theta_*, \sigma_*^2 )]\)
Prior: \(p(\begin{bmatrix} y_*^{\textbf{t}_*} \\ y_*^{\textbf{t}_p} \\ \end{bmatrix} \vert \textbf{y}) = \mathcal{N}( \begin{bmatrix} \hat{m}_0^{\textbf{t}_*} \\ \hat{m}_0^{\textbf{t}_p} \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{**} & \Gamma_{*p} \\ \Gamma_{p*} & \Gamma_{pp} \end{pmatrix})\)
Posterior: \(p(y_*^{\textbf{t}^p} \vert y_*^{\textbf{t}_*}, \textbf{y}) = \mathcal{N} \Big( \hat{\mu}_{*}^{\textbf{t}^p} , \hat{\Gamma}_{*}^{\textbf{t}^p} \Big)\)
with:
A unique underlying mean process might be insufficient
\(\rightarrow\) Mixture model of multitask GP:
\[\forall i , \forall k , \ \ y_i(t) \vert (Z_{ik} = 1) = \mu_k(t) + f_i(t) + \epsilon_i\] with:
It follows that:
\[y_i(\cdot) \vert \{ (\mu_k)_k, \boldsymbol{\pi} \} \sim \sum\limits_{k=1}^K{\pi_k \ \mathcal{GP}\Big(\mu_k(\cdot), \Psi_i(\cdot, \cdot) \Big)}\]
We need to learn the following quantities:
Unfortunately \((\mu_k)_k\) and \((Z_i)_i\) are posterior dependent
\(\rightarrow\) Variational inference, to maintain closed-form approximations. For any distribution \(q\):
\(\log p(\textbf{y} \vert \Theta) = \mathcal{L}(q; \Theta) + KL \big( q \vert \vert p(\boldsymbol{\mu}, \boldsymbol{Z} \vert \textbf{y}, \Theta)\big)\)
Approximation assumption: \(q(\boldsymbol{\mu}, \boldsymbol{Z}) = q_{\boldsymbol{\mu}}(\boldsymbol{\mu})q_{\boldsymbol{Z}}(\boldsymbol{Z})\)
\(\rightarrow\) \(\mathcal{L}(q; \Theta)\) provides a lower bound to maximize
Step E: Optimize \(\mathcal{L}(q; \Theta)\) w.r.t. \(q\)
\(\hat{q}_{\boldsymbol{Z}}(\boldsymbol{Z}) = \prod\limits_{i = 1}^M \mathcal{M}(Z_i;1, \boldsymbol{\tau}_i)\)
\(\hat{q}_{\boldsymbol{\mu}}(\boldsymbol{\mu}) = \prod\limits_{k = 1}^K \mathcal{N}(\mu_k;\hat{m}_k, \hat{C}_k)\)
Step M: Optimize \(\mathcal{L}(q; \Theta)\) w.r.t. \(\Theta\)
\(\rightarrow\) Iterate on these steps until convergence
\(\rightarrow\) Allows a multi-task aspect on the covariance structure, and a compromise between number of hyper-parameters and flexibility:
EM for estimating \(Z_*, \theta_*,\) and \(\sigma_*^2\)
Multi-task prior: \(p(\begin{bmatrix} y_*^{\textbf{t}_*} \\ y_*^{\textbf{t}_p} \\ \end{bmatrix} \vert Z_{*k}=1 , \textbf{y}) = \mathcal{N}( \begin{bmatrix} \hat{m}_k^{\textbf{t}_*} \\ \hat{m}_k^{\textbf{t}_p} \\ \end{bmatrix}, \begin{pmatrix} \Gamma_{**}^k & \Gamma_{*p}^k \\ \Gamma_{p*}^k & \Gamma_{pp}^k \end{pmatrix}), \forall k\)
Multi-task posterior: \(p(y_*^{\textbf{t}^p} \vert y_*^{\textbf{t}_*}, Z_{*k} = 1, \textbf{y}) = \mathcal{N} \big( \hat{\mu}_{*k}^{\textbf{t}^p} , \hat{\Gamma}_{*k}^{\textbf{t}^p} \big), \ \forall k\)
Predictive multi-task GPs mixture: \(p(y_*^{\textbf{t}^p} \vert y_*^{\textbf{t}_*}, \textbf{y}) = \sum\limits_{k = 1}^{K} \tau_{*k} \ \mathcal{N} \big( \hat{\mu}_{*k}^{\textbf{t}^p} , \hat{\Gamma}_{*k}^{\textbf{t}^p} \big)\)
Simu | Simu | Real data | Real data | |
---|---|---|---|---|
MSE | \(CI_{95}\) | MSE | \(CI_{95}\) | |
GP | 87.5 (151.9) | 74.0 (32.7) | 25.3 (97.6) | 72.7 (37.1) |
GPFDA | 31.8 (49.4) | 90.4 (18.1) | ||
MAGMA | 18.7 (31.4) | 93.8 (13.5) | 3.8 (10.3) | 95.3 (15.9) |
Making a prediction is \(\mathbb{P}(\)saying something wrong\() \simeq 1\).
A probabilistic prediction tells you how much:
Enable association with sparse GP approximations
Extend to multivariate functional regression
Work on an online version
Develop a more sophisticated model selection tool
Integrate to the app and launch tests with FFN
Listen to new good ideas that you are about to give me