Add css colors, improve crps slides

This commit is contained in:
2025-05-25 14:35:46 +02:00
parent 42663f73b9
commit de5573e171
3 changed files with 400 additions and 104 deletions

296
index.qmd
View File

@@ -86,10 +86,13 @@ my_bib <- ReadBib("assets/library.bib", check = FALSE)
col_lightgray <- "#e7e7e7"
col_blue <- "#000088"
col_smooth_expost <- "#a7008b"
col_smooth <- "#187a00"
col_pointwise <- "#008790"
col_constant <- "#dd9002"
col_optimum <- "#666666"
col_smooth <- "#187a00"
col_pointwise <- "#008790"
col_green <- "#61B94C"
col_orange <- "#ffa600"
col_yellow <- "#FCE135"
```
# CRPS Learning
@@ -308,9 +311,9 @@ Weights are updated sequentially according to the past performance of the $K$ ex
That is, a loss function $\ell$ is needed. This is used to compute the **cumulative regret** $R_{t,k}$
$$
R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)
$${#eq-regret}
\begin{equation}
R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)\label{eq:regret}
\end{equation}
The cumulative regret:
@@ -325,13 +328,15 @@ Popular loss functions for point forecasting @gneiting2011making:
$\ell_2$ loss:
$$\ell_2(x, y) = | x -y|^2$${#eq-elltwo}
\begin{equation}
\ell_2(x, y) = | x -y|^2 \label{eq:elltwo}
\end{equation}
Strictly proper for *mean* prediction
:::
::: {.column width="2%"}
::: {.column width="4%"}
:::
@@ -339,7 +344,9 @@ Strictly proper for *mean* prediction
$\ell_1$ loss:
$$\ell_1(x, y) = | x -y|$${#eq-ellone}
\begin{equation}
\ell_1(x, y) = | x -y| \label{eq:ellone}
\end{equation}
Strictly proper for *median* predictions
@@ -400,17 +407,9 @@ In stochastic settings, the cumulative Risk should be analyzed `r Citet(my_bib,
::::
## Optimality
## Optimal Convergence
In stochastic settings, the cumulative Risk should be analyezed @wintenberger2017optimal:
\begin{align}
\underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad
\underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}}
\label{eq_def_cumrisk}
\end{align}
There are two problems that an algorithm should solve in iid settings:
<br/>
:::: {.columns}
@@ -423,14 +422,6 @@ There are two problems that an algorithm should solve in iid settings:
\end{equation}
The forecaster is asymptotically not worse than the best expert $\widehat{\mathcal{R}}_{t,\min}$.
:::
::: {.column width="2%"}
:::
::: {.column width="48%"}
### The convex aggregation problem
\begin{equation}
@@ -441,13 +432,14 @@ The forecaster is asymptotically not worse than the best convex combination $\wi
:::
::::
::: {.column width="2%"}
## Optimality
:::
Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence.
::: {.column width="48%"}
Optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} `r Citet(my_bib, "wintenberger2017optimal")`:
According to @wintenberger2017optimal, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if
\begin{align}
\frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & =
@@ -466,104 +458,102 @@ Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} dep
- Regularity conditions on $Y_t$ and $\widehat{X}_{t,k}$
- The weighting scheme
## Optimality
According to @cesa2006prediction EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the:
- Loss $\ell$ is exp-concave
- Learning-rate $\eta$ is chosen correctly
Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.
The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized:
\begin{align}
\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
\end{align}
$\ell'$ is the subgradient of $\ell$ in its first coordinate evaluated at forecast combination $\widetilde{X}$.
Combining probabilistic forecasts calls for a probabilistic loss function
:::: {.notes}
We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
:::
::::
## The Continuous Ranked Probability Score
##
:::: {.columns}
::: {.column width="48%"}
**An appropriate choice:**
### Optimal Convergence
\begin{align*}
\text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx
\label{eq_crps}
\end{align*}
<br/>
It's strictly proper @gneiting2007strictly.
EWA satisfies optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if:
Using the CRPS, we can calculate time-adaptive weight $w_{t,k}$. However, what if the experts' performance is not uniform over all parts of the distribution?
- Loss $\ell$ is exp-concave
- Learning-rate $\eta$ is chosen correctly
The idea: utilize this relation:
Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.
\begin{align*}
\text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) \, d p.
\label{eq_crps_qs}
\end{align*}
Optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick:
\begin{align}
\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
\end{align}
$\ell'$ is the subgradient of $\ell$ at forecast combination $\widetilde{X}$.
:::
::: {.column width="2%"}
::: {.column width="4%"}
:::
::: {.column width="48%"}
to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL):
\begin{align*}
\text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y)
\end{align*}
### Probabilistic Setting
</br>
<br/>
**But is it optimal?**
**An appropriate choice:**
CRPS is exp-concave `r fontawesome::fa("check", fill ="#00b02f")`
\begin{equation*}
\text{CRPS}(F, y) = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq:crps}
\end{equation*}
`r fontawesome::fa("arrow-right", fill ="#000000")` EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv}
It's strictly proper @gneiting2007strictly.
QL is convex, but not exp-concave `r fontawesome::fa("exclamation", fill ="#ffa600")`
Using the CRPS, we can calculate time-adaptive weights $w_{t,k}$. However, what if the experts' performance varies in parts of the distribution?
`r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence
`r fontawesome::fa("lightbulb", fill = col_yellow)` Utilize this relation:
\begin{equation*}
\text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) dp.\label{eq_crps_qs}
\end{equation*}
... to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.
:::
::::
## CRPS-Learning Optimality
## CRPS Learning Optimality
::: {.panel-tabset}
## Almost Optimal Convergence
`r fontawesome::fa("exclamation", fill = col_orange)` QL is convex, but not exp-concave `r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. It satisfies that there exist a $C>0$ such that for $x>0$ it holds that
For convex losses, BOAG satisfies that there exist a $C>0$ such that for $x>0$ it holds that
\begin{equation}
P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq
1-e^{x}
1-e^{-x}
\label{eq_boa_opt_conv}
\end{equation}
`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} @wintenberger2017optimal.
The same algorithm satisfies that there exist a $C>0$ such that for $x>0$ it holds that
\begin{equation}
P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq
C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq
1-e^{x}
1-2e^{-x}
\label{eq_boa_opt_select}
\end{equation}
if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.
if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.
This is for losses that satisfy **A1** and **A2**.
`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* \eqref{eq_optp_select} @gaillard2018efficient.
`r fontawesome::fa("arrow-right", fill ="#000000")` We show that this holds for QL under feasible conditions.
## Conditions + Lemma
## CRPS-Learning Optimality
:::: {.columns}
@@ -624,8 +614,7 @@ QL is Lipschitz continuous:
::::
## CRPS-Learning Optimality
## Proposition + Theorem
:::: {.columns}
@@ -674,6 +663,13 @@ $$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}
::::
::::
:::: {.notes}
We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
::::
## A Probabilistic Example
@@ -797,13 +793,17 @@ ggplot() +
:::
## The Smoothing Procedure
## The Smoothing Procedures
::: {.panel-tabset}
## Penalized Smoothing
:::: {.columns}
::: {.column width="48%"}
We are using penalized cubic b-splines:
Penalized cubic B-Splines for smoothing weights:
Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)$ Then we approximate $w_{t,k}$ by
@@ -811,7 +811,7 @@ Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)
w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi
\end{align}
with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing which minimizes
with parameter vector $\beta$. The latter is estimated to penalize $L_2$-smoothing which minimizes
\begin{equation}
\| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2
@@ -820,7 +820,7 @@ with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing
with differential operator $\mathcal{D}$
Smoothing can be applied ex-post or inside of the algorithm ( `r fontawesome::fa("arrow-right", fill ="#000000")` [Simulation](#simulation)).
Computation is easy, since we have an analytical solution
:::
@@ -840,14 +840,119 @@ We receive the constant solution for high values of $\lambda$ when setting $d=1$
::::
## Basis Smoothing
:::: {.columns}
::: {.column width="48%"}
Represent weights as linear combinations of bounded basis functions:
\begin{equation}
w_{t,k} = \sum_{l=1}^L \beta_{t,k,l} \varphi_l = \boldsymbol \beta_{t,k}' \boldsymbol \varphi
\end{equation}
A popular choice are are B-Splines as local basis functions
$\boldsymbol \beta_{t,k}$ is calculated using a reduced regret matrix:
\begin{equation}
\underbrace{\boldsymbol r_{t}}_{\text{LxK}} = \frac{L}{P} \underbrace{\boldsymbol B'}_{\text{LxP}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxK}}
\end{equation}
`r fontawesome::fa("arrow-right", fill ="#000000")` $\boldsymbol r_{t}$ is transformed from PxK to LxK
If $L = P$ it holds that $\boldsymbol \varphi = \boldsymbol{I}$
For $L = 1$ we receive constant weights
:::
::: {.column width="2%"}
:::
::: {.column width="48%"}
Weights converge to the constant solution if $L\rightarrow 1$
<center>
<img src="/assets/crps_learning/weights_kstep.gif">
</center>
:::
::::
::::
---
## The Proposed CRPS-Learning Algorithm
```{r, fig.align="left", echo=FALSE, out.width = "1000px", cache = TRUE}
knitr::include_graphics("assets/crps_learning/algorithm_1.svg")
```
<br/>
::: {style="font-size: 85%;"}
:::: {.columns}
::: {.column width="43%"}
### Initialization:
Array of expert predicitons: $\widehat{X}_{t,p,k}$
Vector of Prediction targets: $Y_t$
Starting Weights: $\boldsymbol w_0=(w_{0,1},\ldots, w_{0,K})$
Penalization parameter: $\lambda\geq 0$
B-spline and penalty matrices $\boldsymbol B$ and $\boldsymbol D$ on $\mathcal{P}= (p_1,\ldots,p_M)$
Hat matrix: $$\boldsymbol{\mathcal{H}} = \boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'$$
Cumulative Regret: $R_{0,k} = 0$
Range parameter: $E_{0,k}=0$
Starting pseudo-weights: $\boldsymbol \beta_0 = \boldsymbol B^{\text{pinv}}\boldsymbol w_0(\boldsymbol{\mathcal{P}})$
:::
::: {.column width="2%"}
:::
::: {.column width="55%"}
### Core:
for( t in 1:T ) {
&nbsp;&nbsp;&nbsp;&nbsp;$\widetilde{\boldsymbol X}_{t} = \text{Sort}\left( \boldsymbol w_{t-1}'(\boldsymbol P) \widehat{\boldsymbol X}_{t} \right)$ <b style="color: var(--col_grey_7);"># Prediction</b>
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol r_{t} = \frac{L}{M} \boldsymbol B' \left({\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol E_{t} = \max(\boldsymbol E_{t-1}, \boldsymbol r_{t}^+ + \boldsymbol r_{t}^-)$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol V_{t} = \boldsymbol V_{t-1} + \boldsymbol r_{t}^{ \odot 2}$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol \eta_{t} =\min\left( \left(-\log(\boldsymbol \beta_{0}) \odot \boldsymbol V_{t}^{\odot -1} \right)^{\odot\frac{1}{2}} , \frac{1}{2}\boldsymbol E_{t}^{\odot-1}\right)$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol R_{t} = \boldsymbol R_{t-1}+ \boldsymbol r_{t} \odot \left( \boldsymbol 1 - \boldsymbol \eta_{t} \odot \boldsymbol r_{t} \right)/2 + \boldsymbol E_{t} \odot \mathbb{1}\{-2\boldsymbol \eta_{t}\odot \boldsymbol r_{t} > 1\}$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol \beta_{t} = K \boldsymbol \beta_{0} \odot \boldsymbol {SoftMax}\left( - \boldsymbol \eta_{t} \odot \boldsymbol R_{t} + \log( \boldsymbol \eta_{t}) \right)$
&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol w_{t}(\boldsymbol P) = \underbrace{\boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'}_{\boldsymbol{\mathcal{H}}} \boldsymbol B \boldsymbol \beta_{t}$
}
:::
::::
:::
## Simulation Study
@@ -1437,19 +1542,6 @@ BOA > 16 -->
## Outline
```{r, include=FALSE}
col_lightgray <- "#e7e7e7"
col_blue <- "#000088"
col_smooth_expost <- "#a7008b"
col_smooth <- "#187a00"
col_pointwise <- "#008790"
col_constant <- "#dd9002"
col_optimum <- "#666666"
col_green <- "#61B94C"
col_orange <- "#ffa600"
col_yellow <- "#FCE135"
```
</br>
**Multivariate CRPS Learning**