Add css colors, improve crps slides

2025-05-25 14:35:46 +02:00
parent 42663f73b9
commit de5573e171
3 changed files with 400 additions and 104 deletions
--- a/index.qmd
+++ b/index.qmd
@@ -86,10 +86,13 @@ my_bib <- ReadBib("assets/library.bib", check = FALSE)
 col_lightgray <- "#e7e7e7"
 col_blue <- "#000088"
 col_smooth_expost <- "#a7008b"
-col_smooth <- "#187a00"
-col_pointwise <- "#008790"
 col_constant <- "#dd9002"
 col_optimum <- "#666666"
+col_smooth <- "#187a00"
+col_pointwise <- "#008790"
+col_green <- "#61B94C"
+col_orange <- "#ffa600"
+col_yellow <- "#FCE135"
 ```

 # CRPS Learning
@@ -308,9 +311,9 @@ Weights are updated sequentially according to the past performance of the $K$ ex

 That is, a loss function $\ell$ is needed. This is used to compute the **cumulative regret** $R_{t,k}$

-$$
-R_{t,k}  = \widetilde{L}_{t} - \widehat{L}_{t,k} =  \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)
-$${#eq-regret}
+\begin{equation}
+  R_{t,k}  = \widetilde{L}_{t} - \widehat{L}_{t,k} =  \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)\label{eq:regret}
+\end{equation}

 The cumulative regret:

@@ -325,13 +328,15 @@ Popular loss functions for point forecasting @gneiting2011making:

 $\ell_2$ loss:

-$$\ell_2(x, y) = | x -y|^2$${#eq-elltwo}
+\begin{equation}
+  \ell_2(x, y) = | x -y|^2 \label{eq:elltwo}
+\end{equation}

 Strictly proper for *mean* prediction

 :::

-::: {.column width="2%"}
+::: {.column width="4%"}

 :::

@@ -339,7 +344,9 @@ Strictly proper for *mean* prediction

 $\ell_1$ loss:

-$$\ell_1(x, y) = | x -y|$${#eq-ellone}
+\begin{equation}
+  \ell_1(x, y) = | x -y| \label{eq:ellone}
+\end{equation}

 Strictly proper for *median* predictions

@@ -400,17 +407,9 @@ In stochastic settings, the cumulative Risk should be analyzed `r Citet(my_bib,

 ::::

-## Optimality
+## Optimal Convergence

-In stochastic settings, the cumulative Risk should be analyezed @wintenberger2017optimal:
-
-\begin{align}
-    \underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad
-    \underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}}
-    \label{eq_def_cumrisk}
-\end{align}
-
-There are two problems that an algorithm should solve in iid settings:
+<br/>

 :::: {.columns}

@@ -423,14 +422,6 @@ There are two problems that an algorithm should solve in iid settings:
 \end{equation}
 The forecaster is asymptotically not worse than the best expert $\widehat{\mathcal{R}}_{t,\min}$.

-:::
-
-::: {.column width="2%"}
-
-:::
-
-::: {.column width="48%"}
-
 ### The convex aggregation problem

 \begin{equation}
@@ -441,13 +432,14 @@ The forecaster is asymptotically not worse than the best convex combination $\wi

 :::

-::::
+::: {.column width="2%"}

-## Optimality
+:::

-Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence.
+::: {.column width="48%"}
+
+Optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv}  `r Citet(my_bib, "wintenberger2017optimal")`: 

-According to @wintenberger2017optimal, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if

 \begin{align}
    \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & =
@@ -466,104 +458,102 @@ Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} dep
 - Regularity conditions on $Y_t$ and $\widehat{X}_{t,k}$
 - The weighting scheme

-## Optimality
-
-According to @cesa2006prediction EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the:
- Loss $\ell$ is exp-concave
- Learning-rate $\eta$ is chosen correctly
-
-Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.
-
-The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized:
-\begin{align}
-\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
-\end{align}
-$\ell'$ is the subgradient of $\ell$ in its first coordinate evaluated at forecast combination $\widetilde{X}$.
-
-Combining probabilistic forecasts calls for a probabilistic loss function
-
-:::: {.notes}
-
-We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
+:::

 ::::

-## The Continuous Ranked Probability Score
+##

 :::: {.columns}

 ::: {.column width="48%"}

-**An appropriate choice:**
+### Optimal Convergence

-\begin{align*}
-    \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx
-    \label{eq_crps}
-\end{align*}
+<br/>

-It's strictly proper @gneiting2007strictly.
+EWA satisfies optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if:

-Using the CRPS, we can calculate time-adaptive weight $w_{t,k}$. However, what if the experts' performance is not uniform over all parts of the distribution?
+- Loss $\ell$ is exp-concave
+- Learning-rate $\eta$ is chosen correctly

-The idea: utilize this relation:
+Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.

-\begin{align*}
-    \text{CRPS}(F, y) = 2 \int_0^{1}  \text{QL}_p(F^{-1}(p), y) \, d p.
-    \label{eq_crps_qs}
-\end{align*}
+Optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick:
+
+\begin{align}
+\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
+\end{align}
+
+$\ell'$ is the subgradient of $\ell$ at forecast combination $\widetilde{X}$.

 :::

-::: {.column width="2%"}
+::: {.column width="4%"}

 :::

 ::: {.column width="48%"}

-to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL):
-\begin{align*}
-    \text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y)
-\end{align*}
+### Probabilistic Setting

-</br>
+<br/>

-**But is it optimal?**
+**An appropriate choice:**

-CRPS is exp-concave `r fontawesome::fa("check", fill ="#00b02f")`
+\begin{equation*}
+  \text{CRPS}(F, y) = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq:crps}
+\end{equation*}

-`r fontawesome::fa("arrow-right", fill ="#000000")` EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv}
+It's strictly proper @gneiting2007strictly.

-QL is convex, but not exp-concave `r fontawesome::fa("exclamation", fill ="#ffa600")`
+Using the CRPS, we can calculate time-adaptive weights $w_{t,k}$. However, what if the experts' performance varies in parts of the distribution? 

-`r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence
+`r fontawesome::fa("lightbulb", fill = col_yellow)` Utilize this relation:
+
+\begin{equation*}
+    \text{CRPS}(F, y) = 2 \int_0^{1}  \text{QL}_p(F^{-1}(p), y) dp.\label{eq_crps_qs}
+\end{equation*}
+
+... to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.

 :::

 ::::

-## CRPS-Learning Optimality
+## CRPS Learning Optimality
+
+::: {.panel-tabset}
+
+## Almost Optimal Convergence
+
+
+`r fontawesome::fa("exclamation", fill = col_orange)` QL is convex, but not exp-concave `r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. It satisfies that there exist a $C>0$ such that for $x>0$ it holds that

-For convex losses, BOAG satisfies that there exist a $C>0$ such that for $x>0$ it holds that
 \begin{equation}
    P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right)  \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right)  \right) \geq
-    1-e^{x}
+    1-e^{-x}
    \label{eq_boa_opt_conv}
 \end{equation}
+
 `r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} @wintenberger2017optimal.

 The same algorithm satisfies that there exist a $C>0$ such that for $x>0$ it holds that
 \begin{equation}
    P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq
    C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq
-    1-e^{x}
+    1-2e^{-x}
    \label{eq_boa_opt_select}
 \end{equation}

-if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.
+if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate. 

-This is for losses that satisfy **A1** and **A2**.
+`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* \eqref{eq_optp_select} @gaillard2018efficient.
+
+`r fontawesome::fa("arrow-right", fill ="#000000")` We show that this holds for QL under feasible conditions.
+
+## Conditions + Lemma

-## CRPS-Learning Optimality

 :::: {.columns}

@@ -624,8 +614,7 @@ QL is Lipschitz continuous:

 ::::

-
-## CRPS-Learning Optimality
+## Proposition + Theorem

 :::: {.columns}

@@ -674,6 +663,13 @@ $$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}

 ::::

+::::
+
+:::: {.notes}
+
+We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
+
+::::

 ## A Probabilistic Example

@@ -797,13 +793,17 @@ ggplot() +

 :::

-## The Smoothing Procedure
+## The Smoothing Procedures
+
+::: {.panel-tabset}
+
+## Penalized Smoothing

 :::: {.columns}

 ::: {.column width="48%"}

-We are using penalized cubic b-splines:
+Penalized cubic B-Splines for smoothing weights:

 Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)$ Then we approximate $w_{t,k}$ by

@@ -811,7 +811,7 @@ Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)
 w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi
 \end{align}

-with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing which minimizes
+with parameter vector $\beta$. The latter is estimated to penalize $L_2$-smoothing which minimizes

 \begin{equation}
    \| w_{t,k} - \beta' \varphi  \|^2_2 + \lambda \| \mathcal{D}^{d}  (\beta' \varphi)  \|^2_2
@@ -820,7 +820,7 @@ with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing

 with differential operator $\mathcal{D}$

-Smoothing can be applied ex-post or inside of the algorithm ( `r fontawesome::fa("arrow-right", fill ="#000000")` [Simulation](#simulation)).
+Computation is easy, since we have an analytical solution

 :::

@@ -840,14 +840,119 @@ We receive the constant solution for high values of $\lambda$ when setting $d=1$

 ::::

+## Basis Smoothing
+
+:::: {.columns}
+
+::: {.column width="48%"}
+
+Represent weights as linear combinations of bounded basis functions:
+
+\begin{equation}
+  w_{t,k} = \sum_{l=1}^L \beta_{t,k,l} \varphi_l = \boldsymbol \beta_{t,k}' \boldsymbol \varphi
+\end{equation}
+
+A popular choice are are B-Splines as local basis functions
+
+$\boldsymbol \beta_{t,k}$ is calculated using a reduced regret matrix:
+
+\begin{equation}
+  \underbrace{\boldsymbol r_{t}}_{\text{LxK}} = \frac{L}{P} \underbrace{\boldsymbol B'}_{\text{LxP}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxK}}
+\end{equation}
+
+`r fontawesome::fa("arrow-right", fill ="#000000")` $\boldsymbol r_{t}$ is transformed from PxK to LxK
+
+If $L = P$ it holds that $\boldsymbol \varphi = \boldsymbol{I}$ 
+For $L = 1$ we receive constant weights
+
+:::
+
+::: {.column width="2%"}
+
+:::
+
+::: {.column width="48%"}
+
+Weights converge to the constant solution if $L\rightarrow 1$
+
+<center>
+<img src="/assets/crps_learning/weights_kstep.gif">
+</center>
+
+:::
+
+::::
+
+::::
+
 ---

 ## The Proposed CRPS-Learning Algorithm

-```{r, fig.align="left", echo=FALSE, out.width = "1000px", cache = TRUE}
-knitr::include_graphics("assets/crps_learning/algorithm_1.svg")
-```
+<br/>

+::: {style="font-size: 85%;"}
+
+:::: {.columns}
+
+::: {.column width="43%"}
+
+### Initialization:
+
+Array of expert predicitons: $\widehat{X}_{t,p,k}$
+
+Vector of Prediction targets: $Y_t$
+
+Starting Weights: $\boldsymbol w_0=(w_{0,1},\ldots, w_{0,K})$ 
+
+Penalization parameter: $\lambda\geq 0$
+
+B-spline and penalty matrices $\boldsymbol B$ and $\boldsymbol D$ on $\mathcal{P}= (p_1,\ldots,p_M)$
+
+Hat matrix: $$\boldsymbol{\mathcal{H}} = \boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'$$
+
+Cumulative Regret: $R_{0,k} = 0$ 
+
+Range parameter: $E_{0,k}=0$
+
+Starting pseudo-weights: $\boldsymbol \beta_0 = \boldsymbol B^{\text{pinv}}\boldsymbol w_0(\boldsymbol{\mathcal{P}})$
+
+
+:::
+
+::: {.column width="2%"}
+
+:::
+
+::: {.column width="55%"}
+
+### Core:
+
+for( t in 1:T ) {
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\widetilde{\boldsymbol X}_{t} = \text{Sort}\left( \boldsymbol w_{t-1}'(\boldsymbol P) \widehat{\boldsymbol X}_{t} \right)$ <b style="color: var(--col_grey_7);"># Prediction</b>
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol r_{t} = \frac{L}{M} \boldsymbol B' \left({\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol E_{t}    = \max(\boldsymbol E_{t-1}, \boldsymbol r_{t}^+ + \boldsymbol r_{t}^-)$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol V_{t}     = \boldsymbol V_{t-1} + \boldsymbol r_{t}^{ \odot 2}$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol \eta_{t}  =\min\left( \left(-\log(\boldsymbol \beta_{0}) \odot \boldsymbol V_{t}^{\odot -1} \right)^{\odot\frac{1}{2}} , \frac{1}{2}\boldsymbol E_{t}^{\odot-1}\right)$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol R_{t}     = \boldsymbol R_{t-1}+  \boldsymbol r_{t} \odot \left( \boldsymbol 1 - \boldsymbol \eta_{t} \odot \boldsymbol r_{t} \right)/2 + \boldsymbol E_{t} \odot \mathbb{1}\{-2\boldsymbol \eta_{t}\odot \boldsymbol r_{t} > 1\}$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol \beta_{t}     = K \boldsymbol \beta_{0} \odot \boldsymbol {SoftMax}\left( -  \boldsymbol \eta_{t} \odot \boldsymbol R_{t} + \log( \boldsymbol \eta_{t}) \right)$
+
+&nbsp;&nbsp;&nbsp;&nbsp;$\boldsymbol w_{t}(\boldsymbol P) = \underbrace{\boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'}_{\boldsymbol{\mathcal{H}}} \boldsymbol B \boldsymbol \beta_{t}$
+
+} 
+
+:::
+
+::::
+
+:::

 ## Simulation Study

@@ -1437,19 +1542,6 @@ BOA     > 16 -->

 ## Outline

-```{r, include=FALSE}
-col_lightgray <- "#e7e7e7"
-col_blue <- "#000088"
-col_smooth_expost <- "#a7008b"
-col_smooth <- "#187a00"
-col_pointwise <- "#008790"
-col_constant <- "#dd9002"
-col_optimum <- "#666666"
-col_green <- "#61B94C"
-col_orange <- "#ffa600"
-col_yellow <- "#FCE135"
-```
-
 </br>

 **Multivariate CRPS Learning**