Add css colors, improve crps slides
This commit is contained in:
296
index.qmd
296
index.qmd
@@ -86,10 +86,13 @@ my_bib <- ReadBib("assets/library.bib", check = FALSE)
|
||||
col_lightgray <- "#e7e7e7"
|
||||
col_blue <- "#000088"
|
||||
col_smooth_expost <- "#a7008b"
|
||||
col_smooth <- "#187a00"
|
||||
col_pointwise <- "#008790"
|
||||
col_constant <- "#dd9002"
|
||||
col_optimum <- "#666666"
|
||||
col_smooth <- "#187a00"
|
||||
col_pointwise <- "#008790"
|
||||
col_green <- "#61B94C"
|
||||
col_orange <- "#ffa600"
|
||||
col_yellow <- "#FCE135"
|
||||
```
|
||||
|
||||
# CRPS Learning
|
||||
@@ -308,9 +311,9 @@ Weights are updated sequentially according to the past performance of the $K$ ex
|
||||
|
||||
That is, a loss function $\ell$ is needed. This is used to compute the **cumulative regret** $R_{t,k}$
|
||||
|
||||
$$
|
||||
R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)
|
||||
$${#eq-regret}
|
||||
\begin{equation}
|
||||
R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)\label{eq:regret}
|
||||
\end{equation}
|
||||
|
||||
The cumulative regret:
|
||||
|
||||
@@ -325,13 +328,15 @@ Popular loss functions for point forecasting @gneiting2011making:
|
||||
|
||||
$\ell_2$ loss:
|
||||
|
||||
$$\ell_2(x, y) = | x -y|^2$${#eq-elltwo}
|
||||
\begin{equation}
|
||||
\ell_2(x, y) = | x -y|^2 \label{eq:elltwo}
|
||||
\end{equation}
|
||||
|
||||
Strictly proper for *mean* prediction
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="2%"}
|
||||
::: {.column width="4%"}
|
||||
|
||||
:::
|
||||
|
||||
@@ -339,7 +344,9 @@ Strictly proper for *mean* prediction
|
||||
|
||||
$\ell_1$ loss:
|
||||
|
||||
$$\ell_1(x, y) = | x -y|$${#eq-ellone}
|
||||
\begin{equation}
|
||||
\ell_1(x, y) = | x -y| \label{eq:ellone}
|
||||
\end{equation}
|
||||
|
||||
Strictly proper for *median* predictions
|
||||
|
||||
@@ -400,17 +407,9 @@ In stochastic settings, the cumulative Risk should be analyzed `r Citet(my_bib,
|
||||
|
||||
::::
|
||||
|
||||
## Optimality
|
||||
## Optimal Convergence
|
||||
|
||||
In stochastic settings, the cumulative Risk should be analyezed @wintenberger2017optimal:
|
||||
|
||||
\begin{align}
|
||||
\underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad
|
||||
\underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}}
|
||||
\label{eq_def_cumrisk}
|
||||
\end{align}
|
||||
|
||||
There are two problems that an algorithm should solve in iid settings:
|
||||
<br/>
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
@@ -423,14 +422,6 @@ There are two problems that an algorithm should solve in iid settings:
|
||||
\end{equation}
|
||||
The forecaster is asymptotically not worse than the best expert $\widehat{\mathcal{R}}_{t,\min}$.
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="2%"}
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
### The convex aggregation problem
|
||||
|
||||
\begin{equation}
|
||||
@@ -441,13 +432,14 @@ The forecaster is asymptotically not worse than the best convex combination $\wi
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
::: {.column width="2%"}
|
||||
|
||||
## Optimality
|
||||
:::
|
||||
|
||||
Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence.
|
||||
::: {.column width="48%"}
|
||||
|
||||
Optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} `r Citet(my_bib, "wintenberger2017optimal")`:
|
||||
|
||||
According to @wintenberger2017optimal, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if
|
||||
|
||||
\begin{align}
|
||||
\frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & =
|
||||
@@ -466,104 +458,102 @@ Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} dep
|
||||
- Regularity conditions on $Y_t$ and $\widehat{X}_{t,k}$
|
||||
- The weighting scheme
|
||||
|
||||
## Optimality
|
||||
|
||||
According to @cesa2006prediction EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the:
|
||||
- Loss $\ell$ is exp-concave
|
||||
- Learning-rate $\eta$ is chosen correctly
|
||||
|
||||
Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.
|
||||
|
||||
The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized:
|
||||
\begin{align}
|
||||
\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
|
||||
\end{align}
|
||||
$\ell'$ is the subgradient of $\ell$ in its first coordinate evaluated at forecast combination $\widetilde{X}$.
|
||||
|
||||
Combining probabilistic forecasts calls for a probabilistic loss function
|
||||
|
||||
:::: {.notes}
|
||||
|
||||
We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
## The Continuous Ranked Probability Score
|
||||
##
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
**An appropriate choice:**
|
||||
### Optimal Convergence
|
||||
|
||||
\begin{align*}
|
||||
\text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx
|
||||
\label{eq_crps}
|
||||
\end{align*}
|
||||
<br/>
|
||||
|
||||
It's strictly proper @gneiting2007strictly.
|
||||
EWA satisfies optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if:
|
||||
|
||||
Using the CRPS, we can calculate time-adaptive weight $w_{t,k}$. However, what if the experts' performance is not uniform over all parts of the distribution?
|
||||
- Loss $\ell$ is exp-concave
|
||||
- Learning-rate $\eta$ is chosen correctly
|
||||
|
||||
The idea: utilize this relation:
|
||||
Those results can be converted to stochastic iid settings @kakade2008generalization, @gaillard2014second.
|
||||
|
||||
\begin{align*}
|
||||
\text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) \, d p.
|
||||
\label{eq_crps_qs}
|
||||
\end{align*}
|
||||
Optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick:
|
||||
|
||||
\begin{align}
|
||||
\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
|
||||
\end{align}
|
||||
|
||||
$\ell'$ is the subgradient of $\ell$ at forecast combination $\widetilde{X}$.
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="2%"}
|
||||
::: {.column width="4%"}
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL):
|
||||
\begin{align*}
|
||||
\text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y)
|
||||
\end{align*}
|
||||
### Probabilistic Setting
|
||||
|
||||
</br>
|
||||
<br/>
|
||||
|
||||
**But is it optimal?**
|
||||
**An appropriate choice:**
|
||||
|
||||
CRPS is exp-concave `r fontawesome::fa("check", fill ="#00b02f")`
|
||||
\begin{equation*}
|
||||
\text{CRPS}(F, y) = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq:crps}
|
||||
\end{equation*}
|
||||
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv}
|
||||
It's strictly proper @gneiting2007strictly.
|
||||
|
||||
QL is convex, but not exp-concave `r fontawesome::fa("exclamation", fill ="#ffa600")`
|
||||
Using the CRPS, we can calculate time-adaptive weights $w_{t,k}$. However, what if the experts' performance varies in parts of the distribution?
|
||||
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence
|
||||
`r fontawesome::fa("lightbulb", fill = col_yellow)` Utilize this relation:
|
||||
|
||||
\begin{equation*}
|
||||
\text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) dp.\label{eq_crps_qs}
|
||||
\end{equation*}
|
||||
|
||||
... to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
## CRPS-Learning Optimality
|
||||
## CRPS Learning Optimality
|
||||
|
||||
::: {.panel-tabset}
|
||||
|
||||
## Almost Optimal Convergence
|
||||
|
||||
|
||||
`r fontawesome::fa("exclamation", fill = col_orange)` QL is convex, but not exp-concave `r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition. It satisfies that there exist a $C>0$ such that for $x>0$ it holds that
|
||||
|
||||
For convex losses, BOAG satisfies that there exist a $C>0$ such that for $x>0$ it holds that
|
||||
\begin{equation}
|
||||
P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq
|
||||
1-e^{x}
|
||||
1-e^{-x}
|
||||
\label{eq_boa_opt_conv}
|
||||
\end{equation}
|
||||
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} @wintenberger2017optimal.
|
||||
|
||||
The same algorithm satisfies that there exist a $C>0$ such that for $x>0$ it holds that
|
||||
\begin{equation}
|
||||
P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq
|
||||
C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq
|
||||
1-e^{x}
|
||||
1-2e^{-x}
|
||||
\label{eq_boa_opt_select}
|
||||
\end{equation}
|
||||
|
||||
if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.
|
||||
if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.
|
||||
|
||||
This is for losses that satisfy **A1** and **A2**.
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* \eqref{eq_optp_select} @gaillard2018efficient.
|
||||
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` We show that this holds for QL under feasible conditions.
|
||||
|
||||
## Conditions + Lemma
|
||||
|
||||
## CRPS-Learning Optimality
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
@@ -624,8 +614,7 @@ QL is Lipschitz continuous:
|
||||
|
||||
::::
|
||||
|
||||
|
||||
## CRPS-Learning Optimality
|
||||
## Proposition + Theorem
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
@@ -674,6 +663,13 @@ $$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}
|
||||
|
||||
::::
|
||||
|
||||
::::
|
||||
|
||||
:::: {.notes}
|
||||
|
||||
We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.
|
||||
|
||||
::::
|
||||
|
||||
## A Probabilistic Example
|
||||
|
||||
@@ -797,13 +793,17 @@ ggplot() +
|
||||
|
||||
:::
|
||||
|
||||
## The Smoothing Procedure
|
||||
## The Smoothing Procedures
|
||||
|
||||
::: {.panel-tabset}
|
||||
|
||||
## Penalized Smoothing
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
We are using penalized cubic b-splines:
|
||||
Penalized cubic B-Splines for smoothing weights:
|
||||
|
||||
Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)$ Then we approximate $w_{t,k}$ by
|
||||
|
||||
@@ -811,7 +811,7 @@ Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)
|
||||
w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi
|
||||
\end{align}
|
||||
|
||||
with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing which minimizes
|
||||
with parameter vector $\beta$. The latter is estimated to penalize $L_2$-smoothing which minimizes
|
||||
|
||||
\begin{equation}
|
||||
\| w_{t,k} - \beta' \varphi \|^2_2 + \lambda \| \mathcal{D}^{d} (\beta' \varphi) \|^2_2
|
||||
@@ -820,7 +820,7 @@ with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing
|
||||
|
||||
with differential operator $\mathcal{D}$
|
||||
|
||||
Smoothing can be applied ex-post or inside of the algorithm ( `r fontawesome::fa("arrow-right", fill ="#000000")` [Simulation](#simulation)).
|
||||
Computation is easy, since we have an analytical solution
|
||||
|
||||
:::
|
||||
|
||||
@@ -840,14 +840,119 @@ We receive the constant solution for high values of $\lambda$ when setting $d=1$
|
||||
|
||||
::::
|
||||
|
||||
## Basis Smoothing
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
Represent weights as linear combinations of bounded basis functions:
|
||||
|
||||
\begin{equation}
|
||||
w_{t,k} = \sum_{l=1}^L \beta_{t,k,l} \varphi_l = \boldsymbol \beta_{t,k}' \boldsymbol \varphi
|
||||
\end{equation}
|
||||
|
||||
A popular choice are are B-Splines as local basis functions
|
||||
|
||||
$\boldsymbol \beta_{t,k}$ is calculated using a reduced regret matrix:
|
||||
|
||||
\begin{equation}
|
||||
\underbrace{\boldsymbol r_{t}}_{\text{LxK}} = \frac{L}{P} \underbrace{\boldsymbol B'}_{\text{LxP}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxK}}
|
||||
\end{equation}
|
||||
|
||||
`r fontawesome::fa("arrow-right", fill ="#000000")` $\boldsymbol r_{t}$ is transformed from PxK to LxK
|
||||
|
||||
If $L = P$ it holds that $\boldsymbol \varphi = \boldsymbol{I}$
|
||||
For $L = 1$ we receive constant weights
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="2%"}
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="48%"}
|
||||
|
||||
Weights converge to the constant solution if $L\rightarrow 1$
|
||||
|
||||
<center>
|
||||
<img src="/assets/crps_learning/weights_kstep.gif">
|
||||
</center>
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
::::
|
||||
|
||||
---
|
||||
|
||||
## The Proposed CRPS-Learning Algorithm
|
||||
|
||||
```{r, fig.align="left", echo=FALSE, out.width = "1000px", cache = TRUE}
|
||||
knitr::include_graphics("assets/crps_learning/algorithm_1.svg")
|
||||
```
|
||||
<br/>
|
||||
|
||||
::: {style="font-size: 85%;"}
|
||||
|
||||
:::: {.columns}
|
||||
|
||||
::: {.column width="43%"}
|
||||
|
||||
### Initialization:
|
||||
|
||||
Array of expert predicitons: $\widehat{X}_{t,p,k}$
|
||||
|
||||
Vector of Prediction targets: $Y_t$
|
||||
|
||||
Starting Weights: $\boldsymbol w_0=(w_{0,1},\ldots, w_{0,K})$
|
||||
|
||||
Penalization parameter: $\lambda\geq 0$
|
||||
|
||||
B-spline and penalty matrices $\boldsymbol B$ and $\boldsymbol D$ on $\mathcal{P}= (p_1,\ldots,p_M)$
|
||||
|
||||
Hat matrix: $$\boldsymbol{\mathcal{H}} = \boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'$$
|
||||
|
||||
Cumulative Regret: $R_{0,k} = 0$
|
||||
|
||||
Range parameter: $E_{0,k}=0$
|
||||
|
||||
Starting pseudo-weights: $\boldsymbol \beta_0 = \boldsymbol B^{\text{pinv}}\boldsymbol w_0(\boldsymbol{\mathcal{P}})$
|
||||
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="2%"}
|
||||
|
||||
:::
|
||||
|
||||
::: {.column width="55%"}
|
||||
|
||||
### Core:
|
||||
|
||||
for( t in 1:T ) {
|
||||
|
||||
$\widetilde{\boldsymbol X}_{t} = \text{Sort}\left( \boldsymbol w_{t-1}'(\boldsymbol P) \widehat{\boldsymbol X}_{t} \right)$ <b style="color: var(--col_grey_7);"># Prediction</b>
|
||||
|
||||
$\boldsymbol r_{t} = \frac{L}{M} \boldsymbol B' \left({\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\boldsymbol{\mathcal P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)$
|
||||
|
||||
$\boldsymbol E_{t} = \max(\boldsymbol E_{t-1}, \boldsymbol r_{t}^+ + \boldsymbol r_{t}^-)$
|
||||
|
||||
$\boldsymbol V_{t} = \boldsymbol V_{t-1} + \boldsymbol r_{t}^{ \odot 2}$
|
||||
|
||||
$\boldsymbol \eta_{t} =\min\left( \left(-\log(\boldsymbol \beta_{0}) \odot \boldsymbol V_{t}^{\odot -1} \right)^{\odot\frac{1}{2}} , \frac{1}{2}\boldsymbol E_{t}^{\odot-1}\right)$
|
||||
|
||||
$\boldsymbol R_{t} = \boldsymbol R_{t-1}+ \boldsymbol r_{t} \odot \left( \boldsymbol 1 - \boldsymbol \eta_{t} \odot \boldsymbol r_{t} \right)/2 + \boldsymbol E_{t} \odot \mathbb{1}\{-2\boldsymbol \eta_{t}\odot \boldsymbol r_{t} > 1\}$
|
||||
|
||||
$\boldsymbol \beta_{t} = K \boldsymbol \beta_{0} \odot \boldsymbol {SoftMax}\left( - \boldsymbol \eta_{t} \odot \boldsymbol R_{t} + \log( \boldsymbol \eta_{t}) \right)$
|
||||
|
||||
$\boldsymbol w_{t}(\boldsymbol P) = \underbrace{\boldsymbol B(\boldsymbol B'\boldsymbol B+ \lambda (\alpha \boldsymbol D_1'\boldsymbol D_1 + (1-\alpha) \boldsymbol D_2'\boldsymbol D_2))^{-1} \boldsymbol B'}_{\boldsymbol{\mathcal{H}}} \boldsymbol B \boldsymbol \beta_{t}$
|
||||
|
||||
}
|
||||
|
||||
:::
|
||||
|
||||
::::
|
||||
|
||||
:::
|
||||
|
||||
## Simulation Study
|
||||
|
||||
@@ -1437,19 +1542,6 @@ BOA > 16 -->
|
||||
|
||||
## Outline
|
||||
|
||||
```{r, include=FALSE}
|
||||
col_lightgray <- "#e7e7e7"
|
||||
col_blue <- "#000088"
|
||||
col_smooth_expost <- "#a7008b"
|
||||
col_smooth <- "#187a00"
|
||||
col_pointwise <- "#008790"
|
||||
col_constant <- "#dd9002"
|
||||
col_optimum <- "#666666"
|
||||
col_green <- "#61B94C"
|
||||
col_orange <- "#ffa600"
|
||||
col_yellow <- "#FCE135"
|
||||
```
|
||||
|
||||
</br>
|
||||
|
||||
**Multivariate CRPS Learning**
|
||||
|
||||
Reference in New Issue
Block a user