PHD-Presentation/25_07_phd_defense/index.qmd

---
title: "Data Science Methods for Forecasting in Energy and Economics"
date: 2025-07-10
author:
    - name: Jonathan Berrisch
      affiliations:
        - ref: hemf
affiliations:
    - id: hemf
      name: University of Duisburg-Essen, House of Energy Markets and Finance
format:
    revealjs:
        embed-resources: true
        footer: ""
        logo: assets/logos_combined.png
        theme: [default, clean.scss]
        smaller: true
        fig-format: svg
execute:
  daemon: false
highlight-style: github
---

<!--
Render with: quarto preview /home/jonathan/git/PHD-Presentation/25_07_phd_defense/index.qmd --no-browser --port 6074
-->

## Outline

::: {.hidden}
$$
\newcommand{\A}{{\mathbb A}}
$$
:::

<br>

::: {style="font-size: 150%;"}

[{{< fa bars-staggered >}}]{style="color: #404040;"} &ensp; Introduction & Research Motivation

[{{< fa bars-staggered >}}]{style="color: #404040;"} &ensp; Overview of the Thesis

[{{< fa table >}}]{style="color: #404040;"} &ensp; Online Learning

[{{< fa circle-nodes >}}]{style="color: #404040;"} &ensp; Probabilistic Forecasting of European Carbon and Energy Prices

[{{< fa lightbulb >}}]{style="color: #404040;"} &ensp; Limitations

[{{< fa binoculars >}}]{style="color: #404040;"} &ensp; Contributions & Outlook

:::

## PHD DeFence

```{r, setup, include=FALSE}
# Compile with: rmarkdown::render("crps_learning.Rmd")
library(latex2exp)
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
library(kableExtra)
knitr::opts_chunk$set(
  dev = "svglite" # Use svg figures
)
library(RefManageR)
BibOptions(
  check.entries = TRUE,
  bib.style = "authoryear",
  cite.style = "authoryear",
  style = "html",
  hyperlink = TRUE,
  dashed = FALSE
)
my_bib <- ReadBib("assets/library.bib", check = FALSE)
col_lightgray <- "#e7e7e7"
col_blue <- "#000088"
col_smooth_expost <- "#a7008b"
col_smooth <- "#187a00"
col_pointwise <- "#008790"
col_constant <- "#dd9002"
col_optimum <- "#666666"
```

```{r xaringan-panelset, echo=FALSE}
xaringanExtra::use_panelset()
```

```{r xaringanExtra-freezeframe, echo=FALSE}
xaringanExtra::use_freezeframe(responsive = TRUE)
```

# Outline

- [Motivation](#motivation)
- [The Framework of Prediction under Expert Advice](#pred_under_exp_advice)
- [The Continious Ranked Probability Scrore](#crps)
- [Optimality of (Pointwise) CRPS-Learning](#crps_optim)
- [A Simple Probabilistic Example](#simple_example)
- [The Proposed CRPS-Learning Algorithm](#proposed_algorithm)
- [Simulation Results](#simulation)
- [Possible Extensions](#extensions)
- [Application Study](#application)
- [Wrap-Up](#conclusion)
- [References](#references)

---

# Motivation

name: motivation

## Motivation

:::: {.columns}

::: {.column width="48%"}

The Idea:

- Combine multiple forecasts instead of choosing one

- Combination weights may vary over **time**, over the **distribution** or **both**

2 Popular options for combining distributions:

- Combining across quantiles (this paper)
  - Horizontal aggregation, vincentization
- Combining across probabilities
  - Vertical aggregation

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

::: {.panel-tabset}

## Time

```{r, echo = FALSE, fig.height=6}
par(mfrow = c(3, 3), mar = c(2, 2, 2, 2))
set.seed(1)
# Data
X <- matrix(ncol = 3, nrow = 15)
X[, 1] <- seq(from = 8, to = 12, length.out = 15) + 0.25 * rnorm(15)
X[, 2] <- 10 + 0.25 * rnorm(15)
X[, 3] <- seq(from = 12, to = 8, length.out = 15) + 0.25 * rnorm(15)
# Weights
w <- matrix(ncol = 3, nrow = 15)
w[, 1] <- sin(0.1 * 1:15)
w[, 2] <- cos(0.1 * 1:15)
w[, 3] <- seq(from = -2, 0.25, length.out = 15)^2
w <- (w / rowSums(w))
# Vis
plot(X[, 1],
  lwd = 4,
  type = "l",
  ylim = c(8, 12),
  xlab = "",
  ylab = "",
  xaxt = "n",
  yaxt = "n",
  bty = "n",
  col = "#2050f0"
)
plot(w[, 1],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0"
)
text(6, 0.5, TeX("$w_1(t)$"), cex = 2, col = "#2050f0")
arrows(13, 0.25, 15, 0.0, , lwd = 4, bty = "n")
plot.new()
plot(X[, 2],
  lwd = 4,
  type = "l", ylim = c(8, 12),
  xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple"
)
plot(w[, 2],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple"
)
text(6, 0.6, TeX("$w_2(t)$"), cex = 2, col = "purple")
arrows(13, 0.5, 15, 0.5, , lwd = 4, bty = "n")
plot(rowSums(X * w), lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#298829")
plot(X[, 3],
  lwd = 4,
  type = "l", ylim = c(8, 12),
  xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4"
)
plot(w[, 3],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4"
)
text(6, 0.25, TeX("$w_3(t)$"), cex = 2, col = "#e423b4")
arrows(13, 0.75, 15, 1, , lwd = 4, bty = "n")
```

## Distribution

```{r, echo = FALSE, fig.height=6}
par(mfrow = c(3, 3), mar = c(2, 2, 2, 2))
set.seed(1)
# Data
X <- matrix(ncol = 3, nrow = 31)

X[, 1] <- dchisq(0:30, df = 10)
X[, 2] <- dnorm(0:30, mean = 15, sd = 5)
X[, 3] <- dexp(0:30, 0.2)
# Weights
w <- matrix(ncol = 3, nrow = 31)
w[, 1] <- sin(0.05 * 0:30)
w[, 2] <- cos(0.05 * 0:30)
w[, 3] <- seq(from = -2, 0.25, length.out = 31)^2
w <- (w / rowSums(w))
# Vis
plot(X[, 1],
  lwd = 4,
  type = "l",
  xlab = "",
  ylab = "",
  xaxt = "n",
  yaxt = "n",
  bty = "n",
  col = "#2050f0"
)
plot(X[, 2],
  lwd = 4,
  type = "l",
  xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple"
)
plot(X[, 3],
  lwd = 4,
  type = "l",
  xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4"
)
plot(w[, 1],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0"
)
text(12, 0.5, TeX("$w_1(x)$"), cex = 2, col = "#2050f0")
arrows(26, 0.25, 31, 0.0, , lwd = 4, bty = "n")
plot(w[, 2],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple"
)
text(15, 0.5, TeX("$w_2(x)$"), cex = 2, col = "purple")
arrows(15, 0.25, 15, 0, , lwd = 4, bty = "n")
plot(w[, 3],
  lwd = 4, type = "l",
  ylim = c(0, 1),
  xlab = "",
  ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4"
)
text(20, 0.5, TeX("$w_3(x)$"), cex = 2, col = "#e423b4")
arrows(5, 0.25, 0, 0, , lwd = 4, bty = "n")
plot.new()
plot(rowSums(X * w), lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#298829")
```

:::

:::

::::

# The Framework of Prediction under Expert Advice

## The Framework of Prediction under Expert Advice

### The sequential framework

:::: {.columns}

::: {.column width="48%"}

Each day, $t = 1, 2, ... T$
- The **forecaster** receives predictions $\widehat{X}_{t,k}$ from $K$ **experts**
- The **forecaster** assings weights $w_{t,k}$ to each **expert**
- The **forecaster** calculates her prediction:
\begin{equation}
    \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}.
    \label{eq_forecast_def}
\end{equation}
- The realization for $t$ is observedilities
  - Vertical aggregation

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

- The experts can be institutions, persons, or models
- The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions
- We do not need any assumptions concerning the underlying data
- `r Citet(my_bib, "cesa2006prediction")`

:::

::::

---

## The Regret

Weights are updated sequentially according to the past performance of the $K$ experts.

That is, a loss function $\ell$ is needed. This is used to compute the **cumulative regret** $R_{t,k}$

\begin{equation}
    R_{t,k}  = \widetilde{L}_{t} - \widehat{L}_{t,k} =  \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)
    \label{eq_regret}
\end{equation}

The cumulative regret:
- Indicates the predictive accuracy of the expert $k$ until time $t$.
- Measures how much the forecaster *regrets* not having followed the expert's advice

Popular loss functions for point forecasting `r Citet(my_bib, "gneiting2011making")`:
.pull-left[
- $\ell_2$-loss $\ell_2(x, y) = | x -y|^2$
  - optimal for mean prediction
]
.pull-right[
- $\ell_1$-loss $\ell_1(x, y) = | x -y|$
  - optimal for median predictions
]


:::: {.columns}

::: {.column width="48%"}

- $\ell_2$-loss $\ell_2(x, y) = | x -y|^2$
  - optimal for mean prediction

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

- $\ell_1$-loss $\ell_1(x, y) = | x -y|$
  - optimal for median predictions

:::

::::


## Popular Aggregation Algorithms

#### The naive combination

\begin{equation}
    w_{t,k}^{\text{Naive}} = \frac{1}{K}
\end{equation}

#### The exponentially weighted average forecaster (EWA)

\begin{align}
    w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}}
    =
    \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }
    \label{eq_ewa_general}
\end{align}

#### The polynomial weighted aggregation (PWA)

\begin{align}
    w_{t,k}^{\text{PWA}} & = \frac{ 2(R_{t,k})^{q-1}_{+} }{ \|(R_t)_{+}\|^{q-2}_q}
    \label{eq_pwa_general}
\end{align}

with $q\geq 2$ and $x_{+}$ the (vector) of positive parts of $x$.

## Optimality

In stochastic settings, the cumulative Risk should be analyezed `r Citet(my_bib, "wintenberger2017optimal")`:

\begin{align}
    \underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad
    \underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}}
    \label{eq_def_cumrisk}
\end{align}

There are two problems that an algorithm should solve in iid settings:

:::: {.columns}

::: {.column width="48%"}

### The selection problem
\begin{equation}
    \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0.
    \label{eq_opt_select}
\end{equation}
The forecaster is asymptotically not worse than the best expert $\widehat{\mathcal{R}}_{t,\min}$.

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

### The convex aggregation problem

\begin{equation}
    \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 .
    \label{eq_opt_conv}
\end{equation}
The forecaster is asymptotically not worse than the best convex combination $\widehat{X}_{t,\pi}$ in hindsight (**oracle**).

:::

::::

## Optimality

Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence.

According to `r Citet(my_bib, "wintenberger2017optimal")`, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if

\begin{align}
    \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & =
    \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select}
\end{align}

\begin{align}
    \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right)  & =
    \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right)
    \label{eq_optp_conv}
\end{align}

Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} depending on:

- The loss function
- Regularity conditions on $Y_t$ and $\widehat{X}_{t,k}$
- The weighting scheme

## Optimality

According to `r Citet(my_bib, "cesa2006prediction")` EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the:
- Loss $\ell$ is exp-concave
- Learning-rate $\eta$ is chosen correctly

Those results can be converted to stochastic iid settings `r Citet(my_bib, "kakade2008generalization")` `r Citet(my_bib, "gaillard2014second")`.

The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized:
\begin{align}
\ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x
\end{align}
$\ell'$ is the subgradient of $\ell$ in its first coordinate evaluated at forecast combination $\widetilde{X}$.

Combining probabilistic forecasts calls for a probabilistic loss function

:::: {.notes}

We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}.

::::

## The Continuous Ranked Probability Score

:::: {.columns}

::: {.column width="48%"}

**An appropriate choice:**

\begin{align*}
    \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx
    \label{eq_crps}
\end{align*}

It's strictly proper `r Citet(my_bib, "gneiting2007strictly")`.

Using the CRPS, we can calculate time-adaptive weight $w_{t,k}$. However, what if the experts' performance is not uniform over all parts of the distribution?

The idea: utilize this relation:

\begin{align*}
    \text{CRPS}(F, y) = 2 \int_0^{1}  \text{QL}_p(F^{-1}(p), y) \, d p.
    \label{eq_crps_qs}
\end{align*}

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL):
\begin{align*}
    \text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y)
\end{align*}

</br>

**But is it optimal?**

CRPS is exp-concave `r fontawesome::fa("check", fill ="#00b02f")`

`r fontawesome::fa("arrow-right", fill ="#000000")` EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv}

QL is convex, but not exp-concave `r fontawesome::fa("exclamation", fill ="#ffa600")`

`r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence

:::

::::

## CRPS-Learning Optimality

For convex losses, BOAG satisfies that there exist a $C>0$ such that for $x>0$ it holds that
\begin{equation}
    P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right)  \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right)  \right) \geq
    1-e^{x}
    \label{eq_boa_opt_conv}
\end{equation}
`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} `r Citet(my_bib, "wintenberger2017optimal")` .

The same algorithm satisfies that there exist a $C>0$ such that for $x>0$ it holds that
\begin{equation}
    P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq
    C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq
    1-e^{x}
    \label{eq_boa_opt_select}
\end{equation}

if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate.

This is for losses that satisfy **A1** and **A2**.

## CRPS-Learning Optimality

:::: {.columns}

::: {.column width="48%"}

**A1**

For some $G>0$ it holds
for all $x_1,x_2\in \mathbb{R}$ and $t>0$ that

$$ | \ell(x_1, Y_t)-\ell(x_2, Y_t) | \leq G |x_1-x_2|$$

**A2** For some $\alpha>0$, $\beta\in[0,1]$ it holds
for all $x_1,x_2 \in \mathbb{R}$ and $t>0$ that

\begin{align*}
    \mathbb{E}[
        & \ell(x_1, Y_t)-\ell(x_2, Y_t) | \mathcal{F}_{t-1}] \leq \\
        & \mathbb{E}[ \ell'(x_1, Y_t)(x_1 -  x_2)  |\mathcal{F}_{t-1}] \\
                              & +
    \mathbb{E}\left[ \left. \left( \alpha(\ell'(x_1, Y_t)(x_1 -  x_2))^{2}\right)^{1/\beta}  \right|\mathcal{F}_{t-1}\right]
\end{align*}

`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* \eqref{eq_optp_select} `r Citet(my_bib, "gaillard2018efficient")`.

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

**Lemma 1**

\begin{align}
    2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}
      & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\min}
    \label{eq_risk_ql_crps_expert}                        \\
    2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}
      & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\pi} .
    \label{eq_risk_ql_crps_convex}
\end{align}

Pointwise can outperform constant procedures

QL is convex but not exp-concave:

`r fontawesome::fa("arrow-right")` Almost optimal convergence w.r.t. *convex aggregation* \eqref{eq_boa_opt_conv} `r fontawesome::fa("check", fill ="#00b02f")` </br>

For almost optimal congerence w.r.t. *selection* \eqref{eq_boa_opt_select} we need to check **A1** and **A2**:

QL is Lipschitz continuous:

`r fontawesome::fa("arrow-right")` **A1** holds `r fontawesome::fa("check", fill ="#ffa600")` </br>

:::

::::


## CRPS-Learning Optimality

:::: {.columns}

::: {.column width="48%"}

Conditional quantile risk: $\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y_t) | \mathcal{F}_{t-1}]$.

`r fontawesome::fa("arrow-right")` convexity properties of $\mathcal{Q}_p$ depend on the
conditional distribution $Y_t|\mathcal{F}_{t-1}$.

**Proposition 1**

Let $Y$ be a univariate random variable with (Radon-Nikodym) $\nu$-density $f$, then for the second subderivative of the quantile risk
$\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y) ]$
of $Y$ it holds for all $p\in(0,1)$ that
$\mathcal{Q}_p'' = f.$
Additionally, if $f$ is a continuous Lebesgue-density with $f\geq\gamma>0$ for some constant $\gamma>0$ on its support $\text{spt}(f)$ then
is $\mathcal{Q}_p$ is $\gamma$-strongly convex.

Strong convexity with $\beta=1$ implies **A2** `r fontawesome::fa("check", fill ="#ffa600")` `r Citet(my_bib, "gaillard2018efficient")`

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

`r fontawesome::fa("arrow-right")` **A1** and **A2** give us almost optimal convergence w.r.t. selection \eqref{eq_boa_opt_select} `r fontawesome::fa("check", fill ="#00b02f")` </br>

**Theorem 1**

The gradient based fully adaptive Bernstein online aggregation (BOAG) applied pointwise for all $p\in(0,1)$ on $\text{QL}$ satisfies
\eqref{eq_boa_opt_conv} with  minimal CRPS given by

$$\widehat{\mathcal{R}}_{t,\pi} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}.$$

If $Y_t|\mathcal{F}_{t-1}$ is bounded
and has a pdf $f_t$ satifying $f_t>\gamma >0$ on its
support $\text{spt}(f_t)$ then \ref{eq_boa_opt_select} holds with $\beta=1$ and

$$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}$$.

:::

::::


## A Probabilistic Example


:::: {.columns}

::: {.column width="48%"}

Simple Example:


\begin{align}
    Y_t               & \sim \mathcal{N}(0,\,1)                     \\
    \widehat{X}_{t,1} & \sim \widehat{F}_{1}  = \mathcal{N}(-1,\,1) \\
    \widehat{X}_{t,2} & \sim \widehat{F}_{2}  = \mathcal{N}(3,\,4)
    \label{eq:dgp_sim1}
\end{align}

- True weights vary over $p$
- Figures show the ECDF and calculated weights using $T=25$ realizations
- Pointwise solution creates rough estimates
- Pointwise is better than constant
- Smooth solution is better than pointwise

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

::: {.panel-tabset}

## CDFs

```{r, echo = FALSE, fig.width=7, fig.height=6, fig.align='center', cache = FALSE}
source("assets/01_common.R")
load("assets/crps_learning/01_motivation_01.RData")
ggplot(df, aes(x = x, y = y, xend = xend, yend = yend)) +
  stat_function(
    fun = pnorm, n = 10000,
    args = list(mean = dev[2], sd = experts_sd[2]),
    aes(col = "Expert 2"), size = 1.5
  ) +
  stat_function(
    fun = pnorm, n = 10000,
    args = list(mean = dev[1], sd = experts_sd[1]),
    aes(col = "Expert 1"), size = 1.5
  ) +
  stat_function(
    fun = pnorm,
    n = 10000,
    size = 1.5, aes(col = "DGP") # , linetype = "dashed"
  ) +
  geom_point(aes(col = "ECDF"), size = 1.5, show.legend = FALSE) +
  geom_segment(aes(col = "ECDF")) +
  geom_segment(data = tibble(
    x_ = -5,
    xend_ = min(y),
    y_ = 0,
    yend_ = 0
  ), aes(x = x_, xend = xend_, y = y_, yend = yend_)) +
  theme_minimal() +
  theme(
    text = element_text(size = text_size),
    legend.position = "bottom",
    legend.key.width = unit(1.5, "cm")
  ) +
  ylab("Probability p") +
  xlab("Value") +
  scale_colour_manual(NULL, values = c("#969696", "#252525", col_auto, col_blue)) +
  guides(color = guide_legend(
    nrow = 2,
    byrow = FALSE # ,
    # override.aes = list(
    #     size = c(1.5, 1.5, 1.5, 1.5)
    # )
  )) +
  scale_x_continuous(limits = c(-5, 7.5))
```

## Weights

```{r, echo = FALSE, fig.width=7, fig.height=6, fig.align='center', cache = FALSE}
source("assets/01_common.R")
load("assets/crps_learning/01_motivation_02.RData")
ggplot() +
  geom_line(data = weights[weights$var != "1Optimum", ], size = 1.5, aes(x = prob, y = val, col = var)) +
  geom_line(
    data = weights[weights$var == "1Optimum", ], size = 1.5, aes(x = prob, y = val, col = var) # , linetype = "dashed"
  ) +
  theme_minimal() +
  theme(
    text = element_text(size = text_size),
    legend.position = "bottom",
    legend.key.width = unit(1.5, "cm")
  ) +
  xlab("Probability p") +
  ylab("Weight w") +
  scale_colour_manual(
    NULL,
    values = c("#969696", col_pointwise, col_p_constant, col_p_smooth),
    labels = modnames[-c(3, 5)]
  ) +
  guides(color = guide_legend(
    ncol = 3,
    byrow = FALSE,
    title.hjust = 5,
    # override.aes = list(
    #     linetype = c(rep("solid", 5), "dashed")
    # )
  )) +
  ylim(c(0, 1))
```

::::

:::

:::

## The Smoothing Procedure

:::: {.columns}

::: {.column width="48%"}

We are using penalized cubic b-splines:

Let $\varphi=(\varphi_1,\ldots, \varphi_L)$ be bounded basis functions on $(0,1)$ Then we approximate $w_{t,k}$ by

\begin{align}
w_{t,k}^{\text{smooth}} = \sum_{l=1}^L \beta_l \varphi_l = \beta'\varphi
\end{align}

with parameter vector $\beta$. The latter is estimated penalized $L_2$-smoothing which minimizes

\begin{equation}
    \| w_{t,k} - \beta' \varphi  \|^2_2 + \lambda \| \mathcal{D}^{d}  (\beta' \varphi)  \|^2_2
    \label{eq_function_smooth}
\end{equation}

with differential operator $\mathcal{D}$

Smoothing can be applied ex-post or inside of the algorithm ( `r fontawesome::fa("arrow-right", fill ="#000000")` [Simulation](#simulation)).

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

We receive the constant solution for high values of $\lambda$ when setting $d=1$

<center>
<img src="assets/crps_learning/weights_lambda.gif">
</center>

:::

::::


# The Proposed CRPS-Learning Algorithm

---

## The Proposed CRPS-Learning Algorithm

:::: {.columns}

::: {.column width="48%"}

**Initialization:**

Array of expert predicitons: $\widehat{X}_{t,k,p}$

Vector of Prediction targets: $Y_t$

Starting Weights: $w_0=(w_{0,1},\ldots, w_{0,K})$,

Penalization parameter: $\lambda\geq 0$

B-spline and penalty matrices $B$ and $D$ on $\mathcal{P}= (p_1,\ldots,p_M)$

Hat matrix: $$\mathcal{H} = B(B'B+ \lambda D'D)^{-1} B'$$

Cumulative Regret: $R_{0,k} = 0$

Range parameter: $E_{0,k}=0$

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

**Core**:

for(t in 1:T) { for(p in $\mathcal{P}$) {

&nbsp;&nbsp;&nbsp;&nbsp; $\widetilde{X}_{t,k}(p) = \sum_{k=1}^K w_{t-1,k,p} \widehat{X}_{t,k}(p)$ .grey[\# Prediction]

&nbsp;&nbsp;&nbsp;&nbsp; for(k in 1:K){

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $r_{t,k,p} = \text{QL}_p^{\nabla}(\widehat{X}_{t,k}(p),Y_t) - \text{QL}_p^{\nabla}(\widetilde{X}_{t}(p),Y_t)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $E_{t,k,p} = \max(E_{t-1,k,p}, |r_{t,k,p}|)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $\eta_{t,k,p}=\min\left(1/2E_{t,k,p}, \sqrt{\log(K)/ \sum_{i=1}^t (r^2_{i, k,p})}\right)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $R_{t,k,p} = R_{t-1,k,p} + \frac{1}{2} \left( r_{t,k,p}  \left( 1+ \eta_{t,k,p}  r_{t,k,p} \right) + 2E_{t,k,p}  \mathbb{1}(\eta_{t,k,p}r_{t,k,p} > \frac{1}{2}) \right)$

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; $w_{t,k,p} =  \eta_{t,k,p} \exp \left(-  \eta_{t,k,p}  R_{t,k,p} \right) w_{0,k,p} / \left( \frac{1}{K}  \sum_{k = 1}^K  \eta_{t,k,p} \exp \left( - \eta_{t,k,p}  R_{t,k,p}\right) \right)$

&nbsp;&nbsp; }.grey[\#k]}.grey[\#p]

&nbsp;&nbsp; for(k in 1:K){

&nbsp;&nbsp;&nbsp;&nbsp; $w_{t,k} = \mathcal{H} w_{t,k}(\mathcal{P})$ .grey[\# Smoothing]

} .grey[\#k]} .grey[\#t]

:::

::::

## Simulation Study

:::: {.columns}

::: {.column width="48%"}

Data Generating Process of the [simple probabilistic example](#simple_example)

- Constant solution $\lambda \rightarrow \infty$
- Pointwise Solution of the proposed BOAG
- Smoothed Solution of the proposed BOAG
  - Weights are smoothed during learning
  - Smooth weights are used to calculate Regret, adjust weights, etc.
- Smooth ex-post solution
  - Weights are smoothed after the learning
  - Algorithm always uses non-smoothed weights

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

::: {.panel-tabset}

## QL Deviation

![](assets/crps_learning/pre_vs_post.gif)

## CRPS vs. Lambda

CRPS Values for different $\lambda$ (1000 runs)

![](assets/crps_learning/pre_vs_post_lambda.gif)

::::

:::

::::

## Simulation Study

The same simulation carried out for different algorithms (1000 runs):

<center>
<img src="assets/crps_learning/algos_constant.gif">
</center>

## Simulation Study

:::: {.columns}

::: {.column width="48%"}

**New DGP:**

\begin{align}
    Y_t               & \sim \mathcal{N}\left(\frac{\sin(0.005 \pi t )}{2},\,1\right) \\
    \widehat{X}_{t,1} & \sim      \widehat{F}_{1}  = \mathcal{N}(-1,\,1)              \\
    \widehat{X}_{t,2} & \sim       \widehat{F}_{2}  = \mathcal{N}(3,\,4) \label{eq_dgp_sim2}
\end{align}

`r fontawesome::fa("arrow-right", fill ="#000000")` Changing optimal weights

`r fontawesome::fa("arrow-right", fill ="#000000")` Single run example depicted aside

`r fontawesome::fa("arrow-right", fill ="#000000")` No forgetting leads to long-term constant weights

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

**Weights of expert 2**

```{r, echo = FALSE, fig.width=7, fig.height=5, fig.align='center', cache = FALSE}
load("assets/crps_learning/changing_weights.rds")
mod_labs <- c("Optimum", "Pointwise", "Smooth", "Constant")
names(mod_labs) <- c("TOptimum", "Pointwise", "Smooth", "Constant")
colseq <- c(grey(.99), "orange", "red", "purple", "blue", "darkblue", "black")
weights_preprocessed %>%
  mutate(w = 1 - w) %>%
  ggplot(aes(t, p, fill = w)) +
  geom_raster(interpolate = TRUE) +
  facet_grid(Mod ~ ., labeller = labeller(Mod = mod_labs)) +
  theme_minimal() +
  theme(
    # plot.margin = unit(c(0.5, 0.5, 0.5, 0.5), "cm"),
    text = element_text(size = 15),
    legend.key.height = unit(1, "inch")
  ) +
  scale_x_continuous(expand = c(0, 0)) +
  xlab("Time t") +
  scale_fill_gradientn(
    limits = c(0, 1),
    colours = colseq,
    breaks = seq(0, 1, 0.2)
  ) +
  ylab("Weight w")
```

:::

::::

## Simulation Results

The simulation using the new DGP carried out for different algorithms (1000 runs):

<center>
<img src="assets/crps_learning/algos_changing.gif">
</center>

## Possible Extensions

:::: {.columns}

::: {.column width="48%"}

**Forgetting**

- Only taking part of the old cumulative regret into account
- Exponential forgetting of past regret

\begin{align*}
    R_{t,k} & =  R_{t-1,k}(1-\xi) + \ell(\widetilde{F}_{t},Y_i) - \ell(\widehat{F}_{t,k},Y_i) \label{eq_regret_forget}
\end{align*}

**Fixed Shares** `r Citet(my_bib, "herbster1998tracking")`

  - Adding fixed shares to the weights
  - Shrinkage towards a constant solution

\begin{align*}
    \widetilde{w}_{t,k} = \rho \frac{1}{K} + (1-\rho) w_{t,k}
    \label{fixed_share_simple}.
\end{align*}

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

**Non-Equidistant Knots**

- Non-equidistant spline-basis could be used
- Potentially improves the tail-behavior
- Destroys shrinkage towards constant

<center>
<img src="assets/crps_learning/uneven_grid.gif">
</center>

:::

::::

## Application Study: Overview

:::: {.columns}

::: {.column width="29%"}

Data:

- Forecasting European emission allowances (EUA)
- Daily month-ahead prices
- Jan 13 - Dec 20 (Phase III, 2092 Obs)

Combination methods:

- Naive, BOAG, EWAG, ML-PolyG, BMA

Tuning paramter grids:

- Smoothing Penalty: $\Lambda= \{0\}\cup \{2^x|x\in \{-4,-3.5,\ldots,12\}\}$
- Learning Rates: $\mathcal{E}= \{2^x|x\in \{-1,-0.5,\ldots,9\}\}$

:::

::: {.column width="2%"}

:::

::: {.column width="69%"}

```{r, echo = FALSE, fig.width=7, fig.height=5, fig.align='center', cache = FALSE}
load("assets/crps_learning/overview_data.rds")

data %>%
  ggplot(aes(x = Date, y = value)) +
  geom_line(size = 1, col = col_blue) +
  theme_minimal() +
  ylab("Value") +
  facet_wrap(. ~ name, scales = "free", ncol = 1) +
  theme(
    text = element_text(size = 15),
    strip.background = element_blank(),
    strip.text.x = element_blank()
  ) -> p1

data %>%
  ggplot(aes(x = value)) +
  geom_histogram(aes(y = ..density..), size = 1, fill = col_blue, bins = 50) +
  ylab("Density") +
  xlab("Value") +
  theme_minimal() +
  theme(
    strip.background = element_rect(fill = col_lightgray, colour = col_lightgray),
    text = element_text(size = 15)
  ) +
  facet_wrap(. ~ name, scales = "free", ncol = 1, strip.position = "right") -> p2

overview <- cowplot::plot_grid(plotlist = list(p1, p2), align = "hv", axis = "tblr", rel_widths = c(0.65, 0.35))
overview
```

:::

::::

## Application Study: Experts

Simple exponential smoothing with additive errors (**ETS-ANN**):

\begin{align*}
Y_{t} = l_{t-1} + \varepsilon_t \quad \text{with} \quad l_t = l_{t-1} + \alpha \varepsilon_t \quad \text{and} \quad \varepsilon_t \sim \mathcal{N}(0,\sigma^2)
\end{align*}

Quantile regression (**QuantReg**): For each $p \in \mathcal{P}$ we assume:

\begin{align*}
F^{-1}_{Y_t}(p) = \beta_{p,0} + \beta_{p,1} Y_{t-1} + \beta_{p,2} |Y_{t-1}-Y_{t-2}|
\end{align*}

ARIMA(1,0,1)-GARCH(1,1) with Gaussian errors (**ARMA-GARCH**):

\begin{align*}
Y_{t} = \mu + \phi(Y_{t-1}-\mu) + \theta \varepsilon_{t-1} + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1)
\end{align*}

ARIMA(0,1,0)-I-EGARCH(1,1) with Gaussian errors (**I-EGARCH**):

\begin{align*}
Y_{t} = \mu + Y_{t-1}  + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \log(\sigma_t^2) = \omega + \alpha Z_{t-1}+ \gamma (|Z_{t-1}|-\mathbb{E}|Z_{t-1}|) + \beta \log(\sigma_{t-1}^2) \quad \text{and} \quad Z_t \sim \mathcal{N}(0,1)
\end{align*}

ARIMA(0,1,0)-GARCH(1,1) with student-t errors (**I-GARCHt**):

\begin{align*}
Y_{t} = \mu + Y_{t-1}  + \varepsilon_t \quad \text{with} \quad \varepsilon_t = \sigma_t Z, \quad \sigma_t^2 = \omega + \alpha \varepsilon_{t-1}^2 + \beta \sigma_{t-1}^2 \quad \text{and} \quad Z_t \sim t(0,1, \nu)
\end{align*}


## Results

::: {.panel-tabset}

## Significance

```{r, echo = FALSE, fig.width=7, fig.height=5.5, fig.align='center', cache = FALSE, results='asis'}
load("assets/crps_learning/bernstein_application_study_estimations+learnings_rev1.RData")

quantile_loss <- function(X, y, tau) {
  t(t(y - X) * tau) * (y - X > 0) + t(t(X - y) * (1 - tau)) * (y - X < 0)
}
QL <- FCSTN * NA
for (k in 1:dim(QL)[1]) {
  QL[k, , ] <- quantile_loss(FCSTN[k, , ], as.numeric(yoos), Qgrid)
}

## TABLE AREA

KK <- length(mnames)
TTinit <- 1 ## without first, as all comb. are uniform
RQL <- apply(QL[1:KK, -c(1:TTinit), ], c(1, 3), mean)
dimnames(RQL) <- list(mnames, Qgrid)
RQLm <- apply(RQL, c(1), mean, na.rm = TRUE)
# sort(RQLm - RQLm[K + 1])
##
qq <- apply(QL[1:KK, -c(1:TTinit), ], c(1, 2), mean)
# t.test(qq[K + 1, ] - qq[K + 3, ])
# t.test(qq[K + 1, ] - qq[K + 4, ])


library(xtable)
Pall <- numeric(KK)
for (i in 1:KK) Pall[i] <- t.test(qq[K + 1, ] - qq[i, ], alternative = "greater")$p.val

Mall <- (RQLm - RQLm[K + 1]) * 10000
Mout <- matrix(Mall[-c(1:(K + 3))], 5, 6)
dimnames(Mout) <- list(moname, mtname)

Pallout <- format(round(Pall, 3), nsmall = 3)
Pallout[Pallout == "0.000"] <- "<.001"
Pallout[Pallout == "1.000"] <- ">.999"

MO <- K
IDX <- c(1:K)
OUT <- t(Mall[IDX])
OUT.num <- OUT
class(OUT.num) <- "numeric"

xxx <- OUT.num
xxxx <- OUT
table <- OUT
table_col <- OUT
i.p <- 1
for (i.p in 1:MO) {
  xmax <- -min(Mall) * 5 # max(Mall)
  xmin <- min(Mall)
  cred <- rev(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .8, .5)) # , .5,0,0,0,1,1,1) ## red
  cgreen <- rev(c(.5, .5, .55, .6, .65, .7, .75, .8, .85, .9, .95, 1, 1, .9)) # , .5,0,1,1,1,0,0) ## green
  cblue <- rev(c(.55, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5)) # , .5,1,1,0,0,0,1) ## blue
  crange <- c(xmin, xmax) ## range
  ## colors in plot:
  fred <- round(approxfun(seq(crange[1], crange[2], length = length(cred)), cred)(pmin(xxx[, i.p], xmax)), 3)
  fgreen <- round(approxfun(seq(crange[1], crange[2], length = length(cgreen)), cgreen)(pmin(xxx[, i.p], xmax)), 3)
  fblue <- round(approxfun(seq(crange[1], crange[2], length = length(cblue)), cblue)(pmin(xxx[, i.p], xmax)), 3)
  tmp <- format(round(xxx[, i.p], 3), nsmall = 3)
  xxxx[, i.p] <- paste("\\cellcolor[rgb]{", fred, ",", fgreen, ",", fblue, "}", tmp, " {\\footnotesize (", Pallout[IDX[i.p]], ")}", sep = "")
  table[, i.p] <- paste0(tmp, " (", Pallout[i.p], ")")
  table_col[, i.p] <- rgb(fred, fgreen, fblue, maxColorValue = 1)
} # i.p

table_out <- kbl(table, align = rep("c", ncol(table)))

for (cols in 1:ncol(table)) {
  table_out <- table_out %>%
    column_spec(cols, background = table_col[, cols])
}
table_out %>%
  kable_material()
```

```{r, echo = FALSE, fig.width=7, fig.height=5.5, fig.align='center', cache = FALSE, results='asis'}
MO <- 6
OUT <- Mout
OUT.num <- OUT
class(OUT.num) <- "numeric"

xxx <- OUT.num
xxxx <- OUT
i.p <- 1
table2 <- OUT
table_col2 <- OUT
for (i.p in 1:MO) {
  xmax <- -min(Mall) * 5 # max(Mall)
  xmin <- min(Mall)
  cred <- rev(c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, .8, .5)) # , .5,0,0,0,1,1,1) ## red
  cgreen <- rev(c(.5, .5, .55, .6, .65, .7, .75, .8, .85, .9, .95, 1, 1, .9)) # , .5,0,1,1,1,0,0) ## green
  cblue <- rev(c(.55, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5, .5)) # , .5,1,1,0,0,0,1) ## blue
  crange <- c(xmin, xmax) ## range
  ## colors in plot:
  fred <- round(approxfun(seq(crange[1], crange[2], length = length(cred)), cred)(pmin(xxx[, i.p], xmax)), 3)
  fgreen <- round(approxfun(seq(crange[1], crange[2], length = length(cgreen)), cgreen)(pmin(xxx[, i.p], xmax)), 3)
  fblue <- round(approxfun(seq(crange[1], crange[2], length = length(cblue)), cblue)(pmin(xxx[, i.p], xmax)), 3)
  tmp <- format(round(xxx[, i.p], 3), nsmall = 3)
  xxxx[, i.p] <- paste("\\cellcolor[rgb]{", fred, ",", fgreen, ",", fblue, "}", tmp, " {\\footnotesize  (", Pallout[K + 3 + 5 * (i.p - 1) + 1:5], ")}", sep = "")
  table2[, i.p] <- paste0(tmp, " (", Pallout[K + 3 + 5 * (i.p - 1) + 1:5], ")")
  table_col2[, i.p] <- rgb(fred, fgreen, fblue, maxColorValue = 1)
} # i.p

table_out2 <- kableExtra::kbl(table2, align = rep("c", ncol(table2)))

for (cols in 1:ncol(table2)) {
  table_out2 <- table_out2 %>%
    column_spec(1 + cols,
      background = table_col2[, cols]
    )
}

table_out2 %>%
  kable_material() %>%
  column_spec(1, bold = T)
```

## QL

```{r, echo = FALSE, fig.width=13, fig.height=5.5, fig.align='center', cache = FALSE}

##### Performance across probabilities
M <- length(mnames)
Msel <- c(1:K, K + 1, K + 1 + 2 + 1:4 * 5 - 2) ## experts + naive + smooth
modnames <- mnames[Msel]

tCOL <- c(
  "#E6CC00", "#CC6600", "#E61A1A", "#99004D", "#F233BF",
  "#666666", "#0000CC", "#1A80E6", "#1AE680", "#00CC00"
)


t(RQL) %>%
  as_tibble() %>%
  select(Naive) %>%
  mutate(Naive = 0) %>%
  mutate(p = 1:99 / 100) %>%
  pivot_longer(-p, values_to = "Loss differences") -> dummy

t(RQL) %>%
  as_tibble() %>%
  select(mnames[Msel]) %>%
  mutate(p = 1:99 / 100) %>%
  pivot_longer(!p & !Naive) %>%
  mutate(`Loss differences` = value - Naive) %>%
  select(-value, -Naive) %>%
  rbind(dummy) %>%
  mutate(
    p = as.numeric(p),
    name = stringr::str_replace(name, "-P-smooth", ""),
    name = factor(name, levels = stringr::str_replace(mnames[Msel], "-P-smooth", ""), ordered = T),
    `Loss differences` = `Loss differences` * 1000
  ) %>%
  ggplot(aes(x = p, y = `Loss differences`, colour = name)) +
  geom_line(linewidth = 1) +
  theme_minimal() +
  theme(
    text = element_text(size = text_size),
    legend.position = "bottom"
  ) +
  xlab("Probability p") +
  scale_color_manual(NULL, values = tCOL) +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
```

## Cumulative Loss Difference

```{r, echo = FALSE, fig.width=13, fig.height=5.5, fig.align='center', cache = FALSE}
DQL <- t(apply(apply(QL[1:KK, -c(1:TTinit), ], c(1, 2), mean), 1, cumsum))

rownames(DQL) <- mnames

t(DQL) %>%
  as_tibble() %>%
  select(Naive) %>%
  mutate(
    `Difference of cumulative loss` = 0,
    Date = ytime[-c(1:(TT + TTinit + 1))],
    name = "Naive"
  ) %>%
  select(-Naive) -> dummy


data <- t(DQL) %>%
  as_tibble() %>%
  select(mnames[Msel]) %>%
  mutate(Date = ytime[-c(1:(TT + TTinit + 1))]) %>%
  pivot_longer(!Date & !Naive) %>%
  mutate(`Difference of cumulative loss` = value - Naive) %>%
  select(-value, -Naive) %>%
  rbind(dummy) %>%
  mutate(
    name = stringr::str_replace(name, "-P-smooth", ""),
    name = factor(name, levels = stringr::str_replace(mnames[Msel], "-P-smooth", ""))
  )

data %>%
  ggplot(aes(x = Date, y = `Difference of cumulative loss`, colour = name)) +
  geom_line(size = 1) +
  theme_minimal() +
  theme(
    text = element_text(size = text_size),
    legend.position = "bottom"
  ) +
  scale_color_manual(NULL, values = tCOL) +
  guides(colour = guide_legend(nrow = 2, byrow = TRUE))
```

## Weights (BOAG P-Smooth)

```{r, echo = FALSE, fig.width=13, fig.height=5.5, fig.align='center', cache = FALSE}
load("assets/crps_learning/weights_data.RData")
weights_data %>%
  ggplot(aes(Date, p, fill = w)) +
  geom_raster(interpolate = TRUE) +
  facet_grid(Mod ~ .) +
  theme_minimal() +
  theme(
    plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"),
    text = element_text(size = text_size),
    legend.key.height = unit(0.9, "inch")
  ) +
  ylab("p") +
  scale_fill_gradientn(
    limits = c(0, 1),
    colours = colseq,
    breaks = seq(0, 1, 0.2)
  ) +
  scale_x_date(expand = c(0, 0))
```

## Weights (Last)

```{r, echo = FALSE, fig.width=13, fig.height=5.5, fig.align='center', cache = FALSE}
load("assets/crps_learning/weights_example.RData")
weights %>%
  ggplot(aes(x = p, y = weights, col = Model)) +
  geom_line(size = 1.5) +
  theme_minimal() +
  theme(
    plot.margin = unit(c(0.2, 0.3, 0.2, 0.2), "cm"),
    text = element_text(size = text_size),
    legend.position = "bottom",
    legend.title = element_blank(),
    panel.spacing = unit(1.5, "lines")
  ) +
  scale_color_manual(NULL, values = tCOL[1:K]) +
  facet_grid(. ~ K)
```

::::

## Wrap-Up

:::: {.columns}

::: {.column width="48%"}

Potential Downsides:

- Pointwise optimization can induce quantile crossing
  - Can be solved by sorting the predictions

Upsides:

- Pointwise learning outperforms the Naive solution significantly
- Online learning is much faster than batch methods
- Smoothing further improves the predictive performance
- Asymptotically not worse than the best convex combination

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

Important:

- The choice of the learning rate is crucial
- The loss function has to meet certain criteria

The [`r fontawesome::fa("github")` profoc](https://profoc.berrisch.biz/) R Package:

- Implements all algorithms discussed above
- Is written using RcppArmadillo `r fontawesome::fa("arrow-right", fill ="#000000")` its fast
- Accepts vectors for most parameters
  - The best parameter combination is chosen online
- Implements
  - Forgetting, Fixed Share
  - Different loss functions + gradients

:::

::::

<!-- :::: {.notes}

Execution Times:

T = 5000

Opera:

Ml-Poly > 157 ms
Boa     > 212 ms

Profoc:

Ml-Poly > 17
BOA     > 16 -->

# Multivariate Probabilistic CRPS Learning with an Application to Day-Ahead Electricity Prices

---

## Outline

```{r, include=FALSE}
col_lightgray <- "#e7e7e7"
col_blue <- "#000088"
col_smooth_expost <- "#a7008b"
col_smooth <- "#187a00"
col_pointwise <- "#008790"
col_constant <- "#dd9002"
col_optimum <- "#666666"
col_green <- "#61B94C"
col_orange <- "#ffa600"
col_yellow <- "#FCE135"
```

</br>

**Multivariate CRPS Learning**

- Introduction
- Smoothing procedures
- Application to multivariate electricity price forecasts

**The `profoc` R package**

- Package overview
- Implementation details
- Illustrative examples

## The Framework of Prediction under Expert Advice

### The sequential framework

:::: {.columns}

::: {.column width="48%"}

Each day, $t = 1, 2, ... T$

- The **forecaster** receives predictions $\widehat{X}_{t,k}$ from $K$ **experts**
- The **forecaster** assings weights $w_{t,k}$ to each **expert**
- The **forecaster** calculates her prediction:

$$\widetilde{X}_{t}=\sum_{k=1}^K w_{t,k}\widehat{X}_{t,k}$$

- The realization for $t$ is observed

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

- The experts can be institutions, persons, or models
- The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions
- We do not need any assumptions concerning the underlying data
- `r Citet(my_bib, "cesa2006prediction")`

:::

::::

## The Regret

Weights are updated sequentially according to the past performance of the $K$ experts.

`r fontawesome::fa("arrow-right", fill ="#000000")` A loss function $\ell$ is needed (to compute the **cumulative regret** $R_{t,k}$)

\begin{equation}
    R_{t,k}  = \widetilde{L}_{t} - \widehat{L}_{t,k} =  \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i)
    \label{eq_regret}
\end{equation}

The cumulative regret:
- Indicates the predictive accuracy of expert $k$ until time $t$.
- Measures how much the forecaster *regrets* not having followed the expert's advice

Popular loss functions for point forecasting `r Citet(my_bib, "gneiting2011making")`:

:::: {.columns}

::: {.column width="48%"}

- $\ell_2$-loss $\ell_2(x, y) = | x -y|^2$
  - optimal for mean prediction

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

- $\ell_1$-loss $\ell_1(x, y) = | x -y|$
  - optimal for median predictions

:::

::::

---

:::: {.columns}

::: {.column width="48%"}

### Probabilistic Setting

An appropriate loss:

\begin{align*}
    \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx
    \label{eq_crps}
\end{align*}

It's strictly proper `r Citet(my_bib, "gneiting2007strictly")`.

Using the CRPS, we can calculate time-adaptive weights $w_{t,k}$. However, what if the experts' performance varies in parts of the distribution?

`r fontawesome::fa("lightbulb", fill = col_yellow)` Utilize this relation:

\begin{align*}
    \text{CRPS}(F, y) = 2 \int_0^{1}  \text{QL}_p(F^{-1}(p), y) \, d p.
    \label{eq_crps_qs}
\end{align*}

... to combine quantiles of the probabilistic forecasts individually using the quantile-loss QL.

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

### Optimal Convergence

</br>

`r fontawesome::fa("exclamation", fill = col_orange)` exp-concavity of the loss is required for *selection* and *convex aggregation* properties

`r fontawesome::fa("exclamation", fill = col_orange)` QL is convex, but not exp-concave

`r fontawesome::fa("arrow-right", fill ="#000000")` The Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition.

Convergence rates of BOA are:

`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* `r Citet(my_bib, "gaillard2018efficient")`.

`r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* `r Citet(my_bib, "wintenberger2017optimal")`.

:::

::::

## Multivariate CRPS Learning


:::: {.columns}

::: {.column width="48%"}

Additionally, we extend the **B-Smooth** and **P-Smooth** procedures to the multivariate setting:

- Basis matrices for reducing
- - the probabilistic dimension from $P$ to $\widetilde P$
- - the multivariate dimension from $D$ to $\widetilde D$


- Hat matrices
- - penalized smoothing across P and D dimensions

We utilize the mean Pinball Score over the entire space for hyperparameter optimization (e.g, $\lambda$)

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

*Basis Smoothing*

Represent weights as linear combinations of bounded basis functions:

\begin{equation}
  \underbrace{\boldsymbol w_{t,k}}_{D \text{ x } P} = \sum_{j=1}^{\widetilde D} \sum_{l=1}^{\widetilde P} \beta_{t,j,l,k} \varphi^{\text{mv}}_{j} \varphi^{\text{pr}}_{l} = \underbrace{\boldsymbol \varphi^{\text{mv}}}_{D\text{ x }\widetilde D} \boldsymbol \beta_{t,k} \underbrace{{\boldsymbol\varphi^{\text{pr}}}'}_{\widetilde P \text{ x }P} \nonumber
\end{equation}

A popular choice: B-Splines

$\boldsymbol \beta_{t,k}$ is calculated using a reduced regret matrix:

$\underbrace{\boldsymbol r_{t,k}}_{\widetilde P \times \widetilde D} = \boldsymbol \varphi^{\text{pr}} \underbrace{\left({\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widetilde{\boldsymbol X}_{t},Y_t)- {\boldsymbol{QL}}_{\mathcal{P}}^{\nabla}(\widehat{\boldsymbol X}_{t},Y_t)\right)}_{\text{PxD}}\boldsymbol \varphi^{\text{mv}}$

If $\widetilde P = P$ it holds that $\boldsymbol \varphi^{pr} = \boldsymbol{I}$  (pointwise)

For $\widetilde P = 1$ we receive constant weights

:::

::::

## Multivariate CRPS Learning

:::: {.columns}

::: {.column width="48%"}

**Penalized smoothing:**

Let $\boldsymbol{\psi}^{\text{mv}}=(\psi_1,\ldots, \psi_{D})$ and $\boldsymbol{\psi}^{\text{pr}}=(\psi_1,\ldots, \psi_{P})$ be two sets of bounded basis functions on $(0,1)$:

\begin{equation}
  \boldsymbol w_{t,k} = \boldsymbol{\psi}^{\text{mv}} \boldsymbol{b}_{t,k} {\boldsymbol{\psi}^{pr}}'
\end{equation}

with parameter matix $\boldsymbol b_{t,k}$. The latter is estimated to penalize $L_2$-smoothing which minimizes

\begin{align}
   & \| \boldsymbol{\beta}_{t,d, k}' \boldsymbol{\varphi}^{\text{pr}}  - \boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}}  \|^2_2 + \lambda^{\text{pr}}  \| \mathcal{D}_{q}  (\boldsymbol b_{t, d, k}' \boldsymbol{\psi}^{\text{pr}})  \|^2_2 +                       \nonumber \\
   & \| \boldsymbol{\beta}_{t, p, k}' \boldsymbol{\varphi}^{\text{mv}}  - \boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}}  \|^2_2 + \lambda^{\text{mv}}  \| \mathcal{D}_{q}  (\boldsymbol b_{t, p, k}' \boldsymbol{\psi}^{\text{mv}})  \|^2_2  \nonumber
\end{align}

with differential operator $\mathcal{D}_q$ of order $q$

Computation is easy since we have an analytical solution.

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

```{r, fig.align="center", echo=FALSE, out.width = "1000px"}
knitr::include_graphics("assets/mcrps_learning/algorithm.svg")
```

:::

::::

## Application


:::: {.columns}

::: {.column width="48%"}

#### Data

- Day-Ahead electricity price forecasts from `r Citet(my_bib, "marcjasz2022distributional")`
- Produced using probabilistic neural networks
- 24-dimensional distributional forecasts
- Distribution assumptions: JSU and Normal
- 8 experts (4 JSU, 4 Normal)
- 27th Dec. 2018 to 31st Dec. 2020 (736 days)
- We extract 99 quantiles (percentiles)

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

#### Setup

Evaluation: Exclude first 182 observations

Extensions: Penalized smoothing | Forgetting

Tuning strategies:

- Bayesian Fix
  - Sophisticated Baesian Search algorithm
- Online
  - Dynamic based on past performance
- Bayesian Online
  - First Bayesian Fix then Online

Computation Time: ~30 Minutes

:::

::::

# Special Cases


:::: {.columns}

::: {.column width="48%"}

::: {.panel-tabset}

## Constant

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/constant.svg")
```

## Constant PR

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/constant_pr.svg")
```

## Constant MV

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/constant_mv.svg")
```

::::

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

::: {.panel-tabset}

## Pointwise

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/pointwise.svg")
```

## Smooth

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/smooth_best.svg")
```

::::

:::

::::

## Results

```{r, fig.align="center", echo=FALSE, out.width = "400"}
knitr::include_graphics("assets/mcrps_learning/tab_performance_sa.svg")
```

## Results

```{r, warning=FALSE, fig.align="center", echo=FALSE, fig.width=12, fig.height=6}
load("assets/mcrps_learning/pars_data.rds")
pars_data %>%
    ggplot(aes(x = dates, y = value)) +
    geom_rect(aes(
        ymin = 0,
        ymax = value * 1.2,
        xmin = dates[1],
        xmax = dates[182],
        fill = "Burn-In"
    )) +
    geom_line(aes(color = name), linewidth = linesize, show.legend = FALSE) +
    scale_colour_manual(
        values = as.character(cols[5, c("pink", "amber", "green")])
    ) +
    facet_grid(name ~ .,
        scales = "free_y",
        # switch = "both"
    ) +
    scale_y_continuous(
        trans = "log2",
        labels = scaleFUN
    ) +
    theme_minimal() +
    theme(
        # plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"),
        text = element_text(size = text_size),
        legend.key.width = unit(0.9, "inch"),
        legend.position = "none"
    ) +
    ylab(NULL) +
    xlab("date") +
    scale_fill_manual(NULL,
        values = as.character(cols[3, "grey"])
    )
```

## Results: Hour 16:00-17:00

```{r, fig.align="center", echo=FALSE, fig.width=12, fig.height=6}
load("assets/mcrps_learning/weights_h.rds")
weights_h %>%
        ggplot(aes(date, q, fill = weight)) +
    geom_raster(interpolate = TRUE) +
    facet_grid(
        Expert ~ . # , labeller = labeller(Mod = mod_labs)
    ) +
    theme_minimal() +
    theme(
        # plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"),
        text = element_text(size = text_size),
        legend.key.height = unit(0.9, "inch")
    ) +
    scale_x_date(expand = c(0, 0)) +
    scale_fill_gradientn(
        oob = scales::squish,
        limits = c(0, 1),
        values = c(seq(0, 0.4, length.out = 8), 0.65, 1),
        colours = c(
            cols[8, "red"],
            cols[5, "deep-orange"],
            cols[5, "amber"],
            cols[5, "yellow"],
            cols[5, "lime"],
            cols[5, "light-green"],
            cols[5, "green"],
            cols[7, "green"],
            cols[9, "green"],
            cols[10, "green"]
        ),
        breaks = seq(0, 1, 0.1)
    ) +
    xlab("date") +
    ylab("probability") +
    scale_y_continuous(breaks = c(0.1, 0.5, 0.9))
```

## Results: Median

```{r, fig.align="center", echo=FALSE, fig.width=12, fig.height=6}
load("assets/mcrps_learning/weights_q.rds")
weights_q %>%
    mutate(hour = as.numeric(hour) - 1) %>%
    ggplot(aes(date, hour, fill = weight)) +
    geom_raster(interpolate = TRUE) +
    facet_grid(
        Expert ~ . # , labeller = labeller(Mod = mod_labs)
    ) +
    theme_minimal() +
    theme(
        # plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"),
        text = element_text(size = text_size),
        legend.key.height = unit(0.9, "inch")
    ) +
    scale_x_date(expand = c(0, 0)) +
    scale_fill_gradientn(
        oob = scales::squish,
        limits = c(0, 1),
        values = c(seq(0, 0.4, length.out = 8), 0.65, 1),
        colours = c(
            cols[8, "red"],
            cols[5, "deep-orange"],
            cols[5, "amber"],
            cols[5, "yellow"],
            cols[5, "lime"],
            cols[5, "light-green"],
            cols[5, "green"],
            cols[7, "green"],
            cols[9, "green"],
            cols[10, "green"]
        ),
        breaks = seq(0, 1, 0.1)
    ) +
    xlab("date") +
    ylab("hour") +
    scale_y_continuous(breaks = c(0, 8, 16, 24))
```

## Profoc R Package

:::: {.columns}

::: {.column width="48%"}

### Probabilistic Forecast Combination - profoc

Available on [Github](https://github.com/BerriJ/profoc) and [CRAN](https://CRAN.R-project.org/package=profoc)

Main Function: `online()` for online learning.
- Works with multivariate and/or probabilistic data
- Implements BOA, ML-POLY, EWA (and the gradient versions)
- Implements many extensions like smoothing, forgetting, thresholding, etc.
- Various loss functions are available
- Various methods (`predict`, `update`, `plot`, etc.)

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

### Speed

Large parts of profoc are implemented in C++.

<center>
<img src="assets/mcrps_learning/profoc_langs.png">
</center>

We use `Rcpp`, `RcppArmadillo`, and OpenMP.

We use `Rcpp` modules to expose a class to R
- Offers great flexibility for the end-user
- Requires very little knowledge of C++ code
- High-Level interface is easy to use

:::

::::

## Profoc - B-Spline Basis

:::: {.columns}

::: {.column width="48%"}

Basis specification `b_smooth_pr` is internally passed to `make_basis_mats()`:

```{r, echo = TRUE, eval = FALSE, cache = FALSE}
mod <- online(
  y = Y,
  experts = experts,
  tau = 1:99 / 100,
  b_smooth_pr = list(
    knots = 9,
    mu = 0.3, # NEW
    sigma = 1,
    nonc = 0,
    tailweight = 1,
    deg = 3
  )
)
```

Knots are distributed using the generalized beta distribution.

TODO: Add actual algorithm to backup slides

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

TODO

:::

::::

## Wrap-Up

:::: {.columns}

::: {.column width="48%"}

  The [`r fontawesome::fa("github")` profoc](https://profoc.berrisch.biz/) R Package:

Profoc is a flexible framework for online learning.

- It implements several algorithms
- It implements several loss functions
- It implements several extensions
- Its high- and low-level interfaces offer great flexibility

Profoc is fast.

- The core components are written in C++
- The core components utilize OpenMP for parallelization

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

Multivariate Extension:

- Code is available now
- [Pre-Print](https://arxiv.org/abs/2303.10019) is available now

Get these slides:

<center>
<img src="assets/mcrps_learning/web_pres.png">
</center>
[https://berrisch.biz/slides/23_06_ecmi/](https://berrisch.biz/slides/23_06_ecmi/)

:::

::::


## Columns Template

:::: {.columns}

::: {.column width="48%"}

Baz

:::

::: {.column width="2%"}

:::

::: {.column width="48%"}

foo

:::

::::

## Paneltabset Template

::: {.panel-tabset}

## Baz

Bar

## Bam

Foo

::::

# References

```{r refs1, echo=FALSE, results="asis"}
PrintBibliography(my_bib, .opts = list(style = "text"))
```