--- title: "Data Science Methods for Forecasting in Energy and Economics" date: 2025-07-10 author: - name: Jonathan Berrisch affiliations: - ref: hemf affiliations: - id: hemf name: University of Duisburg-Essen, House of Energy Markets and Finance format: revealjs: embed-resources: true footer: "" logo: logos_combined.png theme: [default, clean.scss] smaller: true fig-format: svg execute: daemon: false highlight-style: github --- ## Outline ::: {.hidden} $$ \newcommand{\A}{{\mathbb A}} $$ :::
::: {style="font-size: 150%;"} [{{< fa bars-staggered >}}]{style="color: #404040;"} Introduction & Research Motivation [{{< fa bars-staggered >}}]{style="color: #404040;"} Overview of the Thesis [{{< fa table >}}]{style="color: #404040;"} Online Learning [{{< fa circle-nodes >}}]{style="color: #404040;"} Probabilistic Forecasting of European Carbon and Energy Prices [{{< fa lightbulb >}}]{style="color: #404040;"} Limitations [{{< fa binoculars >}}]{style="color: #404040;"} Contributions & Outlook ::: ## PHD DeFence ```{r, setup, include=FALSE} # Compile with: rmarkdown::render("crps_learning.Rmd") library(latex2exp) library(ggplot2) library(dplyr) library(tidyr) library(purrr) library(kableExtra) knitr::opts_chunk$set( dev = "svglite" # Use svg figures ) library(RefManageR) BibOptions( check.entries = TRUE, bib.style = "authoryear", cite.style = "authoryear", style = "html", hyperlink = TRUE, dashed = FALSE ) my_bib <- ReadBib("assets/library.bib", check = FALSE) col_lightgray <- "#e7e7e7" col_blue <- "#000088" col_smooth_expost <- "#a7008b" col_smooth <- "#187a00" col_pointwise <- "#008790" col_constant <- "#dd9002" col_optimum <- "#666666" ``` ```{r xaringan-panelset, echo=FALSE} xaringanExtra::use_panelset() ``` ```{r xaringanExtra-freezeframe, echo=FALSE} xaringanExtra::use_freezeframe(responsive = TRUE) ``` # Outline - [Motivation](#motivation) - [The Framework of Prediction under Expert Advice](#pred_under_exp_advice) - [The Continious Ranked Probability Scrore](#crps) - [Optimality of (Pointwise) CRPS-Learning](#crps_optim) - [A Simple Probabilistic Example](#simple_example) - [The Proposed CRPS-Learning Algorithm](#proposed_algorithm) - [Simulation Results](#simulation) - [Possible Extensions](#extensions) - [Application Study](#application) - [Wrap-Up](#conclusion) - [References](#references) --- class: center, middle, sydney-blue # Motivation name: motivation ## Motivation :::: {.columns} ::: {.column width="48%"} The Idea: - Combine multiple forecasts instead of choosing one - Combination weights may vary over **time**, over the **distribution** or **both** 2 Popular options for combining distributions: - Combining across quantiles (this paper) - Horizontal aggregation, vincentization - Combining across probabilities - Vertical aggregation ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} ::: {.panel-tabset} ## Time ```{r, echo = FALSE, fig.height=6} par(mfrow = c(3, 3), mar = c(2, 2, 2, 2)) set.seed(1) # Data X <- matrix(ncol = 3, nrow = 15) X[, 1] <- seq(from = 8, to = 12, length.out = 15) + 0.25 * rnorm(15) X[, 2] <- 10 + 0.25 * rnorm(15) X[, 3] <- seq(from = 12, to = 8, length.out = 15) + 0.25 * rnorm(15) # Weights w <- matrix(ncol = 3, nrow = 15) w[, 1] <- sin(0.1 * 1:15) w[, 2] <- cos(0.1 * 1:15) w[, 3] <- seq(from = -2, 0.25, length.out = 15)^2 w <- (w / rowSums(w)) # Vis plot(X[, 1], lwd = 4, type = "l", ylim = c(8, 12), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0" ) plot(w[, 1], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0" ) text(6, 0.5, TeX("$w_1(t)$"), cex = 2, col = "#2050f0") arrows(13, 0.25, 15, 0.0, , lwd = 4, bty = "n") plot.new() plot(X[, 2], lwd = 4, type = "l", ylim = c(8, 12), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple" ) plot(w[, 2], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple" ) text(6, 0.6, TeX("$w_2(t)$"), cex = 2, col = "purple") arrows(13, 0.5, 15, 0.5, , lwd = 4, bty = "n") plot(rowSums(X * w), lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#298829") plot(X[, 3], lwd = 4, type = "l", ylim = c(8, 12), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4" ) plot(w[, 3], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4" ) text(6, 0.25, TeX("$w_3(t)$"), cex = 2, col = "#e423b4") arrows(13, 0.75, 15, 1, , lwd = 4, bty = "n") ``` ## Distribution ```{r, echo = FALSE, fig.height=6} par(mfrow = c(3, 3), mar = c(2, 2, 2, 2)) set.seed(1) # Data X <- matrix(ncol = 3, nrow = 31) X[, 1] <- dchisq(0:30, df = 10) X[, 2] <- dnorm(0:30, mean = 15, sd = 5) X[, 3] <- dexp(0:30, 0.2) # Weights w <- matrix(ncol = 3, nrow = 31) w[, 1] <- sin(0.05 * 0:30) w[, 2] <- cos(0.05 * 0:30) w[, 3] <- seq(from = -2, 0.25, length.out = 31)^2 w <- (w / rowSums(w)) # Vis plot(X[, 1], lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0" ) plot(X[, 2], lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple" ) plot(X[, 3], lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4" ) plot(w[, 1], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#2050f0" ) text(12, 0.5, TeX("$w_1(x)$"), cex = 2, col = "#2050f0") arrows(26, 0.25, 31, 0.0, , lwd = 4, bty = "n") plot(w[, 2], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "purple" ) text(15, 0.5, TeX("$w_2(x)$"), cex = 2, col = "purple") arrows(15, 0.25, 15, 0, , lwd = 4, bty = "n") plot(w[, 3], lwd = 4, type = "l", ylim = c(0, 1), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#e423b4" ) text(20, 0.5, TeX("$w_3(x)$"), cex = 2, col = "#e423b4") arrows(5, 0.25, 0, 0, , lwd = 4, bty = "n") plot.new() plot(rowSums(X * w), lwd = 4, type = "l", xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n", col = "#298829") ``` ::: ::: :::: # The Framework of Prediction under Expert Advice ## The Framework of Prediction under Expert Advice ### The sequential framework :::: {.columns} ::: {.column width="48%"} Each day, $t = 1, 2, ... T$ - The **forecaster** receives predictions $\widehat{X}_{t,k}$ from $K$ **experts** - The **forecaster** assings weights $w_{t,k}$ to each **expert** - The **forecaster** calculates her prediction: \begin{equation} \widetilde{X}_{t} = \sum_{k=1}^K w_{t,k} \widehat{X}_{t,k}. \label{eq_forecast_def} \end{equation} - The realization for $t$ is observedilities - Vertical aggregation ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} - The experts can be institutions, persons, or models - The forecasts can be point-forecasts (i.e., mean or median) or full predictive distributions - We do not need any assumptions concerning the underlying data - `r Citet(my_bib, "cesa2006prediction")` ::: :::: --- ## The Regret Weights are updated sequentially according to the past performance of the $K$ experts. That is, a loss function $\ell$ is needed. This is used to compute the **cumulative regret** $R_{t,k}$ \begin{equation} R_{t,k} = \widetilde{L}_{t} - \widehat{L}_{t,k} = \sum_{i = 1}^t \ell(\widetilde{X}_{i},Y_i) - \ell(\widehat{X}_{i,k},Y_i) \label{eq_regret} \end{equation} The cumulative regret: - Indicates the predictive accuracy of the expert $k$ until time $t$. - Measures how much the forecaster *regrets* not having followed the expert's advice Popular loss functions for point forecasting `r Citet(my_bib, "gneiting2011making")`: .pull-left[ - $\ell_2$-loss $\ell_2(x, y) = | x -y|^2$ - optimal for mean prediction ] .pull-right[ - $\ell_1$-loss $\ell_1(x, y) = | x -y|$ - optimal for median predictions ] :::: {.columns} ::: {.column width="48%"} - $\ell_2$-loss $\ell_2(x, y) = | x -y|^2$ - optimal for mean prediction ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} - $\ell_1$-loss $\ell_1(x, y) = | x -y|$ - optimal for median predictions ::: :::: ## Popular Aggregation Algorithms #### The naive combination \begin{equation} w_{t,k}^{\text{Naive}} = \frac{1}{K} \end{equation} #### The exponentially weighted average forecaster (EWA) \begin{align} w_{t,k}^{\text{EWA}} & = \frac{e^{\eta R_{t,k}} }{\sum_{k = 1}^K e^{\eta R_{t,k}}} = \frac{e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} }{\sum_{k = 1}^K e^{-\eta \ell(\widehat{X}_{t,k},Y_t)} w^{\text{EWA}}_{t-1,k} } \label{eq_ewa_general} \end{align} #### The polynomial weighted aggregation (PWA) \begin{align} w_{t,k}^{\text{PWA}} & = \frac{ 2(R_{t,k})^{q-1}_{+} }{ \|(R_t)_{+}\|^{q-2}_q} \label{eq_pwa_general} \end{align} with $q\geq 2$ and $x_{+}$ the (vector) of positive parts of $x$. ## Optimality In stochastic settings, the cumulative Risk should be analyezed `r Citet(my_bib, "wintenberger2017optimal")`: \begin{align} \underbrace{\widetilde{\mathcal{R}}_t = \sum_{i=1}^t \mathbb{E}[\ell(\widetilde{X}_{i},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Forecaster}} \qquad\qquad\qquad \text{ and } \qquad\qquad\qquad \underbrace{\widehat{\mathcal{R}}_{t,k} = \sum_{i=1}^t \mathbb{E}[\ell(\widehat{X}_{i,k},Y_i)|\mathcal{F}_{i-1}]}_{\text{Cumulative Risk of Experts}} \label{eq_def_cumrisk} \end{align} There are two problems that an algorithm should solve in iid settings: :::: {.columns} ::: {.column width="48%"} ### The selection problem \begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \stackrel{t\to \infty}{\rightarrow} a \quad \text{with} \quad a \leq 0. \label{eq_opt_select} \end{equation} The forecaster is asymptotically not worse than the best expert $\widehat{\mathcal{R}}_{t,\min}$. ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} ### The convex aggregation problem \begin{equation} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \stackrel{t\to \infty}{\rightarrow} b \quad \text{with} \quad b \leq 0 . \label{eq_opt_conv} \end{equation} The forecaster is asymptotically not worse than the best convex combination $\widehat{X}_{t,\pi}$ in hindsight (**oracle**). ::: :::: ## Optimality Satisfying the convexity property \eqref{eq_opt_conv} comes at the cost of slower possible convergence. According to `r Citet(my_bib, "wintenberger2017optimal")`, an algorithm has optimal rates with respect to selection \eqref{eq_opt_select} and convex aggregation \eqref{eq_opt_conv} if \begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) & = \mathcal{O}\left(\frac{\log(K)}{t}\right)\label{eq_optp_select} \end{align} \begin{align} \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) & = \mathcal{O}\left(\sqrt{\frac{\log(K)}{t}}\right) \label{eq_optp_conv} \end{align} Algorithms can statisfy both \eqref{eq_optp_select} and \eqref{eq_optp_conv} depending on: - The loss function - Regularity conditions on $Y_t$ and $\widehat{X}_{t,k}$ - The weighting scheme ## Optimality According to `r Citet(my_bib, "cesa2006prediction")` EWA \eqref{eq_ewa_general} satisfies the optimal selection convergence \eqref{eq_optp_select} in a deterministic setting if the: - Loss $\ell$ is exp-concave - Learning-rate $\eta$ is chosen correctly Those results can be converted to stochastic iid settings `r Citet(my_bib, "kakade2008generalization")` `r Citet(my_bib, "gaillard2014second")`. The optimal convex aggregation convergence \eqref{eq_optp_conv} can be satisfied by applying the kernel-trick. Thereby, the loss is linearized: \begin{align} \ell^{\nabla}(x,y) = \ell'(\widetilde{X},y) x \end{align} $\ell'$ is the subgradient of $\ell$ in its first coordinate evaluated at forecast combination $\widetilde{X}$. Combining probabilistic forecasts calls for a probabilistic loss function :::: {.notes} We apply Bernstein Online Aggregation (BOA). It lets us weaken the exp-concavity condition while almost keeping the optimalities \ref{eq_optp_select} and \ref{eq_optp_conv}. :::: ## The Continuous Ranked Probability Score :::: {.columns} ::: {.column width="48%"} **An appropriate choice:** \begin{align*} \text{CRPS}(F, y) & = \int_{\mathbb{R}} {(F(x) - \mathbb{1}\{ x > y \})}^2 dx \label{eq_crps} \end{align*} It's strictly proper `r Citet(my_bib, "gneiting2007strictly")`. Using the CRPS, we can calculate time-adaptive weight $w_{t,k}$. However, what if the experts' performance is not uniform over all parts of the distribution? The idea: utilize this relation: \begin{align*} \text{CRPS}(F, y) = 2 \int_0^{1} \text{QL}_p(F^{-1}(p), y) \, d p. \label{eq_crps_qs} \end{align*} ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} to combine quantiles of the probabilistic forecasts individually using the quantile-loss (QL): \begin{align*} \text{QL}_p(q, y) & = (\mathbb{1}\{y < q\} -p)(q - y) \end{align*}
**But is it optimal?** CRPS is exp-concave `r fontawesome::fa("check", fill ="#00b02f")` `r fontawesome::fa("arrow-right", fill ="#000000")` EWA \eqref{eq_ewa_general} with CRPS satisfies \eqref{eq_optp_select} and \eqref{eq_optp_conv} QL is convex, but not exp-concave `r fontawesome::fa("exclamation", fill ="#ffa600")` `r fontawesome::fa("arrow-right", fill ="#000000")` Bernstein Online Aggregation (BOA) lets us weaken the exp-concavity condition while almost keeping optimal convergence ::: :::: ## CRPS-Learning Optimality For convex losses, BOAG satisfies that there exist a $C>0$ such that for $x>0$ it holds that \begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\pi} \right) \leq C \log(\log(t)) \left(\sqrt{\frac{\log(K)}{t}} + \frac{\log(K)+x}{t}\right) \right) \geq 1-e^{x} \label{eq_boa_opt_conv} \end{equation} `r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *convex aggregation* \eqref{eq_optp_conv} `r Citet(my_bib, "wintenberger2017optimal")` . The same algorithm satisfies that there exist a $C>0$ such that for $x>0$ it holds that \begin{equation} P\left( \frac{1}{t}\left(\widetilde{\mathcal{R}}_t - \widehat{\mathcal{R}}_{t,\min} \right) \leq C\left(\frac{\log(K)+\log(\log(Gt))+ x}{\alpha t}\right)^{\frac{1}{2-\beta}} \right) \geq 1-e^{x} \label{eq_boa_opt_select} \end{equation} if $Y_t$ is bounded, the considered loss $\ell$ is convex $G$-Lipschitz and weak exp-concave in its first coordinate. This is for losses that satisfy **A1** and **A2**. ## CRPS-Learning Optimality :::: {.columns} ::: {.column width="48%"} **A1** For some $G>0$ it holds for all $x_1,x_2\in \mathbb{R}$ and $t>0$ that $$ | \ell(x_1, Y_t)-\ell(x_2, Y_t) | \leq G |x_1-x_2|$$ **A2** For some $\alpha>0$, $\beta\in[0,1]$ it holds for all $x_1,x_2 \in \mathbb{R}$ and $t>0$ that \begin{align*} \mathbb{E}[ & \ell(x_1, Y_t)-\ell(x_2, Y_t) | \mathcal{F}_{t-1}] \leq \\ & \mathbb{E}[ \ell'(x_1, Y_t)(x_1 - x_2) |\mathcal{F}_{t-1}] \\ & + \mathbb{E}\left[ \left. \left( \alpha(\ell'(x_1, Y_t)(x_1 - x_2))^{2}\right)^{1/\beta} \right|\mathcal{F}_{t-1}\right] \end{align*} `r fontawesome::fa("arrow-right", fill ="#000000")` Almost optimal w.r.t *selection* \eqref{eq_optp_select} `r Citet(my_bib, "gaillard2018efficient")`. ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} **Lemma 1** \begin{align} 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\min} \label{eq_risk_ql_crps_expert} \\ 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi} & \leq \widehat{\mathcal{R}}^{\text{CRPS}}_{t,\pi} . \label{eq_risk_ql_crps_convex} \end{align} Pointwise can outperform constant procedures QL is convex but not exp-concave: `r fontawesome::fa("arrow-right")` Almost optimal convergence w.r.t. *convex aggregation* \eqref{eq_boa_opt_conv} `r fontawesome::fa("check", fill ="#00b02f")`
For almost optimal congerence w.r.t. *selection* \eqref{eq_boa_opt_select} we need to check **A1** and **A2**: QL is Lipschitz continuous: `r fontawesome::fa("arrow-right")` **A1** holds `r fontawesome::fa("check", fill ="#ffa600")`
::: :::: ## CRPS-Learning Optimality :::: {.columns} ::: {.column width="48%"} Conditional quantile risk: $\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y_t) | \mathcal{F}_{t-1}]$. `r fontawesome::fa("arrow-right")` convexity properties of $\mathcal{Q}_p$ depend on the conditional distribution $Y_t|\mathcal{F}_{t-1}$. **Proposition 1** Let $Y$ be a univariate random variable with (Radon-Nikodym) $\nu$-density $f$, then for the second subderivative of the quantile risk $\mathcal{Q}_p(x) = \mathbb{E}[ \text{QL}_p(x, Y) ]$ of $Y$ it holds for all $p\in(0,1)$ that $\mathcal{Q}_p'' = f.$ Additionally, if $f$ is a continuous Lebesgue-density with $f\geq\gamma>0$ for some constant $\gamma>0$ on its support $\text{spt}(f)$ then is $\mathcal{Q}_p$ is $\gamma$-strongly convex. Strong convexity with $\beta=1$ implies **A2** `r fontawesome::fa("check", fill ="#ffa600")` `r Citet(my_bib, "gaillard2018efficient")` ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} `r fontawesome::fa("arrow-right")` **A1** and **A2** give us almost optimal convergence w.r.t. selection \eqref{eq_boa_opt_select} `r fontawesome::fa("check", fill ="#00b02f")`
**Theorem 1** The gradient based fully adaptive Bernstein online aggregation (BOAG) applied pointwise for all $p\in(0,1)$ on $\text{QL}$ satisfies \eqref{eq_boa_opt_conv} with minimal CRPS given by $$\widehat{\mathcal{R}}_{t,\pi} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\pi}.$$ If $Y_t|\mathcal{F}_{t-1}$ is bounded and has a pdf $f_t$ satifying $f_t>\gamma >0$ on its support $\text{spt}(f_t)$ then \ref{eq_boa_opt_select} holds with $\beta=1$ and $$\widehat{\mathcal{R}}_{t,\min} = 2\overline{\widehat{\mathcal{R}}}^{\text{QL}}_{t,\min}$$. ::: :::: ## A Probabilistic Example :::: {.columns} ::: {.column width="48%"} Simple Example: \begin{align} Y_t & \sim \mathcal{N}(0,\,1) \\ \widehat{X}_{t,1} & \sim \widehat{F}_{1} = \mathcal{N}(-1,\,1) \\ \widehat{X}_{t,2} & \sim \widehat{F}_{2} = \mathcal{N}(3,\,4) \label{eq:dgp_sim1} \end{align} - True weights vary over $p$ - Figures show the ECDF and calculated weights using $T=25$ realizations - Pointwise solution creates rough estimates - Pointwise is better than constant - Smooth solution is better than pointwise ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} foo ::: :::: ## Columns Template :::: {.columns} ::: {.column width="48%"} Baz ::: ::: {.column width="2%"} ::: ::: {.column width="48%"} foo ::: :::: # References ```{r refs1, echo=FALSE, results="asis"} PrintBibliography(my_bib, .opts = list(style = "text")) ```