Forecasting

Forecasting is predicting $X_{T + h}$ given available information $D_{t} \equiv {X_{1}, \dots, X_{T}}$ for $h = 1, 2, \dots$

We have several sources of error, like parameter uncertainty, model uncertainty, and ignorance about future errors

Our 1-step ahead forecast is defined as $\hat{X}_{T + 1} \equiv g (D_{T})$
How do we get an optimal forecast? In practice, we minimize the mean squared error $E [X_{T + 1} - \hat{X}_{T + 1}]^{2}$

We propose this will look like $\hat{X}_{T + 1} = E [X_{T + 1} ∣ D_{T}]$
We can do some algebra with expectations to show that $g (D_{T}) = E [X_{T + 1} ∣ D_{T}]$ is a lower bound for any forecaster (with regards to MSE)

The same result holds for the h-step ahead forecast for general $h \geq 1$ , which is $\hat{X}_{T + h} = E [X_{T + h} ∣ D_{T}]$

Example:
For an $A R (1)$ process, we get
$\hat{X}_{T + 1} = ϕ X_{T}$

$e_{T + 1} \equiv X_{T + 1} - \hat{X}_{T + 1} = X_{T + 1} - ϕ X_{T} = ϵ_{T + 1}$
$E [e_{T + 1}] = 0$
$Var [e_{T + 1}] = σ_{ϵ}^{2}$

$\hat{X}_{T + h} = ϕ^{h} X_{T} + \sum_{j = 0}^{h - 1} ϕ^{j} ϵ_{T + h - j}$
$e_{T + h} \equiv X_{T + h} - \hat{X}_{T + h} = \sum_{j = 0}^{h - 1} ϕ^{j} ϵ_{T + h - j}$
$E [e_{T + h}] = 0$
$Var [ϵ_{T + h}] = h σ_{ϵ}^{2}$

If $ϵ$ is normal then our full error term $e_{T + h}$ is also normally distributed, which means we can build confidence intervals

With an intercept, our $A R (1)$ forecasts look like,
$\hat{X}_{T + 1} = α + ϕ X_{T} + ϵ_{T}$
$\hat{X}_{T + 2} = α + ϕ α + ϕ^{2} X_{T}$
$\hat{X}_{T + h} = ϕ^{h} X_{T} + α \sum_{j = 0}^{h - 1} ϕ^{j}$

For $A R (p)$ , we can derive
$\hat{X}_{T + h} = ϕ_{1} \hat{X}_{T + h - 1} + \dots + ϕ_{p} \hat{X}_{T + h - p}$

Example:
Let ${X_{t}}$ follow $X_{t} = ϵ_{t} + θ ϵ_{t - 1}$
$\hat{X}_{T + 1} = θ \cdot E (ϵ_{t} ∣ D_{T}) = θ ϵ_{T}$ (assuming $ϵ_{0} = 0$ )
$e_{T + 1} = ϵ_{T + 1}$

$\hat{X}_{T + 2} = 0$
$e_{T + 2} = ϵ_{T + 2} + θ ϵ_{T + 1}$

$\hat{X}_{T + h} = 0$
$e_{T + h} = ϵ_{T + h} + θ ϵ_{T + h - 1}$

Definition: The least squares estimator of a parameter minimizes the sum of squared residuals

Suppose we have a sample of size $T$ from an $A R (p)$ process,
$y_{t} = α + \sum_{i = 1}^{p} ϕ_{i} y_{t - i} + ϵ_{t}$
$y_{t} = x_{t}^{'} β + ϵ_{t}$
with $x_{t} = (1, y_{t - 1}, \dots, y_{t - p})^{'}$ and $β = (α, ϕ_{1}, \dots, ϕ_{p})^{'}$

In matrix form, we can write $y = Xβ + ϵ$

$\hat{β} = min_{β} \sum_{t = p + 1}^{T} (y_{t} - x_{t}^{'} β)^{2} = min_{β} (y - Xβ)^{'} (y - Xβ)$
$\frac{\partial ( y - Xβ ) ^{'} ( y - Xβ )}{\partial β} = - 2 X^{'} y + 2 X^{'} Xβ = 0$
$\hat{β} = (X^{'} X)^{- 1} X^{'} y$

For $A R (1)$ , this reduces to $\hat{β} = \hat{ϕ} = \frac{\sum _{t = 2}^{T} y _{t} y _{t - 1}}{\sum _{t = 2}^{T} y _{t - 1}^{2}}$

With some reasonable assumptions, it can be shown that $T (\hat{β} - β) \to N (0, V)$ as $T \to \infty$ , with asymptotic variance $V = σ_{ϵ}^{2} (E [x_{t} x_{t}^{'}])^{- 1}$

${ϵ_{t}} \sim N I D (0, σ_{ϵ}^{2})$ with $σ_{ϵ}^{2} > 0$
${y_{t}}$ is stationary

For $A R (1)$ , this reduces to $V = 1 - ϕ^{2}$

This means this is a robust estimator!

For large $T$ , we have $\hat{β} \sim N (β, V / T)$

$\hat{V} \equiv s^{2} S_{xx}^{- 1}$ where $s^{2} \equiv \frac{1}{T - p} \sum_{t = p + 1}^{T} e_{t}^{2}$ and $S_{xx} \equiv \frac{1}{T - p} \sum_{t = p + 1}^{T} x_{t} x_{t}^{'}$

Definition: For a sample $x_{t} \equiv (x_{1}, \dots, x_{T})$ with joint density $f (x_{T}, ψ)$ , the maximum likelihood estimator is $\hat{ψ}_{T} = ar g max_{ψ} f (x_{T}; ψ)$

In practice, we maximize the log likelihood
$f (x_{1}, \dots, x_{T}; ψ) = f (x_{1}; ψ) \times \prod_{t = 2}^{T} f (x_{t} ∣ x_{t - 1}, \dots, x_{1}; ψ)$
$lo g f (x_{1}, \dots, x_{T}; ψ) = lo g f (x_{1}; ψ) + \sum_{t = 2}^{T} lo g f (x_{t} ∣ x_{t - 1}, \dots, x_{1}; ψ)$

This writes a $T$ -dimensional joint density as a sum of $T$ univariate densities

Example:
Now take $X_{t} = ϕ X_{t - 1} + ϵ_{t}$

$\hat{ψ}_{T} = ar g max_{ψ} f (x_{T}; ψ) = max_{ψ} \prod_{t = 2}^{T} f (x_{t} ∣ x_{t - 1}, \dots, x_{1}; ψ)$
$= ar g max_{ψ} \prod_{t = 2}^{T} \frac{1}{2 π σ _{ϵ}^{2}} exp [- \frac{( x _{t} - ϕ x _{t - 1} ) ^{2}}{2 σ _{ϵ}^{2}}]$
$= ar g max_{ψ} \sum_{t = 2}^{T} (- lo g 2 π σ_{ϵ}^{2} - \frac{( x _{t} - ϕ x _{t - 1} ) ^{2}}{2 σ _{ϵ}^{2}})$

We then derive this by setting the derivative to 0

For any $σ_{ϵ}^{2} > 0$ , the value of $ϕ$ that maximizes the above function is the one that minimizes $\sum_{t = 2}^{T} (x_{t} - ϕ x_{t - 1})^{2}$ . So effectively, we’ve reached the LS estimator!

For ARMA models, we use a similar approach to derive a ML estimator

Example:
Consider $X_{t} = α + ϕ_{1} X_{t - 1} + \dots + ϕ_{p} X_{t - p} + ϵ_{t}$
We get $X_{t} ∣ X_{t - 1}, \dots, X_{t - p} \sim N (α + ϕ_{1} X_{t - 1} + \dots + ϕ_{p} X_{t - p}, σ_{ϵ}^{2})$
$\hat{ψ}_{t} = ar g max_{ψ} \sum_{t = p + 1}^{T} (- lo g 2 π σ_{ϵ}^{2} - \frac{( x _{t} - α - ϕ _{1} x _{t - 1} - \dots - ϕ _{p} x _{t - p} ) ^{2}}{2 σ _{ϵ}^{2}})$
Again, the ML estimate of $ϕ_{1}, \dots, ϕ_{p}$ equals the LS estimate, essentially because $ϵ_{t}$ are assumed to be normal

Example:
Consider $X_{t} = α + ϵ_{t} + θ_{1} ϵ_{t - 1} + \dots + θ_{q} ϵ_{t - q}$ with ${ϵ_{t}} \sim NID (0, σ_{ϵ}^{2})$

Assume $ϵ_{t \leq 0} = 0$
$E [ϵ_{k} ∣ X_{1}, \dots, X_{t - 1}] = ϵ_{k}$ for $k \leq t - 1$ and $0$ for $k > t - 1$

So we get $X_{t} ∣ X_{t - 1}, \dots, X_{1} \sim N (α + θ_{1} ϵ_{t - 1} + \dots + θ_{q} ϵ_{t - q}, σ_{ϵ}^{2})$

Model Selection

How do we select parameters?

If we select $p$ too large then we risk higher variance for our model

If we select $p$ too small then our model cannot possibly capture all the data’s nuances

We can strike a balance by minimizing an information criterion,
$T \cdot lo g (SS R_{k} / T) + k C (T)$

$SS R_{k}$ is the sum of squared residuals based on $k$ parameters
For $A R (p)$ with intercept, $k = p + 1$
$C (T) = 2$ is the Akaike information criterion and $C (T) = lo g (T)$ is the Bayesian information criterion

We can use this by choosing $p_{max}$ and testing each model

These can also be computed with the log likelihood $k C (T) - 2 lo g f (x_{T})$

As the sample size goes to infinity BIC correctly estimates the true order, but AIC can outperform in finite samples

Binyamin's Notes

Explorer

Model Selection

Table of Contents