Forecasting is predicting given available information for

We have several sources of error, like parameter uncertainty, model uncertainty, and ignorance about future errors

Our 1-step ahead forecast is defined as
How do we get an optimal forecast? In practice, we minimize the mean squared error

We propose this will look like
We can do some algebra with expectations to show that is a lower bound for any forecaster (with regards to MSE)

The same result holds for the h-step ahead forecast for general , which is

Example:
For an process, we get






If is normal then our full error term is also normally distributed, which means we can build confidence intervals

With an intercept, our forecasts look like,


For , we can derive

Example:
Let follow
(assuming )



Definition: The least squares estimator of a parameter minimizes the sum of squared residuals

Suppose we have a sample of size from an process,


with and

In matrix form, we can write



For , this reduces to

With some reasonable assumptions, it can be shown that as , with asymptotic variance

  • with
  • is stationary

For , this reduces to

This means this is a robust estimator!

For large , we have

where and

Definition: For a sample with joint density , the maximum likelihood estimator is

In practice, we maximize the log likelihood

This writes a -dimensional joint density as a sum of univariate densities

Example:
Now take



We then derive this by setting the derivative to 0

For any , the value of that maximizes the above function is the one that minimizes . So effectively, we’ve reached the LS estimator!

For ARMA models, we use a similar approach to derive a ML estimator

Example:
Consider
We get

Again, the ML estimate of equals the LS estimate, essentially because are assumed to be normal

Example:
Consider with

Assume
for and for

So we get

Model Selection

How do we select parameters?

If we select too large then we risk higher variance for our model

If we select too small then our model cannot possibly capture all the data’s nuances

We can strike a balance by minimizing an information criterion,

  • is the sum of squared residuals based on parameters
  • For with intercept,
  • is the Akaike information criterion and is the Bayesian information criterion

We can use this by choosing and testing each model

These can also be computed with the log likelihood

As the sample size goes to infinity BIC correctly estimates the true order, but AIC can outperform in finite samples