This document follows an explanation from LibreTexts
Many concepts have elegant interpretations if real-valued random variables are viewed as vectors in a vector space. Variance, covariance, and moments take on nice definitions.
A random experiment is modeled by a probability space where is a set of outcomes, is the -algebra of events which defines the sample space in conjunction with (we’ve looked at operations of sets), and is the probability measure on the sample space
Our basic vector space consists of all real-valued random variables defined on . Random variables and are equal if , so technically consists of equivalence classes under this relation (think how we can state the same random variable in many different ways).
We can define addition and scalar multiplication exactly how we’d expect, meaning is a well-defined vector space.
For brevity, I’m not going to be 100% precise with my notation here…
Definition: The k-norm of is defined as . This value measures the size of in a certain sense. We get a few properties very naturally,
This is a nice well-behaved norm, so we denote as the normed vector space of with and with norm . In conventional notation, is used instead of .
If then , so we can say that
Now that we have a well defined norm, we can make a nice distance function.
Definition: The k-metric between and is defined as . It shares the norm’s properties, since it’s essentially the same thing.
This lets us define standard deviation pretty easily as and the variance as
The root mean square error function is . This function is minimized when
The mean absolute error is . This function is minimized when is any median of . This point is confusing, and the text doesn’t explain it so well.
Once we have a measure of distance, we also have a measure of criterion for convergence. as if
When , we say that approaches in mean and when we say in mean square.
Convergence in kth mean implies that the norms converge, but not vice versa. Meaning implies
The text also states that convergence in mean is stronger than convergence in probability, however this term is not defined
is special because it’s the space where norm corresponds to an inner product,
Definition: The inner product of and is defined as
We get our properties for inner product fairly easily,
- and
We can define covariance correlation with this inner product,
Definition:
Definition:
So and are uncorrelated if the centered variables and are orthogonal in
We can see that the inner product corresponds to the 2-norm, since , which also corresponds to the root mean square
We call a Hilbert space
Theorem: The Cauchy-Schwarz inequality states that , or in more familiar terms,
We can also write this like
Discussions of best linear predictors work nicely when described as projections onto subspaces. Say , then the projection of onto is the vector
This projection (if it exists) is unique. Furthermore, for all . This implies is minimizing a distance.
We say that is a subspace of , the subspace generated by and (linear combinations of these two vectors).
We write which is perpendicular to . Then we can say that the best linear predictor of given is the projection of onto , which is spanned by and ,
So
Or in the language of random variables,