Random Variables and Vector Spaces

This document follows an explanation from LibreTexts

Many concepts have elegant interpretations if real-valued random variables are viewed as vectors in a vector space. Variance, covariance, and moments take on nice definitions.

A random experiment is modeled by a probability space $(Ω, F, P)$ where $Ω$ is a set of outcomes, $F$ is the $σ$ -algebra of events which defines the sample space in conjunction with $Ω$ (we’ve looked at operations of sets), and $P$ is the probability measure on the sample space $(Ω, F)$

Our basic vector space $V$ consists of all real-valued random variables defined on $(Ω, F, P)$ . Random variables $X_{1}$ and $X_{2}$ are equal if $P (X_{1} = X_{2}) = 1$ , so technically $V$ consists of equivalence classes under this relation (think how we can state the same random variable in many different ways).

We can define addition and scalar multiplication exactly how we’d expect, meaning $V$ is a well-defined vector space.

For brevity, I’m not going to be 100% precise with my notation here…

Definition: The k-norm of $X$ is defined as $∥ X ∥_{k} = [E [∣ X ∣^{k}]]^{1/ k}$ . This value measures the size of $X$ in a certain sense. We get a few properties very naturally,

$∥ X ∥_{k} \geq 0$
$∥ X ∥_{k} = 0 ⟺ P (X = 0) = 1$
$∥ c X ∥_{k} = ∣ c ∣ ∥ X ∥_{k}$
$∥ X + Y ∥_{k} \leq ∥ X ∥_{k} + ∥ Y ∥_{k}$

This is a nice well-behaved norm, so we denote $L_{k}$ as the normed vector space of $X \in V$ with $∥ X ∥_{k} < \infty$ and with norm $∥ \cdot ∥_{k}$ . In conventional notation, $p$ is used instead of $k$ .

If $j \leq k$ then $∥ X ∥_{j} \leq ∥ X ∥_{k}$ , so we can say that $L_{k} \subseteq L_{j}$

Now that we have a well defined norm, we can make a nice distance function.

Definition: The k-metric between $X$ and $Y$ is defined as $d_{k} (X, Y) = ∥ X - Y ∥_{k}$ . It shares the norm’s properties, since it’s essentially the same thing.

This lets us define standard deviation pretty easily as $∥ X - μ ∥_{2}$ and the variance as $∥ X - μ ∥_{2}^{2}$

The root mean square error function is $∥ X - t ∥_{2}$ . This function is minimized when $t = E [X]$

The mean absolute error is $∥ X - t ∥_{1} = E [∣ X - t ∣]$ . This function is minimized when $t$ is any median of $X$ . This point is confusing, and the text doesn’t explain it so well.

Once we have a measure of distance, we also have a measure of criterion for convergence. $X_{n} \to X$ as $n \to \infty$ if $d_{k} (X_{n}, X) \to 0$

When $k = 1$ , we say that $X_{n}$ approaches $X$ in mean and when $k = 2$ we say in mean square.

Convergence in kth mean implies that the $k$ norms converge, but not vice versa. Meaning $∥ X_{n}, X ∥_{k} \to 0$ implies $∣ ∥ X_{n} ∥_{k} - ∥ X ∥_{k} ∣ \to 0$

The text also states that convergence in mean is stronger than convergence in probability, however this term is not defined

$L_{2}$ is special because it’s the space where norm corresponds to an inner product,
Definition: The inner product of $X$ and $Y$ is defined as $⟨ X, Y ⟩ = E (X Y)$

We get our properties for inner product fairly easily,

$⟨ X, Y ⟩ = ⟨ Y, X ⟩$
$⟨ X, X ⟩ \geq 0$ and $⟨ X, X ⟩ = 0 ⟺ P (X = 0) = 1$
$⟨ a X, Y ⟩ = a ⟨ X, Y ⟩$
$⟨ X + Y, Z ⟩ = ⟨ X, Z ⟩ + ⟨ Y, Z ⟩$

We can define covariance correlation with this inner product,
Definition: $Cov (X, Y) = ⟨ X - E [X], Y - E [Y]⟩$
Definition: $ρ (X, Y) = ⟨ \frac{X - E [ X ]}{∥ X ∥ _{2}}, \frac{Y - E [ Y ]}{∥ Y ∥ _{2}} ⟩$

So $X$ and $Y$ are uncorrelated if the centered variables $X - E [X]$ and $Y - E [Y]$ are orthogonal in $L_{2}$

We can see that the inner product corresponds to the 2-norm, since $⟨ X, X ⟩ = E (X^{2}) = ∥ X ∥_{2}^{2}$ , which also corresponds to the root mean square

We call $L_{2}$ a Hilbert space

Theorem: The Cauchy-Schwarz inequality states that $E [∣ X ∣∣ Y ∣] \leq E [X^{2}] E [Y^{2}]$ , or in more familiar terms, $⟨ X, Y ⟩ \leq ∥ X ∥_{2} ∥ Y ∥_{2}$

We can also write this like $∣ Cov (X, Y) ∣ \leq sd (X) sd (Y)$

Discussions of best linear predictors work nicely when described as projections onto subspaces. Say $U \subseteq V$ , then the projection of $X$ onto $U$ is the vector $v$ $⟨ X - v, U ⟩ = 0$

This projection (if it exists) is unique. Furthermore, $∥ X - v ∥_{2}^{2} \leq ∥ X - u ∥_{2}^{2}$ for all $u \in U$ . This implies $v$ is minimizing a distance.

We say that $W_{X} = {a + b X : a \in R, b \in R}$ is a subspace of $L_{2}$ , the subspace generated by $X$ and $1$ (linear combinations of these two vectors).

We write $X_{0} = X - E [X] = X - ⟨ X, 1 ⟩$ which is perpendicular to $1$ . Then we can say that the best linear predictor of $Y$ given $X$ is the projection of $Y$ onto $W_{X}$ , which is spanned by $1$ and $X_{0}$ ,
So $L (Y ∣ X) = ⟨ Y, 1 ⟩ 1 + \frac{⟨ Y , X _{0} ⟩}{⟨ X _{0} , X _{0} ⟩} X_{0}$

Or in the language of random variables,
$L (Y ∣ X) = E [Y] + \frac{Cov ( X , Y )}{Var ( X )} (X - E [X])$

Binyamin's Notes

Explorer

Random Variables and Vector Spaces