Expectation

Recall, the expected value of a random variable $X$ is $E [X] = \sum_{x} x p (x)$ if $X$ is discrete and $E [X] = \int_{- \infty}^{\infty} x f (x) d x$ if $X$ is continuous

If $P {a \leq X \leq b} = 1$ for some a and $b$ then $a \leq E [X] \leq b$

If $X$ and $Y$ have a joint probability mass function $p (x, y)$ then $E [g (X, Y)] = \sum_{y} \sum_{x} g (x, y) p (x, y)$

The continuous analog to this is $E [g (X, Y)] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} g (x, y) f (x, y) d x d y$

We can apply this to find $E [X + Y] = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} (x + y) f (x, y) d x d y = \int_{- \infty}^{\infty} x f_{X} (x) d x + \int_{- \infty}^{\infty} y f_{Y} (y) d y$
$= E [X] + E [Y]$

An inductive argument shows $E [X_{1} + \dots + X_{n}] = E [X_{1}] + \dots + E [X_{n}]$ . This property is extremely useful, and is often referred to as linearity of expectation.

We can break down a binomial random variable $X$ into Bernoulli random variables to show that $E [X] = n p$

We can break down a negative binomial random variable $X$ into geometric random variables with parameter $p$ (and expectation $E [X_{i}] = 1/ p$ ) to show that $E [X] = \frac{r}{p}$

We can also break down a hypergeometric random variable into indicator variables $Y_{i} = 1$ if the $i$ th ball selected is white and $0$ otherwise. Any ball is equally likely to be the $i$ th ball, so $E [Y_{i}] = \frac{m}{N}$ and $E [X] = \frac{nm}{N}$

Example: Analyzing Quick-Sort

Let $f$ be a function on a finite set $A$ and suppose we want to find $m = max_{s \in A} f (s)$ . We can find a lower-bound probabilistically, since $m \geq E [f (S)]$

The textbook uses this to find bounds on Hamiltonian paths in an example

Observe: If $X$ is the number of events that occur from $A_{1}, \dots, A_{n}$ , then $(2 X)$ is the number of pairs of events that occur. We can see that $(2 X) = i < j \sum I_{i} I_{j}$

We get $E [X^{2}] - E [X] = 2 i < j \sum P (A_{i} A_{j})$

An extension of this is $E [(k X)] = i_{1} < i_{2} < \dots < i_{k} \sum P (A_{i_{1}} A_{i_{2}} \dots A_{i_{k}})$

These can help let us calculate $Var (X)$ and $E [X^{k}]$

We can also write $E [g (X) h (Y)] = E [g (X)] E [h (Y)]$

Definition: The covariance between $X$ and $Y$ is defined as $Cov (X, Y) = E [(X - E [X]) (Y - E [Y])]$ . Expanding this yields $Cov (X, Y) = E [X Y] - E [X] E [Y]$ .

If $X$ and $Y$ are independent, then $Cov (X, Y) = 0$ , however the converse is not true

We have some simple properties,

$Cov (X, Y) = Cov (Y, X)$
$Cov (X, X) = Var (X)$
$Cov (a X, Y) = a Cov (X, Y)$
$Cov (\sum_{i = 1}^{n} X_{i}, \sum_{j = 1}^{m} Y_{j}) = \sum_{i = 1}^{n} \sum_{j = 1}^{m} Cov (X_{i}, Y_{j})$

With 4., we can derive $Var (\sum_{i = 1}^{n} X_{i}) = \sum_{i = 1}^{n} Var (X_{i}) + 2 \sum_{i < j} Cov (X_{i}, X_{j})$

If $X_{1}, \dots, X_{n}$ are pairwise independent then $Var (\sum_{i = 1}^{n} X_{i}) = \sum_{i = 1}^{n} Var (X_{i})$

The correlation of two random variables $X$ and $Y$ , denoted by $ρ (X, Y)$ is defined as $ρ (X, Y) = \frac{Cov ( X , Y )}{Var ( X ) Var ( Y )}$ .

We can prove that $- 1 \leq ρ (X, Y) \leq 1$ pretty simply. Say $X$ and $Y$ have variances given by $σ_{x}^{2}$ and $σ_{y}^{2}$ ,
$0 \leq Var (\frac{X}{σ _{x}} + \frac{Y}{σ _{y}})$
$= \frac{Var ( X )}{σ _{x}^{2}} + \frac{Var ( Y )}{σ _{y}^{2}} + \frac{2 Cov ( X , Y )}{σ _{x} σ _{y}}$
$= 2 [1 + ρ (X, Y)] ⟹ - 1 \leq ρ (X, Y)$

$ρ$ behaves similarly to covariance but is nice and bounded. It can also be viewed as a measure of the degree of linearity between $X$ and $Y$ , since $ρ (X, Y) = - 1$ implies $Var (\frac{X}{σ _{x}} + \frac{Y}{σ _{y}}) = 0$ , which means $\frac{X}{σ _{x}} + \frac{Y}{σ _{y}}$ is constant.

Conditional Expectation

Remember that we have $p_{X ∣ Y} (x ∣ y) = \frac{p ( x , y )}{p _{Y} ( y )}$
It makes sense to also define $E [X ∣ Y = y] = x \sum x p_{X ∣ Y} (x ∣ y)$

We denote $E [X ∣ Y]$ as a function of $Y$ (whose value at $y$ is $E [X ∣ Y = y]$ ). This means $E [X ∣ Y]$ is also a random variable.

Theorem: $E [X] = E [E [X ∣ Y]]$
This implies $E [X] = y \sum E [X ∣ Y = y] P {Y = y}$ (or $E [X] = \int_{- \infty}^{\infty} E [X ∣ Y = y] f_{Y} (y) d y$ for a continuous variable), which is just the law of total probability

We can also interpret this theorem as taking the weighted average of the conditional expected value of $X$ for each $Y = y$

This theorem can help us calculate probabilities by conditioning on indicator variables

Conditional variance is also well-defined, as $Var (X ∣ Y) = E [(X - E [X ∣ Y])^{2} ∣ Y]$

To go with it, there’s a useful identity $Var (X) = E [Var (X ∣ Y)] + Var (E [X ∣ Y])$

Note: If we want to predict $Y$ given $X$ with a function $g$ , our best predictor is $E [Y ∣ X]$

Definition: The moment generating function $M (t)$ of $X$ is defined for all $t$ as $M (t) = E [e^{tX}]$ . It’s called the moment generating function because all moments can be obtained by differentiating $M (T)$ and evaluating at $t = 0$ .

This assumes that $\frac{d}{d t} E [e^{tX}] = E [\frac{d}{d t} e^{tX}]$ , which is generally true

The moment generating function of joint variables $X_{1}, \dots, X_{n}$ is a multivariable function $M (t_{1}, \dots, t_{n}) = E [e^{t_{1} X_{1} + \dots + t_{n} X_{n}}]$ . It can be proven that this function uniquely determines the join distribution of $X_{1}, \dots, X_{n}$ . This function can also be used to find the individual moment generating functions of $X_{i}$ .

Binyamin's Notes

Explorer

Conditional Expectation

Table of Contents