Obtaining Estimators

Our goal is to optimize the parameters of a probabilistic model given an experiment

Definition: A function of a random sample that approximates a parameter is called a statistic or an estimator. An estimator of is denoted . The maximum likelihood estimate is denoted .

Definition: The probability of a trial is called the likelihood function (where is a pmf or pdf)

Poisson: , so
Exponential: , so
Uniform: if or otherwise

Example: To estimate on the uniform distribution , we simply take
It’s pretty intuitive how this maximizes

Bernoulli is perhaps the most intuitive,
If we get heads in an experiment of trials, then

We can do this more generally this by taking the derivative of out formula (use the product rule) and finding a local maxima

,



Ta-Da! Math tells us something we already know but in a more complicated way :)

We can use regular calculus techniques to verify that setting the derivative of a function to results in a global maximum

We can also do something neat by maximizing the log-likelihood , which gives us the same result (since it’s monotonic) in an easier way. This is smart, because it allows us to rewrite our expression into a expression, and moves the exponents to coefficients.




This is almost always a good technique for analyzing
When we do this analysis on Poisson, we also get the sample mean

Recall the example earlier on the uniform distribution. Since this function is not differentiable, we had to take an order statistic on the trial, . A similar idea for the exponential distribution, where .

To do this in general with a multivariate distribution, we set all partial derivatives to

The Method of Moments

The method of moments is often more tractable than the method of maximum likelihood when the underlying model has multiple parameters

Suppose we have unknown parameters . The first theoretical moments of , if they exist, are ;. This gives us equations

We also have sample moments

The sample moments can be used as approximations for the theoretical moments, giving us a system of equations

Interval Estimations

The issue with point estimations is they do not tell us the reliability of our estimations. Confidence intervals let us quantify uncertainty in an estimator

We’ve already looked at confidence intervals in the context of CLT
If we want to find an average to fall into an interval with probability , we can use CLT to get (where )

The same idea can apply to other distributions, like large-sample binomial random variables

Theorem: Let be the number of successes in independent trials (where is large). An confidence interval for is

This is effectively the same equation listed above, as the standard deviation of a binary random variable is

Note that when estimating intervals from data, we’re more accurate saying the chance that the interval aligns with the true parameters, not vice versa. In other words, the true value of the parameters is a fixed fact, while our estimation comes from a random procedure.

In popular press, estimates include a margin of error, which is typically half the maximum width of a confidence interval. The margin of error associated with an estimate is

It’s important to note that the values within a margin of error are not equal: If a political survey over people reveals Bob is winning to with a margin of error, we cannot say that the race is a tie. With the numbers the text uses, Bob has a chance of winning.

How do we choose for an experiment with the binomial distribution? We’d like to minimize such that

We transform the inequality to obtain , implying . Since , we can use as an acceptable bound

Properties of Estimators

Maximum likelihood and the method of moments often yield different estimators. Which should we use? What’s a good estimator and is there a best one?

Biasedness

Definition: An estimator is unbiased if for all

If then we say is asymptotically unbiased

We can sometimes construct an unbiased estimator directly from a biased estimator

Example: Is unbiased (normal distribution)?





So is biased (but not asymptotically)
The unbiased version is often called the sample variance denoted

The sample standard deviation is denoted and is the most common estimator for , although ironically it is biased (taking the square root introduces a downward bias).

So great, we’ve figured out how to unbias an estimator. Now how do we choose between unbiased estimators?

Efficiency

Besides biasedness, we can also measure the precision of an estimator in terms of its variance

Definition: Between two unbiased estimators, if then we say is more efficient than . The relative efficiency of with respect to is

An obvious question (that Shahriar waved away in class for some reason) is whether we can identify an estimator with the smallest variance

Cramér-Rao Inequality: Let be a continuous pdf with continuous first-order and second-order derivatives. Also, suppose that the set of values, where , does not depend on .

Definition: Let be the set of all unbiased estimators for . We say that is a best or minimum-variance estimator if and for all .

The unbiased estimator is efficient if equals the Cramér-Rao lower bound for . The efficiency of is the ratio of the lower bound for to .

Note that it’s possible for the best estimator to not be efficient (meaning no estimators can reach the Cramér-Rao lower bound)

Sufficiency

Suppose that a random sample of size is taken from the Bernoulli pdf . The maximum likelihood estimator is , so our maximum likelihood estimate is

We’re interested in the conditional probability


so our conditional probability equals

Our estimator is sufficient, because is not a function of
This conceptually tells us that everything our data can tell us about is contained in , since otherwise the joint pdf is entirely a function of and

Definition: The statistic is sufficient for if

A one-to-one function of a sufficient statistic is also sufficient, therefore we can construct an unbiased estimator for from a sufficient but biased statistic on

Sometimes this definition is difficult to use, since our statistic might have a difficult pdf. The following theorem is an alternative factorization criterion

Theorem: The statistic is sufficient if and only if there are functions

This works because we are able to “convert” the portion to include the pdf of

If a sufficient estimator exists for then we know an MLE for will maximize and therefore be a function of . It follows that maximum likelihood estimators are functions of sufficient estimators, which is an important justification for preferring them over method of moments estimators. (Would this imply they are often also sufficient, according to this statement above)

Rao-Blackwell: The variance of every unbiased estimator based on a sufficient estimator will necessarily be less than the variance of every unbiased estimator that is not a function of a sufficient estimator. This lets us restrict our search for estimators.

Consistency

It’s sometimes important to consider the asymptotic behavior of estimators, in case they have a problem that arises in the limit

For example, earlier we defined the term asymptotically unbiased

Definition: An estimator is consistent for if it converges in probability to , i.e. for all

We can define this similarly by saying there exists an such that for for all ,

This can be used as a tool to list the required number of samples given , , and

Chebyshev’s Inequality: Let be any random variable with mean and variance . For any ,

Proof: In the continuous case,




Or as stated above,

Chebyshev’s inequality is a useful tool for proving consistency (I believe specifically for unbiased estimators). We can use it to show that a bound can be established satisfying

Related to Chebyshev’s is the weak law of large numbers says that the sample mean is a consistent estimator for

Also note that we can generally show that maximum likelihood estimators are consistent