Obtaining Estimators
Our goal is to optimize the parameters of a probabilistic model given an experiment
Definition: A function of a random sample that approximates a parameter is called a statistic or an estimator. An estimator of is denoted . The maximum likelihood estimate is denoted .
Definition: The probability of a trial is called the likelihood function (where is a pmf or pdf)
Poisson: , so
Exponential: , so
Uniform: if or otherwise
Example: To estimate on the uniform distribution , we simply take
It’s pretty intuitive how this maximizes
Bernoulli is perhaps the most intuitive,
If we get heads in an experiment of trials, then
We can do this more generally this by taking the derivative of out formula (use the product rule) and finding a local maxima
,
Ta-Da! Math tells us something we already know but in a more complicated way :)
We can use regular calculus techniques to verify that setting the derivative of a function to results in a global maximum
We can also do something neat by maximizing the log-likelihood , which gives us the same result (since it’s monotonic) in an easier way. This is smart, because it allows us to rewrite our expression into a expression, and moves the exponents to coefficients.
This is almost always a good technique for analyzing
When we do this analysis on Poisson, we also get the sample mean
Recall the example earlier on the uniform distribution. Since this function is not differentiable, we had to take an order statistic on the trial, . A similar idea for the exponential distribution, where .
To do this in general with a multivariate distribution, we set all partial derivatives to
The Method of Moments
The method of moments is often more tractable than the method of maximum likelihood when the underlying model has multiple parameters
Suppose we have unknown parameters . The first theoretical moments of , if they exist, are ;. This gives us equations
We also have sample moments
The sample moments can be used as approximations for the theoretical moments, giving us a system of equations
Interval Estimations
The issue with point estimations is they do not tell us the reliability of our estimations. Confidence intervals let us quantify uncertainty in an estimator
We’ve already looked at confidence intervals in the context of CLT
If we want to find an average to fall into an interval with probability , we can use CLT to get (where )
The same idea can apply to other distributions, like large-sample binomial random variables
Theorem: Let be the number of successes in independent trials (where is large). An confidence interval for is
This is effectively the same equation listed above, as the standard deviation of a binary random variable is
Note that when estimating intervals from data, we’re more accurate saying the chance that the interval aligns with the true parameters, not vice versa. In other words, the true value of the parameters is a fixed fact, while our estimation comes from a random procedure.
In popular press, estimates include a margin of error, which is typically half the maximum width of a confidence interval. The margin of error associated with an estimate is
It’s important to note that the values within a margin of error are not equal: If a political survey over people reveals Bob is winning to with a margin of error, we cannot say that the race is a tie. With the numbers the text uses, Bob has a chance of winning.
How do we choose for an experiment with the binomial distribution? We’d like to minimize such that
We transform the inequality to obtain , implying . Since , we can use as an acceptable bound
Properties of Estimators
Maximum likelihood and the method of moments often yield different estimators. Which should we use? What’s a good estimator and is there a best one?
Biasedness
Definition: An estimator is unbiased if for all
If then we say is asymptotically unbiased
We can sometimes construct an unbiased estimator directly from a biased estimator
Example: Is unbiased (normal distribution)?
So is biased (but not asymptotically)
The unbiased version is often called the sample variance denoted
The sample standard deviation is denoted and is the most common estimator for , although ironically it is biased (taking the square root introduces a downward bias).
So great, we’ve figured out how to unbias an estimator. Now how do we choose between unbiased estimators?
Efficiency
Besides biasedness, we can also measure the precision of an estimator in terms of its variance
Definition: Between two unbiased estimators, if then we say is more efficient than . The relative efficiency of with respect to is
An obvious question (that Shahriar waved away in class for some reason) is whether we can identify an estimator with the smallest variance
Cramér-Rao Inequality: Let be a continuous pdf with continuous first-order and second-order derivatives. Also, suppose that the set of values, where , does not depend on .
Definition: Let be the set of all unbiased estimators for . We say that is a best or minimum-variance estimator if and for all .
The unbiased estimator is efficient if equals the Cramér-Rao lower bound for . The efficiency of is the ratio of the lower bound for to .
Note that it’s possible for the best estimator to not be efficient (meaning no estimators can reach the Cramér-Rao lower bound)
Sufficiency
Suppose that a random sample of size is taken from the Bernoulli pdf . The maximum likelihood estimator is , so our maximum likelihood estimate is
We’re interested in the conditional probability
so our conditional probability equals
Our estimator is sufficient, because is not a function of
This conceptually tells us that everything our data can tell us about is contained in , since otherwise the joint pdf is entirely a function of and
Definition: The statistic is sufficient for if
A one-to-one function of a sufficient statistic is also sufficient, therefore we can construct an unbiased estimator for from a sufficient but biased statistic on
Sometimes this definition is difficult to use, since our statistic might have a difficult pdf. The following theorem is an alternative factorization criterion
Theorem: The statistic is sufficient if and only if there are functions
This works because we are able to “convert” the portion to include the pdf of
If a sufficient estimator exists for then we know an MLE for will maximize and therefore be a function of . It follows that maximum likelihood estimators are functions of sufficient estimators, which is an important justification for preferring them over method of moments estimators. (Would this imply they are often also sufficient, according to this statement above)
Rao-Blackwell: The variance of every unbiased estimator based on a sufficient estimator will necessarily be less than the variance of every unbiased estimator that is not a function of a sufficient estimator. This lets us restrict our search for estimators.
Consistency
It’s sometimes important to consider the asymptotic behavior of estimators, in case they have a problem that arises in the limit
For example, earlier we defined the term asymptotically unbiased
Definition: An estimator is consistent for if it converges in probability to , i.e. for all
We can define this similarly by saying there exists an such that for for all ,
This can be used as a tool to list the required number of samples given , , and
Chebyshev’s Inequality: Let be any random variable with mean and variance . For any ,
Proof: In the continuous case,
Or as stated above,
Chebyshev’s inequality is a useful tool for proving consistency (I believe specifically for unbiased estimators). We can use it to show that a bound can be established satisfying
Related to Chebyshev’s is the weak law of large numbers says that the sample mean is a consistent estimator for
Also note that we can generally show that maximum likelihood estimators are consistent