Parameter Estimation

If we assume that our data is generated by a random variable X{X} with an underlying distribution P(X){P(X)} that is defined by a parameter, such as the mean θ{\theta}, then we can try to model the data generation process or make inferences based on an estimation of the parameter.

The better our estimation of the parameters, the better our fit of the distribution gets, and the better we can make guesses about unseen data. For example, if we say our Data is generated by a Gaussian Random Variable with mean θ{\theta} and variance σ{\sigma},

N(θ,σ)N(\theta, \sigma)

we would try something like below:

Frequentist vs Bayesian

Depending on if you believe that the parameters, θ{\theta}, σ{\sigma} are fixed or Random Variables themselves there are two major approaches to parameter estimation.

Frequentists

If we believe θ{\theta} is fixed but unknown. Then by the central limit theorem and the law of large numbers, observing the frequency of long experiments, eventually, we should be able to infer θ{\theta} with a small enough error.

Bayesians

We believe θ{\theta} is a random variable. Parameters are random variables. This allows us to use Bayes theorem to infer

P(θX)P(\theta | X)
the posterior. Or more general:
P(modelX)P( model | X)
.

Maximum Likelihood Estimation

Imagine that have a box with 5 balls inside. Some are blue and some are red. We sample the box three times, each time, we note down the color of the ball and then we put the ball back in the box, shake it and draw again. What is the probability that we draw a blue ball? Out of five balls in the box, θ{\theta} are blue. So we have a θ{\theta} in five chance to pick a blue ball:

θ5\frac{\theta}{5}

Above is the likelihood function, it returns the probability of seeing our experiment for each θ{\theta}, the number of blue balls in the jar.

Observed: 2 blue balls, 1 red balls in 3 draws.

In MLE we look for the value of θ{\theta} that maximizes the probability of our sample:

L(x1,x2,,xn;θ)=P(X1,X2,,Xn)L(x_{1}, x_{2}, \dots, x_{n};\theta) = P(X_{1}, X_{2}, \dots, X_{n})
==
(θ5)blue×(1θ5)red\left(\frac{\theta}{5}\right)^{blue} \times \left(1 - \frac{\theta}{5}\right)^{red}
θ={\theta =} 3 ={=} 0.1440.

Continuous Case:
fx(x1,x2,,xn;θ)f_{x}(x_{1}, x_{2}, \dots, x_{n};\theta)
The likelihood function values we calculated for each θ{\theta} as the outcome of 3 consecutive independent Bernoulli trials.
(θ5)blue×(1θ5)red\left(\frac{\theta}{5}\right)^{blue} \times \left(1 - \frac{\theta}{5}\right)^{red}
Where blue in our trial was 2 and red is 1.

Maximum a Posteriori

So earlier, we assumed that θ{\theta} was fixed and we tried to find the value for θ{\theta} that made our observation most likely by maximizing the likelihood, the probability of seeing that outcome given a particular θ{\theta}.

Now we treat θ{\theta} as a random variable, and we have assume that we have some prior knowledge of it’s distribution. We seek to update our belief (posterior) by the observation using Bayes Theorem.

P(θX)=P(Xθ)P(θ)P(X)P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)}

In MAP estimation we find the value of θ{\theta} that maximizes the posterior above. Which is simply the likelihood P(Xθ){P(X | \theta)} times the prior (our initial guess) P(θ){P(\theta)} divided by the evidence P(X){P(X)}. This is one of the most beautiful equations that I know, because the more you think about it, the more it makes sense. But I digress.

In the algebraic sense, this is just a fraction, notice that P(X){P(X)} doesn’t really depend on θ{\theta}. So from the view of a function of θ{\theta}, it looks like this:

(P(Xθ)P(θ)×c(P(X | \theta)P(\theta) \times c
where c{c} is just some constant. If you aren’t convinced, below is a function which is multiplied by some c{c} constant. The red line indicates the maximum.

f(x)×cf(x) \times c

So we maximize

P(Xθ)P(θ)P(X | \theta)P(\theta)
with respect to θ{\theta} instead. Now there is an interesting case where P(θ){P(\theta)} is constant. We go back to the above logic and find out we can safely ignore it and just maximize P(Xθ){P(X | \theta)} with respect to θ{\theta}.

This should make intuitive sense when thinking bayesian. The posterior = updated confidence in the guess given the observation is the prior confidence in the guess times the probability of the observation given the guess. The stronger we belief in our original guess of θ{\theta}, the more likely If P(θ){P(\theta)} is a constant c{c}. Then it’s a uniform random variable and we think all