If we assume that our data is generated by a random variable with an underlying distribution that is defined by a parameter, such as the mean , then we can try to model the data generation process or make inferences based on an estimation of the parameter.
The better our estimation of the parameters, the better our fit of the distribution gets, and the better we can make guesses about unseen data. For example, if we say our Data is generated by a Gaussian Random Variable with mean and variance ,
we would try something like below:
Depending on if you believe that the parameters, , are fixed or Random Variables themselves there are two major approaches to parameter estimation.
If we believe is fixed but unknown. Then by the central limit theorem and the law of large numbers, observing the frequency of long experiments, eventually, we should be able to infer with a small enough error.
We believe is a random variable. Parameters are random variables. This allows us to use Bayes theorem to infer
Imagine that have a box with 5 balls inside. Some are blue and some are red. We sample the box three times, each time, we note down the color of the ball and then we put the ball back in the box, shake it and draw again. What is the probability that we draw a blue ball? Out of five balls in the box, are blue. So we have a in five chance to pick a blue ball:
Above is the likelihood function, it returns the probability of seeing our experiment for each , the number of blue balls in the jar.
Observed: 2 blue balls, 1 red balls in 3 draws.
In MLE we look for the value of that maximizes the probability of our sample:
So earlier, we assumed that was fixed and we tried to find the value for that made our observation most likely by maximizing the likelihood, the probability of seeing that outcome given a particular .
Now we treat as a random variable, and we have assume that we have some prior knowledge of it’s distribution. We seek to update our belief (posterior) by the observation using Bayes Theorem.
In MAP estimation we find the value of that maximizes the posterior above. Which is simply the likelihood times the prior (our initial guess) divided by the evidence . This is one of the most beautiful equations that I know, because the more you think about it, the more it makes sense. But I digress.
In the algebraic sense, this is just a fraction, notice that doesn’t really depend on . So from the view of a function of , it looks like this:
So we maximize
This should make intuitive sense when thinking bayesian. The posterior = updated confidence in the guess given the observation is the prior confidence in the guess times the probability of the observation given the guess. The stronger we belief in our original guess of , the more likely If is a constant . Then it’s a uniform random variable and we think all