Maximum Likelihood (MLE) vs Maximum A Posteriori (MAP)

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation

Sometimes people use those two terms interchangleablly, but there is a difference between them.

Define the Likelihood Function:
- Given data $X = {x_1, x_2, \ldots, x_n}$ and model parameters $\theta$, the likelihood function is $P(X \theta)$.
- For i.i.d. data, the likelihood is the product of individual likelihoods: $P(X|\theta) = \prod_{i=1}^n P(x_i|\theta)$
Log-Likelihood:
- To simplify calculations, we use the log-likelihood: $\log P(X|\theta) = \sum_{i=1}^n \log P(x_i|\theta)$
Maximize the Log-Likelihood:
- Find the parameter values $\theta$ that maximize the log-likelihood function: $\hat{\theta}_{\text{MLE}} = \arg \max_\theta \log P(X|\theta)$
- This involves taking the derivative of the log-likelihood with respect to $\theta$, setting it to zero, and solving for $\theta$: $\frac{\partial}{\partial \theta} \log P(X|\theta) = 0$

Likelihood Function:
\[P(X|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)\]
Log-Likelihood:
\[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]
Maximize the Log-Likelihood:
- For $\mu$: $\frac{\partial}{\partial \mu} \log P(X|\mu, \sigma^2) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i$
- For $\sigma^2$: $\frac{\partial}{\partial \sigma^2} \log P(X|\mu, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (x_i - \mu)^2 = 0 \implies \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2$

Define the Posterior Probability:
- Use Bayes’ theorem to define the posterior probability of the parameters given the data: $P(\theta|X) = \frac{P(X|\theta) P(\theta)}{P(X)}$
- Here, $P(\theta)$ is the prior distribution of the parameters, and $P(X)$ is the marginal likelihood.
Log-Posterior:
- To simplify calculations, we use the log-posterior: $\log P(\theta|X) = \log P(X|\theta) + \log P(\theta) - \log P(X)$
- Since $P(X)$ does not depend on $\theta$, it can be ignored for maximization: $\log P(\theta|X) \propto \log P(X|\theta) + \log P(\theta)$
Maximize the Log-Posterior:
- Find the parameter values $\theta$ that maximize the log-posterior: $\hat{\theta}_{\text{MAP}} = \arg \max_\theta \left( \log P(X|\theta) + \log P(\theta) \right)$
- This involves taking the derivative of the log-posterior with respect to $\theta$, setting it to zero, and solving for $\theta$: $\frac{\partial}{\partial \theta} \left( \log P(X|\theta) + \log P(\theta) \right) = 0$

Prior Probability:
\[P(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp \left( -\frac{(\mu - \mu_0)^2}{2\tau^2} \right)\]
Log-Likelihood:
\[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]
Log-Prior:
\[\log P(\mu) = -\frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\]
Log-Posterior:
\[\log P(\mu|X) \propto -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\]
Maximize the Log-Posterior:
- Take the derivative with respect to $\mu$: $\frac{\partial}{\partial \mu} \log P(\mu|X) = \frac{n}{\sigma^2} \left( \frac{1}{n} \sum_{i=1}^n x_i - \mu \right) - \frac{\mu - \mu_0}{\tau^2} = 0$
- Solve for $\mu$: $\mu = \frac{\frac{n}{\sigma^2} \bar{x} + \frac{1}{\tau^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}$
  
  where $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ is the sample mean.

MLE: Maximizes the likelihood of the observed data without considering prior information.
MAP: Maximizes the posterior probability, which combines the likelihood of the observed data with prior information.