Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation
Sometimes people use those two terms interchangleablly, but there is a difference between them.
Maximum Likelihood Estimation (MLE)
-
Define the Likelihood Function:
-
Given data $X = {x_1, x_2, \ldots, x_n}$ and model parameters $\theta$, the likelihood function is $P(X \theta)$. - For i.i.d. data, the likelihood is the product of individual likelihoods: \(P(X|\theta) = \prod_{i=1}^n P(x_i|\theta)\)
-
-
Log-Likelihood:
- To simplify calculations, we use the log-likelihood: \(\log P(X|\theta) = \sum_{i=1}^n \log P(x_i|\theta)\)
-
Maximize the Log-Likelihood:
- Find the parameter values $\theta$ that maximize the log-likelihood function: \(\hat{\theta}_{\text{MLE}} = \arg \max_\theta \log P(X|\theta)\)
- This involves taking the derivative of the log-likelihood with respect to $\theta$, setting it to zero, and solving for $\theta$: \(\frac{\partial}{\partial \theta} \log P(X|\theta) = 0\)
Example: MLE for Gaussian Distribution
- Data: $X = {x_1, x_2, \ldots, x_n}$
- Model: $x_i \sim \mathcal{N}(\mu, \sigma^2)$
- Parameters: $\theta = (\mu, \sigma^2)$
-
Likelihood Function:
\[P(X|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)\] -
Log-Likelihood:
\[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\] -
Maximize the Log-Likelihood:
- For $\mu$: \(\frac{\partial}{\partial \mu} \log P(X|\mu, \sigma^2) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i\)
- For $\sigma^2$: \(\frac{\partial}{\partial \sigma^2} \log P(X|\mu, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (x_i - \mu)^2 = 0 \implies \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2\)
Maximum A Posteriori (MAP) Estimation
-
Define the Posterior Probability:
- Use Bayes’ theorem to define the posterior probability of the parameters given the data: \(P(\theta|X) = \frac{P(X|\theta) P(\theta)}{P(X)}\)
- Here, $P(\theta)$ is the prior distribution of the parameters, and $P(X)$ is the marginal likelihood.
-
Log-Posterior:
- To simplify calculations, we use the log-posterior: \(\log P(\theta|X) = \log P(X|\theta) + \log P(\theta) - \log P(X)\)
- Since $P(X)$ does not depend on $\theta$, it can be ignored for maximization: \(\log P(\theta|X) \propto \log P(X|\theta) + \log P(\theta)\)
-
Maximize the Log-Posterior:
- Find the parameter values $\theta$ that maximize the log-posterior: \(\hat{\theta}_{\text{MAP}} = \arg \max_\theta \left( \log P(X|\theta) + \log P(\theta) \right)\)
- This involves taking the derivative of the log-posterior with respect to $\theta$, setting it to zero, and solving for $\theta$: \(\frac{\partial}{\partial \theta} \left( \log P(X|\theta) + \log P(\theta) \right) = 0\)
Example: MAP for Gaussian Distribution with Normal Prior
- Data: $X = {x_1, x_2, \ldots, x_n}$
- Model: $x_i \sim \mathcal{N}(\mu, \sigma^2)$
- Prior: $\mu \sim \mathcal{N}(\mu_0, \tau^2)$
- Parameters: $\theta = (\mu, \sigma^2)$
-
Prior Probability:
\[P(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp \left( -\frac{(\mu - \mu_0)^2}{2\tau^2} \right)\] -
Log-Likelihood:
\[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\] -
Log-Prior:
\[\log P(\mu) = -\frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\] -
Log-Posterior:
\[\log P(\mu|X) \propto -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\] -
Maximize the Log-Posterior:
- Take the derivative with respect to $\mu$: \(\frac{\partial}{\partial \mu} \log P(\mu|X) = \frac{n}{\sigma^2} \left( \frac{1}{n} \sum_{i=1}^n x_i - \mu \right) - \frac{\mu - \mu_0}{\tau^2} = 0\)
-
Solve for $\mu$: \(\mu = \frac{\frac{n}{\sigma^2} \bar{x} + \frac{1}{\tau^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}\)
where $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ is the sample mean.
Summary
- MLE: Maximizes the likelihood of the observed data without considering prior information.
- MAP: Maximizes the posterior probability, which combines the likelihood of the observed data with prior information.