Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP) Estimation

Sometimes people use those two terms interchangleablly, but there is a difference between them.

Maximum Likelihood Estimation (MLE)

  1. Define the Likelihood Function:

    • Given data $X = {x_1, x_2, \ldots, x_n}$ and model parameters $\theta$, the likelihood function is $P(X \theta)$.
    • For i.i.d. data, the likelihood is the product of individual likelihoods: \(P(X|\theta) = \prod_{i=1}^n P(x_i|\theta)\)
  2. Log-Likelihood:

    • To simplify calculations, we use the log-likelihood: \(\log P(X|\theta) = \sum_{i=1}^n \log P(x_i|\theta)\)
  3. Maximize the Log-Likelihood:

    • Find the parameter values $\theta$ that maximize the log-likelihood function: \(\hat{\theta}_{\text{MLE}} = \arg \max_\theta \log P(X|\theta)\)
    • This involves taking the derivative of the log-likelihood with respect to $\theta$, setting it to zero, and solving for $\theta$: \(\frac{\partial}{\partial \theta} \log P(X|\theta) = 0\)

Example: MLE for Gaussian Distribution

  • Data: $X = {x_1, x_2, \ldots, x_n}$
  • Model: $x_i \sim \mathcal{N}(\mu, \sigma^2)$
  • Parameters: $\theta = (\mu, \sigma^2)$
  1. Likelihood Function:

    \[P(X|\mu, \sigma^2) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2\sigma^2} \right)\]
  2. Log-Likelihood:

    \[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]
  3. Maximize the Log-Likelihood:

    • For $\mu$: \(\frac{\partial}{\partial \mu} \log P(X|\mu, \sigma^2) = \frac{1}{\sigma^2} \sum_{i=1}^n (x_i - \mu) = 0 \implies \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i\)
    • For $\sigma^2$: \(\frac{\partial}{\partial \sigma^2} \log P(X|\mu, \sigma^2) = -\frac{n}{2\sigma^2} + \frac{1}{2\sigma^4} \sum_{i=1}^n (x_i - \mu)^2 = 0 \implies \hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \hat{\mu})^2\)

Maximum A Posteriori (MAP) Estimation

  1. Define the Posterior Probability:

    • Use Bayes’ theorem to define the posterior probability of the parameters given the data: \(P(\theta|X) = \frac{P(X|\theta) P(\theta)}{P(X)}\)
    • Here, $P(\theta)$ is the prior distribution of the parameters, and $P(X)$ is the marginal likelihood.
  2. Log-Posterior:

    • To simplify calculations, we use the log-posterior: \(\log P(\theta|X) = \log P(X|\theta) + \log P(\theta) - \log P(X)\)
    • Since $P(X)$ does not depend on $\theta$, it can be ignored for maximization: \(\log P(\theta|X) \propto \log P(X|\theta) + \log P(\theta)\)
  3. Maximize the Log-Posterior:

    • Find the parameter values $\theta$ that maximize the log-posterior: \(\hat{\theta}_{\text{MAP}} = \arg \max_\theta \left( \log P(X|\theta) + \log P(\theta) \right)\)
    • This involves taking the derivative of the log-posterior with respect to $\theta$, setting it to zero, and solving for $\theta$: \(\frac{\partial}{\partial \theta} \left( \log P(X|\theta) + \log P(\theta) \right) = 0\)

Example: MAP for Gaussian Distribution with Normal Prior

  • Data: $X = {x_1, x_2, \ldots, x_n}$
  • Model: $x_i \sim \mathcal{N}(\mu, \sigma^2)$
  • Prior: $\mu \sim \mathcal{N}(\mu_0, \tau^2)$
  • Parameters: $\theta = (\mu, \sigma^2)$
  1. Prior Probability:

    \[P(\mu) = \frac{1}{\sqrt{2\pi\tau^2}} \exp \left( -\frac{(\mu - \mu_0)^2}{2\tau^2} \right)\]
  2. Log-Likelihood:

    \[\log P(X|\mu, \sigma^2) = -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2\]
  3. Log-Prior:

    \[\log P(\mu) = -\frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\]
  4. Log-Posterior:

    \[\log P(\mu|X) \propto -\frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} \sum_{i=1}^n (x_i - \mu)^2 - \frac{1}{2} \log(2\pi\tau^2) - \frac{(\mu - \mu_0)^2}{2\tau^2}\]
  5. Maximize the Log-Posterior:

    • Take the derivative with respect to $\mu$: \(\frac{\partial}{\partial \mu} \log P(\mu|X) = \frac{n}{\sigma^2} \left( \frac{1}{n} \sum_{i=1}^n x_i - \mu \right) - \frac{\mu - \mu_0}{\tau^2} = 0\)
    • Solve for $\mu$: \(\mu = \frac{\frac{n}{\sigma^2} \bar{x} + \frac{1}{\tau^2} \mu_0}{\frac{n}{\sigma^2} + \frac{1}{\tau^2}}\)

      where $\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i$ is the sample mean.

Summary

  • MLE: Maximizes the likelihood of the observed data without considering prior information.
  • MAP: Maximizes the posterior probability, which combines the likelihood of the observed data with prior information.