Tell me more ×
Mathematics Stack Exchange is a question and answer site for people studying math at any level and professionals in related fields. It's 100% free, no registration required.

I am given an example:

Suppose 1000 observations are drawn from $N(0,1)$ and $N(5,2)$ with mixing parameters $\pi_{1}=0.2$ and $\pi_{2}=0.8$ respectively. Suppose we only know $\sigma$ and want to estimate $\mu$ and $\pi$. How does one go about using Gaussian Mixture models to estimate these parameters? I know I have to use the EM algorithm but I do not know where to start. I want to use this simple example to get a better understanding of how it works.

share|improve this question
What are $\mu$ and $\sigma$? – Did Apr 30 at 21:49
They are the population mean and standard deviation of the normal distributions from which the points were generated.Once the points are generated, the idea is to estimate $\mu$ and $\pi$ given you know $\sigma$ – phil12 Apr 30 at 22:00
But the distributions $N(0,1)$ and $N(5,2)$ and the mixing distribution $\pi$ fully determine these, don't they? So, in the end, what exactly are you supposed to know? – Did Apr 30 at 22:03
The point is to generate the points using those parameters, and then estimate them from the data alone to see how well they match with the true parameters. For example, using MATLAB or Python, you can generate these data, and the idea is to write an algorithm that estimates these parameters given you only know $\sigma$ – phil12 Apr 30 at 22:04
OK. But somehow you know $\sigma$, right? Which seems strange... – Did Apr 30 at 22:06
show 4 more comments

1 Answer

up vote 2 down vote accepted

The canonical thing to do would be to add latent variables $C_i, i = 1, 2, ..., n$. Letting $X_1, X_2, ..., X_n$, $n = 1000$ be your data, you would set $X_i | C_i = 1 \sim N(0, 1)$ and $X_i | C_i = 0 \sim N(5, 2)$ with $C_i = 1$ with probability $\pi = 0.2$ and $C_i = 0$ with probability $1 - \pi$.

The likelihood after augmenting becomes $$L = \prod_{i=1}^n [\pi N(x_i | \mu_1, 1)]^{C_i} [(1 - \pi) N(x_i | \mu_2, 2)]^{1 - C_i},$$ and the log-likelihood becomes $$\ell = \sum_{i = 1}^n C_i[\log(\pi) + \log(N(x_i | \mu_1, 1))] + (1 - C_i) [\log(1 - \pi) + \log(N(x_i | \mu_2, 2)].$$

We don't know the values of the $C_i$ so we can't maximize this directly; EM works by replacing this maximization problem with maximizing an expected log-likelihood. To apply EM, we first initialize $\mu_1, \mu_2$, and $\pi$ to $\mu_1^{(0}, \mu_2^{(0)}, \pi^{(0)}$. Next, we take the expectation of $\ell$ with respect to the conditional distribution of $[C_1, ..., C_n | X_1, ..., X_n]$ evaluated at the values of the parameters we just initialized. With a little effort it can be shown that $E[C_i | X_1, ..., X_n]$ under this distribution is $$\rho_i^{(0)} = \frac{\pi^{(0)} N(x_i | \mu_1^{(0)}, 1)}{\pi^{(0)} N(x_i |\mu_1^{(0)}, 1) + (1 - \pi^{(0)}) N(x_i | \mu_2^{(0)}, 2)}.$$ Plugging this into $E[\ell | X_1, X_2, ..., X_n]$ gives $$E[\ell | X_1, X_2, ..., X_n] = \sum_{i = 1} ^ n \rho_i^{(0)}[\log(\pi) + \log(N(x_i | \mu_1, 1))] + (1 - \rho_i^{(0)}) [\log(1 - \pi) + \log(N(x_i | \mu_2, 2).$$ Now we take the gradient of this expression in $\mu_1, \mu_2, \pi$, set it equal to $0$, and solve for $\mu_1, \mu_2, \pi$ in terms of $x_i$ and $\rho_i^{(0)}$. The solutions to this equation furnish $\mu_1^{(1)}$, $\mu_2^{(1)}$, and $\pi^{(1)}$, and we start back over with these new values. Iterate until convergence.

In the language of the EM-algorithm, the following sets can be broken into an E-step and M-step. In the E-step of iteration $k+1$st, we calculate $\rho_i^{(k)}$ for $i = 1, 2, ..., n$ in terms of $\mu_1^{(k)}$, $\mu_2^{(k)}$, and $\pi^{(k)}$. In the M-step we set the derivative of the expected log-likelihood to $0$ and solve, which turns out to give $$\pi^{(k+1)} = \frac{\sum \rho_i^{(k)}}{n},$$ $$\mu_1 ^{(k+1)} = \frac{\sum_{i = 1}^n \rho_i^{(k)} x_i}{\sum_{i = 1} ^ n \rho_i^{(k)}},$$ $$\mu_2 ^{(k+1)} = \frac{\sum_{i = 1}^n (1 - \rho_i^{(k)})x_i}{\sum_{i = 1} ^ n (1 - \rho_i^{(k)})}.$$ Intuitively, what is going on is that $\rho_i$ is something like the probability that $C_i = 1$ - occasionally you'll hear $\rho_i$ referred to as a responsibiity to a mixture component; so average these together to get the estimate of $\pi$, and take a weighted average of the $X_i$ according to their probability of having $C_i = 1$ to get $\mu_1$, and a similar story for $\mu_2$.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.