0

I'm trying to follow a tutorial paper on generalizing Dirichlet Distribution Finite Mixture Models to Dirichlet Process Infinite Mixture Models;

Some context to clarify my questions (at the end of this post) follows;

They start by deriving a Gibbs-sampling algorithm for the case of a Dirichlet distribution prior finite Gaussian mixture model (Section 2). In Section 3, they extend this to the infinite mixture case.


I get stuck following the derivation at Section 3.2: When $K$ approaches infinity. They start from

$$ p(c_1, \dots, c_k ~|~ \alpha) = \frac{\Gamma(\alpha)}{\Gamma(n + \alpha)} \prod_{k=1}^K \frac{\Gamma(n_k + \alpha / K)}{\Gamma(\alpha/K)} $$

where $c_i$ is a variable such that $c_i = k$ indicates the $i$th observation $y_i$ belongs to cluster $k$, $\alpha$ is the symmetric Dirichlet distribution concentration parameter, $n$ is the total number of observations, $n_k$ is the number of observations in cluster $k$, $K$ is the total number of clusters, and $\Gamma(\cdot)$ is the gamma function.

The derivations below then show that

$$ p(c_i = k ~|~\mathbf{c}_{-i}, \alpha) = \frac{n_{-i, k}+\alpha/K}{n-1+\alpha} $$

where $-i$ denotes all indicators except the $i$th, and "$n_{-i,k}$ represents the number of observations nested within the $k$th cluster excluding $y_i$".

The derivation (from Appendix A.4) is included below;

First part of derivation from Appendix A.4

Second part of derivation from Appendix A.4

My questions that I'm stuck on;

  • What does "$n_{-i,k}$ represents the number of observations nested within the $k$th cluster excluding $y_i$" mean. Wouldn't this number just be equal to $n_k - 1$? Why the special notation?
  • I can follow the first 4 lines ($=$ signs) of the derivation, but get stuck at the jump from line 4 to line 5. For example, the equation just above the text "a more intuitive explanation..." - it seems that they have dropped the product operator. How did they acomplish this?

1 Answers1

1
  1. Not all the time. $n_{-i,k}$ equals to $n_k - 1$ only when the token was assigned to $k$ in the past iteration, otherwise $n_{-i,k}$ is just $n_k$; We use $n_{-i,k}$ instead of $n_k -1$ in the Gibbs sampling because the algorithm will iterate many times, and each time we will update the assignment for each token $i$. If we use $n_k - 1$, it looks like the number of tokens assigned to class $k$ is fixed.
  2. We use product operator only in join distributions, here we have $z_i = k$, so only $k$ matters, i.e. the product operator actually can be removed at the beginning.