2013年4月7日 星期日

[ammai] week6 Latent Dirichlet allocation


Paper: “Latent Dirichlet allocation,” D. Blei, A. Ng, and M. Jordan, Journal of Machine Learning Research, 3:993–1022, January 2003. Known as LDA.

        The work is extended from pLSA. In pLSA, there are 2 serious disadvantages.
1. To a document out of the training set, it’s ambiguous (or hard) to assign probability.
2. The number of paras grows linearly with the size of the corpus, and it would cause  serious problems of overfitting.
        To deal with the problems above, this paper contributes following solution:
1. Providing a probabilistic model at the level of 
documents.
2. Considering a mixture model that captures the exchangeability of both words and documents in order to fulfill the assumption of the “bag-of-words”.

So the model becomes:

The first term on the right hand side is for modeling the document distribution. And then, LDA treat the mixture weights as a k-parameter random variable, which let the number of parameter independent to the number of documents

And in the experiment’s part of this paper, it has good performance. But according to wikipedia and “On an Equivalence between PLSI and LDA” (SIGIR 2003), the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.

沒有留言:

張貼留言