AMMAI BLOG: [ammai] week6 Latent Dirichlet allocation

Paper: “Latent Dirichlet allocation,” D. Blei, A. Ng, and M. Jordan, Journal of Machine Learning Research, 3:993–1022, January 2003. Known as LDA.

The work is extended from pLSA. In pLSA, there are 2 serious disadvantages.

1. To a document out of the training set, it’s ambiguous (or hard) to assign probability.

2. The number of paras grows linearly with the size of the corpus, and it would cause serious problems of overfitting.

To deal with the problems above, this paper contributes following solution:
1. Providing a probabilistic model at the level of documents.

2. Considering a mixture model that captures the exchangeability of both words and documents in order to fulfill the assumption of the “bag-of-words”.

So the model becomes:

The first term on the right hand side is for modeling the document distribution. And then, LDA treat the mixture weights as a k-parameter random variable, which let the number of parameter independent to the number of documents

And in the experiment’s part of this paper, it has good performance. But according to wikipedia and “On an Equivalence between PLSI and LDA” (SIGIR 2003), the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.

AMMAI BLOG

2013年4月7日星期日

[ammai] week6 Latent Dirichlet allocation

沒有留言:

張貼留言

2013年4月7日 星期日

[ammai] week6 Latent Dirichlet allocation

沒有留言:

張貼留言

2013年4月7日星期日