2013年3月21日 星期四

[ammai] week5 Probabilistic latent semantic indexing

Paper: “Probabilistic latent semantic indexing,” T. Hofmann, SIGIR 1999. Known aspLSI/pLSA.


This paper tries to improve SVD-liked LSA based on the thought of probability by using a generative model and the tempered EM algorithm (TEM). To overcome the problem of synonymy (同義字) and polysemy (一字多義) in tf-idf method, LSA is here representing them in the latent semantic space. But, there are still some information missing when facing the polysemy. pLSA here proposed a solid statistical foundation based on the likelihood principle and defined a generative model.

P(d) is the probability of document d.
P(w) is the probability of word w.
Assume z is the hidden topic.

 

for (2), The author assumed that (d, w) pairs and (w, z) are both independent, which means the words in a document are "generated" independently.
Then we set the objective function to be ....



where n(d, w) is the number of occurrence of pair(d, w)
To solve the optimization, this work applys EM algorithm.

E-Steps:


M-Steps:



Then, we can describe the geometry of PLSA. In below pic.


It makes sense that P(w|zk) spans the space of P(w|d). So, the P(w|zk) are the basis set
Thus , it uses lower dimension than previous work does and it can represent hidden topics.

沒有留言:

張貼留言