AMMAI BLOG: [ammai] week5 Probabilistic latent semantic indexing

Paper: “Probabilistic latent semantic indexing,” T. Hofmann, SIGIR 1999. Known aspLSI/pLSA.

This paper tries to improve SVD-liked LSA based on the thought of probability by using a generative model and the tempered EM algorithm (TEM). To overcome the problem of synonymy (同義字) and polysemy (一字多義) in tf-idf method, LSA is here representing them in the latent semantic space. But, there are still some information missing when facing the polysemy. pLSA here proposed a solid statistical foundation based on the likelihood principle and defined a generative model.

P(d) is the probability of document d.

P(w) is the probability of word w.

Assume z is the hidden topic.