This paper tries to improve SVD-liked LSA based on the
thought of probability by using a generative model and the tempered EM algorithm (TEM).
To overcome the problem of synonymy (同義字) and polysemy (一字多義) in tf-idf method, LSA is here representing them in the latent semantic
space. But, there are still some information missing when facing the polysemy. pLSA here proposed a
solid statistical foundation based on the likelihood principle and defined a
generative model.
P(d) is the probability of document d.
P(w) is the probability of word w.
Assume z is the hidden topic.
for (2), The author assumed that (d, w) pairs and (w, z) are
both independent, which means the words in a document are "generated"
independently.
Then we set the objective function to be ....
where n(d, w) is the number of occurrence of pair(d, w)
To solve the optimization, this work applys EM algorithm.
To solve the optimization, this work applys EM algorithm.
E-Steps:
M-Steps:
Then, we can describe the geometry of PLSA. In below
pic.
It makes sense that P(w|zk) spans the space of P(w|d). So,
the P(w|zk) are the basis set
Thus , it uses lower dimension than previous work does and it
can represent hidden topics.
沒有留言:
張貼留言