AMMAI BLOG: 3月 2013

2013年3月21日星期四

[ammai] week5 Probabilistic latent semantic indexing

Paper: “Probabilistic latent semantic indexing,” T. Hofmann, SIGIR 1999. Known aspLSI/pLSA.

This paper tries to improve SVD-liked LSA based on the thought of probability by using a generative model and the tempered EM algorithm (TEM). To overcome the problem of synonymy (同義字) and polysemy (一字多義) in tf-idf method, LSA is here representing them in the latent semantic space. But, there are still some information missing when facing the polysemy. pLSA here proposed a solid statistical foundation based on the likelihood principle and defined a generative model.

P(d) is the probability of document d.

P(w) is the probability of word w.

Assume z is the hidden topic.

for (2), The author assumed that (d, w) pairs and (w, z) are both independent, which means the words in a document are "generated" independently.

Then we set the objective function to be ....

where n(d, w) is the number of occurrence of pair(d, w)
To solve the optimization, this work applys EM algorithm.

E-Steps:

M-Steps:

Then, we can describe the geometry of PLSA. In below pic.

It makes sense that P(w|zk) spans the space of P(w|d). So, the P(w|zk) are the basis set

Thus , it uses lower dimension than previous work does and it can represent hidden topics.

2013年3月13日星期三

[ammai] week4 Aggregating local descriptors into a compact image representation

Paper: “Aggregating local descriptors into a compact image representation,” Herve Jegou, et al., Proc. IEEE CVPR’10.

Note:

This paper proposed a system for very-large-scale-image search dealing with the accuracy, efficiency and the memory it used.

The mainly two contributions are
1. Propose a method to represent the image with excellent accuracy and reasonable vector dimensionality.
2. Optimizing the trade-off between the indexation algorithm and the dimensional reduction.

After generating the SIFT, not to use millions of centroids to build the dictionary in the Bag of Words, but use a few centroid (256-centroid) to build the descriptor.

Then, we still assign each feature to one of the centroid. After assignment, we don’t count the number of features belonging to that centroid as in the BOW; instead, we accumulate the difference between centroids and features belonging to it. So, each image can be represented by a 128 x k-dimension (number of centroids) vector rather than a large-dimension vector.

After generating vectors, the paper propose a method to compress it. With the desire to achieve better accuracy after quantization, the number k of the cluster should be large enough; however, it is intractable.

So, in this work, it uses a product quantization method. First, divide the feature vector into m equal part. Then, we need m quantizers for each sub-vectors. The good news is that, for now, we only need ks = k^(1/m) centroids for each quantizer to reach expected accuracy.

In addition to quantization, it’s better for us to do dimension reduction. We may like to adopt the traditional PCA. However, there are 2 problems arise from PCA.

1.The variance of each dimension is different after PCA, so the first principal component will be the bottleneck of the quantization error.
2.There is a trade-off between projection error and quantization error.

Comments:

1. The way it quantize the feature is 'product'. I think it is a impressive way to do such things, but it's a little bit complicated.

2.Using original value of distance rather than the absolute value is another surprising things. It might be for the author want to remain the vector's direction in the dimensions.

[ammai] week3 Efficient visual search of videos cast as text retrieval

Paper: “Efficient visual search of videos cast as text retrieval,” J. Sivic, and A. Zisserman, IEEE TPAMI, 2009.

Note:

This paper talks about the object search from video frames. It introduces the concept of “visual words”, which is a well-known algorithm now , and transforms the problem of object retrieval into the text retrieval which had been studied for a long term. In other words, it applies text retrieval techniques on the problem and the outcomes seem well.

The following briefly describe how the system they propose works.

In text retrieval, there are several techniques, such as stemming, weighting-vector( tf-idf ),inverted-indexing, similarity comparison, ranking algorithm.

And the “visual words ” has been presented.

First, it finds the invariant descriptor in each frame. This paper refer other paper's work - trying to detect two kinds of interest point - Shape Adapted (SA) and Maximally Stable (MS),which are both elliptical regions of interest. These two regions are then described by 128-SIFT descriptor.

Second, for checking it is a stable object or just some noises , these region is tracked in 3 frames by using velocity dynamical model. If the frames don’t survive in at least 3 frames, it is useless.

Third, the most important of all, start to build the visual word vocabulary. It simply use a vector quantize here, and the result will be the “visual word”. In the paper, the visual word is quantized in K-means clustering and the data structure is generated in k-d tree to reduce the searching time when quantizing. After the quantization, there are around 16000 clusters. (In implementation, for the accuracy, it uses Mhalanobis distance in K-means.)

Fourth, representing the frame by the visual word clustered at Step 3. The cluster determined by the K-means in Step 3 is thought as what we called visual word. So the intra-cluster variations are analogous to the variations of a stem in text retrieval. Each frame is then represented as a 16000-dimension vector of frequencies that each visual word occurs in.

Fifth, calculating the tf-idf of each visual word and removing the words that occur too frequently. Namely, it take the words occur too many time as the “stop words”.

In retrieval system, the ranking is determined by the L2-normalized cosine similarity of the frame’s representative vector. After first ranking, the candidate list is then re-ranked by spatial consistency voting as shown below.

Comments:

1. The way it used to calculate the similarity is L2 distance. Though it had removed "the stop words", but I still think there's a better way to determine the distance because each visual words might have different importance.

2. The data-set it used is large enough? I doubt it. It only tests 3 movies which makes me wonder that it is a special case or else.

2013年3月11日星期一

First Article

This is for 2013 ammai Spring class taught by winston Hsu. Blog .
Mainly for writing some thought in it.

Yeah!

2013年3月21日 星期四