2013年3月13日 星期三

[ammai] week3 Efficient visual search of videos cast as text retrieval

Paper: “Efficient visual search of videos cast as text retrieval,” J. Sivic, and A. Zisserman, IEEE TPAMI, 2009.

Note:


    This paper talks about the object search from video frames. It introduces the concept of “visual words”, which is a well-known algorithm now , and transforms the problem of object retrieval into the text retrieval which had been studied for a long term. In other words, it applies text retrieval techniques on the problem and the outcomes seem well.

    The following briefly describe how the system they propose works.




    In text retrieval, there are several techniques, such as stemming, weighting-vector( tf-idf ),inverted-indexing, similarity comparison, ranking algorithm.

    And the “visual words ” has been presented.

    First, it finds the invariant descriptor in each frame. This paper refer other paper's work - trying to detect two kinds of interest point - Shape Adapted (SA) and Maximally Stable (MS),which are both elliptical regions of interest. These two regions are then described by 128-SIFT descriptor.

    Second, for checking it is a stable object or just some noises , these region is tracked in 3 frames by using velocity dynamical model. If the frames don’t survive in at least 3 frames, it is useless.

    Third, the most important of all, start to build the visual word vocabulary. It simply use a vector quantize here, and the result will be the “visual word”. In the paper, the visual word is quantized in K-means clustering and the data structure is generated in k-d tree to reduce the searching time when quantizing. After the quantization, there are around 16000 clusters. (In implementation, for the accuracy, it uses Mhalanobis distance in K-means.)

    Fourth, representing the frame by the visual word clustered at Step 3. The cluster determined by the K-means in Step 3 is thought as what we called visual word. So the intra-cluster variations are analogous to the variations of a stem in text retrieval. Each frame is then represented as a 16000-dimension vector of frequencies that each visual word occurs in.

    Fifth, calculating the tf-idf of each visual word and removing the words that occur too frequently. Namely, it take the words occur too many time as the “stop words”.

    In retrieval system, the ranking is determined by the L2-normalized cosine similarity of the frame’s representative vector. After first ranking, the candidate list is then re-ranked by spatial consistency voting as shown below.





Comments:
        1. The way it used to calculate the similarity is L2 distance. Though it had removed "the stop words", but I still think there's a better way to determine the distance because  each visual words might have different importance.
        2. The data-set it used is large enough? I doubt it. It only tests  3 movies which makes me wonder that it is a special case or else.

沒有留言:

張貼留言