Note:
This paper talks about the
object search from video frames. It introduces the concept of “visual words”,
which is a well-known algorithm now , and transforms the problem of object
retrieval into the text retrieval which had been studied for a long term. In
other words, it applies text retrieval techniques on the problem and the
outcomes seem well.
The following briefly describe how the system they propose works.
The following briefly describe how the system they propose works.
In text retrieval, there
are several techniques, such as stemming, weighting-vector( tf-idf
),inverted-indexing, similarity comparison, ranking algorithm.
And the “visual words
” has been presented.
First, it finds the invariant
descriptor in each frame. This paper refer other paper's work - trying to detect
two kinds of interest point - Shape Adapted (SA) and Maximally Stable (MS),which
are both elliptical regions of interest. These two regions are then described
by 128-SIFT descriptor.
Second, for checking
it is a stable object or just some noises , these region is tracked in 3 frames
by using velocity dynamical model. If the frames don’t survive in at least 3
frames, it is useless.
Third, the most
important of all, start to build the visual word vocabulary. It simply use a
vector quantize here, and the result will be the “visual word”. In the paper,
the visual word is quantized in K-means clustering and the data structure is
generated in k-d tree to reduce the searching time when quantizing. After the
quantization, there are around 16000 clusters. (In implementation, for the accuracy,
it uses Mhalanobis distance in K-means.)
Fourth, representing
the frame by the visual word clustered at Step 3. The cluster determined by the
K-means in Step 3 is thought as what we called visual word. So the
intra-cluster variations are analogous to the variations of a stem
in text retrieval. Each frame is
then represented as a 16000-dimension vector of frequencies that each visual
word occurs in.
Fifth, calculating
the tf-idf of each visual word and removing the words that occur too
frequently. Namely, it take the words occur too many time as the “stop words”.
In retrieval system, the ranking is determined
by the L2-normalized cosine similarity of the frame’s representative vector. After
first ranking, the candidate list is then re-ranked by spatial consistency voting as shown below.
Comments:
1. The way it used to calculate the similarity is L2 distance. Though it had removed "the stop words", but I still think there's a better way to determine the distance because each visual words might have different importance.
2. The data-set it used is large enough? I doubt it. It only tests 3 movies which makes me wonder that it is a special case or else.
沒有留言:
張貼留言