2013年6月12日 星期三

[ammai] week 15 Mairal et al. Online dictionary learning for sparse coding. ICML 2009

Paper:
Mairal et al. Online dictionary learning for sparse coding. ICML 2009
 Sparse coding is widely used in machine learning, neuroscience,signal processing, and statistics. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, and is proven works well in image processing domains.


    The following is the algorithm..
     In conclusion,this paper contributes:
    1.Change the dictionary learning problem into the optimization of a smoothing non-convex objective  function.
    2.The online iterative algorithm the author proposed has solved this problem by efficiently minimizing the quadratic surrogate function of the empirical cost by the set of constraints.
    3.The algorithm is faster than previous approaches to dictionary learning on both small and large datasets of natural images which means it's scalable.

[ammai] week14 Hive - A Warehousing Solution Over a Map-Reduce Framework

Paper:
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy: Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB, 2009.


      To deal with large amount of data/images, there's a tool OpenMP helping us do the map-reduce. But, the map-reduce programming model is low level and is hard to maintain and develop under their settings.

      In this paper, it promote a top level data warehousing solution called Hive. It supports queries expressed in a SQL-like language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

    The following figure is the architecture of Hive.

      
    

     In conclusion, there are several component:
1.Hive had given a external interface for both user command line and web UI.
2.The Hive Thrift Server exposes a simple client API to execute HiveQL statements. Thrift is a
framework for cross-language services; a server written in one language can also support clients in other languages. The Thrift Hive clients generated in different languages are used to build common drivers like JDBC (java), ODBC (C++), and scripting drivers written in php, perl, python etc.
3.The Metastore is the system catalog. All other components of Hive interact with the metastore.
4.The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
5.The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this
statement into a plan which consists of a DAG of mapreduce jobs.
6.The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order.

    In metastore, the storage system should be optimized for online transactions with random accesses and updates.The metastore uses either a traditional relational database (like MySQL, Oracle) but not HDFS.

[ammai] week 13 A global geometric framework for nonlinear dimensionality reduction

Paper:
J.B. Tenenbaum, "A global geometric framework for nonlinear dimensionality reduction," Science, 2000 (ISOMAP)

This work (ISOMAP) aims to reduce the high-dimensional data to low dimension.The most difference between ISOMAP and others reduction algorithms (ex: PCA , MDS) is that it is capable of discovering the nonlinear degrees of freedom that underline complex natural observations.

The following are the steps that how it works:

It measures the distance in two ways. With near neighbors,it use Euclidean distance while measuring the distance to non-neighbors in shortest path found from Step 1 to represent the distance. After the distance is determined,ISOMAP applies the MDS to find the meaningful dimension. 

ISOMAP's global coordinates provide a simple way to analyze and manipulate high-dimensional observations in terms of their intrinsic nonlinear degrees of freedom.

[ammai] week 12 Nonlinear dimensionality reduction by locally linear embedding

Paper:
"Nonlinear dimensionality reduction by locally linear embedding" Roweis & Saul, Science, 2000.

   
    Locally linear embedding (also known as LLE) is a clever scheme for finding low-dimensional global coordinates when the data lie on a manifold embedded in a high-dimensional space.The trick is to do a di erent linear dimensionality reduction at each point (because locally a manifold looks linear) and then combine these with minimal discrepancy.
    The LLE procedure has three steps: it builds a neighborhood for each point in the data; finds the weights for linearly approximating the data in that neighborhood; and finally fi nds the low-dimensional coordinates best reconstructed by those weights. This low-dimensional coordinates are then returned.
The following figure shows what I mentioned before.

    The experiments is shown as following:

    One important application is in image retrieval. In image retrieval , the feature's dimension  describing the image might be large with nonlinear distribution. If we apply the LLE to the features,we can find the correlation and can reduct the dimension while preserving meanings.

[ammai] week 11 A Survey on Transfer Learning

Paper:
S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE TKDE, 2010.

Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a  related task that has already been learned. Now it is used in the computer-based machine learning task.
The need for transfer learning may arise when the data can be easily outdated. In this case, the labeled data
obtained in one time period may not follow the same distribution in a later time period.





[ammai] week9/10 Semi-Supervised Hashing for Scalable Image Retrieval

Paper:
    J. Wang et al, "Semi-Supervised Hashing for Scalable Image Retrieval," CVPR, 2010.


     
     SSH(Semi-Supervised Hashing) is based on two idea:
     (1) distribute the similar label images to the closer hash
     (2) make each hash bucket balanced.


     Assume that M  is the images pairs set in which have same label, and C  is the images pairs set where have different labels. The following objective function is main to the first criterion: for pairs with same label, maximize the probability such that they have fall into the closer buckets; for pairs with different, minimize such probability.
     Besides, they add a regularization to constraint such that each bucket size should be balanced: hash functions should be orthogonal.



      Because this work uses is going to learn a projection matrix W for hashing, so J becomes...
      And the author transform the constraint into a soft one like....

     ,which can maximize the variance of mapping space.
     
     The result is terrific in author's experiments as followed....
    

     SSH can maintain the semantic meanings consistency, unlike LSH method, and do not require too much time like supervised learning methods like RBM and SH.
    



[ammai] week8 Learning to Rank: From Pairwise Approach to Listwise Approach

Paper: 

Learning to rank: from pairwise approach to listwise approach,

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li April 2007

In this work, it proposes a method called list-wise approach for learning to rank, which is much better than the pairwise approach proposed before.

In pairwise learning,there are several weakness.
First, the objective of learning is formalized as minimizing errors in classification of document pairs, rather than minimizing errors in ranking of documents.
Second, the training process is costly due to pairwise computation for the distance of two instances.
Third, the assumption that pairs of documents are drawn from i.i.d is to strong.
Last, with more document pairs, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries.


Proposed Method:


For each query q and a list of corresponding documents d0, d1, ...dn,  we can express the documents with feature vectors X0, X1, ...Xn and their scores Y0, Y1, ...Yn. Then we can combine the list of feature vectors and the list of scores to form an "instance" (X, Y). Combine different instance (X, Y) from different queries, we can obtain the training set.


Then, define a ranking function f(X); for each feature vector X, it outputs a score Z . And using this function to our list of feature vectors to get a list of ranking score Z0, Z1, ... Zn. Given a training set, the goal is to find the ranking function that minimize a list-wise loss function L(Y, Z).

And they are trying to find loss function and optimization.

They propose a probabilistic method to calculate the list-wise loss function. Speci?cally they transform both the scores of the documents assigned by a ranking function and the explicit or implicit judgments of the documents given by humans into probability distributions. they can then utilize any metric between probability distributions as the loss function. Considering the uses of two models for the transformation, one is referred to as permutation probability and the other top one probability.

The major contributions of this paper include
(1) proposal 
of the list-wise approach
(2) formulation of the listwise loss 
function on the basis of probability models,
(3) development of the ListNet method (which is a model modeling
with Neural Network   as model and Gradient Descent as algorithm)
(4) empirical veri?cation of the effectiveness of the approach.

2013年6月11日 星期二

[ammai] week7 Support Vector Learning for Ordinal Regression

Given a set of samples S = {(xi,yi)}, where xi is the feature vector of sample and yi is the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:
  1. Classification: yi belongs to a finite set of discrete value, where the value is unordered.
  2. Regression: yi is in a continues metric space, where the value of yi contains relevancy information about the sample.
  3. Ordinal regression: yi belongs to a finite set of discrete value like classification, but there exist ordering relationship in the value as the regression. It can be interpreted as a ranking problem.
This paper gives a formal formulation and solution of the ordinal regression problem.

Given a sample set S = (Xi,Yi), the regression problem aims to find the best hypothesis that minimize the risk function R(h). The common expectation of lost function E[L()] is used here as the risk function.

Since only the order matters, the lost function L(y1, y2, y1', y2') gives 1 when the order is incorrect, and gives 0 otherwise. We can then redefine the sample set S' using pair of original variable: S' = {(x1, x2), sign(y1 - y2)}. 


When using this expression, the output value is in the set {-1, 0, +1}, which transfer the ordinal regression problem into a classification problem.


On the other hand, one can also use a real number function to express the ordinal hypothesis. For each hypothesis h, we can use a function U that


The U function and the theta threshold can then be solved by maximizing the boundary of the training data. The procedure is then the same as the famous Support Vector Machine, which needs to solve a standard QP-problem. 



For any unseen data list, we calculate the pairwise feature differences and multiply by w. Then, we are able to get the pairwise relationships of the data.

Although the algorithm is simple and easy-understanding, the pairwise computation is expensive. Thus, the scalability issue is major concerned for the future work.

2013年4月7日 星期日

[ammai] week6 Latent Dirichlet allocation


Paper: “Latent Dirichlet allocation,” D. Blei, A. Ng, and M. Jordan, Journal of Machine Learning Research, 3:993–1022, January 2003. Known as LDA.

        The work is extended from pLSA. In pLSA, there are 2 serious disadvantages.
1. To a document out of the training set, it’s ambiguous (or hard) to assign probability.
2. The number of paras grows linearly with the size of the corpus, and it would cause  serious problems of overfitting.
        To deal with the problems above, this paper contributes following solution:
1. Providing a probabilistic model at the level of 
documents.
2. Considering a mixture model that captures the exchangeability of both words and documents in order to fulfill the assumption of the “bag-of-words”.

So the model becomes:

The first term on the right hand side is for modeling the document distribution. And then, LDA treat the mixture weights as a k-parameter random variable, which let the number of parameter independent to the number of documents

And in the experiment’s part of this paper, it has good performance. But according to wikipedia and “On an Equivalence between PLSI and LDA” (SIGIR 2003), the pLSA model is equivalent to the LDA model under a uniform Dirichlet prior distribution.

2013年3月21日 星期四

[ammai] week5 Probabilistic latent semantic indexing

Paper: “Probabilistic latent semantic indexing,” T. Hofmann, SIGIR 1999. Known aspLSI/pLSA.


This paper tries to improve SVD-liked LSA based on the thought of probability by using a generative model and the tempered EM algorithm (TEM). To overcome the problem of synonymy (同義字) and polysemy (一字多義) in tf-idf method, LSA is here representing them in the latent semantic space. But, there are still some information missing when facing the polysemy. pLSA here proposed a solid statistical foundation based on the likelihood principle and defined a generative model.

P(d) is the probability of document d.
P(w) is the probability of word w.
Assume z is the hidden topic.

 

for (2), The author assumed that (d, w) pairs and (w, z) are both independent, which means the words in a document are "generated" independently.
Then we set the objective function to be ....



where n(d, w) is the number of occurrence of pair(d, w)
To solve the optimization, this work applys EM algorithm.

E-Steps:


M-Steps:



Then, we can describe the geometry of PLSA. In below pic.


It makes sense that P(w|zk) spans the space of P(w|d). So, the P(w|zk) are the basis set
Thus , it uses lower dimension than previous work does and it can represent hidden topics.

2013年3月13日 星期三

[ammai] week4 Aggregating local descriptors into a compact image representation



Paper: “Aggregating local descriptors into a compact image representation,” Herve Jegou, et al., Proc. IEEE CVPR’10.

Note:

    This paper proposed a system for very-large-scale-image search dealing with the accuracy, efficiency and the memory it used.
   
    The mainly two contributions are
1.  Propose a method to represent the image with excellent accuracy and reasonable vector dimensionality.
2.  Optimizing the trade-off between the indexation algorithm and the dimensional reduction.

      After generating the SIFT, not to use millions of centroids to build the dictionary in the Bag of Words, but use a few centroid (256-centroid) to build the descriptor.

      Then, we still assign each feature to one of the centroid. After assignment, we don’t count the number of features belonging to that centroid as in the BOW; instead, we accumulate the difference between centroids and features belonging to it. So, each image can be represented by a 128 x k-dimension (number of centroids) vector rather than a large-dimension vector.

      After generating vectors, the paper propose a method to compress it. With the desire to achieve better accuracy after quantization, the number k of the cluster should be large enough; however, it is intractable.

      So, in this work, it uses a product quantization method. First, divide the feature vector into m equal part. Then, we need m quantizers for each sub-vectors. The good news is that, for now, we only need ks = k^(1/m) centroids for each quantizer to reach expected accuracy.

      In addition to quantization, it’s better for us to do dimension reduction. We may like to adopt the traditional PCA. However, there are 2 problems arise from PCA.

      1.The variance of each dimension is different after PCA, so the first principal component will be the bottleneck of the quantization error.
      2.There is a trade-off between projection error and quantization error.

Comments:

      1. The way it quantize the feature is 'product'. I think it is a impressive way to do such things, but it's a little bit complicated.

      2.Using original value of distance rather than the absolute value is another surprising things. It might be for the author want to remain the vector's direction in the dimensions.

[ammai] week3 Efficient visual search of videos cast as text retrieval

Paper: “Efficient visual search of videos cast as text retrieval,” J. Sivic, and A. Zisserman, IEEE TPAMI, 2009.

Note:


    This paper talks about the object search from video frames. It introduces the concept of “visual words”, which is a well-known algorithm now , and transforms the problem of object retrieval into the text retrieval which had been studied for a long term. In other words, it applies text retrieval techniques on the problem and the outcomes seem well.

    The following briefly describe how the system they propose works.




    In text retrieval, there are several techniques, such as stemming, weighting-vector( tf-idf ),inverted-indexing, similarity comparison, ranking algorithm.

    And the “visual words ” has been presented.

    First, it finds the invariant descriptor in each frame. This paper refer other paper's work - trying to detect two kinds of interest point - Shape Adapted (SA) and Maximally Stable (MS),which are both elliptical regions of interest. These two regions are then described by 128-SIFT descriptor.

    Second, for checking it is a stable object or just some noises , these region is tracked in 3 frames by using velocity dynamical model. If the frames don’t survive in at least 3 frames, it is useless.

    Third, the most important of all, start to build the visual word vocabulary. It simply use a vector quantize here, and the result will be the “visual word”. In the paper, the visual word is quantized in K-means clustering and the data structure is generated in k-d tree to reduce the searching time when quantizing. After the quantization, there are around 16000 clusters. (In implementation, for the accuracy, it uses Mhalanobis distance in K-means.)

    Fourth, representing the frame by the visual word clustered at Step 3. The cluster determined by the K-means in Step 3 is thought as what we called visual word. So the intra-cluster variations are analogous to the variations of a stem in text retrieval. Each frame is then represented as a 16000-dimension vector of frequencies that each visual word occurs in.

    Fifth, calculating the tf-idf of each visual word and removing the words that occur too frequently. Namely, it take the words occur too many time as the “stop words”.

    In retrieval system, the ranking is determined by the L2-normalized cosine similarity of the frame’s representative vector. After first ranking, the candidate list is then re-ranked by spatial consistency voting as shown below.





Comments:
        1. The way it used to calculate the similarity is L2 distance. Though it had removed "the stop words", but I still think there's a better way to determine the distance because  each visual words might have different importance.
        2. The data-set it used is large enough? I doubt it. It only tests  3 movies which makes me wonder that it is a special case or else.

2013年3月11日 星期一

First Article

This is for 2013 ammai Spring class taught by winston Hsu. Blog .
Mainly for writing some thought in it.

Yeah!