2013年6月29日星期六

[ammai] week16 J. Lin and A. Kolcz, “Large-scale machine learning at twitter,” SIGMOD 2012

Paper:

2013年6月12日星期三

[ammai] week 15 Mairal et al. Online dictionary learning for sparse coding. ICML 2009

Paper:
Mairal et al. Online dictionary learning for sparse coding. ICML 2009
Sparse coding is widely used in machine learning, neuroscience,signal processing, and statistics. This paper focuses on learning the basis set, also called dictionary, to adapt it to specific data, and is proven works well in image processing domains.

The following is the algorithm..

In conclusion,this paper contributes:
1.Change the dictionary learning problem into the optimization of a smoothing non-convex objective function.

2.The online iterative algorithm the author proposed has solved this problem by efficiently minimizing the quadratic surrogate function of the empirical cost by the set of constraints.

3.The algorithm is faster than previous approaches to dictionary learning on both small and large datasets of natural images which means it's scalable.

[ammai] week14 Hive - A Warehousing Solution Over a Map-Reduce Framework

Paper:
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy: Hive - A Warehousing Solution Over a Map-Reduce Framework. VLDB, 2009.

To deal with large amount of data/images, there's a tool OpenMP helping us do the map-reduce. But, the map-reduce programming model is low level and is hard to maintain and develop under their settings.

In this paper, it promote a top level data warehousing solution called Hive. It supports queries expressed in a SQL-like language - HiveQL, which are compiled into map-reduce jobs executed on Hadoop.

The following figure is the architecture of Hive.

In conclusion, there are several component:
1.Hive had given a external interface for both user command line and web UI.
2.The Hive Thrift Server exposes a simple client API to execute HiveQL statements. Thrift is a
framework for cross-language services; a server written in one language can also support clients in other languages. The Thrift Hive clients generated in different languages are used to build common drivers like JDBC (java), ODBC (C++), and scripting drivers written in php, perl, python etc.
3.The Metastore is the system catalog. All other components of Hive interact with the metastore.
4.The Driver manages the life cycle of a HiveQL statement during compilation, optimization and execution.
5.The Compiler is invoked by the driver upon receiving a HiveQL statement. The compiler translates this
statement into a plan which consists of a DAG of mapreduce jobs.
6.The driver submits the individual map-reduce jobs from the DAG to the Execution Engine in a topological order.

In metastore, the storage system should be optimized for online transactions with random accesses and updates.The metastore uses either a traditional relational database (like MySQL, Oracle) but not HDFS.

[ammai] week 13 A global geometric framework for nonlinear dimensionality reduction

Paper:
J.B. Tenenbaum, "A global geometric framework for nonlinear dimensionality reduction," Science, 2000 (ISOMAP)

This work (ISOMAP) aims to reduce the high-dimensional data to low dimension.The most difference between ISOMAP and others reduction algorithms (ex: PCA , MDS) is that it is capable of discovering the nonlinear degrees of freedom that underline complex natural observations.

The following are the steps that how it works:

It measures the distance in two ways. With near neighbors,it use Euclidean distance while measuring the distance to non-neighbors in shortest path found from Step 1 to represent the distance. After the distance is determined,ISOMAP applies the MDS to find the meaningful dimension.

ISOMAP's global coordinates provide a simple way to analyze and manipulate high-dimensional observations in terms of their intrinsic nonlinear degrees of freedom.

[ammai] week 12 Nonlinear dimensionality reduction by locally linear embedding

Paper:
"Nonlinear dimensionality reduction by locally linear embedding" Roweis & Saul, Science, 2000.

Locally linear embedding (also known as LLE) is a clever scheme for finding low-dimensional global coordinates when the data lie on a manifold embedded in a high-dimensional space.The trick is to do a di erent linear dimensionality reduction at each point (because locally a manifold looks linear) and then combine these with minimal discrepancy.

The LLE procedure has three steps: it builds a neighborhood for each point in the data; finds the weights for linearly approximating the data in that neighborhood; and finally fi nds the low-dimensional coordinates best reconstructed by those weights. This low-dimensional coordinates are then returned.

The following figure shows what I mentioned before.

The experiments is shown as following:

One important application is in image retrieval. In image retrieval , the feature's dimension describing the image might be large with nonlinear distribution. If we apply the LLE to the features,we can find the correlation and can reduct the dimension while preserving meanings.

[ammai] week 11 A Survey on Transfer Learning

Paper:
S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE TKDE, 2010.

Transfer learning is the improvement of learning in a new task through the transfer of knowledge from a related task that has already been learned. Now it is used in the computer-based machine learning task.
The need for transfer learning may arise when the data can be easily outdated. In this case, the labeled data
obtained in one time period may not follow the same distribution in a later time period.

[ammai] week9/10 Semi-Supervised Hashing for Scalable Image Retrieval

Paper:
J. Wang et al, "Semi-Supervised Hashing for Scalable Image Retrieval," CVPR, 2010.

SSH(Semi-Supervised Hashing) is based on two idea:
(1) distribute the similar label images to the closer hash
(2) make each hash bucket balanced.

Assume that $M$ is the images pairs set in which have same label, and $C$ is the images pairs set where have different labels. The following objective function is main to the first criterion: for pairs with same label, maximize the probability such that they have fall into the closer buckets; for pairs with different, minimize such probability.
Besides, they add a regularization to constraint such that each bucket size should be balanced: hash functions should be orthogonal.

Because this work uses is going to learn a projection matrix W for hashing, so J becomes...

And the author transform the constraint into a soft one like....

,which can maximize the variance of mapping space.

The result is terrific in author's experiments as followed....

SSH can maintain the semantic meanings consistency, unlike LSH method, and do not require too much time like supervised learning methods like RBM and SH.

[ammai] week8 Learning to Rank: From Pairwise Approach to Listwise Approach

Paper:

Learning to rank: from pairwise approach to listwise approach,

Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li April 2007

In this work, it proposes a method called list-wise approach for learning to rank, which is much better than the pairwise approach proposed before.

In pairwise learning,there are several weakness.
First, the objective of learning is formalized as minimizing errors in classification of document pairs, rather than minimizing errors in ranking of documents.
Second, the training process is costly due to pairwise computation for the distance of two instances.
Third, the assumption that pairs of documents are drawn from i.i.d is to strong.
Last, with more document pairs, the number of generated document pairs varies largely from query to query, which will result in training a model biased toward queries.

Proposed Method:

For each query q and a list of corresponding documents d0, d1, ...dn, we can express the documents with feature vectors X0, X1, ...Xn and their scores Y0, Y1, ...Yn. Then we can combine the list of feature vectors and the list of scores to form an "instance" (X, Y). Combine different instance (X, Y) from different queries, we can obtain the training set.

Then, define a ranking function f(X); for each feature vector X, it outputs a score Z . And using this function to our list of feature vectors to get a list of ranking score Z0, Z1, ... Zn. Given a training set, the goal is to find the ranking function that minimize a list-wise loss function L(Y, Z).

And they are trying to find loss function and optimization.

They propose a probabilistic method to calculate the list-wise loss function. Speci?cally they transform both the scores of the documents assigned by a ranking function and the explicit or implicit judgments of the documents given by humans into probability distributions. they can then utilize any metric between probability distributions as the loss function. Considering the uses of two models for the transformation, one is referred to as permutation probability and the other top one probability.

The major contributions of this paper include
(1) proposal of the list-wise approach
(2) formulation of the listwise loss function on the basis of probability models,
(3) development of the ListNet method (which is a model modeling with Neural Network as model and Gradient Descent as algorithm)

(4) empirical veri?cation of the effectiveness of the approach.

2013年6月11日星期二

[ammai] week7 Support Vector Learning for Ordinal Regression

Given a set of samples S = {(xi,yi)}, where xi is the feature vector of sample and yi is the label, most machine learning problem can be classified into one of the three categories, depending on the property of label:

Classification: yi belongs to a finite set of discrete value, where the value is unordered.
Regression: yi is in a continues metric space, where the value of yi contains relevancy information about the sample.
Ordinal regression: yi belongs to a finite set of discrete value like classification, but there exist ordering relationship in the value as the regression. It can be interpreted as a ranking problem.

This paper gives a formal formulation and solution of the ordinal regression problem.

Given a sample set S = (Xi,Yi), the regression problem aims to find the best hypothesis that minimize the risk function R(h). The common expectation of lost function E[L()] is used here as the risk function.

Since only the order matters, the lost function L(y1, y2, y1', y2') gives 1 when the order is incorrect, and gives 0 otherwise. We can then redefine the sample set S' using pair of original variable: S' = {(x1, x2), sign(y1 - y2)}.

When using this expression, the output value is in the set {-1, 0, +1}, which transfer the ordinal regression problem into a classification problem.

On the other hand, one can also use a real number function to express the ordinal hypothesis. For each hypothesis h, we can use a function U that

The U function and the theta threshold can then be solved by maximizing the boundary of the training data. The procedure is then the same as the famous Support Vector Machine, which needs to solve a standard QP-problem.

For any unseen data list, we calculate the pairwise feature differences and multiply by w. Then, we are able to get the pairwise relationships of the data.

Although the algorithm is simple and easy-understanding, the pairwise computation is expensive. Thus, the scalability issue is major concerned for the future work.

訂閱：文章 (Atom)

2013年6月29日 星期六

2013年6月12日 星期三

Learning to rank: from pairwise approach to listwise approach,

2013年6月11日 星期二

2013年6月29日星期六

2013年6月12日星期三

2013年6月11日星期二