Parallel Aggregation Queries over Star Schema: A Hierarchical Encoding Scheme and Efficient...

6
Parallel Aggregation Queries over Star Schema: a Hierarchical Encoding Scheme and Efficient Percentile Computing as a Case Xiongpai Qin 1,2 , Huiju Wang 1,2 , Xiaoyong Du 1,2 , Shan Wang 1,2 1 School of Information, Renmin University of China, Beijing, 100872, P.R.China 2 Ministry of Education Key Lab of Data Engineering and Knowledge Engineering (RUC), Beijing, 100872, P.R.China [email protected], [email protected], {duyong, swang}@ruc.edu.cn Abstract—Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and more big data analysis applications, due to its well scalability and fault- tolerance. Some aggregation functions, like SUM, can be computed in parallel, because they satisfy distributive law of addition. Unfortunately, some of statistical functions are not naturally parallelizable. That means they do not satisfy distributive law of addition. In this paper, we focus on percentile computing problem. We proposed an iterative-style prediction-based parallel algorithm in a distributed system. Prediction is done through a sampling technique. Experiment results verify the efficiency of our algorithm. Keywords-Hierarchical Encoding;Percentile; Iterative I. INTRODUCTION Big data analysis is a main challenge we encountered recently [1]. For example, Wal-Mart sells more than 267 million products to customers worldwide through more than 6,000 stores every day [2]. The size of its data warehouse is 4PB and is increasing rapidly. Cloud computing is widely used in big data processing and analysis area. Cloud computing uses so-called Scale-Out method to expand the system by adding more computing nodes and rewriting the software to achieve high parallelism. Compared with Scale- Up method, which expands system by adding or replacing CPU, memory, hard drive to a single node, Scale-Out method is a more economical solution. More over, its high scalability and fault tolerance attract more and more applications. Evaluation of aggregation function and some statistical measures is basic for data analysis. Typical aggregation functions include SUM, MIN, MAX, AVERAGE, and COUNT. They are naturally parallelizable, because these functions (operations) satisfy the distributive law of addition, or can be transferred into a set of operations which follow the distributive law of addition. For example, MAX (Mi) = MAX (MAX (Mi)). That means, in a master-slave distributed system, each worker node (slave node) can compute its local maximum value individually and transfer them to the master node. The master node then computes the maximum one from all local maximum values, which is the global maximum value obviously. As for AVERAGE, the master node asks for SUMs and COUNTs from worker nodes, after that global AVERAGE can be computed using the equation of = = = n i n i global i COUNT i SUM AVERAGE 1 1 ) ( / ) ( , where n is the total number of worker nodes. Unfortunately, some other important statistical measures, say MEDIAN (PERCENTILE is a generalization form of MEDIAN), can not be evaluated in this way. Hence, it is interesting to consider how to evaluate such statistical measures efficiently in a distributed computing system. This paper focuses on the evaluation of PERCENTILE function in a distributed environment. PERCENTILE function is used to get the value of some percent point from a data set. It requires to sort the whole data set first and then find the PERCENTILE value in a common way, this method will introduce very high cost of data transfer between nodes. In this paper, we propose an iterative-style prediction- based parallel algorithm for computing PERCENTILE over a large data set which is deployed in a distributed master-slave cluster system. The prediction is based on sampling technique and the whole process is carried out in an iterative style. A sketch of the computing process is as follows: the worker nodes provide a set of sampling data to the master node first, and the master node tries to predict the percentile based on the sampling data and then scatters the predicted value to each worker node; each worker node prunes its local data depending on if the prediction is over-estimate or under- estimate, respectively. The algorithm will end after several iterations. Although sampling data should be transferred between the master node and worker nodes, the cost is ignorable comparing with the traditional sorting-based approach. II. RELATED WORK As mentioned earlier, it’s easy to design algorithms for aggregation functions that satisfy the distributive law of addition. But percentile computing is not naturally parallelized, which requires that we should design an efficient distributed parallel algorithm specially for it. [6] studied MEDIAN computing in cloud environment and proposed two algorithms, which are based on the MapReduce technique and the Grid Batch technique respectively. The whole process is as follows: firstly, get the upper and lower bounds of each chunk through data preprocessing; then partition all the data into the chunks; lastly, navigate to one of chunks which contains the target median according to the number of data points in each chunk, and get the median value by sorting the data in the chunk. This algorithm performs a global sort and introduces much network transmission cost during data partitioning step. On the contrary, there is no global sort operation in our proposal, local sorting and searching are enough, and these operations This work is supported by MOE grant no.708004. Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications 978-0-7695-4428-1/11 $26.00 © 2011 IEEE DOI 10.1109/ISPA.2011.34 329 Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications 978-0-7695-4428-1/11 $26.00 © 2011 IEEE DOI 10.1109/ISPA.2011.34 329

Transcript of Parallel Aggregation Queries over Star Schema: A Hierarchical Encoding Scheme and Efficient...

Parallel Aggregation Queries over Star Schema: a Hierarchical Encoding Scheme and Efficient Percentile Computing as a Case

Xiongpai Qin 1,2, Huiju Wang 1,2, Xiaoyong Du1,2, Shan Wang1,2

1 School of Information, Renmin University of China, Beijing, 100872, P.R.China 2 Ministry of Education Key Lab of Data Engineering and Knowledge Engineering (RUC), Beijing, 100872, P.R.China

[email protected], [email protected], {duyong, swang}@ruc.edu.cn

Abstract—Big data analysis is a main challenge we meet recently. Cloud computing is attracting more and more big data analysis applications, due to its well scalability and fault-tolerance. Some aggregation functions, like SUM, can be computed in parallel, because they satisfy distributive law of addition. Unfortunately, some of statistical functions are not naturally parallelizable. That means they do not satisfy distributive law of addition. In this paper, we focus on percentile computing problem. We proposed an iterative-style prediction-based parallel algorithm in a distributed system. Prediction is done through a sampling technique. Experiment results verify the efficiency of our algorithm.

Keywords-Hierarchical Encoding;Percentile; Iterative

I. INTRODUCTION Big data analysis is a main challenge we encountered

recently [1]. For example, Wal-Mart sells more than 267 million products to customers worldwide through more than 6,000 stores every day [2]. The size of its data warehouse is 4PB and is increasing rapidly. Cloud computing is widely used in big data processing and analysis area. Cloud computing uses so-called Scale-Out method to expand the system by adding more computing nodes and rewriting the software to achieve high parallelism. Compared with Scale-Up method, which expands system by adding or replacing CPU, memory, hard drive to a single node, Scale-Out method is a more economical solution. More over, its high scalability and fault tolerance attract more and more applications.

Evaluation of aggregation function and some statistical measures is basic for data analysis. Typical aggregation functions include SUM, MIN, MAX, AVERAGE, and COUNT. They are naturally parallelizable, because these functions (operations) satisfy the distributive law of addition, or can be transferred into a set of operations which follow the distributive law of addition. For example, MAX (∪Mi) = MAX (∪ MAX (Mi)). That means, in a master-slave distributed system, each worker node (slave node) can compute its local maximum value individually and transfer them to the master node. The master node then computes the maximum one from all local maximum values, which is the global maximum value obviously. As for AVERAGE, the master node asks for SUMs and COUNTs from worker nodes, after that global AVERAGE can be computed using the equation of ∑ ∑

= =

=n

i

n

iglobal iCOUNTiSUMAVERAGE

1 1

)(/)( ,

where n is the total number of worker nodes. Unfortunately, some other important statistical measures,

say MEDIAN (PERCENTILE is a generalization form of MEDIAN), can not be evaluated in this way. Hence, it is interesting to consider how to evaluate such statistical measures efficiently in a distributed computing system. This paper focuses on the evaluation of PERCENTILE function in a distributed environment. PERCENTILE function is used to get the value of some percent point from a data set. It requires to sort the whole data set first and then find the PERCENTILE value in a common way, this method will introduce very high cost of data transfer between nodes.

In this paper, we propose an iterative-style prediction-based parallel algorithm for computing PERCENTILE over a large data set which is deployed in a distributed master-slave cluster system. The prediction is based on sampling technique and the whole process is carried out in an iterative style. A sketch of the computing process is as follows: the worker nodes provide a set of sampling data to the master node first, and the master node tries to predict the percentile based on the sampling data and then scatters the predicted value to each worker node; each worker node prunes its local data depending on if the prediction is over-estimate or under-estimate, respectively. The algorithm will end after several iterations. Although sampling data should be transferred between the master node and worker nodes, the cost is ignorable comparing with the traditional sorting-based approach.

II. RELATED WORK As mentioned earlier, it’s easy to design algorithms for

aggregation functions that satisfy the distributive law of addition. But percentile computing is not naturally parallelized, which requires that we should design an efficient distributed parallel algorithm specially for it. [6] studied MEDIAN computing in cloud environment and proposed two algorithms, which are based on the MapReduce technique and the Grid Batch technique respectively. The whole process is as follows: firstly, get the upper and lower bounds of each chunk through data preprocessing; then partition all the data into the chunks; lastly, navigate to one of chunks which contains the target median according to the number of data points in each chunk, and get the median value by sorting the data in the chunk. This algorithm performs a global sort and introduces much network transmission cost during data partitioning step. On the contrary, there is no global sort operation in our proposal, local sorting and searching are enough, and these operations

This work is supported by MOE grant no.708004.

Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4428-1/11 $26.00 © 2011 IEEE

DOI 10.1109/ISPA.2011.34

329

Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4428-1/11 $26.00 © 2011 IEEE

DOI 10.1109/ISPA.2011.34

329

can be done in parallel; in addition, we use data sampling technique to predict the target percentile value which can accelerate the convergence of the algorithm and reduce network transmission cost significantly.

[7] discussed a multi-way parallel algorithm for percentile computing, but it is designed towards share memory systems, our proposal is for share-nothing architecture. [8] proposed four parallel selection algorithms which are also applicable to percentile computing. [9] designed an adaptive parallel aggregation algorithm for different selectivities. Our proposal is much simpler yet efficient, the thinking behind it is the KISS philosophy (keep it simple and stupid, maybe straightforward is better). The algorithm incurs little network transmission cost, it can scale to a large cluster system.

Experiments of the percentile computing algorithm are conducted on a data set of “flattened-out” star schema [3], which borrows some ideas from universal relation [4]. The difference is that the “flattened-out” star schema doesn’t put all dimension information but hierarchy information, which most aggregation queries will use, into the fact table. And the “put” method is not a direct copy, but an encoding. Thus, the “flattened-out” star schema is more space-saving compared with universal relation. The BLINK[5] prototype proposed by IBM researchers pre-joins dimension tables and the fact table to form a wide table, query processing becomes much simpler, table scanning is parallelized and constant response time of aggregation is achieved. Due to demoralization of data, data redundancy is the major disadvantage of the scheme. The “flattened out” star schema does not incur as much data redundancy like BLINK.

III. THE HIERARCHICAL ENCODING SCHEME Star schema (snowflake schema is an extension of star

schema) is the mostly used data organization method in data warehouse, which heavily depends on star join to answer aggregation queries. In a “flattened-out” star schema, the data is preprocessed during loading, hierarchy information in dimension tables is put into the fact table by encoding the hierarchy as a bit-string. After preprocessing, the fact table is detached from dimension tables and the schema is “flattened-out”. The join between dimension tables and the fact table is eliminated, thus high performance is expected when processing big data on a large cluster.

A. Preprocessing of Star Schema using a Hierarchical Encoding Scheme Each dimension of a star schema has a hierarchy tree and

there exists a pre-order path for each node in this tree. Each node on the path corresponds to an order number in its corresponding level of hierarchy, we call the order number as local hierarchy code. Then, at least ⎡ ⎤M

2log bits are reserved in the binary representation of the code (M is the cardinality of the domain of the level of a hierarchy).

A dimension hierarchy code is a concatenation of local hierarchy codes of the nodes along the path from root to the lowest level node. We represent the dimension hierarchy code as a bit string. The compound dimension code is the

combination of each dimension code in the data warehouse subject. During data loading, star schema is preprocessed as follows: 1) each dimension table is encoded according to dimension hierarchy code definition; 2) a compound dimension code is produced by interleaving the dimension hierarchy codes; 3) the foreign keys of the fact table are replaced with the compound dimension code.

B. Query Processing After above mentioned preprocessing, the fact table has

contained all dimension hierarchy information which is enough for most data warehouse-style aggregation queries (Other queries can be processed using a query rewriting technique). Thus, there is no need to join dimension tables and fact table while answering queries, which will speed up query processing greatly. Aggregation queries are rewritten to throw away the join and use the compound dimension code instead, different types of predicates transforming, and query optimization techniques can be found in [3], interested users can refer to it for details.

C. Summary of The Encoding Technique 1) Scalability: Based on hierarchy encoding, dimension

hierarchy information is put into the fact table and the join operation between dimension tables and the fact table is avoided, which ensures highly scalability of the system.

2) Performance: data is pre-processed during data loading which simplifies successive query processing. This is a practical strategy in that data warehouse follows the usage pattern of "write once, query many times", thus it is worthy of paying some cost while loading data, the overhead will finally be paid off during later data warehouse using.

IV. PREDICTION BASED ITERATIVE PARALLEL ALGORITHM FOR PERCENTILE COMPUTING

Figure 1. Messages and Data Exchanges between the Master Node and Worker Nodes

The computing process of a percentile includes several stages as follows:

330330

The master node requests each worker node to provide a sample of their data so that it can predict the value of percentile.

The prediction is sent to all worker nodes, every worker node search the value locally, and report the position of the prediction in local sorted array of the target column, including the number of data points that are less than the prediction, and the value that is most close to the prediction.

After the master node collects all reports from worker nodes, it can find out the previous prediction is an over-estimate or vise-versa (under estimate), and tells all worker nodes the information. Each worker node prunes local data array accordingly.

After several iterations of above steps, the number of the left data points on all worker nodes will be less than a preset threshold, the master node requests that each worker node sends back the left data points and figure out the target percentile value by sorting the data. The details of each stage are presented in following sections.

A. Data Sampling and Prediction During the processing of queries which involve

percentile computing, the master node dispatches sub-queries onto worker nodes, and each worker node reports the maximum value(Max local), the minimum value(Min local) as well as a sample of local data to the master. The default ratio of sampling is 0.03%, to avoid the situation that the sample is too small or too large, the number of sample is confined to %))03.0*,2000max(,7000min( globalN . On a data set containing 20million data points, a sample ratio of 0.03% can lead to an error up to 6% in prediction [10]. In the algorithm, after each iteration, each worker node will cut away some data points, subsequent iteration of the algorithm will sample the left data points, since the ratio of sample is constant, the prediction will be more accurate. In our work, the uniform sampling technique is used.

After all samples of the worker nodes have arrived, the master node can compute the global maximum value (Max global), the global minimum value (Min global), and total number of the data points in the data set (Num global). After that the master node merges, sorts the samples and predicts the value of the target percentile. The naïve prediction method is that, supposed that there are n worker nodes, and

the total data points is ∑

=

n

ilocal iNum

1)(

,then the number of %*))((

1exp PiNumNum

n

ilocalect ∑

=

=is the global position

of the expected P% percentile. In the sorted sampled data array the position of Num sample*P% holds the value we can use as a prediction (Num sample is the total number of data points in the sample data set collected by the master node).

The naïve algorithm mentioned above has one major shortcoming, when the number of worker nodes increases, the overhead of memory space on the master node to hold the sample data will increase correspondingly. To reduce the memory space overhead, we adopt a prediction algorithm based on histogram. Firstly the master node asks each worker node to report its maximum value, minimum value, as well as the number of data points. After all worker nodes finish

reporting, the master node builds an empty histogram according to the information and then requests all workers nodes to supply data samples. When a sample of any worker node reaches the master node, it is used to refine the histogram immediately and threw away, the memory space overhead on the master node is cut down. In the experiments, we use the so called Equi-Sum (V, F) histogram, namely the equi-height histogram [11], which is superior in prediction to equi-width histogram and is rather simple as well. In a equi-height histogram, the frequencies of all buckets are the same, and the widths(V high-V low, V high is the upper bound of a bucket, and V low is lower bound of the bucket) of the buckets are different. The bucket number is set to 1024. As for how to make prediction using an equi-height histogram, readers can refer to references.

B. Pruning of Local Data and Next Round of Iteration The master node gives out the prediction value of P%

percentile to all worker nodes. The worker nodes work in parallel, they do the job of finding the position of the prediction value in local data set, binary search or other searching method could be used. Finally each worker node reports a duality-array structure to the master(<Num less, Value local>), in which Num less denotes the number of data points that are less than the prediction value in local data set, whereas Value local denotes the minimum value in local data set which is greater than or equal to the prediction value.

After the master node receives all reported duality-array structures from worker nodes, there are three cases to deal with: if

ect

n

ilesstotal NumiNumNum exp

1))(( <= ∑

=

, then

the prediction is an under-estimate; if

ect

n

ilesstotal NumiNumNum exp

1))(( >= ∑

=

, then the

prediction is an over-estimate; if

ect

n

ilesstotal NumiNumNum exp

1))(( === ∑

=

,fortunately,

the prediction is pointing to the target percentile value, the algorithm could end. We expect the predictions are mostly either under-estimated or over-estimated.

Figure 2. Pruning of Over-estimate and Under-estimate

For the situation of under-estimation, the master node tells all worker nodes to perform LEFT-Pruning, i.e. on every worker node the data points which are less than the prediction value are cut away. When the previous prediction is over-estimated, every worker node is told to perform RIGHT-Pruning. After LEFT-Pruning/ RIGHT-Pruning, the

331331

algorithm begins next round of iteration. Worthy to mention is that, after data pruning, the Num expect on the master node should be adjusted accordingly - the number of data points pruned should be subtracted from the number of Num expect .

Subsequent sampling is based on the left data which is much small in scale, the prediction accuracy is expected to improve, and the convergence of the algorithm is accelerated.

During iterations of the algorithm, we hope that under-estimate and over-estimate take place alternatively and interleave with each other, so that the algorithm converges to the final result quickly. Due to diverse distribution characteristics of real data, the unexpected situation of always over-estimating or always under-estimating may occur, which leads to inefficiency of unilateral approaching the target percentile.

We improve the algorithm to make it more adaptive. The strategy is that, if twice under-estimates or twice over-estimates happen successively, then the algorithm will try to adjust prediction parameters with the expectation that next round of prediction be an estimate of opposite direction. To avoid too much adjustment, the step will grow by 2p

(p=1,2,3,4,…)times in subsequent predictions, the initial step is )128/,1( SMAXStep = ,and S is the frequency of the bucket.

Figure 3. Prediction Adjustment after two Successive Under-estimate

For example, as shown in figure 3, suppose that the frequency of every bucket is S(S1,S2,S3), the boundaries of the buckets in iteration 1 are D1,0, D1,1, D1,2, …etc. When D1,i<Num expect < D1,i+1, the prediction will fall in the bucket of [D1,i, D1,i+1]. Since twice under-estimates have taken place successively, then in the following up prediction, the prediction is adjusted rightward by a step of 2p times of initial step.

C. Sorting of Left Data and Searching for the Final Result Every worker node also reports the low bound (Index low)

and high bound (Index high) of local un-pruned data to the master node when providing a sample. When the master node learns from Index High and Index Low of all worker nodes that the size of left data is less than a preset threshold, namely

threhold

n

ilowhigh NumiIndexiIndex <−∑

=1))()(( , it

stops iteration and requests all worker nodes to send back un- pruned data, and sorts the data to find out the final answer.

D. Summary of the Percentile Computing Algorithm The algorithm for percentile computing proposed here

has several properties as follows: 1) The algorithm is based on a star schema which is

“flattened out”, during query processing, there is no need to exchange messages between worker nodes (no joins between the fact table and dimension tables).

2) The algorithm uses sample of data to predict the target percentile value. By predicting and pruning, the left data set becomes smaller and smaller, accuracy of the prediction increases, and the algorithm converges quickly.

3) Making prediction using a histogram enables less memory space overhead; in the meantime, improved version of the algorithm could avoid unilateral approaching the target percentile, under-estimate and over-estimate cut away futile data alternatively and squeeze the space of search to a manageable range.

V. EXPERIMENTS

A. Simulation System, Data Set and the Workload The percentile computing algorithm is verified by

simulation, some ideas are borrowed from references of [12] and [13]. The simulation system has modeled processing power of each node, the network as well as the working logic of the algorithm. In our experiments, we only use synthetic datasets (In near future we will use real life data set to test the algorithm). The schema includes two dimension tables (time dimension, customer dimension) and one fact table(revenue), in each dimension table the hierarchy is of three levels, and each level of hierarchy is encoded using 8bits, each dimension table has two other attributes; in the fact table, there are three measures, each occupies 8byte of space( a double precision numeric). After encoded, each tuple(the data layout is row-wise) of the fact table is 30 bytes, 6 bytes for compound hierarchical code, and other 24 bytes for the three measures. There are 1 billion tuples in the dataset, and the size of data is about 33GB. The workload contains one query which tries to find the revenue of 80% customers grouped by different cities.

In the following experiments, convergence of the algorithm refers to the number of un-pruned tuples after each iteration.

B. Data Size and Convergence of the Algorithm & Impact of P% The first experiment is conducted to study the

relationship between data size and algorithm convergence by finding the 80% percentile from the most detail granularity. Figure 4 shows the number of iterations and corresponding algorithm convergence effect (number of left tuples). The initial numbers of tuples is set to 0.1 billion, 0.5 billion, and 1 billion, the target column is of uniform distribution, the cluster is comprised of 9 nodes (one node is the master and other eight nodes act as worker nodes). For 1 billion tuples of

332332

initial data set, only 5 iterations are needed to converge (left tuples on worker nodes are sent to the master node for finally computing of the percentile).

Figure 4. Data Size and Algorithm Convergence

In this group of experiments, we also study the impact of P% on algorithm convergence. The value of P% are set to 20%, 50%, 80% respectively, other parameters are the same. The outcome is shown in figure 5, we can see that in all cases after 5 iterations, the algorithm converges no matter what the P% are. The result shows that the algorithm is adaptive to different P%.

Figure 5. Impact of P%

C. Sample Ratio and Algorithm Convergence

Figure 6. Sample Ratio vs. Algorithm Convergence

Another experiment is performed to find out how sample ratio influences algorithm convergence. The scale of initial data set is 1 billion tuples, data is of uniform distribution, 8 worker nodes are used. Figure 6 shows the convergence of the algorithm for different sample ratio (0.01%, 0.02%, 0.03%, 0.04%). We see that low sample ratio can lead to higher prediction error and slower convergence, what is interesting is that convergence trends of 0.03% sample ratio and 0.04% sample ratio are almost identical. To reduce the network transmission overhead, we choose 0.03% as the

default sample ratio. Although sample ratios of 0.01%, and 0.02% converge more slowly, only 2-3 more rounds of iteration are adequate for the algorithm to finally converge. In the case of not enough memory space on the master node and highly concurrency, lower sample ratio can be used to save the memory space on the master node.

D. Data Skew and Algorithm Convergence The uniform sampling technique could not capture data

skew efficiently (in our future work, we will try other sampling techniques), the purpose of this experiment is to show algorithm convergences of different data skews. The scale of initial data set is 1 billion tuples, 8 worker nodes are used, and sample ratio is set to 0.03%. Zipfian [14] distribution is used to modeling data skew, the Z parameter is set to 0.1,1,2 respectively to vary the degree of data skew. From the result shown in figure 7, we see that when data is highly skew, the algorithm converges slowly during early several iterations, after 6 iterations the convergence process is accelerated, 3 more iterations are sufficient to arrive the final merging and searching step.

Figure 7. Data Skew and Algorithm Convergence

E. Group Number and Algorithm Performance As for the influence of group number on the performance

of the algorithm, we vary group number from 10 to 1280(10, 20, 40, 80, 160, 320, 640, 1280), and conduct a series of experiments. Data scale and other parameters are the same as sub-section C. When the group number is less than 320, there are no obvious differences concerning convergence speed, when the group number increases even larger, we witness that the algorithm converges rather slowly, higher the group number, more slowly the algorithm converges. The reason behind this phenomenon is that prediction accuracy will decrease when group number is increasing because a constant sample is divided among more groups. It’s our future work to improve the algorithm in this aspect, in the meantime in data warehouse workload, most aggregation queries will produce a result set which has a group number that is less than 100, thus the algorithm can definitely improve performance of most queries.

F. Prediction based Percentile Computing vs. MapReduce based Percentile Computing The final experiment will show the running time and

space overhead of the algorithm, the MapReduce based percentile computing algorithm [6] is used as the counterpart for comparison. The data size is 1 billion tuples, data is of

333333

uniform distribution, 8 worker nodes are used, P% is set to 50%. The running time of our algorithm is 57 seconds, much less than the running time of MapReduce based algorithm, which is 130 seconds. The network transmission overhead of our algorithm is 16.8MB, whereas MapReduce based algorithm incurs a large overhead up to 730MB. The reason is that the network transmission overhead of the algorithm in this paper includes only messages and data samples exchanged between the master node and worker nodes, while the MapReduce based algorithm essentially performs a partial sorting of all data by throwing all data points into preset chunks, which incurs large volume of data transmission during shuffling phase of MapReduce parallel computing. Authors of [6] also proposed a Grid Batch based algorithm which outperforms the MapReduce based one by 30%. Even though compared to Grid Batch based one, the algorithm in this paper is also expected to be superior.

Figure 8. Running Time and Space Overhead of the Algorithm (Prediction based vs. MapReduce based)

VI. DISCUSSION (1) Experiment results show that the percentile

computing algorithm is more efficient than the MapReduce style algorithm and its variation proposed in [6], but it does not mean that we refuse what MapReduce computing paradigm has done well such as node managements and fault tolerance. We are working on deep integration of our techniques of data encoding [3] and query processing with Hadoop (an open source MapReduce implementation). Although the work is somewhat similar to HadoopDB [15] [16] which is ongoing in Yale University, but our integration differs from HadoopDB in many aspects such as data storage, index techniques, and query optimization. (2) The “flattened out” star schema is in fact a semi-denormalization technique. It is not true that normalization is good, and denormalization is bad, it depends. Faced with big data, system scalability is the top concern, thus storage redundancy incurred by denormalization is tolerance, needless to say compression techniques can be used to reduce storage overhead[3].

VII. CONCLUSION AND FUTURE WORK Percentile computing is not naturally distributable, an

iterative parallel algorithm is presented. The algorithm predicts the target percentile value using a small sample of data from worker nodes and converges quickly. Owing to its nice properties, the algorithm is highly scalable, and incurs only an affordable network transmission overhead.

There still exists some room to improve the algorithm, our work in near future includes: (1) Improve the performance of the algorithm when group number is large; (2) Use real data set to test the algorithm, try to identify bottleneck of the algorithm and solve it. (3) Replace uniform sampling and equi-height histogram with more accurate sampling techniques and histograms such as wavelet based histogram, to further improve performance of the algorithm. (4) It is not clear how data skew impacts the convergence speed, which is also in the To Do List of our following up work.

REFERENCES [1] Science Staff. Dealing with Data: Challenges and Opportunities. Science.

2011, 331(6018): 692-693 [2] Randal E. Bryant. Data-Intensive Supercomputing: The case for DISC.

Tech Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon University, 2007

[3] Huiju Wang, Xiongpai Qin, Yansong Zhang, Shan Wang, Zhanwei Wang. LinearDB: A Relational Approach to Make Data Warehouse Scale like MapReduce. Proceedings of DASFAA 2011, pp.306-320

[4] Tzy-Hey Chang, Edward Sciore. A Universal Relation Data Model with Semantic Abstractions. Journal of IEEE Transactions on Knowledge and Data Engineering, 1992, 4(1):23 – 33

[5] Vijayshankar Raman, Garret Swart, Lin Qiao, Frederick Reiss, Vijay Dialani, Donald Kossmann, Inderpal Narang, Richard Sidle. Constant-Time Query Processing. Proceedings of ICDE 2008, pp.60-69

[6] Huan Liu, Orban, D. Computing median values in a cloud environment using GridBatch and MapReduce. IEEE International Conference on Cluster Computing (CLUSTER) 2010, pp.1-10

[7] B. R. Iyer, G. R. Ricard, P. J. Varman. Percentile finding algorithm for multiple sorted runs. Proceedings of VLDB 1989, pp.135 – 144

[8] Akihiro Fujiwara, H. Katsuki, Michiko Inoue, Toshimitsu Masuzawa. Parallel Selection Algorithms with Analysis on Clusters. ISPAN 1999, pp.388-393

[9] Ambuj Shatdal, Jeffrey F. Naughton. Adaptive parallel aggregation algorithms. Proceedings of the 1995 ACM SIGMOD, pp.104 – 114

[10] Surajit Chaudhuri, Rajeev Motwani, Rajeev Motwani. Random sampling for histogram construction: how much is enough? Proceedings of the 1998 ACM SIGMOD, pp.436 – 447

[11] Nicolas Bruno, Surajit Chaudhuri, Luis Gravano. STHoles: a multidimensional workload-aware histogram. Proceedings of the 2001 ACM SIGMOD, pp.211 – 222

[12] Suhel Hammoud, Maozhen Li, Yang Liu, Nasullah Khalid Alham, Zelong Liu. MRSim: A discrete event based MapReduce simulator. Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD) 2010, pp.2993 – 2997

[13] Guanying Wang, Ali Raza Butt, Prashant Pandey, Karan Gupta. A simulation approach to evaluating design decisions in MapReduce setups. 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems(MASCOTS) 2009, pp.1-11

[14] Bryce Cutt, Ramon Lawrence. Using intrinsic data skew to improve hash join performance. Journal of Information Systems, 2009, 34(6):493-510

[15] Azza Abouzeid, Kamil BajdaPawlikowski,Daniel Abadi, Avi Silberschatz, Alexander Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proceedings of VLDB 2009, pp.922-933

[16] Azza Abouzied, Kamil Bajda-Pawlikowski, Jiewen Huang, Daniel J. Abadi, Avi Silberschatz. HadoopDB in Action: Building Real World Applications. Proceedings of SIGMOD 2010, pp.1111-1114

334334