Pushing Predicates into Recursive SQL Common Table Expressions

Pushing Predicates into Recursive SQL

Common Table Expressions

Marta Burzanska, Krzysztof Stencel, and Piotr Wisniewski

Faculty of Mathematics and Computer Science, Nicolaus Copernicus University,Torun Poland

{quintria,stencel,pikonrad}@mat.umk.pl

Abstract. A recursive SQL-1999 query consists of a recursive CTE(Common Table Expression) and a query which uses it. If such a recur-sive query is used in a context of a selection predicate, this predicate canpossibly be pushed into the CTE thus limiting the breadth and/or depthof the recursive search. This can happen e.g. after the definition of a viewcontaining recursive query has been expanded in place. In this paper wepropose a method of pushing predicates and other query operators intoa CTE. This allows executing the query with smaller temporary datastructures, since query operators external w.r.t. the CTE can be com-puted on the fly together with the CTE. Our method is inspired on thedeforestation (a.k.a. program fusion) successfully applied in functionalprogramming languages.

1 Introduction

Query execution and optimisation is a well-elaborated topic. However, the op-timisation of recursive queries introduced by SQL-1999 is not advanced yet. Anumber of techniques is known in the general setting (e.g. the magic sets [1]),but they are not applied to SQL-1999. Since, the recursive query processing isvery time-consuming, new execution and optimisation methods for such queriesare needed.

It seems promising to push selection predicates from the context of a usageof a recursive query, into the sole query (in fact into its CTE). The methodof predicate-move-around [2] is very interesting. It allows pushing and pullingpredicates to places where their execution promises biggest gain in terms of thequery running time. However, this method applies to non-recursive queries only.Recursive queries are much more complex, since predicates external to themmust be applied to all nodes reached during the execution, but not necessarilyto all visited nodes. It could be useful to push such predicates into the initialstep or into the recursive step. However, we cannot do it straightforwardly, sincethe predicate holding for the result does not need to hold for all nodes visited onthe path to the result In this paper we propose a method of pushing predicatesinto CTEs subtle enough not to change the semantics of the query.

Together with pushing predicates our method also tries to push other operatorsinto the recursive CTE so that as much as possible part of computing is performed

J. Grundspenkis, T. Morzy, and G. Vossen (Eds.): ADBIS 2009, LNCS 5739, pp. 194–205, 2009.c© Springer-Verlag Berlin Heidelberg 2009

Pushing Predicates into Recursive SQL Common Table Expressions 195

on the fly together with the recursive processing. This reduces the space neededfor temporary data structures and the time needed to store and retrieve data fromthem. This part of our optimisation method is inspired by the deforestation devel-oped for functional languages [3]. This method is also known as program fusion,because the basic idea behind it is to fuse together two functions of which one con-sumes an intermediate structure generated by the other. This algorithm has beensuccessfully implemented in Glasgow Haskell Compiler (GHC [4]) and proved tobe very effective. But it has to be mentioned, that GHC is not equipped with theoriginal deforestation technique. The algorithm of [3], although showing a greatpotential, was still too complicated and did not cover all of the possible interme-diate structures. This is why many papers on deforestation’s enhancements havebeen prepared. The most universal, and the simplest at the same time is knownas the short-cut fusion, cheap deforestation or foldr-build rule [5,6]. Unfortunatelyit is not suitable for dealing with recursive functions. The problem of deforestingrecursive function has been addressed in [7].

There has been work done on how to translate operators of an object querylanguage into its foldr equivalent. Although most of them have dealt only withOQL operators, they are successful in showing that OQL can be efficiently opti-mised with short-cut deforestation ([8]). But still the issue of optimising recursivequeries is open. One of the works in this field is [9].It presents three optimizationtechniques, i.e. deleting duplicates, early evaluation of row selection conditionand defining an enhanced index.

This paper is organized as follows. In Section 2 we show an example whichpictures the possible gains of our method. In Section 3 we explain some smallutility optimisation steps used by our method. Section 4 explains and justifies themain optimisation step of pushing selection predicates into CTE. Section 5 showsthe measured gain of our optimisation method together with the original queryexecution plan and the plan after optimisation. We show plans and measures forIBM DB2. Section 6 concludes.

2 Motivating Example

Let us consider a database table Emp that consists of the attributes: (EID ⊂ Z,ENAME ⊂ String, MGR ⊂ Z, SALARY ⊂ R). The column eid is the primarykey, while mgr is a foreign key which references eid. The column mgr stores dataon managers of individual employees. Top managers have NULL in this column.We define also a recursive view which shows the subordinate-manager transitiverelationship, i.e. it prints pairs of eids, such that the first component of thepair is a subordinate while, the second is his/her manager. From 1999 one canformulate this query in standard SQL:

CREATE VIEW subordinates (seid, meid) ASWITH subs(seid, meid) AS (SELECT e.eid AS seid, e.eid AS meid FROM Emp e

UNION ALL

196 M. Burzanska, K. Stencel, and P. Wisniewski

SELECT e3.eid AS seid, s.meid AS meid FROM Emp e3, subs sWHERE e3.mgr = s.seid )

SELECT * FROM subs;

This view can then be used to find the total salary of all subordinate employeesof, say, Smith:

SELECT SUM(e2.salary)FROM subordinates s2

JOIN Emp e2 ON (e2.eid = s2.seid)JOIN Emp e1 ON (e1.eid = s2.meid)

WHERE e1.ename = ’Smith’;

A naıve execution of such a query consists in materializing the whole transi-tive subordinate relationship. However, we need only a small fraction of thisrelationship which concerns Smith and her subordinates.

In order to avoid materializing the whole view, we start from a standardtechnique of query modification. We expand the view definition in line:

WITH subs(seid, meid) AS (SELECT e.eid AS seid, e.eid AS meid FROM Emp e

UNION ALLSELECT e3.eid AS seid, s.meid AS meidFROM Emp e3,subs sWHERE e3.mgr = s.seid )

SELECT SUM(e2.salary)FROM subs s2

JOIN Emp e2 ON (e2.eid = s2.seid)JOIN Emp e1 ON (e1.eid = s2.meid)


The execution of this query can be significantly improved, if we manage to pushthe predicate e1.ename = ’Smith’ to the first part of the CTE. In this paperwe show a general method of identifying and optimising queries which allow sucha push.

After this first improvement it is possible to get rid of the join with e1 andpush the join with e2 as well as the retrieval of the salary into the CTE. Afterall this changes we get the following form of our query:

WITH subs(seid, meid, salary) AS (SELECT e.eid AS seid, e.eid AS meid, e.salaryFROM Emp eWHERE e.ename = ’Smith’

UNION ALLSELECT e3.eid AS seid, s.meid AS meid, e3.salaryFROM Emp e3, subs sWHERE e3.mgr = s.seid )

SELECT SUM(s2.salary)FROM subs s2;


The result of the predicate push and the query fusion is satisfactory. Now wetraverse only the Smith’s hierarchy. Further optimisation is not possible, byrewriting SQL query to another SQL query (SQL-1999 severely limits the formof recursive CTEs).

However, we do need to accumulate neither eids nor salaries. We just needto have one temporary structure, i.e. a number register to sum the salaries onthe fly as we traverse the hierarchy. This is the most robust plan (traverse thehierarchy and accumulate salaries). This is a simple application of deforestationand can be done by a DBMS on the level of query execution plans even if its isnot expressible in SQL-1999.

3 Utility Optimisations

The first step that should be done after expanding the view definition is purelysyntactic. We add alias names for tables lacking them, and we change aliasesthat are assigned more than once, so that all tables have different aliases. Thisis done by a simple replacement of alias names (α-conversion).

The second technique is the elimination of vain joins. This technique is usuallyapplied after some other query transformation. When in one of the parts of theCTE, or in the main part of the query a table is joined by its primary key to thea foreign key of another table, but besides the joining condition it is not usedit may be deleted. This is done by removing this table from the FROM clauseat the same time removing the join condition in which it is used. There is onesubtle issue. The foreign key used to join with the removed table cannot havethe value of NULL. Such rows cannot be matched. The join with the removedtable plays the role of the selection predicate IS NOT NULL. Thus, if the foreignkey is not constrained to be NOT NULL, the selection predicate that foreignkey IS NOT NULL must be added. If the schema determines the foreign key tobe NOT NULL, this condition is useless and is not added.

Another simple conversion is a self-join elimination when the join is one-to-one (primary key to primary key). When encountering such a self-join we chooseone of the two aliases used in this join, and then substitute every occurrence ofone of them (besides its definition and joining condition) by another. When thisis done we can delete the self-joining condition and the redundant occurrence ofthe doubled table from the FROM clause. This technique is illustrated by thefollowing example. Starting from a query:

WITH subs(seid, meid, salary) AS (SELECT e.eid AS seid, e.eid AS meid, e2.salary as salaryFROM Emp e, Emp e2WHERE e.eid = e2.eidUNION ALLSELECT e3.eid AS seid, s.meid AS meid,e2.salary as salaryFROM Emp e3,subs s, Emp e2WHERE (e3.mgr = s.seid) AND e.eid = e2.eid )

SELECT SUM(e2.salary)


FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid)WHERE e1.ename = ’Smith’;

Using self-join elimination we obtain the query:

WITH subs(seid, meid, salary) AS (SELECT e.eid AS seid, e.eid AS meid, e.salary as salaryFROM Emp eUNION ALLSELECT e3.eid AS seid, s.meid AS meid,e2.salary as salaryFROM Emp e3,subs s, Emp e2WHERE (e3.mgr = s.seid)AND e.eid = e2.eid )

SELECT SUM(e2.salary)FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid)WHERE e1.ename = ’Smith’;

Self-join elimination can be applied to both parts of CTE definition and to themain part of the query. In the mentioned example it was applied to the first partof the CTE.

4 Predicate Push into CTE

In this section we describe the main part of our technique, i.e. how to findpredicates which can be pushed into a CTE and how to rewrite the query topush selected predicates into CTE.

In subsequent steps we analyse each table used joined to the result of a CTE.Such a table may be simply used in the query surrounding the CTE or mayappear to be joined with CTE after e.g. expansion of the definition of a view (asin the example from Section 2). In the following paragraphs we will call such atable “marked for analysis”.

Let us assume that we have marked for analysis a table that does not appearin any predicate besides the join conditions. If this table is joined to the CTEusing its primary key, we can mark it for pushing into CTE. This table’s aliasmay appear in three parts of the query surrounding the CTE: in the SELECTclause pointing to specific columns, in the condition joining it with CTE or inthe condition joining it with some other table. Let us analyse those cases.

The first case is the simplest — we just need to push the columns into bothSELECT statements inside CTE. To do it, we need to follow a short procedure:after copying the table declaration into both inner FROM clauses, we copy thecolumns’ calls into both inner SELECT clauses, assigning those columns newalias names. We now have to expand CTE’s header using new columns’ aliases.Finally in the outer SELECT clause we replace marked table alias with the outeralias of the CTE.

Second case is when the marked table alias is in the condition joining themarked table with CTE. The first step is to copy the joining condition into the


first part of the CTE. While doing this we need to translate the CTE’s columnused for joining into its equivalent within the first part. Let us assume that thejoining column from the CTE was named cte alias.Col1. In the first SELECTclause of the CTE we have: alias1.some column AS Col1. Having this informa-tion we substitute the column name cte alias.Col1 with alias1.some column. Weproceed analogically when copying the join condition into the recursive part ofthe CTE.

The third case, when marked alias occurs within a join clause that does notinvolve CTE’s alias, is very similar to the case of copying column names from theSELECT clause. Firstly we need to push columns connected with marked tableinto CTE (according to the procedure described above). Secondly we replacethose columns’ names by corresponding CTE’s columns.

All those three cases are illustrated by the following example: Having the query:

WITH subs(seid, meid) AS (SELECT e.eid AS seid, e.eid AS meid FROM Emp e

UNION ALLSELECT e3.eid AS seid, s.meid AS meidFROM Emp e3, subs sWHERE e3.mgr = s.seid )

SELECT e2.salary, d1.nameFROM subs s2 JOIN Emp e2 ON (e2.eid = s2.seid)

JOIN Emp e1 ON (e1.eid = s2.meid)JOIN Dept d1 ON (e1.dept = d1.did)


The table to be analysed is Emp e2. This table is used in two join conditions (withthe CTE, and with the Dept table) and once in the SELECT clause. Thereforewe copy the table name into both FROM clauses existing in the CTE definition,also we copy twice the join with the CTE condition and the column call. Then wereplace the aliases as described above. Finally we remove the marked table withits references from the outer selection query. The resulting query is of the form:

WITH subs(seid, meid, dept, salary) AS (SELECT e.eid AS seid, e.eid AS meid,

e2.dept AS dept, e2.salary AS salaryFROM Emp e, Emp e2WHERE e2.eid = e.eid

UNION ALLSELECT e3.eid AS seid, s.meid AS meid,

e2.dept AS dept, e2.salary AS salaryFROM Emp e3, subs s, Emp e2WHERE e3.mgr = s.seid AND e2.eid = e3.eid )

SELECT s2.salary, d1.nameFROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid)

JOIN Dept d1 ON (s2.dept = d1.did)WHERE e1.ename = ’Smith’;


This form may undergo further optimisations like elimination of self-join. Onething has to be mentioned: if the marked table is not joined with CTE, is shouldbe skipped and returned to later, after other modifications to CTE.

Now let us analyse the situation when a table from the outer query is refer-enced within a predicate. It should be marked for pushing into CTE, undergomoving into CTE like described above, but without deletion from its originalplace. We have to check if moving the predicate into CTE is possible. Thereare many predicates, for which pushing them into CTE would put too big re-strictions on the CTE resulting in loss of data. During the research on recursivequeries we found that the predicate can be pushed into the CTE only if we canisolate a subtree of the result tree that contains only the elements fulfilling thepredicate and no other node outside this subtree fulfils this predicate. This maybe only verified by checking for the existence of the tree invariant.

So a general method for pushing a predicate into CTE is based on checkingCTE for the existence of tree invariant and if found, checking if the predicatecan be attached to CTE through this invariant. To perform this check we useinduction rules. We start by analysing tuple schema generated in the initial stepof CTE materialisation. We need to fetch the metadata information on the tablesused in FROM clauses. First we create the schema of the initial tuples, so wesimply use the SELECT clause and fill the columns with the values found in thisclause. Next we analyse the FROM clause and join predicates in the recursionstep and from the metadata information we create a general tuple schema thatwould be created out of a standard tuple. Analysing SELECT clause we performproper projection onto the newly generated tuple schema thus creating a newschema of a tuple that would be a result of the recursive step. By comparinginput and output tuples we may pinpoint the tuple’s element which is the loopinvariant. If there is no loop invariant we cannot push the predicates. If thereis an invariant, then in order to push the predicate we have to check if it is arestriction on a table joined to the invariant (one of the invariants when many).An easy observation shows that it is sufficient to push the predicate only to theinitial step, because, based on the induction, it will be recursively satisfied in allof the following steps.

Let us now observe how this method is performed on an example. Let usanalyse a following query (with already pushed the join condition):

WITH subs(seid, meid, salary) AS (SELECT e.eid AS seid, e.eid AS meid, e.salary as salaryFROM Emp e, Emp e1WHERE e1.eid = e.eidUNION ALLSELECT e3.eid AS seid, s.meid AS meid,e3.salary as salaryFROM Emp e3, subs s, Emp e1WHERE e3.mgr = s.seid AND e1.eid = s.meid )

SELECT SUM(s2.salary)FROM subs s2 JOIN Emp e1 ON (e1.eid = s2.meid)WHERE e1.ename = ’Smith’;


The table Emp e1 occurs in the predicate e1.ename = ’Smith’. In the CTE defi-nition we reference the table Emp four times and once the CTE itself. From themetadata we know that the Emp table consists of the attributes: (EID, ENAME,MGR, SALARY) and that the EID attribute is a primary key. This means thatevery tuple belonging to the relation Emp has the form: (e, ne, me, se). All of thetuple’s elements are functionally dependent on the first element. By analysingSELECT clauses of the CTE we know that its attributes are: (SEID ⊂ Z, MEID⊂ Z, SALARY ⊂ R). The initial step generates tuples of the form:

(e, e, se)

Let us assume that tuple (a, b, c) ∈ CTE. During the recursion step from thistuple the following tuples are generated:

((a, b, c), (e1, fe1, le1, a, se1), (b, fb, lb, mb, sb))

Next by projection on the elements 4-th,2-nd,8-th we get a tuple:

(e1, b, se1)

Comparing this tuple with the initial tuple template we see, that the second pa-rameter is a tree invariant, so we may attach to this parameter table with predi-cate limiting the size of the result collection. Because the predicate e1.ename =’Smith’ references a table that is joined to the element b, so it can be pushed intothe initial step of CTE. Because all of the information from the outer selectionquery connected with Emp e1 have been included in the CTE definition, theymay be removed from the outer query. Using the transformations described in 3to simplify the recursive step, we get as a result:

WITH subs(seid, meid, salary) AS (SELECT e.eid AS seid, e.eid as meid, e.salary as salaryFROM Emp eWHERE e.ename = ’Smith’

UNION ALLSELECT e3.eid AS seid, s.meid as meid, e3.salary as salaryFROM Emp e3, subs sWHERE e3.mgr = s.seid )

SELECT SUM(s2.salary)FROM subs s2;

This way we have obtained a query which traverses only a fraction of the wholehierarchy. It is the final query of our motivating example (see Section 2). Thepredicate e1.ename = ’Smith’ has been successfully pushed into the CTE. Thegeneral procedure of optimising recursive SQL query is to firstly push in allthe predicates and columns possible and then to use simplification techniquesdescribed in 3.


5 Measured Improvement

In this section we present the results of tests performed on the motivating ex-ample of this paper. The tests were performed on two machines: first one isequipped with Intel core 2 duo u2500 processor and 2GB RAM memory (let uscall it machine A), the other one has phenom x4 9350e processor and 2GB RAMmemory (let us call it machine B). Each one of them has IBM DB2 DBMS v. 9.5installed on MS Vista operating system. The test data is stored within a tableEmp(eid, ename, mgr, salary) and consists of 1000 records. This means that thesize of the whole materialised hierarchy can be counted in hundreds thousands(its upper bound is half the square of the size of the Emp table). The hierarchyitself was created in such a way to eliminate cycles (which is a common companyhierarchy). Tests where performed within two series. The first one tested a casewhen Emp table had index placed only on the primary key. In the second seriesindices where placed on both the primary key and the ename column.

Fig. 1. Basic query’s plan using index onthe Emp table’s primary key. Includes fivefull table scans, one additional index scanand 2 hash joins that also take some timeto be performed

Fig. 2. Optimized query’s plan with in-dex on the Emp’s primary key. Thisplan has no need for hash joins, alsoone full table scan and index scan havebeen eliminated

Let us start by analysing the case when the set of tests was performed on theEmp table that had an index placed on its primary key. The original query wasestimated to be performed within 1728.34 timeron units and evaluated in 2.5son machine A. The query acquired using the method described in this paper (itwill be further called the optimised query) was estimated by the DBMS to be per-formed in 1654.71 timeron units. As for the evaluation plan for the original query1 it indicates the use of many full table scans in the process of materializing the


Fig. 3. Basic query’s plan using indices onthe Emp table’s primary key and ename col-umn. In comparison to Fig. 1 one of the fulltable scans has been replaced by less costlyindex scan. Still two hash joins and fourother full table scans remain.

Fig. 4. Optimized query’s plan usingindices on the Emp table’s primary keyand ename column. In comparison toFig. 3 one full table scan, one indexscan and two hash joins have been elim-inated. Also this plan has the leastamount of full table scans and join op-erations, therefore it is the least timeconsuming.

hierarchy and also two full table scans in the outer select subquery. This indicatesthat firstly, DBMS does not possess any means to optimise the query using alreadyimplemented algorithms. Secondly, the bigger the Emp table, the runtime andresources consumption increase dramatically. The only benefit of having an indexplaced on the primary key was in the initial step of materializing the CTE. In theglobal aspect, this is a small profit, because the initial step in the original querystill consists of 1000 records, and the greatest resources consumption takes placeduring the recursive steps. In comparison, the evaluation plan for the optimisedquery 2 although also having full table scans, benefited in elimination of twohash joins (HSJOIN), that needed full table scans in order to attach Emp tableto the materialized CTE. On the machine A this query was evaluated under 1s.The time was so small, because the initial step of CTE was materialized not forall of the 1000 records, but for only a few.

The second set of tests was performed with indices placed on both the primarykey and the ename column. The original query was evaluated in 2s and thecost in timeron units was estimated at 1681.38. As for the optimised query thecorresponding results were 1615.31 timeron units and evaluation time under 1s.As in previous case, the index placed on the primary key was used only in theinitial step of materializing CTE. As for the index placed on the ename column it


Table 1. Results of the described test in timeron units and real time mesurements

tests original optimised opt/orig

one real time 2.5 s <1 s > 40%index timeron 1728.34 1654.71 95.7%

two real time 2 s <1 s > 50%indices timeron 1681.38 1615.31 96%

was used to reduce the amount of records attached to the materialized hierarchy.This way hash join took less time to be evaluated. Nevertheless the evaluationplan still contains many full scans that deal with huge amount of data. As forthe optimised query the index placed on the primary key is not used, but theindex placed on the ename column speeded up the materialization of the initialstep. The results of the test have been placed for comparison in Table 1. It isworth noting, that the timeron cost of original query, despite indexing, is greaterthan in case of the optimised query. Also basing on this estimation the profit ofour method varies between 4 and 5 percent. It may not seem much, but whenthinking of bigger initial tables, this is quite a good result. What is more, becausethis is a method of rewriting SQL into SQL further optimisation (like placementof indices) may be performed.

6 Conclusion

In this paper we have show an optimisation method of recursive SQL queries. Themethod consists of selecting the predicates which can be pushed into the CTEand moving them. The condition that needs to be satisfied is the existance oftree invariant. The benefit of the usage of our method depends on the selectivityof the predicates being pushed and the recursion depth. A highly selective filtercondition which may indirectly reduce the amount of recursion steps will improvethe evaluation time in a significant way. Even experiments with small tablesproved the high potential of the method, since for such small number of rowsthe reduction of the execution time is substantial.

References

1. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic sets and other strange waysto implement logic programs. In: PODS, pp. 1–15. ACM, New York (1986)

2. Levy, A.Y., Mumick, I.S., Sagiv, Y.: Query optimization by predicate move-around.In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) VLDB, pp. 96–107. Morgan Kaufmann,San Francisco (1994)

3. Wadler, P.: Deforestation: Transforming programs to eliminate trees. Theor. Com-put. Sci. 73(2), 231–248 (1990)

4. Jones, S.P., Tolmach, A., Hoare, T.: Playing by the rules: rewriting as a prac-tical optimisation technique in GHC. In: Haskell Workshop, ACM SIGPLAN,pp. 203–233 (2001)


5. Gill, A.J., Launchbury, J., Jones, S.L.P.: A short cut to deforestation. In: FPCA,pp. 223–232 (1993)

6. Johann, P.: Short cut fusion: Proved and improved. In: Taha, W. (ed.) SAIG 2001.LNCS, vol. 2196, pp. 47–71. Springer, Heidelberg (2001)

7. Ohori, A., Sasano, I.: Lightweight fusion by fixed point promotion. In: Hofmann,M., Felleisen, M. (eds.) POPL, pp. 143–154. ACM, New York (2007)

8. Grust, T., Grust, T., Scholl, M.H., Scholl, M.H.: Query deforestation. Technicalreport, Faculty of Mathematics and Computer Science, Database Research Group,University of Konstanz (1998)

9. Ordonez, C.: Optimizing recursive queries in SQL. In: SIGMOD 2005: Proceedingsof the 2005 ACM SIGMOD international conference on Management of data, pp.834–839. ACM, New York (2005)

Pushing Predicates into Recursive SQL Common Table Expressions

Documents

Transcript of Pushing Predicates into Recursive SQL Common Table Expressions