OLAP expressions are an extremely powerful tool in SQL that ...

42
OLAP expressions are an extremely powerful tool in SQL that enable advanced reporting features such as ranking, counting, averaging, adding, and more within a set of data processed in an SQL statement. This feature allows for data to be aggregated based upon values in a query in a manner very similar to coding control breaks in a program process. This allows for entire programs, or even applications, to be replaced by much more flexible and portable SQL statements. Reduce programming time and complexity, and improve flexibility and performance, by deploying OLAP expressions. This session will show you how! 1

Transcript of OLAP expressions are an extremely powerful tool in SQL that ...

OLAP expressions are an extremely powerful tool in SQL that enable advanced

reporting features such as ranking, counting, averaging, adding, and more within a

set of data processed in an SQL statement. This feature allows for data to be

aggregated based upon values in a query in a manner very similar to coding control

breaks in a program process. This allows for entire programs, or even applications,

to be replaced by much more flexible and portable SQL statements. Reduce

programming time and complexity, and improve flexibility and performance, by

deploying OLAP expressions. This session will show you how!

1

2

3

If you’re keep up with IT news in recent times you’ll easily agree that analytics is a

hot topic. The amount of data stored in our operational systems is increasing on a

daily basis, and management is quickly learning that this information can and

should be quickly harnessed in order for the business to make quick decisions

concerning things such as sales directions, talent acquisition, cost containment, and

more! One of the biggest challenges is to formulate answers to these questions that

utilize the most current information, are inexpensive and easy to create, and can

deliver the answers quickly. Many times great expense is incurred in moving data,

creating data warehouses, and using specialized software to produce various reports.

In addition to this, many times these reporting tools issue complex and redundant

SQL to the data server that can result in excessive reporting costs.

Having OLAP functionality built into the DB2 engine can help reduce some of the

operational and software costs associated with getting answers to complex

questions. This functionality can be used in data warehouses, but also against OLTP

databases with equal results. One more tool in the IT department’s tool box for

answering complex business questions.

4

Analytics is a widely growing segment of database (and non-database) processing.

DB2 has the ability to perform analytics via built-in expressions. Once again, this

means that instead of purchasing an expensive product, or writing thousands of lines

of code, you can simply write an SQL statement that does the processing for you

and creates output that is report ready!

This type of processing is called Online Analytical Processing, OLAP. The

constructs within the DB2 engine can be referred to as:

• OLAP expressions

• OLAP specification

• OLAP functions

• Window functions

5

DB2 provides for several OLAP specific functions, as well as a host of aggregate

functions in support of OLAP expressions. Each of these functions returns a scalar

result to the row being processed. The operations supporting OLAP processing can

process a single row, multiple rows, or an entire result set in the calculation of the

scalar value returned.

A feature of this type of processing is the window. This window is a logical

grouping of data within the result set, and the default window is the entire result set.

Within a window OLAP processing can number or rank rows based upon an

ordering. In addition, aggregation of values within an entire window or via a

grouping within a window can be performed. Multiple OLAP functions can be

specified in a SELECT clause mixing numbering, ranking, and aggregation. This

results in some extremely powerful and flexible data analytics within the SQL

language.

6

The key aspects to OLAP processing are the concepts of windowing and ordering.

As stated before a window is a portion or grouping of the data in the result set. If no

window is specified then the default window is the entire result set, and any

ordering is applied to the entire result. If a window is specified then any ordering is

within that window, and thus any calculations are based only upon the data in that

window. You can specify many OLAP expressions in a single query, each of which

can have its own independent windowing and ordering.

7

The first OLAP expression to explore is the numbering specification. Row

numbering is the easiest concept to understand as it does exactly what its name

implies, numbers rows in the output. Since windowing and ordering can be applied

to row number, it is the perfect function to use to learn about these features since

numbering is extremely easy to understand.

Numbering is enabled via the ROW_NUMBER() function. There are no parameters

to this function. One extremely important thing to remember is that row numbering

is arbitrary to the final ordering of the result. You can number within windows and

you can also apply an order to the numbering. However, the numbering itself is

done arbitrarily. Despite the limited functionality this function can be extremely

useful for things such as determining the minimum and maximum row according to

an order, data sampling, and pagination (although there are some performance

implications).

8

OLAP specification is best taught by example. Let’s start first with a simple process and add to it as

we go along.

OLAP specification allows for numbering of the result set. This numbering can be according to a

specified order, or not. It can also be applied to something called a “partition” or “window” of the

result table. The entire result set can be a window, and that’s what is happening in this example.

Here we are selecting data from the employee table, returning the lastname and salary of our

employees. We’ve specified that the result will be ordered by the lastname column. We’ve also

specified the ROW_NUMBER() window function in the final SELECT of the statement. The

ROW_NUMBER() function tells DB2 that the output row is to be numbered according to the ordering

applied to the function, starting with the number 1 and continuing by adding 1 to the number for each

additional row returned. If no ORDER BY is specified in the window then the numbering is arbitrary

with respect to the order of the result table. Here specifically we said:

ROW_NUMBER() OVER()

We have specified no window and no ordering, and so the rows are number arbitrarily in the result set.

The ORDER BY clause of the final SELECT (the only SELECT in this example) has no meaning for

the numbering. So don’t be fooled by a coincidental numbering in the order of the result.

9

In this example we have specified:

ROW_NUMBER() OVER(ORDER BY SALARY DESC)

There is no window specified and so the numbering is over the entire result set.

However, we have specified the order in which the rows are to be numbered in the

result set. So the rows are numbered in the entire result set in the order of the

SALARY column by descending value. Each row returned gets a number one

greater than the previous row. Also notice that the ORDER BY clause of the final

result table is dictating an order by LASTNAME. So the numbering is in the

different sequence (SALARY DESC) than the result set (LASTNAME ASC).

Already it’s becoming clear that we can create some outstanding reports simply

from SQL. Cool!

10

In this example we have numbered the result over the entire result set, and so our

window is the entire result table. We have numbered according to the SALARY

column descending, and also ordered the result by the SALARY column

descending. So our result table is in the same order as the numbers.

11

This example demonstrates a numbering of the entire result set over one order

(SALARY DESC) and the ordering of that result set in a different order

(WORKDEPT ASC, SALARY DESC).

12

It’s critical to the understanding of OLAP processing to understand the idea of

windows, keeping in mind that windows can also be called partitions or groups.

Basically a window is a logical grouping of data based upon a key value. That key

value is determined by the specification of one or more expressions derived from

the columns of the table or tables referenced in the FROM clause. For example:

PARTITION BY WORKDEPT

Will create one window for each department in the employee table. The window

function being applied is then applied inside each window defined by each key

value. Any ordering specified within the expression is applied within the scope of

each window. In the following example the ordering of employees within a

department will be by the date they were hired

PARTITION BY WORKDEPT ORDER BY HIREDATE

13

In this example partitioning, also called windowing, has been introduced. In the

specification of what the numbering will be over is:

OVER(PARTITION BY EMP.WORKDEPT ORDER BY EMP.SALARY DESC)

This tells DB2 that the result table is to be divided up by the values of the

WORKDEPT column and within each of those “windows” the numbering of the

rows will be based upon the SALARY column in descending sequence. So, the

numbering is no longer over the entire result set, but instead it is established afresh

inside each partition or window.

The result table is also ordered by the same two columns in the same sequence as

specified by the ORDER BY clause of the final SELECT (the only SELECT in this

case). So the numbering of the rows appears consistent with the ordering of the

output.

The numbering of the output is simply that. There is no respect to the data in the

result table and the next number is simply 1 more than the previous row within the

window. So, even though Nicholls and Natz have the same salary they do not

receive the row number.

14

Ranking differs from numbering in that if two or more rows within the window are

not distinct they will receive the same rank. So, while numbering is based upon the

number of rows that precede the current row, ranking is based upon the number of

rows that strictly precede the current row. That is, a rank represents the number of

rows that precede the current row based upon the values as defined in the ordering

within the window. Thus, if two or more rows have the same set of values (are not

distinct from each other) then there will be gaps in the ranking.

In our example here within the window represented by department number C01,

Nicholls and Natz have the same salary. Thus the number of rows that strictly

precede them is 2 (Kwan and Quintana) and thus they both get the rank of 3. If there

was another single person in the department with a lower salary than Nicholls and

Natz then their rank would be 5 as the tie between Nicholls and Natz created a gap.

15

DENSE_RANK() works much like RANK(), except that it closes the gaps that

RANK() would otherwise create. In this example we did not specify a window and

so the entire result set is the window. The first query ranks the people working for

the company by salary descending. As you can see in the result Nicholls and Natz

are tied with the same salary within the window with a rank of 11, and so the next

rank assigned to Jones is 13 due to the gap caused by the tie. The DENSE_RANK()

window function will close these gaps, and so in the second example Jones receives

a rank of 12.

16

Let’s take a look at the power of OLAP processing applied to a common business

activity…to determine which employee is offered the voluntary separation package.

The boss is interested in offering early retirement to the oldest employees, but is

also interested in saving the company as much money as possible. So, he’d like to

see first the employees that are oldest along with those that are highest paid. A

complete list of employees is desired and so we can do that in a single query using

OLAP expressions. This query here lists all employees and ranks them in two ways,

by birthdate ascending and by salary descending. There are no windows so the

application of these rankings are across the entire set of employees.

17

In the result set the employees have been ordered by their age and so the age

ranking correlates to the result order. The salary ranking doesn’t correlate to the age

ranking as well as the boss had hoped, and so this first run at finding the oldest and

highest paid employee is not bearing enough fruit to make a decision.

18

In response to the lack of decision making information in the previous request, the

boss comes back and requests more information. In addition to the rankings

company wide, rankings by department are also desired. Perhaps there are highly

paid older employees relative to those other employees in each department that can

be offered the package. In response two new OLAP expressions are added to the

existing query. These new expressions include windowing by the department so that

the ordering and ranking are applied for each value of the department code. The

powerful OLAP processing built into DB2 can process distinct windows within the

same statement.

The boss doesn’t like a really cluttered report, and that last report had too many

numbers. So, the query containing the OLAP expressions is placed in a nested table

expression and the result is filtered to return only the highest paid and oldest

employees in each department. While the equivalent filtered result could be

returned using subqueries it would be far more complicated and potentially more

expensive to do so.

19

Now the boss is excited! This looks like a pretty good list of potential candidates for

the package. Some bosses would pull the trigger at this point, but the smarter ones

may realize that there is a catch. How many employees are actually in these

departments? We’d hate to lay off everyone in a department so we better make sure

we’re covered before letting the axe fall.

20

New to DB2 10 for z/OS, but available on all currently in service versions of DB2

for LUW, aggregate OLAP functionality takes this type of processing and reporting

to a new level. Common aggregate functions can be applied within windows to

enable complex and diverse reporting simply by running SQL statements. To add

further dimension to this type of reporting is the concept of aggregation groups

which allow for further refinement of aggregation within a window.

This really opens the door to the ability to create complex reports with varying

degree of analytics incorporated into the SQL statement!

21

Let’s take a look at our previous example of the company that is looking to reduce

their employee headcount but do it in somewhat of an intelligent manner by looking

for the older and higher paid employees in each department such that an early

retirement package can be offered. The previous incarnation of this query used

OLAP functionality to find the top two employees in each department ranked by

ages and salary. Then it filtered to return any employee that fell into the range of the

top two for age and salary. However, there was a piece of critical information

lacking from that query. What if a department that had employees designation for

the package had only one or two employees? This question can be easily answered

by adding yet another OLAP expression to the existing query. In this example a

COUNT function has been added, specifying a window based on the department

column. This basically returns an employee count for each department much like if

a separate statement was issued such as:

SELECT WORKDEPT, COUNT(*) FROM EMP GROUP BY WORKDEPT;

The difference is that this information is returned row by row along with the results

of the other OLAP expressions (which could also be accomplished using a scalar

fullselect in the SELECT list). Nonetheless, the addition of one relatively simple

OLAP expression gets the needed information into the report.

22

This final report contains critical information relative to the impact of employee

headcount reduction on a department level. All in one statement!

23

In the case of aggregate functions in an OLAP expression there can be further

refinement of the range of values used for the computation within a window. This

grouping is specified using either the RANGE or ROW keywords to specify the

range over which the aggregation is applied. This enables “moving” values inside

the window.

24

It’s important to understand how the aggregation group is controlled depending

upon a number of rows or a range of rows. The ROWS keyword is used to designate

that the set of rows to base an aggregation upon is a count of the number of rows

before and after the current row being processed. It’s as simple as that. Using

ROWS is most significant if the key value supplied in the ordering is distinct. That

is, there is one row per unique value in the aggregation group. The RANGE

keyword is used to indicate that the set of rows is not based upon counting, but

instead a key value. There are significant restrictions to the key value used in that it

has to be numeric and the data type comparable to the range values provided. The

RANGE keyword is best if there are multiple rows per key value, as long as you can

make sure that key value is numeric.

25

There are several keywords used to designation how to determine the scope of the

aggregation within a window. If no grouping is defined then the scope is the entire

window. If a grouping is desired than a start and end value is determined using

various keywords that designate a position relative to the current row being

processed. It’s best at this point to think of the processing of the SQL statement as a

program loop, where that loop is processing the set of rows in a window one at a

time.

The choices for determining the group are either unbounded, meaning no limit from

the start or end of the window to the current row, or a certain set of rows either

before or after the current position. The number of rows designated before or after

the current position is dependent upon the use of RANGE or ROW, and if no start or

end position is specified then the position of the current row is used.

26

I, being a home brewer and DB2 consultant, have combined both of my passions

and have started recording some of my brewing activities in DB2. My first effort

into this has been to set up a simple table that contains the date of a brew, the name

of the beer, the style of beer, and the quantity of beer brewed in gallons. This simple

record can be analyzed to determine various trends in brewing, as well as how much

of each beer has been brewed. In this example I wanted to simply demonstrate the

difference between using ROW and RANGE in an aggregation within a window.

This query produces multiple totals of beer brewed over two different windows. The

first expression totals the quantity of beer brewed by month. Since there are several

years of brewing recorded each month total reflects multiple years. The second two

totals are calculated over a window that is the entire result set. The first of these

totals the current brew date total as well as the previous brew date. Provided that

there is NOT more than one beer brewed per day then this total reflects the total of

the last two brews. The last total is ordered by the month of brewing and totals the

current and previous month value. RANGE is used because there are multiple rows

per month and the number of those rows are unknown.

Two additional things to note here is that there is a lack of the FOLLOWING clause

in the aggregation group. This means that the end point of the group aggregation is

the current row. The second thing to note is that since RANGE is used the key value

in the ORDER BY specification has to be numeric, as is the result of the

MONTH(BREW_DATE) function invocation.

27

This result demonstrates row based aggregation versus range based aggregation.

While the row based is simply a calculation based upon the current and previous

row, the range based uses the value comparison to determine which rows to

aggregate.

28

This example is from the DB2 for LUW sample database. It uses aggregation to

calculate the average sale quantity company wide on a month-by-month basis. So, a

window is established for each month of sales, and the AVG function is used to

calculate the average sale quantity for each month. A SUM function is also used to

calculate total sales. There is no window specified for this total, but an aggregation

group has been specified designating an unbounded start (from the start of the

window) and the current row. This provides a running total of sales in the result.

A variety of functions, windows, and grouping can be specified to produce all sorts

of running values!

29

This result produces a list of sales from our company’s sales table. Along with the list of every sale are some important key

metrics that will help our marketing team focus attention. The first is the average sale by month. That is, for each month in

which my sales data span show me the average sale for that month. So, along with the list of sales is the average sale amount

for the month. The second is a running total of sales over the entire result set of data. So, in one query we have detailed

information along with important metrics regarding that detailed information. Sweet! This can inspire to create some

extraordinary reports using SQL. Let’s break down our OLAP expressions: The first produces an average

CAST(AVG(CAST(SALES AS DEC(3,1))) OVER ( PARTITION BY YEAR(SALES_DATE),

MONTH(SALES_DATE)) AS DEC(3,1)) The casting is because SALES is actually an integer and we need a decimal

result.

OVER(PARTITION BY YEAR(SALES_DATE), MONTH(SALES_DATE)) tells DB2 that the average

will be calculated within a window that is based on the year and month of the sales

date. So, calculate the average sale for each month! No ordering is needed within

the window because it’s a single value being calculated for each window. The second

produces a running total:

SUM(SALES) OVER (ORDER BY SALES_DATE ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT

ROW) There is no partitioning clause and so the window is the entire result table.

The SUM function indicates a calculated total over the entire window ordered by the

SALES_DATE column. The window-order-clause is followed by a window-aggregation-

group-clause, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW Which tells DB2 that

the group of values to sum will be based upon all preceding rows up to the current

row. The corresponding ORDER BY in the final SELECT makes sure that the running

total makes sense.

30

This query is valid on both DB2 for z/OS and DB2 for LUW. It takes the time

recorded by employees working on projects and calculates the average time entry

quantity recorded each month, as well as the running total of all time recorded.

31

32

OLAP processing can be a performance gain or a detriment to performance. As with

anything else the query performance is relative to the task accomplished and the

alternatives to accomplishing the same tasks by other means:

1. All data returned and program loops

2. Multiple queries issued by the program each collecting a different value

3. Using more traditional SQL, including grouping, subqueries, table expressions,

etc.

4. The portability of the report process (some programming languages might not

be very portable across platforms)

5. The amount of time it takes to code the solution

For OLAP expressions themselves there are potentially big workfile and sort

consumers within DB2. So, you need to make sure that there are adequate shared

resources (workfile, CPU, I/O) to handle the OLAP queries. Using the EXPLAIN

facility and running benchmarks are critical to understanding the impact of OLAP

processing.

33

This explain (DB2 for z/OS) shows a relatively simple OLAP query and two sorts

and workfiles allocated in support of the query. One sort is in support of the OLAP

expression and the second is to order the result set.

Be mindful of the fact that on DB2 for z/OS workfiles are allocated as needed, but

are not released until the query terminates (except for child tasks in parallel queries)

so workfile utilization can be significant. On DB2 for LUW they workfiles can be

truncated when no longer needed, as controlled by the

DB2_SMS_TRUNC_TMPTABLE_THRESH environment variable. As additional

OLAP expressions are added to a query, additional sorts and workfile allocations

will occur.

34

As we added OLAP expressions DB2 can also add sorts and workfile allocations in

response. In this query there are two sorts in support of the two OLAP expressions,

and one sort in support of the final result. This is true for DB2 for z/OS and DB2 for

LUW.

35

OLAP processing cannot only be a choice for business analytics and reporting, but

could also be a potential performance improvement for OLTP and/or batch

transaction processing. In this example here we can see two queries that return the

most recent accounting history record for a history table. The original query uses a

correlated subquery to find the most recent row for each primary key. Since there is

no filtering by key value then the entire table is processed and the correlated

subquery executed for each row in the history table. In comparison, the equivalent

OLAP query in a nested table expression is used to read the entire table once and

rank the rows. Subsequent filtering returns the same result as the correlated

subquery example. Which is the better performer? Well, that depends of course, but

in this example the OLAP query was a significant performance improvement over

the subquery.

36

Here is an example of the same situation using the DB2 sample database. The

queries are finding the oldest employee in each department, one via correlated

subquery and the other using OLAP.

37

An EXPLAIN of the subquery solution shows the execution of the correlated

subquery in a separate query block. This separate query block is executed once per

row processed in the outer portion of the query (query block one), but no sorts or

workfiles are utilized.

38

An EXPLAIN of the OLAP based solution shows two query blocks, but in this case

there is only a single execution of each block. The second query block shows the

table being read and sorted to perform the OLAP processing. The first query block

shows the workfile being read to produce the final result.

Which is better? I don’t know! Try them both, explain them, and benchmark them.

The thing to keep in mind is exactly how much data is being processed. If a lot of

data is going to be processed and there is little or no filtering then the OLAP

solution may be better provided there are adequate workfile resources available. If

there is significant filtering of the data in the outer portion of the query and/or

workfile availability is limited, then the subquery solution should be better given

appropriate indexing.

39

Here is another great example of using the power of OLAP processing to retrieve

some sample data. In this particular case a sample of employee data is desired base

upon certain rules. The rule applied here is that a sample of two employees per

department is desired. Rather than running a series of queries or a single query with

complicated subqueries, only one OLAP query can be used to get the desired result.

40

OLAP processing in DB2 is extremely powerful, and an important tool that can be

utilized quickly to perform complex data analytics. The OLAP specification takes

some time to get used to, and so you need to reserve some time for programmer

education and experimentation. Once this knowledge is acquired this type of

processing can be used to quickly answer complex business questions, but can also

be a performance advantage for certain types of reports. This is especially true in

situations where several queries can be replaced by a single query.

41

Dan Luksetich is a senior DB2 DBA consultant. He works as a DBA, application

architect, presenter, author, and teacher. Dan has been in the information technology

business for over 28 years, and has worked with DB2 for over 23 years. He has been

a COBOL and BAL programmer, DB2 system programmer, DB2 DBA, and DB2

application architect. His experience includes major implementations on z/OS, AIX,

i Series, and Linux environments. Dan's experience includes: Application design

and architecture, database administration, complex SQL, SQL tuning, DB2

performance audits, replication, disaster recovery, stored procedures, UDFs, and

triggers. Dan works everyday on some of the largest and most complex DB2

implementations in the world. He is a certified DB2 DBA, system administrator,

and application developer, and has worked with the teams that have developed

several DB2 for z/OS certification exams. He is the author of several DB2 related

articles as well as co-author of the DB2 9 for z/OS Certification Guide and the DB2

10 for z/OS Certification Guide.

42