Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza

CS 600.316/416 Database Systems

Lecture 9, Feb 26th, 2014. Query Op?miza?on (ii)

Selec%vity = #outputs / #inputs

Rank operator chains by:

Query cos%ng: how long or how many IOPS does this query take?

HISTOGRAMS

Also: V-‐Op%mal histograms, minimizing cumula%ve weighted variance of buckets

SAMPLING

Current Support of Sampling in DB2: The Rand() Function

RAND() returns a uniform random number between 0 and 1 ! SELECT * FROM original query WHERE rand() < 0.01

Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any

optimization as it is applied to the query result

Current Support of Sampling in DB2: The TABLESAMPLE operator

l  Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … l  General form of sampling clause TABLESAMPLE samplingMethod(p)

l  samplingMethod is one of the two: Ÿ  BERNOULLI: row-level Bernoulli sampling Ÿ  SYSTEM: page-level (efficient) sampling method

l  p = inclusion probability for each row (%) = (expected) sampling fraction

Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.)

l  Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements

l  Disadvantage: How to extrapolate the sampling rate at the base

table to the query result? l Joins l Group by l Count (DISTINCT) l Subqueries

Surajit Chaudhuri Rajeev Motwani Vivek Narasayya

On random sampling over Joins

Slides by Srikantha Nema

Terminologies

•  SAMPLE(R, f) is an SQL opera?on

•  When a query Q is evaluated, we obtain rela?on R

•  f is a frac?on of a rela?on R

Difficulty of Join Sampling

( ) ( ) ( ) ( ) ( ){ },,,...,,,,,,, 23212011 kbabababaBAR =

( ) ( ) ( ) ( ) ( ){ }kcacacacaCAR ,,....,,,,,,, 12111022 =

),( 21 fRRSAMPLE ▹◃

),(),( 2211 fRSAMPLEfRSAMPLE ▹◃

! ?

Classifica?on of Join Sampling problem

•  Case A Ø No informa?on is available for either or

•  Case B Ø No informa?on is available for but indexes and /or sta?s?cs are available for

•  Case C Ø Indexes/sta?s?cs are available for and

1R 2R

1R2R

1R 2R

Algorithms for Sampling

•  Unweighted Sequen?al WR Sampling Ø Black-‐Box U1 Ø Black-‐Box U2

•  Weighted Sequen?al WR Sampling Ø Black-‐Box WR1 Ø Black-‐Box WR2

Unweighted Sequen?al WR Sampling

Black-Box U2

Black-Box U1

Binomial distribu%on: B(n,p) = nCk pk (1-‐p)n-‐k

Weighted Sequen?al Sampling

•  Black-‐Box WR1

•  Black-‐Box WR2

Sampling Strategies (old)

•  Strategy Naïve-‐Sample

•  Strategy Olken-‐Sample

New strategies for join Sampling

•  Strategy Stream-‐Sample

•  Strategy Group-‐Sample

•  Strategy Frequency-‐Par??on-‐Sample

Strategy Frequency-‐Par??on-‐Sample

Experimental Evalua?on 1

Summary

•  Selec?vity es?ma?on – Makes extensive use of sta?s?cs in the DBMS (e.g. # records, # dis?nct values, extremal values)

–  Sta?s?cs can be: •  Precomputed (e.g. historgrams and sta?s?cs tables) •  Generated on-‐the-‐fly (e.g. sampling)

–  Handling bias during sampling for joins and group-‐by aggregates

•  Selec?vity es?ma?on generalizes to approximate query processing –  Trade off speed for accuracy

References

•  This slide deck uses material from: –  Amol Deshpande’s CMSC424 hhp://www.cs.umd.edu/class/fall2009/cmsc424/lecture-‐queryop?miza?on.pdf

–  Torsten Grust’s INF 4141 hhp://db.inf.uni-‐tuebingen.de/files/teaching/ws1011/db2/db2-‐selec?vity.pdf

–  S. Chaudhuri, R. Motwani, V. R. Narasayya: On Random Sampling over Joins. SIGMOD Conference 1999: 263-‐274

–  S. Acharya, P. B. Gibbons, V. Poosala: Congressional Samples for Approximate Answering of Group-‐By Queries. SIGMOD Conference 2000: 487-‐498

Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza

Documents

Transcript of Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza