Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza
Selec%vity = #outputs / #inputs
Rank operator chains by:
Query cos%ng: how long or how many IOPS does this query take?
Current Support of Sampling in DB2: The Rand() Function
RAND() returns a uniform random number between 0 and 1 ! SELECT * FROM original query WHERE rand() < 0.01
Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any
optimization as it is applied to the query result
Current Support of Sampling in DB2: The TABLESAMPLE operator
l Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … l General form of sampling clause TABLESAMPLE samplingMethod(p)
l samplingMethod is one of the two: Ÿ BERNOULLI: row-level Bernoulli sampling Ÿ SYSTEM: page-level (efficient) sampling method
l p = inclusion probability for each row (%) = (expected) sampling fraction
Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.)
l Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements
l Disadvantage: How to extrapolate the sampling rate at the base
table to the query result? l Joins l Group by l Count (DISTINCT) l Subqueries
Surajit Chaudhuri Rajeev Motwani Vivek Narasayya
On random sampling over Joins
Slides by Srikantha Nema
Terminologies
• SAMPLE(R, f) is an SQL opera?on
• When a query Q is evaluated, we obtain rela?on R
• f is a frac?on of a rela?on R
Difficulty of Join Sampling
( ) ( ) ( ) ( ) ( ){ },,,...,,,,,,, 23212011 kbabababaBAR =
( ) ( ) ( ) ( ) ( ){ }kcacacacaCAR ,,....,,,,,,, 12111022 =
),( 21 fRRSAMPLE ▹◃
),(),( 2211 fRSAMPLEfRSAMPLE ▹◃
! ?
Classifica?on of Join Sampling problem
• Case A Ø No informa?on is available for either or
• Case B Ø No informa?on is available for but indexes and /or sta?s?cs are available for
• Case C Ø Indexes/sta?s?cs are available for and
1R 2R
1R2R
1R 2R
Algorithms for Sampling
• Unweighted Sequen?al WR Sampling Ø Black-‐Box U1 Ø Black-‐Box U2
• Weighted Sequen?al WR Sampling Ø Black-‐Box WR1 Ø Black-‐Box WR2
Unweighted Sequen?al WR Sampling
Black-Box U2
Black-Box U1
Binomial distribu%on: B(n,p) = nCk pk (1-‐p)n-‐k
New strategies for join Sampling
• Strategy Stream-‐Sample
• Strategy Group-‐Sample
• Strategy Frequency-‐Par??on-‐Sample
Summary
• Selec?vity es?ma?on – Makes extensive use of sta?s?cs in the DBMS (e.g. # records, # dis?nct values, extremal values)
– Sta?s?cs can be: • Precomputed (e.g. historgrams and sta?s?cs tables) • Generated on-‐the-‐fly (e.g. sampling)
– Handling bias during sampling for joins and group-‐by aggregates
• Selec?vity es?ma?on generalizes to approximate query processing – Trade off speed for accuracy
References
• This slide deck uses material from: – Amol Deshpande’s CMSC424 hhp://www.cs.umd.edu/class/fall2009/cmsc424/lecture-‐queryop?miza?on.pdf
– Torsten Grust’s INF 4141 hhp://db.inf.uni-‐tuebingen.de/files/teaching/ws1011/db2/db2-‐selec?vity.pdf
– S. Chaudhuri, R. Motwani, V. R. Narasayya: On Random Sampling over Joins. SIGMOD Conference 1999: 263-‐274
– S. Acharya, P. B. Gibbons, V. Poosala: Congressional Samples for Approximate Answering of Group-‐By Queries. SIGMOD Conference 2000: 487-‐498