Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza

48
CS 600.316/416 Database Systems Lecture 9, Feb 26 th , 2014. Query Op?miza?on (ii)

Transcript of Lecture 9, Feb 26th, 2014. Query OpBmizaBon (ii) - Piazza

CS  600.316/416  Database  Systems  

Lecture  9,  Feb  26th,  2014.  Query  Op?miza?on  (ii)  

Selec%vity  =  #outputs  /  #inputs  

Rank  operator  chains  by:  

Query  cos%ng:  how  long  or  how  many  IOPS  does  this  query  take?  

HISTOGRAMS  

Also:  V-­‐Op%mal  histograms,  minimizing  cumula%ve  weighted  variance  of  buckets  

SAMPLING  

Current Support of Sampling in DB2: The Rand() Function

RAND() returns a uniform random number between 0 and 1 ! SELECT * FROM original query WHERE rand() < 0.01

Advantage: User can easily specify the size of sample he wants Disadvantage: The RAND() operator does not provide any

optimization as it is applied to the query result

Current Support of Sampling in DB2: The TABLESAMPLE operator

l  Can place sampling clause after any SQL table reference SELECT … FROM T TABLESAMPLE BERNOULLI(10.0) WHERE … l  General form of sampling clause TABLESAMPLE samplingMethod(p)

l  samplingMethod is one of the two: Ÿ  BERNOULLI: row-level Bernoulli sampling Ÿ  SYSTEM: page-level (efficient) sampling method

l  p = inclusion probability for each row (%) = (expected) sampling fraction

Current Support of Sampling in DB2: The TABLESAMPLE operator (cont.)

l  Advantage: By pushing sampling to the bottom of a query tree, can provide huge performance improvements

l  Disadvantage: How to extrapolate the sampling rate at the base

table to the query result? l Joins l Group by l Count (DISTINCT) l Subqueries

Surajit  Chaudhuri  Rajeev  Motwani  Vivek  Narasayya  

On  random  sampling  over  Joins  

Slides by Srikantha Nema

Terminologies  

•  SAMPLE(R,  f)  is  an  SQL  opera?on  

•  When  a  query  Q  is  evaluated,  we  obtain  rela?on  R  

•  f  is  a  frac?on  of  a  rela?on  R  

 

Difficulty  of  Join  Sampling  

( ) ( ) ( ) ( ) ( ){ },,,...,,,,,,, 23212011 kbabababaBAR =

( ) ( ) ( ) ( ) ( ){ }kcacacacaCAR ,,....,,,,,,, 12111022 =

),( 21 fRRSAMPLE ▹◃

),(),( 2211 fRSAMPLEfRSAMPLE ▹◃

! ?

Classifica?on  of  Join  Sampling  problem  

•  Case  A  Ø No  informa?on  is  available  for  either              or    

•  Case  B  Ø No  informa?on  is  available  for            but  indexes  and  /or  sta?s?cs  are  available  for    

•  Case  C  Ø Indexes/sta?s?cs  are  available  for            and  

1R 2R

1R2R

1R 2R

Algorithms  for  Sampling  

•  Unweighted  Sequen?al  WR  Sampling  Ø Black-­‐Box  U1  Ø Black-­‐Box  U2  

•  Weighted  Sequen?al  WR  Sampling  Ø Black-­‐Box  WR1  Ø Black-­‐Box  WR2  

Unweighted  Sequen?al  WR  Sampling  

Black-Box U2

Black-Box U1

Binomial  distribu%on:  B(n,p)  =  nCk  pk  (1-­‐p)n-­‐k  

Weighted  Sequen?al  Sampling  

•  Black-­‐Box  WR1  

 •  Black-­‐Box  WR2  

Sampling  Strategies  (old)  

•  Strategy  Naïve-­‐Sample  

•  Strategy  Olken-­‐Sample  

New  strategies  for  join  Sampling  

•  Strategy  Stream-­‐Sample  

 •  Strategy  Group-­‐Sample  

•  Strategy  Frequency-­‐Par??on-­‐Sample  

Strategy  Frequency-­‐Par??on-­‐Sample  

Experimental  Evalua?on  1  

Experimental  Evalua?on  2  

Experimental  Evalua?on  3  

Summary  

•  Selec?vity  es?ma?on    – Makes  extensive  use  of  sta?s?cs  in  the  DBMS  (e.g.  #  records,  #  dis?nct  values,  extremal  values)  

–  Sta?s?cs  can  be:  •  Precomputed  (e.g.  historgrams  and  sta?s?cs  tables)  •  Generated  on-­‐the-­‐fly  (e.g.  sampling)  

–  Handling  bias  during  sampling  for  joins  and  group-­‐by  aggregates  

•  Selec?vity  es?ma?on  generalizes  to  approximate  query  processing  –  Trade  off  speed  for  accuracy  

References  

•  This  slide  deck  uses  material  from:  –  Amol  Deshpande’s  CMSC424  hhp://www.cs.umd.edu/class/fall2009/cmsc424/lecture-­‐queryop?miza?on.pdf  

–  Torsten  Grust’s  INF  4141  hhp://db.inf.uni-­‐tuebingen.de/files/teaching/ws1011/db2/db2-­‐selec?vity.pdf  

–  S.  Chaudhuri,  R.  Motwani,  V.  R.  Narasayya:  On  Random  Sampling  over  Joins.  SIGMOD  Conference  1999:  263-­‐274  

–  S.  Acharya,  P.  B.  Gibbons,  V.  Poosala:  Congressional  Samples  for  Approximate  Answering  of  Group-­‐By  Queries.  SIGMOD  Conference  2000:  487-­‐498