Post on 01-Jun-2018
8/9/2019 Lecture Bwt
1/64
The Burrows-WheelerThe Burrows-WheelerTransformTransform
Sen Zhang
8/9/2019 Lecture Bwt
2/64
TransformTransform
• What is the definition for “transform”?
– To change the nature, function, or condition of; convert.
– To change markedl the a!!earance or form of
• "ossless and reversi#le
– $ the %a, to transform is sim!le, a kid can do it.
– To !ut them #ack is a !ro#lem.
– Think of a & ears old #a#, he !rett much can transformanthing, disassem#le anthing, #ut '
• There e(ist efficient reverse algorithms that can retrieve the originalte(t from the transformed te(t.
)
8/9/2019 Lecture Bwt
3/64
&
What is BWT?What is BWT?
• The $urro%s and Wheeler transform *$WT+ is a#lock sorting lossless and reversi#le datatransform.
• The $WT can !ermute a te(t into a ne%seuence %hich is usuall more “com!ressi#le”.
• Surfaced not long ago, -/, # 0ichael$urro%s and 1avid Wheeler.
• The transformed te(t can #e #etter com!ressed
%ith fast locall2ada!tive algorithms, such as run2length2encoding *or move2to2front coding+ incom#ination %ith 3uffman coding *or arithmeticcoding+.
8/9/2019 Lecture Bwt
4/64
/
OutlineOutline
• What does $WT stand for?• Wh $WT?
– 1ata 4om!ression algorithms – 56" – 3uffman coding – 4om#ine them
– What is left out? – $ring the realit closer to idealit
• Ste!s of $WT• $WT is reversi#le and lossless• Ste!s to inverse
• 7ariants of $WT – ST• When %as $WT initiall !ro!osed?• Where are the inventors of the algorithms?• 8our home%ork9
8/9/2019 Lecture Bwt
5/64
:
Why BWT?Why BWT?
• 5un length encoding – 5e!lacing a long series of a re!eated character %ith a
count of the re!etition. Sueeing to a num#er and a
character.•
8/9/2019 Lecture Bwt
6/64
@
Bridge reality and idealityBridge reality and ideality
• $WT can transform a te(t into a seuencethat is easier to com!ress.
– 4loser to idealit *%hat is e(!ected # 5"6+.
• 4om!ression on the transformed te(tim!roves the com!ression !erformance
8/9/2019 Lecture Bwt
7/64
>
PreliminariesPreliminaries
•
8/9/2019 Lecture Bwt
8/64
F
How to transform?How to transform?
• Three ste!s
– Gorm a H=H matri( # cclicall rotating *left+the given te(t to form the ro%s of the matri(.
– Sort the matri( according to the al!ha#eticorder.
– 6(tract the last column of the matri(.
8/9/2019 Lecture Bwt
9/64
Ine e(am!le
ho% the $WT transforms mississippi .• TJmississi!!iC
8/9/2019 Lecture Bwt
10/64
-K
Step 1: form the matrixStep 1: form the matrix
• The H = H smmetric matri(, 0I, originallconstructed from the te(ts o#tained # rotatingthe te(t CTC. – The matri( I0 has S as its first ro%, i.e. I0L-,
-MHNJT. – The rest ro%s of I0 are constructed # a!!ling
successive cclic left2shifts to T, i.e. each of theremaining ro%s, a ne% te(t TOi is o#tained #cclicall shifting the !revious te(t TOBi2-D one column
to the left.• The matri( I0 o#tained is sho%n in the ne(t
slide.
8/9/2019 Lecture Bwt
11/64
• < te(t T is a seuence of characters dra%n from the al!ha#et.• Without loss of generalit, a te(t T of length CHC is denoted as
(O-(O)(O&...(OBH2-DC, %here ever character (Oi is in the al!ha#et,A, for i in L-, H2-N. The last character of the te(t is a sentinel, %hichis the le(icogra!hicall greatest character in the al!ha#et and
occurs e(actl once in the te(t.•
--
8/9/2019 Lecture Bwt
12/64
-)
Step 1 form the matrixStep 1 form the matrix
Girst treat the in!ut string as a cclic string and construct H= H matri( from it.
8/9/2019 Lecture Bwt
13/64
-&
m i s s i s s i p p i
i s s i s s i p p i ms s i s s i p p i m is i s s i p p i m i si s s i p p i m i s ss s i p p i m i s s i
s i p p i m i s s i si p p i m i s s i s sp p i m i s s i s s ip i m i s s i s s i pi m i s s i s s i p p
m i s s i s s i p p i
Step 1: form the matrixStep 1: form the matrix
8/9/2019 Lecture Bwt
14/64
-/
Step !: transform the matrixStep !: transform the matrix
• Ho%, %e sort all the ro%s of the matri( I0 in ascending order %ith theleftmost element of each ro% #eing the most significant !osition.
• 4onseuentl, %e o#tain the transformed matri( 0 as given #elo%.
i p p i m i s s i s si s s i p p i m i s s
i s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i ss s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
"ompletely sorted from the leftmost #olumn to the rightmost #olumn$
8/9/2019 Lecture Bwt
15/64
-:
Step %: get the transformed textStep %: get the transformed text
• The $urro%s Wheeler transform is the lastcolumn in the sorted list, together %ith thero% num#er %here the original string ends
u!.
8/9/2019 Lecture Bwt
16/64
-@
Step %: get the transformed textStep %: get the transformed text
• Grom the a#ove transform, " is easil o#tained # taking the trans!ose of the lastcolumn of 0 together %ith the !rimar inde(. – / – "J s s m ! C ! i s s i i i
i p p i m i s s i s s
i s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i s
s i s s i p p i m i ss s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
•Hotice ho% there are & iPs in a ro% and ) consecutive sPs and
another ) consecutive sQs 2 this makes the te(t easier to com!ress,than the original string “mississi!!iC”.
8/9/2019 Lecture Bwt
17/64
What is the &enefit?What is the &enefit?
• The transformed te(t is more amena#le tosu#seuent com!ression algorithms.
->
8/9/2019 Lecture Bwt
18/64
-F
'ny pro&lem?'ny pro&lem?
• t sounds cool, #ut '
• s the transformation reversi#le?
8/9/2019 Lecture Bwt
19/64
-
BWT is re(ersi&le and losslessBWT is re(ersi&le and lossless
• The remarka#le thing a#out the $WT is not onl that it generates amore easil com!ressi#le out!ut, #ut also that it is reversible, i.e. itallo%s the original te(t to #e re2generated from the last column dataand the !rimar inde(.
8/9/2019 Lecture Bwt
20/64
)K
BWT is re(ersi&le and losslessBWT is re(ersi&le and lossless
mississippi$
)ndex * and ssmppissiii
BWT
Inverse BWT
mississippi$
??? 3o% to achieve the goal?
8/9/2019 Lecture Bwt
21/64
)-
The intuitionThe intuition
•
8/9/2019 Lecture Bwt
22/64
+or )BWT+or )BWT
• The order is distri#uted and hidden in theout!ut themselves999
))
8/9/2019 Lecture Bwt
23/64
)&
The tri#, isThe tri#, is
• Where to start? Who is the first one to ask? – The last one.
• Ginding immediate !receding character
– $ finding immediate !receding ro% of the currentro%.
• < loo! is needed to recover all.
• 6ach iteration involves t%o matters• 5ecover the current !eo!le *# inde(+• n addition to that, to !oint out the ne(t !eo!le *# inde(+ to
kee! the loo! running.
8/9/2019 Lecture Bwt
24/64
)/
• T%o matters• 5ecover the current !eo!le *# inde(+
– "Lcurrentinde(N, so %hat is the currentinde(?
• n addition to that, to !oint out the ne(t !eo!le *#inde(+
– currentinde( J ne% inde(;
– RR ho% to u!date currentinde(, %e need a u!datingmethod.
W t t , h i thWe want to ,now where is the
8/9/2019 Lecture Bwt
25/64
):
We want to ,now where is theWe want to ,now where is the
pre#eding #hara#ter of a gi(enpre#eding #hara#ter of a gi(en
#hara#ter$#hara#ter$i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i m
i m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i ss s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
$ased on the alread kno%n !rimar inde(, /, %e kno%, "L/N, i.e. C isthe first character to retrieve, #ack%ardl, #ut our uestion is %hichcharacter is the ne(t character to retrieve?
We want to ,now where is theWe want to ,now where is the
8/9/2019 Lecture Bwt
26/64
)@
We want to ,now where is theWe want to ,now where is the
pre#eding #hara#ter of a gi(enpre#eding #hara#ter of a gi(en
#hara#ter$#hara#ter$i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i m
i m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i ss s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
We kno% that the ne(t character is going to #e iQ?$ut "L@NJ"LNJ "L-KN J "L--N JiQ. Which inde( should #e chosen?
8/9/2019 Lecture Bwt
27/64
)>
• We kno% that the ne(t character is goingto #e iQ?
• $ut "L@NJ"LNJ "L-KN J "L--N JiQ. Whichinde( should #e chosen?
•
8/9/2019 Lecture Bwt
28/64
)F
The solutionThe solution
• The solution turns out to #e ver sim!leM
– sing "G ma!!ing9
– 4ontinue to see %hat "G ma!!ing is?
8/9/2019 Lecture Bwt
29/64
)
)n(erse BW-Transform)n(erse BW-Transform
•
8/9/2019 Lecture Bwt
30/64
&K
and + and +
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p p
m i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
8/9/2019 Lecture Bwt
31/64
&-
+ mapping+ mapping
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i ss s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
8/9/2019 Lecture Bwt
32/64
&)
)n(erse BW-Transform:)n(erse BW-Transform:
.e#onstru#tion of T.e#onstru#tion of T
• Start %ith TLN #lank.
• "et u J Hnitialie nde( J the !rimar inde( */ in our
case+• TLuN J "Linde(N.We kno% that "Linde(N is the last character of T#ecause 0Lthe !rimar inde(N ends %ith C.
• Gor each i J u2-, ', - doMs J "GLsN *threading #ack%ards+TLiN J "LsN *read off the ne(t letter #ack+
8/9/2019 Lecture Bwt
33/64
&&
)n(erse BW-Transform:)n(erse BW-Transform:
.e#onstru#tion of T.e#onstru#tion of T
• Girst ste!M
s = 4 T = [.._ _ _ _ _ $]
• Second ste!M
s = LF[4] = 11 T = [.._ _ _ _ i $]• Third ste!M
s = LF[11] = 3 T = [.._ _ _ p i $]
• Gourth ste!Ms = LF[3] = 5 T = [.._ _ p p i $]
•
8/9/2019 Lecture Bwt
34/64
&/
Who #an retrie(e the data?Who #an retrie(e the data?
• Ulease com!lete it9
8/9/2019 Lecture Bwt
35/64
&:
Why does + mapping wor,?Why does + mapping wor,?
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
? Which one
8/9/2019 Lecture Bwt
36/64
&@
Why does + mapping wor,?Why does + mapping wor,?
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
? Wh not this?
8/9/2019 Lecture Bwt
37/64
&>
Why does + mapping wor,?Why does + mapping wor,?
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
? Wh this?
8/9/2019 Lecture Bwt
38/64
&F
Why does + mapping wor,?Why does + mapping wor,?
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
? Wh this?
8/9/2019 Lecture Bwt
39/64
&
Why does + mapping wor,?Why does + mapping wor,?
i p p i m i s s i s si s s i p p i m i s si s s i s s i p p i mi m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i p p i m i s s i ss i s s i p p i m i s
s s i p p i m i s s is s i s s i p p i m i m i s s i s s i p p i
/
>F/:--@K
-K
-)&
? Wh this?
8/9/2019 Lecture Bwt
40/64
/K
The mathemati# explanationThe mathemati# explanation
• T-JS-VU
• T)JS)VU
• f T-ET), S-ES)• Ho%, let us reverse S and U
• UVS-J T-Q
• UVS)JT)Q• Since S-ES), %e kno% T-QET)Q
8/9/2019 Lecture Bwt
41/64
/-
• The secret is hidden in the sorting strategthe for%ard com!onent.
– Sorting strateg !reserves the relative order
in #oth last column and first column.
8/9/2019 Lecture Bwt
42/64
/)
• We had assumed %e have the matri(. $utactuall %e donQt.
• I#servation, %e onl need t%o columns.
•
8/9/2019 Lecture Bwt
43/64
/&
• Girst, %e kno% all of the characters in theoriginal message, even if thePre !ermutedin the %rong order. This ena#les us to
reconstruct the first column.
8/9/2019 Lecture Bwt
44/64
//
• iven onl this information, ou can easilreconstruct the first column. The lastcolumn tells ou all the characters in the
te(t, so Xust sort these characters to getthe first column.
8/9/2019 Lecture Bwt
45/64
/:
)n(erse BW-Transform:)n(erse BW-Transform:
"onstru#tion of ""onstru#tion of "
• Store in 4LcN the num#er of occurrences inT of the characters B-, ', c2-D.
• n our e(am!leM
T = mississi!!iC
i 4, m 1, p 2, s 4, $ 1
C = [0 4 5 7 11]• Hotice that 4LcN V m is the !osition of the
mth occurrence of c in G *if an+.
8/9/2019 Lecture Bwt
46/64
/@
)n(erse BW-Transform:)n(erse BW-Transform:
"onstru#ting the +-mapping"onstru#ting the +-mapping
• Wh and ho% the "G2ma!!ing? – Hotice that for ever ro% of 0, "LiN directl !recedes
GLiN in the te(t *thanks to the cclic shifts+.
• "et "LiN J c, let r i #e the num#er of occurrencesof c in the !refi( "L-,iN, and let 0LXN #e the r i2th
ro% of 0 that starts %ith c. Then the character inthe first column G corres!onding to "LiN is located
at GLXN.• 3o% to use this fact in the "G2ma!!ing?
) BW T f
8/9/2019 Lecture Bwt
47/64
/>
)n(erse BW-Transform:)n(erse BW-Transform:
"onstru#ting the +-mapping"onstru#ting the +-mapping
• So, define "GL-'HN as
"GLiN J 4L"LiNN V r i.
• 4L"LiNN gets us the !ro!er offset to theeroth occurrence of "LiN, and the additionof r i gets us the r i2th ro% of 0 that starts
%ith c.
8/9/2019 Lecture Bwt
48/64
/F
)n(erse BW-Transform)n(erse BW-Transform
• 4onstruct 4L-'YAYN, %hich stores in 4LiNthe cumulative num#er of occurrences in Tof character i.
• 4onstruct an "G2ma!!ing "GL-'HN %hichma!s each character in " to the characteroccurring in G using onl " and 4.
• 5econstruct T #ack%ards # threadingthrough the "G2ma!!ing and reading thecharacters off of ".
8/9/2019 Lecture Bwt
49/64
/
'nother example'nother example
-. 8ou are given and in!ut string “a#a#c”
*a+ sing $urro%s2Wheeler, create all cclicshifts of the string
*#+ sorted order *#+ Iut!ut " and the !rimar inde(.*g+ iven ", determine G and "G *and sho%ho% ou do it+.*h+ 1ecode the original string using inde(, ",and "G *and sho% ho% ou do it+.
8/9/2019 Lecture Bwt
50/64
:K
Pros and #ons of BWTPros and #ons of BWT
• UrosM – The transformed te(t does enXo a com!ression2favora#le
!ro!ert %hich tends to grou! identical characters together sothat the !ro#a#ilit of finding a character close to anotherinstance of the same character is increased su#stantiall.
– 0ore im!ortantl, there e(ist efficient and smart algorithms torestore the original string from the transformed result.
• 4onsM – the need of sorting all the conte(ts u! to their full lengths of CHC
is the main cause for the su!er2linear time com!le(it of $WT.
– Su!er2linear time algorithms are not hard%are friendl.
8/9/2019 Lecture Bwt
51/64
Blo#, wiseBlo#, wise
• t %orks on #locks of certain t!ical sie.
:-
' i d l ith' i d l ith
8/9/2019 Lecture Bwt
52/64
:)
'n impro(ed algorithm'n impro(ed algorithm
-S#hindler Transforms-S#hindler Transforms• To address the a#ove dra%#acks, a slightl
different transform, called ST, %as !ro!osed. – %hich can sort the te(ts # using onl their first CkC
characters *%here CkC can #e a value far less thanCHC+, #ut still render itself reversi#le.
• The ke idea of ST is a t%o2hierarch !rioritsorting scheme, %hich can #e easil achieved
using the radi( sort. – the le(icogra!hical sorting criterion.
– the !ositional sorting criterion.
8/9/2019 Lecture Bwt
53/64
:&
ST transformST transform
i p p i m i s s i s si s s i s s i p p i m
i s s i p p i m i s si m i s s i s s i p pm i s s i s s i p p i p i m i s s i s s i pp p i m i s s i s s is i s s i p p i m i s
s i p p i m i s s i ss s i s s i p p i m is s i p p i m i s s i m i s s i s s i p p i
• "et O/ #e the same matri( as defined for the BWT.• nder k2order ST, O/ is transformed to /0, # sorting all its ro%s
according to their first , leftmost characters, i.e. ,2order conte(ts, onl.• n case that an t%o ,2order conte(ts are eual, the tie is resolved # their
relative !ositions in the original O/.
Only partially sorted on the leftmost two #olumns
8/9/2019 Lecture Bwt
54/64
:/
Pros and "ons of STPros and "ons of ST
• UrosM
– Gaster than $WT
– 3ard%are im!lementation friendl
• 4onsM
– The currentl kno%n a!!roach to inverse STis #ased on a hashing function.
– The relationshi! #et%een inverse ST andinverse $WT is not %ell studied.
' li ti h' li ti h
8/9/2019 Lecture Bwt
55/64
::
'n appli#ation s#heme'n appli#ation s#heme
in data #ommuni#ation systemin data #ommuni#ation system
8/9/2019 Lecture Bwt
56/64
:@
"on#lusions"on#lusions
• The $W transform makes the te(t *string+more amena#le to com!ression.
– $WT in itself does not modif the data stream.
t Xust reorders the sm#ols inside the data#locks.
– 6valuation of the !erformance actuall is
su#Xect to information model assumed.
8/9/2019 Lecture Bwt
57/64
:>
BW Transform SummaryBW Transform Summary
•
8/9/2019 Lecture Bwt
58/64
:F
)ssues left out)ssues left out
• 3o% a#out if all characters in the al!ha#et set a!!ear inthe te(t, i.e. no sentinel can #e used?
• 1o ou need to com!are H !ositions?
• 3o% a#out the in!ut data is not ascii encoded, #ut an
image, or a #iological seuence *1H
8/9/2019 Lecture Bwt
59/64
:
homewor,homewor,
• The $WT algorithms
– Gor%ard Transform
– $ack%ard Transform
• 6ither in the Windo%s environment or the"inu( environment
l f ixamples of running your
8/9/2019 Lecture Bwt
60/64
@K
xamples of running yourxamples of running your
program in the #ommand lineprogram in the #ommand line
• #%t –f te(t- te(t)
– Transfer te(t- to te(t)
• #%t –i te(t) te(t&
– nverse te(t) to te(t&
How to (erify the #orre#tness ofHow to (erify the #orre#tness of
8/9/2019 Lecture Bwt
61/64
@-
How to (erify the #orre#tness ofHow to (erify the #orre#tness of
your algorithms$your algorithms$
• $ecause the #%t is reversi#le andlossless, if our im!lementation is correct,te(t& should #e the same as te(t-.
– 8our can manuall verif te(t- and te(t&
–
8/9/2019 Lecture Bwt
62/64
@)
.e2uirements.e2uirements
• Stage -M use a fi(ed string or acce!t astring from ke#oard to test thecorrectness of our algorithms. *FK !oints+
• Stage )M then e(!and our solution to readthe string from a given file. *)K !oints+Hotice that te(t) should #e a #inar file,
for the first data is inde(, then follo%ed #ascii code.
8/9/2019 Lecture Bwt
63/64
@&
How to sort the matrixHow to sort the matrix
• -. the sim!lest %a – Whatever sorting algorithm ou feel comforta#le
– 0ake each ro% a string, then do string com!arison•
4 string, need to kno% ho% functions for string com!arison• 4!! string, need to kno% ho% to ho% to use string class.
• 8ou use %hichever %a ou feel the most comforta#le.
• ). radi( sort
• &. suffi( arra
3nowledge to &e pra#ti#ed for3nowledge to &e pra#ti#ed for
8/9/2019 Lecture Bwt
64/64
3nowledge to &e pra#ti#ed for3nowledge to &e pra#ti#ed for
the homewor,the homewor,
•