Huffman Coding - UET Taxila

58
H ff H ff E di E di &D di &D di Huffman Huffman Encoding Encoding & Decoding & Decoding 1

Transcript of Huffman Coding - UET Taxila

H ffH ff E diE di & D di& D diHuffman Huffman Encoding Encoding & Decoding& Decoding

1

Huffman CodingHuffman CodingHuffman CodingHuffman Coding

•• Developed in 1952Developed in 1952Developed in 1952 Developed in 1952 •• Research paper for MITResearch paper for MITU d i di i l d i iU d i di i l d i iUsed in digital data transmissionsUsed in digital data transmissions

–– Fax machinesFax machines–– ModemsModems–– Computer networksComputer networks–– Video compressionVideo compression

2

“This is gopher”“This is gopher”This is gopherThis is gopher

•• In binary it would look like thisIn binary it would look like this•• In binary it would look like thisIn binary it would look like this–– “11001111101111010000011001111101111 “11001111101111010000011001111101111

01000001100111110111111100001101000110010000011001111101111111000011010001100100000110011111011111110000110100011001000001100111110111111100001101000110010111100101110011”010111100101110011”

–– That is 14 characters 8 bits in length or That is 14 characters 8 bits in length or –– 14 * 8 = 112 bits to send the message14 * 8 = 112 bits to send the messagegg

3

IntroductionIntroductionIntroductionIntroduction

•• Huffman coding is a compression techniqueHuffman coding is a compression techniqueHuffman coding is a compression technique.Huffman coding is a compression technique.•• In normal Not all characters occur with the In normal Not all characters occur with the

same frequency!same frequency!same frequency!same frequency!•• Yet all characters are allocated the same amount Yet all characters are allocated the same amount

ffof spaceof space–– 1 char = 1 byte, be it e or x1 char = 1 byte, be it e or x

4

The Basic AlgorithmThe Basic AlgorithmThe Basic AlgorithmThe Basic Algorithm

•• Code word lengths are no longer fixed like Code word lengths are no longer fixed like ASCIIASCII

•• Code word lengths vary and will be shorter for Code word lengths vary and will be shorter for the more frequently used charactersthe more frequently used charactersq yq y

5

The Basic AlgorithmThe Basic AlgorithmThe Basic AlgorithmThe Basic Algorithm

1. Scan text to be compressed and tally occurrence of all characters.

2. Sort or prioritize characters based on number of occurrences in text.

3. Build Huffman code tree based on prioritized list.

4. Perform a traversal of tree to determine ll d dall code words.

5. Scan text again and create new file i th H ff d

6

using the Huffman codes

Building a TreeBuilding a TreeScan the original textScan the original text

•• Consider the following short textConsider the following short text

Eerie eyes seen near lake.Eerie eyes seen near lake.

•• Count up the occurrences of all characters in the Count up the occurrences of all characters in the texttext

7

Building a TreeBuilding a TreeScan the original textScan the original text

Eerie eyes seen near lake.Eerie eyes seen near lake.•• What characters are present?What characters are present?What characters are present?What characters are present?

E e r i space y s n a r l ky s n a r l k .

8

Building a TreeBuilding a TreeScan the original textScan the original text

Eerie eyes seen near lakeEerie eyes seen near lakeEerie eyes seen near lake.Eerie eyes seen near lake.•• What is the frequency of each character in the What is the frequency of each character in the

??text?text?Char Freq. Char Freq. Char Freq.

1 1 k 1E 1 y 1 k 1e 8 s 2 . 1r 2 n 2i 1 a 2space 4 l 1

9

Building a TreeBuilding a TreePrioritize charactersPrioritize characters

C bi d i h h dC bi d i h h d•• Create binary tree nodes with character and Create binary tree nodes with character and frequency of each characterfrequency of each character

•• Place nodes in a priority queuePlace nodes in a priority queue–– The The lowerlower the occurrence, the higher the priority the occurrence, the higher the priority

in the queuein the queue

10

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

•• The queue after inserting all nodesThe queue after inserting all nodes•• The queue after inserting all nodesThe queue after inserting all nodes

E i y l k . r s n a sp e

1 1 1 1 1 1 2 2 2 2 4 8

11

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

y l k . r s n a sp ey

1 1

1 1 2 2 2 2 4 8

2

E1

i1

12

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

y l k . r s n a sp e2

E1

i1

y

1 1

1 1 2 2 2 2 4 82

1 1

13

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

k . r s n a sp e2

E1

i1

1 1 2 2 2 2 4 82

1 1

2

y1

l1

14

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

k . r s n a sp e2 2

E1

i1

1 1 2 2 2 2 4 8y1

l1

1 1

15

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

r s n a sp e2 2

E1

i1

2 2 2 2 4 8y1

l1

1 1

2

k1

.1

16

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

r s n a sp e22 2

E1

i1

2 2 2 2 4 8

y1

l1

2

k1

.11 1 1 1

17

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

n a sp e22 2

E1

i1

2 2 4 8

y1

l1

2

k1

.1

2

1 1 1 1 1 1

4

r2

s2

18

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

n a sp e22 2 4

E1

i1

2 2 4 8

y1

l1

2

k1

.1

r2

s2

4

1 1 1 1 2 2

19

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

spe2 2 2 4

E1

i1

p

48

y1

l1

k1

.1

r2

s2

1 1

4

n2

a2

20

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

spe2 2 2 4 4

E1

i1

p

48

y1

l1

k1

.1

r2

s2

n2

a2

1 1

21

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

spe

2 4 4p

48

k1

.1

r2

s2

n2

a2

4

E i

2

y l

2

22

1 1 1 1

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

sp e2 4 4 4

E i

p

4 82

y l

2k1

.1

r2

s2

n2

a2

E1

i1

y1

l1

23

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e4 4 4

E i

82

y l

2r2

s2

n2

a2

1 1y1 1

6

sp4

k .

2

24

1 1

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e4 4 4 6

sp4

82 2

k .

2r2

s2

n2

a2

E1

i1

y1

l1

1 1

25

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e4 6

E i

sp4

82

y l

2

k .

2

E1

i1

y1

l1 1 1

4 4

8

r2

s2

4

n2

a2

26

2 2

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e4 6 8

E i

sp4

82

y l

2

k .

24 4

E1

i1

y1

l1 1 1 r

2s2

n2

a2

27

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e 8

84 4

10

2 2 2

r2

s2

n2

a2 4

6

E1

i1

sp4

y1

l1

k1

.1

2

28

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

e 8 10

8

2 2

4 4 46

E1

i1

sp4

2

y1

l1

2

k1

.1

2r2

s2

n2

a2

1 1 1 1 1 1

29

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

10

e2 2 2

46

8

16

E1

i1

sp4

e8

2

y1

l1

2

k1

.1

2

4 4

8

r2

s2

n2

a2

30

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

10 16

e82 2 2

46

8

E1

i1

sp4

2

y1

l1

2

k1

.1

24 4

r2

s2

n2

a2

31

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

1016

26

e8

46 8

10

E i

sp4

2

y l

2

k .

2 4 4

1 1y1 1 1 1 r

2s2

n2

a2

32

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

•After

26

After enqueueing this node

1016

there is only one node left i i it

sp

e8

2 2 2 4 4

46 8

in priority queue.

E1

i1

sp4

y1

l1

k1

.1 r

2s2

4

n2

a2

4

33

2 2 2 2

Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree

Dequeue the single nodeDequeue the single node left in the queue.

16

26

This tree contains the new code words for each character. e

84

6 8

1016

Frequency of root node h ld l b f

E1

i1

sp4

2

y1

l1

2

k1

.1

2

r s

4

n a

4

should equal number of characters in text.

1 1 1 1 1 1 r2

s2 2 2

Eerie eyes seen near lake 26 characters34

Eerie eyes seen near lake. 26 characters

Encoding the FileEncoding the FileTraverse Tree for CodesTraverse Tree for Codes

•• Perform a traversal of the tree Perform a traversal of the tree to obtain new code wordsto obtain new code words

•• Going left is a 0 going right is a Going left is a 0 going right is a 11 2611

•• code word is only completed code word is only completed when a leaf node is reached when a leaf node is reached

1016

26

sp

e8

2 2 2 4 4

46 8

E1

i1

p4

y1

l1

k1

.1 r

2s2

n2

a2

35

Encoding the FileEncoding the FileTraverse Tree for CodesTraverse Tree for Codes

CharChar CodeCodeddEE 00000000ii 00010001yy 00100010 26yy 00100010ll 00110011kk 01000100

0101010110

16

26

.. 01010101spacespace 011011ee 1010

11001100 sp

e8

2 2 2 4 4

46 8

rr 11001100ss 11011101nn 11101110

11111111

E1

i1

p4

y1

l1

k1

.1 r

2s2

n2

a2

36

aa 11111111

Encoding the FileEncoding the FileEncoding the FileEncoding the File

•• Rescan text and encode file Rescan text and encode file ChCh C dC dsc d codsc d codusing new code wordsusing new code words

Eerie eyes seen near lake.Eerie eyes seen near lake.

CharChar CodeCodeEE 00000000ii 00010001

00100010yy 00100010ll 00110011kk 01000100

0101010100001011000001100111000101011011010011 .. 01010101

spacespace 011011ee 1010rr 11001100

100010101101101001111101011111100011001111110100100101 rr 11001100

ss 11011101nn 11101110aa 11111111

•• Why is there no need for a Why is there no need for a separator character?separator character?

37

aa 11111111..

Encoding the FileEncoding the FileResultsResults

•• 73 bits to encode the text73 bits to encode the text73 bits to encode the text73 bits to encode the text•• ASCII would take 8 * 26 = ASCII would take 8 * 26 =

208 bits208 bits

00001011000001100111000101011011010011111010111111000110011101011111100011001111110100100101

hIf 5 bit h t d dhIf 5 bits per character are needed. Total bits 5 * 26 = 130. Savings not as great.

38

5 26 130. Savings not as great.

Decoding the FileDecoding the FileDecoding the FileDecoding the File

•• How does the receiver know what the How does the receiver know what the codes are?codes are?

•• Once receiver has tree it scans incoming Once receiver has tree it scans incoming bit streambit stream00 l fl f 10

16

26

•• 0 0 ⇒⇒ go leftgo left•• 1 1 ⇒⇒ go rightgo right e

82 2

46 8

10

E1

i1

sp4

2

y1

l1

2

k1

.1

2

r2

s2

4

n2

a2

400001011000001100111000101011011010011

2 2 2 211101011111100011001111110100100101

39

Example 2 :Example 2 :Example 2 :Example 2 :

a: 45b:c: d:f: 5 e: 9Start: a: 45b:13

c:12

d:16

f: 5 e: 9

45b d14Step 1: a: 45b:13

c:12

d:16

1410

Step 1:

f: 5e: 9

d14S 2 a: 45d:161

140

Step 2:1

250

40

f: 5e : 9 b:13

c:12

Example: Huffman code Example: Huffman code construction (construction (22))

a: 45Step 3: 25 300p

b: c:1

250

d:14

10

b:13

c:12

d:16

f: 5e: 9

10f: 5e: 9

41

Example: Huffman code Example: Huffman code construction (construction (33))

a: 45Step 4:1

550

250 1

300

b: c:10

d:14

10

13 12 16

f: 5e: 9

10

42

Example: Huffman code Example: Huffman code construction (3)construction (3)

Step 5: 1000

a: 45

p

55

10

a: 45

30

155

00

125

0 130

0

b:13

c:12

d:161

140

100 101 111

43

f: 5e : 9100 101

1100 1101

Huffman Coding ExampleHuffman Coding Example--11g pg p

•• 44-- Repeat this step until there is only oneRepeat this step until there is only one•• 44-- Repeat this step until there is only one Repeat this step until there is only one tree:tree:

Choose two trees with the smallest weights call theseChoose two trees with the smallest weights call theseChoose two trees with the smallest weights, call these Choose two trees with the smallest weights, call these trees Ttrees T11 and Tand T22. Create a new tree whose root has a weight . Create a new tree whose root has a weight equal to the sum of the weights Tequal to the sum of the weights T11 + T+ T22 and whose left and whose left

b i Tb i T d h i h b i Td h i h b i Tsubtree is Tsubtree is T11 and whose right subtree is Tand whose right subtree is T22. .

•• 55-- The single tree left after the previous step is an The single tree left after the previous step is an optimal encoding treeoptimal encoding treeoptimal encoding tree. optimal encoding tree.

44

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

45

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

46

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

47

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

48

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

49

Huffman Coding ExampleHuffman Coding Example--11Huffman Coding ExampleHuffman Coding Example 11

50

Huffman Coding ExampleHuffman Coding Example--22Huffman Coding ExampleHuffman Coding Example 22•• Character (or symbol) frequenciesCharacter (or symbol) frequencies

AA 2020% (% ( 2020))–– A: A: 2020% (.% (.2020))•• e.g., ‘A’ occurs e.g., ‘A’ occurs 20 20 times in a times in a 100 100 character document, character document, 1000 1000

times in a times in a 5000 5000 character document, etc.character document, etc.

–– B: B: 99% (.% (.0909))–– C: C: 1515% (.% (.1515))–– D:D: 1111% (.% (.1111))D: D: 1111% (.% (.1111))–– E: E: 4040% (.% (.4040))–– F: F: 55% (.% (.0505))

•• Also works if you use Also works if you use character countscharacter counts•• Must know frequency of every characters in the Must know frequency of every characters in the

dd51

documentdocument

Huffman Coding ExampleHuffman Coding Example--22

•• Here are the symbols and their associated frequencies.Here are the symbols and their associated frequencies.

Huffman Coding ExampleHuffman Coding Example 22

•• Now we combine the two least common symbols (those Now we combine the two least common symbols (those with the smallest frequencies) to make a new symbol with the smallest frequencies) to make a new symbol

i d di fi d di fstring and corresponding frequency.string and corresponding frequency.

C 15

A20

D11

F05

B09

E4 .15.20.11.05.09.4

52

Huffman Coding ExampleHuffman Coding Example--22Huffman Coding ExampleHuffman Coding Example 22

•• Here’s the result of combining symbols once.Here’s the result of combining symbols once.•• Now we repeat until we have combined all the symbols into a single string.Now we repeat until we have combined all the symbols into a single string.

C ADBFE.15.20.11

F05

.14

B09

.4

.05.09

53

Huffman Coding ExampleHuffman Coding Example 22ABCDEF1 0

Huffman Coding ExampleHuffman Coding Example--22

EABCDF

1.0

BFD AC

.4.6

C 15

A20

D11

BF14

.25 .35

.15.20.11

F.05

.14

B.09

54

ABCDEFHuffman Coding ExampleHuffman Coding Example•• Now we assign Now we assign 00s/s/11s to each branchs to each branch•• Codes (reading from top to bottom)Codes (reading from top to bottom)

1.0

EABCDF

0 1Huffman Coding ExampleHuffman Coding Example

•• Codes (reading from top to bottom)Codes (reading from top to bottom)–– A: A: 010010–– B: B: 00000000

CC 011011

E.4

ACBFD

ABCDF.6

0 1

–– C: C: 011011–– D: D: 001001–– E: E: 11 C ADBF

AC.35

BFD.25

00 11

–– F: F: 00010001 .15.20.11

F05

.14

B09

0 1

•• NoteNote–– None are prefixes of anotherNone are prefixes of another

•• Decode this.Decode this.

.05.09

Try decoding right

55

Decode this.Decode this.–– 01001111000100000100111100010000–– What’s the first character? The second?What’s the first character? The second?

Try decoding right to left

Huffman Coding : LimitationsHuffman Coding : LimitationsHuffman Coding : LimitationsHuffman Coding : Limitations

•• Knowledge of source statistics is rarely availableKnowledge of source statistics is rarely availableKnowledge of source statistics is rarely available Knowledge of source statistics is rarely available in practice:in practice:

In file compression files can be from a wide varietyIn file compression files can be from a wide variety–– In file compression, files can be from a wide variety In file compression, files can be from a wide variety of applicationsof applications

–– Each source exhibits different statisticsEach source exhibits different statisticsEach source exhibits different statisticsEach source exhibits different statistics

•• Needs a so rce coding algorithm that does notNeeds a so rce coding algorithm that does not•• Needs a source coding algorithm that does not Needs a source coding algorithm that does not depend on source statistics!depend on source statistics!

56

Cost estimation:Cost estimation: Huffman codingHuffman codingCost estimation: Cost estimation: Huffman codingHuffman coding

•• abracadabra frequencies:abracadabra frequencies:55 bb 22 11 dd 11 22–– a: a: 55, b: , b: 22, c: , c: 11, d: , d: 11, r: , r: 22

•• Huffman code:Huffman code:–– a:a: 00 b:b: 100100 c:c: 10101010 d:d: 10111011 r:r: 1111–– a: a: 00, b: , b: 100100, c: , c: 10101010, d: , d: 10111011, r: , r: 1111–– bits: bits: 5 5 * * 1 1 + + 2 2 * * 3 3 + + 1 1 * * 4 4 + + 1 1 * * 4 4 + + 2 2 * * 2 2 = = 2323

•• Follow the tree to decode Follow the tree to decode –– ΘΘ(n)(n)( )( )•• Time to encode?Time to encode?

–– Compute frequencies Compute frequencies –– O(n) O(n) –– Build heap Build heap –– O(O(11) assuming alphabet has constant size) assuming alphabet has constant size–– Encode Encode –– O(n)O(n)

57

Last wordsLast wordsLast wordsLast words

•• Best algorithms compress text toBest algorithms compress text to 7575% of% ofBest algorithms compress text to Best algorithms compress text to 7575% of % of original size, but humans can compress to original size, but humans can compress to 1010%%

•• Humans have far better modeling algorithmsHumans have far better modeling algorithms•• Humans have far better modeling algorithms Humans have far better modeling algorithms because they have better pattern recognition and because they have better pattern recognition and higherhigher level patterns to recognizelevel patterns to recognizehigherhigher--level patterns to recognizelevel patterns to recognize

•• Intelligence = pattern recognition = data Intelligence = pattern recognition = data i ?i ?compression?compression?

58