Huffman Coding - UET Taxila
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of Huffman Coding - UET Taxila
Huffman CodingHuffman CodingHuffman CodingHuffman Coding
•• Developed in 1952Developed in 1952Developed in 1952 Developed in 1952 •• Research paper for MITResearch paper for MITU d i di i l d i iU d i di i l d i iUsed in digital data transmissionsUsed in digital data transmissions
–– Fax machinesFax machines–– ModemsModems–– Computer networksComputer networks–– Video compressionVideo compression
2
“This is gopher”“This is gopher”This is gopherThis is gopher
•• In binary it would look like thisIn binary it would look like this•• In binary it would look like thisIn binary it would look like this–– “11001111101111010000011001111101111 “11001111101111010000011001111101111
01000001100111110111111100001101000110010000011001111101111111000011010001100100000110011111011111110000110100011001000001100111110111111100001101000110010111100101110011”010111100101110011”
–– That is 14 characters 8 bits in length or That is 14 characters 8 bits in length or –– 14 * 8 = 112 bits to send the message14 * 8 = 112 bits to send the messagegg
3
IntroductionIntroductionIntroductionIntroduction
•• Huffman coding is a compression techniqueHuffman coding is a compression techniqueHuffman coding is a compression technique.Huffman coding is a compression technique.•• In normal Not all characters occur with the In normal Not all characters occur with the
same frequency!same frequency!same frequency!same frequency!•• Yet all characters are allocated the same amount Yet all characters are allocated the same amount
ffof spaceof space–– 1 char = 1 byte, be it e or x1 char = 1 byte, be it e or x
4
The Basic AlgorithmThe Basic AlgorithmThe Basic AlgorithmThe Basic Algorithm
•• Code word lengths are no longer fixed like Code word lengths are no longer fixed like ASCIIASCII
•• Code word lengths vary and will be shorter for Code word lengths vary and will be shorter for the more frequently used charactersthe more frequently used charactersq yq y
5
The Basic AlgorithmThe Basic AlgorithmThe Basic AlgorithmThe Basic Algorithm
1. Scan text to be compressed and tally occurrence of all characters.
2. Sort or prioritize characters based on number of occurrences in text.
3. Build Huffman code tree based on prioritized list.
4. Perform a traversal of tree to determine ll d dall code words.
5. Scan text again and create new file i th H ff d
6
using the Huffman codes
Building a TreeBuilding a TreeScan the original textScan the original text
•• Consider the following short textConsider the following short text
Eerie eyes seen near lake.Eerie eyes seen near lake.
•• Count up the occurrences of all characters in the Count up the occurrences of all characters in the texttext
7
Building a TreeBuilding a TreeScan the original textScan the original text
Eerie eyes seen near lake.Eerie eyes seen near lake.•• What characters are present?What characters are present?What characters are present?What characters are present?
E e r i space y s n a r l ky s n a r l k .
8
Building a TreeBuilding a TreeScan the original textScan the original text
Eerie eyes seen near lakeEerie eyes seen near lakeEerie eyes seen near lake.Eerie eyes seen near lake.•• What is the frequency of each character in the What is the frequency of each character in the
??text?text?Char Freq. Char Freq. Char Freq.
1 1 k 1E 1 y 1 k 1e 8 s 2 . 1r 2 n 2i 1 a 2space 4 l 1
9
Building a TreeBuilding a TreePrioritize charactersPrioritize characters
C bi d i h h dC bi d i h h d•• Create binary tree nodes with character and Create binary tree nodes with character and frequency of each characterfrequency of each character
•• Place nodes in a priority queuePlace nodes in a priority queue–– The The lowerlower the occurrence, the higher the priority the occurrence, the higher the priority
in the queuein the queue
10
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
•• The queue after inserting all nodesThe queue after inserting all nodes•• The queue after inserting all nodesThe queue after inserting all nodes
E i y l k . r s n a sp e
1 1 1 1 1 1 2 2 2 2 4 8
11
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
y l k . r s n a sp ey
1 1
1 1 2 2 2 2 4 8
2
E1
i1
12
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
y l k . r s n a sp e2
E1
i1
y
1 1
1 1 2 2 2 2 4 82
1 1
13
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
k . r s n a sp e2
E1
i1
1 1 2 2 2 2 4 82
1 1
2
y1
l1
14
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
k . r s n a sp e2 2
E1
i1
1 1 2 2 2 2 4 8y1
l1
1 1
15
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
r s n a sp e2 2
E1
i1
2 2 2 2 4 8y1
l1
1 1
2
k1
.1
16
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
r s n a sp e22 2
E1
i1
2 2 2 2 4 8
y1
l1
2
k1
.11 1 1 1
17
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
n a sp e22 2
E1
i1
2 2 4 8
y1
l1
2
k1
.1
2
1 1 1 1 1 1
4
r2
s2
18
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
n a sp e22 2 4
E1
i1
2 2 4 8
y1
l1
2
k1
.1
r2
s2
4
1 1 1 1 2 2
19
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
spe2 2 2 4
E1
i1
p
48
y1
l1
k1
.1
r2
s2
1 1
4
n2
a2
20
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
spe2 2 2 4 4
E1
i1
p
48
y1
l1
k1
.1
r2
s2
n2
a2
1 1
21
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
spe
2 4 4p
48
k1
.1
r2
s2
n2
a2
4
E i
2
y l
2
22
1 1 1 1
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
sp e2 4 4 4
E i
p
4 82
y l
2k1
.1
r2
s2
n2
a2
E1
i1
y1
l1
23
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e4 4 4
E i
82
y l
2r2
s2
n2
a2
1 1y1 1
6
sp4
k .
2
24
1 1
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e4 4 4 6
sp4
82 2
k .
2r2
s2
n2
a2
E1
i1
y1
l1
1 1
25
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e4 6
E i
sp4
82
y l
2
k .
2
E1
i1
y1
l1 1 1
4 4
8
r2
s2
4
n2
a2
26
2 2
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e4 6 8
E i
sp4
82
y l
2
k .
24 4
E1
i1
y1
l1 1 1 r
2s2
n2
a2
27
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e 8
84 4
10
2 2 2
r2
s2
n2
a2 4
6
E1
i1
sp4
y1
l1
k1
.1
2
28
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
e 8 10
8
2 2
4 4 46
E1
i1
sp4
2
y1
l1
2
k1
.1
2r2
s2
n2
a2
1 1 1 1 1 1
29
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
10
e2 2 2
46
8
16
E1
i1
sp4
e8
2
y1
l1
2
k1
.1
2
4 4
8
r2
s2
n2
a2
30
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
10 16
e82 2 2
46
8
E1
i1
sp4
2
y1
l1
2
k1
.1
24 4
r2
s2
n2
a2
31
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
1016
26
e8
46 8
10
E i
sp4
2
y l
2
k .
2 4 4
1 1y1 1 1 1 r
2s2
n2
a2
32
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
•After
26
After enqueueing this node
1016
there is only one node left i i it
sp
e8
2 2 2 4 4
46 8
in priority queue.
E1
i1
sp4
y1
l1
k1
.1 r
2s2
4
n2
a2
4
33
2 2 2 2
Building a TreeBuilding a TreeBuilding a TreeBuilding a Tree
Dequeue the single nodeDequeue the single node left in the queue.
16
26
This tree contains the new code words for each character. e
84
6 8
1016
Frequency of root node h ld l b f
E1
i1
sp4
2
y1
l1
2
k1
.1
2
r s
4
n a
4
should equal number of characters in text.
1 1 1 1 1 1 r2
s2 2 2
Eerie eyes seen near lake 26 characters34
Eerie eyes seen near lake. 26 characters
Encoding the FileEncoding the FileTraverse Tree for CodesTraverse Tree for Codes
•• Perform a traversal of the tree Perform a traversal of the tree to obtain new code wordsto obtain new code words
•• Going left is a 0 going right is a Going left is a 0 going right is a 11 2611
•• code word is only completed code word is only completed when a leaf node is reached when a leaf node is reached
1016
26
sp
e8
2 2 2 4 4
46 8
E1
i1
p4
y1
l1
k1
.1 r
2s2
n2
a2
35
Encoding the FileEncoding the FileTraverse Tree for CodesTraverse Tree for Codes
CharChar CodeCodeddEE 00000000ii 00010001yy 00100010 26yy 00100010ll 00110011kk 01000100
0101010110
16
26
.. 01010101spacespace 011011ee 1010
11001100 sp
e8
2 2 2 4 4
46 8
rr 11001100ss 11011101nn 11101110
11111111
E1
i1
p4
y1
l1
k1
.1 r
2s2
n2
a2
36
aa 11111111
Encoding the FileEncoding the FileEncoding the FileEncoding the File
•• Rescan text and encode file Rescan text and encode file ChCh C dC dsc d codsc d codusing new code wordsusing new code words
Eerie eyes seen near lake.Eerie eyes seen near lake.
CharChar CodeCodeEE 00000000ii 00010001
00100010yy 00100010ll 00110011kk 01000100
0101010100001011000001100111000101011011010011 .. 01010101
spacespace 011011ee 1010rr 11001100
100010101101101001111101011111100011001111110100100101 rr 11001100
ss 11011101nn 11101110aa 11111111
•• Why is there no need for a Why is there no need for a separator character?separator character?
37
aa 11111111..
Encoding the FileEncoding the FileResultsResults
•• 73 bits to encode the text73 bits to encode the text73 bits to encode the text73 bits to encode the text•• ASCII would take 8 * 26 = ASCII would take 8 * 26 =
208 bits208 bits
00001011000001100111000101011011010011111010111111000110011101011111100011001111110100100101
hIf 5 bit h t d dhIf 5 bits per character are needed. Total bits 5 * 26 = 130. Savings not as great.
38
5 26 130. Savings not as great.
Decoding the FileDecoding the FileDecoding the FileDecoding the File
•• How does the receiver know what the How does the receiver know what the codes are?codes are?
•• Once receiver has tree it scans incoming Once receiver has tree it scans incoming bit streambit stream00 l fl f 10
16
26
•• 0 0 ⇒⇒ go leftgo left•• 1 1 ⇒⇒ go rightgo right e
82 2
46 8
10
E1
i1
sp4
2
y1
l1
2
k1
.1
2
r2
s2
4
n2
a2
400001011000001100111000101011011010011
2 2 2 211101011111100011001111110100100101
39
Example 2 :Example 2 :Example 2 :Example 2 :
a: 45b:c: d:f: 5 e: 9Start: a: 45b:13
c:12
d:16
f: 5 e: 9
45b d14Step 1: a: 45b:13
c:12
d:16
1410
Step 1:
f: 5e: 9
d14S 2 a: 45d:161
140
Step 2:1
250
40
f: 5e : 9 b:13
c:12
Example: Huffman code Example: Huffman code construction (construction (22))
a: 45Step 3: 25 300p
b: c:1
250
d:14
10
b:13
c:12
d:16
f: 5e: 9
10f: 5e: 9
41
Example: Huffman code Example: Huffman code construction (construction (33))
a: 45Step 4:1
550
250 1
300
b: c:10
d:14
10
13 12 16
f: 5e: 9
10
42
Example: Huffman code Example: Huffman code construction (3)construction (3)
Step 5: 1000
a: 45
p
55
10
a: 45
30
155
00
125
0 130
0
b:13
c:12
d:161
140
100 101 111
43
f: 5e : 9100 101
1100 1101
Huffman Coding ExampleHuffman Coding Example--11g pg p
•• 44-- Repeat this step until there is only oneRepeat this step until there is only one•• 44-- Repeat this step until there is only one Repeat this step until there is only one tree:tree:
Choose two trees with the smallest weights call theseChoose two trees with the smallest weights call theseChoose two trees with the smallest weights, call these Choose two trees with the smallest weights, call these trees Ttrees T11 and Tand T22. Create a new tree whose root has a weight . Create a new tree whose root has a weight equal to the sum of the weights Tequal to the sum of the weights T11 + T+ T22 and whose left and whose left
b i Tb i T d h i h b i Td h i h b i Tsubtree is Tsubtree is T11 and whose right subtree is Tand whose right subtree is T22. .
•• 55-- The single tree left after the previous step is an The single tree left after the previous step is an optimal encoding treeoptimal encoding treeoptimal encoding tree. optimal encoding tree.
44
Huffman Coding ExampleHuffman Coding Example--22Huffman Coding ExampleHuffman Coding Example 22•• Character (or symbol) frequenciesCharacter (or symbol) frequencies
AA 2020% (% ( 2020))–– A: A: 2020% (.% (.2020))•• e.g., ‘A’ occurs e.g., ‘A’ occurs 20 20 times in a times in a 100 100 character document, character document, 1000 1000
times in a times in a 5000 5000 character document, etc.character document, etc.
–– B: B: 99% (.% (.0909))–– C: C: 1515% (.% (.1515))–– D:D: 1111% (.% (.1111))D: D: 1111% (.% (.1111))–– E: E: 4040% (.% (.4040))–– F: F: 55% (.% (.0505))
•• Also works if you use Also works if you use character countscharacter counts•• Must know frequency of every characters in the Must know frequency of every characters in the
dd51
documentdocument
Huffman Coding ExampleHuffman Coding Example--22
•• Here are the symbols and their associated frequencies.Here are the symbols and their associated frequencies.
Huffman Coding ExampleHuffman Coding Example 22
•• Now we combine the two least common symbols (those Now we combine the two least common symbols (those with the smallest frequencies) to make a new symbol with the smallest frequencies) to make a new symbol
i d di fi d di fstring and corresponding frequency.string and corresponding frequency.
C 15
A20
D11
F05
B09
E4 .15.20.11.05.09.4
52
Huffman Coding ExampleHuffman Coding Example--22Huffman Coding ExampleHuffman Coding Example 22
•• Here’s the result of combining symbols once.Here’s the result of combining symbols once.•• Now we repeat until we have combined all the symbols into a single string.Now we repeat until we have combined all the symbols into a single string.
C ADBFE.15.20.11
F05
.14
B09
.4
.05.09
53
Huffman Coding ExampleHuffman Coding Example 22ABCDEF1 0
Huffman Coding ExampleHuffman Coding Example--22
EABCDF
1.0
BFD AC
.4.6
C 15
A20
D11
BF14
.25 .35
.15.20.11
F.05
.14
B.09
54
ABCDEFHuffman Coding ExampleHuffman Coding Example•• Now we assign Now we assign 00s/s/11s to each branchs to each branch•• Codes (reading from top to bottom)Codes (reading from top to bottom)
1.0
EABCDF
0 1Huffman Coding ExampleHuffman Coding Example
•• Codes (reading from top to bottom)Codes (reading from top to bottom)–– A: A: 010010–– B: B: 00000000
CC 011011
E.4
ACBFD
ABCDF.6
0 1
–– C: C: 011011–– D: D: 001001–– E: E: 11 C ADBF
AC.35
BFD.25
00 11
–– F: F: 00010001 .15.20.11
F05
.14
B09
0 1
•• NoteNote–– None are prefixes of anotherNone are prefixes of another
•• Decode this.Decode this.
.05.09
Try decoding right
55
Decode this.Decode this.–– 01001111000100000100111100010000–– What’s the first character? The second?What’s the first character? The second?
Try decoding right to left
Huffman Coding : LimitationsHuffman Coding : LimitationsHuffman Coding : LimitationsHuffman Coding : Limitations
•• Knowledge of source statistics is rarely availableKnowledge of source statistics is rarely availableKnowledge of source statistics is rarely available Knowledge of source statistics is rarely available in practice:in practice:
In file compression files can be from a wide varietyIn file compression files can be from a wide variety–– In file compression, files can be from a wide variety In file compression, files can be from a wide variety of applicationsof applications
–– Each source exhibits different statisticsEach source exhibits different statisticsEach source exhibits different statisticsEach source exhibits different statistics
•• Needs a so rce coding algorithm that does notNeeds a so rce coding algorithm that does not•• Needs a source coding algorithm that does not Needs a source coding algorithm that does not depend on source statistics!depend on source statistics!
56
Cost estimation:Cost estimation: Huffman codingHuffman codingCost estimation: Cost estimation: Huffman codingHuffman coding
•• abracadabra frequencies:abracadabra frequencies:55 bb 22 11 dd 11 22–– a: a: 55, b: , b: 22, c: , c: 11, d: , d: 11, r: , r: 22
•• Huffman code:Huffman code:–– a:a: 00 b:b: 100100 c:c: 10101010 d:d: 10111011 r:r: 1111–– a: a: 00, b: , b: 100100, c: , c: 10101010, d: , d: 10111011, r: , r: 1111–– bits: bits: 5 5 * * 1 1 + + 2 2 * * 3 3 + + 1 1 * * 4 4 + + 1 1 * * 4 4 + + 2 2 * * 2 2 = = 2323
•• Follow the tree to decode Follow the tree to decode –– ΘΘ(n)(n)( )( )•• Time to encode?Time to encode?
–– Compute frequencies Compute frequencies –– O(n) O(n) –– Build heap Build heap –– O(O(11) assuming alphabet has constant size) assuming alphabet has constant size–– Encode Encode –– O(n)O(n)
57
Last wordsLast wordsLast wordsLast words
•• Best algorithms compress text toBest algorithms compress text to 7575% of% ofBest algorithms compress text to Best algorithms compress text to 7575% of % of original size, but humans can compress to original size, but humans can compress to 1010%%
•• Humans have far better modeling algorithmsHumans have far better modeling algorithms•• Humans have far better modeling algorithms Humans have far better modeling algorithms because they have better pattern recognition and because they have better pattern recognition and higherhigher level patterns to recognizelevel patterns to recognizehigherhigher--level patterns to recognizelevel patterns to recognize
•• Intelligence = pattern recognition = data Intelligence = pattern recognition = data i ?i ?compression?compression?
58