A Survey on Bitmap Index Compression Algorithms for Big Data

17
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll 1007-0214 ll 0?/?? ll pp???-??? Volume 18, Number 2, April 2013 A Survey of Bitmap Index-Compression Algorithms for Big data Zhen Chen * , Yu-hao Wen, Wen-xun Zheng, Jia-hui Chang, Guo-dong Peng, Yin-jun Wu, Ge Ma, Mourad Hakmaoui, and Junwei Cao. Abstract: With the growing popularity of Internet applications and the widespread use of mobile Internet, Internet traffic has maintained rapid growth over the past two decades. Internet traffic archival systems (ITAS) for packets or flow records have become more and more widely used in network monitoring, network troubleshooting, and user behavior and experience analysis. In this paper, we survey bitmap-index compression algorithms for traffic archival systems. The current state-of-the-art bitmap-index encoding schemes include: BBC, WAH, PLWAH, EWAH, PWAH, CONCISE, and COMPAX. Based on differences in Segmentation, Chunking, Merge Compress, and NI (Near Identical) features, we provide a thorough categorization of the state-of-the-art bitmap compression algorithms. We also propose some new bitmap encoding algorithms-SECOMPAX, ICX, MASC, PLWAH+-and show the state diagrams for their encoding algorithms. We then evaluate their CPU and GPU implementations with a real Internet trace from CAIDA. Finally, we summarize and discuss the future direction of bitmap-index compression algorithms. Key words: Internet Traffic, Big Data, Traffic archival, Network Security, Bitmap Index, Bitmap Encoding algorithm 1 Introduction 1.1 Big Data The Internet has brought with it access to enormous quantities of rapidly changing data, in all fields. As the leading search engine, Google provides powerfully customizable search capabilities to individuals [1] . Google records each users search behavior, including their Web access path and the access time for each page. The system is able to process more than 34000 requests per second. Important scientific events can generate staggering quantities of data, at enormous rates of growth. For example, the experiments of the European Large Hadron Collider (LHC) produced more than 15 petabytes of data at up to 1.5 gigabytes per second [2] . Network monitors, communication services, sensor networks, and financial services produce unlimited continuously streaming data, growing in real time. Streaming data is characterized by massive data rates. it is not feasible to use single linear scanning for randomly accessing and storing the stream data locally. OnTraditional database storage is a collection of static relational data records for persistent data storage and complex queries, and its linear complexity algorithm has been unable to meet the requirements of the storage, query and analysis of massive data streams.It is critical to make timely decisions . Therefore, efficiently handling large data streams is important. The collection, archiving, indexing, and analysis of data streams have become a major focus of database research. 1.2 Internet traffic data With the commercial popularity of Internet applications and mobile wireless networks Internet traffic is growing at an accelerating pace. A report from Cisco [3] predicts that Internet traffic will grow four-fold from 2011 to 2016, and reach 1.3 zettabytes (a zettabyte is one triillion gigabytes) in 2016. On the Internet, information is transmitted in packets, and transmitted along different paths in one or more networks, and reassembled at the destination. Internet data transmission is based on TCP/IP protocol, which divides the network packets into IP packets, TCP/UDP packets and application packets due to different information they contain. Figure 1 shows the format of original data acquired

Transcript of A Survey on Bitmap Index Compression Algorithms for Big Data

TSINGHUA SCIENCE AND TECHNOLOGYISSNll1007-0214ll0?/??llpp???-???Volume 18, Number 2, April 2013

A Survey of Bitmap Index-Compression Algorithms for Big data

Zhen Chen∗, Yu-hao Wen, Wen-xun Zheng, Jia-hui Chang, Guo-dong Peng, Yin-jun Wu, Ge Ma, MouradHakmaoui, and Junwei Cao.

Abstract: With the growing popularity of Internet applications and the widespread use of mobile Internet, Internet

traffic has maintained rapid growth over the past two decades. Internet traffic archival systems (ITAS) for packets

or flow records have become more and more widely used in network monitoring, network troubleshooting, and user

behavior and experience analysis. In this paper, we survey bitmap-index compression algorithms for traffic archival

systems. The current state-of-the-art bitmap-index encoding schemes include: BBC, WAH, PLWAH, EWAH, PWAH,

CONCISE, and COMPAX. Based on differences in Segmentation, Chunking, Merge Compress, and NI (Near

Identical) features, we provide a thorough categorization of the state-of-the-art bitmap compression algorithms.

We also propose some new bitmap encoding algorithms-SECOMPAX, ICX, MASC, PLWAH+-and show the state

diagrams for their encoding algorithms. We then evaluate their CPU and GPU implementations with a real Internet

trace from CAIDA. Finally, we summarize and discuss the future direction of bitmap-index compression algorithms.

Key words: Internet Traffic, Big Data, Traffic archival, Network Security, Bitmap Index, Bitmap Encoding algorithm

1 Introduction

1.1 Big Data

The Internet has brought with it access to enormousquantities of rapidly changing data, in all fields. Asthe leading search engine, Google provides powerfullycustomizable search capabilities to individuals[1].Google records each users search behavior, includingtheir Web access path and the access time for each page.The system is able to process more than 34000 requestsper second. Important scientific events can generatestaggering quantities of data, at enormous rates ofgrowth. For example, the experiments of the EuropeanLarge Hadron Collider (LHC) produced more than 15petabytes of data at up to 1.5 gigabytes per second[2].Network monitors, communication services, sensornetworks, and financial services produce unlimitedcontinuously streaming data, growing in real time.Streaming data is characterized by massive data rates.it is not feasible to use single linear scanning forrandomly accessing and storing the stream data locally.OnTraditional database storage is a collection of staticrelational data records for persistent data storage and

complex queries, and its linear complexity algorithmhas been unable to meet the requirements of thestorage, query and analysis of massive data streams.Itis critical to make timely decisions . Therefore,efficiently handling large data streams is important.The collection, archiving, indexing, and analysis ofdata streams have become a major focus of databaseresearch.

1.2 Internet traffic data

With the commercial popularity of Internetapplications and mobile wireless networks Internettraffic is growing at an accelerating pace. A reportfrom Cisco [3]predicts that Internet traffic will growfour-fold from 2011 to 2016, and reach 1.3 zettabytes(a zettabyte is one triillion gigabytes) in 2016. Onthe Internet, information is transmitted in packets,and transmitted along different paths in one or morenetworks, and reassembled at the destination. Internetdata transmission is based on TCP/IP protocol, whichdivides the network packets into IP packets, TCP/UDPpackets and application packets due to differentinformation they contain.

Figure 1 shows the format of original data acquired

2 Tsinghua Science and Technology, December 2014, 18(2): 000-000

•Zhen Chen and Junwei Cao are working in Research Instituteof Information Technology and Tsinghua National Laboratoryfor Information Science and Technology (TNList), TsinghuaUniversity, Beijing 100084, China.E-mail:zhenchen, [email protected]

•Yuhao Wen is studying in Department of ElectronicEngineering and Tsinghua National Laboratory for InformationScience and Technology (TNList), Tsinghua University,Beijing 100084, China.E-mail:filippo [email protected]

•Wenxun Zheng is studying in Department of Physics andTsinghua National Laboratory for Information Science andTechnology (TNList), Tsinghua University, Beijing 100084,China.E-mail:[email protected]

• Jiahui Chang is studying in Department of AerospaceEngineering and Tsinghua National Laboratory for InformationScience and Technology (TNList), Tsinghua University,Beijing 100084, China.E-mail:[email protected]

•Guodong Peng is studying in Department of MechanicalEngineering and Tsinghua National Laboratory for InformationScience and Technology (TNList), Tsinghua University,Beijing 100084, China.E-mail:[email protected]

•Yinjun Wu is studying in Department of Automation andTsinghua National Laboratory for Information Science andTechnology (TNList), Tsinghua University, Beijing 100084,China.E-mail:[email protected]

•Ge Ma is studying in Department of Automation and TsinghuaNational Laboratory for Information Science and Technology(TNList), Tsinghua University, Beijing 100084, China.E-mail:[email protected]

•Mourad Hakmaoui is studying in Department of ComputerScience and Technology and Tsinghua National Laboratoryfor Information Science and Technology (TNList), TsinghuaUniversity, Beijing 100084, China.E-mail:[email protected]

∗To whom correspondence should be addressed.Manuscript received: 2014-4-15; revised: 2014-11-1;accepted: 2013-12-10

Fig. 1 Pcap file format.

from a physical link to the Internet. Pcap is anInternet packet sniffing API, which stores captureddata as Pcap (Packet CAPture) files. Raw networkpackets are stored sequentially in Pcap files. ThePcap file header describes the properties of Pcap files(e.g., file size). Packet header contains the descriptioninformation (e.g., timestamps, and size). The packetpayload includes the contents of the complete TCP /IP network packets, see Fig.2. Figure 2.a shows an IPpacket format, including packet header and payload. Ithas the following fields: version (4 bits), header length(4 bits), type of services (8 bits), total length (16 bits),identification (16 bits), flags (3 bits), fragment offset (13bits), TTL (8 bits), protocol (8 bits), header checksum(16 bits), source IP address (32 bits), destination IPaddress (32 bits).

Fig. 2 TCP/IP packet format.

Figure 2.b shows the format of a TCP header formats,including the packet header and payload. It hasthe following fields: source port number (16 bits),destination port number (16 bits), sequence number(32 bits), acknowledgement number (32 bits), dataoffset (4 bits), reserved (6 bits), emergency bits URG,confirm bit ACK, reset bit RST, synchronization bitsSYN, termination bits FIN, window size(16 bits),checksum(16 bits), urgent pointer (16 bits), option field,and padding field. Network flow records describenetwork message and transmission characteristics. The

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 3

record can describe a UDP(User Datagram Protocol)connection or a TCP(Transmission Control Protocol)connection. The flow usually refers to a quintuple of thesource IP address, source port, destination IP address,destination port, protocol.

Table 1 Network flow record structure (Router)

Number Field Remark1 SrcIP Source IP address2 SrcPort Source Port3 DstIP Destination IP4 DstPort Destination Port5 Proto Layer 3 protocol (e.g., TCP UDP)6 TCPflags Cumulative TCP flags (e.g., SYN ACK FIN)7 SrcAS Source Autonomous System Number8 DestAS Destination Autonomous System Number9 Octs # Bytes exchanged in the flow10 Pkts # Packets exchanged in the flow11 Start Flow start time12 Duration Flow duration13 others

A network flow record structure obtained from anInternet router is shown in Table 1. It has the followingfields: source AS number, destination AS number,start and end time, and number of network packets.This is the Netflow V5 flow record format,developedby CISCO; it is the most popular flow record format.Netflow determines which flow packets belong tobased on seven fields: source IP address, destinationIP address, source port number, target port number,third-layer protocol type, TOS byte (DSCP), networkequipment input (or output), and logical networkport (ifIndex). CISCO’s Netflow V9, also knownas IPFIX (IP Flow Information Export), became anIETF standard. (The Internet Engineering Task Forcedevelops and promotes voluntary Internet standards.)It defines how IP flow information is to be formattedand transferred from an exporter to a collector. MobileInternet operators obtain Call Detail Records (CDR)traffic information; their format shown in Table 2.It includes terminal attributes, location area code, CInumber, service gateway, start time, end time, TCP/IPinformation, HTTP application information, etc.

1.3 Traffic archiving and retrieval

Network monitoring has been the core function ofnetwork management, network fault diagnosis, andnetwork security. In addition to real-time firewalls andintrusion detection systems, traffic-archiving systemsare important to network protection. An Internet traffic-archiving system (ITAS) captures packet or flow records

for subsequent analysis and processing. Such systemshave many important applications. The land-basedInternet embraces innumerable corporate networks andmany others, and its growth is accelerating. Giventhe number of these networks and the magnitude oftraffic they carry, managing them, diagnosing networkfaults, and maintaining normal operation is a staggeringchallenge. There are many companies providing aserver hosted by the IDC (Internet Data Center). Thenumber of such local networks is too large and networkbandwidth is generally 100 1000Mbps. How tomanage the huge data traffic, improve the productionefficiency, diagnose network faults, and maintainnormal operation remains as a non-trivial issue.Inaddition, data provided by China Unicom[4]shows thatcurrent mobile Internet users daily generate morethan 30 billion records and 8.4TB of data in thetelecommunications business, resulting in trillions ofrecords, requiring petabyte storage capacity. Withmobile Internet users doubling about every six months,the number of Internet records they generate will furtherincrease[5]. Network security has become a majorchallenge, and technologies such as intrusion detectionsystems, signature detection, and security scanningtechnology have arisen to meet it[6]. But Internet datavolumes have exceeded current real-time detection andanalysis capabilities. It has therefore become essentialto capture Internet traffic for subsequent analysis[7−9].This paper is organized as follows: The structure andfunction of traffic archive systems is introduced inSection 2. Section 3 describes the key technologiesof packet capture, compression, and bitmap indexencoding of network flow archiving systems. Weconclude the paper and discuss future work in Section4.

2 Internet traffic archiving systems (ITAS)

2.1 ITAS structures and functions

Internet traffic archiving systems, or ITAS, typicallyinclude flow-data acquisition, index storage, and queryprocessing. There are two categories of flow data-acquisition, corresponding to two types of data: (1)Packet-level network data (2) Flow-level network dataA typical traffic-archiving system structure is shown inFig.3.

Figure 3 describes a typical traffic-archiving system.After being captured, network data can be routed toan index module in real time, or stored to disk for

4 Tsinghua Science and Technology, December 2014, 18(2): 000-000

Table 2 Flow record format (wireless base station)

Number Field Remark1 Cell phone number May not contain a prefix such as +86,0086,862 Location Area Code LAC3 CI number Select the first CI when a network switches4 Terminal IMEI/ IMSI5 Start time YYYY-MM-DD HH:MM:SS.12345676 Duration (in seconds)7 RAT Type 1 represents 3G, 2 represents 2G8 Traffic ( in bytes) Upstream/ Downstream/total9 Terminal IP Traffic( in bytes)10 Source port/ Destination port11 APN 3gwap,3gnet,uniwap,uninet,cmwap,cmnet12 SGSN/ GGSN IP First access IP13 Http protocol User Agent/Content-type/URL/Status Code14 Others

Fig. 3 The structure of an Internet traffic-archiving system.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 5

subsequent processing. The index module updates theindex with the new data, typically compressing it toreduce storage requirement. The indices are stored byrows or columns after indices compression. In orderto reduce storage overhead, we adopt the method ofstorage compression in the process of storage. Whena query arrives, the query processor calls the currentindex and returns the corresponding result.

2.2 Key technologies in ITAS

2.2.1 High-performance packet/flow capturetechnologies

Internet data quantities and rates are very large,and growing. There are tens of millions of 64-bytepackets to deal with each second, and similar orders ofmagnitude of flow records from routers. In a 10Gbpslink, for example, the network will reach 14 millionpackets per second with 64 bytes per packet. Evenafter polymerization, there are still millions of flowrecords. Though only dealing with flow data, a centralpoint of an operator would generate millions of recordsper second which will reach 30 billion per day.[10−13].Therefore, how to capture and store high arriving speedof packets and streams in real time is a major challenge.

2.2.2 High-performance packet/flow storagetechnologies

In many ITAS, packets and flow records are storedin relational databases by rows, This consumesmuch storage and exhibits inefficient retrievalcharacteristics. Current systems have begun tostore the data in columns, compressing it to reducestorage overhead. A popular compression method isLZO (LempelZivOberhumer), known for its focus onperformance. Other new methods such as RasterZipinclude information-flow optimization.

2.2.3 High-performance packet/flow indexingtechnologies

The accepted approach to indexing large dataquantities is to employ an inverted index; it is widelyused, for example, in search engines. However, amore efficient approach involves bitmap indexwhichis a general data retrieval technology widely used bysearch engines. Inverted index is not as efficient asbitmap index in dealing with the network flow queries,especially value retrieval. The index space for anenormous datastore will itself by huge; consequently,companies such as Google have paid much attentionto its compression[14,15]. In order to create an

efficient index, the general index space should befragmented (i.e., shard or segmentation), and the indexfile should be compressed at the same time. Bitmapcompression encoding algorithm is an effective methodfor compressing the index, to solve the index-spaceexplosion issue.

2.3 Categories of ITAS

We compare and summarize the features of Internettraffic-archiving systems in Table 3. Raw data can bestored in a database by rows or by columns. In bitmapindexes, the usual choice is by columns. For example,NET-FLi, an efficient on-the-fly compression, archivingand Indexing system for streaming network traffic,stores data in columns, and uses the LZO compressionalgorithm to reduce the storage size. Traffic archivingentails packet capture, storage, indexing, and querying.Different traffic-archiving systems focus on differentaspects. This paper takes the packet-capture andindex-compression algorithm as a starting point forunderstanding traffic-archiving systems.

3 Key technologies of bitmap-indexcompression coding

Efficient indexing of network packets or flow is centralto network archive systems. Index information has thefollowing attributes: (1) Massive data: The numberof index messages is massive, even for brief periods.(2) High rates of incoming data: To keep up with therate of packet influx, systems must be highly efficient.(3) Fixed data structure: The index information foreach network packet has a fixed format and a fixedlength. (4) Appending without modification: Networkpacket index information will increase only. Oncethe information is generated, it cant be changed.(5) High repeatability: Data items are frequentlyrepeated. Due to the characteristics of network flowdatarelational databases are not appropriate for thistask. Traditional relational databasesare optimized forhandling frequent changes. They use B or B+ trees,which are not particularly suitable for indexing networkflow information. Moreover, relational databasesneed additional high-speed indexing technology. Acommon technology of large-data retrieval systems isthe inverted index.. The core data structure of theInverted index is the posting list. Sequence of integersof a KEY is stored in inverted lists, such as a timestampand offset, and the most typical KEY is the positionlist, which shows the position where a word appears.

6 Tsinghua Science and Technology, December 2014, 18(2): 000-000

Table 3 A comparison of different traffic-archiving systems

Data format Indexing StoragePacket Flow record Bitmap Hash Tree Column based Row based

TelegraphCQ[16] no yes PostgreSQL(DBS) no noGigascope[17] no yes no no yes no no

MIND[18] no yes no yes no no noHyperion[19] yes no BloomFilter Multi-level indexing no no no

FastBit[20]+TelegraphCQ no yes WAH, PostgreSQL(DBS) yes noTimeMachine[21] yes no no yes no no yes

NET-FLi no yes COMPAX no no LZO noNetStore[22] no yes no no no Segmentation no

RASTERZip[23] no yes yes no no yes noPcapIndex[24] yes no COPMAX no no yes no

TifaFlow[25−27] yes no yes no no no no

Due to the characteristics of network flow data, netflow archiving systems use bitmap indexing rather thaninverted indexes.

3.1 Introduction to bitmap indexing

The concept of bitmap index was first introducedby Professor Israel Spiegler and Rafi Maayan in1985[28]. Patrick O’Neil published a paper aboutthe first commercial database product to implementa bitmap index in 1987[29−34]. It uses a sequenceof bits to indicate the the presence or absence ofan item in the indexed data. With bitmap indexing,logical operations (AND/OR/NOT/XOR, etc.) can beused to answer complex queries. Bitmap index wasdesigned for scientific data and database, which isusually generated by scientific instruments or scientificsimulation. Scientific data are extremely large and donot change. Bitmap index database solve the problemon how to quickly identify a small amount of data in amass of scientific data, while the traditional relationaldatabase is not suitable for this work. The technologiesused in bitmap index databases are bitmap indexing,bitmap compression, and classification. In Bitmapindex database, the data are stored in column and acorresponding bitmap index is created. A bitmap indexexample is shown in Table 4.

Table 4 An example of a bitmap index

RowID ColumnBitmap

=1 =2 =3 =41 1 1 0 0 02 2 0 1 0 03 4 0 0 0 14 3 0 0 1 05 2 0 1 0 06 4 0 0 0 1

Currently, representative bitmap index compressionalgorithms are BBC [35], WAH [36,37], PLWAH[38], EWAH [39], CONCISE [40], PWAH [41] andCOMPAX [42], UCB [43], VLC [44], VAL-WAH [45],RoaringBitmap [46], RLH [47], and DFWAH[48]. Thereare improved versions of PLWAH (PLWAH withadaptive counter), such as PLWAH+[49], and APLWAH.An improved version of COMPAX (COMPAX +oLSH and COMPAX2) has also appeared. Figure 4shows the various bitmap-index coding algorithms inchronological order.

Fig. 4 The advancement of Bitmap index compressionalgorithm.

3.2 Categories of bitmap-index compressionalgorithms

To summarize the characteristics of bitmap-indexcompression algorithms, we compare the bitmap-indexcompression algorithms based on different concepts,named Segmentation, Chunking, Merge Compress, andNear Identical, as shown in Table 5.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 7

Table 5 Bitmap-index encoding algorithms

Concepts Segmentation Chunking Merge Compress Near Identical

Methods 4K 31/63bit Piggyback LFL/FLFDirty Bits( dirty/clean) Dirty Bits( dirty/clean)NI-0 NI-1 NI-0 NI-1

BBC no 8 yes no yes yes no noWAH no 31 no no no no no no

EWAH no 31 no no no no no noPWAH no 63 yes no no no no no

PLWAH no yes yes no yes no no noCONCISE no yes yes no yes no no noCOMPAX yes 31 no yes no no no yes

SECOMPAX yes 31 yes yes no no yes yesICX yes 31 yes yes no no yes yes

MASC yes no no yes no no yes yesPLWAH+ yes no yes yes yes yes no no

3.3 Bitmap-index compression encodingalgorithms

3.3.1 BBC (Byte-aligned Bitmap Compression)Antoshenkov [35] proposed the byte-aligned bitmap

compression method in 1995, the bitmap bytes areclassified as gaps and maps. Gaps containing only zerosor only ones and maps containing a mixture of both.Continuous gaps are encoded by their byte length anda fill bit is used to differ 0 and 1 gaps. Map bytesare directly compiled after the control byte withoutencoding. A pair (gap, map) is encoded into a singleatom which is composed as control byte + gap length +map. Control byte can be divided into four categories:

1. The bitmap bytes which contain 0-3 gaps and 0-15maps can be encoded as follows:

1 [fill bit] [fill length (2 bits) ] [tail length (4 bits)]A fill bit is used to represent all 0s or all 1s; fill length

is used to represent the length of gap; and tail length isused to indicate the length of a map.

2. Bitmap bytes that contains 0-3 gaps, with 1 bitdifferent between the gap and the map, can be encodedas follows:

01 [fill bit] [fill length (2 bits) ] [odd bit position (3bits)]

Odd-bit position is the different 1 bit (dirty bit)position between the map and the gap.

3. Bitmap bytes that contains more than 3 gaps and0-15 maps can be encoded as follows:

001 [fill bit] [tail length (4 bits)]Gap lengths follow the control byte.4. Bitmap bytes that contains more than 3 gaps with

1 bit different between the gap and the map can beencoded as follows:

0001 [fill bit] [odd bit position (3 bits)]

3.3.2 WAH (Word-Aligned Hybrid)WAH is the default algorithm the of Fastbit bitmap

compression database. It makes the source stream into31 bits (63 bits for WAH 64) as a group. There are twotypes of group: Literal, which contains 0 and 1 in 31bits; and Fill, which contains all 0s or all 1s in 31 bits.Literal Group: The type flag is 0; the remaining 31 bitsare the original Literal group. Fill Group: Divided into1-Fill and 0-Fill. The type flag is 1, and the second bitis the sign of Fill type (0-Fill is 0, 1-Fill is 1). Theremaining 30 bits are a counter, which indicates thenumber of consecutive 0-Fill (or 1-Fill) groups.

3.3.3 PLWAH (Position List Word-AlignedHybrid)

PLWAH (Position List Word-Aligned Hybrid)PLWAH groups the source stream into 31-bit units.

Here, too, there are Literal and Fill groups, but thereare some changes in compression. In WAH, the Fill-Group and Literal-Group have an added flag, andeach group codes as 32 bits. For Fill-Groups, thelower 25 bits are a counter. By introducing thedefinitions of nearly identical and position list, PLWAHfurther enriches the codebook, increasing the encodingtypes, and improving compression efficiency. A slightimprovement over PLWAH is achieved by the methodof PLWAH + adaptive counter.When there is a largenumber of consecutive 0s or 1s, so that a word cantfully accommodate a counter, this method uses the sametype of two consecutive fill-word, making two countersinto a single large counter. The first fill-word recordsthe 25 least-significant bits, and the second fill-wordrecords the 25 most-significant bits. Meanwhile, theposition list of the first fill-word is empty, and theposition list of the second fill-word is calculated as

8 Tsinghua Science and Technology, December 2014, 18(2): 000-000

usual. There is thus no need to extend the word-digit tocomplete continuous-stream coding tasks, which savesindex space.3.3.4 EWAH (Enhanced Word-Aligned Hybrid)

Fig. 5 EWAH encoding method.

EWAH defines a 32-bit field that contains consecutive0s or 1s as a clean segment, and defines a 32-bit fieldthat contains mixed 0s and 1s as a dirty segment. TheEWAH algorithm also uses two types of words, wherethe first type is a 32-bit verbatim word; the secondtype of word is a marker word, whose first bit indicateswhich clean word will follow, and 16 bits are used tostore the number of clean words. The remaining 15 bitsare used to store the number of dirty words followingthe clean words. EWAH bitmaps begin with a markerword, shown in Fig.5. Because EWAH uses only 16bits to store the number of clean words, it may be lessefficient than WAH when there are many consecutivesequences of clean words. But the seriousness of thisproblem is limited, and EWAH is feasible. Three waysare proposed to alleviate this compression overheadover clean words. First, more than half of the bitscan be allocated to encode the runs of clean words.Second, when a marker word indicates a run of 216clean words, the next word will be used to indicatethe number of remaining clean words. Finally, thiscompression penalty is less relevant when using 64-bitwords instead of 32-bit words. When there are few dirtywords in the bitmaps, WAH might be preferable. Evenconsidering EWAH and WAH indexes of similar sizes,each EWAH marker word needs to be accessed threetimes to determine the running bits and two runninglengths, while no word needs to be accessed more thantwice with WAH. When there are lots of long runs ofdirty words in the bitmaps, EWAH might be preferable.EWAH will access each dirty word at most once, whileWAH needs to check the first bit of each dirty word to

ascertain it is a dirty word. The EWAH decoder canskip a sequence of dirty words, while the WAH decodermust access them all.

3.3.5 CONCISE (Compressed n ComposableInteger Set)

CONCISE is based on WAH. In a compressed 32-bit segment, when the leftmost bit is 1, the following31 bits are uncompressed; when the leftmost bit is 0,it means fill, and the second bit indicates the type offill. The following 5 bits are the position of a flipped bitwithin the first 31-bit block of the fill. The remaining25 bits count the number of 31-blocks that composethe fill, minus one. When position bits equal 0 (binary00000), the word is a pure fill. Otherwise, position bitsindicate the bit to switch within the first 31-bit block ofthe sequence represented by the fill word. That meansthat 1 (binary 00001) indicates that it has to flip therightmost bit, while 31 (binary 11111) indicates thatit has to flip the leftmost bit. In Fig.6, for example,words #0, #3, and #5 are Literal words, and words#1, #2, and #4 are fill words. Word #0 is used torepresent integers in the range 030; word #1 for integersin the range 3192; and word #2 for integers in the range931022. Word #2 also indicates that integers in therange 941022 are missing, but 93 is in the set, sinceposition bits say that the first number of the missingnumbers sequence is an exception. Word #3 for integersin the range 10231053, word #4 for integers in the range10541,040,187,391, and word #5 for integers in therange 1,040,187,3921,040,187,422.

Fig. 6 CONCISE encoding method.

3.3.6 CONCISE (PWAH (Partitioned Word-Aligned Hybrid)

Based on WAH, PWAH extends word length to 64bits, and makes a word into P pieces, starting with a P-bit header, which is used to indicate the type of fill orLiteral. The length of Literal can be indicated flexibly

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 9

Fig. 7 PWAH-4 bitmap encoding method.

to save space.There are three kinds of PWAH algorithmPWAH-2,

-4, and -8. PWAH-2 uses WAH compression coding.In the experiments, PWAH-8 is used more frequentlythan the others. Combined with the Nuutila algorithm,PWAH-8 can be effective in solving the accessibilityquery problem of large-scale graphs.

3.3.7 CONCISE (COMPAX (COMPressedAdaptive indeX)

There are more signs in COMPAX. The definitionsof Literal and Fill are same as with WAH and PLWAH.Every 31 bits are a chunk, and the chunks are dividedinto four groups according to the following features: 1.Literal-Fill-Literal (LFL) 2. Fill-Literal-Fill (FLF) 3.Fill (F) 4. Literal (L) The COMPAX+oLSH (online-locality-sensitive-hashing) compression method is thesame as that of COMPAX, but this compression methoddeals with the input stream first, which improvescompression rate considerably.

3.3.8 SECOMPAX/ICX and MASC/PLWAH+1. New ideas in bitmap-index encodingSECOMPAX/ICX/MASC/PLWAH+ are proposed

based on COMPAX/PLWAH. SECOMPAX/ICX isbased on COMPAX2, introduces the concept of NI-1 Literal, and extends the number of dirty bytes to2; while MASC/PLWAH+ is based on PLWAH, andintroduces new features from run-length encoding. Interms of piggyback (the combination of two symbols)and FLF/LFL (the combination of three symbols), thecurrent bitmap-indexing encoding algorithm can beclassified as two types. According to these categories,the state-of-the-art bitmap-index encoding algorithmsare shown as follows in Fig.8.

(1) SECOMPAX (Scope Extended COMPAX)SECOMPAX [50] is based on COMPAX2, an enhancedversion of COMPAX. The differences are as follows:a. Nearly identical Literal (NI-L) SECOMPAX extendsthe dirty byte concept to consist of 0-NI-L and 1-NI-Li.e., a Literal word can be nearly identical to a 0-fillword or a 1-fill word. This maintains symmetry in the

codebook. b. LFL and FLF extended type LFL: Inaddition to the 0-NI-L + F + 0-NI-L types, there arethree others: 0-NI-L + F + 1-NI-L, 1-NI-L + F + 0-NI-L, and 1-NI-L + F + 1-NI-L. FLF: In addition to the 0F +0-NI-L + 0F1F+0-NI-L+1F types, there are six others:0F + 1-NI-L + 0F, 1F + 1-NI-L + 1F; 1F + 1-NI-L +0F, 0F + 1-NI-L + 1F; 1F + 0-NI-L + 0F, and 0F + 0-NI-L + 1F; By the inclusion of new codewords, somekinds of origin sequences that COMPAX cannot furthercompress can be dealt with by SECOMPAX: a. In theform of FLF, if two fill words are of different types, or aLiteral word is nearly identical to fill word, COMPAXdoes not try to encode them as FLF. COMPAX encodesthem as an F, an L, and an F, while SECOMPAX canencode them as FLF. Obviously, in this circumstance,the result encoded by COMPAX is twice as long asthose encoded by SECOMPAX. b. In the form of LFL,COMPAX can only deal with: both of the two Literal-Words are nearly identical to 0-Fill words. If one orboth of the Literal words are identical to a 1-Fill word,COMPAX encodes them as an L-Word, an F-Word, andan L-Word (L+F+L, 3 words in total), not as LFL (1word in total). However, SECOMPAX can encode allof those circumstances as LFL, which saves a great dealof room compared with COMPAX.

(2) ICX (Improved COMPAX) The ICX schemeis proposed, to further compress the bitmap byconsidering the possible two dirty-bytes cases, whichare not considered in the COMPAX scheme. And weextend PLWAH nearly identically concept to considerone or two dirty bytes case, and represent these caseswith NI (nearly identical) bits in the new codebook.By this new codebook, more possible LF combinationscan be represented, and they can be easily compressedinto one word. ICX can achieve a better compressionratio than WAH and COMPAX. In the cases wherethe number of 1s is comparable to the number of 0s,ICX performs especially better than both WAH andCOMPAX, since they havent taken those cases intoconsideration.

10 Tsinghua Science and Technology, December 2014, 18(2): 000-000

Fig. 8 New ideas in Bitmap Index encoding.

(3) MASC (MAximized Stride with Carrier) MASC(MAximized Stride with Carrier) is proposed tofurther improve the compression performance withoutimpairing query performance. MASC uses as long astride size as possiblenot limited to 31 bits as in PLWAHand COMPAXto encode the consecutive zero bits andnonzero bits in a more compact way. MASC recordsorigin bitmap index sequences into a new format.MASC is a new extended version of PLWAH. Theconcept carrier in MASC and piggybacked in PLWAHare similar. However, the carrier can carry at most30 nonzero bits, while PLWAH can piggyback onlya single nonzero bit. In addition, we generalize theconcept of Literal word and eventually obsolete it. Asa consequence, several (no more than 30) nonzero bitscan be carried by the former 0-fill word and output a 1-carried 0-fill word, while PLWAH has to encode themin a Literal word, or two Literal words in the worstcase, when consecutive nonzero bits are located in twoadjacent chunks. Considering zero bits and nonzerobits distribution in real data sets, they usually appear inbatchesespecially after being sorted by the hash valueof each record. Thus MASC can perform better thanPLWAH.

(4) PLWAH+ (Position List Word-Aligned HybridPlus) We propose the PLWAH+ (Position List WordAligned Hybrid Plus) bitmap compression schemebased on PLWAH. First, we add the definition ofan LF word that can piggyback more NI words,which is not considered in PLWAH. According to theexperiment, with about a 20% reduction in the amountof Literal words, it can save 3% or more of the storage,compared to PLWAH. And the result shows that the

PLWAH+ is more suitable for streaming network traffic.Furthermore, the definition of the NI Chunk is enlarged,which performs better in some cases, where the ratioof set bits is not at a low level. PLWAH+ furtherclassifies NI into two kindsNearly Identical 0 fill word(NI-0 word) and Nearly Identical 1 fill word (NI-1word). According to a number of tests, a prediction canbe made that PLWAH+ is more suitable for complexdatabases than PLWAH.

2. Encoding algorithm state diagramsThe state diagrams of SECOMPAX, ICX, MASC,

and PLWAH+ are shown in Fig.9-Fig.12. Figure 9shows the state diagram of the SECOMPAX encodingalgorithm. The states of the encoder represent thedifferent word types that have been stored. The symbolsA, B, and C represent: A: Fill, B: NI-Fill, C: Literal.Symbols labeling the edges are introduced as follows:(x, y) means the next character is x, and the output isyexcept if y=null, output nothing, and if y is underlined,the output is a codeword as a whole.

Figure 10 introduces the encoding state diagram forICX. As ICX handles new cases when the numberof dirty bytes is 2, there are more edges than inSECOMPAX.

Figure 11 shows the state diagram of the MASCencoding algorithm, wherein states 0: 0-fill; 1: 1-fill;and C: 1-carried-0-fiil (0-fill word carried 1s). Themeaning of transfer edge (x, y, z) is as follows: x: inputbit (0 or 1); y: output bit; z: the counter. If label(x, y) does not occur, the counter will increase by 1,by default. As MASC encodes the maximu length ofconsecutive 0s or 1s, the counter is not in multiples of31, as in WAH.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 11

Fig. 9 Encoding algorithm of SECOMPAX.

Fig. 10 Encoding Automaton in ICX.

Fig. 11 Encoding Automaton in MASC.

The PLWAH+s FSM is shown in Fig.12, whereinStart is the start state of a compression procedure,symbol F stands for codeword Fill word, symbol Pstands for codeword NI word, and symbol L stands forcodeword Literal word.

Fig. 12 Encoding Automaton in PLWAH+.

The pair (x, y) labels the edge to stand for an actiontaken at a shift of states, which means when x is theinput from a WAH result, the state moves along thecorresponding edge and y is the output to the finalresult. In particular, if y equals null, it outputs nothing.If y has one symbol like FP, it outputs a combination ofte two codewordsi.e., FL. If y has two symbols, such asF L, it outputs two codewords, F and L.

3. Implementation and evaluationWe evaluate SECOMPAX, ICX, MASC, MASC

and PLWAH+ with a real Internet traffic trace fromCAIDA[51]. This Internet traffic trace was anonymizedand captured from a core router by CAIDA at the end of2013. There are 13,581,181 packets in this trace. First,we reorder packets with the mechanism based on theprinciple of locality-based hashing used in [52]. Thenwe convert five-tuples ¡SIP, sport, DIP, dport, proto¿ tobitmaps, and compress the bitmaps in each column witha fixed block size of 4 Kbit, which is also used in [52].In these experiments, source IP (4 bytes), destinationIP (4 bytes), source port (2 bytes), and destination port(2 bytes) are compressed with PLWAH, COMPAX, andSECOMPAX schemes. For SrcIP, there are four bytes,and each byte expands to 256 columnar files. There are1024 columnar files, which contain 13,581,181 bits (thenumber of packets in the trace). The situation is similarfor the DstIP and Ports fields.

12 Tsinghua Science and Technology, December 2014, 18(2): 000-000

Fig. 13 Bitmap-index encoding withPLWAH/COMPAX2/SECOMPAX/ICX/MASC.

Figure 13 shows the average size of a compressedcolumnar file with PLWAH, COMPAX2, andSECOMPAX, ICX, and MASC, where each originalcolumnar file has 13,581,181 bits (the number ofpackets in the trace). Compared with PLWAH,COMPAX2 and SECOMPAX reduce the size of theindex for a source IP address by 6.74%, and fora destination IP by 6.05%. While compared withCOMPAX2, SECOMPAX reduces the size of thesource IP by 4.01%, and destination IP by 3.97%. ICXand MASC show further reduction in compressingthe bitmap files. A detailed comparison is illustratedfor source IP (4 bytes), destination IP (4 bytes),source port (2 bytes), and destination port (2 bytes)in Fig.14-Fig.17. From Fig.14-Fig.17, it is clear thatSECOMPAX/ICX/MASC have a better compressionratio and smaller space consumption than COMPAX2and PLWAH, especially in Byte0 in SrcIP and DstIP.Compared with PLWAH, SECOMPAX can reducethe size of the index for SrcIP Byte 0 by 7.62% andDstIP by 8.32%. Among these, MASC has the bestperformance, as the improvement it shows reachesabout 16%-18%.

4. GPU implementationUsually, the encoding and decoding operation in

the process of compressing runs in a CPU. To furtheraccelerate the compression, we propose a GPU-basedsolution, to offload the indexing and parallelize theencoding. As the CPU is also responsible foroverall system operation, in consideration of totalperformance issues, it is better to offload the bitmapindexing operation to a GPU. Witold Andrzejewski[53,54] introduced GPU-based WAH and PLWAH in[30,31]. However, those implementations could not avoidextending the original data into bitmap form before

Fig. 14 Size after compression using five encoding schemesin DstIP.

Fig. 15 Size after compression using five encoding schemesin DstPort.

Fig. 16 Size after compression using five encoding schemesin SrcIP.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 13

Fig. 17 Size after compression using five encoding schemesin srcPort.

processing, which increased memory consumption anddecreased performance. A new way to compressbitmaps without extending the original data wasintroduced in [39], which will be also used in thispaper. Especially in [55], Fusco et al. evaluate theGPU-based WAH and PLWAH with a sequence ofrandom integers to mimic the five-tuples of Internettrace, and prove the potential of a GPU to achievethe speed of indexing a million of packets persecond. GPU-based SECOMPAX, ICX, and MASCalgorithms are implemented with Thrust, a C++ libraryprovided with the NVIDIA SDK, designed to enhancecode productivity andmore importantlyperformanceand portability across NVIDIA GPUs. To evaluatethe performance of our implementation, we used asimilarly priced CPU and GPU: a 3.4-GHz Intel i7-2600K processor with 8 Mb of cache and an NVIDIAGTX-760 GPU, fitted in a PCI-e Gen 2.0 slot. Theinput data we used is an anonymous Internet trace dataset from CAIDA of 13,581,810 packets. We extracttheir five-tuples (source IP, destination IP, source port,destination port, protocol number). For simulatingcircumstances in practice, we cut those packets by 3,968(128*31), and created bitmap indexes for 14 (bytes) *3,968 one at a time.

Table 6 GPU/CPU bitmap encoding scheme environmentsfor comparison.

dataset 13,581,810 recordsGPU GeForce GTX 760 (1152 cores)CPU Intel i7-2600K@ 3.4 GHz(quad cores)OS Ubuntu 13.04 64 bit

We construct and compress a bitmap index usingGPU-MASC, and compare its result with the encoding

result of MASC. The memory consumption is the same,and is shown in Figure 18. However, GPU-MASCcan build bitmap indexes for 128*31 packets in 8.057milliseconds, while MASC takes 157.3 milliseconds,because MASC cannot encode in parallel on a CPU.Thus, GPU-MASC improves encoding speed by 19.5times.

Fig. 18 Encoding-speed comparison between GPU-MASCand CPU-MASC.

Based on Fig.19, the throughput of GPU-MASC is492,491 packets per second. However, GPU-MASCconstructs and encodes bitmap indexes for all 14 bytesin the five tuple for Internet trace packets, while otheralgorithms on GPU only construct two bytes of the five-tuple, one at a time. So the equivalent throughput forGPU-MASC is 3,447,437 packets per second.

Fig. 19 Encoding time of GPU-WAH and GPU-MASC.

4 Conclusion and future works

We introduce and analyze the traffic-archiving systemsand the key technologies of bitmap-index encoding.First, we provide an introduction to classical traffic-archiving systems in recent years; then we offer asurvey of bitmap-index encoding algorithms. We

14 Tsinghua Science and Technology, December 2014, 18(2): 000-000

summarize bitmap-index encoding algorithms in termsof Segmentation, Chunking, Merge Compress andNI (Near Identical). We also propose somenew bitmap encoding algorithmsSECOMPAX, ICX,MASC, PLWAH+, and show their state diagrams forencoding procedures. We also evaluate their CPUand GPU implementations with a real Internet tracefrom CAIDA. Finally, we summarize and discussthe future direction of bitmap-index compressionalgorithms. With the rapid growth of Internettraffic, traffic-archiving systems and bitmap-indexcompression scheme research will face new challengesand risks, which must be overcome to design moreefficient high-speed indexing technologies. Theseimproved bitmap-index compression schemes will haveimportant research value, which will provide powerfultechnical support for high-performance network-trafficarchiving systems.

Acknowledgements

We are grateful to the teachers and students both inNSLAB and QoSlab. The authors would like to thankProf. Jun Li of NSLAB from RIIT for his guidance.

This work is supported by Ministry of Scienceand Technology of China under the National KeyBasic Research and Development (973) Program ofChina (Nos. 2012CB315801 and 2011CB302805),the National Natural Science Foundation of China A3Program (No. 61161140320) and the National NaturalScience Foundation of China (No. 61233016). Thiswork is also supported by Intel Research Council withthe title of Security Vulnerability Analysis based onCloud Platform with Intel IA Architecture.

References

[1] Google, Web History, http://www.google.com/psearch,2008.08.10.

[2] Business Intelligence Lowdown,http://www.businessintelligeneelowdown.corn/2007/02/top 10 largest .html,2008.08.10.

[3] Cisco Visual Networking Index:Forecast and Methodology, 20132018,http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white paper c11-481360.html, 2014.06.10.

[4] W. Huang, Z. Chen, W. Dong, H. Li, B. Cao, and J.Cao, Mobile Internet Big Data Platform in China Unicom,Tsinghua Science and Technology, 1, 010, 2014.

[5] Z. Chen, W. Huang, and J. Cao, Big Data Engineering forInternet Traffic, Beijing: Tsinghua University Press, 2014.

[6] R. Lin, B. Wu, F. Yang, Y. Zhao, and J. Hou, An efficientadaptive failure detection mechanism for cloud platform

based on volterra series, Communications, China, 11(4),1-12.

[7] Y. Meng, Z. Luan, and D. Qian, Differentiatingdata collection for cloud environment monitoring,Communications, China,11(4), 13-24.

[8] L. Pu, J. Xu, B. Yu, and J. Zhang, Smart cafe: A mobilelocal computing system based on indoor virtual cloud,Communications, China,11(4), 38-49.

[9] M. Li, A. Lukyanenko, S. Tarkoma, and A. Yla-Jaaski,MPTCP incast in data center networks, Communications,China, 11(4), 25-37.

[10] L. Rizzo, Device polling support for FreeBSD,http://info.iet.unipi.it/ luigi/polling/, 2002.

[11] L. Deri, Improving passive packet capture: Beyond devicepolling, In Proceedings of SANE, 2004, pp. 85-93.

[12] L. Rizzo, netmap: a novel framework for fast packet I/O,In Proceedings of the 2012 USENIX conference on AnnualTechnical Conference, USENIX Association, June 2012,pp. 9-9.

[13] A. Papadogiannakis, M. Polychronakis, and E. P.Markatos, Scap: stream-oriented network traffic captureand analysis for high-speed networks. In Proceedings ofthe 2013 conference on Internet measurement conference,October 2013, pp. 441-454.

[14] L. A. Barroso, J. Clidaras, and U. Hlzle, TheDatacenter as a Computer: An Introductionto the Design of Warehouse-Scale Machines,Synthesis Lectures on Computer Architecture ,doi:10.2200/S00516ED2V01Y201306CAC024.

[15] J. Dean, Challenges in Building Large-Scale InformationRetrieval Systems, Web Search and Data MiningConference (WSDM) keynote, February 2009.

[16] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J.Franklin, J. M. Hellerstein, W. Hong, and M. A.Shah, TelegraphCQ: continuous dataflow processing, InProceedings of the 2003 ACM SIGMOD internationalconference on Management of data ACM, June 2003, pp.668-668.

[17] C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk,Gigascope: a stream database for network applications,In Proceedings of the 2003 ACM SIGMOD internationalconference on Management of data ACM, June 2003, pp.647-651.

[18] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, andW. H. Hong, Advanced indexing techniques for wide-area network monitoring, In Data Engineering Workshops,2005. 21st International Conference on IEEE, April 2005,pp. 1184-1184.

[19] P. Desnoyers, and P. J. Shenoy, Hyperion: High VolumeStream Archival for Retrospective Querying, In USENIXAnnual Technical Conference, June 2007, pp. 45-58.

[20] K. Wu, FastBit: an efficient indexing technology foraccelerating data-intensive science, Journal of Physics:Conference Series, Vol. 16. No. 1. IOP Publishing, 2005.

[21] G. Maier, R. Sommer, H. Dreger, A. Feldmann, V.Paxson, and F. Schneider, Enriching network securityanalysis with time travel, In ACM SIGCOMM ComputerCommunication Review, ACM, August 2008, Vol. 38, No.4, pp. 183-194.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 15

[22] P. Giura and N. Memon, NetStore: an efficient storageinfrastructure for network forensics and monitoring, InRecent Advances in Intrusion Detection, Springer Berlin,January 2010, pp. 277-296.

[23] F. Fusco, M. Vlachos, and X. Dimitropoulos, RasterZip:compressing network monitoring data with support forpartial decompression, In Proceedings of the 2012 ACMconference on Internet measurement conference, ACM,November 2012, pp. 51-64.

[24] F. Fusco, X. Dimitropoulos, M. Vlachos, and L.Deri, PcapIndex: an index for network packet traceswith legacy compatibility, ACM SIGCOMM ComputerCommunication Review, 42(1), 47-53, 2012.

[25] J. Li, S. Ding, M. Xu, F. Y. Han, X. Guan,and Z. Chen, TIFA: Enabling Real-Time Queryingand Storage of Massive Stream Data, In Networkingand Distributed Computing (ICNDC), 2011 SecondInternational Conference on, IEEE, pp. 61-64.

[26] Z. Chen, L. Ruan, J. Cao, Y. Yu, and X. Jiang, TIFAflow:enhancing traffic archiving system with flow granularityfor forensic analysis in network security, Tsinghua Scienceand Technology 18, no. 4, 2013.

[27] Z. Chen, X. Shi, L. Ruan, F. Xie and J. Li, High SpeedTraffic Archiving System for Flow Granularity Storage andQuerying, ICCCN 2012 workshop on PMECT, 2012.

[28] I.Spiegler, and R.Maayan, Storage and retrievalconsiderations of binary data bases, Informationprocessing and management, 21(3), 233-254, 1985.

[29] P. Cheng, bitmap index techniques and its researchadvancement, Science and technologies information,026,134-135, 2010.

[30] Z. Huang, W. Lv, and J. Huang, Improved BLASTalgorithm based on bitmap indexes and B+ tree, ComputerEngineering and Applications, 49(11), pp.118-120, 2013.

[31] B. YANG, Y. Qi, Y. Xue and J. Li, Bitmap data structure:Towards high-performance network algorithms designing,Computer Engineering and Applications, 45(15), 2009.

[32] H. Garcia-Molina, J. D. Ullman, and J. Widom, DatabaseSystem implementation, Second Edition, Prentice Hall,2009.

[33] P. E. O’Neil, Model 204 architecture and performance,High Performance Transaction Systems, Springer BerlinHeidelberg, 39-59, 1989.

[34] P. O’Neil, and D. Quass, Improved query performancewith variant indexes, In ACM Sigmod Record, vol. 26, no.2, pp. 38-49. ACM, 1997.

[35] G. Antoshekov, Byte-aligned bitmap compression, InData Compression Conference, 1995, DCC’95, IEEE,Proceedings p. 476.

[36] K. Wu, E. J. Otoo, and Arie Shoshani, Compressingbitmap indexes for faster search operations, In Scientificand Statistical Database Management, 2002, Proceedings,14th International Conference on, pp. 99-108. IEEE, 2002.

[37] K. Wu, E. J. Otoo, and A. Shoshani, Optimizing bitmapindices with efficient compression, ACM Transactions onDatabase Systems (TODS), 31(1), 1-38, 2006.

[38] F. Deli‘ege and T. B. Pedersen, Position list word alignedhybrid: optimizing space and performance for compressedbitmaps, In Proceeding of the 13th InternationalConference on Extending Database Technology, 2010.

[39] D. Lemire, O. Kaser, and K. Aouiche, Sorting improvesword-aligned bitmap indexes, Data & KnowledgeEngineering, 69(1), 3-28, 2010.

[40] A. Colantonio and R. Di Pietro, Concise: Compressedncomposable integer set, Information Processing Letters,110(16), 644-650, 2010.

[41] S. J. van Schaik and O. de Moor, A memory efficientreachability data structure through bit vector compression,In Proceedings of the 2011 ACM SIGMOD InternationalConference on Management of data, pp. 913-924. ACM,2011.

[42] F. Fusco, M. P. Stoecklin, and M. Vlachos, Net-fli: on-the-fly compression, archiving and indexing of streamingnetwork traffic, Proceedings of the VLDB Endowment, 3(1-2), 1382-1393, 2010.

[43] G. Canahuate, M. Gibas, and H. Ferhatosmanoglu,Update conscious bitmap indices, 19th IEEE InternationalConference on Scientific and Statistical DatabaseManagement SSBDM’07, 2007.

[44] F. Corrales, D. Chiu, and J. Sawin, Variable LengthCompression for Bitmap Indices, in DEXA11, pp. 381395,Springer-Verlag, 2011.

[45] G. Guzun, G. Canahuate, D.Chiu, and J. Sawin, Atunable compression framework for bitmap indices, IEEEInternational Conference on Data Engineering (ICDE2014), pp. 484-495, 2014.

[46] S. Chambi, D. Lemire, O. Kaser, and R. Godin, Betterbitmap performance with Roaring bitmaps, arXiv preprintarXiv:1402.6407 (2014).

[47] M. Stabno and R. Wrembel, RLH: Bitmap compressiontechnique based on run-length and Huffman encoding,Information Systems 34, no. 4, pp.400-414, 2009.

[48] A. Schmidt, D. Kimmig, and M. Beine, A Proposal of aNew Compression Scheme of Medium-Sparse Bitmaps,International Journal On Advances in Software, 4(3 and4), 401-411.

[49] J. Chang, Z. Chen, W. Zheng, Y. Wen, J. Cao, andW. L. Huang, PLWAH+: a bitmap index compressingscheme based on PLWAH, In Proceedings of the tenthACM/IEEE symposium on Architectures for networkingand communications systems, ACM, 2014, pp. 257-258.

[50] Y. Wen, Z. Chen, G. Ma, J. Cao, W. Zheng, G. Peng, andW. L. Huang, SECOMPAX: A bitmap index compressionalgorithm, In Computer Communication and Networks(ICCCN), 2014 23rd International Conference on, IEEE,pp. 1-7.

[51] M. F. Oberhumer, The Lempel-Ziv-Oberhumer Packer,http://www.lzop.org/, November 2010.

[52] F. Reiss, K. Stockinger, K. Wu, A. Shoshani, andJ. M. Hellerstein, Enabling real-time querying oflive and historical stream data, In Scientific andStatistical Database Management, 2007. SSBDM’07. 19thInternational Conference on, IEEE, July 2007, pp. 28-28.

16 Tsinghua Science and Technology, December 2014, 18(2): 000-000

[53] W. Andrzejewski, and R. Wrembel, GPU-WAH: ApplyingGPUs to compressing bitmap indexes with word alignedhybrid, In Database and Expert Systems Applications,Springer Berlin Heidelberg, January, 2010, pp. 315-329.

[54] W. Andrzejewski, and R. Wrembel, GPU-PLWAH: GPU-based implementation of the PLWAH algorithm for

compressing bitmaps, Control & Cybernetics, 40(3), 2011.[55] F. Fusco, M. Vlachos, X. Dimitropoulos, and L. Deri,

Indexing million of packets per second using GPUs,In Proceedings of the 2013 conference on Internetmeasurement conference, pp. 327-332. ACM, 2013.

Zhen Chen is an associate professorin Research Institute of InformationTechnology in Tsinghua University. Hereceived his B.E. and Ph.D. degrees fromXidian University in 1998 and 2004. Heonce worked as postdoctoral researcherin Network Institute of Department ofComputer Science in Tsinghua University

during 2004 to 2006. He is also a visiting scholar in UCBerkeley ICSI in 2006. He joined Research Institute ofInformation Technology in Tsinghua University since 2006.His research interests include network architecture, computersecurity, and data analysis. He has published around 80 academicpapers.

Yuhao Wen is an undergraduate studentstudying in Department of ElectronicEngineering at Tsinghua University. Hisresearch interests include big data andnetworks.

Wenxun Zheng is an undergraduatestudent studying in Department ofPhysics at Tsinghua University. Hisresearch interests is now on bitmap indexcompression.

Jiahui Chang is an undergraduate studentstudying in Department of AerospaceEngineering at Tsinghua University. Hemajors in Engineering Mechanics.

Guodong Peng is an undergraduatestudent studying in Department ofMechanical Engineering at TsinghuaUniversity. His research interests includemechanical design and network security.

Yinjun Wu is an undergraduate studentstudying in Department of Automation atTsinghua University. His research interestsinclude bitmap indexing algorithms.

Ge Ma is currently a master studentstudying in Department of Automation atTsinghua University. He got his bachelordegree from Tsinghua University in 2013.His research interests include futurenetwork and bitmap index compressing.

Mourad Hakmaoui is a postgraduatestudent studying for PhD in Departmentof Computer Science and Technologies atTsinghua University. He got his masterand bachelor degrees from University ofMohammed 5th-Morocco in 2008 and2012, respectively. His research interestsinclude enhance bitmap indices and fast

query for Big Data.

Zhen Chen et al.: A Survey on Bitmap Index Compression Algorithms for Internet Traffic Archival Systems 17

Junwei Cao is currently Professor andDeputy Director of Research Instituteof Information Technology, TsinghuaUniversity, China. He is also Director ofOpen Platform and Technology Division,Tsinghua National Laboratory forInformation Science and Technology. Hisresearch is focused on advanced computing

technology and applications. Before joining Tsinghua in 2006,Junwei Cao was a Research Scientist of Massachusetts Institute

of Technology, USA. Before that he worked as a researchstaff member of NEC Europe Ltd., Germany. Junwei Caogot his PhD in computer science from University of Warwick,UK, in 2001. He got his master and bachelor degrees fromTsinghua University in 1998 and 1996, respectively. Junwei Caohas published over 130 academic papers and books, cited byinternational researchers for over 3000 times. Junwei Cao is aSenior Member of the IEEE Computer Society and a Member ofthe ACM and CCF.