Design and Implementation of a Scalable and Flexible Traffic ...

19
85 Design and Implementation of a Scalable and Flexible Traffic Analysis Platform Yuuki Takano, Ryosuke Miura, Shingo Yasuda, Kunio Akashi, and Tomoya Inoue Application-level network traffic analysis and sophisticated analysis techniques, such as machine learning and stream data processing for network traffic, require considerable computational resources. In addition, developing an application protocol analyzer is a tedious and time-consuming task. Therefore, we propose a scalable and flexible traffic analysis platform (SF-TAP) for the efficient and flexible application-level stream analysis of high-bandwidth network traffic. By using our flexible modular platform, developers can easily implement multicore scalable application-level stream analyzers. Furthermore, as SF-TAP is horizontally scalable, it manages high-bandwidth network traffic. To achieve this scalability, we separate the network traffic based on traffic flows, and forward the separated flows to multiple SF-TAP cells, each comprising a traffic capturer and application-level analyzers. This study discusses the design, implementation and detailed evaluation of SF-TAP. 1 Introduction Application-level network-traffic control such as network traffic engineering and intrusion detection systems (IDSs) usually perform complicated analy- ses requiring considerable computational resources. Therefore, the present paper proposes a scalable and flexible traffic analysis platform (SF-TAP) that スケーラブルで柔軟なトラフィック解析基盤の設計と実 装. 高野祐輝, 大阪大学大学院工学研究科, Graduate School of Engineering, Osaka University. 高野祐輝,三浦良介,安田真悟, 情報通信研究機構サイバー セキュリティ研究室, Cybersecurity Laboratory, Na- tional Institute of Information and Communications Technology. 安田真悟, 情報通信研究機構サイバートレーニング研究 , Cyber Training Laboratory, National Institute of Information and Communications Technology. 明石邦夫,井上朋哉, 情報通信研究機構テストベッド研究開 発運用室, Network Testbed Research and Develop- ment Laboratory, National Institute of Information and Communications Technology. コンピュータソフトウェア ,Vol.36, No.3 (2019),pp.85–103. [ソフトウェア論文] 2018 8 5 日受付. runs on commodity hardware. Overall, application-level network traffic analy- sis is hampered by two problems. First, there are numerous application protocols, with new protocols being defined and implemented every year. However, implementing application proto- col parsers and analyzers are tedious and time- consuming tasks. A straightforward implementa- tion of the parser and analyzer, which is crucial for application-level network traffic analysis, has pre- viously been attempted using domain-specific lan- guage (DSL) approaches. For example, the parser BinPAC [28] was proposed for application protocols built into Bro IDS software. Wireshark [44] and Suricata [39] bind Lua language to implement an- alyzers. Unfortunately, most researchers and de- velopers tend to adopt specific programming lan- guages for specific purposes, such as machine learn- ing, but DLSs may not be suitable for other do- mains. The second problem in application-level network traffic analysis is the low scalability of conventional software. Traditional network traffic-analysis ap- plications such as tcpdump [41], Wireshark, and

Transcript of Design and Implementation of a Scalable and Flexible Traffic ...

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

85

Design and Implementation of a Scalable and

Flexible Traffic Analysis Platform

Yuuki Takano, Ryosuke Miura, Shingo Yasuda,

Kunio Akashi, and Tomoya Inoue

Application-level network traffic analysis and sophisticated analysis techniques, such as machine learning

and stream data processing for network traffic, require considerable computational resources. In addition,

developing an application protocol analyzer is a tedious and time-consuming task. Therefore, we propose a

scalable and flexible traffic analysis platform (SF-TAP) for the efficient and flexible application-level stream

analysis of high-bandwidth network traffic. By using our flexible modular platform, developers can easily

implement multicore scalable application-level stream analyzers. Furthermore, as SF-TAP is horizontally

scalable, it manages high-bandwidth network traffic. To achieve this scalability, we separate the network

traffic based on traffic flows, and forward the separated flows to multiple SF-TAP cells, each comprising

a traffic capturer and application-level analyzers. This study discusses the design, implementation and

detailed evaluation of SF-TAP.

1 Introduction

Application-level network-traffic control such as

network traffic engineering and intrusion detection

systems (IDSs) usually perform complicated analy-

ses requiring considerable computational resources.

Therefore, the present paper proposes a scalable

and flexible traffic analysis platform (SF-TAP) that

スケーラブルで柔軟なトラフィック解析基盤の設計と実装.高野祐輝, 大阪大学大学院工学研究科, Graduate School

of Engineering, Osaka University.

高野祐輝,三浦良介,安田真悟, 情報通信研究機構サイバーセキュリティ研究室, Cybersecurity Laboratory, Na-

tional Institute of Information and Communications

Technology.

安田真悟, 情報通信研究機構サイバートレーニング研究室, Cyber Training Laboratory, National Institute

of Information and Communications Technology.

明石邦夫,井上朋哉, 情報通信研究機構テストベッド研究開発運用室, Network Testbed Research and Develop-

ment Laboratory, National Institute of Information

and Communications Technology.

コンピュータソフトウェア,Vol.36,No.3 (2019),pp.85–103.

[ソフトウェア論文] 2018 年 8 月 5 日受付.

runs on commodity hardware.

Overall, application-level network traffic analy-

sis is hampered by two problems. First, there

are numerous application protocols, with new

protocols being defined and implemented every

year. However, implementing application proto-

col parsers and analyzers are tedious and time-

consuming tasks. A straightforward implementa-

tion of the parser and analyzer, which is crucial for

application-level network traffic analysis, has pre-

viously been attempted using domain-specific lan-

guage (DSL) approaches. For example, the parser

BinPAC [28] was proposed for application protocols

built into Bro IDS software. Wireshark [44] and

Suricata [39] bind Lua language to implement an-

alyzers. Unfortunately, most researchers and de-

velopers tend to adopt specific programming lan-

guages for specific purposes, such as machine learn-

ing, but DLSs may not be suitable for other do-

mains.

The second problem in application-level network

traffic analysis is the low scalability of conventional

software. Traditional network traffic-analysis ap-

plications such as tcpdump [41], Wireshark, and

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

86 コンピュータソフトウェア

Snort [38] are single-threaded and cannot exploit

multiple CPU cores during the analysis. To im-

prove the utilization of CPU cores, researchers have

proposed various software solutions. An exam-

ple is GASPP [43], developed for high-bandwidth,

flow-based traffic analysis. This software exploits

GPUs and the Security Content Automation Proto-

col (SCAP) [29], which utilizes multiple CPU cores

and implements a Linux kernel module. The trans-

mission control protocol (TCP) flows must be re-

constructed efficiently, the efficiency of the parser

or analyzing programs is more critical because these

components perform deep analyses such as pat-

tern matching or machine learning, which are com-

putationally demanding. Therefore, when analyz-

ing high-bandwidth traffic, both the TCP flow re-

construction and traffic-analyzing components re-

quire multicore scaling. Moreover, for simple, cost-

effective deep analysis of high-bandwidth network

traffic, the corresponding application-level analysis

platform must have horizontal scalability.

To mitigate the abovementioned issues, we have

designed and implemented an SF-TAP for high-

bandwidth application-level traffic analysis. SF-

TAP adopts a flow abstraction mechanism that

abstracts the network flows as files, similarly to

Plan 9 [32], UNIX’s /dev, and the BSD packet fil-

ter (BPF) [22]. Through the developed interfaces,

logic developers can rapidly and flexibly implement

application-level analyzers in any language. Fur-

thermore, the modular architecture can separate

the L3/L4-level controlling and application-level

analyzing components. This design enables flexible

implementation, dynamic updates, and multicore

scalability of the analyzing components.

This paper was originally published at 29th

Large Installation System Administration Confer-

ence, LISA 2015 [40], and we updated it as follows.

• This paper presents a comprehensive evalua-

tion (Section 5. 4), whereas LISA’s paper re-

ported only evaluations of each component.

• We adopted netmap [34] aiming of improving

performance of the flow abstractor, which is a

component of SF-TAP

• In addition to that, we eliminated the bottle

neck of the flow abstractor which does not ap-

pear before adopting netmap (Appendix B).

• The flow abstractor is completely re-evaluated

(Section 5. 2) because we fixed it as mentioned

above.

For scientific reproducibility purposes, our proof-

of-concept implementation is distributed on the

Web (see [36]) under BSD licensing. Therefore, it is

freely available for use and modification.

2 Design Principles

This section discusses the design principles of SF-

TAP, i.e., the abstraction of network flows, multi-

core and horizontal scalabilities, and modularity.

We first describe the high-level architecture of

SF-TAP. As shown in Fig. 1), SF-TAP has two

main components: the cell incubator and the flow

abstractor. A group of one flow abstractor and sev-

eral analyzers is called a cell. The cell incubator

provides the horizontal scalability; that is, it cap-

tures the network traffic, separates it based on flow

identifiers consisting of source and destination IP

addresses and port numbers, and forwards the sep-

arated flows to specific target cells. Whereas con-

ventional approaches using packet capture (pcap)

or other methods cannot manage high-bandwidth

network traffic [42], our approach has successfully

managed 10 Gbp of network traffic using netmap,

multiple threads, and lightweight locks.

Furthermore, by separating the network traf-

fic, we can manage and analyze high-bandwidth

network traffic on multiple computers. The flow

abstractor receives flows from the cell incubator,

reconstructs the TCP flows, and forwards them

to multiple application-level analyzers, thus pro-

viding multicore scalability. This multicore and

horizontally scalable architecture enables efficient

application-level traffic analysis, which typically re-

quires considerable computational resources.

2. 1 Flow Abstraction

Existing applications based on DSL-based ap-

proaches, such as the aforementioned Wireshark,

Bro, and Suricata applications, are not always ap-

propriate because different programming languages

are suitable for different requirements. For exam-

ple, programming languages suitable for string ma-

nipulation, such as Perl and Python, should be used

for text-based protocols. Conversely, programming

languages suitable for binary manipulation, such as

C and C++, should be used for binary-based pro-

tocols, whereas programming languages equipped

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 87

CPU CPU CPU CPU

Flow Abstractor

CPU CPU CPU CPU

Flow Abstractor

CPU CPU CPU CPU

Flow Abstractor

CPU CPU CPU CPU

Flow Abstractor

Cell Incubator

The Internet

SF-TAP Cell SF-TAP Cell SF-TAP Cell SF-TAP Cell

Intra Network

Core ScalingCore Scaling Core Scaling Core Scaling

Horizontal Scaling

Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer

10GbE 10GbE

Fig. 1 Highly Abstracted Architecture of SF-TAP

with machine learning libraries should be used for

machine learning.

To ensure flexibility in the application-level net-

work traffic analysis, our approach abstracts the

network flows into files using abstraction interfaces,

similarly to Plan 9; UNIX’s /dev; and BPF. Using

these abstraction interfaces, IDS developers, traffic

engineers, and other analysts can implement ana-

lyzers in their preferred languages.

2. 2 Horizontal Scalability

Application-level analyzers require substantial

computational resources. For example, HTTP mes-

sages are analyzed by string parsing, URLs are fil-

tered in real-time by pattern matching based on

regular expressions, and the specific features of net-

work traffic are extracted by machine learning tech-

niques. In general, these processes consume much

CPU time.

To address this problem, our horizontally scal-

able architecture allocates multiple computers to

the analyzers (the most avid consumers of compu-

tational resources) in high-bandwidth application-

level traffic analysis.

2. 3 Modular Architecture

Modularity is an important feature of our archi-

tecture, as it grants the multicore scalability and

flexibility of SF-TAP. It also separates the cap-

turing and analyzing components of the system.

In traditional network traffic analysis applications

such as Snort and Bro, which are monolithic, any

update must halt the operation of all components,

including the traffic capturer. In the proposed

modular architecture, the traffic analysis compo-

nents are updated without interfering with other

components. This is especially important in the

research and development phase, when the analyz-

ing components are frequently updated. The mod-

ularity also ensures that bugs in applications still

under development do not negatively affect other

applications.

2. 4 Commodity Hardware

According to Martins et al. [21], hardware ap-

pliances are relatively inflexible and incompati-

ble with new functions. Furthermore, hardware

appliances are very expensive and are not easily

scaled horizontally. To address these problems,

researchers are developing software-based alterna-

tives running on commodity hardware, such as net-

work function virtualization (NFV) [26]. Our pro-

posed software-based approach also runs in com-

modity hardware environments, achieving similar

flexibility and scalability to NFV.

3 Design

This section describes the design of SF-TAP.

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

88 コンピュータソフトウェア

NW I/F

HTTP I/F

TLS I/FFlow Abstractor

FlowClassifier TLS Analyzer

HTTP Analyzer

HTTP Proxy

TCP and UDP Handler

filter andclassifierrule

L7 Loopback I/F

DBForensicIDS/IPSetc...

ApplicationProtocol Analyzer

etc...TCP Default I/F

UDP Default I/F

Analyzer PlaneAbstractor Plane

CapturerPlane

SF-TAP CellIncubator

FlowIdentifier

FlowSeparator

SeparatorPlane

separated traffic

SF-TAP Cell

L3/L7 Sniffer

SSLProxy

etc...

other SF-TAP cells

IP PacketDefragmenter

L2 Bridge

mirroringtrafficPacket Forwarder

IP FragmentHandler

Fig. 2 Architecture of SF-TAP

3. 1 Design Overview

Fig. 2 shows the detailed architecture of the pro-

posed SF-TAP. SF-TAP consists of four planes:

the capturer, separator, abstractor, and analyzer

planes. The capturer plane, which captures the

network , comprises a port-mirroring mechanism of

L2/L3 switches, an SSL proxy that sniffs the plain

text, and various minor components. The sepa-

rator plane provides horizontal scalability for high-

bandwidth network traffic analysis. This plane sep-

arates the network traffic into L3 and L4 levels, for-

warding the flows to multiple cells. Each cell com-

prises an abstractor plane and an analyzer plane.

The abstractor plane abstracts the network flow.

Specifically, it defragments the IP fragmentations,

identifies the L3 and L4 flows, reconstructs the

TCP streams, detects the application protocol us-

ing regular expressions, and outputs the flows to

the appropriate abstraction interfaces. The ana-

lyzer plane is the plane developed by SF-TAP users.

The analyzers can be implemented in any program-

ming language, and are developed by accessing the

interfaces provided by the abstractor plane.

The components of the capturer plane are well

known, and the analyzers of the analyzer plane

are developed by SF-TAP users. Therefore, the

present study focuses on the separator and abstrac-

tor planes.

The following subsections describe the flow ab-

stractor and cell incubator designs, and present an

HTTP analyzer as an example.

3. 2 Cell Incubator Design

The cell incubator shown in Fig. 2 is a software-

based network traffic balancer that mirrors and sep-

arates the network traffic based on the flows, thus

working as an L2 bridge. The cell incubator con-

sists of a packet forwarder, an IP fragment handler,

and a flow separator.

The packet forwarder receives L2 frames and for-

wards them to the IP fragment handler. If required,

it also forwards frames to other network interface

controllers (NICs), such as the L2 bridge, negating

the need for hardware-based network traffic mirror-

ing.

The IP fragment handler identifies the frag-

mented packets based on the given flows, and for-

wards them to the flow separator. The IP fragment

handler calculates a hash value of IP addresses and

port numbers of a non-fragmented packet, h1 =

hash(srcIP⊕dstIP⊕srcPort⊕dstPort), for load bal-

ancing. However, port numbers are not available

in fragmented packets excluding the first packet

of fragmented packets. Therefore, the IP frag-

ment handler calculates h1 and another hash value,

h2 = hash(srcIP⊕dstIP⊕fragmentation identifier),

and associates h1 with h2 when encountering the

first packet of fragmented packets, which has a frag-

mentation identifier. Consequently, the fragmented

packets following the head can be associated with

h1 because h2 can be calculated. However, there

is a limitation that reordered fragmented packets

may not be properly handled.

The flow separator forwards the packets to mul-

tiple SF-TAP cells using the flow information

(namely, the IP addresses and port numbers of the

source and destination). The destination SF-TAP

cell is determined by the hash value of the flow

identifier.

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 89

1 http:

2 up: ’^[-a-zA-Z]+ .+ HTTP/1\.(0\r?\n

|1\r?\n([-a-zA-Z]+: .+\r?\n)+)’

3 down: ’^HTTP/1\.[01] [1-9][0-9]{2} .+\

r?\n’

4 proto: TCP # TCP or UDP

5 if: http # path to UNIX domain

socket

6 priority: 100 # priority

7 balance: 4 # balanced by 4 IFs

8

9 torrent_tracker: # BitTorrent Tracker

10 up: ’^GET .*(announce|scrape).*\?.*

info_hash=.+&.+ HTTP/1\.(0\r?\n|1\r?\n([-a-

zA-Z]+: .+\r?\n)+)’

11 down: ’^HTTP/1\.[01] [1-9][0-9]{2} .+\

r?\n’

12 proto: TCP

13 if: torrent_tracker

14 priority: 90 # priority

15

16 dns_udp:

17 proto: UDP

18 if: dns

19 port: 53

20 priority: 200

Fig. 3 Example of Flow Abstractor

Configuration

1 $ ls -R /tmp/sf-tap

2 loopback7= tcp/ udp/

3

4 /tmp/sf-tap/tcp:

5 default= http2= ssh=

6 dns= http3= ssl=

7 ftp= http_proxy=

torrent_tracker=

8 http0= irc= websocket=

9 http1= smtp=

10

11 /tmp/sf-tap/udp:

12 default= dns=

torrent_dht=

Fig. 4 Directory Structure of the Flow

Abstraction Interface

3. 3 Flow Abstractor Design

In this subsection, we describe the design of the

flow abstractor schematized in Fig. 2, and explain

its main mechanisms using an example configura-

tion file (see Fig. 3). For human readability, the

flow abstractor is configured in the data serial-

ization language YAML [45]. The flow abstractor

consists of four components. In top-to-bottom se-

quence, these components are the IP packet defrag-

menter, the flow identifier, the TCP and UDP han-

dler, and the flow classifier.

3. 3. 1 Flow Reconstruction

The flow abstractor defragments the fragmented

IP packets and reconstructs the TCP streams.

Thereby, analyzer developers need not implement

the complicated reconstruction logic required for

application-level analysis.

The IP packet defragmenter shown in Fig. 2 de-

fragments the IP fragmented packets. The defrag-

mented IP packets are forwarded to the flow iden-

tifier, which identifies the L3/L4-level flows. The

flows are identified as 5-tuples containing the source

and destination IP addresses, the source and des-

tination port numbers, and a hop count (described

in Section 3. 3. 2). Once the flows have been iden-

tified, the TCP streams are reconstructed by the

TCP and UDP handler. Finally, the flows are for-

warded to the flow classifier.

3. 3. 2 Flow Abstraction Interface

The flow abstractor provides the interfaces that

abstract the flows at the application level (for ex-

ample, the TLS and HTTP interfaces shown in

Fig. 2). The HTTP, BitTorrent tracker [3], and do-

main name system (DNS) interfaces are defined in

Fig. 3.

The flow classifier classifies the flows of various

application protocols, forwarding them to the flow

abstraction interfaces, which are implemented via

a UNIX domain socket, as shown in Fig. 4. The

file names of the interfaces are defined by if items

in Fig. 3; for example, lines 5, 13, and 18 in this fig-

ure define the HTTP, BitTorrent tracker, and DNS

interfaces as http, torrent tracker, and dns, re-

spectively. By providing independent interfaces for

each application protocol, SF-TAP can implement

the analyzers in any programming language.

Furthermore, we designed a special interface for

flow injection called the L7 loopback interface, i.e.,

L7 Loopback I/F (see Fig. 2). This interface is con-

venient for encapsulated protocols such as HTTP

proxy. As HTTP proxy can encapsulate other

protocols within HTTP, the encapsulated traffic

should also be analyzed at the application level.

In this situation, the encapsulated traffic is eas-

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

90 コンピュータソフトウェア

1 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=CREATED

2 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=DATA,from=2,

match=down,len=1398

3

4 1398[bytes] Binary Data

5

6 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=DESTROYED

Fig. 5 Example output of the Flow Abstraction Interface

ily analyzed by re-injecting the encapsulated traffic

into the flow abstractor via the L7 loopback inter-

face. The flow abstractor manages the re-injected

traffic in the same manner as the process against

non-encapsulated traffic. Although the implemen-

tation of the application-level analysis of encapsu-

lated traffic generally tends to be rather complex,

this re-injection mechanism simplifies the imple-

mentation.

The L7 loopback interface may cause infinite re-

injections. To avoid this problem, we introduce a

hop count and a corresponding hop limitation. The

flow abstractor drops the injected traffic when its

hop count exceeds the hop limitation, thus limiting

the number of re-injections.

Besides flow abstraction and the L7 loopback in-

terface, the flow abstractor provides default inter-

faces that capture the unknown or unclassified net-

work traffic.

3. 3. 3 TCP Session Abstraction

Fig. 5 is a representative output of the flow ab-

straction interface. The flow abstractor first out-

puts a header containing information on the flow

identifier and the abstracted TCP event, then out-

puts the body (if it exists).

Line 1 of Fig. 5 indicates that a TCP ses-

sion was established between 192.168.0.1:62918

and 192.168.0.2:80. Line 2 indicates that 1398

bytes of data were sent from 192.168.0.2:80 to

192.168.0.1:62918. The from values distinguish the

source and destination addresses, and the len value

denotes the data length. Lines 3–5 indicate that the

transmitted outputs are binary data. Finally, line

6 indicates that the TCP session was disconnected.

The match value in this figure denotes the pat-

tern, i.e., up or down in Fig. 3, which determines

the protocol detection.

To reduce the complexity of managing a TCP ses-

sion, the flow abstractor abstracts the TCP states

as three events, i.e., CREATED, DATA, and DE-

STROYED. Accordingly, analyzer developers can

easily manage their TCP sessions and concentrate

on the application-level analysis.

3. 3. 4 Application-level Protocol Detec-

tion

The flow classifier shown in Fig. 2 is an

application-level protocol classifier, which detects

protocols using regular expressions and a port num-

ber. The items up and down in Fig. 3 are the regu-

lar expressions in application-level protocol detec-

tion. When the upstream and downstream flows

match the regular expressions, the flows are out-

putted to the specified interface described in a con-

figuration file. Although application-level proto-

cols can be detected by alternative methods such

as Aho–Corasick and Bayesian filtering, we adopt

regular expressions because of their generality and

high expressive power. Meanwhile, the port num-

ber distinguishes the application-level flows from

other flows. For example, line 19 of Fig. 3 classifies

a DNS by its port number 53.

The priority item in Fig. 3 dictates the prior-

ity rules in cases of ambiguity; here, lower priority

values are prioritized over higher ones. For exam-

ple, because BitTorrent tracker adopts HTTP com-

munication, it shares the same protocol format as

HTTP and ambiguity occurs if the rules for HTTP

and BitTorrent tracker have the same priority. The

priority ruling removes this ambiguity. The config-

uration in Fig. 3 prioritizes BitTorrent tracker over

HTTP.

3. 4 Load Balancing via the Flow Abstrac-

tion Interface

In general, the occurrence number of each ap-

plication protocol in a network is biased. Con-

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 91

1 {

2 "client": {

3 "port": "61906",

4 "ip":"192.168.11.12",

5 "header": {

6 "host": "www.nsa.gov",

7 "user-agent":"Mozilla\/5.0 (Macintosh;

Intel Mac OS X 10.9; rv:31.0) Gecko

\/20100101 Firefox\/31.0",

8 "connection": "keep-alive",

9 "pragma": "no-cache",

10 "cache-control": "no-cache"

11 },

12 "method": {

13 "method": "GET",

14 "uri": "\/",

15 "ver": "HTTP\/1.1"

16 },

17 "trailer": {}

18 },

19 "server": {

20 "port": "80",

21 "ip": "23.6.116.226",

22 "header": {

23 "connection": "keep-alive",

24 "content-length":"6268",

25 "date": "Sat, 16 Aug 2014 11:38:25 GMT

",

26 "content-encoding": "gzip",

27 "vary": "Accept-Encoding",

28 "x-powered-by": "ASP.NET",

29 "server": "Microsoft-IIS\/7.5",

30 "content-type": "text\/html"

31 },

32 "response": {

33 "ver": "HTTP\/1.1",

34 "code": "200",

35 "msg": "OK"

36 },

37 "trailer": {}

38 }

39 }

Fig. 6 Example output of the HTTP Analyzer

sequently, if each application protocol is analyzed

by only one process, the computational load will

be concentrated on a particular analyzer process.

To prevent this problem, we introduce a load-

balancing mechanism to the flow abstraction inter-

faces.

The configuration of the load-balancing mecha-

nism is specified on line 7 of Fig. 3. The balance

value is 4, indicating that the HTTP flows are sep-

arated and outputted to four balancing interfaces.

The interfaces http0=, http1=, http2=, and http3=

in Fig. 4 are the balancing interfaces. By intro-

ducing one-to-many interfaces, we can easily scale

non-multithreaded analyzers to CPU cores.

3. 5 HTTP Analyzer Design

This subsection describes the design of the HTTP

analyzer, a type of application-level analyzer. The

HTTP analyzer reads the flows from the abstrac-

tion interface of the HTTP provided by the flow

abstractor, and then serializes the results in JSON

format to provide a standard output. Fig. 6 shows

an example output of the HTTP analyzer. The

HTTP analyzer reads the flows and outputs the re-

sults as streams.

4 Implementation

This section develops our proof-of-concept imple-

mentation of SF-TAP.

4. 1 Implementation of the Cell Incubator

The cell incubator must be capable of manag-

ing high-bandwidth network traffic, which cannot

be handled by conventional methods such as libp-

cap. Therefore, our cell incubator implementation

exploits netmap for packet capturing and forward-

ing.

The cell incubator is implemented in C++ and

is available on FreeBSD and Linux. In inline mode,

the cell incubator works as an L2 bridge; in mirror-

ing mode, it only receives and separates L2 frames

without bridging the NICs (unlike the L2 bridge).

Users can select the desired mode when deploying

the cell incubator.

The network traffic is separated using the hash

values of each flow’s source and destination IP ad-

dresses and port numbers. Thus, each flow is for-

warded to a uniquely decided NIC.

Because we adopted netmap, the NICs used by

the cell incubator must be netmap-available. In

general, receive-side scaling (RSS) is enabled on

netmap-available NICs, and there are multiple re-

ceiving and sending queues on the NICs. Thus,

the cell incubator generates a thread for each

queue to balance the CPU load and achieve high-

throughput packet management. However, the

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

92 コンピュータソフトウェア

sending queues are shared among threads, requir-

ing exclusive, computationally expensive controls.

To reduce the CPU load of the exclusive controls,

we adopt a lock mechanism that uses the compare-

and-swap instruction.

4. 2 Implementation of the Flow Abstrac-

tor

The flow abstractor was implemented in C++,

and requires Boost [4], libpcap [41], libevent [17],

RE2[33], and yamp-cpp[46]. It is available on Linux,

*BSD, and MacOS X. The flow abstractor is multi-

threaded, with traffic capture, flow reconstruction,

and application protocol detection executed by dif-

ferent threads. For simplicity and clarity, the data

transfer among the threads follows the producer-

consumer pattern.

The flow abstractor implements a garbage collec-

tor for zombie TCP connections. More specifically,

TCP connections may disconnect without an FIN

or RST packet because of PC or network troubles.

Garbage collection is controlled by timers, adopt-

ing a partial garbage-collection algorithm to avoid

long-term locking.

Because synchronizing the threads demands

heavy CPU loads, our flow abstractor transfers bulk

data among the threads when a specified amount of

data is collected in the producer’s queue, or when

a specified time has elapsed.

The packets are captured using libpcap or netmap

[34], which is a kernel bypass technology. Kernel by-

pass technologies, such as netmap and DPDK [9],

achieve higher throughput than libpcap, but these

mechanisms requires much CPU resources because

they access network devices via polling to increase

the throughput. Netmap can manage packets by

blocking and waiting, but this mode increases the

latency. Accordingly, they remove CPU resources

from the application-level analyzers. Furthermore,

as netmap and DPDK exclusively attach to NICs,

they prevent the attachment of other programs,

such as tcpdump, to the same NICs. This restric-

tion impedes network operations and the develop-

ment or debugging of network software. To allevi-

ate these disadvantages, the present flow abstractor

adopts both libpcap and netmap.

Whether the header is text or binary, the abstrac-

tion interfaces are configured by the flow abstac-

tor’s configuration file. For script languages like

1 GbE x 1210 GbE x 2

α β

γ

cell incubator

Fig. 7 Experimental Network

of the Cell Incubator

Python, a text header is suitable for easily parsing

the headers, whereas a binary header improves the

performance.

4. 3 Implementation of the HTTP Ana-

lyzer

For demonstration and evaluation, we implement

an HTTP analyzer. The HTTP analyzer is im-

plemented in Python, and manages TCP sessions

using Python’s dictionary data structure. The

HTTP analyzer is also easily implemented in other

lightweight languages. Note that the Python im-

plementation was used for the performance evalua-

tions in Section 5.

5 Experimental Evaluation

This section experimentally evaluates SF-TAP.

5. 1 Performance Evaluation of the Cell

Incubator

The cell-incubator experiments were conducted

on a PC with DDR3 16 GB Memory and an Intel

Xeon E5-2470 v2 processor (10 cores, 2.4 GHz, 25

MB cache) with the FreeBSD 10.1 operating sys-

tem. The computer was equipped with four In-

tel quad-port 1 GbE NICs and an Intel dual-port

10 GbE NIC. For these evaluations, we generated

packets of network traffic which are from 64 to 1024

bytes L2 frames on the 10 GbE lines by using a

hardware application-level traffic generator †1. Thecell incubator separated the traffic into different

flows, and forwarded the separated flows to the

twelve 1 GbE lines. The experimental network is

displayed in Fig. 7.

During the experiments, the cell incubator was

operated in three modes: (1) mirroring mode using

†1 Ixia’s PerfectStorm One 10GE

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 93

[Mpp

s]

0

4

8

12

16

Fragment Size [bytes]

64 128 256 512 1024

ideal (1) α->γ (2) α->β (3) α->β (3) α->γ

Fig. 8 Forwarding performance of

the Cell Incubator (10 Gbps)

CPU

util

izat

ion

[%]

0

25

50

75

100

CPU No.0 1 2 3 4 5 6 7 8 9 1011 1213141516171819

5.95 Mpps 10.42 Mpps 14.88 Mpps

Fig. 9 CPU Load average of the Cell

Incubator (64-byte frames)

0 20 40 60 80 100elapsed time [s]

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

system

user

(a) 5.95 Mpps

0 20 40 60 80 100elapsed time [s]

0

20

40

60

80

100

CPU

utl

izati

on [

%]

system

user

(b) 10.42 Mpps

0 20 40 60 80 100elapsed time [s]

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

system

user

(c) 14.88 Mpps

Fig. 10 Resources of CPU 15 consumed by the Cell Incubator (64-byte frames)

port mirroring on the L2 switch (in this mode, the

cell incubator captured the packets at α and for-

warded them to γ); (2) inline mode with the pack-

ets forwarded directly from α to β, bypassing the

1- GbE NICs; and (3) inline mode with the packets

captured at α and forwarded to both β and γ.

Fig. 8 shows the forwarding performance of the

cell incubator for different fragment sizes of 10

Gbps HTTP traffic. In mirroring mode (mode 1),

the cell incubator forwarded packets up to 12.49

Mpps. When operated as an L2 bridge (mode 2),

the cell incubator forwarded packets up to 11.60

Mpps, and when forwarding packets to β and γ

(mode 3), it delivered up to 11.44 Mpps to each des-

tination. The performance was better in mirroring

mode than inline mode in mirroring mode because

the inline mode delivers packets to two NICs. How-

ever, the inline mode is more suitable for specific

purposes such as IDS, because it drops the same

packets at β and γ; that is, the inline mode cap-

tures all of the transmitted packets.

Fig. 9 shows the CPU load averages of the cell

incubator when forwarding 64-byte frames in the

inline mode. At 5.95 and 10.42 Mpps, no packets

were dropped during forwarding. The upper limit

of dropless forwarding was approximately 10.42

Mpps. This indicates that forwarding consumed

the resources of several CPUs, but especially those

of the 15th CPU.

The loads drained from CPU 15 at different traf-

fic rates are plotted in Fig. 10. At 5.95 Mpps,

the average load was approximately 50%, but at

10.42 Mpps, the loads approached 100%. At 14.88

Mpps, the CPU resources were completely con-

sumed. This limitation in forwarding performance

was probably caused by the bias introduced by the

flow director [13] of Intel’s NIC and its driver. This

bias is network-flow dependent, and uncorrectable

because the flow director cannot currently be con-

trolled by the user programs on FreeBSD. Note that

the fairness problem of the RSS queues could be

resolved by the correct implementation, which is

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

94 コンピュータソフトウェア

Table 1 HTTP traffic in the evaluations. bps,

pps, and cps imply bits per second,

packets per second, and connections

per second, respectively.

Gbps Kpps Kcps

1 224 18.0

2 467 35.9

3 701 53.9

4 935 71.9

5 1,169 89.9

6 1,403 107.9

7 1,638 126.0

8 1,870 143.8

9 2,101 161.7

Flow Abstractor

discard discard discard discard

16 discard processes

HTTP Traffic

Fig. 11 Flow Abstractor with

discard processes

earmarked for future work.

Finally, the memory utilization of the cell incuba-

tor depends on the memory-allocation strategy of

netmap. In the current implementation of the cell

incubator, the experiments required approximately

700 MB of memory.

5. 2 Performance Evaluation of the Flow

Abstractor

This subsection presents the experimental re-

sults of the flow abstractor. Experiments were

executed on a PC with DDR4 512 GB memory,

an Intel Xeon E7-4830v3 processor (12 cores, 24

threads with hyper-threading, 2.1 GHz, 30 MB

cache) × 4, an Intel XL710QDA2 (dual port 40

Gbps QSFP+) network interface card, and the

Ubuntu 16.04 (Linux Kernel 4.4.0) operating sys-

tem, and the evaluated HTTP traffic volumes are

listed in Table 1. The HTTP traffic was generated

by the hardware application-level traffic generator

1 2 3 4 5 6 7 8 9bandwidth [Gbps]

0

5

10

15

20

25

30

35

40

CPU

uti

lizati

on [

x100 %

]

system

user

144[Kcps]108[Kcps]72[Kcps]36[Kcps]

Fig. 12 CPU load of the Flow Abstractor

versus bandwidth

Fig. 13 Packet drop versus bandwidth

as describe in Section 5. 1. The HTTP traffic was

separated and passed to 16 discard processes, as

shown in Fig. 11. The discard processes, which

affect the processing of the UNIX domain socket,

read and simply discard the data from the UNIX

domain socket.

The stacked graph in Fig. 12 plots the aver-

age CPU load of the netmap-based flow abstractor

when capturing HTTP traffic with different band-

widths. In this experiment, the flow abstractor

was configured to use a binary header for the ab-

straction interfaces. As shown in the figure, the

CPU consumption linearly increased with increas-

ing bandwidth. This result is expected, because

our implementation adopts a spinlock based reader-

writer lock to reduce the overhead of lock con-

tentions.

Fig. 13 compares the packet dropping rates of

tcpdump, Snort, and the flow abstractor when

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 95

1 for (;;) {

2 /* Grab a packet */

3 packet = pcap_next(handle, &header);

4 /*

5 Do some tasks

6 parcing, printing, etc.

7 */

8 }

Fig. 14 Pseudocode of single-threaded

packet handling

capturing HTTP traffic with different bandwidths.

“tcpdump -n > /dev/null” captures packets with

libpcap, parses them without DNS resolving, and

outputs the results to the standard output. Tcp-

dump with packet parsing cannot handle 200 Mbps

HTTP traffic, and tended to drop packets. Snort

also captures packets with libpcap, and recon-

structs the TCP sessions using the Stream5 en-

gine. In this evaluation, we set the Stream5 en-

gine parameters to their maximum possible values.

The Stream5 engine was configured as described in

the Appendix A. Snort completely handled the 1.5

Mbps HTTP traffic without packet dropping, but

began dropping packets at higher traffic rates. The

packet dropping rate reached 60% when the traffic

increased to 5 Gbps. As in tcpdump, the packet

dropping rate was caused by the single thread ar-

chitecture.

Fig. 14 describes pseudocode of single-threaded

packet handling. As shown in this figure, packet

handling is basically divided into 2 parts, grabbing

packets and doing some tasks, and overall perfor-

mance decreases as increasing the overhead of tasks

excepting grabbing packets. For example, because

printing to standard output, which calls a system

call, is a relatively time-consuming task, tcpdump

-n > /dev/null captures packets up to only 200

Mbps without packet dropping. In addition to that,

snort also drops packets at higher traffic rates, more

than 1.5 Mbps, because of the same reason.

Fig. 13 shows that “tcpdump -nw /dev/null”

captures packets without DNS resolving with

libpcap, and writes them without parsing to

/dev/null. Both “tcpdump -nw /dev/null” and

the libpcap-based flow abstractor completely han-

dled the 2 Gbps HTTP traffic without packet drop-

ping, but began dropping packets at 3 Gbps HTTP

traffic. Both applications exhibit similar trends due

to their similar packet handling operations; the flow

abstractor captures packets with one thread and

passes them to other threads. Namely, the over-

head of tasks without grabbing packets described

in Fig. 14 were eliminated. However, HTTP traf-

fic over 3 Gbps was not completely handled be-

cause packet capturing was performed in only one

thread. In contrast, the netmap-based flow abstrac-

tor completely handled the 9 Gbps HTTP traffic

without packet dropping. Such remarkable per-

formance is attributable to netmap, which enables

packet-capture through multiple threads, and the

multithreaded design and implementation of the

flow abstractor. As shown in Section 4. 2, the flow

abstractor adopts the bulk data transfer technique

to transfer data between threads to handle. In ad-

dition to that, we implemented a spinlock-based

reader-writer locking described in Appendix B be-

cause the conventional implementation of reader-

writer locking was a main bottleneck of the flow

abstractor.

The flow abstractor consists of mainly 3 threads,

which are the capturing thread (netmap or pcap),

the TCP handling thread and the thread of regu-

lar expression matching and UNIX domain socket.

Fig. 15 shows the CPU load of the threads of the

flow abstractor. Here, we varied the bandwidth of

HTTP traffic, the header format of the abstract

interfaces, and the method for capturing the L2

frames.

In panels (a)-(d) of Fig. 15, the flow abstractor

was implemented with netmap and a binary header

was used for the abstraction interfaces. The HTTP-

traffic rate was varied from 6 to 9 Gbps. As shown

in the graphs, the threads of the regex & UNIX do-

main socket collectively consumed more than 60%

of the flow abstractor’s CPU load, as some func-

tions requiring many CPU resources were executed

in the threads.

In panels (e) and (f) of Fig. 15, the flow abstrac-

tor was also implemented with netmap and the bi-

nary header was used for the abstraction interfaces.

These conditions were same as the panels (a) to (d),

but the HTTP-traffic rate was 1 Gbps for both (e)

and (f), and a text header was used in (f). As shown

in the figures, the text header, which is processed in

the thread of UNIX domain socket, is more compu-

tationally expensive in comparison with the binary

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

96 コンピュータソフトウェア

0 100 200 300 400 500 600elapsed time [s]

0

5

10

15

20

25

30

35

40

45

CPU

uti

lizati

on [

x100 %

]

netmap

TCP

regex & UNIX domain socket

(a) netmap, binary header,

HTTP 6 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

5

10

15

20

25

30

35

40

45

CPU

uti

lizati

on [

x100 %

]

netmap

TCP

regex & UNIX domain socket

(b) netmap, binary header,

HTTP 7 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

5

10

15

20

25

30

35

40

45

CPU

uti

lizati

on [

x100 %

]

netmap

TCP

regex & UNIX domain socket

(c) netmap, binary header,

HTTP 8 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

5

10

15

20

25

30

35

40

45

CPU

uti

lizati

on [

x100 %

]

netmap

TCP

regex & UNIX domain socket

(d) netmap, binary header,

HTTP 9 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

1

2

3

4

5

6

7

CPU

uti

lizati

on [

x1

00

%]

netmap

TCP

regex & UNIX domain socket

(e) netmap, binary header,

HTTP 1 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

1

2

3

4

5

6

7

CPU

uti

lizati

on [

x1

00

%]

netmap

TCP

regex & UNIX domain socket

(f) netmap, text header,

HTTP 1 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

1

2

3

4

5

6

7

CPU

uti

lizati

on [

x1

00

%]

pcapTCP

regex & UNIX domain socket

(g) pcap, binary header,

HTTP 1 Gbps

0 100 200 300 400 500 600elapsed time [s]

0

1

2

3

4

5

6

7

CPU

uti

lizati

on [

x1

00

%]

pcapTCP

regex & UNIX domain socket

(h) pcap, text header,

HTTP 1 Gbps

Fig. 15 CPU loads of the Flow Abstractor threads

Table 2 CPU usage of the Flow Abstractor

functions. The functions starting

with * are mainly used in the

threads of the regex & UNIX

domain socket.

function usage [%]

*free 19.7

*RE2’s matching 10.9

poll (netmap) 10.8

malloc 9.4

lll lock elision 8.8

*write (UNIX domain socket) 8.3

header. The binary header is thus more suitable

for high-bandwidth traffic, and the text header is

suitable for debugging or developing phase because

of the human-readability.

In panels (g) and (h) of Figure 15, the flow

abstractor was implemented with pcap instead of

netmap, the binary and the text header were re-

spectively used, and the HTTP-traffic rate was 1

Gbps. The figures reveal that netmap is more com-

putationally expensive than pcap because netmap

polls packet queues to handle high-pps traffic.

Therefore, pcap is more suitable than netmap when

handling low rate traffic, and netmap is more suit-

able than pcap when handling high rate traffic.

Table 2 shows the computationally expensive

functions of the flow abstractor when capturing 9

Gbps HTTP traffic with netmap and the binary

header for the abstraction interfaces. The functions

prefixed with * are mainly used in the threads of

the regex & UNIX domain socket. This table re-

veals that read-write locks, memory (de)allocation,

polling netmap, matching of regular expressions,

and writing to UNIX domain sockets are the main

bottlenecks.

5. 3 HTTP Analyzer and Load Balancing

Amain feature of the flow abstractor is the multi-

core scalability of its application protocol analyzers.

In this subsection, we show the effectiveness of the

load-balancing mechanism of the flow abstractor

through various experiments, applying the HTTP

analyzer as a heavy application-level analyzer. To

improve the performance, the flow abstractor ap-

plied jemalloc [11] for dynamic memory allocation.

A binary header was used for the abstraction in-

terface. These experiments were conducted on the

PC described in subsection 5. 2.

Fig. 16 shows the CPU loads of the HTTP an-

alyzer and flow abstractor when generating HTTP

requests by using the hardware application-level

traffic generator described in Section 5. 1. In this

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 97

0 10 20 30 40 50 60elapsed time [s]

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

SF-TAP Flow Abstractor

HTTP Analyzer 1 (CPython)

HTTP Analyzer 2 (CPython)

(a) 2 HTTP analyzer processes

0 10 20 30 40 50 60elapsed time [s]

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

SF-TAP Flow Abstractor

HTTP Analyzer 1 (CPython)

HTTP Analyzer 2 (CPython)

HTTP Analyzer 3 (CPython)

HTTP Analyzer 4 (CPython)

(b) 4 HTTP analyzer processes

0 10 20 30 40 50 60elapsed time [s]

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

SF-TAP Flow Abstractor

HTTP Analyzer 1 (CPython)

HTTP Analyzer 2 (CPython)

HTTP Analyzer 3 (CPython)

HTTP Analyzer 4 (CPython)

HTTP Analyzer 5 (CPython)

HTTP Analyzer 6 (CPython)

HTTP Analyzer 7 (CPython)

HTTP Analyzer 8 (CPython)

(c) 8 HTTP analyzer processes

Fig. 16 CPU load balancing of HTTP analyzers with 150 Mbps,

36.3 Kpps and 2.68 Kcps of HTTP traffic

HTTP

Analyzer

HTTP

Analyzer

HTTP

Analyzer

Flow Abstractor

8 http analyzer processes

Cell Incubator

6 Gbps HTTP Traffic

14 SF-TAP Cells

Fig. 17 Comprehensive evaluation of

SF-TAP with 14 cells

figure, the average HTTP traffic generation was 150

Mbps, 36.3 Kpps and 2.68 Kcps, where bps, pps,

and cps imply bits per second, packets per second,

and connections per second, respectively. Panels

(a), (b) and (c) of Fig. 16 show the CPU loads

when executing the load balancing using two, four,

and eight HTTP analyzer processes, respectively.

As shown in Fig. 16 (a), two HTTP analyzer

processes could manage up to approximately 150

Mbps, with each process consuming from 80 to

100% of the CPU resources. Note that a single

HTTP analyzer process could not handle the 150

Mbps HTTP traffic because of CPU saturation.

In the four and eight HTTP analyzer processes

(Fig. 16 (b) and (c), respectively), each process

consumed approximately 60 and 30% of the CPU

resources, respectively. Consequently, we conclude

CPU

util

izat

ion

[x10

0 %

]

0

14

28

42

56

70

SF-TAP Cell Incubator SF-TAP Flow Abstractor HTTP Analyzer

Fig. 18 Average CPU utilization rates of the

Flow Abstractor, the Cell Incubator,

and the HTTP Analyzer

that the load-balancing mechanism is remarkably

efficient in multicore applications. In our exper-

iments, the HTTP analyzer was implemented in

Python, a relatively slowly interpreted language,

and the multiple CPUs were not run in parallel

through multi-threading. Nevertheless, the 150

Mbps HTTP traffic was adequately handled by the

multi-CPU analyzer.

5. 4 Comprehensive Evaluation of SF-

TAP

In this subsection, we comprehensively evalu-

ate SF-TAP. The evaluation topology is shown

in Fig. 17. The cell incubator separated 6 Gbps

HTTP traffic, described in Section 5. 2, into 14 SF-

TAP cells running the flow abstractor and 8 HTTP

analyzer processes. The HTTP traffic was also gen-

erated by the hardware appliance described in Sec-

tion 5. 2. The cell incubator was run on the PC

described in subsection 5. 1, and the SF-TAP cells

were implemented on PCs with DDR3 32 GB mem-

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

98 コンピュータソフトウェア00-0

000-0

100-0

200-0

300-0

400-0

500-0

600-0

701-0

001-0

101-0

201-0

301-0

401-0

501-0

601-0

702-0

002-0

102-0

202-0

302-0

402-0

502-0

602-0

703-0

003-0

103-0

203-0

303-0

403-0

503-0

603-0

704-0

004-0

104-0

204-0

304-0

404-0

504-0

604-0

705-0

005-0

105-0

205-0

305-0

405-0

505-0

605-0

706-0

006-0

106-0

206-0

306-0

406-0

506-0

606-0

707-0

007-0

107-0

207-0

307-0

407-0

507-0

607-0

708-0

008-0

108-0

208-0

308-0

408-0

508-0

608-0

709-0

009-0

109-0

209-0

309-0

409-0

509-0

609-0

710-0

010-0

110-0

210-0

310-0

410-0

510-0

610-0

711-0

011-0

111-0

211-0

311-0

411-0

511-0

611-0

712-0

012-0

112-0

212-0

312-0

412-0

512-0

612-0

713-0

013-0

113-0

213-0

313-0

413-0

513-0

613-0

7

cell#-cpu#

0

20

40

60

80

100

CPU

uti

lizati

on [

%]

Fig. 19 CPU requirements of the SF-TAP cells

ory and a Xeon E5-2609 processor v2 (8 cores, 2.5

GHz, 10 MB cache).

Fig. 18 shows the average CPU utilization rates

per process of the flow abstractor, cell incubator,

and HTTP Analyzer. The HTTP analyzer requires

many more CPU resources than the other two com-

ponents. This result implies a significant role for

scalability in analyzing application protocols.

Fig. 19 shows the CPU usages of the SF-TAP

cells during 8×14 = 112 HTTP analyzer processes.

The CPU loads of the processes were well balanced.

In this experiment, we used the HTTP analyzer,

which simply parses HTTP traffic, as an applica-

tion level analyzer, but more complex processing,

such as pattern matching and machine learning, is

performed in actual cases. Application level analyz-

ers should tend to consume much more computer

resources than the flow abstractor and the cell incu-

bator, but the CPU load required by the analyzers

is well balanced by using SF-TAP like this experi-

ment.

6 Discussion

This section discusses the application of SF-TAP

to pervasive monitoring, an important issue on the

Internet problem [12]. Pervasive-monitoring coun-

termeasures require cryptographic protocols such

as secure sockets layer/transport layer security (SS-

L/TLS) rather than traditional protocols such as

HTTP and FTP, which are insecure. However,

cryptographic protocols invalidate the IDS, conse-

quently incurring other security risks.

Mobile devices, which are widely used in mod-

ern society, lack sufficient machine power to pro-

cess host-based IDSs. Therefore, as future solu-

tions for the Internet, researchers should consider

new approaches such as IDS cooperating with an

SSL/TLS proxy.

There are 2 ways to monitor encrypted traf-

fic. An SSL/TLS proxy server intercepts encrypted

traffic by enforcing installation of a self-signed cer-

tification on clients. However, the proxy with self-

signed certification may impair authenticity be-

cause the proxy can monitor encrypted traffic but

also falsify traffic easily. In addition to that, the

proxy disable the public key pinning extension for

HTTP [10].

Sharing the secret key is another approach to

monitor encrypted traffic. mcTLS [24] and Blind-

Box [37] share the key to encrypted traffic among

multi parties and GINTATE [23], we are currently

studying, shares the key with clients. Compared

to the proxy approach, the secret key sharing ap-

proach does not falsify traffic easily. In other words,

both the proxy and the secret key sharing approach

affect confidentiality and the proxy approach also

affect authenticity and integrity. However, the se-

cret key sharing approach does not easily compro-

mise authenticity and integrity.

From architectural perspective, recursive encap-

sulation, such as TLS over TLS, and a cooperation

mechanism between IDS and DPI system should

also be studied. GINTATE provides a function to

handle recursively encrypted traffic by adopting the

application level loopback interface of the flow ab-

stactor. There are several ways to achieve the co-

operation. GINTATE provides file based abstrac-

tion interface same as the flow abstractor to achieve

modularity.

From confidentiality perspective, the design of

secret level should also be studied. For example,

highly confidential encrypted traffic, such as online

banking, should not be inspected. How to decide

the secret level and how to handle it are big issues

for DPI systems.

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 99

7 Related Work

Traditional packet-capturing applications in-

clude the widely used wireshark [44] and tcpdump

[41]. The network traffic-capturing application lib-

nids [18] reassembles TCP streams. However, these

single-threaded applications cannot exploit the ad-

vantages of multiple CPU cores, and are hence un-

suitable for high-bandwidth network traffic analy-

sis.

SCAP [29] and GASPP [43] have been proposed

for flow-level and high-bandwidth network traf-

fic analyses. Implemented within a Linux kernel,

SCAP exploits the zero-copy mechanism and allo-

cates threads for the RX and TX queues of NIC to

boost the throughput. SCAP also adopts a mech-

anism called subzero-copy packet transfer, using

analyzers that can selectively analyze the required

network traffic. GASPP is a GPGPU-based flow-

level analysis engine based on netmap [34]; thus,

it achieves high-throughput data transfers between

the NIC and CPU memory.

DPDK [9], netmap [34], and PF RING [31] have

been proposed for high-bandwidth packet-capture

implementations. Owing to the many data trans-

fers and software interrupts occur among the NIC,

kernel, and user, traditional methods cannot eas-

ily capture 10 Gbps network traffic. Our proposed

method effectively reduces the frequency of mem-

ory copies and software interrupts, enabling wire-

speed traffic capture.

L7 filter [16], nDPI [25], libprotoident [19], and

PEAFLOW [8] have been proposed for application-

level network traffic classification. These meth-

ods detect the application protocols using Aho–

Corasick or regular expressions. PEAFLOW

achieves accurate classification using the parallel

programming language FastFlow.

IDS applications such as Snort [38], Bro [5] and

Suricata [39] reconstruct the TCP flows and per-

form the application-level analysis. Bro applies the

DSL BinPAC [28] for protocol parsing. However,

both Snort and Bro are single-threaded and cannot

manage high-bandwidth network traffic. On the

other hand, the multi-threaded Suricata does han-

dle high-bandwidth network traffic. To overcome

the single-threaded architecture, Bro comes with

scripts to deploy it in a cluster environment [7], and

achieved to handle 100 Gbps traffic[6]. Whereas the

cluster environment adopts hardware appliances,

SF-TAP uses only commodity hardware.

Schneider et al. [35] proposed a horizontally scal-

able architecture that separates the flows of 10

Gbps traffic, similarly to SF-TAP. They verified

their architecture only on 1 Gbps network traffic,

and have yet to verify it on 10 Gbps network traffic.

They did not provide any implementation handling

10 Gbps traffic, evaluation, and any mechanisms or

interfaces for traffic analysis applications.

Open vSwitch [27] is a software switch that con-

trols network traffic based on its flows, similarly

to the cell incubator in our proposed system, but

its ovs-ofctl does not reassemble the IP fragmented

packets. Some filtering mechanisms such as iptables

[14] and pf [30] also implement flow-based control of

network traffic, but lack a reassembling mechanism.

Furthermore, these methods are less scalable than

SF-TAP, and suffer from performance issues.

The Click [15], SwitchBlade [2], and ServerSwitch

[20] applications adopt modular architectures, pro-

viding flexible and programmable network func-

tions for network switches. SF-TAP borrows these

ideas and incorporates a modular architecture for

network traffic analysis.

SF-TAP also borrows the concept of BPF [22],

a well-known packet-capture mechanism that ab-

stracts the network traffic as files, similarly to

UNIX’s /dev. By abstracting files of network flows,

SF-TAP achieves its modularity and multicore-

scaling ability.

8 Conclusions

Application-level network traffic analysis and

sophisticated analysis techniques, such as ma-

chine learning and stream-data processing of net-

work traffic, require considerable computational re-

sources. To resolve these problems, we proposed a

scalable and flexible traffic analysis platform called

SF-TAP for sophisticated application-level analysis

of high-bandwidth network traffic in real-time.

SF-TAP consists of four planes: the separator

plane, abstractor plane, capturer plane, and an-

alyzer plane. Network traffic is captured by the

capturer plane and separated into different flows at

the separator plane, thus achieving horizontal scal-

ability. The separated network traffic is forwarded

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

100 コンピュータソフトウェア

to multiple SF-TAP cells, which constitute the ab-

stractor and analyzer planes.

We provided cell incubator and flow abstractor

implementations for the separator and abstractor

planes, respectively. As an example, we also imple-

mented an HTTP analyzer in the analyzer plane.

Traffic acquisition by the capturer plane uses well-

known technologies, such as port mirroring of L2

switches.

The flow abstractor (with its modular archi-

tecture) abstracts the network traffic into files,

similarly to Plan9, UNIX’s /dev, and BPF. Ow-

ing to the abstraction features and modularity,

application-level analyzers in SF-TAP can be eas-

ily developed in many programming languages, and

rendered as multicore scalable. The scalability of

the HTTP analyzer to multiple CPU cores was ex-

perimentally demonstrated in a Python example.

The flow abstractor performs bulk data trans-

fers among multiple threads. Our experiments con-

firmed that the multi-threaded flow abstractor can

manage 7 Gbps HTTP traffic without dropping

packets, whereas tcpdump and Snort can manage

only up to 3.0 and 1.5 Gbps, respectively.

The scalability of the HTTP analyzer to multi-

ple CPU cores was assisted by the flow abstraction

interfaces. In our experiments, each of two HTTP

analyzer processes consumed 100% of the CPU re-

sources, but each of eight processes consumed only

30% of the CPU resources.

The cell incubator provides the horizontal scala-

bility of SF-TAP. To manage high-bandwidth net-

work traffic, it separates the network traffic based

on the flows, and forwards the separated flows to

the cells (each comprising one flow abstractor and

several application-level analyzers). In experimen-

tal tests of the mirroring and inline modes, the cell

incubator managed approximately 12.49 Mpps and

11.44 Mpps, respectively.

In the future, we will tackle problems of DPI

over encrypted traffic. For the perspective of pri-

vacy, every traffic should be encrypted, but every

traffic should be inspected to keep networks safety.

We need to tackle the antinomic issue. Addition-

ally, we will study the design of secret level to de-

cide whether traffic should be inspected or not. It

should help to avoid inspecting highly confidential

encrypted traffic, such as online banking.

Acknowledgments

We would like to thank the staff of StarBED

for supporting our research, WIDE Project for

supporting our experiments, and Mr. Nobuyuki

Kanaya for feedback and comments.

References

[ 1 ] Amdahl, G. M.: Validity of the single pro-

cessor approach to achieving large scale comput-

ing capabilities, American Federation of Informa-

tion Processing Societies: Proceedings of the AFIPS

’67 Spring Joint Computer Conference, April 18-

20, 1967, Atlantic City, New Jersey, USA, AFIPS

Conference Proceedings, Vol. 30, AFIPS / ACM /

Thomson Book Company, Washington D.C., 1967,

pp. 483–485.

[ 2 ] Anwer, M. B., Motiwala, M., Tariq, M. M. B.,

and Feamster, N.: SwitchBlade: a platform for

rapid deployment of network protocols on pro-

grammable hardware, Proceedings of the ACM SIG-

COMM 2010 Conference on Applications, Tech-

nologies, Architectures, and Protocols for Computer

Communications, New Delhi, India, August 30 -

September 3, 2010, Kalyanaraman, S., Padmanab-

han, V. N., Ramakrishnan, K. K., Shorey, R., and

Voelker, G. M.(eds.), ACM, 2010, pp. 183–194.

[ 3 ] BitTorrent, http://www.bittorrent.com/.

[ 4 ] Boost C++ Library, http://www.boost.org/.

[ 5 ] The Bro Network Security Monitor, http://

www.bro.org/.

[ 6 ] 100G Intrusion Detection, https://commons.lbl.

gov/display/cpp/100G+Intrusion+Detection.

[ 7 ] Bro Cluster Architecture, https://www.bro.org/

sphinx-git/cluster/index.html.

[ 8 ] Danelutto, M., Deri, L., Sensi, D. D., and

Torquati, M.: Deep Packet Inspection on Commod-

ity Hardware using FastFlow, PARCO, Bader, M.,

Bode, A., Bungartz, H.-J., Gerndt, M., Joubert,

G. R., and Peters, F. J.(eds.), Advances in Parallel

Computing, Vol. 25, IOS Press, 2013, pp. 92–99.

[ 9 ] Intel@ DPDK: Data Plane Development Kit,

http://www.ntop.org/products/pf ring/.

[10] Evans, C., Palmer, C., and Sleevi, R.: Pub-

lic Key Pinning Extension for HTTP, April 2015.

RFC7469.

[11] Evans, J.: A scalable concurrent malloc (3) im-

plementation for FreeBSD, Proc. of the BSDCan

Conference, Ottawa, Canada, April 2006.

[12] Farrell, S. and Tschofenig, H.: Pervasive Moni-

toring Is an Attack, May 2014. RFC7258.

[13] Intel R⃝, High Peformance Packet Processing,

https://networkbuilders.intel.com/docs/network

builders RA packet processing.pdf.

[14] netfilter/iptables project homepage - The net-

filter.org ”iptables” project, http://www.netfilter.

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 101

org/projects/iptables/.

[15] Kohler, E., Morris, R., Chen, B., Jannotti, J.,

and Kaashoek, M. F.: The click modular router,

ACM Trans. Comput. Syst., Vol. 18, No. 3(2000),

pp. 263–297.

[16] L7-filter — ClearFoundation, http://l7-filter.

clearfoundation.com/.

[17] libevent, http://libevent.org/.

[18] libnids, http://libnids.sourceforge.net/.

[19] WAND Network Research Group: libprotoident,

http://research.wand.net.nz/software/

libprotoident.php.

[20] Lu, G., Guo, C., Li, Y., Zhou, Z., Yuan, T.,

Wu, H., Xiong, Y., Gao, R., and Zhang, Y.: Server-

Switch: A Programmable and High Performance

Platform for Data Center Networks, Proceedings of

the 8th USENIX Symposium on Networked Systems

Design and Implementation, NSDI 2011, Boston,

MA, USA, March 30 - April 1, 2011, Andersen,

D. G. and Ratnasamy, S.(eds.), USENIX Associa-

tion, 2011, pp. 15–28.

[21] Martins, J., Ahmed, M., Raiciu, C., Olteanu,

V. A., Honda, M., Bifulco, R., and Huici, F.:

ClickOS and the Art of Network Function Virtu-

alization, Proceedings of the 11th USENIX Sympo-

sium on Networked Systems Design and Implemen-

tation, NSDI 2014, Seattle, WA, USA, April 2-4,

2014, Mahajan, R. and Stoica, I.(eds.), USENIX

Association, 2014, pp. 459–473.

[22] McCanne, S. and Jacobson, V.: The BSD Packet

Filter: A New Architecture for User-level Packet

Capture, Proceedings of the Usenix Winter 1993

Technical Conference, San Diego, California, USA,

January 1993, USENIX Association, 1993, pp. 259–

270.

[23] Miura, R., Takano, Y., Miwa, S., and Inoue, T.:

GINTATE: Scalable and Extensible Deep Packet

Inspection System for Encrypted Network Traffic:

Session Resumption in Transport Layer Security

Communication Considered Harmful to DPI, Pro-

ceedings of the Eighth International Symposium on

Information and Communication Technology, Nha

Trang City, Viet Nam, December 7-8, 2017, ACM,

2017, pp. 234–241.

[24] Naylor, D., Schomp, K., Varvello, M., Leon-

tiadis, I., Blackburn, J., Lopez, D. R., Papagian-

naki, K., Rodrıguez, P. R., and Steenkiste, P.:

Multi-Context TLS (mcTLS): Enabling Secure In-

Network Functionality in TLS, Computer Commu-

nication Review, Vol. 45, No. 5(2015), pp. 199–212.

[25] nDPI, http://www.ntop.org/products/ndpi/.

[26] Leading operators create ETSI standards group

for network functions virtualization, http://www.

etsi.org/index.php/news-events/news/644-2013-01

-isg-nfv-created.

[27] Open vSwitch, http://openvswitch.github.io/.

[28] Pang, R., Paxson, V., Sommer, R., and Pe-

terson, L. L.: binpac: a yacc for writing applica-

tion protocol parsers, Internet Measurement Con-

ference, Almeida, J. M., Almeida, V. A. F., and

Barford, P.(eds.), ACM, 2006, pp. 289–300.

[29] Papadogiannakis, A., Polychronakis, M., and

Markatos, E. P.: Scap: stream-oriented network

traffic capture and analysis for high-speed net-

works, Internet Measurement Conference, IMC’13,

Barcelona, Spain, October 23-25, 2013, Papagian-

naki, K., Gummadi, P. K., and Partridge, C.(eds.),

ACM, 2013, pp. 441–454.

[30] PF: The OpenBSD Packet Filter, http://www.

openbsd.org/faq/pf/.

[31] PF RING, http://www.ntop.org/products/pf

ring/.

[32] Pike, R., Presotto, D. L., Dorward, S., Flan-

drena, B., Thompson, K., Trickey, H., and Win-

terbottom, P.: Plan 9 from Bell Labs, Computing

Systems, Vol. 8, No. 2(1995), pp. 221–254.

[33] RE2: https://github.com/google/re2.

[34] Rizzo, L. and Landi, M.: netmap: memory

mapped access to network devices, SIGCOMM, Ke-

shav, S., Liebeherr, J., Byers, J. W., and Mogul,

J. C.(eds.), ACM, 2011, pp. 422–423.

[35] Schneider, F., Wallerich, J., and Feldmann, A.:

Packet Capture in 10-Gigabit Ethernet Environ-

ments Using Contemporary Commodity Hardware,

PAM, Uhlig, S., Papagiannaki, K., and Bonaven-

ture, O.(eds.), Lecture Notes in Computer Science,

Vol. 4427, Springer, 2007, pp. 207–217.

[36] SF-TAP: Scalable and Flexible Traffic Analysis

Platform, https://github.com/SF-TAP.

[37] Sherry, J., Lan, C., Popa, R. A., and Ratnasamy,

S.: BlindBox: Deep Packet Inspection over En-

crypted Traffic, Computer Communication Review,

Vol. 45, No. 5(2015), pp. 213–226.

[38] Snort :: Home Page, https://www.snort.org/.

[39] Suricata — Open Source IDS / IPS / NSM en-

gine, http://suricata-ids.org/.

[40] Takano, Y., Miura, R., Yasuda, S., Akashi, K.,

and Inoue, T.: SF-TAP: Scalable and Flexible Traf-

fic Analysis Platform Running on Commodity Hard-

ware, 29th Large Installation System Administra-

tion Conference, LISA 2015, Washington, D.C.,

USA, November 8-13, 2015., USENIX Association,

2015, pp. 25–36.

[41] TCPDUMP/LIBPCAP public repository, http:

//www.tcpdump.org/.

[42] Vasiliadis, G., Antonatos, S., Polychronakis,

M., Markatos, E. P., and Ioannidis, S.: Gnort:

High Performance Network Intrusion Detection Us-

ing Graphics Processors, Recent Advances in In-

trusion Detection, 11th International Symposium,

RAID 2008, Cambridge, MA, USA, September 15-

17, 2008. Proceedings, Lippmann, R., Kirda, E.,

and Trachtenberg, A.(eds.), Lecture Notes in Com-

puter Science, Vol. 5230, Springer, 2008, pp. 116–

134.

[43] Vasiliadis, G., Koromilas, L., Polychronakis,

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

102 コンピュータソフトウェア

1 preprocessor stream5_global: track_tcp yes,

\

2 memcap 1073741824, \

3 track_udp yes, \

4 track_icmp no, \

5 max_tcp 1048576, \

6 max_udp 131072, \

7 max_active_responses 2, \

8 min_response_seconds 5

9 preprocessor stream5_tcp: policy windows,

detect_anomalies, require_3whs 180, \

10 overlap_limit 10, small_segments 3 bytes

150, timeout 180, \

11 ports client all, \

12 ports both all

Fig. 20 Stream5 configuration in Snort

M., and Ioannidis, S.: GASPP: A GPU-

Accelerated Stateful Packet Processing Frame-

work, 2014 USENIX Annual Technical Conference,

USENIX ATC ’14, Philadelphia, PA, USA, June

19-20, 2014., Gibson, G. and Zeldovich, N.(eds.),

USENIX Association, 2014, pp. 321–332.

[44] Wireshark - Go Deep., https://www.wireshark.

org/.

[45] The Official YAML Web Site, http://yaml.org/.

[46] A YAML parser and emitter in C++, https:

//github.com/jbeder/yaml-cpp.

A Stream5 Configuration

Fig. 20 shows the Stream5 configuration of Snort

in the evaluation described in subsection 5. 2.

B Spinlock-based reader–writer lock-

ing

According to Amdahl’s law [1], reducing the lock

overhead significantly improves the performance of

the application. We found that the reader-writer

lock function pthread rwlock is a main bottleneck

when implementing the flow abstractor. Thus, our

implementation replaces the pthread rwlock func-

tion with a spinlock-based reader–writer lock.

Fig. 21 compares the CPU utilization rates of the

flow abstractor with spinlock-based reader–writer

locking and pthread rwlock. As shown in this fig-

ure, the spinlock-based reader–writer lock signifi-

cantly reduced the CPU utilization rate of the flow

abstractor, confirming the importance of lock over-

head in multi-threaded applications.

1 2 3 4 5 6 7 8

bandwidth [Gbps]

0

10

20

30

40

50

60

70

CPU

uti

lizati

on [

x100 %

]

system

user

spinlock based readers-writer lock

1 2 3 4 5 6 7 8

bandwidth [Gbps]

system

user

pthread_rwlock

Fig. 21 CPU utilization of the Flow

Abstractor with spinlock-based

reader–writer locking and

withpthread rwlock locking

.

Yuuki Takano

received B.E. from National Insti-

tution for Academic Degrees and

Quality Enhancement of Higher

Education, Japan, in 2003. He also

received M.S. and Ph.D. in Com-

puter Science from Japan Advanced Institute of

Science and Technology, Japan, in 2005 and 2011,

respectively. He is currently a specially appointed

associate professor of Osaka University. His re-

search interests include computer systems, com-

puter networks, network security, and program-

ming language.

Ryosuke Miura

received B.E. from Tokyo Univer-

sity of Science, Japan, in 2003.

He also received M.S. in Computer

Science from Japan Advanced In-

stitute of Science and Technology,

Japan, in 2007. He is currently a technical re-

searcher of National Institute of Information and

Communications Technology, Japan. His research

interests include computer systems, computer net-

works, and network security.

Shingo Yasuda

received B.I. from Bunkyo Univer-

sity, Japan in 2003. He also re-

ceived M.S. and Ph.D. in Computer

Science from Japan Advanced In-

stitute of Science and Technology,

Japan, in 2005 and 2014, respectively. He is cur-

Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)

Vol. 36 No. 3 Aug. 2019 103

rently a senior researcher of National Institute

of Information and Communications Technology,

Japan. His research interests include computer sys-

tems, computer networks, and cyber range.

Kunio Akashi

received B.E. from Osaka Electro-

Communication University, Japan,

in 2008. He also received M.S. and

Ph.D. in Computer Science from

Japan Advanced Institute of Sci-

ence and Technology, Japan, in 2010 and 2017,

respectively. He is currently a researcher of Na-

tional Institute of Information and Communica-

tions Technology, Japan. His research interests in-

clude computer systems, computer networks, and

network testbed.

Tomoya Inoue

received B.E. from National Insti-

tution for Academic Degrees and

Quality Enhancement of Higher

Education, Japan, in 2003. He re-

ceived M.S. and Ph.D. in Computer

Science from Japan Advanced Institute of Science

and Technology, Japan, in 2006 and 2012, respec-

tively. He is currently a researcher of National In-

stitute of Information and Communications Tech-

nology, Japan. His research interests include com-

puter systems, computer networks, and network

testbed.