Design and Implementation of a Scalable and Flexible Traffic ...
-
Upload
khangminh22 -
Category
Documents
-
view
3 -
download
0
Transcript of Design and Implementation of a Scalable and Flexible Traffic ...
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
85
Design and Implementation of a Scalable and
Flexible Traffic Analysis Platform
Yuuki Takano, Ryosuke Miura, Shingo Yasuda,
Kunio Akashi, and Tomoya Inoue
Application-level network traffic analysis and sophisticated analysis techniques, such as machine learning
and stream data processing for network traffic, require considerable computational resources. In addition,
developing an application protocol analyzer is a tedious and time-consuming task. Therefore, we propose a
scalable and flexible traffic analysis platform (SF-TAP) for the efficient and flexible application-level stream
analysis of high-bandwidth network traffic. By using our flexible modular platform, developers can easily
implement multicore scalable application-level stream analyzers. Furthermore, as SF-TAP is horizontally
scalable, it manages high-bandwidth network traffic. To achieve this scalability, we separate the network
traffic based on traffic flows, and forward the separated flows to multiple SF-TAP cells, each comprising
a traffic capturer and application-level analyzers. This study discusses the design, implementation and
detailed evaluation of SF-TAP.
1 Introduction
Application-level network-traffic control such as
network traffic engineering and intrusion detection
systems (IDSs) usually perform complicated analy-
ses requiring considerable computational resources.
Therefore, the present paper proposes a scalable
and flexible traffic analysis platform (SF-TAP) that
スケーラブルで柔軟なトラフィック解析基盤の設計と実装.高野祐輝, 大阪大学大学院工学研究科, Graduate School
of Engineering, Osaka University.
高野祐輝,三浦良介,安田真悟, 情報通信研究機構サイバーセキュリティ研究室, Cybersecurity Laboratory, Na-
tional Institute of Information and Communications
Technology.
安田真悟, 情報通信研究機構サイバートレーニング研究室, Cyber Training Laboratory, National Institute
of Information and Communications Technology.
明石邦夫,井上朋哉, 情報通信研究機構テストベッド研究開発運用室, Network Testbed Research and Develop-
ment Laboratory, National Institute of Information
and Communications Technology.
コンピュータソフトウェア,Vol.36,No.3 (2019),pp.85–103.
[ソフトウェア論文] 2018 年 8 月 5 日受付.
runs on commodity hardware.
Overall, application-level network traffic analy-
sis is hampered by two problems. First, there
are numerous application protocols, with new
protocols being defined and implemented every
year. However, implementing application proto-
col parsers and analyzers are tedious and time-
consuming tasks. A straightforward implementa-
tion of the parser and analyzer, which is crucial for
application-level network traffic analysis, has pre-
viously been attempted using domain-specific lan-
guage (DSL) approaches. For example, the parser
BinPAC [28] was proposed for application protocols
built into Bro IDS software. Wireshark [44] and
Suricata [39] bind Lua language to implement an-
alyzers. Unfortunately, most researchers and de-
velopers tend to adopt specific programming lan-
guages for specific purposes, such as machine learn-
ing, but DLSs may not be suitable for other do-
mains.
The second problem in application-level network
traffic analysis is the low scalability of conventional
software. Traditional network traffic-analysis ap-
plications such as tcpdump [41], Wireshark, and
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
86 コンピュータソフトウェア
Snort [38] are single-threaded and cannot exploit
multiple CPU cores during the analysis. To im-
prove the utilization of CPU cores, researchers have
proposed various software solutions. An exam-
ple is GASPP [43], developed for high-bandwidth,
flow-based traffic analysis. This software exploits
GPUs and the Security Content Automation Proto-
col (SCAP) [29], which utilizes multiple CPU cores
and implements a Linux kernel module. The trans-
mission control protocol (TCP) flows must be re-
constructed efficiently, the efficiency of the parser
or analyzing programs is more critical because these
components perform deep analyses such as pat-
tern matching or machine learning, which are com-
putationally demanding. Therefore, when analyz-
ing high-bandwidth traffic, both the TCP flow re-
construction and traffic-analyzing components re-
quire multicore scaling. Moreover, for simple, cost-
effective deep analysis of high-bandwidth network
traffic, the corresponding application-level analysis
platform must have horizontal scalability.
To mitigate the abovementioned issues, we have
designed and implemented an SF-TAP for high-
bandwidth application-level traffic analysis. SF-
TAP adopts a flow abstraction mechanism that
abstracts the network flows as files, similarly to
Plan 9 [32], UNIX’s /dev, and the BSD packet fil-
ter (BPF) [22]. Through the developed interfaces,
logic developers can rapidly and flexibly implement
application-level analyzers in any language. Fur-
thermore, the modular architecture can separate
the L3/L4-level controlling and application-level
analyzing components. This design enables flexible
implementation, dynamic updates, and multicore
scalability of the analyzing components.
This paper was originally published at 29th
Large Installation System Administration Confer-
ence, LISA 2015 [40], and we updated it as follows.
• This paper presents a comprehensive evalua-
tion (Section 5. 4), whereas LISA’s paper re-
ported only evaluations of each component.
• We adopted netmap [34] aiming of improving
performance of the flow abstractor, which is a
component of SF-TAP
• In addition to that, we eliminated the bottle
neck of the flow abstractor which does not ap-
pear before adopting netmap (Appendix B).
• The flow abstractor is completely re-evaluated
(Section 5. 2) because we fixed it as mentioned
above.
For scientific reproducibility purposes, our proof-
of-concept implementation is distributed on the
Web (see [36]) under BSD licensing. Therefore, it is
freely available for use and modification.
2 Design Principles
This section discusses the design principles of SF-
TAP, i.e., the abstraction of network flows, multi-
core and horizontal scalabilities, and modularity.
We first describe the high-level architecture of
SF-TAP. As shown in Fig. 1), SF-TAP has two
main components: the cell incubator and the flow
abstractor. A group of one flow abstractor and sev-
eral analyzers is called a cell. The cell incubator
provides the horizontal scalability; that is, it cap-
tures the network traffic, separates it based on flow
identifiers consisting of source and destination IP
addresses and port numbers, and forwards the sep-
arated flows to specific target cells. Whereas con-
ventional approaches using packet capture (pcap)
or other methods cannot manage high-bandwidth
network traffic [42], our approach has successfully
managed 10 Gbp of network traffic using netmap,
multiple threads, and lightweight locks.
Furthermore, by separating the network traf-
fic, we can manage and analyze high-bandwidth
network traffic on multiple computers. The flow
abstractor receives flows from the cell incubator,
reconstructs the TCP flows, and forwards them
to multiple application-level analyzers, thus pro-
viding multicore scalability. This multicore and
horizontally scalable architecture enables efficient
application-level traffic analysis, which typically re-
quires considerable computational resources.
2. 1 Flow Abstraction
Existing applications based on DSL-based ap-
proaches, such as the aforementioned Wireshark,
Bro, and Suricata applications, are not always ap-
propriate because different programming languages
are suitable for different requirements. For exam-
ple, programming languages suitable for string ma-
nipulation, such as Perl and Python, should be used
for text-based protocols. Conversely, programming
languages suitable for binary manipulation, such as
C and C++, should be used for binary-based pro-
tocols, whereas programming languages equipped
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 87
CPU CPU CPU CPU
Flow Abstractor
CPU CPU CPU CPU
Flow Abstractor
CPU CPU CPU CPU
Flow Abstractor
CPU CPU CPU CPU
Flow Abstractor
Cell Incubator
The Internet
SF-TAP Cell SF-TAP Cell SF-TAP Cell SF-TAP Cell
Intra Network
Core ScalingCore Scaling Core Scaling Core Scaling
Horizontal Scaling
Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer Analyzer
10GbE 10GbE
Fig. 1 Highly Abstracted Architecture of SF-TAP
with machine learning libraries should be used for
machine learning.
To ensure flexibility in the application-level net-
work traffic analysis, our approach abstracts the
network flows into files using abstraction interfaces,
similarly to Plan 9; UNIX’s /dev; and BPF. Using
these abstraction interfaces, IDS developers, traffic
engineers, and other analysts can implement ana-
lyzers in their preferred languages.
2. 2 Horizontal Scalability
Application-level analyzers require substantial
computational resources. For example, HTTP mes-
sages are analyzed by string parsing, URLs are fil-
tered in real-time by pattern matching based on
regular expressions, and the specific features of net-
work traffic are extracted by machine learning tech-
niques. In general, these processes consume much
CPU time.
To address this problem, our horizontally scal-
able architecture allocates multiple computers to
the analyzers (the most avid consumers of compu-
tational resources) in high-bandwidth application-
level traffic analysis.
2. 3 Modular Architecture
Modularity is an important feature of our archi-
tecture, as it grants the multicore scalability and
flexibility of SF-TAP. It also separates the cap-
turing and analyzing components of the system.
In traditional network traffic analysis applications
such as Snort and Bro, which are monolithic, any
update must halt the operation of all components,
including the traffic capturer. In the proposed
modular architecture, the traffic analysis compo-
nents are updated without interfering with other
components. This is especially important in the
research and development phase, when the analyz-
ing components are frequently updated. The mod-
ularity also ensures that bugs in applications still
under development do not negatively affect other
applications.
2. 4 Commodity Hardware
According to Martins et al. [21], hardware ap-
pliances are relatively inflexible and incompati-
ble with new functions. Furthermore, hardware
appliances are very expensive and are not easily
scaled horizontally. To address these problems,
researchers are developing software-based alterna-
tives running on commodity hardware, such as net-
work function virtualization (NFV) [26]. Our pro-
posed software-based approach also runs in com-
modity hardware environments, achieving similar
flexibility and scalability to NFV.
3 Design
This section describes the design of SF-TAP.
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
88 コンピュータソフトウェア
NW I/F
HTTP I/F
TLS I/FFlow Abstractor
FlowClassifier TLS Analyzer
HTTP Analyzer
HTTP Proxy
TCP and UDP Handler
filter andclassifierrule
L7 Loopback I/F
DBForensicIDS/IPSetc...
ApplicationProtocol Analyzer
etc...TCP Default I/F
UDP Default I/F
Analyzer PlaneAbstractor Plane
CapturerPlane
SF-TAP CellIncubator
FlowIdentifier
FlowSeparator
SeparatorPlane
separated traffic
SF-TAP Cell
L3/L7 Sniffer
SSLProxy
etc...
other SF-TAP cells
IP PacketDefragmenter
L2 Bridge
mirroringtrafficPacket Forwarder
IP FragmentHandler
Fig. 2 Architecture of SF-TAP
3. 1 Design Overview
Fig. 2 shows the detailed architecture of the pro-
posed SF-TAP. SF-TAP consists of four planes:
the capturer, separator, abstractor, and analyzer
planes. The capturer plane, which captures the
network , comprises a port-mirroring mechanism of
L2/L3 switches, an SSL proxy that sniffs the plain
text, and various minor components. The sepa-
rator plane provides horizontal scalability for high-
bandwidth network traffic analysis. This plane sep-
arates the network traffic into L3 and L4 levels, for-
warding the flows to multiple cells. Each cell com-
prises an abstractor plane and an analyzer plane.
The abstractor plane abstracts the network flow.
Specifically, it defragments the IP fragmentations,
identifies the L3 and L4 flows, reconstructs the
TCP streams, detects the application protocol us-
ing regular expressions, and outputs the flows to
the appropriate abstraction interfaces. The ana-
lyzer plane is the plane developed by SF-TAP users.
The analyzers can be implemented in any program-
ming language, and are developed by accessing the
interfaces provided by the abstractor plane.
The components of the capturer plane are well
known, and the analyzers of the analyzer plane
are developed by SF-TAP users. Therefore, the
present study focuses on the separator and abstrac-
tor planes.
The following subsections describe the flow ab-
stractor and cell incubator designs, and present an
HTTP analyzer as an example.
3. 2 Cell Incubator Design
The cell incubator shown in Fig. 2 is a software-
based network traffic balancer that mirrors and sep-
arates the network traffic based on the flows, thus
working as an L2 bridge. The cell incubator con-
sists of a packet forwarder, an IP fragment handler,
and a flow separator.
The packet forwarder receives L2 frames and for-
wards them to the IP fragment handler. If required,
it also forwards frames to other network interface
controllers (NICs), such as the L2 bridge, negating
the need for hardware-based network traffic mirror-
ing.
The IP fragment handler identifies the frag-
mented packets based on the given flows, and for-
wards them to the flow separator. The IP fragment
handler calculates a hash value of IP addresses and
port numbers of a non-fragmented packet, h1 =
hash(srcIP⊕dstIP⊕srcPort⊕dstPort), for load bal-
ancing. However, port numbers are not available
in fragmented packets excluding the first packet
of fragmented packets. Therefore, the IP frag-
ment handler calculates h1 and another hash value,
h2 = hash(srcIP⊕dstIP⊕fragmentation identifier),
and associates h1 with h2 when encountering the
first packet of fragmented packets, which has a frag-
mentation identifier. Consequently, the fragmented
packets following the head can be associated with
h1 because h2 can be calculated. However, there
is a limitation that reordered fragmented packets
may not be properly handled.
The flow separator forwards the packets to mul-
tiple SF-TAP cells using the flow information
(namely, the IP addresses and port numbers of the
source and destination). The destination SF-TAP
cell is determined by the hash value of the flow
identifier.
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 89
1 http:
2 up: ’^[-a-zA-Z]+ .+ HTTP/1\.(0\r?\n
|1\r?\n([-a-zA-Z]+: .+\r?\n)+)’
3 down: ’^HTTP/1\.[01] [1-9][0-9]{2} .+\
r?\n’
4 proto: TCP # TCP or UDP
5 if: http # path to UNIX domain
socket
6 priority: 100 # priority
7 balance: 4 # balanced by 4 IFs
8
9 torrent_tracker: # BitTorrent Tracker
10 up: ’^GET .*(announce|scrape).*\?.*
info_hash=.+&.+ HTTP/1\.(0\r?\n|1\r?\n([-a-
zA-Z]+: .+\r?\n)+)’
11 down: ’^HTTP/1\.[01] [1-9][0-9]{2} .+\
r?\n’
12 proto: TCP
13 if: torrent_tracker
14 priority: 90 # priority
15
16 dns_udp:
17 proto: UDP
18 if: dns
19 port: 53
20 priority: 200
Fig. 3 Example of Flow Abstractor
Configuration
1 $ ls -R /tmp/sf-tap
2 loopback7= tcp/ udp/
3
4 /tmp/sf-tap/tcp:
5 default= http2= ssh=
6 dns= http3= ssl=
7 ftp= http_proxy=
torrent_tracker=
8 http0= irc= websocket=
9 http1= smtp=
10
11 /tmp/sf-tap/udp:
12 default= dns=
torrent_dht=
Fig. 4 Directory Structure of the Flow
Abstraction Interface
3. 3 Flow Abstractor Design
In this subsection, we describe the design of the
flow abstractor schematized in Fig. 2, and explain
its main mechanisms using an example configura-
tion file (see Fig. 3). For human readability, the
flow abstractor is configured in the data serial-
ization language YAML [45]. The flow abstractor
consists of four components. In top-to-bottom se-
quence, these components are the IP packet defrag-
menter, the flow identifier, the TCP and UDP han-
dler, and the flow classifier.
3. 3. 1 Flow Reconstruction
The flow abstractor defragments the fragmented
IP packets and reconstructs the TCP streams.
Thereby, analyzer developers need not implement
the complicated reconstruction logic required for
application-level analysis.
The IP packet defragmenter shown in Fig. 2 de-
fragments the IP fragmented packets. The defrag-
mented IP packets are forwarded to the flow iden-
tifier, which identifies the L3/L4-level flows. The
flows are identified as 5-tuples containing the source
and destination IP addresses, the source and des-
tination port numbers, and a hop count (described
in Section 3. 3. 2). Once the flows have been iden-
tified, the TCP streams are reconstructed by the
TCP and UDP handler. Finally, the flows are for-
warded to the flow classifier.
3. 3. 2 Flow Abstraction Interface
The flow abstractor provides the interfaces that
abstract the flows at the application level (for ex-
ample, the TLS and HTTP interfaces shown in
Fig. 2). The HTTP, BitTorrent tracker [3], and do-
main name system (DNS) interfaces are defined in
Fig. 3.
The flow classifier classifies the flows of various
application protocols, forwarding them to the flow
abstraction interfaces, which are implemented via
a UNIX domain socket, as shown in Fig. 4. The
file names of the interfaces are defined by if items
in Fig. 3; for example, lines 5, 13, and 18 in this fig-
ure define the HTTP, BitTorrent tracker, and DNS
interfaces as http, torrent tracker, and dns, re-
spectively. By providing independent interfaces for
each application protocol, SF-TAP can implement
the analyzers in any programming language.
Furthermore, we designed a special interface for
flow injection called the L7 loopback interface, i.e.,
L7 Loopback I/F (see Fig. 2). This interface is con-
venient for encapsulated protocols such as HTTP
proxy. As HTTP proxy can encapsulate other
protocols within HTTP, the encapsulated traffic
should also be analyzed at the application level.
In this situation, the encapsulated traffic is eas-
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
90 コンピュータソフトウェア
1 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=CREATED
2 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=DATA,from=2,
match=down,len=1398
3
4 1398[bytes] Binary Data
5
6 ip1=192.168.0.1,ip2=192.168.0.2,port1=62918,port2=80,hop=0,l3=ipv4,l4=tcp,event=DESTROYED
Fig. 5 Example output of the Flow Abstraction Interface
ily analyzed by re-injecting the encapsulated traffic
into the flow abstractor via the L7 loopback inter-
face. The flow abstractor manages the re-injected
traffic in the same manner as the process against
non-encapsulated traffic. Although the implemen-
tation of the application-level analysis of encapsu-
lated traffic generally tends to be rather complex,
this re-injection mechanism simplifies the imple-
mentation.
The L7 loopback interface may cause infinite re-
injections. To avoid this problem, we introduce a
hop count and a corresponding hop limitation. The
flow abstractor drops the injected traffic when its
hop count exceeds the hop limitation, thus limiting
the number of re-injections.
Besides flow abstraction and the L7 loopback in-
terface, the flow abstractor provides default inter-
faces that capture the unknown or unclassified net-
work traffic.
3. 3. 3 TCP Session Abstraction
Fig. 5 is a representative output of the flow ab-
straction interface. The flow abstractor first out-
puts a header containing information on the flow
identifier and the abstracted TCP event, then out-
puts the body (if it exists).
Line 1 of Fig. 5 indicates that a TCP ses-
sion was established between 192.168.0.1:62918
and 192.168.0.2:80. Line 2 indicates that 1398
bytes of data were sent from 192.168.0.2:80 to
192.168.0.1:62918. The from values distinguish the
source and destination addresses, and the len value
denotes the data length. Lines 3–5 indicate that the
transmitted outputs are binary data. Finally, line
6 indicates that the TCP session was disconnected.
The match value in this figure denotes the pat-
tern, i.e., up or down in Fig. 3, which determines
the protocol detection.
To reduce the complexity of managing a TCP ses-
sion, the flow abstractor abstracts the TCP states
as three events, i.e., CREATED, DATA, and DE-
STROYED. Accordingly, analyzer developers can
easily manage their TCP sessions and concentrate
on the application-level analysis.
3. 3. 4 Application-level Protocol Detec-
tion
The flow classifier shown in Fig. 2 is an
application-level protocol classifier, which detects
protocols using regular expressions and a port num-
ber. The items up and down in Fig. 3 are the regu-
lar expressions in application-level protocol detec-
tion. When the upstream and downstream flows
match the regular expressions, the flows are out-
putted to the specified interface described in a con-
figuration file. Although application-level proto-
cols can be detected by alternative methods such
as Aho–Corasick and Bayesian filtering, we adopt
regular expressions because of their generality and
high expressive power. Meanwhile, the port num-
ber distinguishes the application-level flows from
other flows. For example, line 19 of Fig. 3 classifies
a DNS by its port number 53.
The priority item in Fig. 3 dictates the prior-
ity rules in cases of ambiguity; here, lower priority
values are prioritized over higher ones. For exam-
ple, because BitTorrent tracker adopts HTTP com-
munication, it shares the same protocol format as
HTTP and ambiguity occurs if the rules for HTTP
and BitTorrent tracker have the same priority. The
priority ruling removes this ambiguity. The config-
uration in Fig. 3 prioritizes BitTorrent tracker over
HTTP.
3. 4 Load Balancing via the Flow Abstrac-
tion Interface
In general, the occurrence number of each ap-
plication protocol in a network is biased. Con-
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 91
1 {
2 "client": {
3 "port": "61906",
4 "ip":"192.168.11.12",
5 "header": {
6 "host": "www.nsa.gov",
7 "user-agent":"Mozilla\/5.0 (Macintosh;
Intel Mac OS X 10.9; rv:31.0) Gecko
\/20100101 Firefox\/31.0",
8 "connection": "keep-alive",
9 "pragma": "no-cache",
10 "cache-control": "no-cache"
11 },
12 "method": {
13 "method": "GET",
14 "uri": "\/",
15 "ver": "HTTP\/1.1"
16 },
17 "trailer": {}
18 },
19 "server": {
20 "port": "80",
21 "ip": "23.6.116.226",
22 "header": {
23 "connection": "keep-alive",
24 "content-length":"6268",
25 "date": "Sat, 16 Aug 2014 11:38:25 GMT
",
26 "content-encoding": "gzip",
27 "vary": "Accept-Encoding",
28 "x-powered-by": "ASP.NET",
29 "server": "Microsoft-IIS\/7.5",
30 "content-type": "text\/html"
31 },
32 "response": {
33 "ver": "HTTP\/1.1",
34 "code": "200",
35 "msg": "OK"
36 },
37 "trailer": {}
38 }
39 }
Fig. 6 Example output of the HTTP Analyzer
sequently, if each application protocol is analyzed
by only one process, the computational load will
be concentrated on a particular analyzer process.
To prevent this problem, we introduce a load-
balancing mechanism to the flow abstraction inter-
faces.
The configuration of the load-balancing mecha-
nism is specified on line 7 of Fig. 3. The balance
value is 4, indicating that the HTTP flows are sep-
arated and outputted to four balancing interfaces.
The interfaces http0=, http1=, http2=, and http3=
in Fig. 4 are the balancing interfaces. By intro-
ducing one-to-many interfaces, we can easily scale
non-multithreaded analyzers to CPU cores.
3. 5 HTTP Analyzer Design
This subsection describes the design of the HTTP
analyzer, a type of application-level analyzer. The
HTTP analyzer reads the flows from the abstrac-
tion interface of the HTTP provided by the flow
abstractor, and then serializes the results in JSON
format to provide a standard output. Fig. 6 shows
an example output of the HTTP analyzer. The
HTTP analyzer reads the flows and outputs the re-
sults as streams.
4 Implementation
This section develops our proof-of-concept imple-
mentation of SF-TAP.
4. 1 Implementation of the Cell Incubator
The cell incubator must be capable of manag-
ing high-bandwidth network traffic, which cannot
be handled by conventional methods such as libp-
cap. Therefore, our cell incubator implementation
exploits netmap for packet capturing and forward-
ing.
The cell incubator is implemented in C++ and
is available on FreeBSD and Linux. In inline mode,
the cell incubator works as an L2 bridge; in mirror-
ing mode, it only receives and separates L2 frames
without bridging the NICs (unlike the L2 bridge).
Users can select the desired mode when deploying
the cell incubator.
The network traffic is separated using the hash
values of each flow’s source and destination IP ad-
dresses and port numbers. Thus, each flow is for-
warded to a uniquely decided NIC.
Because we adopted netmap, the NICs used by
the cell incubator must be netmap-available. In
general, receive-side scaling (RSS) is enabled on
netmap-available NICs, and there are multiple re-
ceiving and sending queues on the NICs. Thus,
the cell incubator generates a thread for each
queue to balance the CPU load and achieve high-
throughput packet management. However, the
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
92 コンピュータソフトウェア
sending queues are shared among threads, requir-
ing exclusive, computationally expensive controls.
To reduce the CPU load of the exclusive controls,
we adopt a lock mechanism that uses the compare-
and-swap instruction.
4. 2 Implementation of the Flow Abstrac-
tor
The flow abstractor was implemented in C++,
and requires Boost [4], libpcap [41], libevent [17],
RE2[33], and yamp-cpp[46]. It is available on Linux,
*BSD, and MacOS X. The flow abstractor is multi-
threaded, with traffic capture, flow reconstruction,
and application protocol detection executed by dif-
ferent threads. For simplicity and clarity, the data
transfer among the threads follows the producer-
consumer pattern.
The flow abstractor implements a garbage collec-
tor for zombie TCP connections. More specifically,
TCP connections may disconnect without an FIN
or RST packet because of PC or network troubles.
Garbage collection is controlled by timers, adopt-
ing a partial garbage-collection algorithm to avoid
long-term locking.
Because synchronizing the threads demands
heavy CPU loads, our flow abstractor transfers bulk
data among the threads when a specified amount of
data is collected in the producer’s queue, or when
a specified time has elapsed.
The packets are captured using libpcap or netmap
[34], which is a kernel bypass technology. Kernel by-
pass technologies, such as netmap and DPDK [9],
achieve higher throughput than libpcap, but these
mechanisms requires much CPU resources because
they access network devices via polling to increase
the throughput. Netmap can manage packets by
blocking and waiting, but this mode increases the
latency. Accordingly, they remove CPU resources
from the application-level analyzers. Furthermore,
as netmap and DPDK exclusively attach to NICs,
they prevent the attachment of other programs,
such as tcpdump, to the same NICs. This restric-
tion impedes network operations and the develop-
ment or debugging of network software. To allevi-
ate these disadvantages, the present flow abstractor
adopts both libpcap and netmap.
Whether the header is text or binary, the abstrac-
tion interfaces are configured by the flow abstac-
tor’s configuration file. For script languages like
1 GbE x 1210 GbE x 2
α β
γ
cell incubator
Fig. 7 Experimental Network
of the Cell Incubator
Python, a text header is suitable for easily parsing
the headers, whereas a binary header improves the
performance.
4. 3 Implementation of the HTTP Ana-
lyzer
For demonstration and evaluation, we implement
an HTTP analyzer. The HTTP analyzer is im-
plemented in Python, and manages TCP sessions
using Python’s dictionary data structure. The
HTTP analyzer is also easily implemented in other
lightweight languages. Note that the Python im-
plementation was used for the performance evalua-
tions in Section 5.
5 Experimental Evaluation
This section experimentally evaluates SF-TAP.
5. 1 Performance Evaluation of the Cell
Incubator
The cell-incubator experiments were conducted
on a PC with DDR3 16 GB Memory and an Intel
Xeon E5-2470 v2 processor (10 cores, 2.4 GHz, 25
MB cache) with the FreeBSD 10.1 operating sys-
tem. The computer was equipped with four In-
tel quad-port 1 GbE NICs and an Intel dual-port
10 GbE NIC. For these evaluations, we generated
packets of network traffic which are from 64 to 1024
bytes L2 frames on the 10 GbE lines by using a
hardware application-level traffic generator †1. Thecell incubator separated the traffic into different
flows, and forwarded the separated flows to the
twelve 1 GbE lines. The experimental network is
displayed in Fig. 7.
During the experiments, the cell incubator was
operated in three modes: (1) mirroring mode using
†1 Ixia’s PerfectStorm One 10GE
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 93
[Mpp
s]
0
4
8
12
16
Fragment Size [bytes]
64 128 256 512 1024
ideal (1) α->γ (2) α->β (3) α->β (3) α->γ
Fig. 8 Forwarding performance of
the Cell Incubator (10 Gbps)
CPU
util
izat
ion
[%]
0
25
50
75
100
CPU No.0 1 2 3 4 5 6 7 8 9 1011 1213141516171819
5.95 Mpps 10.42 Mpps 14.88 Mpps
Fig. 9 CPU Load average of the Cell
Incubator (64-byte frames)
0 20 40 60 80 100elapsed time [s]
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
system
user
(a) 5.95 Mpps
0 20 40 60 80 100elapsed time [s]
0
20
40
60
80
100
CPU
utl
izati
on [
%]
system
user
(b) 10.42 Mpps
0 20 40 60 80 100elapsed time [s]
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
system
user
(c) 14.88 Mpps
Fig. 10 Resources of CPU 15 consumed by the Cell Incubator (64-byte frames)
port mirroring on the L2 switch (in this mode, the
cell incubator captured the packets at α and for-
warded them to γ); (2) inline mode with the pack-
ets forwarded directly from α to β, bypassing the
1- GbE NICs; and (3) inline mode with the packets
captured at α and forwarded to both β and γ.
Fig. 8 shows the forwarding performance of the
cell incubator for different fragment sizes of 10
Gbps HTTP traffic. In mirroring mode (mode 1),
the cell incubator forwarded packets up to 12.49
Mpps. When operated as an L2 bridge (mode 2),
the cell incubator forwarded packets up to 11.60
Mpps, and when forwarding packets to β and γ
(mode 3), it delivered up to 11.44 Mpps to each des-
tination. The performance was better in mirroring
mode than inline mode in mirroring mode because
the inline mode delivers packets to two NICs. How-
ever, the inline mode is more suitable for specific
purposes such as IDS, because it drops the same
packets at β and γ; that is, the inline mode cap-
tures all of the transmitted packets.
Fig. 9 shows the CPU load averages of the cell
incubator when forwarding 64-byte frames in the
inline mode. At 5.95 and 10.42 Mpps, no packets
were dropped during forwarding. The upper limit
of dropless forwarding was approximately 10.42
Mpps. This indicates that forwarding consumed
the resources of several CPUs, but especially those
of the 15th CPU.
The loads drained from CPU 15 at different traf-
fic rates are plotted in Fig. 10. At 5.95 Mpps,
the average load was approximately 50%, but at
10.42 Mpps, the loads approached 100%. At 14.88
Mpps, the CPU resources were completely con-
sumed. This limitation in forwarding performance
was probably caused by the bias introduced by the
flow director [13] of Intel’s NIC and its driver. This
bias is network-flow dependent, and uncorrectable
because the flow director cannot currently be con-
trolled by the user programs on FreeBSD. Note that
the fairness problem of the RSS queues could be
resolved by the correct implementation, which is
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
94 コンピュータソフトウェア
Table 1 HTTP traffic in the evaluations. bps,
pps, and cps imply bits per second,
packets per second, and connections
per second, respectively.
Gbps Kpps Kcps
1 224 18.0
2 467 35.9
3 701 53.9
4 935 71.9
5 1,169 89.9
6 1,403 107.9
7 1,638 126.0
8 1,870 143.8
9 2,101 161.7
Flow Abstractor
discard discard discard discard
16 discard processes
HTTP Traffic
Fig. 11 Flow Abstractor with
discard processes
earmarked for future work.
Finally, the memory utilization of the cell incuba-
tor depends on the memory-allocation strategy of
netmap. In the current implementation of the cell
incubator, the experiments required approximately
700 MB of memory.
5. 2 Performance Evaluation of the Flow
Abstractor
This subsection presents the experimental re-
sults of the flow abstractor. Experiments were
executed on a PC with DDR4 512 GB memory,
an Intel Xeon E7-4830v3 processor (12 cores, 24
threads with hyper-threading, 2.1 GHz, 30 MB
cache) × 4, an Intel XL710QDA2 (dual port 40
Gbps QSFP+) network interface card, and the
Ubuntu 16.04 (Linux Kernel 4.4.0) operating sys-
tem, and the evaluated HTTP traffic volumes are
listed in Table 1. The HTTP traffic was generated
by the hardware application-level traffic generator
1 2 3 4 5 6 7 8 9bandwidth [Gbps]
0
5
10
15
20
25
30
35
40
CPU
uti
lizati
on [
x100 %
]
system
user
144[Kcps]108[Kcps]72[Kcps]36[Kcps]
Fig. 12 CPU load of the Flow Abstractor
versus bandwidth
Fig. 13 Packet drop versus bandwidth
as describe in Section 5. 1. The HTTP traffic was
separated and passed to 16 discard processes, as
shown in Fig. 11. The discard processes, which
affect the processing of the UNIX domain socket,
read and simply discard the data from the UNIX
domain socket.
The stacked graph in Fig. 12 plots the aver-
age CPU load of the netmap-based flow abstractor
when capturing HTTP traffic with different band-
widths. In this experiment, the flow abstractor
was configured to use a binary header for the ab-
straction interfaces. As shown in the figure, the
CPU consumption linearly increased with increas-
ing bandwidth. This result is expected, because
our implementation adopts a spinlock based reader-
writer lock to reduce the overhead of lock con-
tentions.
Fig. 13 compares the packet dropping rates of
tcpdump, Snort, and the flow abstractor when
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 95
1 for (;;) {
2 /* Grab a packet */
3 packet = pcap_next(handle, &header);
4 /*
5 Do some tasks
6 parcing, printing, etc.
7 */
8 }
Fig. 14 Pseudocode of single-threaded
packet handling
capturing HTTP traffic with different bandwidths.
“tcpdump -n > /dev/null” captures packets with
libpcap, parses them without DNS resolving, and
outputs the results to the standard output. Tcp-
dump with packet parsing cannot handle 200 Mbps
HTTP traffic, and tended to drop packets. Snort
also captures packets with libpcap, and recon-
structs the TCP sessions using the Stream5 en-
gine. In this evaluation, we set the Stream5 en-
gine parameters to their maximum possible values.
The Stream5 engine was configured as described in
the Appendix A. Snort completely handled the 1.5
Mbps HTTP traffic without packet dropping, but
began dropping packets at higher traffic rates. The
packet dropping rate reached 60% when the traffic
increased to 5 Gbps. As in tcpdump, the packet
dropping rate was caused by the single thread ar-
chitecture.
Fig. 14 describes pseudocode of single-threaded
packet handling. As shown in this figure, packet
handling is basically divided into 2 parts, grabbing
packets and doing some tasks, and overall perfor-
mance decreases as increasing the overhead of tasks
excepting grabbing packets. For example, because
printing to standard output, which calls a system
call, is a relatively time-consuming task, tcpdump
-n > /dev/null captures packets up to only 200
Mbps without packet dropping. In addition to that,
snort also drops packets at higher traffic rates, more
than 1.5 Mbps, because of the same reason.
Fig. 13 shows that “tcpdump -nw /dev/null”
captures packets without DNS resolving with
libpcap, and writes them without parsing to
/dev/null. Both “tcpdump -nw /dev/null” and
the libpcap-based flow abstractor completely han-
dled the 2 Gbps HTTP traffic without packet drop-
ping, but began dropping packets at 3 Gbps HTTP
traffic. Both applications exhibit similar trends due
to their similar packet handling operations; the flow
abstractor captures packets with one thread and
passes them to other threads. Namely, the over-
head of tasks without grabbing packets described
in Fig. 14 were eliminated. However, HTTP traf-
fic over 3 Gbps was not completely handled be-
cause packet capturing was performed in only one
thread. In contrast, the netmap-based flow abstrac-
tor completely handled the 9 Gbps HTTP traffic
without packet dropping. Such remarkable per-
formance is attributable to netmap, which enables
packet-capture through multiple threads, and the
multithreaded design and implementation of the
flow abstractor. As shown in Section 4. 2, the flow
abstractor adopts the bulk data transfer technique
to transfer data between threads to handle. In ad-
dition to that, we implemented a spinlock-based
reader-writer locking described in Appendix B be-
cause the conventional implementation of reader-
writer locking was a main bottleneck of the flow
abstractor.
The flow abstractor consists of mainly 3 threads,
which are the capturing thread (netmap or pcap),
the TCP handling thread and the thread of regu-
lar expression matching and UNIX domain socket.
Fig. 15 shows the CPU load of the threads of the
flow abstractor. Here, we varied the bandwidth of
HTTP traffic, the header format of the abstract
interfaces, and the method for capturing the L2
frames.
In panels (a)-(d) of Fig. 15, the flow abstractor
was implemented with netmap and a binary header
was used for the abstraction interfaces. The HTTP-
traffic rate was varied from 6 to 9 Gbps. As shown
in the graphs, the threads of the regex & UNIX do-
main socket collectively consumed more than 60%
of the flow abstractor’s CPU load, as some func-
tions requiring many CPU resources were executed
in the threads.
In panels (e) and (f) of Fig. 15, the flow abstrac-
tor was also implemented with netmap and the bi-
nary header was used for the abstraction interfaces.
These conditions were same as the panels (a) to (d),
but the HTTP-traffic rate was 1 Gbps for both (e)
and (f), and a text header was used in (f). As shown
in the figures, the text header, which is processed in
the thread of UNIX domain socket, is more compu-
tationally expensive in comparison with the binary
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
96 コンピュータソフトウェア
0 100 200 300 400 500 600elapsed time [s]
0
5
10
15
20
25
30
35
40
45
CPU
uti
lizati
on [
x100 %
]
netmap
TCP
regex & UNIX domain socket
(a) netmap, binary header,
HTTP 6 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
5
10
15
20
25
30
35
40
45
CPU
uti
lizati
on [
x100 %
]
netmap
TCP
regex & UNIX domain socket
(b) netmap, binary header,
HTTP 7 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
5
10
15
20
25
30
35
40
45
CPU
uti
lizati
on [
x100 %
]
netmap
TCP
regex & UNIX domain socket
(c) netmap, binary header,
HTTP 8 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
5
10
15
20
25
30
35
40
45
CPU
uti
lizati
on [
x100 %
]
netmap
TCP
regex & UNIX domain socket
(d) netmap, binary header,
HTTP 9 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
1
2
3
4
5
6
7
CPU
uti
lizati
on [
x1
00
%]
netmap
TCP
regex & UNIX domain socket
(e) netmap, binary header,
HTTP 1 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
1
2
3
4
5
6
7
CPU
uti
lizati
on [
x1
00
%]
netmap
TCP
regex & UNIX domain socket
(f) netmap, text header,
HTTP 1 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
1
2
3
4
5
6
7
CPU
uti
lizati
on [
x1
00
%]
pcapTCP
regex & UNIX domain socket
(g) pcap, binary header,
HTTP 1 Gbps
0 100 200 300 400 500 600elapsed time [s]
0
1
2
3
4
5
6
7
CPU
uti
lizati
on [
x1
00
%]
pcapTCP
regex & UNIX domain socket
(h) pcap, text header,
HTTP 1 Gbps
Fig. 15 CPU loads of the Flow Abstractor threads
Table 2 CPU usage of the Flow Abstractor
functions. The functions starting
with * are mainly used in the
threads of the regex & UNIX
domain socket.
function usage [%]
*free 19.7
*RE2’s matching 10.9
poll (netmap) 10.8
malloc 9.4
lll lock elision 8.8
*write (UNIX domain socket) 8.3
header. The binary header is thus more suitable
for high-bandwidth traffic, and the text header is
suitable for debugging or developing phase because
of the human-readability.
In panels (g) and (h) of Figure 15, the flow
abstractor was implemented with pcap instead of
netmap, the binary and the text header were re-
spectively used, and the HTTP-traffic rate was 1
Gbps. The figures reveal that netmap is more com-
putationally expensive than pcap because netmap
polls packet queues to handle high-pps traffic.
Therefore, pcap is more suitable than netmap when
handling low rate traffic, and netmap is more suit-
able than pcap when handling high rate traffic.
Table 2 shows the computationally expensive
functions of the flow abstractor when capturing 9
Gbps HTTP traffic with netmap and the binary
header for the abstraction interfaces. The functions
prefixed with * are mainly used in the threads of
the regex & UNIX domain socket. This table re-
veals that read-write locks, memory (de)allocation,
polling netmap, matching of regular expressions,
and writing to UNIX domain sockets are the main
bottlenecks.
5. 3 HTTP Analyzer and Load Balancing
Amain feature of the flow abstractor is the multi-
core scalability of its application protocol analyzers.
In this subsection, we show the effectiveness of the
load-balancing mechanism of the flow abstractor
through various experiments, applying the HTTP
analyzer as a heavy application-level analyzer. To
improve the performance, the flow abstractor ap-
plied jemalloc [11] for dynamic memory allocation.
A binary header was used for the abstraction in-
terface. These experiments were conducted on the
PC described in subsection 5. 2.
Fig. 16 shows the CPU loads of the HTTP an-
alyzer and flow abstractor when generating HTTP
requests by using the hardware application-level
traffic generator described in Section 5. 1. In this
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 97
0 10 20 30 40 50 60elapsed time [s]
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
SF-TAP Flow Abstractor
HTTP Analyzer 1 (CPython)
HTTP Analyzer 2 (CPython)
(a) 2 HTTP analyzer processes
0 10 20 30 40 50 60elapsed time [s]
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
SF-TAP Flow Abstractor
HTTP Analyzer 1 (CPython)
HTTP Analyzer 2 (CPython)
HTTP Analyzer 3 (CPython)
HTTP Analyzer 4 (CPython)
(b) 4 HTTP analyzer processes
0 10 20 30 40 50 60elapsed time [s]
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
SF-TAP Flow Abstractor
HTTP Analyzer 1 (CPython)
HTTP Analyzer 2 (CPython)
HTTP Analyzer 3 (CPython)
HTTP Analyzer 4 (CPython)
HTTP Analyzer 5 (CPython)
HTTP Analyzer 6 (CPython)
HTTP Analyzer 7 (CPython)
HTTP Analyzer 8 (CPython)
(c) 8 HTTP analyzer processes
Fig. 16 CPU load balancing of HTTP analyzers with 150 Mbps,
36.3 Kpps and 2.68 Kcps of HTTP traffic
HTTP
Analyzer
HTTP
Analyzer
HTTP
Analyzer
Flow Abstractor
8 http analyzer processes
Cell Incubator
6 Gbps HTTP Traffic
14 SF-TAP Cells
Fig. 17 Comprehensive evaluation of
SF-TAP with 14 cells
figure, the average HTTP traffic generation was 150
Mbps, 36.3 Kpps and 2.68 Kcps, where bps, pps,
and cps imply bits per second, packets per second,
and connections per second, respectively. Panels
(a), (b) and (c) of Fig. 16 show the CPU loads
when executing the load balancing using two, four,
and eight HTTP analyzer processes, respectively.
As shown in Fig. 16 (a), two HTTP analyzer
processes could manage up to approximately 150
Mbps, with each process consuming from 80 to
100% of the CPU resources. Note that a single
HTTP analyzer process could not handle the 150
Mbps HTTP traffic because of CPU saturation.
In the four and eight HTTP analyzer processes
(Fig. 16 (b) and (c), respectively), each process
consumed approximately 60 and 30% of the CPU
resources, respectively. Consequently, we conclude
CPU
util
izat
ion
[x10
0 %
]
0
14
28
42
56
70
SF-TAP Cell Incubator SF-TAP Flow Abstractor HTTP Analyzer
Fig. 18 Average CPU utilization rates of the
Flow Abstractor, the Cell Incubator,
and the HTTP Analyzer
that the load-balancing mechanism is remarkably
efficient in multicore applications. In our exper-
iments, the HTTP analyzer was implemented in
Python, a relatively slowly interpreted language,
and the multiple CPUs were not run in parallel
through multi-threading. Nevertheless, the 150
Mbps HTTP traffic was adequately handled by the
multi-CPU analyzer.
5. 4 Comprehensive Evaluation of SF-
TAP
In this subsection, we comprehensively evalu-
ate SF-TAP. The evaluation topology is shown
in Fig. 17. The cell incubator separated 6 Gbps
HTTP traffic, described in Section 5. 2, into 14 SF-
TAP cells running the flow abstractor and 8 HTTP
analyzer processes. The HTTP traffic was also gen-
erated by the hardware appliance described in Sec-
tion 5. 2. The cell incubator was run on the PC
described in subsection 5. 1, and the SF-TAP cells
were implemented on PCs with DDR3 32 GB mem-
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
98 コンピュータソフトウェア00-0
000-0
100-0
200-0
300-0
400-0
500-0
600-0
701-0
001-0
101-0
201-0
301-0
401-0
501-0
601-0
702-0
002-0
102-0
202-0
302-0
402-0
502-0
602-0
703-0
003-0
103-0
203-0
303-0
403-0
503-0
603-0
704-0
004-0
104-0
204-0
304-0
404-0
504-0
604-0
705-0
005-0
105-0
205-0
305-0
405-0
505-0
605-0
706-0
006-0
106-0
206-0
306-0
406-0
506-0
606-0
707-0
007-0
107-0
207-0
307-0
407-0
507-0
607-0
708-0
008-0
108-0
208-0
308-0
408-0
508-0
608-0
709-0
009-0
109-0
209-0
309-0
409-0
509-0
609-0
710-0
010-0
110-0
210-0
310-0
410-0
510-0
610-0
711-0
011-0
111-0
211-0
311-0
411-0
511-0
611-0
712-0
012-0
112-0
212-0
312-0
412-0
512-0
612-0
713-0
013-0
113-0
213-0
313-0
413-0
513-0
613-0
7
cell#-cpu#
0
20
40
60
80
100
CPU
uti
lizati
on [
%]
Fig. 19 CPU requirements of the SF-TAP cells
ory and a Xeon E5-2609 processor v2 (8 cores, 2.5
GHz, 10 MB cache).
Fig. 18 shows the average CPU utilization rates
per process of the flow abstractor, cell incubator,
and HTTP Analyzer. The HTTP analyzer requires
many more CPU resources than the other two com-
ponents. This result implies a significant role for
scalability in analyzing application protocols.
Fig. 19 shows the CPU usages of the SF-TAP
cells during 8×14 = 112 HTTP analyzer processes.
The CPU loads of the processes were well balanced.
In this experiment, we used the HTTP analyzer,
which simply parses HTTP traffic, as an applica-
tion level analyzer, but more complex processing,
such as pattern matching and machine learning, is
performed in actual cases. Application level analyz-
ers should tend to consume much more computer
resources than the flow abstractor and the cell incu-
bator, but the CPU load required by the analyzers
is well balanced by using SF-TAP like this experi-
ment.
6 Discussion
This section discusses the application of SF-TAP
to pervasive monitoring, an important issue on the
Internet problem [12]. Pervasive-monitoring coun-
termeasures require cryptographic protocols such
as secure sockets layer/transport layer security (SS-
L/TLS) rather than traditional protocols such as
HTTP and FTP, which are insecure. However,
cryptographic protocols invalidate the IDS, conse-
quently incurring other security risks.
Mobile devices, which are widely used in mod-
ern society, lack sufficient machine power to pro-
cess host-based IDSs. Therefore, as future solu-
tions for the Internet, researchers should consider
new approaches such as IDS cooperating with an
SSL/TLS proxy.
There are 2 ways to monitor encrypted traf-
fic. An SSL/TLS proxy server intercepts encrypted
traffic by enforcing installation of a self-signed cer-
tification on clients. However, the proxy with self-
signed certification may impair authenticity be-
cause the proxy can monitor encrypted traffic but
also falsify traffic easily. In addition to that, the
proxy disable the public key pinning extension for
HTTP [10].
Sharing the secret key is another approach to
monitor encrypted traffic. mcTLS [24] and Blind-
Box [37] share the key to encrypted traffic among
multi parties and GINTATE [23], we are currently
studying, shares the key with clients. Compared
to the proxy approach, the secret key sharing ap-
proach does not falsify traffic easily. In other words,
both the proxy and the secret key sharing approach
affect confidentiality and the proxy approach also
affect authenticity and integrity. However, the se-
cret key sharing approach does not easily compro-
mise authenticity and integrity.
From architectural perspective, recursive encap-
sulation, such as TLS over TLS, and a cooperation
mechanism between IDS and DPI system should
also be studied. GINTATE provides a function to
handle recursively encrypted traffic by adopting the
application level loopback interface of the flow ab-
stactor. There are several ways to achieve the co-
operation. GINTATE provides file based abstrac-
tion interface same as the flow abstractor to achieve
modularity.
From confidentiality perspective, the design of
secret level should also be studied. For example,
highly confidential encrypted traffic, such as online
banking, should not be inspected. How to decide
the secret level and how to handle it are big issues
for DPI systems.
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 99
7 Related Work
Traditional packet-capturing applications in-
clude the widely used wireshark [44] and tcpdump
[41]. The network traffic-capturing application lib-
nids [18] reassembles TCP streams. However, these
single-threaded applications cannot exploit the ad-
vantages of multiple CPU cores, and are hence un-
suitable for high-bandwidth network traffic analy-
sis.
SCAP [29] and GASPP [43] have been proposed
for flow-level and high-bandwidth network traf-
fic analyses. Implemented within a Linux kernel,
SCAP exploits the zero-copy mechanism and allo-
cates threads for the RX and TX queues of NIC to
boost the throughput. SCAP also adopts a mech-
anism called subzero-copy packet transfer, using
analyzers that can selectively analyze the required
network traffic. GASPP is a GPGPU-based flow-
level analysis engine based on netmap [34]; thus,
it achieves high-throughput data transfers between
the NIC and CPU memory.
DPDK [9], netmap [34], and PF RING [31] have
been proposed for high-bandwidth packet-capture
implementations. Owing to the many data trans-
fers and software interrupts occur among the NIC,
kernel, and user, traditional methods cannot eas-
ily capture 10 Gbps network traffic. Our proposed
method effectively reduces the frequency of mem-
ory copies and software interrupts, enabling wire-
speed traffic capture.
L7 filter [16], nDPI [25], libprotoident [19], and
PEAFLOW [8] have been proposed for application-
level network traffic classification. These meth-
ods detect the application protocols using Aho–
Corasick or regular expressions. PEAFLOW
achieves accurate classification using the parallel
programming language FastFlow.
IDS applications such as Snort [38], Bro [5] and
Suricata [39] reconstruct the TCP flows and per-
form the application-level analysis. Bro applies the
DSL BinPAC [28] for protocol parsing. However,
both Snort and Bro are single-threaded and cannot
manage high-bandwidth network traffic. On the
other hand, the multi-threaded Suricata does han-
dle high-bandwidth network traffic. To overcome
the single-threaded architecture, Bro comes with
scripts to deploy it in a cluster environment [7], and
achieved to handle 100 Gbps traffic[6]. Whereas the
cluster environment adopts hardware appliances,
SF-TAP uses only commodity hardware.
Schneider et al. [35] proposed a horizontally scal-
able architecture that separates the flows of 10
Gbps traffic, similarly to SF-TAP. They verified
their architecture only on 1 Gbps network traffic,
and have yet to verify it on 10 Gbps network traffic.
They did not provide any implementation handling
10 Gbps traffic, evaluation, and any mechanisms or
interfaces for traffic analysis applications.
Open vSwitch [27] is a software switch that con-
trols network traffic based on its flows, similarly
to the cell incubator in our proposed system, but
its ovs-ofctl does not reassemble the IP fragmented
packets. Some filtering mechanisms such as iptables
[14] and pf [30] also implement flow-based control of
network traffic, but lack a reassembling mechanism.
Furthermore, these methods are less scalable than
SF-TAP, and suffer from performance issues.
The Click [15], SwitchBlade [2], and ServerSwitch
[20] applications adopt modular architectures, pro-
viding flexible and programmable network func-
tions for network switches. SF-TAP borrows these
ideas and incorporates a modular architecture for
network traffic analysis.
SF-TAP also borrows the concept of BPF [22],
a well-known packet-capture mechanism that ab-
stracts the network traffic as files, similarly to
UNIX’s /dev. By abstracting files of network flows,
SF-TAP achieves its modularity and multicore-
scaling ability.
8 Conclusions
Application-level network traffic analysis and
sophisticated analysis techniques, such as ma-
chine learning and stream-data processing of net-
work traffic, require considerable computational re-
sources. To resolve these problems, we proposed a
scalable and flexible traffic analysis platform called
SF-TAP for sophisticated application-level analysis
of high-bandwidth network traffic in real-time.
SF-TAP consists of four planes: the separator
plane, abstractor plane, capturer plane, and an-
alyzer plane. Network traffic is captured by the
capturer plane and separated into different flows at
the separator plane, thus achieving horizontal scal-
ability. The separated network traffic is forwarded
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
100 コンピュータソフトウェア
to multiple SF-TAP cells, which constitute the ab-
stractor and analyzer planes.
We provided cell incubator and flow abstractor
implementations for the separator and abstractor
planes, respectively. As an example, we also imple-
mented an HTTP analyzer in the analyzer plane.
Traffic acquisition by the capturer plane uses well-
known technologies, such as port mirroring of L2
switches.
The flow abstractor (with its modular archi-
tecture) abstracts the network traffic into files,
similarly to Plan9, UNIX’s /dev, and BPF. Ow-
ing to the abstraction features and modularity,
application-level analyzers in SF-TAP can be eas-
ily developed in many programming languages, and
rendered as multicore scalable. The scalability of
the HTTP analyzer to multiple CPU cores was ex-
perimentally demonstrated in a Python example.
The flow abstractor performs bulk data trans-
fers among multiple threads. Our experiments con-
firmed that the multi-threaded flow abstractor can
manage 7 Gbps HTTP traffic without dropping
packets, whereas tcpdump and Snort can manage
only up to 3.0 and 1.5 Gbps, respectively.
The scalability of the HTTP analyzer to multi-
ple CPU cores was assisted by the flow abstraction
interfaces. In our experiments, each of two HTTP
analyzer processes consumed 100% of the CPU re-
sources, but each of eight processes consumed only
30% of the CPU resources.
The cell incubator provides the horizontal scala-
bility of SF-TAP. To manage high-bandwidth net-
work traffic, it separates the network traffic based
on the flows, and forwards the separated flows to
the cells (each comprising one flow abstractor and
several application-level analyzers). In experimen-
tal tests of the mirroring and inline modes, the cell
incubator managed approximately 12.49 Mpps and
11.44 Mpps, respectively.
In the future, we will tackle problems of DPI
over encrypted traffic. For the perspective of pri-
vacy, every traffic should be encrypted, but every
traffic should be inspected to keep networks safety.
We need to tackle the antinomic issue. Addition-
ally, we will study the design of secret level to de-
cide whether traffic should be inspected or not. It
should help to avoid inspecting highly confidential
encrypted traffic, such as online banking.
Acknowledgments
We would like to thank the staff of StarBED
for supporting our research, WIDE Project for
supporting our experiments, and Mr. Nobuyuki
Kanaya for feedback and comments.
References
[ 1 ] Amdahl, G. M.: Validity of the single pro-
cessor approach to achieving large scale comput-
ing capabilities, American Federation of Informa-
tion Processing Societies: Proceedings of the AFIPS
’67 Spring Joint Computer Conference, April 18-
20, 1967, Atlantic City, New Jersey, USA, AFIPS
Conference Proceedings, Vol. 30, AFIPS / ACM /
Thomson Book Company, Washington D.C., 1967,
pp. 483–485.
[ 2 ] Anwer, M. B., Motiwala, M., Tariq, M. M. B.,
and Feamster, N.: SwitchBlade: a platform for
rapid deployment of network protocols on pro-
grammable hardware, Proceedings of the ACM SIG-
COMM 2010 Conference on Applications, Tech-
nologies, Architectures, and Protocols for Computer
Communications, New Delhi, India, August 30 -
September 3, 2010, Kalyanaraman, S., Padmanab-
han, V. N., Ramakrishnan, K. K., Shorey, R., and
Voelker, G. M.(eds.), ACM, 2010, pp. 183–194.
[ 3 ] BitTorrent, http://www.bittorrent.com/.
[ 4 ] Boost C++ Library, http://www.boost.org/.
[ 5 ] The Bro Network Security Monitor, http://
www.bro.org/.
[ 6 ] 100G Intrusion Detection, https://commons.lbl.
gov/display/cpp/100G+Intrusion+Detection.
[ 7 ] Bro Cluster Architecture, https://www.bro.org/
sphinx-git/cluster/index.html.
[ 8 ] Danelutto, M., Deri, L., Sensi, D. D., and
Torquati, M.: Deep Packet Inspection on Commod-
ity Hardware using FastFlow, PARCO, Bader, M.,
Bode, A., Bungartz, H.-J., Gerndt, M., Joubert,
G. R., and Peters, F. J.(eds.), Advances in Parallel
Computing, Vol. 25, IOS Press, 2013, pp. 92–99.
[ 9 ] Intel@ DPDK: Data Plane Development Kit,
http://www.ntop.org/products/pf ring/.
[10] Evans, C., Palmer, C., and Sleevi, R.: Pub-
lic Key Pinning Extension for HTTP, April 2015.
RFC7469.
[11] Evans, J.: A scalable concurrent malloc (3) im-
plementation for FreeBSD, Proc. of the BSDCan
Conference, Ottawa, Canada, April 2006.
[12] Farrell, S. and Tschofenig, H.: Pervasive Moni-
toring Is an Attack, May 2014. RFC7258.
[13] Intel R⃝, High Peformance Packet Processing,
https://networkbuilders.intel.com/docs/network
builders RA packet processing.pdf.
[14] netfilter/iptables project homepage - The net-
filter.org ”iptables” project, http://www.netfilter.
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 101
org/projects/iptables/.
[15] Kohler, E., Morris, R., Chen, B., Jannotti, J.,
and Kaashoek, M. F.: The click modular router,
ACM Trans. Comput. Syst., Vol. 18, No. 3(2000),
pp. 263–297.
[16] L7-filter — ClearFoundation, http://l7-filter.
clearfoundation.com/.
[17] libevent, http://libevent.org/.
[18] libnids, http://libnids.sourceforge.net/.
[19] WAND Network Research Group: libprotoident,
http://research.wand.net.nz/software/
libprotoident.php.
[20] Lu, G., Guo, C., Li, Y., Zhou, Z., Yuan, T.,
Wu, H., Xiong, Y., Gao, R., and Zhang, Y.: Server-
Switch: A Programmable and High Performance
Platform for Data Center Networks, Proceedings of
the 8th USENIX Symposium on Networked Systems
Design and Implementation, NSDI 2011, Boston,
MA, USA, March 30 - April 1, 2011, Andersen,
D. G. and Ratnasamy, S.(eds.), USENIX Associa-
tion, 2011, pp. 15–28.
[21] Martins, J., Ahmed, M., Raiciu, C., Olteanu,
V. A., Honda, M., Bifulco, R., and Huici, F.:
ClickOS and the Art of Network Function Virtu-
alization, Proceedings of the 11th USENIX Sympo-
sium on Networked Systems Design and Implemen-
tation, NSDI 2014, Seattle, WA, USA, April 2-4,
2014, Mahajan, R. and Stoica, I.(eds.), USENIX
Association, 2014, pp. 459–473.
[22] McCanne, S. and Jacobson, V.: The BSD Packet
Filter: A New Architecture for User-level Packet
Capture, Proceedings of the Usenix Winter 1993
Technical Conference, San Diego, California, USA,
January 1993, USENIX Association, 1993, pp. 259–
270.
[23] Miura, R., Takano, Y., Miwa, S., and Inoue, T.:
GINTATE: Scalable and Extensible Deep Packet
Inspection System for Encrypted Network Traffic:
Session Resumption in Transport Layer Security
Communication Considered Harmful to DPI, Pro-
ceedings of the Eighth International Symposium on
Information and Communication Technology, Nha
Trang City, Viet Nam, December 7-8, 2017, ACM,
2017, pp. 234–241.
[24] Naylor, D., Schomp, K., Varvello, M., Leon-
tiadis, I., Blackburn, J., Lopez, D. R., Papagian-
naki, K., Rodrıguez, P. R., and Steenkiste, P.:
Multi-Context TLS (mcTLS): Enabling Secure In-
Network Functionality in TLS, Computer Commu-
nication Review, Vol. 45, No. 5(2015), pp. 199–212.
[25] nDPI, http://www.ntop.org/products/ndpi/.
[26] Leading operators create ETSI standards group
for network functions virtualization, http://www.
etsi.org/index.php/news-events/news/644-2013-01
-isg-nfv-created.
[27] Open vSwitch, http://openvswitch.github.io/.
[28] Pang, R., Paxson, V., Sommer, R., and Pe-
terson, L. L.: binpac: a yacc for writing applica-
tion protocol parsers, Internet Measurement Con-
ference, Almeida, J. M., Almeida, V. A. F., and
Barford, P.(eds.), ACM, 2006, pp. 289–300.
[29] Papadogiannakis, A., Polychronakis, M., and
Markatos, E. P.: Scap: stream-oriented network
traffic capture and analysis for high-speed net-
works, Internet Measurement Conference, IMC’13,
Barcelona, Spain, October 23-25, 2013, Papagian-
naki, K., Gummadi, P. K., and Partridge, C.(eds.),
ACM, 2013, pp. 441–454.
[30] PF: The OpenBSD Packet Filter, http://www.
openbsd.org/faq/pf/.
[31] PF RING, http://www.ntop.org/products/pf
ring/.
[32] Pike, R., Presotto, D. L., Dorward, S., Flan-
drena, B., Thompson, K., Trickey, H., and Win-
terbottom, P.: Plan 9 from Bell Labs, Computing
Systems, Vol. 8, No. 2(1995), pp. 221–254.
[33] RE2: https://github.com/google/re2.
[34] Rizzo, L. and Landi, M.: netmap: memory
mapped access to network devices, SIGCOMM, Ke-
shav, S., Liebeherr, J., Byers, J. W., and Mogul,
J. C.(eds.), ACM, 2011, pp. 422–423.
[35] Schneider, F., Wallerich, J., and Feldmann, A.:
Packet Capture in 10-Gigabit Ethernet Environ-
ments Using Contemporary Commodity Hardware,
PAM, Uhlig, S., Papagiannaki, K., and Bonaven-
ture, O.(eds.), Lecture Notes in Computer Science,
Vol. 4427, Springer, 2007, pp. 207–217.
[36] SF-TAP: Scalable and Flexible Traffic Analysis
Platform, https://github.com/SF-TAP.
[37] Sherry, J., Lan, C., Popa, R. A., and Ratnasamy,
S.: BlindBox: Deep Packet Inspection over En-
crypted Traffic, Computer Communication Review,
Vol. 45, No. 5(2015), pp. 213–226.
[38] Snort :: Home Page, https://www.snort.org/.
[39] Suricata — Open Source IDS / IPS / NSM en-
gine, http://suricata-ids.org/.
[40] Takano, Y., Miura, R., Yasuda, S., Akashi, K.,
and Inoue, T.: SF-TAP: Scalable and Flexible Traf-
fic Analysis Platform Running on Commodity Hard-
ware, 29th Large Installation System Administra-
tion Conference, LISA 2015, Washington, D.C.,
USA, November 8-13, 2015., USENIX Association,
2015, pp. 25–36.
[41] TCPDUMP/LIBPCAP public repository, http:
//www.tcpdump.org/.
[42] Vasiliadis, G., Antonatos, S., Polychronakis,
M., Markatos, E. P., and Ioannidis, S.: Gnort:
High Performance Network Intrusion Detection Us-
ing Graphics Processors, Recent Advances in In-
trusion Detection, 11th International Symposium,
RAID 2008, Cambridge, MA, USA, September 15-
17, 2008. Proceedings, Lippmann, R., Kirda, E.,
and Trachtenberg, A.(eds.), Lecture Notes in Com-
puter Science, Vol. 5230, Springer, 2008, pp. 116–
134.
[43] Vasiliadis, G., Koromilas, L., Polychronakis,
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
102 コンピュータソフトウェア
1 preprocessor stream5_global: track_tcp yes,
\
2 memcap 1073741824, \
3 track_udp yes, \
4 track_icmp no, \
5 max_tcp 1048576, \
6 max_udp 131072, \
7 max_active_responses 2, \
8 min_response_seconds 5
9 preprocessor stream5_tcp: policy windows,
detect_anomalies, require_3whs 180, \
10 overlap_limit 10, small_segments 3 bytes
150, timeout 180, \
11 ports client all, \
12 ports both all
Fig. 20 Stream5 configuration in Snort
M., and Ioannidis, S.: GASPP: A GPU-
Accelerated Stateful Packet Processing Frame-
work, 2014 USENIX Annual Technical Conference,
USENIX ATC ’14, Philadelphia, PA, USA, June
19-20, 2014., Gibson, G. and Zeldovich, N.(eds.),
USENIX Association, 2014, pp. 321–332.
[44] Wireshark - Go Deep., https://www.wireshark.
org/.
[45] The Official YAML Web Site, http://yaml.org/.
[46] A YAML parser and emitter in C++, https:
//github.com/jbeder/yaml-cpp.
A Stream5 Configuration
Fig. 20 shows the Stream5 configuration of Snort
in the evaluation described in subsection 5. 2.
B Spinlock-based reader–writer lock-
ing
According to Amdahl’s law [1], reducing the lock
overhead significantly improves the performance of
the application. We found that the reader-writer
lock function pthread rwlock is a main bottleneck
when implementing the flow abstractor. Thus, our
implementation replaces the pthread rwlock func-
tion with a spinlock-based reader–writer lock.
Fig. 21 compares the CPU utilization rates of the
flow abstractor with spinlock-based reader–writer
locking and pthread rwlock. As shown in this fig-
ure, the spinlock-based reader–writer lock signifi-
cantly reduced the CPU utilization rate of the flow
abstractor, confirming the importance of lock over-
head in multi-threaded applications.
1 2 3 4 5 6 7 8
bandwidth [Gbps]
0
10
20
30
40
50
60
70
CPU
uti
lizati
on [
x100 %
]
system
user
spinlock based readers-writer lock
1 2 3 4 5 6 7 8
bandwidth [Gbps]
system
user
pthread_rwlock
Fig. 21 CPU utilization of the Flow
Abstractor with spinlock-based
reader–writer locking and
withpthread rwlock locking
.
Yuuki Takano
received B.E. from National Insti-
tution for Academic Degrees and
Quality Enhancement of Higher
Education, Japan, in 2003. He also
received M.S. and Ph.D. in Com-
puter Science from Japan Advanced Institute of
Science and Technology, Japan, in 2005 and 2011,
respectively. He is currently a specially appointed
associate professor of Osaka University. His re-
search interests include computer systems, com-
puter networks, network security, and program-
ming language.
Ryosuke Miura
received B.E. from Tokyo Univer-
sity of Science, Japan, in 2003.
He also received M.S. in Computer
Science from Japan Advanced In-
stitute of Science and Technology,
Japan, in 2007. He is currently a technical re-
searcher of National Institute of Information and
Communications Technology, Japan. His research
interests include computer systems, computer net-
works, and network security.
Shingo Yasuda
received B.I. from Bunkyo Univer-
sity, Japan in 2003. He also re-
ceived M.S. and Ph.D. in Computer
Science from Japan Advanced In-
stitute of Science and Technology,
Japan, in 2005 and 2014, respectively. He is cur-
Sanshusha pLATEX2ε: c01_takano : 2019/7/12(18:58)
Vol. 36 No. 3 Aug. 2019 103
rently a senior researcher of National Institute
of Information and Communications Technology,
Japan. His research interests include computer sys-
tems, computer networks, and cyber range.
Kunio Akashi
received B.E. from Osaka Electro-
Communication University, Japan,
in 2008. He also received M.S. and
Ph.D. in Computer Science from
Japan Advanced Institute of Sci-
ence and Technology, Japan, in 2010 and 2017,
respectively. He is currently a researcher of Na-
tional Institute of Information and Communica-
tions Technology, Japan. His research interests in-
clude computer systems, computer networks, and
network testbed.
Tomoya Inoue
received B.E. from National Insti-
tution for Academic Degrees and
Quality Enhancement of Higher
Education, Japan, in 2003. He re-
ceived M.S. and Ph.D. in Computer
Science from Japan Advanced Institute of Science
and Technology, Japan, in 2006 and 2012, respec-
tively. He is currently a researcher of National In-
stitute of Information and Communications Tech-
nology, Japan. His research interests include com-
puter systems, computer networks, and network
testbed.