Small-ruleset regular expression matching on GPGPUs
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Small-ruleset regular expression matching on GPGPUs
Small-Ruleset Regular Expression Matching on GPGPUs:Quantitative Performance Analysis and Optimization
Jamin Naghmouchi1,2 Daniele Paolo Scarpazza1 Mladen Berekovic2
1 IBM T.J. Watson Research CenterBusiness Analytics & Math Dept.
Yorktown Heights, NY, [email protected], [email protected]
2 Institut für Datentechnik und KommunikationsnetzeTechnische Universität Braunschweig
Braunschweig, [email protected]
ABSTRACTWe explore the intersection between an emerging class of archi-tectures and a prominent workload: GPGPUs (General-PurposeGraphics Processing Units) and regular expression matching, re-spectively. It is a challenging task because this workload –with itsirregular, non-coalesceable memory access patterns– is very differ-ent from the regular, numerical workloads that run efficiently onGPGPUs.
Small-ruleset expression matching is a fundamental buildingblock for search engines, business analytics, natural language pro-cessing, XML processing, compiler front-ends and network secu-rity. Despite the abundant power that GPGPUs promise, little workhas investigated their potential and limitations with this workload,and how to best utilize the memory classes that GPGPUs offer.
We describe an optimization path of the kernel of flex (thepopular, open-source regular expression scanner generator) tofour nVidia GPGPU models, with decisions based on quantitativemicro-benchmarking, performance counters and simulator runs.
Our solution achieves a tokenization throughput that exceeds theresults obtained by the GPGPU-based string matching solutionspresented so far, and compares well with solutions obtained on anyarchitecture.
Categories and Subject DescriptorsD.1.3 [Programming Techniques]: Concurrent Programming—Parallel Programming; F.2.2 [Analysis of Algorithms]: Nonnu-merical Algorithms—Pattern matching
General TermsAlgorithms, Design, Performance
1. INTRODUCTIONWith the advent of “Web 2.0” applications, the volume of un-
structured data that Internet and enterprises applications produceand consume has been growing at extraordinary rates. The toolswe use to access, transform and protect these data are search
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.ICS’10, June 2–4, 2010, Tsukuba, Ibaraki, Japan.Copyright 2010 ACM 978-1-4503-0018-6/10/06 ...$10.00.
engines, business analytics suites, Natural-Language Processing(NLP) tools, XML processors and Intrusion Detection Systems(IDSs). These tools rely crucially on some form of Regular Ex-pression (regexp) scanning.
We focus on tokenization: a form of small-ruleset regexp match-ing used to divide a character stream into tokens like English words,E-mail addresses, company names, URLs, phone numbers, IP ad-dresses, etc. Tokenization is the first stage of any search engineindexer (where it consumes between 14 and 20% of the execu-tion time [1]) and any XML processing tool (where it can absorb30% [2, 3]). It is also part of NLP tools and programming languagecompilers.
The further growth of unstructured-data applications depends onwhether fast, scalable tokenizers are available. Architectures areoffering more cores per socket [4, 5] and wider SIMD units (Single-Instruction Multiple-Data). For example, Intel is increasing per-chip core counts from the current 4–8 to the 16–48 of Larrabee [6],and SIMD width from the current 128 bits of SSE (StreamingSIMD Extension [7]) to the 256 bits of AVX (Advanced VectoreXtensions [8]) and the 512 bits of LRBni [9].
nVidia GPGPUs [10] employ hundreds of light-weight cores thatjuggle thousands of threads, in an attempt to mask the latency to anuncached main memory. Despite this promising amount of paral-lelism, little work has explored the potential of GPGPUs for textprocessing tasks, whereas traditional multi-core architectures havereceived abundant attention [11, 12, 13, 14, 15].
Filling this gap is the objective with this paper. It is a challeng-ing task because tokenization is far from the numerical, array-basedapplications that traditionally map well to GPGPUs. Unlike nu-merical kernels that aim at fully coalesced memory accesses, ourworkload never enjoys coalescing. Also, automaton-based algo-rithms have been named embarrassingly sequential [16] for theirinherent lack of parallelism.
Our optimization reasoning relies on performance figures thatare not available from the manufacturer or independent publica-tions. We determine these figures with micro-benchmarks specifi-cally designed for the purpose.
We start our optimization from a naïve port to GPGPUs of a to-kenizer kernel produced by flex [17]. We analyze compute opera-tions and memory accesses, and explore data-layout improvementson a quantitative basis, with the help of benchmarks, profiling, per-formance counters, static analysis of the disassembled bytecode,and simulator [18] runs.
On a GTX280 device, we achieve a typical tokenizing through-put on realistic (Wikipedia) data of 1.185 Gbyte/s per device, anda peak scanning throughput of 6.92 Gbyte/s (i.e., 3.62× and 8.59×speedups over naïve GPGPU ports, respectively). This perfor-mance is 20.1× faster than the original, unmodified flex tokenizer
running in 4 threads on a modern commodity processor, and 49.8×faster than a single-threaded flex.
The limitations of our approach are the size of the rule set and theneed for a large number of independent input streams. The first lim-itation derives from our mapping of automata state tables to (small)core-local memories and caches. This constraint does not fit appli-cations requiring large state spaces like IDS or content-based trafficfiltering. The second constraint is due to the high number of threads(approx. 4,000–6,000) necessary to reach good GPGPU utilizationthreads at any time. Traditional CPUs that have fewer cores andthreads, and reach full utilization with fewer input streams.
2. THE GPGPU ARCHITECTURE ANDPROGRAMMING MODEL
We briefly introduce the architecture and programming modelof nVidia GPGPUs of the CUDA (Compute-Unified Device Archi-tecture) family. We focus primarily on the GTX280 device, butthese concepts apply broadly to the other devices we consider (Ta-ble 2). For more detailed information, see the technical documen-tation [19] and the relevant research papers [10, 20, 21, 22, 18].
2.1 Compute cores and memory hierarchyA CUDA GPGPU is a hierarchical collection of cores as in Fig-
ure 1. Cores are called Scalar Processors (SP). A Streaming Mul-tiprocessor (SM) contains 8 SPs and associated resources (e.g., acommon instruction unit, shared memory, a common register file,and an L1 cache). Three SMs, together with texture units and L2caches constitute a Thread Processing Cluster (TPC). The GTX280has 10 TPCs, connected to memory controllers and an L3 cache.
Part of the internals we report are unofficial and derive from Pa-padopoulou et al. [22]. nVidia often does not disclose the internalsof its devices, possibly in an attempt to discourage non-portable op-timizations. Nevertheless, the high-performance computing com-munity pursues efficiency even at the cost of device-specific, low-level optimizations.
The memory hierarchy includes a shared register file, a block ofshared memory and a global memory. The register file is staticallypartitioned among threads at compile time.
In the general case, GPGPUs do not mitigate memory latency byusing caches (except for the Fermi models, not available at the timethis article was composed). Rather, they maintain a large number ofthreads, and mask latencies by switching to a different, ready groupof threads. The cores are designed to perform inexpensive contextswitches. The L1–L3 caches are used only for instructions, con-stants and caches. In detail, the constant memory is a programmer-designated area of global memory, up to 64 kbytes, initialized bythe host and not modifiable by code running on the device; accessesto this area are cached. A visual representation of memory classesand latencies is in Figure 2. In Section 4, we analyze quantitativelythe performance of these classes of memory.
2.2 Programming ModelThe CUDA programming model does not explicitly associate
threads to cores. Rather, the programmer provides a kernel of codethat describes the operations of a single thread, and a rectangulargrid that defines the thread count and numbering.
Threads are organized in a hierarchy: a grid of blocks of warpsof threads. A warp is a collection of 32 threads that run on the8 SPs of an SM concurrently (each SP runs 4 threads). A blockis a collection of threads, of programmer-defined size (up to 512).We found no convenience in defining blocks of non-multiple-of-32threads, therefore we regard a block as a group of warps. A grid is
an array of blocks, having 1, 2 or 3 dimensions. Image processingand physics kernels map naturally to 2D and 3D grids, but our text-based workload does not need more than 1D. Therefore we treatthis grouping as a linear array of blocks and refer to this degree offreedom only as “number of blocks” from now on.
2.3 Compilation and Execution ModelWith CUDA, GPGPUs are coprocessor devices where to offload
portions of the code called kernels. The programmer can writesingle-source C/C++ hybrid applications, where the kernels arelimited to a subset of the C language with no recursion and no func-tion pointers. The NVCC compiler separates host code from devicecode and it forwards host code to an external compiler like GNUGCC. The device source code is compiled to a virtual instruction setcalled PTX [23]. At program startup, the device driver applies finaloptimizations and translates PTX code into real binary code. Thereal instruction set and the compilation process are undocumented.
Threads in a warp execute in lockstep (with a shared programcounter) and therefore have no independent control flows. Controlflow statements in the source map to predicated instructions: thehardware nullifies the instructions of non-taken branches.
The memory accesses of the threads in each half-warp can co-alesce into one, that serves 16 threads concurrently, provided thatthe target addresses respect constraints of stride and continuity. Un-coalesced accesses are much more inefficient than coalesced ones.Due to the lack of regularity of our workload, its memory accessesnever coalesce.
3. REGULAR-EXPRESSION MATCHINGAND TOKENIZATION
A regular expression describes a set of strings in terms of an al-phabet and operators like choice ‘|’, optionality ‘?’, and unboundedrepetition ‘*’. Regexps are a powerful abstraction: one expressioncan denote an infinite class of strings with desired properties.
Finding matches between a stream of text and a set of regexps isa common task. Antivirus scanners find matches between user filesand a set of regexps that denote the signatures of malicious content.IDSs do the same on live network traffic. These examples exhibitlarge rule sets (since threat signatures can be tens of thousands) andlow match rates (because the majority of traffic and files are usuallylegitimate). In this work, we rather focus on the small-ruleset, high-match-rate regexp matching involved in the tokenization stage ofsearch engines, XML and NLP processors, and compilers. Also,tokenizers implement a different matching semantics: they onlyaccepts non-overlapping, maximum length matches.
Regexp matching is performed with a Deterministic Finite Au-tomaton (DFA), often generated automatically with tools likeflex [17]. Flex takes as an input a rule set like the one in Fig-ure 3, and produces the C code of an automaton that matches thoserules, as in Figure 4. Our GPGPU-based tokenizer implements thisexample. The size of the flex-generated DFA is 174 states, whichillustrates practically what we mean by small rule set.
In the listing, the while loop scans the input. At each iteration,the automaton reads one input character, determines the next stateand whether it is an accepting one, it performs associated semanticactions, and transitions to the new state. In tokenizers, the usualsemantic actions add the accepted token to an output table.
The memory accesses of this DFA are illustrated in Figure 5. TheDFA reads sequentially characters from the input (1) with no reuse,except for backup transitions, discussed below; the input is read-only and not bound in size. The automaton accesses the STT (2)and the accept table (3); both accesses are, at a first approximation,
TP
C
TP
C
TP
C
TP
C
TP
C
Interconnect
Me
mo
ry
GPGPU card
SM SM SM
Thread Processing Cluster (TPC) Streaming Multiprocessor (SM)
L332k
TP
C
TP
C
TP
C
TP
C
TP
C
Textureunits
L28 k
SP
SP
SP
SP
SF
U
SP
SP
SP
SP
SF
U
RegisterFile
SharedMemory
DPU
InstructionUnit
L1 2k
Figure 1: A GTX280 contains 10 Thread Processing Clusters (TPCs), each of which groups 3 Streaming Multiprocessors (SMs).One SM contains 8 Scalar Processors (SPs), i.e., 10×3×8 = 240 cores. An SM also contains Special Function Units (SFUs) andDouble-Precision Units (DPUs) which we do not use in this work.
Table 1: Architectural characteristics of the compute devices we employ in our experiments.Architecture Number Total Clock Global Constant Shared Registers
Revision of SMs cores Rate Memory Memory Memory per Block
GeForce GTX280 1.3 30 240 1.30 GHz 1.00 Gbytes 64 kbytes 16 kbytes 16,384GeForce GTX8800 1.0 16 128 *1.40 GHz 0.75 Gbytes 64 kbytes 16 kbytes 8,192Tesla C870 1.1 16 128 1.35 GHz 1.50 Gbytes 64 kbytes 16 kbytes 8,192Quadro FX3700 1.0 14 112 1.24 GHz 0.50 Gbytes 64 kbytes 16 kbytes 8,192
In all devices a warp is 32 threads and the maximum number of threads per block is 512. (* Overclocked specimen; factory clock rate was 1.35 GHz).
Scalar Processor (SP)Streaming Multiprocessor (SM)Thread Processing Cluster (TPC)
L1 2k
L2 8k
L3 32k
Global Memory1 Gbyte
SharedMemory
16k
>406 cycles~220 cycles
~81 cycles6 cycles
6 cycles
Figure 2: Round-trip read latencies to the memories on a GTX280 GPGPU, from the point of view of a Scalar Processor (SP),expressed in clock cycles for a 1.30 GHz device [22]. Color coding is consistent with Fig. 1, but some blocks were omitted for clarity.
LETTER [a-z]DIGIT [0-9]P ("_"|[,-/])HAS_DIGIT ({LETTER}|{DIGIT})*{DIGIT}({LETTER}|{DIGIT})*ALPHA {LETTER}+ALPHANUM ({LETTER}|{DIGIT})+APOSTROPHE {ALPHA}("’"{ALPHA})+ACRONYM {ALPHA}"."({ALPHA}".")+COMPANY {ALPHA}("&"|"@"){ALPHA}EMAIL {ALPHANUM}(("."|"-"|"_"){ALPHANUM})*"@"
{ALPHANUM}(("."|"-"){ALPHANUM})+HOST {ALPHANUM}("."{ALPHANUM})+NUM {ALPHANUM}{P}{HAS_DIGIT}|{HAS_DIGIT}{P}{ALPHANUM}|
{ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+|{HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+|{ALPHANUM}{P}{HAS_DIGIT}({P}{ALPHANUM}{P}{HAS_DIGIT})+|{HAS_DIGIT}{P}{ALPHANUM}({P}{HAS_DIGIT}{P}{ALPHANUM})+
STOPWORD "a"|"an"|"and"|"are"|"as"|"at"|"be"|"but"|"by"|"for"|"if"|"in"|"into"|"is"|"it"|"no"|"not"|"of"|"on"|"or"|"s"|"such"|"t"|"that"|"the"|"their"|"then"|"there"|"these"|"they"|"this"|"to"|"was"|"will"|"with"
KEPT_AS_IS {ALPHANUM}|{COMPANY}|{EMAIL}|{HOST}|{NUM}
%%{STOPWORD}|.|\n /* ignore */;{KEPT_AS_IS} emit_token (yytext);{ACRONYM} emit_acronym (yytext);{APOSTROPHE} emit_apostrophe (yytext);%%
Figure 3: An example tokenizer rule set, specified in flex, sim-ilar to the one of Lucene [24], the open-source search enginelibrary. The rules accept words, company names, email ad-dresses, host names and numbers as a class of tokens; theyalso recognize acronyms and apostrophe expressions as distinctclasses.
const flex_int16_t yy_nxt[][...] = { ... }; /* next-state table */const flex_int16_t yy_accept[ ] = { ... } ; /* accept table */
/* ... */while ( 1 ) {
yy_bp = yy_cp;yy_current_state = yy_start; /* initial state */
while ( (yy_current_state =yy_nxt[ yy_current_state ][ *yy_cp ]) > 0 )
{if ( yy_accept[yy_current_state] ) {(yy_last_accepting_state) = yy_current_state;(yy_last_accepting_cpos) = yy_cp;
}++yy_cp;
}yy_current_state = -yy_current_state;
yy_find_action:yy_act = yy_accept[yy_current_state];
switch ( yy_act ){case 0: /* back-up transition */
yy_cp = (yy_last_accepting_cpos) + 1;yy_current_state = (yy_last_accepting_state);goto yy_find_action;
case 1: /* ignore */ break;case 2: emit_token(yytext); break;case 3: emit_acronym(yytext); break;case 4: emit_apostrophe(yytext); break;/* ... */}
}
Figure 4: The core of the tokenizer generated by flex, corre-sponding to the rule set of Figure 3. In the code, yy_nxt containsthe State Transition Table (STT). Array yy_accept marks the ac-cepting states and the semantic actions (rule numbers) associ-ated with them. At any time, the characters between pointersyy_bp and yy_cp are the input partial match candidate that theautomaton is considering.
... its biggest annual gain since 2003, ...
Input text
yy_bp yy_cp
1: *yy_cp
State transition table Accept table
Output token table
2: yy_nxt [yy_current_state][*yy_cp]
3: yy_accept [yy_current_state]
4: (*token_table_p) = ...
Automaton
yy_current_state
yy_bp
yy_cp
token_table_p
start_p stop_p file_id rule_id
Figure 5: The memory accesses of a tokenizing automaton. Ac-cess 1 is a mostly-linear read with no reuse. Accesses 2 and 3are mostly-random accesses. Access 4 is a linear write with noreuse. Reads are drawn in green, writes in red, computation inblue.
random, and both tables are limited in size. The uncompressed 16-bit STT for the example above, with ASCII-128 inputs occupies|states| × |input alphabet| × (16 bits) =174×128×2 bytes = 44kbytes.
Since flex-based tokenizers implement the longest-of-the-leftmost matching semantics, upon a valid match, the DFA canpostpone the action in search of a longer match. If the attempt fails,the automaton backs up to the last valid match and re-processes therest of the input again.
Accesses (1–4) never coalesce because, at any time, the DFAsrunning in the different threads find themselves in uncorrelatedstates, accessing independent portions of the STT. Not even theinput pointers advance with constant strides, since each DFA mightat any time incur a backup state and jump back to an arbitrary inputlocation.
STTs naturally contain redundant information. One STT com-pression technique consists in removing duplicate columns, bymapping the input characters into a smaller number of equivalenceclasses. The number of classes needed depends on the rule set. Ourexample above needs 29 classes, thus reducing the STT size by afactor of 128/29 = 4.4. Equivalence classes come at the price of oneadditional lookup, since each input character must now be mappedto the class it belongs to.
4. A TOKENIZATION-ORIENTED PER-FORMANCE CHARACTERIZATION
On a GPGPU, the different memory classes have significantlydifferent performance, that varies depending on the characteristicsof its accesses: coalescing, stride, block size, caching, spread, lo-cality, alignment, bank and controller congestion, etc.. Actual per-formance can be significantly lower than peak values advertised bythe manufacturer or measured in ideal conditions. For example, theGTX280 is advertised with a bandwidth of a 120 Gbyte/s, and theBWtest benchmark in the SDK reports 114 Gbyte/s. In practice,parallel threads accessing single contiguous characters from devicememory experience <1% of that (Figure 6, right).
We measure the throughput of the GPGPU memories with ac-cess patterns relevant to tokenization, with the use of micro-benchmarks. We also measure gaps: sustained operation inter-arrival times at the device level. E.g., if the read gap is 1 ns,
0
2
4
6
8
10
0 2000 4000 6000 8000 10000 12000 14000 16000
Read g
ap in n
anoseconds
Total number of threads
Gap of parallel, single-byte, linear reads from global memory
Blocksof warps
1
4
6
8
12
18
20
24
28
30
0
2
4
6
8
10
12
14
16
0 2000 4000 6000 8000 10000 12000 14000 16000
Read thro
ughput in
Gbyte
s/s
econd
Total number of threads
Throughput of parallel, single-byte, linear reads from global memory
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 6: When threads access their individual, linear input streams with single-byte reads to global memory, they experience avery low bandwidth (0.87 Gbyte/s), much lower than the ideal (120 Gbyte/s). Each line connects experiments with the same numberof blocks, and growing number of warps per block (1...16); e.g., the rightmost data point represents 30 blocks × 16 warps × 32threads/warp = 15,360 threads.
0
2
4
6
8
10
0 2000 4000 6000 8000 10000 12000 14000 16000
Read g
ap in n
anoseconds
Total number of threads
Gap of parallel, aligned, 16-byte, linear reads from global memory
Blocksof warps
1
4
6
8
12
18
20
24
28
30
0
2
4
6
8
10
12
14
16
0 2000 4000 6000 8000 10000 12000 14000 16000
Read thro
ughput in
Gbyte
s/s
econd
Total number of threads
Throughput of parallel, aligned, 16-byte, linear reads from global memory
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 7: Accessing inputs in blocks of 16 bytes enjoys a more constant gap (left) and improves the throughput (right) up to 17.7×.Compare these values with the figures above: figure axes use consistent scaling to ease comparison.
the device can complete one read every nanosecond. Gaps are notround-trip times (RTTs): the RTT of a device memory read cantake 340–690 ns. The RTT is of lesser importance here because thehardware is precisely designed to mask it as much as possible.
Our benchmarks consider all combinations of thread block sizesand counts permitted by the available registers per thread (a scarceresource). This wide-spectrum analysis is necessary because thesechoices impact performance in a way that is difficult to predict an-alytically. Indeed, two choices of block count and size deliver dif-ferent performance even if they correspond to the same total threadcount. All data refer to the GTX280 device.
4.1 Accessing the inputBenchmarks show that the access of input characters (one byte
at a time) by DFAs running in concurrent threads is inefficient. Inthe best conditions (approx. 4,000 active threads), the gap reachesa minimum of 1.15 ns, and the throughput reaches 0.87 Gbyte/s,far from the ideal 120 Gbyte/s (Figure 6).
In these charts, each line (of a fixed color and point style) con-nects values corresponding to a fixed number of blocks of threads,with warps per block growing left-to-right from 1 warp to 16 warps(i.e., 32 ... 512 threads per block). Different lines correspond todifferent numbers of total blocks.
When the threads load inputs in blocks of 16 characters, the bestgap decreases to 1.04 ns, and the throughput increases up to 15.43Gbyte/s (Figure 7). This is a 17.7× improvement, very apparentwhen comparing Figures 6(right) and 7(right); the axes use thesame scale to ease comparison.
4.2 Generating outputSimilarly, threads can write their output entries one field at a time
or in a single block. We assume four-field token entries (startingand ending positions of the token in memory, a document identifierand a token class identifier), each field a 32-bit integer.
Benchmarks show a 2.02× advantage of blocked over single-value writes (Figures 8 and 9, respectively). 32-bit writes achieve a
0
2
4
6
8
10
0 2000 4000 6000 8000 10000 12000 14000 16000
Write
gap in n
anoseconds
Total number of threads
Gap of parallel, single 32-bit value, linear writes to global memory
Blocksof warps
1
4
6
8
12
18
20
24
28
30
0
2
4
6
8
10
12
14
16
0 2000 4000 6000 8000 10000 12000 14000 16000
Write
thro
ughput in
Gbyte
s/s
econd
Total number of threads
Throughput of parallel, single 32-bit value, linear writes to global memory
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 8: Single-value independent 32-bit writes of linear output streams to global memory obtain a minimum gap of 0.71 ns (left)and a maximum throughput of 5.67 Gbyte/s (right). This is subject to improvement as shown below.
0
2
4
6
8
10
0 2000 4000 6000 8000 10000 12000 14000 16000
Write
gap in n
anoseconds
Total number of threads
Gap of parallel, aligned, 16-byte block, linear writes to global memory
Blocksof warps
1
4
6
8
12
18
20
24
28
30
0
2
4
6
8
10
12
14
16
0 2000 4000 6000 8000 10000 12000 14000 16000
Write
thro
ughput in
Gbyte
s/s
econd
Total number of threads
Throughput of parallel, aligned, 16-byte block, linear writes to global memory
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 9: By writing outputs in blocks of 16 bytes, tokenizers can increase the write throughput up to 11.48 Gbyte/s, a two-foldimprovement. Compare these values with the figure above; the figures axes use consistent scaling to ease comparison
minimum gap of 0.71 ns (left) and a maximum throughput of 5.67Gbyte/s (right). Blocked 16-byte writes have longer gaps (1.40 ns)but a throughput up to 11.48 Gbyte/s.
4.3 Accessing tables of constantsAll DFAs access common constant tables in Accesses 2 and
3 (Figure 5), in a pattern that we approximate with randomized,pointer-chasing benchmarks with variable Working Sets (WS). Inour implementation we use 16-bit state identifiers.
Global, constant and shared memories exhibit very different per-formance levels, up to 6.25 Gbyte/s, 49.87 Gbyte/s and 117.6Gbyte/s, respectively (see Figures 10, 11, 12). Faster memories arelimited in capacity: the constant memory is 64 kbytes, the sharedis 16 kbytes.
Global and shared memories are uncached, and their perfor-mance is independent from the working set. The constant mem-ory is cached (see the L1–3 hierarchy in Figure 2), and its perfor-mance depends strongly on the WS size. With a 2-kbyte WS (i.e.,the size of L1), the throughput peaks at 49.87 Gbyte/s, but as theWS grows the throughput falls rapidly (Figure 11). With a WS of
around 6 kbytes and more, the constant memory is outperformedby the global memory.
5. OPTIMIZATION RESULTSOur parallelization maps one DFA to each thread, and a list of
input documents to each DFA. Each thread processes its documentssequentially, appending output to its private token table. Host codedistributes the input and allocates output tables. Load balancingis simple because tokenization times are proportional to documentsizes (which are known in advance). We don’t account for thissetup time because it can be overlapped with GPGPU execution byusing double-buffered runs.
We assume that the number of documents to be tokenized and thestatistical distribution of their size are suitable to achieve good loadbalancing. Our experiments employed the articles in Wikipedia,that verify this assumption. In different circumstances, our ap-proach may not guarantee good load balancing and/or device uti-lization.
We present 5 optimization versions of the naïve implementation,
0
0.5
1
1.5
2
2.5
3
3.5
100 1000 10000
Glo
ba
l m
em
ory
acce
ss g
ap
in
na
no
se
co
nd
s
Number of threads
Gap of parallel, 16-bit, random reads from global memory
Blocksof warps
4
8
12
16
20
24
28
30
0
1
2
3
4
5
6
7
100 1000 10000
Glo
ba
l m
em
ory
acce
ss t
hro
ug
hp
ut
in G
byte
/s
Number of threads
Throughput of parallel, 16-bit, random reads from global memory
Blocksof warps
4
8
12
16
20
24
28
30
Figure 10: Accessing constant tables in global memory. Accesses are 16-bit, concurrent, random, uncoalesced reads. Global memoryperformance is independent from working set size, and peaks at a relatively low thread count (2,000).
0
5
10
15
20
25
100 1000 10000
Co
nsta
nt
me
mo
ry a
cce
ss g
ap
in
na
no
se
co
nd
s
Number of threads
Gap of parallel, 16-bit, random reads from constant memory
WS= 4k
WS= 16k
WS= 32k
WS= 63k
Blocksof warps
4
8
12
16
20
24
28
30
0
10
20
30
40
50
100 1000 10000
Co
nsta
nt
me
mo
ry a
cce
ss t
hro
ug
hp
ut
in G
byte
/s
Number of threads
Throughput of parallel, 16-bit, random reads from constant memory
WS= 2k
WS= 4k
WS= 8kWS= 63k
Blocksof warps
4
8
12
16
20
24
28
30
Figure 11: Accesses to tables in constant memory are faster (up to 49.87 Gbyte/s) than global memory, provided that tables are smallenough to fit it (64 kbytes) and that the actual Working Set (WS) is very small (2–4 kbytes). Larger working sets thrash the cachehierarchy and degrade performance.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
100 1000 10000
Sh
are
d m
em
ory
re
ad
ga
p in
na
no
se
co
nd
s
Number of threads
Gap of parallel, 16-bit, random reads from shared memory
Blocksof warps
4
8
12
16
20
24
28
30
0
20
40
60
80
100
120
140
100 1000 10000
Sh
are
d m
em
ory
re
ad
th
rou
gh
pu
t in
Gb
yte
/s
Number of threads
Throughput of parallel, 16-bit, random reads from shared memory
Blocksof warps
4
8
12
16
20
24
28
30
Figure 12: The best throughput is achieved when storing tables in shared memory (up to 117.6 Gbyte/s). Performance is independentfrom the working set size. Tables must be small enough to fit the shared memory (16 kbytes).
Global memory
Constant memory
Shared memory
Registers
Thread
Input Output STT
State
Version 0
Global memory
Constant memory
Shared memory
Registers
Thread
Input Output
STT
State
Version 1
Global memory
Constant memory
Shared memory
Registers
Thread
Input Output
STT
State
Version 2
Global memory
Constant memory
Shared memory
Registers
Thread
Input Output
STT
State
Version 3
Input
16-b
yte
blo
ck
Global memory
Constant memory
Shared memory
Registers
Thread
Input Output
STT
State
Version 4,5
Input Output
16-byte block
16-b
yte
bloc
k
Figure 13: How our optimized versions place data structures into the GPGPU memory classes. Reads are drawn in green, writes inred, and computation is denoted in blue.
175
888
8
yy_ n
x t[..
. ][...
]*y
y _cp
yy_ a
c cep
t[.. .]
264
6
424
626262
4862
120
2424
48
120
490
490
490
toke
n_ta
b le[
...]
120
500
1000
1500
2000
2500
Tim
e (c
lock
cy c
les)
840
172
8
31
96
120
31
490
yy_nxt
*yy_
cp
yy_accept
72
888
424
626262
4862
120
2424
toke
n_ta
b le[
...]
Version 0 Version 1
851
173
9
76
72
490
72
48
yy_ec
*yy_
cpyy_accept
yy_nxt
888
424
626262
4862
120
2424
toke
n_ta
b le[
...]
31
31
31
Version 2 Version 3 Version 4 Version 5
99 Load from Global Memory
99 Store to Global Memory
99 Compute Block
99 Load fromConstant Memory
Key
99 Load fromShared Memory
729
161
782
206
710
72
352
yy_ec
bloc
k lo
a d (
1 6 c
hars
)
yy_accept
yy_nxt
916
(on
ly r
un
s o
nce
eve
r y ~
16
ite
ratio
ns)
10224
unpack*yy_cp
888
424
626262
4862
120
2424
toke
n_ta
b le
31
31
31
109
415
60
82
206
710
824
yy_ec
bloc
k lo
a d (
1 6 c
hars
)
yy_nxt
916
(on
ly r
un
s o
nce
eve
r y ~
16
ite
ratio
ns)
10224
unpack*yy_cp
466
304
48
114
toke
n_ta
b le
31
31
ComputeOps
Data transfers
ComputeOps
Data transfers
ComputeOps
Data transfers
ComputeOps
Data transfers
ComputeOps
Data transfers
466
729
119
5
82
206
304
48
710
114
72
352
yy_ec
bloc
k lo
a d (
1 6 c
hars
)to
k en_
tab l
e
yy_accept
yy_nxt
916
(on
ly r
un
s o
nce
eve
r y ~
16
ite
ratio
ns)
10224
unpack*yy_cp
31
31
31
ComputeOps
Data transfers
Notes. Compute instructions (in blue) are statically timed with their nominal latencies [22]. Instructions are obtained by decompiling the PTX bytecode with decuda. Latencies(reads in green, writes in red) are nominal, and not representative of high utilization. Diagrams show the control flow of a DFA transition without back-up that generates output.Output generation code is shown even if it is conditional. Numbers are in good accordance (max error = 11.04%) with single-thread profiling experiments on the device. For clarity,accesses to literal constants in constant memory are not shown; their contribution is accumulated to the blue column, assuming latencies of an L1 hit.
Figure 14: Single-thread timing analysis of the code of Versions 0–5 for one DFA iteration. Latencies refer to the GTX280 device andare in clock cycles. This analysis shows memory latencies before they are overlapped with computation.
corresponding to the different data mappings to GPGPU memoryclasses represented in Figure 13. Our optimizations reduce thenumber and latency of memory accesses, as Figure 14 shows.
We report typical, best-case and worst-case tokenizing through-puts of our versions on all considered GPGPUs. Figure 21 andTable 2 summarize the results for the GTX280 only. The typicalthroughput is obtained by our tokenizer on ASCII HTML articlesfrom Wikipedia. The best-case scenario employs the same tok-enizer on non-matching input; in these conditions, the tokenizernever spends time producing output or backing up. This representspeak performance in a IDS-like low-hit-rate scanning scenario. Theworst-case scenario uses an input set that maximizes the time spentin output generation.
In the charts, each line connects experiments with the same num-ber of blocks, and growing number of warps per block from left toright. For each data point, the total number of threads is given by(no. blocks) × (no. warps/block) × (32 threads/warp).
Version 0 is a naïve port of the code generated by flex. We placeinputs, outputs and STTs in global memory. The STT is uncom-pressed. Threads read their inputs one character at a time. Wemake no attempt to reduce the control flow or compute operations.Output entries are stored to global memory with four distinct 32-bitoperations. The throughput is shown in Figure 15.
Version 1 places the STT in constant memory. Our referenceSTT (41 kbytes in size) fits without compression. The typical per-formance increases by 26%, as Figure 16 shows. This improve-
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 0 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 0 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 0 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 0 on FX3700
Best case
Typical caseWorst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 15: Performance of the naïve implementation across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 1 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 1 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 1 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 1 on FX3700
Best case
Typical case
Worst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 16: Performance of implementation Version 1 across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 2 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 2 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 2 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 2 on FX3700
Best case
Typical case
Worst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 17: Performance of implementation Version 2 across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 3 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 3 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 3 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 3 on FX3700
Best case
Typical case
Worst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 18: Performance of implementation Version 3 across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 4 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 4 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 4 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 4 on FX3700
Best case
Typical caseWorst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 19: Performance of implementation Version 4 across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Tokeniz
ation thro
ughput in
Mbyte
s/s
Total number of threads
Version 5 on GTX280
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 5 on GTX8800
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 5 on C870
Best case
Typical case
Worst case
0
200
400
600
800
1000
1200
1400
0k 2k 4k 6k 8k 10k 12k 14k
Total number of threads
Version 5 on FX3700
Best case
Typical case
Worst case
Blocksof warps
1
4
6
8
12
16
20
24
28
30
Figure 20: Performance of implementation Version 5 across the GPGPU architectures we considered.
0
200
400
600
800
1000
1200
Version 0 Version 1 Version 2 Version 3 Version 4 Version 5
Thro
ughput (M
byte
/s)
Typical performance across the optimization steps on GTX280
0
1000
2000
3000
4000
5000
6000
7000
Version 0 Version 1 Version 2 Version 3 Version 4 Version 5
Thro
ughput (M
byte
/s)
Best-, Typical and Worst-case performance on GTX280
Worst caseTypical caseBest case
Figure 21: How the throughput in the typical case (left) and in all the cases (right) varies across the optimization steps.
Typical Best-case Worst-case Registers Top CumulativeOptimization Step Throughput Throughput Throughput per Utilization Speed-up
(Mbyte/s) (Mbyte/s) (Mbyte/s) Thread (typical)
(0) Naïve baseline implementation 327.4 805.6 181.5 21 50 % = 1.00×(1) STT in Constant memory 414.6 840.1 252.4 21 50 % 1.27×(2) STT in Shared memory 423.8 853.1 246.9 19 50 % 1.29×(3) Blocked Input 856.0 6,921.9 449.0 25 50 % 2.61×(4) Blocked output 960.5 6,924.4 464.3 26 50 % 2.93×(5) Single-access STT 1,185.5 5,591.3 958.7 23 50 % 3.62×
Flex -CF on Intel Xeon E5430 (1 thread) @ 2.66 GHz 23.8Flex -CF on Intel Xeon E5430 (4 threads) @ 2.66 GHz 58.9
[25] Multi-pattern matching on nVidia GT8600 @ 1.2 GHz 287[11] Optimized XML scanning on Intel Xeon E5472 @ 3.00 GHz 2,880[14] SIMD-based tokenization on IBM Cell/B.E. @ 3.20 GHz 1,780[26] Multipattern matching on 128-CPU Cray XMT @ 0.5 GHz 3,500
Table 2: Performance of our tokenizer across its optimization steps on GTX280, compared against an unoptimized flex on a recentIntel Xeon processor and previous results in literature.
ment is better than what suggested by Figure 11 because our actualworking set is much smaller than the STT size: flex DFAs visitmost frequently a small subset of the states. This is not adequatelyapproximated by our pointer-chasing micro-benchmarks.
Version 2 places the STT in shared memory. This requires com-pression to fit the shared memory (16 kbytes). The increase in typ-ical performance is only 2% (Figure 17) because the compressionmechanism requires one more access (to the equivalence class ta-ble) per state transition. Two lookups in shared memory are onlymarginally faster than one in constant memory in our working-setconditions (compare Figures 11 and 12).
Version 3 read inputs in 16-byte blocks. DFA iterations extractsingle characters from the blocks and issue loads only when thetarget address is not covered by the block. To transfer blocks, weemploy the uint4 CUDA-native data type, originally intended for4 32-bit integers. Dereferencing a uint4 pointer takes one PTX in-struction (ld.global.v4.u32). No char16 native type exists, and noCUDA intrinsic to extract characters from a uint4. For that purpose,we employ a construct like union { uint4 block; char array[16]; }.Figure 18 illustrates the result: a 2.02× typical throughput increase.
Version 4 adopts output blocking. It writes an output en-try with one aligned store. If aligned, a uint4 write translatesinto one st.global.v4.u32 PTX instruction. For that, we use themake_uint4(x,y,z,w) CUDA intrinsic. Version 4 improves the typ-ical performance of Version 3 by 12%, as Figure 19 shows.
Version 5 uses a merged STT that encodes the information of
yy_nxt and yy_accept in each 16-bit STT value. It also has opti-mized control flow and pointer arithmetics. This removes one tablelookup per iteration, and increases typical performance with respectto Version 4 by 23% (Figure 20).
In typical conditions, Version 5 on the GTX280 is 49.8× fasterthan a single-thread instance of unmodified flex running on the In-tel Xeon E5430 @ 2.66 GHz, and 20.1× faster than a 4-threadparallelized flex tokenizer running on the same processor.
In certain conditions, our tokenizer delivers a higher throughputthan the device-host bandwidth provided by the PCI Express bus.This would be a limit only if the tokenizer were used in isolation.However, tokenizers are frequently the first stage of a pipeline thatcan operate locally (e.g., in device memory), without the need forhost-device transfers, except for the final results.
6. RELATED WORKMany authors commented on the CUDA programming model;
a comprehensive discussion was presented by Nickolls et al. [10].Hong and Kim [20] presented an analytical model to estimate theperformance of CUDA applications on the basis of their parallelismand the structure of their memory accesses. Bakhoda et al. [18]made available a GPGPU simulator. Papadopoulou et al. [22] per-formed a microbenchmark-based investigation to discover unpub-lished GPGPU internals. Volkov and Demmel [21] analyzed andoptimized dense linear algebra kernels on GPGPUs, achieving al-
most optimal utilization of the arithmetic units. Their workload haslittle similarity with ours, but we believe we share their white-box,bottom-up analysis approach.
While no work has explicitly explored information indexingor retrieval algorithms onto GPGPUs, the work performed onautomaton-based string matching is related to ours and worth men-tioning. Vasiliadis et al. have presented Gnort [25], a network in-trusion detection engine based on the Aho-Corasick [27] algorithm.Gnort delivers an end-to-end filtering throughput of 287 Mbyte/sonto an nVidia GT8600 device, with a dictionary of a few thou-sands signatures. The authors store STTs in texture memory, inan attempt to leverage the caches. Smith et al. [28] also presentedan intrusion detection solution, which relies on XFA, a DFA vari-ant designed to recognize regexps compactly. Their implementa-tion delivers a throughput of 156 Mbyte/s. Goyal et al. [29] pro-poses a DFA-based regexp matching implementation that delivers50 Mbyte/s.
We now compare our result with relevant ones on traditionalmulti-core architectures. Pasetto et al. [11] have presented aflexible tool that can perform small-ruleset regexp matching at2.88 Gbyte/s per chip on an Intel Xeon E5472. Scarpazza andRussell [14] presented a SIMD tokenizer that delivers 1.00–1.78Gbyte/s on one IBM Cell/B.E. chip, and extended their approach toIntel Larrabee [15]. Villa et al. [26] demonstrated a large-dictionaryAho-Corasick-based string matcher that delivers 3.5 Gbyte/s on aCray XMT. Iorio and van Lunteren [12] proposed a BFSM [30]string matcher for automata whose STT fits in the register file,achieving 4 Gbyte/s on an IBM Cell chip.
Not only some of these solutions outperform ours; one additionaladvantage they have is that they do not need a high number of in-dependent input streams to achieve high device utilization.
7. CONCLUSIONSWe have found that the workload of automaton-based tokeniza-
tion with small rule sets maps to GPGPUs better than its irregularmemory access patterns suggest. This irregularity prevents the taskfrom exploiting coalesced memory accesses,
We have characterized the performance of the different GPGPUmemory classes when subject to the typical access patterns of to-kenizers. We have showed how software designers can speed uptheir workloads significantly by mapping each data structures tothe most appropriate GPGPU memory classes, and restructuringaccesses to use larger blocks.
We have employed this knowledge to craft an optimized tok-enizer implementation that compares well with the results reportedin literature.
AcknowledgmentsWe thank Rajesh Bordawekar of IBM T.J. Watson, for his sup-port and valuable suggestions; we thank Oreste Villa of the PacificNorthwest National Laboratory for his helpful comments.
8. REFERENCES[1] Daniele Paolo Scarpazza and Gordon W. Braudaway. Workload characterization
and optimization of high-performance text indexing on the Cell processor. InIEEE Intl. Symposium on Workload Characterization (IISWC’09), Austin,Texas, USA, October 2009.
[2] M. Nicola and J. John. XML parsing: A threat to database performance. InCIKM. ACM, 2003.
[3] Eric Perkins, Margaret Kostoulas, Abraham Heifets, Morris Matsa, and NoahMendelsohn. Performance analysis of XML APIs. In XML 2005 Conferenceand Exposition, November 2005.
[4] Intel Corp. Tera-scale research prototype – connecting 80 simple cores on asingle test chip, October 2006.
[5] Intel Corp. Single-chip cloud computer;techresearch.intel.com/articles/tera-scale/1826.htm, December 2009.
[6] Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash,Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin,Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. Larrabee: Amany-core x86 architecture for visual computing. In ACM SIGGRAPH 2008,pages 1–15, New York, NY, USA, 2008. ACM.
[7] Intel Corp. Intel SSE4 Programming Reference, Reference Number:D91561-001, April 2007.
[8] Nadeem Firasta, Mark Buxton, Paula Jinbo, Kaveh Nasri, and Shihjong Kuo.Intel AVX: New Frontiers in Performance Improvements and EnergyEfficiency, March 2008.
[9] Michael Abrash. A first look at the Larrabee new instructions (LRBni). The Dr.Dobb’s Journal, April 2009.
[10] John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. Scalableparallel programming with CUDA. ACM Queue, 6(2):40–53, 2008.
[11] Davide Pasetto, Fabrizio Petrini, and Virat Agarwal. Tools for very fast regularexpression matching. IEEE Computer, 43(3):50–58, March 2010.
[12] Francesco Iorio and Jan Van Lunteren. Fast pattern matching on the CellBroadband Engine. In 2008 Workshop on Cell Systems and Applications(WCSA), affiliated with the 2008 Intl. Symposium on Computer Architecture(ISCA’08), Beijing, China, June 2008.
[13] Robert D. Cameron and Dan Lin. Architectural support for SWAR textprocessing with parallel bit streams: the inductive doubling principle. In 14thIntl. Conference on Architectural support for programming languages andoperating systems (ASPLOS’09), pages 337–348, New York, NY, USA, 2009.ACM.
[14] Daniele Paolo Scarpazza and Gregory F. Russell. High-performance regularexpression scanning on the Cell/B.E. processor. In 23rd Intl. Conference onSupercomputing (ICS’09), Yorktown Heights, New York, USA, June 2009.
[15] Daniele Paolo Scarpazza. Is Larrabee for the rest of us? – Can non-numericalapplication developers take advantage from the new LRBni instructions? Dr.Dobb’s Journal ; http://www.ddj.com/architect/221601028, Nov 2009.
[16] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis,Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker,John Shalf, Samuel Webb Williams, and Katherine A. Yelick. The landscape ofparallel computing research: A view from berkeley. Technical ReportUCB/EECS-2006-183, EECS Department, University of California, Berkeley,Dec 2006.
[17] Vern Paxson. flex – a fast lexical analyzer generator, 1988.[18] Ali Bakhoda, George Yuan, Wilson W. L. Fung, Henry Wong, and Tor M.
Aamodt. Analyzing CUDA workloads using a detailed GPU simulator. In IEEEIntl. Symposium on Performance Analysis of Systems and Software(ISPASS’09), Boston, MA, April 2009.
[19] nVidia. nVidia CUDA programming guide, version 2.3, August 2009.[20] Sunpyo Hong and Hyesoon Kim. An analytical model for a GPU architecture
with memory-level and thread-level parallelism awareness. SIGARCH Comput.Archit. News, 37(3):152–163, 2009.
[21] Vasily Volkov and James W. Demmel. Benchmarking GPUs to tune denselinear algebra. In ACM/IEEE Intl. Conference for High PerformanceComputing, Networking, Storage and Analysis (SuperComputing’08), pages1–11, Austin, TX, November 2008.
[22] Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Henry Wong.Micro-benchmarking the GT200 GPU. Technical report, Computer Group,ECE, University of Toronto, 2009.
[23] nVidia. nVidia compute, PTX : Parallel thread execution, ISA version 1.4,September 2009.
[24] The Apache Software Foundation. Lucene, http://lucene.apache.org.[25] Giorgos Vasiliadis, Spyros Antonatos, Michalis Polychronakis, Evangelos P.
Markatos, and Sotiris Ioannidis. Gnort: High performance network intrusiondetection using graphics processors. In 11th Intl. Symposium on RecentAdvances in Intrusion Detection (RAID’08), volume 5230 of Lecture Notes inComputer Science, pages 116–134, Cambridge, MA, September 2008. Springer.
[26] Oreste Villa, Daniel Chavarria, and Kristyn Maschhoff. Input-independent,scalable and fast string matching on the Cray XMT. In 23nd IEEE Intl. Parallel& Distributed Processing Symposium (IPDPS’09), 2009.
[27] Alfred V. Aho and Margaret J. Corasick. Efficient string matching: an aid tobibliographic search. Communications of the ACM, 18(6):333–340, 1975.
[28] Randy Smith, Neelam Goyal, Justin Ormont, Karthikeyan Sankaralingam, andCristian Estan. Evaluating GPUs for Network Packet Signature Matching. InIEEE Intl. Symposium on Performance Analysis of Systems and Software(ISPASS’09), Boston, MA, April 2009.
[29] Neelam Goyal, Justin Ormont, Randy Smith, Karthikeyan Sankaralingam, andCristian Estan. Signature matching in network processing using SIMD/GPUarchitectures. Technical Report 1628, University of Wisconsin at Madison,January 2008.
[30] Jan van Lunteren. High-performance pattern-matching for intrusion detection.In 25th IEEE Intl. Conference on Computer Communications (INFOCOM2006), pages 1–13, April 2006.