A hardware processing unit for point sets

Graphics Hardware (2008)David Luebke and John D. Owens (Editors)

A Hardware Processing Unit for Point Sets

Simon Heinzle Gaël Guennebaud Mario Botsch Markus Gross

ETH Zurich

Abstract

We present a hardware architecture and processing unit for point sampled data. Our design is focused on funda-mental and computationally expensive operations on point sets including k-nearest neighbors search, moving leastsquares approximation, and others. Our architecture includes a configurable processing module allowing usersto implement custom operators and to run them directly on the chip. A key component of our design is the spatialsearch unit based on a kd-tree performing both kNN and εN searches. It utilizes stack recursions and featuresa novel advanced caching mechanism allowing direct reuse of previously computed neighborhoods for spatiallycoherent queries. In our FPGA prototype, both modules are multi-threaded, exploit full hardware parallelism, andutilize a fixed-function data path and control logic for maximum throughput and minimum chip surface. A detailedanalysis demonstrates the performance and versatility of our design.

Categories and Subject Descriptors (according to ACM CCS): I.3.1 [Hardware Architecture]: Graphics processors

1. Introduction

In recent years researchers have developed a variety of pow-erful algorithms for the efficient representation, processing,manipulation, and rendering of unstructured point-sampledgeometry [GP07]. A main characteristic of such point-basedrepresentations is the lack of connectivity. It turns out thatmany point processing methods can be decomposed into twodistinct computational steps. The first one includes the com-putation of some neighborhood of a given spatial position,while the second one is an operator or computational pro-cedure that processes the selected neighbors. Such opera-tors include fundamental, atomic ones like weighted aver-ages or covariance analysis, as well as higher-level oper-ators, such as normal estimation or moving least squares(MLS) approximations [Lev01, ABCO∗01]. Very often, thespatial queries to collect adjacent points constitute the com-putationally most expensive part of the processing. In thispaper, we present a custom hardware architecture to acceler-ate both spatial search and generic local operations on pointsets in a versatile and resource-efficient fashion.

Spatial search algorithms and data structures are very wellinvestigated [Sam06] and are utilized in many different ap-plications. The most commonly used computations includethe well known k-nearest neighbors (kNN) and the Euclideanneighbors (εN) defined as the set of neighbors within a given

radius. kNN search is of central importance for point pro-cessing since it automatically adapts to the local point sam-pling rates.

Among the variety of data structures to accelerate spa-tial search, kd-trees [Ben75] are the most commonly em-ployed ones in point processing, as they balance time andspace efficiency very well. Unfortunately, hardware accel-eration for kd-trees is non-trivial. While the SIMD designof current GPUs is very well suited to efficiently implementmost point processing operators, a variety of architecturallimitations leave GPUs less suited for efficient kd-tree im-plementations. For instance, recursive calls are not supporteddue to the lack of managed stacks. In addition, dynamic datastructures, like priority queues, cannot be handled efficiently.They produce incoherent branching and either consume a lotof local resources or suffer from the lack of flexible memorycaches. Conversely, current general-purpose CPUs featurea relatively small number of floating point units combinedwith a limited ability of their generic caches to support theparticular memory access patterns generated by the recur-sive traversals in spatial search. The resulting inefficiency ofkNN implementations on GPUs and CPUs is a central moti-vation for our work.

In this paper, we present a novel hardware architec-ture dedicated to the efficient processing of unstructured

c© The Eurographics Association 2008.

Heinzle et al. / A Hardware Processing Unit for Point Sets

point sets. Its core comprises a configurable, kd-tree based,neighbor search module (implementing both kNN and εNsearches) as well as a programmable processing module.Our spatial search module features a novel advanced cachingmechanism that specifically exploits the spatial coherenceinherent in our queries. The new caching system allows tosave up to 90% of the kd-tree traversals depending on theapplication. The design includes a fixed function data pathand control logic for maximum throughput and lightweightchip area. Our architecture takes maximum advantage ofhardware parallelism and involves various levels of multi-threading and pipelining. The programmability of the pro-cessing module is achieved through the configurability ofFPGA devices and a custom compiler.

Our lean, lightweight design can be seamlessly integratedinto existing massively multi-core architectures like GPUs.Such an integration of the kNN search unit could be donein a similar manner as the dedicated texturing units, whereneighborhood queries would be directly issued from runningkernels (e.g., from vertex/fragment shaders or CUDA pro-grams). The programmable processing module together withthe arithmetic hardware compiler could be used for embed-ded devices [Vah07, Tan06], or for co-processors to a CPUusing front side bus FPGA modules [Int07].

The prototype is implemented on FPGAs and provides adriver to invoke the core operations conveniently and trans-parently from high level programming languages. Operatingat a rather low frequency of 75 MHz, its performance com-petes with CPU reference implementations. When scalingthe results to frequencies realistic for ASICs, we are able tobeat CPU and GPU implementations by an order of magni-tude while consuming very modest hardware resources.

Our architecture is geared toward efficient, genericpoint processing, by supporting two fundamental operators:Cached kd- tree-based neighborhood searches and genericmeshless operators, such as MLS projections. These con-cepts are widely used in computer graphics, making our ar-chitecture applicable to as diverse research fields as point-based graphics, computational geometry, global illuminationand meshless simulations.

2. Related Work

A key feature of meshless approaches is the lack of ex-plicit neighborhood information, which typically has to beevaluated on the fly. The large variety of spatial data struc-tures for point sets [Sam06] evidences the importance of ef-ficient access to neighbors in point clouds. A popular andsimple approach is to use a fixed-size grid, which, howeverdoes not prune the empty space. More advanced techniques,such as the grid file [NHS84] or locality-preserving hash-ing schemes [IMRV97] provide better use of space, but toachieve high performance, their grid size has to be carefullyaligned to the query range.

The quadtree [FB74] imposes a hierarchical access struc-ture onto a regular grid using a d-dimensional d-ary searchtree. The tree is constructed by splitting the space into 2d

regular subspaces. The kd-tree [Ben75], the most popularspatial data structure, splits the space successively into twohalf-spaces along one dimension. It thus combines efficientspace pruning with small memory footprint. Very often, thekd-tree is used for k-nearest neighbors search on point dataof moderate dimensionality because of its optimal expected-time complexity of O(log(n) + k) [FBF77, Fil79], where nis the number of points. Extensions of the initial concept in-clude the kd-B-tree [Rob81], a bucket variant of a kd-tree,where the partition planes do not need to pass through thedata points. In the following, we will use the term kd-tree todescribe this class of spatial search structures.

Approximate kNN queries on the GPU have been pre-sented by Ma et al. [MM02] for photon mapping, where alocality-preserving hashing scheme similar to the grid filewas applied for sorting and indexing point buckets. In thework of Purcell et al. [PDC∗03], a uniform grid constructedon the GPU was used to find the nearest photons, howeverthis access structure performs only well on similarly sizedsearch radii.

In the context of ray tracing, various hardware implemen-tations of kd-tree ray traversal have been proposed. Theseinclude dedicated units [WSS05, WMS06] and GPU im-plementations based either on a stack-less [FS05, PGSS07]or, more recently, a stack-based approach [GPSS07]. Mostof these algorithms accelerate their kd-tree traversal byexploiting spatial coherence using packets of multiplerays [WBWS01]. However, this concept is not geared to-ward the more generic pattern of kNN queries and does notaddress the neighbor sort as a priority list.

In order to take advantage of spatial coherence in nearestneighbor queries, we introduce a coherence neighbor cachesystem, which allows us to directly reuse previously com-puted neighborhoods. This caching system, as well as thekNN search on kd-tree, are presented in detail in the nextsection.

3. Spatial Search and Coherent Cache

In this section we will first briefly review the kd-tree basedneighbor search and then present how to take advantage ofthe spatial coherence of the queries using our novel coherentneighbor cache algorithm.

3.1. Neighbor Search Using kd-Trees

The kd-tree [Ben75] is a multidimensional search tree forpoint data. It splits space along a splitting plane that is per-pendicular to one of the coordinate axes, and hence canbe considered a special case of binary space partitioningtrees [FKN80]. In its original version, every node of the



1 2

3 4

5 678

1 2 3 4 5 6 7 8

x

y y

x x x y

Figure 1: The kd-tree data structure: The left image showsa point-sampled object and the spatial subdivision computedby a kd-tree. The right image displays the kd-tree, points arestored in the leaf nodes.

tree stores a point, and the splitting plane hence has to passthrough that point. A more commonly used approach is tostore points, or buckets of points, in the leaf nodes only. Fig-ure 1 shows an example of a balanced 2-dimensional kd-tree.Balanced kd-trees can always be constructed in O(n log2 n)for n points [OvL80].

kNN SearchThe k-nearest neighbors search in a kd-tree is performed asfollows (Listing 1): We traverse the tree recursively down thehalf spaces in which the query point is contained until we hita leaf node. When a leaf node is reached, all points containedin that cell are sorted into a priority queue of length k. In abacktracking stage, we recursively ascend and descend intothe other half spaces if the distance from the query pointto the farthest point in the priority queue is greater than thedistance of the query point to the cutting plane. The priorityqueue is initialized with elements of infinite distance.

Point query; // Query PointPriorityQueue pqueue; // Priority Queue of length k

void find_nearest (Node node) {if (node.is_leaf) {

// Loop over all points contained by the leaf’s bucket// and sort into priority queue.for (each point p in node)

if (distance(p,query) < pqueue.max())pqueue.insert(p);

} else {partition_dist = distance(query, node.partition_plane);// decide whether going left or right firstif (partition_dist > 0) {

find_nearest(node.left);// taking other branch only if it is close enoughif (pqueue.max() > abs(partition_dist))

find_nearest(node.right);} else {

find_nearest(node.right);if (pqueue.max() > abs(partition_dist))

find_nearest(node.left);}

}

Listing 1: Recursive search of the kNN in a kd-tree.

εN SearchAn εN search, also called ball or range query, aims to find allthe neighbors around a query point qi within a given radius

(k+1)NN

kNN

qiqj ei

2ei

qiqj

ri

αrirj

(a) (b)

Figure 2: The principle of our coherent neighbor cache al-gorithm. (a) In the case of kNN search the neighborhood ofqi is valid for any query point q j within the tolerance dis-tance ei. (b) In the case of εN search, the extended neighbor-hood of qi can be reused for any ball query (q j,r j) which isinside the extended ball (qi,αri).

ri. However, in most applications it is desirable to bound themaximum number of found neighbors. Then, the ball queryis equivalent to a kNN search where the maximum distanceof the selected neighbors is bound by ri. In the above algo-rithm, this behavior is trivially achieved by initializing thepriority queue with placeholder elements at a distance ri.

Note that in high-level programming languages, the stackstores all important context information upon a recursivefunction call and reconstructs the context when the functionterminates. As we will discuss subsequently, this stack has tobe implemented and managed explicitly in a dedicated hard-ware architecture.

3.2. Coherent Neighbor Cache

Several applications, such as up-sampling or surface recon-struction, issue densely sampled queries. In these cases, itis likely that the neighborhoods of multiple query points arethe same. The coherent neighbor cache (CNC) exploits thisspatial coherence to avoid multiple computations of similarneighborhoods. The basic idea is to compute slightly moreneighbors than necessary, and use this extended neighbor-hood for subsequent, spatially close queries.

Assume we query the kNN of the point qi (Figure 2a).Instead of looking for the k nearest neighbors, we computethe k + 1 nearest neighbors Ni = {p1, ...,pk+1}. Let ei behalf the difference of the distances between the query pointand the two farthest neighbors: ei = (‖pk+1− qi‖−‖pk −qi‖)/2. Then, ei defines a tolerance radius around qi suchthat the kNN of any point inside this ball are guaranteed tobe equal to Ni \{pk+1}.

In practice, the cache stores a list of the m most recentlyused neighborhoods Ni together with their respective querypoint qi and tolerance radius ei. Given a new query pointq j , if the cache contains a Ni such that ‖q j−qi‖ < ei, thenN j = Ni is reused, otherwise a full kNN search is performed.



In order to further reduce the number of cache misses, it ispossible to compute even more neighbors, i.e., the k+c near-est ones. However, for c 6= 1 the extraction of the true kNNwould then require to sort the set Ni at each cache hit, whichconsequently would prevent the sharing of such a neighbor-hood by multiple processing threads.

Moreover, we believe that in many applications it ispreferable to tolerate some approximation in the neighbor-hood computation. Given any positive real ε, a data pointp is a (1 + ε)-approximate k-nearest neighbor (AkNN) ofq if its distance from q is within a factor of (1 + ε) of thedistance to the true k-nearest neighbor. As we show in ourresults, computing AkNN is sufficient in most applications.This tolerance mechanism is accomplished by computing thevalue of ei as follows,

ei =‖pk+1−qi‖ · (1+ ε)−‖pk−qi‖

2+ ε. (1)

The extension of the caching mechanism to ball queriesis depicted in Figure 2b. Let ri be the query radius associ-ated with the query point qi. First, an extended neighbor-hood of radius αri with α > 1 is computed. The resultingneighborhood Ni can be reused for any ball query (q j,r j)with ‖q j−qi‖ < αri− r j. Finally, the subsequent process-ing operators have to check for each neighbor its distanceto the query point in order to remove the wrongly selectedneighbors. The value of α is a tradeoff between the cachehit rate and the overhead to compute the extended neigh-borhood. Again, if an approximate result is sufficient, thena (1 + ε)-AkNN like mechanism can be accomplished byreusing Ni if the following coherence test holds: ‖q j−qi‖<(αri− r j) · (1+ ε).

4. A Hardware Architecture for Generic PointProcessing

In this section we will describe our hardware architectureimplementing the algorithms introduced in the previous sec-tion. In particular, we will focus on the design decisions andfeatures underlying our processing architecture, while theimplementations details will be described in Section 5.

4.1. Overview

Our architecture is designed to provide an optimal compro-mise between flexibility and performance. Figure 3 shows ahigh-level overview of the architecture. The two main mod-ules, the neighbor search module and the processing mod-ule, can both be operated separately or in tandem. A globalthread control unit manages user input and output requestsas well as the module’s interface to high level programminglanguages, such as C++.

The core of our architecture is the configurable neighborsearch module, which is composed of a kd-tree traversal unitand a coherent neighbor cache unit. We designed this module

Neighbor Search ModuleNeighbor Search Module

Processing Module

Coherent NeighborCache

Kd-treeTraversal

GlobalThreadControl

InitLoop

KernelFinalize

kd-treecachedata

cache

External DRAM

Figure 3: High-level overview of our architecture. The twomodules can be operated separately or in tandem.

to support both kNN and εN queries with maximal sharing ofresources. In particular, all differences are managed locallyby the coherent neighbor cache unit, while the kd-tree traver-sal unit works regardless of the kind of query. This moduleis designed with fixed function data paths and control logicfor maximum throughput and for moderate chip area con-sumption. We furthermore designed every functional unit totake maximum advantage of hardware parallelism. Multi-threading and pipelining were applied to hide memory andarithmetic latencies. The fixed function data path also allowsfor minimal thread-storage overhead. All external memoryaccesses are handled by a central memory manager and sup-ported by data and kd-tree caches.

In order to provide optimal performance on a limited hard-ware, our processing module is also implemented using afixed function data path design. Programmability is achievedthrough the configurability feature of FPGA devices and byusing a custom hardware compiler. The integration of our ar-chitecture with existing or future general purpose computingunits, like GPUs, is discussed in section 6.2.

A further fundamental design decision is that the kd-treeconstruction is currently performed by the host CPU andtransferred to the subsystem. This decision is justified giventhat the tree construction can be accomplished in a prepro-cess for static point sets, whereas neighbor queries have to becarried out at runtime for most point processing algorithms.Our experiments have also shown that for moderately sizeddynamic data sets, the kd-tree construction times are negli-gible compared to the query times.

Before going more into detail, it is instructive to describethe procedural and data flows of a generic operator appliedto some query points. After the requests are issued for agiven query point, the coherent neighbor cache is checked.If a cached neighborhood can be reused, a new processingrequest is generated immediately. Otherwise, a new neigh-bor search thread is issued. Once a neighbor search is ter-minated, the least recently used neighbor cache entry is re-placed with the attributes of the found neighbors and a pro-cessing thread is generated. The processing thread loopsover the neighbors and writes the results into the deliverybuffer, from where they are eventually read back by the host.



Processing Modulekd-tree Traversal Unit

Thread andTraversalControl

External Ram

Node Traversal

Stack Recursion

Leaf Processing Priority Queues

kd-treenodes

Stacks

Figure 4: Top level view of the kd-tree traversal unit.

In all subsequent figures, blue indicates memory whilegreen stands for arithmetic and control logic.

4.2. kd-Tree Traversal Unit

The kd-tree traversal unit is designed to receive a query (q,r)and to return at most the k-nearest neighbors of q within aradius r. The value of k is assumed to be constant for a batchof queries.

This unit starts a query by initializing the priority queuewith empty elements at distance r, and then performs thesearch following the algorithm of Listing 1. While this algo-rithm is a highly sequential operation, we can identify threemain blocks to be executed in parallel, due to their indepen-dence in terms of memory access. As depicted in Figure 4,these blocks include node traversal, stack recursion, and leafprocessing.

The node traversal unit traverses the path from the currentnode down to the leaf cell containing the query point. Mem-ory access patterns include reading of the kd-tree data struc-ture and writing to a dedicated stack. This stack is explicitlymanaged by our architecture and contains all traversal infor-mation for backtracking. Once a leaf is reached, all pointscontained in that leaf node need to be inserted and sorted intoa priority queue of length k. Memory access patterns includereading point data from external memory and read-write ac-cess to the priority queue. After a leaf node has been left,backtracking is performed by recurring up the stack until anew downward path is identified. The only memory accessis reading the stack.

4.3. Coherent Neighbor Cache Unit

The coherent neighbor cache unit (CNC), depicted in Fig-ure 5, maintains a list of the m most recently used neighbor-

Search mode kNN εNci = ei (equation 1) distance of top element

Skip top element: always if it is the empty elementCoherence test: ‖q j−qi‖ < ci ‖q j−qi‖ < ci− r j

Generated query: (q j,∞) (q j,αr j)

Table 1: Differences between kNN and εN modes.

CoherentNeighborCache

ProcessingModule

neighbors0

queryrequest

c0 q0

...

LRU

Cac

heM

anag

er

Coherence Check

Kd-

tree

Tra

vers

al

NeighborCopy

neighbors0c0 q0

neighbors0c0 q0

Figure 5: Top level view of coherent neighbor cache unit.

Processing ModuleProcessing Module

ThreadControl

Initialization/Reset(configurable)

Loop Kernel(configurable)

Finalization(configurable)

Neighbor Cache

output buffer

Multi-threaded

Register B

ank

Figure 6: Top level view of our programmable processingmodule.

hoods in a least recently used order (LRU). For each cacheentry the list of neighbors Ni, its respective query positionqi, and a generic scalar comparison value ci, as defined inTable 1, are stored. The coherence check unit uses the lattertwo values to determine possible cache hits and issues a fullkd-tree search otherwise.

The neighbor copy unit updates the neighborhood cacheswith the results from the kd-tree search and computes thecomparison value ci according to the current search mode.For correct kNN results, the top element corresponding tothe (k +1)NN needs to be skipped. In the case of εN queriesall empty elements are discarded. The subtle differences be-tween kNN and εN are summarized in Table 1.

4.4. Processing ModuleThe processing module, depicted in Figure 6, is composedof three customizable blocks: an initialization step, a loopkernel executed sequentially for each neighbor, and a final-ization step. The three steps can be globally iterated multipletimes, where the finalization step controls the termination ofthe loop. This outer loop is convenient to implement, e.g.,an iterative MLS projection procedure. Listing 2 shows aninstructive control flow of the processing module.

Vertex neighbors[k]; // custom typeOutputType result; // custom typeint count = 0;do {

init(query_data, neighbors[k], count);for (i=1..k)

kernel(query_data, neighbors[i]);} while (finalization(query_data, count++, &result));

Listing 2: Top-level algorithm implemented by the process-ing module.



All modules have access to the query data (position, ra-dius, and custom attributes) and exchange data through ashared register bank. The initialization step furthermore hasaccess to the farthest neighbor, which can be especially use-ful to, e.g., estimate the sampling density. All modules oper-ate concurrently, but on different threads.

The three customizable blocks are specified using apseudo assembly language supporting various kinds of arith-metic operations, comparison operators, reads and condi-tional writes to the shared register bank, and fixed size loopsachieved using loop unrolling. Our arithmetic compiler thengenerates hardware descriptions for optimized fixed functiondata paths and control logic.

5. Prototype Implementation

This section describes the prototype implementation of thepresented architecture using Field Programmable Gate Ar-rays (FPGAs). We will focus on the key issues and non-trivial implementation aspects of our system. At the end ofthe section, we will also briefly sketch some possible opti-mizations of our current prototype, and describe our GPUbased reference implementation that will be used for com-parisons in the result section.

5.1. System Setup

The two modules, neighbor search and processing, aremapped onto two different FPGAs. Each FPGA is equippedwith a 64 bit DDR DRAM interface and both are integratedinto a PCI carrier board, with a designated PCI bridge forhost communication. The two FPGAs communicate via ded-icated LVDS DDR communication units. The nearest neigh-bors information is transmitted through a 64 bit channel, theresults of the processing operators are sent back using a 16bit channel. The architecture would actually fit onto a singleVirtex2 Pro chip, but we strived to cut down the computa-tion times for the mapping, placing, and routing steps in theFPGA synthesis. The communication does not degrade per-formance and adds only a negligible latency to the overallcomputation.

5.2. kd-Tree Traversal Unit

We will now revisit the kd-tree traversal unit of the neighborsearch module in Figure 4 and discuss its five major unitsfrom an implementational perspective.

There are up to 16 parallel threads operating in the kd-treetraversal unit. A detailed view of the stack, stack recursion,and node traversal units is presented in Figure 7. The nodetraversal unit pushes the paths not taken during traversalonto a shared stack for subsequent backtracking. The maxi-mum size of the stack is bound by the maximum depth of thetree. Compared to general purpose architectures, however,our stacks are lightweight and stored on-chip to maximize

Node Traversal StackNode Traversal

Control Squared plane distance

Stack Recursion

<

Thr

ead

and

trav

ersa

l con

trol

Push

Pop

Mu

lti-thread

edS

tack Sto

rage

Stack pointersPop from stack

Recursion done

Control

Leaf Processing

kd-tree

Figure 7: Detailed view of the sub-units and storage of thenode traversal, stack recursion and stack units.

Priority QueueLeaf Processing

Distance Calculation

Thread and traversal control

Multi-threadedQueue Storage

Point DataCache

External DRAM

Inse

rt

Point Request QueueTop

Read

Leaf

Tra

vers

alM

anag

er CNC

Figure 8: Detailed view of the leaf processing and parallelpriority queue units. The sub-units load external data, com-pute distance and resort the parallel priority queues.

performance. They only store pointers to the tree paths nottaken as well as the distance of the query point to the bound-ing plane (2+4 Bytes). Our current implementation includes16 parallel stacks, one for each thread and each being of size16, which allows us to write to any stack and read from anyother one in parallel.

The leaf processing and priority queue units are presentedin greater detail in Figure 8. Leaf points are sorted into oneof the 16 parallel priority queues, according to their distanceto the query point. The queues store the distances as well aspointers to the point data cache. Our implementation uses afully sorted parallel register bank of length 32 and allows theinsertion of one element in a single cycle. Note that this fullysorted bank works well for small queue sizes because of theassociated small propagation paths. For larger k, a constanttime priority queue [WHA∗07] could be used.

For ease of implementation, the kd-tree structure is cur-rently stored linearly on-chip in the spirit of a heap, whichcan also be considered as a cache. The maximum tree depthis 14. The internal nodes and the leaves are stored separately,points are associated only with leaf nodes. The 214−1 inter-nal nodes store the splitting planes (32 bit), the dimension (2bit) and a flag indicating leaf nodes (1 bit). This additionalbit allows us to support unbalanced kd-trees as well. The 214

leaf nodes store begin and end pointers to the point buck-ets in the off-chip DRAM (25 bits each). The total storagerequirement of the kd-tree is 170 kBytes.



5.3. Coherent Neighbor Cache Unit

The CNC unit (Figure 5) includes 8 cached neighborhoodsand one single coherence check sub-unit, testing for cachehits in an iterative manner. The cache manager unit main-tains the list of caches in LRU order, and synchronizes be-tween the processing module and the kd-tree search unit us-ing a multi-reader write lock primitive. For a higher numberof caches, the processing time increases linearly. Note, how-ever, additional coherence test units could be used to reducethe number of iterations.

As the neighbor search unit processes multiple queries inparallel, it is important to carefully align the processing or-der. Usually, subsequent queries are likely to be spatially co-herent and would eventually be issued concurrently to theneighbor search unit. To prevent this problem, we interleavethe queries. In the current system this task is left to the user,which allows to optimally align the interleaving to the natureof the queries.

5.4. Processing Module

The processing module (Figure 6) is composed of three cus-tom units managed by a processing controller. These unitscommunicate through a multi-threaded quad-port bank of 16registers. The repartition of the four ports to the three customunits is automatically determined by our compiler.

Depending on the processing operator, our compilermight produce a high number of floating point operationsthus leading to significant latency, which is, however, hid-den by pipelining and multi-threading. Our implementationallows for a maximum of 128 threads operating in paralleland is able to process one neighbor per clock cycle. In ad-dition, a more fine-grained multi-threading scheme iteratingover 8 sub-threads is used to hide the latency of the accumu-lation in the loop kernel.

5.5. Resource Requirements and Extensions

Our architecture was designed using minimal on-chip re-sources. As a result, the neighbor search module is very leanand uses a very small number of arithmetic units only, assummarized in Table 2. The number of arithmetic units ofthe processing module depends entirely on the processing

ArithmeticUnit

kd-treeTraversal CNC

CovarianceAnalysis

SPSSProjection

Add/Sub 6 6 38 29Mul 4 6 49 32Div 0 0 2 3Sqrt 0 2 2 2

Table 2: Usage of arithmetic resources for the two units ofour neighbor search module, and two processing examples.

Data CurrentPrototype

Off-chip kd-tree& shared data

cacheThread data 1.36 kB (87 B/thd) 1.36 kBTraversal stack 2 kB (8×16 B/thd) 2 kBkd-tree 170 kB (depth: 14) 16 kB (cache)Priority queue 3 kB (6×32 B/thd) 4 kBDRAM manager 5.78 kB 5.78 kBPoint data cache 16 kB (p-queue unit) 16 kB (shared cache)Neighbor caches 8.15 kB (1044B/cache) 1.15 kB (148 B/cache)Total 206.3 kB 46.3 kB

Table 3: On-chip storage requirements for our current andplanned, optimized version of the neighbor search module.

operator. Their complexity is therefore limited by the re-sources of the targeted FPGA device. Table 2 shows twosuch examples.

The prototype has been partitioned into the neighborsearch module integrated on a Virtex2 Pro 100 FPGA, andthe processing module was integrated on a Virtex2 Pro 70FPGA. The utilization of the neighbor search FPGA was23’397 slice flip flops and 33’799 LUTs. The utilization ofthe processing module in the example of a MLS projectionprocedure based on plane fits [AA04] required 31’389 sliceflip flops and 35’016 LUTs.

The amount of on-chip RAM required by the current pro-totype is summarized in Table 3, leaving out the buffersfor PCI transfer and inter-chip communication which arenot relevant for the architecture. This table also includes aplanned variant using one bigger FPGA only. Using only onechip, the architecture then could be equipped with genericshared caches to access the external memory, the tree struc-ture would also be stored off-chip and hence alleviate thecurrent limitation on the maximum tree depth. Furthermore,such caches would make our current point data cache obso-lete and reduce the neighbor cache foot-print by storing ref-erences to the point attributes. Finally, this would not onlyoptimize on-chip RAM usage, but also reduce the memorybandwidth to access the point data for the leaf processingunit, and hence speed up the overall process.

5.6. GPU Implementation

For comparison, we implemented a kd-tree based kNNsearch algorithm using NVIDIA’s CUDA. Similar to theFPGA implementation, the lightweight stacks and priorityqueues are stored in local memory. Storing the stacks andpriority queues in fast shared memory would limit the num-ber of threads drastically and degrade performance com-pared to using local memory. The neighbor lists are writ-ten and read in a large buffer stored in global memory. Weimplemented the subsequent processing algorithms as a sec-ond, distinct kernel. Owing to the GPU’s SIMD design, im-plementing a CNC mechanism is not feasible and wouldonly decrease the performance.



Figure 9: A series of successive smoothing operations on the Igea model. The model size is 134k points. A neighborhood of size16 has been used for the MLS projections. Only MLS software code has been replaced by FPGA driver calls.

Figure 10: A series of successive simulation steps of the 2D breaking dam scenario using SPH. The simulation is using adaptiveparticle resolutions between 1900 and 4600 particles, and performs kNN queries up to 30 nearest neighbors. Only kNN softwarecode has been replaced by FPGA driver calls.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0

200

400

600

800

1000

1200

1400

1600

1800

2000

CPU, kNNCPU + SPSSCPU + SORTGPU, kNNGPU + SPSSGPU + SORTFPGAFPGA + MLS

nb neighbors

k q

ue

rie

s/se

co

nd

k qu

erie

s/se

cond

FPGAFPGA + MLSCPU

+ MLS projection

=k

CUDA

+ SORT

Figure 11: Performances of kNN and MLS projection as afunction of the number of neighbors k. The input points havebeen used as query points.

6. Results and Discussions

To demonstrate the versatility of our architecture, we imple-mented and analyzed several meshless processing operators.These include a few core operators which are entirely per-formed on-chip: covariance analysis, iterative MLS projec-tion based on either plane fit [AA04] or spherical fit [GG07],and a meshless adaptation of a nonlinear smoothing opera-tor for surfaces [JDD03]. We integrated these core operatorsinto more complex procedures, such as a MLS based resam-pling procedure, as well as a normal estimation procedurebased on covariance analysis [HDD∗92].

We also integrated our prototype into existing publiclyavailable software packages. For instance, in PointShop 3D[ZPKG02] the numerous MLS calls for smoothing, hole

filling, and resampling [WPK∗04] have been replaced bycalls to our drivers. See Figure 9 for the smoothing opera-tion. Furthermore, an analysis of a fluid simulation researchcode [APKG07] based on smoothed particle hydrodynamics(SPH) showed that all the computations involving neighborsearch and processing can easily be accelerated, while thehost would still be responsible for collision detection andkd-tree updates (Figure 10).

6.1. Performance Analysis

Both FPGAs, a Virtex2 Pro 100 and a Virtex2 Pro 70, op-erate at a clock frequency of 75 MHz. We compare the per-formance of our architecture to similar CPU and GPU im-plementations, optimized for each platform, on a 2.2 GHzIntel Core Duo 2 equipped with a NVIDIA GeForce 8800GTS GPU. Our results were obtained for a surface data-setof 100k points in randomized order and with a dummy oper-ator that simply reads the neighbor attributes. Note that ourmeasurements do not include transfer costs since our hard-ware device lacks an efficient transfer interface between thehost and the device.

General Performance AnalysisFigure 11 demonstrates the high performance of our de-sign for generic, incoherent kNN queries. The achieved on-chip FPGA query performance is about 68% and 30% ofthe throughput of the CPU and GPU implementations, re-spectively, although our FPGA clock rate is 30 times lowerthan that of the CPU, and it consumes considerably fewer



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0

1000

2000

3000

4000

5000

6000

7000

8000

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

CPUCPU + CNCCPU, % cache hitCUDAFPGAFPGA + CNCFPGA, %

k q

ue

rie

s/s

eco

nd

% cache hits

k qu

erie

s/se

cond

FPGAFPGA + CNCCPU

=b

CPU + CNCCUDA% cache hits

Figure 12: Number of ball queries per second for an in-creasing level of coherence.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

0

250

500

750

1000

1250

1500

1750

2000

0%

25%

50%

75%

100%

CPU, no cacheCPU, CNC, e=0.1CPU, % cache hit, e=0.1CUDAFPGA, ANNFPGA, ANN, %FPGAFPGA, % hitsFPGA, no cache

k qu

erie

s/s

% cache hitsk

quer

ies/

seco

nd

FPGAFPGA + CNCFPGA + CNC + AkNN

% cache hits

=b

CPUCPU + CNC + AkNNCUDA

Figure 13: Number of kNN queries per second for an in-creasing level of coherence (k = 30). Approximate kNN(AkNN) results were obtained with ε = 0.1.

resources. Moreover, with the MLS projection enabled, ourprototype exhibits the same performance as with the kNNqueries only, and outperforms the CPU implementation.

Finally, note that when the CNC is disabled our archi-tecture produces fully sorted neighborhoods for free, whichis beneficial for a couple of applications. As shown in Fig-ure 11 adding such a sort to our CPU and GPU implementa-tions has a non-negligible impact, in particular for the GPU.

Our FPGA prototype, integrated into the fluid simula-tion [APKG07], achieved half of the performance of theCPU implementation, because up to 30 neighbors per queryhave to be read back over the slow PCI transfer inter-face. In the case of the smoothing operation of PointShop3D [WPK∗04], our prototype achieved speed ups of a fac-tor of 3, including PCI communication. The reasons for thisspeed up are two-fold: first, for MLS projections, only theprojected positions need to be read back. Second, the kd-treeof PointShop 3D is not as highly optimized as the referenceCPU implementation used for our other comparisons. Notethat using a respective ASIC implementation and a more ad-vanced PCI Express interface, the performance of our sys-tem would be considerably higher.

The major bottleneck of our prototype is the kd-treetraversal, and it did not outperform the processing modulefor all tested operators. In particular, the off-chip memoryaccesses in the leaf processing unit represent the limitingbottleneck in the kd-tree traversal. Consequently, the nearestneighbor search does not scale well above 4 threads, whilesupporting up to 16 threads.

Coherent Neighbor Cache Analysis

In order to evaluate the efficiency of our coherent neigh-bor cache with respect to the level of coherence, we imple-mented a resampling algorithm that generates b× b querypoints for each input point, uniformly spread over its localtangent plane [GGG08]. All results for the CNC were ob-tained with 8 caches.

The best results were obtained for ball queries (Figure 12),where even an up-sampling pattern of 2× 2 is enough tosave up to 75% of the kd-tree traversals, thereby showing theCNC’s ability to significantly speed up the overall computa-tion. Figure 13 depicts the behavior of the CNC with bothexact and (1 + ε)-approximate kNN. Whereas the cache hitrate remains relatively low for exact kNN (especially withsuch a large neighborhood), already a tolerance (ε = 0.1) al-lows to save more than 50% of the kd-tree traversals. Forincoherent queries, the CNC results in a slight overhead dueto the search of larger neighborhoods. The GPU implemen-tation does not include a CNC, but owing to its SIMD designand texture caches, its performance significantly drops downas the level of coherence decreases.

While these results clearly demonstrate the general use-fulness of our CNC algorithm, they also show the CNChardware implementation to be slightly less effective thanthe CPU-based CNC. The reasons for this behavior are two-fold. First, from a theoretical point of view, increasing thenumber of threads while keeping the same number of cachesdecreases the cache hit rate. This behavior could be com-pensated by increasing the number of caches. Second, ourcurrent prototype consistently checks all caches while ourCPU implementation stops at the first cache hit. With a highcache hit rate such as in Figure 12, we observed speed-upsof a factor 2 using this optimization on the CPU.

An analysis of the (1 + ε)-approximate kNN in terms ofcache hit rate and relative error can be found in Figures 14and 15. These results show that already small values of ε aresufficient to significantly increase the percentage of cachehits, while maintaining a very low error for the MLS projec-tion. In fact, the error is of the same order as the numericalorder of the MLS projection. Even larger tolerance valueslike ε = 0.5 lead to visually acceptable results, which is dueto the weight function of the MLS projection that results inlow influence of the farthest neighbors.



6.2. GPU Integration

Our results show that the GPU implementation of kNNsearch is only slightly faster than our current FPGA pro-totype and CPU implementation. Moreover, with a MLSprojection operator on top of a kNN search, we observea projection rate between 0.4M and 3.4M projections persecond, while the same hardware is able to perform up to100M of projections per second using precomputed neigh-borhoods [GGG08]. Actually, the kNN search consumesmore than 97% of the computation time.

This poor performance is partly due to the divergence inthe tree traversal, but even more important, due to the prior-ity queue insertion in O(log k), which infers many incoher-ent execution paths. On the other hand, our design optimallyparallelizes the different steps of the tree traversal and al-lows the insertion of one neighbor into the priority queue ina single cycle.

These results motivate the integration of our lightweightneighbor search module into such a massively multi-core ar-chitecture. Indeed, a dedicated ASIC implementation of ourmodule could be further optimized and run at a much higherfrequency and could improve the performance by more thanan order of magnitude. Such an integration could be done ina similar manner as the dedicated texturing units of currentGPUs.

In such a context, our processing module would then bereplaced by more generic computing units. Nevertheless, weemphasize that the processing module still exhibits severaladvantages. First, it allows to optimally use FPGA devicesas co-processors to CPUs or GPUs, which can be expectedto become more and more common in the upcoming years.Second, unlike the SIMD design of GPU’s microprocessors,our custom design with three sub-kernels allows for optimalthroughput, even in the case of varying neighbor loops.

7. Conclusion

We presented a novel hardware architecture for efficientnearest-neighbor searches and generic meshless processingoperators. In particular, our kd-tree based neighbor searchmodule features a novel and dedicated caching mechanismexploiting the spatial coherence of the queries. Our resultsshow that neighbor searches can be accelerated efficientlyby identifying independent parts in terms of memory ac-cess. Our architecture is implemented in a fully pipelinedand multi-threaded manner and suggests that its lightweightdesign could be easily integrated into existing computing orgraphics architectures, and hence be used to speed up ap-plications depending heavily on data structure operations.When scaled to realistic clock rates, our implementationachieves speedups of an order of magnitude compared to ref-erence implementations. Our experimental results prove thehigh benefit of a dedicated neighbor search hardware.

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75

1E-7

1E-6

1E-5

1E-4

1E-3

1E-2

1E-1

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

k=8 (error) error (k=16)cache hit (k=16)error (k=32)cache hit (k=32)A

vera

ge e

rror

% cache hits

k=8k=16k=32% cache hits

=ε

Ave

rage

err

or

Figure 14: Error versus efficiency of the (1 + ε)-approximate k-nearest neighbor mechanism as function ofthe tolerance value ε. The error corresponds to the averagedistance for a model of unit size. The percentages of cachehit were obtained with 8 caches and patterns of 8×8.

Figure 15: MLS reconstructions with, from left to right, ex-act kNN (34 s), and AkNN with ε = 0.2 (12 s) and ε = 0.5(10 s). Without our CNC the reconstruction takes 54 s.

A current limitation of the design is its off-chip tree con-struction. An extended architecture could construct or up-date the tree on-chip to avoid expensive host communica-tion. We also would like to investigate configurable spatialdata structure processors in order to support a wider rangeof data structures and applications.

References

[AA04] ALEXA M., ADAMSON A.: On normals and pro-jection operators for surfaces defined by point sets. InSymposium on Point-Based Graphics (2004), pp. 149–155.

[ABCO∗01] ALEXA M., BEHR J., COHEN-OR D.,FLEISHMAN S., LEVIN D., SILVA C. T.: Point set sur-faces. In IEEE Visualization (2001), pp. 21–28.

[APKG07] ADAMS B., PAULY M., KEISER R., GUIBAS

L. J.: Adaptively sampled particle fluids. ACM Transac-tions on Graphics 26, 3 (2007), 48:1–48:7.

[Ben75] BENTLEY J. L.: Multidimensional binary searchtrees used for associative searching. Communications ofthe ACM 18, 9 (1975), 509–517.



[FB74] FINKEL R. A., BENTLEY J. L.: Quad trees: Adata structure for retrieval on composite keys. Acta Infor-matica 4, 1 (1974), 1–9.

[FBF77] FRIEDMAN J. H., BENTLEY J. L., FINKEL

R. A.: An algorithm for finding best matches in logarith-mic expected time. ACM Transactions on MathematicalSoftware 3, 3 (1977), 209–226.

[Fil79] FILHO Y. V. S.: Average case analysis of regionsearch in balanced k-d trees. Information Processing Let-ters 8, 5 (1979).

[FKN80] FUCHS H., KEDEM Z. M., NAYLOR B. F.: Onvisible surface generation by a priori tree structures. InComputer Graphics (Proceedings of SIGGRAPH) (1980),pp. 124–133.

[FS05] FOLEY T., SUGERMAN J.: Kd-tree accelerationstructures for a GPU raytracer. In Graphics Hardware(2005), pp. 15–22.

[GG07] GUENNEBAUD G., GROSS M.: Algebraic pointset surfaces. ACM Transactions on Graphics 26, 3 (2007),23:1–23:9.

[GGG08] GUENNEBAUD G., GERMANN M., GROSS M.:Dynamic sampling and rendering of algebraic point setsurfaces. Computer Graphics Forum 27, 2 (2008), 653–662.

[GP07] GROSS M., PFISTER H.: Point-Based Graph-ics. The Morgan Kaufmann Series in Computer Graphics.Morgan Kaufmann Publishers, 2007.

[GPSS07] GÜNTHER J., POPOV S., SEIDEL H.-P.,SLUSALLEK P.: Realtime ray tracing on GPU with BVH-based packet traversal. In Symposium on Interactive RayTracing (2007), pp. 113–118.

[HDD∗92] HOPPE H., DEROSE T., DUCHAMP T., MC-DONALD J., STUETZLE W.: Surface reconstruction fromunorganized points. In Computer Graphics (Proceedingsof SIGGRAPH) (1992), pp. 71–78.

[IMRV97] INDYK P., MOTWANI R., RAGHAVAN P.,VEMPALA S.: Locality-preserving hashing in multidi-mensional spaces. In Symposium on Theory of Computing(1997), pp. 618–625.

[Int07] INTEL: Intel QuickAssist technology acceleratorabstraction layer white paper. In Platform-level Servicesfor Accelerators Intel Whitepaper (2007).

[JDD03] JONES T. R., DURAND F., DESBRUN M.: Non-iterative, feature-preserving mesh smoothing. ACMTransactions on Graphics 22, 3 (2003), 943–949.

[Lev01] LEVIN D.: Mesh-independent surface interpola-tion. In Advances in Computational Mathematics (2001),pp. 37–49.

[MM02] MA V. C. H., MCCOOL M. D.: Low latencyphoton mapping using block hashing. In Graphics Hard-ware (2002), pp. 89–99.

[NHS84] NIEVERGELT J., HINTERBERGER H., SEVCIK

K. C.: The grid file: An adaptable, symmetric multikeyfile structure. ACM Transactions on Database Systems 9,1 (1984), 38–71.

[OvL80] OVERMARS M., VAN LEEUWEN J.: Dynamicmulti-dimensional data structures based on quad- and k-dtrees. Tech. Rep. RUU-CS-80-02, Institute of Informationand Computing Sciences, Utrecht University, 1980.

[PDC∗03] PURCELL T. J., DONNER C., CAMMARANO

M., JENSEN H. W., HANRAHAN P.: Photon mapping onprogrammable graphics hardware. In Graphics Hardware(2003), pp. 41–50.

[PGSS07] POPOV S., GÜNTHER J., SEIDEL H.-P.,SLUSALLEK P.: Stackless kd-tree traversal for high per-formance GPU ray tracing. Computer Graphics Forum26, 3 (2007), 415–424.

[Rob81] ROBINSON J. T.: The K-D-B-tree: a search struc-ture for large multidimensional dynamic indexes. In In-ternational Conference on Management of Data (1981),pp. 10–18.

[Sam06] SAMET H.: Multidimensional point data. InFoundations of Multidimensional and Metric Data Struc-tures. Morgan Kaufmann, 2006, pp. 1–190.

[Tan06] TANURHAN Y.: Processors and FPGAs quo vadis.IEEE Computer Magazine 39, 11 (2006), 106–108.

[Vah07] VAHID F.: It’s time to stop calling circuits hard-ware. IEEE Computer Magazine 40, 9 (2007), 106–108.

[WBWS01] WALD I., BENTHIN C., WAGNER M.,SLUSALLEK P.: Interactive rendering with coherent raytracing. Computer Graphics Forum 20, 3 (2001), 153–164.

[WHA∗07] WEYRICH T., HEINZLE S., AILA T., FAS-NACHT D., OETIKER S., BOTSCH M., FLAIG C., MALL

S., ROHRER K., FELBER N., KAESLIN H., GROSS M.:A hardware architecture for surface splatting. ACM Trans-actions on Graphics 26, 3 (2007), 90:1–90:11.

[WMS06] WOOP S., MARMITT G., SLUSALLEK P.: B-kd trees for hardware accelerated ray tracing of dynamicscenes. In Graphics Hardware (2006), pp. 67–77.

[WPK∗04] WEYRICH T., PAULY M., KEISER R., HEIN-ZLE S., SCANDELLA S., GROSS M.: Post-processing ofscanned 3D surface data. In Symposium on Point-BasedGraphics (2004), pp. 85–94.

[WSS05] WOOP S., SCHMITTLER J., SLUSALLEK P.:RPU: a programmable ray processing unit for realtimeray tracing. ACM Transactions on Graphics 24, 3 (2005),434–444.

[ZPKG02] ZWICKER M., PAULY M., KNOLL O., GROSS

M.: Pointshop 3D: An interactive system for point-basedsurface editing. ACM Transactions on Graphics 21, 3(2002), 322–329.


A hardware processing unit for point sets

Documents

Transcript of A hardware processing unit for point sets