A Dynamic Load-Balancing Parallel Search for Enumerative Robot Path Planning

A Dynamic Load-Balancing Parallel Searchfor Enumerative Robot Path Planning

Babak Taati & Michael Greenspan & Kamal Gupta

Received: 5 July 2006 /Accepted: 8 July 2006 /Published online: 26 September 2006# Springer Science + Business Media B.V. 2006

Abstract We present a parallel formulation for enumerative search in high dimensionalspaces and apply it to planning paths for a 6-dof manipulator robot. Participating processorsperform local A* search towards the goal configuration. To exploit all the processors attheir maximum capacity at all times, a dynamic load-balancing scheme matches idle andbusy processors for load transfer. For comparison purposes, we have also implemented anexisting parallel static load-balancing formulation based on regular domain decomposition.Both methods achieved almost linear speed-up in our experiments. The two methods followdifferent search strategies in parallel and the implementation of the existing method (withtuned space decomposition) was more time efficient on average. However, the planningtime of that method is highly dependent on the distribution of the search space among theprocessors and its tuned decomposition varies for different obstacle placements. Empiricalselection of the space decomposition parameters for the existing method does not guaranteeminimal planning time in all environments and leads to slower planning than our dynamicload-balancing method in some cases. The performance of the developed dynamic methodis independent of the obstacle placements and the method can achieve consistent speed-upin all environments.

Key words dynamic load sharing . parallel robot path planning

J Intell Robot Syst (2006) 47: 55–85DOI 10.1007/s10846-006-9067-z

B. Taati (*) :M. GreenspanDepartment of Electrical and Computer Engineering,Queen_s University, Kingston, Ontario, Canadae-mail: [email protected]

M. GreenspanSchool of Computing, Queen_s University, Kingston, Ontario, Canada

K. GuptaSchool of Engineering Science, Simon Fraser University, Burnaby, BC, Canada

1 Introduction

The basic robot path planning problem is to find a collision-free path for the robot, from aninitial Fstart_ configuration to a desired Fgoal_ configuration, in an environment withobstacles. A path is defined as a connected set of configurations and a collision-free pathimplies that the robot is not in collision with surrounding obstacles or with itself in any ofthese configurations. The path planning problem is an essential component to enableautonomous robotics and has applications in autonomous vehicles, flexible manufacturing,tele-operation, medical surgery, space robotics, molecular biology, computer graphics, andgame development [21]. In the industrial robotics field, the primary goal for much of thecurrent research focus is to achieve efficient planning for robots that are now widely in use,which typically have around 5–10 (often 6) degrees of freedom. In this research, we havefocused on deterministic path planning algorithms, and our focus is to achieve betterplanning time for industrial robots using parallel processing.

The notion of configuration space (C-space) [22, 25] is widely used in robotics becauseit equates the path planning problem to a search in a d-dimensional C-space. The workspaceof a robot is the physical three-dimensional (or two-dimensional for planar robots)environment that the robot moves within. The configuration space of a d degree-of-freedom(dof ) robot is a d-dimensional space, where each dimension corresponds to a dof. Eachconfiguration specifies the position of the complete whole-arm geometry of the robotrelative to a fixed coordinate system. The collision-causing configurations form thesubspace of the C-space known as the C-Obstacles, and the remainder of the C-space iscalled C-Free and is collision-free. Figure 1 illustrates the physical and configuration spacefor a simple 2-dof planar robot. Path planning is equivalent to finding a continuous curve,the path, in the configuration space from the start configuration to the goal configurationsuch that all the points on the path lie within C-Free.

Figure 1 The physical and configuration space for a 2-dof planar robot with a bar as an obstacle

56 J Intell Robot Syst (2006) 47: 55–85

For a 6-dof robot, the C-space is six-dimensional, and it is the relatively high dimen-sionality of the search space that is the main reason behind long search times of planners.For instance, if each joint can rotate 200- and its motion is discretized into relatively coarsesteps of 2-, the search space will contain 1012 cells. It is well known that path planning is ahard problem (finding the shortest path in 3-D is NP-hard [21]) and the worst case timecomplexity of the complete deterministic algorithms (i.e., those that are guaranteed to find acollision-free path if it exists) are exponential in the number of degrees of freedom of therobot and polynomial in the geometric complexity of its environment (e.g., the number ofpolyhedra or geometric primitives with which the environment can be modeled). Althougha great deal of progress has been made in developing practical path planning methods [13],most of the current deterministic path planners are relatively slow for 6-dof industrial robotsand unsuitable for industry applications, where near real-time planning time is required.Although probabilistic methods have achieved significant reductions in planning times,these methods are not complete, and they may occasionally fail to find a path even if oneexists [18]. Proposed methods for improving the planning time can be categorized into threemain branches: use of heuristics, performing pre-computation, and parallel processing, withparallel processing being the least explored of these three alternatives. A brief primer onparallel processing and some of its terminology is presented in the appendix.

The focus of this work is on implementing a parallel formulation for speeding up adeterministic enumerative path planning algorithm. The performance of an existing parallelstatic load-balancing method (PSLB) [15], which essentially partitions the C-space intoseveral hypercubes and assigns them to different processors, is highly dependent on theshape and position of the obstacles and the hypercube size. Empirical selection of the sizedoes not necessarily guarantee a balanced load among the processors and optimal timeperformance.

We have developed a dynamic load-balancing scheme, the performance of which isrelatively independent of the location and size of the obstacles within the environment. Thisdoes not mean that the suggested method can solve all the planning problems in equal time,but rather that it can achieve consistent speedup for problems of various levels ofdifficulties. The dynamic load-balancing method does not assign specific portions of the C-space to any processor in advance. Instead, it exploits a dynamic load assignment techniquethat tries to balance the processors_ workloads during execution based on demand. Unlikemany parallel graph search techniques [1], the developed method does not require explicitrepresentation of the graph and therefore can be applied to the robot path planning problemwhere only an implicit graph is available due to the huge size of the robot_s C-space.

The dynamic method is inspired by a technique originally developed by Kumar and Rao[20] for parallel depth-first tree search under a shared memory model. A number ofmodifications to the original method were needed to adapt this method to the path planningproblem, which follows a grid search rather than a tree search. Modifications were alsorequired to allow the method to perform efficiently in a distributed-memory clustercomputer, which is a more common form of parallel computer than the shared-memorymodel for which it was initially designed. This method has been applied to planning pathsfor a 6-dof manipulator robot and extensive tests, with up to 16 processors, have shown thatit maintains a highly balanced load among the processors and achieves near linear speed-up.

A possible argument against the use of parallel processing is that it only improves theperformance linearly while the complexity of deterministic path planning is exponential inthe number of degrees of freedom. While this limits the potential improvement of suchtechniques for much higher dof robots, or for applications such as molecular biology wherereally high (20–50) dof cases occur, the performance enhancement for industrial robots

J Intell Robot Syst (2006) 47: 55–85 57

(4 to 10 dof) can be significant. In particular, the primary goal of our research has been toachieve fast planning times, ideally, in the order of few seconds, for 6-dof robots that arewidely in use in industry. If such a result is realized, the parallel planner can be practical formany of the robots used in industry today.

The remainder of this paper is organized as follows: Section 2 briefly reviews differentclasses of planning algorithms and an existing parallel formulation from each class. Thedomain decomposition parallel formulation of Henrich et al. [15] for enumerative pathplanning is explained in Section 3. Section 4 explains the dynamic load-balancing schemefor tree search developed by Kumar and Rao [20] and our adaptation of it for robot pathplanning. Section 5 presents a summary of our experimental results, which compare the twoparallel formulations_ planning time, speed-up, and load balance. We conclude the paper inSection 6 with suggestions for future research and enhancement of the dynamic method,following which the appendix presents a brief primer on parallel processing and differentload-balancing approaches.

2 Previous Work

Latombe [22], Gupta and Del Pobil [13], and more recently Choset et al. [6], and Lavalle[23] provide in depth analysis of the path planning problem and some of the major existingplanners. Planners are generally categorized into three main classes: roadmap approaches,cell decomposition approaches, and potential field based methods. (See [21] for analternative categorization.) This section briefly reviews these categories and a few of theavailable parallel formulations. Henrich [14] provides a more detail review of some of theexisting parallel planners.

2.1 Roadmap Based Methods

Roadmap based methods are used mostly to ease the computational burden of the search inhigh dimensional C-spaces. Most of these methods try to build a subset of the C-space inthe form of a connected graph, the roadmap, and perform the search in that graph instead ofthe entire C-space.

A class of sampling techniques, such as Ariadne_s clew algorithm [26], probabilisticroadmaps [18], and rapidly exploring random trees [23, 24], have been particularlysuccessful in efficiently capturing the connectivity of the C-Free through a roadmap graphand perform queries on that graph instead of the entire C-space.

The Probabilistic Roadmap Method (PRM) by Kavraki et al. [18] has gained specialattention partly because of its ability to perform fast multiple queries. PRM enables efficientplanning in stationary environments by placing random landmarks in the search space andconnecting them with a local planner (learning phase) and then performing searches on theconstructed graph (the roadmap) rather than the entire workspace (query phase).

Note, however, that the PRM is not a complete algorithm and is only probabilisticallycomplete. It may fail even if a collision-free path exists, particularly when the path needs topass through narrow regions in the C-space. Moreover, the run-times, and the paths foundmay vary from one run to another due to the pseudo-random nature of the algorithm,leading to great variability in results.

The most time consuming part of planning a path in PRM is the learning phase and, aslong as the environment is quiescent, several (relatively) fast queries can be performed after


the roadmap is built. This has lead Amato and Dale [2] to develop a (trivial and scalable)parallel formulation for PRM that only parallelizes the learning phase. They have reportedconsistent speed-up (with average efficiency of 60%) for up to 16 processors on a sharedmemory cluster for a 3-dof mobile robot.

2.2 Cell Decomposition Methods

Cell decomposition methods discretize the C-space onto a grid network of connected cellsthat are each free or blocked. Free cells are subsets of the C-Free and blocked cells have atleast one configuration in them that puts the robot in collision. The connected grid formedby the free cells captures the connectivity of the C-Free. This grid can be searched to find apath from the cell that contains the start configuration to the cell that contains the goalconfiguration. These methods are resolution complete, i.e., they will find a collision-freepath is there exists one within the resolution of the grid. For most industrial robots the sizeof the C-space is very large (generally six-dimensional) and checking all the cells forcollision prior to the search is impractically slow. In these cases, the cells are checked forcollision on an Fas needed_ basis as the search proceeds (implicit C-space). Searching thelarge size C-space is also a highly time consuming task. With current computational powerof standard CPUs, these cell-decomposition based algorithms are impractical for robotswith dofs greater than 4 [13].

Finding a path highly depends on the resolution of the grid and a coarse discretizationcan reduce the chance of finding a path but it has the benefit of reducing the computationaleffort of the search. To exploit the desired features of both coarse and fine resolutions,(speed of coarse resolution and quality of fine resolution), some researchers have suggestedhierarchical searches [5, 12] which start with coarse resolution and gradually increase theresolution.

Parallelizing cell decomposition methods, and specifically exhaustive enumerativesearches, is the focus of this research. Section 3 reviews PSLB [15] (an existing parallelformulation) and Section 4 offers a novel dynamic load-balancing scheme. In general,unlike the probabilistic methods where trivial parallelization often works efficiently, parallelcell decomposition methods seem to require continuous communication and load-balancingbetween processing units during the search.

As mentioned in the introduction, parallel implementations would at best achieve alinear speed-up (in the number of processors) over their sequential counterpart. As a result,for high dof problems (e.g., 50-dof), parallel implementations of deterministic algorithmsmay not be the answer. However, for applications in industrial robotics, where a largemajority of robots are õ6-dof, parallel implementations (as we show in this paper) ofdeterministic algorithms is successful in reducing the planning time to a few seconds, thusmaking it desirable to use them.

Using deterministic algorithms has certain advantages. One is that different runs of thealgorithm on the same scenario will yield the same run-time and the same path, whereasprobabilistic algorithms can vary greatly in their run-times and in the actual path found,even for the same scenario. In addition, exhaustive enumeration techniques are resolutioncomplete. That is, for a given robot at a given resolution, they can either find a path ordetermine that no path exists at the given resolution. Since industrial robots often havespecific known joint resolutions, the search resolution can be set to their joint resolution ifrequired. With parallel implementations, if we can reduce this work case planning time toacceptable limits (for robots with 6 or 7 dofs), then these planners can be used in practicewith guaranteed run-time performance.


Various parallel formulations are offered for graph search algorithms. However, most ofthese formulations are not applicable to path planning for different reasons. For instance, manyof these formulations require explicit knowledge about the graph [1] and cannot be applied topath planning where only implicit C-space is available. These formulations can be applied inplanning path for low-dof robots (where calculating the explicit C-space is possible) but are notsuitable for planning for high-dof robots. Some other formulations [20] are only valid (orefficient) for searching tree structures. Qin and Henrich [29] offer a non-deterministic approachfor parallelizing a grid-based planer based on random placements of subgoals in the C-space.Henrich [14] provides a review of parallel cell decomposition planners.

2.3 Potential Field Methods

Potential field methods ([3, 19]) are local and generally incomplete methods that use somesort of information (e.g., the location of the obstacles and the goal configuration) toconstruct an artificial field and let the robot be affected by the artificial forces caused by thefield. Challou et al. [4] have offered a trivial parallel implementation for [3] and haveexperimented it with a 6-dof robot on up to 512 processors.

3 Parallel Static Load-Balancing

Henrich et al. [15] have applied a regular domain decomposition parallel grid search to pathplanning. The 6-D configuration space is partitioned into several hypercubes and eachhypercube is assigned to a processor, within which it performs a local A* [27] search. Sincethe number of hypercubes is higher than the number of processors, each processor isassigned multiple hypercubes. To increase the probability of having a balanced load,hypercubes are cyclically assigned to processors. That is, neighboring hypercubes areassigned to different processors. The term static refers to the fact that the mapping of the C-space to the processors is performed once at initialization and does not change during thesearch. That is, a specific portion of the space is assigned to a processor for the entireduration of the planning.

At the beginning, all the processors_ search queues are empty except for the processorthat has the start configuration. The processor with the start configuration expands this cell(i.e., explores its neighboring cells) and continues the A* search towards the goal. As soonas the search reaches the border of another hypercube, instead of putting the newlyencountered cells in its local search queue, the processor sends them to the processor thatowns that neighboring hypercube. The search will then continue along parallel fronts indifferent processors.

The first processor to reach the goal configuration announces its success to all the otherprocessors. The path is likely divided among the searching processors and needs to beretrieved before the search ends successfully. Alternatively, if all the processors_ searchqueues become empty and the goal was not found, then the algorithm ends in failurewithout having found a path to the goal.

3.1 Efficiency of the Parallel Formulation

Since the hypercubes are selected such that there is no overlap between them, there is nowork duplication among the processors and the parallel formulation is therefore alwaysmaximally orthogonal.


Each hypercube is assigned to a particular processor for the entire duration of the search.The load distribution is therefore highly dependent on the initial mapping of the hypercubesto the processors, as well as the shape and location of the obstacles and the locations of thestart and goal configurations. The load tends to be more evenly distributed when thedifficult portions of the C-space are divided among many of the processors. However, if allthe obstacles happen to lie within the hypercubes belonging to only one or a small numberof processors, then the rest of the processors may not contribute to the solution and thespeedup could be minimal.

The size of the hypercubes, as well as the method of assigning them to the processorshave a great effect on the load distribution. Henrich et al. have chosen to decompose thesearch space into many hypercubes (more than the number of processors) and cyclicallyassign them to the processors. This way, each processor would have more than one(preferably non-adjacent) hypercube assigned to it. Intuitively, small hypercubes result in ahigher number of messages but would provide a more balanced load, whereas largerhypercubes will reduce the amount of interprocessor communication but can cause animbalanced load distribution.

Messages are passed when a local search reaches the boundary between two hypercubes.The communication load is therefore proportional to the surface size of the boundariesbetween the hypercubes, and we use the term area to denote the measure of the hyper-cubesurface. This surface area rises as the size of the hypercubes reduces and increases thenumber of messages accordingly. Conversely, if the hypercubes are large, then a relativelylow number of messages will be required. However, as stated in the previous section, largehypercubes result in a load imbalance and reduce the performance of the search. The size ofthe hypercubes therefore is a major factor in the trade-off between the communication loadand work distribution. The optimum point for this trade-off depends on the particular spacebeing searched. The size that leads to reduced planning time on average (based on a set ofexperiments), herein is referred to as the tuned hypercube size.

3.2 Decomposition Effect on the Planning Time

Decomposing the C-space into hypercubes changes the nature of the progression of thesearch. Although each processor performs a local A* search in the hypercubes assigned toit, the resulting search can vary greatly from the pure sequential A*.

Figure 2 illustrates this difference in a two dimensional search. The obstacle is shown inblack and the explored cells are marked. The start configuration is the shaded cell on theleft side and the goal configuration is the dark cell on the right side of the obstacle. In (a),the sequential A* search needs to explore a large area behind the obstacle before finding apath around it. In PSLB, as shown in (b), the space is partitioned into 12 hypercubes thatare cyclically assigned to 4 processors (shown in different shades of gray). Even thougheach processor follows a local A* search, the visited cells differ from those of the sequentialsearch. In this example the parallel version explores a lower number of cells than thesequential search and simulating the same space decomposition and hypercube assignmenton a single processor results in a speed-up over the pure sequential A*. Note that the foundpath is no longer optimal as is the case with the sequential A*.

In our experiments, when this parallel search was simulated on a single processor itachieved faster planning than the pure sequential search in most cases. It was also observedthat this effect was maximal for scenarios where the robot had to find detours aroundobstacles or find its way out of traps. For example, in environment H_CASE_B (illustratedin Figure 7), planning time of the sequential search was 13.3 times longer than the case


search with two processors and decomposed C-space. The effect caused relatively lessspeed-up in scenarios where the robot had to find its way through narrow passages. Forinstance, in environment BOOKSHELF_B (Figure 7) where the robot had to find its waythrough a narrow passage, the sequential search took only 1.9 times more than the twoprocessor search. (i.e., the efficiency was lower than 100% due to parallelization overhead).

These speed-ups and slow-downs can be explained by the fact that the overall searchstrategy in PSLB is similar to a two-level hierarchical search. Figure 3 illustrates thissimilarity in a 2D example where decomposing the C-space into smaller hypercubes leadsto a drastic speed-up over the pure sequential search. The figure depicts a case where a puresequential A* search gets trapped inside a U-shaped obstacle but the parallel search runningon a single processor can easily find a path around the obstacle when the search space ispartitioned into six hypercubes. In the figure, the U-shaped obstacle is shown in black. Theserial search (a) must exhaust the entire trap (the shaded area) before finding a path aroundthe obstacle. The dashed line is a straight line connecting the start (S) and goal (G)configurations and the solid line is the optimal path, i.e., minimum length, to the goal. Notethat the search time is proportional to the shaded area and in higher dimensional cases thetrap areas can become very large and sequentially determining a path can becomputationally very expensive. In the parallel case (b) the search space is decomposed

Figure 3 The two-level hierarchy caused by space decomposition, can avoid traps

Figure 2 Sequential A* and its parallel version using PSLB lead to different parts of C-space beingsearched


into six hypercubes that are assigned to processors P1 to P6. Processor P5 starts the searchtowards the goal and passes a cell to processor P3. P3 needs to search the entire trap (thegray area) before it can find a path around it. However, while P3 is expanding cells insidethe trap, P5 continues to expand its cells and passes some cells to processor P6 where astraight line to the goal is collision-free. The search time is much smaller than the serialcase and is almost proportional to the length of the found path.

Decomposition of the C-space into hypercubes resembles a two level hierarchical search(similar to [5]). If a path exists through the coarse resolution of the hypercubes, and if thehypercube assignment is such that the processors that this path encounters are not busy withother parts of the search space, then the search can find a solution very quickly withoutexhaustively visiting all cells in the individual hypercubes.

Figure 3 also illustrates two important properties of the speed-up caused by thedecomposition of the search space. The first is that this kind of speed-up can occur even ifall the obstacles happen to lie in one hypercube, or hypercubes belonging to one processor.Intuitively, it might seem that in these cases other processors might not be able to contributeto the search and the one processor which contains the obstacle has to perform all thedifficult time consuming calculations. However, as the figure depicts, an alternative pathmay exist that includes cells from other hypercubes. Other processors therefore cancontribute to the search and can find an alternate path, which is different from the optimal(i.e., minimum length) path.

The second property is that the path does not necessarily have the optimal (i.e., minimal)length. In the parallel case, even if the admissibility criteria of the A* search are satisfied,the search might find a sub-optimal path. Fortunately, in many path planning applications,fast planning has a higher priority than finding an optimal length path.

4 Dynamic Load-Balancing

The dynamic load-balancing scheme is based on the parallel depth-first tree search methodof Kumar and Rao [20]. In that method, each processor performs a depth-first search locallyin its respective search tree. If a processor finishes searching its tree and has not found thegoal, it then requests additional work from other processors. Initially one processor isassigned to have the start configuration and has the entire graph to search while all otherprocessors are idle. All remaining processors request work and the search space is dividedequally among them.

Procedure Search()//at processor i{while (not terminated)

{ if ( stack[i]==empty) GetWork();

While (stack[i] ≠ empty ) { DFS(stack[i]); GetWork(); } TerminationTest(); }}

Procedure GetWork(){ for (j= 0 to noOfTries ) { target=(target+1) mod N if (P[target] not empty) { Lock stack[target]; Get work from target; Unlock stack[target]; Return; } }}

Figure 4 The pseudo-code for Kumar and Rao_s parallel depth-first-search for trees and its GetWorkprocedure


Every time an idle processor requests additional work from another processor, therequested processor shares some of its work with the idle processor by splitting its searchspace into two equal halves and giving one half to the requesting processor. The pseudo-code for the search (and the GetWork routine) is shown in Figure 4 (adapted from [20]). TheDFS( ) function performs depth-first-search until the search stack is empty. Figure 5illustrates load sharing between two processors participating in a tree search.

Load-balancing in our method is based on the Request & Share concept of Kumar andRao, which tries to keep the work distribution among the processors as evenly balanced aspossible at all times. Cell decomposition based path planning requires a grid search;therefore some modifications were required before using Kumar and Rao_s method for pathplanning. Furthermore, the original method best performs on shared-memory clusters andour method is designed to perform on distributed-memory networks. There was also a needfor a mechanism for maintaining orthogonality to eliminate (or reduce) the amount ofoverlap between different processors_ search spaces. This section explains the main issuesregarding the adaptation of Kumar and Rao_s method to path planning.

4.1 Search Algorithm: A*

Kumar and Rao chose a depth-first search partly because of its low memory requirements.Although memory efficiency is valuable, time efficiency is of greater concern here.

A* [27] is chosen as the heart of our search algorithm because of its effective use ofheuristics. This algorithm assigns a value function (g + w � h) to cells in the grid based ontheir distance from the start configuration (g) and their estimated distance to the goalconfiguration (h). These values are calculated on an as needed basis; i.e., only for cells inthe OPEN priority queue of the A* search.

The search can be weighted towards the goal configuration by adjusting the weight w,where a high value for w implies a search highly weighted towards the goal. This generallyimproves the search time but increases the length of the found path. The Manhattandistance is used to measure the distance of a cell from the start configuration and to estimateits distance to the goal configuration. Pure sequential A* maintains the optimality1 of thefound path as long as the admissibility criteria [27] is satisfied. However, in many robotpath planning applications, the time efficiency of finding a path is of a higher priority thanthe optimality of the determined path. The A* parameters can be adjusted so that a path isfound faster, with the trade-off being non-optimality of the resulting path. Furthermore, itshould be emphasized that when the search algorithm is performed in parallel, even if the

Figure 5 Load sharing between two processors

1 The term optimal path is used in this text as a path with the shortest length. Other possible measures ofoptimality include: smoothness, energy cost, and time cost.


admissibility criteria are satisfied locally (i.e., in each processor), the overall searchalgorithm varies from that of a serial A* algorithm and the found path will not in generalhave the optimal length.

4.2 Matchmaking

Matchmaking is defined as the process of linking an idle processor (one with an emptysearch stack) to a non-idle processor for the purpose of load sharing [32]. Balance amongprocessors occurs when all of the processors have non-empty stacks. At the beginning ofthe search, only one of the processors has the start configuration in its local search queue,and all other processors have empty stacks. Therefore, the load distribution is initiallybriefly at its most unbalanced (one processor with 100% of the load and others with 0%).Once the search progresses there might be other situations where one processor exhausts allcells in its local stack without finding the goal configuration and ends up with an emptystack again, while other processors still have non-empty stacks. This can happen when aprocessor is trapped in a narrow dead-end in the C-space and can no longer generateadditional cells to explore. The developed load-balancing technique identifies the proces-sors with empty stacks (idle processors) and finds a match for them among the processorswith non-empty stacks (donor processors). After a match is found, the donor processorshares some of its workload with the idle processor by removing some cells off its localstack and sending them to the idle processor.

Dynamic load-balancing could be performed in either a centralized or distributedmanner. In a distributed manner, work transfer could be receiver-initiated or sender-initiated(see Appendix). Our approach uses a distributed method, similar to asynchronous roundrobin, and its work transfer is receiver-initiated.

A distributed load-balancing approach was chosen to avoid the problem of bottleneck inthe matchmaker processor. Work-stealing was chosen over work-sharing to minimize theidle time of processors with empty stacks. There is one major difference between thedeveloped method and most of the existing methods in the literature. The reason for thisdifference is that in the developed method, we have exploited a property of Beowulfclusters that enables fast message broadcasts.

In most Beowulf clusters that are developed exclusively for parallel computing, a fastswitch connects the processors and simulates a star network, in which each processor isconnected to all other processors through a communication link [34]. In such networks,broadcasting a message to all other processors takes the same amount of time that isrequired for sending a message to a single processor. To take advantage of these fastbroadcasts, the developed method differs from other distributed dynamic load-balancingschemes, such as asynchronous round robin or nearest neighbor polling. Many of thesemethods are developed for large-scale networks, where fast broadcasts are not feasible andeach processor is only connected to a handful of other processors and any communicationwith other processors has to go through these direct neighbors. The nearest neighbor pollingmethod for instance, matches donor processors to idle processors that are direct neighbors.In our implementation, every time a processor faces an empty stack, it broadcasts a messageto all other processors and announces its idleness. Of the recipients of the message, oneprocessor with a non-empty stack agrees to be the donor and responds to the idle processor.This distributed approach is referred to as announce protocol in this document.

A potential pitfall of the announce protocol is the possibility that multiple processorscould simultaneously respond to a share request, and various approaches are possible toprevent multiple donors from answering a single request. One possibility is for each self


designated donor to double check with the idle processor to see if it still needs work beforesharing its stack. The first potential donor that contacts the idle processor receives a positiveresponse and the sharing occurs. All the donor candidates that double check with therequestor after the sharing occurs will subsequently receive a negative response. Thismethod requires a large number of unnecessary messages since, except at the beginning ofthe search, most processors will be non-idle and will all try to respond to all requests. Whenall processors respond, 2 � (n j 2) messages are sent in an n-processor network: (n j 2)offers from all the potential donor processors, except the true donor processor and the idleprocessor itself, and (n j 2) negative responses.

In another approach, which we call the double announce protocol, the first non-idleprocessor that decides to share its stack with the idle processor broadcasts its decision to allother processors (including the idle processor) prior to sharing. Upon receipt of thisbroadcasted message, the idle processor enters into receiving mode and all other processorsnotice that the request has been serviced and that there is no need for further sharing. Thismethod requires only (n j 1) messages for matching a donor to the idle processor, which ishalf that of the previous method. Furthermore, the messages sent by the donor can bebroadcast, which is generally more efficient than sending individual messages.

Owing to the non-deterministic nature of the network, the order in which the messagesare sent is not necessarily preserved: a processor might receive a response to a requestbefore receiving the request itself. In another scenario, the response to a request might takea long time to transmit and a second processor might decide to become the donor in themean time. By storing some minimal information about each request, the developed methodaccommodates these cases without causing the algorithm to fail.

In the developed method, each request (the first announcement) contains two numbers:the id of the idle processor, and the request number that shows the number of requests thathave been broadcast by this idle processor. These two numbers form a request tag for eachrequest. For example if processor j receives a message as REQUEST(i, m), it knows that itis the mth request from processor i. A response to a request (the second announcement)contains three numbers: the id of the responding processor and the request tag. For instanceRESPONSE( j, (i, m)) is the jth processor_s response to the mth request of processor i.

Each processor keeps an array of request tags, each associated with one of theprocessors. For example, in an n-processor search, the ith processor keeps (n j 1) requesttags, each matching with one of the searching processors (except the local processor).

The processors store the incoming messages in a queue and poll their message queueafter each local expansion in a message handling procedure. After an idle processor (i)broadcasts a work request (by sending out a REQUEST(i, m) message) and a non-idleprocessor assigns itself as the donor (by broadcasting a RESPONSE(j, (i, m)) message), theremaining (n j 2) processors face different possibilities based on the order that they receivethese two messages and the time when they poll their message queue. These three eventscan occur in six possible different orders, which are categorized as follows: Normal Send/Receive (two cases); Hyper Responding/Receiving (one case); Early Response (one case);and Early Poll (two cases).

Normal send and normal receive occur when the request, its response, and messagehandling happen in one of the two orders listed in Table I. In both these cases the processoris informed that processor j has responded to processor i_s work request. The request tagstable is updated and the search continues.

Early response occurs when the request, its response, and message handling happen inthe order shown in Table II. That is, the processor notices the response to a request beforereceiving the request. In these cases, the request tags table is updated and the request is


ignored upon receipt. A variation of the early response case is when a processor receives theresponse to the nth request before receiving the mth request when n > m. Table III illustratesthe order of the events in a sample case with n = m + 1. The message handling procedureupdates the request tags table and the mth through nth requests are ignored upon receipt.

Hyper responding and hyper receiving occur when the request, its response, andmessage handling happen in the order shown in Table IV. That is, at least one of theremaining (n j 2) processors handles the request before receiving the message thatannounces the fact that this request has already been responded to. This processor updatesits request tags table, announces itself as a donor, and starts sharing its stack with the idleprocessor. In this case, the idle processor receives work from two (or more) processors.

Early poll occurs when the request, its response, and message handling happen in one ofthe two orders listed in Table V. That is, when a message handling happens before receivingthe request and its response. In these cases the processor does not notice anything and alater poll of the message queue (after receiving one or both of the messages) puts this caseinto one of the previous four possibilities.

4.3 Cycle Prevention

The original method of Kumar and Rao was developed for depth first tree searches whichhave no cycling conditions, i.e., visiting a cell more than once, through different paths. Inits application here to grid searches, cycling must be explicitly prevented. A straightforwardway to prevent cycling is to keep track of the cells that have been visited in a hash table.Whenever a cell is visited, its hash table entry is checked using a key based upon the cell_sunique coordinates. If an entry does not already exist for a cell_s key, then it is being visitedfor the first time. A new hash entry is created containing the cell_s coordinates and someadditional information, such as the partial path length (from the start cell) and thepredecessor cell. Alternately, if when a cell is visited its hash table entry already exists, thenthe new visit is ignored if the partial path length is not smaller than the partial path length ofthe previous visit of this cell. If the new partial path length is smaller than that stored in thehash entry, then the entries fields are updated. The use of a sufficiently large table and theselection of an appropriate hash function to generate the keys serves to keep the number ofhash table collisions as low as possible. The hash function is chosen empirically (based on

Order Event

1 Receive REQUEST(i, m)2 Receive RESPONSE(j (i, m))3 Message handlingor1 Receive RESPONSE(j, (i, m))2 Receive REQUEST(i, m)3 Message handling

Table I Normal send and normalreceive

Order Event

1 Receive RESPONSE(j, (i,m))2 Message handling3 Receive REQUEST(i, m)

Table II Early response


the cell configuration, modulus a large prime number) such that it generates unique keys forcells in a local neighborhood, thereby further reducing the occurrence of hash tablecollisions. To avoid collisions, the cells that create the same key are stored in a list.

4.4 Orthogonality

Different processors will have different cells in their local stacks after the initial sharing, butthey all follow the same A* strategy toward the goal so they will quickly end up exploringthe same overlapping cell spaces, and become non-orthogonal. Each processor keeps itsown local hash table but would not know of the cells visited by other processors and wouldvisit them again.

Barring any further interprocessor communication, one possible way to increaseorthogonality is to keep a central hash table in a designated processor, and make all otherprocessors communicate with the designated hash manager each time they visit a cell. Thisscheme has two main drawbacks, the first being the large amount of message passing thatwould be required. The second drawback is that the hash manager has to service allprocessors and will become a critical resource subject to bottleneck.

A distributed approach was selected as an alternative to the central hash managementmethod. The distributed method does not fully prevent work duplication and only tries toreduce, rather than eliminate, the amount of overlap between processors_ search spaces. Inthe distributed method, each processor keeps track of its visited cells in a local hash table,and to reduce overlap, each processor broadcasts its newly visited cells to all otherprocessors occasionally. The required communication load is much lower than the centralhash managing and it does not suffer from bottlenecks.

Figure 6 emphasizes the necessity of these occasional hash broadcasts in a simple two-dimensional search with two processors. The example illustrates that without hashbroadcasts, even if visited cells are broadcast during sharing, large amount of overlapbetween individual processors_ workload can occur. Initially processor 1 starts with thestart configuration in its local stack and processor 2 starts with an empty stack. Althoughafter sharing the two processors have different cells in their stacks, they both expand theirlocal cells towards the goal and explore the same configurations. Explored cells in eachprocessor (the ones in the CLOSED list [27]) are shown in dark shade and the cells in theOPEN list (to be explored) are shown in light textured shade.

Order Event

1 Receive RESPONSE(j, (i, m + 1))2 Message handling3 Receive REQUEST(i, m)4 Receive REQUEST(i, m + 1)5 Receive RESPONSE(k, (i,m))

Table III Early response (a morecomplicated case)

Order Event

1 Receive REQUEST( i, m)2 Message handling3 Receive RESPONSE(j, (i,m))

Table IV Hyper responding andhyper receiving


The communication between the processors is very fast (270 Mbit/s) [34] and hashbroadcasts are very efficient. In a typical search, only a negligible fraction (less than 1%) ofthe total search time is spent on hash broadcasts. The receipt and processing of a hashbroadcast message however, can take longer. Each received cell needs to be checked againstthe local hash table before being inserted to make sure that the existing cells are not re-inserted into the local table.

To prevent processors from re-broadcasting cells received from other processors_ hashbroadcast, all the received cells are tagged. This tag will also be used in processing the cellsthat are received in stack shares and will be discussed in more detail in the next section.

4.5 Sharing

In a shared memory cluster, stack sharing could be done very efficiently by simply splittingthe stack into two halves and referring to the receiver processor a pointer to one half. Whenthere is no shared memory (as is the case in this work), the donor processor must send allthe shared data to the receiver through the network. However, as the network is very fast,and because sharing is an infrequent event, the actual time spent on sharing is usually verysmall. Experimental results confirm that only a minute fraction (less than 0.1%) of the totalsearch time is typically spent on sharing and receiving.

When cells are shared they are removed from the donor_s OPEN list (i.e., the list of cellsthat are to be explored) and sent to the receiver. Along with the cell_s configuration, someadditional information such as the cell_s partial path length, originating processor andpredecessor cell, is also communicated to the receiver processor. When the shared cell leadsto the goal configuration, the parent cell information is needed for retrieving the path.

As the Occasional Hash Broadcast method does not fully prevent load duplication, theremight be cases where the shared cells have already been visited in the receiver. If the sharedcell and its parent cell are both new to the receiver processor, the cell is put in the receiver_sOPEN list and both the cell and its predecessor are put in the local hash table. If the sharedcell has already been explored however, it is ignored. There are also cases where the sharedcell is new, but its predecessor cell is not. In these cases, the cell is still put in the localOPEN list and the local hash table. However, some extra care is required to ensure thatbacktrack is possible in case the shared cell leads to the goal configuration. If the parent cellexists in the receiver processor because of an earlier hash broadcast, it cannot be used forbacktracking and the parent cell received with the shared cell must replace it. Table VIcovers all the possibilities and describes the action taken in each case. (Owner is a propertyof a cell that shows the processor that the cell belongs to.)

Order Event

a)1 Message handling2 Receive RESPONSE(j, (i, m))3 Receive REQUEST(i, m)orb)1 Message handling2 Receive REQUEST(i, m)3 Receive RESPONSE(j, (i, m))

Table V Early poll


5 Experimental Results

Both PSLB and our dynamic load-balancing methods were implemented and tested onvarious environments with a 6-dof-manipulator robot model. To our knowledge, there doesnot exist any accepted set of standardized test cases for robotic motion planning of 6-dofmanipulators, although examples such as alpha puzzle for free flying objects have been

Figure 6 Work redundancy when two processors search in a 2D space


used [34]. The environments which we present are believed to be both reasonablychallenging and comprehensive, and consist of situations such as getting in and out ofa narrow region, and are based upon other similar problems encountered in the literature[5, 7, 15].

Figure 7 illustrates the start and the goal configurations of eight of these testenvironments. The tests cases were designed to represent a variety of typical andchallenging planning scenarios, e.g., finding paths through narrow passages (BOOK-SHELF_A, BOOKSHELF_B, and MAZE environments), finding detours around obstacles(WALL and H_CASE_B environments), and finding paths in environments with dead-endtraps (E_TRAP, H_CASE_A, and U_CASE environments). Extensive tests where run oneight of these environments and the results are presented in this section.

The tests were performed on Bugaboo, Simon Fraser University_s Beowulf cluster [34].At the time of the experiments, the cluster consisted of 96 dual processor nodes (AMDAthlon MP 1.2 GHz) connected through a three-way channel bonded fast Ethernet switch(bandwidth 269 Mbit/s, latency 55 ms). Recently, the processors have been upgraded toAthlon MP 2800 + processors (2.133 GHz) and the memory on each node is doubled to1 GB. However, the experiments reported in this paper were performed before theseupgrades. The code is developed in C++ under Linux and the Parallel Virtual Machinelibrary (PVM) [7] was used as the message passing interface.

Running multiple threads on a single processor simulates the parallel version andminimizes the differences between the parallel and the sequential case. The results fromsuch a simulation are not guaranteed to be identical as when the program is executed purelysequentially due to the nondeterministic nature of the cluster (message delays, etc) and thefact that in the parallel program the processors run asynchronously [10]. Thesenondeterministic factors can also cause the same parallel program to vary in runtimeperformance if executed several times on the same cluster and can lead to minoracceleration anomalies when measuring the speed-up of the parallel program. To reduce this

Table VI Different possibilities that can happen when receiving cells

Receivedcell

Parentcell

Action

New New &Put both the shared cell and its parent in the local hash table&Put the shared cell in the OPEN list (with the received cost determining its position inthe sorted list)

&Set the parent_s owner as the received parent_s owner&Set the shared cell_s owner to the local id

New Old &Put the shared cell in the local hash table and in the OPEN list&If the existing parent_s owner is j1 (i.e., it is a cell previously received in a hashbroadcast)Set the existing parent_s owner as the received parent_s ownerSet the existing parent_s cost equal to the received parent_s cost&Else if the existing parent_s owner is not j1 (i.e., the parent cell belongs to thisprocessor and is not the result of a hash receive)Keep the existing parent_s cost and owner&Set the shared cell_s cost as the parent_s cost plus the cost of stepping from the parentto the cell

Old New Do nothing (putting the received parent cell in the local hash table is not necessarysince the cell information will be received again later through a hash broadcast)

Old Old Do nothing


Figure 7 Start and goal configurations in our test environments


Figure 7 (continued)


effect, each experiment in this research was performed four times and the average valuesare presented.

Collision detection is one of the most time consuming tasks in path planning [13]. It wastypical in these experiments for around 90% of the planning time to be consumed oncollision detection. The amount of time that each collision check takes depends on severalfactors including the selection of the collision detection algorithm and its implementation,the complexity and dof of the robot, and the geometric complexity of the environment. Inthis research, the V-Collide collision detection library (RAPID) [16] (with its MPK [9]interface) was used in both the dynamic method and the implementation of PSLB. Eachcollision check takes between 0.6 ms to 4.2 ms (on a 1.2 GHz processor), depending on thecomplexity of the obstacle and the configuration of the robot. Planning time is roughlyproportional to the number of collision checks and therefore the number of collision checkscan be used as a normalized factor to measure the performance of a planner. The number ofcollision checks performed by different processors is used as a measure of load balanced.An equal (or nearly equal) number of collision checks in all the participating processors isequivalent to a balanced load. Obstacle information must be duplicated in each processor atthe beginning of the search in both methods so that collision checks can be performedlocally.

5.1 PSLB

5.1.1 Parameter Setting

The search parameters for PSLB are the hypercube size b and the weight w towards the goalconfiguration. Henrich et al. [15] have chosen b = 16- (that is, hypercube size of 166 for the6-dof robot) and w = 99, a search heavily weighted towards the goal configuration.

In this research, each environment was tested with various values of w and b, in anattempt to tune the search parameters to achieve reduced time performance. Selection ofw = 100 led to the fastest planning time in the highest number of cases. Different testenvironments achieved their best time performance at different hypercube sizes but b = 45-provided the best results on average.

Figure 8 illustrates the planning time of PSLB with eight processors on two sampleenvironments for different values of b. Since the hypercube size that leads to the reducedplanning time depends on the obstacle placement, empirical selection of the size does notnecessarily guarantee the best time performance in other environments. For example, asillustrated in, the tuned value for the WALL environment is b $ 45-, which leads to aplanning time much larger than the minimal planning time in the BOOKSHELF_Benvironment, which is achieved at b $ 35-.

The tuned value for hypercube size depends on many parameters including thediscretization resolution, the message delay of the network, and the obstacle settings. Sincethese parameters are different in the experiments performed by Henrich et al. than in thisresearch, it is not unexpected that the tuned hypercube size would be different in the twoexperiments.

5.1.2 Speed-Up

PSLB has achieved an average efficiency of 97.4% on the test environments. The averageratio of standard deviation to mean speed up is 16%. Figure 9 illustrates the search time andthe speed-up charts for a sample case with 2, 4, 8, and 16 processors, with w = 100 and b =


20-. Each test is performed four times and the average planning time is reported. Theexperiments do not include tests on a single processor because the cluster on which theexperiments are done consists of dual processor nodes. When multiple threads are initiatedon a single node, half of them are automatically assigned to the second processor.

Table VII shows the planning time of PSLB when eight threads are run on two, four, andeight processors. Each test is performed four times and the average planning time isreported.

The results show close to linear speed-up in most cases. Perceived super-linear speed-upwas observed in a few cases (in environments WALL, H_CASE_A, and H_CASE_B).These environments were tested eight more times (with two, four, and eight processors) and

Figure 8 Planning time of PSLB on WALL (the dashed line) and BOOKSHELF_B (the solid line)environments with eight processors and different values for the hypercube size

Figure 9 Planning time forU_CASE test environment


when the average planning time of all the 12 tests were used, two (out of five) of themmeasured sub-linear speed-up and the other three showed efficiencies closer to 100%. Thissuggests that a portion of these acceleration anomalies were due to the unavoidablediscrepancies between the parallel and the sequential version (because of the previouslydiscussed network nondeterminism). The cache effect [10] is also a possible explanation fora portion of the speed-up anomalies.

5.1.3 Load Balance

The number of collision checks per processor was measured in different test environmentsand with various numbers of participating processors. Table VIII shows the average and thestandard deviation of number of collision checks per processor on various test environmentsfor a seven-processor search. The rather high value of the overall standard deviation (31.3%of the average) suggests that PSLB has not achieved a good load balance in many cases.

5.2 Dynamic Load-Balancing Method

The dynamic method has achieved a consistent speed-up with an average efficiency of95.9%. The experiments have also shown that the dynamic method maintains a morebalanced load among processors than PSLB.

5.2.1 Parameter Setting

Extensive experiments were performed with different parameter settings on severalenvironments in order to empirically set the values for each of the search parameters.

Environment Number2 4 8

WALL 25.6 12.5 6.1BOOKSHELF_A 13.5 7.5 3.6BOOKSHELF_B 185.8 96.1 48.4E_TRAP 36.1 19.0 9.8U_CASE 40.5 20.9 10.4H_CASE_A 11.7 5.8 2.9H_CASE_B 1.6 0.9 0.3MAZE 62.5 31.2 17.7

Table VII Planning time (s) ofPSLB on different environmentswith two, four, and eightprocessors

Test environment Average Standard deviation

WALL 1,961 68.8BOOKSHELF_A 554.3 376.8BOOKSHELF_B 9,343 1246.7E_TRAP 6,589 1724.5U_CASE 2,125 912.9H_CASE_A 752.3 88.1H_CASE_B 25.4 20.2MAZE 4,580 252.1

Table VIII The average and thestandard deviation of number ofcollisions checks per processor inPSLB on a seven-processorsearch


The parameter values that provided the best planning time on average are shown inTable IX.

It should be noted that the last two parameters in the table set the overall search strategyand are not specific to the parallel formulation. The first parameter sets the hash broadcastperiod and small values for it lead to greater orthogonality in the search. Since the networkswitch in most parallel clusters including ours is very fast, the value of this parameter is setto a very small value of broadcasting after every 20 cell expansions. Even at this very smallvalue, the communication overhead is very small and its effect is outweighed by thebenefits gained from the increase in orthogonality.

The second parameter, minimum stack size for sharing, prevents processors with too fewcells in their stack to become donor processors. This is to ensure that the donor processorsdo not share too many of their cells and will not have empty stacks of their own after a fewexpansions. The value of this parameter has little effect on the overall performance of thesearch as long as it is not too low (e.g., 5) or too high (e.g., 100,000, that will delay theinitial sharing at the beginning of the search).

5.2.2 Speed-Up

The dynamic method has achieved almost linear consistent speed-up. Figure 10 shows theplanning time for two sample cases with 2, 4, 8, and 16 processors. Each experiment istested four times and the average planning time is reported. Figure 11 illustrates the pathfound for one of these test environments.

Table X shows the planning time of the dynamic load-balancing method when eightthreads were run on two, four, and eight processors. Each test was performed three times

Parameter Tuned value

Hash broadcast period After every 20 expansionsMinimum stack size forsharing

100 cells

Sharing option Share cells with highestg + wh value

Weight towards the goal (w) 100

Table IX Tuned values for thesearch parameters in the dynamicmethod

Figure 10 Planning time for MAZE and WALL test environments


and the average planning time is reported. It should be noted that, due to parallelizationoverhead, the planning time of the two-processor search would be generally larger than halfof that of the pure sequential search on a single processor.

The method has achieved an overall efficiency of 95.9%. The standard deviation tomean ratio of the speed-ups range from 5% to 31% with the average of 16%. Super-linearspeed-up was observed in a few cases (in environments BOOKSHELF_B, H_CASE_A,and MAZE). These environments were tested eight more times and, similar to the case withPSLB, when the average planning time of all the 12 tests were used, three out of four casesshowed sub-linear speed-up.

5.2.3 Load Balance

Table XI shows the average and the standard deviation of number of collision checks perprocessor on a seven-processor search. The low overall standard deviation (5.6% of theaverage) confirms that the method maintains a highly balanced load among the processingunits.

5.3 Discussion

The experiments showed that, when eight threads are initiated on two, four, and eightprocessors, both methods achieved near-linear speed-up in most cases and the differencebetween the efficiencies of the two methods was only 1.5 percentage points. Similar resultswere observed when 16 threads were initiated on 2, 4, 8, and 16 processors. Figure 12illustrates the speed-up charts of the two methods in two sample cases with up to eightprocessors where the two curves virtually lie on each other.

Figure 11 The planned path for BOOKSHELF_B test environment. The start configuration is shown on theleft and the goal configuration is shown on the right. Three intermediate configurations along the path arealso shown

Environment Number

2 4 8

WALL 220.6 117.6 66.9BOOKSHELF_A 14.8 7.6 3.8BOOKSHELF_B 328.1 171.3 78.0E_TRAP 1,756.4 925.6 502.3U_CASE 83.8 43.9 22.4H_CASE_A 52.9 26.9 12.9H_CASE_B 2.3 1.2 0.66MAZE 589.7 271.2 145.9

Table X Planning time (s) of thedynamic search on differentenvironments with two, four, andeight processors


While the achieved speed-up in the dynamic method was almost linear (similar to that ofPSLB), PSLB (with parameters tuned for the specific environment) has had shorterplanning times in all cases. The planning time of PSLB has been 37.7% of that of thedynamic method on average. Figure 13 compares the planning time of the two methods intwo sample cases with up to eight processors.

Experiments with other values of w (i.e., searches that are less weighted towards the goalconfiguration) have also shown near linear speed-up for both methods and shorter planningtime for PSLB (with parameters tuned for the specific environment). Figure 14 comparesthe planning time and the speed-up of the two methods with w = 10 on the H_CASE_Benvironment, where both methods have almost linear speed-up and the planning time ofPSLB is 60.4% of that of the dynamic method on average.

The difference between the planning times is due to the fact that the two methods followdifferent search strategies. The dynamic method, similar to the sequential A*, focuses all itsenergy on the cells that seems to be most promising and therefore can get trapped insidelocal minima. PSLB decomposed the C-space into smaller sections and can find pathsaround obstacles and traps more easily, due to an effect similar to a coarse to fine searchstrategy, as described in Section 3.2.

It should also be noticed that the tuned value of the hypercube size was selected basedon the experiments on the same environments that the dynamic method was tested on.However, it was shown (Section 5.1.1) that the tuned hypercube size for PSLB depends onthe obstacle placement and empirical selection of the size does not guarantee the optimal

Test environment Average Standard deviation

WALL 16,243 521.4BOOKSHELF_A 756.1 54.3BOOKSHELF_B 14,578 749.3E_TRAP 71,516 3169U_CASE 5,683.4 559.1H_CASE_A 4,317.4 366.2H_CASE_B 71,516 3169MAZE 26,760 549.4

Table XI The average and thestandard deviation of number ofcollision checks per processor inthe dynamic load-balancingmethod

Figure 12 Comparing the speed-up of the dynamic method and PSLB on two sample environments. Thespeed-up of the dynamic method is shown with a dashed line and the speed-up of PSLB is shown with asolid line. Speed-up of the four and eight processor cases is measured against the case with two processors


time performance for other environments. This means that the time performance of PSLB(in comparison with the dynamic method) might not be as good as the experiments haveshown if it is tested on environments that it has not been tuned for. The following exampleillustrates this case.

A new test case (BOOKSHELF_C) is shown in Figure 15. (The same obstacle used inenvironments BOOKSHELF_A and BOOKSHELF_B is used but the start and goalconfigurations have been changed). Figure 16 illustrates the planning time of PSLB witheight processors (the solid line) with respect to different values of the hypercube size. Thefigure shows that selection of b = 25- provides the shortest planning time (20.8 s). Thedynamic method_s planning time with eight processors for the same environment is 22.2 s(the dashed line), which is longer than the planning time of PSLB with the tuned hypercubesize. However, if b = 45- is selected (as mentioned in Section 5.1.1 this value provided thebest results on average for different environments), PSLB_s planning time is 30.1 s, whichis longer than that of the dynamic method. Furthermore, as the graph indicates, in thisenvironment for the vast majority of values of b (the hypercube dimension), PSLB willhave a poorer time performance than the dynamic method.

Figure 13 Planning time of the dynamic method and PSLB for two sample environments. Planning time ofthe dynamic method is shown with a dashed line and the speed-up of PSLB is shown with a solid line

Figure 14 Planning time and speed-up of the two methods on H_CASE_B environment with w = 10. Theplanning time and the speed-up of the dynamic method are shown with dashed lines and PSLB is shown withsolid lines


The experiments have shown that the dynamic method maintains a more balanced loadamong the processors, and that it can have a faster planning time than PSLB if the luxury oftuning parameters for the specific environment being searched is not there. The standarddeviation of number of collision checks per processor is used as a quantitative measure ofload balance. The experimental samples presented in Sections 5.1.3 and 5.2.3 show that fora seven-processor search the overall standard deviation of the number of collision checksper processor is 31.3% of the average for PSLB, while the same value for the dynamicmethod is a mere 5.6%.

6 Conclusions and Future Work

A dynamic load-balancing parallel scheme for an exhaustive enumerative planner has beenpresented. The method was implemented on a distributed memory cluster and was tested onvarious environments with a 6-dof robot. The experiments have shown that the planner

Figure 15 Test environment BOOKSHELF_C

Figure 16 Planning time of PSLB with eight processors on BOOKSHELF_C environment with respect tohypercube dimension


maintains a highly balanced load among the processors and achieves consistent almostlinear speed-up independent of obstacle placement.

The parallel static load-balancing formulation of Henrich et al. [15] was implementedand was used as a benchmark to measure the performance of the dynamic method. Theexperiments have demonstrated that both methods have achieved almost linear speed-up.For tuned parameters values, the time performance of PSLB has been better in all the casesbecause it follows a different search strategy than the dynamic method. The strategy inPSLB is roughly similar to a two-level hierarchical search that enables the planner to findfast paths through the coarse resolution of the hypercubes in many cases. However, PSLB_sperformance was found to be highly dependent on the size of the hypercubes and theoptimal size (i.e., that leads to minimal planning time) depends on the obstacle arrangementin the environment. Therefore, empirical selection of the hypercube size (i.e., tuning itbased on a set of experiments) does not guarantee good performance in all environments. Incontrast, the performance of the dynamic method does not depend on the obstacleplacement, and of course the issue of hypercube size does not arise since no spacedecomposition is used. The method achieves consistent speed-up in all environments.

We therefore strongly believe that implementing a parallel planner that can combine theenvironment independent high efficiency of the dynamic method with a multi-levelhierarchical search, such as employed by the SANDROS planner [5], and will lead to arobust (parameters will not require fine tuning) and more time-efficient (may be even closeto real time) parallel planner.

The first step towards this planner would be incorporating an adaptive discretizationresolution into the current implementation of the dynamic method. The collision detectionroutine currently used provides a binary output that determines whether a configuration isin collision or not. Other efficient collision detection methods can also provide the distanceof the robot to the nearest obstacle for free configurations [11]. If such data is available, thestep size of the robot can be adjusted accordingly in such a way that longer steps are takenwhen the robot is far from the obstacles and smaller steps are taken when adjacent toobstacles. This adaptive step size will enable the planner to focus its energy on the areasthat are most difficult and therefore may drastically improve the performance.

Appendix

Parallel Processing: A Brief Primer

Parallel processing is the simultaneous use of more than one processor, called a multi-processor or a cluster computer, to perform a task. Usually, the cooperation among theprocessors is maintained through a network that enables communication and messagepassing among the processors. Multiprocessors are divided into two main categories: sharedmemory and distributed memory systems. Shared memory systems utilize a memory unitaccessible by all the processors, the implementation of which requires complexarchitectures and is relatively costly. In distributed memory clusters each processor has alocal memory unit and the processors can only communicate with each other through anetwork. By comparison, distributed memory units are easy to set up and cost efficient andtherefore are the preferred choice for high performance parallel computing in manyapplications. Furthermore, an application developed for a distributed memory system can


be ported to shared memory environments with minimal changes, while the opposite is notgenerally true.

The efficiency of a parallel program largely depends on its load distribution andcommunication overhead. To get maximum efficiency, all the processors need to work attheir full capacity at all time. Some algorithms (especially probabilistic ones) are easilyparallelizable and require minimal communication between processors, but in most cases aload-balancing scheme is required in order to maintain an even distribution of work amongthe processors. Load-balancing moves tasks from heavily loaded processors to less busyprocessors through the communication network. The process of load-balancing itself incurssome computational cost, and depending upon the amount of information that is needed tobe shuffled between processors to achieve balance, may also tax the network resources.Therefore, often there is a trade-off between having a low communication overhead andhaving a balanced load.

Load-balancing could be either static or dynamic. In static load-balancing, the workloadis divided among the processors in a fixed manner which does not change during theruntime. Dynamic approaches, on the other hand, attempt to balance the workload byshifting the load among processors at runtime.

There are two possible approaches for implementing a dynamic load-balancing mechanism:centralized and distributed [30]. In centralized approaches, one designated processor, calledthe matchmaker or the coordinator, has the responsibility of locating the idle processorsand finding a donor processor for each of them [33]. In contrast, in a distributed load-balancing system, each processor is responsible for maintaining a balanced load andkeeping an estimate of the load distribution among other processors.

Centralized approaches, such as global round robin, often require rather low amounts ofmessages for finding a match but the matchmaker becomes a critical resource and issusceptible to bottleneck, affecting the planner_s efficiency [28]. This becomes an issueparticularly in large-scale clusters with many processors and could limit the scalability of aparallel formulation [30]. Furthermore, the matchmaker becomes what is knows as a singlepoint of failure and renders the entire system failure prone. This is because if a failureoccurs on the matchmaker processor, or its network link, it will affect or stop the entire jobscheduling process [17].

Distributed approaches avoid the problem of contention and are less sensitive to thefailure of a single processor. However, they often require a larger amount of interprocessorcommunication and this could in turn limit their scalability. Asynchronous (local) roundrobin, nearest neighbor load sharing, and random polling are among the commondecentralized load-balancing schemes [10]. In distributed approaches, load transfer betweennodes could be receiver-initiated (work-stealing) or sender-initiated (work-sharing). Adetailed survey of some of the general load-balancing algorithms is provided in [31]. Inrecent years, hybrid systems have been developed that combine centralized and distributedmethods in an attempt to inherit the desirable properties of each approach and enhance thescalability [30, 32].

Another factor affecting the efficiency of a parallel program is orthogonality. Twoprocessors are said to be working orthogonally if there is no duplication among their tasks.The more orthogonal the workloads of different processors, the higher the efficiency that isachieved.

We use speed-up and efficiency as quantitative measures of the performanceenhancement achieved by our parallel formulation. Speed-up of a parallel implementationmeasures how much faster it runs on multiple processors than on a single processor. The


speed-up for n processors, S(n), can be calculated as t 1ð Þt nð Þ, where t(1) is the time it takes to

perform the computation on a single processor, and t(n) the time when n processors areused. Values of S(n) < n, S(n) = n and S(n) > n represent sub-linear, linear, and super-linearspeed-up respectively. The efficiency of the implementation is defined as S nð Þ

n . A scalableparallel formulation is one that can maintain a high efficiency with large values of n.

The speed-up of multiple processor searches are measured against the case where theparallel algorithm is simulated on a single processor. Speed-up achieved on multipleprocessors would always be sub-linear or linear at most, assuming no extraneous limitationssuch as cache effect [10]. In fact if the communication delay is anything but zero, thespeed-up is sub-linear and the efficiency is smaller than one. Efficiencies close to 1 (or100%) show good parallel formulation and an effective utilization of all the processors.

In special cases, super-linear speed-up can be achieved due to the condition known asthe cache effect. This occurs when the initial problem is large and is broken into smallpieces for each processor. Most processors have more effective memory management (dueto the fast access time for a limited amount of cache memory) and perform better forsmaller problems. If this enhanced performance for smaller sized problems exceeds theinefficiencies caused by the communication delays, super-linear speed-up can be realized.

References

1. Akl, S.G.: The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood (1989)2. Amato, N.A., Dale, L.K.: Probabilistic roadmap methods are embarrassingly parallel. In: Proc IEEE Intl

Conf Robotics Automat, pp. 688–694, Detroit (1999)3. Barraquand, J., Latombe, J.C.: Robot motion planning: a distributed representation approach. Int. J. Rob.

Res. 10(6), 628–649 (1991)4. Challou, D., Boley, D., Gini, M., Kumar, V.: A parallel formulation of informed randomized search for

robot motion planning problems. Proc IEEE Int. Conf. Robot. Autom. pp. 709–714. Nagoya, Japan(1995)

5. Chen, P.C., Hwang, Y.K.: SANDROS: a dynamic graph search algorithm for motion planning. IEEETrans. Robot. Autom. 14(3), 390–403 (1998)

6. Choset, H., Lynch, K.M., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L.E., Thrun, S.: Principles ofRobot Motion: Theory, Algorithms, and Implementation. MIT Press, Cambridge, Massachusetts (2005)

7. Geist, G.A., Beguelin, A., Dongarra, J.J., Jiang, W., Manchek, R., Sunderam, V.S.: PVM: Parallel VirtualMachine: A Users_ Guide and Tutorial for Networked Parallel Computing. MIT Press, Cambridge,Massachusetts (1994)

8. Gini, M.: Parallel search algorithms for robot motion planning, Workshop on practical motion planningin robotics: Current approaches and future directions, IEEE Int. Conf. Robot. Autom. (1996)

9. Gipson, I., Gupta, K.K., Greenspan, M.: MPK: an open extensible motion planning kernel. J. Intell.Robot. Syst. 8(18), 433–443 (2001)

10. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing (and accompanyingtext: Search Algorithms for Discrete Optimization Problems), Addison-Wesley (2003)

11. Greenspan, M., Burtnyk, N.: Obstacle count independent real-time collision avoidance. Proc IEEE Int.Conf. Robot. Autom. vol. 2, pp 1073–1080, Minneapolis (1996)

12. Gupta, K.: Fast collision avoidance for manipulator arms: a sequential search strategy. IEEE Trans.Robot. Autom. 6(5), 522–532 (1990)

13. Gupta, K., Del Pobil, A.P.: Practical Motion Planning in Robotics: Current Approaches and FutureDirections. Wiley, Ellicott City (1998)

14. Henrich, D.: Fast motion planning by parallel processing – a review. J. Intell. Robot. Syst. 45–69 (1997)15. Henrich, D., Wurl, C., Worn, H.: 6 Dof path planning in dynamic environments – a parallel on-line

approach. Proc IEEE Int. Conf. Robot. Autom. pp 330–335. Leuven, Belgium (1998)16. Hudson, T.C., Lin, M.C., Cohen, J., Gottschhalk, S., Manocha, D.: V-COLLIDE: accelerated collision

detection for VRML. In: Carey, R., Strauss, P. (eds.) VRML 97: Second Symposium on the VirtualReality Modeling, pp. 117–124. New York (1997)


17. James, H.A.: Scheduling in Metacomputing Systems. PhD Thesis, University of Adelaide (1999)18. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in

high-dimensional configuration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996)19. Khatib, O.: Real-time obstacle avoidance for manipulators and mobile robots. Int. J. Rob. Res. 5(1),

90–98 (1986)20. Kumar, V., Rao, V.N.: Scalable parallel formulations of depth-first search. In: Kumar, V. (ed.) Parallel

Algorithms for Machine Intelligence and Vision. Springer, Berlin Heidelberg New York (1990)21. Latombe, J.C.: Motion planning: a journey of robots, molecules, digital actors, and other artifacts.

Int. J. Rob. Res. 18(11), 1119–1128 (1999)22. Latombe, J.C.: Robot Motion Planning. Kluwer, Boston (1991)23. LaValle, S.M.: Planning Algorithms. Cambridge Univ. Press (2006)24. LaValle, S.M.: Rapidly-Exploring Random Trees: A New Tool for Path Planning. Iowa State University

Computer Science Dept. Technical Report 98–11 (1998)25. Lozano-Perez, T.A.: Spatial planning: a configuration space approach. IEEE Trans. Comput. 32(2),

108–120 (1983)26. Mazer, E., Ahuactzin, J.M., Talbi, E.G., Bessiere, P., Chatroux, T.: Parallel Motion Planning with

Ariadne_s Clew Algorithm, Proc Int. Conf. Robotics, Kyoto, vol. 2, pp 1373–1380, Japan, (1994)27. Nilsson, J.: Principles of Artificial Intelligence. Morgan Kaufmann, San Francisco, California, Edmond

(1980)28. Prasad, R., Moritz, C.A.: Efficient search techniques in the billion transistor era. Paper presented at the

International conference on parallel and distributed processing techniques and applications, Las Vegas(2001)

29. Qin, C., Henrich, D.: Path planning for industrial robot arms – a parallel randomized approach. In:Proceedings of the International Symposium on Intelligent Robotic Systems, pp 65–72, Lisbon, Portugal(1996)

30. Qureshi, K., Hatanaka, M.: An introduction to load balancing for parallel raytracing on HDC systems.Curr. Sci. 78(7), 818–820 (2000)

31. Shirazi, B.A., Kavi, K.M., Hurson, A.R.: Scheduling and load balancing in parallel and distributedsystems. IEEE Computer Society Press (1995)

32. Sigdel, K., Bertels, K., Pourebrahimi, B., Vassiliadis, S., Shuai, L.S.: A framework for adaptivematchmaking in distributed computing, In: Proceedings of GRID Workshop, Krakow, Poland (2005)

33. Widell, N.: Migration algorithms for automated load balancing. In: Proceedings of the 16th IASTEDInternational Conference on Parallel and Distributed Computing and Systems, Cambridge (2004)

34. High Performance Computing at SFU, http://www.hpc.sfu.ca35. Motion Planning Puzzles at Texas A&M Univ., http://parasol-www.cs.tamu.edu/dsmft/benchmarks/mp/


http://www.hpc.sfu.ca

http://parasol-www.cs.tamu.edu/dsmft/benchmarks/mp/

A Dynamic Load-Balancing Parallel Search for Enumerative Robot Path Planning

Documents

Transcript of A Dynamic Load-Balancing Parallel Search for Enumerative Robot Path Planning