Parallel simulation of fluid slip in a microchannel

14
Parallel Simulation of Fluid Slip in a Microchannel Jingyu Zhou, Luoding Zhu, Linda Petzold, and Tao Yang Department of Computer Science University of California, Santa Barbara jzhou, zhuld, petzold, tyang @cs.ucsb.edu Abstract Fluid flow in channels has traditionally been as- sumed to satisfy a no-slip condition along the chan- nel walls. However, it has recently been found ex- perimentally that the flow in microchannels can slip along hydrophobic (repelling water) walls. The slip can have important physical consequences for such flows. The physical mechanism underlying this phe- nomena is not well understood. An accurate, physi- cally realistic model that can be simulated rapidly is critical for obtaining a better understanding of these results, and ultimately for modeling and for optimiz- ing the flow in microdevices to achieve desired ob- jectives. This paper investigates the parallel simulation of fluid slip along microchannel walls using the mul- ticomponent lattice Boltzmann method (LBM) with domain decomposition. Because of the high com- plexity for microscale simulation, even a parallel computation of fluid slip can take days or weeks. Any slowness in the participating nodes in a cluster can drag the entire computation substantially, due to frequent node synchronization involved in each computational phase of the algorithm. We augment the parallel LBM algorithm with filtered dynamic remapping for lattice points. This filtered scheme uses lazy remapping and over-redistribution strate- gies to balance the computational speed of partici- pating nodes and to minimize the performance im- pact of slow nodes on synchronized phases. Our ex- perimental results indicate that the proposed tech- This work was supported by NSF/ITR ACI-0086061 nique can greatly speed up fluid slip simulation on a non-dedicated cluster over a long period of execu- tion time. 1 Introduction In fluid mechanics, a no-slip boundary condition is usually assumed for viscous flow over a solid sur- face. This means that the fluid velocity at the sur- face is the same as the velocity of the surface (zero if the surface is at rest). However, it has recently been found experimentally that the no-slip assump- tion may be not valid for some micro scale flows, for example flows in micro-electro-mechanical sys- tems (MEMS). Many researchers [6, 20, 15, 40, 41, 26, 1, 7, 48, 24, 35, 38, 36, 37] have investigated fluid slip phenomena for microscale flows. Appar- ent fluid slip at the micro-device walls can have a significant influence on the mass and heat transfer in the system and has drawn much recent attention. Among these results is a recent laboratory experi- ment performed by Tretheway and Meinhart [36]. They found an approximate fluid slip with re- spect to the main stream flow velocity in a three- dimensional microchannel with hydrophobic walls (the walls are coated with a material which repels water molecules). 1 The mechanism of fluid slip on hydrophobic surfaces is not yet well understood. A 1 For a comprehensive review of fluid slippage over hy- drophobic surfaces, see Vinogradova [39] and the references therein. The hydrophobicity of a solid surface is not well un- derstood. We refer the following papers [22, 19, 17, 3, 4, 28, 38] to interested readers. 1

Transcript of Parallel simulation of fluid slip in a microchannel

Parallel Simulation of Fluid Slip in a Microchannel�

Jingyu Zhou, Luoding Zhu, Linda Petzold, and Tao YangDepartment of Computer Science

University of California, Santa Barbara�jzhou, zhuld, petzold, tyang � @cs.ucsb.edu

Abstract

Fluid flow in channels has traditionally been as-sumed to satisfy a no-slip condition along the chan-nel walls. However, it has recently been found ex-perimentally that the flow in microchannels can slipalong hydrophobic (repelling water) walls. The slipcan have important physical consequences for suchflows. The physical mechanism underlying this phe-nomena is not well understood. An accurate, physi-cally realistic model that can be simulated rapidly iscritical for obtaining a better understanding of theseresults, and ultimately for modeling and for optimiz-ing the flow in microdevices to achieve desired ob-jectives.

This paper investigates the parallel simulation offluid slip along microchannel walls using the mul-ticomponent lattice Boltzmann method (LBM) withdomain decomposition. Because of the high com-plexity for microscale simulation, even a parallelcomputation of fluid slip can take days or weeks.Any slowness in the participating nodes in a clustercan drag the entire computation substantially, dueto frequent node synchronization involved in eachcomputational phase of the algorithm. We augmentthe parallel LBM algorithm with filtered dynamicremapping for lattice points. This filtered schemeuses lazy remapping and over-redistribution strate-gies to balance the computational speed of partici-pating nodes and to minimize the performance im-pact of slow nodes on synchronized phases. Our ex-perimental results indicate that the proposed tech-�

This work was supported by NSF/ITR ACI-0086061

nique can greatly speed up fluid slip simulation ona non-dedicated cluster over a long period of execu-tion time.

1 Introduction

In fluid mechanics, a no-slip boundary condition isusually assumed for viscous flow over a solid sur-face. This means that the fluid velocity at the sur-face is the same as the velocity of the surface (zeroif the surface is at rest). However, it has recentlybeen found experimentally that the no-slip assump-tion may be not valid for some micro scale flows,for example flows in micro-electro-mechanical sys-tems (MEMS). Many researchers [6, 20, 15, 40, 41,26, 1, 7, 48, 24, 35, 38, 36, 37] have investigatedfluid slip phenomena for microscale flows. Appar-ent fluid slip at the micro-device walls can have asignificant influence on the mass and heat transferin the system and has drawn much recent attention.Among these results is a recent laboratory experi-ment performed by Tretheway and Meinhart [36].They found an approximate ����� fluid slip with re-spect to the main stream flow velocity in a three-dimensional microchannel with hydrophobic walls(the walls are coated with a material which repelswater molecules). 1 The mechanism of fluid slip onhydrophobic surfaces is not yet well understood. A

1For a comprehensive review of fluid slippage over hy-drophobic surfaces, see Vinogradova [39] and the referencestherein. The hydrophobicity of a solid surface is not well un-derstood. We refer the following papers [22, 19, 17, 3, 4, 28,38] to interested readers.

1

possible mechanism for generating the observed sliphas been proposed in [37] (see Section 2 for details).

We investigated the proposed mechanism by sim-ulating the water-air two-phase system with themulticomponent lattice Boltzmann method for flowin a three-dimensional hydrophobic microchan-nel [47]. Micro-scale simulation for fluid slip iscomputation-intensive. It can take hundreds of dayson a fast single-processor machine, especially whenhigh resolution is required. Previous work has fo-cused on the parallelization of the single compo-nent LBM with static domain decomposition. Sato-fuka [29] parallelized the single-component LBMin two and three dimensions via domain decompo-sition. Kandhai et al. [18] adopted the Orthogo-nal Recursive Bisection method to parallelize thesingle-component LBM. Suviola [34] parallelizedthe single-component LBM for simulating suspen-sion flow. Previous work (e.g. [33]) has focusedmainly on static-decomposition of the grid intoequal sub-volumes, based on slice, box, or cubicpartitioning. Here we report on our parallelizationof the multicomponent LBM via domain decompo-sition with filtered dynamic lattice point remapping.

For the multi-component LBM, we use domaindecomposition, which divides the problem domaininto a number of sub-domains. Computation on sub-domains is performed by separate processes that ex-ecute the same functions on different data. Becausethe sub-domains are connected at their boundaries,processes belonging to neighboring sub-domainsmust synchronize on their boundaries. Synchroniza-tions divide the computation into phases. While fulllinear speedup for parallel multi-component LBMcan be achieved in a dedicated cluster, the paral-lelized code can still take weeks in a modest-sizedcluster for a high-resolution microscale simulation.In a production environment for scientific simula-tion, it is very possible that a cluster is shared withother tasks. Phase-based synchronization with slownodes in LBM drags the entire computation signif-icantly in a non-dedicated environment. For exam-ple, a single heavily-loaded node can slow down theprogram’s execution time by a factor of two to threeas we will show later.

It is clear that we need to improve self-adaptiveness of the LBM algorithm in responding tothe slowness of some nodes. The challenge for dy-namic load balancing is to tolerate transient spikesand to avoid excessive communication needed fordata remapping. We have proposed a scheme calledfiltered remapping which adopts lazy migration thatminimizes unnecessary re-balancing, and uses anover-redistribution strategy during lattice point mi-gration, to proactively minimize the impact of slownodes. Our experimental results show that the pro-posed scheme can balance work load in the pres-ence of slow nodes and effectively handle transientspikes, with significant improvement compared toother methods.

This paper is structured as follows. Section 2presents the multicomponent LBM and its paral-lelization with domain decomposition. Section 3gives motivation and details on filtered dynamic lat-tice point remapping. Section 4 provides the simu-lation and evaluation results. Finally, Section 5 con-cludes the paper.

2 Fluid Slip and Lattice Boltz-mann Method

For fluid flows at micro or nano scale, the latticeBoltzmann method provides a more physically real-istic means of simulation approach than the Navier-Stokes (N-S) equations. When the � number 2 ismuch larger than � , (which is often true for problemsat micro or nano scale) the N-S equations are notvalid, but the lattice Boltzmann method is still valid.In our work, the parallelized multi-component lat-tice Boltzmann method with filtered dynamic loadrebalancing was applied to investigate a possiblegenerating mechanism for apparent fluid slip [37]at hydrophobic microchannel walls. The key ideaof the mechanism is as follows. The water used inthe experiment [36] was not treated to remove dis-solved or entrained gases. It is conjectured that thegases and the water vapor may form a depleted thin

2Defined as the ratio of the mean free path of moleculesand the characteristic length of a problem.

2

layer between the hydrophobic surfaces and the wa-ter. This depleted water region could generate theobserved apparent fluid slip because the overall den-sity in the thin layer is much lower than that of wa-ter.

We investigated the above mechanism by simu-lating the water-air two-phase system with the par-allelized multicomponent lattice Boltzmann methodfor flow in a three-dimensional hydrophobic mi-crochannel. The hydrophobic walls were modeledby applying a force in a region very close to thewalls. This force is repulsive to the water molecules,and neutral to the air molecules. Note that, unlikethe methods of Molecular Dynamics or Direct Sim-ulation Monte Carlo, there are no “molecules” inthe lattice Boltzmann method. The introduced hy-drophobic force acts on the single particle velocitydistribution functions, instead of on the moleculesthemselves. These forces decay exponentially awayfrom the wall. The initial water-air mixture is as-sumed to be uniform. The initial density of the airin the water is calculated under standard conditions.The multicomponent lattice Boltzmann model weuse is standard, see [30, 31, 32, 23], except that weintroduced the additional hydrophobic wall forcesinto the formulation.

2.1 Lattice Boltzmann Method

The lattice Boltzmann method [25, 5, 2, 8, 27, 21,43, 14, 16] is an alternative to traditional numeri-cal methods for simulating fluid flows governed bythe Navier-Stokes equations. It solves for the singleparticle velocity distribution functions that satisfya simplified Boltzmann equation, instead of solv-ing for the macroscopic quantities like fluid velocityand pressure. Compared to the direct discretizationof the incompressible Navier-Stokes equations, thelattice Boltzmann method has several advantages(See [27] for details). For example, the simplifiedBoltzmann equation is a scalar linear differentialequation, while the Navier-Stokes equations are asystem of nonlinear differential equations. Amongthese comparative advantages is the local behav-ior of both the collision and streaming operators.

Therefore this method is very natural for paralleliza-tion.

The multicomponent lattice Boltzmann methodused in our work is the S-C model [11, 30, 31, 32].A brief account of this method is as follows. Let �denote the fluid components. For each fluid compo-nent � ( �� ������ in our case), a single particle ve-locity distribution function ����������������� is introduced,which solves the LBGK model for that component: � � �!�"���#����� � $ �&% � � ������������� � ' �( � �)� � �������#����� ' � �+*-,/. ���������������10Here ( � and �2�+*-,/. are the relaxation time and theequilibrium distribution function for component � ,respectively.

After discretization in particle velocity space �and in time � , the multicomponent LBE is obtained:� �3 �!� $ � 3 ��� $ �+�� 4� �3 ��������� ' �( � �)� �3 ��������� ' � ��*5,/.3 �!���������6�where � �3 is the distribution function for the � com-ponent along the direction � 3 . Note that the dis-cretization in � is the same for each component.

An interparticle interaction potential, which is in-tended to model the interaction between differentcomponents, can be defined as7 �!�"��89�� ;: � : �1< = �6� < �!����89�/> � �!�?�/> � < �!89�60Here the Greens function,

= �@� < ������89� , characterizesthe nature of the interaction between different com-ponents (attractive or repulsive and its strength).The choice of > determines the equation of stateof the system under study. By selecting differentfunctions

=and > , various fluid mixtures and mul-

tiphase flows can be simulated.The equilibrium distribution � �+*-,/.3 can be written

as � �+*-,/.3 ���������A �B � �!�"�����DC 3 �D� $FE � 3 %�G � ��������� $HI � E � 3 � 3KJ G � �!�������DG � �!�"����� ' G � �!�"�����9%+G � �!�"���������6�where C 3 is the weight. The mass density of com-ponent � is defined byB � ���������A ;: 3ML � � �3 ���������6�

3

where L � is the molecular mass of component � .The velocity, G?� , is computed viaB � �!�"������G � ���������A �B � �!�"�����ONGP�!�������$ ( �2QSR �Q � ��������� $ ( �OT?� �!�?�1�where the average velocity NG is defined by

NGU�!�"�����V: � BW���!�"�����( � 4: �XY L �( � : 3 � �3 �!�"�����/� 31Z[ 0

Here \^]`_\ba is the net rate of momentum change thatcan be computed in terms of the interaction poten-tial:

QSR �Q � �!�������A ' :/ced c ������89�A ' > � ���?� : � < : 3 = 3 �6� < > � < ��� $ � 3 �/� 3 0The forces T ���?� are the hydrophobic forces we

introduced to simulate the hydrophobic walls. Ourchoice of T ���!�f� is as follows: (index � denotes thefluid in the model to simulate the water and index �the fluid to simulate the air/water vapor)T"g ���?�A h� ,T , ���?�A i�j�k��l I �^m��6��l H �^nW��� ,where m is the distance away from the side wallsalong the inward normal direction, and n has a simi-lar meaning for the top and bottom walls. l I �!mo� andl H �jnW� are in the form of p gKqsr�t R � ' m�uvp I � . Here p gand p I are constants to be determined.

The macroscopic quantities are connected to dis-tribution functions by the following relations:Bw���������A : � B � ���������1��!BxG9�@�!�������A : � L � : 3 � �3 � 3 $ �y : � Q�R �Q � ���������60

The dimensionless viscosity of the system is definedby z |{ �

I~} _6��_} ' �� 0

Figure 1: The three-dimensional lattice for D3Q19 model.

Each node has 19 different possible movement of directions.

2.2 Parallelization of LBM

We performed the parallelization of the multi-component LBM in three-dimensions via domaindecomposition. Because of the special geometryin our application (the t direction is much longerthan the m and n directions), one-dimensional de-composition along the t direction was used. Themicrochannel was partitioned into � cubics alongthe t direction. Each cubic was assigned to a pro-cessor. For each processor, the distribution func-tions of the two components at both ends along di-rections ���1���`���S�+���S�+� must be sent to its right neigh-bor and the distribution functions along directionsy �`���S�S���S� � ����� must be sent to its left neighbor. Inthe meantime, each processor must wait to receivethe distribution functions of the two components atboth ends along directions

y �`���S�S���S� � ����� from itsright neighbor, and the distribution functions alongdirections ���`�k�`���S���k�S�+� from its left neighbor, seeFigure 1.

In each phase of the LBM computation, a latticepoint requires a constant number of floating opera-tions. Our application’s problem domain is a cubicby size ���#%j����%j��� . The time complexity of the pro-gram is ���^����%D���f%D���6� . Figure 2 gives the pseudo-code of the Lattice Boltzmann method. Differentprocessors are assigned different starting and endingindices on the t axis. Lines 4 through 17 represent

4

1 s = starting index on X axis;2 e = ending index on X axis;3 for (phase = 1; phase <= TOTAL_PHASES; phase++) {4 Compute collision from s to e;5 Compute streaming from s to e;6

7 /* communication */8 Exchange distribution func. with neighbors;9

10 Compute bounce back;11 yz_direction(s, e);12

13 /* communication */14 Exchange number density with neighbors;15

16 Compute force from s to e17 Compute velocity from s to e;18

19 /* lattice points remapping */20 if ( phase % REMAPPING_INTERVAL == 0 ) {21 t_local = estimate_time();22

23 /* communication */24 Exchange t_local with neighbors;25

26 Compute remapping amount;27

28 /* communication */29 Redistribute data;30

31 Update s and e;32 }33 }

Figure 2: Pseudo code of parallelized Lattice Boltzmann

method.

the computation of a phase. The velocity computedon line 17 is used in the collision computation ofline 4 in the next iteration. The distribution functionand the number density data on the boundaries mustbe exchanged with neighboring nodes in each phase(line 8 and line 14, respectively). The applicationrequires a large number of simulation steps. Everyfew phases, the program performs a remapping ofthe lattice points (from line 20 to 32) to improveperformance by adjusting the load on different pro-cessors and reducing the synchronization overhead.

3 Filtered Dynamic Remappingof Lattice Points

We have parallelized the fluid slip code using MPI.In this section we will discuss the motivation for dy-namic lattice point remapping and then present a fil-

tered remapping method to minimize the impact ofslow nodes.

3.1 Motivation

While the code can achieve an almost perfectspeedup in a dedicated cluster, the parallel code stilltakes a long time (weeks) to complete the simulationfor even a small microchannel. Any machine slow-ness due to resource sharing or operational error candramatically slow down the entire computation dueto the nature of synchronization among nodes con-ducted at each phase of the computation. This isbecause other nodes must wait for the slowest one,in order to synchronize.

0 0.2 0.4 0.6 0.8 1200

300

400

500

600

700

800

Disturbance (%)

Exe

cutio

n tim

e (s

)

0 0.2 0.4 0.6 0.8 10

20

40

60

80

100

120

140

160

180

200

Disturbance (%)O

verh

ead

(%)

Figure 3: Increased time caused by competing jobs.

To find out whether short load spikes or long loadspikes have the most impact, we performed a se-ries of experiments. In these experiments, our MPI-based fluid slip program used 20 nodes and 600phases representing a small percentage of phases re-quired in the simulation described in Section 4. Oneof the nodes was slowed down by a CPU-intensivecompeting job which ran throughout the executionof our MPI program. Every 10 seconds, it spent acertain percentage of time competing for CPU re-sources; it slept the rest of the time. We varied thatpercentage from 0 to 100% and plotted the results inFigure 3. On the left, the execution time of our MPIprogram is plotted with respect to different distur-bance levels. The increased time per program phaseis plotted on the right. The curve shows that the

5

overhead is close to a linear increase when the dis-turbance is less than 60% (which corresponds to 6seconds in each 10 seconds period) and sharply in-creases after that point. This suggests that short loadspikes have a relatively small impact on the over-all performance. However, the sharp increase after60% means that if the high load persists longer, theMPI program’s performance suffers greatly. Con-sidering that the time for running our MPI programunder perfect conditions is about 250 seconds forthis selected case, the overhead caused by persistentcompeting jobs could be several times higher.

The reason for significant slowdown even by asingle slow node is due to a ripple effect of the LBMcode with domain decomposition. As all neighbor-ing nodes must synchronize in each phase, at onephase the neighbor nodes are slowed down by theslowest node; in two phases, nodes with distancetwo away are slowed down; the slowdown contin-ues until all processes are affected in 10 to 20 phasesdepending on the position of the slowest node.

These experiments suggest that the dominant per-formance slowdown is caused by a persistent slownode and that transient load spikes have less per-formance impact. Thus, the remainder of this sec-tion focuses on improving performance in the caseof persistent slow nodes.

3.2 Self-Adaptiveness with DynamicData Remapping

The above experimental result shows that the im-pact of a persistent slow node must be minimized toreduce program execution time. Dynamic load bal-ancing [42] is a common technique used to sched-ule computation in an adaptive manner such thatfast nodes perform more work. We first describedynamic lattice mapping used to achieve load bal-ancing in general and explain the dynamic mappingchallenge in our context, and then propose a solu-tion called filtered dynamic remapping.

3.3 Issues for Dynamic Mapping ofLattice Points

Since computational complexity of our applicationat each processor is proportional to the number oflattice points assigned to the processor, remappingof lattice points is needed to achieve load balanc-ing in the presence of slow and fast nodes. Eachnode needs to monitor program execution and adap-tively reduce the number of lattice points if this ma-chine becomes slower compared to others in thecluster. Figure 4 illustrates the dynamic data remap-ping operation. Initially, lattice points are evenlydistributed among all nodes (Figure 4-a). Whennode 2 becomes slower, some of its lattice pointswill be shifted to neighboring nodes 1 and 3 (Fig-ure 4-b).

� � � � ���-� �

� � � �W��� ��

���6�+�^���� ��2¡���¢2��£�£�¤ ¥�¦

§ �/¨+��© �� @¡���¢2��£�£�¤ ¥�¦

Figure 4: Example of data remapping.

There are three important issues that we need toaddress for remapping. Our solutions for the firstand the third issues are different from what havebeen proposed in the load balancing literature [42].ª Performance prediction. Each machine needs

to predict how long it would take to process theexisting lattice points in the next phase basedon the performance in the previous phases.Earlier results have found that CPU load can bepredicted using simple models [9], and futureload is closer to the most recent data [46, 13].Our finding is that prediction that mainly de-pends on performance data of the most recentphase can result in excessive data migration

6

(we call it migration oscillation) when the clus-ter sharing pattern changes rapidly. Our designchoice is to use the harmonic average of pro-cessing time at the last phases that minimizesthe chance of excessive remapping.ª Global vs local information exchange. Weconsider two different approaches for dynamicdata remapping, i.e., global or local informa-tion exchange. In the global approach, load in-formation is exchanged among all nodes anda global decision is made to distribute latticepoints evenly among all nodes such that allnodes are expected to complete in the sametime before the next phase starts.

The major problem with the above global so-lution is that synchronization overhead can besignificant. This is because group communica-tion is required among all nodes for making aglobal decision. Also a node may have to shiftdata to multiple destinations or to receive newlattice points from multiple sources. Since pre-diction of load may not be accurate, the bene-fits of global load balancing guided by predic-tion may not pay off, considering the excessivecommunication overhead involved.

We have followed the previous work in dis-tributed load sharing [42] where balancing isdecided locally among neighbor nodes.ª Conservative redistribution vs. over-redistribution. When fast nodes and slownodes are identified among neighbor nodes andlattice point migration is planned, a decisionmust be made on the number of lattice pointsto be transfered. The previous approach forgeneral load balancing using local informationexchange [42] has taken a conservative view:when a node is considered light and a heavily-loaded node needs to shift t amount of load,typically a smaller amount such as t�«�¬ willactually be transferred (e.g. ¬ E �W� ). Thereason is that a light node may be consid-ered “light” by everybody and this node canbe overloaded by receiving loads from multi-

ple sources independently.

In our setting with a linear array of proces-sors and with neighbor data dependence insynchronization for LBM, we find that over-redistribution, which aggressively shifts points,performs much better than conservative redis-tribution. We will discuss the reason below.

We call our overall approach filtered dynamic lat-tice point mapping. Our design choice is based onthe following strategies.ª Lazy remapping. Frequent re-balancing of

load can hurt overall performance in two as-pects. First, it can cause excessive data migra-tion when the load changes rapidly. Secondly,remapping can introduce an unbalanced loadif estimation is not accurate, and can lead togreater communication overhead.

Our strategy in designing the filtered approachis to remap lazily, which reduces synchroniza-tion cost and also allows the system to care-fully respond to usage spikes in a cluster. Weuse this strategy in two aspects of our filteredremapping method. 1) use the average load in-dex of the last phases to guide prediction in-stead of using the load index of the most re-cent phase. 2) We avoid moving data pointsfrom a fast node to a slow node even if the slownode has much fewer lattice points. The reasonis that slow nodes can result in sluggish com-munication and the benefit of exploiting slownodes for their computing power is not signif-icant considering that communication to thesenodes can be very slow.ª Over-redistribution of lattice points fromconfirmed slow nodes. When a node is de-tected to be slow with high confidence, it isbetter to minimize the use of such a node sinceit not only processes data at a low speed butalso drags the entire computation due to slug-gish communication. Once the system findsthat a node is really slow, our strategy is to ag-gressively shift data from slow nodes to fasternodes.

7

3.4 Filtered Dynamic Remapping

Performance prediction. Let the execution timefor the last phases at a node x be ­ g� , ­ I� � , ­¯®� . Thepredicted time (load index) for the next phase ­#� isdefined: g°w±² $ g°�³² $ %�%�% g°k´² 0Thus if there is a load spike during the last phase, nomigration will be made unless this machine is reallyslow for the last phases (in our experiment, =10).

Load index exchange and remapping condi-tion checkup. Only neighboring nodes in the lin-ear array of cluster nodes will communicate. Wedescribe the formula for making a migration deci-sion among three neighboring nodes. The formulais similar for the first node and the end node in thelinear array.

Given three neighboring nodes µ ' � , µ , andµ $ � , we decide whether we should move somedata points from node µ to µ $ � as follows. Let¶"·¹¸ g � ¶"· � ¶�·»º g be the number of lattice points onnodes µ ' � , µ , µ $ � . Let ­ ·¹¸ g ��­ · ��­ ·�º g be their pre-dicted times respectively. Let

¶A¼·¹¸ g � ¶"¼· � ¶"¼·�º g be theintended number of lattice points after remapping.We define the processing speed of node µ as½ · ¶"·­ · 0

The intention of remapping is to balance thesethree nodes in the next phase through this local in-formation change on the behalf of node µ . The ob-jective is:¶"¼·¹¸ g½ ·¾¸ g

¶"¼·½ · ¶"¼·�º g½ ·�º g ¶�·¾¸ g $ ¶�· $ ¶�·»º g½ ·¹¸ g $ ½ · $ ½ ·�º g 0¶"¼·¹¸ g � ¶"¼· � ¶"¼·�º g can be computed from the above

equation. When¶ ¼·�º gÀ¿ ¶"·�º g , the number of data

points that need to be remapped from node µ to µ $ �is¶ ¼·�º g ' ¶"·�º g . This condition

¶ ¼·�º g ¿ ¶"·�º g can berewritten as: ¶"·¹¸ g $ ¶"· $ ¶"·�º g½ ·¹¸ g $ ½ · $ ½ ·�º g ¿ ­ ·»º g 0A similar condition can be derived for deriving thetentative amount of points redistributed from node µto µ ' � .

Remapping decision and over-redistribution.An additional consideration is whether we shouldmove Á ¶�·�º g points from node µ to µ $ � , whereÁ ¶"·�º g ¶"¼·�º g ' ¶"·�º g . The following remappingcondition needs to be met: Á ¶�·�º g is greater than athreshold and

½ ·�º g ¿ ½ ·. Namely, we don’t move a

small number of points and don’t move points froma fast node to a slow node. In our experimental set-ting discussed in Section 4, the minimal migrationis one 2D plane of size

y ��� q y � for a microchan-nel with ÂW��� q y ��� q y � lattice points. Thus we use4,000 as the threshold.

If node µ is considered to be slow and redistri-bution of lattice points from µ is necessary, we willshift data aggressively from this slow node to an-other faster node. Given three consecutive nodesµ ' � , µ , and µ $ � , let

½ ·¹¸ g , ½ · , and½ ·�º g be their

processing speed respectively. Instead of Á ¶Ã·�º g ,Ä Á ¶"·�º g lattice points will be redistributed fromnode µ to µ $ � . The scaling factor we use is Ä ÆÅ6Ç-È ±Å@Ç .

In some rare cases, when the local informationexchange among nodes µ ' � , µ , µ $ � concludes nodeµ needs to shift points to µ $ � , if node µ $ � concludesthat it needs to shift data from node µ $ � to µ basedon local information exchange among nodes µ , µ $ � ,and µ $ y

, a conflict resolution is deployed betweennode µ and µ $ � to redistribute a proper amount oflattice points.

4 Simulation of Fluid Slip andParallel Performance

We performed the simulation of fluid slip on a��0¾� q � q y¯É L H micro-channel. (Figure 5 showsthe diagram of the microchannel.) The grid spac-ing is �Ê L . The nondimensional hydrophobic wallforce used in the simulation is 0.2, corresponding toa physical force of  q ��� ¸ H Q m� r uvp L H with a de-cay length of

� 0-�¯ L . The appropriate magnitudefor this force is not well understood. For the currentsimulation, the force function was chosen so that thesimulation results would be consistent with experi-mental observations. The main simulation resultsare as follows.

8

Width = 1 Micron

Flow Direction

Length

xy

z

Depth = 0.1 Micron

= 2 Micron

Figure 5: Diagram of the microchannel.

We will first discuss the simulation results, whichmatch well the experimental observations. Then wewill present the performance results on parallel sim-ulation of this fluid slip instance.

4.1 Simulation Results

Figure 6 shows fluid densities close to the two sidewalls as a function of distance away from the sidewall at the cross-section t i� É L and nË Ì�v�¯ L .The t -axis is the density, and the m -axis is the dis-tance from the side wall. Figure 6 (A) shows thedensity of the fluid used to simulate water in themodel along the m -direction (in the middle of then direction) on a cross-section in the middle of thechannel ( t -direction). Figure 6 (B) shows the den-sity of the fluid used to simulate water vapor/air.We can see that the density of water is decreasedand that of water vapor/air is increased close to thewalls. Sakurai [28] et al. have also observed a dras-tic decrease of the water molecule number densityat a monolayer-water interface from the simulationresults of water between hydrophobic surfaces, viamolecular dynamics. Our results are consistent withtheirs.

Figure 7 shows the normalized streamwise veloc-ity profile and a local blowup along the m -directionat cross-section t � É L for nh �O�Í L . Thet -axis is the normalized velocity, and the m -axis isthe position from the side wall (unit: micron). Thesolid line (in A and B) is the velocity profile whenno wall forces are present. The dotted line (in partA), or the dashed line (in part B) is the case wherewall forces are introduced. In contrast to the former

0.5 0.6 0.7 0.8 0.9 1 1.1

0

20

40

Fluid densities near side wall along y−direction

Density of water (g/cm3)

0.8 1 1.2 1.4 1.6

0

20

40

Density of air/water vapor (× 10−4 g/cm3)

Dis

tanc

e fr

om s

ide

wal

l (na

nom

eter

)

(A)

(B)

Figure 6: Fluid densities as a function of distance away from

the side wall at the cross-section ÎÐÏÒÑ�ÓWÔ and ÕKÏ×Ö6Ø"ÙWÔ .

The Ö1Ø nm region close to the side wall is shown. The Î -axis

is the density of water or air/water vapor, and the Ú -axis is

the distance from the side wall. The graph (A) is the density

profile for water and the graph (B) is the density profile for

air/water vapor.

case, the latter results in apparent slip at the walls.(See Figure 7 (B) for the local blowup near the sidewall.) We can see from Figs. 7 and 8 that in theregion very close to the walls, the water density de-creases and the water vapor/air density rises. Thisenables the fluid slip on the walls (approximately�W� of free stream velocity), compared to the solidlines in Figure 7, which illustrate the case where nohydrophobic wall forces were applied.

4.2 Parallel Simulation with FilteredDynamic Lattice Mapping

We have conducted parallel simulation of the abovefluid flow instance on a Linux cluster of 32 PCnodes with 3GB RAM and dual-processor 2.6 GHzXeon, connected by a Gigabit Ethernet switch. Forthe simulation setting we have used 400 q 200 q 20lattice points with slice decomposition on 20 clus-ter nodes, i.e., a 20 q 200 q 20 lattice for each node.The total running time for this problem with 20,000LBM steps (phases) on a single machine is 43.56hours. We choose this problem size because itallows us to repeat various experiments quickly.

9

0 0.5 10

0.2

0.4

0.6

0.8

1

llllllllllllllllllllllllllllllllllllllll Streamwise velocity profiles as a function of y position

u/u0

Pos

ition

from

sid

e w

all (

mic

rons

)

... with wall forces

__no wall forces

0 0.50

0.01

0.02

0.03

0.04

0.05

u/u0

−−− with wall forces__ no wall forces

(A) (B)

Figure 7: Normalized streamwise velocity profiles along

the Ú -direction at cross-section ÎÛÏÜÑ�ÓWÔ for Õ�Ï4Ö6Ø"ÙWÔ .

The solid lines are the velocity profiles when no wall forces

are used. The dotted or dashed line is the case where wall

forces are introduced. The graph (A) is normalized velocity

at ÕÝÏÜÖ1Ø9ÙWÔ as a function of distance from the side wall

( Ú -direction). The graph (B) shows the normalized velocity

profile near the side channel wall.

Simulation for higher resolution would take muchlonger time 3

With a dedicated cluster, our parallel codeachieves almost full linear speedup when varyingthe number of nodes. The speedup is 18.97 with 20nodes. Our focus in the experiments is to study theperformance of the filtered dynamic lattice mappingand compare it with other approaches under twotypes of workload: 1) Fixed slow nodes: There is afixed set of nodes which are shared by other jobs. Inthis setting, a background job runs on each selectednode taking 70% CPU resource. 2) Transient spikes:A node is randomly chosen and a background job isexecuted which takes 70% of CPU resource for alength varying from 1 to 4 seconds. After 10 sec-onds, another node is randomly chosen to execute abackground job. This random process repeats every10 seconds.

3A real application can require about 500,000 LBM phasesto reach the steady-state. Our current application setting rep-resents the simplest possible geometry, the smallest physicaldimensions, and the coarsest resolution necessary to study theslip phenomenon. To study the effects of slip in a typicalMEMS device would require substantially more computation.

0 1 2 3 4 55

10

15

20

# of slow nodes

Spe

edup

No RemappingRemapping

0 1 2 3 4 50.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of slow nodes

Effi

cien

cy

No RemappingRemapping

Figure 8: Speedup and normalized efficiency with 20,000

phases.

4.2.1 Overall Performance with Fixed SlowNodes

The left part of Figure 8 shows the speedup of theparallel code with filtered dynamic remapping whenthe number of slow nodes varies from 0 to 5. Thespeedup is defined as the execution time of the se-quential program divided by the execution time ofparallel program. When the cluster is dedicated, thespeedup is close to 19. When there is one slow node,the speedup is about 16, whereas the parallel codewithout dynamic remapping performs poorly withmore slow nodes and its speedup drops dramatically.With five slow nodes, the parallel code with filtereddynamic remapping can still achieve a speedup of13.

We use the following normalized efficiencyÅ ]6ÞjÞ)\bßD]I , ¸ ,�à H~á to examine the utilization under this non-dedicated cluster with L slow nodes in which abackground job takes 70% CPU resource at eachnode. The right part of Figure 8 shows the normal-ized efficiency metric, indicating that our code canachieve very good resource utilization which is 90%when the number of slow nodes is less than four and80% for five slow nodes. In comparison, the utiliza-tion is low without remapping.

4.2.2 Execution profiling of different ap-proaches with one fixed slow node

We provide execution profile and cost distributionof different approaches with one slow node, whichallows us to understand behavior and effectivenessof proposed optimization strategies. The following

10

0 10 200

100

200

300

400

500

600

700

Dedicated

Tim

e (s

)

0 10 200

100

200

300

400

500

600

700

No-remap0 10 20

0

100

200

300

400

500

600

700

Conservative0 10 20

0

100

200

300

400

500

600

700

Filtered

RemappingCommunicationComputation

Figure 9: Execution profile and cost distribution for different schemes for 600 phases.

three approaches are compared with filtered remap-ping: 1) Dedicated scheme, where there is no slownode and dynamic data remapping is disabled; 2)No-remapping scheme, where one node is slowwithout dynamic data remapping; 3) Conservativescheme, which is the same as the filtered schemeexcept that over-redistribution is not used.

The results are shown in Figure 9 for 20 clusternodes. This experiment takes 600 phases. With 20dedicated nodes, the computation takes about 251seconds. When no remapping is used and node 9is a shared node, the total time increases from 251seconds to 717 seconds and the increasing ratio is185.6%. The “ripple” effect of a slow node dragsthe entire computation significantly.

The remapping method with conservative re-distribution balances the computation nicely, butit does not effectively reduce the synchronizationoverhead caused by sluggish communication innode 9. The filtered approach aggressively redis-tributes the lattice points of node 9 to others anduses only 313.0 seconds. The overall parallel timeincreasing ratio is only 24.7% compared to the ded-icated case. This method reduces the time of non-remapping by 56.3% and that of conservative redis-tribution by 39%. From this figure, one can seethat filtered remapping moves most of the latticepoints from node 9 to its neighbors using the over-redistribution strategy, and then it shifts these pointsfurther to other nodes. The slow node (node 9) ac-counts for most of the time spent in communicationin this LBM-based simulation.

Finally, it should be noted that cost of remappingin both the filtered and conservative schemes is low,as the profile shows. That is because both of themuse lazy remapping.

4.2.3 Comparison of different approaches withmultiple slow nodes

We further compare filtered remapping with conser-vative redistribution and non-remapping when thenumber of slow nodes varies from 0 to five. Inthis comparison, we also include the performanceof remapping using global information exchange,which tries to re-assign lattice points to nodes pro-portionally to their speeds. This global method em-ploys lazy remapping but does not use the over-redistribution strategy.

Results are shown in Figure 10. The filtered ap-proach performs best. It outperforms the conserva-tive approach by up to 39%, and the no-remappingmethod by up to 57.8%. Global remapping performswell with one slow node; however after the num-ber of slow nodes becomes more than 2, the globalmethod is worse than the others. The reason is thatthe global approach incurs too much communica-tion overhead, and slow nodes still take on too manylattice points.

4.2.4 Tolerance of transient spikes

In this experiment, we compare how different algo-rithms tolerate transient load spikes caused by back-

11

0 1 2 3 4 5200

300

400

500

600

700

800

900

# of slow nodes

Exe

cutio

n tim

e (s

)

No RemappingFiltered RemappingConservative RemappingGlobal Remapping

Figure 10: Execution time of 600 phases for different remap-

ping techniques.

ground jobs executed on randomly-selected nodesevery 10 seconds as we have discussed previously.The number of LBM phases for this experiment is100.

Table 1 lists the execution slowdown ratio offour approaches compared to the dedicated case.The result shows that the filtered, conservative, andno-remapping methods have similar performanceand are much better than global remapping. No-remapping performs well because every node has anequal chance to become slow, and there is no bene-fit to do re-balancing. The filtered method uses thelazy remapping strategy which can effectively toler-ate the transient spikes. The conservative redistribu-tion algorithm uses the same lazy remapping strat-egy and thus performs similarly. The performanceof the global approach is worse than others becauseit involves too much global synchronization, whichis sensitive to the existence of slow nodes.

Spike No-Remap Global Filtered Cons.Length

1 s 7.4% 5.8% 6.7% 10.9%2 s 11.9% 37.2% 15.6% 16.0%3 s 23.7% 40.9% 23.3% 24.9%4 s 35.6% 49.5% 38.1% 39.8%

Table 1: Slowdown ratio of different algorithms compared to

the dedicated case with various length of disturbance for 100

phases. Spike length is in seconds.

5 Concluding Remarks

We have parallelized the multicomponent latticeBoltzmann method for simulation of fluid slip ina microchannel. While domain decomposition canparallelize the code with a fairly good speedup in adedicated environment, the long computation timeand synchronization overhead at each phase of thiscomputation require us to improve load balancingthrough dynamic data remapping. Our experimentshave shown that the proposed filtered remappingtechnique can effectively re-distribute the workloadamong computing nodes and significantly reducethe overhead resulting from slow nodes in a par-tially heavily loaded cluster. For the tested cases,filtered remapping outperforms no-remapping byup to 57.8% and more compared to a global strat-egy. The over-redistribution strategy used in fil-tered remapping improves performance by up to39% compared to a conservative approach.

Previous work in load balancing has studied pro-cess migration and scheduling independent of ap-plication characteristics, e.g. [10, 42, 45]. An-other category of research in load balancing has fo-cused on application-specific strategies. For exam-ple, Hamdi and Lee [12] have studied load redistri-bution for parallel image processing with a central-ized approach. The main contribution of our loadbalancing work is a filtered approach which useslazy remapping and over-redistribution of data fromslow nodes by taking the application characteristicof LBM-based simulation.

Prediction of CPU usage has been studied in[44, 9, 46]. In a study of Unix load from a large setof computational settings, Dinda and O’Halloran [9]concluded that load is consistently predictable withsimple linear models. In [46], it was shown thatgiving more weight to the most recent data can im-prove the accuracy of the load predictions. For fluidslip, instead of relying on the most recent perfor-mance data, we have used the harmonic average ofthe last several sampled times for usage prediction.This helps to minimize the impact of transient loadspikes.

12

References[1] J. Barrat and L. Bocquet. Large slip effect at a nonwet-

ting fluid-solid interface. Phys. Rev. Lett., 82:4671, 1999.

[2] P. L. Bhatnagar, E. P. Gross, and M. Krook. A model forcollision processes in gases, I: small amplitude processin charged and neutral one-component system. Phys.Rev., 94:511, 1954.

[3] N. F. Bunkin, O. A. Kiseleva, A. V. Lobeyev, and T. G.Movchan. Effect of salts and dissolved gas on opticalcavitation near hydrophobic and hydrophilic surfaces.Langmuir, 13:3024–3028, 1997.

[4] D. Chandler. Two faces of water. Nature, 417:491, May2002.

[5] S. Y. Chen, H. D. Chen, and W. M. D. Martinez. Phys.rev. lett. Phys. Rev. Lett., 67:3776, 1991.

[6] C. H. Choi, K. J. A. Westin, and K. S. Breuer. Apparentslip flows in hydrophilic and hydrophobic microchan-nels. submitted, 2003.

[7] N. Churaev, V. Sobolev, and A. Somov. Slippage of liq-uids over lyophobic solid surface. J. Colloid InterfaceSci., 97:574, 1984.

[8] S. Z. D. H. Rothman. Lattice gas models of phase sep-aration: interface, phase transitions and multiphase flow.Rev. Mod. Phys., 66:1417, 1994.

[9] P. Dinda and D. O’Hallaron. An Evaluation of LinearModels for Host Load Prediction. In Proc. of HPDC-8,1999.

[10] E. Frachtenberg, D. Feitelson, F. Petrini, and J. Fernan-dez. Flexible CoScheduling: Mitigating Load Imbalanceand Improving Utilization of Heterogeneous Resources.In IPDPS03, April 2003.

[11] D. Grunau, S. Y. Chen, and K. Eggert. A lattice Boltz-mann model for multiphase fluid flows. Phys. of FluidsA, 5:2557, 1993.

[12] M. Hamdi and C.-K. Lee. Dynamic Load Balancing ofData Parallel Applications on a Distributed Network. InProc. of 9th Sumpercomputing, pages 170–179, 1995.

[13] M. Harchol-Balter and A. B. Downey. Exploiting Pro-cess Lifetime Distributions for Dynamic Load Balanc-ing. In Proceedings of the 1996 ACM SIGMETRICS,pages 13–24, May 1996.

[14] X. He and L. S. Luo. Theory of lattice Boltzmannmethod: from the Boltzmann equation to the latticeBoltzmann equation. Phys. Rev. E, 56:6811, 1997.

[15] R. G. Horn, O. I. Vinogradova, M. E. Mackay, andN. Phan-Thien. Hydrodynamic slippage inferred fromthin film drainage measurements in a solution of nonad-sorbing polymer. J. Chem. Phys., 112(14), April 2000.

[16] S. Hou. Lattice Boltzmann Method for incompressibleviscous flow. PhD thesis, Kansas State University, 1995.

[17] D. M. Huang and D. Chandler. The hydrophobic effectand the influence of solute-solvent attractions. J. Phys.Chem., 106:2047–2053, 2002.

[18] D. Kandhai, A. Koponen, A. Hoekstra, M. Kataja, J. Ti-monen, and P. Sloot. Lattice-Boltzmann Hydrodynam-ics on Parallel Systems. Computer Physics Communica-tions, 111(1-3):14–26, 1998.

[19] K. Lum, D. Chandler, and J. D. Weeks. Hydrophobic-ity at small and large length scales. J. Phys. Chem.,103:4570–4577, 1999.

[20] D. Lumma, A. Best, A. Gansen, F. Feuillebois, J. O.Radler, and O. Vinogradova. Flow profile near awall measured by double-focus fluorescence cross-correlation. Phys. Rev. E, 112(14), 2000.

[21] L. S. Luo. Unified theory of the lattice Boltzmann mod-els for nonideal gases. Phys. Rev. Lett., 81:1618, 1998.

[22] L. Mainbaum and D. Chandler. A coarse-grained modelof water confined in a hydrophobic tube. J. Phys. Chem.,107:1189–1193, 2003.

[23] N. S. Martys and H. Chen. Simulation of multicom-ponent fluids in complex three-dimensional geometriesby the lattice Boltzmann method. Physical Review E,53(1):743, 1996.

[24] R. Pit, H. Hervet, and L. Leger. Direct experimental evi-dence of slip in hexadecane: Solid interfaces. Phys. Rev.Lett., 85:980, 2000.

[25] Y. H. Qian. Lattice gas and lattice kinetic theory appliedto the Navier-Stokes equations. PhD thesis, UniversityPierre et Marie Curie, Paris, 1990.

[26] E. Ruckenstein and P. Rajora. On the no-slip boundarycondition of hydrodynamics. J. Colloid Interface Sci.,96:488, 1983.

[27] G. D. D. S. Y. Chen. Lattice Boltzmann method for fluidflows. Annu Rev. Fluid Mech., 30:329, 1998.

[28] M. Sakurai, H. Tamagawa, K. Ariga, T. Kunitake, andY. Inoue. Molecular dynamics simulation of water be-tween hydrophobic surfaces. Implication for the long-range hydrophobic force. Chem. Phys. Lett., 289:567,1998.

[29] N. Satofuka and T. Nishioka. Parallelization of latticeBoltzmann method for incompressible flow computa-tions. Comput. Mech., 23:164, 1999.

[30] X. Shan and H. Chen. Lattice Boltzmann model forsimulating flows with multiple phases and components.Physical Review E, 47(3):1815, 1993.

[31] X. Shan and H. Chen. Simulation of nonideal gasesand liquid-gas phase transitions by the lattice Boltzmannequation. Phys. Rev. E, 49:2941, 1994.

[32] X. Shan and G. D. Doolen. Multicomponent latticeBoltzmann model with interparticle interaction. J. Stat.Phys., 81:379, 1995.

13

[33] P. A. Skordos. Parallel Simulation of Subsonic Fluid Dy-namics on a Cluster of Workstations. In Proc. of HPDC-4, 1995.

[34] T. Suviola. Parallelization of a lattice Boltzmann sus-pension flow solver. Applied parallel computing lecturenotes in computer science, 2367:603, 2002.

[35] P. A. Thompson and S. M. Troian. A general bound-ary condition for liquid flow at solid surfaces. Nature,389(6649):360–362, 1997.

[36] D. C. Tretheway and C. D. Meinhart. Apparent fluid slipat hydrophobic microchannel walls. Physics of Fluids,14(3), 2002.

[37] D. C. Tretheway and C. D. Meinhart. A generating mech-anism for apparent fluid slip in hydrophobic microchan-nels. submitted to Physics of Fluids, August 2003.

[38] O. I. Vinogradona. Possible implications of hydrophobicslippage on the dynamic measurements of hydrophobicforces. J. Phys: Condens. matter, 8, 9491 1996.

[39] O. I. Vinogradova. Slippage of water over hydrophobicsurfaces. Int. J. Miner. Process., 56:31–60, 1999.

[40] K. Watanabe, Yanuar, and H. Mizunuma. Slip of Newto-nian fluids at solid boundary. JSME Int. J., 41:525, 1998.

[41] K. Watanabe, Yanuar, and H. Mizunuma. Drag reductionof Newtonian fluid in a circular pipe with a highly water-repellant wall. J. Fluid Mech., 381:225, 1999.

[42] M. H. Willebeek and A. P. Reeves. Strategies for Dy-namic Load Balancing on Highly Parallel Computers.IEEE Trans. on Parallel and Distributed Systems, 4(9),1993.

[43] D. A. Wolf-Gladrow. Lattice-gas cellular automata andlattice Boltzmann Models – an introduction. Springer,Berlin, 2000.

[44] R. Wolski, N. Spring, and J. Hayes. Predicting the CPUAvailability of Time-shared Unix System on the Compu-tational Grid. In Proc. of HPDC-8, 1999.

[45] C. Xu, F. Lau, and R. Diekmann. Decentralized Remap-ping of Data Parallel Applications in Distributed Mem-ory Multiprocessors. Concurrency: Practice and Expe-rience, 9(12):1351–1376, Dec. 1997.

[46] L. Yang, I. Foster, and J. M. Schopf. Homeostatic andTendency-based CPU Load Predictions. In Proceedingsof IPDPS 2003, April 2003.

[47] L. Zhu, D. Tretheway, L. Petzold, and C. Meinhart. Sim-ulation of fluid slip at hydrophobic microchannel wallsby the lattice Boltzmann method. submitted to J. Com-put. Phys., 2003.

[48] Y. Zhu and S. Granick. Rate-dependent slip of Newto-nian liquid at smooth surfaces. Phys. Rev. Lett., 87, 2001.

14