Performance assessment of four cluster interconnects on identical hardware: hints for cluster...

11
Performance assessment of four cluster interconnects on identical hardware: hints for cluster builders H. Pourreza, M.R. Eskicioglu and P.C.J. Graham* Department of Computer Science, University of Manitoba, Winnipeg, MB, Canada R3T 2N2 E-mail: {pourreza,rasit,pgraham}@cs.umanitoba.ca *Corresponding author Abstract: With the current popularity of cluster computing, the importance of un- derstanding the capabilities and performance potential of the various network inter- connects they use is growing. In this paper, we extend work done by other researchers by presenting the results of a performance analysis of multiple network interconnects running identical applications on identical cluster nodes. Repeated isolated timing runs of a number of standard cluster computing benchmarks (including the NAS parallel benchmarks (Bailey et al., 1991) and the Pallas benchmarks (Pallas Corp., 1999)) as well as some real world parallel applications on first and second generation Myrinet (Bo- den et al., 1995), SCI (IEEE, 1992), Fast (100Mbps) and Gigabit (1000Mbps) Ethernet were made and the results are reported. Such results are particularly valuable for com- putational scientists who are now, thanks to the low cost of such systems and the growing availability of relatively simple cluster tools (e.g. (Ferri, 2002)), able to build their own clusters but who may have relatively limited technical expertise concerning available cluster interconnects. Keywords: High performance network interconnects, benchmarking, NAS, NPB. Reference to this paper should be made as follows: Pourreza, H., Eskicioglu, M.R., and Graham, P.J.C. (xxxx) ‘Performance assessment of four cluster interconnects on identical hardware: hints for cluster builders’, Int’l. J. High Performance Computing and Networking. Vol. x, No. x, pp.xxx–xxx. Biographical notes: Hossein Pourreza is a PhD candidate, Dr. Eskicioglu is an Assistant Professor and Dr. Graham is an Associate Professor all in the Computer Science department at the University of Manitoba. 1 INTRODUCTION Great interest has arisen over the last 5 to 10 years in the use of clusters for High Performance Computing (HPC). Since the first Beowulf paper (Becker et al., 1995) was published a huge number of researchers who previously could not afford to exploit parallel computing in their re- search have decided to build and learn to program their own cluster-based parallel computers. Originally, the com- pute nodes in such clusters were connected using simple networking hardware (sometimes channel bonding multi- ple network interface cards (NICs) together to increase the available bandwidth). As the popularity of clusters has grown, a number of special purpose, high-bandwidth and/or low latency networks (e.g. Myrinet, SCI and Gi- gaNet) have been developed and are now being actively used in clusters to expand the range of parallel problems that may be solved using them. Some of these special purpose interconnects are quite expensive while others are less so. At the same time, the cost of commodity “fast” Ethernet has become extremely low and the cost of giga- bit Ethernet has now decreased to where it is affordable for almost all cluster builders. The variety of available interconnects makes their performance evaluation an im- portant issue for the computational scientists (e.g. physi- cists, chemists and engineers doing physical systems sim- ulations) who are assembling their own compute clusters. Unfortunately, much of the work done in this area has, in some sense, been “comparing apples to oranges”. Most ex- Copyright c 200x Inderscience Enterprises Ltd.

Transcript of Performance assessment of four cluster interconnects on identical hardware: hints for cluster...

Performance assessment offour cluster interconnects onidentical hardware: hints forcluster builders

H. Pourreza, M.R. Eskicioglu and P.C.J. Graham*Department of Computer Science, University of Manitoba,Winnipeg, MB, Canada R3T 2N2E-mail: {pourreza,rasit,pgraham}@cs.umanitoba.ca*Corresponding author

Abstract: With the current popularity of cluster computing, the importance of un-derstanding the capabilities and performance potential of the various network inter-connects they use is growing. In this paper, we extend work done by other researchersby presenting the results of a performance analysis of multiple network interconnectsrunning identical applications on identical cluster nodes. Repeated isolated timing runsof a number of standard cluster computing benchmarks (including the NAS parallelbenchmarks (Bailey et al., 1991) and the Pallas benchmarks (Pallas Corp., 1999)) aswell as some real world parallel applications on first and second generation Myrinet (Bo-den et al., 1995), SCI (IEEE, 1992), Fast (100Mbps) and Gigabit (1000Mbps) Ethernetwere made and the results are reported. Such results are particularly valuable for com-putational scientists who are now, thanks to the low cost of such systems and thegrowing availability of relatively simple cluster tools (e.g. (Ferri, 2002)), able to buildtheir own clusters but who may have relatively limited technical expertise concerningavailable cluster interconnects.

Keywords: High performance network interconnects, benchmarking, NAS, NPB.

Reference to this paper should be made as follows: Pourreza, H., Eskicioglu, M.R.,and Graham, P.J.C. (xxxx) ‘Performance assessment of four cluster interconnects onidentical hardware: hints for cluster builders’, Int’l. J. High Performance Computingand Networking. Vol. x, No. x, pp.xxx–xxx.

Biographical notes: Hossein Pourreza is a PhD candidate, Dr. Eskicioglu is anAssistant Professor and Dr. Graham is an Associate Professor all in the ComputerScience department at the University of Manitoba.

1 INTRODUCTION

Great interest has arisen over the last 5 to 10 years in theuse of clusters for High Performance Computing (HPC).Since the first Beowulf paper (Becker et al., 1995) waspublished a huge number of researchers who previouslycould not afford to exploit parallel computing in their re-search have decided to build and learn to program theirown cluster-based parallel computers. Originally, the com-pute nodes in such clusters were connected using simplenetworking hardware (sometimes channel bonding multi-ple network interface cards (NICs) together to increasethe available bandwidth). As the popularity of clustershas grown, a number of special purpose, high-bandwidthand/or low latency networks (e.g. Myrinet, SCI and Gi-

gaNet) have been developed and are now being activelyused in clusters to expand the range of parallel problemsthat may be solved using them. Some of these specialpurpose interconnects are quite expensive while others areless so. At the same time, the cost of commodity “fast”Ethernet has become extremely low and the cost of giga-bit Ethernet has now decreased to where it is affordablefor almost all cluster builders. The variety of availableinterconnects makes their performance evaluation an im-portant issue for the computational scientists (e.g. physi-cists, chemists and engineers doing physical systems sim-ulations) who are assembling their own compute clusters.Unfortunately, much of the work done in this area has, insome sense, been “comparing apples to oranges”. Most ex-

Copyright c© 200x Inderscience Enterprises Ltd.

1

isting performance analyses have been forced to compareresults for interconnects from different cluster systems be-cause any given cluster seldom has more than one or twointerconnects available. In this paper, we present the re-sults of an “apples vs. apples” comparison of four intercon-nects all running the same applications on the same clus-ter nodes. This makes our results particularly valuable inthe comparison of these specific cluster interconnects sincethere are no questionable factors due to the use of differentcluster nodes and/or systems software.

The rest of this paper is organized as follows. In Sec-tion 2 a review of key related work in cluster performanceassessment is provided. Section 3 describes the pertinentcharacteristics of the environment in which the experi-ments were done. The experiments themselves and ourexperimental methodology are described in Section 4. Adiscussion of the results obtained is provided in Section 5.Section 6 ends the paper by presenting some conclusionsand directions for future work.

2 RELATED WORK

There have been a number of papers written on the sub-ject of cluster performance. These papers have had manydifferent goals ranging from comparisons with SMP ma-chines to the assessment of specific cluster components (asis done in this paper). An unfortunate characteristic ofmany of the papers is that the results are often narrowor, in some sense, incomplete. This is due, in part, to therelative newness and evolving nature of cluster computingand the interconnects being used. It is also due to the dif-fering viewpoints of the papers’ authors and their reasonsfor doing the underlying research. We do not attempt toprovide a complete survey of work related to cluster perfor-mance assessment but, instead, seek to highlight the typesof performance analyses done, explain briefly why they areof limited usefulness and set the stage for a better assess-ment of cluster interconnect performance as is provided inthis paper.

Many previous papers on cluster performance evaluationhave focused on the use of clusters for either single ap-plications or specific purposes. For example Banikazemiet al. (Banikazemi et al., 2001) focus on the use ofcluster system interconnects for implementing the Tread-marks (Keleher et al., 1994) software distributed sharedmemory (DSM) system. While valuable for researchers inspecific fields, such papers are not of much use to computa-tional science researchers wanting to build their own clus-ters since the results reported seldom apply to a wide rangeof applications. Greater breadth is necessary to help clus-ter builders select the appropriate components with whichto build their parallel compute clusters.

Another group of papers have reported the comparativeperformance of a variety of different cluster systems forone or more applications. Such papers are typified by, forexample, the work of Lan (Lan and Deshikachar, 2003)and van der Steen (van der Steen, 2003). Unfortunately,

the results from such studies are also of limited use tonew cluster builders since it is difficult or sometimes evenimpossible to factor out the effects of differences in proces-sors, motherboards, and other system components used inthe different cluster systems. Results for specific clustersystems have, we believe, relatively short life times. Weprefer to evaluate the performance of specific components(in this paper, the interconnection network) to producespecific results which are likely to be of value for longerperiods and for a broader range of parallel applications.

We also believe that there is no substitute for doingperformance analysis on real hardware running real ap-plications. A relatively large number of simulation basedpapers have also been written addressing various aspectsof cluster performance including some comparing the per-formance of specific network interconnects. For example,Chen et al. (Chen et al., 2000) describe a simulation, usingOpNet (http://www.opnet.com), that compares Myrinetand Gigabit Ethernet for use as cluster interconnects. Bydefinition, simulations abstract away unimportant details.Unfortunately, such abstraction may lead to results thatare missing relevant details if the simulation is not verycarefully designed. For example, ignoring the impact ofprocessor to NIC transfer speed when using a low-latencynetwork such as Myrinet may significantly skew the sim-ulation results. It is all too easy to inadvertently fail toconsider such fine granularity details especially when usinga large, well known simulation system such as OpNet orNS-2 in which we are often tempted to implicitly trust thesimulator’s implementation details. If the implementationdetails fail to capture all the necessary characteristics, in-accurate results may result. Chen et al.’s work does notappear to suffer from this sort of problem and they do pro-vide a cost-benefit analysis for the two networks they sim-ulate which is of great potential value for potential clusterbuilders.

The performance analysis we report in this paper ismost similar in character to earlier work done by Hsiehet al. (Hsieh et al., 2000) who describe a comparison ofGigaNet and Myrinet interconnects for small scale clus-ters of SMP nodes based, primarily, on programs from theNAS parallel benchmarks (NPB) (Bailey et al., 1991). Thiswork is similar to ours although, in this paper, we choosenot to exploit the SMP nature of our cluster nodes forsome experiments (since multiprocessor cluster nodes arestill relatively uncommon) and we do not report resultsusing GigaNet for reasons described later. Instead we pro-vide results for Myrinet, SCI (Scalable Coherent Interface),and commodity fast and gigabit Ethernet networks. Hsiehet al. generated their results using GigaNet and Myrineton identical cluster hardware running meaningful test pro-grams (the NPB benchmarks are derived from real-worldprograms). Such results are directly relevant and of greatvalue to cluster builders.

The results reported in this paper improve on existingrelated work in the following ways:

• They are based on timing runs done on real hardware

2

using real programs.

• The cluster node hardware is identical for all runs us-ing all network interconnects. This means that theresults do not have to be carefully “interpreted” toaccount for possible performance variations due to dif-ferences in system components other than the networkinterconnect.

• We report on four different network interconnectsrather than only two and the performance evaluationwork on which the paper is based actually considers(to varying extents) six different network intercon-nects.

• In addition to the key benchmark suites used in otherpapers, we also tested a number of real-world applica-tions. This allows us to be more confident that we arenot seeing only results that can be expected of par-ticularly well-tuned benchmark code. Results for realapplications are potentially more indicative of whatnew cluster users can expect from their applicationsthan results from well established benchmarks.

3 THE EXPERIMENTAL ENVIRONMENT

All the experiments were conducted using an 8 node(16 processor) Linux cluster (running RedHat 9.0, kernel2.4.18smp and gcc 3.2.2) in the Parallel and DistributedSystems Lab (PDSL) at the University of Manitoba. Eachnode contains dual Pentium III, 550MHz processors with512MB of shared SDRAM memory and local disks. (AllI/O activity in the experiments was done to local disksto remove the effects of NFS access.) Each node also hasfirst and second generation Myrinet, GigaNet, fast Eth-ernet, gigabit Ethernet and point to point SCI (DolphinWulfKit) network interface cards (NICs). All NICs exceptthe SCI NICs are connected to dedicated switches. TheSCI NICs are interconnected in a 4x2 mesh configuration.The fast Ethernet network is also used for “maintenance”traffic (i.e. NFS, NIS, etc.) but during the experiments,steps were taken to ensure that such traffic would be min-imal and thus would not have a serious impact on thetiming results obtained using the fast Ethernet network.Although the nodes are dual processor, except where ex-plicitly noted, only a single processor was used for runningthe applications whose timing results are reported in thispaper. Only results from the best performing version ofMPI that was available for each interconnect are reported.While it should have had little or no bearing on the out-come of the timing runs, for completeness, we describe thecluster host node which was a Pentium IV 2.4GHz with1GB of DDRAM memory and 200GB of hard disk.

Unfortunately, while each cluster node has a GigaNetinterconnect, we are unable to present the results for thisinterconnect here due to the fact that we could not ob-tain access to a suitable MPI implementation for GigaNet.Additionally, specific results using the second generation

Myrinet NICs are omitted. This is because the mother-boards in the cluster nodes do not support 64 bit trans-fers which is one of the key benefits of second generationMyrinet. Through experimentation, we discovered thatwithout such wide transfers, there is only negligible differ-ence between the two generations of Myrinet in the clusterwe used to run the experiments. We plan to report resultswith second generation Myrinet (as well as infiniband andQuadrics) in a future paper after a cluster upgrade.

4 METHODOLOGY

The bulk of the applications run in our performance evalu-ation were from the NAS Parallel Benchmarks (NPB) (Bai-ley et al., 1991) version 2.4, and from the Parallel MPIBenchmarks (PMB) (Pallas Corp., 1999) version 2.2.1. Ad-ditionally, we selected certain “real world” applications(ones not from any common parallel benchmark suite) toincrease our confidence in the results obtained using thebenchmarks. Among these applications were PSTSWM(http://www.csm.ornl.gov/chammp/pstswm/), a shallowwater model commonly used as part of larger Global Cli-mate Models (GCMs) and Gromacs (Lindahl et al., 2001)(http://www.gromacs.org/), a molecular dynamics pack-age.

The NPB benchmark suite consists of a number of par-allel programs implemented using MPI for execution onclusters that are intended to exhibit a range of commu-nication patterns and intensity. Some of the benchmarkprograms are extremely compute bound while others arecommunication bound. All programs are derived from real-world code (either being largely complete applications orwhat the benchmark designers see as key software “ker-nels”). Each program also has multiple “classes” whichcorrespond to the size/complexity of the data being oper-ated on (with class-A being the smallest and class-C thelargest) or which correspond to other characteristics of theproblem instances for specific benchmarks. Among otherthings, the different classes allow for the assessment of par-allel performance as the problem size increases. The spe-cific programs that are reported on in this paper are: CG(Conjugate Gradient), FT (an FFT-based partial differen-tial equation solver), IS (Integer Sort), LU (LU Decom-position), MG (Multi Grid) and SP (a solver of multipleindependent systems of equations). The program EP (Em-barrassingly Parallel) which requires no inter-process com-munication was also run during our tests but the resultsare as expected and not of interest in a paper assessingthe performance of network interconnects. The interestedreader is referred to (Bailey et al., 1991) for additionalinformation about the programs that constitute the NASparallel benchmarks.

All applications were run 16 times (using their defaultparameters) and the results averaged to produce the resultsreported here. All timings were done in isolation from allother work and logins. (No other applications were run-ning while the timings were being taken, no other users

3

0

100

200

300

400

500

600

700

800

1 32 1024 32768 1.04858e+06

Mbp

s

size (bytes)

Pallas Ping-Pong Bandwidth

Fast EthGigabit Eth

MyrinetSCI

Figure 1: Bandwidth results using thePallas Ping-Pong benchmark.

were allowed to be logged in and the operating systems onall nodes were running only cluster-essential software dae-mons.) Both public domain compilers (e.g. gcc and g77version 3.2.2) and the corresponding Intel compilers for theIA32 (i.e. Pentium family) of processors were available foruse but the results reported here are only for programscompiled using the GNU compilers. This decision wasmade since we felt that the GNU compilers would be moregenerally available to, and hence more likely to be used by,most cluster builders. The only impact of using the Intelcompilers would be to increase the computational speedof the applications thereby increasing the gap between thespeed of computation and that of communication makingthe applications more communication bound. In general,using compilers that can produce higher quality code willslightly increase a cluster user’s desire to have a higherperformance network interconnect.

5 RESULTS

In this section we present some results from our experi-ments using a number of graphs showing the performanceof a selection of benchmarks and applications running un-der various conditions over the available cluster intercon-nects. We divide the presentation of the results into anumber of sub-sections to focus our discussion and to im-prove the clarity of presentation.

5.1 Basic Performance Results

We begin by presenting results from the Pallas benchmarksthat help to characterize the base performance of the var-ious interconnects on the experimental cluster using MPI.These include results for bandwidth and latency as well asMPI’s “MPI Allreduce” operation.

Figure 1 shows the results of using the Pallas Ping-Pongbenchmark to assess interconnect bandwidth. This bench-mark assesses bandwidth by exchanging packets whose

1

4

16

64

256

1024

4096

16384

65536

262144

1 32 1024 32768 1.04858e+06

usec

onds

size (bytes)

Latency of Ping-Pong Packets

Fast EthGigabit Eth

MyrinetSCI

Figure 2: Latency results using thePallas Ping-Pong benchmark.

sizes vary from 0 to 4MB. It also considers the effects ofusing MPI in the bandwidth (i.e. raw bandwidth would beexpected to be higher than this.) It is noteworthy that forsmaller messages (up to about 10K bytes) SCI offers betterresults whereas for larger messages, Myrinet is preferable.This behaviour can be explained by the greater raw band-width provided by SCI and the fact that the cost of itspoint to point rather than switched structure does not fac-tor into the simple Ping-Pong benchmark (which operatesonly between two directly connected cluster nodes).

The latency of each network, calculated again using thePallas Ping-Pong benchmark, is shown in Figure 2 (whichuses log scale axes). In this case, the size of the data trans-mitted varied from 1 byte to 4MB). It is noteworthy thatthe latency incurred using Myrinet and SCI is significantlybelow that of the other interconnects. This is expectedand would be important for cluster users whose applica-tions exhibit a high communication to computation ratioparticularly if the messages exchanged are small. This ob-servation is also consistent with the bandwidth results al-ready presented.

Finally, Figure 3 shows the results of using MPI’s“MPI Allreduce” operation with 8 processors for messagesizes varying from 4 bytes to 4MB. MPI Allreduce is animportant, fundamental operation since it is commonlyused in many parallel MPI programs. The results, as ex-pected, show that the high-performance, low latency inter-connects (SCI and Myrinet) consistently out perform thecommodity Ethernet networks. This effect is, again, mostimportant for supporting frequent, small messages.

5.2 Results from the NAS Parallel Benchmarks

In this subsection we present the results of the key bench-marks we ran from the NAS parallel benchmark suite(NPB version 2.4). We begin by presenting results (inFigure 4) obtained by running the three different classes(i.e. problem sizes) of the CG benchmark on the entire8 node cluster for each of the four interconnects. The re-

4

64

256

1024

4096

16384

65536

262144

1.04858e+06

32 1024 32768 1.04858e+06

usec

onds

size (bytes)

MPI Allreduce (8 processors)

Fast EthGigabit Eth

MyrinetSCI

Figure 3: MPI’s All-reduce results with8 processors.

CG Benchmark

0

50

100

150

200

250

300

A B C

Benchmark Class

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 4: Performance of CG on 8 Processorsfor different problem classes.

sults show consistent behaviour across the problem sizeswith only minor variance between classes and highly con-sistent relative performance for the various interconnectsat all sizes. Similar results were obtained for the otherNPB benchmark programs. As expected, fast Ethernet issignificantly slower than the other interconnects. It is alsonoteworthy that with CG (a realistic, application-orientedbenchmark) the overhead of point to point routing used inthe SCI interconnect begins to result in negative perfor-mance relative to, in particular, Myrinet. For this appli-cation, the performance of SCI is nearly indistinguishablefrom that of the much more cost-effective gigabit Ethernet.

We now turn to presenting individual results for the var-ious NPB benchmark programs for a single class of prob-lems. (We have selected for presentation Class-A arbitrar-ily, because of the largely similar results obtained for theother classes.) Results are shown in Figures 5, 6, 7, 8and 9 for the CG, FT, IS, LU, and MG benchmarks (re-spectively) running on each of the four interconnects. Allof the graphs show results normalized to the performanceof fast Ethernet.

CG-A Benchmark

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 4 8

Number of Processors

Nor

mal

ized

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 5: Performance of CG (class A) ondifferent numbers of processors.FT-A Benchmark

0

0.5

1

1.5

2

2.5

3

2 4 8

Number of Processors

Nor

mal

ized

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 6: Performance of FT on differentnumbers of processors.

The benchmarks FT, IS and CG (to some extent) arecommunication bound (Wong et al., 1999) and, accord-ingly, the difference between fast Ethernet and Myrinet(or SCI) is more evident. LU and MG, on the other hand,are compute bound and as a result we see very similar per-formance for all the interconnects. As expected, the differ-ences between the interconnects become more pronouncedas the number of processors, and hence, the amount ofcommunication increases. Due to the large message sizes(in some cases several megabytes (Wong et al., 1999))Myrinet outperforms SCI noticeably.

5.3 Results Using “Real” Applications

While the NPB benchmarks are, in general, based on realcode and are widely accepted, there is no substitute fordoing performance assessment using real applications. Inaddition to the fact that real applications may have dif-ferent characteristics from “well-tuned” benchmark code,they are also complete in that they include all of the pro-gram including I/O and other “hard to parallelize” code.If such applications display different performance results

5

IS-A Benchmark

0

1

2

3

4

5

6

7

8

2 4 8

Number of Processors

Nor

mal

ized

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 7: Performance of IS on differentnumbers of processors.

LU-A Benchmark

0

100

200

300

400

500

600

700

800

2 4 8

Number of Processors

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 8: Performance of LU on differentnumbers of processors.

MG-A Benchmark

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

2 4 8

Number of Processors

Nor

mal

ized

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 9: Performance of MG on differentnumbers of processors.

Gromacs(D.DPPC) Benchmark

0

200

400

600

800

1000

1200

1400

2 4 8

Number of Processors

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 10: Performance of Gromacs (DPPC) ondifferent numbers of processors.

Gromacs(D.Villin) Benchmark

0

200

400

600

800

1000

1200

1400

2 4 8

Number of Processors

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 11: Performance of Gromacs (DVILLIN) ondifferent numbers of processors.

then they are clearly useful but even if the results obtainedare consistent with benchmark results they are valuablesince they increase user confidence in the overall set of re-sults. We now present performance results obtained usingtwo real-world parallel applications and compare those re-sults to the core NPB results just presented.

Figures 10 and 11 shows the impact of increasing thenumber of processors available to run the Gromacs paral-lel program on (results are shown in normalized MFLOPs).Figure 12 shows the running time (in seconds) of thePSTSWM program for a small problem size. We wereunable to run the program for medium or large problemsizes because of memory constraints on the cluster nodes.

It is worth noting that the results from these real worldapplications are largely consistent with those from the NASparallel benchmark suite. This adds further support to thegenerally held belief that the NPB suite accurately reflectsthe characteristics of a wide range of parallel programs (orat least those typified by computation and communica-tion patterns similar to those of the tested applications).It would therefore seem that cluster builders wanting toevaluate cluster configurations but who do not yet have

6

PSTSWM(Small) Benchmark

0

50

100

150

200

250

300

2 4 8

Number of Processors

Tim

e (s

ec)

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 12: Performance of PSTSWM ondifferent numbers of processors.

0

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

FT-A Speedup

Fast EthGigabit Eth

MyrinetSCI

Ideal

Figure 13: Speedup of the FT benchmark ondifferent interconnects.

their application(s) parallelized can rely on the NPB suitefairly safely.

5.4 Network Speedups

In this subsection, we turn our attention to assessing ex-pected program speedup using the various interconnectsfor a range of parallel programs with varying communica-tion characteristics.

Figures 13 and 14 show the speed up for two significantlydifferent (in terms of communication behaviour) class-Aprograms from the NPB suite. FT (Figure 13) is commu-nication bound and it’s speedup is observed to be highlysub-linear, especially on fast Ethernet. LU (Figure 14), onthe other hand, is computationally bound and shows linear(and in some cases super-linear) speedup.

Speedups for the NAS benchmarks on each specific inter-connect (as well as the ideal possible speedup) are shownin Figures 15, 16, 17 and 18. The BT and SP benchmarksrequire a number of processors which is a square so onlythe results for 4 and 16 processors are shown. (Of course,this means that the results from running on 16 processors

0

2

4

6

8

10

12

14

16

18

2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

LU-A Speedup

Fast EthGigabit Eth

MyrinetSCI

Ideal

Figure 14: Speedup of the LU benchmark ondifferent interconnects.

0

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Fast Ethernet Speedup

BTCGFTIS

LUMGSP

Ideal

Figure 15: Speedup of the NAS parallelbenchmarks on fast Ethernet.

will also include any SMP effects since the test bed clusterconsists of only 8 dual node machines.)

Considering these figures, it is clear that, again, fastEthernet performs significantly worse than the other in-terconnects. Despite this, its performance does yield fairlyreasonable speedups for a number of applications as shownin Figure 15. It is also interesting to note that for sev-eral applications, the faster interconnects (in some casesincluding gigabit Ethernet) provide quite close to idealspeedups. Perhaps most importantly, considering Fig-ures 15 through 18 it is clear that the variance in theachievable speedup across different benchmark programsis far greater for the commodity interconnects (with, nat-urally, fast Ethernet being the worst). This means that forcluster builders who are unable to characterize the behav-iour of their programs in advance, a high performance, lowlatency interconnect is the safest, albeit more expensivechoice. (Of course, it is always preferable to be able to as-sess your applications directly on specific hardware beforecommitting to purchasing expensive equipment.)

7

0

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Gigabit Ethernet Speedup

BTCGFTIS

LUMGSP

Ideal

Figure 16: Speedup of the NAS parallelbenchmarks on gigabit Ethernet.

0

2

4

6

8

10

12

14

16

18

2 4 6 8 10 12 14 16

Spe

edup

Number of Processors

Myrinet Ethernet Speedup

BTCGFTIS

LUMGSP

Ideal

Figure 17: Speedup of the NAS parallelbenchmarks on Myrinet.

0

2

4

6

8

10

12

14

16

2 4 8 16

Spe

edup

Number of Processors

SCI Speedup

BTCGFTIS

LUMGSP

Ideal

Figure 18: Speedup of the NAS parallelbenchmarks on SCI.

5.5 Impact of the use of Dual Processor (SMP)Nodes

An increasingly common trend in building low cost clus-ters is to build CLUMPS (i.e. “CLUsters of MultiProces-sorS”). This has the advantage of amortizing the relativelyhigh cost of the NICs over a larger number of processors.Of course, this also has the disadvantage of forcing theprovided bandwidth to be shared by the multiple proces-sors. In cases where the communication to computationrate is high, this may make CLUMPS less appealing butfor many cluster builders with more moderate communi-cations demands the ever decreasing cost and increasingprevalence of 2-way and 4-way SMP nodes makes them anappealing option. Accordingly, it is important to considersuch systems in any interconnect performance evaluation.

Figures 19, 20 and 21 show the performance of the NPBLU and FT benchmarks and the Gromacs application (re-spectively) when running on 1, 2, 4, and 8 dual proces-sor SMP cluster nodes. The impact of SMP on the FTbenchmark is much more evident than with the other twoprograms which suggests that the benefits to be had byexploiting SMP cluster nodes will vary with applicationcharacteristics. Certainly, applications that are communi-cation bound stand to benefit more greatly from the useof SMP nodes since the communication that takes placebetween processes running on processors in the same SMPnode can be optimized by the MPI implementation. Thisis, however, not a guarantee of better performance in allcases since having some processes complete early only towait (for synchronization purposes) on other, still-runningprocesses will offer little benefit in terms of improvementin overall application execution time.

5.6 Communication Characteristics of the “Real”Applications

One of the most important characteristics that must beconsidered in evaluating an application’s likely benefit inusing a given cluster interconnect is its communication be-haviour. This is most commonly characterized by the num-ber and size of messages exchanged, per process, during itsexecution. (Although other factors including message fre-quency and aggregate message size should often also beconsidered.) The following four figures illustrate the mes-saging behavior of the two real world applications consid-ered in this paper in terms of message size and number.

Figure 22 shows the average message size sent byprocesses in the Gromacs program when run on differ-ent numbers of processors. Note that, due to technicallimitations of the cluster used, the ‘D.lzm’ component ofthe Gromacs program could not be run with 16 processorswhich is why the corresponding trace on the graph is ab-breviated. The reason why the message size decreases inthe graph is because the size of the data being operatedon by each process decreases so the amount of informa-tion that must be exchanged between processes (i.e. themessage size) also decreases accordingly. This decrease in

8

LU-A Benchmark

0

200

400

600

800

1000

1200

1400

1*2 2*2 4*2 8*2

Number of Processors

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 19: SMP effect on the LU benchmark.

FT-A Benchmark

0

0.5

1

1.5

2

2.5

3

3.5

1*2 2*2 4*2 8*2

Number of Processors

Nor

mal

ized

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 20: SMP effect on the FT benchmark.

Gromacs(D.Villin) Benchmark

0

200

400

600

800

1000

1200

1400

1*2 2*2 4*2 8*2

Number of Processors

MF

lops

Fast Eth

Gigabit Eth

Myrinet

SCI

Figure 21: SMP effect on theGromacs application.

0

50

100

150

200

250

300

350

400

450

500

2 4 6 8 10 12 14 16

Mes

sage

Siz

e (K

byte

s)

Number of Processors

Gromacs Benchmark

d.villind.lzm

d.dppc

Figure 22: Average message size sent byprocesses in the Gromacs application.

0

100

200

300

400

500

2 4 6 8 10 12 14 16

Num

ber

of M

essa

ges

Number of Processors

Gromacs Benchmark

d.villind.lzm

d.dppc

Figure 23: Number of messages sent byprocesses in the Gromacs application.

message size is necessarily accompanied by a correspond-ing increase in the number of messages sent since there aremore processes (running on more cluster nodes) sendingthe smaller messages. This phenomenon is clearly shownin Figure 23. Again, the line for the ‘D.lzm’ componentis necessarily truncated. Figures 24 and 25 show relatedresults for the PSTSWM program. Such communicationbehaviour for expected problem sizes should clearly be con-sidered in making an interconnect selection decision.

6 CONCLUSIONS AND FUTURE WORK

In this paper we presented the results of a comprehen-sive performance assessment of four different interconnectsused in a small-scale cluster of dual processor SMP nodes.The applications used to do the performance assessment in-cluded both “well-tuned” parallel benchmarks (e.g. NPB2.4) and real-world applications (e.g. Gromacs). Beingvery general in nature, we believe that the results pre-sented in this paper will be of immediate use to cluster

9

0

2000

4000

6000

8000

10000

12000

14000

2 4 8 16

Mes

sage

Siz

e (K

byte

s)

Number of Processors

PSTSWM Benchmark

SmallMedium

Figure 24: Average message size sent byprocesses in the PSTSWM application.

builders in selecting an appropriate cluster interconnecttechnology for their intended application(s).

A general characteristic of the benchmarks/applicationsused in our analyses is that they tend to use coarse grainedcommunication to reduce messaging overhead. This is be-cause of the fact that, traditionally, network bandwidthhas been relatively low and latency has been high. As aresult, the benefits of very low latency interconnects suchas SCI and Myrinet are, perhaps, understated in the re-sults presented. While such networks are more expensivethan gigabit and fast Ethernet, it may be that the addedcost in hardware may be paid for by savings in programmertime and effort during program parallelization.

The MPI library for gigabit Ethernet uses TCP/IP andincurs the related overhead, compared to, for example,the use of the GM library for MPI over Myrinet. Nev-ertheless, in many applications the results obtained usinggigabit Ethernet in our small cluster are comparable tothe higher-end interconnects (Myrinet and SCI). Unfortu-nately, the limited size of our cluster did not allow us toproperly investigate the scalability aspects of the intercon-nects. This is a particular concern with gigabit Ethernetwhere (in traditional Ethernet applications) it is commonto “chain” switches together. If large, dedicated gigabitEthernet switches are available, we expect that they shouldprovide scalability which is directly comparable to otherswitched interconnects (e.g. Myrinet, and GigaNet).

Overall, our results suggest that gigabit Ethernet wouldnormally make a cost-effective and appropriate choice ofinterconnect for most cluster builders. This excludes, per-haps, those who expect their applications to exhibit veryaggressive communications and who should therefore con-sider a faster interconnect (e.g. Myrinet, SCI, GigaNet,etc.). Although fast Ethernet is cheap and readily avail-able and, in the presented results, showed reasonable per-formance (at least for compute bound applications) we con-clude that it is no longer a good choice for building clusters.This is particularly true when the rapidly decreasing costand significantly better performance of Gigabit Ethernet

0

5

10

15

20

2 4 8 16

Num

ber

of M

essa

ges

Number of Processors

PSTSWM Benchmark

SmallMedium

Figure 25: Number of messages sent byprocesses in the PSTSWM application.

(as demonstrated in this paper) are considered.Finally, our experiments also show that, at least com-

pared to the applications we considered (Gromacs andPSTSWM), the performance results of such special-purpose parallel benchmarks as NPB are indeed accu-rate indicators of expected cluster-based parallel process-ing performance. Hence, cluster builders can in most cases,as expected, safely rely on the results of these benchmarks.

Our performance evaluation of cluster interconnects isongoing. In the short-term, we will expand our work toinclude results for GigaNet, to assess the performance ofa wider range of real-world applications (including somewhich have had only limited or naive parallelization) andto quantify the effects of using more efficient compilers.In the long term, once we have upgraded our cluster, wewill revise our results and assess the impact of both sig-nificantly faster processors and the availability of the 64bit host-to-NIC transfers supported by the newer Myrinet(and other) NICs. Finally, we will do a separate assess-ment of scalability on a larger cluster and carefully inte-grate those results with our revised results.

ACKNOWLEDGEMENT

This research was supported in part by the NaturalSciences and Engineering Research Council of Canadaunder grant number OGP-0194227.

REFERENCES

Bailey, D. H., Barszcz, E., Barton, J. T., Browning, D. S.,Carter, R. L., Dagum, L., Fatoohi, R. A., Frederickson,P. O., Lasinski, T. A., Schreiber, R. S., Simon, H. D.,Venkatakrishnan, V., and Weeratunga, W. K. (1991).The NAS Parallel Benchmarks. Intl. Journal of Super-computing Applications, 5(3):66–73.

10

Banikazemi, M., Liu, J., Panda, D. K., and Sadayappan,P. (2001). Implementing TreadMarks over Virtual In-terface Architecture on Myrinet and Gigabit Ethernet:Challenges, Design Experience, and Performance Eval-uation. In Proc. of Int’l Conf. on Parallel Processing(ICPP’01), pages 167–174.

Becker, D. J., Sterling, T., Savarese, D., Dorband, J. E.,Ranawak, U. A., and Packer, C. V. (1995). Beowulf:A Parallel Workstation for Scientific Computation. InProc. of the 24th Intl. Conference on Parallel Processing,pages 11–14.

Boden, N., Cohen, D., Felderman, R., Kulawik, A., Sietz,C., Seizovic, J., and Su, W. (1995). Myrinet: A Gigabitper second Local Area Network. IEEE Micro, 15(1):29–36.

Chen, H., Wyckoff, P., and Moor, K. (2000).Cost/Performance Evaluation of Gigabit Ethernet andMyrinet as Cluster Interconnect. In Proc. OPNET-WORK 2000.

Ferri, R. (2002). The OSCAR Revolution. Linux Journal,(98):90–94.

Hsieh, J., Leng, T., Mashayekhi, V., and Rooholamini, R.(2000). Architectural and Performance Evaluation of Gi-gaNet and Myrinet Interconnects on Clusters of Small-Scale SMP Servers. In Proc. of the 2000 ACM/IEEEconference on Supercomputing (CDROM), page 18.

IEEE (1992). Standard for Scalable Coherent Interface.number 1596.

Keleher, P., Dwarkadas, S., Cox, A. L., and Zwaenepoel,W. (1994). TreadMarks: Distributed Shared Memoryon Standard Workstations and Operating Systems. InProc. of the Winter 94 Usenix Conference, pages 115–131. USENIX.

Lan, Z. and Deshikachar, P. (2003). Performance Analysisof a Large-Scale Cosmology Application on Three Clus-ter Systems. In Proc. of the 2003 IEEE InternationalConference on Cluster Computing (Cluster’03), pages56–63.

Lindahl, E., Hess, B., and van der Spoel, D. (2001). GRO-MACS 3.0: a package for molecular simulation andtrajectory analysis. Journal of Molecular Modelling,7(98):306–317.

Pallas Corp. (1999). The Pallas MPIBenchmarks. [Online 1999]. Available at:http://www.pallas.com/e/products/pmb/index.htm.

van der Steen, A. J. (2003). An Evaluation of Some Be-owulf Clusters. Cluster Computing, 6(4):287–297.

Wong, F., Martin, R., Arpaci-Dusseau, R., and Culler,D. (1999). Architectural Requirements and Scalabilityof the NAS Parallel Benchmarks. In Proc. of the 1999ACM/IEEE conference on Supercomputing (CDROM),page 41.

11