Design Flow of a Dedicated Computer Cluster Customized for a Distributed Genetic Algorithm...

6
148 1-4244-1027-4/07/$25.00 ©2007 IEEE Design Flow of a Dedicated Computer Cluster Customized for a Distributed Genetic Algorithm Application Alexandra Aguiar, Márcio Kreutz, Rafael Santos and Tatiana Santos Universidade de Santa Cruz do Sul Santa Cruz do Sul, RS, Brazil [email protected], {kreutz, rsantos, tatianas}@unisc.br Abstract— In the past few years, computer grids and clusters of computers have been widely used in order to keep up with the high computational performance required by high- end applications. Also, they are especially attractive due to their good performance at relatively low cost, if compared to powerful servers and supercomputers. The same scenario can be faced on the embedded world, where high specialized tasks are usually partitioned to dedicated processors composing distributed systems. This work is focused on the architectural specializa- tion of cluster machines by analysing application behavior and optimizing instruction-set architectures. The motivation for this work relies on the observation that tasks found in embedded software present a behavior that normally can be implemented by a subset of instruction-set architectures. This opens opportunities for optimization by removing the not needed instructions. As a consequence, processors become specialized since their resources fit better to applications performance and power consumption constraints. This work proposes a design flow to adapt cluster machines to the constraints of embedded applications, where high flexibility and performance are achieved by hardware customization and its further distribution. Moreover, it is presented a case of study with the entire design flow description as well as the synthesis results. I. I NTRODUCTION Typically, the performance of applications can be improved by either distributing their tasks among processing elements or by specializing them to dedicated hardware. Both approaches have pros and cons. Some applications are more naturally distributed by software while some others are more likely to fit in a hardware implementation. This clearly depends on the source of the application. In the latest years, however, the use of computer clusters and grids have become an attractive alternative for the execution of applications that demand high computational power because of their price-performance ratio. These systems often respond to the processing demands of most distributed applications. This work takes as its main goal the mission of find- ing a compromise between both approaches: processors are connected by a network fabric in a cluster fashion, while their instruction are targeted to dedicated tasks of embedded applications. The memory model used in cluster systems requires that applications exchange messages through a networked pool of nodes [1]. Even though current giga ethernet cards are found at relatively low prices and offer significant higher bandwidth, some applications do not map well to this model. On top of that, despite the fact that network cards are constantly improving, some applications also need resources at the processor level that are not found in commodity mi- croprocessors. This occurs especially in applications in which a hardware implementation would be the best approach to follow. Nevertheless, even in such cases the distribution would certainly help to increase performance. So, a combination of dedicated processors coupled with high speed networks may be an interesting alternative for those applications that need specific processor resources and high speed communication capabilities. In this work it is presented the design flow of a application- specific computer cluster machine where processors are cus- tomized according to the demand constraints of embedded ap- plications and integrated with a ethernet-based communication stack. The proposed design flow targets applications that allow a smooth and efficient distribution, having tasks that could be easily fitted in small, optimized processors. Finally, this research provides the hardware infrastructure to allow the connection among the nodes as well as a design flow to build the application and integrate the system. In other words, this work proposes a development flow and infrastructure which allows optimized processor to execute a given distributed and/or parallel application. The remaining of this paper is divided as follows: Section 2 shows the related works, while Section 3 presents in details the concept of a cluster target to application specific constraints. Section 4 describes a case of study with a distributed genetic algorithm. Section 5 describes how synthesis and validation was performed in this research, while Section 6 shows the results achieved. Finally, Section 7 presents the conclusions, remarks and future works. II. RELATED WORK Previous works have studied the use of application specific devices in cluster systems as these devices have become more popular over the years. Yeh et.al [2] proposed the use of FPGAs to build a switch fabric. Jones et. al [3] included a reconfigurable off-the-shelf computing card in each node of a cluster.

Transcript of Design Flow of a Dedicated Computer Cluster Customized for a Distributed Genetic Algorithm...

148

1-4244-1027-4/07/$25.00 ©2007 IEEE

Design Flow of a Dedicated Computer ClusterCustomized for a Distributed Genetic Algorithm

ApplicationAlexandra Aguiar, Márcio Kreutz, Rafael Santos and Tatiana Santos

Universidade de Santa Cruz do SulSanta Cruz do Sul, RS, Brazil

[email protected], {kreutz, rsantos, tatianas}@unisc.br

Abstract— In the past few years, computer grids and clustersof computers have been widely used in order to keep upwith the high computational performance required by high-end applications. Also, they are especially attractive due totheir good performance at relatively low cost, if compared topowerful servers and supercomputers. The same scenario can befaced on the embedded world, where high specialized tasks areusually partitioned to dedicated processors composing distributedsystems. This work is focused on the architectural specializa-tion of cluster machines by analysing application behavior andoptimizing instruction-set architectures. The motivation for thiswork relies on the observation that tasks found in embeddedsoftware present a behavior that normally can be implemented bya subset of instruction-set architectures. This opens opportunitiesfor optimization by removing the not needed instructions. As aconsequence, processors become specialized since their resourcesfit better to applications performance and power consumptionconstraints.

This work proposes a design flow to adapt cluster machines tothe constraints of embedded applications, where high flexibilityand performance are achieved by hardware customization andits further distribution. Moreover, it is presented a case of studywith the entire design flow description as well as the synthesisresults.

I. INTRODUCTION

Typically, the performance of applications can be improvedby either distributing their tasks among processing elementsor by specializing them to dedicated hardware.

Both approaches have pros and cons. Some applicationsare more naturally distributed by software while some othersare more likely to fit in a hardware implementation. Thisclearly depends on the source of the application. In thelatest years, however, the use of computer clusters and gridshave become an attractive alternative for the execution ofapplications that demand high computational power becauseof their price-performance ratio. These systems often respondto the processing demands of most distributed applications.

This work takes as its main goal the mission of find-ing a compromise between both approaches: processors areconnected by a network fabric in a cluster fashion, whiletheir instruction are targeted to dedicated tasks of embeddedapplications.

The memory model used in cluster systems requires thatapplications exchange messages through a networked pool ofnodes [1]. Even though current giga ethernet cards are found

at relatively low prices and offer significant higher bandwidth,some applications do not map well to this model.

On top of that, despite the fact that network cards areconstantly improving, some applications also need resourcesat the processor level that are not found in commodity mi-croprocessors. This occurs especially in applications in whicha hardware implementation would be the best approach tofollow. Nevertheless, even in such cases the distribution wouldcertainly help to increase performance.

So, a combination of dedicated processors coupled with highspeed networks may be an interesting alternative for thoseapplications that need specific processor resources and highspeed communication capabilities.

In this work it is presented the design flow of a application-specific computer cluster machine where processors are cus-tomized according to the demand constraints of embedded ap-plications and integrated with a ethernet-based communicationstack. The proposed design flow targets applications that allowa smooth and efficient distribution, having tasks that could beeasily fitted in small, optimized processors.

Finally, this research provides the hardware infrastructureto allow the connection among the nodes as well as a designflow to build the application and integrate the system. Inother words, this work proposes a development flow andinfrastructure which allows optimized processor to execute agiven distributed and/or parallel application.

The remaining of this paper is divided as follows: Section 2shows the related works, while Section 3 presents in details theconcept of a cluster target to application specific constraints.Section 4 describes a case of study with a distributed geneticalgorithm. Section 5 describes how synthesis and validationwas performed in this research, while Section 6 shows theresults achieved. Finally, Section 7 presents the conclusions,remarks and future works.

II. RELATED WORK

Previous works have studied the use of application specificdevices in cluster systems as these devices have become morepopular over the years.

Yeh et.al [2] proposed the use of FPGAs to build a switchfabric. Jones et. al [3] included a reconfigurable off-the-shelfcomputing card in each node of a cluster.

149

Other implementations such as Dandalis [4] proposed theuse of reconfigurable network cards to implement specificprotocols such as IPsec. Sass [5] proposed the use of an intelli-gent network card (INIC) capable of processing messages andinjecting them into the network alleviating the pressure on theprocessor in order to enable full exploitation of bandwidth andlatencies of modern networks. Then Underwood [6] presenteda cost analysis of an Adaptable Computing Cluster based onthe INIC project.

More recently Jacob [7] proposed the CARMA frameworkfor reconfigurable clusters as a tool for managing differentconfiguration schemes and Willians [8] presented a reconfig-urable cluster-on-chip architecture and supporting libraries fordeveloping multi-core reconfigurable systems on chip usingthe MPI (Message Passing Interface) standard.

The dedicated cluster architecture proposed in this researchdiffers from the other projects mainly at two aspects. First,our approach aims at customizing the processor itself as afunction of applications constraints. This is done through anautomated flow which generates an optimized microprocessorstarting from a Java application. The microprocessor is tailoredfor the application, enabling the optimizations necessary forthat application. Furthermore, it synthesizes only the hardwareresources needed by the application.

Second, the generated microprocessor is integrated with thecommunication module (a TCP/IP stack) which implementstransmit/receive buffers that can be accessed at the samefrequency as the processor. The later allows full exploitationof high speed network communication.

The design flow proposed allows the programmer to concen-trate on the development of the application, i.e. the algorithm.Thus the use of an automated flow enables fast developmentat a higher level abstraction not found in other projects.

The result is a tightly coupled device which integrates theprocessor and communication into one single FPGA. It isimportant to mention here that the goal is not to propose adevice that will replace conventional cluster nodes (PCs) nora reconfigurable node that is complimentary to a PC host (asopposed to the work discussed earlier [3][4][6][5]).

On the other hand, the goal is to enable fast developmentof distributed applications that require customized processorsin order to achieve small area, low power and high perfor-mance through processor instruction-set and network latencyoptimization. It is also important to highlight that once theapplication is partitioned, each cluster node may comprise ahighly optimized processor, according to the task previouslyallocated to it. The partitioned tasks allocation is supposed tobe done at an earlier design stage and is not the scope of thisresearch.

It is also important to consider that the focus of this work,at its current development status, concerns for a proove ofconcept related to the development of a design flow devotedto generate a optimized cluster architectures, not to develop acomplete cluster machine comprised by any number of nodes.

III. DESIGNING A DEDICATED CLUSTER NODE

One of the main goals of this work is to provide a designflow aimed at designing optimized cluster machines by tuningtheir instruction-sets according to the constraints of embeddedsoftware tasks. Figure 1 presents such design flow as proposedin this work.

The first step concerns for the FemtoJava core generationthrough the SASHIMI tool [9]. This core is further integratedwith the communication infrastructure provided by the LwIPlibrary (TCP/IP stack) and the integration layer, as describedin steps (2) and (3).

Fig. 1. Design flow for a Dedicated cluster development

The following Sections discuss each step of this flow indetails.

A. Creating a FemtoJava Core

The first step to the dedicated cluster implementation isthe application definition and further distribution. The targetapplication must be implemented in Java language, accordingto the SASHIMI tool constraints.

The SASHIMI tool is an environment which synthesizesapplications described in Java Language to specific VHDLmicrocontrollers. Thus, the main advantage of using theSASHIMI tool is the automatic generation of a microcontrolleradapted for a given application described in such high levelabstraction language.

Thus, the tool automatically verifies which instructions usedin the Java description are essential to the implementationin hardware and then generates the customized FemtoJavamicrocontroller. The system was initially proposed in [9], butwas later improved and the newest version can generate coreswith pipeline and VLIW support. More details about the tooland the constraints to implement the synthesizable Java codemay be also found in [9].

150

The idea of using an automatic flow to implement theembedded core is mainly due to the fast development cycleprovided. Also, this alternative was used in order to makethe implementation of the dedicated cluster feasible includingfor researchers with little knowledge in hardware design, asthe application may be described directly in Java Language.The integration layer was also developed in order to easilyintegrate the TCP/IP stack and FemtoJava core. Any variationon the application affects only this step of the flow. For eachapplication, a new core must be developed using SASHIMItool but the remaining modules do not change, since theypresent proper interface to connect with any FemtoJava createdby the SASHIMI tool.

So, this first step provides the customized core, which islater integrated with the remaining blocks (TCP/IP stack andintegration layer). The following sections discuss both theintegration layer and the TCP/IP stack.

B. The Integration Layer

In order to complete the communication infrastructure, itis necessary some additional logic to provide synchronizationbetween the TCP/IP stack and the FemtoJava core. This logic,developed in VHDL, is implemented mainly through buffersand Finite State Machines.

Thus, when the FemtoJava core needs to communicate, itsends a request to the FSM responsible for the communicationbetween the FemtoJava and the TCP/IP stack. This FSMhandles the request and places the data to be sent in a FIFObuffer (send FIFO). A second FSM, responsible for sending,reads the new data placed in the buffer and sends it to thestack. The stack packs the data and sends it over the network.

The receiving process is similar, but it starts when new datais unpacked by the TCP/IP stack. It then sends a request tothe FSM responsible for receiving the data. This FSM handlesthe request and places the data in a FIFO buffer (receive

FIFO). The FSM responsible for the communication betweenthe FemtoJava and the TCP/IP stack reads the received dataand sends it to the FemtoJava core.

Figure 2 shows the structures from the integration layerbetween the FemtoJava core and the TCP/IP stack, which isdetailed in the next Section.

C. The TCP/IP stack

The TCP/IP stack is implemented using the LwIP (Lightweight IP) communication library described in [10]. Thislibrary, written in C language, runs in one of the PowerPCshardwired in the FPGA used as development platform in thiswork.

The stack provided by LwIP library supports transfer ratesof 10/100Mb/s and operates in full-duplex mode, once the send

and receive entities are implemented to work independently.This research uses the LwIP library through API socketsfunctions, which allow threads programming in order to sendand receive data while the FemtoJava core executes in parallel.

It is important to notice that the physical layer is imple-mented by an ASIC [11] available in the development board.

So, the network packages are sent/received by such ASICwhich then passes these packages to LwIP. LwIP provides thedata link, network and transport layers, passing the resultantTCP packages to the application layer implemented directlyin hardware by the FemtoJava core.

D. Integrating the Node

The infrastructure, i.e., the communication stack and in-tegration layer, are ready to be connected to the embeddedapplication earlier developed in Java and synthesized to aVHDL FemtoJava microcontroller. As discussed before, thisinfrastructure should not vary should the embedded applicationchanges.

Figure 2 shows how the system integrates the entire node,i.e., how all blocks are connected in order to implement theprocessing node. Basically, the FemtoJava core and the TCP/IPstack are both connected with the integration layer, whichis responsible for synchronizing the communication betweenthese two modules.

Fig. 2. System integration

Each node may be responsible for a given part of theapplication, according to the distribution performed yet at thesoftware level.

IV. CASE OF STUDY: A DISTRIBUTED EVOLUTIONARY

ALGORITHM

A real scientific application was used to apply the flow anddesign a dedicated cluster machine.

This application, widely used in the pharmaceutical indus-try, uses spectrographic techniques in order to dose and furthercharacterize anti-hypertensives [12]. These techniques, how-ever, result in a large number of variables and, as consequence,the dosage process, i.e., the determination of the necessarychemical elements, is very slow. This occurs because a com-binatorial analysis is required to choose the best variablescombination among all.

An alternative to speedup this process is to employ Evo-lutionary Algorithms (EA), since they have been recognizedas a powerful approach to solve optimization problems [13].Evolutionary Algorithms are inspired by simple models ofbiological evolution. They are known as robust optimization

151

algorithms based on the collective learning process within apopulation of individuals.

Each individual represents a search point in the represen-tation space R. By the iterative processing of an evolutionloop, consisting of selection, mutation, and recombination, arandomly initialized population evolves toward better regionsof the search space. The fitness function f delivers the qualityinformation necessary for the selection process to favor indi-viduals with higher fitness for reproduction. The reproductionprocess consists of the recombination mechanism, responsiblefor the mixing of parental information, and mutation introduc-tion undirected innovation into the population.

The EA algorithm applied to the spectrographic techniqueused in this work was first developed in Java Language and itwas manually distributed according to the island model brieflydiscussed below.

In the island model, fully described in [14], each node hasits own subpopulation and performs all typical tasks requiredby an EA (analysis/selection, mutation and recombination).This means that the total population of a sequential algorithmmight be divided into smaller subpopulations distributed forexecution in the different nodes available.

Figure 3 shows the island model distribution.

Fig. 3. EA distribution according to the island model

Thus, the AE implemented according to the island modelperforms the following tasks:

• Analysis – in this step, all population individuals areevaluated according to a given mathematical function

• Crossover – in this step, the individuals recombination isperformed

• Mutation – this part of the algorithm provides themutation on the actual population, generating the newpopulation

• Migration – during this step, each node sends and receivesa given number of individuals from the remaining nodes.After this procedure, these new individuals are thenintegrated to their own population.

Although both models were completely developed in soft-ware, previous works have shown that the island model is

much more effective for this application [15]. Thus, only thismodel was developed to the cluster.

It is possible to observe that each island may be imple-mented in a different node. So, the dedicated cluster for thiscase is composed by the replication of several single nodes,implemented according to the methodology described here.After this process, the prototyped nodes must be connectedthrough a traditional network and the embedded application isready to run in the cluster.

As discussed before, after developing the distributed algo-rithm in Java language, the SASHIMI tool [9] was used inorder to generate the FemtoJava cores used in each node ofthe cluster. Those cores were later connected with the TCP/IPstack and integration layer in order to complete each node.

All synthesis and validation strategies are described in thenext Section.

V. SIMULATION, SYNTHESIS AND VALIDATION

After using SASHIMI tool [9] to generate the FemtoJavacore for the distributed EA, simulations were performed usingMentor Graphics ModelSim in order to validate the core. Thereference data for the validation was extracted from the samealgorithm modeled in Java which was used as the SASHIMItool input.

The TCP/IP as well as the integration layer were also sim-ulated in Mentor Graphics ModelSim. The reference data, inthis case, is extracted from a program implemented to generatenetwork packages with known transfer data, written in C. Thecore and the stack were then integrated and new simulationswere performed using Mentor Graphics ModelSim.

The prototyping was performed targeting the XUP-V2Pboard [16], which has a XC2VP30 Virtex-II Pro FPGA andit is designed by Digilent, Inc. Besides the powerful FPGA(with two PowerPC 405 processors hardwired) this boardhas SDRAM, as well as several other useful resources andinterfaces such as RS-232, ethernet, XSVGA output, and soon.

After the simulation, the synthesis and debug-ging/verification processes were started using the XilinxISE Foundation Software [17], as the target device is a XilinxVirtex II Pro FPGA family device. Synthesis results areshown in the next Section.

In order to validate the prototype, the Xilinx EDK PlatformStudio tool was used to create the entire programming envi-ronment to support verification [18]. This includes an interfaceto communicate the prototyped node with RS-232 interfacewhich allowed the verification of the data produced by eachnode against the ones produced yet in the simulation.

Thus, the bitstream produced by the Xilinx ISE Foundationsoftware was downloaded to the FPGAs through an USB2programming interface (JTAG). This download included thenode as well as the verification interface generated by theXilinx EDK platform Studio. Also, the remaining PowerPCprocessor available was used to verify the prototyped nodes.Thus, the results generated by a given node were received bythe PowerPC, which sent these data through the RS-232 to a

152

host PC. Then, these results were compared with the resultsachieved by simulation.

VI. RESULTS

The following Sections present a summary of the synthesisresults as well as the performance achieved by the designedsystem.

A. Synthesis

The entire node, i.e., FemtoJava core, TCP/IP network stackand integration layer were completely synthesized using theXilinx ISE foundation software for the XC2VP30 Virtex-II Pro FPGA. Furthermore, a non-distributed version of theapplication was also developed and synthesized to the FPGA.This second embedded system do not have any support forparallel execution, i.e., the application executes a sequentialversion of the algorithm and it was not integrated with thecommunication stack.

The post place and route synthesis results for both systemsare summarized in Table I.

TABLE I

SYNTHESIS RESULTS

Synthesis for XC2VP30

Distributed SequentialMaximum Frequency 100.59 MHz 102.06 MHz

Power Consumption 1.1 W 1.1 WArea (# of LUTs) 8,354 (30%) 5,394 (19%)

It is possible to observe that the entire node of the dedicatedcluster occupies around 30% of the LUTs available, meaningthat there is still enough room to optimize the embeddedapplication, if needed. These optimizations, however, are notthe scope of this work. Also, the sequential version is sig-nificantly smaller. Nevertheless, this version does not use thecommunication infra-structure, which occupies a large part ofthe LUTs available in the FPGA, around 19%.

The maximum frequency achieved in both cases is morethan 100MHz. This is a significant result, considering theFPGA technology used in this research. Moreover, regarding tothe distributed version, the embedded application is synchro-nized with the integration layer as well as the communicationstack and this frequency is more than enough to keep aneffective process. It is also interesting to observe that the nodecommunication infrastructure is not limiting frequency, as bothversions has achieved similar results.

Additionally, it is important to notice the low power con-sumption in both cases (sequential and distributed), which wasaround 1.1 W. This result points out how power consumptionmay be reduced when using dedicated hardware, as a state-of-the-art GPP may consume over 100 W [19]. This is especiallyattractive in comparison with regular clusters, since one of themain problems of the later approach is the high rates in powerconsumption and heat dissipation.

B. Performance

In order to validate and compare the results achieved by theembedded version, the distributed application were also run ina regular cluster. However, it is not intention of this workwidely compare dedicated and regular clusters using differentconfigurations and number of nodes. The main goal of thiswork was to develop and validate an effective flow to generatea dedicated cluster to a given application and the comparisonbetween this and regular clusters is only to illustrate suchapproach. Thus, both regular and dedicated clusters used inthe experiments showed in this Section have two nodes.

The conventional cluster used to perform the experimentsis based on two Intel Pentium 4 nodes, each one running at2.8GHz with 512MBytes. The interconnection used betweenthem is a direct fast-ethernet link, similar to the FPGAcluster architecture. The application ran in this cluster was thesame Java application which was earlier synthesized throughSASHIMI tool in order to create the customized FemtoJavacore. As discussed before, the nodes are similar and each oneimplements an island from the EA.

On the other hand, the dedicated cluster used is based ontwo nodes developed according to the methodology previouslydescribed in this work. It is important to observe that eachnode has the same functionalities from the ones ran in theconventional cluster, as they were generated from that sameJava application.

Table II shows results relative to the evolutionary algorithmexecution time in both conventional and application-specificclusters. These results were taken considering the average of20 executions of the algorithm in each cluster. In order toobtain the execution time results in the conventional clustercase, system calls were used. For the dedicated cluster, thenumber of clock pulses necessary to complete an executionwere accumulated and later multiplied by its period. Thus,the table shows the average of execution time as well as thebest performance achieved by the experiments in each case(dedicated and traditional), besides the relation between thoseresults.

TABLE II

DEDICATED VS. TRADITIONAL CLUSTERS

Comparison results between dedicated and traditional

Dedicated Traditional Difference (%)Average exec. time (s) 14.16 22.53 37.15Lowest exec. time (s) 5.36 19.5 72.51

It is possible to observe that the dedicated cluster achievedremarkable results, when considering a cluster comprising twonodes. In average, the traditional cluster takes more than 22seconds to find the EA fitness, while the dedicated cluster takesless than 15 seconds. This means that, in average, the gain ofthe dedicated cluster over the traditional is greater than 35%.Moreover, the dedicated cluster performance may be more than70% better when comparing the lowest execution times of eachcase. This points out that, in fact, the latencies produced byuseless software and hardware layers can significantly harm

153

a given system performance. Besides that, the results pointthat the dedicated cluster, which runs at 100 MHz, provides aspeed-up of 1.5 times over a Pentium based cluster running at2.8 GHz which means that the dedicated nodes are 42 timesmore efficient compared with a superscalar processor runningat the same frequency. This happens because the dedicatedcluster runs on FPGAs which are devices that explore parallelor spacial execution instead of the temporal or sequentialapproach used in the Pentium based cluster. Moreover theconcurrent execution provided by FPGAs devices representan advantage for the algorithm’s operations, which explore itby increasing the parallelism of its execution.

VII. CONCLUSION AND FUTURE WORK

This work proposed a design flow to efficiently distributeembedded systems through dedicated cluster machines.

Even with several studies in the field of distributed soft-ware, computer clusters and embedded systems, none of themproposed an approach joining these technologies.

According to the proposed flow, the nodes are developedin Java Language and the synthesizable VHDL is generatedautomatically through SASHIMI design flow. After that, theapplication core must be integrated with the TCP/IP stack aswell as an integration layer.

The application implemented as a first case of study wasan Evolutionary Algorithm (EA) applied to spectrographicanalysis, which is a widely known problem in the pharmaceu-tical industry. After following the design flow, the nodes wereentirely implemented and integrated. The verification processis totally complete and performance achieved remarkableresults for a 2 node cluster.

The dedicated node using the XUP-V2P board achievesaround 100MHz and consume only 30% of the target device, aXilinx VP30 Virtex II Pro. Moreover, the frequency achievedis considered to be good for the used device. There is stillroom for new optimizations in the architecture as it wasfully placed using just 30% of the FPGA LUTs. As anautomatic flow was used to generate the FemtoJava core, somemanually optimizations may be easily performed. However,which optimizations to implement as well as their impact inthe node performance are going to be addressed in a futurework.

The general performance for the experiments executed inthis work are also very impressive. In average, using thededicated cluster is more than 35% better than running thesame algorithm in a conventional cluster. The best case,however, achieves more than 70% of improvement over theconventional cluster. More experiments with different clustersconfigurations and number of nodes still need to be performed,but these results validate the idea of an embedded distributedapplication running in a application-specific cluster.

Also as future work, new alternatives to communicationother than TCP/IP are going to be studied and analyzed.Moreover, new applications are going to be developed to verifythe effectiveness of this approach.

ACKNOWLEDGMENT

The authors gratefully acknowledge the UNISC andFAPERGS support in the form of scholarships and grants.

REFERENCES

[1] A. S. Tanenbaum, Distributed Systems: Principles and Paradigms, 2002.[2] C.-C. Yeh, C.-H. Wu, and J.-Y. Juang, “Design and implementation of

a multicomputer interconnection network using FPGAs,” in IEEE Sym-posium on Field-Programmable Custom Computing Machines. NapaValley, California: IEEE, 1995, pp. 30–39.

[3] M. Jones, L. Scharf, J. Scott, C. Twaddle, M. Yaconis, K. Yao, andP. Athanas, “Implementing an API for distributed adaptive computingsystems,” in IEEE Symposium on Field-Programmable Custom Comput-ing Machines. Napa Valley, California: IEEE, 2000, pp. 222–230.

[4] A. Dandalis, V. K. Prasanna, and J. D. P. Rolim, “An adaptive crypto-graphic engine for ipsec architectures,” in IEEE Symposium on Field-Programmable Custom Computing Machines. Napa Valley, California:IEEE, 2000, pp. 132–141.

[5] R. Sass, K. Underwood, and W. Ligon, Design of Adaptable ComputingCluster. The Military and Aerospace Programmable Logic Device(MAPLD) International Conferences, 2001.

[6] K. Underwood, R. Sass, and W. Ligon, Cost Effectiveness of an

Adaptable Computing Cluster. ACM/IEEE Supercomputing, 2001.[7] A. M. Jacob, I. A. Troxel, and A. D. George, Distributed Configuration

Management for Reconfigurable Cluster Computing. HCS ResearchLab, University of Florida, 2004.

[8] J. Willians, I. Syed, J. Wu, and N. Bergmann, “A reconfigurable cluster-on-chip architecture with MPI communication layer,” in 14th AnnualIEEE Symposium on Field-Programmable Custom Computing Machines.Los Alamitos, California: IEEE, 2006, pp. 351–352.

[9] S. Ito, L. Carro, and R. Jacobi, “Sashimi and femtojava: makingjava work for microcontroller applications,” IEEE Design & Test ofComputers, pp. 100–110, 2001.

[10] A. Dunkels, “Design and implementation of the lwip tcp/ip stack,”Swedish Institute of Computer Science, Tech. Rep., February 2001.

[11] Intel, “Intel LXT972A single-port 10/100mbps PHY transceiver,” Available at:<http://www.intel.com/design/network/products/lan/datashts/24918603.pdf>.Accessed in May., 2007.

[12] J. Coates, “Vibrational spectroscopy: Instrumentations for infrared andraman spectroscopy,” Applied Spectroscopy Reviews, vol. 33, 1998.

[13] L. A. N. Lorena and J. C. Furtado, “Constructive genetic algorithm forclustering problems,” Evolutionary Computation, vol. 9, no. 3, pp. 309–327, 2001.

[14] E. Alba and J. M. Troya, A useful review on coarse grain Parallel Ge-

netic Algorithms, Universidad de Málaga, Campus de Teatinos (2.2.A.6),29071-Málaga (ESPAÑA), 1997.

[15] A. Aguiar, C. Both, M. Kreutz, R. dos Santos, and T. dos Santos,“Implementação de algoritmos genéticos paralelos aplicados a fárma-cos,” in XXVI Congresso da Sociedade Brasileira de Computação- WPerformance. Campo Grande - MS: Sociedade Brasileira deComputação, Julho 2006.

[16] Xilinx University Program Virtex-II Pro Development System - HardwareReference Manual, Xilinx Inc., 2006.

[17] Xilinx ISE 8.2i - Software Manual, Xilinx Inc., 2006.[18] Xilinx Embedded Development Kit (EDK) 8.2 - Software Manual, Xilinx

Inc., 2006.[19] “Intel pentium 4 processor - thermal management,” Available

at <http://www.intel.com/support/processors/pentium4/sb/CS-007999.htm>. Accessed in December, 2006, 2006.