HPRA: A pro-active Hotspot-Preventive high-performance routing algorithm for Networks-on-Chips

7
HPRA: A Pro-Active Hotspot-Preventive High-Performance Routing Algorithm for Networks-on-Chips Elena Kakoulli , Vassos Soteriou , Theocharis Theocharides Department of EECEI KIOS Research Center, Department of ECE Cyprus University of Technology University of Cyprus {elena.kakoulli,vassos.soteriou}@cut.ac.cy [email protected] Abstract—The inherent spatio-temporal unevenness of traffic flows in Networks-on-Chips (NoCs) can cause unforeseen, and in cases, severe forms of congestion, known as hotspots. Hotspots reduce the NoC’s effective throughput, where in the worst case scenario, the entire network can be brought to an unrecoverable halt as a hotspot(s) spreads across the topology. To alleviate this problematic phenomenon several adaptive routing algorithms employ online load-balancing functions, aiming to reduce the possibility of hotspots arising. Most, however, work passively, merely distributing traffic as evenly as possible among alternative network paths, and they cannot guarantee the absence of network congestion as their reactive capability in reducing hotspot formation(s) is limited. In this paper we present a new pro-active Hotspot-Preventive Routing Algorithm (HPRA) which uses the advance knowledge gained from network-embedded Artificial Neural Network-based (ANN) hotspot predictors to guide packet routing across the network in an effort to mitigate any unforeseen near-future occurrences of hotspots. These ANNs are trained offline and during multicore operation they gather online buffer utilization data to predict about-to-be-formed hotspots, promptly informing the HPRA routing algorithm to take appropriate action in preventing hotspot formation(s). Evaluation results across two synthetic traffic patterns, and traffic benchmarks gathered from a chip multipro- cessor architecture, show that HPRA can reduce network latency and improve network throughput up to 81% when compared against several existing state-of-the-art congestion-aware routing functions. Hardware synthesis results demonstrate the efficacy of the HPRA mechanism. I. I NTRODUCTION Networks-on-Chips (NoCs) [5] are the interconnect of preference in state-of-the-art multicore Systems-on-Chips (SoCs) [2] and Chip Multiprocessors (CMPs), such as Intel’s 48-core Single-Chip Cloud computer [13]. NoCs are replacing shared-communication mediums as they scale efficiently with increasing topology sizes, and their attributes of expandability, modularity, ability to tolerate faults, and energy efficiency, aid in quickly designing, testing, and verifying ultra high-performance multicore computing systems. The inherently unpredictable application traffic patterns can cause hotspot formation(s), temporally and spatially across an NoC topol- ogy. This adverse phenomenon of traffic hotspots is caused as NoC routers or modules in a multicore system occasionally receive packetized traffic from other network element producers at a faster rate than they can actually eject this traffic, as interconnecting links and input/output ports are bandwidth-restricted, and as routed flits constantly compete for network resources (buffers, channels, etc) [20]. Even a single traffic sender or receiver can cause a hotspot. Hotspots can also be produced by factors such as the lack of traffic balancing under oblivious routing algorithms, non-optimal application mapping onto a multicore chip, application migration, and due to network-resource demands that unpredictably occur dynamically [3]. Wormhole Flow-Control (WFC), employed in most NoCs [6], where packetized messages are broken down into smaller logical units called flits, in an effort to save on buffer sizing requirements, intensifies this detrimental effect of hotspots onto the performance of NoCs. Under WFC, the spreading of packets in a pipelined mode across several routers, as flits advance towards their destination, produces backpressure at upstream buffers causing them to quickly fill-up in a domino-style mode. Hence, a hotspot(s) can quickly span several portions of the topology at a time, causing further message blocking to propagate spatially across several routers. This NoC resource over-utilization, can produce irreversible traffic blockage which may force the entire NoC to stall indefinitely, under which state the NoC becomes inoperable. Hotspot formations are especially unpredictable in general-purpose best-effort parallel on-chip systems such as CMPs, which are considered in this paper, where application patterns cannot be pre-determined and are highly spatio-temporally variable during system operation, unlike in special-purpose SoCs where traffic patterns may be known a-priori to system operation [31]. Even under the use of load-balancing adaptive routing func- tions [16], substantial effective throughput degradation in an NoC can be observed. The development of congestion-management techniques as a means to safeguard the scalability of NoCs and hence the performance sustainability of their hosting general-purpose CMPs and application-driven SoCs, has been identified as a major research challenge in a number of recent significant surveys [3], [19]. Such new schemes will enable designers and architects to lay the roadmap in future multicore chip design - current techniques such as dynamic congestion management in the form of adaptive routing protocols [6], [8], [14], [17], [18], [26], application scheduling [3], and the addition of extra buffering at router input ports [21] to house delayed flits in an attempt to improve NoC throughput in the presence of bursty traffic that may cause hotspots to form, are not always sufficient as NoC congestion is an complex and unpredictable phenomenon. In this article we present a new congestion-preventive pro-active routing function, termed Hotspot-Preventive Routing Algorithm (HPRA). Instead of passively measuring current network statistics, such as link and buffer utilization, in attempting to reactively balance- out traffic to improve or sustain network throughput, like most of existing routing algorithms [6], HPRA pro-actively prevents the unforseen formation of NoC hotspots or elevated congestion that may occur in the near future during network operation. This pro-active hotspot prevention is achieved with the use of advance information sourced with the use of Artificial Intelligence (AI) principles that are utilized during network operation to continuously predict the formation of traffic hotspots or congestion. AI principles are chosen because of their adaptability to changing traffic conditions and their ability to learn about small network spatio-temporal variations which can lead to online congestion and hence build on improving their ability to forecast the next hotspot occurrence in advance. An NoC-embedded Artificial Neural Network-based (ANN) hard- ware mechanism, from our previous work [16], is used in dynamically foreseeing these potential hotspot formations. Here, the routing algorithm utilizes this advance information, to partially or completely throttle hotspot-destined traffic, gradually allowing portions or the entirety of such traffic to reach their destinations, while continuously balancing-out traffic that is not hotspot-destined. The latter traffic category is balanced spatially across the topology via the use of real- time statistics gathered from the network, that are used to choose 249 978-1-4673-3052-7/12/$31.00 ©2012 IEEE

Transcript of HPRA: A pro-active Hotspot-Preventive high-performance routing algorithm for Networks-on-Chips

HPRA: A Pro-Active Hotspot-Preventive High-PerformanceRouting Algorithm for Networks-on-Chips

Elena Kakoulli†, Vassos Soteriou†, Theocharis Theocharides‡

†Department of EECEI ‡KIOS Research Center, Department of ECECyprus University of Technology University of Cyprus

{elena.kakoulli,vassos.soteriou}@cut.ac.cy [email protected]

Abstract—The inherent spatio-temporal unevenness of traffic flows inNetworks-on-Chips (NoCs) can cause unforeseen, and in cases, severeforms of congestion, known as hotspots. Hotspots reduce the NoC’seffective throughput, where in the worst case scenario, the entire networkcan be brought to an unrecoverable halt as a hotspot(s) spreads acrossthe topology. To alleviate this problematic phenomenon several adaptiverouting algorithms employ online load-balancing functions, aiming toreduce the possibility of hotspots arising. Most, however, work passively,merely distributing traffic as evenly as possible among alternativenetwork paths, and they cannot guarantee the absence of networkcongestion as their reactive capability in reducing hotspot formation(s)is limited. In this paper we present a new pro-active Hotspot-PreventiveRouting Algorithm (HPRA) which uses the advance knowledge gainedfrom network-embedded Artificial Neural Network-based (ANN) hotspotpredictors to guide packet routing across the network in an effort tomitigate any unforeseen near-future occurrences of hotspots. These ANNsare trained offline and during multicore operation they gather onlinebuffer utilization data to predict about-to-be-formed hotspots, promptlyinforming the HPRA routing algorithm to take appropriate action inpreventing hotspot formation(s). Evaluation results across two synthetictraffic patterns, and traffic benchmarks gathered from a chip multipro-cessor architecture, show that HPRA can reduce network latency andimprove network throughput up to 81% when compared against severalexisting state-of-the-art congestion-aware routing functions. Hardwaresynthesis results demonstrate the efficacy of the HPRA mechanism.

I. INTRODUCTION

Networks-on-Chips (NoCs) [5] are the interconnect of preferencein state-of-the-art multicore Systems-on-Chips (SoCs) [2] and ChipMultiprocessors (CMPs), such as Intel’s 48-core Single-Chip Cloudcomputer [13]. NoCs are replacing shared-communication mediumsas they scale efficiently with increasing topology sizes, and theirattributes of expandability, modularity, ability to tolerate faults, andenergy efficiency, aid in quickly designing, testing, and verifying ultrahigh-performance multicore computing systems.

The inherently unpredictable application traffic patterns can causehotspot formation(s), temporally and spatially across an NoC topol-ogy. This adverse phenomenon of traffic hotspots is caused asNoC routers or modules in a multicore system occasionally receivepacketized traffic from other network element producers at a fasterrate than they can actually eject this traffic, as interconnectinglinks and input/output ports are bandwidth-restricted, and as routedflits constantly compete for network resources (buffers, channels,etc) [20]. Even a single traffic sender or receiver can cause a hotspot.Hotspots can also be produced by factors such as the lack of trafficbalancing under oblivious routing algorithms, non-optimal applicationmapping onto a multicore chip, application migration, and due tonetwork-resource demands that unpredictably occur dynamically [3].

Wormhole Flow-Control (WFC), employed in most NoCs [6],where packetized messages are broken down into smaller logicalunits called flits, in an effort to save on buffer sizing requirements,intensifies this detrimental effect of hotspots onto the performance ofNoCs. Under WFC, the spreading of packets in a pipelined modeacross several routers, as flits advance towards their destination,produces backpressure at upstream buffers causing them to quickly

fill-up in a domino-style mode. Hence, a hotspot(s) can quickly spanseveral portions of the topology at a time, causing further messageblocking to propagate spatially across several routers. This NoCresource over-utilization, can produce irreversible traffic blockagewhich may force the entire NoC to stall indefinitely, under whichstate the NoC becomes inoperable. Hotspot formations are especiallyunpredictable in general-purpose best-effort parallel on-chip systemssuch as CMPs, which are considered in this paper, where applicationpatterns cannot be pre-determined and are highly spatio-temporallyvariable during system operation, unlike in special-purpose SoCswhere traffic patterns may be known a-priori to system operation [31].

Even under the use of load-balancing adaptive routing func-tions [16], substantial effective throughput degradation in an NoC canbe observed. The development of congestion-management techniquesas a means to safeguard the scalability of NoCs and hence theperformance sustainability of their hosting general-purpose CMPsand application-driven SoCs, has been identified as a major researchchallenge in a number of recent significant surveys [3], [19]. Suchnew schemes will enable designers and architects to lay the roadmapin future multicore chip design - current techniques such as dynamiccongestion management in the form of adaptive routing protocols [6],[8], [14], [17], [18], [26], application scheduling [3], and the additionof extra buffering at router input ports [21] to house delayed flits in anattempt to improve NoC throughput in the presence of bursty trafficthat may cause hotspots to form, are not always sufficient as NoCcongestion is an complex and unpredictable phenomenon.

In this article we present a new congestion-preventive pro-activerouting function, termed Hotspot-Preventive Routing Algorithm(HPRA). Instead of passively measuring current network statistics,such as link and buffer utilization, in attempting to reactively balance-out traffic to improve or sustain network throughput, like mostof existing routing algorithms [6], HPRA pro-actively prevents theunforseen formation of NoC hotspots or elevated congestion that mayoccur in the near future during network operation. This pro-activehotspot prevention is achieved with the use of advance informationsourced with the use of Artificial Intelligence (AI) principles thatare utilized during network operation to continuously predict theformation of traffic hotspots or congestion. AI principles are chosenbecause of their adaptability to changing traffic conditions and theirability to learn about small network spatio-temporal variations whichcan lead to online congestion and hence build on improving theirability to forecast the next hotspot occurrence in advance.

An NoC-embedded Artificial Neural Network-based (ANN) hard-ware mechanism, from our previous work [16], is used in dynamicallyforeseeing these potential hotspot formations. Here, the routingalgorithm utilizes this advance information, to partially or completelythrottle hotspot-destined traffic, gradually allowing portions or theentirety of such traffic to reach their destinations, while continuouslybalancing-out traffic that is not hotspot-destined. The latter trafficcategory is balanced spatially across the topology via the use of real-time statistics gathered from the network, that are used to choose

249978-1-4673-3052-7/12/$31.00 ©2012 IEEE

the least congested progressive path. These ANNs are trained offlineusing synthetic hotspot traffic [16], and during NoC operation theygather online buffer utilization data to predict about-to-be-formedhotspots, promptly informing the HPRA routing algorithm to takeappropriate action in alleviating hotspot formation(s). In categorizingtraffic as hotspot- and non-hotspot-destined, the algorithm offloadshotspot router receivers and their surrounding network paths, whileallowing non-hotspot-causing traffic to reach their destination(s);hotspot and non-hotspot traffic utilize two separate injection queues ateach router, managed by a controller. In this way HPRA allows non-adversarial traffic to continuously flow in the network in a monitoredmanner, and once the effect of a hotspot(s) is reduced, determinedwith the use of online network statistics gathering, or is eradicatedaltogether, hotspot-destined traffic is again allowed to flow in thenetwork until the same hotspot again starts to build-up where it isre-throttled. This process is repeated with future hotspot predictions.

The proposed HPRA mechanism is designed with manageablehardware overheads that account to a 11.4% overall network over-head. The ANN mechanism is designed as an independent processingelement that is embedded into the base NoC topology of a multicoresystem. Evaluation results across two adversarial synthetic trafficpatterns, and traffic benchmarks gathered from the TRIPS CMP [24],show that HPRA can reduce network latency and improve networkthroughput by up to 81% when compared against several existingstate-of-the-art congestion-aware routing functions (see Section V).In addition, the use of HPRA across two 2-D mesh-based networksizes shows its scalability, versatility and adaptability.

Following, Section II presents a survey of related work. Section IIIintroduces ANNs, and outlines the proposed ANN architecture andthe data it processes, and Section IV provides details of HPRA. Next,Section V outlines our experimental setup and evaluates the per-formance of HPRA against several other state-of-the-art congestion-aware routing algorithms, while Section VI presents HPRA hardwaresynthesis results. Section VII concludes this article.

II. BACKGROUND AND RELATED WORK

Hotspot-management schemes in NoCs fall under two maincategories: (1) implicit approaches, mostly dealing with networkcongestion reduction via workload load-balancing, and (2) explicitapproaches that are specifically targeted in reducing the impact ofhotspots upon an NoC. The schemes under category (1) are equallyapplied to off-chip interconnection networks and on-chip systemsand their range is vast, hence the interested reader is urged torefer to survey works such as [3], [6]. Load-balancing selectionfunctions which mainly account for localized or de-centralized onlinestatistics gathering, used to direct routing decisions and at the sametime reduce hardware overheads, aim at distributing uneven trafficamong alternative lightly-loaded router output ports or routing pathsspanning the NoC topology. Their main target is to sustain theeffective network throughput under adversarial (i.e., highly unevenlydistributed) traffic patterns, but do not explicitly handle hotspots.

In the on-chip domain, the Balanced Adaptive Routing Protocol(BARP) [17] uses hybrid local-global information regarding networkcongestion to distribute traffic evenly among the shortest routing pathsto avoid congestion emergencies. DyXY [18] provides adaptive rout-ing based on congestion conditions measured in the the topologicalproximity of an NoC router. Following, [20] proposes a memory-lessswitch for NoCs that misroutes packets to non-ideal routes duringinstances of elevated localized congestion levels. The work by Gratzet al. [11] aims to enhance global load-balancing in adaptive routing,using a lightweight technique that informs a congestion-aware routingscheme about traffic levels in network regions beyond the localityof adjacent network routers. The work in [27] develops a routingalgorithm to support multiple concurrent applications in a congestion

propagation NoC, where the Destination-Based Adaptive Routing(DBAR) leverages both local and global network information. The au-thors in [23] propose a minimum-based Destination-Adaptive Routing(DAR) strategy where every node estimates the delay to every othernode in the NoC to drive routing decisions. DyAD [14] adopts bufferoccupation statistics to signal congestion among neighboring nodesto act accordingly towards increasing NoC throughput. Finally thework in [32] investigates the impact of input selection at NoC routersand presents a contention-aware technique to enhance NoC routingefficiency. All these works measure online network congestion, andthen react to adverse conditions to reduce or correct the impact ofelevated network throughput demands by spatially spreading trafficamong alternative routes. All these schemes merely respond toalready elevated network delays in packet transmission progression,either locally or semi-globally, and exploit no advance informationabout potential hotspots in the network topology.

Next, most of the explicit hotspot-management efforts under cate-gory (2) focus on reducing the number of routed packets destined tohighly utilized routers using various schemes, detailed next. They allassume that the spatial locations of hotspots are either (a) known apriory, and hence the hotspot-management techniques pro-activelyaim at reducing possible hotspot formations, or (b) spatially andtemporarily react to reduce the possibility of hotspot occurrences.

Various hotspot prevention works fall under the domain of large-scale off-chip interconnected computers. The scheme in [12] discardshotspot-bound network packets which are later re-transmitted to theirdestinations. Next, the work in [1] reacts temporally and throttlestraffic at the router injection ports, preventing further traffic frombeing routed into a hotspot(s), while the work in [25] presents ananalytical model of hotspot traffic in k-ary n-cube networks. Finally,the schemes in [9] use separate buffering for efficiently handlinghotspot-destined traffic, or use a large number of virtual channels.

Due to the limited hardware resources in NoCs, and the needto sustain high performance, most of the hotspot-related schemesapplicable to off-chip networks, such as packet dropping and retrans-mission, are unsuitable for NoCs. Hotspot prevention is mostly basedon spatial techniques via the use of adaptive routing which aims toroute packets around heavily used paths, such as the work in [7].Next, [21] describes a predictive closed-loop flow-control mechanismthat utilizes NoC traffic source and router models to control thepacket injection rate of traffic producers to achieve a consistent packetflow. The work by Walter et al. [31] uses a low-cost end-to-endcredit-based allocation technique in NoCs to throttle and regulatehotspot-destined traffic. The work in [30] utilizes injection throttlingto improve the throughput of a network under high load. Finally, [4]presents a congestion-control strategy for NoCs based on a best-effortcommunication service that monitors link utilizations as a congestionmeasure, which are then transported to a model predictive controllerthat bounds latency and offers bandwidth availability at links.

III. ANNS: INFORMATION PROCESSING AND PREDICTION

An Artificial Neural Network (ANN) is an information processingparadigm, inspired by the way biological neuron systems process in-formation and is composed of a large number of highly interconnectednodes (neurons). Neurons work in unison to solve specific problems.A neuron takes a set of inputs and multiplies each input by a weightvalue, which is determined in training stage, accumulating the resultuntil all the inputs are received. A threshold value is then subtractedfrom the accumulated result and this result is then used to computethe output of the neuron based on an activation function. The neuronoutput is then propagated to the neurons of the next layer whichperform the same operation with the newly set of inputs and their ownweights. This is repeated for all the layers of an ANN. ANNs learnby training, and are typically trained for specific usage in applications

250

ANN ANN

ANN ANN

Fig. 1. ANN monitoring system: Four base ANN units each monitoringa 4!4 region in an 8!8 2-dimensional mesh NoC topology.

such as pattern recognition or data classification. They have alsobeen successfully used as prediction and forecasting mechanisms inseveral areas, as they are able to determine hidden and strongly non-linear dependencies, even when there is significant noise in the dataset [15]. ANNs have been used as branch prediction mechanismsin computer architecture, as forecasting mechanisms in stocks, andin several other prediction applications [15]. A neural network canbe realized in hardware by using interconnected neuron hardwarecomputation engines, each of which is composed by a multiplier-accumulator (MAC) unit and a Look-Up Table which implements theactivation function. Training weights are held in memory, but overallthe complexity of the hardware lies in the interconnection structure;fortunately, the discrete operation of the ANN allows for effectiveresource-sharing and pipelining, hence for small-sized ANNs theneuron complexity is not an issue [15] and the interconnect can beaddressed by efficient dataflow management.

In our previous work [16], we introduced the concept of usingANNs as a forecasting mechanism that can be effectively integratedusing dedicated hardware within the underlying NoC infrastructure.In contrast to other forms of forecasting mechanisms, ANNs offersignificantly better accuracy when dealing with non-linear forecastingmodels, something which is observed in NoC hotspots. We illustratedhow a generic ANN is implemented in hardware, and how ANNs canbe designed to monitor NoC traffic and predict hotspot formations,and provided the necessary implementation details. Moreover, weillustrated how a base ANN engine that monitors a 4! 4 mesh ortorus NoC, can be designed to monitor a variety of NoC sizes andtopologies, and how it can scale to facilitate large NoCs, through acombination of base ANNs and voting mechanisms. Further discus-sion related to the ANN implementation and architecture is foundin our previous work; we only provide the specific implementationdetails concerning the NoC targeted in this work [16] (8!8 mesh).

It is assumed, therefore, that the proposed algorithm utilizes anANN which has been implemented in hardware, and can accuratelydetect and inform the underlying interconnect system about potentialhotspots, providing the coordinates (i.e., routers) of all the predictedhotspots in the network. This work, therefore, utilizes the forecastgiven by the ANN, to pro-actively regulate network traffic, in aneffort to prevent hotspots from forming and thus improve the overallnetwork performance. The ANN can subsequently be trained andconfigured to monitor the interconnection traffic in discrete timeintervals, and detect pro-actively whether a traffic pattern observedindicates a high probability of a hotspot forming within the network.The ANN utilized in this work has two operation modes; off-line training (where the weights are being computed) and on-lineprediction (where the computed weights are programmed into theANN which uses the real-time traffic to predict hotspot formation).The ANN designed in this work adopts to an 8!8 mesh architecture;thus, the overall network uses four base ANN engines as Fig. 1shows, and a sequence of OR-based voting mechanisms for the borderneurons (shown in Fig. 2). The base ANN receives the average bufferutilization values for each router port, for all routers monitored by

!"" #

!"" $

!"" %

!"" &Fig. 2. ANN monitoring fragmentation and boundary routers.

the base ANN in 50-cycle intervals, and forecasts potential hotspotformation across the routers that it monitors. Given that hotspotsare typically a situation spanning across more than one router, andgiven that the base ANN monitors a 4!4 region, the border routersbetween each region are considered to be hotspots if at least one oftheir neighboring routers is also considered to be a hotspot. The ANNoutcome is therefore propagated through dedicated 8-bit links to allrouters within the monitored region during each discrete interval, andis used by the hotspot-preventing HPRA algorithm. The 8-bit linksare also used to carry the utilization rates during each interval to theANN, thus minimizing the hardware overheads.

IV. HPRA ROUTING ALGORITHM IMPLEMENTATION DETAILS

In this section we detail the implementation of HPRA and itsinput-buffered NoC router-based micro-architecture. We also describein detail how HPRA uses the hotspot prediction information fromthe ANNs embedded into the NoC architecture to guide HPRA inalleviating the formation of destination router traffic hotspots. Twovariants of the HPRA algorithm are demonstrated.

A. Limitations of Oblivious and Adaptive Routing Algorithms

Oblivious routing algorithms [6] can lead to non-uniform net-work resource utilization as many workloads exhibit intense spatio-temporal communication pattern variations. Adaptive routing can aidin distributing traffic evenly across the topology. However, certaintraffic permutations such as transpose and hotspot traffic, and realapplications [24], cannot be well-balanced across the network glob-ally, and more importantly, most of adaptive routing algorithms reactto traffic imbalances as they use current network metrics to directrouting, instead of preventing congestion a priory. Hence, most NoCadaptive routing algorithms still exhibit limitations as the trafficcannot be evened-out both across time and space.

B. Details and Demonstration of HPRA

HPRA overcomes the limitations of the above conventional adap-tive routing algorithms and routers, in an effort to extend the net-work’s effective throughput while reducing latency, in the presence ofadversarial traffic, i.e., hotspots. We note that we assume destinationhotspots, i.e., traffic hotspots that occur at destination (or ejection)router nodes when many router traffic producers send traffic to asingle or a subset of network router destinations.

HPRA is a scalable, versatile, and adaptable congestion-preventivepro-active routing function. Instead of passively measuring currentnetwork statistics, such as buffer utilization, in attempting to re-actively balance-out traffic to improve or sustain network through-put [11], [17], [18], like most of existing routing algorithms [6],HPRA pro-actively prevents the unforseen formation of NoC hotspotsthat may occur dynamically in the near future. This pro-active hotspotprevention mode is achieved with the use of advance informationsourced with the use of Artificial Intelligence (AI) principles thatare utilized during network operation to continuously predict theformation of traffic hotspots. These AI principles are applied in theform of NoC-embedded hardware Artificial Neural Networks (ANNs)

251

Crossbar

Input port Output port

Input buffer VC1 and VC2,

or injection quque

Switch arbiter

Legend:

Output buffer

Route Computation, VC Allocator, Free VC Counter and

Buffer Utilization Counter

(a)

X+

Y+

X-

Y-

Injection ports (2)

Ejectionport

X+

Y+

X-

Y-

Control Line

Data Line

Wormhole NoC Router with HPRA Routing Mechanism

Injection Port Controller

Hotspot-Destined

Non Hotspot-Destined

Free VC CountFree VC Count

Buffer UtilizationBuffer Utilization

ANN Predictor Hardware Mechanism

(b) (c)HPRA_a Routing Algorithm Free VC Count Status Flow

HPRA_b Routing Algorithm Free VC Count Status Flow

Source Router

Destination Router

Source Router

Destination Router

First Possible Packet Route

Sec

ond

Po

ssib

le P

acke

t Ro

ute

First Possible Packet Route

Sec

ond

Po

ssib

le P

acke

t Ro

ute

Legend:VC_free

status flow

Fig. 3. (a) Base wormhole flow-control NoC router with 2 VCs at each input port with HPRA augmentations and ANN processing element predictor,(b) HPRA a routing algorithm with the two possible progressive packet routes and free VC (VC free) status flow (only in the Y or X dimensionstowards the source router) in the packet flow’s reverse path direction, and (c) HPRA b routing algorithm with the two possible progressive packetroutes and free VC status flow (XY or YX towards the source router) in the packet flow’s reverse path direction.

which can predict possible future hotspot formations with relativelygood accuracy (see Section III; from our previous work [16]). Theadvance hotspot knowledge gained with ANNs is used to guide packetrouting across the NoC topology in an effort to mitigate or preventany unforeseen and unanticipated near-future occurrences of hotspots.

HPRA works on two main axes: (1) it balances-out traffic that isnot destined towards routers that are predicted to or currently act ashotspots, and (2) as stated earlier, it utilizes advance information fromNoC-embedded ANNs that report in advance near-future possiblehotspot formations. To achieve these two modes it differentiates trafficinto two categories: (1) traffic that is non-hotspot router destined(NonHSD), and (2) traffic that is hotspot router destined (HSD), asdetermined by the ANN independent hardware processing enginesembedded into the NoC (see Section III).

NonHSD traffic is routed through and continuously balancedspatially across the NoC topology based on real-time utilizationstatistics that are correlated well with downstream congestion andthat are inexpensive to compute. In specific, the aggregate count offree Virtual Channels (VC free), an indicator of congestion levels,along the perimeter of the minimum rectangle outlined by the source-destination router node pairs is used to choose the least congestedprogressive path towards the NonHSD’s destination. There are twosuch sub-modes of operation, dubbed as HPRA a and HPRA b, asFig. 3-(b, c) demonstrates. Under HPRA a, the mechanism aggre-gates and propagates the VC free count of downstream routers toadjacent upstream routers, measured independently along both the Xor Y dimensions tangential to the source router, where downstreamVC free counts flow backwards to the source router, as Fig. 3-(b) shows. Each packet is directed towards one of the two routes,once at the source, onto which the VC free count is the greatest,an indication of smaller congestion levels, similar to the 01TURNrouting function [26] (01TURN, however, does not use congestionmetrics and is merely oblivious to traffic levels; either dimensionis randomly chosen once at the source). When the same packetreaches the opposite corner router, it starts to progressively routealong the remaining dimension. If for example HPRA a first selectsto route along the X dimension, once this path is exhausted, it thenprogressively traverses the Y dimension. No such choice is takenwhen either the X or Y offsets are zero since the routing algorithmis always progressive. The difference with HPRA b, is that the twoentire progressive paths along the perimeter of the minimum rectangleare each accounted separately for their VC free counts, as Fig. 3-(c)

shows; the two paths are XY and YX, i.e., first traversing the Xdimension and then the Y dimension, or the reverse, respectively.This selection function is slightly more expensive to compute; weassume overlay wires of 6-bit width to transport VC free information(for 2-VC per-port routers in an 8! 8 mesh), which interconnecteach pair of neighboring routers (see “Free VC Count” in Fig. 3-(a)). This overhead was taken into account in Section VI. HPRA boffers a slightly better performance advantage versus HPRA a as theresults of Section V-A show, due to the expanded spatial congestionmeasurement gathering that is fed back to the source routers.

In categorizing traffic as HSD and NonHSD, HPRA offloadshotspot router receivers and their surrounding network paths, whileallowing non-hotspot-causing traffic (i.e., non-HSD) to ceaselesslyreach their destination(s) in a balanced mode. HSD and NonHSDtraffic are each housed into a distinct injection queue at each router(two injection queues per router), so that each such traffic flowcan be separately monitored and controlled. Fig. 3-(a) shows thesetwo injection queues that are managed by a controller. This trafficcategorization mechanism works as follows: when a CMP processingelement is about to inject a packet into the network, the packet’sdestination coordinates are checked. If they match the coordinatesof any hotspot router, or with an about to become a hotspot router(as determined by the ANN predictors, 50 cycles in advance ascomputed in [16], see Section III), they are housed into the hotspot-destined First-Come First-Served (FCFS) injection queue; otherwise,the packet is categorized as NonHSD and placed into the non-hotspotdestined FCFS queue. In the latter case the NonHSD routing load-balancing rules outlined earlier are carried out, and NonHSD trafficflow is governed by regular wormhole flow-control [6].

The gist behind this traffic differentiation, is to monitor the amountof HSD traffic that is destined to flow to router hotspot receivers, sothat no further congestion is created. With this scheme in place, thedanger of further severely congesting and even stalling indefinitelythe entire network as a hotspot(s) spatially spreads across the NoCtopology, is alleviated. The advance prediction of possible hotspotformation(s) gained from the ANN prediction engine(s) is utilized toeither partially or completely throttle HSD traffic, gradually allowingportions or the entirety of such HSD traffic to reach their destinations.To determine whether HSD traffic should be throttled altogetheror allowed to be partially or completely injected into the network,a second set of statistics is utilized, that of the Aggregate BufferUtilization (ABU) at the destination hotspot router, a summation of

252

!"

#"

$"

%"

&""

&!"

&#"

"'"% "'"( "'&" "'&& "'&) "'&# "'&* "'&$ "'&% "'!" "'!& "'!) "'!# "'!$ "'!% "')" "')& "'))

!"#$%&'()*#"+,-(.,-,)"/0

!"#$%&'(")*+&$),-.'&/0"(1$0%2%.$3

+,

-.+,

/012!304

-5167

"&65/82!304

"&65/82%304

/012#304

/012%304

9:/12;2!304

9:/12<2!304

Fig. 4. Latency-throughput curves for various routing algorithms under synthetic transpose traffic applied to an 8!8 2D mesh-connected NoC.

all flit-occupied VC buffers at all router input ports. This informationis fed back to the source router (refer to the “Buffer Utilization”signal in Fig. 3-(a)); only the source router is allowed to throttleHSD traffic, and no intermediate router is allowed to carry thisthrottling procedure, as traffic throttling can only be applied at HSDinjection queues. The ABU metrics are compared to the total inputport buffering capacity of the hotspot router; when the current ABUat this hotspot destination router exceeds a threshold set at 50%(value determined via extensive empirical evaluation), no HSD trafficfrom any source router is allowed to be injected into the network;otherwise HSD traffic is injected into the network, until this thresholdlimit is again exceeded. ABU is monitored on a cycle-by-cycle basis,enabling fine-grained HSD injection or throttling decisions.

Lastly, in case HSD traffic is de-throttled and concurrentlyNonHSD traffic exists in its own injection queue, both at the samesource router, while a hotspot still exists or is predicted to persist,the Injection Port Controller (see Fig. 3-(a)) prioritizes the injectionof HSD traffic over NonHSD traffic, as the injection of the formercategory has already suffered a delay at injection (during throttling).If the same hotspot is predicted as eradicated then the injection ofeither NonHSD or HSD traffic is randomly carried out on a per-flitbasis. The entire above process is repeated until the same hotspotagain starts to build-up or is predicted to occur in the near future.

V. EXPERIMENTAL SETUP AND RESULTS

To evaluate the performance of the HPRA routing scheme weimplemented a detailed cycle-accurate simulator that supports k-ary2D mesh topologies with four-stage pipelined routers, each with upto eight Virtual Channels (VCs) per port, each consisting of 5-flitbuffers. The pipeline stages comprise: (1) routing computation tocompute the next-hop packet route, (2) VC arbitration to allocatevirtual channels at the downstream router, (3) switch arbitration toallocate per-flit switch bandwidth at the crossbar to traversing flits,(4) crossbar (switch) traversal, and finally physical link traversal(taken to be 1-cycle in duration but not considered to be part ofthe router pipeline). Packets are composed of five 32-bit flits, exceptfor the TRIPS applications in which two packet sizes exist [24](see Section V-C). Each router consists of four incoming and fouroutgoing unidirectional channels. Simulation length was taken to behalf a million cycles for the transpose and hotspot synthetic trafficpatterns (Sections V-A and V-B, respectively), and for the first 10million cycles of each TRIPS benchmark (Section V-C).A. Results With Synthetic Transpose Traffic

We first evaluate the effectiveness of HPRA using synthetictranspose traffic, an adversarial form of traffic as it overloads thetwo orthogonal diagonals of the network topology, to gain insightinto the relative strengths of our proposed routing algorithm whencompared to several existing state-of-the-art congestion-aware routingalgorithms. Transpose traffic uses a uniform random injection process,with each router having the same probability of injecting, as well asejecting a packet, into/from an 8!8 mesh network topology.

Fig. 4 compares the sustainable throughput performance of thetwo versions of our algorithm, HPRA a and HPRA b (see Sec-tion IV for details) against Duato’s fully adaptive Protocol [8], theDyXY congestion-oriented adaptive routing algorithm [18], the near-optimal 01TURN adaptive routing algorithm [26], and the RegionalCongestion Awareness (RCA) algorithm [11] which aims to enhanceglobal load-balancing in adaptive routing, using a technique thatinforms a congestion-aware routing scheme about traffic levels innetwork regions beyond the locality of adjacent network routers;we compare against the 1D-Local version of the RCA algorithm,a locally adaptive router that uses the vc congestion metric collectedover the span of independent rows and columns, where the source-destination routers reside in the 2D mesh NoC topology. We alsorun the standard non-adaptive Dimension-Order Routing (DOR-XY)function for comparison. We note that RCA [11] and 01TURN [26]were originally designed to work with 8 Virtual Channels (VCs),as compared to our lightweight HPRA approach which uses just2 VCs at minimum, hence requiring significantly less overheadboth in packet buffering and in the arbitration mechanism’s designcomplexity. Nevertheless, we also run additional versions of 01TURNwith 2 VCs and RCA with both 2 and 4 VCs as well, for comparison.

The performances of the two HPRA schemes, as depicted inFig. 4, show improvement in throughput and latency across alltraffic patterns with no sacrifice in latency; HPRA outperforms DOR-XY, the DyXY congestion-adaptive routing [18], and Duato’s fullyadaptive routing [8]. HPRA a and HPRA b, both using just 2 VCs,also outperform, in terms of the sustainable throughput, the thenear-optimal 01TURN adaptive routing algorithm [26], which usesa whopping 8 VCs, by approximately 37.5%. HPRA a and HPRA boutperform the RCA congestion-aware routing algorithm with 8 VCs,by about 14% and 13%, respectively. Lastly, HPRA a and HPRA boutperform RCA with 2 VCs by 63.5% and 65%, respectively.HPRA b is marginally better than HPRA a, as it considers thecongestion levels across the entire span of the two alternative andprogressive XY and YX source-destination paths (see Section IV).

B. Results with Synthetic Hotspot Traffic

To determine the effectiveness of HPRA in eradicating hotspots wedevise a hotspot model with well-defined spatially-located hotspotformations, of relatively short temporal duration, centered on tworandomly chosen concurrent router receiver hotspots located at dif-ferent co-ordinates, in an 8 ! 8 mesh network. Using this modelwe synthesize a number of traces that exhibit a range of networkthroughput demands upon the NoC, until the saturation point ofHPRA and of other competing routing algorithms is determined.

Our hotspot model uses the Uniform Random Traffic (URT) modelas a base, where all NoCs nodes have an equal probability of sendingand receiving a packet to/from another node per unit time, exceptwhen for short pre-specified and periodically occurring time intervalsjust two arbitrarily selected nodes each receive with equal probability10% (20% of total traffic, combined) of the network traffic from

253

!"

#"

$"

%"

&""

&!"

&#"

"'"% "'"( "'&" "'&& "'&) "'&# "'&* "'&+ "'&% "'!" "'!!!"#$%&'()*#"+,-(.,-,)"/0

!"#$%&'(")*+&$),-.'&/0"(1$0%2%.$3

4256

789:;<=>?@/

;@A=>?@/

4:A9B

CD;A=E=>?@/

;@A=F?@/

789:;<=G?@/

CD;A=E=F?@/

;@A=G?@/

CD;A=E=G?@/

Fig. 5. Latency-throughput curves for various routing algorithms under synthetic hotspot traffic applied to an 8!8 2D mesh-connected NoC.

any remaining non-hotspot sender nodes; the remaining percentageof traffic remains as URT traffic. In specific, we define Time FrameWindows (TFWs) of 3,000 cycles (set empirically), during any timeat which the two hotspots can occur simultaneously for a duration of800 cycles. No other hotspots can occur within a TFW, and during therest of the duration of the TFW the traffic behaves purely as URT. Theinjection rate during both the URT and hotspot phases is of constantperiodicity, i.e., the injection rate is steady, set at a pre-defined value.This model test-stresses our prediction mechanism as hotspots occurunexpectedly and abruptly within a TFW; also, since the hotspots areof relatively long duration in relation to the duration of a TFW (800cycles out of every 3,000 cycles), it also tests the ability of HPRAto handle hotspot traffic while sustaining high network throughputunder these adversarial conditions.

Fig. 5 compares the sustainable throughput of HPRA b (seeSection IV for details) against Duato’s fully adaptive Protocol [8],the DyXY [18] (2 VCs per input port) and 01TURN [26] (with 2 and8 VCs per input port) adaptive routing algorithms, and the RegionalCongestion Awareness (RCA) algorithm [11] (locally adaptive router)using its 1D-Local version in tandem with the vc congestion metricas used in Section V-A. We use three versions of RCA and HPRA b,both with 2, 4, and 8 VCs for full comparison purposes.

HPRA b scheme’s performance, as depicted in Fig. 5, showsimprovement in throughput and latency, in some cases substantial,across all traffic patterns with no sacrifice in latency; HPRA boutperforms DyXY congestion-adaptive routing [18], and Duato’sfully adaptive routing [8]. HPRA b also outperforms, in terms of thesustainable throughput, the near-optimal 01TURN adaptive routingalgorithm [26] which uses 2 VCs by approximately, 9.6%, 40%,and 81% when HPRA b utilizes 2, 4 and 8 VCs per input port,respectively; HPRA b also outperforms 01TURN by 18% when bothalgorithms use 8 VCs per input port. Lastly, HPRA b with 2 VCs perinput port outperforms the RCA congestion-aware routing algorithmwith the same VC count per input port by approximately 16.1%,by 11.8% when both algorithms utilize 4 VCs at each of theirinput port, and by 8.8% when both algorithms utilize 8 VCs ateach of their input port. The conclusion, here, is that HPRA b isrelatively better performing as opposed to RCA, when fewer (andmore reasonable) numbers of VCs are in use. Given the resource-constrained environment of NoCs, i.e., it is more preferable to usefewer VCs and buffering, HPRA b is a better choice than RCA inhandling hotspot traffic.

C. Results with TRIPS CMP Applications

The TRIPS [24] prototype CMP consists of two large, tiled,distributed processing cores, with each core containing an executionarray of ALU tiles, distributed register file tiles and partitioned localL1 cache tiles interconnected via a 5!5 mesh network. The two 5!5mesh-interconnected cores are interconnected by a second, 4 ! 102D mesh network to a shared, distributed static-NUCA L2 cache.The two multi-tile processor cores communicate through the on-chip

secondary cache system using an embedded 4!10 wormhole-routedmesh network, abbreviated as OCN (On-Chip Network) [10]. TheOCN is optimized for cache-line sized transfers with support forother memory-related operations, acting as an inter-core fabric forthe two processors, the L2 cache, DRAM, I/O and DMA traffic [24].

To determine the performance of HPRA, we used a representativeset of 9 benchmarks from the SPEC CPU2000 Suite [29] gatheredfrom the cycle-accurate TRIPS processor simulator to generate OCNrequests, compiled using the Scale compiler [28] (appropriatelymodified for TRIPS’s use). There are two memory system transactiontypes, writes and reads, consisting of request and reply packets. Writetransactions from the processor to the L2 bank consist of a five-flitrequest packet containing the evicted L1 dirty cache line, answeredwith a 1-flit packet acknowledgement to the processor; while readtransactions consist of a single flit read request packet from theprocessor, replied by the L2 bank with a 5-flit packet containingthe requested cache line.

The TRIPS 4 ! 10 mesh OCN was covered using three base4! 4 mesh-covering ANNs as in our previous work [16]; an OR-based voting engine was used for the results from each ANN thataffect the overlapping routers. Fig. 6 compares latency results forvarious TRIPS OCN [24] benchmarks using HPRA b with 2 VCsand various other routing algorithms: deterministic DOR-XY (notethat some of the benchmark latencies, under DOR-XY, are over 40cycles and are clipped in Fig. 6), Duato’s fully adaptive Protocol [8],the near-optimal 01TURN adaptive routing algorithm [26], and theRegional Congestion Awareness (RCA) algorithm [11] using its1D-Local version (as used in Sections V-A and V-B). Despite thefact that under all TRIPS benchmarks HPRA b with 2VCs showsbetter NoC sustainable performance, this improvement is marginal;HPRA b is by average 1.07%, 0.43%, and 1.88% better-performingthan Duato’s Protocol (2 VCs), 01TURN (2 VCs) and RCA (2 VCs),respectively. The best performance gains are with equake at 1.59%when compared to Duato’s Protocol, again equake at 1.38% whencompared to 01TURN adaptive routing, and mgrid at 2.29% whencompared to RCA congestion-aware routing. The reason for thesesmall improvements, is that the TRIPS applications contain very lighthotspots and the 4 ! 10 mesh OCN, due to its small size in theX dimension (spanning just three hops) offers little opportunity inexploiting NoC path diversity; hence due to both of these reasons,the HPRA routing algorithm cannot offer substantial performanceadvantage as compared to the aforementioned routing functions.

VI. HARDWARE SYNTHESIS RESULTS

The overall interconnection system was implemented in Verilogand synthesized using Synopsys Design Vision, using a commercial65 nm CMOS technology targeting 500 MHz as operating frequencyand 1 V power supply voltage. For the purposes of comparison,we also synthesized a three-stage pipelined 2-Virtual Channel (VC)router comprising (1) routing, (2) parallelized VC arbitration andspeculative switch allocation, and (3) switch traversal, with a 100-flit

254

!"

!!

!#

!$

!%

&"

&!

&#

&$

&%

#"

!"#$%&'()*#"+,-(.,-,)"/0

!"

#$%&'

()&$*+,-./0

*/%,-./0

12*%,3,-./0

Fig. 6. Latency results for various TRIPS OCN [24] benchmarks usingHPRA b with 2 VCs and various other routing algorithms.

FCFS injection port queue. The hotpot-destined injection queue wasset at 50 flits depth (see Section IV), as experimental measurementsshowed that even under the worst-case NoC operating conditions, i.e.,just below saturation point, no more than 45-flit buffering to supporthotspot-destined traffic throttling was required. Link traversal lastsfor one cycle but it is not part of the router pipeline. The speculativeswitch allocation prioritizes non-speculative requests over speculativeones, while prioritized matrix arbitration, look-ahead routing andcredit-based wormhole flow-control [22] are utilized.

For a targeted 8 ! 8 mesh NoC, synthesis results indicate ahardware overhead of 4.2% for the ANN prediction system (allhardware related to the ANN including the 8-bit dedicated linksbetween routers, see Section III), and an overhead of 7.2% in terms ofthe hotspot preventing algorithm hardware, the majority of which liesin the extra buffer space required in the injection ports of each router.The overall hardware overhead, thus, is 11.4%. Furthermore, using anaverage of 30% switching activity, the power overheads of the overallhotspot prevention system (that includes the ANN and the dedicatedhardware within each router) are estimated at approximately 9%. Thebase router (with the injection buffers) was estimated at 351 mW,whereas the modified router consumes 383 mW. However, as thehotspot preventing algorithm is likely to improve the overall systemperformance and reduce the average packet delay, it is anticipatedthat the overall energy savings will be more in the case of a hotspotpreventing mechanism present. This is left as future work.

VII. CONCLUSIONS AND FUTURE WORK

This paper presented the Hotspot-Preventive Routing Algorithm(HPRA) which uses advance congestion-level knowledge based onArtificial Intelligence principles embedded in the system hardware,to pro-actively guide packet routing across the NoC in an effort tomitigate any unforeseen near-future occurrences of hotspots. ArtificialNeural Networks are used to predict about-to-be-formed hotspotsduring multicore operation, promptly informing HPRA to take appro-priate action in preventing hotspot formation(s). Extensive evaluationresults show that HPRA can reduce network latency and improvenetwork throughput by up to 81% when compared to several existingcongestion-aware routing schemes, while requiring a modest 11.4%hardware overhead. Future work involves the evaluation of HPRAin a full-system simulation environment where dependencies amongpacket types exist, while optimizing its hardware requirements.

ACKNOWLEDGEMENT

This work falls under the Cyprus Research Promotion Foundation’sFramework Programme for Research, Technological Developmentand Innovation 2009-10 ( 2009-10), co-funded by theRepublic of Cyprus and the European Regional Development Fund,and specifically under Grant /0311/06.

REFERENCES

[1] E. Baydal et al. A Family of Mechanisms for Congestion Control in WormholeNetworks. IEEE Trans. on Parallel and Distributed Systems, Vol. 16, No. 9, pp.772–784, Sept. 2005.

[2] D. Bertozzi and L. Benini. Xpipes: A Network-on-Chip Architecture for GigascaleSystems-on-Chip. IEEE Circuits and Systems Magazine, Vol. 4, No. 2, pp. 18–31,March-May 2004.

[3] T. Bjerregaard and S. Mahadevan. A Survey of Research and Practices of Network-on-Chip. ACM Computing Surveys, Vol. 38, No. 1, pp. 1–51, March 2006.

[4] J. W. van den Brand et al. Congestion-controlled Best-Effort Communication forNetworks-on-Chip. Proc. 10th ACM/IEEE Design, Automation and Test in EuropeConf. and Exhibition, pp. 948–953, April 2007.

[5] W. J. Dally and B. Towles. Route Packets not Wires: On-Chip InterconnectionNetworks. Proc. IEEE Design Automation Conf., pp. 684–689, June 2001.

[6] W. J. Dally and B. Towles Principles and Practices of Interconnection Networks.Morgan Kaufmann Publishers Inc. ISBN: 9780122007514, 2004.

[7] M. Daneshtalab et al. NoC Hot Spot Minimization Using AntNet Dynamic RoutingAlgorithm. Proc. IEEE Int’l Conf. on Application-specific Systems, Architecturesand Processors, pp. 33–38, Dec. 2006.

[8] J. Duato. A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks.IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 12, pp. 1320–1331,Dec. 1993.

[9] J. Duato et al. A New Scalable and Cost-Effective Congestion Management Strategyfor Lossless Multistage Interconnection Networks. Proc. IEEE Int’l Symp. on High-Performance Computer Architecture, pp. 108–119, Feb. 2005.

[10] P. Gratz et al. Implementation and Evaluation of On-Chip Network Architectures.Proc. IEEE Int’l Conf. on Computer Design, pp. 477–484, Oct. 2006.

[11] P. Gratz at al. Regional Congestion Awareness for Load Balance in Networks-on-Chip. Proc. IEEE Int’l Symp. on High-Performance Computer Architecture, pp.203–214, Feb. 2008.

[12] W. S. Ho and D. L. Eager. A Novel Strategy for Controlling Hot-Spot Congestion.Proc. IEEE Int’l Conf. on Parallel Processing, pp. 14–18, Aug. 1989.

[13] J. Howard et al. A 48-Core IA-32 Processor in 45 nm CMOS Using On-DieMessage-Passing and DVFS for Performance and Power Scaling. IEEE Journal ofSolid-State Circuits, Vol. 46, No. 1, pp. 173–183, Jan. 2011.

[14] J. Hu and R. Marculescu. DyAD - Smart Routing for Networks-on-Chip. Proc.IEEE/ACM Design Automation Conf., pp. 260–263, June 2004.

[15] A. K. Jain at al. Artificial Neural Networks: A Tutorial. IEEE Computer, Vol. 29.No. 3, pp. 31–44, March 1996.

[16] E. Kakoulli et al. Intelligent Hotspot Prediction for Network-on-Chip-Based Mul-ticore Systems. IEEE Trans. on Computer-Aided Design of Integrated Circuits andSystems, Vol. 31, No.3, pp. 418–431, March 2012.

[17] P. Lotfi-Kamran et al. BARP - A Dynamic Routing Protocol for Balanced Distribu-tion of Traffic in NoCs. Proc. ACM/IEEE Design, Automation and Test in EuropeConf. and Exhibition, pp. 1408–1413, March 2008.

[18] M. Li et al. DyXY - A Proximity Congestion-Aware Deadlock-Free Dynamic RoutingMethod for Network on Chip. Proc. IEEE Design Automation Conf., pp. 849–852,July 2006.

[19] R. Marculescu et al. Outstanding Research Problems in NoC Design: System, Mi-croarchitecture, and Circuit Perspectives. IEEE Trans. on Computer–Aided Designof Integrated Circuits and Systems, Vol. 28, No. 1, pp. 3–21, Jan. 2009.

[20] E. Nilsson et al. Load distribution with the Proximity Congestion Awareness in aNetwork on Chip. Proc. IEEE/ACM Design, Automation and Test in Europe Conf.and Exhibition, pp. 11126–11127, March 2003.

[21] U. Y. Ogras and R. Marculescu. Analysis and Optimization of Prediction-BasedFlow Control in Networks-on-Chip. ACM Trans. on Design Automation of Elec-tronic Systems, Vol. 13, No. 1, pp. 1–28, Jan. 2008.

[22] L.-S. Peh and W. J. Dally. A Delay Model and Speculative Architecture for PipelinedRouters. Proc. IEEE Int’l Symp. on High-Performance Computer Architecture, pp.255–266, Jan. 2001.

[23] R. S. Ramaujam and B. Lin. Destination-Based Adaptive Routing on 2D MeshNetworks. Proc. ACM/IEEE Symposium on Architectures for Networking andCommunications Systems, pp. 19—31, Oct. 2010.

[24] K. Sankaralingam et al. Distributed Microarchitectural Protocols in the TRIPSPrototype Processor. Proc. IEEE/ACM Int’l Symp. on Microarchitecture, pp. 480–491, Dec. 2006.

[25] H. Sarbazi-Azad et al. An Analytical Model of Fully-Adaptive Wormhole-Routedk-Ary n-Cubes in the Presence of Hot Spot Traffic. IEEE Trans. on Computers, Vol.50, No. 7, pp. 623–634, July 2001.

[26] D. Seo et al. Near-Optimal Worst-Case Throughput Routing for Two DimensionalMesh Networks. Proc. IEEE/ACM Int’l Symp. on Computer Architecture, pp. 432-443, June 2005.

[27] M. Sheng et al. DBAR: An Efficient Routing Algorithm to Support MultipleConcurrent Applications in Networks-on-Chip. Proc. IEEE/ACM Int’l Symp. onComputer Architecture, pp. 413–424, 2011.

[28] A. Smith et al. Compiling for EDGE Architectures. Proc. IEEE/ACM Int’l Symp.on Code Generation and Optimization, pp. 185–195, March 2006.

[29] Standard Performance Evaluation Corporation: http://www.spec.org.[30] M. Thottethodi et al. Self-Tuned Congestion Control for Multiprocessor Networks.

Proc. Inter’l Symposium on High-Performance Computer Architecture, pp. 107-118,Jan. 2001.

[31] I. Walter et al. Access Regulation to Hot-Modules in Wormhole NoCs. Proc.ACM/IEEE Annual Symp. on Networks-on-Chip, pp. 137–148, May 2007.

[32] D. Wu et al. Improving Routing Efficiency for Network-on-Chip ThroughContention-Aware Input Selection. Proc. IEEE Asia and South Pacific Conf. onDesign Automation, pp. 6–10, March 2006.

255