An Interconnect Architecture for Networking Systems on Chips
Wear-out Leveling in Network on Chips (NoCs) - OhioLINK ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of Wear-out Leveling in Network on Chips (NoCs) - OhioLINK ...
Wear-out Leveling in Network on Chips (NoCs)
A thesis presented to
the faculty of
the Russ College of Engineering and Technology of Ohio University
In partial fulfillment
of the requirements for the degree
Master of Science
Juman Alshraiedeh
April 2017
© 2017 Juman Alshraiedeh. All Rights Reserved.
2
This thesis titled
Wear-out Leveling in Network on Chips (NoCs)
by
JUMAN ALSHRAIEDEH
has been approved for
the School of Electrical Engineering and Computer Science
and the Russ College of Engineering and Technology by
Avinash Karanth Kodi
Professor of Electrical Engineering and Computer Science
Dennis Irwin
Dean, Russ College of Engineering and Technology
3
Abstract
ALSHRAIEDEH, JUMAN, M.S., April 2017, Electrical Engineering
Wear-out Leveling in Network on Chips (NoCs) (64 pp.)
Director of Thesis: Avinash Karanth Kodi
According to Moore’s Law, the number of transistors on a single chip doubles every
two years, allowing tens or even hundreds of cores to be integrated. As multi-cores
communicate with memory, the underlying Network-on-Chip (NoC) experiences
different stress levels due to asymmetric traffic patterns and complex routing algorithms.
Unfortunately, the growth in the number of transistors in NoCs will significantly
impact both reliability (physical failures) and aging (uneven utilization) due to the
increasing effects of Electromigration (EM), Hot carrier injection (HCI) and Negative
Bias Temperature Instability (NBTI). A novel in-flight, adaptive, routing algorithm is
presented to reduce the accumulation of EM, HCI, and NBTI effects on the lifetime of
NoC. The proposed routing algorithm is based on a new metric called Packet-Per-Port
(P3) which equalizes the stress throughout the network. The net impact is that the network
components such as the links and the routers will age evenly and thereby improve NoC
reliability and maximize the lifetime of the chip. Results indicated that for Splash-2
traces, we observe 7.3% to 13.7% energy per bit reduction and up to 6.23% improvement
in the transistor reliability when compared to Dimensional Order Routing (DOR) on 8x8
mesh. P3 was also compared to age-adaptive routing proposed in [1] and we observe 1%
improvement in reliability and up to 11% energy per bit reduction.
4
I dedicate this thesis to my husbond and son; thanks for all your endless support. I could
not have done it without you
5
Acknowledgments
This work is partially supported by the National Science Foundation (NSF) grants CCF-
1054339 (CAREER), CCF-1318981, CCF-1420718, and CCF-1513606. I would like to
thank my parents and family for all the love and support. Many thanks to Dr. Avinash
Kodi for his guidance and support. Last but not least, I would like to thank my committee
members Dr. Kaya, Dr. Dill, and Dr. Ingram for their time and feedback.
6
Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Causes of Age Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2.1 Negative Bias Temperature Instability . . . . . . . . . . . . . . . . 181.2.2 Hot Carrier Injection . . . . . . . . . . . . . . . . . . . . . . . . . 191.2.3 Electromigration . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.3.1 Use it or Lose it: Wear-out and Lifetime in Future Chip
Multiprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.3.2 Toward Graceful Aging Degradation in NoCs Through an
Adaptive Routing Algorithm . . . . . . . . . . . . . . . . . . . . . 221.4 Proposed Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Routing in NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.1 Dimensional Order Routing . . . . . . . . . . . . . . . . . . . . . . . . . 262.2 Regional Congestion Awareness . . . . . . . . . . . . . . . . . . . . . . . 272.3 Packet Per Port (P3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3.1 P3 Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 292.3.2 Router Micro-Architecture . . . . . . . . . . . . . . . . . . . . . . 302.3.3 Routing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.1 Network Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.1.2 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 503.1.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 513.1.4 Aging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543.1.5 Energy per Bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7
4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Appendix: Routers/Links Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 63
8
List of Tables
Table Page
2.1 P3 Routing Algorithm. [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Cache and Core Parameter Used for Splash-2 Traces Extraction. [3] . . . . . . 38
9
List of Figures
Figure Page
1.1 Common NoC topologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.2 Intel Teraflop processors. [4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.3 IBM Power7 connection. [5] . . . . . . . . . . . . . . . . . . . . . . . . . . . 151.4 Switching techniques used in NoCs. [6] . . . . . . . . . . . . . . . . . . . . . 161.5 Threshold voltage changes through stress and recovery cycle. [7] Copyright
©2006, IEEE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191.6 Use it or Lose it proposal router micro-architecture. [8] . . . . . . . . . . . . . 21
2.1 Routing protocol taxonomy. [9] . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 XY/YX turns roles. Arrows show the possible turns while dash lines mean
prohibited turns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Regional congestion branches. [10] Copyright©2008, IEEE . . . . . . . . . . 282.4 Router Microarchitecture. Left side shows the smaller crossbars connections
to the other parts of the router, the right top corner shows the congestioninformation collection and calculation. P3 circuit calculation is explained inthe right bottom corner. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5 Routing Example shows the routing path of a packet routed from router 15 torouter 1. Each graph describes the status of the network at the time of routingand the routing decision. a) shows routing the packet from R 15 to R14 basedon P3 routing algorithm, b) packet get routed from R 14 to R13, c) packetget routed from R13 to R9, d) packet get routed from R9 to 5, e) packet getrouted from R5 to R4, f) packet get routed to it final destination R1 from R4.[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1 Load distribution for bit reversal traffic for 4x4 mesh under a) P3, b) DOR, c)RCA, d) AGE-ADAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Load distribution for Matrix transpose traffic for 4x4 mesh under a) P3, b)DOR, c) RCA, d) AGE-ADAP. . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Load distribution for Uniform traffic for 4x4 mesh under a) P3, b) DOR, c)RCA, d) AGE-ADAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4 Load distribution for complement traffic for 4x4 mesh under a) P3, b) DOR,c) RCA, d) AGE-ADAP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.5 Load distribution for uniform traffic for 8x8 mesh under a) P3, b) DOR, c) RCA. 453.6 Load distribution for complement traffic for 8x8 mesh under a) P3, b) DOR ,
c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.7 Load distribution for bit reversal traffic for 8x8 mesh under a) P3, b) DOR, c)
RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473.8 Load distribution for matrix transpose traffic for 8x8 mesh under a) P3, b)
DOR, c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10
3.9 Load distribution for Radiocity from splash-2 traces for 8x8 mesh under a)P3, b) DOR, c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.10 Load distribution for FFM from splash-2 traces for 8x8 mesh under a) P3, b)DOR, c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.11 Average network latency for 4 x 4 mesh for a few select synthetic trafficpatterns - (a) Uniform , (b) Complement , (c) Matrix Transpose , (d) BitReversal.[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.12 Average network latency for 8 x 8 mesh for a few select synthetic trafficpatterns - (a) Uniform , (b) Complement , (c) Matrix Transpose, and (d) BitReversal.[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.13 Average network latency for 8 x 8 mesh for splash-2 traces . . . . . . . . . . . 543.14 Network throughput - (a) 4x4 mesh under synthetic traffic , (b) 8x8 mesh
under synthetic traffic patterns, and (c) 8x8 mesh under Splach-2 traces(normalize to DOR). (notation: Uni: Uniform, Comp: Complement, Mt-Tr: MatrixTranspose, Bt-Re: Bit Reversal, Per-sh: Perfect Shuffle, BF: Butterfly, GEO-MEAN:Geometric Mean). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.15 The aging effects on transistor reliability. [2] . . . . . . . . . . . . . . . . . . . 563.16 Energy per Bit for Splash-2 traces normalized to DOR. [2] . . . . . . . . . . . 57
A.1 Load distribution for Water from splash-2 traces for 8x8 mesh under a) P3, b)DOR , c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
A.2 Load distribution for Radix from splash-2 traces for 8x8 mesh under a) P3, b)DOR, c) RCA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
11
1 Introduction
1.1 Motivation
As the ever-growing size of applications require faster and more computing resources,
research is rapidly proposing techniques to improve performance (latency, power, and cost)
of next generation processing cores and the underlying communication fabric to connect
the cores to the memory. While single core processors continue to improve in performance
dramatically (instruction per cycle, performance per Watt) dramatically, increasing the
clock frequency or voltage is challenging due to temperature and power limitations of a
single-chip. As technology scaling has enabled integration of billions of transistors on a
single chip, scaling the number of processing cores to achieve higher performance is the
approach of the industry. Starting with the first dual-core processor developed by Intel in
1998 (Celeron D), the technology has considerably improved whereas today, six multicore
processors are available for personal use as in Intel Core i7 970 and Extreme editions [11].
In the second quarter of 2016, Intel launched Intel Xeon E7 v4 family with 4-24 cores
on chip for workstation, and the system delivered high performance, high reliability, and
in-memory secured computation [12]. The number of processors scales even further for
graphical processing units (GPUs), to 512 processors that can be integrated on chip as in
GF100 512-core (Nivida) and this number of streaming multicores is also increasing. The
critical problem now is how to connect processors to communicate between cores, memory
and I/O devices to achieve the best performance under thermal limits of the chip.
Network on Chips NoCs are the defacto communication fabric used for data transfer
between several cores on the chip. NoCs will define the network topology, switching,
and routing techniques as well as determine the quality of service (QoS) for the network.
Network topologies are divided into two main subdivisions: switched networks -when
disjoint portions of the media are shared via switching, and shared networks where the
12
(a) Bus topology (b) Ring topology (c) Mesh topology (d) Toric topology
Figure 1.1: Common NoC topologies.
network media is shared by all devices on the network [6]. The number of nodes, the
connectivity pattern, and the overall cost in terms of (latency and power) determine the
appropriate network for the chosen application. Bus topology is a basic shared network
topology that restricts the number of simultaneous communication that can take place and
therefore it is bandwidth-limited. Although the bus topology is bandwidth-limited, it is
suitable for low number of cores to be connected anywhere between 2-4 nodes. For high
number of cores, crossbars and higher-radix (degree) switch topologies are more suitable.
Meshes and tori topologies showed promising performance as the bandwidth scales with
the number of nodes in these topologies; however the limited bisectional bandwidth
(maximum bandwidth available when the network is split into two equal halves [13]) affects
the performance gain as the number of processors is scaled. Interconnection networks
should be designed such that maximum amount of information can be transferred within
the least amount of power and with high reliability. The critical aspects that determine
the suitability of the topology are: link bandwidth, bisectional bandwidth, and diameter
(the shortest path between the most distant nodes for a given topology). For 2-D mesh
topology, the bandwidth scales linearly with the number of processors (N) N, but the
bi-sectional bandwidth does not scale as well with the number of processors (N12 )which
13
limits the performance under high load. The linearly growing diameter for 2-D mesh is
2(√
N − 1) which affects power consumption negatively due to extra network hops(links
and router power). On the other hand torus topology provides better bandwidth scalability
4N when compared to N for mesh, and better bisectional bandwidth, 2N12 and smaller
diameter (√
N) when compared to 2-D mesh architecture. Torus improves performance but
at the cost of power and area since torus requires wrap-around which increases the area
and decreases the diameter. The length of wrap-around links is affected by the number of
processors; research suggested to fold the torus to reduce the area to maximize the benefit
of torus [14]. Ring topology, mesh, and torus are all implemented in industry today. For
example TeraFlops was an Intel prototype with 80 cores connected in the mesh topology
(see Figure 1.2), whereas IBM Power7 is a CMP system with 8 cores connected in a ring
network (see Figure 1.3). In February 2015 IBM introduced Power8 which has 12 cores
fabricated using 22 nm technology connected in ring topology [5]. Many topologies have
been developed through the years to connect multicores taking into account that they have
to be mapped onto a two dimensional plane. Research in emerging technology showed
that this may change with the development of the die-stacking and the three-dimensional
integration [15].
NoCs performance relies on the router microarchitecture which decides on the routing,
switching, and arbitration. Routing scheme defines all possible paths that a packet may
take, switching resolves the order of transmission of packets, while arbitration resolves
the conflict when multiple packets request the same resource. Routing algorithms seek to
improve the performance of the network by lowering the latency and power consumption
while improving the throughput of the system. Routing algorithm is explained in more
detail in the next chapter.
Switching in NoCs can either be packet or circuit switching [16]. Packet switching
defines the granularity of the transmission to one packet. In circuit switching, the entire
14
Figure 1.2: Intel Teraflop processors. [4].
path is reserved until the communication ends which reduces the throughput of the network.
Figure 1.4 shows the switching techniques. For Wormhole Switching the header flit (the
minimum size of information transfer through the network) is the only flit that suffers
from latency, other flits follow the path of the header flit. In Store and Forward (S&F)
the entire packet has to be stored before it is forwarded which implies that large buffers
are needed. Unlike S&F, Virtual Cut-Through switching divides the packet into flits and
forwards it as soon as the next router has enough space. However, virtual cut-through
switching requires the same buffer space as S&F, and therefore none of the NoCs designs
have adopted these approaches. Virtual channels (VCs) with buffers are used to improve
the switching performance of the network by allowing multiple packets to occupy the same
15
Figure 1.3: IBM Power7 connection. [5]
channel as long as VCs avoid the deadlock/livelock situation. Arbitration occurs before
switching where a request is put in a schedule to avoid starvation and high latency. There
are multiple arbitration mechanisms proposed for NoCs such as round robin, first come first
serve (FCFS) and priority and each technique has its own pros and cons [15].
Transistors size is continuously shrinking and further scaling to 7 nm or less have been
proposed [17]. However, thinner transistors are less robust and have higher probability
of physical failure [17]. The boundaries at which transistor is considered to be working
properly is shrinking with the transistor size due to the physical characteristics of the
material. The silicon doping level and the alloy have direct impact on the electron mobility
and thereby on the reliability of the transistor. As wires get thinner they get affected more
by crosstalk, coupling, soft error, and process variations [18]. According to International
Technology Roadmap for Semiconductor (ITRS) for 2013, it is important to address the
reliability with thinner transistors to achieve the desired performance for multicore systems
[17]. Improving the NoCs reliability through redundancy is limited by area and power
consumption. Typically, the acknowledgment control signal and packet time stamp are used
to recover and correct soft errors. Research continues to investigate emerging materials and
16
Figure 1.4: Switching techniques used in NoCs. [6]
fabrication techniques to reduce the possibility of hard errors to improve Mean-Time-To-
Failure (MTTF) [15]. MTTF is defined as the expected time before system failure, and is
mathematically calculated as shown in equation 1.1
MTT F =
∫ ∞
0R(t)dt (1.1)
where R(t) is the reliability function.
The continuous scaling of transistors impacts the aging process negatively since aging
is connected to the transistor utilization and fabrication. Aging can be controlled but
not eliminated since it is a result of some physical phenomena. Even though aging
does not cause immediate failures, it shifts critical parameters such as threshold voltage
(Vth) and gate delay causing the transistors to fail. For example, routers fail to operate
under the specified timing constraints when the cumulative gate delay exceeds 10% from
the fabricated value, causing a system failure if no recovery mechanism is implemented.
17
Although aging is connected to physical phenomena, research has showed that the rate at
which router ages depends on the router utilization. Routing algorithms are responsible for
distributing the load throughout the network. Directional Order Routing (DOR) is the basic
routing algorithm where packets travel first in the x-direction, and then in the y-direction.
A prior study has showed that routers in the middle experience a higher traffic, and it is
almost double the traffic when compared to the peripheral routers. The higher aging rate
will dominate the chip aging rate, and therefore to maximize the lifetime of chip it would
be better if all the network components experience even utilization.
1.2 Causes of Age Degradation
Age degradation of transistors is a natural result due to usage and intrinsic property of
the material as well as the fabrication methodology. Today, there are no ideal material
where all defects can be eliminated or where novel materials can retain its production
condition eternally (lifetime of chip). With time, transistors age and deviate from their
production condition. It is possible to have minimal aging effects if aging was considered
at fabrication and changes in the material during runtime due to utilization were taking into
account.
Shrinking transistors and aggressive voltage scaling have reduced the lifetime of
transistors which results in reducing the lifetime of the chip. It is important first to identify
aging sources and reduce the impact as much as possible. Aging will produce a permanent
fault after certain period of time and if age-aware decisions/designs are incorporated, then
by improving the operating conditions, we may prolong the time before a permanent fault
will be observed in the system. Therefore, the goal here is not to eliminate the faults due
to aging but to design runtime decisions that delay the onset of faults and thereby improve
the lifetime of the chip.
18
1.2.1 Negative Bias Temperature Instability
Metal oxide semiconductor field effect transistors (MOSFET) is the technology used
to illustrate the functions in the Integrated Circuit (IC). MOSFET was chosen for ICs
because of its low static power consumption and high resistance to the field noise. Negative
Bias Temperature Instability (NBTI) which is the major cause for aging, primarily affects
the functionality of P-MOSFET. N-MOSFET effected by the temperature instability but the
effects are minimal when compared to p-MOSFET effects. NBTI decreases the electrons
mobility, drain current and conductivity, forcing the voltage threshold to go higher. The
cause of NBTI is not clear, but generally speaking when the logical zero is applied at the
gate for multiple consecutive cycles, interface traps start to build up causing the temperature
to elevate. Multiple diffusion models model the NBTI effects on the transistors, reaction
and diffusion model (R-D) [19] modeled NBTI degradation in two phases: reaction in
terms of chemical reactions and diffusion in terms of hydrogen molecules phases. Through
reaction phase degradation is linear with the stress time t, In the diffusion phase the
degradation is time dependent, but not linear. The hydrogen diffuses from the interface at
t0.25 rate. The number of traps generated through diffusion phase can be model as 1.2[19]:
4Nit(t) ≈
√AB
t0.25 (1.2)
where A and B are material related variables.
Figure[1.5] shows the effect of stress and recovery cycles and how the recovery is
not perfect when stressed with NBTI effects. This causes the voltage threshold to change
(4Vth). With longer stress time, the degradation loss is bigger. It is important to allow
some recovery time from NBTI, but what is more important is to eliminate longer stress
time. The duty cycle (the percentage of the cycle in which the signal is inactive) is a critical
measure to slow down aging. As shown in equation 1.3 proven by [8], duty cycle (β) has a
direct relationship to the lifetime of the transistor under NBTI.
19
Figure 1.5: Threshold voltage changes through stress and recovery cycle. [7] Copyright
©2006, IEEE
TT FNBT I = [ANBT I(1 − ββ
)n exp(nEaNBT I
kT)]
1n [8] (1.3)
1.2.2 Hot Carrier Injection
Hot Carrier Injection (HCI) is a semiconductor phenomenon, where electrons or holes
gain enough kinetic energy to leave the conduction/valence band. In MOSFET transistors
HCI occurs when carriers transfer high kinetic energy to holes and electrons allowing them
to leave the band leaving traps in the interface and therefore the connection efficiency
is reduced. HCI is correlated to the switching activity (αS A) in MOSFET transistors,
according to the equation 1.4 which shows that the lifetime of transistors is inversely
related to switching activity. The equation shows that increasing the switching activity
will increase the age degradation. However notice that the switching activity αS A does not
20
increase by reducing the duty cycle β [8]. The NBTI age degradation is related to 11−β while
HCI age degradation is related to αS A. Small improvement on the duty cycle β tends to
have a large impact on reducing NBTI degradation with small effect on HCI degradation,
see equations 1.5, 1.6.
TT FHCI = AHCI1
dg fαS A(Isub)−N ′exp(
EaHCI
kT) [8] (1.4)
δdg−NBT I = A(β
1 − β
n
tnexp(EaHCI
kT) [8] (1.5)
δdg−HCI = A(Isub)m)(dg fαS At)n [8] (1.6)
1.2.3 Electromigration
When electrons are passing through metal interconnection there is a possibility that
an electron will hit an ion and lose some of its energy resulting in creation of a void.
Electromigration can result in loss of connection because of the voids generated from the
collision which will decrease the efficiency of the conductor. Since the size of transistor is
shrinking, electromigration will have a bigger impact on the lifetime. According to [20],
65 nm processors have three times the failure rate when compared to 180 nm processors.
Electromigration is believed to dominate the IC failure and a lot of research on improving
the IC materials and conditions are already undergoing. Additional improvement to
maximize the lifetime of ICs is to level the degradation rate across the network. Distributing
the load throughout the network will decrease the age degradation rate across the entire
network.
1.3 Related Work
It is nearly impossible to eliminate age related degradation; however controlling the
amount of degradation and reverting aging effects will increase the lifetime of chips.
Previous work looked into aging problems, and has developed different mechanisms with
few of the approaches requiring hardware design changes while others require routing
21
algorithm modifications. In what follows, I will discuss few prior work related to age-
related avoidance and recovery.
1.3.1 Use it or Lose it: Wear-out and Lifetime in Future Chip Multiprocessors
After characterizing HCI and NBTI, it was found that network working under medium
utilization suffer the least aging effects when compared to low or high utilization network.
Under the assumption that real traffic have usually low utilization, Kim et al. [8] designed
wear-resistant router microarchitecture.
Figure 1.6: Use it or Lose it proposal router micro-architecture. [8]
They proposed a router micro-architecture as shown in Figure 1.6 where they added
additional exercise mode circuit to generate random cycles if the router is idle. They
proved that increasing the switching activity has lower impact on aging than higher duty
cycle. The proposal has been tested under Random number generators (ROM) and Linear
Feedback Shift Register (LFSR) and the results showed that LFSR was not able to produce
enough randomness to the network, while (ROM) with different length generator out-
22
performs LFSR and doubles the lifetime. The proposal weakness is in increasing the power
consumption in very large scale systems.
1.3.2 Toward Graceful Aging Degradation in NoCs Through an Adaptive Routing
Algorithm
As routers and links are not going to be used equally in the network, and the highly
used router will have a higher likelihood to fail, Bhardwaj,K et al. [1] proposed to route
packets taking into consideration both aging, and congestion. They used the property of
mesh topology that there are multiple shortest paths between source and destination that are
not in the same row or column. The authors alternatively routed packets through these paths
based on the age and congestion scores. The authors defined a new metric Traffic Threshold
per Epoch (TTpE) , where this metric is calculated off-line to define the maximum amount
of traffic each link or router should experience in different epochs. TTpE is used as an
aging metric while Regional Congestion Awareness (RCA) [10] is used as congestion score.
Packets get routed to the path that is least aged and least congested among all other available
paths by using the look-up table. Bhardwaj,K et al.[1] also implemented a recovery cycle
after a pre-calculated threshold, Traffic Threshold per Epoch (TTpE). When routers reach
their epoch threshold, they are forced to stay idle until the end of the epoch, and this will
allow some stress recovery.
They implemented a recovery cycle in the form of relaxation. If a router reaches its
TTpE value for the epoch, it will stop accepting any new packets until this epoch ends. By
implementing this recovery cycle, there will be some increase on latency but it will relax
the link and the router to reduce the aging degradation.
The need for an off-line calculation that depends on load distribution is one of the
drawbacks of the proposed idea. TTpE has to be set for every application or system. The
other drawback is the use of look-up table as it limits the scalability of the design and loses
23
the ability to adapt to changes in-flight. In the follow-up work, Bhardwaj,K et al. in [21]
considers the power consumption to improve the design [1].
1.4 Proposed Design
In this thesis, I propose an in-flight adaptive routing algorithm which works on aging
links and routers of the network at the same pace to reduce the aging effects and increase the
lifetime of the chip. The algorithm works by distributing the load throughout the network
with minimal effects on the network performance. We propose Packet Per Port (P3) as an
aging metric which is paired with an adaptive routing algorithm to deliver better distribution
of the network load. The algorithm takes advantage of the fully adaptive routing to avoid
the aged link and routers in-flight. P3 connects the routers/links age to the time they are
being used making it possible to control both ages at once to improve the lifetime of chip.
Smaller dependent crossbars [22] are used to further improve the performance and reduce
the power consumption of routers. The use of the simple but effective aging metric (P3)
allows us to improve the transistor reliability by 3.6% to 6.27% when compared to DOR.
We obtain up to 13.8% reduction in power consumption. The major contributions of this
work are as follows:
1. A new adaptive routing algorithm that will increase the lifetime of the chip by better
stress distribution, thus allowing minimum aging effects on the network without
sacrificing latency or throughput.
2. We propose to reduce the router power together with improving reliability by
implementing smaller, independent crossbars in the router instead of one large
monolithic crossbar.
24
3. The simulation results show that we can improve transistor reliability by 3.6% to
6.27% when compared to DOR while saving 13.8% in power on a 8 x 8 mesh
architecture.
The rest of the chapters in this thesis are organized as the following: Chapter 2
discusses routing in NoCs and introduces P3 routing algorithm and design. Chapter 3
evaluates the performance of the proposed design. Chapter 4 concludes and proposes future
work.
25
2 Routing in NoCs
Routing algorithms dictate the path a packet takes from source to destination. The
basic characteristic of any routing algorithm includes connectivity and deadlock/livelock
freedom. Connectivity is the ability to send/receive packets from any source/destination on
the network. Deadlock free is the elimination of any circular dependencies that will clog
the delivery of packets while livelock freedom ensures the forward progression of packets.
Other characteristics such as adaptivity and fault tolerance improve the performance of the
routing algorithm.
Figure 2.1: Routing protocol taxonomy. [9]
As Figure 2.1 shows the classification of routing algorithms. The classification
is based on the number of destinations, routing decisions, implementations, adaptivity,
progressiveness, minimality, and number of paths [6]. Routing algorithms can be
categorized into two main groups: deterministic (oblivious) routing and adaptive routing.
26
Deterministic routing is used to route packets regardless of the network status and is only
based on the destination address. In deterministic routing, packets take the same path
from source to destination. In oblivious routing, packets may take different paths, however
congestion information is not considered. On the other hand, in adaptive routing algorithms
packets take different paths between the same source/destination using congestion and
traffic condition. In this chapter, I will explain two routing algorithms: Dimensional Order
Routing (DOR) and Regional Congestion Awareness (RCA). DOR is an oblivious routing
algorithm and it is considered as the basic routing algorithm for mesh and torus topologies.
RCA is an adaptive routing algorithm which routes around congestion to avoid long latency.
2.1 Dimensional Order Routing
Dimensional Order Routing Algorithm (DOR) is based on a rule that packets cannot
leave the current dimension ( X or Y) until reaching the zero offset. After reaching the zero
offset, it can switch to the next dimension which ensures the deadlock/live lock freedom.
From the definition of DOR, only one turn is allowed from one dimension to the other
through the entire path. For 2-D mesh, DOR can be XY routing where packets are routed
in the X-dimension and then in the Y-dimension, or it can be YX routing (the opposite). In
both cases, DOR prohibits two of the four possible turns which depends on the direction
(XY or YX) to avoid deadlocks/livelocks. As shown in Figure 2.2 there is no circular
dependency since a circle cannot be formed. The prohibition of the turns is based on the
routing algorithm and ensures that the network is fully connected. Figure 2.2a (a) shows the
possible and prohibited turns for XY routing, Figure 2.2a (b) is for YX routing. DOR is a
simple and effective routing algorithm and DOR is the most common routing algorithm for
2-D mesh topology. DOR shows good performance in terms of latency and throughput, but
it does not consider the network status that could translate to high latency due to congestion.
27
(a) XY prohibited turns (b) YX prohibited turns
Figure 2.2: XY/YX turns roles. Arrows show the possible turns while dash lines mean
prohibited turns.
2.2 Regional Congestion Awareness
Regional Congestion Awareness (RCA) is an adaptive routing algorithm takes
congestion into consideration while making the routing decisions. It is proposed by Gratz
et al. [10], where they use the number of free virtual channel and crossbar demand to
speculate the network bottlenecks and avoid them. The proposal has three techniques
RCA-1D, RCA-fanin, and RCA-quadrant. The difference is the flow of the congestion
information; RCA-1D collect the congestion information of each direction independently,
while RCA-fanin provide a regional information because of its coarse view, RCA-quadrant
aim to increase the accuracy of the RCA-fanin by providing two separate scores for the
congestion, one for each axes. Even though RCA-1D is the simplest implementation, it
provides the most promising results.
Results show that when the traffic is uniform or complement, RCA increases the load
at the middle routers while keeping the peripheral routers at low utilization. The benefits
of RCA are shown when the load through the network is not well distributed.
28
Figure 2.3: Regional congestion branches. [10] Copyright©2008, IEEE
The remaining sections will present a novel routing algorithm that takes aging as well
as congestion in choosing the best routing path. The algorithm works on distributing the
load throughout the network to lower the overall aging rate of the network.
2.3 Packet Per Port (P3)
Oblivious routing algorithms like DOR that ignore the network load when choosing
the output port of the router are simple, but create hot spots due to routers experiencing
unequal utilization. On the other hand, fully adaptive routing algorithms can distribute the
load better on routers and links and equalize the utilization; however the penalty shows up
in routing complexity, excess power and latency consumption due to additional hops. For
the above reasons, we propose a new routing algorithm that is fully adaptive and yet, at
the same time uses minimal number of hops to transfer a packet. The algorithm speculates
the congestion information and calculates the next router utilization locally. The proposed
routing algorithm takes advantages of both worlds (oblivious and adaptive) to improve the
lifetime of chips by reducing the NBTI/EM effects on routers/links while delivering high
performance. The overall objective of the proposed routing algorithm is to select the least
utilized and least congested link/router from all candidate links/routers. In our design, we
prioritize least utilized link/router over least congested link/router as the goal is to improve
29
the reliability due to aging. However, one could easily pick least congested link/router if
the goal is higher performance.
2.3.1 P3 Routing Algorithm
The proposed routing algorithm makes routing decisions based on two primarily
scores: P3 and congestion score. P3 is the metric that provides the utilization information of
the next router locally. P3 is a directional metric that keeps track of all the link traversals.
Each router has four cardinal directions (+x, -x, +y, and -y) and therefore has four P3
values associated with each router. Age-aware routing [1] uses the delay variation in the
router’s critical path to select the output port, while P3 provides a more immediate reaction
since the decision is based on the link/router utilization. The number of free buffer slots
propagates dimensionally, either row-wise or column-wise to form the congestion score.
To minimize the lag time between the transmission of the congestion information and
the decision making, I confined the congestion information to the boundaries of source
and destination nodes within a grid. Every node receives congestion scores from each
node along the dimension rather than one cumulative score as in 1D-RCA. The node then
crunches the number to decide the congestion score. For example in a 4x4 mesh, the last
node will receive three congestion scores and depending on the packet destination, the
node will weigh the remaining nodes differently depending on the source and destination
boundaries. The sending router requests and receives congestion information from the two
possible routing directions (x and y) to reach the destination router. However, in 1D RCA
the congestion information is not bounded by the area of interest, as the router receives
congestion information along the entire row or column of the NoC which increases the lag
time between receiving the congestion information and making the decision. Moreover,
traffic intensities could change due to variation in application behavior making 1D-RCA
slower to respond to transient traffic changes. Therefore, our proposed routing algorithm
30
improves on reducing the lag time to receive congestion score and improves the congestion
score calculation as it is limited to source and destination nodes and not the entire network.
The trade-off is that instead of one single score, we receive multiple scores periodically
which incurs marginal power penalty. The routing algorithm selects the least utilized
as well as the least congested link with higher priority given to least utilization. The
route computation includes two stages of finding the possible routes and calculating the
congestion scores. Any source/destination combinations that are not located in the same
row/column have two possible links/routers one in each axis. The number of free input
buffer slots is used to calculate the propagation congestion. By using the 1-D RCA as
described in equations [ 2.1, 2.2 ]. P3is used as the aging metric. At each router, the
links/routers utilization are determined as XP3 = P3[Router][Xdirection] along the x-axis
and YP3 = P3[Router][Ydirection] along the y-axis. Both aging scores and the dimensional
congestion scores are fed to the routing algorithm.
XCong =
xdest∑N=xsrc
XrouterFBS
|xo f f set|(2.1)
YCong =
ydest∑N=ysrc
YrouterFBS
|yo f f set|(2.2)
where Xrouter is the next router on the X-axis, Yrouter is the next router on Y-axis and
FBS is Free Buffer Slots.
2.3.2 Router Micro-Architecture
Independently, smaller crossbars have been implemented to optimize the power
consumption and reduce the area as the wires are shorter and can deliver higher throughput
[22]. The router’s age is determined by the delay variation in the critical path, and
therefore the use of smaller crossbars will not impact aging. However, the use of multi-
crossbars within the router micro-architecture increases the probability of an output port
being occupied when compared to the conventional router [22]. Further, smaller crossbars
31
Table 2.1: P3 Routing Algorithm. [2]
Routing Algorithm
Inputs: Coordinate of current router (Xsrc,Ysrc)
and of destination router (Xdest,Ydest)
XCong, YCong, XP3, YP3
1. Calculate Xoffset = Xdest - Xsrc
2. Calculate Yoffset = Ydest - Ysrc
3. no. of hops = abs(Xoffest) + abs(Yoffset)
4. Determine the Virtual Network [X/Y , -X/Y , X/-Y , -X/-Y]
the packet is going to be routed in
5. If(XP3 > YP3) Output port = Ydirection
Else If( XP3 < YP3 ) Output port = Xdirection
Else If( XP3 == YP3 )
If(XCong <= YCong) = Output port = Xdirection
Else Output port = Ydirection
6. (Output port == Xdirection) ? Xoffset−− : Yoffset−−
7. (Output port == Xdirection) ? P3[Router][Xdirection]++
: P3[Router][Ydirection]++
8. While (no. of hops != 0 )
32
can better distribute the load and improve utilization of all wires within the crossbar. The
proposed micro-architecture with split crossbars along with congestion and aging score
calculation is shown in Figure 2.4. Virtual Channel (VC) allocation is critical in NoCs as
it prohibits any circular dependencies and therefore prevents any deadlock/livelock. In
routers with multi-crossbars, there are four quadrants (+x,+y), (+x,-y), (-x,+y), and (-
x,-y), where the packets can be adaptively routed to the destination, assuming that the
source is the reference point. As Figure 2.4 shows, we require four VCs to eliminate
deadlocks/livelocks with each VC allocated to one quadrant. The VC allocation is based on
the destination quadrant, if the destination is in the (+x,+y) quadrant, VC1 gets allocated.
If the destination is in the (+x,-y), VC2 gets allocated and so on. This separation creates
four Virtual Networks (VN) accordingly, and packets get injected to one VN and cannot
exit until the packet reaches its destination. Forcing packets to stay within the assigned VN
allows packets to move freely within the assigned VN and be routed adaptively.
The main target of P3 routing algorithm is to improve lifetime reliability of NoCs.
While lifetime is improved, this come at a cost of possible increase in latency due to
congestion. As describe above, P3 gives higher priority to aging over congestion and
our results showed that there is insignificant increase on the network latency as packets
sometimes get routed to highly congested routers. P3 routing algorithm were able to
improve the lifetime of chips and reduce power consumption with minimal cost on latency
and throughput.
2.3.3 Routing Example
In this section, I will illustrate an example of the proposed P3 routing algorithm. As
shown in Figure 2.5, the example depicts a 4x4 mesh architecture, however it can be applied
to 8 x 8 architecture as well. In this example router 15 (the top right corner) wants to
communicate with router 0 (bottom left corner). At the source node as shown in Figure
33
Figure 2.4: Router Microarchitecture. Left side shows the smaller crossbars connections
to the other parts of the router, the right top corner shows the congestion information
collection and calculation. P3 circuit calculation is explained in the right bottom corner.
34
Figure 2.5: Routing Example shows the routing path of a packet routed from router 15
to router 1. Each graph describes the status of the network at the time of routing and
the routing decision. a) shows routing the packet from R 15 to R14 based on P3 routing
algorithm, b) packet get routed from R 14 to R13, c) packet get routed from R13 to R9, d)
packet get routed from R9 to 5, e) packet get routed from R5 to R4, f) packet get routed to
it final destination R1 from R4. [2]
2.5(a), R15 (located at [3,3]), wants to communicate with the destination router R0 (located
at [0,0]) and the x and y offsets are [-3,-3]. I use red color for highly congested links and
higher P3 values, blue color for lower P3 values, black colored routers for routers through
which packets have traversed, black solid lines for links that the packet has, and black
35
dashed lines for next possible links. Based on the colors code (from the Figure 2.5) I see
that R14 on Y-axis is utilized higher than the R11 on the X-axis. R15 receives congestion
information from both -X, and -Y axes, R15 senses congestion (between R12 and R13 on
the X-axis, and between R7 and R11 on the Y-axis). Since congestion calculation is based
on the distance between the congestion and the source, XCong will be lower than the YCong.
The routing algorithm routes the packet to the least utilized and the least congested link and
P3[R15][−X] will be increased by 1. As shown in Figure 2.5 (b) for R14, the next router on
the X-axis is R13, and on the Y-axis is R10. In terms of utilization, R13 is preferred and in
terms of congestion R10 is preferred. As we prioritize utilization over congestion, R13 is
selected by the routing algorithm and P3[R14][−X] is increased by 1. As shown in Figure
2.5 (c), R12 and R9 are the next possible candidate routers for the packet. The algorithm
selects to make a turn to the Y-direction since R12 has higher utilization when compared
to R9 and increase P3[R13][−Y] by 1. Even though it senses congestion along the Y-axis,
priority is given to utilization rather than congestion. As shown in Figure 2.5 (d), the two
possible routers (R8 and R5) are potential candidates for the packet, the algorithm decides
to take the least utilized router as it is R5 and increase P3[R9][−Y] by 1. As shown in Figure
2.5 (e), R4 is less utilized than R1 however the congestion on the X-axis is higher than the
Y-axis. The algorithm avoids the highly utilized router and selects to route the packet to
X-axis and increase P3[R5][−X] by 1. As shown in Figure 2.5 (f), the packet has to take
Y-direction to reach router 0 which is the destination router and P3[R4][−Y] get increased
by 1. As this example illustrates, the routing algorithm decides to pick the least congested
and least utilized link with higher priority given to least utilization to improve performance
and reliability.
P3 routing algorithm works on leveling the load through out the network to reduce
aging effects on routers/links and thus improve the reliability of NoC’s and increase the
lifetime of the chip with minimal cost. However, since P3 has visibility to the neighbors
36
utilization packet gets routed to the least aged and/or least congestion to avoids highly aged
router/link and allows for some time to recover. In some cases, packets reach no return
point where it will have to travel the highly aged router/link. This is only allowed when
packets reach one of its zero offset and have to switch to the other direction. It could be
avoided if P3 does not take the shortest path and/or if P3 increases the utilization visibility.
Since shortest path decision saves on power consumption I would not recommend it. A
further improvement could be increasing the utilization visibility to improve the routing
decision by eliminating the number of times packets had to travel through highly aged
router/link. The addition would increase the lifetime of chip tremendously.
37
3 Performance Evaluation
The performance of P3 is evaluated in compared to Dimensional Order Routing
(DOR), Regional Congestion Awareness (RCA), which is an adaptive routing algorithm
that makes the routing decisions to avoid congestion, and Age-Aware Adaptive Routing
(AGE-ADAP) proposed in [1] where packets are being routed to the least aged and least
congested path using look-up table.
I use a cycle accurate network simulator that models the router microarchitecture
presented in section 2.3.2. Performance evaluation is done on 4x4 and 8x8 mesh with
different traffic characteristic to provide full analysis [23]. Synthetic traffic are chosen
to show the strengths and weaknesses of the proposed routing algorithm, two standard
synthetic traffic complement and uniform. Plus, two permutation synthetic traffic: perfect
shuffle and matrix transpose. Finally, we evaluate P3 on real traffic (Splash-2). To simulate
the synthetic traffic, simulator was warmed up then packets start injected and measurements
were collected. I tested the designs for as low as 0.05 network load and until reaching the
network saturation point. Splash-2 traces were collected from real application using SIMIC
simulator supported by GEMS simulator for memory package. Table 3.1 shows the core
caches parameter used to extract Splash-2 traces. While we compared all three algorithms
to P3 on a 4 x 4 mesh, we did not compare AGE-ADAP for 8 x 8 mesh. This is due to the
fact that AGE-ADAP was designed on a 4x4 mesh with a concentration of 4 and our design
is applicable to designs without any concentration. For the simulation, we used four VCs
allocated separately.
To model the aging effects we modeled a 45 nm transistor on Synopsys HSPICE using
HCI/NBTI long term degradation and Predictive Technology Models (PTM) [24] for 10
years. We observed the change in threshold voltage to model the aging factor as presented
by [25].
38
Table 3.1: Cache and Core Parameter Used for Splash-2 Traces Extraction. [3]
Parameter Value
L1/L2 coherence MOESI
L2 cache size/assoc 4MB/16-way
L2 cache line size 64
L2 access latency (cycles) 4
L1 cache size/assoc 64KB/4-way
L1 cache line size 64
L1 access latency (cycles) 2
Core Frequency (GHz) 5
Threads (core) 2
Issue policy In-order
Memory Size (GB) 4
Memory Controllers 16
Memory Latency (cycle) 160
Directory Latency (cycle) 80
39
3.1 Network Performance
In this chapter I will go over the performance (latency, throughput, and power) and
reliability matrices to evaluate P3 routing algorithm.
3.1.1 Utilization
Aging effects can be reduced by leveling the load throughout the network and thereby
improve the utilization (the percent of packets transfer through the router/link of the total
number of packet). P3 routing algorithm works on leveling the utilization throughout the
network to achieve longer lifetime. I calculated the utilization of each router/link of 4x4
mesh and compared the result to DOR, RCA, and AGE-ADAP routing algorithm. The
following Figures show how the load is distributed throughout the network, where each
point presents the percentage of the total packets that get to be routed through specific
coordinate of a specific router. Each router is connected to four coordinates (+X,-X,+Y,-
Y) and each router/link is presented in these figures to show how they suffer under basic
routing from large variation in the utilization while P3 was able to minimize the variation
in utilization and distribute the load more evenly. Figures 3.1, 3.2 show the improvement
of P3 over other routing algorithms for matrix transpose and perfect shuffle traffic patterns.
P3 increases the utilization of the low utilized router/link and decreases the utilization of
the highly utilized router/link. The load distribution for bit reversal and matrix transpose
under DOR routing shows that router/link utilization range is approximately three times
of P3 utilization range. Some routers/links that are physically there did not get to be used
under DOR; however all the physical links get to be used in P3 and this best usage of
the routers/links is what improves the utilization distribution. Comparing P3 to RCA and
AGE-ADAP shows that P3 has approximately 20% tighter range than RCA. In AGE-ADAP
avoiding the high congestion path allows to improve the load distribution. However; the
improvement in utilization falls behind P3 because of how the algorithm calculated the
40
aging score. It is noticeable that the links connected to the same router have very close
utilization values.
(a) P3 utilization under bit reversal traffic (b) DOR utilization under bit reversal traffic
(c) RCA utilization under bit reversal traffic (d) AGE-ADAP utilization under bit reversal
traffic
Figure 3.1: Load distribution for bit reversal traffic for 4x4 mesh under a) P3, b) DOR, c)
RCA, d) AGE-ADAP.
41
(a) P3 utilization under matrix transpose traffic (b) DOR utilization under matrix transpose traffic
(c) RCA utilization under matrix transpose traffic (d) AGE-ADAP utilization under matrix
transpose traffic
Figure 3.2: Load distribution for Matrix transpose traffic for 4x4 mesh under a) P3, b)
DOR, c) RCA, d) AGE-ADAP.
42
On the other hand, P3 is unable to improve the utilization of uniform and complement
traffic. As shown in Figures 3.3,3.4. As any adaptive routing algorithm, when the load is
evenly distributed, the algorithm manages the load on the peripheral routers better than it
does on the core routers. As it is shown in the Figures 3.3,3.4, P3 managed the load at the
peripheral routers and lost control at the middle routers.
The utilization for 8x8 mesh under uniform, complement, bit reversal, and matrix
transpose. figures 3.5, 3.6,3.7, 3.8, show how routers/links utilization distribution is better
for 8x8 mesh when compared to 4x4 mesh. P3 was able to reduce the highly utilized
(core) routers under complement traffic which was not possible for 4x4 mesh. This is
a result of increasing the number of shortest paths available which allows more turns to
avoid aged and congested routers. P3 was able to approximately reduce the routers/links
utilization range by 50% when compared to DOR. Even though P3 utilization range is
close to RCA utilization, the number of routers/links that suffer high or low utilization is
less which reduces the possibility of system failure. Figure 3.9,3.10 and appendix A show
the utilization under splash-2 traces which were improves when compared to DOR and
RCA by approximately 10% and 5% respectively.
43
(a) P3 utilization under uniform traffic (b) DOR utilization under uniform traffic
(c) RCA utilization under uniform traffic (d) AGE-ADAP utilization under uniform traffic
Figure 3.3: Load distribution for Uniform traffic for 4x4 mesh under a) P3, b) DOR, c)
RCA, d) AGE-ADAP.
44
(a) P3 utilization under complement traffic (b) DOR utilization under complement traffic
(c) RCA utilization under complement traffic (d) AGE-ADAP utilization under complement
traffic
Figure 3.4: Load distribution for complement traffic for 4x4 mesh under a) P3, b) DOR, c)
RCA, d) AGE-ADAP.
45
(a) P3 utilization under uniform traffic
(b) DOR utilization under uniform traffic (c) RCA utilization under uniform traffic
Figure 3.5: Load distribution for uniform traffic for 8x8 mesh under a) P3, b) DOR, c)
RCA.
46
(a) P3 utilization under complement traffic
(b) DOR utilization under complement traffic (c) RCA utilization under complement traffic
Figure 3.6: Load distribution for complement traffic for 8x8 mesh under a) P3, b) DOR ,
c) RCA.
47
(a) P3 utilization under bit reversal traffic
(b) DOR utilization under bit reversal traffic (c) RCA utilization under bit reversal traffic
Figure 3.7: Load distribution for bit reversal traffic for 8x8 mesh under a) P3, b) DOR, c)
RCA.
48
(a) P3 utilization under matrix transpose traffic
(b) DOR utilization under matrix transpose (c) RCA utilization under matrix transpose
Figure 3.8: Load distribution for matrix transpose traffic for 8x8 mesh under a) P3, b)
DOR, c) RCA.
49
(a) P3
(b) DOR (c) RCA
Figure 3.9: Load distribution for Radiocity from splash-2 traces for 8x8 mesh under a) P3,
b) DOR, c) RCA.
50
(a) P3
(b) DOR (c) RCA
Figure 3.10: Load distribution for FFM from splash-2 traces for 8x8 mesh under a) P3, b)
DOR, c) RCA.
3.1.2 Latency
Latency is a very critical measurement that shows the performance of the network.
It is the average time a packet would spend in the network before it will be delivered to
the destination. Figure 3.11 shows the network latency for 4 x 4 mesh architecture for
matrix transpose and bit reversal traffic pattern. DOR provides the best performance for
complement and uniform traffic since the load is well distributed throughout the network.
This is due to the fact that adaptive routing algorithms will try to balance the load on the
peripheral routers which results in traffic diverted to the core routers where congestion
builds up. However, since RCA is only concerned with congestion, it outperforms P3 when
51
the network load is uneven as shown in 3.11. It must be noted that RCA does not distribute
the load evenly throughout the network and the packet simply chooses the least congested
path. Even though P3 does not deliver the lowest latency, P3 improves the distribution of
load which will increase the lifetime of the chip. As P3 is concerned with load distribution
and congestion avoidance, latency impact is minimal. AGE-ADAP is able to outperform
P3 in terms of latency, as it is not a fully adaptive algorithm, however AGE-ADAP does not
distribute the load as P3.
For 8 x 8 mesh, P3 shows better load distribution as well as improves the network
latency. As shown in Figures 3.12(c) and 3.12(d), P3 outperforms DOR for permutation
traffic, 19% for matrix transpose and 77% bit reversal because it is able to avoid congestion
and load balance the traffic throughout the network. However, RCA showed better network
latency as it does not consider utilization, and only avoids congestion.
Latency results show that the network will not suffer from a high latency nor
starvation, as figures [3.11, 3.12, 3.13] show the effect on latency is very minimal and this
is due to the fact that the used path is still the shortest path and the congestion is avoided
as much as possible.
3.1.3 Throughput
For 4 x 4 mesh the results indicates that P3 did not affect the throughput for all
tested traffic patterns except for complement traffic where P3 reduces the throughput by
10%. However for 8 x 8 mesh as shown in Figure 3.14(a), P3 is able to improve the
throughput by 36% when compared to DOR for bit reversal traffic. As complement traffic
is well distributed throughout the network, throughput is affected because of the increased
congestion at the core routers. Real traffic throughput as shown in Figure 3.14(b) indicates
that the throughput degradation is marginal and depends on traffic distribution.
52
(a) Uniform Traffic (b) Complement Traffic
(c) Matrix Transpose (d) Bit Reversal
Figure 3.11: Average network latency for 4 x 4 mesh for a few select synthetic traffic
patterns - (a) Uniform , (b) Complement , (c) Matrix Transpose , (d) Bit Reversal.[2]
53
(a) Uniform (b) Complement
(c) Matrix Transpose (d) Bit Reversal
Figure 3.12: Average network latency for 8 x 8 mesh for a few select synthetic traffic
patterns - (a) Uniform , (b) Complement , (c) Matrix Transpose, and (d) Bit Reversal.[2]
54
Figure 3.13: Average network latency for 8 x 8 mesh for splash-2 traces
3.1.4 Aging
Aging in NoCs can be determined by the slow shift in the threshold voltage Vth and this
will result in an increase in gate delay. As NoCs is a latency sensitive network, increasing
the critical path delay by more than 10% is considered as a system failure [26]. The latency
induced will exceed the router pipeline which will create erroneous packet reception. Based
on reaction-diffusion, transistor delay is shifted by Vth according to Alpha Power Law[27]:
dg = αVdd
µ(Vdd − Vth)ϕ(3.1)
P3 will control the increase of the threshold voltage by dividing the load throughout
the network to reduce the age degradation. P3 was able to achieve a better utilization and
a shorter duty cycle. We ran Synopsys HSPICE for a transistor model under PTM for 10
years and collected the Vth to calculate aging[25] as described Aging = 4VthVth
.By increasing
55
(a) 4x4 mesh (synthetic traffic) (b) 8x8 mesh (synthetic traffic)
(c) 8x8 mesh (Splash-2 traces) [2]
Figure 3.14: Network throughput - (a) 4x4 mesh under synthetic traffic , (b) 8x8 mesh
under synthetic traffic patterns, and (c) 8x8 mesh under Splach-2 traces (normalize to
DOR). (notation: Uni: Uniform, Comp: Complement, Mt-Tr: Matrix Transpose, Bt-Re: Bit Reversal,
Per-sh: Perfect Shuffle, BF: Butterfly, GEO-MEAN: Geometric Mean).
56
Figure 3.15: The aging effects on transistor reliability. [2]
the load on the peripheral routers and decreasing the load on the core routers, P3 showed
the best load distribution through out the network. As shown in Figure [3.15], P3 is able to
increase the lifetime reliability by decreasing the diffusion of the threshold voltage.
3.1.5 Energy per Bit
The use of multi-crossbars within the routers and the distributed routing algorithm
paid off for P3 in terms of energy per bit. Figure 3.16 shows that P3 consumes the least
energy per bit for Splash-2 traces, P3 reduces the energy per bit between 7.3% and 13.7%.
All simulation has been done on DSENT simulator [28].
58
4 Conclusions and FutureWork
In this research, I propose an in-flight adaptive routing algorithm that improves the
lifetime of the network by leveling the load on the network. By distributing the load
on all links and routers, our evaluation showed that we can reduce the delay variation
and improve the reliability of the chip. Our results also indicate that we can reduce the
power consumption when compared to state-of-the-art routing algorithms with marginal
performance degradation. P3 improves the lifetime of chip by avoiding highly aged
routers/links, however it is based on the utilization of the next hop only. With limited
visibility to the next router/link utilization P3 shows promising results in term of improving
the lifetime of chips with minimal impact on the latency and throughput. In the future, P3
performance could be boosted by increasing the utilization visibility to find the optimal
path. P3 values could be transferred throughout the network with RCA. This improvement
would result in routing packets to the optimized path which reduces the impact on latency
and throughput.
This research covered two sources of age degradation; Negative Bias Temperature
Instability (NBTI) and Hot Carrier Injection (HCI). The results showed how those two
would improve the lifetime reliability of chip. Electromigration (EM) was not included
because the impact of EM on 45 nm transistors is mainly controlled by doping level
of silicon. As chips are designed with 14 nm transistor and smaller, the effects of EM
degradation on chip are becoming more critical and could dominate the lifetime of chip.
As of today, EM effects have been handled through improvement of the material and the
doping level of silicon. If a routing algorithm or router microarchitecture is designed to
account for EM effects, the lifetime of chip would be improved tremendously.
59
References
[1] Bhardwaj , K. and Chakraborty, K. and Roy, S., “Towards graceful aging degradationin nocs through an adaptive routing algorithm,” in Design Automation Conference(DAC), 2012 49th ACM/EDAC/IEEE, June 2012, pp. 382–391.
[2] J. Alshraiedeh and A. Kodi, “An adaptive routing algorithm to improve lifetimereliability in nocs architecture,” in 2016 IEEE International Symposium on Defectand Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Sept 2016, pp.127–130.
[3] D. DiTomaso, A. Kodi, and A. Louri, “Qore: A fault tolerant network-on-chiparchitecture with power-efficient quad-function channel (qfc) buffers,” in 2014 IEEE20th International Symposium on High Performance Computer Architecture (HPCA),Feb 2014, pp. 320–331.
[4] Teraflop processors. [Online]. Available: http://www.intel.com/pressroom/kits/teraflops/
[5] Power7 processors. [Online]. Available: http://arstechnica.com/gadgets/2009/09/
ibms-8-core-power7-twice-the-muscle-half-the-transistors/
[6] Dally, William and Towles, Brian, Principles and Practices of InterconnectionNetworks. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2003.
[7] Bhardwaj, S. and Wenping Wang and Vattikonda, R. and Yu Cao and Vrudhula, S.,“Predictive modeling of the nbti effect for reliable design,” in Custom IntegratedCircuits Conference, 2006. CICC ’06. IEEE, Sept 2006, pp. 189–192.
[8] Kim, Hyungjun and Vitkovskiy, Arseniy and Gratz, Paul V. and Soteriou,Vassos, “Use it or lose it: Wear-out and lifetime in future chipmultiprocessors,” in Proceedings of the 46th Annual IEEE/ACM InternationalSymposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM,2013, pp. 136–147. [Online]. Available: http://doi.acm.org/10.1145/2540708.2540721
[9] J. Duato, S. Yalamanchili, and N. Lionel, Interconnection Networks: An EngineeringApproach. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2002.
[10] Gratz, P. and Grot, B. and Keckler, S.W., “Regional congestion awareness for loadbalance in networks-on-chip,” in High Performance Computer Architecture, 2008.HPCA 2008. IEEE 14th International Symposium on, Feb 2008, pp. 203–214.
[11] Intel core i7. [Online]. Available: http://ark.intel.com/products/47933/
Intel-Core-i7-970-Processor-12M-Cache-3 20-GHz-4 80-GTs-IntelQPI?wapkw=
core+i7+970/
60
[12] Xeon processor e7 v4 family. [Online]. Available: http://ark.intel.com/products/family/93797/Intel-Xeon-Processor-E7-v4-Family/
[13] K. Bolding and L. Snyder, Eds., PCRCW ’94: Proceedings of the First InternationalWorkshop on Parallel Computer Routing and Communication. London, UK, UK:Springer-Verlag, 1994.
[14] W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnectionnetworks,” in Design Automation Conference, 2001. Proceedings, 2001, pp. 684–689.
[15] L. Benini and G. D. Micheli, “Chapter 5 - network and transportlayers in networks on chip*,” in Networks on Chips: Topology and Tools,ser. Systems on Silicon, L. Benini, , and G. D. Micheli, Eds. SanFrancisco: Morgan Kaufmann, 2006, pp. 147 – 202. [Online]. Available:http://www.sciencedirect.com/science/article/pii/B9780123705211500060
[16] R. S. Ankur Agarwal, Cyril Iskander, “Survey of network on chip (noc) architecturesand contributions,” Journal of engineering, Computing and Architecture, vol. 3, no. 1,pp. 21–27, 2009.
[17] “ITRS International Technology Roadmap for Semiconductors. Process integration,devices, and structures (PIDS),” 2009.
[18] R. Hegde and N. R. Shanbhag, “Toward achieving energy efficiency in presence ofdeep submicron noise,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 8, no. 4, pp. 379–391, Aug 2000.
[19] S. Ogawa and N. Shiono, “Generalized diffusion-reaction model forthe low-field charge-buildup instability at the si-sio2 interface,” Phys.Rev. B, vol. 51, pp. 4218–4230, Feb 1995. [Online]. Available:http://link.aps.org/doi/10.1103/PhysRevB.51.4218
[20] Cher Ming Tan, Electromigration in ULSI interconnects. New Jersey, CA, USA:World Scientific Publishing Co.Pte.Ltd, 2010.
[21] b. Bhardwaj, K. and Chakraborty, K. and Roy, S., “An milp-based aging-aware routingalgorithm for nocs,” March 2012, pp. 326–331.
[22] A. Kodi, R. Morris, D. DiTomaso, A. Sarathy, and A. Louri, “Co-design of channelbuffers and crossbar organizations in nocs architectures,” in Computer-Aided Design(ICCAD), 2011 IEEE/ACM International Conference on, Nov 2011, pp. 219–226.
[23] A. K. Kodi and A. Louri, “Optisim: A system simulation methodology for opticallyinterconnected hpc systems,” IEEE Micro, vol. 28, no. 5, pp. 22–36, Sept 2008.
[24] “Predictive technology model,” http://ptm.asu.edu/.
61
[25] Y. Wang, S. Cotofana, and L. Fang, “A unified aging model of nbti and hci degradationtowards lifetime reliability management for nanoscale mosfet circuits,” in NanoscaleArchitectures (NANOARCH), 2011 IEEE/ACM International Symposium on, June2011, pp. 175–180.
[26] J.W. McPherson, Reliability Physics and Engineering: Time-To-Failure Modeling.Spring, 2010.
[27] T. Sakurai and A. R. Newton, “Alpha-power law mosfet model and its applications tocmos inverter delay and other formulas,” IEEE Journal of Solid-State Circuits, vol. 25,no. 2, pp. 584–594, Apr 1990.
[28] C. Sun, C. H. O. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. S. Peh, andV. Stojanovic, “Dsent - a tool connecting emerging photonics with electronics foropto-electronic networks-on-chip modeling,” in Networks on Chip (NoCS), 2012 SixthIEEE/ACM International Symposium on, May 2012, pp. 201–210.
[29] Ancajas, D.M. and Bhardwaj, K. and Chakraborty, K. and Roy, S., “Wearoutresilience in nocs through an aging aware adaptive routing algorithm,” Very LargeScale Integration (VLSI) Systems, IEEE Transactions on, vol. 23, no. 2, pp. 369–373,Feb 2015.
[30] A. Kodi, A. Sarathy, A. Louri, and J. Wang, “Adaptive inter-router links for low-power, area-efficient and reliable network-on-chip (noc) architectures,” in DesignAutomation Conference, 2009. ASP-DAC 2009. Asia and South Pacific, Jan 2009,pp. 1–6.
[31] S. Kumar, A. Jantsch, J. P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja,and A. Hemani, “A network on chip architecture and design methodology,” in VLSI,2002. Proceedings. IEEE Computer Society Annual Symposium on, 2002, pp. 105–112.
[32] L. Benini and G. D. Micheli, “Networks on chips: a new soc paradigm,” Computer,vol. 35, no. 1, pp. 70–78, Jan 2002.
[33] A. K. Mishra, N. Vijaykrishnan, and C. R. Das, “A case for heterogeneous on-chipinterconnects for cmps,” SIGARCH Comput. Archit. News, vol. 39, no. 3, pp. 389–400, Jun. 2011. [Online]. Available: http://doi.acm.org/10.1145/2024723.2000111
[34] Dieter K. Schroder, “Negative bias temperature instability: What do we understand?”in Microelectronics Reliability 47 (2007), December 2006, p. 841852.
[35] K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K. Chang, “The case for asingle-chip multiprocessor,” SIGOPS Oper. Syst. Rev., vol. 30, no. 5, pp. 2–11, Sep.1996. [Online]. Available: http://doi.acm.org/10.1145/248208.237140
62
[36] S. Murali, T. Theocharides, N. Vijaykrishnan, M. J. Irwin, L. Benini, andG. D. Micheli, “Analysis of error recovery schemes for networks on chips,”IEEE Des. Test, vol. 22, no. 5, pp. 434–442, Sep. 2005. [Online]. Available:http://dx.doi.org/10.1109/MDT.2005.104
63
Appendix: Routers/Links Utilization
A.1 Splash-2 Router/Links Utilization
(a) P3
(b) DOR (c) RCA
Figure A.1: Load distribution for Water from splash-2 traces for 8x8 mesh under a) P3, b)
DOR , c) RCA.
64
(a) P3
(b) DOR (c) RCA
Figure A.2: Load distribution for Radix from splash-2 traces for 8x8 mesh under a) P3, b)
DOR, c) RCA.