Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip
-
Upload
independent -
Category
Documents
-
view
5 -
download
0
Transcript of Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip
Cubic Ring Networks:A Polymorphic Topology for Network-on-Chip
Bilal Zafar, Jeff Draper and Timothy PinkstonMing Hsieh Department of Electrical EngineeringUniversity of Southern California, Los Angeles, CA
ICPP `10: 39th International Conference on Parallel Processing San Diego, CASep. 16, 2010
In the beginning … (2001)• IBM released its first commercial dual-core CPU (Power4)• Bill Dally presents “Route Packets Not Wires” at DAC
And, now …• Multi-core CMPs with 6-12 cores commercially available• Many-core CMPs research chips
o Intel Teraflops Research Chip (80 cores; 2D Mesh)o MIT RAW Chip (16 cores; 2D Mesh)o UT Austin TRIPS (2x25 tiles; 2D Mesh)
CMPs & Network-on-Chip: Then & Now
9/18/2010 2
Power• Too much power in the network: 20-36% [Kundu, OCIN Workshop 2006]
• Leakage power is high, especially in buffers
Performance• Bandwidth is not a problem; sufficient wire bandwidth• Latency per hop is high
Resilience • Fixed routing causes single point-of-failure• Expensive to exploit path diversity (blame deadlocks-avoidance)
Network-on-Chip Challenges
9/18/2010 3
Three main classes of methods to reduce power [M. Horowitz]
• Cheato Lower the performance of the design
• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results
• Reformulate Problemo Reduce work to be done
Power Reduction
9/18/2010 4
Three main classes of methods to reduce power [M. Horowitz]
• Cheatingo Lower the performance of the design = Increase latency
– Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04]
– Dynamic Link width reduction [Alonso04]
• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results = Clock gating
– Leakage reduction in buffers [Chen03]
• Problem Reformulationo Reduce Work
Power Reduction in NoCs
9/18/2010 5
Three main classes of methods to reduce power [M. Horowitz]
• Cheatingo Lower the performance of the design = Increase latency
– Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04]
– Dynamic Link width reduction [Alonso04]
• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results = Clock gating
– Leakage reduction in buffers [Chen03]
• Problem Reformulationo Reduce Work
Power Reduction in NoCs
9/18/2010
Decrease Bandwidth
Power-Gate Segments
6
Our Approach
Assumptions: • Physical k-ary n-cube torus network• Turning off links & ports saves power
Key Insight
9/18/2010 7
• 8-ary 2-cube Torus• 64 Routers• Each Router has 5 ports
(4 Network + 1 Local)• No. of Bi-directional Links = 128• Average Distance = 4
Key Insight: Illustration
Y-Links Removed = 7 Average Distance vs. Turned Off Links
9/18/2010 15
Let’s Exploit This!
Power-Bandwidth tradeoff:• Increase in off links/ports is
linear• Increase in average distance is
non-linear• Power-Bandwidth Tradeoff
works across different sizes
Power-Bandwidth Tradeoff
9/18/2010 16
Latency Reduction
Bandwidth Increase
What?• A network topology and routing function …
Why?• … designed to operate at multiple power-performance points to
meet the bandwidth demands of the application …
How?• … by dynamically reconfiguring the logical topology.
Cubic Ring Network
9/18/2010 17
A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next
higher (if there exists one) through at least one node
Topology: Informal Description
9/18/2010 20
A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next
higher (if there exists one) through at least one node
Example: Consider the 4x4x4 torus
Topology: Informal Description
9/18/2010 21
A Cubic Ring topology (logical) is obtained from a torus topology (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next
higher (if there exists one) through at least one node
Example: Consider the 4x4x4 torus … morphed into one possible cRing configuration
Topology: Informal Description
9/18/2010 22
k-ary n-cube Torus: n-dimensional, radix-k• kn nodes connected via nkn bi-directional links organized as nkn-1
rings
k-ary n-cube R-ring Cubic Ring• R = {rn-1, rn-2, … , r1, r0}, each ri is a k-bit string corresponding to
a specific set of one of more torus rings in the i-th dimension• The value of each bit of ri indicates the presence (if ‘1’) or
absence (if ‘0’) of the corresponding set of rings in the cRing• r0[l] = 1, for all 0 ≤ l ≤ k-1 … all rings in one dimension are
connected• ri[l] = 1, for all 0 ≤ i ≤ n-1, for at least one value of l … each level
of the hierarchy is connected to the next higher level through at least one node
Topology: Formal Description
9/18/2010 23
Examples:
• 4-ary 2-cube R-cRingwith R = {1000, 1111}
• 4-ary 2-cube R-cRingwith R = {0101,1111}
Topology: Formal Description
9/18/2010 25
Examples:
• 4-ary 2-cube R-cRingwith R = {1000, 1111}
• 4-ary 2-cube R-cRingwith R = {1010,1111}
• 4-ary 3-cube R-cRingwith R = {0001, 0101, 1111}
Topology: Formal Description
9/18/2010 26
Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing
o Rings at each level, not nodes
Example: • Source = {0,1,1}• Destination: {2,3,2}
Routing Function
9/18/2010 28
Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing
o Rings at each level, not nodes
Example:• Source = {0,1,1}• Destination: {2,3,2}• Route:
s → u1 → u2 → u3 → u4 → u5 → u6 → d
Routing Function
9/18/2010 29
Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing
o Rings at each level, not nodes
Example:• Source = {0,1,1}• Destination: {2,3,2}• Route:
s → u1 → u2 → u3 → u4 → u5 → u6 → do Source local ring hop: s → u1 o Source intermediate ring hop: u1 → u2 o Global ring hop: u2 → u3 → u4o Destination intermediate ring hop: u4 → u5o Destination local ring hop: u5 → u6 → d
Routing Function
9/18/2010 30
Proof of Deadlock-Freedom• No cycles within rings
o Guaranteed by Bubble Flow Control (BFC) [Carrion’97]o A message can be injected into a ring iff after injection there is at least
one empty message buffer (“bubble”) in the ring in the dir of the msgo Applies to newly injected and turning messages
• No deadlocks in the up segment (VC0)o Dimensions are travelled in the increasing ordero For example: x+/x- → y+/y- → z+/z-
• No deadlocks in the down segment (VC1)o Dimensions are travelled in the decreasing order (e-cube routing)o For example: z+/z- → y+/y- → x+/x-
• No deadlocks in the networko VC0 → VC1 and once in VC1 messages sink at their destination
Deadlock-Freedom
9/18/2010 31
Router Power Estimate• Router Verilog, synthesized using Synopsys Design Compiler for
TSMC 90nm• Two virtual channels, 64-bit flit size, 8-flit input buffers• Fixed configurations: upper limit of the power-savings
Power
9/18/2010 33
Performance Evaluation
9/18/2010
Evaluated Networkso Flit-level simulation using detailed network simulator, SICOSYS; o 4-stage Bubble Adaptive Router, 8-flit input buffers, 2-flit packetso 4/8-ary 2-cube Torus Network with Bubble Adaptive routingo 4/8-ary 2-cube R-cRing w/ R = {0001, 1111}, {0101, 1111} and {0111, 1111} o 4/8-ary 2-cube Torus Network with West-Last East-Last routing
On/Off Links with West-Last East-Last Routing [Soteriou’04]
• Alternate Row-Column on/off links• Each router must have one out-going link in each dimension
connected• West-Last East-Last routing
34
Performance Evaluation: 64 nodes
9/18/2010
Torus WLEL cRing, ry= {00000001}
cRing, ry= {000100001}
cRing, ry= {01001001}
# On Links 128 64 72 80 88
# 3-port Routers 0 0 56 48 40# 5-port Routers 64 64 8 16 24
35
Independent of Size …• With 4 Global Rings, < 5%
increase in latency
Power-Bandwidth Tradeoff
9/18/2010
For a 16x16 Torus• About 37% off links; • Less than 3% increase in
average distance
36
3%
37%
Failed Resource:• Network link (fully or partially)• Router port: input buffer, VC control, routing unit, output VC
control, link control
Resolution: • Disable the ring involving the failed resource• Assign the disabled ring to highest dimension
Fault Coverage• Network remains fully-connected as long as all links in at least
one dimension are working• When no dimension has all links connected, disconnect a ring (in
lowest dimension). Does not affect routing elsewhere
For Fault Tolerance
9/18/2010 37
Conclusion• Dynamically reconfigurable NoCs offer two key advantages:
o Power-Bandwidth tradeoff, rather than Power-Latencyo Same mechanism used for power-reduction & fault tolerance
• Cubic Ring Networks are an example of how efficient dynamically reconfigurable NoCs can be implemented
• Must select topology, routing and flow control with an eye toward deadlock-free dynamic reconfiguration
Continuing Worko Dynamic Reconfiguration Schemeo When to trigger reconfiguration?o Quantify fault tolerance capability of cRings
Conclusion & Continuing Work
9/18/2010 38