Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip

39
Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip Bilal Zafar, Jeff Draper and Timothy Pinkston Ming Hsieh Department of Electrical Engineering University of Southern California, Los Angeles, CA ICPP `10: 39 th International Conference on Parallel Processing San Diego, CA Sep. 16, 2010

Transcript of Cubic Ring Networks: A Polymorphic Topology for Network-on-Chip

Cubic Ring Networks:A Polymorphic Topology for Network-on-Chip

Bilal Zafar, Jeff Draper and Timothy PinkstonMing Hsieh Department of Electrical EngineeringUniversity of Southern California, Los Angeles, CA

ICPP `10: 39th International Conference on Parallel Processing San Diego, CASep. 16, 2010

In the beginning … (2001)• IBM released its first commercial dual-core CPU (Power4)• Bill Dally presents “Route Packets Not Wires” at DAC

And, now …• Multi-core CMPs with 6-12 cores commercially available• Many-core CMPs research chips

o Intel Teraflops Research Chip (80 cores; 2D Mesh)o MIT RAW Chip (16 cores; 2D Mesh)o UT Austin TRIPS (2x25 tiles; 2D Mesh)

CMPs & Network-on-Chip: Then & Now

9/18/2010 2

Power• Too much power in the network: 20-36% [Kundu, OCIN Workshop 2006]

• Leakage power is high, especially in buffers

Performance• Bandwidth is not a problem; sufficient wire bandwidth• Latency per hop is high

Resilience • Fixed routing causes single point-of-failure• Expensive to exploit path diversity (blame deadlocks-avoidance)

Network-on-Chip Challenges

9/18/2010 3

Three main classes of methods to reduce power [M. Horowitz]

• Cheato Lower the performance of the design

• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results

• Reformulate Problemo Reduce work to be done

Power Reduction

9/18/2010 4

Three main classes of methods to reduce power [M. Horowitz]

• Cheatingo Lower the performance of the design = Increase latency

– Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04]

– Dynamic Link width reduction [Alonso04]

• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results = Clock gating

– Leakage reduction in buffers [Chen03]

• Problem Reformulationo Reduce Work

Power Reduction in NoCs

9/18/2010 5

Three main classes of methods to reduce power [M. Horowitz]

• Cheatingo Lower the performance of the design = Increase latency

– Dynamic Voltage and Frequency Scaling of links [Shang03, Sing04]

– Dynamic Link width reduction [Alonso04]

• Reduce Wasteo Stop wasting energy on stuff that doesn’t produce results = Clock gating

– Leakage reduction in buffers [Chen03]

• Problem Reformulationo Reduce Work

Power Reduction in NoCs

9/18/2010

Decrease Bandwidth

Power-Gate Segments

6

Our Approach

Assumptions: • Physical k-ary n-cube torus network• Turning off links & ports saves power

Key Insight

9/18/2010 7

• 8-ary 2-cube Torus• 64 Routers• Each Router has 5 ports

(4 Network + 1 Local)• No. of Bi-directional Links = 128• Average Distance = 4

Key Insight: Illustration

4-ary 2-cube Torus Average Distance vs. Turned Off Links

9/18/2010 8

Key Insight: Illustration

Y-Links Removed = 1 Average Distance vs. Turned Off Links

9/18/2010 9

Key Insight: Illustration

Y-Links Removed = 2 Average Distance vs. Turned Off Links

9/18/2010 10

Key Insight: Illustration

Y-Links Removed = 3 Average Distance vs. Turned Off Links

9/18/2010 11

Key Insight: Illustration

Y-Links Removed = 4 Average Distance vs. Turned Off Links

9/18/2010 12

Key Insight: Illustration

Y-Links Removed = 5 Average Distance vs. Turned Off Links

9/18/2010 13

Key Insight: Illustration

Y-Links Removed = 6 Average Distance vs. Turned Off Links

9/18/2010 14

Key Insight: Illustration

Y-Links Removed = 7 Average Distance vs. Turned Off Links

9/18/2010 15

Let’s Exploit This!

Power-Bandwidth tradeoff:• Increase in off links/ports is

linear• Increase in average distance is

non-linear• Power-Bandwidth Tradeoff

works across different sizes

Power-Bandwidth Tradeoff

9/18/2010 16

Latency Reduction

Bandwidth Increase

What?• A network topology and routing function …

Why?• … designed to operate at multiple power-performance points to

meet the bandwidth demands of the application …

How?• … by dynamically reconfiguring the logical topology.

Cubic Ring Network

9/18/2010 17

The Topology

The Routing

The Results

Agenda

9/18/2010 18

The Topology

The Routing

The Results

Agenda

9/18/2010 19

A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next

higher (if there exists one) through at least one node

Topology: Informal Description

9/18/2010 20

A Cubic Ring network (logical) is obtained from a Torus network (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next

higher (if there exists one) through at least one node

Example: Consider the 4x4x4 torus

Topology: Informal Description

9/18/2010 21

A Cubic Ring topology (logical) is obtained from a torus topology (physical) by removing a subset of rings in …• in all but one dimension• a hierarchical fashion, where each level is connected to the next

higher (if there exists one) through at least one node

Example: Consider the 4x4x4 torus … morphed into one possible cRing configuration

Topology: Informal Description

9/18/2010 22

k-ary n-cube Torus: n-dimensional, radix-k• kn nodes connected via nkn bi-directional links organized as nkn-1

rings

k-ary n-cube R-ring Cubic Ring• R = {rn-1, rn-2, … , r1, r0}, each ri is a k-bit string corresponding to

a specific set of one of more torus rings in the i-th dimension• The value of each bit of ri indicates the presence (if ‘1’) or

absence (if ‘0’) of the corresponding set of rings in the cRing• r0[l] = 1, for all 0 ≤ l ≤ k-1 … all rings in one dimension are

connected• ri[l] = 1, for all 0 ≤ i ≤ n-1, for at least one value of l … each level

of the hierarchy is connected to the next higher level through at least one node

Topology: Formal Description

9/18/2010 23

Examples:

• 4-ary 2-cube R-cRingwith R = {0001, 1111}

Topology: Formal Description

9/18/2010 24

Examples:

• 4-ary 2-cube R-cRingwith R = {1000, 1111}

• 4-ary 2-cube R-cRingwith R = {0101,1111}

Topology: Formal Description

9/18/2010 25

Examples:

• 4-ary 2-cube R-cRingwith R = {1000, 1111}

• 4-ary 2-cube R-cRingwith R = {1010,1111}

• 4-ary 3-cube R-cRingwith R = {0001, 0101, 1111}

Topology: Formal Description

9/18/2010 26

The Topology

The Routing

The Results

Agenda

9/18/2010 27

Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing

o Rings at each level, not nodes

Example: • Source = {0,1,1}• Destination: {2,3,2}

Routing Function

9/18/2010 28

Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing

o Rings at each level, not nodes

Example:• Source = {0,1,1}• Destination: {2,3,2}• Route:

s → u1 → u2 → u3 → u4 → u5 → u6 → d

Routing Function

9/18/2010 29

Routing derives naturally from the topology• Messages travel up the hierarchy and then down the hierarchy• cRing routing is NOT up/down routing

o Rings at each level, not nodes

Example:• Source = {0,1,1}• Destination: {2,3,2}• Route:

s → u1 → u2 → u3 → u4 → u5 → u6 → do Source local ring hop: s → u1 o Source intermediate ring hop: u1 → u2 o Global ring hop: u2 → u3 → u4o Destination intermediate ring hop: u4 → u5o Destination local ring hop: u5 → u6 → d

Routing Function

9/18/2010 30

Proof of Deadlock-Freedom• No cycles within rings

o Guaranteed by Bubble Flow Control (BFC) [Carrion’97]o A message can be injected into a ring iff after injection there is at least

one empty message buffer (“bubble”) in the ring in the dir of the msgo Applies to newly injected and turning messages

• No deadlocks in the up segment (VC0)o Dimensions are travelled in the increasing ordero For example: x+/x- → y+/y- → z+/z-

• No deadlocks in the down segment (VC1)o Dimensions are travelled in the decreasing order (e-cube routing)o For example: z+/z- → y+/y- → x+/x-

• No deadlocks in the networko VC0 → VC1 and once in VC1 messages sink at their destination

Deadlock-Freedom

9/18/2010 31

The Topology

The Routing

The Results

Agenda

9/18/2010 32

Router Power Estimate• Router Verilog, synthesized using Synopsys Design Compiler for

TSMC 90nm• Two virtual channels, 64-bit flit size, 8-flit input buffers• Fixed configurations: upper limit of the power-savings

Power

9/18/2010 33

Performance Evaluation

9/18/2010

Evaluated Networkso Flit-level simulation using detailed network simulator, SICOSYS; o 4-stage Bubble Adaptive Router, 8-flit input buffers, 2-flit packetso 4/8-ary 2-cube Torus Network with Bubble Adaptive routingo 4/8-ary 2-cube R-cRing w/ R = {0001, 1111}, {0101, 1111} and {0111, 1111} o 4/8-ary 2-cube Torus Network with West-Last East-Last routing

On/Off Links with West-Last East-Last Routing [Soteriou’04]

• Alternate Row-Column on/off links• Each router must have one out-going link in each dimension

connected• West-Last East-Last routing

34

Performance Evaluation: 64 nodes

9/18/2010

Torus WLEL cRing, ry= {00000001}

cRing, ry= {000100001}

cRing, ry= {01001001}

# On Links 128 64 72 80 88

# 3-port Routers 0 0 56 48 40# 5-port Routers 64 64 8 16 24

35

Independent of Size …• With 4 Global Rings, < 5%

increase in latency

Power-Bandwidth Tradeoff

9/18/2010

For a 16x16 Torus• About 37% off links; • Less than 3% increase in

average distance

36

3%

37%

Failed Resource:• Network link (fully or partially)• Router port: input buffer, VC control, routing unit, output VC

control, link control

Resolution: • Disable the ring involving the failed resource• Assign the disabled ring to highest dimension

Fault Coverage• Network remains fully-connected as long as all links in at least

one dimension are working• When no dimension has all links connected, disconnect a ring (in

lowest dimension). Does not affect routing elsewhere

For Fault Tolerance

9/18/2010 37

Conclusion• Dynamically reconfigurable NoCs offer two key advantages:

o Power-Bandwidth tradeoff, rather than Power-Latencyo Same mechanism used for power-reduction & fault tolerance

• Cubic Ring Networks are an example of how efficient dynamically reconfigurable NoCs can be implemented

• Must select topology, routing and flow control with an eye toward deadlock-free dynamic reconfiguration

Continuing Worko Dynamic Reconfiguration Schemeo When to trigger reconfiguration?o Quantify fault tolerance capability of cRings

Conclusion & Continuing Work

9/18/2010 38

Questions?

9/18/2010 39