Dynamic Voltage/Frequency Scaling and Power-Gating of ...

89
Dynamic Voltage/Frequency Scaling and Power-Gating of Network-on-Chip with Machine Learning A thesis presented to the faculty of the Russ College of Engineering and Technology of Ohio University In partial fulfillment of the requirements for the degree Master of Science Mark A. Clark May 2019 © 2019 Mark A. Clark. All Rights Reserved.

Transcript of Dynamic Voltage/Frequency Scaling and Power-Gating of ...

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-on-Chip with

Machine Learning

A thesis presented to

the faculty of

the Russ College of Engineering and Technology of Ohio University

In partial fulfillment

of the requirements for the degree

Master of Science

Mark A. Clark

May 2019

© 2019 Mark A. Clark. All Rights Reserved.

2

This thesis titled

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-on-Chip with

Machine Learning

by

MARK A. CLARK

has been approved for

the School of Electrical Engineering and Computer Science

and the Russ College of Engineering and Technology by

Avinash Karanth

Professor of Electrical Engineering and Computer Science

Dennis Irwin

Dean, Russ College of Engineering and Technology

3

Abstract

CLARK, MARK A., M.S., May 2019, Electrical Engineering

Dynamic Voltage/Frequency Scaling and Power-Gating of Network-on-Chip with

Machine Learning (89 pp.)

Director of Thesis: Avinash Karanth

Network-on-chip (NoC) continues to be the preferred communication fabric in

multicore and manycore architectures as the NoC seamlessly blends the resource efficiency

of the bus with the parallelization of the crossbar. However, without adaptable power

management the NoC suffers from excessive static power consumption at higher core

counts. Static power consumption will increase proportionally as the size of the NoC

increases to accommodate higher core counts in the future. NoC also suffers from excessive

dynamic energy as traffic loads fluctuate throughout the execution of an application. Power-

gating (PG) and Dynamic Voltage and Frequency Scaling (DVFS) are two highly effective

techniques proposed in literature to reduce static power and dynamic energy in the NoC

respectively. DVFS is a popular technique that allows dynamic energy to be saved but may

potentially lead to a loss in throughput. Power-gating allows static power to be saved but

can introduce new problems incurred by isolating network routers. Further complications

include the introduction of long wake-up delays and break-even times. However, both

DVFS and power-gating are critical for realizing energy proportional computing as core

counts race into the hundreds for multi-cores.

In this thesis, we propose two distinct but related techniques that enable energy-

proportional computing for NoC. We first propose LEAD - Learning-enabled Energy-

Aware Dynamic voltage/frequency scaling for NoC architectures. LEAD applies machine

learning (ML) techniques to enable improvements in both energy and performance with

reduced overhead cost. This allows LEAD to enact a proactive energy management

strategy that relies on an offline trained regression model while also providing a wide

4

variety of voltage/frequency (VF) pairs. In this work, we will refer to various VF pairs

as modes. LEAD groups each router and the router’s outgoing links locally into the

same V/F domain allowing energy management at a finer granularity without additional

timing complications and overhead. We then build on LEAD and propose DozzNoC, an

adaptable power management technique that effectively combines LEAD with a partially

non-blocking power-gating technique. This allows DozzNoC to target both static power

and dynamic energy simultaneously, thereby enabling energy proportional computing. Our

ML DVFS techniques from LEAD are applied on top of a partially non-blocking power-

gated scheme that uses real valued wake-up/switching delays. DozzNoC also allows

independently power-gated or voltage scaled routers such that each router and its outgoing

links share the same voltage/frequency domain.

We evaluate both LEAD and DozzNoC using trace files generated from PARSEC 2.1

and Splash-2 benchmark suits. Trace files are gathered at various network sizes and across

two different network topologies. For a 64 core 4 × 4 concentrated mesh (CMesh) network,

simulation results show that LEAD can achieve an average of 17% dynamic energy savings

for an average loss of only 4% throughput. Our simulation results for DozzNoC on an 8 ×

8 mesh network show that for an average decrease of 7% in throughput, we can achieve an

average dynamic energy savings of 25% and an average static power reduction of 53%.

5

Acknowledgments

I thank my advisor, Dr. Avinash Karanth for the support, guidance, and motivation he

provided. I also want to thank the many wonderful friends I made throughout my time at

Ohio University, even if we have since gone our own ways in life.

6

Table of Contents

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

List of Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.1 Integrated Circuits to Multicores . . . . . . . . . . . . . . . . . . . . . . . 131.2 Energy Proportional Computing and NoC . . . . . . . . . . . . . . . . . . 161.3 Dynamic Voltage and Frequency Scaling for Multicores . . . . . . . . . . . 191.4 Power-gating for Multicores . . . . . . . . . . . . . . . . . . . . . . . . . 211.5 Benefits of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 231.6 Major Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241.7 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 LEAD: Offline Trained Proactive DVFS for NoC . . . . . . . . . . . . . . . . . 272.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.2 LEAD Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.2.1 Operating V/F Modes . . . . . . . . . . . . . . . . . . . . . . . . 332.3 DVFS Models and Implementation . . . . . . . . . . . . . . . . . . . . . . 34

2.3.1 DVFS Implementation . . . . . . . . . . . . . . . . . . . . . . . . 372.4 Machine Learning for DVFS . . . . . . . . . . . . . . . . . . . . . . . . . 39

3 DozzNoC: Combination of ML based DVFS and Power-Gating for NoC . . . . . 413.1 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 DozzNoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.2.1 Operational States . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3 Power-Gated DVFS Models . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Machine Learning for PG-DVFS . . . . . . . . . . . . . . . . . . . . . . . 55

4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.1 LEAD Simulation Methodology . . . . . . . . . . . . . . . . . . . . . . . 58

4.1.1 LEAD Model Variants . . . . . . . . . . . . . . . . . . . . . . . . 604.1.2 LEAD Mode Breakdown . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 LEAD ML Simulation Methodology . . . . . . . . . . . . . . . . . . . . . 62

7

4.2.1 LEAD Feature Engineering . . . . . . . . . . . . . . . . . . . . . 634.2.2 LEAD ML Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 DozzNoC Simulation Methodology . . . . . . . . . . . . . . . . . . . . . 674.3.1 DozzNoC Model Variants . . . . . . . . . . . . . . . . . . . . . . 694.3.2 DozzNoC Mode Breakdown . . . . . . . . . . . . . . . . . . . . . 71

4.4 DozzNoC ML Simulation Methodology . . . . . . . . . . . . . . . . . . . 724.4.1 DozzNoC Feature Engineering . . . . . . . . . . . . . . . . . . . . 73

4.5 LEAD Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.5.1 LEAD Energy and Throughput . . . . . . . . . . . . . . . . . . . . 75

4.6 DozzNoC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 764.6.1 DozzNoC Throughput, Static Power, and Dynamic Energy . . . . . 77

5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

8

List of Tables

Table Page

3.1 DozzNoC’s Reduced Feature Set [16] . . . . . . . . . . . . . . . . . . . . . . 564.1 LEAD Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.2 Multi2sim Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.3 Dynamic Energy Per Hop (Modes 1-5) [17]©2018 ACM . . . . . . . . . . . . 604.4 Full LEAD Feature Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.5 Full LEAD Feature Set (Cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . 664.6 LEAD-τ Mode Selection Accuracy [17]©2018 ACM . . . . . . . . . . . . . . 674.7 DozzNoC Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 684.8 Static Power and Dynamic Energy Per Hop for Active State Operational Modes

[16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

9

List of Figures

Figure Page

1.1 Rapid growth of processor performance from increased clock speed to multi-core processors [46]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Various network topologies ranging from the bus to a hypercube. . . . . . . . . 171.3 Depiction of static power becoming the majority of power consumption in the

NoC as technology size decreases. [9]. . . . . . . . . . . . . . . . . . . . . . . 181.4 Depiction of DVFS being applied at various granularities ranging from per

network to per element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201.5 An example of power-gating applied to the NoC where the router modification,

handshaking, and router pipeline are shown from Power-Punch [13]. . . . . . . 222.1 An example DVS link is shown in part (a), while a history-based DVS

algorithm is shown in part (b) [51]. . . . . . . . . . . . . . . . . . . . . . . . . 302.2 A Threshold and PI controller Finite State Machine (FSM) are shown in (a),

while a Greedy controller FSM is shown in (b) [30]. . . . . . . . . . . . . . . . 312.3 An example of simultaneous power, temperature, and performance manage-

ment using Q-learning [52]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.4 We apply LEAD to a CMesh with 16 routers and 64 cores. We use on chip

voltage regulators that can adjust the supply voltage between 0.8V and 1.2V,allowing us to apply DVFS to individual routers and their corresponding links[17]©2018 ACM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.5 The architecture as well as all additional units required for reactive or proactivemode selection are shown in (a). A simple voltage regulator setup that allowsthe selection of voltage levels in the range of 0.8V to 1.2V for every router andits’ associated outgoing links is shown in (b) [17]©2018 ACM. . . . . . . . . . 35

2.6 LEAD-τ uses a predicted input buffer utilization to select the optimal modeper epoch. LEAD-∆ uses a predicted change in input buffer utilization tomove in the direction of the optimal mode per epoch. LEAD-G incorporatesboth energy and throughput into the label and moves up/down adjacent modes(based on exploration direction) such that energy divided by throughput isminimized [17]©2018 ACM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.1 The difference in network component sizes between a Single-NoC and a Multi-NoC [19]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2 The NoRD network bypass ring is shown in (a), the router bypass datapath isshown in (b), and the network interface datapath is shown in (c) [11]. . . . . . . 44

3.3 Four seperate DarkNoC layers are shown in (a), a single DarkNoC layer isshown in (b), and DarkNoC routers are shown in (c) [8]. . . . . . . . . . . . . 46

3.4 We apply DozzNoC to both a CMesh in (a), as well as a mesh in (b). The routermicroarchitecture for a mesh topology is shown in (c) [16]. . . . . . . . . . . . 47

10

3.5 Our version of Power Punch acts as a baseline and is shown in (a). LEAD-τhas only one state, the active state; however active routers may operate at oneof five different voltage levels (b). DozzNoC has three states and when a routeris active it may operate at one of five different voltage levels as shown in (c) [16]. 49

3.6 DozzNoC mode selection is shown in (a), while LEAD-τ mode selection isshown in (b) [16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.7 A walkthrough example showing the difference in active state mode selectionacross two epochs for a dummy network [16]. . . . . . . . . . . . . . . . . . . 53

4.1 The throughput loss and dynamic energy savings across multiple LEAD-τvariants with an epoch size of 500 cycles [17]©2018 ACM. . . . . . . . . . . . 61

4.2 The time routers and links spent in each mode for all LEAD models across alltest traces [17]©2018 ACM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3 The change in throughput, dynamic energy, and latency of DozzNoC at variousepoch sizes compared to a baseline DozzNoC model with an epoch size of 100cycles [16]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.4 A breakdown of predicted operational modes while in an active state for an 8× 8 mesh network for DozzNoC (a), LEAD-τ (b), and ML+TURBO (c) [16]. . 72

4.5 The results of DozzNoC single feature mode selection accuracy testing [16]. . . 734.6 The throughput, latency, dynamic energy savings, static power savings and

EDP of DozzNoC-5 versus DozzNoC-41 [16]. . . . . . . . . . . . . . . . . . . 744.7 The throughput (a) and Normalized Dynamic Energy (b) for all LEAD models

compared against baseline and greedy [17]©2018 ACM. . . . . . . . . . . . . 764.8 DozzNoC throughput for CMesh architecture at epoch size of 500 cycles (a).

DozzNoC normalized static and dynamic energy for CMesh at epoch size of500 cycles at high load (b). DozzNoC normalized static and dynamic energyfor CMesh at epoch size of 500 cycles at low load (c) [16]. . . . . . . . . . . . 78

4.9 DozzNoC throughput for mesh architecture at epoch size of 500 cycles (a).DozzNoC normalized static and dynamic energy for mesh at epoch size of 500cycles at high load (b). DozzNoC normalized static and dynamic energy formesh at epoch size of 500 cycles at low load (c) [16]. . . . . . . . . . . . . . . 79

11

List of Acronyms

BW - Buffer Write

CMesh - Concentrated Mesh

DOR - Dimension Order Routing

DozzNoC - Partially non-blocking Power-Gating with ML based DVFS

DVS - Dynamic Voltage Scaling

DVFS - Dynamic Voltage and Frequency Scaling

HPC - High-Performance Computing

HVt - High Threshold Voltage

IC - Integrated Circuit

LEAD - Learning-enabled Energy-Aware Dynamic voltage/frequency scaling

LVt - Low Threshold Voltage

ML - Machine Learning

MOSFET - Metal-Oxide-Semiconductor Field-Effect Transistor

NoC - Network-on-Chip

NVt - Normal Threshold Voltage

PG - Power-Gating

RC - Router Computation

RL - Reinforcement Learning

SA - Switch Allocation

SCC - Single-chip Cloud Computer

SSI - Small-Scale Integration

ST - Switch Traversal

T-Breakeven - Breakeven Time

T-Wakeup - Wake-up Delay

ULSI - Ultra-Large-Scale Integration

12

VA - Virtual Channel Allocation

VC - Virtual Channel

VFI - Voltage Frequency Island

13

1 Introduction

1.1 Integrated Circuits to Multicores

1The modern multi-core processor is a result of several technological advancements

that began with the development of the integrated circuit (IC). The IC, a collection of

microelectronic components or circuits fabricated onto the same microchip [36], were

enabled largely due to the discovery of the transistor. A transistor is a semiconductor

that acts like an electronic switch in most cases [23, 53, 60] but may also be used for

signal amplification. At the time of its discovery almost 70 years ago, the transistor

was large and made with vacuum tubes. The number of transistors contained within

an IC was negligible and in the small-scale integration (SSI) range. However, in the

next few decades, MOSFET (metal-oxide-semiconductor field-effect transistor) scaling

following Moore’s Law [42] had allowed the size of the transistor to be drastically

reduced. Moore’s Law [42, 50] is an observation from Gordon Moore which states that

the number of transistors per square inch on an IC will double every year, which eventually

changed into doubling every 18 months. This allowed the number of transistors packed

onto IC’s to grow into the ultra-large-scale integration (ULSI) range. In 1974, Robert

H. Dennard noted that as transistor size decreased, power density remained constant

[54]. Dennard Scaling meant that as technology size decreased there would be an ever-

increasing need for power management strategies. At larger technology sizes, dynamic

power constituted the majority of the power consumed by the chip. Therefore, early

power management strategies focused on the reduction of dynamic energy, the energy

that is expended due to transistor switching [7, 20, 41, 51]. However, power consumption

has not remained proportional to area as Dennard failed to consider leakage current. At

1 Some material including figures, sentences, and paragraphs are used verbatim from prior publications[17, 26] with permission©2018 ACM and©2018 IEEE as well as a submitted publication awaiting decision[16]

14

smaller technology sizes, leakage current has caused static power to become the dominant

source of power consumption in the IC [10, 25, 28, 43]. Modern integrated circuits known

as microprocessors contain several billions of transistors. Examples of microprocessors

include the many core TILE-Gx [47], the Kalray MPPA-256 [21], and the 336-core Ambric

Am2045 [33]. With these astronomical number of transistors, there is an urgent need for

power management strategies. Such strategies must enable energy proportional computing

through the reduction of both static power and dynamic energy.

The multi-core processor is an advanced microprocessor that contains multiple

independent processing units. Each individual processing unit can execute separate

instructions from multiple threads in parallel. Processing cores/threads can also work

together on the same application or task by dividing parallel instructions among the cores.

The popularity of multi-core processors is a direct result of the ever-increasing demand

for computational power to run high-performance computing (HPC) applications. Such

applications range from simulations in astrophysics or biology, to large artificial neural

networks in machine learning, to cloud computing [34, 38, 49].

There are essentially two different methods of increasing the computational power

of the processor, the first involves increasing clock speed. Instructions are executed on

the rising or falling edge of the clock, therefore if the clock frequency is increased the

throughput of the processor will rise. This can be accomplished by increasing the power

consumption of the processor as power and frequency are directly related. However, due

to Moore’s Law and Dennard Scaling, there is an upper limit to this growth. Eventually

the amount of power consumed to raise the clock frequency will outweigh any additional

performance benefits. Intel and other chip manufacturers saw this phenomenon around 3-4

GHz as processor designers ran into the power-wall [46]. As this upper limit was reached,

a new method emerged to increase computational throughput. Instead of increasing

the clock frequency, several lower frequency processing units were built onto the same

15

Figure 1.1: Rapid growth of processor performance from increased clock speed to multi-core processors [46].

microprocessor and used in parallel. This allows multiple applications to be executed

at the same time, and it allows multiple cores to work together on parallel applications.

Starting at duo core, multi-core processors have already reached hundreds of cores in

modern super computers. This rapid rise in processor performance from single-core to

multi-core processors is shown in Figure 1.1. This rising core count has led to the need for

a highly scalable interconnect fabric.

Initially, a bus or crossbar would fulfil the role of interconnect between cores.

However, these communication fabrics have severe drawbacks when they are scaled into

higher core counts. A bus interconnect only allows one on-chip component at a time to use

the shared system bus. This system bus is composed of a control bus, an address bus, and a

data bus while each core contains a CPU, Memory, or I/O device. With each additional core

comes additional devices that must share the same system bus. This can severely cripple

16

any potential performance gains. This means that while the bus is extremely resource

efficient, its performance is not scalable into higher core counts. The crossbar switch is

a non-blocking interconnect. A crossbar allows multiple components to use dedicated

channels without interrupting the connectivity of other components. However, this type

of interconnect uses vast amounts of wires and switch points at higher core counts. This

implies that the crossbar interconnect is not easily scalable in terms of on-chip area and

resource in-efficiency. Therefore, a newer communication fabric known as the Network-

on-Chip (NoC) has emerged that seeks to combine the resource efficiency of the bus with

the parallelizable nature of the crossbar.

1.2 Energy Proportional Computing and NoC

The NoC allows several cores to communicate and work in parallel through the usage

of multiple routers and links. These routers and links may be arranged in various physical

layouts or network topologies which belong to two types of networks. The first type of

network is called a direct network in which every node is both a terminal and a switch.

This differs from an indirect network in which nodes may be either switches or terminals

[1]. The simplest direct network is the mesh topology wherein routers and links form a

grid-like pattern. Each core is connected to a dedicated router which is used to send and

receive data among cores. A concentrated mesh (CMesh) topology is similar to a mesh in

that routers and links are arranged in a grid-like pattern. However, the CMesh differs from

a mesh because multiple cores share a common router. The idea is to balance resource

efficiency and network performance with varying concentration factors. Torus is another

popular network topology like a mesh, however it contains wrap-around links. Wrap-

around links enable fewer hops between distant cores which in turn decreases network

latency. While a torus network can increase performance over a mesh, it has a larger

footprint due to increased numbers of wires. In larger torus networks, the wrap-around

17

Figure 1.2: Various network topologies ranging from the bus to a hypercube.

link may be significantly larger than normal links, leading to uneven data arrival times.

Hypercubes are multi-dimensional and can range from n-dimensional meshes to k-ary n-

cubes. Hypercubes allow increased network bisection bandwidth with increased design

complexity. At higher dimensions, the design can be difficult to physically implement with

the number of communication ports and links per processor increasing logarithmically.

These various network topologies are shown in Figure 1.2.

The two most essential components of the NoC are the routers and links which allow

for the storing, switching, and routing of data between cores [18]. Data is sent in the form

of packets and a packet is often split into smaller units called flits. The flit is the smallest

unit of data sent through the network, and there are three different types of flits. The head

flit traverses the network first and secures the data path. Body flits and tail flits follow once

the path is secured, with the tail flit deallocating the path. Routers are composed of several

units such as input/output buffers and a crossbar that are used in different stages of the router

18

Figure 1.3: Depiction of static power becoming the majority of power consumption in theNoC as technology size decreases. [9].

pipeline. A simple 5 stage pipeline would start with a buffer write (BW) stage wherein flits

are written to an input virtual channel. The next stage is the route computation (RC) stage

where the route is computed. Route computation may be done statically or dynamically in

several different ways. If the routing decision was static (deterministic) then downstream

data paths are fixed, and the route is pre-determined. If the routing decision was dynamic

(adaptive) then the downstream data paths are not fixed, and the route is determined based

on the current state of the network. The next stage is the virtual channel allocation (VA)

stage where a downstream virtual channel is allocated, then during the switch allocation

(SA) stage packets compete for the crossbar. Finally, in the switch traversal (ST) stage the

packet is switched across the crossbar and the stages repeat in the downstream router.

Previous research has shown that the interconnect fabric consumes as much as 30%

or more of the total chip power [27]. We have also seen how the interconnect can account

for over 50% of the total chip dynamic power [39]. At large technology sizes, static power

consumption is a relatively small portion of the total power. However, as technology size

has decreased, NoC static power consumption has risen from 17.9% of total power at 65nm

to 74% at 22nm [3, 11, 14, 45, 56]. This growing percentage of static power consumption

19

in the NoC is shown in Figure 1.3. If these trends continue, we can expect static power

consumption to dominate total power consumption of multicore chips at 14nm and below.

Naturally there have been many attempts to reduce both static and dynamic power and

these energy proportional computing methods will be introduced in sections 1.3 and 1.4.

Energy proportional computing states that the power consumed by the electronic circuit or

microprocessor should be equivalent to the amount of work being done [19]. This means

that processors under higher work-load are expected to consume more static and dynamic

power than idle processors. On the other hand, idle processors or those under light loads

should consume little to no power. This concept can be applied to the NoC such that an

idle NoC consumes substantially less static and dynamic power than a fully active NoC.

1.3 Dynamic Voltage and Frequency Scaling for Multicores

Dynamic voltage and frequency scaling (DVFS) is a popular technique that allows

supply voltage and clock frequency of various chip components to be dynamically altered

at run-time. DVFS may be applied to the cores and the entire NoC, or it may be

applied individually to various NoC components such as routers and links. The supply

voltage and clock frequency are increased/decreased in proportion to network load with

the key goal of saving dynamic power while meeting strict performance requirements

[4, 7, 20, 41, 51, 59, 63, 64]. The relationship between transistor dynamic power and

supply voltage/clock frequency is given as [37]:

Pdynamic = CV2A f (1.1)

where C is load capacitance, V is supply voltage, f is frequency, and A is activity factor.

The supply voltage can be increased/decreased at run-time, and scaling techniques will

decrease the supply voltage when network demand is low and increase the supply voltage

when network demand is high. At low network loads, performance loss is tolerable to save

dynamic energy, while at high network loads any loss in performance can result in network

20

Figure 1.4: Depiction of DVFS being applied at various granularities ranging from pernetwork to per element.

saturation, dropped packets, and increased network contention. A smart voltage switching

algorithm selects optimal voltage levels at both low and high network demand. The optimal

voltage level will maximize dynamic energy savings while minimizing performance loss.

Recent work has shown that machine learning can be applied to the voltage switching

logic to improve voltage level selection through predictions of future network states and

parameters [16, 17]. This will be discussed in greater detail in section 1.5.

Dynamic voltage and frequency scaling may be applied to various NoC components

such as input ports, routers, links, buffers, crossbars, and other shared resources at various

granularities ranging from coarse to fine grain. A coarse-grained approach could apply

DVFS to all routers and links at the same time so that they all share the same voltage

level. This differs from a more fine-grained approach which might apply DVFS to

individual routers and links allowing each to operate at different voltage levels. An example

showing various voltage frequency island (VFI) granularities is shown in Figure 1.4. It is

often assumed that fine-grain DVFS schemes offer a greater potential for energy savings.

However, there is concern that the overhead cost of providing separate voltage domains for

each router/link could offset any potential savings [26]. Prior research has also used various

21

network parameters to quantify and measure network traffic including link utilization

[51], VFI utilization [30], network slack [24], buffer utilization [20], and cache-coherence

properties [31].

1.4 Power-gating for Multicores

Power-gating is another very popular and well-researched energy proportional

computing technique that seeks to save static power. This is accomplished by switching

off the supply voltage to various NoC components such as routers and links in proportion

to network load [8, 10–13, 19, 25, 28, 43, 48]. The key goal of power-gating is to maximize

the static power savings of powered off network components with minimal impact on

network performance. High static power is a direct result of transistor leakage power as

shown in the formula below [37]:

Pstatic = V(ke−qVth/(akaT )) (1.2)

where V is the supply voltage, Vth is the threshold voltage, T is the temperature, and the

remaining parameters fluctuate with design. Power-gating is challenging to implement

successfully due to large wake-up delays (T-Wakeup), minimum breakeven time (T-

Breakeven), and potential loss in network connectivity. T-Wakeup is the wake-up delay

of the powered off network component and represents the time needed to fully charge local

voltage levels up to Vdd [16]; other work has estimated this to be around 10 cycles [11–

13, 19] but it is largely hardware dependent. This differs from the breakeven time, which

refers to the minimum time that a component must be powered off to ensure a net savings

in static power when it is switched back on [16]. Other work has estimated T-Breakeven to

be around 12 cycles [19], however this too is largely hardware dependent.

A successful power-gating model will ensure that (i) only unused or lightly used

components are power-gated, (ii) power-gated components do not cause loss in network

connectivity and are woken before they cause blocking, and (iii) power-gated components

22

Figure 1.5: An example of power-gating applied to the NoC where the router modification,handshaking, and router pipeline are shown from Power-Punch [13].

meet or exceed T-Breakeven ensuring static power savings are maximized [16]. Power-

gating models that follow these three rules have the greatest potential to maximize static

power savings while minimizing performance loss. Many new power-gating techniques

also operate on cycle-by-cycle time granularity as this increases potential power savings

across small periods of idleness. The critical challenge of maintaining network connectivity

when individual routers or links are powered off can be tackled in many ways. Such

methods range from the addition of escape channels to packet re-routing to early wake-

up. The underlying idea behind all methods is to minimize network performance impact by

hiding wake-up latency and avoiding router blocking. The router architecture, handshaking

protocol, and router pipeline of a typical power-gated router is shown in Figure 1.5.

Previous research has broken the NoC into multiple sub-networks that can be powered

on/off in proportion to network load [19]. Each sub-network always maintains full

connectivity alleviating deadlock and live-lock complications. Another work leverages

the amount of dark-silicon at smaller technology sizes to create multiple fully connected

NoCs made from high threshold voltage (HVt), normal voltage threshold (NVt), and low

voltage threshold (LVt) cells. This allows the most energy efficient NoC to be selected

23

that meets network demands [8]. Other research has focused on maximizing router off

time by updating routing tables and re-routing packets around powered-off components

[45]. Another work minimized the effects of router blocking by sending wake-up signals

to power-up downstream routers before packets arrive and wait to hop across them [13].

The key goal behind all this prior research is to ensure maximal static power savings with

minimal performance loss while maintaining network connectivity [16].

1.5 Benefits of Machine Learning

Older DVFS and power-gating techniques were reactive, i.e. they reacted to changes

in traffic demand after network loads had already shifted. Such approaches often rely on

old/outdated values of network parameters, resulting in performance loss and suboptimal

dynamic energy savings. Recent work has begun to incorporate machine learning

algorithms and other advanced techniques that allow accurate predictions of future network

parameters. Accurately predicting the future needs of the network enables proactive voltage

switching and more optimal voltage level selection. Ultimately proactive voltage switching

improves energy/performance trade-offs [15, 22, 29, 35, 52, 62].

Machine learning is a rapidly growing field in computer science and refers to a

collection of pattern recognition techniques that allow algorithms to recognize patterns and

make data-driven predictions or decisions. Three large fields in machine learning include

supervised learning, unsupervised learning, and reinforcement learning [6]. Supervised

learning is the most well-known technique and involves supplying a label during training.

During unsupervised learning, labels are not supplied during training. Reinforcement

Learning (RL) is a more complex type of ML. RL is a powerful state-action-reward

technique that allows an agent to learn the set of actions that garner the highest cumulative

reward.

24

RL is also the most common type of machine learning applied to DVFS, however,

this method is often trained online as data becomes available. Online training can result in

high runtime overhead and suboptimal initial agent performance. This thesis will instead

explore offline-trained supervised learning, eliminating training/validation overhead while

maintaining low run-time computational overhead. We will discuss how proactive versions

of each DVFS model were developed, how features and labels were extracted, and how

labels were supplied during training and validation to train models. We will also discuss

how the trained weights are subsequently used at run-time to govern DVFS voltage level

selection.

1.6 Major Contributions

In this thesis, we propose several different ML enhanced energy proportional

computing techniques. These techniques enable more optimal energy/performance trade-

offs for the NoC. The key goal behind our DVFS and DVFS+PG models is to save power

and energy without impacting the performance of the NoC. Learning-enabled Energy-

Aware Dynamic voltage/frequency scaling (LEAD), is a collection of offline-trained linear

regression based DVFS techniques. LEAD models are proactive, using only local router

information when calculating labels. LEAD scales the router and the router’s outgoing

links simultaneously to avoid inefficient use of network bandwidth or excess energy

consumption. LEAD-τ predicts future buffer utilization, LEAD-∆ predicts change in buffer

utilization between the current and future epoch, and LEAD-G predicts change in energythroughput2 .

Based on these predicted values, voltage levels are selected on a per router basis without

the need for global coordination [17].

In this thesis we also propose DozzNoC, an adaptable power-management technique

that effectively combines power-gating (to target low-network activity) and DVFS (to

target variability in network load) with supervised ML. Each router in DozzNoC has three

25

operational states; while in an active state, DozzNoC routers operate at one of five different

voltage levels using DVFS. While in the inactive state the router is power-gated. While

in the wakeup state the routers local voltage level is charged up to Vdd. To minimize the

performance penalty due to powered off network components, DozzNoC implements a

partially nonblocking power-gated design [16].

For a 4 × 4 CMesh architecture, our simulation results show that LEAD-τ achieves an

average of 17% savings in total dynamic energy for a minimal loss of 2-4% in throughput

for real traffic patterns [17]. When applied to an 8 × 8 mesh network, DozzNoC achieves

an average reduction of 25% in dynamic energy and 53% in static power for a loss of 7%

in throughput [16]. The major contributions of this work are as follow:

• Machine Learning+DVFS: LEAD and DozzNoC both apply linear regression

based ML techniques that enable proactive DVFS using a minimal amount of router

features so as to maximize energy savings with minimal impact on performance.

Offline training and local router features ensure minimal overhead and design

scalability [16].

• Power-Gating+DVFS: DozzNoC simultaneously combines partially non-blocking

power-gating with DVFS techniques. This allows power-gating of NoC routers

during periods of low network activity (to save static power) and DVFS during

periods of medium to high network activity (to reduce dynamic energy consumption)

[16].

1.7 Thesis Organization

The organization of my thesis is as follows: In Chapter 2, we will present an in-

depth analysis on current state-of-the-art DVFS techniques. Then we will discuss our

proposed LEAD models. We will introduce our router architecture including additional

LEAD specific components as well as the network topology. Then we will delve into the

26

specific V/F pairs used in this work and introduce our various LEAD models and show

how they are used for proactive mode selection. In the latter part of Chapter 2 we will

also discuss the machine learning aspect of the design and detail how training, validation,

and testing are performed for our LEAD models. In Chapter 3 we will present an in-depth

analysis on current state-of-the-art power-gating techniques. We will discuss DozzNoC,

our proposed power-gated DVFS model that builds on LEAD via the addition of partially

non-blocking power-gating. We will introduce the different operational states and discuss

the various models used to compare against DozzNoC. Then, we will delve into the ML

aspect of DozzNoC and discuss the reduced feature set as well as label generation and

overhead. In Chapter 4, we will discuss the simulation methodology for both LEAD and

DozzNoC. This will include all LEAD and DozzNoC model variants and will include and

in-depth explanation of the feature engineering and ML evaluation. The final two sections

of Chapter 4 will present the throughput, latency, dynamic energy, and static power results

for LEAD and DozzNoC respectively. Finally, Chapter 5 will conclude our thesis and

remind our readers of our contributions as well as plans for possible future work.

27

2 LEAD: Offline Trained Proactive DVFS for NoC

The main focus of this chapter is the introduction of LEAD, machine learning

enhanced dynamic voltage and frequency scaling of NoC routers and links. While

previously introduced in section 1.3, a more detailed discussion of select prior DVFS

techniques will be presented in the first section of this chapter. The next section will focus

on our proposed LEAD architecture and network topology, followed by a section detailing

the specific LEAD models and their implementation. The final section of this chapter will

focus on the application of ML to enable proactive mode selection, including the feature

set, label, and ML overhead.

2.1 Related Works

There is a large volume of work focusing on the application of DVFS to both the

core and uncore with a recent rise in the application of ML techniques. Dynamic voltage

scaling (DVS) scales only the supply voltage. While scaling only voltage can lead to

energy savings, device delay will rise. Naturally most DVS schemes correct this problem

by also increasing/decreasing the frequency proportionally to the supply voltage (DVFS).

Most prior work applies DVFS on a per-router or per-core level [4, 7, 41, 59] as a finer

granularity is expected to yield higher energy savings. However, it is uncertain whether per-

core DVFS and finer VFI granularities can provide the necessary power savings to mitigate

increased voltage regulator overhead and complexity. Much of this additional overhead is

due to voltage drop of inefficient on-chip voltage regulators [26]. However, newer voltage

regulator frameworks have alleviated this concern using a hierarchy of on-chip and off-

chip voltage regulators with many modern approaches achieving 90% energy savings [2] or

more [26]. One such work seeks to better explore this issue [30] and proposes an in-depth

analysis of energy-efficiency gains as a direct result of using per-core DVFS. This work

used real workloads with many different DVFS control algorithms. These algorithms all

28

have different logic for voltage switching, which can have huge impacts on energy savings

and performance loss. The first model, Threshold, measures VFI utilization over a window

of several cycles and uses this to predict VFI utilization over the next window. Based on

this prediction, voltage levels are increased, decreased, or held constant. The PI based

algorithm uses a proportional-integral controller. These per-VFI PI controllers compute

utilization and voltage levels the same was as Threshold. Both the Threshold and PI models

are shown in Figure 2.2 (a). The final comparative model is called Greedy, an adaptation

of a Greedy-search method proposed in [40]. The Greedy model uses explorative logic to

either increase or decrease VFI voltage levels to move in the direction of the most energy-

efficient mode every time window. This model is the basis for LEAD-G and the full logic is

shown in Figure 2.2 (b). The results of this work showed that the Greedy control algorithm

could achieve 38.2% reduction in energythroughput2 , resulting in high energy reduction.

The voltage regulator hierarchy is also important to the design of the DVFS algorithm

as it directly impacts the number of voltage levels and VFI’s allowable in the network.

Off-chip voltage regulators have high energy-efficiency and do not consume precious chip

space, but they also have very large switching latencies, often in the micro second range.

This is not suitable for voltage scaling of cores or other uncore components as high

switching latencies could result in the loss of thousands of cycles of processor performance.

However, on-chip voltage regulators have much smaller switching latencies, often in the

nano second range. On-chip voltage regulators may also be combined with off-chip voltage

regulators in a highly energy-efficient voltage regulator hierarchy. One work, [24] presents

an in-depth analysis of fine-grain DVFS using on-chip voltage regulators and presents

the trade-offs associated with both on-chip and off-chip voltage regulation. They find

that scaling of voltage and frequency of individual off-chip memory accesses in memory

intensive workloads can result in substantial energy savings. They propose both a proactive

voltage scaling technique that tries to scale up voltage before a memory access returns

29

from memory, and a reactive scheme that upscales voltage after a memory access returns

from memory. For memory intensive workloads they find that proactive scaling results

in an average of 12% energy savings for .08% performance loss while the reactive model

results in an average of 23% energy reduction for 6% performance loss. We have also

seen how scaling the links and router together can result in higher energy savings [55]. In

this work, DVS with proportional frequency scaling is applied to both the links and CPU

using sparse matrix computations. The main goal of the design is to achieve maximal

energy savings with minimal impact on execution time across various applications. These

matrixes are represented as tree-based parallel sparse computations. This approach applies

load-balancing techniques and then exploits load imbalances across parallel processors to

determine the optimal voltage level for each processor and its set of communication links,

but only applies such techniques to processors that are not in the critical path. The authors

develop three separate models, one that scales only the CPU voltage levels (CPU-VS), one

that scales only the link voltage levels (LINK-VS), and one that scales both the CPU and

link voltage levels (CPU-LINK-VS). The authors found that they could save on average

13-17% more energy by integrating CPU and link voltage scaling.

Another work [51] uses DVS to scale the voltage and frequency of routers and links

using a history-based approach. DVS links require adaptive power-supply regulators that

can increase/decrease both frequency and voltage of the link, but most modern designs

instead use multi-voltage supplies. A typical DVS link is shown in Figure 2.1 (a). The

proposed history based DVFS approach keeps track of link utilization over a window of

H cycles and combines both a short-term and long-term utilization to create a weighted

average. Using this weighted average, the future link utilization is predicted and the DVS

algorithm increases/decreases voltage levels accordingly with the main goal of achieving

power savings without decreasing performance in the network. If the network is predicted

to have high link utilization, then the link speed is increased (voltage increase), if the

30

Figure 2.1: An example DVS link is shown in part (a), while a history-based DVS algorithmis shown in part (b) [51].

network is predicted to have low link utilization, the link speed is decreased (voltage

decrease), and if the future needs are the same, then link speed remains unchanged.

This algorithm is shown in greater detail in Figure 2.1 (b). Another work, [20] uses the

Single-chip Cloud Computer (SCC), a multi-core research platform, to dynamically control

the voltage/frequency on a per-core basis by monitoring buffer que occupancy. Because

que occupancy varies widely with workload, the authors implement state-based feedback

control. The feedback controller seeks to exploit workload variations to increase/decrease

the voltage levels accordingly. Firstly, que occupancy is monitored, then average arrival

rate and service rates are updated. Next the state feedback values are computed, and the

best frequency divider is selected. Lastly, the frequency and voltage level of each VFI are

updated. Due to the nature of this power management policy, it is implemented in software

and must be run on all cores separately. This work shows the criticality of smart power

management strategies.

Many older reactive DVFS policies, including the ones mentioned above, rely on using

old or current network parameter values to select voltage levels for future epochs. This can

31

Figure 2.2: A Threshold and PI controller Finite State Machine (FSM) are shown in (a),while a Greedy controller FSM is shown in (b) [30].

become problematic if future needs are not accurately represented, especially since voltage

levels are switched on the scale of 100s-1000s of cycles. If the optimal voltage level isn’t

selected, then the network will not have the opportunity to choose a new voltage level for

many cycles, meaning hundreds or thousands of cycles of excess power consumption or

lost performance. To ensure that the optimal voltage level is chosen, the voltage switching

logic must be supplied with an accurate estimate of future network needs to help combat

under/over estimation. If future network needs are over estimated, then the switching

logic will choose a voltage level that exceeds future needs, leading too little to no power

savings. If future network needs are under estimated, then the switching logic will choose a

voltage level that cannot meet future needs and performance will suffer. One work [31] uses

cache coherent protocols to determine future network states. This work differs from prior

reactionary techniques that focused on measuring current network parameters to estimate

future network needs because of the high prediction accuracy (87%) of cache-coherence

properties. In a cache-coherent NoC, traffic messages are defined by the coherence protocol

and must adhere to a strict set of communication rules. These protocols (ACKs, NACKs,

data requests, etc.) are sent from source to destination using end-to-end communication.

32

Figure 2.3: An example of simultaneous power, temperature, and performance manage-ment using Q-learning [52].

Because traffic messages in a cache-coherent NoC are regular and predictable, they are

used to better predict future network needs than typical parameters such as link/router

utilization, round-trip time, latency, or network slack. Due to this inherently exploitable

nature, the authors can rely on accurate predictions of future network demand which leads

to more optimal voltage level selection.

Newer methods have begun to incorporate machine learning techniques as they can

lead to highly accurate predictions of future network needs. RL algorithms can even

learn optimal control polices as data become available. One such work [22] uses online

learning based DVFS in a single-tasking and multi-tasking environment and compares

the potential energy savings of each. Another work [52] uses Q-learning based DVFS

to manage temperature, performance, and energy in a processor as shown in Figure 2.3.

33

2.2 LEAD Architecture

LEAD is built on a CMesh topology using a hierarchy of off-chip and on-chip voltage

regulators that allow the selection of multiple voltage modes as shown in Figure 2.4. Our

network consists of 16 routers, 64 cores, and 48 unidirectional links corresponding to a

more area/energy efficient network. We propose per router DVFS such that the router

as well as the outgoing links are scaled simultaneously to operate at the same mode of

operation. Each router consists of 8 input ports, 8 output ports, and 4 virtual channels

per port while each processor has an individual L1 cache and each router has an L2 cache

shared among the four cores connected to each router. DVFS is not applied to the cores,

LLC or other uncore components. When a packet is first generated, the packet is stored in

the input buffer. The output port is computed using XY dimension-order routing (DOR)

in the route computation (RC) stage of the router pipeline. After a virtual channel is

allocated, the head packet competes for the output channel in the switch allocation (SA)

stage. After successfully competing and being awarded the channel, the packet is sent

across the crossbar to the destination port in the switch traversal (ST) stage. The proposed

router microarchitecture is shown in Figure 2.5(a) [17].

2.2.1 Operating V/F Modes

Each LEAD model uses five modes of operation with voltage and frequency levels

similar to those proposed in previous work [31]. The supply voltage changes in 100

mV steps with proportional changes in clock frequency which allows decreased voltage

regulator framework complexity. The V/F pairs our models use include {0.8 V/1 GHz,

0.9 V/1.5 GHz, 1.0 V/1.8 GHz, 1.1 V/2 GHz and 1.2 V/2.25 GHz} which correspond to

modes 1-5 as shown in Figure 2.5(b). We carefully chose five modes to balance overhead

and energy savings because having a large number of V/F pairs leads to increased voltage

regulator overhead without the guarantee of increased energy savings, whereas too few

34

Figure 2.4: We apply LEAD to a CMesh with 16 routers and 64 cores. We use on chipvoltage regulators that can adjust the supply voltage between 0.8V and 1.2V, allowing usto apply DVFS to individual routers and their corresponding links [17]©2018 ACM.

modes will not allow the DVFS algorithm to exploit traffic load variation and save energy

at run-time [17]. Power-gating introduces several unique challenges (deadlocks, breakeven

time, loss in throughput) and while this chapter will not focus on power-gating, we have

implemented a power-gated version of LEAD which will be further discussed in Chapter

3.

2.3 DVFS Models and Implementation

In this portion of our work we focus on measuring the impact of different mode

selection models on dynamic energy savings and performance. In this chapter we propose

three machine learning based models; LEAD-τ, LEAD-∆, and LEAD-G. LEAD-G is based

35

Figure 2.5: The architecture as well as all additional units required for reactive or proactivemode selection are shown in (a). A simple voltage regulator setup that allows the selectionof voltage levels in the range of 0.8V to 1.2V for every router and its’ associated outgoinglinks is shown in (b) [17]©2018 ACM.

on an already proposed reactive model called Greedy. This Greedy model is presented in

[30] as an adaptation of a Greedy search method presented in earlier work [40]. Greedy

and LEAD-G are used strictly for comparative purposes. All three of these models do

not apply power-gating, but we will introduce an additional LEAD model that does apply

power-gating in chapter 3 [17].

Baseline: The baseline model always operates all routers in mode 5 (highest V/F

pair) and does not apply DVFS to the network. This model has the highest throughput and

lowest latency but has no dynamic energy savings [17]. We use this as our baseline because

it has the highest performance and we want to ensure energy savings with as little loss in

performance as possible.

LEAD-τ: This model starts each router in the lowest mode of operation, mode 1.

It then chooses the routers’ mode for the next epoch based on the predicted input buffer

36

utilization of the router for the next epoch. If the router’s buffers are predicted to be less

than 5% full, then the lowest mode is chosen. If the buffers are predicted to be between 5%

and 10% full, then mode 2 is chosen. If the buffers are predicted to be between 10% and

20% full, then the third mode is chosen. If the buffers are predicted to be between 20% and

25% full, then the router will operate in the fourth mode. Finally, if the buffers are predicted

to be greater than 25% full for the next epoch, then the router will operate in the highest

mode of operation, mode 5. For larger epoch sizes the thresholds are reduced for more

aggressive scaling to counter a larger theoretical maximum. This theoretical maximum is

calculated as a worst-case time variant sum over the duration of an epoch. For simplicity

sake we will show the thresholds as if they are all for epoch size of 100 cycles. The LEAD-

τ model assumes a voltage regulator scheme that allows the transition from any mode to

any mode in one cycle without the need to transition into adjacent modes [17]. While this

is not practical, real-valued switching delay is introduced into the models we present in

Chapter 3. This model emphasizes the importance of being able to select the optimal mode

at any given epoch versus other designs which are constrained to only being allowed to

transition into adjacent modes [17].

LEAD-∆: This model starts each router in the highest mode of operation, mode 5.

Every epoch routers’ transition one mode up/down based on the predicted change in input

buffer utilization between the current and future epochs. Mode transitions only occur if this

predicted change in buffer utilization falls within certain carefully selected criteria. These

criteria must ensure that small variations in network traffic over a short time span do not

govern mode selection over an entire epoch; however, it is critical that the router still be

able to adequately adapt to long-term changes in network traffic patterns. The buffers must

be predicted to increase by at least 6% of their maximum utilization over the next epoch to

warrant a mode transition into a higher mode, and they must be predicted to decrease by at

least 3% of their maximum utilization to warrant a mode transition into a lower mode for

37

the next epoch. We ensure dynamic energy savings by requiring the predicted change in

buffer utilization required to move down a voltage level be less than the change required to

move up. The LEAD-∆ model is used to compare the trade-offs associated with being able

to transition only into adjacent modes at every epoch and still assumes that each transition

takes one cycle. This model is more suited to gradual traffic changes where adjacent mode

transitions are optimal [17].

LEAD-G: This model (based on prior work [30], [40]) explores to find the mode that

minimizes a predicted future energythroughput2 . LEAD-G adds explorative logic and introduces

both dynamic energy and throughput into the label in the hopes of better balancing the

trade-off between the two. LEAD-G starts each router in the highest mode of operation

and in a downwards explorative direction. If the predicted change in energythroughput2 between the

current and future epoch is less than or equal to 0, then the router will move one mode

further in the current exploration direction (downward/upward). If the predicted change in

energythroughput2 is greater than 0, then the router is put into a hold phase. The hold phase lasts 2

epochs, a similar duration to prior work [30], and during the hold phase the router cannot

increase/decrease its’ voltage level. After the hold phase expires, the exploration direction

is flipped and the model begins to explore in the opposite direction until the predicted

energythroughput2 is greater than 0 again. This model seeks to minimize the predicted energy

throughput2 and

assumes that routers may only transition into adjacent modes. The logic behind all three

LEAD models is further explained in Figure 2.6 [17].

2.3.1 DVFS Implementation

As shown in Figure 2.5(a), LEAD uses four components per router in order to perform

reactive (non-ML) or proactive (ML) model selection. We have not mentioned how reactive

mode selection is done, but it is a simple process wherein each LEAD model uses current

values to govern mode selection instead of labels (predicted future values). This shall be

38

Figure 2.6: LEAD-τ uses a predicted input buffer utilization to select the optimal mode perepoch. LEAD-∆ uses a predicted change in input buffer utilization to move in the directionof the optimal mode per epoch. LEAD-G incorporates both energy and throughput into thelabel and moves up/down adjacent modes (based on exploration direction) such that energydivided by throughput is minimized [17]©2018 ACM.

further discussed in Chapter 4. The first additional router component is called Feature

Extract. It gathers local router/link features and supplies them to the Non-ML Model

unit. For reactive mode selection, the Non-ML Model unit takes key values supplied from

Feature Extract and selects the appropriate mode for the router and the router’s outgoing

links. For proactive mode selection, we require the addition of two new units. The

first component, Label, takes the features supplied by Feature Extract and applies Ridge

Regression to generate a corresponding label. This label is then supplied to the ML Model

unit to select the appropriate mode for the router and the outgoing links [17].

39

2.4 Machine Learning for DVFS

We use machine learning to train different Ridge Regression algorithms corresponding

to LEAD-τ, LEAD-∆, and LEAD-G. There are two arrays of values needed when using

regression, the first being the feature set and the second being the weight vector. Each

feature has a corresponding weight, a scalar factor representing the impact on predicting

the output. The Ridge Regression equation is shown below:

E(w) =12

N∑n=1

{y(xn,w) − tn}2 +

λ

2

M∑j=1

w2j (2.1)

Where,

xn = (xn1, xn2, ..., xnK) (2.2)

w = (w1,w2, ...,wi)T (2.3)

y(xn,w) =

K∑k=1

xn,k · wk (2.4)

In the Ridge Regression equation listed above, we minimize the sum of squared errors

between our predicted label y(xn,w) and the actual label tn. The feature vector xn

(containing K features) as well as the supplied labels tn are used to train the system offline

such that a corresponding weight vector w (also of size i) is created. During tuning, different

values of λ are tried for the equation λ2

∑Mj=1 w2

j until the best fitting solution for the training

data is found. This validation phase is important as it helps reduce model complexity

and combat over-fitting. We used a total of 14 different trace files; 6 for training, 3 for

validation, and 5 for testing [17].

Feature Set: The feature set is directly related to prediction accuracy and overhead

cost. The feature set must be kept as small as possible because every new feature

increases computational overhead during label generation. Features must also be carefully

chosen such that accurate predictions of future network needs can be gleaned from them.

Our feature set for this work is composed of 39 network parameters (buffer utilization,

40

incoming/outgoing link utilization per direction, request/response packets, etc.) as well as

a label local to each of the 16 routers [17]. The full LEAD feature set is shown in Chapter

4, while a reduced feature set obtained after extensive feature engineering and testing will

be presented in Chapter 3.

Label: A reactive version of each LEAD model is used to govern mode selection for

all training and validation trace files so that features and labels can be extracted. This data is

then used to train/validate each LEAD model. While the same features are used to train all

LEAD models, the label supplied for training is unique to each LEAD model. Therefore,

after training/validation each LEAD model will be composed of a uniquely trained set of

weights. The label for LEAD-τ is the future input buffer utilization of the router for the

next epoch. The label for LEAD-∆ is the difference between the routers’ current and future

buffer utilization. The label for LEAD-G is the difference between the routers’ current and

future energythroughput2 . These labels are supplied along with the corresponding features to train

and validate all ML algorithms offline [17].

ML Overhead: A trained linear regression algorithm uses a series of additions and

multiplies to calculate a label; thus, the overhead cost can be simplified to the timing,

power, and area cost required to execute a set number of additions and multiplies. The

energy cost of a single 16 bit floating point add is estimated to be 0.4 pJ and the area cost is

1360 um2 [32]. The energy cost of a multiply is estimated to be 1.1 pJ and the area cost is

1640 um2 [32]. The total energy overhead cost is 58.1 pJ (considering two-stage multiplies

followed by an addition). The total area overhead cost is 0.12 mm2, and the total timing

cost is 3-4 cycles [17]. We use larger epoch sizes of 500 cycles and 1000 cycles to ensure

that such overhead costs are kept relatively small as labels only need to be calculated once

per epoch.

41

3 DozzNoC: Combination ofML based DVFS and

Power-Gating for NoC

This chapter focuses on the introduction of DozzNoC, a power-gated LEAD model.

DozzNoC enables energy proportional computing by applying both power-gating and

DVFS to network routers and links. DozzNoC applies power-gating at times of network

idleness and DVFS at times of low-high network traffic; this will be discussed in greater

detail in section 2.2. While previously introduced in section 1.4, a more detailed discussion

of select prior power-gated techniques will be presented in the first section of this

chapter. The second section will focus on our proposed DozzNoC architecture and network

topology, followed by an in-depth explanation of the various DozzNoC states, as well as

how states and operational modes are selected. The last section of this chapter will detail

how we have applied machine learning to DozzNoC, heavily drawing from our LEAD

models. We will also detail the reduced feature set, label, and reduced ML overhead.

3.1 Related Works

Much prior PG research has focused on the application of runtime power-gating to

both the core and various uncore components as well as the NoC. Power-gating is used to

completely power off unused or lightly used cores or other network components. Successful

power-gating techniques can drastically reduce leakage power, making them critical for

energy proportional computing. Most designs that apply power-gating to the NoC assume

a single network. If a downstream router or link is power-gated, network connectivity will

be lost. Typically, this scenario results in performance loss and introduces the risk of dead-

locks and live-locks. Therefore, preventative measures such as data re-routing or early

wake-up must be taken.

42

Other research has focused on the benefits and trade-offs associated with applying

power-gating at various time granularities. Coarse-grain power-gating methods that use

large re-configuration windows of 10K cycles or more have several major drawbacks. The

first drawback comes from cutting network connectivity for large periods of time, which

can result in massive performance loss if counter measures such as bypass paths or packet

re-routing are not implemented. The second drawback comes from the inability to exploit

small periods of router idleness of 10-100 cycles that are common across real traffic patterns

and applications. Both problems are solved with smaller re-configuration windows. Such

techniques employ header transistors with high threshold voltage that cut power when the

sleep signal is asserted. Fine-grain power-gating can achieve up to 10x reduction in leakage

power, therefore I will focus on newer research that use smaller re-configuration windows.

One such work seeks to avoid loss of connectivity by breaking the network up into

several smaller but fully connected networks in a Multi-NoC architecture [19]. This

Multi-NoC design is built on a 256-core CMesh with 8 memory controllers. This design

partitions wires, buffers, and other network components into several subnetworks built

from proportionally smaller components. This Multi-NoC concept is visualized in Figure

3.1. Figure 3.1 also highlights the differences between a Single-NoC and a Multi-NoC

composed of four subnetworks. If link width in a Single-NoC was 512 bits, it would

equal 128 bits in a Multi-NoC. This same concept is also applied to the router and other

network components. If all four subnetworks are active, then total bandwidth in the Multi-

NoC would remain unchanged from the original larger Single-NoC. This allows Catnap

to power-gate entire subnetworks proportional to network demand without loss in network

connectivity. Catnap uses a unique subnet-selection policy to determine what subnetwork

new packets are injected into, and a unique power-gating policy to determine when a

subnetwork should be switched on/off. Catnap routers can be in three different power

states: active, sleep, and wake-up. In the active state, routes are on and operational and can

43

Figure 3.1: The difference in network component sizes between a Single-NoC and a Multi-NoC [19].

be used to send/receive data. In the sleep state, routers are powered off and cannot be used.

In the wake-up state, local voltage levels are being charged up to Vdd and the router cannot

be used until the wake-up delay has been met, which the authors estimate to be 10 cycles.

There is also the energy cost to wake-up a powered off router, which was estimated to be

around 12 cycles worth of leakage energy. Catnap demonstrates that a power-gated Multi-

NoC design can consume about 44% less power than a bandwidth proportional Single-NoC

for a minimal 5% loss in performance.

Normally a router cannot be used to send/receive or forward data unless it is powered

on, however one work seeks to decouple the node’s ability to send/receive packets from

the power state of the router [11]. This is accomplished by providing separate wake-

up avoidance decoupling bypass paths, alleviating the need for packets to wait until the

router has been woken up. Packets have the option to choose between normal paths and

bypass paths if routers are power-gated. This approach allows the network interface to

44

Figure 3.2: The NoRD network bypass ring is shown in (a), the router bypass datapath isshown in (b), and the network interface datapath is shown in (c) [11].

forward packets directly into specific input ports of the router via bypass paths, forming

a fully connected unidirectional ring. The network level bypass channel, the additional

router level bypass path, and the network interface bypass path are shown in Figure 3.2(a-

c) respectively. NoRD operates while trying to fulfill two key objectives. The first objective

is to maximize net energy savings which is done by maximizing the amount of router off

time. NoRD must ensure that during fragmented idle periods packets have a bypass path

around powered off routers, ensuring they are not woken prematurely. The second objective

is to minimize performance loss, which is done by ensuring that wake-up delay is reduced

or hidden entirely. This can be accomplished by routing around powered off routers. If the

wake-up delay is not properly hidden, it can compound across multiple hops. This work

estimates typical wake-up delay to range from 10-20 cycles depending on clock frequency.

The authors also estimate the break-even time to be approximately 10 cycles, slightly less

than Catnap’s estimated 12 cycles. However, these delays and break-even times are largely

hardware dependent. NoRD shows that by decoupling the node and router, static energy

can be reduced by an additional 29.9% over traditional power-gating techniques that fail to

exploit periods of fragmented traffic.

45

Another work focuses on minimizing the performance loss incurred from wake-up

delays by sending wake-up signals to downstream routers [13]. The authors of this work

accomplish that goal using two key mechanisms. The first ensures that power control

signals utilize existing network slack to wake up initial routers in the packets path. The

second mechanism ensures that wake up signals stay sufficiently far ahead of the packet

to “punch through” blocked routers. Router blocking occurs when a packet arrives at a

router that is powered off and must wait for it to be turned on to switch across it. As

previously mentioned, this delay can become compounding if multiple routers are powered

off in the packets path resulting in massive performance loss. However, some of this delay

can be hidden simply by sending a wake-up signal to downstream routers once the route is

calculated. Power Punch demonstrates how early wake up detection enables power-gated

schemes to save up to 83% of router static energy with minimal impact on execution time.

DarkNoC [8] is difficult to classify as a DVFS scheme. Instead I will discuss this work

with power-gated schemes like Catnap which used several fully connected subnetworks.

This is because DarkNoC focuses on leveraging the amount of dark silicon on future chips

to create multiple fully connected NoCs. Each different NoC is built with different ratios

of various transistor types to leverage associated advantages. Low threshold voltage cells

(LVt) allow for faster switching at the cost of increased leakage power, therefore LVt cells

are used along critical paths where decreased latency is most valued. Normal voltage

threshold cells (NVt) allow for normal switching speeds and normal leakage power. High

voltage threshold cells (HVt) allow for low leakage power at the cost of slow switching

speeds, therefore HVt cells are used along non-critical paths to enable power savings.

Each layer is optimized for a specific voltage/frequency range, but all are architecturally

identical. Figure 3.3 shows the different DarkNoC layers and how routers are optimized. At

any given time, only the most energy efficient layer that adequately meets network demands

is powered on, the rest remain off. DarkNoC improves EDP by up to 56% compared to state

46

Figure 3.3: Four seperate DarkNoC layers are shown in (a), a single DarkNoC layer isshown in (b), and DarkNoC routers are shown in (c) [8].

of the art DVFS techniques, however due to the nature of this design it would require four

fully connected NoCs, potentially quadrupling the area overhead of the NoC.

3.2 DozzNoC Architecture

DozzNoC improves upon LEAD with enough versatility to be applicable to multiple

network topologies. We specifically apply DozzNoC to both a CMesh and a mesh network

topology unlike LEAD which was only applied to a CMesh [16]. When applying DozzNoC

to a CMesh we use a 4 × 4 NoC with a concentration factor of 4. When applying DozzNoC

to a mesh we use an 8 × 8 network with 64 cores. We can easily switch between other

network topologies and scale to increased numbers of routers because we use only local

router features to select voltage levels. Using only local router features is critical because

we eliminate the need for global co-ordination. This means adding a new router is as simple

as adding another voltage regulator and will incur minimal amounts of extra computational

overhead. The difference between applying DozzNoC to a mesh and CMesh network can

be seen in Figure 3.4(a,b). When applying DozzNoC to a CMesh network, we use larger

routers with 8 input ports, 8 output ports, and 4 virtual channels per port. When applying

47

Figure 3.4: We apply DozzNoC to both a CMesh in (a), as well as a mesh in (b). The routermicroarchitecture for a mesh topology is shown in (c) [16].

DozzNoC to a mesh network, we use smaller routers with 5 input ports, 5 output ports,

and 4 virtual channels per port. Processor count remains unchanged between topologies

and each core has a separate L1 cache, however, in a CMesh the L2 cache is shared among

cores. The router microarchitecture for DozzNoC remains unchanged from LEAD. Newly

generated packets sit in the input buffers of the routers. This location corresponds to the

packets’ injection source. Next the output port is computed using look-ahead routing in

the route computation stage. We then use XY dimension order routing to select the output

ports. We also use this information to ensure that downstream routers are not allowed to

be powered-off, and if they are off, to wake them up for a partially non-blocking power-

gated scheme. Naturally it would be more difficult to design a partially non-blocking

power-gated scheme if we did not use XY routing, but it is still achievable if downstream

routers can be determined and woken up in advance. Next, a virtual channel is allocated

to the packet and it competes for the output channel. Lastly, the packet is sent across the

crossbar and into the destination port after a successful switch traversal [16]. Our proposed

router microarchitecture does not really change from LEAD, therefore we will not replicate

previously presented material. However, we have removed one redundant component. The

48

new DozzNoC router microarchitecture is shown in Figure 3.4(c), while Figure 3.4(a)

shows DozzNoC being applied to a CMesh and Figure 3.4(b) shows DozzNoC being

applied to a mesh.

3.2.1 Operational States

Routers and links under DozzNoC are given three states of operation as shown in

Figure 3.5. These three states include an inactive state, an active state, and a wakeup state.

While these states appear to be similar to the states one would observe from a normal

power-gated scheme, the active state is unique [16]. All wake-up and switching latencies

are estimated using equivalent capacitance of a router and its’ outgoing links, however the

voltage regulator design as well as the specific values for voltage switching and wake-up

delay were created and evaluated by another team of researchers. While the switching and

wake-up delays are based on real values, they will not be discussed in this thesis, nor will

the voltage regulator design.

Inactive State: In this state, the power supply to an individual router and its’ outgoing

links is reduced to 0 V , thus the router and associated links are completely switched off.

While in an inactive state, the router and its’ outgoing links consume no power. However,

inactive components may not be used to send/receive packets and cannot be used to hop

packets across them. The router can transition from an active state to an inactive state in a

single cycle, but it must satisfy specific conditions before it is allowed to switch off. For

this work we ensure that the routers’ buffers have been empty for several consecutive cycles

and that it is not a downstream router before we allow the router to be switched off [16].

Wakeup State: A router that is in the process of transitioning from an inactive state to

an active state goes into a wakeup state. While in the wakeup state, the router consumes

the same amount of power as if it were on as it is supplied with one of 5 various Vdds.

However, components that are in this state may not be used to send/receive packets and it

49

Figure 3.5: Our version of Power Punch acts as a baseline and is shown in (a). LEAD-τ hasonly one state, the active state; however active routers may operate at one of five differentvoltage levels (b). DozzNoC has three states and when a router is active it may operate atone of five different voltage levels as shown in (c) [16].

may not be used to hop packets across them until the wake-up delay has been satisfied. This

is similar to prior work that also called this a ”wake-up” state [19]. A router can transition

from an inactive state to a wakeup state in a single cycle, but the router must wait in the

wakeup state for a set amount of cycles before it can be considered fully on and functional.

In a power delivery system this is called the wake-up time, and it is defined as the interval

from the instant of a voltage change until the local voltage level settles to meet the supply

voltage level [16].

Active State: A router that is in an active state can operate at one of five different

voltage levels. These different V/F pairs are referred to as various modes of operation in

which the supply voltage and clock frequency are proportionally increased/decreased. The

V/F pairs our DozzNoC model uses in this work are {0.8 V/1 GHz, 0.9 V/1.5 GHz, 1.0

V/1.8 GHz, 1.1 V/2 GHz and 1.2 V/2.25 GHz} which correspond to being in the active

state in modes 3-7. We start the numbering at mode 3 because we consider mode 1 to be

the inactive state and mode 2 to be the wakeup state. These V/F pairs are similar to those

used in other works [17, 31]. A key difference in this portion of our work is that we use real

valued switching delays obtained from a unique SIMO voltage regulator design supplied

50

Figure 3.6: DozzNoC mode selection is shown in (a), while LEAD-τ mode selection isshown in (b) [16].

by other researchers in [16]. We chose not to add/remove modes for several reasons. First,

we wanted to maintain a fair comparison against prior work that used a similar number of

V/F pairs. Secondly, we did not wish to add unnecessary design complexity [16].

3.3 Power-Gated DVFS Models

This research focuses on combining the static power savings of power-gating with the

dynamic energy savings of DVFS. To that end, we propose two machine learning models

built on a power-gated framework that use supervised learning to train the regression

algorithm. These two algorithms correspond to DozzNoC and ML+TURBO. We also

compare to our best performing LEAD model from prior work that also used machine

learning and DVFS, LEAD-τ. All three machine learning models use the same threshold

based DVFS mode selection logic. This logic looks at the predicted future buffer utilization

and compares it to a theoretical maximum to determine what mode should be selected for

the next epoch. The state transition logic for all three ML models is shown in Figure

51

3.6. DozzNoC and ML+TURBO use the state selection logic from part a, while LEAD-

τ will never transition out of the active state. When the router is in the active state, all

three comparative ML models use the logic in part b to select the optimal voltage mode.

ML+TURBO was added to see the impact on static power and dynamic energy when the

highest mode is chosen instead of a lower predicted mode [16].

Baseline: The baseline model starts with all routers operating in the active state at the

highest voltage level, mode 7. The Baseline does not allow the transition of a router into

any other state. Therefore, all routers are always active and always operate in mode 7. This

model will always have the highest throughput and the lowest latency as it incurs no router

wake-up delay and no voltage level switching delay. However, the baseline offers no static

power savings and no dynamic energy savings and is mostly used as an upper bound on

performance and power cost [16].

Power Punch (PG): The Power Punch model is based on prior work from which the

name was created [13]. This model is not an exact replica of Power Punch, rather it behaves

similarly in that look-ahead routing is used to wake-up downstream routers and minimize

the impact of router blocking. Therefore, it is used for comparative purposes against a

state-of-the-art power-gating technique. This model operates routers individually in one of

three states. Either a router is inactive, waking up, or active. If the router is inactive, then

it consumes no power but cannot be used to send/receive packets. If a router is waking

up, it consumes the same amount of power as if it was active, but it cannot be used to

send/receive packets. Finally, if a router is active, then it will operate at the highest mode

of operation, mode 7. For a router to transition from an active state to an inactive state,

it must be idle for at least T-Idle consecutive cycles. A router is considered idle only if

its’ input buffers are empty and it is not a downstream router. This secondary condition

was developed to make this model as non-blocking in nature as possible so that a fairer

comparison to Power Punch can be made. We use XY look-ahead DOR routing, which

52

allows us to easily know the next router in a packets path so that downstream routers can

be secured. When a router is in a secured state, it cannot be switched off. If it is currently

off, it will immediately transition into a wakeup state where it will stay until the wake-up

delay has been met. The main purpose of this model is to compare the static power savings

of a state-of-the-art power-gating technique against a design that combines power-gating

and DVFS [16].

DozzNoC (ML+PG+DVFS): The DozzNoC design uses the same underlying

partially non-blocking power-gated design proposed earlier wherein all routers may be in

one of three states. In the inactive state, a router consumes no power but cannot be used

to send/receive packets. Transitioning to an inactive state takes a single cycle, and in order

to transition to an inactive state the router must first be in an active state and idle for more

than T-Idle consecutive cycles. The second state is the wakeup state. A router transitions

to a wakeup state when it is off and needs to be woken up. A router in the wakeup state

cannot be used to send/receive packets until the wake-up delay has been met. The length of

the wakeup state depends on the voltage level of the active state and is based on real values

presented in prior work [16]. The final state is the active state. A router that is in an active

state may send/receive packets and operate like normal. However, the key difference is that

DozzNoC uses predictive machine learning techniques to determine the optimal voltage

level for a router that is in an active state. This means that DozzNoC dynamically adjusts

the supply voltage to select the optimal active state operational mode. In order to do this,

we predict future input buffer utilization of a router and then compare this to the theoretical

maximum utilization to determine the optimal voltage level. This optimal voltage level

must meet network performance demands while still ensuring dynamic energy savings.

This DVFS design relies on aggressive voltage scaling that minimizes potential loss in

throughput. For epoch size of 100 cycles, if we predict the buffers to be less than 5% of

the maximum over the next epoch, we select the lowest voltage level for the active state to

53

Figure 3.7: A walkthrough example showing the difference in active state mode selectionacross two epochs for a dummy network [16].

operate at, mode 3. Mode 3 is considered the lowest mode of operation because mode 1

corresponds to the inactive state and mode 2 corresponds to the wakeup state. If the buffers

are predicted to be between 5% and 10% of the maximum we select mode 4, if the buffers

are predicted to be between 10% and 20% we select mode 5, if the buffers are between 20%

and 25% we select mode 6, and finally if the buffers are predicted to be more than 25% full

we select mode 7. These thresholds are slightly different at higher epoch sizes as they need

to be more aggressive to counteract a higher theoretical maximum [16].

In Figure 3.7 we show a walk-through example detailing how routers and links in

a dummy network would transition between different states on a per cycle basis. We

54

also show how the optimal active state voltage level is updated every epoch. R-Idle is

the number of consecutive cycles that a router has been idle, while R-Wakeup is used to

measure the number of cycles a router has spent in the wakeup state. For example, if in the

first epoch (Ek) DozzNoC predicts that the input buffers will be greater than 25% full for the

next epoch, then it will subsequently determine that the optimal active mode of operation

is the highest mode. This is highlighted with router 1 operating at the highest mode until

it is turned off, and router 2 waking up into the highest mode after the wake-up delay

is satisfied (RWakeup = 0). Router idleness (RIdle) along with number of cycles a router has

been waking up (RWakeup) are measured every cycle. A router will only switch off if a router

has been idle for several consecutive cycles (T-Idle) meaning the buffers are empty and it

is not a downstream router. Once a router is off, it must wait several cycles corresponding

to the wake-up delay before it is useable (T-Wakeup). This delay is kept track of by setting

RWakeup = TWakeup and subtracting one from Rwakeup every cycle. Once the router is finished

waking up (RWakeup = 0) we may use the router. If at the next epoch Ek+1 DozzNoC predicts

that the input buffers will be less than 5% full for the next epoch, then it will subsequently

determine that the optimal active mode of operation is the lowest mode. Thus router 1 will

now operate in the lowest mode when active and router 2 will operate in the lowest mode

after it successfully transitions from the wakeup state to the active state. When the router

has been idle for T-Idle consecutive cycles it is turned off, as can be observed by router

1. When a router is no longer considered idle (RIdle = 0), it transitions from the inactive

state to the wakeup state as shown by router 3. While this example network has only four

routers, the actual network will have 16 routers for a CMesh and 64 routers for a mesh.

The three states, inactive, wakeup and active will remain the same as the dummy network.

However, in this example a router in the active state has only three modes of operation, low,

medium and high while the real network has five operational modes. Also, for simplicity,

55

all routers in the dummy network have the same active mode of operation, while in the real

network each individual router selects operational modes based on local router labels [16].

LEAD-τ (ML+DVFS): LEAD-τ [17] is used for comparative purposes so that

throughput loss and dynamic energy savings of DozzNoC can be compared to prior work

that only applied machine learning-based DVFS to the NoC. While using this model, the

router can only be in the active state and may never transition to an inactive or wakeup state.

While in the active state LEAD-τ uses the same mode selection logic as DozzNoC where

future input buffer utilization is predicted, and an optimal active voltage level is selected.

This model may transition from any voltage level to any other voltage level within the range

of 0.8V to 1.2V. The main purpose for including this model is to compare how a stand-alone

machine learning DVFS design performs against a machine learning design that has DVFS

and power-gating [16].

ML+TURBO: This model seeks to apply power-gating and DVFS to the NoC in a

similar fashion to DozzNoC. It uses the same three states of operation as DozzNoC; the

inactive state, the wakeup state, and the active state. When a router and its’ links are active,

a prediction of the future input buffer utilization is used to govern the voltage level. The

key difference between this model and DozzNoC is that every three times we predict that a

router should be at any active mode other than mode 3 or mode 7, we instead select mode

7. The key goal of this model is to improve throughput and static power savings at the cost

of dynamic energy since we opt for the highest operational mode even if we predict a lower

mode sufficient to meet network demand [16].

3.4 Machine Learning for PG-DVFS

Machine Learning enables us to use proactive mode selection techniques for DozzNoC

and all comparative models. This allows DozzNoC to use ML techniques to select

operating voltage levels while in the active state. The underlying ML equations have not

56

changed between LEAD and DozzNoC. We use the same Ridge Regression optimization

formulation as in Section 2.4, and the feature vector and weight vector are trained and

validated in the same way. The major difference for DozzNoC is that extensive feature

engineering allows us to reduce the number of features to only 5.

Feature Set: The feature set is carefully crafted such that prediction accuracy is

maximized while overhead is kept to a minimum. This is accomplished by selecting local

router features that give the greatest insight into network performance while also keeping

the relative number of features small. For DozzNoC we determined the features that were

most relevant to accurate label prediction exhaustively as is shown in 4.5. This will be

further explained in the results section. We found that it is possible to reduce the number

of features (starting with the same 41 features proposed in prior work) without having a

significant impact on the overall performance, which is shown in 4.6. Each additional

feature equates to more run-time computational overhead because the number of additions

and multiplications necessary to generate the label increases. The original feature set

proposed in [17] contained 41 features in total as well as a label, however we have reduced

this to only five critical features. These five features are listed in detail in Table 3.1 [16].

Table 3.1: DozzNoC’s Reduced Feature Set [16]

Feature Set:

Feature 1: Array of 1’s

Feature 2: Requests Sent by All Cores

Feature 3: Requests Received by All Cores

Feature 4: Router Total Off Time

Feature 5: Current Input Buffer Utilization

Label: Future Input Buffer Utilization

57

Label: Label generation for DozzNoC remains unchanged from LEAD, however,

LEAD used 40 features, while DozzNoC uses only 5. Like LEAD, DozzNoC and all

comparative models are trained offline and are ready to use at test time [16].

Machine Learning Overhead: After a model has been trained, the weight vector is

exported to the network simulator where it can be used to select voltage levels when routers

are active. The additional overhead incurred from machine learning can be broken down

into the timing, area, and energy cost to execute a series of additions and multiplications

as this is how a label is calculated. Each feature is multiplied by its equivalent weight and

then the results are summed to generate a label. Prior work [32] has already estimated the

cost to do these operations and they do not change from LEAD ML overhead. Prior work

that used 41 features calculated the total energy overhead cost to be 61.1pJ, the total area

overhead cost to be 0.122 mm2, and the total timing cost to be 3-4 cycles. We will show

in Chapter 4 that the feature set can be reduced to 5 features without causing a significant

impact on model performance. Therefore, the overhead to generate a label for DozzNoC

and all comparative models can be reduced to only 5 multiplications and 4 additions. This

equates to a total energy overhead cost of 7.1pJ, a total area overhead cost of 0.013 mm2,

and a total timing cost of 3-4 cycles per router. Our epoch size is 500 cycles, with label

generation occurring on a per router basis once per epoch. However, if the additional

energy overhead were too large, it could be further reduced by increasing the size of the

epoch. If the area overhead was too large, it could be reduced by forcing routers to share

computational resources, which would increase the timing overhead proportional to the

amount of shared resources [16].

58

4 Performance Evaluation

4.1 LEAD Simulation Methodology

In this chapter we evaluate the static power and dynamic energy savings as well as

the performance trade-offs for our various proposed LEAD (LEAD-τ, LEAD-∆, LEAD-

G) and DozzNoC (DozzNoC, LEAD-τ, ML+TURBO) models. For fair comparison,

we specifically apply all LEAD models to a 4×4 CMesh topology with 64 cores. We

evaluate all LEAD models across 13 different real traffic traces gathered using Multi2sim.

Multi2sim [58] is a full system simulator that uses CPU benchmarks from PARSEC 2.1

[5] and SPLASH2 [61] in order to generate cycle accurate traces. These traces are used as

input for our in-house network simulator, where reactive versions of each LEAD model are

run such that features and labels can be extracted. The 13 different traces used to train and

validate each LEAD model are shown in Table 4.1.

Table 4.1: LEAD Benchmarks

Training Validation Testing

Barnes Raytrace Lu (non-contiguous)

Blackscholes Swaptions Lu (contiguous)

Bodytrack Ferret Radix

Dedup Fluidanimate

FFT Canneal

Ocean

The goal of using trace files generated with PARSEC 2.1 and SPLASH2 benchmarks

for these simulations is to evaluate the throughput, latency, and dynamic energy of various

machine learning enhanced voltage switching models across real traffic patterns. LEAD is

implemented in our in-house network simulator; therefore, each line of the trace file need

59

only consist of relevant packet data. This data includes packet injection source, packet

destination, packet type, and injection cycle. Packet source and destination information

must be known to calculate the route. It is also important that we know the packet type

(request or response) because only requests are used as inputs. Our network simulator

will automatically generate a response packet once a request has been received. It is also

crucial that we know the injection time since our results must be cycle accurate. Specific

Multi2sim parameters used to gather trace files are shown in Table 4.2.

Table 4.2: Multi2sim Parameters

Cores 64

Routers 16

Concentration Factor 4

L1/L2 Coherence Protocol MOESI

Private L1/L2 Yes

Private LLC No

L1 Instruction Cache Size (kB) 32

L1 Data Cache Size (kB) 64

L2 Cache Size (kB) 256

L3 Cache Size (MB) 8

Main Memory Size (GB) 16

We used DSENT [57] to model routers and links so that the resulting dynamic power

results could be converted into energy per hop. The energy per hop is gathered for all 5

different V/F pairs at a 22 nm technology node assuming a 128-bit flit width [17]. The

total dynamic energy is calculated as the cost to send a specific number of packets through

the network, where every hop is recorded and the cost per hop is known. Each entry

in a trace file contains a single packets injection information, therefore total packets sent

60

equal total requests in the trace file. Naturally the number of requests may vary slightly

between trace files, however, the number of packets sent per trace is still constant across

all LEAD models. The energy per packet is calculated as the cost to traverse a router and

an outgoing link. Routers and links may operate at one of five different operating voltages

corresponding to modes 1-5, thus energy per hop will change accordingly. This energy

cost is listed starting from lowest mode to highest mode: {25 pJ/packet, 32 pJ/packet, 39

pJ/packet, 47 pJ/packet, 57 pJ/packet} [17]. The voltage and frequency of each mode as

well as packet traversal energy is shown in table 4.3.

Table 4.3: Dynamic Energy Per Hop (Modes 1-5) [17]©2018 ACM

Mode 1 0.8V 1.0 Ghz 25 pJ/packet

Mode 2 0.9V 1.5 Ghz 32 pJ/packet

Mode 3 1.0V 1.8 Ghz 39 pJ/packet

Mode 4 1.1V 2.0 Ghz 47 pJ/packet

Mode 5 1.2V 2.25 Ghz 57 pJ/packet

4.1.1 LEAD Model Variants

For the LEAD-τ model, we needed to determine the ideal input buffer utilization

thresholds for switching between modes. To determine these ideal mode transition

thresholds, we conducted an exhaustive study. Figure 4.1 shows a small set of our threshold

studies on a single trace file. Each model variant’s name (shown on the x-axis) consists

of 4 values that correspond to the mode transition thresholds. For example, 5/10/20/25

implies a transition from mode 1 to mode 2 when buffer utilization exceeds 5%, from

mode 2 to mode 3 when the buffer utilization exceeds 10%, from mode 3 to mode 4 when

buffer utilization exceeds 20%, and finally a transition to mode 5 when buffer utilization

is above 25%. From Figure 4.1, we observe that the set of thresholds that lead to the

61

Figure 4.1: The throughput loss and dynamic energy savings across multiple LEAD-τvariants with an epoch size of 500 cycles [17]©2018 ACM.

best performance (lowest throughput loss and maximum energy savings) was 5/10/20/25

which yielded 15.51% energy savings while showing 5.35% throughput loss across a single

training trace file. Therefore, we use 5/10/20/25 threshold values for our LEAD-τ model

[17]. For the LEAD-∆ and LEAD-G models, similar testing occurred. LEAD-∆ model

variants changed the thresholds needed to transition up or down a voltage level, while

LEAD-G model variants changed the number of hold cycles before the model could change

explorative directions. Only the best performing models are shown in the results section.

4.1.2 LEAD Mode Breakdown

In Figure 4.2 we show the amount of time spent in each of the five different modes

across all five test applications for the best performing LEAD models. Every epoch the

routers’ current voltage level is recorded, and one of five various counters is incremented

corresponding to the routers’ voltage level. This is done for all 16 routers every epoch

so that total network time spent in each mode can be calculated. LEAD-τ has the unique

ability to switch from any-to-any mode, which allows it to avoid mode 4 altogether. We

can observe how LEAD-τ routers spend most of their time in the highest mode of operation

but can switch into lower modes to exploit idle traffic patterns. This ensures that LEAD-τ

62

Figure 4.2: The time routers and links spent in each mode for all LEAD models across alltest traces [17]©2018 ACM.

minimizes loss in throughput while maximizing energy savings. Both LEAD-∆ and LEAD-

G show significant amounts of time spent in the lower modes 2 and 1 respectively. However,

LEAD-∆ routers have a very even distribution of time spent in each of the five different

mode of operation, corresponding to the gradually changing nature of the model. LEAD-G

routers show the exact opposite behavior of LEADτ wherein they wish to stay in the lowest

mode of operation as much as possible to maximize energy savings. The LEAD-G model

does not conserve performance at high network load. All three models and their behaviors

will be discussed further in the following sections.

4.2 LEAD ML Simulation Methodology

Each LEAD model is trained using Ridge Regression, a machine learning algorithm

that minimizes the sum of squared errors and adds a regularization term. We must gather

training and validation data as well as test data to evaluate model performance. LEAD uses

supervised machine learning, thus both features and corresponding labels must be gathered

for training and validation. Labels are unique to each of the three different LEAD models

63

and will be discussed in greater detail in the following section. To gather training data, a

reactive version of each LEAD model is used to govern mode selection. Then, at every

epoch (100-1000 cycles) features as well as the previous epochs corresponding label are

exported to a text file for every router.

From Table 4.1, we observe that six trace files are used to gather features and target

labels, three are used to validate the trained models, and the remaining five are used at

test time to evaluate model generalization. This process must be repeated for each LEAD

model as each will have a unique reactive version used to gather training/validation data,

and each will have unique target label. This results in each LEAD model having a unique

set of trained weights which are exported for use during test time. It is important that we

tune the lambda hyper-parameter during the validation phase as it controls the amount of L2

regularization used to combat over-fitting. If the model over-fits, it will generalize poorly.

This means the model will perform very well on the training data but will perform poorly

at test time. Model training and tuning occurs in Matlab where we have implemented

our Ridge Regression algorithm. After this phase concludes, trained weights are exported

back to our in-house network simulator where they are used on test traces to predict router

specific target labels every epoch. These labels are used to govern mode selection and

enable proactive voltage switching. The five trace files used at test time are completely

untouched during training/validation. Seeing them in advance would be equivalent to

cheating, thus generalization performance could not be accurately measured.

4.2.1 LEAD Feature Engineering

The process of feature engineering involves selecting specific network parameters to

use as inputs for label generation while keeping overhead to a minimum. If the label is

the future state of the network, then features should contain relevant network information

that will help predict this future state. For example, the label for LEAD-τ is input buffer

64

utilization over the next epoch, therefore the most critical feature is the input buffer

utilization over the current epoch. Knowing the current value of a parameter helps pattern

recognition software predict the future value of that parameter immensely. We use a total

of 39 features ranging from requests sent and received on all directions, responses sent and

received on all directions, incoming and outgoing link utilization in all directions, current

input buffer utilization, current change in input buffer utilization, current slack, and current

voltage level.

The first feature is an array of 1’s, this is used for weight normalization. The

next several features are Requests/Responses sent and received on every direction.

These features are important because they provide the algorithm with useful direction

specific traffic insights. This adds directional traffic flow information to label generation.

Incoming/Outgoing link utilization help determine whether the performance needs of the

network are increasing or decreasing. These features are critical to determine whether

router modes should be increased or decreased to meet such changing needs. However, the

most critical feature for predicting future input buffer utilization, is knowing the current

input buffer utilization. Therefore, feature 36 is mainly added to improve performance

of the LEAD-τ model. Likewise, knowing the current change in input buffer utilization

between the last and current epoch is very helpful when trying to predict this change for

the next epoch. Therefore, feature 37 was added largely to help performance of the LEAD-

∆ model, however feature 36 is still important for this label as well. Similarly, feature

38 was added to help label generation of the LEAD-G model. The feature set does not

change between LEAD models, only the corresponding label changes. Therefore, we tried

to select enough key features such that each of the three various labels can be accurately

generated at test time using the same feature set. Each feature contains only local router

information with the full list of features given in two separate Tables 4.4 and 4.5. The

feature set is split into two different tables because of page size limitations. It is crucial

65

Table 4.4: Full LEAD Feature Set

LEAD Feats: (1-20)

Feature 1: Array of 1’s

Feature 2: Requests Sent by All Cores:

Feature 3: Responses Sent by All Cores:

Feature 4: Requests Recieved by All Cores:

Feature 5: Responses Recieved by All Cores:

Feature 6: Requests Sent by Cores per Direction (+X):

Feature 7: Responses Sent by Cores per Direction (+X):

Feature 8: Requests Recieved by Cores per Direction (+X):

Feature 9: Responses Recieved by Cores per Direction (+X):

Feature 10: Outgoing Link Utilization per Direction (+X):

Feature 11: Incoming Link Utilization per Direction (+X):

Feature 12: Requests Sent by Cores per Direction (-X):

Feature 13: Responses Sent by Cores per Direction (-X):

Feature 14: Requests Recieved by Cores per Direction (-X):

Feature 15 Responses Recieved by Cores per Direction (-X):

Feature 16: Outgoing Link Utilization per Direction (-X):

Feature 17: Incoming Link Utilization per Direction (-X):

Feature 18: Requests Sent by Cores per Direction (+Y):

Feature 19: Responses Sent by Cores per Direction (+Y):

Feature 20: Requests Recieved by Cores per Direction (+Y):

that the total number of features be kept minimal because each additional feature incurs

increased run-time computational overhead.

66

Table 4.5: Full LEAD Feature Set (Cont.)

LEAD Feats: (21-40)

Feature 21: Responses Recieved by Cores per Direction (+Y):

Feature 22: Outgoing Link Utilization per Direction (+Y):

Feature 23: Incoming Link Utilization per Direction (+Y):

Feature 24: Requests Sent by Cores per Direction (-Y):

Feature 25: Responses Sent by Cores per Direction (-Y):

Feature 26: Requests Recieved by Cores per Direction (-Y):

Feature 27: Responses Recieved by Cores per Direction (-Y):

Feature 28: Outgoing Link Utilization per Direction (-Y):

Feature 29: Incoming Link Utilization per Direction (-Y):

Feature 30: Requests Sent by Cores per Direction (core):

Feature 31: Responses Sent by Cores per Direction (core):

Feature 32: Requests Recieved by Cores per Direction (core):

Feature 33: Responses Recieved by Cores per Direction (core):

Feature 34: Outgoing Link Utilization per Direction (core):

Feature 35: Incoming Link Utilization per Direction (core):

Feature 36: Current Input Buffer Utilization:

Feature 37: Current Change in Input Buffer Utilization:

Feature 38: Current Slack:

Feature 39: Voltage Level:

Feature 40: Label:

4.2.2 LEAD ML Accuracy

Machine learning research will typically use RMSE to determine model accuracy,

however we define and use mode selection accuracy to evaluate both our LEAD models

67

and our DozzNoC models. We use mode selection accuracy because there is a large range

of acceptable label values per mode. Table 4.6 shows the achieved mode selection accuracy

for various applications for the LEAD-τ model. To calculate mode selection accuracy, we

start by recording the label every epoch. Then, during the following epoch, we measure the

actual buffer utilization and compare to the label from the previous epoch. We determine

what mode should have been chosen based on the actual utilization, and what mode was

chosen using the label. If both the label and actual utilization would lead to the same

mode being selected, then the selection was accurate, else the selection was inaccurate. All

accurate selections are divided by the total number of inaccurate and accurate selections to

generate mode selection accuracy. We achieve an average of 86% accuracy across all test

benchmarks, with the Canneal application showing almost 95% accuracy.

Table 4.6: LEAD-τ Mode Selection Accuracy [17]©2018 ACM

Models Lu Ls Radix Fluid Canneal

LEAD-τ 500 88.3% 82.3% 62.7% 68% 95.9%

LEAD-τ 1000 83.2% 73.2% 56.6% 53.2% 95.9%

4.3 DozzNoC Simulation Methodology

DozzNoC simulation methodology is almost identical to LEAD. Firstly, we use the

same cycle accurate full system simulator, Multi2sim [58] to gather trace files. We use

the same industry standard benchmarks from both PARSEC 2.1 [5] and SPLASH2 [61]

to create trace files containing cycle accurate network traffic. The full set of benchmarks

used for DozzNoC is the same as the benchmarks used in LEAD, and are shown in Table

4.7. The main difference is that DozzNoC is evaluated on time compressed and non-time

compressed trace files. DozzNoC is also evaluated across both a mesh with 64 cores, and

a CMesh with 16 routers and a concentration factor of 4. The goal of evaluating DozzNoC

68

across different network topologies is to see the trade-off in performance and energy savings

across topologies. Concentrated designs have lower network area overhead, but this can

also hinder potential static power savings. This is because power-gated schemes that use

shared components can be powered off less often. We also use time-compressed and non-

time compressed trace files to simulate different traffic loads. Time compressed trace files

simulate high network load. The idea is to decrease packet injection times so that packets

are injected into the network as fast as the scheme will allow. The second set of trace files

are not time compressed and simulate low-normal network load. The goal behind using

compressed and uncompressed traces is to ensure that DozzNoC performs well at times of

network saturation, as well as network idleness. All DozzNoC benchmarks can be seen in

Table 4.7. Traces are supplied to our in-house network simulator and are used as input for

real traffic patterns to gather training and validation data for our various models. This will

be discussed further in following sections. The goal of using PARSEC 2.1 and SPLASH2

benchmarks for these simulations is to increase the accuracy of the performance results

during the testing phase. The performance accuracy of synthetic traffic is not as reliable

because models can be designed that exploit synthetic traffic patterns as they are generated

using a predictable algorithm.

Table 4.7: DozzNoC Benchmarks

Time Compressed Non-Time Compressed

Training Validation Testing Training Validation Testing

Barnes Raytrace Lu (non-cont.) Barnes Raytrace Lu (non-cont.)

Blackscholes Swaptions Lu (cont.) Blackscholes Swaptions Lu (cont.)

Bodytrack Ferret Radix Bodytrack Ferret Radix

Dedup Fluidanimate Dedup Fluidanimate

FFT Canneal FFT Canneal

Ocean Ocean

69

DSENT [57] is used once more to model the router and the links as well as to

obtain their respective static power/dynamic energy costs. The static power as well as

the dynamic energy of the router and it’s outgoing links for a CMesh are shown in table

4.8. This table has three columns of significant importance, the first being the static power

in Joules/second. This is directly from DSENT and is an estimate of the static power to

operate a router and its outgoing links. We have converted this cost to a per cycle basis

so that total static power and dynamic energy can be calculated with our in-house network

simulator. Every cycle a router is operational in mode 1 will cost 66.7% less static power

than if it were operational in mode 5 as reflected in column 4. The final column shows

the dynamic energy per hop for a packet traversing a router in each of the five different

operational modes. It should be noted that this cost is identical to the energy per hop

from LEAD. The network latency of a CMesh is lower than a mesh as there are fewer

hops from source to destination. However, the power/energy cost per hop is higher for a

CMesh because it has more components and larger crossbars. Thus, the CMesh network

is used as a worst case for any power/energy costs and a best case for latency. We use the

same power/energy cost per hop for both a mesh and CMesh for simplicity sake. When

determining voltage switching latency and wake up delay, CMesh routers have a higher

equivalent capacitance, which in turn means they will have larger delays. However, we

once again assume the same switching delay and wake-up delay for both a CMesh and

a mesh router for simplicity. These delays are estimated for the five different modes of

operation at a technology size of 22nm with 128-bit flit width [16].

4.3.1 DozzNoC Model Variants

We tested many different variants of the DozzNoC model, just as we tested many

different variants of our LEAD models. We did not test different active state operational

threshold values, instead we kept these the same as those used in LEAD-τ. This was done

70

Table 4.8: Static Power and Dynamic Energy Per Hop for Active State Operational Modes[16]

Voltage Frequency Static Power Static Power Dynamic Energy

(Volts) (GHz) (J/s) (Cycle) (pJ/hop)

0.8 1.0 .036 .667 25.1

0.9 1.5 .041 .750 31.8

1.0 1.8 .045 .833 39.2

1.1 2.0 .050 .917 47.5

1.2 2.25 .054 1.0 56.5

partly for fair comparative purposes, but also partly because we had already performed

extensive threshold testing for LEAD-τ. Instead we opted to test how DozzNoC model

variants performed at various epoch sizes. This testing focused on the trade-offs between

coarse time-grained and fine time-grained DVFS applied to a partially non-blocking power-

gated scheme. In Figure 4.3, we observe how various epoch size variant DozzNoC

models compared against a baseline. The baseline DozzNoC model chose new active state

operational modes every 100 cycles. This means that if the router is active, it will operate

at one of five different operational modes to be re-calculated every 100 cycles. Our other

model variants include epoch sizes of 1000 cycles, 5000 cycles, and 10,000 cycles. The

percent change in throughput, latency, and dynamic energy across all test traces show that

epoch size of 500 cycles and 1000 cycles leads to the best model performance. Features and

labels are exported every epoch, therefore, to ensure sufficient training/validation data is

generated (improving generalization performance), we opt for an epoch size of 500 cycles.

We also tested several variations of the ML+TURBO model. This was accomplished by

changing how often ML+TURBO would force routers and links into the highest mode of

operation. For example, should it force routers and links into the highest mode of operation

71

Figure 4.3: The change in throughput, dynamic energy, and latency of DozzNoC at variousepoch sizes compared to a baseline DozzNoC model with an epoch size of 100 cycles [16].

every 2 re-configurable windows or every 3. In the following sections we will evaluate only

the best performing models for DozzNoC, LEAD-τ, and ML+TURBO.

4.3.2 DozzNoC Mode Breakdown

In Figure 4.4 we show the total network distribution of predicted active modes of

operation for DozzNoC, LEAD-τ and ML+TURBO. This means that when a DozzNoC

router is in the active state, it will operate at one of the five different operational modes

(Mode 3-Mode 7). DozzNoC active state voltage levels are updated every epoch, but a

router may transition between active, wake-up, and inactive states at any point within an

epoch. Therefore, the predicted operational modes over an epoch do not directly reflect

time spent in all modes, rather it is a proportional reflection. For LEAD-τ, routers will

never be in an inactive or wakeup state as power-gating is not applied. Therefore, the

predicted operational modes will reflect the actual network mode breakdown much like

the breakdown for other LEAD models presented in Figure 4.2. Because LEAD-τ routers

do not apply power-gating, they are always active in the operational mode determined by

the generated label. The baseline and Power Punch scheme are not shown as they do not

use DVFS logic to select optimal active modes. Instead, the baseline model is always

operational in the highest mode and the Power Punch model is a power-gating scheme with

72

Figure 4.4: A breakdown of predicted operational modes while in an active state for an 8 ×8 mesh network for DozzNoC (a), LEAD-τ (b), and ML+TURBO (c) [16].

only one operational mode. When Power Punch is in the active state, it will always operate

at mode 7 [16].

4.4 DozzNoC ML Simulation Methodology

The ML simulation methodology used for the DozzNoC, LEAD-τ, and ML+TURBO

used in this section are very similar to those used for LEAD. We must first develop reactive

versions of each model which rely on current input buffer utilization to select voltage levels

while the router is in an active state. This allows us to run our network simulator and export

features and a label every epoch. The main difference between these reactive models and

the reactive versions presented in LEAD are that these need to be adaptable to different

network topologies and they must be able to handle various traffic loads. Once again,

gathered features and labels are used to train our various mode selection models using

supervised Ridge Regression. Our trace breakdown is the same as LEAD, using a total of

six trace files for training purposes, three for validation, and the final five for testing. The

73

Figure 4.5: The results of DozzNoC single feature mode selection accuracy testing [16].

main difference between DozzNoC comparative models and our LEAD models is that we

must repeat this process for both time compressed and non-time compressed trace files.

This means that we will have twice as many uniquely trained algorithms for DozzNoC.

The validation phase for DozzNoC remains unchanged from LEAD. After training and

validation, we test the trained algorithm by exporting the trained weights for use in our in-

house network simulator where they are used to generate labels. This time the label (future

input buffer utilization) does not change between models, however the process must still

be repeated for DozzNoC, LEAD-τ, and ML+TURBO as each will have a unique weight

array. The LEAD-τ model logic used in these tests does not change from the previous

version presented in section 4.1. However, we still need to repeat the training/validation

process for this model as it must be evaluated on both time compressed and non-time

compressed trace files as well as both a mesh and CMesh network topology.

4.4.1 DozzNoC Feature Engineering

For DozzNoC, we wanted to ensure that our feature set consisted only of relevant

network information. Therefore, we performed a series of experiments where a select

number of features were used to train, validate, and test model performance. Firstly, we

tested the single feature mode selection accuracy as shown in Figure 4.5. This experiment

consisted of using only a single feature for training, validation, and testing. Then, we

74

Figure 4.6: The throughput, latency, dynamic energy savings, static power savings andEDP of DozzNoC-5 versus DozzNoC-41 [16].

evaluated how accurately each single feature DozzNoC model selected operational modes

using mode selection accuracy.

We took the five features that had the highest single feature mode selection accuracy

and created a reduced feature set. This reduced feature set was used to evaluate model

throughput, latency, dynamic energy, static power, and EDP against a model that used

41 features as can be seen in 4.6. The reduced feature set model is called DozzNoC-

5 as it used only the five features that had the highest mode selection accuracy, while

the original DozzNoC model that used all 41 features is called DozzNoC-41. These 41

features are largely pulled directly from the original LEAD feature set. However, some

LEAD model specific features were removed, and some power-gating specific features

were added. Newly added features included current router off time and network total off

time. From Figure 4.6 we can observe that there is almost no impact on throughput, latency,

dynamic energy savings, static power savings, or EDP between the two DozzNoC model

variants. This allows us to reduce the run-time overhead incurred by label generation every

epoch to only 5 multiplies and 4 additions per router, far less than the original 41 multiplies

and 40 additions [16].

75

4.5 LEAD Results

In this section we will compare the throughput and dynamic energy for all LEAD

models against a baseline model which does not apply DVFS and the reactive Greedy

model proposed in [40]. All LEAD models will be evaluated on a CMesh topology with

16 routers and 64 cores. The goal is to evaluate dynamic energy savings in high load

environments.

4.5.1 LEAD Energy and Throughput

Figure 4.7 shows the average dynamic energy and throughput across all five test traces

for all LEAD models. The best performing LEAD-τ model is evaluated at epoch sizes of

both 500 and 1000 cycles. Other LEAD models show only the best performing model

variants at epoch size of 500 cycles. The epoch size determines how often mode re-

configuration is performed as well as how often features and labels are extracted. The

dynamic energy is normalized to a baseline that does not apply DVFS to make visualizing

energy savings and performance loss as simple as possible. At an epoch size of 500

and 1000 cycles, LEAD-τ reduces total dynamic energy by 13-17% for a minimal loss

in throughput of 2-4%. The LEAD-∆ model decreased dynamic energy by 34-35%,

far more than LEAD-τ, however this energy savings came at a substantial throughput

loss of 40-43%. LEAD-G performed the best in terms of dynamic energy reduction,

saving a significant 42% dynamic energy for an almost even throughput trade-off of 42%.

This makes sense since LEAD-G sought to move routers in the direction that minimized

energythroughput2 . If both the top and bottom terms are equally weighted, then we should see

an even trade-off between the two. We see that LEAD-τ greatly improves total dynamic

energy savings at both select epoch sizes with minimal loss in throughput. LEAD-G may

not have the highest performance, but overall it did save significantly more dynamic energy

than LEAD-∆ and was able to achieve an even trade-off between dynamic energy savings

76

Figure 4.7: The throughput (a) and Normalized Dynamic Energy (b) for all LEAD modelscompared against baseline and greedy [17]©2018 ACM.

and loss in throughput. In almost all networks, and especially networks under high load,

LEAD-τ would be the optimal mode selection model. However, if the network rarely

became congested, then LEAD-G would be the best option [17].

4.6 DozzNoC Results

In this section we will evaluate the throughput, normalized dynamic energy, and

normalized static power for DozzNoC, LEAD-τ, and ML+TURBO. These results will then

be compared against a baseline model with neither power-gating nor DVFS and a state-of-

the-art power-gating model based on Power Punch. DozzNoC and all comparative models

will be evaluated on both a CMesh topology with 16 routers and 64 cores at high network

load, and a mesh topology with 64 routers and 64 cores at a low network load. This will

allow a model’s performance and power/energy savings to be evaluated across multiple

topologies and multiple network load environments.

77

4.6.1 DozzNoC Throughput, Static Power, and Dynamic Energy

For traditional DVFS designs, the focus is often the trade-off between performance

loss and dynamic energy savings. This differs from traditional power-gated designs which

usually focus on the trade-off between performance loss and static power savings. For our

work, our results must focus on the trade-off between performance loss and both dynamic

energy savings as well as static power savings. Therefore, we compare a baseline model

that has neither power-gating nor DVFS implemented against four other models to show

case these numerous trade-offs. The baseline model is always active and always operates

routers and links at the highest voltage level while the Power Punch design operates routers

and links at mode 7 when in the active state. The baseline shows how a model that applied

neither DVFS nor power-gating would perform and acts as an upper bound on performance.

The Power Punch model shows the trade-offs associated with state-of-the-art power-gating

techniques that employ partial or full non-blocking techniques [16].

Overall network throughput and static power/dynamic energy across all five test

traces for the CMesh topology at high network load are shown in Figure 4.8. The same

information for a mesh topology at low network load is shown in Figure 4.9. These

two figures showcase the trade-offs between model performance and static power/dynamic

energy savings across multiple network topologies and network loads. For a mesh topology

at an epoch size of 500 cycles, our version of Power Punch can achieve an average of 47%

static power savings for an increase of 5% in latency and a throughput loss of 9%. For a

CMesh topology, the static power savings to an average of 32% for latency increase of 3%

and throughput loss of 5%. LEAD-τ shows the trade-offs associated with machine learning

based proactive DVFS and allows DozzNoC a direct comparison to our best performing

LEAD model. For a mesh topology at an epoch size of 500 cycles, LEAD-τ can achieve

an average of 25% dynamic energy savings and 25% static power savings for a 1% latency

increase and a 3% loss in throughput. We note that static power savings are obtainable

78

Figure 4.8: DozzNoC throughput for CMesh architecture at epoch size of 500 cycles (a).DozzNoC normalized static and dynamic energy for CMesh at epoch size of 500 cycles athigh load (b). DozzNoC normalized static and dynamic energy for CMesh at epoch size of500 cycles at low load (c) [16].

while only using DVFS because lower voltage levels will consume proportionally less static

power than the baseline which always operates at the highest voltage level. This means if

the simulation takes the same amount of time to execute the same amount of instructions for

both the baseline and DVFS model, the DVFS model will save static power as it operated

at lower modes throughout the course of the simulation [16].

DozzNoC highlights our novel power-gated DVFS design which seeks to save both

static power and dynamic energy. For a mesh topology with an epoch size of 500 cycles,

our DozzNoC model can save on average 53% static power and 25% dynamic energy.

This comes at the cost of increasing latency by 3% and decreasing throughput by 7%. For

a CMesh network, DozzNoC can save on average 39% static power and 18% dynamic

energy for a latency increase of 2% and a throughput loss of 5%. The ML+TURBO model

79

Figure 4.9: DozzNoC throughput for mesh architecture at epoch size of 500 cycles (a).DozzNoC normalized static and dynamic energy for mesh at epoch size of 500 cycles athigh load (b). DozzNoC normalized static and dynamic energy for mesh at epoch size of500 cycles at low load (c) [16].

is an experimental model designed to show the trade-off between dynamic energy savings

and static power savings. Every three epochs that our ML+TURBO model determined a

router should operate in a mode other than the lowest or highest mode, we instead forced

that router to operate in the highest mode. The goal was to see if we could lose some

dynamic energy savings to obtain a greater increase in static power savings through faster

simulations. For a mesh topology with an epoch size of 500 cycles, ML+TURBO saved

on average 52% static power and 21% dynamic energy for a latency increase of 3% and a

throughput loss of 7%. For a CMesh topology, ML+TURBO saved on average 37% static

power and 14% dynamic energy for a latency increase of 2% and a throughput loss of 4%.

When compared to our DozzNoC model, we note that not only did we have a slight loss in

static power savings, but that we also had a slight loss in dynamic energy savings. This is

80

because the highest mode of operation consumes the most dynamic energy and it has the

highest static power cost. Also, operating in the highest mode doesn’t necessarily mean

that the simulation will end sooner because packet injection is based on real valued cycle

times [16].

81

5 Conclusions and FutureWork

In this thesis, we presented and evaluated two novel energy proportional computing

techniques applicable to the NoC. The first technique was proactive machine learning

enhanced DVFS on a per router basis in the form of our various LEAD models. In the

sections that presented and evaluated our LEAD models, we showed how proactive mode

selection techniques such as LEAD-τ can lead to substantial dynamic energy savings of

17% on average with minimal loss in performance of 2-4%. LEAD-∆ highlights how we

can have an unequal trade-off between dynamic energy savings and performance with a 34-

35% dynamic energy reduction for 40-43% loss in throughput. This showed the opposite

side of the spectrum; how a poorly tuned or otherwise underwhelming mode selection

model can lead to sub-optimal dynamic energy savings/performance trade-off. LEAD-G

shows a mode selection model that achieved very good energy savings of 42% for an even

throughput loss of 42% resulting in an equal trade-off in saturated networks [17].

The second energy proportional computing technique we presented and evaluated

in this thesis was DozzNoC. DozzNoC is a combination of our best performing LEAD

model, LEAD-τ, with a partially non-blocking power-gated framework. In the sections

that presented and evaluated DozzNoC and its’ various comparative models, we showed

how it is possible to target both static power and dynamic energy. The power gated portion

of this design can be operated on a very fine time grain to ensure that break-even times,

wakeup times, and idle counters are accounted for, while the DVFS portion can be operated

on a larger time grain. This ensures that switching delays and computational overhead

costs are minimized. The LEAD-τ model as well as the Power Punch model were used for

comparative purposes, and highlighted the individual trade-offs associated with using either

a modern partially non-blocking power-gated scheme, or a proactive mode selection model

for DVFS. At high network load for the mesh network topology, our version of Power

Punch achieved an average of 47% static power savings at a cost of 9% throughput with

82

no reduction in dynamic energy. Our best performing LEAD model, LEAD-τ, applied

to the same topology achieved a dynamic energy reduction of 25% and a static power

reduction of 25% at a cost of only 3% throughput. Finally, our DozzNoC model achieved

an average static power reduction of 53% and an average dynamic energy reduction of 25%

for only 7% loss in throughput. These results showcase how it is possible to combine the

underlying ideas behind a partially or fully non-blocking power-gated design and smart

proactive DVFS to save both dynamic energy and static power. These savings can be

realized for minimal loss in performance and minimal run-time computational overhead

[16].

There are many avenues for future work that could improve various aspects of both

LEAD and DozzNoC. One example of future LEAD work would be the application of

more advanced machine learning techniques such as deep learning and reinforcement

learning. The goal of this future work would be to analyze the overhead and energy savings

associated with both online and offline learning, as well as the implementation of more

dynamic models that can easily adapt to new network load environments and topologies.

Another example of future LEAD work would include monitoring bit error rate and adding

noise so that reliability trade-offs can be evaluated. This type of work would mainly focus

on lower voltage levels where lower supply voltage may lead to an increase in bit error rate.

The goal of this future work would be to quantify error rate and determine the reliability

impact on energy savings and performance loss. Future DozzNoC work may include the

development of novel non-blocking strategies or other modification of the power-gating

scheme. Router/link off time would be maximized through the addition of either packet

re-routing or bypass channels. The goal would be to evaluate static power and energy

savings of a DozzNoC model with both reinforcement/deep learning techniques, and novel

non-blocking power-gating strategies.

83

References

[1] M. Baboli, N. Husin, and M. Marsono. A comprehensive evaluation of directand indirect network-onchip topologies. In Proceedings of the 2014 InternationalConference on Industrial Engineering and Operations Management, January 2014.

[2] Y. Bai, V. W. Lee, and E. Ipek. Voltage regulator efficiency aware power management.In Proceedings of the Twenty Second International Conference on ArchitecturalSupport for Programming Languages and Operating Systems (ASPLOS), April 2017.

[3] A. Banerjee, R. Mullins, and S. Moore. A power and energy exploration ofnetwork-on-chip architectures. In First International Symposium on Networks-on-Chip (NOCS’07), May 2007.

[4] Andrea Bianco, Paolo Giaccone, and Nanfang Li. Exploiting dynamic voltage andfrequency scaling in networks on chip. In 2012 IEEE 13th International Conferenceon High Performance Switching and Routing, pages 229–234, June 2012.

[5] C. Bienia and K. Li. PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors . In Proc. of the 5th Annual Workshop on Modeling, Benchmarkingand Simulation, June 2009.

[6] Christopher M. Bishop. Pattern recognition and machine learning (informationscience and statistics). In Springer-Verlag, 2006.

[7] Paul Bogdan, Radu Marculescu, Siddharth Jain, and Rafael Gavila. An optimal con-trol approach to power management for multi-voltage and frequency islands multi-processor platforms under highly variable workloads. In International Symposium onNetworks on Chip (NoCS), pages 35–42, May 2012.

[8] Haseeb Bokhari, Haris Javaid, Muhammad Shafique, Jorg Henkel, and SriParameswaran. darknoc: Designing energy-efficient network-on-chip with multi-vtcells for dark silicon. In 2014 51st ACM/EDAC/IEEE Design Automation Conference(DAC) (DAC-51), pages 1–6, June 2014.

[9] R. Boyapati, J. Huang, N. Wang, K. H. Kim, K. H. Yum, and E. J. Kim. Fly-over:A light-weight distributed power-gating mechanism for energy-efficient networks-on-chip. In International Parallel and Distributed Processing Symposium (IPDPS), July2017.

[10] M.R. Casu, M.K. Yadav, and M. Zamboni. Power-gating technique for network-on-chip buffers. In Electronics Letters, pages 1438–1440, November 2013.

[11] Lizhong Chen and Timothy Pinkston. Nord: Node-router decoupling for effectivepower-gating of on-chip routers. In 2012 45th Annual IEEE/ACM InternationalSymposium on Microarchitecture, pages 270–281, December 2012.

84

[12] Lizhong Chen, Lihang Zhao, Ruisheng Wang, and Timothy Pinkston. Mp3:Minimizing performance penalty for power-gating of clos network-on-chip. In 2014IEEE 20th International Symposium on High Performance Computer Architecture(HPCA), pages 296–307, February 2014.

[13] Lizhong Chen, Di Zhu, Massoud Pedram, and Timothy Pinkston. Power punch:Towards non-blocking power-gating of noc routers. In 2015 IEEE 21st InternationalSymposium on High Performance Computer Architecture (HPCA), pages 378–389,July 2015.

[14] X. Chen and L. Peh. Leakage power modeling and optimization in interconnectionnetworks. In Proceedings of the 2003 International Symposium on Low PowerElectronics and Design, 2003., August 2003.

[15] Xi Chen, Zheng Xu, Hyungjun Kim, Paul Gratz, Jiang Hu, Michael Kishinevsky,Umit Ogras, and Raid Ayoub. Dynamic voltage and frequency scaling forshared resources in multicore processor designs. In 50th ACM/EDAC/IEEE DesignAutomation Conference (DAC), July 2013.

[16] Mark Clark, Yingping Chen, Avinash Karanth, Brian D. Ma, and Ahmed Louri.Dozznoc: Reducing static and dynamic energy in network-on-chips with low-latencyvoltage regulators using machine learning. In Submitted to Transactions, 2018.

[17] Mark Clark, Avinash Kodi, Razvan Bunescu, and Ahmed Louri. Lead: Learning-enabled energy-aware dynamic voltage/frequency scaling in nocs. In The 55th AnnualDesign Automation Conference (DAC), June 2018.

[18] W. Dally and B. Towles. Principles and practices of interconnection networks. InMorgan Kaufmann, 2003.

[19] Reetuparna Das, Satish Narayanasamy, Sudhir Satpathy, and Ronald Dreslinski.Catnap: Energy proportional multiple network-on-chip. In ISCA ’13 Proceedings ofthe 40th Annual International Symposium on Computer Architecture, pages 320–331,June 2013.

[20] Radu David, Paul Bogdan, and Radu Marculescu. Dynamic power management formulticores: Case study using the intel scc. In Internationa Conference on VLSI andSystem-on-Chip (VLSI-SoC), pages 147–152, October 2012.

[21] Benoit Dupont de Dinechin, Pierre Guironnet de Massas, Guillaume Lager, ClementLeger, Benjamin Orgogozoa, Jerome Reybert, and Thierry Strudela. A distributedrun-time environment for the kalray mppar-256 integrated manycore processor. InIntl. Conference on Computational Science, ICCS, Vol. 18, 2013.

[22] Gaurav Dhiman and Tajana Rosing. Dynamic voltage frequecy scaling for multi-tasking systems using online learning. In ACM/IEEE International Symposium onLow Power Electronics and Design (ISLPED), August 2007.

85

[23] C. Enz and Y. Cheng. Mos transistor modeling for rf ic design. In IEEE Journal ofSolid-State Circuits, February 2000.

[24] Stijn Eyerman and Lieven Eeckhout. Fine-grained dvfs using on-chip regulators. InACM Transactions on Architecture and Code Optimization (TACO), April 2011.

[25] Hossein Farrokhbakht, Mohammadkazem Taram, Behnam Khaleghi, and ShaahinHessabi. Toot: an efficient and scalable power-gating method for noc routers. In 2016Tenth IEEE/ACM International Symposium on Networks-on-Chip (NOCS), pages 1–8,August 2016.

[26] Quintin Fettes, Mark Clark, Razvan Bunescu, Avinash Karanth, and Ahmed Louri.Dynamic voltage and frequency scaling in nocs with supervised and reinforcementlearning techniques. In IEEE Transactions on Computers, October 2018.

[27] A. Flores, J.L. Aragon, and M. Acacio. An energy consumption characterizationof on-chip interconnection networks for tiled cmp architectures. In J Supercomput(2008) 45, September 2008.

[28] Kyle Hale, Boris Grot, and Stephen Keckler. Segment gating for static energyreduction in networks-on-chip. In 2009 2nd International Workshop on Network onChip Architectures, pages 57–62, December 2009.

[29] Richard Hay. Machine learning based dvfs for energy efficient execution ofmultithreaded workloads. In Dissertations and Theses Technical Reports-ComputerScience, November 2014.

[30] Sebastion Herbert and Diana Marculescu. Analysis of dynamic voltage/frequencyscaling in chip-multiprocessors. In Proceedings of the 2007 international symposiumon Low power electronics and design (ISLPED), August 2007.

[31] R. Hesse and N.E. Jerger. Improving dvfs in nocs with coherence prediction. InNOCS ’15 Proceedings of the 9th International Symposium on Networks-on-Chip,September 2015.

[32] M. Horowitz. 1.1 computing’s energy problem (and what we can do about it). In2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers(ISSCC), pages 10–14, February 2014.

[33] Brad Hutchings, Brent Nelson, Stephen West, and Reed Curtis. Optical flow on theambric massively parallel processor array (mppa). In 17th IEEE Symposium on FieldProgrammable Custom Computing Machines, April 2009.

[34] Keith R. Jackson, Lavanya Ramakrishnan, Krishna Muriki, Shane Canon, ShreyasCholia, John Shalf, Harvey J. Wasserman, and Nicholas J. Wright. Performanceanalysis of high performance computing applications on the amazon web services

86

cloud. In IEEE Second International Conference on Cloud Computing Technologyand Science (CLOUDCOM), 2010.

[35] Rahul Jain, Preeti Panda, and Sreenivas Subramoney. Machine learned machines:Adaptive co-optimization of caches, cores, and on-chip network. In Design,Automation and Test in Europe Conference and Exhibition (DATE), April 2016.

[36] R.M. Warner Jr. and B. L. Grung. Transistors: Fundamentals for the integrated-circuitengineer. 1984.

[37] S. Kaxiras and M. Martonosi. Computer architecture techniques for power-efficiency.In Morgan and Claypool Publishers, 2008.

[38] R. Lippmann. An introduction to computing with neural nets. In IEEE ASSPMagazine, 1987.

[39] N. Magen, A. Kolodny, U. Weiser, and N. Shamir. Interconnect-power dissipation ina microprocessor. In SLIP ’04 Proceedings of the 2004 international workshop onSystem level interconnect prediction, pages 7–13, February 2014.

[40] G. Magklis, P. Chaparro, J. Gonzalez, and A. Gonzalez. Independent front-end andback-end dynamic voltage scaling for a gals microarchitecture. In Proceedings ofthe 2006 International Symposium on Low Power Electronics and Design (ISPLED),2006.

[41] Asit K. Mishra, Reetuparna Das, Soumya Eachempati, Ravi Iyer, N. Vijaykrishnan,and Chita R. Das. A case for dynamic frequency tuning in on-chip networks. InProceedings of the International Symposium on Microarchitecture (MICRO), pages392–303, 2009.

[42] G. E. Moore. Cramming more components onto integrated circuits. In Proceedingsof the IEEE, January 1998.

[43] Nasim Nasirian, Reza Soosahabi, and Magdy Bayoumi. Traffic-aware power-gatingscheme for network-on-chip routers. In 2016 IEEE Dallas Circuits and SystemsConference (DCAS), pages 1–4, October 2016.

[44] B. Parhami. Computer architecture: From microprocessors to supercomputers. InOxford University Press, 2005.

[45] Ritesh Parikh, Reetuparna Das, and Valeria Bertacco. Power-aware nocs throughrouting and topology reconfiguration. In DAC ’14 Proceedings of the 51st AnnualDesign Automation Conference, pages 1–6, June 2014.

[46] D. A. Patterson and J. L. Hennessy. Computer organization and design: Thehardware/software interface, 5th ed. In Morgan Kaufmann, 2013.

87

[47] Carl Ramey. Tile-gx100 manycore processor: Acceleration interfaces and architec-ture. In 2011 IEEE Hot Chips 23 Symposium (HCS), August 2011.

[48] Ahmad Samih, Ren Wang, Anil Krishna, Christian Maciocco, Charlie Tai, andYan Solihin. Energy-efficient interconnnect via router parking. In 2013 IEEE19th International Symposium on High Performance Computer Architecture (HPCA),pages 508–519, February 2013.

[49] K.Y. Sanbonmatus and C.S. Tung. High performance computing in biology:Multimillion atom simulations of nanoscale systems. In Journal of Structural Biology,October 2006.

[50] R. R. Schaller. Moore’s law: past, present and future. In IEEE Spectrum Volume: 34,June 1997.

[51] Li Shang, Li-Shiuan Peh, and Niraj K. Jha. Analysis of dynamic voltage/frequencyscaling in chip-multiprocessors. In Computer Architecture Letters, pages 6–6, January2002.

[52] Hao Shen, Jun Lu, and Qinru Qiu. Learning based dvfs for simultaneous temperature,performance and energy management. In 13th International Symposium on QualityElectronic Design (ISQED), March 2012.

[53] W. Shockley and R. Brown. Electrons and holes in semiconductors with applicationsto transistor electronics. 1963.

[54] L. Shulenburger, A. Landahl, J. Moussa, O. Parekh, J. Wendt, and J. Aidun.Benchmarking adiabatic quantum optimization for complex network analysis. InTechnical Report SAND2015-3025, Sandia National Laboratories, June 2015.

[55] Seung Son, K. Malkowski, Guilin Chen, M. Kandemir, and P. Raghavan. Integratedlink/cpu voltage scaling for reducing energy consumption of parallel sparse matrixapplications. In Proceedings 20th IEEE International Parallel and DistributedProcessing Symposium, April 2006.

[56] C. Sun, C. Chen, G. Kurian, L. Wei, J. Miller, A. Agarwal, L. Peh, and V. Stojanovic.Dsent - a tool connecting emerging photonics with electronics for opto-electronicnetworks-on-chip modeling. In IEEE/ACM Sixth International Symposium onNetworks-on-Chip, May 2012.

[57] Chen Sun, C.-H.O. Chen, G. Kurian, Lan Wei, J. Miller, A. Agarwal, Li-Shiuan Peh,and V. Stojanovic. Dsent - a tool connecting emerging photonics with electronics foropto-electronic networks-on-chip modeling. In Networks on Chip (NoCS), 2012 SixthIEEE/ACM International Symposium on, pages 201–210, 2012.

88

[58] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli.Multi2sim: A simulation framework for cpu-gpu computing. In Proceedings of the21st International Conference on Parallel Architectures and Compilation Techniques,PACT ’12, pages 335–344, 2012.

[59] Saeeda Usman, Samee Khan, and Sikandar Khan. A comparative study ofvoltage/frequency scaling in noc. In IEEE International Conference on Electro-Information Technology , EIT 2013, pages 1–5, May 2013.

[60] R. Widlar. New developments in ic voltage regulators. In IEEE International Solid-State Circuits Conference. Digest of Technical Papers, February 1970.

[61] S.C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2Programs: Characterization and Methodological Considerations. In Proc. of the 22ndInternational Symposium on Computer Architecture, June 1995.

[62] Sheng Yang, Rishad Shafik, Geoff Merrett, Edward Stott, Joshua Levine, JamesDavis, and Bashir Al-Hashimi. Adaptive energy minimization of embeddedheterogeneous systems using regression-based learning. In 25th InternationalWorkshop on Power and Timing Modeling, Optimization and Simulation (PATMOS),September 2015.

[63] Jia Zhan, Nikolay Stoimenov, Jin Ouyang, Lothar Thiele, Vijaykrishnan Narayanan,and Yuan Xie. Optimizing the noc slack through voltage and frequency scaling inhard real-time embedded systems. In IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, pages 1632–1643, November 2014.

[64] Davide Zoni, Federico Terraneo, and William Fornaciari. A dvfs cycle accurate sim-ulation framework with asynchronous noc design for power-performance optimiza-tions. In Journal of Signal Processing Systems, pages 357–371, June 2016.

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

!!

Thesis and Dissertation Services