Resilient Energy-Constrained Microprocessor Architectures

136
Resilient Energy-Constrained Microprocessor Architectures for obtaining the academic degree of Doctor of Engineering Approved Dissertation Department of Informatics Karlsruhe Institute of Technology (KIT) by Anteneh Gebregiorgis from Zalanbesa, Ethiopia Oral Exam Date: 03 May 2019 Adviser: Prof. Dr. Mehdi Baradaran Tahoori, KIT Co-adviser: Prof. Dr. Said Hamdioui, Delft University of Technology

Transcript of Resilient Energy-Constrained Microprocessor Architectures

Resilient Energy-ConstrainedMicroprocessor Architectures

for obtaining the academic degree of

Doctor of Engineering

Approved

Dissertation

Department of Informatics

Karlsruhe Institute of Technology (KIT)

by

Anteneh Gebregiorgis

from Zalanbesa, Ethiopia

Oral Exam Date: 03 May 2019

Adviser: Prof. Dr. Mehdi Baradaran Tahoori, KIT

Co-adviser: Prof. Dr. Said Hamdioui, Delft University of Technology

To My Family

iii

Acknowledgments

First fo all, I would like to express my sincere gratitude to my adviser, Prof. Mehdi Tahoori,

who has helped me relentlessly, involved me in various research projects during my PhD

journey. Also, I would like to extend my gratitude to my co-adviser Prof. Said Hamdioui for

his support and motivation.

When I found myself experiencing the feeling of fulfillment, I realized though only my name

appears on the cover of this dissertation, my loving wife had an immense contribution for

the successful completion of my journey. Thus, I extend my earnest gratitude to my wife,

Tirhas, for her continued and unfailing love, support, and understanding during my pursuit

of Ph.D degree that made everything possible. She sacrificed most of her wishes to support

and encourage me unconditionally, which I am heavily indebted for. Finally, I acknowledge

the people who mean a lot to me, my family, for showing faith in me and giving me liberty to

follow my desire. I salute you all for the selfless love, care, pain and sacrifice you did to shape

my life.

I would like to thank my colleagues at the chair of dependable nano computing for their

thoughts, helps, and companionship. I would like to especially thank my friends, Rajendra

Bishnoi, Saber Golanbari, and Arun Kumar Vijayan which I have learned a lot from them,

not only in research, but also in other aspects of life.

v

Anteneh Gebregiorgis

Kaiserallee. 109

76185 Karlsruhe

Hiermit erklare ich an Eides statt, dass ich die von mir vorgelegte Arbeit selbststandig verfasst

habe, dass ich die verwendeten Quellen, Internet-Quellen und Hilfsmittel vollstandig angegeben

haben und dass ich die Stellen der Arbeit - einschlielich Tabellen, Karten und Abbildungen -

die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf

jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe.

Karlsruhe, May 2019

©Anteneh Gebregiorgis

vi

ABSTRACT

In the past years, we have seen a tremendous increase in small and battery-powered devices

for sensing, wireless communication, data processing, classification, and recognition tasks.

Collectively referred to as the Internet of Things (IoT), all these devices have a huge impact on

several aspects of our day-to-day life. Since IoT devices need to be portable and lightweight,

they depend on battery or environmental harvested-energy as their primary energy source.

As a result, IoT devices have to operate on a limited energy envelope in order to increase

the supply duration of their energy source. In order to meet the stringent energy-budget of

battery-powered IoT devices, extreme-low energy design has become a standard requirement.

In this regard, supply voltage downscaling has been used as an effective approach for reducing

the energy consumption of Complementary Metal Oxide Semiconductor (CMOS) circuits and

enable ultra-low power operation. Although aggressive supply voltage downscaling is a popular

approach for extreme-low power operation, it reduces the performance significantly. In this

regard, operating in the near-threshold voltage domain (commonly known as NTC) could

provide a better trade-off between performance and energy saving, as it can achieve up to 10×energy-saving at the cost of linear performance reduction. However, the broad applicability

of NTC is hindered by several barriers, such as the increase in functional failure of memory

components, performance variation, and higher sensitivity to variation effects.

For NTC processors, wide variation extent and higher functional failure rate pose a daunting

challenge to assure timing certainty of logic blocks such as pipeline stages of a processor core

and stability of memory elements (caches and registers). Moreover, due to the reduction in

noise margin of memory components, susceptibility to runtime reliability issues, such as aging

and soft errors, is also increasing with supply voltage downscaling. These challenges limit NTC

potentials and force designers to use large timing margins in order to ensure reliable operation

of different architectural blocks, which leads to significant overheads. Therefore, analyzing and

mitigating variation induced timing failure of pipeline stages and memory failures during early

design phases plays a crucial role in the design of resilient and energy-efficient microprocessor

architectures.

This thesis provides cost-effective cross-layer solutions to improve the resiliency and energy-

efficiency of energy-constrained pipelined microprocessors operating in the near-threshold volt-

age domain. Different architecture-level solutions for logic and memory components of a

pipelined processor are presented in this thesis. The solutions provided in this thesis address

the three main NTC challenges, namely increase in sensitivity to process variation, higher

memory failure rate, and performance uncertainties. Additionally, this thesis demonstrates

how to exploit emerging computing paradigms, such as approximate computing, in order to

further improve the energy efficiency of NTC designs.

vii

ZUSAMMENFASSUNG

In den letzten Jahren wurde eine starke Zunahme an miniaturisierten und batterie-betriebenen

elektronischen Systemen beobachtet, welche zur kabellosen Kommunikation, Datenverarbeitung,

Mustererkennung oder fur Sensoranwendungen eingesetzt werden. Alles deutet darauf hin,

dass solche Internet of Things-Anwendungen (IoT) einen starken Einfluss auf unseren All-

tag haben werden. Da IoT-Systeme portabel und leichtgewichtig sein mussen, ist ein Bat-

teriebetrieb oder eine autarke Energiegewinnung aus der Umgebung Voraussetzung. Aus

diesem Grund ist es notwendig, den Leistungsverbrauch des IoT-Systems zu reduzieren und

die Energiequelle wenig zu belasten um eine moglichst lange Versorgung zu gewahrleisten.

Um diese stringenten Energie-Budgets von batteriebetriebenen IoT-Systemen einzuhalten, ist

eine Niedrigenergiebauweise ist Standardvorrausetzung geworden. In diesem Zusammenhang

wurden effiziente Verfahren entwickelt, welche die Versorgungsspannung reduzieren um den

Leistungsverbrauch von Complementary Metal Oxide Semiconductor (CMOS) Schaltungen zu

senken und eine ultra-low-power Operation zu ermoglichen. Obwohl diese Energieverbrauchs-

Reduzierung durch die starke Verringerung der Versorgungsspannung sehr effektiv ist, ver-

schlechtert sich auf der anderen Seite die Performance der Schaltung signifikant. Bei der Ab-

senkung der Versorgungsspannung in den nahen Schwellspannungsbereich (auch bezeichnet als

Near-Threshold-Computing (NTC)), wird jedoch der Leistungsverbrauch um das 10-fache ver-

ringert auf Kosten einer Performance-Reduzierung mit lediglich linearem Verlauf. Damit einher

gehen jedoch weitere Probleme, wie zum Beispiel funktionale Ausfalle von Speicherbausteinen,

Performance-Variationen und hohere Sensitivitat zu weiteren Variationen, welche eine breite

Anwendung dieses Verfahrens erschweren.

Fur NTC-Prozessoren fuhren diese erhohten Variationen und funktionalen Ausfalle zu un-

tragbaren Nebeneffekten, da zeitliche Anforderungen von Logikbausteinen wie zum Beispiel

Pipeline-Stufen des Prozessorkerns, sowohl als Stabilitat von Speicherbausteinen (Caches und

Register), nicht eingehalten werden. Des weiteren, da eine Reduzierung der Noise-Margin von

Speicherelementen auftritt, erhoht sich mit der Absenkung der Versorgungsspannung auch die

Anfalligkeit zu Alterungseffekten und Soft-Errors, welche den verlasslichen Betrieb wahrend

der Laufzeit beeintrachtigen. Diese Probleme limitieren das Potential von NTC-Verfahren

und zwingen Entwickler zeitliche Anforderungen aufzulockern um den verlasslichen Betrieb

von verschiedenen Komponenten der Gesamtarchitektur zu garantieren, welches zu einem

bedeutenden Mehraufwand fuhrt. Daher ist eine Analyse und Verbesserung der variations-

induzierten zeitlichen Verstoßen von Pipeline-Stufen und Speicherausfallen in fruhen Stadien

des Entwurfs entscheidend, um verlassliche und energieeffiziente Mikroprozessorarchitekturen

zu ermoglichen.

Diese Arbeit liefert einen schichtubergreifenden und kosteneffizienten Ansatz zur Verbesserung

der Verlasslichkeit und Energieeffizienz von energiebeschrankten Mikroprozessoren mit Pipeline,

welche im nahen Schwellspannungsbereich betrieben werden. Losungen auf verschiedenen Ar-

chitekturebenen fur Logik- und Speicherkomponenten eines Prozessors mit Pipeline werden

in der hier vorliegenden Arbeit behandelt. Die erarbeiteten Losungen adressieren die drei

viii

hauptsachlichen Herausforderungen von NTC-Anwendungen, namlich die Erhohung der Sensi-

tivitat zu Prozessvariationen, hohere Ausfallwahrscheinlichkeiten von Speicherbausteinen und

Performance-Schwankungen. Erganzend wird demonstriert, wie aufkommende Paradigmen

wie das Approximate Computing eingesetzt werden um die Energieeffizienz von NTC-Designs

weiter zu verbessern.

ix

Table of Contents

ABSTRACT x

Glossary xiii

Acronyms xv

List of Figures xvii

List of Tables xxi

1 Introduction 1

1.1 Problem statement and objective . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Cross-layer memory reliability analysis and mitigation technique . . . . 4

1.2.2 Pipeline stage delay balancing and optimization techniques . . . . . . . 5

1.2.3 Exploiting approximate computing . . . . . . . . . . . . . . . . . . . . . 6

1.3 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Background and State-of-The-Art 7

2.1 Near-Threshold Computing (NTC) for energy-efficient designs . . . . . . . . . . 8

2.1.1 NTC basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 NTC application domains . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Challenges for NTC operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Performance reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Increase sensitivity to variation effects . . . . . . . . . . . . . . . . . . . 12

2.2.3 Functional failure and reliability issues of NTC memory components . . 14

2.3 Existing techniques to overcome NTC barriers . . . . . . . . . . . . . . . . . . . 18

2.3.1 Solutions addressing performance reduction . . . . . . . . . . . . . . . . 18

2.3.2 Solutions addressing variability . . . . . . . . . . . . . . . . . . . . . . . 19

2.3.3 Solutions addressing memory failures . . . . . . . . . . . . . . . . . . . . 21

2.4 Emerging technologies and computing paradigm for extreme energy efficiency . 24

2.4.1 Non-volatile processor design . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.2 Exploiting approximate computing for NTC . . . . . . . . . . . . . . . . 24

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Reliable Cache Design for NTC Operation 27

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Cross-layer reliability analysis framework for NTC caches . . . . . . . . . . . . 28

3.2.1 System FIT rate extraction . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Cross-layer SNM and SER estimation . . . . . . . . . . . . . . . . . . . 30

3.2.3 Experimental evaluation and trade-off analysis . . . . . . . . . . . . . . 33

xi

Table of Contents

3.3 Voltage scalable memory failure mitigation scheme . . . . . . . . . . . . . . . . 41

3.3.1 Motivation and idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.3.2 Built-In Self-Test (BIST) based runtime operating voltage adjustment . 43

3.3.3 Error tolerant block mapping . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3.4 Evaluation of voltage scalable mitigation scheme . . . . . . . . . . . . . 45

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Reliable and Energy-Efficient Microprocessor Pipeline Design 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Variation-aware pipeline stage balancing . . . . . . . . . . . . . . . . . . . . . . 50

4.2.1 Pipelining background and motivational example . . . . . . . . . . . . . 51

4.2.2 Variation-aware pipeline stage balancing flow . . . . . . . . . . . . . . . 52

4.2.3 Coarse-grained balancing for deep pipelines . . . . . . . . . . . . . . . . 55

4.2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline

design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3.2 Fine-grained MEP analysis basics and challenges . . . . . . . . . . . . . 62

4.3.3 Motivation and problem statement for pipeline stage-level MEP assignment 64

4.3.4 Lagrange multiplier based two-phase hierarchical pipeline stage-level MEP

tuning technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3.5 Implementation issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.6 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.3.7 Comparison with related works . . . . . . . . . . . . . . . . . . . . . . . 80

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Approximate Computing for Energy-Efficient NTC Design 85

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.1 Embracing errors in approximate computing . . . . . . . . . . . . . . . 86

5.2.2 Related works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3 Error propagation aware timing relaxation . . . . . . . . . . . . . . . . . . . . . 87

5.3.1 Motivation and idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.3.2 Variation-induced timing error propagation analysis . . . . . . . . . . . 88

5.3.3 Mixed-timing logic synthesis flow . . . . . . . . . . . . . . . . . . . . . . 91

5.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.2 Energy efficiency analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.3 Application to image processing . . . . . . . . . . . . . . . . . . . . . . 96

5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Conclusion and Remarks 99

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Bibliography 101

xii

Glossary

3D three-dimensional integrated circuit made of vertical stacked silicon wafers.

AVF Architectural Vulnerability Factor is the probability that an error in memory structure

propagates to the data path. AVF = vulnerable period / total program execution period.

ECC Error Correction Code (ECC), enables error checking and correction of a data that is

being read or transmitted, when necessary.

FinFET Fin Field Effect Transistor is non planner three dimensional transistor.

FIT Failures in Time rate is a standard value defined as the Failure Rate per billion hours of

operation.

IoT Internet of Things is the network of devices that contain sensors, actuators, and connec-

tivity which allows the devices yo connect and interact.

LLC Last Level Cache is the lowers-level cache that is usually shared by all the functional

units on the chip.

SeaMicro Low-Power, High-Bandwidth Micro-server Solutions.

SP Signal Probability is the probability of storing logic 1 in the SRAM cell.

xiii

Acronyms

ABB Adaptive Body Bias.

AI Arteficial Inteligence.

BIST Built-In Self Test.

BTI Bias Temperature Instability.

CMOS Complementary Metal Oxide Semiconductor.

CMP Chip Multiprocessing.

CPU Central Processing Unit.

DCT Discrete Cosine Transform.

DSP Digital Signal Processing.

DVFS Dynamic Voltage and Frequency Scaling.

EDA Electronics Design Automation.

FBB Forward Body Bias.

GPU Graphics Processing Unit.

IPC Instruction Per Cycle.

ITRS International Technology Roadmap for Semiconductors.

LER Line Edge Roughness.

LUT Look-Up Table.

MEP Minimum Energy Point.

NMOS Negative-channel Metal Oxide Semiconductor.

NTC Near Threshold Computing.

PDP Power Delay Product.

PMOS Postive-channel Metal Oxide Semiconductor.

RBB Reverse Body Bias.

xv

Acronyms

RDF Random Dopant Fluctuations.

SER Soft Error Rate.

SNM Static Noise Margin.

SRAM Static Random Access Memory.

SSTA Statistical Static Timing Analysis.

VLSI Very Large Scale Integrated Circuits.

xvi

List of Figures

1.1 Intel Chips Transistor count per unit area, Millions of transistor per millimeter

square (MTr/mm2) and voltage scaling of different technology nodes. . . . . . . 2

1.2 Thesis contribution summary of solutions addressing memory failure, variability

effect on pipeline stages, and exploiting emerging computing paradigm. . . . . 5

2.1 Dynamic, leakage and total energy trends of supply voltage downscaling at

different voltage levels for an inverter chain implemented with saed 32nm library. 9

2.2 Supply voltage downscaling induced performance reduction (delay increase) of

an inverter chain implementation evaluated for wide operating voltage range. . 12

2.3 Process variation induced performance/ delay variation of b01 circuit across

wide supply voltage range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Energy-consumption characteristics of inverter chain implementation of three

different process corners; typical (TT) with regular Vth(RVT), slow (SS) with

high Vth( HVT=RVT + ∆Vth), and fast (FF) with low Vth (LVT=RVT -

∆Vth), where ∆Vth=25mV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Schematic diagram of 6T SRAM cell, where WL= word-line, BL=bit-line and

RL=read-line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Write margin (in terms of write latency) comparison of 6T and 8T SRAM cell

operating in near-threshold voltage domain (0.5V). . . . . . . . . . . . . . . . . 16

2.7 Interdependence of reliability failure mechanisms and their impact on the system

Failure In-Time (FIT) rate in NTC. . . . . . . . . . . . . . . . . . . . . . . . . 18

2.8 Variation-induce timing error detection and correction techniques for combina-

tional circuits (a) razor flip-flop based timing error detection and correction,

(b) shadow flip-flop for time borrowing, and (c) adaptive body biasing to adjust

circuit timing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.9 Alternative bit-cell designs to improve read disturb, stability and yield of SRAM

cells in NTC domain (a) Differential 7-Transistor (7T) bit-cell deign, (b) Read/write

decoupled 8-Transistor (8T) bit-cell design, and (c) Robust and read/write de-

coupled 10-Transistor (10T) bit-cell design timing. . . . . . . . . . . . . . . . . 22

3.1 Cross-layer impact of memory system and workload application on system-level

reliability (Failure-In-Time (FIT rate)) of NTC memory components, and their

interdependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2 Holistic cross-layer reliability estimation framework to analyze the impact of

aging and process variation effects on soft error rate. . . . . . . . . . . . . . . . 29

xvii

LIST OF FIGURES

3.3 SNM degradation in the presence of process variation and aging after 3 years

of operation, aging+PV-induced SNM degradation at NTC is 2.5× higher than

the super-threshold domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 SER rate of fresh and aged 6T and 8T SRAM cells for various Vdd values. . . . 33

3.5 SER of 6T and 8T SRAM cells in the presence of process variation and aging

effects after 3 years of operation. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.6 Workload effects on aging-induced SNM degradation in the presence of process

variation for 6T and 8T SRAM cell based cache after 3 years of operation (a)

6T SRAM based cache (b) 8T SRAM based cache. . . . . . . . . . . . . . . . . 35

3.7 Workload effect on SER rate of 6T SRAM cell based cache memory for wide

supply voltage range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.8 Impact of cache organization on SNM degradation in near-threshold (NTC) and

super-threshold (ST) in the presence of process variation and aging effect after

3 years of operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.9 FIT rate and performance design space of various cache configurations in the

super-threshold voltage domain by considering average workload effect (the blue

italic font indicates optimal configuration). . . . . . . . . . . . . . . . . . . . . 38

3.10 FIT rate and performance design space of 6T and 8T designs for various cache

configurations in the near-threshold voltage domain by considering average

workload effect (the blue italic font indicates optimal configuration). . . . . . . 39

3.11 FIT rate and performance trade-off analysis of near-threshold 6T and 8T caches

for various cache configurations and average workload effect in the presence of

process variation and aging effects. . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.12 Energy consumption profile of 6T and 8T based 4K 4-way cache for wide sup-

ply voltage value ranges averaged over the selected workloads from SPEC2000

benchmarks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.13 Error-free minimum operating voltage distribution of 8 MB cache, Set size =

128 Byte (a) block size=32 Bytes (4 blocks per set) and (b) block size=64 Bytes

(two blocks per set), the cache is modeled as 45nm node in CACTI. . . . . . . 42

3.14 Cache access control flowchart equipped with BIST and block mapping logic. . 44

3.15 Error tolerant cache block mapping scheme (mapping failing blocks to marginal

blocks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.16 Comparison of voltage downscaling in the presence of block disabling and ECC

induce overheads for gzip, parser and mcf applications from SPEC2000 bench-

mark (a) energy comparison (b) Performance in IPC comparison . . . . . . . . 46

4.1 Variation induced delay increase in super and near threshold voltages for OpenSPARC

core (refer to Table 4.1 in Section 4.2.4 for setup of OpenSPARC). . . . . . . . 50

4.2 Impact of process variation on the delays of different pipeline stages in NTC. . 51

4.3 Variation-aware pipeline stage delay balancing synthesis flow for NTC. . . . . . 52

4.4 Time constraint modification using pseudo divide and conquer. . . . . . . . . . 54

xviii

LIST OF FIGURES

4.5 Pipeline merging and signal control illustration (a) before (b) after merging

where S2 = S2+S3 and S3 = S4. . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6 Nominal and variation-induced delays of OpenSPARC pipeline stages (a) nomi-

nal delay balanced baseline design, (b) guard band reduction (delay optimized)

of nominal delay balanced, and (c) statistical delay balanced (power optimized). 57

4.7 Power (energy) improvement of variation-aware balancing over the baseline de-

sign for OpenSPARC core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.8 Nominal and variation-induced Delays of FabScalar pipeline stages (a) base-

line design, the gray boxes indicate the stages to be merged, (b) merged and

optimized design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.9 Power (energy) improvement of variation-aware FabScalar design over the base-

line design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.10 Dynamic to leakage energy ratio of pipeline stages of FabScalar core under

different workloads synthesized using 0.5V saed 32nm library. . . . . . . . . . . 61

4.11 MEP supply voltage movement characteristics of Regular Vth (RVT), High Vth

(RVT+∆Vth) and Low Vth (RVT-∆Vth) inverter chain implementations for dif-

ferent activity rates in saed 32nm library where ∆Vth = 25mV. . . . . . . . . . 63

4.12 Energy vs MEP supply voltage for a 3-stage pipeline core with Regular Vth

(RVT), High Vth (RVT+∆Vth) and Low Vth in saed 32nm library, and ∆Vth =

25mV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.13 Energy gain comparison of core-level vs stage-level MEP assignment for a 3-

stage pipeline core with Regular Vth (RVT), High Vth (RVT+∆Vth), and Low

Vth in saed 32nm library and ∆Vth = 25mV, target frequency = 67MHz. . . . 66

4.14 Algorithm for solving MEP of pipeline stages by using Lagrangian function and

linear algebra. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.15 Illustrative example for clustering of the MEP voltages of different pipeline

stages (a) MEP distribution on Vth, Vdd space (b) Clustering of the MEPs into

3 Vdd and 3 Vth groups. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.16 Dual purpose flip-flop (voltage level conversion and pipeline stage register), the

gates shaded in red are driven by VDDL. . . . . . . . . . . . . . . . . . . . . . . 74

4.17 Comparing the energy efficiency improvement of the proposed micro-block (pipeline

stage) level MEP [Proposed] and macro-block level MEP [Related work] over

baseline design of Fabscalar core. . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.18 Comparing the energy efficiency improvement of the proposed micro-block (pipeline

stage) level MEP [Proposed] and macro-block level MEPrelated work over base-

line design of OpenSPARC core. . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.19 Effect of Vdd scaling on the energy efficiency of different pipeline stages (a)

pipeline stage level MEP (Vth) (b) core-level MEPrelated work Vth (normalized

to one cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

xix

LIST OF FIGURES

4.20 Energy-saving and area overhead trade-off for different voltage islands for Fab-

Scalar core (three voltage islands provides better energy-saving and overhead

trade-off for FabScalar core). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1 Example of FIR filter circuit showing path classification and timing constraint

relaxation of paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2 Variation-induced path delay distribution with the assumption of normal dis-

tribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.3 Timing error generation, propagation and masking probability from error site

to primary output of a circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4 Proposed timing error propagation aware mixed-timing logic synthesis framework. 91

5.5 Energy efficiency improvement for different levels of approximation (% of relaxed

paths). The curves correspond to the relaxation amount as a % of the clock. . 94

5.6 Effective system error for different levels of accuracy (% relaxed paths). The

curves correspond to the amount by which the selected paths are relaxed, as a

% of the clock. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

5.7 Energy efficiency improvement of different optimization schemes. . . . . . . . . 95

5.8 Post-synthesis Monte Carlo simulation flow. . . . . . . . . . . . . . . . . . . . . 96

5.9 PSNR distribution for different approximation levels for 35% relaxation slack. . 96

5.10 Comparison of output quality of different levels of approximation (% of relaxed

paths). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

xx

List of Tables

3.1 Experimental setup, configuration and evaluated benchmark applications . . . 34

3.2 ECC overhead analysis fo different block sizes and correction capabilities . . . . 43

3.3 Minimum scalable voltage analysis for different ECC schemes . . . . . . . . . . 46

4.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2 PDP of baseline, delay optimized, and power optimized designs . . . . . . . . . 58

4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

4.4 Energy efficiency comparison of the pipeline stage-level MEP assignment with

core-level MEP assignment and voltage over-scaling technique . . . . . . . . . . 82

xxi

1 Introduction

The number of Complementary Metal Oxide Semiconductor (CMOS) transistors on a chip has

increased exponentially for the past five decades, as predicted by Gordon Moore in the year

1965, commonly referred as Moore’s law [1]. The technology downscaling driven tremendous

increase in transistor count has served as a mainstay for the semiconductor industry’s relentless

progress in improving the performance of computing devices from generation to generation [1].

The rapid increase in computing power has touched almost all aspects of our day-to-day life

such as education, health care, and security systems. As a result, embedded devices such

as wearable and hand-held devices, smartphones, navigation systems, Artificial Intelligence

(AI) systems, and critical components in aerospace and automotive systems became abundant

at an affordable cost [2]. If Moore’s law continues to hold, we can expect further exciting

developments for all segments of society. For example, it can facilitate the availability of large

health monitoring devices, such as diagnostic imaging devices, that presently are limited in use

due to their immense cost and size [2]. With scaling to the nanoscale era, atomic scale processes

become extremely important and small imperfections have a large impact on both device

performance and reliability which limits the scaling extent. In order to sustain the benefits

of technology downscaling, new silicon-based technologies, such as FinFET devices [3, 4] and

three-dimensional (3D) integration [5], are providing new paths for increasing the transistor

count per chip area. However, due to different barriers, the increase in integration density no

longer translate into a proportionate increase in the performance or energy efficiency [6].

Although microprocessors have enjoyed significant performance improvement while main-

taining constant power density, the conventional supply voltage downscaling has slowed down

in recent years. The slower pace on supply voltage downscaling limits the processor frequency

in order to meet the power-density constraint of Very Large Scale Integrated (VLSI) chips.

This supply voltage downscaling stagnation phenomenon is commonly referred as Dennard

scaling [7]. Figure 1.1 shows the technology scaling driven increase in transistor count and

supply voltage downscaling potential of different technology nodes in the nanometer regime.

As shown in the figure, the supply voltage stagnates around 1V for all technology nodes be-

yond 90nm node. As a consequence, the dynamic energy remains constant across different

technology nodes while leakage energy continues to increase. The supply voltage stagnation

leads to higher power density which restricts further technology downscaling and integration

density due to the thermal and cooling limits. Hence, FinFET and 3D based improvements

in the integration density has a minimal impact on the performance and energy efficiency

improvement [3, 5]. Moreover, the emergence of Internet of Things (IoT) applications has

increased the need for extremely-low power design, which in turn, increases the pressure for

supply voltage downscaling.

To overcome these challenges, processor designers have added more cores without a signifi-

cant increase in the operating frequency, leading to a prevalence of chip multiprocessing (CMP)

known as dark-silicon [7]. However, because the number of cores has been increasing geomet-

1

1 Introduction

250 180 130 90 65 45 32 22Technology node [nm]

0

1

2

Supply

volt

age [

V]

0

4

8

12

16

Tra

nsi

stor

denis

ty (

MTr/

mm

2)

Figure 1.1: Intel Chips Transistor count per unit area, Millions of transistor per millimeter square

(MTr/mm2) and voltage scaling of different technology nodes.

rically with each process node while die area has remained fixed, the total chip power has

again started to increase, despite the relatively flat core operating frequencies [8]. In practice,

the power dissipation of chip multiprocessing is constrained by thermal cooling limits forcing

some cores to be idle and underutilized. Hence, the underutilized cores must be powered off

to satisfy the energy budget, limiting the number of cores that can be active simultaneously

and reduces the achievable throughput of modern chip multiprocessing [7].

Therefore, a computing paradigm shift has become a crucial step to overcome these chal-

lenges and unlock the potentials of energy-constrained wearable and hand-held devices for

different application segments [9]. In this regard, aggressive supply voltage scaling down to

the near-threshold voltage domain commonly referred as Near-Threshold Computing (NTC),

in which the supply voltage (Vdd) is set to be close to the transistor threshold voltage, has

emerged as a promising paradigm to overcome the power limitation and enable energy-efficient

operation of nanoscale devices [9, 6]. Since the dynamic power decreases quadratically with

a decrease in the supply voltage, operating in the near-threshold domain reduces the energy

consumption exponentially. The exponential energy reduction makes NTC a promising design

approach for extremely-low energy domains such as mobile devices, energy-harvested embed-

ded devices, particularly in the scope of IoT applications [6, 10].

In comparison to the conventional super-threshold voltage operation, computations in the

near-threshold voltage domain are performed in a very energy-efficient manner. Unfortunately,

the performance drops linearly in the near-threshold domain [6, 11]. However, the performance

reduction at NTC can be compensated by taking advantage of the available resource (e.g.,

multi-core chips) in combination with the inherent application parallelism [12]. In addition

to performance reduction, however, NTC comes with its own set of challenges that affects

the energy efficiency and hinder its widespread applicability. NTC faces three key challenges

that must be addressed in order to harness its benefits [6, 10, 11]. These challenges include

2

1.1 Problem statement and objective

1) performance loss, 2) increase in sensitivity to variation effects as well as increased timing

failure of logic units, and 3) higher functional failure rate of memory components. Moreover,

the contribution of leakage energy is increasing significantly with supply voltage downscaling,

which makes it a significant part of the energy consumed by NTC designs. Therefore, over-

coming these challenges and improving the energy efficiency of NTC designs is a formidable

challenge requiring a synergistic approach combining different architectural and circuit-level

design techniques addressing storage elements and logic units [10, 12, 13].

1.1 Problem statement and objective

Although performance degradation is an essential issue in the near-threshold domain, variety

of circuit and architecture-level solutions, such as parallel architectures and voltage boosting,

have been proposed to fully or partially regain the performance reduction of NTC opera-

tion [12, 14]. For processors operating in the near-threshold domain, wide variation extent

and higher functional failure rate are posing a daunting challenge for guaranteeing timing cer-

tainty of logic blocks and stability of storage elements (caches and registers) [10]. Moreover,

due to the reduction in the noise margin of memory components, susceptibility to runtime

reliability issues, such as transistor aging and soft errors, is also increasing with supply volt-

age downscaling [13]. These challenges limit NTC potentials and force designers to add large

timing margins to ensure reliable operation of different architectural blocks, which imposes

significant performance and power overheads [15]. Therefore, analyzing and mitigating varia-

tion induced timing failure of pipeline stages and memory failures during early design phases

plays a crucial role in the design of resilient and energy-efficient microprocessor architectures.

Historically process variation has always been a critical aspect of semiconductor fabrica-

tion [16]. However, the lithography process of nanoscale technology nodes is worsening the

impact of process variation [17]. As a result, systematic and random process variations are pos-

ing a significant challenge [17]. Traditional methods to deal with variability issue mainly focus

on adding design margins [10]. Although such approaches are useful in the super-threshold

domain, they are wasteful and inadequate when the supply voltage is scaled down to the

near-threshold voltage domain [6]. As a result, such methods have significant performance

and power overheads when applied to NTC designs [6, 10]. Therefore, an effective variabil-

ity mitigation technique for NTC should address the variability effects at different levels of

abstraction, including device, circuit, and architecture-levels with the consideration of the

running workload characteristics.

Variation sensitivity of NTC also increases the functional failure rate, particularly it com-

promises the state of Static Random Access Memory (SRAM) cells by making them incline for

one state over the other [18, 15, 19]. Moreover, process variation increases the susceptibility

of SRAM cells to runtime failures such as aging and soft errors [18]. As a result, variation

induced functional failure of SRAM cells is increasing exponentially with supply voltage down-

scaling [18, 20]. For instance, a typical 65nm SRAM cell has a failure probability of ≈10−7

in the super-threshold voltage domain, and it is easily addressed using simple error correction

(ECC) mechanisms [21]. However, the failure rate increases by five orders of magnitude in the

near-threshold voltage domain (e.g., 500mV) [21, 20], in which ECC based solutions become

expensive. Therefore, robust cross-layer approaches, ranging from the architecture to circuit-

levels, are crucial to address the variation-induced functional failure of memory components,

3

1 Introduction

and improve the resiliency and energy efficiency of NTC designs [18].

The goal of this thesis is to improve the resiliency and energy efficiency of energy-constrained

pipelined NTC microprocessors by using cost-effective cross-layer solutions. This thesis presents

different circuit and architecture-level solutions for logic and memory components of a pipelined

processor addressing the three main NTC challenges, namely increase in sensitivity to process

variation, higher memory failure rate, and performance uncertainties. Additionally, this thesis

demonstrates how to exploit emerging computing paradigms, such as approximate computing,

in order to further improve the energy efficiency of NTC designs.

1.2 Thesis contributions

Different reliability and energy efficiency challenges of NTC designs are explored in this thesis.

To address the reliability and energy efficiency issues, a cross-layer NTC memory reliability

analysis framework consisting of accurate circuit-level models of aging, process variation, and

soft error is developed. The framework integrates the circuit-level models with architecture-

level memory organization, and all the way to the system level workload effects. The cross-

layer framework is useful to explore the impact of the reliability failure mechanisms, their

interdependence, and workload effects on the reliability of memory arrays. The framework

is also applicable for design space exploration as it helps to understand how the reliability

issues change from the super-threshold to the near-threshold voltage domain. In addition

to the framework, different architecture-level solutions to improve the resiliency and energy

efficiency of the logic units (pipeline stages) of NTC processors are also presented in this thesis.

These solutions include variation-aware pipeline stage balancing and fine-grained minimum

energy point operation. Moreover, this thesis explores the potentials of emerging computing

paradigms, such as approximate computing, for further energy efficiency improvement of NTC

designs. As shown in Figure 1.2, the overall contributions of this thesis are classified into three

main categories; 1) cross-layer memory reliability analysis and mitigation, 2) variation-aware

pipeline stage optimization, and 3) exploiting approximate computing for energy-efficient NTC

design.

1.2.1 Cross-layer memory reliability analysis and mitigation technique

It has been widely studied that functional failure of memory components is a crucial issue

in the design of resilient energy-constrained processors. In order to address this issue, a

cross-layer reliability analysis, and mitigation framework is developed in this thesis [18]. The

framework first determines the combined effect of aging, soft error, and process variation on the

reliability of NTC memories (e.g., caches and registers). Then, a voltage scalable mitigation

scheme is developed to design resilient and energy-efficient memory architecture for NTC

operation [22, 20]. The cross-layer reliability analysis framework is applicable to:

• To study the combined effect of aging, soft error, and process variation at different levels

of abstraction. Additionally, the framework is useful for circuit and architecture-level

design space exploration.

• To understand how the impact of reliability issues change from the super-threshold to

the near-threshold voltage domain.

4

1.2 Thesis contributions

• To estimate the memory failure rate, and error-free supply voltage downscaling potentials

of memory arrays for different organizations.

• To develop error-tolerant mitigation techniques addressing aging, and variation induced

failures of memory arrays operating in the near-threshold voltage domain.

1.2.2 Pipeline stage delay balancing and optimization techniques

To address variation-induced timing uncertainty of pipelined NTC processor, different architecture-

level optimization, and delay balancing techniques are presented in this thesis [23, 24].

• Variation-aware pipeline balancing [24]: variation-aware pipeline balancing tech-

nique is a design-time solution to improve the energy efficiency and performance of

pipelined processors operating in the near-threshold voltage domain. This technique

adopts an iterative variation-aware synthesis flow in order to balance the delay of pipeline

stages in the presence of extreme delay variation.

• Pipeline stage level Minimum Energy Point (MEP) design [23]: The increas-

ing demand for energy reduction has motivated various researchers to investigate the

optimum supply voltage for minimizing power consumption while satisfying the spec-

ified performance constraints. For this purpose, a fine-grained (pipeline stage-level)

energy-optimal supply and threshold voltage (Vdd, Vth) pair assignment technique for

an energy-efficient microprocessor pipeline design is developed in this thesis. The pipeline

stage-level MEP assignment is a design-time solution to optimize the pipeline stages in-

dependently by considering their structure and activity rate variation.

Gate delay analysis

Device level

Circuit level

Architecture level

System level

Aging modeling

and Vth shift

characterization

Vth variation

modeling and

characterization

Transistor model

SRAM modeling and characterization:

- SNM

- Critical charge

- SER

- Cache organization

- AVF analysis

- Signal probability analysis

- FIT rate analysis Ca

ch

e

Approximate circuit

synthesis and

optimization

- Pipeline stage balancing

- Synthesis and optimization

- Statistical timing analysis

Running workload

characterization and

trace extraction

Processor model and workload

characterstics

Memory reliability analysis

and mitigation

Pipeline stage delay balancing

and optimization

Emerging computing

paradigm

Figure 1.2: Thesis contribution summary of solutions addressing memory failure, variability effect on

pipeline stages, and exploiting emerging computing paradigm.

5

1 Introduction

1.2.3 Exploiting approximate computing

Approximate computing has emerged as a promising alternative for energy-efficient designs [25,

26]. Approximate computing exploits the inherent application error resiliency, to achieve a de-

sirable trade-off between performance/ energy efficiency, and output quality [27, 25, 28]. In

order to exploit the advantages of approximate computing, a framework leveraging the error

tolerance potential of approximate computing is developed to improve the energy efficiency of

NTC designs. In the framework, the control logic portion of a design is identified first and

is protected from approximation. Then, the approximable and non-approximable portions

of the data-flow portion are identified with the help of error propagation analysis tool. Af-

terward, a mixed-timing logic synthesis flow which applies a tight timing constraint for the

non-approximable portion, and a relaxed timing constraint for the approximable part is used

to synthesize the design.

1.3 Thesis outline

The remainder of the thesis is organized in five chapters. A short introduction of each chapter

is given as follows:

Chapter 2 presents a detailed analysis of near-threshold computing. The chapter motivates

the need for NTC operation first. Then, the merits and challenges of NTC are discussed in

detail. Besides, the chapter discusses the strengths and shortcomings of the state-of-the-art

techniques addressing NTC challenges.

Chapter 3 presents the reliability analysis and mitigation framework for near-threshold mem-

ories. The chapter first discusses the main reliability challenges of memory elements operating

at different supply voltage levels (from super-threshold to the near-threshold voltage domain).

Then, modeling and cross-layer analysis of the reliability failure mechanisms, and their in-

terdependence is presented. Based on the analysis an energy-efficient mitigation scheme is

presented to address the reliability failures of NTC memories. Finally, the chapter summary

is presented towards the end of the chapter.

Chapter 4 presents different architecture-level optimization techniques developed to improve

the energy efficiency, and balance the delay of the pipeline stages of pipelined NTC processors.

First, the chapter discusses the impact of process variation on the delay of pipeline stages.

Then, a variation-aware balancing technique is presented to balance the delay of the pipeline

stages in the presence of extreme variation effect. Additionally, since variation has a negative

impact on the leakage power of NTC designs, the chapter presents an analytical Minimum

Energy Point (MEP) operation technique that determines optimal supply and threshold voltage

pair assignment of pipeline stages.

Chapter 5 presents the framework to exploit the potentials of approximate computing for

energy-efficient NTC designs. The chapter first discusses the error tolerance nature of ap-

proximate computing for timing relaxation of NTC designs. Then, a mixed-timing synthesis

framework is presented to exploit the inherent error tolerance nature of approximate com-

puting. Finally, Chapter 6 presents the conclusions of the thesis, and points out potential

directions for future research.

6

2 Background and State-of-The-Art

Power consumption is one of the most significant roadblocks of technology downscaling ac-

cording to a recent report by the International Technology Roadmap for Semiconductors

(ITRS) [29]. Power delivery and heat removal capabilities are already limiting the perfor-

mance improvement of modern microprocessors, and will continue to restrict the performance

severely [30]. Although the dynamic power of transistors is decreasing with technology down-

scaling, the overall power density goes up due to the increase in leakage power leading to an

increase in the overall energy consumption of modern circuits.

The most effective knob to reduce the energy consumption of nanoscale microprocessors

is by lowering their supply voltage (Vdd). Although supply voltage downscaling results in

a quadratic reduction in the dynamic power consumption, the operating frequency is also

reduced and hence, the task completion latency increases significantly [31, 32]. Despite its

performance reduction, supply voltage downscaling has been widely adopted in various Dy-

namic Voltage and Frequency Scaling (DVFS) techniques. However, since it impedes the

execution time, it is not generally applicable for high-performance applications [32]. In order

to address this issue, parallel execution is used to counteract the downside of supply voltage

downscaling by increasing the instruction throughput. In the parallel execution approach, the

user program (task) is parallelized in order to run on multiple cores (CMP) [7]. However, the

increase in power density and heat removal complexity limits the number of cores that can be

active simultaneously, which eventually limits the maximum attainable throughput of modern

CMPs. Therefore, it is necessary for a paradigm shift in order to improve the performance and

energy efficiency of microprocessor designs in the nanoscale era [7]. To this end, aggressive

downscaling of the supply voltage to the near-threshold voltage domain has emerged as an

effective approach to improve the energy efficiency of nanoscale processors with an acceptable

performance reduction.

This chapter defines and explores near-threshold computing (aka NTC), a design paradigm in

which the supply voltage is set to be close to the threshold voltage of the transistor. Operating

in the near-threshold voltage domain retains much of the energy savings of supply voltage

downscaling with more favorable performance and variability characteristics. This energy

and performance trade-off makes near-threshold voltage operation applicable for broad range

of power-constrained computing segments from sensors to high-performance servers. This

chapter first presents a detailed discussion of NTC operation. Then, the main challenges

of NTC operation are discussed in detail followed by the discussion of the state-of-the-art

solutions to overcome the challenges of NTC operation.

7

2 Background and State-of-The-Art

2.1 Near-Threshold Computing (NTC) for energy-efficient designs

2.1.1 NTC basics

The power consumption of CMOS circuits has three main components; that is, dynamic power,

leakage power, and short circuit power [33]. Equation (2.1) shows the contribution of the three

power components to the overall power consumption of CMOS circuits.

Ptotal = Pdynamic + Pshort + Pleakage (2.1)

Dynamic power (Pdynamic) of CMOS devices mainly steams from the charging and discharging

of the internal node capacitance when the output of a CMOS gate is switching [34]. The

switching is a strong function of the input signal switching activity and the operating clock

frequency [35, 36]. Therefore, the dynamic power of CMOS circuits is modeled as shown in

Equation (2.2) [35, 36].

Pdynamic = α× Cload × V 2dd × f (2.2)

where α is the input signal switching activity, Cload is the load capacitance, Vdd is the supply

voltage, and f is the operating clock frequency.

Leakage power, the power dissipated due to the current leaked when CMOS circuits are

in an idle state, is also another critical component of the total power consumption of CMOS

circuits [37]. The leakage power is also dependent on the supply voltage of CMOS circuit and

it is expressed as shown in Equation (2.3) [37].

Pleakage = Ileak × Vdd (2.3)

where Ileak is the leakage current during the idle state of CMOS circuits.

Although dynamic and leakage powers are the dominant components of the CMOS circuit

power consumption, the short circuit power is also important component as it has a strong

dependency on the supply voltage [34]. However, since the dynamic power, which contributes

almost 80% of the total CMOS power consumption, has a quadratic dependency on the supply

voltage, reducing the supply voltage reduces the power consumption quadratically as shown

in Equation (2.2) [21, 24, 38]. Moreover, since the other components of total power (leakage

and short circuit powers) have a linear relation to the supply voltage, they are also reduced

linearly with supply voltage downscaling [10, 38].

Therefore, supply voltage downscaling is the most effective method to improve the energy ef-

ficiency of CMOS circuits. Since CMOS circuits can function properly at very low voltages even

when the Vdd drops below the threshold voltage (Vth), they provide huge voltage downscaling

potential to reduce the energy consumption. With such broad voltage downscaling potential,

it has become an important issue to determine the optimal operation region in which the power

consumption of CMOS circuits is reduced significantly with minimal impact on other circuit

characteristics [39]. Thus, scaling the supply voltage down to the near-threshold voltage regime

(Vdd ≈ Vth), known as NTC, provides more than 10× energy reduction at the expense of linear

performance reduction. However, further scaling the supply voltage down to the sub-threshold

8

2.1 Near-Threshold Computing (NTC) for energy-efficient designs

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1Supply voltage [V]

0.0

0.2

0.4

0.6

0.8

1.0

Norm

alize

d en

ergy

per

cyc

le

LeakageDynamicTotal

Figure 2.1: Dynamic, leakage and total energy trends of supply voltage downscaling at different voltage

levels for an inverter chain implemented with saed 32nm library.

voltage domain (Vdd << Vth) reduces the energy saving potential due to the rapid increase

in the leakage power and delay of CMOS circuits. Therefore, in the sub-threshold voltage

domain, the increase in leakage energy eventually dominates any reduction in dynamic energy

which increases the overall energy consumption when compared to the near-threshold voltage

domain.

To demonstrate the benefits and drawbacks of supply voltage downscaling, the dynamic

energy and leakage energy characteristics of an inverter chain implemented using the saed

32nm library is extracted for different supply voltage values as shown in Figure 2.1. the figure

shows that the dynamic energy of the inverter chain decreases quadratically with supply voltage

downscaling. As shown in the figure, the leakage energy has a rather minimal contribution to

the total energy in the super-threshold voltage domain. However, its contribution increases

when the supply voltage is scaled down to the near-threshold voltage domain, and becomes

dominant in the sub-threshold voltage domain. As a result, the leakage increase in the sub-

threshold voltage domain nullifies the energy reduction benefits of supply voltage downscaling.

An essential consideration for operating in the near-threshold domain is that the optimal

operating point is usually set to be close the transistor threshold voltage; however, the exact

optimal point varies from design to design depending on several parameters. Determining

the exact optimal point is referred to as Minimum Energy Point (MEP) operation, and it is

discussed in detail in Chapter 4 of this thesis.

2.1.2 NTC application domains

Graphics based workload applications are among the primary beneficiaries of NTC operation.

Since graphics processors (e.g., image processing units, and DSP accelerators commonly found

9

2 Background and State-of-The-Art

in hand-held devices such as tablets and smart-phones) are inherently throughput focused, indi-

vidual thread performance has rather minimal importance [40]. Thus, highly parallel ultra-low

power processing units and DSP accelerators operating in the near-threshold voltage domain

can deliver the required performance and throughput with a limited energy budget [41]. As a

result, the battery supply duration of those hand-held devices can be improved significantly by

taking advantage of NTC operation. Similarly, Graphics Processing Units (GPUs) for mobile

devices which usually run at a relatively low frequency benefits from NTC operation [24, 6].

Moreover, since GPU’s are mostly power limited, an improvement in energy efficiency with the

help of NTC operation can be directly translated into performance gain, as NTC enables to

power on multiple GPU units at the same time, and improve the throughput without exceeding

the thermal constraints [7, 42].

Additionally, the inherent error tolerance nature of various workload applications can be ex-

ploited to further improve the energy efficiency with the help of emerging computing paradigms

such as approximate computing [25, 43, 27, 44]. Floating point intensive workload applications

are inherently tolerant to various inaccuracies [25, 43]. The inherent error tolerance nature of

float point intensive workload applications makes them the best fit for NTC operation with

less emphasis on computation accuracy in order to improve the energy efficiency [43]. The

issue of exploiting inherent error tolerance nature of applications with the help of approximate

computing for energy-efficient NTC designs is discussed in detail in Chapter 5 of this thesis.

With the increasing demand for Internet of Things (IoT) applications, sensor-based systems

consisting of single or multiple nodes are becoming abundant in our day-to-day life [45]. A

sensor node typically consists of data processing and storage unit, off-chip communication,

sensing elements, and a power source [46]. These devices are usually placed in remote areas

(eg., to collect weather data) or implanted in the human body, such as pacemakers, designed

to sense and adjust the rhythm of the heart. Since these devices are mainly powered through

a battery or harvested energy, energy reduction is crucial while performance is not a major

constraint [46, 45]. Therefore, energy efficiency is the critical limiting constraint in the design

of IoT based battery-powered sensor node devices. As a result, those devices benefits from

energy-efficient NTC design techniques.

The energy efficiency achieved by near-threshold voltage operation techniques could vary

from one application to another application. For instance, general purpose processors for laptop

and desktop computers are less likely to benefit from NTC operation due to their demand

for higher performance. Thus, while energy efficiency is essential for longer battery supply

duration of laptop computers, sacrificing performance is not desirable from their overall design

purpose [32]. In such application domains user programs are executed in a time sliced manner,

and hence, the responsiveness of the system is mostly determined by the latency [15, 32].

Therefore, NTC is not beneficiary for such application domains as the modern CPUs designed

for laptop and desktop applications usually need to run at higher frequencies (e.g., 2-4GHz).

One potentially way of adopting NTC for these application domains is basically by increasing

the throughput via parallel execution while the frequency is sacrificed [7, 12]. However, due to

the limitation in the parallel portion of workload applications (commonly knowns as Amdahl’s

law) the throughput improvement of parallel execution cannot fully regain the performance

reduction imposed by NTC operation. Moreover, these systems are extremely cost sensitive,

and hence, the extra area and power overheads of the additional units can nullify the benefits

gained through NTC based parallel execution.

10

2.2 Challenges for NTC operation

Similarly, high-performance processors designed for high-end server applications are not best

fit for NTC operation, mainly because several applications require high single-threaded per-

formance with a limited response time [47]. Massive server farms are often power hungry, and

if single thread performance is not critical, then the performance is traded-off for energy effi-

ciency improvement by using controlled supply voltage downscaling [48, 49]. Hence, massively

parallel workloads, like those targeted by low-power cloud servers (e.g., SeaMicro), benefits

from NTC operation [50]. However, they have several challenges, such as the need for software

redesign to handle clustered configurations, to be addressed in order to harness the full-fledged

NTC benefits.

2.2 Challenges for NTC operation

Although NTC is a promising way to provide better trade-off for performance and energy

efficiency, its widespread applicability is limited by the challenges that come along with it. The

main challenges for NTC operation are 1) performance reduction, 2) increase in sensitivity to

variation effects, and 3) higher functional failure rate of storage elements. Therefore, these

three key challenges must be addressed adequately in order to get the full NTC benefits.

2.2.1 Performance reduction

In smaller technology nodes, the transistor threshold voltage is scaled down slowly in order to

reduce the leakage power of the transistor. Therefore, it necessitates for the supply voltage

to be considerably higher than the transistor threshold voltage in order to achieve better

performance. The dwindling threshold voltage scaling slowed down the pace of supply voltage

downscaling, which eventually leads to energy inefficiency [51]. As a consequence, energy-

efficient operation in the NTC regime comes at the cost of performance reduction. To study the

performance reduction of NTC operation, it is crucial to investigate the delay characteristics of

CMOS circuits operating in the near-threshold voltage domain. For this purpose, the voltage

downscaling induced delay increase of an inverter chain implemented using the saed 32nm

library is studied for different supply voltage values (from super-threshold down to the sub-

threshold voltage domain) given in Figure 2.2. As shown in the figure, the delay increases

linearly when the supply voltage is scaled down to the near-threshold regime (1.1V to 0.5V).

With further downscaling to the sub-threshold domain, however, the circuit delay increases

exponentially due to the exponential dependence of the transistor drain current on the node

voltages (VGS and VDS) as shown in Equation (2.4) [52].

IDS = IS × eVGS−Vth

nVT ×(

1− e−VDSVT

)(2.4)

where Vth is the threshold voltage, VT is the thermal voltage, and n is a process dependent term

called slope factor, and it is typically in the range of 1.3-1.5 for modern CMOS processes [52].

VGS and VDS parameters are the gate-to-source and drain-to-source voltages, respectively.

The parameter IS is the specific current which is given by Equation (2.5) [52].

IS = 2nµCoxV2T

W

L(2.5)

11

2 Background and State-of-The-Art

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1Supply voltage [V]

100

101

102

103

104

100

102

104

Dela

y in

[ns]

Figure 2.2: Supply voltage downscaling induced performance reduction (delay increase) of an inverter

chain implementation evaluated for wide operating voltage range.

where µ is the carrier mobility, Cox is the gate capacitance per unit area, and WL is the transistor

aspect ratio [52].

Although the performance reduction observed in NTC is not as severe as the reduction in the

sub-threshold voltage domain, it is still one of the formidable challenges for the widespread ap-

plicability of NTC designs. There have been several recent advances in circuit and architecture-

level techniques to regain some of the loss in performance [6, 15, 21]. Most of these techniques

mainly focus on massive parallelism with an NTC oriented memory hierarchy design. The

data transfer and routing challenges in these architectures is addressed by the use of 3D inte-

gration [15, 5]. Additionally, the use of deeply pipelined processor architectures can effectively

regain the performance loss of NTC design as it enables to increase the operating frequency [53].

However, the data dependency among instruction streams as well as conditional and uncon-

ditional branch instructions results in frequent pipeline flushing which eventually reduces the

overall throughput.

2.2.2 Increase sensitivity to variation effects

Another primary challenge for operating at reduced supply voltage values is the increase in

sensitivity to process variation which affects the circuit delay and energy efficiency signifi-

cantly [54, 55]. As a result, NTC designs display a dramatic increase in performance uncer-

tainty. Based on the nature of manufacturing, process variation is classified into two main

categories; local (intra-die) and global (inter-die) variations [17, 16, 56]. Local variation is

defined as the change (difference) in the parameters of the transistors in a single die, and it

can be systematic or random [17]. Random local variation due to Random Dopant Fluctua-

tions (RDF) and Line Edge Roughness (LER) results in variation in the transistor threshold

12

2.2 Challenges for NTC operation

voltage [16]. For nanoscale designs operating in the near-threshold voltage domain, the impact

of random local variation on circuit performance is becoming increasingly important [57, 58].

The primary reasons behind this trend are the reduction in transistor gate dimensions, reduced

pace of gate oxide thickness scaling, as well as dopant-ion fluctuations [17, 58].

To illustrate the impact of local process variation on the performance uncertainty of NTC

circuits, a Monte Carlo simulation based circuit delay analysis is performed for the b01 circuit

from ITC’99 benchmark suite [59]. The circuit has been synthesized with the Nangate 45nm

Open Cell Library [60] characterized for different supply voltages, ranging from 0.4V to 0.9V,

with Cadence Liberate Variety statistical characterization tool [61]. Monte Carlo analysis is

done using 1000 samples by considering local variation induced threshold voltage shift as shown

in Figure 2.3.

Figure 2.3 shows local process variation has minimum impact on the circuit delay when

operating at higher supply voltage values. However, the impact of local variation increases

exponentially when the supply voltage is scaled down to the near/sub-threshold voltage do-

mains. For example, the delay variation due to process variation alone increases by 6× from

≈20% in the super-threshold voltage domain (0.8V and above) to 120% at NTC (0.5V). The

delay variation even increases by 8× when the supply voltage is further scaled down to the

sub-threshold voltage domain.

Similar to the local variation, global (inter-die) variation affects the performance of dif-

ferent chips. Global variation is usually related to the process corners often called process

Monte Carlo [62]. Process corners are provided by the foundry, and are typically determined

by library characterization data [63]. Process corner is represented statistically (e.g., ±3σ) to

designate Fast (FF), Typical (TT), and Slow (SS) corners. These corners are used to represent

global process variation that designers must consider in their designs. Global (process corner)

0.4 0.5 0.6 0.7 0.8 0.9Vdd (V)

0

20

40

60

80

100

120

140

dela

y (n

s)

Mean ( delay)Standard deviation (3 delay)

Figure 2.3: Process variation induced performance/ delay variation of b01 circuit across wide supply

voltage range.

13

2 Background and State-of-The-Art

0.5

1

1.5

2

2.5

3

3.5

4

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9

En

erg

y (

µJ)

Vdd

High VthRegular Vth

Low Vth

Figure 2.4: Energy-consumption characteristics of inverter chain implementation of three different pro-

cess corners; typical (TT) with regular Vth(RVT), slow (SS) with high Vth( HVT=RVT +

∆Vth), and fast (FF) with low Vth (LVT=RVT - ∆Vth), where ∆Vth=25mV.

variation causes significant change in the duty cycle and signal slew rate, and significantly

affect the energy efficiency of chips [63, 62]. In order to illustrate this scenario, the impact of

global process variation (process corner) on the energy consumption of inverter chain imple-

mented using three different process corners of the saed 32nm library is evaluated as shown

in Figure 2.4. The three process corners of the inverter chain are implemented using the SS,

TT, and FF process corners. In the TT corner implementation, both NMOS and PMOS tran-

sistor types are in a typical (T) process corner (Regular Vth). The FF corner implementation

represents the condition where both NMOS and PMOS transistors are faster, with lower Vth

value (Low Vth) when compared to the typical process corner (TT). Similarly, the SS process

corner implementation represents the condition where both NMOS and PMOS transistors are

slower, with higher Vth value (High Vth) than the TT process corner.

The energy consumption result given in Figure 2.4 shows that for a given activity rate, the

energy consumption varies for different process corners with different threshold voltage values.

For all implementations, the overall energy consumption reduces with a decrease in the supply

voltage (e.g., Vdd range of 0.9-0.65 V for SS corner (HVT)). When the supply voltage is reduced

further (e.g., Vdd≤0.6V for HVT), the propagation delay increases rapidly which increases the

overall energy consumption.

The increased local and global variation induced performance and energy fluctuation of

NTC circuits looms a daunting challenge that forces designers to passover low voltage design

entirely. In the super-threshold voltage domain, those variation issues are easily addressed by

adding conservative margins. For NTC designs, however, local and global process variations

reduce the performance of chips by up to 10×. Hence, conservative margin approaches are

inefficient for NTC operation due to the wide variation extent.

2.2.3 Functional failure and reliability issues of NTC memory components

The increase in sensitivity to process variation of NTC circuits affects not only the performance

but also functionality. Notably, the mismatch in device strength due to process variation

14

2.2 Challenges for NTC operation

affects the state of positive feedback loop based storage elements (SRAM cells) [21, 64, 65].

The mismatch in the transistors makes SRAM cells to incline for one state over the other, a

characteristic that leads to hard functional failure or soft timing failure [22, 13]. The variation-

induced functional failure rate of SRAM cells is more pronounced in the nanoscale era as highly

miniaturized devices are used to satisfy the density requirements [66]. SRAM cells mainly

suffer from three main unreliability sources: 1) aging effects, 2) radiation-induced soft error,

and 3) variation-induced functional failures [18]. The SRAM cell susceptibility to these issues

increases with supply voltage downscaling.

A) Aging effects in SRAM cells

Accelerated transistor aging is one of the main reliability concerns in CMOS devices. Among

various mechanisms, Bias Temperature Instability (BTI) is the primary aging mechanism in

nanoscale devices [67]. BTI gradually increases the threshold voltage of a transistor over a

long period, which in turn increases the gate delay [67]. BTI-induced threshold voltage shift

is a strong function of temperature as it has an exponential dependency. Hence, BTI-induced

aging rate is higher at high operating voltage and temperature values. In SRAM cells, BTI

reduces the Static Noise Margin (SNM)1 of an SRAM cell, and makes it more susceptible to

failures. BTI-induced SNM degradation is higher when the cell stores the same value for a

longer period (e.g., storing ‘0’ at node ‘A’ of the SRAM cell shown in Figure 2.5). Hence, the

effect of BTI on an SRAM cell is a strong function of the cell’s Signal Probability (SP)2.

VDD

WL

BLBL

P1

N1 N2

P2

GND

WL

VL = 1

VR = 0

NR

NLB

A

Figure 2.5: Schematic diagram of 6T SRAM cell, where WL= word-line, BL=bit-line and RL=read-line.

B) Process variation in SRAM cells

Variation in transistor parameters such as channel length, channel width, and threshold voltage

results in a mismatch in the strength of the transistors in an SRAM cell, and in extreme cases

it makes the cell to fail [6]. The variation-induced memory failure rate increases significantly

with supply voltage downscaling, for instance, SRAM cells operating at NTC (0.5V) have 5×higher failure rate than the cells operating at a nominal voltage [6]. Process variation affects

several aspects of SRAM cells, and the main variation-induced SRAM cell failures are:

Read failure: Read failure/ disturb is a phenomenon where the stored value is distorted

during read operation. For example, when reading the value of the cell shown in Figure 2.5,

1SNM is the minimum amount of DC noise that leads to a loss of the stored value2Probability of storing logic ‘1’ in the SRAM cell

15

2 Background and State-of-The-Art

(VL=‘1’ and VR=‘0’), due to the voltage difference between the access transistor NR and

pull-down transistor N2, the voltage at node VR increases [68, 69]. If this voltage is higher

than the trip voltage (Vtrip) of the left inverter, then the stored value of the cell is changed.

Hence, the condition for read failure is be expressed as [70]:

read failure =

{1, if VR>Vtrip

0, otherwise

where Vtrip=VP1-VN1 (here VP1 and VN1 indicate the voltages of the PMOS and NMOS

transistors of the left inverter shown in Figure 2.5 where P1 and N1 are the corresponding

PMOS and NMOS transistors of the inverter).

Write failure: Write failure occurs when the cell is not able to write/ change its state

with the applied write voltage. For example, during a write operation (e.g., writing ‘0’ to the

SRAM cell shown in Figure 2.5), the node VL is discharged through the bit-line BL. Write

failure occurs when the node VL is not reduced to be lower than Vtrip of the right inverter

(VR) [70, 69]. In the standard 6T SRAM cell, write failure is a challenging issue as the cell

cannot be optimized without reducing its read margin [70, 69, 68]. However, this is improved

with the help of read/write assist circuitries or differential read/write access as it is done in

the 7T, 8T, and 10T SRAM cell designs [64, 65, 19]. In order to illustrate the write failure

issue, the write margin behaviors of 6T and 8T NTC SRAM cells are studied and compared

in Figure 2.6. As shown in the figure, the 6T SRAM cell has a smaller write margin as it has

longer write latency. On the other hand, the short write latency of the 8T design enables it

to have a relatively larger write margin. The improvement in the write margin is because the

8T cell is optimized to improve the write operation without affecting its read operation, as the

write and read operations are decoupled.

Hold failure: Hold failure commonly known as metastability issue, is a reliability issue that

occurs when the SRAM cell is not able to store the value for a longer period [22, 70]. This

problem happens during a standby mode if the voltage at nodes VL or VR is smaller (smaller

SNM value), then the stored value is easily destroyed by a noise voltage due to various sources

such as particle strike and leakage current [22, 70].

0

0.25

0.5

WL

(V

)

0

0.25

0.5

A &

B n

od

es

fo

r 6

T (

V)

0

0.25

0.5

9.8 10.1 10.4 10.7 11

A &

B n

od

es

fo

r 8

T (

V)

Time [ns]

Figure 2.6: Write margin (in terms of write latency) comparison of 6T and 8T SRAM cell operating in

near-threshold voltage domain (0.5V).

16

2.2 Challenges for NTC operation

C) Soft error rate in SRAM cells

In SRAM cells, soft error is a transient phenomenon that occurs when charged particles pene-

trate the cell’s cross junction creating an aberrant charge that changes the state of the cell [71].

The primary source of soft errors is related to cosmic ray events such as neutrons and alpha

particles. Atmospheric neutrons are one of the higher flux components, and their reaction

has a high energy transfer. Thus, neutrons are the most likely cosmic radiations to cause soft

errors [72, 18]. Neutrons do not generate electron-hole pairs directly. However, their inter-

action with the Si-atoms generates secondary particles. These secondary particles produce

charges/electron-hole pairs [72]. If the generated charges are larger than the critical charge3

of an SRAM cell, then the internal value of the cell is inverted, this phenomenon is commonly

referred to as soft error.

Radiation-induced Soft Error Rate (SER) of an SRAM cell increases significantly with de-

crease in the supply voltage. Previous experiments have shown that the radiation-induced

SER increases by 50% for just 20% decrease in the supply voltage [73]. Moreover, the SER of

NTC designs is affected by variation and aging-induced SNM degradation.

D) Interdependence and combined effects

Analyzing failures based on a particular reliability failure mechanism is insufficient for esti-

mating the system level reliability as the interdependence among different failure mechanisms

(such as aging, soft error, and process variation) has a considerable impact on the overall

system reliability [74, 18, 22]. Figure 2.7 shows how the interdependence between different

reliability mechanisms (aging, SER, and process variation) affects the overall system reliabil-

ity of memory components in terms of Failure In Time (FIT rate). As shown in the figure,

variation-induced threshold voltage shift increases both aging and SER by reducing the SNM

and critical charge of the cell. Similarly, aging-induced SNM degradation increases the sen-

sitivity of SRAM cell to soft errors. The problem is more pronounced when the SRAM cell

is operating at NTC domain due to the wide variation extent and higher sensitivity to aging

effects [18]. It has been observed that aging has ≈5% SNM and critical charge degradation at

NTC while process variation induced SNM degradation reaches as high as 60% [18]. In the

super-threshold voltage domain (1.0V), however, the aging effect increases by 3× to be 15%

while variation effect is reduced significantly.

Moreover, the running workload affects the aging rate and SER of memory components, as

it determines the signal probability and the Architectural Vulnerability Factor (AVF)4 of the

memory elements [18]. Therefore, to overcome these reliability challenges and improve the

overall system reliability, combined analysis of the reliability failure mechanisms at different

levels of abstraction is imperative. Besides, the cross-layer analysis should consider the im-

pact of workload on signal probability as well as architectural vulnerability factor of memory

components, and their circuit-level consequences on critical charge and SNM degradation.

3minimum amount of charge required to upset the stored value, of an SRAM cell4AVF is the probability that an error in memory structure propagates to the data path. AVF = vulnerable

period / total program execution period.

17

2 Background and State-of-The-Art

Supply voltage

scalingPerformance

Aging SER PV

FIT rate

Reduce

ReduceReduce

Increase

Figure 2.7: Interdependence of reliability failure mechanisms and their impact on the system Failure

In-Time (FIT) rate in NTC.

E) Technology scaling effects on SRAM reliability

Reliability has been an essential issue with the miniaturization of CMOS technology, as differ-

ent design-time and runtime failures are among the limiting factors of technology scaling [75].

At smaller technology nodes, process variation increases the permanent and transient failures

of memory components significantly [6, 76]. Authors in [6] show that SRAM cell failure rate

increases by more than 2× with downscaling from 90nm to 65nm technology node. Similarly,

authors in [77] demonstrated that technology downscaling increases the radiation-induced soft

error rate of SRAM cells significantly.

2.3 Existing techniques to overcome NTC barriers

2.3.1 Solutions addressing performance reduction

As it has been discussed in Section 2.2.1, performance reduction has a significant challenge

for the widespread applicability of NTC operation. As a result, the low clock frequency

dictates the use of highly-parallel architectures to improve the throughput of NTC designs [78,

41]. Moreover, in order to maximize the energy efficiency of NTC operation, configurable

architectures are essential and must rely on efficient control, such as vector execution for

Single Instruction Multiple Data (SIMD) [79, 80]. Unfortunately, due to process variation,

SIMD architectures with a large number of processing units exacerbate the timing variability

problem and hence, limiting their performance improvement [81, 82].

Voltage boosting, increasing the supply voltage to increase the frequency of cores, is another

technique used to improve the performance of NTC operation. Various 2D [14] and 3D [15]

architectures have utilized voltage boosting to improve the performance of their design. The

3D based multi-core NTC design work in [15], for instance, adopts dividing task into serial

and parallel portions, and use clustered boosting architecture to increase the supply voltage of

one core by disabling the remaining cores in the cluster in order to accelerate the execution of

serial portion of a program. For real applications, however, the process of dividing a task into

serial and parallel portions incurs parallelization overhead [42, 83]. Moreover, as shown by

Amdahl’s law [42] the speedup achieved by task division is limited by the sequential portion.

18

2.3 Existing techniques to overcome NTC barriers

Therefore, the performance does not improve with the increase in parallelization as the serial

portion eventually dominates.

Increasing the number of pipeline stages is another alternative for power-performance trade-

off in NTC [24]. In a deeply pipelined architectures, the amount of logic and latches in

a single stage is rather minimal, and hence the stages have shorter Fan-out of four (FO4)

delay [84], where FO4 delay is the delay of one inverter driving four equal-sized inverters.

Thus, deep-pipelined NTC designs can potentially run at higher clock frequency without a

significant increase on the energy consumption [85, 24, 82]. However, deep-pipelined NTC

designs face two main challenges; the first challenge is the reduction in throughput due to data

dependency and branch miss-prediction induced pipeline flushing and stalling. The second

challenge is that deeper pipeline stages have higher sensitivity to variation effects. Since the

pipeline stages have shorter logic depth, the impact of variation on their delay is higher [53].

In shallow pipeline design, however, the stages have longer logic depth, and hence, they exhibit

significantly reduced variability due to the averaging effect.

2.3.2 Solutions addressing variability

Higher sensitivity to global and local process variation poses a formidable challenge for NTC

designs. Therefore, design time and runtime solutions addressing variability issues are crucial

for a highly energy-efficient NTC operation. This subsection assesses the existing techniques

for dealing with variability effect at NTC and their shortcomings. The existing solutions are

broadly classified into three main groups. These groups are 1) Timing error detection and

correction techniques, 2) Time-borrowing techniques, and 3) Device optimization and tuning

techniques.

A) Timing error detection and correction techniques

Timing error detection and correction approaches aim at reducing the excessive guard band

for NTC, or enabling better than worst-case design by using additional circuitries to detect

timing errors in a circuit [86, 87, 88, 89]. These class of techniques typically are reactive runtime

approaches as they wait for the timing error to happen and rollback the computation [55].

Architectural level shadow flip-flop based techniques have been proposed to detect and

correct timing errors at runtime, commonly know as Razor flip-flop [87, 88, 89]. These Razor

based techniques employ shadow flip-flops with a delayed clock to speculatively operate tasks

without adding conservative timing margins as shown in Figure 2.8(a). When an error is

detected, i.e., when the outputs of the main and shadow flip-flops are different, then it stalls

the operation and reloads the correct value from the shadow flip-flop into the main flip-flop.

However, these razor based techniques do not scale well since the global signal has to be

propagated across the entire circuit in a single clock cycle. Moreover, the variation extent

at NTC results in more timing errors forcing the use of shadow flip-flop for almost all paths

which adds significant area and power overheads. Bubble Razor [90, 86] partially address

the problem by utilizing a two-phase latch as opposed to the shadow flip-flop. When an

error is detected, it propagates bubbles to the neighboring latches to clock gate the erroneous

signals from propagating. Bubble razor implements a local stalling scheme by using one extra

cycle for the correct data to arrive. Although bubble razor offers 1-cycle error correction, its

19

2 Background and State-of-The-Art

Q

QSET

CLR

D

Q

QSET

CLR

D

Clk

01

Delayed Clk

Data in

Main

FF

Shadow

FF

Data out

Error

(a) Razor flip-flop [88]

t

CLK

CLKS

CLKM

(b) Soft-edge clocking [91]

NMOS

PMOS

Vdd

GND

(c) Adaptive body biasing [92]

Figure 2.8: Variation-induce timing error detection and correction techniques for combinational circuits

(a) razor flip-flop based timing error detection and correction, (b) shadow flip-flop for time

borrowing, and (c) adaptive body biasing to adjust circuit timing.

effectiveness is limited at NTC due to the wide variation extent.

B) Time-borrowing techniques

Another potential approach to address variation-induced timing error at NTC is to use clock

stretching and time-borrowing techniques [91, 93, 94]. Time-borrowing techniques correct the

timing errors by using Soft-edge Flip-Flop (SFF), which has a small transparency window, or

softness [91, 93, 95]. In these approaches, a tunable inverter chain is used in a master-slave

flip-flop to delay the master clock edge with respect to the slave edge in order to create the

transparency window (t) as shown in Figure 2.8(b). The transparency window is used to mask

timing errors in the logic paths that were too slow for the normal clock period [91, 93]. The

transparency window allows time borrowing within an edge-triggered flip-flop. Hence, time-

borrowing (soft-edge) flip-flop balances the delay of shorter and longer paths, and is effective

at mitigating random delay variation in the super-threshold voltage domain. However, for

NTC operation, it requires a very conservative time-borrowing margin which forces designers

to stretch the clock cycle significantly, and eventually introduce high performance and energy

overheads. Authors in [94] presented a more systematic time-borrowing method that uses a

special flip-flop (with a time-borrowing detection) and a clock shifter. The time-borrowing

flip-flop uses clock shifter circuits to allow time borrowing on the critical paths, and generate

the time-borrowing signal for clock shifter to stretch the clock period dynamically. It pays

back the borrowed time in the next clock cycle; therefore, no error recovery is needed and

20

2.3 Existing techniques to overcome NTC barriers

has less performance overhead as the clock is only stretched when timing errors are detected.

However, due to the wide variation extent at NTC, the clock needs to be stretched significantly

which increases the complexity and overheads.

C) Device optimization and tuning techniques

Device tuning and optimization approaches change the electrical characteristics (e.g., power

and delay) of a circuit by dynamically tuning CMOS transistor parameters such as threshold

voltage [92, 96]. In this regard, Adaptive Body Biasing (ABB) is the most widely used runtime

approach to carefully tune the threshold voltage of a transistor [97]. ABB carefully selects an

appropriate positive or negative bias voltage, and apply it to the body of a transistor to tune

the threshold voltage of the transistor [92, 96] as shown in Figure 2.8(c). The applied bias

voltage can be negative (Forward (FBB)), which reduces the threshold voltage (Vth), or positive

voltage (Reverse body bias (RBB)) to increases the Vth [92, 96]. Decreasing the Vth improves

the performance (lower delay) at the expense of additional leakage power while increasing the

Vth reduces both performance and leakage power. Therefore, slow circuit blocks are forward

biased, whereas leaky circuit blocks are reverse biased. These approaches are dynamic in

nature, and enable post-fabrication tuning of circuit parameters.

Circuit optimization is another method that utilizes design-time optimizations to reduce

the timing error of circuits [98]. It is effective in voltage downscaling based approaches where

there is a wall of equally critical paths that are subjected to timing errors [98]. To address

this issue, the work in [81] focus on timing variation-aware circuit optimizations such as gate

sizing (W/L ratio) and use of multiple Vth cells to balance the delay distribution of the paths.

These device optimization and tuning techniques are orthogonal to the solutions proposed in

this thesis, and can be applied together for further energy efficiency improvement.

2.3.3 Solutions addressing memory failures

With the increase in reliability challenges, various researchers have focused on developing

mitigation schemes to address reliability and process variation issues of memory components

independently [74]. In the super-threshold voltage regime, several works are available in the

literature to address these reliability issues such as [67, 72, 99, 100, 101, 102, 103, 104], to

name a few. In NTC, however, most of the existing techniques address performance loss and

variability of logic components by using design-time solutions such as multiple supply and

threshold voltage assignment [23]. Existing techniques to address memory failure at NTC

are classified into four main categories; alternative bit-cell design, heterogeneous cache design,

strong error correction, and cache capacity reduction based redundancy solutions.

A) Alternative SRAM bit-cell topologies for NTC operation

The conventional 6T SRAM cell design shown in Figure 2.5 has been commonly used as the

basic unit of memory arrays operating in the super-threshold voltage domain [22, 18]. Due

to higher sensitivity to variation effects at NTC, however, the standard 6T SRAM design

faces several challenges. Operating below nominal voltage reduces the write, read, and hold

margins of the 6T SRAM cell, and eventually leads to functional failures. Among these, read

21

2 Background and State-of-The-Art

VDDWL

BLBL

P1

N1 N2

P2

GND

RLW

(a) 7T SRAM cell design [64]

VDD

WL

BLBL

P1

N1 N2

P2

GND

WL

N3

N4

N5

N6

RL

AB

(b) 8T SRAM cell design [106]

VDD

WL

BLBL

P1

N1 N2

P2

GND

WLRWL

RBL

N3N4

N5

P3

N6

N7

(c) 10T SRAM cell design [19]

Figure 2.9: Alternative bit-cell designs to improve read disturb, stability and yield of SRAM cells in

NTC domain (a) Differential 7-Transistor (7T) bit-cell deign, (b) Read/write decoupled

8-Transistor (8T) bit-cell design, and (c) Robust and read/write decoupled 10-Transistor

(10T) bit-cell design timing.

stability (or read upset), which cannot be solved without a significant transistor up-sizing is

the fundamental problem of the 6T SRAM cell at NTC [105]. Various alternative SRAM cell

designs have been proposed in order to address the reliability issues and improve the robustness

of SRAM cells operating in the near-threshold voltage domain [4, 64, 106, 19]. A 7-transistor

(7T) SRAM cell design that utilizes differential read access is proposed in [64] to improve the

read stability as shown in Figure 2.9(a). Authors in [107, 65, 108, 106] proposed an 8T SRAM

design (as shown in Figure 2.9(b)) to decouple the read and write accesses. The read/write

decoupling allows separate sizing and optimization of the 8T SRAM cell to improve its read

and write margins without affecting to one another. A 10T SRAM cell design (as shown in

Figure 2.9(c)) has been proposed in [19] to improve the read and write margins of near/sub-

threshold SRAM cells by using separate read and write word-lines. However, these alternative

designs have a negative impact on memory array density as they have two or more additional

transistors over the standard 6T design.

B) Heterogeneous cache design

Various researchers proposed the use of mixed SRAM cells (i.e., conventional (e.g., 6T) and 8T

or 10T) for designing reliable and high-performance caches [109, 110]. This scheme enables to

complement both designs (e.g., 6T and 8T SRAM cell designs), and harness their benefits while

suppressing their shortcomings. Authors in [109] presented a heterogeneous cache designed

22

2.3 Existing techniques to overcome NTC barriers

using both 6T and 10T transistors. They show how an 8-way cache is designed using 6T cells

only, or with six 6T and two 10T ways in order to make different area-reliability trade-offs at

NTC.

Similarly, a low-voltage Last Level Cache (LLC) architecture is proposed in [110]. The low-

voltage LLC exploits DVFS and workload characteristics of various applications to achieve

better performance and energy efficiency trade-offs. When the workload has higher cache

activity, the processor spends significant time in high frequency/voltage mode. In this scenario,

the LLC portions with smaller cell size are activated. At low voltages, however, only larger

cells (cells with up-sized transistors) are used to achieve low failure rates. This approach

achieves better performance and energy efficiency trade-off as the performance penalty of

having reduced LLC capacity is small when the processor runs at a lower frequency. Although

these heterogeneous design techniques mitigate design-time variability effects, they fail to

address runtime reliability issues such as aging and soft-errors which have significant impact.

C) Error Correction Codes (ECC)

Error detection and correction codes are widely used at the system level to prevent the propaga-

tion of erroneous data [111]. Simple ECC schemes such as single error correction are sufficient

enough to protect caches operating in the super-threshold voltage domain [112]. Due to the

wide variation extent at NTC, cache memories require more robust ECC schemes to detect and

correct multiple bit errors [113]. However, such sophisticated ECC schemes incur significant

area and power overheads which potentially nullify the energy gains of NTC operation.

Authors in [113] proposed turbo product codes based Forward Error Correction (FEC)

technique to enable low-voltage operation of caches. In their approach, the cache trade-off

some cache capacity to store error correction information of the adopted ECC scheme. A

technique to mitigate the overhead of robust ECC schemes for enabling reliable low-voltage

operation is presented in [114]. The authors use a fast mechanism to predict ECC information,

and the robust error correction scheme is employed in parallel to verify the correctness of the

predicted value. Then, the predicted ECC value is fed to the subsequent stages. When the

predicted value is the same as the output of strong error correction, that means the prediction

hides the latency of strong error correction. When the value is miss-predicted, instructions

are flushed and restarted using the corrected value. Thus, in their approach miss-prediction

imposes an additional delay and energy overhead which affects the energy efficiency of NTC

caches significantly.

D) Cache capacity reduction based redundancy approaches

A generic and relatively low-cost solution to cache failures is to use column/ row redun-

dancy [115]. Although redundancy is a low-cost solution to address memory failures, it is not

adequate for NTC caches as the memory failure rate is higher, and multiple failing rows and

columns at a time could not be effectively handled with a low-cost redundancy [13]. Similarly,

cache capacity reduction is another technique used to address memory failures in NTC by

disabling the faulty cache lines or blocks. The techniques presented in [116, 66] disable the

faulty cache blocks, and map their access to the fault-free blocks. These techniques address

permanent failures as the failing portions are disabled. However, they cannot address runtime

23

2 Background and State-of-The-Art

failures such as soft errors which are high at NTC.

Authors in [117] proposed two techniques to enable ultra-low voltage cache operation. The

first technique, referred to as word-disabling, disables several words that have one or more

failing cells. Then, it combines the non-failing words in two consecutive ways to form a logical

cache line, and the position of failing/non-failing words is stored in the tag. Their second

approach, called bit-fix, uses some portion of the cache ways to store both the location and

correct value of defective bits in other cache ways. The limitation of these techniques is that

the cache size is reduced significantly. Additionally, these techniques does not protect the

cache from runtime failures.

2.4 Emerging technologies and computing paradigm for extreme

energy efficiency

Despite the recent advances in semiconductor technology and the development of energy-

efficient computing techniques, the overall energy consumption is still increasing significantly.

To further reduce the energy consumption of different circuits, designers are looking for emerg-

ing technologies and computing paradigms. Thus, non-volatile memory technologies and ap-

proximate computing have emerged as viable alternatives for energy-efficient design [118, 27,

119, 120].

2.4.1 Non-volatile processor design

Leakage power, the power consumed statically when devices are not operating, has become

the most pressing design issues [119]. Until recently, it was considered as a second-order effect;

however, nowadays it starts to dominate the total power consumption of modern System-on-

Chips (SoCs) [118]. Therefore, leakage power reduction is extremely important, especially

for the design of battery-powered hand-held devices. Non-volatile on-chip storage technologies

play a vital role in dealing with this issue by enabling their normally-off computing capabilities.

In this regard, several non-volatile processor designs have been proposed using various non-

volatile memory technologies such as Flash, Ferroelectric Random Access Memory (FRAM),

Resistive Random Access Memory (RRAM), Phase-Change Random Access Memory (PCRAM)

and spintronic technologies [121, 122, 119, 123]. Among these technologies, spintronic tech-

nology, in particular, Spin-Orbit Torque (SOT), is the most promising candidate as it has

the edge over other non-volatile technologies in terms of fast accesses, high endurance, and

better scalability [124]. In addition to that, this technology has various other advantageous

features, such as high density, CMOS compatibility, and immunity to radiation-induced soft

errors [124].

2.4.2 Exploiting approximate computing for NTC

As computer systems become pervasive, their interaction with the physical world and data

processing requirements are increasing significantly. Consequently, large number of applica-

tions such as recognition, mining, and synthesis (RMS) applications, have emerged in a broad

computing platform spectrum [120]. Fortunately, such applications are inherently error re-

24

2.5 Summary

silient. The inherent error tolerance nature of such applications motivates designers to exploit

approximate computing to improve the energy efficiency of their designs [120]. Approximate

computing deliberately allows “acceptable errors” of the computing process in order to achieve

significant improvement in the energy efficiency [27]. This approach is discussed in detail in

Chapter 5 of this thesis.

2.5 Summary

Aggressive supply voltage downscaling to the near-threshold voltage regime (NTC) is a promis-

ing approach to reduce the energy consumption of circuits. However, NTC comes with its own

set of challenges, most importantly performance reduction, variation effect, and higher func-

tional failure rate. These challenges are the main bottlenecks for the widespread applicability

of NTC. Although various techniques have been proposed to address these issues and improve

the energy efficiency, holistic cross-layer analysis and solutions addressing both logic and mem-

ory components are off decisive importance. This thesis provides various cross-layer solutions

to mitigate the impact of process variation and runtime failures in combinational logic and

memory components operating in the near-threshold voltage domain. The solutions provided

in this thesis target both cache unit and pipeline stages of pipelined processor architecture to

enable resilient and energy-efficient operation of NTC microprocessors.

25

3 Reliable Cache Design for NTC Operation

Near-threshold computing plays a vital role in reducing the energy consumption of modern

VLSI circuits. However, NTC designs suffer from high functional failure rate of memory

components. Understanding the characteristics of the functional failures and variability effects

is of decisive importance in order to mitigate their effects, and get the full NTC benefits. This

chapter presents a comprehensive cross-layer reliability analysis framework to assess the impact

of soft error, aging, and variation-induced failures on NTC caches. The goal of this chapter

is first to quantify the reliability of cache memories designed using different SRAM cells and

evaluate the voltage downscaling potential of caches. Then, a reliability and performance trade-

off is performed to determine the optimal cache organization for NTC operation. Moreover,

the chapter presents a proper mitigation scheme to enable reliable operation of NTC caches.

3.1 Introduction

SRAM based memory elements have been the prominent limiting factor in the near-threshold

voltage domain as the supply voltage of SRAM cells does not easily downscale, as it is done

for combinational logic. The supply voltage downscaling limitation is due to the significant

increase in the failure rate of SRAM cells operating at lower supply voltage values, which in

turn severely affects the yield. Various state-of-the-art solutions addressing this issue have

been discussed in Chapter 2. These solutions, such as variation tolerant SRAM cell design [4,

64, 106] and heterogeneous cache design [109], improve the robustness of cache memories.

However, the improvement comes at the cost of increased area and power overheads. Moreover,

these approaches mostly ignore the impact of runtime failure mechanisms, such as aging and

soft error, on the reliability of memory components. Therefore, design-time reliability failure

analysis and mitigation schemes are crucial for the reliable operation of near-threshold caches.

Analyzing failures based on a particular reliability failure mechanism is insufficient for esti-

mating the system-level reliability, as the interdependence among different failure mechanisms

has a considerable impact on the overall system reliability. Moreover, the running workload

affects the aging and SER of memory components as it determines the SP and AVF of the

memory elements. Therefore, performing a combined analysis on the reliability failure mech-

anisms across different layers of abstraction (as shown in Figure 3.1) is crucial, and it helps

designers to choose the most reliable components at each abstraction layer, and tackle the

reliability challenges of NTC operation.

For this purpose, a comprehensive cross-layer reliability analysis framework addressing the

combined effect of aging, process variation, and soft error on the reliability of NTC cache

designs is presented in this chapter. Moreover, the chapter presents the advantages and limi-

tations of two different NTC SRAM cell designs (namely, 6T and 8T cells) in terms of reliability

(SER and SNM) improvement, area, and energy overheads. The framework presented in this

27

3 Reliable Cache Design for NTC Operation

Memory system Effects

Workload

Cache organization

SRAM

PV

AVF, SNM

Critical

chargeSER

FIT rate

System

Architecture

Circuit

Device

Ab

str

actio

n la

ye

r

Figure 3.1: Cross-layer impact of memory system and workload application on system-level reliability

(Failure-In-Time (FIT rate)) of NTC memory components, and their interdependence.

chapter helps to explore the cross-layer impact of different reliability failure mechanisms, and

it is useful to study the combined effect of workload and cache organization on the SER and

SNM of cache memories. The framework is also helpful to understand how the reliability

issues change from super-threshold to the near-threshold voltage domain. Furthermore, it is

important for architectural-level design space exploration to find the best cache organization

for better reliability and performance trade-offs of NTC caches. Based on the comprehensive

analysis using the framework, a memory failure mitigation scheme is developed to improve the

energy efficiency of NTC caches.

3.2 Cross-layer reliability analysis framework for NTC caches

The comprehensive cross-layer reliability estimation framework that abstracts the impact of

workload, cache organization, and reliability failure mechanisms at different levels of abstrac-

tion is illustrated in Figure 3.2. The reliability analysis and simulation conducted in this work

use the symmetric six-transistor (6T) and 8T SRAM cells shown in Figures 2.5 and 2.9(b). In

this work, the device-level critical charge characterization is modeled according to the analyt-

ical model presented in [71].

This section presents the cross-layer reliability estimation framework in a top-down manner.

The system-level Failure In-Time (FIT) rate and SNM extraction are described in Section 3.2.1

followed by the cross-layer SNM and SER estimation in Section 3.2.2.

3.2.1 System FIT rate extraction

The system-level FIT rate of a cache memory is the sum of the FIT rate of each row (cache

line). The row FIT rate is calculated as the product of the row-wise SER (extracted based

on the circuit-level SER information) and its Architectural Vulnerability Factor (AVF). Cache

AVF is a metrics used to determine the probability that an error in a cache memory propagates

to the datapath, and results in a visible error in a program’s final output [125]. Equation (3.1)

28

3.2 Cross-layer reliability analysis framework for NTC caches

SRAM

model

BTI

model

SNM, Vth (µ and σ)

Critical charge estimation as a function of

Vth, Vdd

Transistor parameter (SNM, and Vth)

calculation in the presence of aging

Critical charge Flux distribution

SER estimation

SER distribution

WorkloadProcessor

model

Trace data Cache

organization

AVF analysis and Signal

Probability (SP) calculation

AVF value SP values

System FIT rate and

SNM calculation

Arc

hite

ctu

ral le

ve

lDe

vic

e le

ve

lC

ircu

it le

ve

l

Pro

ce

ss v

aria

tio

nGEM5 (System level)

Figure 3.2: Holistic cross-layer reliability estimation framework to analyze the impact of aging and

process variation effects on soft error rate.

shows the system-level FIT rate calculation of cache memories.

FIT system =

N−1∑i=0

AV F i × SERi (3.1)

where N is the total number of rows in the cache.

A) Architecture-level AVF analysis

One step of determining the failure rate of memory (cache) due to soft errors is to determine

the AVF value of the memory. AVF of a memory array is measured by the ratio of vulnerable

periods, time interval in which the memory content is exposed to particle strike, to the total

program execution period, and the probability of the erroneous value being propagated [125].

Hence, the vulnerability factor of a memory array is computed based on the liveness analysis

commonly known as Architectural Correct Execution (ACE) analysis which is the ratio of

ACE (vulnerable) cycles to the total number of operational cycles [126]. Therefore, the AVF

value of a memory array with M cells is computed as shown in Equation (3.2).

AV F array =

∑M−1i=0 ACEiT ×M

(3.2)

where T is the total number of cycles.

B) Architecture-level SNM analysis

Aging-induced SNM degradation of an SRAM cell strongly depends on the Signal Probability

(SP) of the cell. Thus, BTI-induced SNM degradation is minimized when the signal probability

of the cell is balanced (close to 0.5) [67]. In order to determine the aging-induced SNM

degradation, the worst case SP of the memory row is obtained as the maximum SP distance

29

3 Reliable Cache Design for NTC Operation

from 0.5 (D = |SP − 0.5|) as shown in Equation (3.3). Then, the worst-case SP is used by the

SNM estimation tool given in Figure 3.2 to determine the corresponding aging-induced SNM

degradation.

SPworst−case = MAXZi=1Di (3.3)

where Di = |SP i − 0.5| and Z is the total number of cells in the memory row.

In order to extract the AVF and SNM of a cache unit, first, it is necessary to extract the trace

of the data stored in the cache, read-write accesses, and the duration (number of cycles) of the

running workload. Once the information is available, the reliability analysis tool uses it along

with the cache organization to determine the AVF and SP of the cache memory according to

Equations (3.2) and (3.3), and generates the SNM LUT for different signal probability values.

The cache organization (size and associativity) has significant impact on the SER and SNM

of the cache, as it determines the hit ratio and the duration data is stored in a cache entry.

Hence, different cache size and associativity combinations results in different SER and SNM

values for the same workload application. Additionally, SER and SNM are highly dependent

on the running workload. In order to explore the impact of cache organization and workload,

various organizations and workload applications are investigated.

3.2.2 Cross-layer SNM and SER estimation

A) SNM degradation estimation

Device-level aging analysis

BTI-induced aging degrades the carrier mobility of CMOS transistors, and leads to transistor

threshold voltage (Vth) shift. In an SRAM cell, the Vth shift reduces the noise tolerance margin

of the cell, and makes it more susceptible to failures. In the reliability analysis framework, the

BTI-induced threshold voltage shift of the transistors in an SRAM cell is evaluated at device-

level using a Reaction-Diffusion (RD) model [127]. Then, the device-level Vth shift results are

used to estimate the corresponding SNM degradation of an SRAM cell at the circuit-level.

Circuit-level SNM estimation

The SNM of an SRAM cell is extracted by conducting a circuit-level SPICE simulation. The

SPICE simulation uses device-level aging and architecture-level SP results to determine the

SNM of the SRAM cell. Finally, the SNM degradation of a particular SP value is obtained

according to Equation (3.4).

DEGSP =SNMSP − SNMfresh

SNMfresh× 100% (3.4)

where SNMSP is the SNM of the SRAM cell for a particular signal probability value and

SNMfresh is the SNM of a fresh (new) SRAM cell.

Aging and process variation induced SNM degradation analysis

BTI-induced SNM degradation of an SRAM cell depends not only on the cell signal probability,

but also on process parameters, such as channel length and oxide thickness, which are highly

30

3.2 Cross-layer reliability analysis framework for NTC caches

affected by manufacturing variabilities. Due to low operating temperature at NTC, aging

has relatively less impact on the SNM degradation of near-threshold voltage SRAM cells.

However, in combination with variation-induced threshold voltage shift, aging degrades the

SNM of SRAM cells significantly.

Figure 3.3 shows the worst case aging (SP=0.0) and variation-induced SNM degradation

of 6T and 8T SRAM cells after three years of operation for wide supply voltage range. The

obtained SNM degradation confirms the analytical expectation as the SNM degradation in

NTC is 2.5× higher than the degradation in the super-threshold voltage domain (as shown

by the gray boxes). While the use of 8T instead of 6T SRAM cells in super-threshold voltage

domain has limited improvement in SNM degradation (only 7.7%), it achieves more than 14%

reduction in the SNM degradation in the near-threshold voltage domain.

B) SER estimation

The SER of an SRAM cell depends on two main factors, the critical charge of the cell and the

flux rate of the strike. To determine SRAM cell SER, first, the critical charge of an SRAM

cell is obtained from a circuit-level model. Then, the SER value is calculated by combining

the critical charge, flux distribution, and the area sensitive to strike.

Device-level critical charge characterization

The sensitivity of an SRAM cell to radiation-induced soft errors is determined by the critical

charge (Qcritical) of the cell, as it determines the minimum amount of charge required to

alter the state of the cell. The Qcritical of an SRAM cell depends on several factors such as

supply voltage, threshold voltage, and strength of the transistors of the SRAM cell [128]. The

critical charge of an SRAM cell is computed using analytical models or circuit simulators. An

analytical model developed in [71] is used to determine the Qcritical.

15%

30%

45%

60%

75%

0.5 0.6 0.7 0.8 0.9 1.0 1.1

2.5

X

X

SN

M d

egra

dat

ion

Vdd

6T8T

Figure 3.3: SNM degradation in the presence of process variation and aging after 3 years of opera-

tion, aging+PV-induced SNM degradation at NTC is 2.5× higher than the super-threshold

domain.

31

3 Reliable Cache Design for NTC Operation

As shown in Figure 3.2, the SPICE model of an SRAM cell along with the BTI model

is employed to evaluate the impact of BTI on the threshold voltage (Vth) of the transistors

of an SRAM cell. The BTI analysis uses the SP values of the memory array from higher

(architecture-level) analysis to determine the BTI-induced Vth shift of the running workload.

In this way, the aging effect of the workload is incorporated into the framework. Once the

fresh and aged Vth values are available, the impact of process variation is incorporated as a

normal distribution (µ± 3σ) of the transistor threshold voltage where µ is the mean Vth value

and the standard deviation (σ) which is obtained using an industrial standard, measurement

based, model (the “Pelgrom model”) given in Equation (3.5) [17]. Finally, all these parameters

are used by the model given in [71] to extract the Qcritical.

σ∆Vth =AV T√L×W

(3.5)

where L and W are the length and width of transistors, and AV T is process specific parameter

(the “Pelgrom coefficient”).

Circuit-level SER analysis

The circuit-level SER analysis is conducted using the SER extraction module of the framework

given in Figure 3.2. First, the critical charge of the SRAM cell is extracted using the device-

level model [71]. Afterward, the critical charge along with the neutron-induced flux distribution

is used to determine the SER of the cell using an experimentally verified empirical model given

in Equation (3.6) [129]. As shown in Equation (3.6), the SER of an SRAM cell has an inverse

exponential relation with its critical charge (Qcritical). Hence, the higher the Qcritical, the lower

the SER will be.

SER ∝ FAe(−QcriticalQs

)(3.6)

where F is the flux in particles/cm2-s with energy higher than 1MeV [130]; A is the area

sensitive to a strike in cm2, and QS is the charge collection efficiency.

The main observations from Equation (3.6) are:

• The SER of an SRAM cell has an inverse exponential relation to its critical charge.

Hence, a small decrease in the Qcritical leads to an exponential increase in the cell SER.

• For the same atmospheric neutrons, a small drift in Qcritical leads to a significant increase

in the SER. Furthermore, transistor up-sizing increases the area which is sensitive to

particle strike and hence, higher SER.

SER of 6T and 8T SRAM cells

In the conventional 6T SRAM cell, the cell must maintain the stored value and it should be

stable during read/write accesses. SRAM cell stability is a challenging task when the cell is

operating in the near-threshold voltage domain, as the cell mainly suffers from read-disturb.

To address this issue, either a read-write assist circuitry should be employed, or the pull-down

(NMOS) transistors of the SRAM cell should be strengthened by transistor up-sizing [53].

However, the up-sizing also increases the area of the cell that is sensitive to soft errors. Since

the read-disturb of the 6T SRAM cell is worst when it operates at lower voltage values,

32

3.2 Cross-layer reliability analysis framework for NTC caches

0

0.004

0.008

0.012

0.016

0.02

0.5 0.6 0.7 0.8 0.9 1.0 1.1

SE

R

Vdd

Fresh6TFresh8TAged6TAged8T

Figure 3.4: SER rate of fresh and aged 6T and 8T SRAM cells for various Vdd values.

transistor up-sizing cannot adequately mitigate the read disturb issue which makes the 6T

design less desirable for near-threshold voltage operation.

This issue is addressed by using alternative SRAM cell designs (such as 8T [107] and 10T [19]

SRAM cells). For example, The read failure issue is solved in the 8T design by decoupling

the read and write lines using two additional NMOS access transistors. The decoupling allows

to downsize the pull-down NMOS transistors, and reduce the area sensitive to soft errors.

Therefore, alternative SRAM designs (e.g., 8T) are recommended for NTC operation, which is

verified by studying the reliability and energy efficiency improvement of the 8T SRAM design

over the conventional 6T design. The transistor sizing specified in [107] is used for the design

of the 6T and 8T SRAM cells used in this study.

Figure 3.4 shows the fresh and aged SER of the 6T and 8T SRAM designs for different

supply voltage values. In the super-threshold voltage domain, (0.9V-1.1V) the 6T and 8T

designs have negligible differences in their SER. In NTC, however, the 6T design has higher

SER than the 8T design due to the effects of transistor up-sizing which increases the area

sensitive to radiation. The combined effect of aging and process variation on 6T and 8T

SRAM cells is shown in Figure 3.5. Figure 3.5 shows variation effect has severe impact at

NTC, as the SER of the 6T and 8T SRAM cell designs in the near-threshold voltage domain

is 4× higher than their SER in the super-threshold voltage domain.

3.2.3 Experimental evaluation and trade-off analysis

A) Experimental setup

The reliability analysis is conducted using an ALPHA implementation of an embedded in-

order core on the Gem5 architectural simulator [131]. Since cache memories are the main

focus, various cache sizes (4KByte-16KByte) and wide associativity range from simple directly

33

3 Reliable Cache Design for NTC Operation

Figure 3.5: SER of 6T and 8T SRAM cells in the presence of process variation and aging effects after

3 years of operation.

mapped to 4-way set associative caches are assessed to perform a reliability and performance

trade-off analysis. The evaluation is conducted using several workload applications from the

SPEC2000 CPU benchmark suite [132]. The workload applications were executed for 5 million

cycles by fast-forwarding to the memory intensive phases. The experimental setup used in this

work is presented in Table 3.1.

The BTI-induced Vth shift is extracted by assuming 10% BTI-induced aging after three

years of operation [133]. First, the 45nm 6T and 8T SRAM cells are modeled using the PTM

model. Afterward, the BTI-induced Vth shift LUT and the corresponding SNM degradation

Table 3.1: Experimental setup, configuration and evaluated benchmark applications

Simulation environment Gem5

Core configuration Near-threshold Super-threshold

Processor model Embedded Embedded

Architecture Single in-order core Single in-order core

ISA ALPHA ALPHA

Supply voltage 0.5V 1.1V

Frequency 100MHz 1GHz

Technology node 45nm PTM 45nm PTM

Cache configuration Near-threshold Super-threshold

L1 Cache

Sizes=4, 8, and 16 KByte Sizes=4, 8, and 16 KByte

Associativity=1,2, and 4

way

Associativity=1,2, and 4

way

Replacement policy=LRU Replacement policy=LRU

SRAM cells=6T and 8T SRAM cell=6T

Benchmark SPEC2000 SPEC2000

34

3.2 Cross-layer reliability analysis framework for NTC caches

for various SP values (0.0-1.0) is obtained using a SPICE simulation. The impact of process

variation is considered as a normal distribution of the transistor threshold voltage with a

mean (µ = Vth, 300mV) and standard deviation (σ) obtained using the Pelgrom model given

in Equation (3.5).

To demonstrate the effect of soft error, neutron-induced soft errors are considered as they

are the dominant soft error mechanisms at terrestrial altitudes. In order to ensure the proper

functionality of both 6T and 8T SRAM cells in the near-threshold voltage domain, their tran-

sistors are sized according to the transistor sizing used to model and fabricate near-threshold

6T and 8T SRAM cells specified in [107]. It should be noted that L1 cache is used for illustra-

tion purpose only as most embedded NTC processors have limited cache hierarchy. However,

the framework is generic, and it is applicable to any cache levels such as L2 and L3.

B) Workload effect analysis

As discussed in Section 3.2.2, BTI-induced SNM degradation of SRAM cell highly depends

on the cell’s signal probability and the residency time of valid data which varies from one

workload application to another. Similarly, the SER of memory components is dependent on

the data residency period which is commonly measured using AVF. Hence, for SER analysis,

the AVF of different workloads is obtained based on the workload application’s data residency

period. In order to show the effect of workload variation on SER and SNM degradation, the

AVF and signal probabilities of the cache memory are extracted by running different workload

applications from the SPEC2000 benchmark suite. Then, the corresponding SNM and SER of

the cache memory are obtained using the SER and SNM models presented in Section 3.2.2.

Aging and variation-induced SNM degradation

SNM degradation affects the metastability of SRAM cells. Metastability of SRAM cell de-

termines the stability of the stored value, and it is highly dependent on the worst case SNM

degradation [67]. Therefore, for any workload application, the aging-induced SNM degradation

should be evaluated based on the first cell to fail (worst case SNM degradation).

The impact of workload on the SNM degradation of 6T and 8T based caches across wide

supply voltage range is shown in Figures 3.6(a) and 3.6(b), respectively. For both cases, the

SNM degradation increases significantly with supply voltage downscaling. Although the aging

0%

10%

20%

30%

40%

50%

60%

70%

0.5 0.6 0.7 0.8 0.9 1.0 1.1

SN

M d

egra

dat

ion

Vdd

AppluBzip2

LucasMesa

ParserAverage

(a) 6T based cache

0%

10%

20%

30%

40%

50%

60%

70%

0.5 0.6 0.7 0.8 0.9 1.0 1.1

SN

M d

egra

dat

ion

Vdd

AppluBzip2

LucasMesa

ParserAverage

(b) 8T based cache

Figure 3.6: Workload effects on aging-induced SNM degradation in the presence of process variation

for 6T and 8T SRAM cell based cache after 3 years of operation (a) 6T SRAM based cache

(b) 8T SRAM based cache.

35

3 Reliable Cache Design for NTC Operation

0.5 0.6 0.7 0.8 0.9 1.0 1.1Vdd [V]

10 6

10 5

10 4

10 3

10 2

10 1

100

101

SER

[Fai

lure

s in

5M c

ycle

s]

ParserMesaBzip2

LucasAppluAverage

Figure 3.7: Workload effect on SER rate of 6T SRAM cell based cache memory for wide supply voltage

range.

rate is slower at lower supply voltage values due to the lower temperature, the wide variation

extent in NTC leads to higher aging sensitivity. Hence, in NTC the impact of process variation

on SNM is more severe and leads to a significant increase in the aging sensitivity of SRAM

cells.

Soft error rate analysis

In order to analyze the impact of workload variation on the soft error rate of cache memories,

the architectural vulnerability factor of each workload is extracted and combined with the

circuit-level information. Figure 3.7 shows the contribution of the SPEC2000 workload appli-

cations on the SER of the 6T SRAM based cache. As shown in the figure, for all workload

applications the SER increases significantly with supply voltage downscaling. For example,

the SER of all workload applications increases by five orders of magnitude when the supply

voltage is downscaled from the super-threshold voltage (1.1V) to the near-threshold voltage

domain (0.5V).

Additionally, the workload variation has a considerable impact on the soft error rate. For

example, the SER of Bzip2 is almost two orders of magnitude higher than the SER of Mesa

and Parser workload applications. The workload variation impact is observed because Bzip2

application has higher locality and hit rate which increases the data residency period when

compared to the other workload applications. Although the higher hit rate of Bzip2 leads to

a better performance measured in Instructions Per Cycle (IPC), it has a significant impact

on the soft error rate of the cache. Hence, it is essential to exploit the workload variation in

order to downscale the supply voltage of the cache memory in per-application bases for a given

target error rate. For a given target FIT rate (e.g., 10−2) the cache has to operate at 0.6V

for Mesa and Parser workload applications. However, for Bzip2 the cache has to operate at a

higher voltage (0.7V) for the same target error rate.

36

3.2 Cross-layer reliability analysis framework for NTC caches

C) Cache organization impact on system FIT rate

Cache organization has a significant impact on the performance of embedded processors [134].

Similarly, the organization has an impact on the reliability of cache units. In NTC, the

reliability impact of cache organization is even more pronounced. Hence, a proper cache size

and associativity selection should consider both performance and reliability as target metrics.

The system failure probability (FIT rate and SNM) of a cache unit is highly dependent on the

architectural vulnerability factor and the values stored in the cache as well as their residency

time intervals, which is in turn is a strong function of the read-write accesses of the cache.

Hence, these parameters are influenced by cache size and associativity.

The performance and reliability impacts of different cache organizations in the near and

super-threshold voltage domains are evaluated using the configurations described in Table 3.1.

For near-threshold voltage (0.5V) the processor core frequency is set to 100MHz, and the

cache latency is set to 1 cycle as gate delay is the dominant factor in the near-threshold

voltage domain [135]. In the super-threshold voltage domain, however, the cache latency and

interconnect delay has a significant impact on the overall delay. Thus, the cache hit latency is

set to 2 cycles for 4K and 8K cache sizes and 3 cycles for the 16K cache size [136].

Cache organization and SNM degradation

Since cache organization determines the data residency period, it has a direct impact on

the SNM degradation. Figure 3.8 illustrates the impact of cache organization on the SNM

degradation of near and super-threshold voltage 6T and 8T SRAM cell based memory arrays

in the presence of process variation and aging effects after three years of operation. The figure

shows smaller cache size with higher associativity (4k-4w) has less impact on SNM degradation

as the data resides in the cache for a smaller duration.

20%

30%

40%

50%

60%

70%

4kdm 4k2w 4k4w 8kdm 8k2w 8k4w 16kdm 16k2w 16k4w

SN

M d

egra

dat

ion

Vdd

6TST 6TNT 8TNT

Figure 3.8: Impact of cache organization on SNM degradation in near-threshold (NTC) and super-

threshold (ST) in the presence of process variation and aging effect after 3 years of operation.

37

3 Reliable Cache Design for NTC Operation

8

16

24

32

40

0.3

9

0.4

2

0.4

35

0.4

6

0.4

67

0.4

7

0.4

78

0.4

81

0.4

83

0.4

9

0.5

FIT

rate

IPC

8k-4w

4k-4w

4k-2w

8k-2w

4k-dm

8k-dm16k-4w

16k-2w

16k-dm

Figure 3.9: FIT rate and performance design space of various cache configurations in the super-threshold

voltage domain by considering average workload effect (the blue italic font indicates optimal

configuration).

Cache organization and SER FIT rate

The cache size and associativity also affect the ACE cycles of cache lines and their failure

probabilities. The impact of the cache organization on the FIT rate and performance (IPC)

varies along various supply voltage domains. In the super-threshold voltage, an increase in

cache size and associativity improves the performance. However, from a FIT rate point of

view, an increase in the cache size has a negative impact on FIT rate as it increases the

AVF of the cache. Smaller cache sizes, however, have lower performance and better FIT

rate. Figure 3.9 shows the design space of FIT rate and performance (IPC) impact of various

cache organizations in the super-threshold voltage domain. In the figure, the FIT rate and

performance optimal configuration is (8k-4w) as indicated by the blue italic font in Figure 3.9.

In the near-threshold voltage domain, the performance is mainly dominated by the delay of

the logic unit and the memory failure rate is significantly high. Therefore, it is essential to

select a cache organization that gives better reliability (FIT rate and SNM) than performance.

Hence, in NTC a smaller cache size with higher associativity gives the best reliability and

performance trade-off. Figure 3.10 shows the design space for the FIT rate and performance

trade-off for 6T and 8T designs in NTC.

D) Reliability-aware optimal cache organization

The experimental results reported in Figures 3.8, 3.9, 3.10, and 3.11 shows an increase in

the cache associativity improves the performance and reliability (both FIT rate and SNM).

Hence, in the super-threshold voltage domain, medium cache size (e.g., 8 KByte) with higher

associativity has a better reliability and performance trade-off. In NTC, however, smaller

cache sizes with higher associativity are preferable for two main reasons: 1) The performance

is mainly dominated by the processor core, not by the cache units and hence, cache latency is

38

3.2 Cross-layer reliability analysis framework for NTC caches

200

400

600

800

1000

0.2

5

0.3

0.3

1

0.3

4

0.3

45

0.3

5

0.3

51

0.3

59

0.3

67

0.3

75

0.4

FIT

rate

IPC

8k-4w

4k-4w

4k-2w

8k-2w

4k-dm

8k-dm 16k-4w

16k-2w

16k-dm

6T 8T

Figure 3.10: FIT rate and performance design space of 6T and 8T designs for various cache configu-

rations in the near-threshold voltage domain by considering average workload effect (the

blue italic font indicates optimal configuration).

not an important issue. 2) The soft error rate and SNM degradation are higher in NTC than

in the super-threshold voltage domain. Hence, the cache size is reduced by half to obtain a

better reliability and performance trade-off in NTC.

In the NTC domain, the selection of an optimal cache organization for the 6T SRAM cell

based caches is different from the 8T based caches, depending on the FIT rate and performance

requirement. For example, for a target tolerable FIT rate of 350 at NTC (as shown by the

dotted line in Figures 3.11(a) and 3.11(b)), only 4 KByte 4-way associative cache organization

is within the acceptable zone for the 6T-based cache. In the 8T-based cache, however, three

additional cache organizations (4K-dm, 4k-2w and 8k-4w) are within the acceptable zone.

Hence, the 8k-4w cache is used in the 8T-based cache to get ≈10% performance improvement

without violating the reliability constraint.

To implement the suggested cache organizations for a specific supply voltage value (only

near-threshold or super-threshold) is straightforward. For caches that are expected to operate

in both super and near-threshold voltage domains, the reliability-performance optimum cache

organization in the super-threshold voltage (e.g., 4-way 8 KByte in this case) is preferable.

Then, when switching to the near-threshold voltage domain, some portion of the cache is

disabled (power gated) in order to maintain the reliability-performance trade-off at NTC.

E) Overall energy saving analysis of 6T and 8T caches

The energy saving potential of supply voltage downscaling is evaluated by extracting the aver-

age energy consumption profile of the 4K-Byte 4-way set associative cache (i.e., the reliability

performance optimal cache configuration) using 6T and 8T implementations. The energy con-

39

3 Reliable Cache Design for NTC Operation

(a) Near-threshold 6T (b) Near-threshold 8T

Figure 3.11: FIT rate and performance trade-off analysis of near-threshold 6T and 8T caches for various

cache configurations and average workload effect in the presence of process variation and

aging effects.

sumption of the cache memory consists of three different components. These components are

peripheries, row and column decoders, and bit-cell array energy consumptions. Since the en-

ergy consumption of the periphery and row/column decoder is independent of the bit-cell used,

they are assumed to be uniform for both 6T and 8T based caches. Hence, the energy-saving

comparison is done based only on the energy consumption of the bit-cell array.

Figure 3.12 compares the total energy consumption of the 6T and 8T based cache memories

for a wide supply voltage range. As shown in the figure, the 8T based cache has slightly

higher energy consumption in the super-threshold voltage domain (0.7V-1.1V) than the 6T

based cache. The slightly higher energy consumption is because of the additional transistors

used for read/write decoupling. However, due to the increase in the failure rate in the near-

threshold domain, the 6T based cache consumes more energy than the corresponding 8T based

implementation. The energy cost of the higher failure rate is considered as an increase in the

read/write latency of the cache. This shows addressing the failures of the 6T cache in NTC

results in additional energy cost which makes it less attractive for operating at lower supply

voltage values (e.g., below 0.6V).

F) Reliability improvement and area overhead analysis of 8T based caches

In a near-threshold voltage SRAM design, the 8T cell improves the soft error rate in the

presence of aging and variation effects by up to 25%. Similarly, the SNM is improved by

≈15% using 8T SRAM cells in NTC caches. However, it is expected that the 8T SRAM

design has 30% area overhead than the 6T design due to the two additional access transistors.

In practice, however, the overhead is much less. Since the 6T SRAM has to be up-sized to

increase its read stability, the up-sizing increases the cell area of the 6T design to the extent

of being larger than the area of 8T design, as experimentally demonstrated in [107].

40

3.3 Voltage scalable memory failure mitigation scheme

Figure 3.12: Energy consumption profile of 6T and 8T based 4K 4-way cache for wide supply voltage

value ranges averaged over the selected workloads from SPEC2000 benchmarks.

3.3 Voltage scalable memory failure mitigation scheme

As shown in the analysis presented in Section 3.2, process variation has a significant impact

on the failure rate of memory components operating in the near-threshold voltage domain.

Hence, addressing variation-induced memory failures plays an essential role in harnessing NTC

benefits. One way of mitigating variation-induced memory failures is by determining the

voltage downscaling potential of cache memories without surpassing the tolerable/correctable

error margins. For this purpose, the operating voltage of caches should be gracefully reduced

so that the number of failing bits due to permanent and transient failures remains tolerable.

This section presents a BIST based voltage scalable mitigation technique to determine an

error-free supply voltage downscaling potential of caches at runtime. In order to reduce the

runtime configuration complexities, the cache organization such as size, associativity, and

block size are determined during design time. In this work, the block size is considered as

the smallest unit used to transfer data to and from the cache. Then, a BIST based runtime

cache operating voltage downscaling analysis is performed for a given cache organization. To

illustrate the impact of block size selection, the voltage downscaling potential of two block

sizes is studied.

3.3.1 Motivation and idea

Due to the wide variation extent in NTC, different memory cells have different SNM values; as

a result, their minimum operating voltages for a proper functionality varies significantly. The

cells with smaller SNM values need to operate at a higher supply voltage than the cells with

larger SNM values. Therefore, the supply voltage of some cells (cells with smaller SNM value)

should be scaled down more conservatively than the cells with larger SNM in order to maintain

the overall reliability. This idea is exploited in order to minimize the effect of process variation

41

3 Reliable Cache Design for NTC Operation

0.435

0.45

0.465

0.48

0.495

0.51

0.525

0.54

0.555

Scala

ble

Vd

d i

n [

V]

Set 63

Set 62

Set 61

.

Set 4

Set 3

Set 2

Set 1

Set 0

.

.

Block 0 Block 1 Block 2 Block 3

(a) Block size=32 Byte

0.435

0.45

0.465

0.48

0.495

0.51

0.525

0.54

0.555

Set 63

Set 62

Set 61

.

Set 4

Set 3

Set 2

Set 1

Set 0

.

.

Block 0 Block 1

Scala

ble

Vdd in

[V

]

(b) Block size=64 Byte

Figure 3.13: Error-free minimum operating voltage distribution of 8 MB cache, Set size = 128 Byte (a)

block size=32 Bytes (4 blocks per set) and (b) block size=64 Bytes (two blocks per set),

the cache is modeled as 45nm node in CACTI.

and determine error tolerant/error-free voltage downscaling potential of near-threshold caches.

Since cache memories are divided into several blocks, block size selection has a significant

impact on the supply voltage downscaling potential of cache memories. Hence, one need to

analyze the impact of process variation and supply voltage downscaling potential of cache

memories in a per block bases.

Cache block size has a substantial impact on the miss rate and miss penalty of caches at the

same time. In order to reduce the cache miss rate and its associated penalty, a larger block

size is preferable as it improves locality and reduces the miss rate. From a reliability point of

view, however, larger block sizes have wide variation extent, and as a result more failing cells

in NTC, which makes the entire block fail. These failures forces the cache memory to operate

at a much higher voltage (i.e., more conservative scaling) leading to a significant reduction in

the energy efficiency. However, this is addressed by decreasing the cache block size in order

to reduce cache operating voltage as the variation extent is minimal in comparison to larger

block sizes.

To exploit this fact, the impact of block size selection on the supply voltage downscaling

potential of a near-threshold voltage 8KB cache is evaluated as shown in Figure 3.13. The

cache is modeled in CACTI [137] with 128 Byte set size and two different block sizes, and

the impact of process variation is modeled using the threshold voltage variation model given

in Equation (3.5). As shown in the figure, the smaller block size (Figure 3.13(a)) has narrow

variation extent, and hence, it has more supply voltage downscaling potential than its larger

block size counterpart (Figure 3.13(b)) at design time. During operation time, the supply

voltage downscaling potential of the larger block size cache is reduced further due to various

runtime factors such as aging-induced SNM degradation and SER. Moreover, smaller block

sizes have lower multiple bit failure rates, and hence, simpler ECC schemes are adopted at

a minimum cost [138]. Table 3.2 shows the ECC overhead comparison for 64 and 32 Byte

block sizes according to [138]. The table shows dividing the cache into smaller blocks has

42

3.3 Voltage scalable memory failure mitigation scheme

Table 3.2: ECC overhead analysis fo different block sizes and correction capabilities

ECC Block size=64 Byte Block size=32 Byte

schemes Area

overhead

Storage

overhead

Latency

overhead

Area

overhead

Storage

overhead

Latency

overhead

SECDED 13k gates 11 bits 2 cycle ≈4k gates 10 bits 1 cycle

DECTED >50k

gates

21 bits 4 cycles ≈10k

gates

19 bits 2 cycle

4EC5ED ≈60k

gates

41 bits 15 cycles ≈50k

gates

37 bits 9 cycle

an advantage in terms of ECC overhead. Therefore, appropriate cache block size selection

should consider both performance and reliability effects at the same time in order to achieve

maximum performance while operating within the tolerable reliability margin. Once the cache

block size is determined, the cache supply voltage should be tuned at runtime to incorporate

the runtime reliability effects such as aging. For this purpose, a BIST based supply voltage

tuning is used, and its concept is discussed in the following subsection.

3.3.2 Built-In Self-Test (BIST) based runtime operating voltage adjustment

Built-In Self Test (BIST) is a widely used technique to test VLSI system on chip [139]. Since

memory components occupy majority of the chip area, BIST plays a significant role in testing

large and complex memory arrays easily [139, 140]. In order to determine the runtime supply

voltage downscaling potential of caches, it is essential to assume a cache memory is equipped

with BIST infrastructure to test the entire memory.

In a conventional BIST, the BIST controller generates the test addresses and test patterns

(finite number of read/write operations). Then, the test is performed, and the test result is

compared with the expected response to determine the failing cells [140]. In this case, however,

since the BIST module has to determine the minimum scalable voltage of each block, the test

controller has to be modified in order to iteratively test and generate the minimum scalable

voltages of each block. The goal is first to determine the error-free minimum scalable voltage

of each cache block with/without error correction hardware. Then, the cache operating voltage

is determined based on the block with higher operational voltage as shown in Equation (3.7),

such that the runtime memory failure is minimized.

V cachedd = max

0≤i≤N−1V Bidd (3.7)

where N is the total number of cache blocks, and VBidd is the runtime minimum scalable voltage

of block Bi obtained using the iterative BIST.

Algorithm 1 presents the iterative BIST technique used to determine the minimum scalable

voltage of cache memory by considering permanent and runtime memory failures. The algo-

rithm takes cache size (Cs), operating voltage (Vdd), block size (Bs), and tolerance margin

(Fm) as its input. Then, the number of cache blocks is determined by dividing the cache size

by the block size (Step 2). Afterward, the minimum scalable voltage of each block is obtained

by gradually reducing the operating voltage, and conducting block-level BIST to determine

the total number of failing bits at each operating voltage level (Steps 3-10). It should be noted

43

3 Reliable Cache Design for NTC Operation

Algorithm 1 Runtime cache operating voltage adjustment

1: function Cache-Vdd-scaling (Cs, Vdd, Bs, Fm ) . Cs=cache size, Vdd= operating voltage, Bs=block size,

Fm=tolerable margin of failing bits

2: Bt←CsBs

; . Bt=total number of cache blocks

3: for block i←1 to Bt do

4: Fc ←0; . Fc=failing cells counter

5: Vnewdd ← Vdd; . Vnew

dd =voltage used to perform BIST

6: while Fc≤Fm do

7: Perform BIST using Vnewdd ;

8: Fc ← failing cells; . total number of failing cells per block

9: Vnewdd ←Vnew

dd -∆Vdd; . reduce operating voltage by ∆Vdd

10: end while

11: end for

12: Vcachedd ← max

1≤i≤Bt

V newddi

; . Vnewddi

new operating voltage of blocki

that, the supply voltage is reduced as long as the number of failing bits per block is within

the tolerable/ correction capability of the adopted error correction scheme. For example, a

cache memory equipped with a Single Error Correction Double Error Detection (SECDED)

infrastructure tolerates two failing bits per block (hence Fm=2) as SECDED corrects only one

bit and detects two erroneous bits at a time. Hence, whenever two failing bits are detected the

error-free version is loaded from the lower-level memory which makes SECDED sufficient solu-

tion for tolerating two failing bits per block. Finally, the algorithm determines the operating

voltage of the cache based on the block with the highest voltage as shown in Step 12.

The overall flow of the cache access control logic along with the BIST infrastructure as well

as mapping logic is presented in Figure 3.14. The cache controller first decodes the address

and identifies the requested block. Then, it determines if the requested block is functional or

failing block for the specified operating voltage. If the requested block is functional, then a

conventional block access is performed. In case the requested block is a failing one, the error

tolerant block mapping scheme is employed to redirect the access request.

Since this approach considers the effect of permanent and transient failure mechanisms, it is

orthogonal with different dynamic cache mitigation schemes such as block disabling [116, 66]

and strong ECC schemes [138]. For energy-critical systems, block disabling technique is applied

in combination with this approach to downscale the cache operating voltage aggressively by

Failing block?

Block mapping

Hit?

Respond request

Normal block

access

Hit? Access lower memory

Yes

Yes

No No

No

BIST

Cache access control

Failure map

Address decoder/ block identification

Mapped marginal block

Yes

Figure 3.14: Cache access control flowchart equipped with BIST and block mapping logic.

44

3.3 Voltage scalable memory failure mitigation scheme

Block mapping stack frame

B0 B1 B2

B3 B4 B5

B6 B7 B8

B9 B10 B11

B12 B13 B14

Memory array

BIST module

B14

B13B7B3

B9B10B0B8

B11

B6

B1B4

B14

B13B7B3

B9

B10B0B8

B11

B6B1

B14

B13B7

B3

B9

B10B0B8

B11

B6

Failure map and Vdd_min

B0 B1 B2

B3 B4 B5

B6 B7 B8

B9 B10 B11

B12 B13 B14

Requested

B5

Map B5 to B4 and update

stack pointer

Requested

B12

Map B12 to B1 and update stack pointer

Robust blocks Marginal blocks Failing blocks

Figure 3.15: Error tolerant cache block mapping scheme (mapping failing blocks to marginal blocks).

disabling the failing blocks at lower operating voltages at the cost of performance reduction

(increase in miss rate).

3.3.3 Error tolerant block mapping

Once the minimum scalable voltages of the cache blocks are determined, the next task is

to disable the failing blocks, and map their read/write accesses to the corresponding non-

failing blocks in order to ensure reliable cache operation. Additionally, in order to reduce

the vulnerability to runtime failures (such as noise and soft errors), the non-failing blocks are

stored in a stack frame sorted by their minimum scalable voltage values. Since the marginal

blocks (blocks with less voltage downscaling potential) are more sensitive to runtime failures,

they are stored at the top of the stack. Then, access to a disabled block is mapped to the

marginal blocks in the stack. The mapping enables to reduce soft error vulnerability of the

marginal blocks by reducing their data residency period. Since a stack is a linear data structure

in which the insertion and deletion operations are performed at only one end commonly known

as “top”, the marginal blocks need to be at the top (upper half) of the stack to ensure their

fast replacement.

The mapping process is illustrated in Figure 3.15 by using an illustrative example. As shown

in the figure, the cache blocks are divided into three categories: i) red blocks are failing blocks.

ii) yellow blocks are marginal blocks (non-failing but with limited supply voltage downscaling

potential). iii) blue blocks are robust blocks (i.e., non-failing with higher supply voltage

downscaling potential). Hence, the marginal blocks are stored at the top of the stack frame.

Then, when a disabled (failing) block is requested (e.g., B5) its access request is mapped to

a marginal block at the top of the stack frame (e.g., B4), and the stack pointer is updated

to point to the next element in the stack. This process continues until all the disabled blocks

are mapped. It should be noted that once a block is mapped, it is removed from the mapping

stack when updating the stack pointer. For example, when block B5 is mapped to block B4,

then, block B4 is removed from the stack as shown by the empty slot in Figure 3.15.

3.3.4 Evaluation of voltage scalable mitigation scheme

A) Variation-aware voltage scaling analysis

The supply voltage scalability of three different block sizes (16, 32, and 64 Byte) with different

error correction schemes is compared in order to analyze the impact of block size selection on

the supply voltage downscaling potential of cache memories with and without error correction

45

3 Reliable Cache Design for NTC Operation

schemes. The error-free (correctable error) minimum voltage of three block sizes is studied

for 8 KByte cache memory without ECC, parity, and Single Error Correction Double Error

Detection (SECDED) configurations. Table 3.3 shows the supply voltage downscaling potential

of the studied block sizes. For all ECC schemes (given in Table 3.3), the cache operating voltage

has to be downscaled more conservatively when the block size is larger (64 Bytes). However,

larger block sizes help to reduce the cache miss rate that results in a better cache performance.

Therefore, for an aggressive supply voltage downscaling, the block size should be selected as

small as possible by making performance and energy-saving trade-off analysis.

Table 3.3: Minimum scalable voltage analysis for different ECC schemes

ECC-SchemeMinimum Scalable voltage in [V]

Block size=16Byte Block size=32Byte Block size=64Byte

No-ECC 0.50 0.53 0.54

Parity 0.47 0.51 0.53

SECDED 0.43 0.48 0.50

B) Energy and performance evaluation of voltage scalable cache different ECC schemes

The average energy reduction and performance comparison of voltage scaled cache memory

with and without ECC are given in Figures 3.16(a) and 3.16(b) by running selected workloads

(gzip, parser and mcf ) from the SPEC2000 benchmark. The energy results in Figure 3.16(a)

are extracted from CACTI by considering block disabling, and ECC induced delay and energy

overheads. As shown in the figure, supply voltage downscaling improves the energy efficiency

significantly. However, the overheads of this scheme, namely ECC energy overhead, block dis-

abling induced cash miss rate, and ECC encoding/decoding delay overhead outweigh the energy

gain of supply voltage downscaling when the cache operating voltage is below 0.7V. There-

fore, the energy per access of Double Error Correction Tripple Error Retection (DECTED) is

higher than SECDED when the supply voltage is scaled down to 0.7V or below. Similarly,

Figure 3.16(b) shows the cache performance (IPC) is reduced significantly with the supply

voltage downscaling as more blocks are disabled for reliable operation.

0.5 0.6 0.7 0.8 0.9 1.0 1.1Vdd [V]

1

2

3

4

5

Energ

y p

er

access (

nJ)

No-ECC

SECDED

DECTED

(a) Energy consumption comparison

0.5 0.6 0.7 0.8 0.9 1.0 1.1Vdd [V]

0.7

0.8

0.9

1.0

Norm

aliz

ed IPC

No-ECC

SECDED

DECTED

(b) Performance (IPC) comparison

Figure 3.16: Comparison of voltage downscaling in the presence of block disabling and ECC induce

overheads for gzip, parser and mcf applications from SPEC2000 benchmark (a) energy

comparison (b) Performance in IPC comparison

46

3.4 Summary

3.4 Summary

Embedded microprocessors, particularly for battery-powered mobile applications, and energy-

harvested Internet of Things (IoT) are expected to meet stringent energy budgets. In this

regard, operating in the near-threshold voltage domain provides better performance and energy

efficiency trade-offs. However, NTC faces various challenges among which increase in functional

failure rate of memory components is the dominant issue. This chapter analyzed the combined

effect of aging, process variation, and soft error on the reliability of cache memories in super and

near-threshold voltage domains. It is observed that the combined effect of process variation and

aging has a massive impact on the soft error rate and SNM degradation of NTC memories.

Experimental results shows process variation and aging-induced SNM degradation is 2.5×higher in NTC than in the super-threshold voltage domain while SER is 8× higher. The

use of 8T instead of 6T SRAM cells reduces the system-level SNM and SER by 14% and

22% respectively. Additionally, workload and cache organization have a significant impact

on the FIT rate and SNM degradation of memory components. This chapter demonstrated

that the reliability and performance optimal cache organization changes when going from the

super-threshold voltage to the near-threshold voltage domain.

47

4 Reliable and Energy-Efficient Microprocessor

Pipeline Design

This chapter presents techniques to mitigate variation-induced delay imbalance and energy in-

efficiency of pipelined processor core. The techniques are variation-aware pipeline stage design

and minimum energy point operation of pipeline stages for energy-efficient design of pipelined

NTC processor cores. The first technique illustrates the variation-induced delay imbalance of

pipeline stages and develop a pipeline stage delay balancing strategy in the presence of ex-

treme variation. The second technique studies the energy-optimal minimum energy operation

of pipeline stages and assigns pipeline stage-level optimal supply and threshold voltage pairs

which is determined based on the structure, functionality, and workload dependent activity

rate of the pipeline stages.

4.1 Introduction

For a processor operating in NTC, the wide variation extent affects the delay of the pipeline

stages significantly [82, 90]. Due to the difference in the logic depth and type of gates used,

the impact varies widely among different pipeline stages of NTC processor [56]. Moreover,

circuit parameters such as threshold voltage and activity rate have a direct impact on the

delay, dynamic, and leakage power consumption of the pipeline stages. The dynamic power

consumption of pipeline stages increases linearly with an increase in the activity rate. Similarly,

transistor threshold voltage has a significant impact on the leakage power consumption and

delay of pipeline stages. Therefore, adding timing margins for delay variation, as it is done

in the super-threshold voltage domain, is not effective approach for NTC operation. Hence,

addressing variation effects and determining the optimum supply and threshold voltage pairs

is of decisive importance for energy-efficient NTC operation [141, 11].

For this purpose, this chapter presents two pipeline stage designing and balancing method-

ologies to tackle the impact of process variation, and improve the energy efficiency of pipelined

NTC processors. The first approach employs a variation-aware pipeline stage balancing and

synthesis technique to balance the delay of pipeline stages in the presence of extreme delay

variation. The variation-aware balancing approach, presented in Section 4.2, improves the

energy efficiency of the pipeline stages significantly by using slower, but energy-efficient gates

in the implementation of the faster pipeline stages so that their delay is balanced with the

delay of the slower pipeline stages. The second technique presents a fine-grained pipeline

stage level Minimum Energy Point (MEP), supply and threshold voltage pair, assignment for

pipeline stages in order to improve their energy efficiency. The MEP assignment determines

the energy-optimal supply and threshold voltage (Vdd, Vth) pairs of pipeline stages based

on their structure and activity rates. The optimal supply and threshold pairs assignment of

individual pipeline stages is useful to attain extreme energy efficiency at NTC.

49

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0%

20%

40%

60%

80%

IF TSel

Dec

ExeM

emW

B

Del

ay i

ncr

ease

Stages

ST NTC

Figure 4.1: Variation induced delay increase in super and near threshold voltages for OpenSPARC core

(refer to Table 4.1 in Section 4.2.4 for setup of OpenSPARC).

The rest of the chapter is organized as follows. Section 4.2 presents the variation-aware

processor pipeline optimization technique, in which the pipeline stages are balanced by con-

sidering the impact of process variation during earlier design phases. Section 4.3 presents a

fine-grained (pipeline stage level) energy-optimal MEP, supply and threshold voltage (Vdd,

Vth) pair, assignment technique for an energy-efficient microprocessor pipeline design. Finally,

the chapter is summarized in Section 4.4.

4.2 Variation-aware pipeline stage balancing

The impact of variation on the delay of pipeline stages in NTC is illustrated by comparing the

variation-induced delay increase of the OpenSPARC core at the super-threshold (1.1V) and

near-threshold (0.45V) voltage domains as shown in Figure 4.1. The pipeline stages of the

processor core are synthesized with 45nm Nangate library and the impact of process variation

is incorporated as threshold voltage variation, and it is modeled by the Pelgrom model [17].

As shown in Figure 4.1, the variation-induced delay increase of pipeline stages is <20% in

the super-threshold voltage domain, while it escalates to 73% at NTC. Moreover, there is

substantial variation between the delays of the pipeline stages in the near-threshold voltage

domain.

In order to address the variation-induced delay variation of pipeline stages, a variation-aware

pipeline stage balancing technique, which is a design-time solution that improves the energy

efficiency, and minimizes the performance uncertainty of pipelined NTC processors is presented

in this section. In the super-threshold voltage domain, the design-time balancing techniques

such as, [142] does not necessarily have to be variation-aware, as the variation impact is easily

addressed by adding timing margins. However, due to the wide variation extent at NTC,

delay variation cannot be effectively handled by adding conservative timing margins. Hence,

variation-aware pipeline balancing techniques are crucial to improve the energy efficiency, and

50

4.2 Variation-aware pipeline stage balancing

IF

DecRRExe

Mem

WB(a) nominal

delay balanced(c) variation-

aware balancedTD TOP

TOP

IF

DecRRExe

Mem

WB

Nominal delay Variation-induced delay

IF

DecRRExe

Mem

WB(b) guard band

reductionTD TOPTnew

Figure 4.2: Impact of process variation on the delays of different pipeline stages in NTC.

avoid pessimistic margins of NTC designs. Therefore, an iterative variation-aware synthesis

flow is adopted in order to balance the delay of the pipeline stages in the presence of extreme

delay variations. By doing so, the variation-aware balancing approach has dual advantages:

i) It assures timing certainty and avoids the need for conservative timing margins. ii) It

improves the energy efficiency by balancing the delays of the pipeline stages using slower, but

energy-efficient gates in the implementation of the fast pipeline stages.

For this purpose, the pipeline stages are optimized and synthesized independently using NTC

standard cell library, and Statistical Static Timing Analysis (SSTA) is performed to obtain the

statistical delays of the pipeline stages. Afterward, the statistical delays of the pipeline stages

are provided to the synthesis tool in order to iteratively balance the pipeline stages accordingly.

The variation-aware synthesis flow enables the synthesis tool to use energy-efficient gates (e.g.,

INVX1) in the faster stages in order to improve the energy efficiency without violating any

timing requirements, while faster gates (e.g., INVX32) are used to implement the slower (timing

critical) stages.

4.2.1 Pipelining background and motivational example

Pipelining is mainly used to increase the clock frequency and instruction throughput of mod-

ern processors. Thus, balancing the delay of the pipeline stages is helpful to achieve better

performance. However, due to the wide variation extent in the near-threshold voltage domain,

the classical design-time pipeline balancing approaches are not sufficient anymore, unless a

new variation-aware balancing objective is incorporated.

As shown by the example in Figure 4.2, the energy efficiency of a nominal delay balanced

design is improved by two different approaches; one way of improving the energy efficiency is to

minimize the delays of the pipeline stages by using delay as the primary target of optimization.

Another way is to keep the delay constant, and minimize the power consumption by using

slower, but energy-efficient gates in the implementation of the faster pipeline stages. For

instance, the pipeline stages in Figure 4.2(a) are balanced based on their nominal delays (the

green bars) with TD as the target delay. However, due to process variation (the gray bars),

the actual post manufacturing delays (TOP ) of the pipeline stages are highly unbalanced. The

post-manufacturing delay imbalance has two consequences; on the one hand, it leads to a large

timing guard band requirement which affects the performance significantly; on the other hand,

some of the pipeline stages (e.g., WB) are over-designed with a considerably large timing slack

51

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

Individual stageRTL code

Cell library database (.db)

Constraint file

Synthesis tool

Synthesized netlist

SSTA tool

Timing and power report

Variation-aware library (.lib)

No improvementModify

constraints

Yes

End

No

Figure 4.3: Variation-aware pipeline stage delay balancing synthesis flow for NTC.

that leads to higher leakage energy consumption which eventually reduces the overall energy

efficiency.

Hence, in order to minimize the leakage energy consumption and improve the energy effi-

ciency of the design, one should consider the impact of process variation during early design

phases. In Figure 4.2(b), the design is optimized to reduce the post-manufacturing delay TOP

to Tnew by using delay as the main target of the optimization. In Figure 4.2(c), however, the

pipeline stages are balanced based on their statistical delays instead of their nominal delays

(i.e., post-manufacturing statistical delay (TD), information is incorporated during synthesis

phase). Hence, in Figure 4.2(c) the worst-case delay is kept constant and the delays of the

faster stages are increased by using slower, but energy-efficient gates in order to improve the

energy efficiency.

4.2.2 Variation-aware pipeline stage balancing flow

A) Variation-aware cell library for NTC

Before discussing the variation-aware pipeline stage balancing flow, it is essential to discuss

the appropriate variation-aware cell library used in the balancing flow. The variation-aware

cell library is a library that extends the standard cell library with variation information of the

standard cell parameters such as delay, power, capacitance, and current of the standard gates.

The variation-aware library is characterized using Cadence Virtuoso Variety tool based on the

SPICE netlist of the standard gates and the threshold voltage variation information obtained

from the Pelgrom model [17]. It should be noted that the variation-aware library can be also

characterized using other tools, such as PrimeTime Advanced On-Chip Variation (AOCV)

tool [143]. The characterized variation-aware cell library is used by the SSTA tool to obtain

the statistical delay values of the standard cells. The statistical delay values of the standard

cells are used to create a combined variation-aware library which is used by the variation-aware

synthesis flow shown in Figure 4.3.

52

4.2 Variation-aware pipeline stage balancing

Algorithm 2 Iterative optimization algorithm for variation-aware pipeline stage balancing

1: function Iterative-optimization(Stage-RTL,PV library, Constraint-file)

2: . Synthesisflow is a tcl script that invokes the Synopsys design compiler

3: . Di = statistical delay of stagei;

4: for stage i←1 to n do

5: . balance nominal delays as shown in Figure 4.3

6: Di = Synthesisflow(RTL-codei, library, Constraints)

7: end for

8: . Obtain DPVi using SSTA tool

9:

10: DPVmax = max

1≤j≤nDPVi ; . DPV

i = Dnomi + ∆Di

11: for stage i←1 to n-1 do . balance the stages based on their statistical delay

12: while DPVi < DPV

max do

13: update constraint-file

14: Synthesisflow(RTLcodei, library, constraints)

15: DPVi = SSTA(synthesized netlisti, library, variation library)

16: end while

17: end for

18: return timing and power profiles of the balanced design

B) Iterative variation-aware pipeline stage optimization

The variation-aware synthesis flow presented in Figure 4.3 is the core part of the iterative

optimization technique. As shown in the figure, the synthesis tool is provided with the RTL

description of the pipeline stages along with their constraint files. The constraint files contain

the design constraints (such as area and power optimization requirements) and timing assign-

ments of the pipeline stages. Then, the synthesis tool uses the standard cell library database

(.db file) to synthesize the design and generate the synthesized netlist that satisfy the specified

constraints. Afterward, the SSTA tool uses the variation-aware library (.lib file) to generate

the statistical timing and power reports of the synthesized netlist. At this stage, the statisti-

cal delay of the pipeline stage (obtained from the timing report) is compared with the target

delay. If the delay is less than the target delay (i.e., positive slack), then the constraint file is

updated by relaxing the timing constraint, and the synthesis flow is repeated by using slower,

but energy-efficient gates (e.g., INVX1). However, if the delay is larger than the target delay

(i.e., negative slack), the constraint file is modified by making tighter timing constraint, and

the synthesis flow is repeated by using faster gates (e.g., INVX8) in order to meet the timing

constraint.

Algorithm 2 presents the iterative pipeline stage optimization. First, the pipeline stages

are balanced based on their nominal delays (Steps 4-7). The Synthesisflow function is a tcl

script that invokes the synthesis tool (Synopsys design compiler in this case) to synthesize the

behavioral code of the pipeline stages satisfying the specified constraints. Then, the statistical

delays of the stages are extracted using the SSTA tool, and the critical delay DPVmax is updated

accordingly (Steps 9-10). Once the worst case statistical delay, DPVmax, is obtained, the algorithm

re-balances the delays of the pipeline stages by using DPVmax as a target delay. A tcl script is

used to automatically modify the constraint file for the next iteration by analyzing the timing

53

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

Slack S0

TD

Step-0

Step-1

Step-2

Step-3

T1 = T0+ S0/2

S1

T2 = T1 - S1/2 S2

T3 = T2 + S2/2

Critical Stage

optimized Stage

T0

T1

T2

T3

Figure 4.4: Time constraint modification using pseudo divide and conquer.

slack of the current iteration. In the iterative synthesis, the remaining n-1 stages (where n is

the total number of stages) are optimized iteratively (Steps 11-16) in order to balance their

delays, DPVi ’s, with the delay of the critical stage (DPV

max).

Algorithm stopping condition: For each pipeline stage, the iterative optimization flow

considers the delay of the stage as balanced, and stops the iteration if and only if the timing

slack is positive (i.e., no timing violation) and one of the following conditions are satisfied:

• Dmax - Di ≤ ε: the delay of the stage is nearly balanced with the target delay. This

condition guarantees that the stage has a reasonably balanced delay with respect to the

critical stage. The value of ε can be changed to meet design specific requirements such

as area and power minimization.

• Di cannot be improved anymore: in this case, the delay of the pipeline stage cannot

be improved further by applying more synthesis iterations. Hence, the delay of the last

iteration (Di−1) such that Di−1<Dmax is used as its delay.

C) Timing constraint modification

Modifying the constraint file is an integral part of the iterative algorithm, as it affects the

runtime of the algorithm. Thus, a pseudo divide and conquer approach in which the timing

constraint is tightened or relaxed by half of the timing slack of the previous iteration is used to

iteratively modify the constraint file. The pseudo divide and conquer approach is illustrated

in Figure 4.4, using an illustrative example for balancing the delay of a representative stage

with the delay of the critical stage. As shown in Figure 4.4, the statistical delay of the critical

stage (TD), the statistical delay of representative stage (T0) and the slack of the representative

stage (S0) are extracted during the first synthesis iteration (Step-0). It should be noted that

the slack Si in the ith iteration is either positive or negative. If the slack is positive, the timing

constraint of the next iteration is relaxed; otherwise, tight timing constraint is applied in order

to satisfy the timing requirement of the design.

In the example given in Figure 4.4, since S0 is a positive slack, the timing constraint is

relaxed by half of the slack (i.e., T1= T0 + S02 ) which allows the synthesis tool to use slower,

but energy-efficient gates. Then, during the next synthesis iteration (Step-1), the synthesis

tool generates the synthesized design and timing report, from which the timing slack (S1) is

obtained. Now since the slack S1 is negative, the timing constraint of the next iteration is

tightened (T2= T1 - S12 ), and the synthesis tool uses faster gates to generate a design that

54

4.2 Variation-aware pipeline stage balancing

satisfy the timing requirement (Step-2). The iterative timing constraint modification is applied

until the delay of the stage is fully/ nearly balanced to the target delay, or the slack is positive,

but the delay cannot be improved further.

D) Algorithm runtime analysis

The runtime of Algorithm 1 is mainly determined by; i) variation-aware synthesis time; ii) the

number of synthesis iterations. The former case, (variation-aware synthesis time) is determined

by the synthesis tool, and is improved by using different synthesis optimization techniques.

For example, in Synopsys design compiler -map−effort low option can be used to specify low

optimization effort for faster synthesis. However, the latter case (the number of synthesis

iterations), is determined by how fast the algorithm reaches the target delay (i.e., Di ≈ Dmax)

when balancing the delays of the pipeline stages. Since the runtime is highly determined by

the technique used to modify the timing constraints, the pseudo divide and conquer approach

discussed in the above subsection is used to modify the timing constraint. In comparison to a

naive incremental approach where the timing constraint is modified by adding or subtracting

the entire slack (Ti=Ti−1±Si−1), the pseudo divide and conquer approach converges to the

optimal solution with fewer iterations as it always gets closer to the solution. The algorithm

runtime also varies from design to design depending on the number and complexity of the

pipeline stages in the design.

4.2.3 Coarse-grained balancing for deep pipelines

Pipeline merging is a coarse-grained configuration technique used to balance the delays of

pipeline stages, particularly for deep pipelines. The idea is to selectively merge consecutive

small (fast) stages by disabling the intermediate registers between the consecutive pipeline

stages [144]. However, pipeline stage merging is a very challenging and complex task; and it

may not be applied to any consecutive pipeline stages due to several complexities. For example,

there are interrupt, stall, forward, and feedback loop signals generated by a pipeline stage

which are fed to the previous/next stages in order to control the proper instruction execution.

During pipeline stage merging, however, the generation and propagation of these signals can

be delayed or completely masked, which affects the proper functionality of the merged design.

Hence, in order to maintain the proper functionality, the generation and propagation flow of

these control signals should be handled appropriately. Moreover, the delay constraint of the

design should be maintained during the merging process in order not to affect the operational

clock frequency of the design. The example given in figure 4.5 illustrates the cases of feedback

loop control signal handling and delay constraints. However, the same concept applies to other

S1 S2 S3 S4 S1 S2 S3

(a) (b)

Figure 4.5: Pipeline merging and signal control illustration (a) before (b) after merging where S2 =

S2+S3 and S3 = S4.

55

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

cases as well.

• Maintaining feedback signals: For example, consider the design in Figure 4.5(a) which

has four stage (S1-S4) with a feedback signal loop from stage 2 to stage 1. During merging

of stages S2 and S3 to create S2, the feedback signals have to be maintained in order to

have a fully functional design as shown in Figure 4.5(b).

• Delay constraint: The delay of the merged stage should not exceed the critical stage

delay. This constraint assures that the frequency is not affected by the merging process.

It is observed that some of the pipeline stages of the FabScalar core (such as Dispatch, Issue

Queue, Write Back, and Retire) are faster than the other stages which leads to higher energy-

inefficiency. However, the inter-stage delay variation is minimized by using pipeline merging

before applying the variation-aware balancing flow. Hence, the rename and dispatch stages

are merged, and named as RD stage. Similarly, the Issue Queue and Register Read stages are

merged to form IRR stage. However, the Write Back and Retire stages are not merged due to

the complexity of bypass and out-of-order commit signals. For brevity, the merging process of

rename, dispatch stages is discussed.

In the baseline 4-issue out-of-order design, the rename stage receives four decoded instruc-

tions from the previous (decode) stage, and it performs logical to physical register mapping

(renaming). Afterward, the renamed instructions are dispatched to the issue and load/store

queue in the dispatch stage. To merge these stages a new top-level module, RD, is created

with the input ports of Rename and output ports of Dispatch stages. Then, the Rename and

Dispatch stages are instantiated and connected via wires within the top-level module to create

a single combined stage. Finally, the newly merged stages are added to the pipeline of the

FabScalar core and the core is synthesized, and debugged for functional verification before

using it in the case study.

4.2.4 Experimental results

A) Experimental setup

To illustrate the effectiveness of pipeline stage merging and variation-aware delay balancing

on deep and shallow pipeline designs, two case studies are conducted using OpenSPARC [145]

and FabScalar [146] cores. The architectural description and simulation setup of the two

processor cores is presented in Table 4.1. The impact of process variation on the delays of the

pipeline stages is extracted by an SSTA tool that uses variation-aware library characterized by

Table 4.1: Experimental setup

Configuration OpenSPARC T1 FabScalar

Processor model Embedded Embedded

Architecture in-order core out-of-order

Nominal voltage Frequency 1.4 GHz 800 MHz

NTC Frequency 230 MHz 67 MHz

Supply voltage 0.45V 0.45V

Technology node 45nm Nangate 45nm Nangate

Pipeline depth 6 11

56

4.2 Variation-aware pipeline stage balancing

0

1

2

3

4

5

IF TSEL

DEC

EXE

MEM

WB

Del

ay [

ns]

Stages

NominalVariation-induced

(a) Baseline design

0

1

2

3

4

5

IF TSEL

DEC

EXE

MEM

WB

Del

ay [

ns]

Stages

NominalVariation-induced

(b) Delay optimized

0

1

2

3

4

5

IF TSEL

DEC

EXE

MEM

WB

Del

ay [

ns]

Stages

NominalVariation-induced

(c) Power optimized

Figure 4.6: Nominal and variation-induced delays of OpenSPARC pipeline stages (a) nominal delay

balanced baseline design, (b) guard band reduction (delay optimized) of nominal delay

balanced, and (c) statistical delay balanced (power optimized).

Cadence Virtuoso Variety tool as discussed in Section 4.2.2. The power results are obtained

using Synopsys design compiler.

B) Case study using OpenSPARC core

OpenSPARC T1 is an open source version of the industrial UltraSPARC T1 processor, a

64-bit multi-core processor that runs up to 1.4 GHz of frequency when operating in the super-

threshold voltage domain (Vdd=1.1V). When operating at NTC (Vdd=0.45V), it runs up to

230MHz. As discussed in Section 4.2.2, the energy efficiency is improved either by reducing

the delay or by keeping the delay constant, and reducing the power consumption using energy-

efficient gates.

Balancing for delay improvement

The target of timing margin reduction, delay improvement, is to improve the statistical post-

manufacturing delays of the pipeline stages in order to boost the overall frequency. The

delay balancing optimization approach is demonstrated in Figure 4.6. In the baseline design

(Figure 4.6(a)), the stages are designed to have balanced nominal delays of ≈2.6ns. Due to

process variation, the stages become highly unbalanced, and the delay of the critical stage

becomes 4.3ns, which limits the core to run at a frequency of ≈230MHz. However, the core

frequency is improved by tightening the timing constraint, and optimizing the delay of the

decode stage. Figure 4.6(b) shows the delays of the pipeline stages after optimizing the critical

stage (decode stage). As shown in Figure 4.6(b), the critical stage delay is 3.75ns which is

≈15% lower than the baseline delay. Hence the core can run at a higher frequency (267MHz),

which gives 15% improvement over the baseline frequency. As shown in Table 4.2, the 15%

delay reduction of the delay optimized design leads to 13% improvement in the Power Delay

Product (PDP) of the design.

Balancing for power improvement

In this approach, the energy efficiency is improved by using slower, but more energy-efficient

gates in the faster pipeline stages. In order to reduce the power consumption, the delay of the

57

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0%

15%

30%

45%

60%

75%

IF TSel

Dec

ExeMem

WB

Total

Improvem

ent

Stages

Figure 4.7: Power (energy) improvement of variation-aware balancing over the baseline design for

OpenSPARC core.

faster pipeline stages is increased and balanced with the delay of the slowest (critical) stage.

Although the power balancing approach increases the delay of all individual stages except the

critical stage (decode stage), it does not affect the overall clock frequency as the frequency is

determined by the delay of the critical stage (decode stage). The power balancing scenario

is illustrated in Figure 4.6 with a comparison to the baseline design (nominal delay balanced

design) and delay balanced designs.

The energy efficiency of the nominal delay balanced design is improved using the variation-

aware pipeline stage balancing flow. Hence, the pipeline stages are balanced based on their

statistical delays (4.3ns as a target critical delay). The usage of such relaxed timing constraint

enables the synthesis tool to use slower, but more energy-efficient gates in the fast stages. This

optimization is illustrated in Figure 4.6(c). In the optimized design (Figure 4.6(c)), all the

pipeline stages have nearly balanced statistical delays which makes the design more energy-

efficient as it uses slower gates in the faster stages. Hence, in comparison to the baseline design,

the variation-aware design is more energy-efficient.

The power improvement of the variation-aware design over the baseline is shown in Fig-

ure 4.7. Figure 4.7 shows the power consumption of all stages is improved except for the

decode stage as it is the critical stage. The Write Back (WB) and Thread Select (TSel) stages

have significant improvement (>70%) since they have considerably larger slack in the baseline

design. In total, the variation-aware delay balancing approach improves the power consump-

tion of the OpenSPARC core by 55%. Additionally, since PDP is a function of power and

Table 4.2: PDP of baseline, delay optimized, and power optimized designs

Design IF TSel Dec Exe Mem WB Avg

Baseline design 0.232 0.256 0.316 3.117 1.036 0.318 0.881

Delay optimized 0.202 0.22 0.303 2.718 0.903 0.277 0.776

% Improvement 14.8 16.4 4.2 14.6 14.7 14.8 13.25

Power optimized 0.144 0.147 0.316 1.995 0.649 0.185 0.572

% Improvement 62.0 74.1 0 56.2 59.6 72.1 54.2

58

4.2 Variation-aware pipeline stage balancing

0

5

10

15

20

FS1FS2

DEC

REN

DIS

ISQRR

EXE

LSUW

BRET

Del

ay [

ns]

RD IRR

Stages

NominalVariation-induced

(a) Baseline (non optimized)

0

5

10

15

20

FS1FS2

DEC

RD

IRR

EXE

LSUW

BRET

Del

ay [

ns]

Stages

NominalVariation-induced

(b) Merged and optimized

Figure 4.8: Nominal and variation-induced Delays of FabScalar pipeline stages (a) baseline design, the

gray boxes indicate the stages to be merged, (b) merged and optimized design.

delay, the achieved power improvement can be interpreted to PDP improvement. As shown in

Table 4.2, the power optimized design improves the average PDP by 54.2% which is 4× better

than the PDP improvement achieved by delay optimization (timing margin reduction).

C) Case study using FabScalar core

FabScalar is an open source pipelined out-of-order core developed by academia. In NTC

FabScalar runs up to 67 MHz. The frequency of Fabscalar is limited by the load store unit

(LSU). Since FabScalar is highly unbalanced design, pipeline stage merging is necessary before

applying the proposed variation-aware pipeline balancing technique.

Delay balancing for power improvement

To balance the delay of the FabScalar core, first, the faster pipeline stages of the baseline design

are merged, and the merged design is validated as discussed in Section 4.2.3. Rename (Ren)

and Dispatch (Dis) stages are merged to create RenameDispatch (RD) stage. While Issue

Queue (ISQ) and Register Read (RR) stages are merged to form IssueRegisterRead (IRR)

stage. Hence, the baseline design has 11 stages, whereas the merged design has 9 stages as

shown in Figure 4.8. Afterward, the proposed variation-aware balancing flow is employed to

balance the delay of the pipeline stages.

Figure 4.8 shows the nominal and variation-induced delays of the FabScalar pipeline stages.

As shown in Figure 4.8(a) the stages have highly unbalanced delays with the LSU stage being

the critical stage having a delay of 15ns. However, due to process variation, the execute stage

becomes critical with a delay of ≈20ns. The problem is addressed in Figure 4.8(b) by applying

pipeline merging, and balancing the stages based on their statistical delays. Figure 4.8(a)

shows the pipeline stage of the design have highly unbalanced nominal and statistical delay

distribution, while the stages in the proposed approach (4.8(b)) are well balanced in the

presence of extreme variation.

59

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0%

30%

60%

90%

120%

150%

180%

FS1FS2

DEC

RD

IRR

EXE

LSUWB

RET

Total

Improvem

ent

Stages

Figure 4.9: Power (energy) improvement of variation-aware FabScalar design over the baseline design.

Power improvement

Unlike the optimization of the OpenSPARC core, optimizing the FabScalar core has two-fold

gain in power. On the one hand, the design is simplified, and power consumption is minimized

by disabling the stage flip-flops of the merged stages. On the other hand, the synthesis tool

uses slower, but more energy-efficient gates in the stages that have positive time slacks to

improve the overall energy efficiency.

The power improvement of the variation-aware pipeline balancing and merging of FabScalar

core is shown in Figure 4.9. Since LSU is the critical stage, it has no improvement over the

baseline. As shown in the figure, the variation-aware balancing technique improves the total

power consumption of the core by ≈85%. Since energy is the product of power and delay,

measured in Watt-hours (Wh), the 85% power reduction improves the energy efficiency by the

same amount.

4.3 Fine-grained Minimum Energy Point (MEP) tuning for

energy-efficient pipeline design

4.3.1 Background

With supply voltage downscaling, not only the dynamic power reduces quadratically, but also

the leakage power is reduced in a linear fashion leading to a significant energy reduction.

However, in addition to the delay increase, the practical energy benefit of supply voltage

downscaling is limited by several barriers, such as an increase in sensitivity to process and

performance variations [147, 18]. Therefore, downscaling the supply voltage beyond a certain

point no longer improve the energy efficiency, as the delay increases significantly leading to

an increase in the leakage power which eventually dominates the energy consumption. The

transistor threshold voltage is increased in order to reduce the leakage power; however, this

increases the circuit delay, and eventually reduces the energy efficiency. Therefore, both supply

and threshold voltages should be tuned simultaneously in order to co-optimize the delay and

power consumption of a circuit. Hence, the supply voltage, leading to the maximum energy

efficiency, known as Minimum Energy Point (MEP), is obtained by tuning both supply and

threshold voltages at the same time [148]. For energy-constrained devices such as IoT applica-

60

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

Fs1 Fs2 Dec Dis Exe Isq Lsu

RegreadRen Ret

Wback

Pipeline stages

Bzip

Gap

Gzip

McfParse

rVerto

x

Work

load

0.8

1.6

2.4

3.2

4.0

4.8

5.6

6.4

Dynam

ic t

o leaka

ge r

ati

o

Figure 4.10: Dynamic to leakage energy ratio of pipeline stages of FabScalar core under different work-

loads synthesized using 0.5V saed 32nm library.

tions, where performance is not the primary concern, MEP operation plays a significant role

in minimizing the energy consumption and hence, increase the supply duration of their energy

source (battery or harvested energy).

In addition to the supply and threshold voltages, the MEP of a circuit depends on vari-

ous technology, design, and operation parameters, such as transistor size, activity rate and

structure of the circuit [149]. For macro-structures, such as pipelined processors, consisting

of several micro blocks, the activity rate1 and circuit structure (number and type of gates of

individual blocks) have a significant impact on the MEP. The dynamic power is determined

by the supply voltage, activity rate, and structure of the circuit. However, the leakage power

is a strong function of the circuit structure, supply voltage, and threshold voltage. Therefore,

tracking MEP not only require optimizing supply and threshold voltages, but also a compre-

hensive analysis and understanding of the micro block structures, and their intrinsic inter-block

activity rate variation under the same/ different workload applications is essential [39].

Although operating at MEP is an effective solution for energy-constrained designs, the state-

of-the-art MEP techniques [148, 150, 83, 151] are applied either at core-level or circuit-level.

The circuit-level solutions [151] are significantly under-optimized when applied to a pipelined

processor due to their lack of high-level information including functionality, structure, and

accurate activity rate of the pipeline stages. Similarly, the core-level MEP solutions [148, 150]

ignore the impact of the difference in the circuit structure and intrinsic activity rate variations

among different processor components such as pipeline stages.

To analyze the impact of functionality, workload, and inter block/stage activity rate variation

on energy efficiency, the dynamic and leakage power consumptions of the pipeline stages are

studied using FabScalar core synthesized with saed 32nm NTC library (0.5V). The heatmap

in Figure 4.10 shows the ratios of the dynamic to leakage power consumptions of the pipeline

stages of FabScalar core for six different workload applications from the SPEC2000 benchmark

suite. Due to the difference in their structure and functionality, the pipeline stages have wide

activity rate variation. The intrinsic inter-stage activity rate variation leads to a wide variation

extent in the dynamic and leakage power consumption (D/L ratio) of the pipeline stages as

shown in Figure 4.10.

1Switching activity or toggle rate

61

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

The D/L ratio of the pipeline stages has a significant impact on the MEP of the individual

pipeline stages. For example, since dynamic power is dominant in the stages with higher

D/L ratios (e.g., writeback (WB)), lower supply and threshold voltage values are required for

a better energy efficiency. However, stages with lower D/L ratios (e.g., execute (Exe)) are

dominated by leakage power, in which relatively higher supply and threshold voltage values

improve the energy efficiency of these stages. Another important observation is that the

functionality of the pipeline stages has more influence on the D/L ratio of the stages than the

impact of activity rate variation due to running different workloads. Therefore, a design-time

solution based on the stage functionality is more effective than a complicated runtime tracking

of workload variation to adjust the MEP.

For a given circuit with a specific activity rate, the MEP supply voltage varies depending

on the threshold voltage. The MEP variation shows the MEP is not only dependent on the

supply voltage, but it also depends on threshold voltage as it has a strong impact on the

leakage power consumption. Hence, the threshold voltage serves as an additional knob for

energy optimization [152]. Various industrial and academic researches [153, 152, 154] have

explored the leakage power reduction potentials of threshold voltage tuning, by employing a

multi-threshold design technique at gate and circuit-levels. Therefore, simultaneous tuning of

supply and threshold voltages is of decisive importance in order to co-optimize the dynamic

and leakage powers.

4.3.2 Fine-grained MEP analysis basics and challenges

A) Minimum energy point analysis

For a given activity rate and threshold voltage value, the energy consumption is decreased

significantly by reducing the supply voltage. However, when the supply voltage is reduced the

delay and leakage power consumption of the circuit increases remarkably, which minimizes the

overall energy gain. Similarly, for a fixed activity rate and supply voltage, the leakage power

is reduced by increasing the threshold voltage. However, as the threshold voltage approaches

the supply voltage, the delay increases significantly leading to an increase in the leakage power

consumption, which eventually reduces the overall energy saving. These phenomena indicate

that there exists a unique MEP, Vdd, Vth pair, that result in a minimum energy consumption.

The MEP of a design has voltage and energy components. The voltage component (consisting

of supply and threshold voltages) is a value at which the MEP occurs. While energy component

is the energy consumed at the MEP point. Therefore, MEP analysis should consider the

movement of both supply and threshold voltage components with respect to the resulting

energy consumption of the design.

To validate the above idea and illustrate the dependency of the MEP supply voltage on

the threshold voltage (Vth) and activity rate, an experiment is conducted using three different

implementations of an inverter chain for a wide range of activity rates. First the three instances

of the inverter chain are designed with three different process corners (threshold voltage values),

namely typical corner (TT) with Regular Vth (RVT), slow corner (SS) with High Vth (HVT),

and fast corner (FF) with Low Vth (LVT) from the saed 32nm library. Afterward, the MEP

Vdd movements of these designs are tracked for different activity rates as shown in Figure 4.11.

Figure 4.11 shows that at lower activity rates the slow corner (HVT), high Vth, implemen-

62

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

0.4

0.5

0.6

0.7

0.8

0.9

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

ME

P (

V)

Activity rate

High VthRegular Vth

Low Vth

Figure 4.11: MEP supply voltage movement characteristics of Regular Vth (RVT), High Vth

(RVT+∆Vth) and Low Vth (RVT-∆Vth) inverter chain implementations for different ac-

tivity rates in saed 32nm library where ∆Vth = 25mV.

tation has a lower MEP supply voltage than both typical corner (RVT) and fast corner (LVT),

implementations. However, when the activity rate increases (≥0.3), the MEP of the high Vth

increases rapidly, while the MEPs’ of regular and low Vth implementations decreases. Another

observation from the figure is that the MEP of the high Vth implementation starts to decrease

when the activity rate is higher than 0.8. The drop in MEP of the high Vth implementation is

because the leakage energy remains constant while the dynamic energy always escalates with

an increase in the supply voltage.

B) Analytical delay and energy approximation for MEP

In order to analytically approximate the minimum energy point, (Vdd, Vth) pair of a circuit,

the dynamic power (Pd), leakage power (Pl), and delay (D) are determined as a function of

the supply voltage (Vdd), threshold voltage (Vth), and activity rate (β) of the circuit. Then,

the energy consumption is analytically approximated as the product of total power and delay

(i.e., E = (Pd + Pl) × D). For CMOS circuits, dynamic power is the power consumed by

charging and discharging of node capacitance, and it has a linear relation to the activity rate

and quadratic relation to the supply voltage.

In current technology nodes operating in the near-threshold voltage domain, the leakage

power is dominated by the sub-threshold drain-to-source current. Hence, the traditional alpha-

power law based leakage power model cannot sufficiently depict this phenomenon. For this

purpose, authors in [155] presented a more accurate trans-regional drain current based leak-

age and delay approximation models for NTC, by evaluating the dominant terms for circuit

delay and energy at three different operating regions; strong, weak, and moderate inversion

regions. In order to obtain an analytical model that work across these three regions, a second-

degree polynomial approximation and curve fitting approaches are used to incorporate the

drain current effects in the three regions [155]. Therefore, the usage of such accurate near-

threshold delay and energy models are vital to analytically approximate the MEP, and reduce

63

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

the synthesis time and library characterization effort of synthesis based MEP extraction.

C) Multi-threshold fabrication and challenges

Multi-threshold CMOS devices are used to reduce the leakage energy while maintaining the

performance requirements. A multi-threshold design uses high threshold voltage transistors

in the non-critical paths for leakage reduction, while low threshold voltage transistors are

used in the critical paths to maintain the performance requirements [152]. These high and

low threshold voltage transistors are used at different levels of abstraction, such as at a gate,

circuit, and architecture level [153, 154]. Multi-threshold circuits are fabricated by implanting

different amount of dopant ions on the surface of the substrate. The implanted ions increase

or decrease the amount of charge accumulated at the surface of the channel region, which in

turn shifts the transistor threshold voltage [156].

Although the usage of multi-threshold design reduces the leakage energy significantly, the

maximum number of threshold voltages in a design is typically limited, e.g., only three [152],

due to different fabrication and design challenges. For example, during the fabrication process,

more additional masks are required to control the dopant ions which increases the fabrication

cost and effort. Additionally, multi-threshold designs impose several design challenges, such as

proximity problems during place and route, and an increase in the optimization effort required

by computer-aided design tools.

4.3.3 Motivation and problem statement for pipeline stage-level MEP assignment

A) Motivational example

To show the advantages of fine-grained (pipeline stage-level) MEP, (Vdd, Vth) pair assignment,

let us consider a simplified version of the Fabscalar core with only three stages, i.e., decode,

execute, and writeback stages. As shown in Figure 4.10, the dynamic to leakage ratio (D/L) of

the writeback stage is much higher than the D/L ratios of the decode and execute stages (i.e.,

D/Lwb >> D/Ldec >> D/Lexe). The higher D/L ratio indicates that the writeback stage is

dominated by dynamic energy while leakage is dominant in the execute stage. In the case of

the decode stage, however, both dynamic and leakage energies have almost equal contribution.

Hence, assigning core-level (single) MEP for a pipelined processor with such wide D/L ratio

variation is highly energy inefficient.

In order to compare the energy efficiency of the core-level and pipeline stage-level MEPs,

first, the three stages are synthesized using a core-level MEP (regular Vth, Vdd=0.5V pair).

For this configuration, the total energy per cycle of the core, i.e., the sum of the three stages,

is 1.92 µJ as shown by the global MEP curve in Figure 4.12. For energy efficiency improvement

of the simplified core, the pipeline stages should be designed, and operate at different (pipeline

stage-level) MEPs that is determined based on their structure and activity rates. In order

to maintain the operating frequency (i.e., pipeline stage-level delay ≤ core-level delay), the

stage-level MEPs of the pipeline stages are extracted by using the baseline delay as the upper

bound of the delay of the pipeline stages. The delay constraint is useful to improve the overall

energy efficiency by reducing the power consumption. Therefore, the writeback stage should

operate at lower supply and threshold voltages (Vddwb,Vthwb

pair) to decrease its dynamic

64

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

Vdd

Global MEP (Regular-Vth)Execute MEP (High-Vth)Decode MEP (Regular-Vth)Writeback MEP (Low-Vth)

0.4V 0.55V 0.6V

0.33

0.41

1.92

0.5V

0.38

Ener

gy p

er c

ycle

(u

J)

Figure 4.12: Energy vs MEP supply voltage for a 3-stage pipeline core with Regular Vth (RVT), High

Vth (RVT+∆Vth) and Low Vth in saed 32nm library, and ∆Vth = 25mV.

energy. In order to reduce the leakage energy, both execute and decode stages should operate

at a relatively higher supply and threshold voltage pairs. Hence, The MEPs of the pipeline

stages are related as follows:

Vddexe>Vdddec>Vddwb

and

Vthexe>Vthdec>Vthwb

Thus, in the pipeline stage-level MEP assignment, the writeback stage is implemented using

low Vth, low Vdd (0.4V) library, while high Vth, high Vdd(0.6V) library is used for the execute

stage. Similarly, regular Vth, Vdd=0.55V library is used for the decode stage as shown in

Figure 4.12. From the figure, it is clear that the stages obtain their minimum energy when

operating on their local MEP, and leads to a reduction in the overall energy.

Figure 4.13 shows the energy efficiency improvement of the pipeline stage-level MEP over the

core-level MEP assignment. For all stages, the pipeline stage-level MEP assignment consumes

less energy per cycle than the core-level MEP counterpart. As a result, the total energy per

cycle of the pipeline stage-level MEP implementation is 1.13 µJ which is 40% less than the

core-level MEP.

B) Problem statement

Energy-constrained devices usually have specific minimum performance requirements. For

such systems, the MEP, (Vdd, Vth) pair, should be selected so that the minimum system

performance requirement is satisfied. Therefore, determining the minimum target delay (Td)

requirement is crucial to optimize the MEP of the pipeline stages of a processor core. Hence,

for a given target delay Td, the MEP optimization problem is formulated as a function of Vdd,

65

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0

0.3

0.6

0.9

1.2

1.5

1.8

Decode Execute Writeback Total

En

erg

y p

er c

ycl

e (u

J)

Stages

Core-levelStage-level

Figure 4.13: Energy gain comparison of core-level vs stage-level MEP assignment for a 3-stage pipeline

core with Regular Vth (RVT), High Vth (RVT+∆Vth), and Low Vth in saed 32nm library

and ∆Vth = 25mV, target frequency = 67MHz.

Vth, and activity rate (β) as given in Equation (4.1):

MinimizeN∑i=1

Pti | max1≤i≤N

Di ≤ Td (4.1)

where N is the number of pipeline stages, D is the delay, and Pt the total power (Pt = Pd+Pl)

of the stages approximated analytically [155], and Td is the target delay constraint, which is

determined by the application designer or obtained from the baseline design. Since the total

power and delay are functions of Vdd, Vth, and activity rate (β), the energy minimization

problem is simplified into a dual variable optimization problem in order to approximate the

optimal Vdd, Vth pair. The target delay constraint (Td) in Equation (4.1) is necessary in order

to ensure that the performance requirement is satisfied while reducing the overall energy (i.e.,

reducing power without exceeding the delay constraint so that solving Equation (4.1) gives the

optimal MEP for that particular delay constraint). It has been observed that the energy-delay

curve is flat in the MEP area and hence, the delay of a design is reduced to gain substantial

performance improvement without introducing energy overhead [149]. Thus, the delay of the

pipeline stages is co-optimized in order to track the MEP that satisfies the given performance

requirement.

Since the optimization problem given in Equation (4.1) is an equality constrained multi-

variable optimization problem with two unknowns (Vdd and Vth), Lagrange multiplier (solver)

is applied to solve the multi-variable problem and determine the MEPs of the pipeline stages

subjected to a target delay constraint. Lagrange multiplier is a powerful method for solv-

ing equality constrained multi-variable optimization problems without the need to solve the

conditions explicitly.

66

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

4.3.4 Lagrange multiplier based two-phase hierarchical pipeline stage-level MEPtuning technique

A) Phase-one: analytical MEP approximation

The analytical MEP approximation uses the activity rate and circuit structure (gate compo-

nents and effective capacitance) information of the pipeline stages to determine their individual

MEPs, Vdd Vth pairs. Since the energy and delay of a pipeline stage are strongly dependent on

the supply and threshold voltages, a Lagrange multiplier based optimization strategy is used to

analytical approximate the pipeline stage-level MEP of pipeline stages. The usage of analytical

approximation model helps us to reduce the solution space analytically, and minimize library

characterization effort required by the synthesis based energy and delay extraction approach

given in the second phase of the proposed MEP extraction.

Lagrange solver

Since the objective is to minimize the overall energy consumption of a pipelined processor,

the optimization problem is simplified to minimizing the power consumption of the individual

pipeline stages subjected to a delay constraint as shown below.

minimize Pt(Vdd1 , Vth1 , Vdd2 , Vth2 , ..., VddN , VthN )

Subjected to

D(Vdd1 , Vth1Vdd2 , Vth2 , ..., VddN , VthN ) ≤ Td

For a pipelined processor with N stages, the optimal Vdd, Vth pairs of the stages are obtained

using the Lagrange multiplier technique. First, the Lagrangian multiplier technique is dis-

cussed for solving the optimization problem of a single block (pipeline stage). Then, a similar

technique is applied to the other stages in order to obtain their MEP’s independently. For

simplicity, the term P it (Vidd, Vi

th) is used to represent the total power of a pipeline stage i

(P it = P id + P il ). Similarly, the term Dic(V

idd, Vi

th) is used to represent the delay constraint

of the ith pipeline stage. The Lagrangian based solution has three main steps. The Lagrange

function is defined first using a constant multiplier (λ). The constant λ is required to show

the rate of change of the solution (Vdd and Vth) with respect to the equality constraint (Td).

Changing the constant λ leads to a different solution that satisfies the optimization constraint.

Then, the gradient of the function is determined by differentiating the Lagrange function with

respect to the unknown variables (Vdd and Vth). The problem is converted into a system of

linear equation by further differentiating the gradient of the function. Finally, the Vdd, Vth

pair are approximated by solving the linear equation.

Step 1: Lagrange function definition

A Lagrange function (Li) of stage i is defined by introducing a new variable λi, commonly

known as Lagrange multiplier as follows:

Li(V idd, V

ith, λ

i) = P it (Vidd, V

ith)− λi(Di

c(Vidd, V

ith)− T id) (4.2)

67

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

Step 2: Gradient function formulation

For minimum power consumption, set the gradient of the Li equal to zero (i.e., 5Li = 0), and

solve for Vidd and Vi

th using the partial differential of Li with respect to Vidd and Vi

th. Hence:

∂Li

∂V idd

=∂(P it (V

idd, V

ith)− λi(Di

c(Vidd, V

ith)− T id))

∂V idd

= 0 (4.3)

By applying fundamental rules of differentiation, Equation (4.3) is simplified further as follows:

∂P it (vidd, V

ith)

∂V idd

−∂(λi ×Di

c(vidd, V

ith))

∂V idd

= 0 (4.4)

Similarly the optimal Vith is obtained by differentiating 5Li with respect to Vi

th as shown in

Equations (4.5) and (4.6).

∂Li

∂V ith

=∂P it (V

idd, V

ith)− λi(Di

c(Vidd, V

ith)− T id)

∂V ith

= 0 (4.5)

Further differentiating Equation(4.5) gives:

∂P it (vidd, V

ith)

∂V ith

−∂(λi ×Di

c(vidd, V

ith))

∂V ith

= 0 (4.6)

The differentiation results of Equations (4.4) and (4.6) are simplified into a system of linear

equation with two variables and two unknowns.

Step 3: Combining the equalities

In the previous steps, the multi-variable optimization problem is simplified into a system of

linear equations by using Lagrange function. The next step is to solve the linear equations

obtained at step 2, and determine the optimal (Vidd, Vi

th) pair of stage i. A solution to the

linear system is an assignment of values to the variables such that all the equations are satisfied

simultaneously. The linear system with two equations and two unknown variables, (Vdd and

Vth), is easily solved using a linear algebra. This analytical solution is applied to all pipeline

stages to obtain their optimal Vdd and Vth pairs based on their state dependent parameters

such as activity rate (β). It should be noted that the system of linear equations have multiple

solutions that satisfy the optimization constraint but lead to different λ values. Since the value

of λ does not impact the optimization, the algorithm iteratively solves the linear equations in

order to obtain the optimal MEP (Vdd,Vth pair) of the pipeline stages.

Generalized implementation flow

The mathematical solution for the optimization problem is implemented as shown in Fig-

ure 4.14. First, the baseline total power and delay values of the pipeline stages are approxi-

mated analytically. Then, the worst case (overall) delay is determined as the maximum delay

of the individual pipeline stages. Afterward, the multi-variable optimization problem (formu-

lated in Equation (4.1)) is simplified to a system of linear equations by applying the Lagrangian

function, and the optimal Vdd, Vth pairs are approximated by solving the simplified system of

linear equations.

68

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

Start

Input: - Fitting parameters, - Vdd, Vth pairs, - Average effective capacitance (C), and –

Average activity rate (ᵝ) of each stage

Obtain initial N energy and delay (D) values where N = total stages,

Obtain target (worst) delay as Td = Max1≤i≤N Di

Employ Lagrange multiplier to simplify the multi-variable optimization problem to system of

linear equalities (∇L(Vdd,Vth, λ)) i.e.,

∂L(Vdd,Vth ,λ)/∂Vdd

∂L(Vdd,Vth ,λ)/∂Vth

Solve linear equations to get optimal Vdd,Vth pairs

Obtain delay (d) value using new Vdd,Vth

Is d ≤ Td ?

Output: optimal Vdd,Vth pairs

End

Yes

No

Ap

ply

fo

r e

ach

sta

ge

Figure 4.14: Algorithm for solving MEP of pipeline stages by using Lagrangian function and linear

algebra.

Once the optimal Vdd, Vth pair is obtained, the new delay value is approximated using the

new Vdd, Vth pair and compared to the delay constraint. If the constraint is satisfied the Vdd,

Vth pair is considered as the optimal MEP approximation for the particular stage. Otherwise,

the algorithm re-solves the linear equation in order to get new Vdd, Vth pair that satisfy the

specified delay constraint. The flow is applied to all pipeline stages in order to approximate

their MEP Vdd, Vth pairs independently.

69

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

B) Second-level optimization: synthesis based MEP clustering

Due to the difference in the structure and activity rate of the pipeline stages, the analytical

MEP approximation technique provides a distinct supply and threshold voltage pair approx-

imation for each pipeline stage. However, these approximated supply and threshold voltage

pairs might have different values for each pipeline stages, which makes the practical imple-

mentation of the pipeline stage-level MEP assignment challenging. Therefore, the number of

Vdd, Vth pairs per design should be limited in order to avoid the barriers of multi-supply and

multi-threshold designs. Thus, clustering the MEPs of the pipeline stages into smaller groups

is essential in order to minimize the number of Vdd, Vth islands in a design. The second-level

synthesis based optimization helps to limit the number of Vdd, Vth pairs of the pipeline stages,

depending on their energy and delay values extracted with the help of an industrial synthesis

tool. Therefore, before presenting the synthesis based clustering approach, it is crucial to start

by discussing the complexity and implementation challenges of a multiple Vdd, Vth design.

Afterward, the industrial synthesis tool based accurate estimation of delay and energy values

of the pipeline stages is performed in order to determine the optimal MEP clusters of the

pipeline stages.

MEP clustering to voltage groups/ islands

To illustrate the need for grouping of pipeline stages MEP to limited (fewer) Vdd, Vth islands,

let us consider the MEP distribution of K (e.g., K = 12) pipeline stages given in the supply

and threshold voltage space of Figure 4.15(a). As shown in the figure, each pipeline stage has

a unique MEP which demands the designer to use 12 distinct Vdd, Vth pairs. Please note that

each symbol in the figure represents the MEP of a single pipeline stage. The different colors

and shapes are used to represent Vdd and Vth islands as shown in Figure 4.15(b). Hence,

stages with the same color (e.g., red) have relatively close threshold voltage values, and similar

symbol (e.g., star) indicates the stages with close Vdd values which are clustered into one

voltage island.

Assume a feasible multi-supply multi-threshold design has only n supply voltage and m

threshold voltage pairs where n << K and m << K. The values of n and m are deter-

mined by considering various factors including fabrication cost and implementation constraints.

Therefore, to cluster the pipeline stages into n Vdd and m Vth islands, the energy and delay

information of the pipeline stages should be accurately extracted by synthesizing the stages

using commercial EDA tools such as Synopsys design compiler. For this purpose, various li-

braries are characterized according to the analytically approximated MEP Vdd, Vth pairs of

the pipeline stages. The MEP of the pipeline stages obtained using the synthesis tool should

be sorted based on their supply and threshold voltage values. Since the pipeline depth, K,

is limited due to several barriers such as limited degree of instruction-level parallelism and

pipeline hazards (e.g., a maximum of 30 stages for commercial processors [157, 85]), a sim-

ple sorting algorithm is applied to sort the stages with respect to their supply and threshold

voltage values.

The sorted supply and threshold voltages of the pipeline stages are clustered into n supply

and m threshold voltage groups. To cluster the sorted MEP’s of the pipeline stages given in

Figure 4.15(a), let us consider a design can have only three supply voltage and three threshold

voltage islands (i.e., the value of n and m = 3). The division/ clustering of the supply

70

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

0.2 0.25 0.3 0.35 0.4 0.45Vth [V]

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55Vdd[V]

(a) Pipeline stages MEP distribution

0.2 0.25 0.3 0.35 0.4 0.45Vth [V]

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Vdd[V]

Vth island 1 Vth island 2 Vth island 3

Vdd

isla

nd 1

Vdd

isla

nd 2

Vdd

isla

nd 3

(b) Clustered MEP

Figure 4.15: Illustrative example for clustering of the MEP voltages of different pipeline stages (a) MEP

distribution on Vth, Vdd space (b) Clustering of the MEPs into 3 Vdd and 3 Vth groups.

71

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

Algorithm 3 MEP grouping for cost reduction

1: function MEP-clustering(K libraries (LK) characterized for K different Vdd, Vth pairs,

S (number of islands), ε)

2: Divide LK into S groups of sizeS

K; . here boundaries are multiples of

S

K3: Obtain total energy (Et) using single library per groups;

4: do

5: Move the boundaries KS by ε;

6: new boundaries ← current boundaries ±KS ;

7: Obtain Etnew ;

8: if Etnew<Et then

9: Et ← Etnew ;

10: while Etnew ≤ Et11: return Clustered S islands;

and threshold voltages is depicted in Figure 4.15(b), where the threshold voltage islands are

marked using red, gray and green colors while the three supply voltage islands are indicated

by different shapes (i.e., the stages with the same symbol are assigned the same Vdd island).

In the clustering process, the decision in the total number of islands and their boundaries

has a direct impact on implementation overhead and energy efficiency of the pipeline stages.

Therefore, the selection of n and m should be according to:

Maximize n,m | Ic ≤ Tc

where Ic is implementation cost (overhead) of n,m islands and Tc is the target cost.

Synthesis algorithm for clustering

To determine the optimal boundaries (division points) of the supply and threshold voltage

islands, an iterative synthesis algorithm is developed as shown in Algorithm 3. The input to

the algorithm are, K libraries (LK) characterized for K different Vdd, Vth pairs, number of

voltage islands (S where S=n or S=m), and shifting scale ε. The algorithm first divides the list

elements into S groups with a boundary of KS and assigns single supply and threshold voltage

per group to obtain the baseline total energy (Et) (Steps 1 and 2). Then, the division boundary

is shifted iteratively by a small interval ε in order to find the energy-optimal clustering (Steps

3-10).

The iterative process enables to group the pipeline stages neighboring to the division point

to the most optimum side. At this stage, different clustering techniques such as K-means

clustering can be applied to cluster the pipeline stages into different groups. However, since the

dataset and number of clusters are small, the adopted grouping algorithm effectively clusters

the pipeline stages with minimum runtime (O(n×k)), where n is the number of entities (pipeline

stages) to cluster and k is the number of clusters. K-means clustering, however, takes O(nk+1)

time. Moreover, due to the smaller dataset size, the clustering outcome of the computationally

costly K-means clustering algorithm may not differ from the result of the adopted simple

algorithm.

72

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

Once the pipeline stages are clustered, the cluster libraries (Vdd, Vth pairs) are used by the

synthesis tool (e.g., Synopsys design compiler) to synthesize the pipeline stages, and extract

their energy and delay values. To further enhance the energy efficiency, various design opti-

mization and post-synthesis tuning are applicable at this stage. Hence, although the pipeline

stages are operating at different supply and threshold voltage islands, the clock frequency

remains the same for all stages as it is determined by the target delay (Td).

4.3.5 Implementation issues

A) Memory subsystem for low voltage operation

As discussed in Chapter 3, the functional failure rate of memory components is a challenging

issue for low voltage operation. At lower voltage values, SRAM cells suffer from variation

induced read, write, and hold failures. Additionally, runtime issues, such as soft error and

aging, also significantly increase the failure rate of memory components. The problem of

SRAM cells is aggravated as the SNM of the cells is reduced significantly due to aging and

variation effects. As a result, the supply voltage downscaling potential of embedded memories

is limited by either read SNM (cell stability during a read operation) or write SNM (stability

during a write operation). Hence, memory components should operate at a relatively higher

supply voltage values than the logic blocks.

In order to guarantee reliable memory operation at lower supply voltage values, an 8T SRAM

cell can be used to implement the memory array. For pipeline stage registers, dual-purpose low

power flip-flops are used to serve as storage elements as well as voltage level converter between

two pipeline stages in a different voltage islands. The usage of a dual-purpose flip-flop helps to

avoid the use of complex voltage level converters and the overhead associated with them. In

order to further enhance the energy efficiency, the low voltage cache implementations discussed

in Chapter 3 can be used for pipeline stage-level MEP designs. However, since pipeline design

is the main focus of this chapter, the energy results and comparison is done for the pipeline

stages of the cores used in the case studies.

B) Implementation challenges of multi-threshold design

Voltage level conversion

In a multi-voltage design, if a low voltage (VDDL) gate drives a high voltage (VDDH) gate, it

generates a leakage current as the PMOS in the VDDH gate will not be completely turned-

off [158, 159]. The leakage current affects the output signal swing of the VDDH gate. In order

to address this issue, voltage level converters are placed on the boundary between low voltage

and high voltage regions to prevent the leakage, and restore the VDDH swing with a minimal

area and power overheads.

In the stage-level MEP approach, the level converters should be at the corresponding stage

registers between the VDDL and VDDH stages. Therefore, the overhead is minimized by using

level converting flip-flops [158, 160, 161] (as shown in Figure 4.16) [158] between VDDL and

VDDH stages which serves both as a level converter and stage register. In comparison to

the conventional flip-flip (e.g. C2MOS flip-flop [162]) the level converting flip-flop has four

additional transistors. It is worth mentioning that level converters are not required when a

73

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

VDD_H

CLK

D

D

D

Q

D

CLK

CLK

Q

Figure 4.16: Dual purpose flip-flop (voltage level conversion and pipeline stage register), the gates

shaded in red are driven by VDDL.

VDDL gate is driven by a VDDH gate as the PMOS in the VDDL gate will completely turned-off

with high input voltage.

Threshold voltage limitations per design

The goal of a multi-threshold circuit is to selectively scale the threshold voltage for low-power

operation. As presented in Section 4.3.1, granularity (number of threshold voltage domains)

determines the fabrication cost of multi-threshold circuits. Various industrial and academic

research works demonstrated that the feasible maximum number of threshold voltages per

design is three [152, 156]. Therefore, the usage of more than 3 Vdd and Vth islands per design

is not an efficient approach from fabrication cost, overhead and design challenges perspective.

As a result, a maximum of 3 voltage islands per design is considered as a de facto standard

for multi-threshold multi-supply voltage designs. The limitation is elaborated more in Sec-

tion 4.3.6 with the help of Figure 4.20 to illustrate energy-gain and area overhead of different

voltage islands per design. For this purpose, the Vth group size (S) of the adopted clustering

given in Algorithm 2 is limited to 3. However, body biasing technique [163] are useful for

further threshold voltage tuning, and increase the granularity at a low cost. Thus, combining

body biasing and multi-threshold devices helps to create more fine-grained threshold voltage

islands.

74

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

C) Forward and feedback loop signals and their challenges

Although pipeline stages are designed as sequential and separate by intermediate stage reg-

isters, in practice, however, feedback loop signals are generated by different stages to control

the proper instruction execution. These signals introduce timing challenges in the design of

a multi-supply/ multi-threshold voltage system. Inter-stage feedback signals have no impact

on the timing of the stages as their values are latched, and made available in the subsequent

clock cycle. However, intra-stage feedback signals can potentially affect the timing of circuits.

Hence, the stages with intra-stage feedback signals require a more extensive timing analysis in

order to avoid timing problems and set an appropriate clock latency.

D) Physical design and power planning

In a multiple supply voltage design, a voltage island represents a design block with unique

powering and supply voltage rails. The circuit components within a voltage island are primarily

powered from the island voltage. Hence, a dedicated supply voltage rail is required to power-on

different voltage islands. In the conventional design flow, this is achieved during partitioning

and floorplanning phases of VLSI design flow. During partitioning and floorplanning, different

voltage islands are identified, and a dedicated supply voltage rail is assigned to each voltage

island [164, 78]. This technique is applied in order to add dedicated power rail for the pipeline

stages (voltage islands), and simplify the power planning issue. However, it requires additional

effort on the design phase as the design has to satisfy timing requirements [164].

4.3.6 Experimental results

A) Setup and implementation flow

To validate the effectiveness of the pipeline stage-level MEP assignment technique, the La-

grange multiplier based analytical MEP solver is implemented in C++. Different EDA tools

such as Synopsys design compiler are employed to synthesize the stages and extract the power

and delay results of the pipeline stages. For a proof of concept, FabScalar and OpenSPARC

cores are used as a case study, and their results are compared with a state-of-the-art coarse-

grained core-level MEP technique presented in [165] as well as a baseline near-threshold voltage

implementations of the two processor cores. The experimental setup, tools, and configurations

Table 4.3: Experimental setup

Configuration OpenSPARC FabScalar

Architecture In-order out-of-order

Technology Saed 32nm Saed 32nm

Simulation tool ModelSim NCSim

Synthesis tool Synopsys design compiler Synopsys design compiler

Workload Regression SPEC2000

Target frequency 230MHz 67MHz

Cache latency 2 cycles 1 cycle

75

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

used in this work are summarized in Table 4.3. Since the cores are designed to operate in

the near threshold voltage domain, the target frequencies used in this work are 230MHz and

67MHz for OpenSPARC and FabScalar, respectively. Therefore, the target delay (Td) used in

Equation (4.1) is 4.35ns for OpenSPARC core, and 15ns for FabScalar core.

The impact of various workload applications on the energy consumption of the two processor

cores is analyzed. For FabScalar, six different workload applications, namely Bzip, Gap, Gzip,

Mcf, Parser, and Vortex are used from SPEC2000 benchmark suite [132]. The activity rate

profile (SAIF-files) of the stages are extracted by conducting post-synthesis simulation using

Cadence NCSim simulation engine [166]. Then, the SAIF-files are used as an input to the

power compiler for accurate power estimation. In the case of OpenSPARC, Several regression

tests are used to extract the switching activity profiles of the pipeline stages. The results show

that the workload variation has a negligible impact on the energy consumption of the pipeline

stages.

B) Case studies for energy efficiency improvement

The impact of activity rate and circuit structure variation on the MEP and energy efficiency

of the FabScalar and OpenSPARC cores is investigated per pipeline stage bases. As discussed

in the previous sections, since workload variation has a minimum impact on the activity rate

variation in particular, and MEP in general, analyzing the energy efficiency improvement by

considering average workload effects is sufficient. For comparison with the related works, a

macro-block (core) level MEP implementation of the two processor cores according to [165] is

studied. This core-level MEP solves the optimal Vdd, Vth pairs of a processor by considering

average activity rate. However, such an approach is not efficient as the inherent inter-block

activity rate and structural variation of the pipeline stages are not considered.

Fs1 Fs2 Dec Ren Dis Isq RR Exe LSU WB Ret TotalStages

0

20

40

60

80

Energ

y p

er

cycl

e (

uJ)

BaselineMacro-block MEP[Related work]Micro-block MEP[Proposed]

Figure 4.17: Comparing the energy efficiency improvement of the proposed micro-block (pipeline stage)

level MEP [Proposed] and macro-block level MEP [Related work] over baseline design of

Fabscalar core.

76

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

Energy efficiency improvement of FabScalar core

Figure 4.17 compares the energy consumption of the pipeline stages of FabScalar core im-

plemented using baseline NTC library, macro-block (core-level) MEP [165], and micro-block

(pipeline stage) level MEP techniques. In general, both core-level [165] and pipeline stage-level

MEP approaches improve the total energy consumption of the core by 30% and 47% respec-

tively. When compared to the core-level MEP [165], the pipeline stage-level MEP has ≈24%

improvement in energy efficiency.

As discussed in the previous sections, due to the intrinsic activity rate variation of pipeline

stages, the core-level MEP has less improvement as it uses the same Vdd, Vth pairs for all

pipeline stages regardless of their variations. The core-level assignment affects the dynamic

and leakage energy consumption of several pipeline stages, which eventually reduces the overall

energy efficiency. When the intrinsic variation is considered during MEP optimization, the

energy efficiency of the pipeline stages is improved significantly with a minimal area and

power overheads.

Energy efficiency improvement of OpenSPARC core

Similar to the FabScalar core, the effectiveness of the pipeline stage-level MEP assignment

technique is verified using OpenSPARC core which has fewer pipeline stages when compared

to the FabScalar core. Figure 4.18 compares the energy consumption of the pipeline stages

of OpenSPARC core implemented using baseline NTC library, core-level MEP and pipeline

Fetch Decode Pick Execute Memory Wback TotalStages

0

5

10

15

20

25

30

35

40

Energ

y p

er

cycl

e (

uJ)

BaselineMacro-block MEP[Related work]Micro-block MEP[Proposed]

Figure 4.18: Comparing the energy efficiency improvement of the proposed micro-block (pipeline stage)

level MEP [Proposed] and macro-block level MEPrelated work over baseline design of

OpenSPARC core.

77

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0.0 0.2 0.4 0.6 0.8 1.0Vdd [V]

10−7

10−6

10−5

10−4

10−3

10−2

10−1

100

101N

orm

aliz

ed E

nerg

y/o

pera

tion (µJ)

Fetch

Decode

Pick

Execute

Memory

Wback

(a) Pipeline stage-level MEP

0.0 0.2 0.4 0.6 0.8 1.0Vdd [V]

10−6

10−5

10−4

10−3

10−2

10−1

100

101

Norm

aliz

ed E

nerg

y/o

pera

tion (µJ)

Fetch

Decode

Pick

Execute

Memory

Wback

(b) core-level MEP

Figure 4.19: Effect of Vdd scaling on the energy efficiency of different pipeline stages (a) pipeline stage

level MEP (Vth) (b) core-level MEPrelated work Vth (normalized to one cycle.

78

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

stage-level MEP techniques. The figure shows both core-level and pipeline stage-level MEP

approaches improve the total energy consumption of the core by 32% and 48% respectively.

In comparison to the core-level MEP [165], the pipeline stage-level MEP assignment technique

has ≈20% improvement in energy efficiency. Since OpenSPARC has fewer stage and less

activity rate variation than Fabscalar, the energy-improvement of the pipeline stage-level MEP

assignment technique over the core-level MEP is slightly lower for OpenSPARC core (20%)

than the improvement achieved for the FabScalar core (24%).

The impact of supply voltage tunning on the energy consumption of the individual pipeline

stages of OpenSPARC core is compared using core-level and stage-level MEP techniques as

shown in Figure 4.19. It should be noted that the optimal threshold voltage of both tech-

niques is obtained first, then supply voltage sweeping technique is used to compare the energy

consumption behavior of the individual pipeline stages in a wide voltage range. Hence, for the

pipeline stage-level MEP assignment technique, there are three activity rate dependent opti-

mal threshold voltage values of the pipeline stages, while there is only one optimal threshold

voltage value for the core-level MEP technique as it is determined for the entire core.

Figure 4.19(a) shows the energy consumption of the pipeline stages of the OpenSPARC core

using the pipeline stage-level MEP assignment approach. Since, the stage-level MEP has three

different Vth classes, i.e., HVT for Decode and Pick, RVT for Fetch and Write back, and LVT

for Execute and Memory stages, the optimal supply voltages of the stages also varies. In the

case of core-level MEP approach shown in Figure 4.19(b), however, the pipeline stages are

less energy-efficient as the contribution of leakage energy increase differently for each stage.

Therefore, the intrinsic activity rate variation of pipeline stages has a significant impact on

the energy efficiency. Hence, considering the impact of variation during early design phases is

of decisive importance for the design of energy-efficient systems.

C) Overhead analysis

Area overhead

An important challenge when using multi-voltage designs is the overhead induced by level

converters. However, two potential solutions are employed to minimize the overhead of voltage

level converters depending on the difference between the VDDH and VDDL regions. Authors

in [167] presented a voltage level converter free dual voltage design. This technique is applied

to the pipeline stage-level MEP assignment technique which has fine-grained voltage islands.

However, for a higher VDD granularity (i.e., voltage islands with more than 100mV difference

in their Vdd), voltage level converter flip-flops, presented in Section 4.3.5, is employed with a

minimum overhead. In comparison to the total chip area (area of all stages and stage registers),

the usage of voltage level converting flip-flops between VDDL and VDDH regions have a small

area overhead (up to 1.6%) for fewer voltage islands (≤3) as shown in Figure 4.20. However,

with increasing the number of voltage islands, the overhead of level converting flip-flops also

increases.

As shown in the figure, with an increase in the number of voltage-islands the energy-saving

and area overhead increases. However, with further increase in the voltage-islands (> 3) the

energy-saving saturates while the overhead escalates. Hence, a maximum of three voltage-levels

provide better energy saving and overhead trade-off.

79

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5 6 70%

2%

4%

6%

8%

10%

12%

Ener

gy g

ain i

n (

%)

Are

a over

hea

d i

n (

%)

Number of voltage islands

Energy gainArea overhead

Figure 4.20: Energy-saving and area overhead trade-off for different voltage islands for FabScalar core

(three voltage islands provides better energy-saving and overhead trade-off for FabScalar

core).

Performance overhead

At the system level, the performance (runtime of applications) of microprocessors is affected

by various factors such as cache misses, branch miss-predication, kernel calls and interrupts.

The effect of these factors on the system performance, IPC, varies from one design to another

design depending on the clock frequency and memory latency. The variation in application

runtime affects the energy savings of the pipeline stage-level MEP assignment approach. For a

given design, with a fixed memory latency, increasing the processor delay (frequency) increases

the IPC of the design as the number of cycles for memory access will increase. The impact

of increasing the frequency (reducing the delay from the target delay value Td) on the IPC of

the pipeline stage-level MEP design is studied, and the result shows it has ≈6% IPC reduction

when compared to the baseline design. However, in comparison to the significant energy

savings, the 6% IPC reduction of the stage-level MEP assignment is easily tolerated.

4.3.7 Comparison with related works

There have been significant research works discussing different techniques to reduce the power

consumption of modern CMOS devices [168, 24]. Most of these works are based on supply

voltage downscaling as it is one of the more effective methods to reduce power consumption. All

these techniques are broadly classified into three main categories, namely timing speculation

techniques, better-than-worst-case designs, and near-threshold computing techniques.

A) Timing speculation techniques

In order to reduce timing margins and ensure error-free operation under supply voltage down-

scaling, timing speculation techniques [89, 87, 90] employ a dynamic timing error detection

and correction technique. The core concept of these techniques is to use redundant flip-flops

80

4.3 Fine-grained Minimum Energy Point (MEP) tuning for energy-efficient pipeline design

(commonly known as razor) which are derived by a delayed clock in order to detect timing

errors. Whenever a timing error is detected an error signal is raised, and rollback mechanism

is used to ensure the correct execution. However, due to the increasing amount of timing

errors these techniques are not effective in terms of performance and energy efficiency when

the supply voltage is downscaled aggressively to the near/sub-threshold voltage domain.

B) Better-than-worst-case techniques

Several better-than-worst-case designs have been proposed to reduce the power consumption

by embracing quality degradation of designs. Better-than-worst case techniques are mainly

utilized in a signal processing domain by exploiting the noise tolerance nature of applications.

For example, an algorithmic noise tolerance based technique [169] exploits the error tolerance

capability of applications to reduce the power consumption. However, the level of supply

voltage downscaling and power saving benefit of such better-than-worst-case designs is limited

by the error rate and correction overheads.

C) Near threshold techniques

Near-threshold designs rely on downscaling of the supply voltage (Vdd) close to the transistor

threshold voltage in order to reduce the dynamic power quadratically. Various researchers

proposed different NTC based techniques in order to minimize the power consumption of

CMOS devices [148, 6, 18]. Unfortunately, the energy efficiency of these techniques is limited

by several factors, such as performance degradation, variability effects, and increase in leakage

energy contribution. Identifying and operating at minimum energy point is crucial in order

to overcome these challenges [165, 83, 151]. Authors in [151] demonstrated that the minimum

energy point of a given circuit could vary depending upon the activity rate of the circuit. Hence,

for a pipelined processor with different functional blocks, using a single minimum energy point

is less energy-efficient as the functional blocks have wide variation in the activity rate. Hence,

applying a global MEP for a pipelined processor is not effective. Instead, a pipeline stage-

level optimal MEP assignment, as presented in this chapter, improves the energy efficiency of

pipelined processor cores significantly.

In order to demonstrate the advantages of the pipeline stage-level MEP assignment tech-

nique, it is compared with the core-level MEP technique as well as slack redistribution based

voltage over-scaling techniques. The core-level MEP technique is discussed in the previous

subsection. The second technique presented in [170] is based on slack distribution and voltage

over-scaling for energy reduction. The voltage over-scaling technique iteratively reduces the

supply voltage by tolerating timing errors or using error correction mechanisms.

Table 4.4 compares the energy consumption of different pipeline stages optimized using

stage-level MEP, core-level MEP, and voltage over-scaling techniques. The table shows for

all stages, both MEP optimization method have smaller energy consumption than the voltage

over-scaling approach. This is because the timing errors in the voltage over-scaling approach

increases when the supply voltage is reduced, which limits the level of voltage downscaling.

The table also shows the under optimization of the core-level MEP [165] and voltage over-

scaling technique [170] when compared to the pipeline stage-level MEP assignment approach.

81

4 Reliable and Energy-Efficient Microprocessor Pipeline Design

Table 4.4: Energy efficiency comparison of the pipeline stage-level MEP assignment with core-level

MEP assignment and voltage over-scaling technique

Stages

Energy per cycle (µJ)

Stage-level Core-level Voltage

MEP MEP over-scaling

Fetch 1.07 1.52 (42.1%) 2.05 (91.5%)

Decode 0.38 0.45 (18.4%) 0.98 (157.8%)

Pick 0.09 0.1 (11.1%) 0.197 (118.8%)

Execute 13.95 18.23 (30.6%) 21.7 (55.5%)

Memory 4.56 5.33 (16.8%) 6.85 (50.2%)

WriteBack 2.45 3.08 (25.7%) 4.61 (88.2%)

Total 22.5 28.71 (27.6%) 36.38 (61.7%)

Area overhead 1.6% 0% 2.7%

Hence, the total energy of the core-level MEP [165] is 27.6% less optimized than the stage-level

MEP technique. Similarly, voltage over-scaling [170] is under optimized by more than 60%.

Moreover, the table shows the pipeline stage-level MEP assignment technique has less area

overhead than voltage over-scaling technique [170]. Although core-level MEP [165] has zero

area overhead, its energy efficiency is much less than the stage-level MEP technique (27% less

efficient) which eventually nullify the area overhead savings.

4.4 Summary

Energy reduction has become the primary design issue in the design of energy-constrained

applications such as energy-harvested devices for IoT applications. In order to address the

variation-induced timing uncertainty and improve the overall energy efficiency of pipelined

NTC processors, two architectural level optimization techniques are presented in this chapter.

The first solution employs a variation-aware pipeline balancing technique, which is a design-

time solution to improve the energy efficiency and minimize performance uncertainty of NTC

microprocessors. The variation-aware balancing technique adopts an iterative synthesis flow

in order to balance the stages in the presence of extreme delay variation. The pipeline stages

are synthesized independently using variation-aware NTC standard cell library, and SSTA

is performed to obtain their statistical delay information. Afterward, the statistical delays

of the stages are provided to the synthesis tool in order to iteratively balance the pipeline

stages accordingly. Experimental results show that the variation-aware pipeline balancing

technique improves the energy efficiency of OpenSPARC and FabScalar cores by 55% and

85%, respectively.

The second technique illustrated the dependency of MEP on the threshold voltage, structure,

and activity rate of a circuit. Thus, it demonstrated the limitations and energy inefficiency

of using single (core-level) MEP for pipelined processor cores, in which the pipeline stages

have intrinsic structure and activity rate variation. The issue is addressed by using a pipeline

82

4.4 Summary

stage-level MEP tuning technique, in which the pipeline stages are designed to operate at

different MEPs by using an activity rate dependent supply and threshold voltage tuning. In

the pipeline stage-level MEP assignment approach, first, the MEPs of the pipeline stages are

approximated analytically. Then, to reduce the implementation cost and overhead, the MEPs

of the pipeline stages are clustered into smaller groups. The effectiveness of the pipeline stage-

level MEP assignment technique is validated by using FabScalar and OpenSPARC cores, and

simulation results show that the stage-level MEP technique improves the energy efficiency of

both FabScalar and OpenSPARC cores by almost 50%.

83

5 Approximate Computing for Energy-Efficient

NTC Design

Most of the energy-constrained portable and handheld devices employ various signal processing

algorithm and architectures, and are an integral part of many multimedia applications, such as

video, audio, and image processing algorithms. Fortunately, these multimedia applications are

inherently error-resilient and create room for erroneous computation with the aim of improving

the performance and energy efficiency of the design. Based on this context, approximate

computing is adopted to aggressively reduce the supply voltage of CMOS circuits by exploiting

the inherent error tolerance nature of multimedia applications. This chapter shows how to

exploit approximate computing to improve the energy efficiency of near-threshold designs.

5.1 Introduction

Approximate computing is a promising approach to tolerate timing errors and get utmost

NTC benefit, as it embraces non-important timing errors to improve the energy efficiency [25].

Approximate computing relies on the ability of applications to tolerate a loss in quality for

performance and energy efficiency improvement [25]. Since approximate computing relaxes

the traditional exact computation requirements, it naturally fits with NTC by effectively tol-

erating most of the variation-induced timing errors and improve the performance/ energy

efficiency [26]. Therefore, approximate NTC is an attractive alternative in energy-constrained

domains, such as signal processing and big data analysis, as many applications in these domains

have inherent error tolerance [25, 26].

Recent works have explored the use of circuit and algorithmic level approximate computing

for low-power designs [171, 172, 173, 174]. These techniques mainly rely on supply voltage

downscaling and exploit the algorithmic noise tolerance to embrace the resulting errors. Such

techniques are sufficient in the super-threshold voltage domain as process variation has a small

impact and hence, the resulting errors are effectively tolerated by approximate computing. In

the near-threshold voltage domain, however, such techniques are not sufficient anymore as the

variation-induced timing errors surpass the error tolerance capability of various applications.

Other works in [43, 44, 175, 176] employ logic simplification techniques to design inaccurate

low-power functional components such as adders and multipliers. However, on top of their

inherent inaccuracy, the wide range of variation-induced timing errors in NTC aggravate the

error rate and eventually lead to an unacceptable output quality.

In order to apply approximate computing to NTC designs, it is imperative to identify the

approximable and non-approximable portions of the design based on a quantitative analysis

of timing errors, and their structural and functional propagation probabilities. Therefore, a

detailed analysis of statistical timing errors, as well as structural and functional error prop-

agation probabilities is of decisive importance for approximate NTC design. This chapter

85

5 Approximate Computing for Energy-Efficient NTC Design

presents a technique to exploit approximate computing in order to improve the energy effi-

ciency of NTC designs. The proposed approximate NTC design technique uses an integrated

framework consisting of statistical timing error analysis and error propagation analysis tools.

In the framework, first, the control logic part of a design is identified and protected from

approximation. Then, the data flow portion of the design is classified into approximable and

non-approximable portions. Afterward, a mixed-timing logic synthesis which applies a tight

timing constraint for the non-approximable portion, and a relaxed timing constraint for the

approximable part is used to synthesize the design. The mixed-timing logic synthesis has two

features:

• It forces the synthesis tool to use faster gates in order to guarantee timing certainty of

the non-approximable portion by applying a tight timing constraint.

• In contrary to the tight constraint, a relaxed timing constraint is used in the approximable

portion of the design, so that timing errors are embraced for energy saving. The relaxed

timing constraint allows the synthesis tool to use more energy-efficient gates.

Experimental results show that exploiting approximate computing for NTC improves the en-

ergy efficiency of Discrete Cosine Transform (DCT) application by more than 30%. Further-

more, the effectiveness of the approximate computing based NTC design is demonstrated using

an image compression application, and it is observed that the energy efficiency improvement

is achieved at an acceptable output quality reduction.

The rest of the chapter is organized as follows: approximate computing background and

related works are presented in Section 5.2. Error propagation aware timing relaxation frame-

work for approximate NTC is presented in Section 5.3, followed by the experimental results in

Section 5.4. Finally, Section 5.5 presents the chapter summary.

5.2 Background

5.2.1 Embracing errors in approximate computing

A) Approximate computing basics

Digital Signal Processing (DSP) is the building block of several multimedia applications. Most

of these multimedia applications implement image and video processing algorithms, where the

final output is either an image or a video. The perceptual limitation of human vision allows the

outputs of these algorithms to be numerically approximate. The relaxation on the accuracy

requirement gives a room for imprecise or approximate computation. Hence, approximate

computation is exploited to come up with low-power designs at different levels of abstraction,

such as algorithmic, architecture and logic level approximation.

Reliable and accurate operations dictate the performance and energy efficiency of modern

computing platforms. However, several applications in the domain of signal processing, ma-

chine vision, and big data analysis have intrinsic tolerance to inaccuracy as the inaccurate

computations may have a minor impact on the final output quality. Therefore, approximate

computing exploits the inherent applications error resiliency, to achieve a desirable trade-off

between performance/ energy efficiency and output quality. Since approximate computing

relaxes the need for exact computation, it adds another dimension to the energy-performance

86

5.3 Error propagation aware timing relaxation

optimization space enabling better performance and energy efficiency trade-off by embracing

timing and memory errors.

B) Approximation in near-threshold computing

Variation-induced timing errors prevent NTC designs from operating at their designated nom-

inal frequency. Hence, to guarantee error-free operation, NTC designs should operate at a

lower frequency by constituting large timing guard-band which affects both energy efficiency

and performance. In several applications, however, the effect of variation-induced errors are

tolerated by using approximate computing. As shown in this chapter, combining approximate

computing and NTC has a significant impact on improving the energy efficiency of various

designs with inherent error tolerance capabilities.

5.2.2 Related works

Most of the works in the field of approximate computing rely on algorithmic noise toler-

ance [171, 172, 173, 174], voltage downscaling and design of inaccurate functional components

(such as adders and multiplier) [43, 44, 175, 176]. However, all these works are targeting the

nominal voltage domain where the impact of variation is minimal. Authors in [28] employed

the concept of decoupling control and data operations into different cores, in which the con-

trol cores are operated at a higher voltage. However, their technique requires voltage level

converter, and a dedicated power delivery network which imposes significant overhead. Addi-

tionally, they have zero protection for all data operations and rely on checkpoint and rollback

mechanism for error recovery. These affect the performance and energy efficiency significantly.

In the proposed error propagation-aware timing relaxation approach, however, there is no over-

head, and all control and important data operations are performed in an error-free manner,

as errors are only allowed in the approximable data-flow portion of the design.

5.3 Error propagation aware timing relaxation

5.3.1 Motivation and idea

Variation-induced timing errors significantly reduce the performance and energy efficiency of

NTC designs. In this regard, leveraging approximate computing along with NTC improves

the energy efficiency of designs by embracing the timing errors of NTC designs. From timing

criticality perspective, circuit paths are categorized as critical (longer delay) and non-critical

(shorter delay) paths. Similarly, they are classified as important (error sensitive) and non-

important (error tolerant) paths based on the error generation and propagation probabilities.

The “importance” of the paths depends on their structural and functional error propagation

probabilities. This concept is elaborated using an example circuit by classifying the paths

into different groups. For example, Figure 5.1(a) shows the classification of FIR filter circuit

paths into different groups (critical, non-critical, important and non-important) based on their

timing criticality and importance. These classifications are used to relax the timing constraint

of the non-important and non-critical group in order to improve the energy efficiency and

performance of the circuit.

87

5 Approximate Computing for Energy-Efficient NTC Design

Critical

Non-critical

ImportantNon-important

386

214

350700

0

00

0

(a) Path distribution

Top Trelax

P1P2P3P4P5P6P7P8

Variation-induced delayNominal delay

(b) Path relaxation

Figure 5.1: Example of FIR filter circuit showing path classification and timing constraint relaxation

of paths.

Figure 5.1(b) shows the variation-induced delay distribution and timing constraint relaxation

of non-important representative paths. In the conventional design approach, regardless of

their importance and criticality, all paths should be designed and operate at a designated

clock period (Top) which is defined according to the delay of the longest path of the circuit

including variation guard-band. However, it is observed that not all paths of a circuit are

equally important to the final output. Therefore, the energy efficiency and performance are

improved by relaxing the timing constraint of the less critical/ important paths. For this

purpose, first, the data-flow paths/ endpoints of a circuit are classified into approximable

(non-critical or non-important) and non-approximable (critical and important) groups based

on their timing criticality and correctness (error propagation probability). Then, a mixed-

timing logic synthesis is applied to improve the energy efficiency of the circuit.

To illustrate the mixed-timing logic synthesis scenario, let us consider the delay of the paths

given in Figure 5.1(b). Based on timing error and propagation analysis, assume the paths P1,

P6, and P7 are non-approximable, while paths P2, P3, P4, P5, and P8 are approximable.

Therefore, in the mixed-timing logic synthesis, paths P1, P6, and P7 are subjected to a tight

timing constraint (Top) in order to ensure timing correctness of the non-approximable portion.

However, a loose timing constraint Trelax is used for the remaining approximable portion for

energy efficiency improvement. The relaxation amount for the approximable paths depends

on the residual timing errors of the paths, which is obtained from a statistical timing error

analysis, as well as the target error rate for the entire circuit.

5.3.2 Variation-induced timing error propagation analysis

Since the objective is to relax the timing constraint of the approximable paths and to use tight

timing constraint for the non-approximable paths, the identification of the approximable and

non-approximable portions of a circuit is crucial. Hence, a statistical timing error analysis

tool is used to analyze the occurrence probability and distribution of variation-induced timing

errors. Afterward, an error-propagation probability analysis tool is employed to study the

propagation probability of timing errors to the important portion of the circuit.

88

5.3 Error propagation aware timing relaxation

nominal delay variation-inducedTclk

shift

Timing error

Figure 5.2: Variation-induced path delay distribution with the assumption of normal distribution.

A) Statistical timing error rate analysis

In a static timing analysis, where paths have discrete delay values, the timing error rate is

obtained by analyzing the arrival time of the signals. Thus, for a given circuit with a clock

period of Tclk, the timing error probability (Ep) of the circuit is calculated as a weighted sum

of the paths with a delay greater than the clock period (Tclk) over the total number of paths

as shown in Equation (5.1).

Ep =

N∑i=1

Pi

M(5.1)

where N is the number of paths with delay Di>Tclk, and M is the total number of paths in

the circuit.

Due to the wide variation extent in NTC, however, the delay of the paths becomes a dis-

tribution rather than a discrete value, which makes the weighted sum based error probability

calculation an optimistic approach. Therefore, an accurate error probability estimation based

on Statistical Static Timing Analysis (SSTA) is required. For this purpose, an SSTA based

timing error analysis framework is developed to determine timing error of NTC designs. In

the statistical timing error analysis framework, the variation-induced delay shift is modeled

by a normal distribution (e.g., µ ± 3σ). To show the operation of the statistical timing error

analysis, let us consider the delay distributions of tight and relaxed paths (indicated by the

green and yellow shades) shown in Figure 5.2. Assume (µT , σT ) are the mean and standard

deviation of the tight constraint (green shade), and (µR, σR) are the mean and standard de-

viation pairs for the relaxed path (yellow shade). Therefore, the timing error probability of

the relaxed path is the probability where the path delay is longer than the operational clock

(Tclk), which is equivalent to the triangular area shown by grid pattern in Figure 5.2. Hence,

the statistical timing error probability (EP ) is modeled mathematically as the integral of the

probability density function over the interval [Tclk, µR+3σR], as shown in Equation (5.2) [177].

EP ≈µR+3σR∫Tclk

f (x| µR, σR)dx (5.2)

where the probability density f (x) is given in Equation (5.3) [177].

f(x| µR, σ2R) =

1

σR√

2πexp(−(x− µR)2

2σ2R

) (5.3)

89

5 Approximate Computing for Energy-Efficient NTC Design

Combinational logic

Combinational logic

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Combinational logic

Combinational logic

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

D

Q

QSET

CLR

DCombinational logic

Combinational logic

Combinational logic

Erroneous data

Erroneous blockError free blockPropagating block

Error propagation Masked error pathMasking block

Input

Output

Figure 5.3: Timing error generation, propagation and masking probability from error site to primary

output of a circuit.

B) Structural error propagation analysis

In a circuit with various combinational and sequential components, timing errors are mani-

fested as delayed signals which cannot be latched by storage elements, such as flip-flops or

latches, during a specific clock period. The delayed signals lead to the latching of erroneous

value with a possibility of propagating to the primary output or consecutive flip-flops in the

next cycles. Therefore, an erroneous flip-flop value is either:

• Propagated to the primary output and give an erroneous result or system failure.

• Masked in one of the stages before it reaches the primary output. Thus, the error has

no observable effect, and the system functionality is not affected by the error.

To illustrate the propagation and masking probability of timing errors, let us consider the

circuit shown in Figure 5.3. As shown in the figure, a timing error from the erroneous com-

binational blocks (blocks indicated by red) could lead to the storage of wrong values in the

sequential elements (flip-flops). The erroneous data can be propagated further to the pri-

mary output (as indicated by the error propagation forward cone, purple blocks) or may not

propagate due to logic masking (as shown by the error masking backward cone from the yel-

low blocks). Since sequential circuits are the error sites for timing errors, variation-induced

timing error is modeled and analyzed using the existing sequential circuit error modeling tech-

niques [178]. Therefore, a model for soft error rate estimation of sequential circuits is adopted

to analyze the propagation probabilities of timing errors and their effect on the system func-

tionality [179]. Hence, for a design with k primary outputs the system failure probability

(observability of an error from any reachable primary output) due to a wrong flip-flop value

is obtained using Equation (5.4) [179]:

Sfp = 1−k∏j=1

[1− EP × POj × Pprop] (5.4)

where EP is the statistical timing error probability obtained using Equation (5.2), POj is

reachability of primary output j from the error site and Pprop is the error propagation proba-

bility.

90

5.3 Error propagation aware timing relaxation

Tim

ing

err

or

an

aly

sis

fra

me

wo

rk

Pre

-syn

the

sis

sim

ula

tio

nMixed-timing logic synthesis

Simulation

input

Design

(RTL)

VCD file

NTC standard

cell libraryConstraint file

Synopsys design compiler / power

compiler

Synthesized

netlistSdf, ddc

files

Power

profile

Simulation engine

Cadence SSTA engine

Statistical delay per

endpointStatistical timing error

analysis tool

Operational

clock

Structural error propagation

analysis tool

Functional error analysis

tool

System failure analysis (Sysfp)Update constraint file and re-synthesis

Start

Figure 5.4: Proposed timing error propagation aware mixed-timing logic synthesis framework.

C) Functional error propagation analysis

In many signal processing algorithms, all variables and instructions are not equally important

to the final output [180]. For example, control flow instructions are more important than

data flow instructions and hence, any small error in these instructions have a catastrophic

effect, such as system crashing or freezing. Similarly, in data flow instructions, high order

output bits are more critical for output accuracy than the lower order bits and the fractional

portion of mantissa in floating point values. Therefore, for data-intensive applications, the

energy efficiency is improved further by using functionality based approximation ( i.e., by

approximating the lower order bits).

To utilize the functionality based approximation, first, analyzing the role of the bit posi-

tions to the correctness of the final output of a given design (e.g., DCT application detailed

in Section 5.4.3) is essential. From the analysis, a weight vector (Wi) is determined, and as-

signed to the primary outputs to indicate their importance to the output accuracy and level

of approximation. For example, for an integer or mantissa of a floating point value with N-bit

width primary output, the weight vector is assigned within the closed interval [0-1], where 1

indicates important (no approximation), and 0 represents the less important output bits (see

Equation (5.5)).

∀i ∈1...N−1 Wi =1

2N−1−i (5.5)

Therefore, the weight vector of the primary output bits is coupled with the error propagation

model presented in Equation (5.4) to increase the level of approximation.

In order to maintain correct program execution flow, protecting all bit positions of variables

related to control flow instructions is essential. Hence, for all control flow related outputs the

weight is set to 1 to indicate that approximation is not allowed as there is no tolerance for

control flow errors.

5.3.3 Mixed-timing logic synthesis flow

The overall flow of the timing error propagation aware mixed-timing logic synthesis framework

is presented in Figure 5.4. The analysis starts with a pre-synthesis simulation to extract the

Value Change Data (VCD) file for accurate power estimation. Then, the design is synthesized

to obtain the baseline power consumption, timing, and variation-induced delay information.

Afterward, the Statistical timing error rate, structural error propagation, and functional error

91

5 Approximate Computing for Energy-Efficient NTC Design

propagation analysis are conducted to identify the approximable portion of the design. Finally,

the approximable portion information is used to update the timing constraint and perform

mixed-timing logic synthesis and power optimization iteratively.

In the mixed-timing logic synthesis, the paths endpoints (i.e., flip-flops and primary outputs)

are assigned tight/relaxed timing constraints based on the statistical timing error as well as

structural and functional error propagation analysis presented in the previous subsection.

Therefore, once the approximable and non-approximable endpoints are identified, the mixed-

timing logic synthesis assigns tight timing constraint (Ttight) to the non-approximable portion

of the design, and a loose constraint for the approximable portion. Finally, the design operates

with a clock period of Ttight.

A) Tight timing constraint for a non-approximable portion

For all endpoints within this group, a tighter timing constraint (Ttight) is assigned so that

the operational clock always bounds the variation-induced delay. The tight timing constraint

guarantees post manufacturing timing certainty and error-free operation of the important

portion of the design. The tight timing constraint forces the synthesis tool to reduce the delay

of the paths by using faster gates. In order to consider the impact of process variation during

design time, the SSTA tool is provided with a variation characterized standard cell library so

that the operational clock is sufficient enough, even in the presence of variation.

In cases where a non-approximable endpoint has a statistical delay less than the tighter

delay constraint (i.e., Di<Ttight), then the endpoint is relaxed for energy improvement by

using slower, but energy-efficient gates. However, if there is a timing violation in any of

the non-approximable endpoints, then the tight constraint (operational clock) is relaxed by a

small interval δ (i.e., Ttight + δ), and the synthesis flow resumes iteratively ensuring timing

correctness of the non-approximable portion.

B) Loose timing constraint for an approximable portion

Unlike the non-approximable portion, the approximable part of a design is allowed to violate

the operational clock by using loose timing constraint (Tloose) during synthesis time, which

increase the delay for energy efficiency improvement. In other words, harmless timing errors

are allowed systematically in order to save energy. Therefore, since the timing constraint is

not essential for this group (Tloose>Ttight), the synthesis tool is forced to save energy by using

energy-efficient gates. It should be noted that the relaxation amount for the approximable

paths depends on the target system error rate and effective timing error rate obtained by

statistical structural and functional propagation analysis.

Mixed-timing synthesis runtime analysis

To obtain a basic mixed-timing design, the flow depicted in Figure 5.4 has to iterate twice in

the best case (once with tight timing, Ttight, and once with loose timing constraint, Tloose). If

there are timing violations in the non-approximable portion (Ttight), then the tight constraint

is relaxed by a small interval δ (i.e., Ttight + δ) iteratively. Hence, the number of iterations

becomes higher; for example, the mixed-timing optimization of DCT circuit took 6 iterations.

92

5.4 Experimental results

5.4 Experimental results

5.4.1 Experimental setup

The timing error propagation aware mixed-timing logic synthesis framework composing of

statistical timing error analysis and functional error propagation analysis is implemented in

C++. The analysis tool presented in [179] is integrated to the error propagation-aware timing

relaxation framework for structural error propagation analysis. Synopsys design compiler is

employed for synthesis and optimization. Timing analysis in the presence of variation effect is

conducted by Cadence SSTA tool that uses a variation characterized saed 32nm NTC library

(0.5V supply voltage). For power analysis, first, Modelsim is used to extract the switching

activities in Value Change Dump (VCD) formate and parsed to a Switching Activity Interface

Format (SAIF) file. Then, the SAIF file is used by the synthesis tool to obtain an accurate

power estimation.

The effectiveness of the technique is demonstrated using a DCT as a case study. DCT is

widely used in image compression techniques to transform signal (image data) from a spatial

domain into a frequency domain. The Lena image is used as a representative benchmark for

image compression.

5.4.2 Energy efficiency analysis

To determine the optimal level of approximation in terms of energy efficiency and effective

system timing error rate, the statistical error rate (Ep) (Equation (5.2)), structural propagation

(Sfp) (Equation (5.4)), and functional propagation (FP) (Equation (5.5)) probabilities of all

paths are obtained first. Afterward, for each path, the effective error rate is determined as

the product of its Sfp and FP values as shown in Equation (5.6). Since the structural and

functional error propagation probabilities are considered independent, i.e., a failure occurs

when an error is propagated to functionally important output, then the joint probability is the

product of the structural and functional probabilities, i.e., P(Sfp & FP) = P(Sfp) × P(FP).

∀i ∈ path Eeffi = Sfpi × FPi (5.6)

Once the effective error rates are obtained, the paths are sorted in ascending order based on

their effective error rate (Eeff ), where paths with small Eeff are more suitable for approx-

imation. Finally, different percentages of the sorted paths are relaxed (starting with paths

that have smaller Eeff ) to determine the optimal approximable portion. Hence, sweeping the

percentage of relaxed paths from 0% to 100% with a step of 10% is used to study the impact

of relaxed paths. For example, if the relaxed paths (approximable portion) is 10%, then it

means that 10% of the paths with the smallest Eeff are relaxed. Hereinafter the term “relaxed

paths” refers to the percentage of paths that are subjected to loose timing constraint during

synthesizing the circuit. Since the paths of DCT have unbalanced delays, i.e., some paths are

longer while others are shorter even with a tighter clock, the energy efficiency improvement

does not always increase with an increase in the number of relaxed paths. Therefore, the in-

crease in the number of relaxed paths no longer improve the energy efficiency beyond a certain

point (e.g., 70%) and it eventually saturate.

93

5 Approximate Computing for Energy-Efficient NTC Design

0%

5%

10%

15%

20%

25%

30%

35%

0 10 20 30 40 50 60 70 80 90 100

Ener

gy i

mpro

vem

ent

(%)

Relaxed paths (in %)

Slack 10%Slack 20%Slack 25%

Slack 35%Slack 50%

Figure 5.5: Energy efficiency improvement for different levels of approximation (% of relaxed paths).

The curves correspond to the relaxation amount as a % of the clock.

A) Optimal relaxation slack analysis

The impact of using different slacks on the energy efficiency and effective system error rate of

a given circuit for different approximation levels is quantified through extensive error analysis

and simulation. In this analysis, the energy efficiency improvement and timing error rate

of 10%, 20%, 35%, and 50% relaxation slacks are presented. It should be noted that 10%

relaxation represents that the relaxed clock is 10% longer than the tight (operational clock).

Hence, Trelax = 1.1×Ttight, the other relaxation slacks 20%, 25%, 35% and 50% also have

similar relation to the operational clock.

The energy efficiency improvement and effective system timing error rate of the five relax-

ation slacks for different percentage of relaxed paths of DCT circuit are given in Figures 5.5

and 5.6. As shown in Figure 5.5, the energy efficiency improvement increases with an increase

in the percentage of relaxed paths. However, it saturates when the relaxed paths exceed

70%. The saturation is because some paths cannot be optimized further regardless of the

optimization effort.

The effective system timing error rate (see Figure 5.6) is obtained as the sum of the effective

error rate (Eeff ) of the relaxed paths (i.e., SystemEER =N∑i=1

Eeff , where N is the total number

of relaxed paths). The two figures shows 35% relaxation slack provides the best trade-off

between energy efficiency and error rate than the other relaxation slacks. Additionally, for all

relaxation slacks, there is a sharp increase in the effective system timing error rate when the

relaxed paths exceed 60% while the energy efficiency improvement is saturating. Hence, 60%

relaxed paths provides a good balance between energy and effective system error rate.

94

5.4 Experimental results

0

0.1

0.2

0.3

0.4

0.5

0.6

0 10 20 30 40 50 60 70 80 90 100

Eff

ecti

ve

syst

em e

rror

rate

Relaxed paths (in %)

Slack 10%Slack 20%Slack 25%

Slack 35%Slack 50%

Figure 5.6: Effective system error for different levels of accuracy (% relaxed paths). The curves corre-

spond to the amount by which the selected paths are relaxed, as a % of the clock.

B) Energy efficiency improvement of different optimization methods

Depending on the classification scheme used (timing criticality or error propagation), the

approximable portion of a design varies from one scheme to other. For example, if timing

criticality is the only metric then, all non-critical paths are approximated regardless of their

error propagation probability and impact on the final output. Therefore, the energy efficiency

improvement of three optimizations, namely timing criticality based, importance based (struc-

tural and functional error propagation), and a combination of both are studied.

Figure 5.7 shows the energy improvement of timing criticality, importance, and criticality

and importance based optimization methods for the same system error rate (0.08 or 60%

0%

5%

10%

15%

20%

25%

30%

10% 20% 25% 35% 50%

En

erg

y i

mp

rov

emen

t (%

)

Relaxiation slack

Timing criticalityImportance

Timing criticality & importance

Figure 5.7: Energy efficiency improvement of different optimization schemes.

95

5 Approximate Computing for Energy-Efficient NTC Design

Post-syntheis Monte Carlo simulation

Synthesized

netlist NTC library HDL

Standard delay

format (sdf) file

DCT operation

Transformed image

Pixel extraction and down

sampling

Random timing error insertion

Input Image

Partitioning into 8X8 block Image reconstruction

Modify sdf file to inject errors

PSNR analysis toolPSNR distribution

Figure 5.8: Post-synthesis Monte Carlo simulation flow.

relaxed paths) of DCT using different relaxation slacks. The figure shows for all slacks applying

timing criticality or importance based optimization only has less energy saving as the level of

approximation is limited. However, it is observed that better energy saving (more than 30%

over a non-approximate NTC) is obtained when all optimization techniques (timing, structural,

and functional) are applied in combination as they provide more approximation freedom.

5.4.3 Application to image processing

Approximate computing is widely employed in many DSP algorithms used in audiovisual ap-

plications as they are inherently error-resilient. Among various multimedia systems, an image

compression algorithm is used to show the effectiveness of the timing error propagation-aware

approximation technique. However, it should be noted that the proposed error propagation-

aware timing relaxation technique is generic and can be applied to any error resilient systems,

which have an approximation freedom. DCT algorithm linearly transforms input data into

a frequency domain where a set of coefficients represent it. It is an integral part of different

multimedia applications such as JPEG.

20% 40% 60% 80% 100%

Relaxed paths

10

15

20

25

30

35

PSN

R (

dB

)

Figure 5.9: PSNR distribution for different approximation levels for 35% relaxation slack.

96

5.5 Summary

Since variation-induced timing errors are stochastic in nature, a Monte Carlo based post-

synthesis simulation flow is developed to quantify the effect of statistical timing errors on the

output quality of image compression by using the error propagation-aware timing relaxation

approach. Figure 5.8 presents the post-synthesis simulation framework which is composed

of preprocessing, post-synthesis simulation, and post-processing. First, the input grayscale

image is preprocessed by converting it to a pixel array and partitioning it into 8×8 blocks.

Then, one thousand Monte Carlo post-synthesis simulations are performed by modifying the

timing constraint SDF file, and randomly injecting timing errors based on the SSTA delay

distributions of the relaxed paths. Finally, in the post-processing phase, the compressed images

are decompressed by applying inverse DCT (IDCT), and the corresponding Peak Signal to

Noise Ratio (PSNR) values are extracted. Figure 5.9 shows the PSNR distribution of different

percentage of relaxed paths with a 35% relaxation slack. Figure 5.10 shows the examples of the

output images for the baseline (a non-approximate NTC implementation), 20% 60%, and 80%

relaxed paths. Since the 80% relaxed paths has a higher error rate, the output image given in

Figure 5.10(d) has lower quality than the output images presented in Figures 5.10(a), 5.10(b),

and 5.10(c).

(a) PSNR=34.1

baseline (0%)

SEER=0.0

(b) PSNR=28.1

20% relaxation

SEER=0.01

(c) PSNR=21.2

60% relaxation

SEER=0.08

(d) PSNR=16.5

80% relaxation

SEER=0.29

Figure 5.10: Comparison of output quality of different levels of approximation (% of relaxed paths).

5.5 Summary

The widespread applicability of NTC is hindered by several barriers such as increase in

variation-induced timing errors. These problems are addressed in the scope of approximate

computing by tolerating non-important variation-induced timing errors. The chapter presented

a framework that exploits the error tolerance potential of approximate computing for energy-

efficient NTC designs. In the framework, statistical timing error analysis, as well as structural

and functional error propagation analysis, is performed to identify the approximable portion

of a design. Then, a mixed-timing logic synthesis is employed to improve energy efficiency by

embracing errors in the approximable portion of the design. Experimental results show that

the technique improves energy efficiency of approximate NTC design by more than 30%.

97

6 Conclusion and Remarks

6.1 Conclusions

With the abundance of battery-powered devices for IoT applications, the demand for energy

efficiency and longer battery lifetime has increased significantly. The most effective way to

reduce the energy consumption of CMOS circuits is to decrease the dynamic and static powers

consumed by the circuits. Dynamic and leakage power reduction is achieved by aggressively

downscaling the supply voltage of CMOS circuits. Therefore, supply voltage downscaling is

a promising approach to realize energy-efficient operation for energy-constrained application

domains such as IoT. This dissertation studied the potentials of supply voltage downscaling

to reduce the energy consumption with a main emphasis on near-threshold computing (NTC),

where the supply voltage is set to be close to the threshold voltage of a transistor, for energy-

efficient operation. NTC enables energy savings on the order of 10×, with a linear performance

degradation, providing a much better energy/performance trade-off. However, in addition to

performance reduction, NTC comes with its own set of challenges that hinder its widespread

applicability. The main challenges for energy-efficient NTC operation are: 1) performance

loss; 2) increased sensitivity to variation effects; and 3) high functional failure rate of storage

elements. This dissertation started with the analysis of the main NTC barriers and state-of-the-

art solutions to overcome them. Then, the dissertation provided different architecture-level

solutions to tackle those barriers and improve the reliability and energy efficiency of NTC

designs. In general, the main contributions of this dissertation is summarized as follows:

• Memory failure analysis framework for wide supply voltage ranges is presented. The

framework helps to study the impact of different memory failure mechanisms and thier

interdependence across different supply voltage values (from the super-threshold to near-

threshold voltage domain). Moreover, the dissertation presented memory failure mitiga-

tion techniques in order to enable reliable NTC operation.

• The impact of process variation on the performance and energy efficiency of NTC proces-

sor pipeline stages is studied, and a variation-aware pipeline delay balancing technique

is adopted to reduce the impact of process variation, and improve the energy efficiency

of pipeline stages operating in the near-threshold voltage domain. In order to further

improve the energy efficiency of pipeline stages, a fine-grained Minimum Energy Point

(MEP) tuning technique is presented in the dissertation. In the presented MEP assign-

ment technique, the individual pipeline stages of a processor are designed to operate at

local MEPs obtained through pipeline stage-level supply and threshold voltage tuning.

• The dissertation also explored the potentials of emerging computing paradigm, approx-

imate computing, in improving the energy efficiency of NTC designs. Thus, the disser-

tation demonstrated how to achieve a desirable trade-off between energy efficiency and

output quality reduction of NTC designs with the help of approximate computing.

99

6 Conclusion and Remarks

In conclusion, the methodologies and solutions presented in this dissertation are viable alter-

natives towards resilient and energy-efficient NTC processor design. Other published works

not included in this dissertation [118, 38] also explore the potentials of power gating and non-

volatile microprocessor design for energy-efficient ultra-low power processor design. Therefore,

all these contributions are crucial steps in the design and implementation of resilient and

energy-constrained microprocessor architectures.

6.2 Remarks

It should be noted that the solutions and methodologies presented in this dissertation are

generic and can be applied to different designs in the context of resilient and energy-efficient de-

sign. Among the various application domains, brain-inspired neuromorphic computing benefits

from the solutions presented in this dissertation. Neuromorphic systems which are prevalent

and employed in a wide variety of classification and recognition tasks have high computa-

tional energy demand, and hence, their energy-efficient implementation is of great interest.

Therefore, energy-efficient design approaches such as the solutions presented in this disser-

tation are crucial for the widespread applicability and the deployment of cognitive tasks in

energy-constrained embedded and IoT platforms.

100

Bibliography

[1] Chris A Mack. Fifty years of moore’s law. IEEE Transactions on semiconductor manu-

facturing, 2011.

[2] Chris Mack. The multiple lives of moore’s law. IEEE Spectrum, 2015.

[3] Chi-Wen Liu and Chao-Hsiung Wang. Finfet device and method of manufacturing same,

May 13 2014. US Patent 8,723,272.

[4] Yen-Huei Chen, Wei-Min Chan, Wei-Cheng Wu, Hung-Jen Liao, Kuo-Hua Pan, Jhon-

Jhy Liaw, Tang-Hsuan Chung, Quincy Li, Chih-Yung Lin, Mu-Chi Chiang, et al. A 16

nm 128 mb sram in high-k metal-gate finfet technology with write-assist circuitry for

low-vmin applications. IEEE Journal of Solid-State Circuits, 2015.

[5] Abbas Sheibanyrad, Frederic Petrot, Axel Jantsch, et al. 3D integration for NoC-based

SoC Architectures. Springer, 2011.

[6] Ronald G Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, and Trevor

Mudge. Near-threshold computing: Reclaiming moore’s law through energy efficient

integrated circuits. Proceedings of the IEEE, 2010.

[7] Hadi Esmaeilzadeh, Emily Blem, R St Amant, Karthikeyan Sankaralingam, and Doug

Burger. Dark silicon and the end of multicore scaling. IEEE Micro, 2012.

[8] Timothy Normand Miller. Architectural Solutions for Low-power, Low-voltage, and Un-

reliable Silicon Devices. PhD thesis, The Ohio State University, 2012.

[9] Himanshu Kaul, Mark Anders, Steven Hsu, Amit Agarwal, Ram Krishnamurthy, and

Shekhar Borkar. Near-threshold voltage (ntv) design: opportunities and challenges. In

Proceedings of the 49th Annual Design Automation Conference, 2012.

[10] Mohammad Saber Golanbari, Anteneh Gebregiorgis, Fabian Oboril, Saman Kiamehr,

and Mehdi Baradaran Tahoori. A cross-layer approach for resiliency and energy effi-

ciency in near threshold computing. In Computer-Aided Design (ICCAD), IEEE/ACM

International Conference on, 2016.

[11] Ulya R Karpuzcu, Nam Sung Kim, and Josep Torrellas. Coping with parametric variation

at near-threshold voltages. IEEE Micro, 2013.

[12] Ulya R Karpuzcu, Abhishek Sinkar, Nam Sung Kim, and Josep Torrellas. Energysmart:

Toward energy-efficient manycores for near-threshold computing. In High Performance

Computer Architecture (HPCA), IEEE 19th International Symposium on, 2013.

[13] Anteneh Gebregiorgis and Mehdi B Tahoori. Reliability and performance challenges of

ultra-low voltage caches: A trade-off analysis. In IEEE 24th International Symposium

on On-Line Testing And Robust System Design (IOLTS), 2018.

[14] Nathaniel Pinckney, Ronald G Dreslinski, Korey Sewell, David Fick, Trevor Mudge,

101

BIBLIOGRAPHY

Dennis Sylvester, and David Blaauw. Limits of parallelism and boosting in dim silicon.

IEEE Micro, 2013.

[15] Ronald G Dreslinski, David Fick, Bharan Giridhar, Gyouho Kim, Sangwon Seo, Matthew

Fojtik, Sudhir Satpathy, Yoonmyung Lee, Daeyeon Kim, Nurrachman Liu, et al. Cen-

tip3de: A 64-core, 3d stacked near-threshold system. IEEE Micro, 2013.

[16] Kelin J Kuhn. Reducing variation in advanced logic technologies: Approaches to process

and design for manufacturability of nanoscale cmos. In Electron Devices Meeting, IEDM.

IEEE International, 2007.

[17] Kelin J Kuhn, Martin D Giles, David Becher, Pramod Kolar, Avner Kornfeld, Roza

Kotlyar, Sean T Ma, Atul Maheshwari, and Sivakumar Mudanai. Process technology

variation. IEEE Transactions on Electron Devices, 2011.

[18] Anteneh Gebregiorgis, Saman Kiamehr, Fabian Oboril, Rajendra Bishnoi, and Mehdi B

Tahoori. A cross-layer analysis of soft error, aging and process variation in near threshold

computing. In Design, Automation & Test in Europe Conference & Exhibition (DATE),

2016.

[19] Benton H Calhoun and Anantha Chandrakasan. A 256kb sub-threshold sram in 65nm

cmos. In Solid-State Circuits Conference. ISSCC. Digest of Technical Papers. IEEE

International, 2006.

[20] Anteneh Gebregiorgis and Mehdi B Tahoori. Reliability analysis and mitigation of near

threshold caches. In Mixed Signals Testing Workshop (IMSTW), International, 2017.

[21] Ronald Dreslinski, Michael Wieckowski, D Sylvester Blaauw, and T Mudge. Near thresh-

old computing: Overcoming performance degradation from aggressive voltage scaling. In

Proc. Workshop Energy-Efficient Design, 2009.

[22] Anteneh Gebregiorgis, Rajendra Bishnoi, and Mehdi B Tahoori. A comprehensive reli-

ability analysis framework for ntc caches: A system to device approach. IEEE Transac-

tions on Computer-Aided Design of Integrated Circuits and Systems, 2018.

[23] Anteneh Gebregiorgis and Mehdi B Tahoori. Fine-grained energy-constrained micro-

processor pipeline design. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 2018.

[24] Anteneh Gebregiorgis, Mohammad Saber Golanbari, Saman Kiamehr, Fabian Oboril,

and Mehdi B Tahoori. Maximizing energy efficiency in ntc by variation-aware micro-

processor pipeline optimization. In Proceedings of the International Symposium on Low

Power Electronics and Design, 2016.

[25] Jie Han and Michael Orshansky. Approximate computing: An emerging paradigm for

energy-efficient design. In Test Symposium (ETS), 18th IEEE European, 2013.

[26] Haider AF Almurib, T Nandha Kumar, and Fabrizio Lombardi. Inexact designs for

approximate low power addition by cell replacement. In Design, Automation & Test in

Europe Conference & Exhibition (DATE), 2016.

[27] Anteneh Gebregiorgis, Saman Kiamehr, and Mehdi B Tahoori. Error propagation aware

timing relaxation for approximate near threshold computing. In Proceedings of the 54th

Annual Design Automation Conference, 2017.

102

BIBLIOGRAPHY

[28] Ismail Akturk, Nam Sung Kim, and Ulya R Karpuzcu. Decoupled control and data

processing for approximate near-threshold voltage computing. IEEE Micro, 2015.

[29] Mark Neisser and Stefan Wurm. Itrs lithography roadmap: 2015 challenges. Advanced

Optical Technologies, 2015.

[30] Rich McGowen, Christopher A Poirier, Chris Bostak, Jim Ignowski, Mark Millican,

Warren H Parks, and Samuel Naffziger. Power and temperature control on a 90-nm

itanium family processor. IEEE Journal of Solid-State Circuits, 2006.

[31] Vivek De, Sriram Vangal, and Ram Krishnamurthy. Near threshold voltage (ntv) com-

puting: Computing in the dark silicon era. IEEE Design & Test, 2017.

[32] Ali Pahlevan, Javier Picorel, Arash Pourhabibi Zarandi, Davide Rossi, Marina Zapater,

Andrea Bartolini, Pablo G Del Valle, David Atienza, Luca Benini, and Babak Falsafi.

Towards near-threshold server processors. In Design, Automation & Test in Europe

Conference & Exhibition (DATE), 2016, 2016.

[33] Nam Sung Kim, Todd Austin, David Baauw, Trevor Mudge, Krisztian Flautner, Jie S Hu,

Mary Jane Irwin, Mahmut Kandemir, and Vijaykrishnan Narayanan. Leakage current:

Moore’s law meets static power. computer, 2003.

[34] Pushpa Saini and Rajesh Mehra. Leakage power reduction in cmos vlsi circuits. Inter-

national Journal of Computer Applications, 2012.

[35] Andrew B Kahng, Bin Li, Li-Shiuan Peh, and Kambiz Samadi. Orion 2.0: A fast

and accurate noc power and area model for early-stage design space exploration. In

Proceedings of the conference on Design, Automation and Test in Europe, 2009.

[36] Liang Wang and Kevin Skadron. Dark vs. dim silicon and near-threshold computing

extended results. University of Virginia Department of Computer Science Technical

Report TR-2013-01, 2012.

[37] Xuning Chen and Li-Shiuan Peh. Leakage power modeling and optimization in inter-

connection networks. In Proceedings of the 2003 international symposium on Low power

electronics and design, 2003.

[38] Mohammad Saber Golanbari, Anteneh Gebregiorgis, Elyas Moradi, Saman Kiamehr,

and Mehdi B Tahoori. Balancing resiliency and energy efficiency of functional units

in ultra-low power systems. In Proceedings of the 23rd Asia and South Pacific Design

Automation Conference, 2018.

[39] Yogesh K Ramadass and Anantha P Chandrakasan. Minimum energy tracking loop with

embedded dc-dc converter delivering voltages down to 250mv in 65nm cmos. In IEEE

International Solid-State Circuits Conference. Digest of Technical Papers, 2007.

[40] Surhud Khare and Shailendra Jain. Prospects of near-threshold voltage design for green

computing. In VLSI Design and 2013 12th International Conference on Embedded Sys-

tems (VLSID), 2013 26th International Conference on, 2013.

[41] Ronald Dreslinski Jr. Near threshold computing: From single core to many-core energy

efficient architectures. 2011.

[42] Mark D Hill and Michael R Marty. Amdahl’s law in the multicore era. Computer, 2008.

[43] Vaibhav Gupta, Debabrata Mohapatra, Anand Raghunathan, and Kaushik Roy. Low-

103

BIBLIOGRAPHY

power digital signal processing using approximate adders. IEEE Transactions on

Computer-Aided Design of Integrated Circuits and Systems, 2013.

[44] Zhixi Yang, Ajaypat Jain, Jinghang Liang, Jie Han, and Fabrizio Lombardi. Approximate

xor/xnor-based adders for inexact computing. In Nanotechnology (IEEE-NANo), 13th

IEEE Conference on, 2013.

[45] Farzad Samie, Lars Bauer, and Jorg Henkel. Iot technologies for embedded computing:

A survey. In Proceedings of the Eleventh IEEE/ACM/IFIP International Conference on

Hardware/Software Codesign and System Synthesis, 2016.

[46] Imran Khan, Fatna Belqasmi, Roch Glitho, Noel Crespi, Monique Morrow, and Paul Po-

lakos. Wireless sensor network virtualization: A survey. IEEE Communications Surveys

& Tutorials, 2016.

[47] Trevor Mudge. Power: A first-class architectural design constraint. Computer, 2001.

[48] David Meisner, Brian T Gold, and Thomas F Wenisch. Powernap: eliminating server

idle power. In ACM sigplan notices, 2009.

[49] Wes Felter, Karthick Rajamani, Tom Keller, and Cosmin Rusu. A performance-

conserving approach for reducing peak power consumption in server systems. In Pro-

ceedings of the 19th annual international conference on Supercomputing, 2005.

[50] Ashutosh Dhodapkar, Gary Lauterbach, Sean Lie, Dhiraj Mallick, Jim Bauman, Sundar

Kanthadai, Toru Kuzuhara, Gene Shen, Min Xu, and Chris Zhang. Seamicro sm10000-64

server: building datacenter servers using cell phone chips. In Hot Chips 23 Symposium

(HCS), 2011 IEEE, 2011.

[51] Thomas N Theis and H-S Philip Wong. The end of moore’s law: A new beginning for

information technology. Computing in Science & Engineering, 2017.

[52] Christian Piguet. Low-power electronics design. CRC press, 2004.

[53] Mingoo Seok, Gregory Chen, Scott Hanson, Michael Wieckowski, David Blaauw, and

Dennis Sylvester. Cas-fest 2010: Mitigating variability in near-threshold computing.

IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 2011.

[54] Saman Kiamehr, Mojtaba Ebrahimi, Mohammad Saber Golanbari, and Mehdi B

Tahoori. Temperature-aware dynamic voltage scaling to improve energy efficiency of

near-threshold computing. IEEE Transactions on Very Large Scale Integration (VLSI)

Systems, 2017.

[55] David Bull, Shidhartha Das, Karthik Shivashankar, Ganesh S Dasika, Krisztian Flaut-

ner, and David Blaauw. A power-efficient 32 bit arm processor using timing-error detec-

tion and correction for transient-error tolerance and adaptation to pvt variation. IEEE

Journal of Solid-State Circuits, 2011.

[56] Bo Liu, Hamid Reza Pourshaghaghi, Sebastian Moreno Londono, and Jose Pineda

de Gyvez. Process variation reduction for cmos logic operating at sub-threshold sup-

ply voltage. In 14th Euromicro Conference on Digital System Design, 2011.

[57] Juergen Pille, Chad Adams, Todd Christensen, Scott R Cottier, Sebastian Ehrenreich,

Fumihiro Kono, Daniel Nelson, Osamu Takahashi, Shunsako Tokito, Otto Torreiter,

et al. Implementation of the cell broadband engine in 65 nm soi technology featuring

104

BIBLIOGRAPHY

dual power supply sram arrays supporting 6 ghz at 1.3 v. IEEE Journal of Solid-State

Circuits, 2008.

[58] Xiaoyao Liang, David Brooks, and Gu-Yeon Wei. Process variation tolerant circuit with

voltage interpolation and variable latency, February 23 2010. US Patent 7,667,497.

[59] Luis Basto. First results of itc’99 benchmark circuits. IEEE Design & Test of Computers,

2000.

[60] Jesper Knudsen. Nangate 45nm open cell library. CDNLive, EMEA, 2008.

[61] Liberate variety statistical characterization. https://www.cadence.com/

content/cadence-www/global/en_US/home/tools/custom-ic-analog-rf-design/

library-characterization/variety-statistical-characterization-solution.

html. Accessed: 2019-01-14.

[62] Sparsh Mittal. A survey of architectural techniques for managing process variation. ACM

Computing Surveys (CSUR), 2016.

[63] Mohammad Saber Golanbari, Saman Kiamehr, Fabian Oboril, Anteneh Gebregiorgis,

and Mehdi B Tahoori. Post-fabrication calibration of near-threshold circuits for energy

efficiency. In Quality Electronic Design (ISQED), 2017 18th International Symposium

on, 2017.

[64] Ramy E Aly, Md Ibrahim Faisal, and Magdy A Bayoumi. Novel 7t sram cell for low

power cache design. In SOC Conference. Proceedings. IEEE International, 2005.

[65] Leland Chang, Robert K Montoye, Yutaka Nakamura, Kevin A Batson, Richard J Eick-

emeyer, Robert H Dennard, Wilfried Haensch, and Damir Jamsek. An 8t-sram for vari-

ability tolerance and low-voltage operation in high-performance caches. IEEE Journal

of Solid-State Circuits, 2008.

[66] Amit Agarwal, Bipul Chandra Paul, Hamid Mahmoodi, Animesh Datta, and Kaushik

Roy. A process-tolerant cache architecture for improved yield in nanoscale technologies.

IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2005.

[67] Anteneh Gebregiorgis, Mojtaba Ebrahimi, Saman Kiamehr, Fabian Oboril, Said Ham-

dioui, and Mehdi B Tahoori. Aging mitigation in memory arrays using self-controlled

bit-flipping technique. In Design Automation Conference (ASP-DAC), 20th Asia and

South Pacific, 2015.

[68] Koichi Takeda, Yasuhiko Hagihara, Yoshiharu Aimoto, Masahiro Nomura, Yoetsu

Nakazawa, Toshio Ishii, and Hiroyuki Kobatake. A read-static-noise-margin-free sram

cell for low-vdd and high-speed applications. IEEE journal of solid-state circuits, 2006.

[69] Evelyn Grossar, Michele Stucchi, Karen Maex, and Wim Dehaene. Read stability and

write-ability analysis of sram cells for nanometer technologies. IEEE Journal of Solid-

State Circuits, 2006.

[70] Saibal Mukhopadhyay, Hamid Mahmoodi, and Kaushik Roy. Modeling of failure prob-

ability and statistical design of sram array for yield enhancement in nanoscaled cmos.

IEEE transactions on computer-aided design of integrated circuits and systems, 2005.

[71] Shah M Jahinuzzaman, Mohammad Sharifkhani, and Manoj Sachdev. An analytical

model for soft error critical charge of nanometric srams. IEEE Transactions on Very

105

BIBLIOGRAPHY

Large Scale Integration (VLSI) Systems, 2009.

[72] Mojtaba Ebrahimi, Adrian Evans, Mehdi B Tahoori, Razi Seyyedi, Enrico Costenaro,

and Dan Alexandrescu. Comprehensive analysis of alpha and neutron particle-induced

soft errors in an embedded processor at nanoscales. In Proceedings of the conference on

Design, Automation & Test in Europe, 2014.

[73] Jorge Tonfat, Jose Rodrigo Azambuja, Gabriel Nazar, Paolo Rech, Christopher Frost,

Fernanda Lima Kastensmidt, Luigi Carro, Ricardo Reis, Juliano Benfica, Fabian Vargas,

et al. Analyzing the influence of voltage scaling for soft errors in sram-based fpgas.

In Radiation and Its Effects on Components and Systems (RADECS), 14th European

Conference on, 2013.

[74] Hussam Amrouch, Victor M van Santen, Thomas Ebi, Volker Wenzel, and Jorg Henkel.

Towards interdependencies of aging mechanisms. In Proceedings of the IEEE/ACM In-

ternational Conference on Computer-Aided Design, 2014.

[75] Jorg Henkel, Lars Bauer, Nikil Dutt, Puneet Gupta, Sani Nassif, Muhammad Shafique,

Mehdi Tahoori, and Norbert Wehn. Reliable on-chip systems in the nano-era: Lessons

learnt and future trends. In Proceedings of the 50th Annual Design Automation Confer-

ence, 2013.

[76] I Chatterjee, B Narasimham, NN Mahatme, BL Bhuva, RA Reed, RD Schrimpf,

JK Wang, N Vedula, B Bartz, and C Monzel. Impact of technology scaling on sram

soft error rates. IEEE Transactions on Nuclear Science, 2014.

[77] Guillaume Hubert, Laurent Artola, and D Regis. Impact of scaling on the soft error

sensitivity of bulk, fdsoi and finfet technologies due to atmospheric radiation. Integration,

2015.

[78] Wai-Kei Mak and Jr-Wei Chen. Voltage island generation under performance require-

ment for soc designs. In Proceedings of the Asia and South Pacific Design Automation

Conference, 2007.

[79] Ronny Krashinsky, Christopher Batten, Mark Hampton, Steve Gerding, Brian Pharris,

Jared Casper, and Krste Asanovic. The vector-thread architecture. ACM SIGARCH

Computer Architecture News, 2004.

[80] Mark Woh, Sangwon Seo, Scott Mahlke, Trevor Mudge, Chaitali Chakrabarti, and Krisz-

tian Flautner. Anysp: anytime anywhere anyway signal processing. In ACM SIGARCH

Computer Architecture News, 2009.

[81] Keith A Bowman, James W Tschanz, Shih-Lien L Lu, Paolo A Aseron, Muhammad M

Khellah, Arijit Raychowdhury, Bibiche M Geuskens, Carlos Tokunaga, Chris B Wilker-

son, Tanay Karnik, et al. A 45 nm resilient microprocessor core for dynamic variation

tolerance. IEEE Journal of Solid-State Circuits, 2011.

[82] Eric Chun, Zeshan Chishti, and TN Vijaykumar. Shapeshifter: Dynamically changing

pipeline width and speed to address process variations. In Proceedings of the 41st annual

IEEE/ACM International Symposium on Microarchitecture, 2008.

[83] Bo Zhai, David Blaauw, Dennis Sylvester, Dennis Sylvester, and Krisztian Flautner.

Theoretical and practical limits of dynamic voltage scaling. In Proceedings of the 41st

annual Design Automation Conference, 2004.

106

BIBLIOGRAPHY

[84] Tilak Agerwala and Siddhartha Chatterjee. Computer architecture: Challenges and

opportunities for the next decade. IEEE Micro, 2005.

[85] Viji Srinivasan, David Brooks, Michael Gschwind, Pradip Bose, Victor Zyuban, Philip N

Strenski, and Philip G Emma. Optimizing pipelines for power and performance.

In Microarchitecture,(MICRO-35). Proceedings. 35th Annual IEEE/ACM International

Symposium on, 2002.

[86] Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Money Harris,

David Blaauw, and Dennis Sylvester. Bubble razor: Eliminating timing margins in an

arm cortex-m3 processor in 45 nm cmos using architecturally independent error detection

and correction. IEEE Journal of Solid-State Circuits, 2013.

[87] Shidhartha Das, David Roberts, Seokwoo Lee, Sanjay Pant, David Blaauw, Todd Austin,

Krisztian Flautner, and Trevor Mudge. A self-tuning dvs processor using delay-error

detection and correction. IEEE Journal of Solid-State Circuits, 2006.

[88] Dan Ernst, Nam Sung Kim, Shidhartha Das, Sanjay Pant, Rajeev Rao, Toan Pham,

Conrad Ziesler, David Blaauw, Todd Austin, Krisztian Flautner, et al. Razor: A low-

power pipeline based on circuit-level timing speculation. In Proceedings of the 36th

annual IEEE/ACM International Symposium on Microarchitecture, 2003.

[89] Shidhartha Das, Carlos Tokunaga, Sanjay Pant, Wei-Hsiang Ma, Sudherssen Kalaiselvan,

Kevin Lai, David M Bull, and David T Blaauw. Razorii: In situ error detection and

correction for pvt and ser tolerance. IEEE Journal of Solid-State Circuits, 2009.

[90] Matthew Fojtik, David Fick, Yejoong Kim, Nathaniel Pinckney, David Harris, David

Blaauw, and Dennis Sylvester. Bubble razor: An architecture-independent approach

to timing-error detection and correction. In Solid-State Circuits Conference Digest of

Technical Papers (ISSCC), IEEE International, 2012.

[91] Mihir R Choudhury, Vikas Chandra, Robert C Aitken, and Kartik Mohanram. Time-

borrowing circuit designs and hardware prototyping for timing error resilience. IEEE

transactions on Computers, 2014.

[92] Klaus Von Arnim, Eduardo Borinski, Peter Seegebrecht, Horst Fiedler, Ralf Brederlow,

Roland Thewes, Joerg Berthold, and Christian Pacha. Efficiency of body biasing in

90-nm cmos for low-power digital circuits. IEEE Journal of Solid-State Circuits, 2005.

[93] Mihir Choudhury, Vikas Chandra, Kartik Mohanram, and Robert Aitken. Timber: Time

borrowing and error relaying for online timing error resilience. In Design, Automation

& Test in Europe Conference & Exhibition (DATE), 2010.

[94] Kwanyeob Chae, Saibal Mukhopadhyay, Chang-Ho Lee, and Joy Laskar. A dynamic tim-

ing control technique utilizing time borrowing and clock stretching. In Custom Integrated

Circuits Conference (CICC), IEEE, 2010.

[95] Michael Wieckowski, Young Min Park, Carlos Tokunaga, Dong Woon Kim, Zhiyoong

Foo, Dennis Sylvester, and David Blaauw. Timing yield enhancement through soft edge

flip-flop based design. In Custom Integrated Circuits Conference, 2008. CICC. IEEE,

2008.

[96] Siva Narendra, James Tschanz, Joseph Hofsheier, Bradley Bloechel, Sriram Vangal,

Yatin Hoskote, Stephen Tang, Dinesh Somasekhar, Ali Keshavarzi, Vasantha Erraguntla,

107

BIBLIOGRAPHY

et al. Ultra-low voltage circuits and processor in 180nm to 90nm technologies with a

swapped-body biasing technique. In Solid-State Circuits Conference. Digest of Technical

Papers. ISSCC IEEE International, 2004.

[97] Le Yan, Jiong Luo, and Niraj K Jha. Combined dynamic voltage scaling and adaptive

body biasing for heterogeneous distributed real-time embedded systems. In Proceedings

of the 2003 IEEE/ACM international conference on Computer-aided design, 2003.

[98] Shankar Ganesh Ramasubramanian, Swagath Venkataramani, Adithya Parandhaman,

and Anand Raghunathan. Relax-and-retime: A methodology for energy-efficient recovery

based design. In Proceedings of the 50th Annual Design Automation Conference, 2013.

[99] V Huard, D Angot, and Florian Cacho. From bti variability to product failure rate: A

technology scaling perspective. In Reliability Physics Symposium (IRPS), IEEE Inter-

national, 2015.

[100] Anteneh Gebregiorgis, Fabian Oboril, Mehdi B Tahoori, and Said Hamdioui. Instruction

cache aging mitigation through instruction set encoding. In Quality Electronic Design

(ISQED), 17th International Symposium on, 2016.

[101] Abbas BanaiyanMofrad, Mojtaba Ebrahimi, Fabian Oboril, Mehdi B Tahoori, and Nikil

Dutt. Protecting caches against multi-bit errors using embedded erasure coding. In Test

Symposium (ETS), 20th IEEE European, 2015.

[102] Mohammad Sadeghi and Hooman Nikmehr. Aging mitigation of l1 cache by exchanging

instruction and data caches. Integration, 2018.

[103] Nezam Rohbani, Mojtaba Ebrahimi, Seyed-Ghassem Miremadi, and Mehdi B Tahoori.

Bias temperature instability mitigation via adaptive cache size management. IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, 2017.

[104] Dominic Oehlert, Arno Luppold, and Heiko Falk. Mitigating data cache aging through

compiler-driven memory allocation. In Proceedings of the 21st International Workshop

on Software and Compilers for Embedded Systems, 2018.

[105] Yang Chen, Zuochang Ye, and Yan Wang. Yield driven design and optimization for

near-threshold voltage sram cells. In Semiconductor Technology International Conference

(CSTIC), China, 2016.

[106] Chengzhi Jiang, Dayu Zhang, Song Zhang, He Wang, Zhong Zhuang, and Faming Yang.

A yield-driven near-threshold 8-t sram design with transient negative bit-line scheme.

In 7th IEEE International Conference on Electronics Information and Emergency Com-

munication (ICEIEC), 2017.

[107] Yasuhiro Morita, Hidehiro Fujiwara, Hiroki Noguchi, Yusuke Iguchi, Koji Nii, Hiroshi

Kawaguchi, and Masahiko Yoshimoto. An area-conscious low-voltage-oriented 8t-sram

design under dvs environment. In VLSI Circuits, IEEE Symposium on, 2007.

[108] Jaydeep P Kulkarni, John Keane, Kyung-Hoae Koo, Satyanand Nalam, Zheng Guo, Eric

Karl, and Kevin Zhang. 5.6 mb/mm2 1r1w 8t sram arrays operating down to 560 mv

utilizing small-signal sensing with charge shared bitline and asymmetric sense amplifier

in 14 nm finfet cmos technology. J. Solid-State Circuits, 2017.

[109] Bojan Maric, Jaume Abella, and Mateo Valero. Adam: An efficient data management

mechanism for hybrid high and ultra-low voltage operation caches. In Proceedings of the

108

BIBLIOGRAPHY

great lakes symposium on VLSI, 2012.

[110] Hamid Reza Ghasemi, Stark C Draper, and Nam Sung Kim. Low-voltage on-chip cache

architecture using heterogeneous cell sizes for high-performance processors. In High

Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium

on, 2011.

[111] Charles W Slayman. Cache and memory error detection, correction, and reduction

techniques for terrestrial servers and workstations. IEEE Transactions on Device and

Materials Reliability, 2005.

[112] Charles Slayman. Soft error trends and mitigation techniques in memory devices. In

Reliability and Maintainability Symposium (RAMS), Proceedings-Annual, 2011.

[113] Timothy N Miller, Renji Thomas, James Dinan, Bruce Adcock, and Radu Teodorescu.

Parichute: Generalized turbocode-based error correction for near-threshold caches. In

Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitec-

ture, 2010.

[114] Henry Duwe, Xun Jian, and Rakesh Kumar. Correction prediction: Reducing error

correction latency for on-chip memories. In High Performance Computer Architecture

(HPCA), IEEE 21st International Symposium on, 2015.

[115] Stanley E Schuster. Multiple word/bit line redundancy for semiconductor memories.

IEEE Journal of Solid-State Circuits, 1978.

[116] Chris Wilkerson, Hongliang Gao, Alaa R Alameldeen, Zeshan Chishti, Muhammad Khel-

lah, and Shih-Lien Lu. Trading off cache capacity for reliability to enable low voltage

operation. In ACM SIGARCH computer architecture news, 2008.

[117] Farrukh Hijaz, Qingchuan Shi, and Omer Khan. A private level-1 cache architecture

to exploit the latency and capacity tradeoffs in multicores operating at near-threshold

voltages. In Computer Design (ICCD), IEEE 31st International Conference on, 2013.

[118] Anteneh Gebregiorgis, Rajendra Bishnoi, and Mehdi B Tahoori. Spintronic normally-off

heterogeneous system-on-chip design. In Design, Automation & Test in Europe Confer-

ence & Exhibition (DATE), 2018.

[119] Albert Lee, Chieh-Pu Lo, Chien-Chen Lin, Wei-Hao Chen, Kuo-Hsiang Hsu, Zhibo

Wang, Fang Su, Zhe Yuan, Qi Wei, Ya-Chin King, et al. A reram-based nonvolatile

flip-flop with self-write-termination scheme for frequent-off fast-wake-up nonvolatile pro-

cessors. IEEE Journal of Solid-State Circuits, 2017.

[120] Yen-Kuang Chen, Jatin Chhugani, Pradeep Dubey, Christopher J Hughes, Daehyun

Kim, Sanjeev Kumar, Victor W Lee, Anthony D Nguyen, and Mikhail Smelyanskiy.

Convergence of recognition, mining, and synthesis workloads and its implications. Pro-

ceedings of the IEEE, 2008.

[121] Yiqun Wang, Yongpan Liu, Shuangchen Li, Daming Zhang, Bo Zhao, Mei-Fang Chiang,

Yanxin Yan, Baiko Sai, and Huazhong Yang. A 3us wake-up time nonvolatile processor

based on ferroelectric flip-flops. In ESSCIRC (ESSCIRC), Proceedings of the, 2012.

[122] S Bartling, S Khanna, M Clinton, S Summerfelt, J Rodriguez, and H McAdams. An

8mhz 75 ua/mhz zero-leakage non-volatile logicbased cortex-m0 mcu soc exhibiting 100lt;

400ns wakeup and sleep transitions. In Solid-State Circuits Conference Digest of Tech-

109

BIBLIOGRAPHY

nical Papers (ISSCC), 2013.

[123] H Koike, T Ohsawa, S Ikeda, T Hanyu, H Ohno, T Endoh, N Sakimura, R Nebashi,

Y Tsuji, A Morioka, et al. A power-gated mpu with 3-microsecond entry/exit delay

using mtj-based nonvolatile flip-flop. In Solid-State Circuits Conference (A-SSCC), IEEE

Asian, 2013.

[124] Fabian Oboril, Rajendra Bishnoi, Mojtaba Ebrahimi, and Mehdi B Tahoori. Evalua-

tion of hybrid memory technologies using sot-mram for on-chip cache hierarchy. IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015.

[125] Vilas Sridharan and David R Kaeli. Using hardware vulnerability factors to enhance avf

analysis. ACM SIGARCH Computer Architecture News, 2010.

[126] Mark Wilkening, Vilas Sridharan, Si Li, Fritz Previlon, Sudhanva Gurumurthi, and

David R Kaeli. Calculating architectural vulnerability factors for spatial multi-bit tran-

sient faults. In Proceedings of the 47th Annual IEEE/ACM International Symposium on

Microarchitecture, 2014.

[127] Kjell O Jeppson and Christer M Svensson. Negative bias stress of mos devices at high

electric fields and degradation of mnos devices. Journal of Applied Physics, 1977.

[128] Jose Manuel Cazeaux, Daniele Rossi, Martin Omana, Cecilia Metra, and Abhijit Chat-

terjee. On transistor level gate sizing for increased robustness to transient faults. In

On-Line Testing Symposium, IOLTS. 11th IEEE International, 2005.

[129] Peter Hazucha and Christer Svensson. Impact of cmos technology scaling on the atmo-

spheric neutron soft error rate. IEEE Transactions on Nuclear science, 2000.

[130] Jean-Luc Autran, S Serre, Daniela Munteanu, S Martinie, S Semikh, S Sauze, S Uznanski,

G Gasiot, and P Roche. Real-time soft-error testing of 40 nm srams. In Proc. IEEE

IRPS, 2012.

[131] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi,

Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti,

et al. The gem5 simulator. ACM SIGARCH Computer Architecture News, 2011.

[132] John L Henning. Spec cpu2000: Measuring cpu performance in the new millennium.

Computer, 2000.

[133] Taniya Siddiqua, Sudhanva Gurumurthi, and Mircea R Stan. Modeling and analyzing

nbti in the presence of process variation. In Quality Electronic Design (ISQED), 12th

International Symposium on, 2011.

[134] Oluleye Olorode and Mehrdad Nourani. Improving performance in sub-block caches with

optimized replacement policies. ACM Journal on Emerging Technologies in Computing

Systems (JETC), 2015.

[135] Hu Chen, Dieudonne Manzi, Sanghamitra Roy, and Koushik Chakraborty. Opportunistic

turbo execution in ntc: exploiting the paradigm shift in performance bottlenecks. In

Proceedings of the 52nd Annual Design Automation Conference, 2015.

[136] Understanding cpu caching and performance. http://http://arstechnica.com/

gadgets/2002/07/caching/2/, 2015.

[137] Premkishore Shivakumar and Norman P Jouppi. Cacti 3.0: An integrated cache timing,

110

BIBLIOGRAPHY

power, and area model. Technical Report, Compaq Computer Corporation, 2001.

[138] Alaa R Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu, Chris Wilkerson, and Shih-

Lien Lu. Energy-efficient cache design using variable-strength error-correcting codes. In

ACM SIGARCH Computer Architecture News, 2011.

[139] Said Hamdioui, Zaid Al-Ars, Georgi Nedeltchev Gaydadjiev, and Adrianus Van de Goor.

Generic march element based memory built-in self test, December 9 2014. US Patent

8,910,001.

[140] Albert Au, Artur Pogiel, Janusz Rajski, Piotr Sydow, Jerzy Tyszer, and Justyna Zawada.

Quality assurance in memory built-in self-test tools. In DDEC, 2014.

[141] Nathaniel Pinckney, Korey Sewell, Ronald G Dreslinski, David Fick, Trevor Mudge,

Dennis Sylvester, and David Blaauw. Assessing the performance limits of parallelized

near-threshold computing. In Proceedings of the 49th Annual Design Automation Con-

ference, 2012.

[142] Fabian Oboril and Mehdi B Tahoori. Aging-aware design of microprocessor instruction

pipelines. TCAD, 2014.

[143] Sunil Walia. Primetime® advanced ocv technology. Synopsys, Inc, 2009.

[144] Jinson Koppanalil, Prakash Ramrakhyani, Sameer Desai, Anu Vaidyanathan, and Eric

Rotenberg. A case for dynamic pipeline scaling. In Proceedings of the international

conference on Compilers, architecture, and synthesis for embedded systems, 2002.

[145] Opensparc t1. http://www.oracle.com/technetwork/systems/opensparc/index.

html. Accessed: 2016-01-30.

[146] Niket Choudhary, Salil Wadhavkar, Tanmay Shah, Hiran Mayukh, Jayneel Gandhi,

Brandon Dwiel, Sandeep Navada, Hashem Najaf-abadi, and Eric Rotenberg. Fabscalar:

Automating superscalar core design. IEEE Micro, 2012.

[147] Sparsh Mittal. A survey of architectural techniques for near-threshold computing. Jour-

nal Emerging Technologies, 2015.

[148] Bo Zhai, Ronald G Dreslinski, David Blaauw, Trevor Mudge, and Dennis Sylvester.

Energy efficient near-threshold chip multi-processing. In Proceedings of the international

symposium on Low power electronics and design, 2007.

[149] Dejan Markovic, Cheng C Wang, Louis P Alarcon, Tsung-Te Liu, and Jan M Rabaey.

Ultralow-power design in near-threshold region. Proceedings of the IEEE, 2010.

[150] Benton H Calhoun and Anantha Chandrakasan. Characterizing and modeling minimum

energy operation for subthreshold circuits. In Proceedings of the international symposium

on Low power electronics and design, 2004.

[151] Alice Wang, Anantha P Chandrakasan, and Stephen V Kosonocky. Optimal supply and

threshold scaling for subthreshold cmos circuits. In Proceedings. IEEE Computer Society

Annual Symposium on, 2002.

[152] Farzan Fallah and Massoud Pedram. Standby and active leakage current control and

minimization in cmos vlsi circuits. IEICE transactions on electronics, 2005.

[153] J Buurma and L Cooke. Low-power design using multiple vt h asic libraries. SoC central,

Aug, 2004.

111

BIBLIOGRAPHY

[154] Frank Sill, Frank Grassert, and Dirk Timmermann. Low power gate-level design with

mixed-v th (mvt) techniques. In Proceedings of the 17th symposium on Integrated circuits

and system design, 2004.

[155] Sean Keller, David Money Harris, and Alain J Martin. A compact transregional model

for digital cmos circuits operating near threshold. IEEE Transactions on Very Large

Scale Integration (VLSI) Systems, 2014.

[156] Kangguo Cheng, Bruce B Doris, Ali Khakifirooz, Qing Liu, Nicolas Loubet, and Scott

Luning. Simplified multi-threshold voltage scheme for fully depleted soi mosfets, Decem-

ber 22 2015. US Patent 9,219,078.

[157] Allan Hartstein and Thomas R Puzak. The optimum pipeline depth for a microprocessor.

In Computer Architecture, Proceedings. 29th Annual International Symposium on, 2002.

[158] Jizhong Shen, Liang Geng, Guangping Xiang, and Jianwei Liang. Low-power level con-

verting flip-flop with a conditional clock technique in dual supply systems. Microelec-

tronics Journal, 2014.

[159] Peiyi Zhao, Jason B McNeely, Pradeep K Golconda, Soujanya Venigalla, Nan Wang,

Magdy A Bayoumi, Weidong Kuang, and Luke Downey. Low-power clocked-pseudo-

nmos flip-flop for level conversion in dual supply systems. IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, 2009.

[160] Hamid Mahmoodi-Meimand and Kaushik Roy. Self-precharging flip-flop (spff): A new

level converting flip–flop. In Solid-State Circuits Conference, ESSCIRC. Proceedings of

the 28th European, 2002.

[161] Radu Zlatanovici. Contention-free level converting flip-flops for low-swing clocking, 2015.

US Patent 9,071,238.

[162] Nikola Nedovic, Marko Aleksic, and Vojin G Oklobdzija. Comparative analysis of double-

edge versus single-edge triggered clocked storage elements. In IEEE international sym-

posium on circuits and systems, 2002.

[163] Benton H Calhoun, Frank A Honore, and Anantha Chandrakasan. Design methodology

for fine-grained leakage control in mtcmos. In Proceedings of the international symposium

on Low power electronics and design, 2003.

[164] David E Lackey, Paul S Zuchowski, Thomas R Bednar, Douglas W Stout, Scott W

Gould, and John M Cohn. Managing power and performance for system-on-chip designs

using voltage islands. In Proceedings of the IEEE/ACM international conference on

Computer-aided design, 2002.

[165] Benton H Calhoun, Alice Wang, and Anantha Chandrakasan. Modeling and sizing

for minimum energy operation in subthreshold circuits. IEEE Journal of Solid-State

Circuits, 2005.

[166] NcSim:. http://www.cadence.com/, 2016.

[167] Mohammad Reza Kakoee, Ashoka Sathanur, Antonio Pullini, Jos Huisken, and Luca

Benini. Automatic synthesis of near-threshold circuits with fine-grained performance

tunability. In Proceedings of the 16th ACM/IEEE international symposium on Low

power electronics and design, 2010.

112

BIBLIOGRAPHY

[168] Scott Hanson, Bo Zhai, Kerry Bernstein, David Blaauw, Andres Bryant, Leland Chang,

Koushik K Das, Wilfried Haensch, Edward J Nowak, and Dinnis M Sylvester. Ultralow-

voltage, minimum-energy cmos. IBM journal of research and development, 2006.

[169] Sai Zhang and Naresh R Shanbhag. Embedded algorithmic noise-tolerance for signal pro-

cessing and machine learning systems via data path decomposition. IEEE Transactions

on Signal Processing, 2016.

[170] Andrew B Kahng, Seokhyeong Kang, Rakesh Kumar, and John Sartori. Slack redis-

tribution for graceful degradation under voltage overscaling. In Design Automation

Conference (ASP-DAC), 15th Asia and South Pacific, 2010.

[171] Vinay K Chippa, Debabrata Mohapatra, Anand Raghunathan, Kaushik Roy, and Sri-

mat T Chakradhar. Scalable effort hardware design: Exploiting algorithmic resilience

for energy efficiency. In Proceedings of the 47th Design Automation Conference, 2010.

[172] Jongsun Park, Jung Hwan Choi, and Kaushik Roy. Dynamic bit-width adaptation in

dct: An approach to trade off image quality and computation energy. IEEE Trans. VLSI

Syst., 2010.

[173] Byonghyo Shim, Srinivasa R Sridhara, and Naresh R Shanbhag. Reliable low-power

digital signal processing via reduced precision redundancy. IEEE Transactions on Very

Large Scale Integration (VLSI) Systems, 2004.

[174] Rajamohana Hegde and Naresh R Shanbhag. Soft digital signal processing. IEEE Trans-

actions on Very Large Scale Integration (VLSI) Systems, 2001.

[175] Ajay K Verma, Philip Brisk, and Paolo Ienne. Variable latency speculative addition: A

new paradigm for arithmetic circuit design. In Proceedings of the conference on Design,

automation and test in Europe, 2008.

[176] Khaing Yin Kyaw, Wang Ling Goh, and Kiat Seng Yeo. Low-power high-speed multiplier

for error-tolerant application. In Electron Devices and Solid-State Circuits (EDSSC),

IEEE International Conference of, 2010.

[177] Karlin. A first course in stochastic processes. Tutorial, 2014.

[178] G. Asadi and M.B. Tahoori. An analytical approach for soft error rate estimation in

digital circuits. In ISCAS, 2005.

[179] H. Asadi and M.B. Tahoori. Soft error modeling and protection for sequential elements.

In DFT, 2005.

[180] Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze,

and Dan Grossman. Enerj: Approximate data types for safe and general low-power

computation. In ACM SIGPLAN Notices, 2011.

113