IMPLICATIONS OF FUTURE TECHNOLOGIES ON THE ...

140
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering IMPLICATIONS OF FUTURE TECHNOLOGIES ON THE DESIGN OF FPGAs A Thesis in Computer Science and Engineering by Aman Gayasen c 2006 Aman Gayasen Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy December 2006

Transcript of IMPLICATIONS OF FUTURE TECHNOLOGIES ON THE ...

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

IMPLICATIONS OF FUTURE TECHNOLOGIES

ON THE DESIGN OF FPGAs

A Thesis in

Computer Science and Engineering

by

Aman Gayasen

c© 2006 Aman Gayasen

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Doctor of Philosophy

December 2006

The thesis of Aman Gayasen was reviewed and approved∗ by the following:

Mahmut KandemirAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee

Vijaykrishnan NarayananAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee

Mary Jane IrwinProfessor of Computer Science and Engineering

Vittal PrabhuAssociate Professor of Industrial and Manufacturing Engineering

Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

∗Signatures are on file in the Graduate School.

iii

Abstract

The Field Programmable Gate Array (FPGA) industry is going through an excit-

ing phase. The market leaders, Xilinx and Altera, announce new products almost every

year. Their CAD tools also keep adding new features. The growing popularity of FPGAs

demands that we sustain the growth of FPGAs. This thesis explores new technologies

for continuing the improvement of FPGAs in future.

In this thesis, we study the effect of three main future technologies. First, we

evaluate FPGA designs for scaled CMOS technologies — 65nm and below. The main

problems here are leakage, temperature, and process variation. Second, we look at

3-D stacking of multiple dies within a package. Since this technology is still being

perfected, we have several parameters to play with. For example, the properties of

the vias that provide communication among the different layers (inter-layer vias) are

very different from the other wires, especially pitch and length. This brings about an

asymmetry in the FPGA fabric. How this influences the FPGA architecture is a question

we try to answer. Furthermore, stacking multiple layers increases the power density,

which increases the junction temperature. This thesis studies the impact of stacking

on temperature, and proposes thermal-aware organization of FPGAs. Finally, we look

at some technologies that are still in their infancy, such as molecular switches, carbon

nanotubes, and silicon nanowires. Specifically, we explore the use of such technologies

to implement the interconnect fabric in an FPGA.

iv

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 FPGA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

Chapter 3. Reducing Leakage Energy in FPGAs . . . . . . . . . . . . . . . . . . 15

3.1 Using Sleep Transistors . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 RCP: Region-Constrained Placement . . . . . . . . . . . . . . 21

3.1.2 Combining RCP and Time-Based Control . . . . . . . . . . . 22

3.1.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.3.1 Time-based leakage control . . . . . . . . . . . . . . 27

3.1.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 28

3.1.4.1 Time-based Leakage Control . . . . . . . . . . . . . 31

3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Dual-Vdd FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.1.1 Fully Programmable (FP) . . . . . . . . . . . . . . . 36

v

3.2.1.2 Logic Programmable (LP) . . . . . . . . . . . . . . . 38

3.2.1.3 Level Conversion . . . . . . . . . . . . . . . . . . . . 39

3.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.2.1 Vdd Assignment . . . . . . . . . . . . . . . . . . . . 42

3.2.2.2 Power Estimation . . . . . . . . . . . . . . . . . . . 47

3.2.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 51

3.2.3.1 FP Architecture . . . . . . . . . . . . . . . . . . . . 51

3.2.3.2 LP Architecture . . . . . . . . . . . . . . . . . . . . 53

3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Chapter 4. Three-Dimensional FPGAs . . . . . . . . . . . . . . . . . . . . . . . 59

4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.1 2-D Switch Boxes . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.1.2 3-D Technology Overview . . . . . . . . . . . . . . . . . . . . 63

4.2 3-D Detailed Routing Architecture . . . . . . . . . . . . . . . . . . . 65

4.2.1 Switch Box Topology . . . . . . . . . . . . . . . . . . . . . . . 68

4.2.2 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2.2.1 Architecture and Technology Parameters . . . . . . 72

4.2.2.2 Experimentation Flow . . . . . . . . . . . . . . . . . 73

4.2.2.3 Area Model . . . . . . . . . . . . . . . . . . . . . . . 75

4.2.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 75

4.3 Thermal Issues in 3-D FPGAs . . . . . . . . . . . . . . . . . . . . . 81

4.3.1 Thermal-Characterization of FPGAs: 2-D to 3-D . . . . . . 83

vi

4.3.2 Thermal-Aware 3-D FPGA Organization . . . . . . . . . . . 89

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

Chapter 5. Technology Alternatives for Nanoscale FPGA Interconnects . . . . . 91

5.1 Nanotechnology Primitives . . . . . . . . . . . . . . . . . . . . . . . . 92

5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2 Nanoscale FPGA Architectures . . . . . . . . . . . . . . . . . . . . . 95

5.2.1 Arch1: Using non-lithographic nano-wires and molecular switches 96

5.2.2 Arch2: FPGA using lithographic wires and molecular switches 101

5.3 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Chapter 6. Summary and Future Directions . . . . . . . . . . . . . . . . . . . . 109

6.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

vii

List of Tables

3.1 Characteristics of benchmark designs . . . . . . . . . . . . . . . . . . . . 56

3.2 Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs,

Vddh = 1.1V, Vddl = 0.8V . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.1 Via properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.2 Power densities in 4VFX100 (Freq : 500MHz) . . . . . . . . . . . . . . 84

4.3 Effect of stacking on temperature . . . . . . . . . . . . . . . . . . . . . . 86

4.4 Parameters for temperature estimation in HS3d . . . . . . . . . . . . . 86

4.5 Thermal-aware 3-D FPGA design . . . . . . . . . . . . . . . . . . . . . 88

viii

List of Figures

1.1 Traditional FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Virtex-2 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 FPGA containing sleep transistors . . . . . . . . . . . . . . . . . . . . . 17

3.2 Leakage energy breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 a) Horizontal and b) Vertical styles of RCP on an XC2V40 FPGA for a

region size of 2× 4 slices. Required number of regions is 100 (13 regions) 19

3.4 Different placements for an example design. In part (c), each module is

bounded by a polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Experimental Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.6 Average leakage power savings for RCP and normal placement. . . . . . 29

3.7 Leakage power savings for RCP for 4× 16 region for all designs. . . . . . 29

3.8 Average clock frequency for RCP. . . . . . . . . . . . . . . . . . . . . . . 29

3.9 Average leakage energy savings for RCP and normal placement. . . . . . 29

3.10 Leakage power savings for time-based leakage control. . . . . . . . . . . 31

3.11 Leakage energy savings for time-based leakage control. . . . . . . . . . . 31

3.12 Supply transistors used for programmable Vdd . . . . . . . . . . . . . . 33

3.13 Fully programmable dual-Vdd architecture (FP) . . . . . . . . . . . . . 36

3.14 Logic programmable dual-Vdd architecture (LP) . . . . . . . . . . . . . 39

3.15 Level converter circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.16 Experimental Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

ix

3.17 Distribution of path delays . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.18 Power consumption for different Vddl’s. Vddh=1.1V. . . . . . . . . . . . 49

3.19 Power consumption for different architectures and algorithms. Vddh=1.1V,

Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.20 Average power breakdown between logic and routing resources. Vddh=1.1V,

Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.21 Average power consumption for different critical path delay tolerances.

Vddh=1.1V, Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.22 Critical path delay for LP FPGA with different extents of Vddl resources.

Vddh=1.1V, Vddl=0.9V. . . . . . . . . . . . . . . . . . . . . . . . . . . 57

3.23 Energy consumption in LP FPGAs. Vddh=1.1V, Vddl=0.9V . . . . . . 58

4.1 2-D switch boxes. X0, Y0, X1, Y1 mark their sides. . . . . . . . . . . . 62

4.2 Two kinds of stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3 3-D FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.4 3-D switch boxes for H=4, V=2. . . . . . . . . . . . . . . . . . . . . . . 67

4.5 Experimentation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4.6 Comparing 2-D and 3-D FPGAs . . . . . . . . . . . . . . . . . . . . . . 76

4.7 Comparing the switch boxes for 5-layer FPGA . . . . . . . . . . . . . . 76

4.8 Comparing the switch boxes for different via technologies for 5-layer FPGA 80

4.9 Comparing the switch boxes for different process nodes for 5-layer FPGA 80

4.10 Virtex-4 FX100 device (not to scale) . . . . . . . . . . . . . . . . . . . . 84

4.11 Thermal profile of 4VFX100 . . . . . . . . . . . . . . . . . . . . . . . . 85

x

4.12 Effect of stacking on peak temperature . . . . . . . . . . . . . . . . . . . 85

4.13 3-D FPGA organizations . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.1 FPGA using nano-wires and molecular switches . . . . . . . . . . . . . 106

5.2 3-D organization of nano-wires . . . . . . . . . . . . . . . . . . . . . . . 106

5.3 Critical path delays in the 3 architectures . . . . . . . . . . . . . . . . . 107

5.4 Dependence of performance on molecular switch’s ON resistance . . . . 107

5.5 Resistance and Capacitance values of single-length NiSi nano-wires . . . 108

5.6 Performance of a design (misex3) using metal nano-wires . . . . . . . . 108

xi

Acknowledgments

I am grateful to my advisers, Dr. Vijay and Dr. Kandemir, for their support

throughout my Ph.D. Without their guidance, both in professional and personal matters,

I would never have completed this thesis. I am also thankful to members of MDL for

creating a friendly work environment.

Some of the work in this thesis was done with the help of other MDL students.

While it is impossible to thank everyone who might have influenced my research indi-

rectly, I am attempting to thank those who worked closely with me on several projects.

Yuh-Fang and Ki-Yong helped me with the FPGA power work. Priya helped with the

thermal work. Besides them, Suresh worked with me on several projects. I also enjoyed

some enlightening conversations with Vijay Sai and Greg Link. During the last semester

at Penn State, I also worked with Soumya, Prasanth, and Sungmin. Besides them, my

neighbor in the lab, Jooheung, was a constant source of inspiration. I also wish to thank

Ing-Chao for the ping-pong games; they helped me focus when I was under stress.

Outside Penn State, I frequently collaborated with Tim Tuan and Arif Rahman

of Xilinx Research Labs. I am grateful for their help. They, and Satyaki Das, were

excellent mentors during my internships at Xilinx.

1

Chapter 1

Introduction

Field Programmable Gate Arrays (FPGAs) are Integrated Circuits (ICs) contain-

ing programmable logic and interconnect elements. The “Field” in FPGA denotes their

ability to be programmed by the end-user. The “Gate Array” signifies their similar-

ity to conventional mask-programmed gate arrays. FPGAs belong to a broader cate-

gory of field-programmable devices, called Programmable Logic Devices (PLDs), which

include PLA (Programmable Logic Arrays), PAL (Programmable Array Logic), and

CPLD (Complex PLD). While PLAs and PALs can implement only two-level logic, both

FPGAs and CPLDs can implement multi-level logic. CPLDs consist of multiple PAL

elements interconnected through a programmable switch matrix. In contrast, FPGAs

contain several small programmable logic elements connected using a mixture of short

and long wires and programmable switches. While CPLDs offer a more predictable tim-

ing, they lag FPGAs in logic capacity. Because of their large capacities, and superior

device utilization, FPGAs are among the most popular programmable devices.

FPGAs present significant advantages over microprocessors as well as ASICs.

Compared to microprocessors, they offer a higher performance for parallel applications.

Compared to ASICs, they offer a simpler design flow and lower Non-Recurring Engineer-

ing (NRE) costs. Therefore, they are suited for small-to-medium volumes of production

and for products where time-to-market is critical. Furthermore, the regular structure of

2

FPGAs makes them highly amenable to shrinking geometries, and therefore, they usu-

ally are at the forefront of new technologies. Consequently, by using FPGAs, designers

can get the advantages of advanced top-of-the-line process technologies without worry-

ing about the complexities that accompany the technology scaling. Due to all the above

reasons, FPGAs are poised to be among the most popular devices of the future.

At the time of their introduction in the mid-eighties, FPGAs were primarily used

for prototyping and to implement glue logic. However, over a period of twenty years,

especially since the late nineties, their market has diversified significantly. The inclusion

of embedded processors, memory, and DSP blocks provides the complete platform to

create embedded systems [1, 2]. Their inherent parallelism, coupled with an increase in

size and decrease in logic delays, allows them to be used as hardware accelerators for

high performance applications (e.g., [3, 4, 5]). People are also working to create scalable

supercomputers using an array of off-the-shelf FPGAs [6]. Furthermore, the introduction

of low-cost FPGAs by both Xilinx [1] and Altera [2] has enabled the use of FPGAs in

consumer markets. Technology research group Gartner Dataquest forecasts that the

market for programmable logic devices, which incorporates reconfigurable computing

with FPGAs, will double in a period of five years, from $3.1 billion in 2005 to $6.2

billion in 2010 [7]. In order to maintain this growth in the FPGA market, FPGAs

must consistently improve in performance, size, and features. This thesis explores future

technologies that will be crucial in sustaining this improvement.

Future technologies can be divided into the following three categories.

3

• The first category consists of scaled CMOS technologies, as predicted by ITRS [8].

It predicts that the industry will move to 22nm technology in 2016. These tech-

nologies will face severe power and reliability problems. Leakage power, which

until 130nm was only a minor component of the total power, has already become

a severe problem in sub-100nm technologies. Furthermore, because of increased

power densities on the die, the die temperature is also increasing. Higher tempera-

ture in-turn causes a plethora of problems, including an increase in leakage power

and reduction in silicon lifetime. In severe cases, heat could also melt the package

and cause total disruption. Beside power and temperature, variability, both man-

ufacturing and long-term, is a serious problem. Defect rates are also expected to

increase for smaller technologies. In this study, we focus on reducing leakage power

and temperature in an FPGA.

• The second category comprises evolutionary technologies, such as, stacking multi-

ple device layers to create a three-dimensional (3-D) IC. Three-D stacking is helpful

in reducing wire lengths, which translates into reduction in the FPGA area and

power consumption. A timing-driven placement and routing tool can also use 3-D

to reduce the critical path delay. The vertical connections in a 3-D technology are

much larger than the metal wires in a 2-D chip. Therefore, we would normally try

to reduce their number. A key challenge for 3-D FPGAs is designing the routing

architecture such that we use the vertical connections judiciously. Further, tem-

perature is a major concern here, because stacking multiple layers increases the

4

effective power density. Our results show that going from a single layer to a 4-layer

FPGA could increase the peak temperature by a factor of 2.4 (see Chapter 4).

• The final category contains non-lithographic technologies, such as, carbon nan-

otubes, silicon nanowires, and molecular switches. We broadly call them, nan-

otechnologies. Although several scientists are working on them, these technologies

are still in their initial stages of development. The key question here is, “What

are the desired properties in such technologies for them to be better than scaled

CMOS?” With this information, we give valuable feedback to the chemists and

material scientists who are developing these technologies, and also set reasonable

expectations from nanotechnologies.

The remainder of the thesis is organized as follows. Chapter 2 reviews the existing

literature related to this study. Chapter 3 discusses the challenges in reducing leakage

energy in future CMOS technology nodes, and presents two techniques to reduce leakage.

Chapter 4 develops a detailed routing architecture for 3-D FPGAs, and also studies

thermal issues in them. Chapter 5 explores nanotechnology alternatives to implement

the interconnect fabric in FPGA. Finally, Chapter 6 summarizes the contributions of

this thesis, and presents possible directions for future research in this field.

1.1 FPGA Architectures

Figure 1.1 shows the traditional island-style FPGA architecture. It consists of

a 2-D array of configurable logic blocks (CLBs) in a sea of routing wires. The CLBs

typically contain multiple Look-Up Tables (LUTs) and Flip-Flops (FFs). The routing

5

(a) (b)

Fig. 1.1. Traditional FPGA architecture

wires connect among themselves through programmable switches, forming a switch block.

Similarly, these wires also connect to the CLBs, forming connection blocks.

The modern FPGA has a more complex architecture than the one shown in Fig-

ure 1.1. An example of a modern FPGA is the Virtex-2 FPGA, shown in Figure 1.2.

It stores the configuration information in SRAM cells, each of which typically consists

of 6 transistors. The basic logic element in Virtex-2 is called a slice. A slice consists

of 2 LUTs, 2 FFs, fast carry logic, and some wide MUXes [1]. A CLB in turn consists

of 4 slices and an interconnect switch matrix. The interconnect switch matrix consists

of large multiplexers (as large as 32-to-1) controlled by configuration SRAM cells. Note

that Figure 1.2 is not drawn to scale, and in reality the interconnect switches account

for nearly 70% of the CLB area. The FPGA contains an array of such CLBs along with

block RAMs (BRAMs), multipliers and IO blocks as depicted in Figure 1.2. Altera’s

FPGAs are also similar in technology to Virtex-2.

6

Fig. 1.2. Virtex-2 FPGA architecture

A different kind of FPGAs are the antifuse-based FPGAs offered by Actel that

are one-time programmable. Actel and Lattice also offer flash-based FPGAs. In this

study, we focus on only SRAM-based FPGAs.

7

Chapter 2

Related Work

Ever since the first FPGAs were introduced by Xilinx in the mid 80’s, they have

been a popular topic for research. Their programmability offers interesting avenues for

creativity. Researchers at University of Toronto performed early research on FPGA ar-

chitecture [9]. They used the area, delay, and area-delay product as metrics to evaluate

FPGA architectures. They also developed tools to allow FPGA architecture explo-

ration [10].

Meanwhile, in the late 90’s, researchers at Berkeley started looking at the en-

ergy consumption in FPGAs. Energy consumption was becoming crucial because of

the growing demand for the use of FPGAs in embedded devices. They proposed low-

swing interconnect circuits and an interconnect architecture optimized to reduce the

energy-delay product [11]. Some studies also analyzed the dynamic power consumption

of commercial FPGAs — first of a Xilinx XC4003A FPGA [12], and, more recently, of

Virtex-2 [13]. Both studies observed that the interconnect fabric consumes the majority

of the dynamic power. Similar to the early study at Toronto (which focused on area and

delay), some researchers studied the influence of architecture parameters, such as LUT

size, cluster size, and segment length, on power consumption [14, 15]. Studies have also

tried to reduce the dynamic power through modifications in the CAD tools, ranging from

8

clustering [16], place and route [17], to bitstream manipulation [18]. The bitstream ma-

nipulation technique modified the LUT configuration bits to reduce dynamic power [18].

Recently, Lamoureux and Wilton [19] proposed a complete power-driven CAD flow, and

studied the interaction among the different CAD stages.

All the above studies focused on dynamic power consumption. With shrinking

transistor sizes, leakage power is also becoming important. FPGA researchers recognized

this, and therefore, the past two years have seen several studies on FPGA leakage power

(see [20] for a survey). Since FPGAs use several transistors to provide programmability,

their leakage power consumption is significantly higher than a custom circuit implement-

ing the same functionality. Tuan and Lai [21] performed a detailed analysis of leakage

power in Xilinx CLBs. They concluded that leakage must be significantly reduced to

enable the use of FPGAs in mobile applications.

Several techniques to reduce leakage in FPGAs have also been proposed. Two

of them proposed the use of sleep transistors [22, 23]. While researchers at MIT [22]

proposed a fine-grained leakage control scheme, embedding sleep transistors within the

CLB circuit; Gayasen et al. [23] advocated a coarser region-based leakage control, and

proposed constraining the design to a minimum number of regions to reduce leakage. At

the circuit-level, Azizi and Najm [24] presented low-leakage circuits for LUTs. Since the

leakage of routing muxes depends strongly on their input values, Srinivasan et al. [25]

presented circuits to reduce leakage in the routing fabric by setting desired values at the

inputs of the unused routing muxes. Lodi et al. [26] developed low leakage circuits for

the FPGA routing switch. Rahman and Polavarapuv [27] evaluated several low-leakage

design techniques for FPGAs. One of them was the use of a heterogenous routing fabric,

9

consisting of a mixture of high and low threshold (Vt) transistors. Since high Vt reduces

the leakage current at the expense of an increase in delay, the router needs to pick

the correct resources based on the slack available. This idea was proposed for a more

commercial architecture later [28], where detailed experiments helped them decide which

resources to slow down without affecting performance.

At the CAD level, Hassan et al. [29] proposed a low-leakage packing algorithm

that packed the LUTs exhibiting similar idleness together so that they could be shut

down together. Anderson et al. [30] presented a no-cost technique to reduce leakage by

selecting the polarities of logic signals appropriately. A similar technique was proposed

in [31], but with Asymmetric SRAM cells [32]. Chapter 3 presents our region-based

leakage control technique.

Researchers have previously proposed dual-Vdd techniques for ASICs [33, 34].

The dual-Vdd ASIC uses high-Vdd (Vddh) only to supply the timing-critical blocks, and

saves power on the non-critical ones by supplying them a lower Vdd (Vddl). Li et al. [35]

first applied the idea to FPGAs. They fixed the voltages of logic blocks and attempted

to place the design such that timing-critical blocks used high Vdd. This approach did

not provide enough power savings unless some performance degradation was allowed.

Therefore, a programmable Vdd FPGA was next proposed [36, 37], where the circuit

blocks could be programmed to run on Vddh or Vddl. In [36], the Vdd of only logic blocks

could be programmed. All routing resources remained at Vddh, and the emphasis was on

reducing dynamic power while keeping the leakage constant. Gayasen et al. applied the

programmable Vdd idea to routing resources as well as logic, and reduced both dynamic

10

and leakage powers [37, 38]. Later, Lin et al. [39] evaluated several variants of the dual-

Vdd architecture, and also improved the voltage assignment algorithm by formulating it

as a linear programming problem [40]. All these approaches required two power supplies

and two power grids. To eliminate these overheads, Anderson and Najm [41] proposed

a circuit that utilized the threshold drop across an NMOS transistor to locally generate

an alternate power supply for every routing switch. Chen et al. [42] also presented a cut

enumeration algorithm targeting low power technology mapping for FPGA architectures

with dual supply voltages. In Chapter 3, we present our dual-Vdd architectures.

Several studies have recognized the overheads of the programmable interconnect

fabric in an FPGA. The interconnect resources take almost 70% of the die area and

consume the major part of FPGA power [21, 13]. Furthermore, for most designs, they

also constitute more than 50% of the critical path delay. Therefore, FPGA interconnect

merits special attention. In order to reduce the interconnect area, researchers have

proposed 3-D FPGAs, consisting of multiple stacked 2-D FPGAs. More than a decade

ago, Alexander et al. [43] presented a 3-D FPGA that used package-level integration to

stack multiple 2-D FPGAs interconnected using solder bumps. The minimum pitch of

these vertical interconnects was 100µm. Campenhout et al. [44] proposed opto-electronic

FPGAs, in which the inter-chip communication used optical links. The optical links

provide a large vertical channel density. The Rothko 3-D FPGA [45] was a 3-D extension

of the Triptych sea-of-gates architecture [46], consisting of routing and logic blocks. The

3-D integration was done at the wafer-level and inter-layer communication used metal

vias. A dynamically reconfigurable 3-D FPGA was presented in [47], which consisted

of three physical layers: routing and logic block layer, routing layer, and memory layer.

11

Recently, Lin et al. [48] analyzed the performance benefits of a monolithically stacked 3-

D FPGA. Their 3-D integration technology provided very fine vias, which allowed them

to stack the configuration memory on top of the rest of the FPGA (logic blocks and

interconnects).

Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et

al. [49] presented an analytical model for predicting interconnect requirements in 3-D

FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and

power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended

this model to incorporate clustered logic blocks (similar to Virtex-2 [1]).

On the CAD front, Ababei et al. [51, 52] recently presented a partitioning-based

placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer

vias. However, their router was not timing-driven.

Although several researchers have proposed 3-D FPGAs, the detailed routing

architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset

switch block. Although Wu et al. [53] designed universal 3-D switch blocks, they used

track count as the sole metric of quality. Furthermore, they assumed that the number

of inter-layer vias is the same as the horizontal channel width. In today’s technology,

especially if we stack more than two layers, the vias are much thicker than the horizontal

wires (1µm vs. 0.1µm), which makes this assumption impractical. In Chapter 4, we

propose 3-D switch block designs considering the special via properties [54].

Three-D technology is known to suffer from thermal issues — stacking multiple

layers increases the effective power density in the package. Package designers have been

considering thermal issues in 2-D ICs for a long time. Instead of considering variations in

12

the temperatures on the die, they designed the package to support the worst case speci-

fications of the design. As designing the package for the worst case junction temperature

started becoming too expensive, researchers started looking at design level solutions to

reduce the temperature. Dynamic thermal management (DTM) techniques use thermal

sensors to monitor the junction temperature and control the power consumption of the

design on the basis of the temperature [55]. Common techniques include clock gating,

and voltage and frequency scaling when the temperature increases beyond a threshold.

Thermal-aware floorplanning is another design-level solution. Here, the floorplan

tries to reduce the hotspots on the die by distributing the temperature uniformly [56, 57].

Researchers have mostly focused at microprocessors in these works. Thermal placement

is a similar technique applied at the placement stage. Chen and Sapatnekar [58] proposed

a partition-driven algorithm for standard cell thermal placement. Thermal floorplanning

and placement are particularly attractive because they impact the performance less than

DTM.

On the modeling front, several researchers have developed tools for estimating the

die temperature. Among them, HotSpot [59] is an architecture level thermal simulator,

which can perform transient as well as steady-state temperature estimation. HS3d [60] is

another architecture level tool that performs only steady state temperature estimation,

but is orders of magnitude faster than HotSpot. Since in this work we look at only steady

state temperatures, we use HS3d.

Recently, some researchers have proposed solutions for thermal issues in 3-D ICs

too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sap-

atnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard

13

cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce

the peak temperature [63, 64].

Thermal issues in FPGAs are relatively unexplored. Some researchers have pro-

posed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66].

They, however, considered only CLBs in the fabric, and consequently, observed very

little temperature variations across the die. In Chapter 4 we characterize the thermal

profile of a real platform FPGA [67], and then observe the effect of stacking on temper-

ature. We also suggest alternate organizations to reduce the temperature.

In the long term, even 3-D may not provide the desired performance. There-

fore, we need to explore alternate technologies. Studies have looked at using some

non-lithographic technologies to manufacture FPGAs. DeHon [68], Goldstein [69], and

Tour [70] have previously proposed programmable architectures using some form of nano-

structures that are made using self-assembly. Goldstein tried to make crossbar-based de-

vices by aligning nano-wires in two planes at right angles to each other. The crosspoints

contained molecules that provided programmable logic as well as interconnections. It

suffered from problems of signal-degradation, as there was no way to restore the signal

using only two terminal devices. DeHon overcame this problem by using SiNW based

FETs to restore the signals, and proposed a PLA structure. However, the logic function-

ality in that architecture was limited to OR (and inversion). Tour, instead, proposed

replacing the logic blocks by nanocells and connecting them using metal wires. This

suffered with problems of training these nanocells, which were assumed to consist of a

randomly connected mass of molecules. Furthermore, since the bottleneck in current

FPGAs lies in the interconnect, Tour’s architecture does not help solve this problem.

14

All the above architectures propose drastic changes in the existing CMOS tech-

nology as well as the design methodologies. In Chapter 5, we propose an architecture

that blends with existing technology easily, and preserves all the design methodologies

and flexibility in logic functionality [71].

15

Chapter 3

Reducing Leakage Energy in FPGAs

With the development of FPGAs in new technologies - 90nm and below, optimiz-

ing leakage power1 is becoming imperative. As the transistor feature sizes and threshold

voltages reduce, and the number of transistors used in FPGAs increase, the overall leak-

age power is rapidly increasing. Consequently, the leakage problem is anticipated to

be a major obstacle for FPGA applications in both high performance and low-power

embedded designs. Due to this trend, we need to focus on leakage power optimizations

going beyond prior power optimization techniques for FPGAs that focus primarily on

reducing the dynamic energy [11, 12].

3.1 Using Sleep Transistors

The flexibility provided by the FPGA structures in placing different applications

results in a large portion of the components being unutilized [21]. In fact, the typical

logic utilization for the designs experimented is 62%. A similar trend holds for larger

benchmark suites of greater than 100 designs in different target devices [21]. These

unutilized resources in an FPGA serve as a good candidate for leakage optimizations.

Reducing leakage power has already been the focus of optimization in various

non-FPGA architectures. These optimizations have ranged from circuit to software

1Unlike dynamic power which is expended only when the hardware component in questionexercised, leakage power is spent even if the component is idle.

16

approaches [72, 73, 74]. Among these techniques, a popular one to reduce both the

subthreshold and gate leakage components is to switch off the power supply to the circuit

by introducing a high-threshold voltage sleep transistor between the circuit and its supply

rail. The sizing of this sleep transistor has an impact on both the performance and the

area overheads imposed. Specifically, its sizing should be large for better performance.

However, this increases both the area penalty and the ability to reduce leakage current

(as wider transistors leak more). The optimal sizing of these sleep transistors has been

the focus of prior efforts and the peak current required by the supply gated circuit serves

as the reference for this sizing [75]. Since the peak current for different portions of

a circuit do not normally occur simultaneously, prior work has used the approach of

controlling a clustered group of circuits together with a single sleep transistor [75]. This

optimization helps to reduce the area penalty as compared to using sleep transistors with

each individual sub-circuit.

It should be noted that sleep transistors can be used to control leakage in FPGAs

as well. An obvious approach would be to place unused CLBs into low-power states using

sleep transistors (see Figure 3.1). However, such a fine-grain (at individual CLB level)

power management of the FPGA fabric can introduce a significant area penalty, which

may not be tolerated in many designs. Instead, in this paper, we propose a strategy,

whereby the FPGA fabric is divided into regions, each of which can be independently

controlled through a sleep transistor. A region is a rectangular array of CLBs, and is

the minimum unit of power management. This approach is similar to the clustering

technique mentioned in the paragraph above. Our experimental results indicate that

area of the CLB arrays including the sleep transistor area overhead can be reduced by

17

Fig. 3.1. FPGA containing sleep transistors

18

5% when moving from using regions with 4 logic slices to regions with 256 logic slices.

By selecting a suitable region size, one can control the area overheads and at the same

time achieve large leakage savings. Based on this region concept, we also propose a

placement technique, referred to as Region-Constrained Placement (RCP), that tries to

use a minimum number of regions for a given application, thereby increasing the number

of unused regions that can be switched off. A key observation from our results is that

the leakage power savings obtained using RCP on an FPGA with coarse-grain regions is

larger than that obtained using normal placement employed on an FPGA with fine-grain

regions.

The maximum savings that can be obtained from the leakage management scheme

discussed above is limited by the volume of the unused regions. Consequently, we also

utilize a time-based control scheme that reduces leakage even in the utilized portions of

the FPGA by switching off/on the power supply, exploiting the idleness in portions of

the design. Specifically, the time-based scheme dynamically turns off power supply to

all regions containing only idle modules. We investigate combinations of the time-based

control scheme with two variants of RCP: (i) module-level RCP that places each module

of the design that exhibits a distinct idleness profile using RCP individually and turns

off power supply to all regions containing only idle modules, and (ii) design-level RCP

that places the entire design using RCP and turns off power supply to all regions that

contain only idle modules. Our experiments show that the time-based RCP scheme can

provide additional energy savings as compared to statically switching off only unused

portions.

19

The leakage distribution in a Xilinx FPGA in 90nm technology, with the excep-

tion of the BRAMs and multipliers was shown to be 38% in the configuration SRAMs,

34% in the interconnect matrix, 16% in LUTs and 12% in other logic [21]. Since many

of the techniques proposed for saving leakage energy in on-chip memory can be applied

to BRAMs (and because they are not used by most of our designs), our leakage opti-

mizations in this paper do not target them. In order to reduce the leakage energy in the

configuration SRAM, we increased the threshold voltage of the configuration SRAM to

obtain a 98% reduction in leakage energy while increasing configuration time by 20%.

Since configuration time is not critical in most of our target designs, this tradeoff for

power savings is reasonable. The resulting leakage breakdown in our system is shown

in Figure 3.2. The focus of this work is on reducing the leakage energy in the LUTs,

arithmetic logic and flip flops that account for 45% of the total leakage energy. While

our work focuses on the slices, the technique can be extended to switching off the routing

resources as well. This is a part of our planned future work.

Fig. 3.2. Leakage energy break-down

Fig. 3.3. a) Horizontal and b) Verticalstyles of RCP on an XC2V40 FPGA for aregion size of 2× 4 slices. Required numberof regions is 100 (13 regions)

20

In order to provide leakage control, the FPGA is divided into regions. A region

consists of one or more neighboring slices (potentially across different CLBs), and is the

minimum power management unit (granularity). Sleep transistors are embedded into the

FPGA fabric controlling the power supply to the individual regions. In this architecture,

the control bit for the power switch (See Figure 3.1) of the region determines whether

the region is supply-gated or not. The control bits of the different regions are set during

the configuration of the FPGA. The area overhead associated with the control bits (and

the associated wiring) is proportional to the number of regions, while their impact on

leakage energy is relatively small due to the use of high threshold transistors for the

configuration bits. Thus, the area overhead favors a smaller number of large regions.

An important issue in the design of this architecture is the sizing of the power

switches. The power switches should be large enough to support the peak current re-

quirements of the logic slices that they control to have negligible impact on performance.

Since the peak current for a larger region is less than the sum of the peak currents of

smaller regions constituting the larger region, it is possible to have a smaller area over-

head when moving to larger regions with similar performance. In order to show this

impact, we experimented with two different region sizes of 256 slices and 4 slices using

XPower [1]. A single region of 256 slices had a peak current that was 68% of the sum

of peak currents of 64 regions each of 4 slices constituting the same area. Next, we per-

formed SPICE simulations to estimate the sleep transistor size for various region sizes.

Assuming a slice area of 5000 sq. micron (from custom layout), it was estimated that

the area penalty for a region size of 4 slices was around 15%, while that for 256 was 10%.

This motivates the need for using large region sizes.

21

The amount of leakage reduction due to the introduction of the power switch is

also influenced by the sizing and threshold voltage of the sleep transistor and whether

a PMOS or NMOS transistor is used to gate the VDD or ground power supply rail.

The leakage reduction varies from 85-98% based on these factors, incurring performance

degradation varying from 0-30% [76]. In our experiments, we use a PMOS gate switch

providing 90% leakage reduction.

3.1.1 RCP: Region-Constrained Placement

The placement of the design has a significant impact on the ability to supply-

gate the logic slices in our region-based architecture. Employing the PAR tool in the

normal design flow due to lack of region concept tends to scatter the utilized slices across

different regions (See Figure 3.4(a)). Since the regions with partially used slices cannot

be supply-gated, the potential for leakage energy savings reduces. Thus, we propose

a new region constrained placement strategy, RCP, that takes into account the region

concept explicitly.

The basic principle of RCP is to constrain the placement of the design to specific

regions of the FPGA (See Figure 3.4(b)) and leave some regions of the FPGA completely

unused, so that they can be supply-gated. This in turn helps to maximize the potential

leakage savings. In our implementation of RCP, we place the design into contiguous

regions to the extent possible and utilize two different styles: horizontal and vertical

placements as shown in Figure 3.3. While the horizontal and vertical placements utilize

the same number of logic slices, they do not provide similar performance results due

to asymmetry in the target Virtex-II architecture. For example, there are fast carry

22

chains running vertically in the FPGA, but not horizontally. Furthermore, there are

more slices in a column than in a row in all Virtex-II parts (except XC2V40, which has

16 slices in both directions). While we confine the utilization of logic slices to specified

regions, in order to circumvent issues with routing congestion, routing of IO signals and

unroutability; the constraints on routing resources are kept as “soft”. This permits the

use of routing resources outside the regions that have logic placed in them. As part of

our future work, we plan to investigate a supply-gating mechanism that also switches off

interconnect muxes.

3.1.2 Combining RCP and Time-Based Control

(a) Traditional (b) RCP (c) Module-level RCP

Fig. 3.4. Different placements for an example design. In part (c), each module isbounded by a polygon

It should be observed that RCP is essentially a static technique where the unuti-

lized FPGA space (regions) can be shut off at configuration time (before the execution).

23

While it is easy to implement, it may not be as effective in designs that occupy large por-

tion of the FPGA space (which in turn limits the potential leakage savings). However,

for the designs with modules that remain inactive over significant durations of time,

we can employ a time-based control scheme that reduces leakage even in the utilized

portions of the FPGA by switching off/on the power supply, exploiting the idleness in

portions of the design. Specifically, the time-based scheme turns off power supply to

all regions containing only idle modules. We investigate combinations of the time-based

control scheme with two variants of RCP: (i) module-level RCP that places each module

of the design that exhibits a distinct idleness profile using RCP individually, and turns

off power supply to all regions containing only idle modules, and (ii) design-level RCP

that places the entire design using RCP and turns off power supply to all regions that

contain only idle modules.

We can implement the idea of time-based control as follows. The gate voltage of a

sleep transistor is still controlled by a configuration bit. However, instead of configuring

this bit statically when the design is loaded on the FPGA, dynamic reconfiguration

[77] of these control bits is used to switch a sleep transistor on or off. In order to

limit the overhead of reconfiguring these control bits, the sleep transistor should not

change state very frequently. Furthermore, support for just reconfiguring these control

bits may be useful as opposed to the minimum reconfigurable block in current Virtex-

II technology, which is a frame [77]. Reconfiguration time for one frame varies from

2µ seconds for smallest to 23µ seconds for the largest FPGA. However, support for

reconfiguring only the sleep transistor configuration bits can reduce this time, but may

increase area overheads due to the configuration circuit.

24

With increasing FPGA sizes, it is possible to envision an entire system on FPGA.

In such designs, many parts of the design may remain inactive for long durations. Time-

based control seems to be a very promising approach for such designs. Figure 3.4(c) shows

an example design placement using module-level RCP for time-based leakage control. We

see from this figure that modules of the design get placed on non-overlapping regions,

thus maximizing the number of regions that can be dynamically switched off. Note that

this slightly decreases the statically unused portion on the FPGA (because in order to

ensure the inter-module region exclusivity needed for module-level RCP, some regions

can only be partially filled). Still, our experiments show a significant increase in leakage

savings due to module-level RCP.

3.1.3 Experimentation

In order to investigate the energy savings due to the proposed approach, we

selected a set of applications and used the Xilinx Virtex-II FPGA as our target hardware.

The selected applications include 14 publicly available reference designs provided by

Xilinx, 4 designs from ITC’99 benchmark suite, 3 academic designs and 14 commercial

designs available internally at Xilinx. Table 3.1 provides the important characteristics of

each application and lists the number of slices, IO blocks (IOBs), block RAMs (BRAMs)

and multipliers (MULTs) used in the designs along with the target FPGA device used

for the mapping. Note that on an average only 62% of the slices were used. Industry6

is an extreme case, where although only 4% of the slices are used; but due to the I/O

requirements of the design, it cannot be mapped to a smaller FPGA.

25

These designs were then implemented using the experimental flow illustrated in

Figure 3.5 to evaluate the energy savings possible due to the proposed optimizations.

The specific steps in this design flow are elaborated below.

All the designs were synthesized for area-optimization from their HDL represen-

tation using the Xilinx Synthesis Technology (XST). This synthesis step produced a

gate-level netlist. Next, the designs were mapped on to the smallest possible Virtex-II

FPGA device, setting the place and route effort level high (level 5). After the mapping

and completion of place and route (PAR), an NCD file that contains the placed and

routed design is generated. The map process also generates a MAP report which is used

to implement RCP. The maximum clock frequency for the design was estimated by using

the post-PAR static timing analysis tool, TRACE on the mapped design. The NCD file

was translated to an ASCII file in XDL format using the xdl tool. This ASCII file was

processed using a customized tool developed for this project to determine the unused

regions of the FPGA given the region sizes. Using this information, the leakage savings

possible in the standard placement process was obtained (assuming that the regions that

are completely unused are switched off).

In order to determine the leakage savings using RCP, the synthesized gate-level

(NGC file) was re-used. The MAP report from the normal mapping was used to de-

termine the number of logic slices used in the design. Based on the number of slices

obtained and the size of the regions, a User Constraints File (UCF) was created to re-

strict the placement to a specific number of regions. Different UCFs were created for

horizontal and vertical styles of RCP, and for different regions. The mapping and place

and route obtained using the specified constraints produces an NCD file. Similar to the

26

Fig. 3.5. Experimental Flow

27

normal placement scheme, the maximum clock frequency for the design is estimated by

using TRACE. Leakage energy savings is evaluated in this case by assuming that power

supply to all unused regions is turned off. As explained earlier, an estimated 45% of the

total leakage happens in the logic slices (Figure 3.2). Furthermore, as explained in the

beginning of this chapter, leakage reduces to 10% of the original using supply-gating with

PMOS transistors. Thus, if for some design, 25% of the slices can be switched off, then

the leakage power is reduced by (25×0.9×0.45) = 10.125%. Furthermore, suppose after

RCP, the clock frequency degrades to 97% of original. Then, the new leakage energy

(Power-Delay-Product, PDP) will be (89.8750.97 ) = 92.65%.

Our experiments were performed for different region choices. Region-widths of

2, 4, 8, 16 slices, and heights of 2, 4, 8, 16 slices were considered. Thus, a total of 16

different region choices were explored. Furthermore, as explained earlier, two styles of

RCP: horizontal and vertical were explored. Thus, a total of 32 different varieties of

RCP were explored.

3.1.3.1 Time-based leakage control

The experiments for time-based leakage control were performed using an academic

design implementing an Adaptive Viterbi Algorithm (AVA) decoder [78]. The design

consists of 3 AVA decoders of varying constraint lengths (4, 6, and 9). Different decoders

are selected depending on the noise levels in the transmission channel. If the noise level

is high, then the decoder with a larger constraint length is selected. In [78], the authors

utilize reconfiguration to switch between decoders of different constraint lengths. We

28

modified the design by statically mapping 3 different sizes of decoders on the FPGA,

and selecting the right decoder depending on noise in the channel. For this work, we

assumed that an input coming into the FPGA decides which decoder to choose. The

design was mapped onto an XC2V1500-bg575. The resource usage was 5469 slices (71%),

90 IOBs (22%), 0 BRAM and 0 multiplier. The three different decoders occupied 718,

1846, and 2854 slices respectively. Another module, which remained active all the time

(branch metric generator) occupied 51 slices. The advantage of this design is that the

decoding can be done much more rapidly if the channel is not noisy. The drawback is

that at any given time, two decoders are sitting idle. This gives a scope for switching-off

the unused decoders. We estimated and compared the leakage savings for this design for

design-level RCP, module-level RCP and normal placement, assuming run-time leakage

control. We also compared savings obtained from run-time leakage control with static

control. In order to estimate the leakage savings from run-time control, we assumed that

each of the 3 decoders is active for equal durations. Thus, any of the three decoders can

be switched off for two-thirds of the total time.

3.1.4 Results and Analysis

Figure 3.6 plots the average estimated leakage power savings by switching off the

unused regions in FPGA. The savings are represented as percent of total leakage (that

occurs without any switching off). A region represented as 2 4 means that the region is

2 slices wide and 4 slices high. Plots for RCP as well as without RCP have been shown.

For both, RCP and normal placement, leakage savings decrease with increase in region

size. However, the decrease for RCP is very small compared to normal placement. As is

29

Fig. 3.6. Average leakage power savingsfor RCP and normal placement.

Fig. 3.7. Leakage power savings for RCPfor 4× 16 region for all designs.

Fig. 3.8. Average clock frequency forRCP.

Fig. 3.9. Average leakage energy savingsfor RCP and normal placement.

30

evident from the plots, RCP clearly outperforms normal placement. Especially for large

region sizes, RCP provides more than 6 times the savings of normal placement. This

happens because, although the number of slices used is the same in both cases; in case

of normal placement, they are scattered across regions. Larger regions can accentuate

this problem.

We observed that the leakage power savings are strongly dependent on the re-

source usage of a design. Figure 3.7 plots the variation, across all designs, of leakage

power savings for a single region choice. It shows that the leakage power savings vary

significantly depending on the design. For some designs, there is no leakage saving be-

cause those designs occupy all the regions of the FPGA. Leakage power is reduced by

more than 20% for 40% of the designs.

However, the constraint on the placement due to RCP can influence the timing

of the signals. Figure 3.8 plots the average estimated clock frequencies achieved using

RCP, expressed as a percentage of frequency estimated for normal placement. A region

represented as 2 4 h refers to horizontal style of RCP with region of height 4 slices

and width 2 slices. Similarly, a region represented as 2 4 v refers to vertical style with

the same region size and shape. The plot shows that for all regions, the average clock

frequency is within 8% of original clock frequency.

The performance penalty can result in longer execution time and consequently

increase the duration of leakage. To capture this impact, Figure 3.9 plots average esti-

mated leakage energy savings for RCP as well as for normal placement. We note that

except for very fine-grain regions, RCP always results in higher leakage energy savings.

The difference between the two increases for large region sizes. Again note that small

31

region size incurs larger area overhead due to larger effective sleep transistor size, more

routing and control signals, and more configuration bits (which increases configuration

time too).

Fig. 3.10. Leakage power savings for time-based leakage control.

Fig. 3.11. Leakage energy savings fortime-based leakage control.

3.1.4.1 Time-based Leakage Control

Figure 3.10 plots leakage power savings for dynamic and static leakage controls

for the AVA decoder design. The savings from dynamic control are shown for a module

level RCP (modules get placed in non-overlapping regions), design level RCP, and for

normal placement. The savings from static leakage control are shown for design-level

RCP, and for normal placement.

32

It is observed that time-based leakage control results in very large savings com-

pared to static control. Furthermore, among the different placement strategies for time-

based control, module-level RCP outperforms the others. Design-level RCP performs

better than normal placement in most cases, but in some cases normal placement results

in larger savings. This happens because in case of normal placement, the 3 different

modules are placed slightly separated (because the placer has a larger area available to

place the modules). Therefore, only a few regions are common among the different mod-

ules. In case of design-level RCP, the placer has a smaller area in which to fit the entire

design. This increases the overlap among the 3 modules, thus disabling the dynamic

switching-off of those regions.

Figure 3.11 plots leakage energy savings for time-based control (module-level RCP,

design-level RCP, normal placement with no RCP) and static control (RCP, normal

placement with no RCP) for the AVA decoder design. It is observed that time-based

control results in very large savings compared to static control. Also, in all but two

cases, module-level RCP results in the largest energy savings.

It must be observed that the plots shown above do not account for additional

overhead for dynamic reconfiguration of the control bits. However, even assuming that

reconfiguration incurs a 10% increase in overall execution time and consequent leakage

energy penalty, we find that module-level RCP with time-based leakage control provides

19% (is 27% without reconfiguration overhead) more leakage savings than a normal

placement with static leakage control.

33

(a) (b) (c)

Fig. 3.12. Supply transistors used for programmable Vdd

3.1.5 Summary

Our work demonstrates that switching off parts of FPGA can result in significant

leakage savings in most designs. The savings can be further increased by using Region

Constrained Placement (RCP). Furthermore, if RCP is used then the switch-off granu-

larity need not be very fine, since leakage savings decrease very gradually with increasing

region size. Thus, considering the area overhead of having very small regions, a large

region size coupled with RCP looks to be a practical choice. Module-level RCP is a

promising enhancement for designs in which some modules stay inactive for significant

durations of time.

3.2 Dual-Vdd FPGA

Reducing the supply voltage (Vdd) is an effective technique for reducing both

dynamic and static power. Dynamic power varies quadratically with supply voltage,

while both sub-threshold leakage (due to Drain Induced Barrier Lowering, DIBL) and

gate leakage vary exponentially. However, reducing the supply voltage negatively affects

34

the circuit performance. Dual-Vdd is a popular technique to reap the benefits of voltage

scaling without its performance penalty. The timing-critical blocks in the design operate

on the normal Vdd (or Vddh), while non-critical blocks operate on a second supply rail

with a lower voltage (or Vddl). While dual-Vdd ICs have been successfully used in low-

power ASICs and custom ICs [34], no commercial FPGA today uses multiple Vdd’s for

power reduction.2

The difficulty of designing a dual-Vdd FPGA is that the optimal Vdd assign-

ment changes from one design to another. Consequently, if logic blocks are statically

determined to be operating at low or high Vdd, the placement and routing algorithms

need to be modified accordingly (e.g., [35]). However, static assignment of Vdd to the

blocks may prevent the ability to reduce power or to meet timing constraints for some

designs. In contrast, the use of Vdd-programmability for each block helps to tune the

number of high and low Vdd blocks as desired by the application. In this approach,

the challenge is in determining the Vdd assignments to each block. The need for level

converters wherever a low-Vdd block drives a high-Vdd block and the associated delay

and energy overheads are important considerations when performing these Vdd assign-

ments. Furthermore, positioning of the level converters influences the ability to assign

lower Vdd’s to the routing blocks.

In our programmable dual-Vdd architecture, the Vdd of a circuit block is selected

between Vddh and Vddl by using two high-Vt transistors (supply transistors) connecting

the block to the supplies (see Figure 3.12). This circuit was previously used by [36]. The

2Xilinx Virtex-II FPGAs use different supply voltages for I/O and the core. Pass transistorsused for interconnects are also supplied higher gate voltages to eliminate the Vt drop. However,this is not targeted to reduce power.

35

state (ON/OFF) of each supply transistor is controlled by a configuration bit, which is

set by the Vdd assignment algorithm. The configuration bits are set either to connect

the block to one of the power supplies or completely disconnect the block from both

the power supply lines when the block is unused or idle. We evaluate the effectiveness

of different Vdd assignment algorithms and implementation choices for an island-style

FPGA architecture designed in 65nm technology. Our results demonstrate that one of

the Vdd assignment techniques provides an average power saving of 61% across different

MCNC benchmarks.

3.2.1 Architecture

We propose two types of dual-Vdd architectures. The first, Fully Programmable

(FP), architecture allows all logic blocks (CLBs) and routing resources to be indepen-

dently programmed as Vddh or Vddl. The second, Logic Programmable (LP), gives that

flexibility only for CLBs, and fixes the voltages of the routing resources. Both the archi-

tectures are built on cluster-based island-style FPGAs, with the configuration stored in

SRAM cells. The basic logic element (BLE) consists of a 4-input LUT and a flip-flop.

Multiple BLEs cluster together to form a CLB (see Figure 3.13).

In both architectures, level conversion takes place only at CLB pins. For this

purpose, CLB pins have level converters (LCs) attached to them. A multiplexer allows

to by-pass the level converter if level conversion is not needed at that pin. Placing the

level converter only at CLB pins reduces the complexity of the routing fabric, and also

limits the area and leakage overhead of level converters.

36

(a) Dual-Vdd CLB (b) Dual-Vdd routing mux

Fig. 3.13. Fully programmable dual-Vdd architecture (FP)

3.2.1.1 Fully Programmable (FP)

The FP architecture facilitates configurable supply voltage for logic blocks and

routing multiplexers. Figure 3.13(a) shows how the CLB is configured using high-Vt

supply transistors to operate at two different voltages.

We experimented with two variants of FP, differing in the placement of the level

converters. While the first version places LCs at the output pins of CLBs, the second

places them at CLB input pins. Figure 3.13(a) shows the first case, where only the

output pins of a CLB have LCs attached to them. In this case, a net with multiple

fanouts operates at high Vdd if any one of the CLBs driven by this net is at high Vdd

(since the signal’s voltage level does not change in the routing fabric). This limits the

number of routing muxes that can operate at low Vdd, and therefore is less effective in

reducing routing power compared to the case when LCs are attached to CLB input pins.

However, the drawback of keeping LCs at input pins of CLBs (apart from area penalty)

is that a larger number of LCs are needed, which increases the leakage in logic blocks.

37

Our results support this reasoning, but show that overall leakage is lower for the second

case.

Figure 3.13(b) shows a routing multiplexer (mux) in the FP architecture. The

multiplexer’s output is connected to a level-restoring buffer to restore the Vt-drop

through the NMOS-based multiplexer. Note that the same set of supply transistors

controls the voltage of configuration SRAM cells and the level-restoring buffer. Since

the configuration SRAM is not timing critical, the supply transistors need to be sized

just enough to supply the maximum current needed by the level-restoring buffer.

If a circuit block (CLB or routing mux) is completely unused, then in order to

save leakage, it is desirable to completely switch off that block. This is achieved by

keeping a separate configuration bit for every supply transistor3. Although this incurs

more area overhead, it results in significant leakage savings, since resource utilization in

an FPGA is typically low [21, 23].

Due to the area overhead of level converters and supply transistors (and associated

configuration SRAM cells), the dual-Vdd FPGA takes approximately 50% more area

than a single-Vdd FPGA.

The majority of leakage in an FPGA occurs in the configuration SRAM cells.

[23] have previously shown that by increasing the threshold voltage of the configuration

SRAM, its leakage can be reduced by 98%, while increasing configuration time by 20%.

Since configuration time is not critical in most of our target designs, this tradeoff for

power savings is reasonable. For applications where configuration time is crucial, we have

3In case of a routing mux, we need to pull down the control signals when the mux is unused.The pull-down transistors can be sized very small.

38

proposed the use of Asymmetric SRAM cells [31]. In order to see the effect of dual-Vdd

on power consumption, we have neglected the configuration SRAM leakage both for the

single supply design, and for the dual supply design (since the reduction of configuration

SRAM leakage is achieved by increasing its threshold voltage, and is equally applicable

to both single and dual supply designs).

3.2.1.2 Logic Programmable (LP)

The LP architecture facilitates configurable supply only to logic blocks (see Fig-

ure 3.14). The routing resources run at supplies fixed at the time of device fabrication.

The routing switches contain sleep transistors to cut off their power supply when not

used.

The FP dual-Vdd FPGA of the previous section results in a large area penalty

of about 50%. A key observation is that most of the area is consumed by the routing

resources. By fixing the supply voltages of routing resources, an LP FPGA eliminates

the supply transistors and associated configuration SRAM cells in the routing fabric.

Instead, we need only one sleep transistor per routing switch. This sleep transistor is

controlled by the SRAM cell that controls the state of the routing switch. This more

than halves the area cost of supply transistors in the routing fabric. Compared with

a single Vdd FPGA, the area penalty for an LP FPGA is close to 20%. This circuit

is similar to one of the circuits in [39], with the difference that in our case the supply

voltage could be either Vddh or Vddl while they fixed the supply to Vddh for routing.

Every logic block still has its own supply transistors, and can be independently

programmed to function at Vddh or Vddl. In order to further reduce the area penalty

39

Fig. 3.14. Logic programmable dual-Vdd architecture (LP)

due to these supply transistors, we share the supply transistors among multiple logic

blocks. Since all CLBs do not normally draw the maximum currents at the same time,

the supply transistor can be sized smaller than the sum of independent supply transistors.

Hence, the area overhead of supply transistors is reduced.

Level conversion still occurs only at CLB pins. However, unlike FP, we do not

have the flexibility to set the Vdd of nets to match that of logic blocks connected to

them. Therefore, we need to allow for level conversion at both input and output pins of

CLBs.

The LP architectures are especially suited for low-cost applications with low power

requirements.

3.2.1.3 Level Conversion

Level converters have been studied widely ever since multi-Vdd circuits were pro-

posed [33, 79]. The area, delay and power overheads of level converters prohibit random

40

Fig. 3.15. Level converter circuit

Vdd assignment to logic elements of a circuit. For the present work, we have used the

level converter circuit shown in Figure 3.15, and a 65nm Berkeley Predictive SPICE

model [80] to simulate it. For an FPGA architecture where level converters are placed

at CLB input pins, four level converters are required per BLE. For a Vddh of 1.1V and

Vddl of 0.9V, the LC delay is almost 17% of the delay of an LUT, and as much as 41% of

the clock-to-Q delay of the flip-flop. This significant delay in the LC prohibits the use of

many LCs within a logical path of the circuit. In contrast to delay, power consumption

in an LC was observed to be negligible (< 1%) compared to a BLE. This allows us to

place LCs at all pins of a CLB and still save power.

3.2.2 Methodology

We used VPR and its power model [10, 14] for this work. MCNC benchmarks

were used to evaluate the dual-Vdd architecture and Vdd assignment algorithms. The

architecture of FP FPGA closely resembles a modern FPGA. The LUT size of 4, and

cluster size of 8 LUTs are the same as a Xilinx Virtex-II device. The routing channel

41

Fig. 3.16. Experimental Flow

consists of 200 tracks, with buffered segments of lengths 1, 2, 6 and “long”. The switch

block used a Wilton topology [81].

For LP, however, we simplified the fabric to resemble the one used by [82]. The

CLB consists of 4 BLEs. The routing fabric consists of only length-four segments, which

has been shown to be the best for area and speed by [82]. We further changed the

switch block topology to Subset. These simplifications made it easier to implement the

LP architecture in VPR. A Subset switch block connects only segments of the same

type. In an LP FPGA, we wanted no connections from a Vddl routing resource to a

Vddh resource because the routing switches did not have any level converters. Using a

Subset switch block made it easier to guarantee this (by creating a type for segments

at a particular Vdd). This, however, also does not allow connections from Vddh to

Vddl routing resources, and therefore, the power savings we report here for LP could

be improved. For the purpose of comparison of FP with LP, this restriction is justified

42

because we do not allow such connections for FP either. Furthermore, we chose all

segments to be of length 4 because we did not want nets to solely use longer or shorter

wires. Because of the Subset topology, only wires of the same type would connect, and

therefore, a length 6 wire will not connect to a length 2 wire (which does not resemble a

modern commercial FPGA architecture, such as Virtex-II). Despite these simplifications,

we believe our results to be indicative of other segmented routing architectures as well.

Circuit simulations were performed in SPICE using 65nm BSIM4 device models.

Delays of BLE and LC were obtained from these simulations. Power consumption, both

static and dynamic, of the LC was also obtained through SPICE simulations. Figure 3.16

shows the experimental flow. The flow deviates from a normal VPR flow after the place

and route stage. We first assign voltage to all CLBs using algorithms that are discussed

below, and then estimate power of the design placed and routed on the target dual-Vdd

architecture. Assigning voltages after routing makes the timing analysis more accurate,

since all the routing delays get incorporated in the timing graph.

3.2.2.1 Vdd Assignment

In order to be effective, a dual-Vdd scheme requires that paths in the circuit vary

in their delays. If all paths are of same delay then all circuit elements will require high

Vdd to maintain the performance of the design.

Figure 3.17 shows the distribution of path delays averaged over MCNC bench-

marks. We observe that path delays in a circuit vary considerably. Therefore, a dual-Vdd

scheme can be expected to reduce the power consumption significantly. Figure 3.17 also

shows the path delays after using our dual-Vdd assignment algorithms.

43

Algorithm 1 Algorithm for Vdd assignment: Low-to-High (assuming LCs at CLB inputpins)

Assign Vddl to all CLBs and routing muxesP ← list of all paths in the designT ← longest path delay when all blocks operate at VddhTd ← xT, x ≥ 1 is a user-defined performance metriccritical path ← Pi ∈ P | delay(Pi) > Tdfor all CLBs do

criticality(CLB) ← # paths passing through CLBend forwhile critical path not empty do

Pk ← path ∈ critical path with maximum delayN ← all blocks through which Pk flowsSort N based on criticality (first entry has most paths)while delay(Pk) > Td do

Ni ← first(N)N ← N - NiAssign Vddh to Ni and all routing muxes driven by Niupdate delay of all paths passing through Ni

end whilecritical path ← critical path - Pk

end while

44

Algorithm 2 Algorithm for Vdd assignment: High-to-Low (assuming LCs at CLB inputpins)

Assign Vddh to all CLBs and routing muxesP ← list of all paths in the designT ← longest path delay when all blocks operate at VddhTd ← xT, x ≥ 1 is a user-defined performance metricvddl delay(Pi) ← delay(Pi) when all blocks in Pi are at Vddlcritical path ← Pi ∈ P | vddl delay(Pi) > Tdfor all CLBs do

criticality(CLB) ← # paths passing through CLBend forwhile critical path not empty do

Pk ← path ∈ critical path with maximum delayN ← all blocks through which Pk flowsSort N based on criticality (last entry has most paths)while (delay(Pk) < Td) & (N not empty) do

Ni ← first(N)N ← N - NiAssign Vddl to Ni and all routing muxes driven by Nicalculate delays of all paths flowing through Niif any of the delays > Td then

reset Ni and all routing muxes driven by Ni to Vddhelse

update delays of all paths flowing through Niend if

end whilecritical path ← critical path - Pk

end while

45

Fig. 3.17. Distribution of path delays

We use the heuristic shown in algorithm 1 for Vdd assignment. Initially we assign

low Vdd to all CLBs in the FPGA, and find those paths whose delays become greater

than the desired clock time period. We call such paths “critical”. Those CLBs which

do not belong to any of the critical paths can be kept at low voltage without affecting

performance of the design. Some of the remaining CLBs and routing muxes need to

operate at high-Vdd so that the design’s performance target is met. The order in which

these CLBs are analyzed is crucial for the performance of the heuristic. We define

“criticality” of a CLB as the number of critical paths that pass through this CLB. The

CLBs within a path are analyzed in decreasing order of their criticalities. We started

with CLBs on the most critical path, and proceeded to smaller paths in decreasing order

of their delay. Algorithm 1 handles the case when LCs are at CLB inputs. In that case

all routing muxes driven by a CLB have the same voltage as the CLB. For the other

situation, when LCs are at CLB outputs, the voltage of routing muxes driving a CLB is

the same as that of the CLB.

46

In order to enumerate all paths whose delays become larger than the required

clock time period, we used the algorithm proposed by [83]. It maintains all paths in

a heap data structure with their delays as the keys. Each path also maintains all the

branch-points in the path in increasing order of their branch-slacks 4.

We also experimented with a variant (High-to-Low) of the above algorithm, in

which all the CLBs are initially kept at high Vdd and then some of them are changed

to low Vdd (see algorithm 2). Before changing a CLB to low-Vdd, we need to make

sure that this will not increase the delay of some other path in the circuit above the

desired clock period. The number of low Vdd blocks using both versions, for Vddh of

1.1V and Vddl of 0.8V (for 65nm technology) is shown in Table 3.2. For 10 out of 15

designs, the High-to-Low (h2l) version performs better than Low-to-High (l2h). This

happens because in case of h2l, when the CLBs on a particular path are being analyzed

whether they can be run on low-Vdd, the algorithm continues to look at all the other

CLBs on the path even after it failed to change the Vdd of some CLB. In contrast, in

the l2h case, the algorithm keeps changing CLBs on a path to high Vdd (in decreasing

order of criticality), till the delay of the path is less than the required clock period. This

sometimes causes the path’s delay to be reduced more than what was necessary.

For the LP FPGA, the core of the Vdd assignment algorithm remains the same

as that for FP. The main differences lie in the way the routing segments are handled.

Since their Vdd’s are fixed, the assignment algorithm does not assign voltages to them.

4Branch slack is defined as the decrease in path delay if a particular branch point is used togenerate a new path

47

Additionally, since this architecture allows level conversion at both inputs and outputs

of the logic blocks, we modify the assignment algorithm accordingly.

3.2.2.2 Power Estimation

After all logic blocks have been assigned appropriate supply voltages, we esti-

mate the power consumption of the entire FPGA. We concentrate only on the power

consumption in the core of the FPGA, and do not try to optimize or estimate IO power

consumption. Furthermore, we did not estimate the power consumption in the global

routing grid used for clock distribution.

In order to estimate dynamic power, VPR’s power model calculates transition

densities at all internal nodes of the FPGA, assuming that all inputs to the FPGA

have the same static probability (default: 0.5). Capacitances are estimated from the

capacitance values of a MOSFET, and that of wires and switches, all of which need to

be provided in the architecture file taken by VPR as an input. We used the Berkeley

Predictive 65nm technology parameters for our experimentation.

We modified VPR’s dynamic power model to include dual supply voltages. The

dynamic power of a circuit element reduces by ( V ddlV ddh)2 when its voltage is reduced from

Vddh to Vddl. SPICE simulations of an LC provided its energy values for different pairs

of Vddh and Vddl. We used these energy values and the transition density at the input

of an LC to calculate the its dynamic power.

VPR has got a basic leakage model, which calculates sub-threshold leakage due

to weak inversion. However, in a 65nm technology, two more effects, namely, DIBL and

gate leakage become significant, and need to be included in the leakage estimation. We

48

also modified the leakage model to take into account multiple supply voltages, and sleep

modes. Specifically, the following modifications were made to VPR’s leakage estimation.

1. Gate leakage and sub-threshold leakage due to DIBL were included in the leakage

estimation. In order to estimate leakage of a single MOSFET, we used results from

SPICE simulations. BSIM4 device models for 65nm were used. Simulations were

performed for various supply voltages to get leakage numbers for different voltages.

These numbers were incorporated into the power model of VPR to estimate gate

leakage of the entire FPGA.

2. We estimated average leakage in a routing multiplexer by halving the worst case

leakage, as discussed in [27]. To verify the results, we simulated multiplexers of

various sizes and structures and found our leakage estimate to be very close to the

SPICE results.

3. In the dual-Vdd FPGA, unused logic blocks and routing muxes are kept in a sleep

state by switching off both the supply transistors. Circuit simulations in SPICE

showed that in sleep mode, leakage of a circuit block reduces to 10% of the original

(high Vdd) leakage.

4. To estimate level converter leakage, we obtained the leakage number for one level

converter from SPICE simulations, and multiplied this by the number of level

converters in the FPGA.

49

Fig. 3.18. Power consumption for different Vddl’s. Vddh=1.1V.

Fig. 3.19. Power consumption for different architectures and algorithms. Vddh=1.1V,Vddl=0.9V

50

Fig. 3.20. Average power breakdown between logic and routing resources. Vddh=1.1V,Vddl=0.9V

Fig. 3.21. Average power consumption for different critical path delay tolerances.Vddh=1.1V, Vddl=0.9V

51

3.2.3 Results and Analysis

In this section, we first evaluate the FP architecture (Figures 3.18, 3.19, 3.20,

3.21) and then compare it with LP (Figures 3.22, 3.23).

3.2.3.1 FP Architecture

Power in the dual-Vdd architecture strongly depends on the values of Vddh and

Vddl. In order to understand this dependence, and to come up with a good voltage

choice, we fixed the high-Vdd at 1.1V and varied Vddl from 0.8V to 1.0V. Figure 3.18

shows the power consumption for different Vddl values (using High-to-Low Algorithm,

LC at CLB’s inputs) . Note that for 11 (out of 15) designs, Vddl value of 0.9V results in

maximum power savings. When Vddl is increased to 1.0V, although the number of CLBs

on low Vdd increases, the total power consumption increases. This happens because the

power consumption of the circuit elements at 1.0V is significantly higher than at 0.9V.

Interestingly, when we reduce Vddl to 0.8V, power consumption again increases because

the number of CLBs and routing muxes on low Vdd becomes too low. Therefore, for

all other results in this section, we use a Vddl of 0.9V. For this case, the average power

reduction is close to 61%.

Figure 3.19 shows the power consumption of the designs for the two algorithms

— High-to-Low (h2l) and Low-to-High (l2h), and level converter placements — at CLB

outputs (LCo) or inputs (LCi). (h2lLCi denotes High-to-Low algorithm with LC at

CLB Inputs.) Note that for most designs, the High-to-Low algorithm outperforms the

Low-to-High algorithm. This is expected because, as shown above (see Table 3.2), the

High-to-Low algorithm resulted in larger number of low-Vdd CLBs. Furthermore, the

52

placement of LCs at CLB inputs saves more power (average: 61%) than their placement

at outputs (average: 57%). This happens because LC leakage is not large enough to

overshadow the gains we get in the routing power by placing LCs at CLB inputs.

Figure 3.20 shows the static and dynamic power consumption in both logic and

routing resources for the different algorithms and LC placements. An important obser-

vation is that not all components of power are reduced by the same factor. The reduction

in dynamic power is much less than that in leakage. For example, using High-to-Low

algorithm and placing LC at CLB inputs saves 24% dynamic power and 76% leakage

power. This can be attributed to two factors. First, in an FPGA since there exist a

large number of unused circuit elements, it is possible to reduce the leakage in them by

switching them off. Second, leakage varies exponentially with supply voltage, but dy-

namic power varies only quadratically with supply voltage. Note that leakage in routing

resources reduces to less than 17% of the original, because in most designs it is possible

to put a large number of routing muxes in sleep state, as they are sparsely used. Another

trend to note is that the logic portion of leakage is larger when LCs are placed at CLB

inputs (LCi) than when they are placed at CLB outputs (LCo). This implies that the

larger overall power saving for the LCi case comes entirely from the routing resources.

Figure 3.21 shows what happens when we modify the Vdd assignment algorithm

to allow some degradation in the performance of the design. In the figure, a delay value

of 110% denotes 10% performance penalty. Note that these delay values may increase

after circuit implementation due to the use of supply transistors, and due to a possible

increase of wire lengths (since total CLB area and consequently inter-CLB distances

53

increase). Using h2lLCi, a 10% decrease in performance increases the average power

saving by around 4%, but beyond 20%, the power remains almost constant.

3.2.3.2 LP Architecture

For LP architectures, since we hard-wire the supply voltages of routing fabrics,

the critical path delay of the design may get affected. Therefore, we first look at the

impact of LP on the delays of all designs. Figure 3.22 shows both the average and worst-

case delays for the benchmark designs. Restricting the maximum increase in delay due

to LP to 20%, we decide to keep 50% of the routing resources on low Vdd. Note that

the average increase in delay for this architecture is only 3% of the FP architecture. The

slightly irregular variation in delay happens due to the heuristic nature of the router.

In this delay comparison, we do not include the increase in delay because of resistance

of the supply transistors, delay through the mux at CLB pins that selects between

Vddl or the level-converted signal, and because of an increase in the wire lengths as a

consequence of an increase in the FPGA area. The increase, however, is minimal, and is

highly dependent on the circuit implementation. For example, [84] demonstrate effective

supply-gating of circuits with a performance penalty of less than 10%. [36] observed

a penalty of 5% for dual-Vdd circuits when they used regular-Vt gate-boosted supply

transistors

We realized that if the FPGA has too many routing resources, it is possible that

none of the low voltage resources get used, and the delay of the design remains the

same as that for single Vdd FPGA (if the router is timing-driven). To avoid such a

scenario, we first found the minimum channel width for every design using VPR, and

54

then used 130% of the minimum as the channel width. This is different from the above

FP experiments. However, while comparing FP with LP, we used the LP channel width

for both architectures. Also note that the CLB here consists of 4 BLEs instead of the 8

in the FP experiments.

Figure 3.23 shows the total FPGA energy (power-delay product) obtained using

this architecture for different spatial granularities. h2l-50-2x1 on the x-axis refers to the

architecture where 50% of the routing resources are at Vddl, and the supply transistors

are shared among CLBs in clusters of dimension 2 × 1. We compare energy instead of

power because the critical path delays of designs mapped on LP FPGAs are different

from those on FP FPGAs. The Vdd assignment algorithm remains h2l (High-to-Low)

for all of them. Compared with FP, LP increases the energy by about 4.1%. The routing

energy increases because we do not change their supply voltage. However, the energy

used by logic blocks decreases by about 1.5%, because, due to the presence of LCs at

both CLB inputs and outputs, we have more flexibility in assigning Vdd’s to them. We

further observe that the use of 4 × 4 clusters increases the total energy by about 12%

(compared with FP).

3.2.4 Summary

We presented two types of dual-Vdd FPGA. The fully programmable (FP) FPGA

reduced the total energy by about 60% on an average at the expense of about 50% area

penalty. The logic programmable (LP) FPGA reduced the total energy by 57.3% with

about 20% area increase compared to single supply FPGA. LP, however, resulted in an

average increase of 3% in the critical path delay over FP.

55

We also explored different Vdd assignment algorithms and level converter place-

ments for FP architecture. Experiments demonstrated that high-to-low algorithm cou-

pled with placement of level converters at the input pins of CLBs resulted in maximum

power savings. The dynamic power was reduced by 24%, while the reduction in static

power was close to 76%.

In future, the implementation of the LP dual Vdd architecture can be modified

to allow connections from Vddh to Vddl resources in the routing fabric. Further, the

routing architecture can be improved to use different lengths of segments.

56

Table 3.1. Characteristics of benchmark designsDesign #Slices #IOBs #BRAMs #MULTs FPGA device

1 xapp248 96(37%) 17(19%) 0 0 XC2V40-cs1442 xapp270\des 4,723(92%) 189(58%) 0 0 XC2V1000-fg4563 xapp270\triple-des 14,273(99%) 301(62%) 0 0 XC2V3000fg6764 xapp288\ser decoder 50(19%) 20(22%) 0 0 XC2V40-cs1445 xapp288\par decoder 107(41%) 28(31%) 0 0 XC2V40-cs1446 xapp289 614(39%) 166(83%) 0 0 XC2V250-fg4567 xapp298 70(27%) 16(18%) 0 0 XC2V40-cs1448 xapp299 973(31%) 262(99%) 1(2%) 0 XC2V500-fg4569 xapp610 1,369(89%) 20(21%) 0 8(33%) XC2V250-cs14410 xapp611 1,534(99%) 20(21%) 0 16(66%) XC2V250-cs14411 xapp615 1,155(75%) 45(48%) 0 2(8%) XC2V250-cs14412 xapp621 1,305(84%) 29(31%) 0 0 XC2V250-cs14413 xapp625\video 254(99%) 63(71%) 0 0 XC2V40-cs14414 xapp645 550(10%) 278(85%) 0 0 XC2V1000-fg45615 itc99\b04 110(42%) 21(23%) 0 0 XC2V40-cs14416 itc99\b05 259(50%) 39(42%) 0 0 XC2V80-cs14417 itc99\b12 214(83%) 13(13%) 0 0 XC2V40-cs14418 itc99\b14 2,432(79%) 88(51%) 0 2(6%) XC2V500-fg25619 ava\k4 724(47%) 85(92%) 0 0 XC2V250-cs14420 ava\k7 2,034(66%) 85(49%) 0 0 XC2V500-fg25621 ava\k9 2,895(94%) 85(49%) 0 0 XC2V500-fg25622 industry1 1,954(38%) 279(64%) 0 0 XC2V1000-ff89623 industry2 2,488(80%) 185(70%) 0 0 XC2V500-fg45624 industry3 2,513(81%) 132(50%) 0 0 XC2V500-fg45625 industry4 5,777(75%) 182(34%) 0 0 XC2V1500-ff89626 industry5 5,153(67%) 65(12%) 0 0 XC2V1500-ff89627 industry6 206(4%) 287(66%) 0 0 XC2V1000-ff89628 industry7 3,505(68%) 251(58%) 0 0 XC2V1000-ff89629 industry8 1,602(52%) 60(22%) 0 0 XC2V500-fg45630 industry9 2,280(44%) 293(67%) 0 0 XC2V1000-ff89631 industry10 3,663(71%) 224(51%) 0 0 XC2V1000-ff89632 industry11 4,364(85%) 172(40%) 0 0 XC2V1000-ff89633 industry12 97(37%) 80(90%) 0 0 XC2V40-fg25634 industry13 411(80%) 84(70%) 0 0 XC2V80-fg25635 industry14 1,288(83%) 186(93%) 2(8%) 0 XC2V250-fg456

Average 61.89% 50.82% 0.29% 3.23% -

57

Table 3.2. Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs, Vddh= 1.1V, Vddl = 0.8V

Design # CLBs # Vddl CLBsLow-to-High High-to-Low

alu4 191 51 65apex2 235 54 74apex4 158 46 26bigkey 214 24 81

des 200 87 127dsip 172 12 31

elliptic 451 339 327ex1010 575 177 185ex5p 133 30 24

misex3 175 18 16pdc 572 400 405

s38584.1 806 724 739seq 219 61 58spla 462 215 226tseng 131 102 114

Fig. 3.22. Critical path delay for LP FPGA with different extents of Vddl resources.Vddh=1.1V, Vddl=0.9V.

58

Fig. 3.23. Energy consumption in LP FPGAs. Vddh=1.1V, Vddl=0.9V

59

Chapter 4

Three-Dimensional FPGAs

As transistors become faster and designs get larger, the delay incurred in the

interconnecting metal wires becomes significant. Consequently, reducing the wire-length

is crucial for future technologies. Three-dimensional (3-D) integration is a promising

technique for reducing wire lengths. By stacking multiple silicon wafers interconnected

with fine vias, the average wire length in the designs gets significantly reduced, which

improves their performance. Other gains, such as reduced design footprint and the ability

to integrate different technologies, further favor 3-D ICs.

Field Programmable Gate Arrays (FPGAs) are consistently improving in capacity

and performance, and are now among the most popular devices in the market. With

their regular structure, they also scale easily to future technologies. However, the large

overheads of their programmable interconnect are severely limiting their growth. The

programmable interconnect resources take almost 70% of the die area, and consume the

major part of FPGA power. Furthermore, for most designs, they also constitute more

than 50% of the critical path delay. Therefore, a reduction in the interconnect resources,

by going to 3-D, will greatly benefit FPGAs.

The advantages of 3-D FPGAs have evoked significant interest, and several stud-

ies have looked at them in the past. More than a decade ago, Alexander et al. [43]

presented a 3-D FPGA that used package-level integration to stack multiple 2-D FPGAs

60

interconnected using solder bumps. The minimum pitch of these vertical interconnects

was 100µm. Campenhout et al. [44] proposed opto-electronic FPGAs, in which the

inter-chip communication used optical links. The optical links provide a large vertical

channel density. The Rothko 3-D FPGA [45] was a 3-D extension of the Triptych sea-

of-gates architecture [46], consisting of routing and logic blocks. The 3-D integration

was done at the wafer-level and inter-layer communication used metal vias. A dynami-

cally reconfigurable 3-D FPGA was presented in [47], which consisted of three physical

layers: routing and logic block layer, routing layer, and memory layer. Recently, Lin et

al. [48] analyzed the performance benefits of a monolithically stacked 3-D FPGA. Their

3-D integration technology provided very fine vias, which allowed them to stack the

configuration memory on top of the rest of the FPGA (logic blocks and interconnects).

Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et

al. [49] presented an analytical model for predicting interconnect requirements in 3-D

FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and

power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended

this model to incorporate clustered logic blocks (similar to Virtex-2 [1]).

On the CAD front, Ababei at al. [51, 52] recently presented a partitioning-based

placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer

vias. However, their router was not timing-driven.

Although several researchers have proposed 3-D FPGAs, the detailed routing

architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset

switch block (see definition in Section 4.1.1). Although Wu et al. [53] designed universal

3-D switch blocks, they used track count as the sole metric of quality. Furthermore,

61

they assumed that the number of inter-layer vias is the same as the horizontal channel

width. In today’s technology, especially if we stack more than two layers, the vias are

much thicker than the horizontal wires (1um vs. 0.1um), which makes this assumption

impractical.

This chapter consists of two main sections. In Section 4.2, we explore six 3-D

switch box (SB) topologies for the case when the vias are fewer than the horizontal

wires. These switch boxes range from a simple extension of the 2-D subset SB - used

in prior studies [51] - to 3D universal SBs with additional flexibility for the inter-layer

vias. Section 4.1 gives a brief overview of 2-D switch boxes and 3-D technology. The

switch box topologies explored in this study are described in Section 4.2. Section 4.2.2

explains the experimentation methodology, and Section 4.2.3 analyzes the exploration

results. Using detailed area and delay models, we estimate their impact on FPGA area,

delay, and area-delay product. The results indicate that the area-delay product (ADP)

depends heavily on the SB topology: our best SB reduces ADP by 10% compared to the

subset SB.

Section 4.3 analyzes (Section 4.3.1) and reduces (Section 4.3.2 the thermal issues

in 2-D and 3-D FPGAs. A thermal-aware 3-D FPGA design reduces the peak tempera-

ture by about 16C.

Finally, Section 4.4 summarizes the contributions of this chapter.

62

(a) Subset (b) Universal

Fig. 4.1. 2-D switch boxes. X0, Y0, X1, Y1 mark their sides.

4.1 Background

4.1.1 2-D Switch Boxes

Our study will focus on island-style SRAM-based FPGAs. FPGAs from Xilinx

and Altera belong to this category. The logic block (CLB) consists of Look-Up-Tables

(LUTs) and Flip-Flops (FFs). Routing wires (tracks) and programmable switches con-

stitute the routing channel. Channel width refers to the number of tracks in a channel.

The CLBs connect to the channel through connection boxes. The routing wires connect

among themselves through switch boxes.

Switch box topology refers to the connectivity provided by the switch box. Re-

searchers have explored several topologies [85, 86, 81, 87, 88] (see Figure 4.1). The

subset (also called disjoint) topology, used in Xilinx XC4000 FPGAs, connects tracks of

the same number in all four directions. This divides the channel into disjoint sets of

63

tracks, and a net uses the same track number for its route. Universal topology provides

more flexibility than disjoint. It facilitates connectivity for all possible global routes of

two-terminal nets.

Research has shown that the universal switch box results in fewer tracks in the

channel [89]. Hyper-universal switch boxes provide even greater flexibility, and facilitate

the connectivity for all possible global routes of k-terminal nets [90]. However, they use

more switches than universal switch boxes.

(a) Face-to-Face (b) Face-to-Back

Fig. 4.2. Two kinds of stacking

4.1.2 3-D Technology Overview

3-D chip design is a promising methodology to alleviate many interconnect prob-

lems. Current state of the art chips are two-dimensional, which means that they have

64

Table 4.1. Via propertiesThickness Pitch Height

Via 1 1um 3um 10umVia 2 2um 5um 20umVia 3 5um 10um 50um

only one plane of active layer that contains all the devices. Note that although no transis-

tor (device) is stacked on top of other transistor (device), the metal wires interconnecting

these devices typically span multiple layers, with the higher layers occupied by global

wires. 3-D ICs extend this concept to the devices by stacking multiple device layers in

the vertical dimension.

Several technologies, such as beam recrystallization, silicon epitaxial growth, pro-

cessed wafer bonding, and solid phase crystallization, enable the vertical integration

of multiple device layers [91], Among these technologies, wafer bonding is particularly

promising. It involves the bonding of two fully processed wafers (on which the devices

and interconnects have already been fabricated). Since the individual wafers are fabri-

cated separately, it is possible to integrate completely different technologies, and have

a very large number of layers. The inter-layer vias in this technology can be as fine as

1µm × 1µm at a 3µm pitch [92]. The wafers can be bonded in two ways: face-to-face

or face-to-back. In the former, a wafer is inverted to bond with another wafer (see Fig-

ure 4.2 (a)). This reduces the area overhead of the inter-layer vias because they do not

need to pass through the Silicon substrate. However, this limits the number of layers

to only two. The second way, face-to-back, does not invert the wafer (see Figure 4.2

(b)). Consequently, it can integrate more than two layers of Silicon. However, since the

65

inter-layer vias now need to pass through the Si layer, they take up die space. In this

study, we evaluate these two wafer-bonding techniques for 3-D FPGA integration.

Since the wafer-bonding 3-D technology is still being perfected, several meth-

ods are being explored. These methods result in different via dimensions and wafer

thicknesses. For this study, we explore three different methods, which result in the via

dimensions shown in Table 4.1. Via 1 reflects the process from Tezzaron [92], which

uses a wafer thickness of 10um. Because they are so thin, these wafers lack mechanical

strength, and require the use of handle wafers during processing. At the other extreme is

via 3 that uses 50um wafers, which reflects the process in [93]. A larger wafer thickness

imparts mechanical strength to the wafers, and eliminates the need for handle wafers.

Via 2 reflects an intermediate process that we use to illustrate the trends due to via

dimensions. An integration technology from MIT uses SOI wafers to reduce the device

layer thickness to less than a micron [94]. We do not model this technology in this work.

4.2 3-D Detailed Routing Architecture

We extend the island-style architecture of 2-D FPGAs to 3-D (see Figure 4.3).

The CLB consists of LUTs and FFs. The switch box is modified to connect the inter-

layer vias (ILVs) to the horizontal wires (CHANX and CHANY), and also with other

ILVs. The ILVs form channels in the vertical direction (CHANZ). The architecture

is symmetric in the X and Y directions, i.e., CHANX and CHANY contain the same

number of tracks. CHANZ, however, differs from CHANX and CHANY in its width,

which is influenced by the via density provided by the 3-D technology. We use V to

66

Fig. 4.3. 3-D FPGA

67

(a) Subset (b) Subset-split

(c) Subset-twist (d) Subset-more

(e) Universal-twist (f) Universal-more

Fig. 4.4. 3-D switch boxes for H=4, V=2.

68

refer to the number of vias (i.e. vertical channel width) and H for the horizontal channel

width. Figure 4.3 shows the case when H = V = 3.

CHANZ differs from CHANX and CHANY in another respect too. The length

of these vias depends on the wafer thickness, which is typically much smaller than the

average 2-D wire length (e.g., wafer thickness = 10um for Tezzaron’s process [92], length

of a wire spanning 4 CLBs = 150um in a 65nm process). These differences between

vertical and horizontal channels must be accounted for to design a good 3-D FPGA.

Next, we describe the various 3-D architectures we explored. Where appropriate, we

also discuss how technology parameters influence our design.

4.2.1 Switch Box Topology

The flexibility, Fs, of a switch box (SB) refers to the number of wires to which

each incoming wire can connect. Previous studies have shown that for a 2-D FPGA, an

Fs of 3 provides good routability [85]. In such SBs, a track connects to one track on

each of the other three sides of the SB. Subset and universal topologies are examples of

such SBs (see Figure 4.1).

These 2-D SBs are extended to 3-D by adding two more faces, which contain

terminals for vertical wires – one for going up, and another for going down. Since the

vias will be fewer than the horizontal wires, the two vertical faces will contain fewer

terminals than the other four. We use V to refer to the number of vias (i.e. vertical

channel width) and H for the horizontal channel width.

Figure 4.4 shows the SBs we created for this study for H=4 and V =2. Normally,

the 3-D SB is visualized as a cube, where each face of the cube represents one of the

69

directions. However, for ease of illustration, we have flattened the SB and shown it as

a hexagon, where each side represents a direction: North (Y0), South (Y1), East (X1),

West (X0), Top (Z1), or Bottom (Z0). Furthermore, we show only the connections to

the vertical faces (Z0 and Z1). For all SBs, the horizontal wires (CHANX and CHANY)

use either the subset or universal connections among themselves. These connections

were described in Section 4.1.1 and illustrated in Figure 4.1. For clarity, we do not

show the horizontal connections in Figure 4.4. The first four SBs use subset connections

among the horizontal wires, and the last two use universal. Figure 4.4 also tabulates the

connections from the vertical faces, where Xi,j refers to the jth terminal on the Xi face

of the SB.

The first SB (subset, see Figure 4.4 (a)) is an extension of the 2-D subset SB. This

SB connects the same track number on all sides. Consequently, the entire routing fabric

gets divided into disjoint subsets, and a net uses the same track number for its entire

route. Note that only the first V of the H horizontal wires connect to the vias. While

these wires have a flexibility of 5 (3 connections to the other horizontal directions, and 2

to the vertical ones), the other wires connect to only horizontal tracks (flexibility = 3).

Apart from decreasing the routing flexibility, this results in a difference in the capacitive

loads of the horizontal wires: large for the first V wires, and small for the rest.

The second SB (subset-split, see Figure 4.4 (b)) modifies the subset SB by allowing

the first V horizontal tracks to connect to the vias going above, and the last V to those

going below. This implies that now there are twice as many horizontal wires that connect

to the vertical wires. Therefore, if nets do not fanout at the SB, then this SB provides

greater flexibility to the vertical directions. A limitation, however, is that the first V

70

can only go above, and the last V, only below. Consequently, if a net needs to fanout to

both Top and Bottom, then it needs to use two horizontal tracks (compared to one for

subset). This SB distributes the capacitive loads on the horizontal tracks more evenly

than the Subset SB.

The subset-split SB, although more flexible than subset, suffers from the “disjoint”

property of subset SBs: the entire routing fabric is divided into disjoint subsets, and a

net can use only one of those subsets. This disjoint subset consists of vertical track i,

and horizontal tracks i and H − i− 1 (where i ∈ 0, 1, ..., V − 1). In order to improve

upon this, we modified the connections to the vertical faces as shown in Figure 4.4 (c).

Now, terminal Z0,0 connects to track 1 on the side X0, but track 0 on side X1. This

allows the net to switch tracks at the SBs. We call this SB subset-twist.

The main objective of the subset-twist SB is to improve the flexibility in the

vertical direction. Another way to achieve this is by adding more switches to the vertical

faces – the approach used by the next, subset-more SB (see Figure 4.4 (d)). Here, the

vertical terminal i connects to both i and H − i − 1 terminals on the horizontal faces

(where i ∈ 0, 1, ..., V −1). The extra switches have a two-fold effect. On the one hand,

they improve the flexibility in the vertical direction, and on the other, they increase the

area of the SB and the capacitive loads on the wires.

The next two switch boxes use universal connections among the horizontal wires.

The vertical connections in the universal-twist SB are identical to the subset-twist SB

(see Figure 4.4 (e)). However, due to universal connections among the horizontal wires,

it provides greater flexibility. The last SB, universal-more further increases the flexibility

by adding more switches to the vertical faces. For example, in Figure 4.4 (f), track 0 on

71

side Z0 connects to both, tracks 1 and 3 on the X0 side. These extra switches improve

the flexibility in the vertical direction, but also increase the area of the SB and the

capacitive loads on the wires.

4.2.2 Experimentation

We modified VPR [10], an FPGA place and route tool available from University

of Toronto, to model our 3-D FPGA architectures. We refer to this tool as 3-D VPR.

It uses simulated annealing to place the logic blocks and then routes the nets using a

modified path-finder algorithm. Both placement and routing are timing-driven, i.e., they

try to reduce the delays of critical paths.

The 2-D placement algorithm of VPR optimized the following cost function.

Cost2D = α · Costtiming + (1− α) · Costcong−2D

Costtiming =

Nnets∑

i=1

[ num sinks(i)∑

j=1

delay(i, j)

]

Costcong−2D =

Nnets∑

i=1

q(i)

[

bbx(i)

Cav,x(i)β+

bby(i)

Cav,y(i)β

]

where Nnets is the number of nets in the design, num sinks(i) is the number of sink

pins of net i, delay(i, j) is the estimated delay from the source of net i to sink number j.

For each net i, bbx(i) and bby(i) denote the x and y spans of its bounding box, respec-

tively. The q(i) factor compensates for the fact that the bounding box wire length model

underestimates the wiring necessary to connect nets with more than three terminals. Its

72

value depends on the number of terminals of net i. Cav,x(i) and Cav,y(i) are the average

channel capacities in the x and y directions respectively, over the bounding box of net i.

The value of β adjusts the weight given to congestion in the cost function. The larger

the value of β, the more wiring in narrow channels is penalized relative to wiring in wider

channels. A value of 1 has been previously found to work best, and is used in this work.

To the 2-D cost function, we add a term, Costspanz , to reduce the vertical span

of the nets. This is similar to what was proposed in [51], except that, similar to the

congestion cost terms for x and y directions, we incorporate congestion in Costspanz .

Cost3D = α · Costtiming + β · Costcong−2D + γ · Costspanz

Costspanz =

Nnets∑

i=1

q(i)

[

bbz(i)

Cav,z(i)β

]

By varying the values for α, β, and γ for two of the benchmark designs, we found α = 0.5,

β = 0.1, and γ = 0.4 to give the smallest critical path delay. Hence, we use these values

in all our experimentation.

4.2.2.1 Architecture and Technology Parameters

The logic blocks in our experiments consist of 4 4-input LUTs and 4 FFs, with 10

inputs and 4 outputs. All the inputs are equivalent, and so are the outputs, that is, every

input pin can internally drive any LUT input. The pins are uniformly distributed around

the sides of the CLB. Each output pin connects to 25% of the tracks in the adjacent

channel, and every input pin connects to 60% of the adjacent tracks. All horizontal

73

segments (CHANX and CHANY) in the routing fabric span 1 CLB, and are driven by

tri-state buffers.

The vertical channel (CHANZ) has vias that transcend only single layer. When

these vias are very short (10um), we use minimum size pass transistor switches to drive

them. However, for the case when they are 50um high, we use a 5X tri-state buffer switch

to drive them. In contrast, the buffers driving the CHANX and CHANY segments are

always 5X the minimum, and consist of two stages.

We calculated the resistance and capacitance values for the vias and horizontal

wires by using the Predictive Technology Model (PTM) [95]. Timing parameters for

switches were derived from Spice simulations using 65nm BPTM.

We explored a spectrum of 3-D technologies: with the via properties shown in

table 4.1, number of layers varying from 2 to 5, and either face-to-face (f2f) or face-

to-back (f2b) bonding technology. The finest vias of 1um thickness are in line with

Tezzaron’s process [92], while the coarsest ones (of 5um thickness) are reflecting the

process from [93].

4.2.2.2 Experimentation Flow

Figure 4.5 shows the experimentation flow. A design in blif format is packed into

clusters (CLBs) of 4-LUTs using T-VPack. On the basis of the number of CLBs in

the design, 3-D VPR creates the smallest FPGA fabric that would contain the design.

It takes the number of layers as an input, and finds the minimum square size of one

layer, assuming that all layers contain the same number of CLBs. The packed netlist

is then placed and routed using 3-D VPR to find the minimum number of vias for a

74

Fig. 4.5. Experimentation flow

large horizontal channel width (= 80 for 5 layers). The router performs a binary search

over the number of vias to find the minimum value. Fixing the number of vias to 130%

of the minimum value, we re-route the design to find the minimum possible channel

width. Thus, this flow gives priority to reducing the number of vias instead of channel

width, which makes sense because the vias take more area than the horizontal wires.

However, most FPGAs provide more than the minimum number of channels to ensure

good performance for the worst case too. On similar lines, we add 30% to the minimum

via and channel-width numbers while evaluating the FPGA. Using these values (which

may be different for every design), we re-route the design to obtain the critical path

delay of the routed design. This flow is repeated for every switch-block type for all the

20 MCNC benchmark designs.

75

4.2.2.3 Area Model

VPR estimates area by counting the number of transistors in the fabric. This

works because the 2-D FPGA area is transistor-dominated. In case of 3-D, however,

we must add the via areas to the transistor areas. The two types of 3-D integration

technologies discussed in Section 4.1.2 need different area models. In case of face-to-face

(f2f) bonding, the inter-layer vias (ILVs) do not pass through the Silicon (see Figure 4.2).

Consequently, they do not take any die area. In contrast, the face-to-back (f2b) bonding

requires vias to pass through the Silicon (through-vias). In this case, every via con-

sumes some Si area. We incorporate the area overhead of these through-vias in our area

estimates.

While comparing the area of two architectures, we estimate the total FPGA area

and divide it by the number of CLBs in the fabric to estimate the area per CLB. Thus,

the area numbers in the next section include the area for one logic block (CLB), and the

routing resources (horizontal wires, switches, and vias) associated with it.

4.2.3 Results and Analysis

Here, we show the results for two extremes of 3-D integration: first, a simple stack

of two layers; and second, a more aggressive stack of 5 layers. Together they capture

the trends seen by varying the number of layer in a 3-D FPGA. While the two-layer

FPGA can be fabricated using f2f or f2b wafer bonding, the 5-layer FPGA must be

fabricated using f2b. For all these technology points, we evaluate the effects of different

via dimensions shown in table 4.1. The metric we primarily look at to evaluate an

architecture is the area-delay product (ADP), because it is inversely proportional to the

76

Fig. 4.6. Comparing 2-D and 3-D FPGAs

Fig. 4.7. Comparing the switch boxes for 5-layer FPGA

77

throughput of the device [96]. In all the figures in this section, we plot the geometric

means over 20 MCNC benchmarks.

The first step towards evaluating 3-D FPGAs is comparing them with 2-D FPGAs.

Figure 4.6 shows the average area (per CLB), delay, and ADP for 1, 2, and 5 layers

in 65nm technology. For both 2 and 5 layers, it shows the results for the three via

technologies of table 4.1. The key ‘2-layers-f2f-3um’ in the figure refers to the use of

2 device layers, stacked using f2f bonding with vias at 3um pitch (via 1 in Table 4.1).

Figure 4.6 uses the same switch box (universal-twist) for all cases.

The area is estimated as explained in Section 4.2.2.3. Note that area reduces as

we increase the number of layers, or reduce the pitch of the vias. The smallest area is

obtained when five layers are used with 3um-pitch vias, in which case, the CLB’s area is

only 84% of the single-layer case. Furthermore, we observe that the area of the 2-layer

FPGA using f2f bonding remains constant with increasing via pitches. This happens

because the vias in this case are accommodated within the transistors’ footprint, and

the CLB area is determined by the transistors.

The critical path delay also reduces with increasing number of layers (second set

of bars in Figure 4.6). The 5-layer FPGA with 5um-pitch vias (best case) reduces the

delay by 24.7% compared with the single layer case, and by 14% compared with the

2-layer case. This happens because interconnect lengths (and hence delays) reduce as

we increase the number of layers. F2f and f2b technologies do not have any significant

impact on the delay.

The reduction of area and delay in 3-D combine to significantly reduce the area-

delay product of the FPGA (third set of bars in Figure 4.6). The 5-layer FPGA reduces

78

the area-delay product by 36% (for 3um pitch vias), while a 2-layer FPGA does so by

about 20%, when compared to a single-layer FPGA. These results justify the interest in

3-D FPGAs, and demonstrate that we can obtain significant improvements even by the

relatively simple integration of two FPGA layers.

Now, we explore the different switch boxes to find which one gives the best values

for area, delay, and area-delay product. Figure 4.7 shows the results for 5 layers, using

65nm process and 3um-pitch vias (via 1 in Table 4.1). The results for 2 layers follow a

similar trend. The first set of bars in Figure 4.7 compare the flexibility in the vertical

direction of the various SBs by looking at the minimum number of vias they take for

the designs to route. Observe that the universal-more type of SB provides the greatest

flexibility (minimum number of vias). In fact, it uses only 49% of the vias needed by the

subset SB. It also results in the minimum channel width among all the SBs. However,

the total area is determined by both, the vias and the number of transistors in the fabric.

Since universal-more uses extra switches to increase flexibility, we observe that the total

area taken by the FPGA using universal-more SB is larger than that of the one with

universal-twist SB. This indicates that the universal-twist SB provides greater flexibility

per switch than the universal-more SB.

While the area metric reduced to 88% by using universal-twist SB instead of the

subset SB, the critical path delay does not show such a strong variation. This happens

because the timing-driven router of 3-D VPR gives less weight to congestion for timing-

critical nets, which implies that they almost always take the shortest possible route.

The smallest delay is obtained for the subset-split case. Note that adding more switches

to the SB increases the delay, which is explained by the larger parasitic capacitances

79

due to these switches. Because the variation in delay is not much, the trend for area-

delay product is similar to that for area. The universal-twist offers the lowest area-delay

product, 91% of that for the subset SB.

Next, we explore how the via properties affect the choice of SB for the 5-layer

FPGA. Figure 4.8 compares the area-delay product for different SBs for the three via

technologies of Table 4.1. The x-axis is labeled as <via pitch>-<via height>. Intuitively,

as the vias become larger, we will prefer the SB that provides the minimum number of

vias. Figure 4.8 demonstrates this trend. As vias become larger, the difference between

the area-delay products for universal-twist and universal-more (which produces the min-

imum number of vias) reduces. This happens because, as vias become larger, the area

taken by the vias starts dominating the total area. However, even for 10um-pitch vias

(the largest case), the universal-twist SB continues to provide the smallest area-delay

product.

We also look at the effect of technology scaling on the performance of our SBs in

a 5-layer FPGA (see Figure 4.9). The vias are assumed to remain at 3um pitch while

the CMOS technology scales from 65nm to 45nm and 32nm. Again, the universal-twist

remains the best SB for all process nodes. Since the via dimensions remain constant

among the different process nodes, the area penalty due to through-vias increases as

transistor dimensions shrink. Consequently, the universal-more SB (which gives the

minimum number of vias) improves as process scales. However, even for the 32nm node,

the universal-twist SB remains the best from an area-delay product perspective.

80

Fig. 4.8. Comparing the switch boxes for different via technologies for 5-layer FPGA

Fig. 4.9. Comparing the switch boxes for different process nodes for 5-layer FPGA

81

4.3 Thermal Issues in 3-D FPGAs

Junction temperature is a growing concern in integrated circuits. Improvements

in fabrication technology, circuit design, architecture, and tools, have all contributed to-

wards an increase in logic density as well as clock frequency. Increased logic density and

performance have in turn led to an increase in power densities, which manifests itself in

the form of high temperatures. FPGAs are following a similar trend. Recent articles on

thermal management from leading FPGA manufacturers ([97, 98]) clearly indicate the

growing importance of thermal issues in FPGA designs. Since three-dimensional integra-

tion increases the effective power density, 3-D ICs suffer from even higher temperatures.

Die temperature must be controlled because it impacts the timing, leakage power,

package design, and lifetime of the device. Circuits run slower when they are hot,

and their lifetime reduces exponentially with increasing temperature. Besides, plastic

packages can only withstand relatively low temperatures. Furthermore, leakage power

increases exponentially with temperature, which can cause a thermal runaway.

All these factors have forced chip manufacturers to employ techniques to con-

trol the die temperature. These techniques can be divided into two categories, namely

package level, and design level.

Package designers have been considering thermal issues for a long time. Heat

sinks, spreaders, and fans are the most common examples of package level techniques.

Instead of considering variations in the temperatures on the die, they design the package

to support the worst case specifications of the design. They typically provide the user

with the thermal resistance (θJA) of the package, which is used to estimate the junction

82

temperature (TJ ) using

TJ = TA + θJA ∗ Power, (4.1)

where TA is the ambient temperature, and Power refers to the total power consumed

by the chip.

As designing the package for the worst case junction temperature started becom-

ing too expensive, researchers started looking at design level solutions to reduce the

temperature. A common example is dynamic thermal management (DTM), where the

design is run at a reduced power (and performance) if the chip temperature increases

beyond a previously set threshold. Thermal sensors measure the temperature, and power

is reduced by lowering the clock frequency or the supply voltage, and clock-gating [55].

Design level techniques can also aid in removing the heat generated by the design.

For example, thermal-aware floorplanning tries to reduce the hotspots on the die by

distributing the temperature uniformly [56, 57]. Researchers have mostly focused at

microprocessors in these works. Thermal placement is a similar technique applied at the

placement stage. Chen and Sapatnekar [58] proposed a partition-driven algorithm for

standard cell thermal placement. Thermal floorplanning and placement are particularly

attractive because they impact the performance less than DTM.

On the modeling front, several researchers have developed tools for estimating the

die temperature. Among them, HotSpot [59] is an architecture-level thermal simulator,

which can perform transient as well as steady-state temperature estimation. HS3d [60] is

another architecture-level tool that performs only steady state temperature estimation,

but is orders of magnitude faster than HotSpot. Both HS3d and HotSpot provide the

83

flexibility to set several package and die parameters, such as the spreader thickness,

package-to-air thermal resistance (r convec), and substrate thickness. Since in this work

we look at only steady state temperatures, we use HS3d.

Recently, some researchers have proposed solutions for thermal issues in 3-D ICs

too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sap-

atnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard

cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce

the peak temperature [63, 64].

Thermal issues in FPGAs are relatively unexplored. Some researchers have pro-

posed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66].

They, however, considered only CLBs in the fabric, and consequently, observed very

little temperature variations across the die. In contrast, we focus on platform FPGAs,

containing embedded circuit blocks including high-speed transceivers, multipliers, DLLs,

and memories (see Figure 4.10) [1, 2]. Here, we first characterize the temperature distri-

bution in a modern 2-D FPGA, and then observe how it changes when we stack multiple

such layers. We further propose changes in the placement of hard blocks in a 3-D FPGA

to reduce the die temperature.

4.3.1 Thermal-Characterization of FPGAs: 2-D to 3-D

Most modern FPGAs incorporate hard blocks in the fabric (e.g., Virtex-4, see

Figure 4.10). Table 4.2 shows the power densities for the blocks in a Virtex-4 FPGA.

Observe that the power densities vary from 0.78 for the DSP blocks to 11.46 for the

DCMs. This vast range results in large temperature variations within the FPGA die

84

Fig. 4.10. Virtex-4 FX100 device (not to scale)

Table 4.2. Power densities in 4VFX100 (Freq : 500MHz)Block type Power density

(normalized to CLB)DSP 0.78CLB 1.00PPC 1.32IOB 2.33

BRAMDual Port 3.85Single Port 1.93

MGTTransceiver 7.75Transmitter 4.22Receiver 4.11

PMCD 11.4

DCMHigh Freq 11.46Low Freq 9.84

85

74

76

78

80

82

84

86

88

90

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2

0.4 0.6

0.8 1

1.2 1.4

1.6 1.8

74 76 78 80 82 84 86 88 90

Temperature (C)

X location(cm)

Y location(cm)

Temperature (C)

Fig. 4.11. Thermal profile of 4VFX100

Fig. 4.12. Effect of stacking on peak temperature

86

Table 4.3. Effect of stacking on temperature#Layers 3-D Tech Vias Temperature C

Peak Average Min1 - - 89.4 79.4 75.02 Ref [92] No via 133.7 114.1 105.32 Ref [92] Via 1 (Table 4.1) 133.4 113.9 105.12 Ref [92] Max vias 133.2 113.9 105.22 Ref [93] No via 132.9 114.2 105.62 Ref [93] Via 3 132.6 114.0 105.52 Ref [93] Max vias 132.4 114.0 105.53 Ref [92] No via 178.2 149.1 135.93 Ref [92] Via 1 177.2 148.5 135.43 Ref [92] Max vias 176.5 148.5 135.63 Ref [93] No via 175.8 149.3 136.93 Ref [93] Via 3 174.8 148.6 136.43 Ref [93] Max vias 174.4 148.6 136.54 Ref [92] No via 222.8 184.3 166.74 Ref [92] Via 1 220.7 183.0 165.74 Ref [92] Max vias 219.4 183.0 166.14 Ref [93] No via 218.4 184.7 168.64 Ref [93] Via 3 216.4 183.4 167.64 Ref [93] Max vias 215.6 183.4 167.9

Table 4.4. Parameters for temperature estimation in HS3dParameter ValueAmbient temperature 45Cr convec 0.5C/WSubstrate thickness 500 umSpreader thickness 1 mmSink thickness 6.9 mmGlue thickness 2 um

87

(see Figure 4.11). The hotspots occur near the MGTs and DCMs, which are about 14C

above the coolest portions.

Table 4.3 shows the temperatures for 3-D FPGAs consisting of identical FPGA

layers of 4VFX100. The temperatures were estimated using HS3d [60] with the param-

eters listed in Table 4.4. The r convec value of 0.5C reflects the thermal resistance

of a high-end package with a moderate heat sink. We estimated temperatures for two

extremes of 3-D technologies: one with very thin layers and fine vias (Tezzaron’s process,

Via 1 of Table 4.1), and another with 5um vias and 50um layers (Via 3 of Table 4.1). For

both these technology nodes, we also varied the number of inter-layer thermal vias be-

tween the two extremes of no thermal vias to the maximum possible number of thermal

vias. Table 4.3 shows the temperatures for these two corners along with a more realistic

number based on the via pitches in Table 4.1.

As expected, the peak temperature increases with increase in the number of layers

– from 89.4C for a 2-D FPGA to 220.7C for a 4-layer FPGA using Tezzaron’s process.

The intra-package temperature variation also increases with increase in the number of

layers, from 14.4C for a 2-D FPGA to 55.0C for a 4-layer FPGA. This large variation

in temperature indicates that the peak temperature could be reduced by distributing the

hot blocks more evenly across the fabric. Interestingly, 3-D technology parameters change

the temperatures only minutely. For a 4-layer FPGA, layer thickness changes the peak

temperature by up to 4.4C, while thermal vias could decrease the peak temperature

by up to 3.4C. Figure 4.12 shows the effect of stacking on temperature, as well as the

possible variations because of 3-D technology parameters.

88

Table 4.5. Thermal-aware 3-D FPGA designFPGA Design Temperature C

Peak Average Minimum2-D 86.9 78.0 73.82-layer stacked 128.48 111.11 102.782-layer thermal 112.92 111.19 110.302-layer thermal inverted 112.94 111.23 110.33

Layer 1 Layer 2a) 2-layer stacked

Layer 1 Layer 2b) 2-layer thermal

Fig. 4.13. 3-D FPGA organizations

89

4.3.2 Thermal-Aware 3-D FPGA Organization

Recently, a study proposed alternate organizations for a 2-D FPGA to reduce the

intra-die temperature variations [67]. Using a fully utilized Virtex-4 FX100 FPGA as

an example, it demonstrated a reduction in peak die temperature of about 6C. Since

temperature variation is larger in a 3-D FPGA, we would expect thermal organiza-

tion to have a greater impact. To demonstrate this, we design a thermal-aware 2-layer

FPGA. For ease of experimentation, we consider only 4 types of blocks in the FPGA,

namely, CLB, BRAM, DSP, and MGT. These blocks consume the majority of the area

in 4VFX100. The peak temperature for a 2-D FPGA containing these blocks is 86.9C.

In the first case, we stack two identical such layers to form a 2-layer stacked FPGA (see

Figure 4.13(a)). The peak temperature for this FPGA is 128.5C. Note that stacking

the hot blocks significantly increases the power density, and therefore, the temperature.

Hence, next, we keep all the MGTs, DSPs, and BRAMs on a single layer. The second

layer now consists only of CLBs (see Figure 4.13(b)). This change in floorplan can be im-

plemented easily with the column-based modular architecture of Virtex-4 (ASMBL) [1].

This reduces the peak temperature to 112.9C (2-layer thermal in Table 4.5). The tem-

perature variation also drops from 25.7C for the stacked design to only 2.6C for the

thermal-aware design.

In the previous experiments, the heat sink is attached closest to the layer consum-

ing the maximum power. Previous studies have suggested that this should be preferred.

In fact, researchers have proposed thermal-aware 3-D floorplanning that tries to place

the hot blocks closer to the sink [61]. In order to see the effect of sink placement, we

90

attached it to the layer containing only CLBs in the 2-layer thermal organization. Ta-

ble 4.5 also shows the temperature for this case (2-layer thermal inverted). We observe

that the temperature increases only very slightly because of this change. This happens

because the vertical distances are small compared to the horizontal dimensions of the

FPGA.

4.4 Summary

This chapter demonstrated that 3-D FPGAs can provide significant advantages

over 2-D by reducing the interconnect area and the total area-delay product. The 3-D

FPGA with 5 layers and 3um-pitch vias reduces the area-delay product of a 2-D FPGA

by 36%.

We designed and evaluated several switch boxes for 3-D FPGAs, and showed that

the area-delay product depends heavily on the switch box topology. In 65nm technology,

the area-delay product for our universal-twist switch box is 15% lower than that of the

subset switch box for 5um-pitch vias. We further showed that the universal switch boxes

become even better with scaling process technology, as well as with larger vias. However,

adding more switches to the universal SB does not provide any benefit.

Three-D integration, however, increases the die temperature. Our experiments

indicate that the peak temperature for a 4-layer FPGA is 2.4 times that of a single-layer

FPGA. However, the large variation in temperature within the 3-D package allows us

to re-organize the 3-D FPGA to reduce the peak temperature. For a 2-layer FPGA,

the peak temperature reduced by 16C when the design was altered to create a more

uniform temperature profile.

91

Chapter 5

Technology Alternatives for Nanoscale

FPGA Interconnects

The previous decade has seen large-scale concerted efforts to develop nano-scale

technologies that will help sustain the Moore’s law. Innovations in lithographic CMOS

technologies have indicated that it would be possible to scale CMOS till at least up

to the second half of the next decade. However, conventional lithographic techniques

suffer from increasing fabrication costs, which may ultimately limit their application.

Recently, a (comparatively) low cost and reliable nano-imprint lithography technique

has been proposed [99, 100] which raises the hopes of obtaining cost-effective nano-

scale fabrication. However, at present, this imprint technique is limited to very regular

structures, and is unlikely to produce the complex structures that current lithography

can produce. While nano-imprint as well as conventional lithography are top-down

techniques, there are several bottom-up assembly techniques [101] in which molecules

assemble to form nano-structures. Although these techniques are expected to be very

low cost, they suffer from yield issues and are limited to very simple geometries.

Modern high-end FPGAs contain a variety of resources, and are not restricted to

a simple array of logic blocks consisting of Look-Up Tables (LUTs) connected using pro-

grammable switch blocks. In current FPGAs, apart from the basic programmable blocks,

there exist RAM modules, some hard-coded blocks (e.g. multipliers), and even some full

processors (e.g. PowerPC processors). Apart from them, the basic programmable logic

92

block itself has been augmented to contain non-LUT structures, like fast carry-chain

circuits. There have been advances in the interconnect architecture too. Modern FP-

GAs consist of segments of different lengths, each with different connectivity. However,

it is widely accepted that the interconnect is the major bottleneck in FPGAs. The in-

terconnect multiplexers in Xilinx’s Virtex-2 FPGAs take around 70% of the CLB area.

Furthermore, even after careful timing-driven packing and placement, interconnects are

the dominant source of delay for most designs. In addition to this, the power con-

sumption in a typical FPGA-mapped design is absolutely dominated (> 70%) by the

interconnect resources [13].

In this chapter, we explore different solutions to the interconnect problem in the

nano-scale regime. We explore nano-wires of different widths and materials as inter-

connect. We also explore replacing the pass-transistor switches in current FPGAs by

molecular switches [101, 102] that provide reprogrammable connections between wires.

This alleviates the need for SRAM cells to control the state of the switch, since these

molecules store the state within themselves. This is similar to anti-fuse FPGAs, but, in

contrast to anti-fuse technology, these molecules are reprogrammable. Furthermore, we

expect the structure of the CLB to be more difficult to realize efficiently in a technology

more amenable to regular structures. Therefore, the logic blocks in our architecture are

fabricated using lithographic techniques.

5.1 Nanotechnology Primitives

Several nano-structure fabrication techniques have been proposed over the past

few years. Among them, Nano-imprint [99, 100] and Dip Pen Nano-lithography (DPN) [103]

93

are the most promising techniques. In case of nano-imprint technology [99, 100], e-beam

lithography (or a similar technique) is used to create a mould, which is subsequently

used to emboss the circuit on other chips for mass production. The mould can be made

very fine, and the technique is expected to scale up to a few nano-meters of feature size.

DPN [103], in contrast, uses an Atomic Force Microscope (AFM) to write the circuit

on the die. Although inherently slower than nano-imprint, using multiple AFM tips

improves the writing speed significantly. This has been demonstrated to produce very

small features, and is expected to fabricate features smaller than 10nm. Directed self-

assembly [101] is another approach towards making nano-structures. Although this may

be the cheapest way to make circuits, it suffers from very high defect rates.

Note that all these (nano-imprint, DPN and self-assembly) technologies are ex-

pected to be limited to very simple geometries. It has been shown that it is possible to

get sets of parallel wires using any of the above techniques. Therefore, we propose to use

them (preferably nano-imprint) to make only wires in the FPGA. These wires could be

made using a single crystal of metal-silicide (e.g., NiSi nano-wires [104]) or made out of

metal. Carbon nanotube wires could also be considered, although a recent work claimed

that carbon nanotubes may not be better than metal wires with respect to reducing

interconnect delays [105].

In addition to the wires, we also need some sort of programmable switches to

provide programmable connection among the wires and between wires and logic pins. In

the FPGAs of Xilinx and Altera, these are made using pass transistors and SRAM cells,

while Accelerator FPGAs use one-time programmable anti-fuse material. At the nano-

scale we can use single-molecule switches that exhibit reversible switching behavior [70].

94

These molecules self-assemble at the cross-points of nano-wires, and can be switched

between ON and OFF states by the application of a voltage bias. It is desirable that these

switches have very low ON resistance and a very large OFF resistance. ON resistances

of hundreds of ohms and OFF-to-ON ratios of 1000 have been observed recently [102].

Note that very fast switching characteristics is not essential for FPGAs, because these

switches will not be configured very frequently and the FPGA configuration time is

normally not critical.

Early work in molecular switching suffered from filament formation due to the

small gap separating the nano-wires. Consequently, the switching behavior observed was

due to the metallic filament instead of molecule. Chemists at several research institutions

are targeting this problem. One such (as yet unpublished) work from our collaborating

chemists can increase the vertical separation among wires to 30nm and uses nano-spheres

to provide programmable connections. In line with this work, we experiment with a fixed

vertical separation between nano-wires of 30nm.

5.1.1 Related Work

DeHon [68], Goldstein [69], Tour [70] have previously proposed programmable ar-

chitectures using some form of nano-structures that are made using self-assembly. Gold-

stein tried to make crossbar-based devices by aligning nano-wires in two planes at right

angles to each other. The crosspoints contained molecules that provided programmable

logic as well as interconnections. It suffered from problems of signal-degradation, as

there was no way to restore the signal using only two terminal devices. DeHon overcame

this problem by using SiNW based FETs to restore the signals, and proposed a PLA

95

structure. However, the logic functionality in that architecture was limited to OR (and

inversion).

Tour, instead, proposed replacing the logic blocks by nanocells and connecting

them using metal wires. This suffered with problems of training these nanocells, which

were assumed to consist of a randomly connected mass of molecules. Furthermore, since

the bottleneck in current FPGAs lies in the interconnect, Tour’s architecture does not

help solve this problem.

All the above architectures propose drastic changes in the existing CMOS tech-

nology as well as the design methodologies. We propose an architecture that blends with

existing technology easily, and preserves all the design methodologies and flexibility in

logic functionality.

5.2 Nanoscale FPGA Architectures

We explored FPGA architectures with varying degrees of nanoscale integration in

the interconnect fabric. The logic block in all architectures is assumed to be made using

22nm lithography (which [8] predicts to be available in 2016). In the first architecture, we

consider the inter-CLB wires to be made using some nano-fabrication technology and the

interconnect switches to be made using self-assembled molecular switches. Both metal

and metal-silicide nano-wires are explored. Note that this organization needs decoders

to address the (nano) wires. In the second architecture, we assume inter-CLB copper

wires fabricated using advanced lithography but keep molecular switches to connect

them. In order to make the exploration tractable, we limit the inter-CLB metal wires

to only two levels (M3 and M4). The main difference between arch1 and arch2 is the

96

attainable wire pitch (up to 10nm for arch1, 54nm for arch2). Finally, we compare these

architectures with the current island-style FPGA architecture containing pass-transistor

switches (arch2), scaled to the 22nm technology node.

5.2.1 Arch1: Using non-lithographic nano-wires and

molecular switches

Figure 5.1 shows the proposed architecture, and figure 5.2 shows how the different

technologies are stacked together. The logic block remains in silicon, and uses M1 and

M2 layers for local connections. The IO pins of the logic block are in M2 layer, and the

nano-wires are on top of this. Molecular switches provide programmable connections

between nano-wires and between nano-wires and logic blocks. Note that each layer in

figure 5.2(a) is isolated from its adjacent layers by a dielectric.

The salient features of this architecture are described below.

Interconnect wires

A good interconnect material must have a low resistivity, a large current-carrying

capacity, and the ability to be made at small pitches. A low resistivity is needed to have

small delay, which is determined by the RC product. While copper wires are expected

to have a resistivity of 2.2µΩ-cm at the 22nm technology node [8], NiSi nanowires have

been shown to have resistivities of around 10µΩ-cm [104]:. Even with poorer resistivities,

NiSi nanowires may be preferred due to their ability to sustain a current density of up

to hundred times that of copper (> 1 × 108A/cm2). Some nano-fabrication technology

may be needed to fabricate wires at pitches of less than 10nm1.

1The wire pitch at the 22nm node is predicted to be 54nm.

97

We experimented with different routing architectures, consisting of different seg-

ment lengths. It has been previously shown that a segmented routing architecture is

better than non-segmented ones [81]. The logic block (8 LUT+FFs) in 22nm technology

is expected to be around 12.5µm x 12.5µm. In addition to this, the decoders take some

space. Therefore, a single-length wire in our architecture needs to run 25µm, a double

length wire 38µm, triple-length wire 50µm. Assuming 50µm as the limit for the length

of these wires, we investigate architectures having a maximum segment length of 3 logic

blocks.

Interface to CMOS

The problem of interfacing such nano-structures with the structures made using

traditional lithography was addressed in [68]. These nano-wires can be accessed with a

decoder made using advanced nano-imprint technology. [106] also proposes a stochastic

approach to addressing these wires, and claims that we can uniquely address these wires

with high probability if the number of wires is large. [68] proposed the use of√

N control

signals for a decoder that is used to address N wires. We use a similar technique, and

therefore account for 15 decoder control signals for 200 wires in the FPGA channel. Note

that these decoders are needed only to configure the switches, and are switched off at

operation time.

Programmable Switches

98

As described in section 5.2, arch1 uses molecular switches that can be made to

assemble at the cross-points of the wires. After this, these switches can be configured to

make the desired connections by applying the correct voltages at the wires (similar to

anti-fuse FPGAs).

Configuring the FPGA

The logic functionality of this FPGA can be easily programmed using SRAM cells.

Programming the routing is similar to anti-fuse FPGAs, except that we need decoders

to address the nano-wires. The main concept is that the wires should be activated in

some particular order to avoid affecting wrong switches. [107] presents a way to program

the anti-fuses in an anti-fuse FPGA, which is directly applicable to our architecture too.

Initially, all the molecular switches are off and all the wires are pre-charged to a voltage

Vp/2. This is required to ensure that the voltage difference of Vp is applied only to

the desired switch. Then the two wires that need to be connected through a switch are

addressed using a decoder and pulled to Vp and ground respectively, thus applying a

voltage difference of Vp to the molecular switch that needs to be turned on. Note that

Vp needs to be larger than the operating voltage. Experiments with molecular switches

have shown a value of 1.75V [70], which is more than double that of the operating voltage

at 22nm node.

We also envision a possibility of using the CLB logic itself to program the molecu-

lar switches. In order to do that, the configuration will need to go through the following

steps. First the global clock resources need to be configured. Next, the CLB (logic)

is configured to drive appropriate control signals to the address decoder. Note that

99

since different CLBs cannot communicate at this stage, all control signals need to be

synchronized with the global clock signal. Furthermore, since the configuration time is

usually not critical, we can afford to minimize the configuration logic (that needs to fit

within a single CLB). Next the routing (molecular) switches are programmed followed

by configuration of the CLBs to implement the user design. Note that this configuration

methodology will greatly simplify the programming circuitry when compared to anti-fuse

FPGAs.

Capacitance and Area Estimation

Capacitance of a single-length wire2, C1−wire, in arch1 is estimated as follows.

C1−wire = 4×Nchannel × Cnano−jn

+(2×Nclb−pins + 2×Ndecoder)× Cmicro−jn +

2× Ccouple

where Nchannel is the number of wires in the FPGA channel (channel width), Cnano−jn

is the junction capacitance between two nano-wires, Nclb−pins in the number of IO pins

in the logic block, Ndecoder refers to the number of control signals in the decoders,

Cmicro−jn is the junction capacitance between a lithographic wire and a nano-wire, and

Ccouple is the coupling capacitance with an adjacent wire.

2wire that spans adjacent CLBs

100

The junction capacitance between any two wires, Cjunc is calculated using [68]

Cjunc = 2πǫLln(2h

r),

where ǫ is the permittivity of the dielectric separating the wires (we assumed SiO2), r is

the radius of the wires and h is the separation between the wires.

For Cnano−jn, L = 2r and h was kept as 30nm and for calculating Cmicro−jn,

L was changed to the lithographic metal half pitch (54nm for 22nm node).

Ccouple was estimated using the equation for two long parallel cylindrical con-

ductors.

Ccouple =πǫL

ln( D2a +

( D2a)2 − 1)

where D is the spacing between the axes of the two cylinders, and L is the length of

the cylinders (wires). We observed that the coupling capacitance calculated using the

above equation was always larger than the capacitance calculated using Berkeley device

group’s interconnect model [80], and therefore used the above as a pessimistic value.

The area of the arch1 FPGA is equal to area of logic blocks + area of decoders

when the pitch of the nano-wires is within 25nm. For larger wire-pitches, area is deter-

mined by the wires and is quadratically proportional to the wire pitch. Note that when

area of the device increases, the lengths of the wires also increase and consequently, wire

capacitance and resistance per CLB length changes.

101

5.2.2 Arch2: FPGA using lithographic wires and molecular switches

Arch1 described in the previous subsection needs decoders for addressing the

nano-wires, which increases the complexity of the fabrication process. Therefore, we also

explore an FPGA, which uses conventional lithographic metal wires as the interconnect,

with molecular switches at their cross-points (as in the previous architecture). Note that

assuming a channel width of 200 (same as arch3, and similar to commercial SRAM-based

FPGAs), the area of the CLB will be determined by the wires instead of the logic. For

22nm technology, ITRS predicts a wire pitch of 54nm. For a channel width of 200, we

will need 400 wires within the CLB pitch. This comes out to be 400 × 54 = 21.6µm

long. In addition to that, we will need space for the logic pins, which calculates to 40 x

54 = 2.16 µm. Therefore, the CLB dimensions in this case is projected to be 23.76µm

x 23.76µm, which is only slightly smaller than the current Xilinx CLB scaled to 22nm

technology (25µm x 25µm).

5.3 Comparative Evaluation

We used VPR [10] to model the various FPGA architectures and evaluate their

performance.

Modeling Arch1 in VPR

In order to model arch1 in VPR, we added a new type of switch box that allows

a wire to connect only to the wires at right angles to it. This was done because in arch1,

molecules assemble only at wire cross-over points and not between two wires running in

the same direction. In order to account for the large defect rates expected at this scale,

102

we started with assuming that only half of the switches are operational, but due to the

immensely large number of programmable switches in our architecture (even when only

half of the switches are visible), VPR takes extremely long (> 2 days on a SunBlade-

2000, for a 191 CLB design) to finish the placement and routing of the designs. In order

to facilitate experimenting with multiple designs, we limited the number of switches in

VPR to only about 1% of the total physically present switches. Consequently, in VPR,

the CLB outputs have switches to only half of the wires in the channel, and a wire

can connect to only 4 other wires in the switch box, two in each of the perpendicular

directions. The performance we obtained by limiting the number of switches was not

very different from that obtained by keeping all the switches for the few designs we

initially experimented with. Since the flexibility provided by our switch box is still

greater than the switches built in VPR, we expect that our switch box is still not very

limiting, and similar results will be obtained considering all switches too. Note that

since we still counted the junction capacitances between all crossing wires, our results

for the proposed architectures should be considered as the lower bound, and could be

enhanced by improvements in the tools. We used MCNC benchmark circuits for all

experimentation. These designs varied in size from 131 to 806 CLBs. In order to have

reasonable performance, we kept the routing as segmented with 20% single-length, 30%

double-length, and 50% triple-length wires.

5.3.1 Results

Figure 5.3 shows the critical path delays of all the designs when mapped to the

three architectures. The results for arch1 use a spacing s of 10nm between the nanowires

103

and a wire diameter of 15nm. The lithographic wire pitch was kept as 54nm, as predicted

by [8] for the 22nm node. The resistance of the molecular switch was assumed to be 1kΩ,

and the material for the nano-wire was assumed to be copper (resistivity=2.2µΩ-cm [8]).

Note that the delay is maximum for arch3 (lithographic, SRAM-based), and the delays

for arch1 and arch2 are comparable. However, the area of the arch1 FPGA is only about

30% of the arch2 FPGA. The average reduction in critical path delay was 30% for arch2

and 32% for arch1, when compared to arch3.

The performance of the designs (mapped on arch1 and arch2 FPGAs) strongly

depends on the molecular switch resistance. For our experimentation we assumed that

the off resistance of the switch is sufficiently high to consider it as an open circuit. Results

for varying molecular on resistance from 100 Ω to 100 KΩ (typical value is around 10kΩ

today) are shown in figure 5.4. It is observed that the delay of the circuit increases very

sharply beyond 10kΩ. In fact the delay becomes as large as 20X for arch1 when the

molecular resistance is 100kΩ. The delay value for arch1 using NiSi nanowire remains

larger than arch3 for all values of molecular resistances. This happens due to very large

resistance of these wires. Note that these NiSi nano-wires can support large current

densities, while the metal nano-wires may in reality be limited by electro-migration.

Figure 5.5 shows the variation of resistance and capacitance of single-length NiSi

nano-wires with wire dimensions. The notation R-25 means resistance for nano-wires

with a pitch of 25nm. The plot shows results for wire pitches ranging from 25nm to

55nm. Note that as the wire pitch is increased, the area of the FPGA increases, thereby

increasing the wire length. Therefore, we can see a slight increase in the wire resistance

when the pitch is increased even when the width of the wire remains the same. The

104

capacitance value at 50nm width clearly reaches unacceptable limits (>20fF). At the

other extreme, the resistance values are very large (>100 kΩ) when the width of the

wire is reduced to 5nm. Note that looking at the RC product of the wire alone is not

expected to give an indication of the performance of the FPGA, since every net will

go through some molecular switches (with resistances) and into the input pins of logic

blocks (with capacitances).

Figure 5.6 shows the variation of performance of arch1 with varying wire dimen-

sions for the design misex3; other designs showed a similar behavior. Note that per-

formance of arch1 is inferior to arch3 when the molecular resistance is 100kΩ or 10kΩ.

However, as the molecular resistance reduces, arch1 starts performing better than arch3.

The figure is divided into vertical sections of separate wire pitches. For every wire pitch,

we experimented with several wire dimensions. Note that for Rswitch=100kΩ, delay

increases monotonically (except 5 30 → 10 25) with width of the wire for a fixed pitch.

This happens because the large switch resistance makes the net delay very sensitive to

capacitance of the wire. With the delay of the design being dominated by the routing de-

lay (and because the logic delay remains almost same for different wire dimensions), the

delay of the design increases with capacitance. The other extreme occurs when Rswitch

is 100 Ω, in which case the delay decreases with increase in width due to reduction in

wire resistance. Rswitch values of 10 and 1 kΩ show intermediate behavior.

5.4 Summary

In this paper we explored several nano-scale interconnect technologies for FPGAs.

First, we replaced the FPGA interconnect fabric by nano-structures: lithographic wires

105

by nano-wires made using nano-imprint technology, and switches by molecular switches.

Second, we used lithographic wires connected using programmable switches. The results

for these two were compared with current FPGA architecture containing pass transistor

switches, scaled to 22nm.

We found that the first architecture provided the best performance with the least

area. The area reduced to 30% of the scaled architecture, and the critical path delay

reduced by 32% on an average. The second architecture improved the performance over

the scaled FPGA, but area reduction was only 10%. Using NiSi nano-wires instead of

metal nano-wires was not good for performance, but may be useful to counter electro-

migration. The resistance of the molecular switch was found to be a crucial factor in the

performance of the design, and values lower than 10kΩ were observed to be critical for

performance.

This kind of exploratory research is highly interdisciplinary, and building success-

ful nanoscale devices requires synergy between the architects and the chemists. One of

the motivations of this work was to set the requirements from these nanoscale technolo-

gies to the chemists who are actually developing these. From the results we conclude that

molecular switches with on-resistances of around 1kΩ are needed for good performance.

Furthermore, materials with lower resistivities than NiSi nanowires must be explored for

fabricating nano-wires. Architectural improvements, and throughput-oriented designs

may utilize the area benefits of nanotechnologies to provide faster application run-times

even with higher molecule and wire resistances.

106

Fig. 5.1. FPGA using nano-wires and molecular switches

(a) (b)

Fig. 5.2. 3-D organization of nano-wires

107

Fig. 5.3. Critical path delays in the 3 architectures

Fig. 5.4. Dependence of performance on molecular switch’s ON resistance

108

Fig. 5.5. Resistance and Capacitance values of single-length NiSi nano-wires

Fig. 5.6. Performance of a design (misex3) using metal nano-wires

109

Chapter 6

Summary and Future Directions

The growing popularity of FPGAs demands ways to sustain their growth in future.

In this thesis, we looked at three main future technologies: scaled CMOS, 3-D integration,

and nanotechnology.

Leakage Energy in Scaled CMOS

Leakage energy is increasing exponentially with shrinking feature sizes. The prob-

lem is more severe in FPGAs because they use extra transistors to provide programma-

bility. This, however, also gives us an opportunity to save leakage: since a majority of

these transistors are unused or idle for any given design, we can save leakage by simply

cutting off their power supplies. This is similar to the use of sleep transistors in ASICs,

but is simpler because the state of the sleep transistor can be fixed at the configuration

time. In contrast, ASICs normally control the sleep transistor dynamically, based on the

usage of the circuit block. Such a dynamic control can also be used in FPGAs if the

design has modules that remain idle for long periods of times. However, even without

the dynamic control, we can save significant leakage by using sleep transistors controlled

by configuration SRAM cells.

Another, more expensive (in terms of area overhead) approach is the use of mul-

tiple supply voltages in the FPGA. We proposed an architecture that reduces both

dynamic and leakage powers by using two supply voltages (high: Vddh, low: Vddl).

110

Through the use of supply transistors, circuit blocks on the FPGA can run on either of

the two Vdd’s. Extra configuration bits store the state of the supply transistors, which

sets the Vdd for every circuit block. A Vdd assignment algorithm assigns values to these

configuration bits, such that timing constraints are not violated. We integrated this

algorithm with VPR (an FPGA place-and-route tool from Toronto [10]), and obtained

a 61% reduction in total FPGA power. For the case when all routing muxes can be

individually programmed to either Vddh or Vddl, the FPGA area increases by about

50%. In order to reduce this area penalty, we present another architecture where the

voltages of the routing resources are fixed (at either Vddh or Vddl) at the time of fab-

rication. The router then routes the critical nets through the resources that are on high

Vdd, thus maintaining the performance of the design. We observe that this architecture

reduces the FPGA energy by 57.3% with an area penalty of only about 20%, and the

performance for all the benchmark designs remains within 20% of the original (when all

resources are at Vddh). We further reduce the area penalty by controlling the Vdd of

multiple logic blocks from the same set of supply transistors.

Three-Dimensional Integration of FPGAs

Three-dimensional (3-D) integration is a promising technique for reducing wire

lengths in an integrated circuit. Three-D is especially attractive for FPGAs because

the interconnect dominates their total area, delay, and power. In order to achieve this

goal, we design a 3-D FPGA, which uses wafer bonding to stack multiple programmable

fabrics. The inter-layer vias in this technology are much larger than the horizontal wires,

which forces us to minimize the number of such vias. Consequently, inter-layer vias are

111

fewer than the horizontal wires. Designing an efficient switch block for such an FPGA

is a challenging task.

We design multiple switch boxes for 3-D FPGAs, and analyze their benefits and

drawbacks. Using detailed area-timing models and a 3-D FPGA place and route tool,

we evaluate the different switch box topologies. The results indicate that the area-delay

product depends heavily on the switch box topology. Compared with the subset switch

block, our best switch block reduces the area-delay product by 9% for 3um pitch vias

and by over 15% for 5um pitch vias in a 65nm technology. We also demonstrate that,

compared with a 2-D FPGA, a 5-layer 3-D FPGA reduces the area-delay product by

more than 35%.

Besides the influence on the FPGA routing architecture, 3-D increases the die

temperature of FPGAs. We analyzed the thermal issues in 3-D FPGAs, and proposed

an FPGA organization that reduced the peak temperature of a 2-layer FPGA by 16C.

Nano-Scale Technologies

Recently, chemists and material scientists have made significant progress in de-

veloping alternate technologies for creating features at the nano-scale [99, 103]. They

have successfully fabricated wires separated by only a few nanometers (nanowires) [104].

Similarly, they are developing reprogrammable molecular switches [102]. The scaling of

CMOS is expected to saturate some time during the next decade, which mandates the

exploration of alternate technologies for chip fabrication.

In this project, we explore different technologies to implement FPGA intercon-

nect. We explore nano-wires of different widths and materials, and also explore replacing

112

the pass-transistor switches in current FPGAs by molecular switches that provide repro-

grammable connections between wires. This alleviates the need for SRAM cells to control

the state of the switch, since these molecules store the state within themselves. This is

similar to anti-fuse FPGAs, but, in contrast to anti-fuse technology, these molecules are

re-programmable. Furthermore, we expect the structure of the CLB to be more difficult

to realize efficiently in these nanotechnologies because they are amenable only to regular

structures. Therefore, the logic blocks in our architecture are fabricated using 22nm

CMOS process (scaled from the 65nm values).

We compare them to traditional SRAM-based FPGAs that use pass transistors

as switches (scaled to 22nm), and show that by using nano-wires and molecular switches,

FPGA area reduces by 70%.

6.1 Future Directions

Process variation is increasing significantly in CMOS technologies. Tackling pro-

cess variations effectively in FPGAs would be an interesting project for future study.

Since FPGAs can be reprogrammed, we can potentially reconfigure them to avoid in-

crease in delay of the critical paths. Furthermore, including the hard blocks in the power

reduction mechanisms will be needed for deploying power-efficient versions of the cur-

rent FPGAs. Since the various hard blocks can be combined and configured in multiple

modes, the tools can decide on a power-efficient mode for these blocks. Altera’s Quartus-

2 already performs some such optimizations. The dual-Vdd technique proposed in this

thesis can also be improved, especially by improving the routing and placement tools to

be aware of the voltage clusters. This will improve the results shown in Chapter 3.

113

Since previous studies for 2-D have shown that the optimal switch box changes

when the routing fabric contains long segments, our study of 3-D switch boxes must

be extended for FPGAs containing segments of different lengths. Furthermore, the 3-D

FPGA can include some direct connections among neighboring CLBs. Since the number

of neighbors increases when we go to 3-D, direct connections may have a larger impact

than for 2-D.

Nanotechnologies, being in their infancy, need a continuous effort to be realized

into real implementations. A detailed power and thermal model will benefit in further

evaluating their benefits. Besides this, the architecture can be improved to be more

robust towards the high defect rates, common to self-assembly.

114

References

[1] “Xilinx Products Documentation,” http://www.xilinx.com/literature.

[2] Altera product datasheets http://www.altera.com/literature.

[3] “Cray XD1 Supercomputer,” http://www.cray.com/products/xd1.

[4] “SGI RASC Technology,” http://www.sgi.com/products/rasc.

[5] E.Lattanzi, A.Gayasen, M.Kandemir, V.Narayanan, L.Benini, and A.Bogliolo.

“Improving Java Performance by Dynamic Method Migration on FPGAs”. In

Int. J. Embedded Systems (to appear).

[6] “FPGA High Performance Computing Alliance,” http://www.fhpca.org.

[7] PLD Sector Overview by Jefferies and Company, Dec 2005.

[8] International technology roadmap for semiconductors. <http://public.itrs.net>.

[9] S. Brown, R. Francis, J. Rose, an Z. Vranesic. “Field-Programmable Gate Arrays”.

Kluwer Academic Publishers, May 1992.

[10] V. Betz and J. Rose. “VPR: A New Packing, Placement and Routing Tool for

FPGA Research”. In International Workshop on Field-programmable Logic and

Applications, 1997.

[11] V. George, H. Zhang, and J. Rabaey. “The design of a low energy FPGA”. In

Proceedings of International Symposium on Low Power Electronics and Design,

1999.

115

[12] E. Kusse and J. Rabaey. “Low-Energy Embedded FPGA Structures”. In Proceed-

ings of International Symposium on Low Power Electronics and Design, 1998.

[13] L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption in

Virtex[tm]-II FPGA family”. In Proceedings of ACM/SIGDA International Sym-

posium on Field-programmable gate arrays, 2002.

[14] K. Poon, A. Yan, and S. Wilton. “A flexible Power Model for FPGAs”. In Proceed-

ings of International Conference on Field Programmable Logic and Applications,

2002.

[15] F. Li, D. Chen, L. He, and J. Cong. “Architecture Evaluation for Power-Efficient

FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-

programmable gate arrays, 2003.

[16] A. Singh and M. Marek-Sadowska. “Efficient Circuit Clustering for Area and Power

Reduction in FPGAs”. In Proceedings of ACM/SIGDA International Symposium

on Field-programmable gate arrays, 2002.

[17] Kaushik Roy. “Power-Dissipation Driven FPGA Place and Route Under Timing

Constraints”. In IEEE Trans. on Circuits and Systems-I, Vol. 46, No. 5, May 1999.

[18] B. Kumthekar, L. Benini, E. Macii, and F. Somenzi. “In-place power optimiza-

tion for LUT-based FPGAs”. In Proceedings of ACM/IEEE Design Automation

Conference, 1998.

116

[19] J. Lamoureux and S.J.E. Wilton. “On the Interaction between Power-Aware FPGA

CAD Algorithms”. In IEEE International Conference on Computer Aided Design,

November 2003.

[20] A. Gayasen and N. Vijaykrishnan. “Architecture and Design Flow Optimizations

for Power-Aware FPGAs”. In VLSI Handbook, CRC Press (to appear).

[21] T. Tuan and B. Lai. “Leakage Power Analysis of a 90nm FPGA”. In Proceedings

of Custom Integrated Circuits Conference, 2003.

[22] B. Calhoun, F. Honore, and A. Chandrakasan. “Design Methodology for Fine-

grained Leakage Control in MTCMOS”. In Proceedings of International Symposium

on Low Power Electronics and Design, 2003.

[23] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.

“Reducing leakage energy in FPGAs using region-constrained placement”. In Pro-

ceedings of International Symposium on Field-programmable gate arrays, 2004.

[24] N. Azizi and F. N. Najm. “Look-Up Table Leakage Reduction for FPGAs”. In

Proceedings of Custom Integrated Circuits Conference, 2005.

[25] S. Srinivasan, A. Gayasen, N. Vijaykrishnan, M. J. Irwin, and T. Tuan. “Leakage

Control in FPGA Routing Fabric”. In Proceedings of Asia and South Pacific Design

Automation Conference, 2005.

[26] A. Lodi, L. Ciccarelli, and R. Giansante. “Combining Low-Leakage Techniques for

FPGA Routing Design”. In Proceedings of ACM/SIGDA international symposium

on Field-programmable gate arrays, 2005.

117

[27] A. Rahman and V. Polavarapuv. “Evaluation of Low-Leakage Design Techniques

for Field Programmable Gate Arrays”. In Proceedings of ACM/SIGDA Interna-

tional Symposium on Field-programmable gate arrays, 2004.

[28] A. Rahman, S. Das, T. Tuan, and A. Rahut. “Heterogeneous Routing Architecture

for Low-Power FPGA Fabric”. In Custom Integrated Circuits Conference, 2005.

[29] H. Hassan, M. Anis, and M. Elmasry. “LAP: a logic activity packing methodology

for leakage power-tolerant FPGAs”. In Proceedings of International Symposium

on Low Power Electronics and Design, 2005.

[30] J. H. Anderson, F. Najm, and T. Tuan. “Active Leakage Power Optimization

for FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-

programmable gate arrays, 2004.

[31] S. Srinivasan, A. Gayasen, N. Vijaykrishnan, Y. Xie, M. J. Irwin, and T. Tuan.

“Improving Soft-Error Tolerance of FPGA Configuration Bits”. In Proceedings of

International Conference on Computer Aided Design, 2004.

[32] N. Azizi, F. N. Najm, A. Moshovos. “Low Leakage Asymmetric-cell SRAM”. In

IEEE Intl. Symposium on Low Power Electronic Devices, 2002.

[33] K. Usami and M. Horowitz. “Clustered voltage scaling technique for low-power

design”. In Proceedings of International Symposium on Low Power Electronics and

Design, 1995.

118

[34] M. Takahashi et.al. “A 60-mW MPEG4 Video Codec Using Clustered Voltage

Scaling with Variable Supply-Voltage Scheme”. In IEEE Journal of Solid-State

Circuits, Vol. 33, No. 11, Nov 1998.

[35] F. Li, Y. Lin, L. He, and J. Cong. “Low-power FPGA using Pre-Defined Dual-

Vdd/Dual-Vt Fabrics”. In Proceedings of ACM/SIGDA International Symposium

on Field-programmable gate arrays, 2004.

[36] F. Li, Y. Lin, and L. He. “FPGA Power Reduction Using Configurable Dual-Vdd”.

In Proceedings of ACM/IEEE Design Automation Conference, 2004.

[37] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.

“A Dual-Vdd Low Power FPGA Architecture”. In Proceedings of ACM/SIGDA

International Conference on Field-Programmable Logic and Applications, 2004.

[38] A. Gayasen, S. Srinivasan, N. Vijaykrishnan, M. Kandemir. “Design of Power-

Aware FPGA Fabrics”. In Int. J. Embedded Systems (to appear).

[39] Y. Lin, F. Li, and L. He. “Circuits and Architectures for Field Programmable Gate

Array with Configurable Supply Voltage”. In IEEE Trans. on VLSI Systems, vol.

13, no. 9, pp. 1035-1047, Sep 2005.

[40] Y. Lin and L. He. “Leakage efficient chip-level dual-vdd assignment with time

slack allocation for FPGA power reduction”. In Proceedings of ACM/IEEE Design

Automation Conference, 2005.

119

[41] J. H. Anderson and F. Najm. “Low Power Programmable Routing Circuitry for

FPGAs”. In Proceedings of IEEE International Conference on Computer-Aided

Design, 2004.

[42] D. Chen, J. Cong, F. Li, and L. He. “Low-Power Technology Mapping for FPGA

Architectures with Dual Supply Voltages”. In Proceedings of International Sym-

posium on Field-programmable gate arrays, 2004.

[43] A. J. Alexander, J. P. Cohoon, Jared L. Colflesh, J. Karro, and G. Robins, “Three-

Dimensional Field-Programmable Gate Arrays,” In Proceedings of the Interna-

tional ASIC Conference, pp. 253-256, 1995.

[44] J. V. Campenhout, H. V. Marck, J. Depreitere, and J. Dambre, “Optoelectronic

FPGAs,” In IEEE Journal of Selected Topics in Quantum Electronics, pp. 306-315,

1999.

[45] M. Leeser, W. M. Meleis, M. M. Vai, S. Chiricescu, W. Xu, and P. M. Zavracky,

“Rothko: A Three-Dimensional FPGA,” In IEEE Design and Test of Computers,

pp. 16-23, 1998.

[46] G. Borriello, C. Ebeling, S. A. Hauck, S. Burns, “The triptych FPGA architecture,”

In IEEE Transactions on VLSI Systems, Vol. 3, No. 4, pp. 491-501, 1995.

[47] S. Chiricescu, M. Leeser, and M. M. Vai, “Design and analysis of a dynamically

reconfigurable three-dimensional FPGA,” In IEEE Transactions on VLSI Systems,

Vol. 9, No. 1, pp. 186-196, 2001.

120

[48] M. Lin, A. El Gamal, Y. Lu, and S. Wong, “Performance benefits of monolith-

ically stacked 3D-FPGA,” Proceedings of the international Symposium on Field

Programmable Gate Arrays, 2006.

[49] A. Rahman, S. Das, A. Chandraksan, and R. Reif, “Wiring requirement and three-

dimensional integration of field-programmable gate arrays,” In Proceedings of the

international workshop on System-level interconnect prediction, 2001.

[50] Y-S Kwon, P. Lajevardi, A. Chandrakasan, F. Honore, and D. E. Troxel, “A 3-D

FPGA wire resource prediction model validated using a 3-D placement and routing

tool,” In Proceedings of the international workshop on System-level interconnect

prediction, 2005.

[51] C. Ababei, H. Mogal, and K. Bazargan, “Three-dimensional Place and Route for

FPGAs,” In Proceedings of the Asia South-Pacific Design Automation Conference,

2005.

[52] C. Ababei, Y. Feng, B. Goplen, H. Mogal, T. Zhang, K. Bazargan, and S. Sapat-

nekar, “Placement and Routing in 3D Integrated Circuits,” In IEEE Design and

Test, Vol. 22, No. 6, pp. 520-531, Nov-Dec 2005.

[53] G-M. Wu, M. Shyu, and Y-W. Chang, “Universal switch blocks for three-

dimensional FPGA design,” In Proceedings of ACM/SIGDA international sym-

posium on Field-programmable gate arrays, 1999.

121

[54] A. Gayasen, N. Vijaykrishnan, M. Kandemir, A. Rahman. “Switch Box Archi-

tectures for Three-Dimensional FPGAs”. In Proceedings of Field-Programmable

Custom Computing Machines (FCCM), Apr 2006.

[55] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-

Performance Microprocessors,” In Proceedings of the 7th International Symposium

on High-Performance Computer Architecture, 2001.

[56] K. Sankaranarayanan, S. Velusamy, M. Stan, and K. Skadron, “A Case for

Thermal-Aware Floorplanning at the Microarchitectural Level,” In the Journal

of Instruction-Level Parallelism, Vol. 7, Oct 2005 (http://www.jilp.org/vol7).

[57] Y. Han, I. Koren, and C. A. Moritz, “Temperature aware floorplanning,” In Second

Workshop on Temperature-Aware Computer Systems(TACS-2), held in conjunc-

tion with ISCA-32, June 2005.

[58] G. Chen and S. Sapatnekar, “Partition-driven standard cell thermal placement,”

In Proceedings of the International Symposium on Physical Design, 2003.

[59] K. Skadron, M. R. Stan, et al., “Temperature-Aware Microarchitecture,” In Pro-

ceedings of the 30th International Symposium on Computer Architecture (ISCA),

2003.

[60] G. M. Link and N. Vijaykrishnan, “Thermal Trends in Emerging Technolo-

gies,” In Proceedings of the International Symposium on Quality Electronic Design

(ISQED), 2006.

122

[61] J. Cong, J. Wei, and Y. Zhang. “A Thermal-Driven Floorplanning Algorithm

for 3D ICs”. In Proceedings of the International Conference on Computer-Aided

Design, Nov 2004.

[62] B. Goplen and S. S. Sapatnekar. “Efficient Thermal Placement of Standard Cells

in 3D ICs using a Force Directed Approach”. In Proceedings of the International

Conference on Computer-Aided Design, 2003.

[63] J. Cong and Y. Zhang. “Thermal Via Planning for 3-D IC’s”. In Proceedings of

the International Conference on Computer Aided Design, Nov 2005.

[64] B. Goplen and S. S. Sapatnekar. “Thermal Via Placement in 3D ICs”. Proceedings

of the ACM International Symposium on Physical Design, 2005.

[65] S. Lopez-Buedo, J. Garrido, and E. Boemo, “Dynamically Inserting, Operating,

and Eliminating Thermal Sensors of FPGA-based Systems,” In IEEE Transactions

on Components and Packaging Technologies (CPM), Vol.25, No.4, pp.561-566, De-

cember 2002.

[66] S. Velusamy et al., “Monitoring Temperature in FPGA based SoCs,” In Proceedings

of the International Conference on Computer Design (ICCD), 2005.

[67] P. Sundararajan, A. Gayasen, N. Vijaykrishnan, and T. Tuan. Thermal Character-

ization and Optimization in Platform FPGAs In Proceedings of the International

Conference on Computer Aided Design, Nov 2006.

123

[68] Andre DeHon and Michael J. Wilson, “Nanowire-Based Sublithographic Pro-

grammable Logic Arrays,” In Proc. of International Symposium on Field Pro-

grammable Gate Arrays, 2004.

[69] S.C. Goldstein and M. Budiu, “Nanofabrics: Spatial computing using molecular

electronics,” In Proceedings of the International Symposium on Computer Archi-

tecture (ISCA 2001), July 2001.

[70] J. M. Tour, W. L. Van Zandt, C. P. Husband, S. M. Husband, L. S. Wilson, P.

D. Franzon, D. P. Nackashi, “Nanocell Logic Gates for Molecular Computing,” In

IEEE Transactions on Nanotechnology 2002, 1, 100-109.

[71] A. Gayasen, N. Vijaykrishnan, and M. J. Irwin. “Exploring Technology Alterna-

tives for Nano-Scale FPGA Interconnects”. In Proceedings of the Design Automa-

tion Conference, 2005.

[72] A. Chandrakasan, W. Bowhill, and F. Fox. “Design of High-Performance Micro-

processor Circuits”. IEEE Press, 2001.

[73] Z. Chen, M. Johnson, L. Wei, and K. Roy. “Estimation of standby leakage power in

CMOS circuits considering accurate modeling of transistor stacks”. In Proceedings

of International Symposium on Low Power Electronics and Design, 1998.

[74] W. Zhang, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, , and V. De. “Compiler

support for reducing leakage energy consumption”. In Proceedings of the 6th Design

Automation and Test in Europe Conference, 2003.

124

[75] J. Kao, S. Narendra, and A. Chandrakasan. “MTCMOS Hierarchical Sizing Based

on Mutual Exclusive Discharge Patterns”. In Proceedings of the Design Automation

Conference, 1998.

[76] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. Vijaykumar. “Gated-Vdd: A Circuit

Technique to Reduce Leakage in Deep-Submicron Cache Memories”. In Proceedings

of International Symposium on Low Power Electronics and Design, 2000.

[77] Xilinx Application Note. “Two Flows for Partial Reconfiguration: Module Based

or Difference Based”. http://direct.xilinx.com/bvdocs/publications/xapp290.pdf.

[78] S.Swaminathan, R. Tessier, D. Goeckel, and W. Burleson. “A Dynamically Recon-

figurable Adaptive Viterbi Decoder”. In Proceedings of ACM/SIGDA international

symposium on Field-programmable gate arrays, 2002.

[79] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and

S. Kulkarni. Pushing ASIC performance in a power envelope. In Proceedings of

the Design Automation Conference, 2003.

[80] “Berkeley Predictive Technology Model,” http://www-

device.eecs.berkeley.edu/∼ptm/interconnect.html.

[81] S. Wilton, “Architecture and Algorithms for Field-Programmable Gate Arrays

with Embedded Memory,” PhD thesis, University of Toronto, 1997.

[82] V. Betz and J. Rose. “FPGA Routing Architecture: Segmentation and Buffering

for to Optimize for Speed and Density”. In Proceedings of ACM/SIGDA Interna-

tional Symposium on Field-programmable gate arrays, 1999.

125

[83] Y-C. Ju and R. A. Saleh. “Incremental Techniques for the Identification of Stati-

cally Sensitizable Critical Paths”. In Proceedings of the Design Automation Con-

ference, 1991.

[84] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto, and T. Sakurai.

“Boosted Gate MOS (BGMOS): Device/Circuit Cooperation Scheme to Achieve

Leakage-Free Giga-Scale Integration”. In Proceedings of Custom Integrated Circuits

Conference, 2000.

[85] J. Rose and S. Brown, “Flexibility of interconnection structures for field-

programmable gate arrays,” In IEEE J. Solid-State Circuits, vol. 26, pp. 277-282,

Mar 1991.

[86] G. Lemieux, S. Brown, and D. Vranesic, “On two-step routing for FPGAS,” In

Proceedings of the international symposium on Physical design, 1997.

[87] M. I. Masud and S. Wilton, “A new switch block for segmented FPGAs,” In

Proceedings of the International Workshop on Field Programmable Logic and Ap-

plications, 1999.

[88] P. Hallschmid and S. Wilton, “Detailed routing architectures for embedded pro-

grammable logic IP cores,” In Proceedings of the ACM/SIGDA International Sym-

posium on Field-Programmable Gate Arrays, 2001.

[89] Y.-W. Chang, D. F. Wong, and C. K. Wong, “Universal switch blocks for FPGA

design,” In ACM Transactions Design Automation of Electronic Systems, 1(1):80-

101, Jan. 1996.

126

[90] H. Fan, J. Liu, Y.-L. Wu, and C.-C. Cheung, “On optimum switch box designs for

2-D FPGAs,” In Proceedings of the Design Automation Conference (DAC), 2001.

[91] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: A novel chip

design for improving deep submicron interconnect performance and systems-on-

chip integration,” In Proceedings of the IEEE, Vol. 89, May 2001, pp. 602-633.

[92] S. Gupta, M. Hilbert, S. Hong, and R. Patti, “Techniques for Producing 3D ICs

with High-Density Interconnect,” 2005. Available from Tezzaron Semiconductor.

[93] Y. Yamaji, T. Ando, et al., “Thermal Characterization of Bare-Die Stacked Mod-

ules with Cu through-vias,” In Electronic Components and Technology Conference,

2001.

[94] C. S. Tan and R. Reif, “Multi-Layer Silicon Layer Stacking Based on Copper Wafer

Bonding,” In Electrochemical and Solid-State Letters, 8(6):G147-G149, 2005.

[95] Predictive Technology Model, http://www.eas.asu.edu/∼ptm

[96] V. Betz, J. Rose, and A. Marquardt, “Architecture and CAD for Deep-Submicron

FPGAs,” Kluwer Academic Publishers, February 1999.

[97] A. Telikepalli, “Designing for Power Budgets and Effec-

tive Thermal Management,” In Xcell Journal, Issue 56, 2006.

(http://www.xilinx.com/publications/xcellonline/xcell 56)

[98] “Thermal Management for 90-nm FPGAs,” Application Note 358, Altera Corpo-

ration.

127

[99] Yong Chen et al., “Nanoscale molecular-switch devices fabricated by imprint lithog-

raphy, ” In Applied Physics Letters 82 (2003), no. 10, 1610–1612.

[100] M. C. McAlpine, R. S. Friedman, and C. M. Lieber, “Nanoimprint Lithography

for Hybrid Plastic Electronics,” In Nano Letters 3, 443-445, 2003.

[101] Brent A. Mantooth and Paul S. Weiss, “Fabrication, Assembly, and Characteri-

zation of Molecular Electronic Components,” In Proceedings of the IEEE, Vol 91,

No. 11, Nov 2003.

[102] D. R. Stewart, D. A. A. Ohlberg, P. A. Beck, Y. Chen, R. S. Williams, J. O. Jeppe-

sen, K. A. Nielsen, J. F. Stoddart, “Molecule-Independent Electrical Switching in

Pt/Organic Monolayer/Ti Devices,” In Nano Letters 2004 Vol. 4, No. 1, 133-136.

[103] Jung-Hyurk Lim, Chad A. Mirkin, “Electrostatically Driven Dip-Pen Nanolithog-

raphy of Conducting Polymers,” In Adv. Mat., 2002, 14(20), 1474-1477.

[104] Yue Wu, Jie Xiang, Chen Yang, Wei Lu, and C. M. Lieber, “Single-crystal metallic

nanowires and metal/semiconductor nanowire heterostructures,” In Nature, Vol.

430, Jul 2004.

[105] Arijit Raychowdhury, Kaushik Roy, “A Circuit Model for Carbon Nanotube Inter-

connects: Comparative Study with Cu Interconnects for Scaled Technologies,” In

Proc. of International Conference on Computer Aided Design, 2004.

[106] Andre DeHon, Patrick Lincoln, John E. Savage, “Stochastic Assembly of Sublitho-

graphic Nanoscale Interfaces,” In IEEE Transactions on Nanotechnology, Vol. 2,

No. 3, Sep 2003.

128

[107] Jonathan Greene, Esmat Hamdy, and Sam Beal, “Antifuse Field Programmable

Gate Arrays,”, In Proceedings of the IEEE, Vol. 81, No. 7, Jul 1993.

Vita

Aman Gayasen was born in India on August 8, 1979. In 2001, he received the

B.Tech degree in Electrical Engineering from the Indian Institute of Technology, Delhi.

After graduation, he joined Ikos Systems (now Mentor Graphics), India as a software

engineer working on a synthesis tool. He joined the Ph.D. program in Computer Science

and Engineering at the Pennsylvania State University, University Park in August 2002.

He worked as a teaching assistant from August 2002 to May 2003. Since then, he has

been employed as a research assistant in Computer Science and Engineering. He worked

as a summer intern at Xilinx Research Labs, San Jose during the summers of 2004 and

2005.

He received the National Talent Search Scholarship (a Govt. of India undergrad-

uate fellowship) from 1995 to 2001. The Pennsylvania State University awarded him the

Robert M. Owens Memorial Scholarship in 2005.

His research interests include VLSI design, CAD for VLSI, reconfigurable archi-

tectures, and nanotechnology. He is a student member of IEEE and ACM.