Post on 01-Mar-2023
The Pennsylvania State University
The Graduate School
Department of Computer Science and Engineering
IMPLICATIONS OF FUTURE TECHNOLOGIES
ON THE DESIGN OF FPGAs
A Thesis in
Computer Science and Engineering
by
Aman Gayasen
c© 2006 Aman Gayasen
Submitted in Partial Fulfillmentof the Requirements
for the Degree of
Doctor of Philosophy
December 2006
The thesis of Aman Gayasen was reviewed and approved∗ by the following:
Mahmut KandemirAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee
Vijaykrishnan NarayananAssociate Professor of Computer Science and EngineeringThesis Co-AdviserCo-Chair of Committee
Mary Jane IrwinProfessor of Computer Science and Engineering
Vittal PrabhuAssociate Professor of Industrial and Manufacturing Engineering
Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering
∗Signatures are on file in the Graduate School.
iii
Abstract
The Field Programmable Gate Array (FPGA) industry is going through an excit-
ing phase. The market leaders, Xilinx and Altera, announce new products almost every
year. Their CAD tools also keep adding new features. The growing popularity of FPGAs
demands that we sustain the growth of FPGAs. This thesis explores new technologies
for continuing the improvement of FPGAs in future.
In this thesis, we study the effect of three main future technologies. First, we
evaluate FPGA designs for scaled CMOS technologies — 65nm and below. The main
problems here are leakage, temperature, and process variation. Second, we look at
3-D stacking of multiple dies within a package. Since this technology is still being
perfected, we have several parameters to play with. For example, the properties of
the vias that provide communication among the different layers (inter-layer vias) are
very different from the other wires, especially pitch and length. This brings about an
asymmetry in the FPGA fabric. How this influences the FPGA architecture is a question
we try to answer. Furthermore, stacking multiple layers increases the power density,
which increases the junction temperature. This thesis studies the impact of stacking
on temperature, and proposes thermal-aware organization of FPGAs. Finally, we look
at some technologies that are still in their infancy, such as molecular switches, carbon
nanotubes, and silicon nanowires. Specifically, we explore the use of such technologies
to implement the interconnect fabric in an FPGA.
iv
Table of Contents
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 FPGA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 3. Reducing Leakage Energy in FPGAs . . . . . . . . . . . . . . . . . . 15
3.1 Using Sleep Transistors . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 RCP: Region-Constrained Placement . . . . . . . . . . . . . . 21
3.1.2 Combining RCP and Time-Based Control . . . . . . . . . . . 22
3.1.3 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.3.1 Time-based leakage control . . . . . . . . . . . . . . 27
3.1.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4.1 Time-based Leakage Control . . . . . . . . . . . . . 31
3.1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Dual-Vdd FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1.1 Fully Programmable (FP) . . . . . . . . . . . . . . . 36
v
3.2.1.2 Logic Programmable (LP) . . . . . . . . . . . . . . . 38
3.2.1.3 Level Conversion . . . . . . . . . . . . . . . . . . . . 39
3.2.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2.1 Vdd Assignment . . . . . . . . . . . . . . . . . . . . 42
3.2.2.2 Power Estimation . . . . . . . . . . . . . . . . . . . 47
3.2.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3.1 FP Architecture . . . . . . . . . . . . . . . . . . . . 51
3.2.3.2 LP Architecture . . . . . . . . . . . . . . . . . . . . 53
3.2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 4. Three-Dimensional FPGAs . . . . . . . . . . . . . . . . . . . . . . . 59
4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.1 2-D Switch Boxes . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.1.2 3-D Technology Overview . . . . . . . . . . . . . . . . . . . . 63
4.2 3-D Detailed Routing Architecture . . . . . . . . . . . . . . . . . . . 65
4.2.1 Switch Box Topology . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Experimentation . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2.2.1 Architecture and Technology Parameters . . . . . . 72
4.2.2.2 Experimentation Flow . . . . . . . . . . . . . . . . . 73
4.2.2.3 Area Model . . . . . . . . . . . . . . . . . . . . . . . 75
4.2.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 75
4.3 Thermal Issues in 3-D FPGAs . . . . . . . . . . . . . . . . . . . . . 81
4.3.1 Thermal-Characterization of FPGAs: 2-D to 3-D . . . . . . 83
vi
4.3.2 Thermal-Aware 3-D FPGA Organization . . . . . . . . . . . 89
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Chapter 5. Technology Alternatives for Nanoscale FPGA Interconnects . . . . . 91
5.1 Nanotechnology Primitives . . . . . . . . . . . . . . . . . . . . . . . . 92
5.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.2 Nanoscale FPGA Architectures . . . . . . . . . . . . . . . . . . . . . 95
5.2.1 Arch1: Using non-lithographic nano-wires and molecular switches 96
5.2.2 Arch2: FPGA using lithographic wires and molecular switches 101
5.3 Comparative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Chapter 6. Summary and Future Directions . . . . . . . . . . . . . . . . . . . . 109
6.1 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
vii
List of Tables
3.1 Characteristics of benchmark designs . . . . . . . . . . . . . . . . . . . . 56
3.2 Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs,
Vddh = 1.1V, Vddl = 0.8V . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.1 Via properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 Power densities in 4VFX100 (Freq : 500MHz) . . . . . . . . . . . . . . 84
4.3 Effect of stacking on temperature . . . . . . . . . . . . . . . . . . . . . . 86
4.4 Parameters for temperature estimation in HS3d . . . . . . . . . . . . . 86
4.5 Thermal-aware 3-D FPGA design . . . . . . . . . . . . . . . . . . . . . 88
viii
List of Figures
1.1 Traditional FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Virtex-2 FPGA architecture . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 FPGA containing sleep transistors . . . . . . . . . . . . . . . . . . . . . 17
3.2 Leakage energy breakdown . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 a) Horizontal and b) Vertical styles of RCP on an XC2V40 FPGA for a
region size of 2× 4 slices. Required number of regions is 100 (13 regions) 19
3.4 Different placements for an example design. In part (c), each module is
bounded by a polygon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Experimental Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Average leakage power savings for RCP and normal placement. . . . . . 29
3.7 Leakage power savings for RCP for 4× 16 region for all designs. . . . . . 29
3.8 Average clock frequency for RCP. . . . . . . . . . . . . . . . . . . . . . . 29
3.9 Average leakage energy savings for RCP and normal placement. . . . . . 29
3.10 Leakage power savings for time-based leakage control. . . . . . . . . . . 31
3.11 Leakage energy savings for time-based leakage control. . . . . . . . . . . 31
3.12 Supply transistors used for programmable Vdd . . . . . . . . . . . . . . 33
3.13 Fully programmable dual-Vdd architecture (FP) . . . . . . . . . . . . . 36
3.14 Logic programmable dual-Vdd architecture (LP) . . . . . . . . . . . . . 39
3.15 Level converter circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.16 Experimental Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
ix
3.17 Distribution of path delays . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.18 Power consumption for different Vddl’s. Vddh=1.1V. . . . . . . . . . . . 49
3.19 Power consumption for different architectures and algorithms. Vddh=1.1V,
Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.20 Average power breakdown between logic and routing resources. Vddh=1.1V,
Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.21 Average power consumption for different critical path delay tolerances.
Vddh=1.1V, Vddl=0.9V . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.22 Critical path delay for LP FPGA with different extents of Vddl resources.
Vddh=1.1V, Vddl=0.9V. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.23 Energy consumption in LP FPGAs. Vddh=1.1V, Vddl=0.9V . . . . . . 58
4.1 2-D switch boxes. X0, Y0, X1, Y1 mark their sides. . . . . . . . . . . . 62
4.2 Two kinds of stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3 3-D FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.4 3-D switch boxes for H=4, V=2. . . . . . . . . . . . . . . . . . . . . . . 67
4.5 Experimentation flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.6 Comparing 2-D and 3-D FPGAs . . . . . . . . . . . . . . . . . . . . . . 76
4.7 Comparing the switch boxes for 5-layer FPGA . . . . . . . . . . . . . . 76
4.8 Comparing the switch boxes for different via technologies for 5-layer FPGA 80
4.9 Comparing the switch boxes for different process nodes for 5-layer FPGA 80
4.10 Virtex-4 FX100 device (not to scale) . . . . . . . . . . . . . . . . . . . . 84
4.11 Thermal profile of 4VFX100 . . . . . . . . . . . . . . . . . . . . . . . . 85
x
4.12 Effect of stacking on peak temperature . . . . . . . . . . . . . . . . . . . 85
4.13 3-D FPGA organizations . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1 FPGA using nano-wires and molecular switches . . . . . . . . . . . . . 106
5.2 3-D organization of nano-wires . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Critical path delays in the 3 architectures . . . . . . . . . . . . . . . . . 107
5.4 Dependence of performance on molecular switch’s ON resistance . . . . 107
5.5 Resistance and Capacitance values of single-length NiSi nano-wires . . . 108
5.6 Performance of a design (misex3) using metal nano-wires . . . . . . . . 108
xi
Acknowledgments
I am grateful to my advisers, Dr. Vijay and Dr. Kandemir, for their support
throughout my Ph.D. Without their guidance, both in professional and personal matters,
I would never have completed this thesis. I am also thankful to members of MDL for
creating a friendly work environment.
Some of the work in this thesis was done with the help of other MDL students.
While it is impossible to thank everyone who might have influenced my research indi-
rectly, I am attempting to thank those who worked closely with me on several projects.
Yuh-Fang and Ki-Yong helped me with the FPGA power work. Priya helped with the
thermal work. Besides them, Suresh worked with me on several projects. I also enjoyed
some enlightening conversations with Vijay Sai and Greg Link. During the last semester
at Penn State, I also worked with Soumya, Prasanth, and Sungmin. Besides them, my
neighbor in the lab, Jooheung, was a constant source of inspiration. I also wish to thank
Ing-Chao for the ping-pong games; they helped me focus when I was under stress.
Outside Penn State, I frequently collaborated with Tim Tuan and Arif Rahman
of Xilinx Research Labs. I am grateful for their help. They, and Satyaki Das, were
excellent mentors during my internships at Xilinx.
1
Chapter 1
Introduction
Field Programmable Gate Arrays (FPGAs) are Integrated Circuits (ICs) contain-
ing programmable logic and interconnect elements. The “Field” in FPGA denotes their
ability to be programmed by the end-user. The “Gate Array” signifies their similar-
ity to conventional mask-programmed gate arrays. FPGAs belong to a broader cate-
gory of field-programmable devices, called Programmable Logic Devices (PLDs), which
include PLA (Programmable Logic Arrays), PAL (Programmable Array Logic), and
CPLD (Complex PLD). While PLAs and PALs can implement only two-level logic, both
FPGAs and CPLDs can implement multi-level logic. CPLDs consist of multiple PAL
elements interconnected through a programmable switch matrix. In contrast, FPGAs
contain several small programmable logic elements connected using a mixture of short
and long wires and programmable switches. While CPLDs offer a more predictable tim-
ing, they lag FPGAs in logic capacity. Because of their large capacities, and superior
device utilization, FPGAs are among the most popular programmable devices.
FPGAs present significant advantages over microprocessors as well as ASICs.
Compared to microprocessors, they offer a higher performance for parallel applications.
Compared to ASICs, they offer a simpler design flow and lower Non-Recurring Engineer-
ing (NRE) costs. Therefore, they are suited for small-to-medium volumes of production
and for products where time-to-market is critical. Furthermore, the regular structure of
2
FPGAs makes them highly amenable to shrinking geometries, and therefore, they usu-
ally are at the forefront of new technologies. Consequently, by using FPGAs, designers
can get the advantages of advanced top-of-the-line process technologies without worry-
ing about the complexities that accompany the technology scaling. Due to all the above
reasons, FPGAs are poised to be among the most popular devices of the future.
At the time of their introduction in the mid-eighties, FPGAs were primarily used
for prototyping and to implement glue logic. However, over a period of twenty years,
especially since the late nineties, their market has diversified significantly. The inclusion
of embedded processors, memory, and DSP blocks provides the complete platform to
create embedded systems [1, 2]. Their inherent parallelism, coupled with an increase in
size and decrease in logic delays, allows them to be used as hardware accelerators for
high performance applications (e.g., [3, 4, 5]). People are also working to create scalable
supercomputers using an array of off-the-shelf FPGAs [6]. Furthermore, the introduction
of low-cost FPGAs by both Xilinx [1] and Altera [2] has enabled the use of FPGAs in
consumer markets. Technology research group Gartner Dataquest forecasts that the
market for programmable logic devices, which incorporates reconfigurable computing
with FPGAs, will double in a period of five years, from $3.1 billion in 2005 to $6.2
billion in 2010 [7]. In order to maintain this growth in the FPGA market, FPGAs
must consistently improve in performance, size, and features. This thesis explores future
technologies that will be crucial in sustaining this improvement.
Future technologies can be divided into the following three categories.
3
• The first category consists of scaled CMOS technologies, as predicted by ITRS [8].
It predicts that the industry will move to 22nm technology in 2016. These tech-
nologies will face severe power and reliability problems. Leakage power, which
until 130nm was only a minor component of the total power, has already become
a severe problem in sub-100nm technologies. Furthermore, because of increased
power densities on the die, the die temperature is also increasing. Higher tempera-
ture in-turn causes a plethora of problems, including an increase in leakage power
and reduction in silicon lifetime. In severe cases, heat could also melt the package
and cause total disruption. Beside power and temperature, variability, both man-
ufacturing and long-term, is a serious problem. Defect rates are also expected to
increase for smaller technologies. In this study, we focus on reducing leakage power
and temperature in an FPGA.
• The second category comprises evolutionary technologies, such as, stacking multi-
ple device layers to create a three-dimensional (3-D) IC. Three-D stacking is helpful
in reducing wire lengths, which translates into reduction in the FPGA area and
power consumption. A timing-driven placement and routing tool can also use 3-D
to reduce the critical path delay. The vertical connections in a 3-D technology are
much larger than the metal wires in a 2-D chip. Therefore, we would normally try
to reduce their number. A key challenge for 3-D FPGAs is designing the routing
architecture such that we use the vertical connections judiciously. Further, tem-
perature is a major concern here, because stacking multiple layers increases the
4
effective power density. Our results show that going from a single layer to a 4-layer
FPGA could increase the peak temperature by a factor of 2.4 (see Chapter 4).
• The final category contains non-lithographic technologies, such as, carbon nan-
otubes, silicon nanowires, and molecular switches. We broadly call them, nan-
otechnologies. Although several scientists are working on them, these technologies
are still in their initial stages of development. The key question here is, “What
are the desired properties in such technologies for them to be better than scaled
CMOS?” With this information, we give valuable feedback to the chemists and
material scientists who are developing these technologies, and also set reasonable
expectations from nanotechnologies.
The remainder of the thesis is organized as follows. Chapter 2 reviews the existing
literature related to this study. Chapter 3 discusses the challenges in reducing leakage
energy in future CMOS technology nodes, and presents two techniques to reduce leakage.
Chapter 4 develops a detailed routing architecture for 3-D FPGAs, and also studies
thermal issues in them. Chapter 5 explores nanotechnology alternatives to implement
the interconnect fabric in FPGA. Finally, Chapter 6 summarizes the contributions of
this thesis, and presents possible directions for future research in this field.
1.1 FPGA Architectures
Figure 1.1 shows the traditional island-style FPGA architecture. It consists of
a 2-D array of configurable logic blocks (CLBs) in a sea of routing wires. The CLBs
typically contain multiple Look-Up Tables (LUTs) and Flip-Flops (FFs). The routing
5
(a) (b)
Fig. 1.1. Traditional FPGA architecture
wires connect among themselves through programmable switches, forming a switch block.
Similarly, these wires also connect to the CLBs, forming connection blocks.
The modern FPGA has a more complex architecture than the one shown in Fig-
ure 1.1. An example of a modern FPGA is the Virtex-2 FPGA, shown in Figure 1.2.
It stores the configuration information in SRAM cells, each of which typically consists
of 6 transistors. The basic logic element in Virtex-2 is called a slice. A slice consists
of 2 LUTs, 2 FFs, fast carry logic, and some wide MUXes [1]. A CLB in turn consists
of 4 slices and an interconnect switch matrix. The interconnect switch matrix consists
of large multiplexers (as large as 32-to-1) controlled by configuration SRAM cells. Note
that Figure 1.2 is not drawn to scale, and in reality the interconnect switches account
for nearly 70% of the CLB area. The FPGA contains an array of such CLBs along with
block RAMs (BRAMs), multipliers and IO blocks as depicted in Figure 1.2. Altera’s
FPGAs are also similar in technology to Virtex-2.
6
Fig. 1.2. Virtex-2 FPGA architecture
A different kind of FPGAs are the antifuse-based FPGAs offered by Actel that
are one-time programmable. Actel and Lattice also offer flash-based FPGAs. In this
study, we focus on only SRAM-based FPGAs.
7
Chapter 2
Related Work
Ever since the first FPGAs were introduced by Xilinx in the mid 80’s, they have
been a popular topic for research. Their programmability offers interesting avenues for
creativity. Researchers at University of Toronto performed early research on FPGA ar-
chitecture [9]. They used the area, delay, and area-delay product as metrics to evaluate
FPGA architectures. They also developed tools to allow FPGA architecture explo-
ration [10].
Meanwhile, in the late 90’s, researchers at Berkeley started looking at the en-
ergy consumption in FPGAs. Energy consumption was becoming crucial because of
the growing demand for the use of FPGAs in embedded devices. They proposed low-
swing interconnect circuits and an interconnect architecture optimized to reduce the
energy-delay product [11]. Some studies also analyzed the dynamic power consumption
of commercial FPGAs — first of a Xilinx XC4003A FPGA [12], and, more recently, of
Virtex-2 [13]. Both studies observed that the interconnect fabric consumes the majority
of the dynamic power. Similar to the early study at Toronto (which focused on area and
delay), some researchers studied the influence of architecture parameters, such as LUT
size, cluster size, and segment length, on power consumption [14, 15]. Studies have also
tried to reduce the dynamic power through modifications in the CAD tools, ranging from
8
clustering [16], place and route [17], to bitstream manipulation [18]. The bitstream ma-
nipulation technique modified the LUT configuration bits to reduce dynamic power [18].
Recently, Lamoureux and Wilton [19] proposed a complete power-driven CAD flow, and
studied the interaction among the different CAD stages.
All the above studies focused on dynamic power consumption. With shrinking
transistor sizes, leakage power is also becoming important. FPGA researchers recognized
this, and therefore, the past two years have seen several studies on FPGA leakage power
(see [20] for a survey). Since FPGAs use several transistors to provide programmability,
their leakage power consumption is significantly higher than a custom circuit implement-
ing the same functionality. Tuan and Lai [21] performed a detailed analysis of leakage
power in Xilinx CLBs. They concluded that leakage must be significantly reduced to
enable the use of FPGAs in mobile applications.
Several techniques to reduce leakage in FPGAs have also been proposed. Two
of them proposed the use of sleep transistors [22, 23]. While researchers at MIT [22]
proposed a fine-grained leakage control scheme, embedding sleep transistors within the
CLB circuit; Gayasen et al. [23] advocated a coarser region-based leakage control, and
proposed constraining the design to a minimum number of regions to reduce leakage. At
the circuit-level, Azizi and Najm [24] presented low-leakage circuits for LUTs. Since the
leakage of routing muxes depends strongly on their input values, Srinivasan et al. [25]
presented circuits to reduce leakage in the routing fabric by setting desired values at the
inputs of the unused routing muxes. Lodi et al. [26] developed low leakage circuits for
the FPGA routing switch. Rahman and Polavarapuv [27] evaluated several low-leakage
design techniques for FPGAs. One of them was the use of a heterogenous routing fabric,
9
consisting of a mixture of high and low threshold (Vt) transistors. Since high Vt reduces
the leakage current at the expense of an increase in delay, the router needs to pick
the correct resources based on the slack available. This idea was proposed for a more
commercial architecture later [28], where detailed experiments helped them decide which
resources to slow down without affecting performance.
At the CAD level, Hassan et al. [29] proposed a low-leakage packing algorithm
that packed the LUTs exhibiting similar idleness together so that they could be shut
down together. Anderson et al. [30] presented a no-cost technique to reduce leakage by
selecting the polarities of logic signals appropriately. A similar technique was proposed
in [31], but with Asymmetric SRAM cells [32]. Chapter 3 presents our region-based
leakage control technique.
Researchers have previously proposed dual-Vdd techniques for ASICs [33, 34].
The dual-Vdd ASIC uses high-Vdd (Vddh) only to supply the timing-critical blocks, and
saves power on the non-critical ones by supplying them a lower Vdd (Vddl). Li et al. [35]
first applied the idea to FPGAs. They fixed the voltages of logic blocks and attempted
to place the design such that timing-critical blocks used high Vdd. This approach did
not provide enough power savings unless some performance degradation was allowed.
Therefore, a programmable Vdd FPGA was next proposed [36, 37], where the circuit
blocks could be programmed to run on Vddh or Vddl. In [36], the Vdd of only logic blocks
could be programmed. All routing resources remained at Vddh, and the emphasis was on
reducing dynamic power while keeping the leakage constant. Gayasen et al. applied the
programmable Vdd idea to routing resources as well as logic, and reduced both dynamic
10
and leakage powers [37, 38]. Later, Lin et al. [39] evaluated several variants of the dual-
Vdd architecture, and also improved the voltage assignment algorithm by formulating it
as a linear programming problem [40]. All these approaches required two power supplies
and two power grids. To eliminate these overheads, Anderson and Najm [41] proposed
a circuit that utilized the threshold drop across an NMOS transistor to locally generate
an alternate power supply for every routing switch. Chen et al. [42] also presented a cut
enumeration algorithm targeting low power technology mapping for FPGA architectures
with dual supply voltages. In Chapter 3, we present our dual-Vdd architectures.
Several studies have recognized the overheads of the programmable interconnect
fabric in an FPGA. The interconnect resources take almost 70% of the die area and
consume the major part of FPGA power [21, 13]. Furthermore, for most designs, they
also constitute more than 50% of the critical path delay. Therefore, FPGA interconnect
merits special attention. In order to reduce the interconnect area, researchers have
proposed 3-D FPGAs, consisting of multiple stacked 2-D FPGAs. More than a decade
ago, Alexander et al. [43] presented a 3-D FPGA that used package-level integration to
stack multiple 2-D FPGAs interconnected using solder bumps. The minimum pitch of
these vertical interconnects was 100µm. Campenhout et al. [44] proposed opto-electronic
FPGAs, in which the inter-chip communication used optical links. The optical links
provide a large vertical channel density. The Rothko 3-D FPGA [45] was a 3-D extension
of the Triptych sea-of-gates architecture [46], consisting of routing and logic blocks. The
3-D integration was done at the wafer-level and inter-layer communication used metal
vias. A dynamically reconfigurable 3-D FPGA was presented in [47], which consisted
of three physical layers: routing and logic block layer, routing layer, and memory layer.
11
Recently, Lin et al. [48] analyzed the performance benefits of a monolithically stacked 3-
D FPGA. Their 3-D integration technology provided very fine vias, which allowed them
to stack the configuration memory on top of the rest of the FPGA (logic blocks and
interconnects).
Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et
al. [49] presented an analytical model for predicting interconnect requirements in 3-D
FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and
power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended
this model to incorporate clustered logic blocks (similar to Virtex-2 [1]).
On the CAD front, Ababei et al. [51, 52] recently presented a partitioning-based
placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer
vias. However, their router was not timing-driven.
Although several researchers have proposed 3-D FPGAs, the detailed routing
architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset
switch block. Although Wu et al. [53] designed universal 3-D switch blocks, they used
track count as the sole metric of quality. Furthermore, they assumed that the number
of inter-layer vias is the same as the horizontal channel width. In today’s technology,
especially if we stack more than two layers, the vias are much thicker than the horizontal
wires (1µm vs. 0.1µm), which makes this assumption impractical. In Chapter 4, we
propose 3-D switch block designs considering the special via properties [54].
Three-D technology is known to suffer from thermal issues — stacking multiple
layers increases the effective power density in the package. Package designers have been
considering thermal issues in 2-D ICs for a long time. Instead of considering variations in
12
the temperatures on the die, they designed the package to support the worst case speci-
fications of the design. As designing the package for the worst case junction temperature
started becoming too expensive, researchers started looking at design level solutions to
reduce the temperature. Dynamic thermal management (DTM) techniques use thermal
sensors to monitor the junction temperature and control the power consumption of the
design on the basis of the temperature [55]. Common techniques include clock gating,
and voltage and frequency scaling when the temperature increases beyond a threshold.
Thermal-aware floorplanning is another design-level solution. Here, the floorplan
tries to reduce the hotspots on the die by distributing the temperature uniformly [56, 57].
Researchers have mostly focused at microprocessors in these works. Thermal placement
is a similar technique applied at the placement stage. Chen and Sapatnekar [58] proposed
a partition-driven algorithm for standard cell thermal placement. Thermal floorplanning
and placement are particularly attractive because they impact the performance less than
DTM.
On the modeling front, several researchers have developed tools for estimating the
die temperature. Among them, HotSpot [59] is an architecture level thermal simulator,
which can perform transient as well as steady-state temperature estimation. HS3d [60] is
another architecture level tool that performs only steady state temperature estimation,
but is orders of magnitude faster than HotSpot. Since in this work we look at only steady
state temperatures, we use HS3d.
Recently, some researchers have proposed solutions for thermal issues in 3-D ICs
too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sap-
atnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard
13
cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce
the peak temperature [63, 64].
Thermal issues in FPGAs are relatively unexplored. Some researchers have pro-
posed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66].
They, however, considered only CLBs in the fabric, and consequently, observed very
little temperature variations across the die. In Chapter 4 we characterize the thermal
profile of a real platform FPGA [67], and then observe the effect of stacking on temper-
ature. We also suggest alternate organizations to reduce the temperature.
In the long term, even 3-D may not provide the desired performance. There-
fore, we need to explore alternate technologies. Studies have looked at using some
non-lithographic technologies to manufacture FPGAs. DeHon [68], Goldstein [69], and
Tour [70] have previously proposed programmable architectures using some form of nano-
structures that are made using self-assembly. Goldstein tried to make crossbar-based de-
vices by aligning nano-wires in two planes at right angles to each other. The crosspoints
contained molecules that provided programmable logic as well as interconnections. It
suffered from problems of signal-degradation, as there was no way to restore the signal
using only two terminal devices. DeHon overcame this problem by using SiNW based
FETs to restore the signals, and proposed a PLA structure. However, the logic function-
ality in that architecture was limited to OR (and inversion). Tour, instead, proposed
replacing the logic blocks by nanocells and connecting them using metal wires. This
suffered with problems of training these nanocells, which were assumed to consist of a
randomly connected mass of molecules. Furthermore, since the bottleneck in current
FPGAs lies in the interconnect, Tour’s architecture does not help solve this problem.
14
All the above architectures propose drastic changes in the existing CMOS tech-
nology as well as the design methodologies. In Chapter 5, we propose an architecture
that blends with existing technology easily, and preserves all the design methodologies
and flexibility in logic functionality [71].
15
Chapter 3
Reducing Leakage Energy in FPGAs
With the development of FPGAs in new technologies - 90nm and below, optimiz-
ing leakage power1 is becoming imperative. As the transistor feature sizes and threshold
voltages reduce, and the number of transistors used in FPGAs increase, the overall leak-
age power is rapidly increasing. Consequently, the leakage problem is anticipated to
be a major obstacle for FPGA applications in both high performance and low-power
embedded designs. Due to this trend, we need to focus on leakage power optimizations
going beyond prior power optimization techniques for FPGAs that focus primarily on
reducing the dynamic energy [11, 12].
3.1 Using Sleep Transistors
The flexibility provided by the FPGA structures in placing different applications
results in a large portion of the components being unutilized [21]. In fact, the typical
logic utilization for the designs experimented is 62%. A similar trend holds for larger
benchmark suites of greater than 100 designs in different target devices [21]. These
unutilized resources in an FPGA serve as a good candidate for leakage optimizations.
Reducing leakage power has already been the focus of optimization in various
non-FPGA architectures. These optimizations have ranged from circuit to software
1Unlike dynamic power which is expended only when the hardware component in questionexercised, leakage power is spent even if the component is idle.
16
approaches [72, 73, 74]. Among these techniques, a popular one to reduce both the
subthreshold and gate leakage components is to switch off the power supply to the circuit
by introducing a high-threshold voltage sleep transistor between the circuit and its supply
rail. The sizing of this sleep transistor has an impact on both the performance and the
area overheads imposed. Specifically, its sizing should be large for better performance.
However, this increases both the area penalty and the ability to reduce leakage current
(as wider transistors leak more). The optimal sizing of these sleep transistors has been
the focus of prior efforts and the peak current required by the supply gated circuit serves
as the reference for this sizing [75]. Since the peak current for different portions of
a circuit do not normally occur simultaneously, prior work has used the approach of
controlling a clustered group of circuits together with a single sleep transistor [75]. This
optimization helps to reduce the area penalty as compared to using sleep transistors with
each individual sub-circuit.
It should be noted that sleep transistors can be used to control leakage in FPGAs
as well. An obvious approach would be to place unused CLBs into low-power states using
sleep transistors (see Figure 3.1). However, such a fine-grain (at individual CLB level)
power management of the FPGA fabric can introduce a significant area penalty, which
may not be tolerated in many designs. Instead, in this paper, we propose a strategy,
whereby the FPGA fabric is divided into regions, each of which can be independently
controlled through a sleep transistor. A region is a rectangular array of CLBs, and is
the minimum unit of power management. This approach is similar to the clustering
technique mentioned in the paragraph above. Our experimental results indicate that
area of the CLB arrays including the sleep transistor area overhead can be reduced by
18
5% when moving from using regions with 4 logic slices to regions with 256 logic slices.
By selecting a suitable region size, one can control the area overheads and at the same
time achieve large leakage savings. Based on this region concept, we also propose a
placement technique, referred to as Region-Constrained Placement (RCP), that tries to
use a minimum number of regions for a given application, thereby increasing the number
of unused regions that can be switched off. A key observation from our results is that
the leakage power savings obtained using RCP on an FPGA with coarse-grain regions is
larger than that obtained using normal placement employed on an FPGA with fine-grain
regions.
The maximum savings that can be obtained from the leakage management scheme
discussed above is limited by the volume of the unused regions. Consequently, we also
utilize a time-based control scheme that reduces leakage even in the utilized portions of
the FPGA by switching off/on the power supply, exploiting the idleness in portions of
the design. Specifically, the time-based scheme dynamically turns off power supply to
all regions containing only idle modules. We investigate combinations of the time-based
control scheme with two variants of RCP: (i) module-level RCP that places each module
of the design that exhibits a distinct idleness profile using RCP individually and turns
off power supply to all regions containing only idle modules, and (ii) design-level RCP
that places the entire design using RCP and turns off power supply to all regions that
contain only idle modules. Our experiments show that the time-based RCP scheme can
provide additional energy savings as compared to statically switching off only unused
portions.
19
The leakage distribution in a Xilinx FPGA in 90nm technology, with the excep-
tion of the BRAMs and multipliers was shown to be 38% in the configuration SRAMs,
34% in the interconnect matrix, 16% in LUTs and 12% in other logic [21]. Since many
of the techniques proposed for saving leakage energy in on-chip memory can be applied
to BRAMs (and because they are not used by most of our designs), our leakage opti-
mizations in this paper do not target them. In order to reduce the leakage energy in the
configuration SRAM, we increased the threshold voltage of the configuration SRAM to
obtain a 98% reduction in leakage energy while increasing configuration time by 20%.
Since configuration time is not critical in most of our target designs, this tradeoff for
power savings is reasonable. The resulting leakage breakdown in our system is shown
in Figure 3.2. The focus of this work is on reducing the leakage energy in the LUTs,
arithmetic logic and flip flops that account for 45% of the total leakage energy. While
our work focuses on the slices, the technique can be extended to switching off the routing
resources as well. This is a part of our planned future work.
Fig. 3.2. Leakage energy break-down
Fig. 3.3. a) Horizontal and b) Verticalstyles of RCP on an XC2V40 FPGA for aregion size of 2× 4 slices. Required numberof regions is 100 (13 regions)
20
In order to provide leakage control, the FPGA is divided into regions. A region
consists of one or more neighboring slices (potentially across different CLBs), and is the
minimum power management unit (granularity). Sleep transistors are embedded into the
FPGA fabric controlling the power supply to the individual regions. In this architecture,
the control bit for the power switch (See Figure 3.1) of the region determines whether
the region is supply-gated or not. The control bits of the different regions are set during
the configuration of the FPGA. The area overhead associated with the control bits (and
the associated wiring) is proportional to the number of regions, while their impact on
leakage energy is relatively small due to the use of high threshold transistors for the
configuration bits. Thus, the area overhead favors a smaller number of large regions.
An important issue in the design of this architecture is the sizing of the power
switches. The power switches should be large enough to support the peak current re-
quirements of the logic slices that they control to have negligible impact on performance.
Since the peak current for a larger region is less than the sum of the peak currents of
smaller regions constituting the larger region, it is possible to have a smaller area over-
head when moving to larger regions with similar performance. In order to show this
impact, we experimented with two different region sizes of 256 slices and 4 slices using
XPower [1]. A single region of 256 slices had a peak current that was 68% of the sum
of peak currents of 64 regions each of 4 slices constituting the same area. Next, we per-
formed SPICE simulations to estimate the sleep transistor size for various region sizes.
Assuming a slice area of 5000 sq. micron (from custom layout), it was estimated that
the area penalty for a region size of 4 slices was around 15%, while that for 256 was 10%.
This motivates the need for using large region sizes.
21
The amount of leakage reduction due to the introduction of the power switch is
also influenced by the sizing and threshold voltage of the sleep transistor and whether
a PMOS or NMOS transistor is used to gate the VDD or ground power supply rail.
The leakage reduction varies from 85-98% based on these factors, incurring performance
degradation varying from 0-30% [76]. In our experiments, we use a PMOS gate switch
providing 90% leakage reduction.
3.1.1 RCP: Region-Constrained Placement
The placement of the design has a significant impact on the ability to supply-
gate the logic slices in our region-based architecture. Employing the PAR tool in the
normal design flow due to lack of region concept tends to scatter the utilized slices across
different regions (See Figure 3.4(a)). Since the regions with partially used slices cannot
be supply-gated, the potential for leakage energy savings reduces. Thus, we propose
a new region constrained placement strategy, RCP, that takes into account the region
concept explicitly.
The basic principle of RCP is to constrain the placement of the design to specific
regions of the FPGA (See Figure 3.4(b)) and leave some regions of the FPGA completely
unused, so that they can be supply-gated. This in turn helps to maximize the potential
leakage savings. In our implementation of RCP, we place the design into contiguous
regions to the extent possible and utilize two different styles: horizontal and vertical
placements as shown in Figure 3.3. While the horizontal and vertical placements utilize
the same number of logic slices, they do not provide similar performance results due
to asymmetry in the target Virtex-II architecture. For example, there are fast carry
22
chains running vertically in the FPGA, but not horizontally. Furthermore, there are
more slices in a column than in a row in all Virtex-II parts (except XC2V40, which has
16 slices in both directions). While we confine the utilization of logic slices to specified
regions, in order to circumvent issues with routing congestion, routing of IO signals and
unroutability; the constraints on routing resources are kept as “soft”. This permits the
use of routing resources outside the regions that have logic placed in them. As part of
our future work, we plan to investigate a supply-gating mechanism that also switches off
interconnect muxes.
3.1.2 Combining RCP and Time-Based Control
(a) Traditional (b) RCP (c) Module-level RCP
Fig. 3.4. Different placements for an example design. In part (c), each module isbounded by a polygon
It should be observed that RCP is essentially a static technique where the unuti-
lized FPGA space (regions) can be shut off at configuration time (before the execution).
23
While it is easy to implement, it may not be as effective in designs that occupy large por-
tion of the FPGA space (which in turn limits the potential leakage savings). However,
for the designs with modules that remain inactive over significant durations of time,
we can employ a time-based control scheme that reduces leakage even in the utilized
portions of the FPGA by switching off/on the power supply, exploiting the idleness in
portions of the design. Specifically, the time-based scheme turns off power supply to
all regions containing only idle modules. We investigate combinations of the time-based
control scheme with two variants of RCP: (i) module-level RCP that places each module
of the design that exhibits a distinct idleness profile using RCP individually, and turns
off power supply to all regions containing only idle modules, and (ii) design-level RCP
that places the entire design using RCP and turns off power supply to all regions that
contain only idle modules.
We can implement the idea of time-based control as follows. The gate voltage of a
sleep transistor is still controlled by a configuration bit. However, instead of configuring
this bit statically when the design is loaded on the FPGA, dynamic reconfiguration
[77] of these control bits is used to switch a sleep transistor on or off. In order to
limit the overhead of reconfiguring these control bits, the sleep transistor should not
change state very frequently. Furthermore, support for just reconfiguring these control
bits may be useful as opposed to the minimum reconfigurable block in current Virtex-
II technology, which is a frame [77]. Reconfiguration time for one frame varies from
2µ seconds for smallest to 23µ seconds for the largest FPGA. However, support for
reconfiguring only the sleep transistor configuration bits can reduce this time, but may
increase area overheads due to the configuration circuit.
24
With increasing FPGA sizes, it is possible to envision an entire system on FPGA.
In such designs, many parts of the design may remain inactive for long durations. Time-
based control seems to be a very promising approach for such designs. Figure 3.4(c) shows
an example design placement using module-level RCP for time-based leakage control. We
see from this figure that modules of the design get placed on non-overlapping regions,
thus maximizing the number of regions that can be dynamically switched off. Note that
this slightly decreases the statically unused portion on the FPGA (because in order to
ensure the inter-module region exclusivity needed for module-level RCP, some regions
can only be partially filled). Still, our experiments show a significant increase in leakage
savings due to module-level RCP.
3.1.3 Experimentation
In order to investigate the energy savings due to the proposed approach, we
selected a set of applications and used the Xilinx Virtex-II FPGA as our target hardware.
The selected applications include 14 publicly available reference designs provided by
Xilinx, 4 designs from ITC’99 benchmark suite, 3 academic designs and 14 commercial
designs available internally at Xilinx. Table 3.1 provides the important characteristics of
each application and lists the number of slices, IO blocks (IOBs), block RAMs (BRAMs)
and multipliers (MULTs) used in the designs along with the target FPGA device used
for the mapping. Note that on an average only 62% of the slices were used. Industry6
is an extreme case, where although only 4% of the slices are used; but due to the I/O
requirements of the design, it cannot be mapped to a smaller FPGA.
25
These designs were then implemented using the experimental flow illustrated in
Figure 3.5 to evaluate the energy savings possible due to the proposed optimizations.
The specific steps in this design flow are elaborated below.
All the designs were synthesized for area-optimization from their HDL represen-
tation using the Xilinx Synthesis Technology (XST). This synthesis step produced a
gate-level netlist. Next, the designs were mapped on to the smallest possible Virtex-II
FPGA device, setting the place and route effort level high (level 5). After the mapping
and completion of place and route (PAR), an NCD file that contains the placed and
routed design is generated. The map process also generates a MAP report which is used
to implement RCP. The maximum clock frequency for the design was estimated by using
the post-PAR static timing analysis tool, TRACE on the mapped design. The NCD file
was translated to an ASCII file in XDL format using the xdl tool. This ASCII file was
processed using a customized tool developed for this project to determine the unused
regions of the FPGA given the region sizes. Using this information, the leakage savings
possible in the standard placement process was obtained (assuming that the regions that
are completely unused are switched off).
In order to determine the leakage savings using RCP, the synthesized gate-level
(NGC file) was re-used. The MAP report from the normal mapping was used to de-
termine the number of logic slices used in the design. Based on the number of slices
obtained and the size of the regions, a User Constraints File (UCF) was created to re-
strict the placement to a specific number of regions. Different UCFs were created for
horizontal and vertical styles of RCP, and for different regions. The mapping and place
and route obtained using the specified constraints produces an NCD file. Similar to the
27
normal placement scheme, the maximum clock frequency for the design is estimated by
using TRACE. Leakage energy savings is evaluated in this case by assuming that power
supply to all unused regions is turned off. As explained earlier, an estimated 45% of the
total leakage happens in the logic slices (Figure 3.2). Furthermore, as explained in the
beginning of this chapter, leakage reduces to 10% of the original using supply-gating with
PMOS transistors. Thus, if for some design, 25% of the slices can be switched off, then
the leakage power is reduced by (25×0.9×0.45) = 10.125%. Furthermore, suppose after
RCP, the clock frequency degrades to 97% of original. Then, the new leakage energy
(Power-Delay-Product, PDP) will be (89.8750.97 ) = 92.65%.
Our experiments were performed for different region choices. Region-widths of
2, 4, 8, 16 slices, and heights of 2, 4, 8, 16 slices were considered. Thus, a total of 16
different region choices were explored. Furthermore, as explained earlier, two styles of
RCP: horizontal and vertical were explored. Thus, a total of 32 different varieties of
RCP were explored.
3.1.3.1 Time-based leakage control
The experiments for time-based leakage control were performed using an academic
design implementing an Adaptive Viterbi Algorithm (AVA) decoder [78]. The design
consists of 3 AVA decoders of varying constraint lengths (4, 6, and 9). Different decoders
are selected depending on the noise levels in the transmission channel. If the noise level
is high, then the decoder with a larger constraint length is selected. In [78], the authors
utilize reconfiguration to switch between decoders of different constraint lengths. We
28
modified the design by statically mapping 3 different sizes of decoders on the FPGA,
and selecting the right decoder depending on noise in the channel. For this work, we
assumed that an input coming into the FPGA decides which decoder to choose. The
design was mapped onto an XC2V1500-bg575. The resource usage was 5469 slices (71%),
90 IOBs (22%), 0 BRAM and 0 multiplier. The three different decoders occupied 718,
1846, and 2854 slices respectively. Another module, which remained active all the time
(branch metric generator) occupied 51 slices. The advantage of this design is that the
decoding can be done much more rapidly if the channel is not noisy. The drawback is
that at any given time, two decoders are sitting idle. This gives a scope for switching-off
the unused decoders. We estimated and compared the leakage savings for this design for
design-level RCP, module-level RCP and normal placement, assuming run-time leakage
control. We also compared savings obtained from run-time leakage control with static
control. In order to estimate the leakage savings from run-time control, we assumed that
each of the 3 decoders is active for equal durations. Thus, any of the three decoders can
be switched off for two-thirds of the total time.
3.1.4 Results and Analysis
Figure 3.6 plots the average estimated leakage power savings by switching off the
unused regions in FPGA. The savings are represented as percent of total leakage (that
occurs without any switching off). A region represented as 2 4 means that the region is
2 slices wide and 4 slices high. Plots for RCP as well as without RCP have been shown.
For both, RCP and normal placement, leakage savings decrease with increase in region
size. However, the decrease for RCP is very small compared to normal placement. As is
29
Fig. 3.6. Average leakage power savingsfor RCP and normal placement.
Fig. 3.7. Leakage power savings for RCPfor 4× 16 region for all designs.
Fig. 3.8. Average clock frequency forRCP.
Fig. 3.9. Average leakage energy savingsfor RCP and normal placement.
30
evident from the plots, RCP clearly outperforms normal placement. Especially for large
region sizes, RCP provides more than 6 times the savings of normal placement. This
happens because, although the number of slices used is the same in both cases; in case
of normal placement, they are scattered across regions. Larger regions can accentuate
this problem.
We observed that the leakage power savings are strongly dependent on the re-
source usage of a design. Figure 3.7 plots the variation, across all designs, of leakage
power savings for a single region choice. It shows that the leakage power savings vary
significantly depending on the design. For some designs, there is no leakage saving be-
cause those designs occupy all the regions of the FPGA. Leakage power is reduced by
more than 20% for 40% of the designs.
However, the constraint on the placement due to RCP can influence the timing
of the signals. Figure 3.8 plots the average estimated clock frequencies achieved using
RCP, expressed as a percentage of frequency estimated for normal placement. A region
represented as 2 4 h refers to horizontal style of RCP with region of height 4 slices
and width 2 slices. Similarly, a region represented as 2 4 v refers to vertical style with
the same region size and shape. The plot shows that for all regions, the average clock
frequency is within 8% of original clock frequency.
The performance penalty can result in longer execution time and consequently
increase the duration of leakage. To capture this impact, Figure 3.9 plots average esti-
mated leakage energy savings for RCP as well as for normal placement. We note that
except for very fine-grain regions, RCP always results in higher leakage energy savings.
The difference between the two increases for large region sizes. Again note that small
31
region size incurs larger area overhead due to larger effective sleep transistor size, more
routing and control signals, and more configuration bits (which increases configuration
time too).
Fig. 3.10. Leakage power savings for time-based leakage control.
Fig. 3.11. Leakage energy savings fortime-based leakage control.
3.1.4.1 Time-based Leakage Control
Figure 3.10 plots leakage power savings for dynamic and static leakage controls
for the AVA decoder design. The savings from dynamic control are shown for a module
level RCP (modules get placed in non-overlapping regions), design level RCP, and for
normal placement. The savings from static leakage control are shown for design-level
RCP, and for normal placement.
32
It is observed that time-based leakage control results in very large savings com-
pared to static control. Furthermore, among the different placement strategies for time-
based control, module-level RCP outperforms the others. Design-level RCP performs
better than normal placement in most cases, but in some cases normal placement results
in larger savings. This happens because in case of normal placement, the 3 different
modules are placed slightly separated (because the placer has a larger area available to
place the modules). Therefore, only a few regions are common among the different mod-
ules. In case of design-level RCP, the placer has a smaller area in which to fit the entire
design. This increases the overlap among the 3 modules, thus disabling the dynamic
switching-off of those regions.
Figure 3.11 plots leakage energy savings for time-based control (module-level RCP,
design-level RCP, normal placement with no RCP) and static control (RCP, normal
placement with no RCP) for the AVA decoder design. It is observed that time-based
control results in very large savings compared to static control. Also, in all but two
cases, module-level RCP results in the largest energy savings.
It must be observed that the plots shown above do not account for additional
overhead for dynamic reconfiguration of the control bits. However, even assuming that
reconfiguration incurs a 10% increase in overall execution time and consequent leakage
energy penalty, we find that module-level RCP with time-based leakage control provides
19% (is 27% without reconfiguration overhead) more leakage savings than a normal
placement with static leakage control.
33
(a) (b) (c)
Fig. 3.12. Supply transistors used for programmable Vdd
3.1.5 Summary
Our work demonstrates that switching off parts of FPGA can result in significant
leakage savings in most designs. The savings can be further increased by using Region
Constrained Placement (RCP). Furthermore, if RCP is used then the switch-off granu-
larity need not be very fine, since leakage savings decrease very gradually with increasing
region size. Thus, considering the area overhead of having very small regions, a large
region size coupled with RCP looks to be a practical choice. Module-level RCP is a
promising enhancement for designs in which some modules stay inactive for significant
durations of time.
3.2 Dual-Vdd FPGA
Reducing the supply voltage (Vdd) is an effective technique for reducing both
dynamic and static power. Dynamic power varies quadratically with supply voltage,
while both sub-threshold leakage (due to Drain Induced Barrier Lowering, DIBL) and
gate leakage vary exponentially. However, reducing the supply voltage negatively affects
34
the circuit performance. Dual-Vdd is a popular technique to reap the benefits of voltage
scaling without its performance penalty. The timing-critical blocks in the design operate
on the normal Vdd (or Vddh), while non-critical blocks operate on a second supply rail
with a lower voltage (or Vddl). While dual-Vdd ICs have been successfully used in low-
power ASICs and custom ICs [34], no commercial FPGA today uses multiple Vdd’s for
power reduction.2
The difficulty of designing a dual-Vdd FPGA is that the optimal Vdd assign-
ment changes from one design to another. Consequently, if logic blocks are statically
determined to be operating at low or high Vdd, the placement and routing algorithms
need to be modified accordingly (e.g., [35]). However, static assignment of Vdd to the
blocks may prevent the ability to reduce power or to meet timing constraints for some
designs. In contrast, the use of Vdd-programmability for each block helps to tune the
number of high and low Vdd blocks as desired by the application. In this approach,
the challenge is in determining the Vdd assignments to each block. The need for level
converters wherever a low-Vdd block drives a high-Vdd block and the associated delay
and energy overheads are important considerations when performing these Vdd assign-
ments. Furthermore, positioning of the level converters influences the ability to assign
lower Vdd’s to the routing blocks.
In our programmable dual-Vdd architecture, the Vdd of a circuit block is selected
between Vddh and Vddl by using two high-Vt transistors (supply transistors) connecting
the block to the supplies (see Figure 3.12). This circuit was previously used by [36]. The
2Xilinx Virtex-II FPGAs use different supply voltages for I/O and the core. Pass transistorsused for interconnects are also supplied higher gate voltages to eliminate the Vt drop. However,this is not targeted to reduce power.
35
state (ON/OFF) of each supply transistor is controlled by a configuration bit, which is
set by the Vdd assignment algorithm. The configuration bits are set either to connect
the block to one of the power supplies or completely disconnect the block from both
the power supply lines when the block is unused or idle. We evaluate the effectiveness
of different Vdd assignment algorithms and implementation choices for an island-style
FPGA architecture designed in 65nm technology. Our results demonstrate that one of
the Vdd assignment techniques provides an average power saving of 61% across different
MCNC benchmarks.
3.2.1 Architecture
We propose two types of dual-Vdd architectures. The first, Fully Programmable
(FP), architecture allows all logic blocks (CLBs) and routing resources to be indepen-
dently programmed as Vddh or Vddl. The second, Logic Programmable (LP), gives that
flexibility only for CLBs, and fixes the voltages of the routing resources. Both the archi-
tectures are built on cluster-based island-style FPGAs, with the configuration stored in
SRAM cells. The basic logic element (BLE) consists of a 4-input LUT and a flip-flop.
Multiple BLEs cluster together to form a CLB (see Figure 3.13).
In both architectures, level conversion takes place only at CLB pins. For this
purpose, CLB pins have level converters (LCs) attached to them. A multiplexer allows
to by-pass the level converter if level conversion is not needed at that pin. Placing the
level converter only at CLB pins reduces the complexity of the routing fabric, and also
limits the area and leakage overhead of level converters.
36
(a) Dual-Vdd CLB (b) Dual-Vdd routing mux
Fig. 3.13. Fully programmable dual-Vdd architecture (FP)
3.2.1.1 Fully Programmable (FP)
The FP architecture facilitates configurable supply voltage for logic blocks and
routing multiplexers. Figure 3.13(a) shows how the CLB is configured using high-Vt
supply transistors to operate at two different voltages.
We experimented with two variants of FP, differing in the placement of the level
converters. While the first version places LCs at the output pins of CLBs, the second
places them at CLB input pins. Figure 3.13(a) shows the first case, where only the
output pins of a CLB have LCs attached to them. In this case, a net with multiple
fanouts operates at high Vdd if any one of the CLBs driven by this net is at high Vdd
(since the signal’s voltage level does not change in the routing fabric). This limits the
number of routing muxes that can operate at low Vdd, and therefore is less effective in
reducing routing power compared to the case when LCs are attached to CLB input pins.
However, the drawback of keeping LCs at input pins of CLBs (apart from area penalty)
is that a larger number of LCs are needed, which increases the leakage in logic blocks.
37
Our results support this reasoning, but show that overall leakage is lower for the second
case.
Figure 3.13(b) shows a routing multiplexer (mux) in the FP architecture. The
multiplexer’s output is connected to a level-restoring buffer to restore the Vt-drop
through the NMOS-based multiplexer. Note that the same set of supply transistors
controls the voltage of configuration SRAM cells and the level-restoring buffer. Since
the configuration SRAM is not timing critical, the supply transistors need to be sized
just enough to supply the maximum current needed by the level-restoring buffer.
If a circuit block (CLB or routing mux) is completely unused, then in order to
save leakage, it is desirable to completely switch off that block. This is achieved by
keeping a separate configuration bit for every supply transistor3. Although this incurs
more area overhead, it results in significant leakage savings, since resource utilization in
an FPGA is typically low [21, 23].
Due to the area overhead of level converters and supply transistors (and associated
configuration SRAM cells), the dual-Vdd FPGA takes approximately 50% more area
than a single-Vdd FPGA.
The majority of leakage in an FPGA occurs in the configuration SRAM cells.
[23] have previously shown that by increasing the threshold voltage of the configuration
SRAM, its leakage can be reduced by 98%, while increasing configuration time by 20%.
Since configuration time is not critical in most of our target designs, this tradeoff for
power savings is reasonable. For applications where configuration time is crucial, we have
3In case of a routing mux, we need to pull down the control signals when the mux is unused.The pull-down transistors can be sized very small.
38
proposed the use of Asymmetric SRAM cells [31]. In order to see the effect of dual-Vdd
on power consumption, we have neglected the configuration SRAM leakage both for the
single supply design, and for the dual supply design (since the reduction of configuration
SRAM leakage is achieved by increasing its threshold voltage, and is equally applicable
to both single and dual supply designs).
3.2.1.2 Logic Programmable (LP)
The LP architecture facilitates configurable supply only to logic blocks (see Fig-
ure 3.14). The routing resources run at supplies fixed at the time of device fabrication.
The routing switches contain sleep transistors to cut off their power supply when not
used.
The FP dual-Vdd FPGA of the previous section results in a large area penalty
of about 50%. A key observation is that most of the area is consumed by the routing
resources. By fixing the supply voltages of routing resources, an LP FPGA eliminates
the supply transistors and associated configuration SRAM cells in the routing fabric.
Instead, we need only one sleep transistor per routing switch. This sleep transistor is
controlled by the SRAM cell that controls the state of the routing switch. This more
than halves the area cost of supply transistors in the routing fabric. Compared with
a single Vdd FPGA, the area penalty for an LP FPGA is close to 20%. This circuit
is similar to one of the circuits in [39], with the difference that in our case the supply
voltage could be either Vddh or Vddl while they fixed the supply to Vddh for routing.
Every logic block still has its own supply transistors, and can be independently
programmed to function at Vddh or Vddl. In order to further reduce the area penalty
39
Fig. 3.14. Logic programmable dual-Vdd architecture (LP)
due to these supply transistors, we share the supply transistors among multiple logic
blocks. Since all CLBs do not normally draw the maximum currents at the same time,
the supply transistor can be sized smaller than the sum of independent supply transistors.
Hence, the area overhead of supply transistors is reduced.
Level conversion still occurs only at CLB pins. However, unlike FP, we do not
have the flexibility to set the Vdd of nets to match that of logic blocks connected to
them. Therefore, we need to allow for level conversion at both input and output pins of
CLBs.
The LP architectures are especially suited for low-cost applications with low power
requirements.
3.2.1.3 Level Conversion
Level converters have been studied widely ever since multi-Vdd circuits were pro-
posed [33, 79]. The area, delay and power overheads of level converters prohibit random
40
Fig. 3.15. Level converter circuit
Vdd assignment to logic elements of a circuit. For the present work, we have used the
level converter circuit shown in Figure 3.15, and a 65nm Berkeley Predictive SPICE
model [80] to simulate it. For an FPGA architecture where level converters are placed
at CLB input pins, four level converters are required per BLE. For a Vddh of 1.1V and
Vddl of 0.9V, the LC delay is almost 17% of the delay of an LUT, and as much as 41% of
the clock-to-Q delay of the flip-flop. This significant delay in the LC prohibits the use of
many LCs within a logical path of the circuit. In contrast to delay, power consumption
in an LC was observed to be negligible (< 1%) compared to a BLE. This allows us to
place LCs at all pins of a CLB and still save power.
3.2.2 Methodology
We used VPR and its power model [10, 14] for this work. MCNC benchmarks
were used to evaluate the dual-Vdd architecture and Vdd assignment algorithms. The
architecture of FP FPGA closely resembles a modern FPGA. The LUT size of 4, and
cluster size of 8 LUTs are the same as a Xilinx Virtex-II device. The routing channel
41
Fig. 3.16. Experimental Flow
consists of 200 tracks, with buffered segments of lengths 1, 2, 6 and “long”. The switch
block used a Wilton topology [81].
For LP, however, we simplified the fabric to resemble the one used by [82]. The
CLB consists of 4 BLEs. The routing fabric consists of only length-four segments, which
has been shown to be the best for area and speed by [82]. We further changed the
switch block topology to Subset. These simplifications made it easier to implement the
LP architecture in VPR. A Subset switch block connects only segments of the same
type. In an LP FPGA, we wanted no connections from a Vddl routing resource to a
Vddh resource because the routing switches did not have any level converters. Using a
Subset switch block made it easier to guarantee this (by creating a type for segments
at a particular Vdd). This, however, also does not allow connections from Vddh to
Vddl routing resources, and therefore, the power savings we report here for LP could
be improved. For the purpose of comparison of FP with LP, this restriction is justified
42
because we do not allow such connections for FP either. Furthermore, we chose all
segments to be of length 4 because we did not want nets to solely use longer or shorter
wires. Because of the Subset topology, only wires of the same type would connect, and
therefore, a length 6 wire will not connect to a length 2 wire (which does not resemble a
modern commercial FPGA architecture, such as Virtex-II). Despite these simplifications,
we believe our results to be indicative of other segmented routing architectures as well.
Circuit simulations were performed in SPICE using 65nm BSIM4 device models.
Delays of BLE and LC were obtained from these simulations. Power consumption, both
static and dynamic, of the LC was also obtained through SPICE simulations. Figure 3.16
shows the experimental flow. The flow deviates from a normal VPR flow after the place
and route stage. We first assign voltage to all CLBs using algorithms that are discussed
below, and then estimate power of the design placed and routed on the target dual-Vdd
architecture. Assigning voltages after routing makes the timing analysis more accurate,
since all the routing delays get incorporated in the timing graph.
3.2.2.1 Vdd Assignment
In order to be effective, a dual-Vdd scheme requires that paths in the circuit vary
in their delays. If all paths are of same delay then all circuit elements will require high
Vdd to maintain the performance of the design.
Figure 3.17 shows the distribution of path delays averaged over MCNC bench-
marks. We observe that path delays in a circuit vary considerably. Therefore, a dual-Vdd
scheme can be expected to reduce the power consumption significantly. Figure 3.17 also
shows the path delays after using our dual-Vdd assignment algorithms.
43
Algorithm 1 Algorithm for Vdd assignment: Low-to-High (assuming LCs at CLB inputpins)
Assign Vddl to all CLBs and routing muxesP ← list of all paths in the designT ← longest path delay when all blocks operate at VddhTd ← xT, x ≥ 1 is a user-defined performance metriccritical path ← Pi ∈ P | delay(Pi) > Tdfor all CLBs do
criticality(CLB) ← # paths passing through CLBend forwhile critical path not empty do
Pk ← path ∈ critical path with maximum delayN ← all blocks through which Pk flowsSort N based on criticality (first entry has most paths)while delay(Pk) > Td do
Ni ← first(N)N ← N - NiAssign Vddh to Ni and all routing muxes driven by Niupdate delay of all paths passing through Ni
end whilecritical path ← critical path - Pk
end while
44
Algorithm 2 Algorithm for Vdd assignment: High-to-Low (assuming LCs at CLB inputpins)
Assign Vddh to all CLBs and routing muxesP ← list of all paths in the designT ← longest path delay when all blocks operate at VddhTd ← xT, x ≥ 1 is a user-defined performance metricvddl delay(Pi) ← delay(Pi) when all blocks in Pi are at Vddlcritical path ← Pi ∈ P | vddl delay(Pi) > Tdfor all CLBs do
criticality(CLB) ← # paths passing through CLBend forwhile critical path not empty do
Pk ← path ∈ critical path with maximum delayN ← all blocks through which Pk flowsSort N based on criticality (last entry has most paths)while (delay(Pk) < Td) & (N not empty) do
Ni ← first(N)N ← N - NiAssign Vddl to Ni and all routing muxes driven by Nicalculate delays of all paths flowing through Niif any of the delays > Td then
reset Ni and all routing muxes driven by Ni to Vddhelse
update delays of all paths flowing through Niend if
end whilecritical path ← critical path - Pk
end while
45
Fig. 3.17. Distribution of path delays
We use the heuristic shown in algorithm 1 for Vdd assignment. Initially we assign
low Vdd to all CLBs in the FPGA, and find those paths whose delays become greater
than the desired clock time period. We call such paths “critical”. Those CLBs which
do not belong to any of the critical paths can be kept at low voltage without affecting
performance of the design. Some of the remaining CLBs and routing muxes need to
operate at high-Vdd so that the design’s performance target is met. The order in which
these CLBs are analyzed is crucial for the performance of the heuristic. We define
“criticality” of a CLB as the number of critical paths that pass through this CLB. The
CLBs within a path are analyzed in decreasing order of their criticalities. We started
with CLBs on the most critical path, and proceeded to smaller paths in decreasing order
of their delay. Algorithm 1 handles the case when LCs are at CLB inputs. In that case
all routing muxes driven by a CLB have the same voltage as the CLB. For the other
situation, when LCs are at CLB outputs, the voltage of routing muxes driving a CLB is
the same as that of the CLB.
46
In order to enumerate all paths whose delays become larger than the required
clock time period, we used the algorithm proposed by [83]. It maintains all paths in
a heap data structure with their delays as the keys. Each path also maintains all the
branch-points in the path in increasing order of their branch-slacks 4.
We also experimented with a variant (High-to-Low) of the above algorithm, in
which all the CLBs are initially kept at high Vdd and then some of them are changed
to low Vdd (see algorithm 2). Before changing a CLB to low-Vdd, we need to make
sure that this will not increase the delay of some other path in the circuit above the
desired clock period. The number of low Vdd blocks using both versions, for Vddh of
1.1V and Vddl of 0.8V (for 65nm technology) is shown in Table 3.2. For 10 out of 15
designs, the High-to-Low (h2l) version performs better than Low-to-High (l2h). This
happens because in case of h2l, when the CLBs on a particular path are being analyzed
whether they can be run on low-Vdd, the algorithm continues to look at all the other
CLBs on the path even after it failed to change the Vdd of some CLB. In contrast, in
the l2h case, the algorithm keeps changing CLBs on a path to high Vdd (in decreasing
order of criticality), till the delay of the path is less than the required clock period. This
sometimes causes the path’s delay to be reduced more than what was necessary.
For the LP FPGA, the core of the Vdd assignment algorithm remains the same
as that for FP. The main differences lie in the way the routing segments are handled.
Since their Vdd’s are fixed, the assignment algorithm does not assign voltages to them.
4Branch slack is defined as the decrease in path delay if a particular branch point is used togenerate a new path
47
Additionally, since this architecture allows level conversion at both inputs and outputs
of the logic blocks, we modify the assignment algorithm accordingly.
3.2.2.2 Power Estimation
After all logic blocks have been assigned appropriate supply voltages, we esti-
mate the power consumption of the entire FPGA. We concentrate only on the power
consumption in the core of the FPGA, and do not try to optimize or estimate IO power
consumption. Furthermore, we did not estimate the power consumption in the global
routing grid used for clock distribution.
In order to estimate dynamic power, VPR’s power model calculates transition
densities at all internal nodes of the FPGA, assuming that all inputs to the FPGA
have the same static probability (default: 0.5). Capacitances are estimated from the
capacitance values of a MOSFET, and that of wires and switches, all of which need to
be provided in the architecture file taken by VPR as an input. We used the Berkeley
Predictive 65nm technology parameters for our experimentation.
We modified VPR’s dynamic power model to include dual supply voltages. The
dynamic power of a circuit element reduces by ( V ddlV ddh)2 when its voltage is reduced from
Vddh to Vddl. SPICE simulations of an LC provided its energy values for different pairs
of Vddh and Vddl. We used these energy values and the transition density at the input
of an LC to calculate the its dynamic power.
VPR has got a basic leakage model, which calculates sub-threshold leakage due
to weak inversion. However, in a 65nm technology, two more effects, namely, DIBL and
gate leakage become significant, and need to be included in the leakage estimation. We
48
also modified the leakage model to take into account multiple supply voltages, and sleep
modes. Specifically, the following modifications were made to VPR’s leakage estimation.
1. Gate leakage and sub-threshold leakage due to DIBL were included in the leakage
estimation. In order to estimate leakage of a single MOSFET, we used results from
SPICE simulations. BSIM4 device models for 65nm were used. Simulations were
performed for various supply voltages to get leakage numbers for different voltages.
These numbers were incorporated into the power model of VPR to estimate gate
leakage of the entire FPGA.
2. We estimated average leakage in a routing multiplexer by halving the worst case
leakage, as discussed in [27]. To verify the results, we simulated multiplexers of
various sizes and structures and found our leakage estimate to be very close to the
SPICE results.
3. In the dual-Vdd FPGA, unused logic blocks and routing muxes are kept in a sleep
state by switching off both the supply transistors. Circuit simulations in SPICE
showed that in sleep mode, leakage of a circuit block reduces to 10% of the original
(high Vdd) leakage.
4. To estimate level converter leakage, we obtained the leakage number for one level
converter from SPICE simulations, and multiplied this by the number of level
converters in the FPGA.
49
Fig. 3.18. Power consumption for different Vddl’s. Vddh=1.1V.
Fig. 3.19. Power consumption for different architectures and algorithms. Vddh=1.1V,Vddl=0.9V
50
Fig. 3.20. Average power breakdown between logic and routing resources. Vddh=1.1V,Vddl=0.9V
Fig. 3.21. Average power consumption for different critical path delay tolerances.Vddh=1.1V, Vddl=0.9V
51
3.2.3 Results and Analysis
In this section, we first evaluate the FP architecture (Figures 3.18, 3.19, 3.20,
3.21) and then compare it with LP (Figures 3.22, 3.23).
3.2.3.1 FP Architecture
Power in the dual-Vdd architecture strongly depends on the values of Vddh and
Vddl. In order to understand this dependence, and to come up with a good voltage
choice, we fixed the high-Vdd at 1.1V and varied Vddl from 0.8V to 1.0V. Figure 3.18
shows the power consumption for different Vddl values (using High-to-Low Algorithm,
LC at CLB’s inputs) . Note that for 11 (out of 15) designs, Vddl value of 0.9V results in
maximum power savings. When Vddl is increased to 1.0V, although the number of CLBs
on low Vdd increases, the total power consumption increases. This happens because the
power consumption of the circuit elements at 1.0V is significantly higher than at 0.9V.
Interestingly, when we reduce Vddl to 0.8V, power consumption again increases because
the number of CLBs and routing muxes on low Vdd becomes too low. Therefore, for
all other results in this section, we use a Vddl of 0.9V. For this case, the average power
reduction is close to 61%.
Figure 3.19 shows the power consumption of the designs for the two algorithms
— High-to-Low (h2l) and Low-to-High (l2h), and level converter placements — at CLB
outputs (LCo) or inputs (LCi). (h2lLCi denotes High-to-Low algorithm with LC at
CLB Inputs.) Note that for most designs, the High-to-Low algorithm outperforms the
Low-to-High algorithm. This is expected because, as shown above (see Table 3.2), the
High-to-Low algorithm resulted in larger number of low-Vdd CLBs. Furthermore, the
52
placement of LCs at CLB inputs saves more power (average: 61%) than their placement
at outputs (average: 57%). This happens because LC leakage is not large enough to
overshadow the gains we get in the routing power by placing LCs at CLB inputs.
Figure 3.20 shows the static and dynamic power consumption in both logic and
routing resources for the different algorithms and LC placements. An important obser-
vation is that not all components of power are reduced by the same factor. The reduction
in dynamic power is much less than that in leakage. For example, using High-to-Low
algorithm and placing LC at CLB inputs saves 24% dynamic power and 76% leakage
power. This can be attributed to two factors. First, in an FPGA since there exist a
large number of unused circuit elements, it is possible to reduce the leakage in them by
switching them off. Second, leakage varies exponentially with supply voltage, but dy-
namic power varies only quadratically with supply voltage. Note that leakage in routing
resources reduces to less than 17% of the original, because in most designs it is possible
to put a large number of routing muxes in sleep state, as they are sparsely used. Another
trend to note is that the logic portion of leakage is larger when LCs are placed at CLB
inputs (LCi) than when they are placed at CLB outputs (LCo). This implies that the
larger overall power saving for the LCi case comes entirely from the routing resources.
Figure 3.21 shows what happens when we modify the Vdd assignment algorithm
to allow some degradation in the performance of the design. In the figure, a delay value
of 110% denotes 10% performance penalty. Note that these delay values may increase
after circuit implementation due to the use of supply transistors, and due to a possible
increase of wire lengths (since total CLB area and consequently inter-CLB distances
53
increase). Using h2lLCi, a 10% decrease in performance increases the average power
saving by around 4%, but beyond 20%, the power remains almost constant.
3.2.3.2 LP Architecture
For LP architectures, since we hard-wire the supply voltages of routing fabrics,
the critical path delay of the design may get affected. Therefore, we first look at the
impact of LP on the delays of all designs. Figure 3.22 shows both the average and worst-
case delays for the benchmark designs. Restricting the maximum increase in delay due
to LP to 20%, we decide to keep 50% of the routing resources on low Vdd. Note that
the average increase in delay for this architecture is only 3% of the FP architecture. The
slightly irregular variation in delay happens due to the heuristic nature of the router.
In this delay comparison, we do not include the increase in delay because of resistance
of the supply transistors, delay through the mux at CLB pins that selects between
Vddl or the level-converted signal, and because of an increase in the wire lengths as a
consequence of an increase in the FPGA area. The increase, however, is minimal, and is
highly dependent on the circuit implementation. For example, [84] demonstrate effective
supply-gating of circuits with a performance penalty of less than 10%. [36] observed
a penalty of 5% for dual-Vdd circuits when they used regular-Vt gate-boosted supply
transistors
We realized that if the FPGA has too many routing resources, it is possible that
none of the low voltage resources get used, and the delay of the design remains the
same as that for single Vdd FPGA (if the router is timing-driven). To avoid such a
scenario, we first found the minimum channel width for every design using VPR, and
54
then used 130% of the minimum as the channel width. This is different from the above
FP experiments. However, while comparing FP with LP, we used the LP channel width
for both architectures. Also note that the CLB here consists of 4 BLEs instead of the 8
in the FP experiments.
Figure 3.23 shows the total FPGA energy (power-delay product) obtained using
this architecture for different spatial granularities. h2l-50-2x1 on the x-axis refers to the
architecture where 50% of the routing resources are at Vddl, and the supply transistors
are shared among CLBs in clusters of dimension 2 × 1. We compare energy instead of
power because the critical path delays of designs mapped on LP FPGAs are different
from those on FP FPGAs. The Vdd assignment algorithm remains h2l (High-to-Low)
for all of them. Compared with FP, LP increases the energy by about 4.1%. The routing
energy increases because we do not change their supply voltage. However, the energy
used by logic blocks decreases by about 1.5%, because, due to the presence of LCs at
both CLB inputs and outputs, we have more flexibility in assigning Vdd’s to them. We
further observe that the use of 4 × 4 clusters increases the total energy by about 12%
(compared with FP).
3.2.4 Summary
We presented two types of dual-Vdd FPGA. The fully programmable (FP) FPGA
reduced the total energy by about 60% on an average at the expense of about 50% area
penalty. The logic programmable (LP) FPGA reduced the total energy by 57.3% with
about 20% area increase compared to single supply FPGA. LP, however, resulted in an
average increase of 3% in the critical path delay over FP.
55
We also explored different Vdd assignment algorithms and level converter place-
ments for FP architecture. Experiments demonstrated that high-to-low algorithm cou-
pled with placement of level converters at the input pins of CLBs resulted in maximum
power savings. The dynamic power was reduced by 24%, while the reduction in static
power was close to 76%.
In future, the implementation of the LP dual Vdd architecture can be modified
to allow connections from Vddh to Vddl resources in the routing fabric. Further, the
routing architecture can be improved to use different lengths of segments.
56
Table 3.1. Characteristics of benchmark designsDesign #Slices #IOBs #BRAMs #MULTs FPGA device
1 xapp248 96(37%) 17(19%) 0 0 XC2V40-cs1442 xapp270\des 4,723(92%) 189(58%) 0 0 XC2V1000-fg4563 xapp270\triple-des 14,273(99%) 301(62%) 0 0 XC2V3000fg6764 xapp288\ser decoder 50(19%) 20(22%) 0 0 XC2V40-cs1445 xapp288\par decoder 107(41%) 28(31%) 0 0 XC2V40-cs1446 xapp289 614(39%) 166(83%) 0 0 XC2V250-fg4567 xapp298 70(27%) 16(18%) 0 0 XC2V40-cs1448 xapp299 973(31%) 262(99%) 1(2%) 0 XC2V500-fg4569 xapp610 1,369(89%) 20(21%) 0 8(33%) XC2V250-cs14410 xapp611 1,534(99%) 20(21%) 0 16(66%) XC2V250-cs14411 xapp615 1,155(75%) 45(48%) 0 2(8%) XC2V250-cs14412 xapp621 1,305(84%) 29(31%) 0 0 XC2V250-cs14413 xapp625\video 254(99%) 63(71%) 0 0 XC2V40-cs14414 xapp645 550(10%) 278(85%) 0 0 XC2V1000-fg45615 itc99\b04 110(42%) 21(23%) 0 0 XC2V40-cs14416 itc99\b05 259(50%) 39(42%) 0 0 XC2V80-cs14417 itc99\b12 214(83%) 13(13%) 0 0 XC2V40-cs14418 itc99\b14 2,432(79%) 88(51%) 0 2(6%) XC2V500-fg25619 ava\k4 724(47%) 85(92%) 0 0 XC2V250-cs14420 ava\k7 2,034(66%) 85(49%) 0 0 XC2V500-fg25621 ava\k9 2,895(94%) 85(49%) 0 0 XC2V500-fg25622 industry1 1,954(38%) 279(64%) 0 0 XC2V1000-ff89623 industry2 2,488(80%) 185(70%) 0 0 XC2V500-fg45624 industry3 2,513(81%) 132(50%) 0 0 XC2V500-fg45625 industry4 5,777(75%) 182(34%) 0 0 XC2V1500-ff89626 industry5 5,153(67%) 65(12%) 0 0 XC2V1500-ff89627 industry6 206(4%) 287(66%) 0 0 XC2V1000-ff89628 industry7 3,505(68%) 251(58%) 0 0 XC2V1000-ff89629 industry8 1,602(52%) 60(22%) 0 0 XC2V500-fg45630 industry9 2,280(44%) 293(67%) 0 0 XC2V1000-ff89631 industry10 3,663(71%) 224(51%) 0 0 XC2V1000-ff89632 industry11 4,364(85%) 172(40%) 0 0 XC2V1000-ff89633 industry12 97(37%) 80(90%) 0 0 XC2V40-fg25634 industry13 411(80%) 84(70%) 0 0 XC2V80-fg25635 industry14 1,288(83%) 186(93%) 2(8%) 0 XC2V250-fg456
Average 61.89% 50.82% 0.29% 3.23% -
57
Table 3.2. Comparison of High-to-Low and Low-to-High algorithms (LC at CLB inputs, Vddh= 1.1V, Vddl = 0.8V
Design # CLBs # Vddl CLBsLow-to-High High-to-Low
alu4 191 51 65apex2 235 54 74apex4 158 46 26bigkey 214 24 81
des 200 87 127dsip 172 12 31
elliptic 451 339 327ex1010 575 177 185ex5p 133 30 24
misex3 175 18 16pdc 572 400 405
s38584.1 806 724 739seq 219 61 58spla 462 215 226tseng 131 102 114
Fig. 3.22. Critical path delay for LP FPGA with different extents of Vddl resources.Vddh=1.1V, Vddl=0.9V.
59
Chapter 4
Three-Dimensional FPGAs
As transistors become faster and designs get larger, the delay incurred in the
interconnecting metal wires becomes significant. Consequently, reducing the wire-length
is crucial for future technologies. Three-dimensional (3-D) integration is a promising
technique for reducing wire lengths. By stacking multiple silicon wafers interconnected
with fine vias, the average wire length in the designs gets significantly reduced, which
improves their performance. Other gains, such as reduced design footprint and the ability
to integrate different technologies, further favor 3-D ICs.
Field Programmable Gate Arrays (FPGAs) are consistently improving in capacity
and performance, and are now among the most popular devices in the market. With
their regular structure, they also scale easily to future technologies. However, the large
overheads of their programmable interconnect are severely limiting their growth. The
programmable interconnect resources take almost 70% of the die area, and consume the
major part of FPGA power. Furthermore, for most designs, they also constitute more
than 50% of the critical path delay. Therefore, a reduction in the interconnect resources,
by going to 3-D, will greatly benefit FPGAs.
The advantages of 3-D FPGAs have evoked significant interest, and several stud-
ies have looked at them in the past. More than a decade ago, Alexander et al. [43]
presented a 3-D FPGA that used package-level integration to stack multiple 2-D FPGAs
60
interconnected using solder bumps. The minimum pitch of these vertical interconnects
was 100µm. Campenhout et al. [44] proposed opto-electronic FPGAs, in which the
inter-chip communication used optical links. The optical links provide a large vertical
channel density. The Rothko 3-D FPGA [45] was a 3-D extension of the Triptych sea-
of-gates architecture [46], consisting of routing and logic blocks. The 3-D integration
was done at the wafer-level and inter-layer communication used metal vias. A dynami-
cally reconfigurable 3-D FPGA was presented in [47], which consisted of three physical
layers: routing and logic block layer, routing layer, and memory layer. Recently, Lin et
al. [48] analyzed the performance benefits of a monolithically stacked 3-D FPGA. Their
3-D integration technology provided very fine vias, which allowed them to stack the
configuration memory on top of the rest of the FPGA (logic blocks and interconnects).
Researchers have also looked at theoretical models for 3-D FPGAs. Rahman et
al. [49] presented an analytical model for predicting interconnect requirements in 3-D
FPGAs, and estimated over 50% reduction in channel width, interconnect delay, and
power dissipation, when compared to 2-D FPGAs. Kwon et al. [50] recently extended
this model to incorporate clustered logic blocks (similar to Virtex-2 [1]).
On the CAD front, Ababei at al. [51, 52] recently presented a partitioning-based
placement algorithm for 3-D FPGAs, which primarily focused on reducing the inter-layer
vias. However, their router was not timing-driven.
Although several researchers have proposed 3-D FPGAs, the detailed routing
architecture of a 3-D FPGA remains unexplored. Ababei et al. [51] assumed a subset
switch block (see definition in Section 4.1.1). Although Wu et al. [53] designed universal
3-D switch blocks, they used track count as the sole metric of quality. Furthermore,
61
they assumed that the number of inter-layer vias is the same as the horizontal channel
width. In today’s technology, especially if we stack more than two layers, the vias are
much thicker than the horizontal wires (1um vs. 0.1um), which makes this assumption
impractical.
This chapter consists of two main sections. In Section 4.2, we explore six 3-D
switch box (SB) topologies for the case when the vias are fewer than the horizontal
wires. These switch boxes range from a simple extension of the 2-D subset SB - used
in prior studies [51] - to 3D universal SBs with additional flexibility for the inter-layer
vias. Section 4.1 gives a brief overview of 2-D switch boxes and 3-D technology. The
switch box topologies explored in this study are described in Section 4.2. Section 4.2.2
explains the experimentation methodology, and Section 4.2.3 analyzes the exploration
results. Using detailed area and delay models, we estimate their impact on FPGA area,
delay, and area-delay product. The results indicate that the area-delay product (ADP)
depends heavily on the SB topology: our best SB reduces ADP by 10% compared to the
subset SB.
Section 4.3 analyzes (Section 4.3.1) and reduces (Section 4.3.2 the thermal issues
in 2-D and 3-D FPGAs. A thermal-aware 3-D FPGA design reduces the peak tempera-
ture by about 16C.
Finally, Section 4.4 summarizes the contributions of this chapter.
62
(a) Subset (b) Universal
Fig. 4.1. 2-D switch boxes. X0, Y0, X1, Y1 mark their sides.
4.1 Background
4.1.1 2-D Switch Boxes
Our study will focus on island-style SRAM-based FPGAs. FPGAs from Xilinx
and Altera belong to this category. The logic block (CLB) consists of Look-Up-Tables
(LUTs) and Flip-Flops (FFs). Routing wires (tracks) and programmable switches con-
stitute the routing channel. Channel width refers to the number of tracks in a channel.
The CLBs connect to the channel through connection boxes. The routing wires connect
among themselves through switch boxes.
Switch box topology refers to the connectivity provided by the switch box. Re-
searchers have explored several topologies [85, 86, 81, 87, 88] (see Figure 4.1). The
subset (also called disjoint) topology, used in Xilinx XC4000 FPGAs, connects tracks of
the same number in all four directions. This divides the channel into disjoint sets of
63
tracks, and a net uses the same track number for its route. Universal topology provides
more flexibility than disjoint. It facilitates connectivity for all possible global routes of
two-terminal nets.
Research has shown that the universal switch box results in fewer tracks in the
channel [89]. Hyper-universal switch boxes provide even greater flexibility, and facilitate
the connectivity for all possible global routes of k-terminal nets [90]. However, they use
more switches than universal switch boxes.
(a) Face-to-Face (b) Face-to-Back
Fig. 4.2. Two kinds of stacking
4.1.2 3-D Technology Overview
3-D chip design is a promising methodology to alleviate many interconnect prob-
lems. Current state of the art chips are two-dimensional, which means that they have
64
Table 4.1. Via propertiesThickness Pitch Height
Via 1 1um 3um 10umVia 2 2um 5um 20umVia 3 5um 10um 50um
only one plane of active layer that contains all the devices. Note that although no transis-
tor (device) is stacked on top of other transistor (device), the metal wires interconnecting
these devices typically span multiple layers, with the higher layers occupied by global
wires. 3-D ICs extend this concept to the devices by stacking multiple device layers in
the vertical dimension.
Several technologies, such as beam recrystallization, silicon epitaxial growth, pro-
cessed wafer bonding, and solid phase crystallization, enable the vertical integration
of multiple device layers [91], Among these technologies, wafer bonding is particularly
promising. It involves the bonding of two fully processed wafers (on which the devices
and interconnects have already been fabricated). Since the individual wafers are fabri-
cated separately, it is possible to integrate completely different technologies, and have
a very large number of layers. The inter-layer vias in this technology can be as fine as
1µm × 1µm at a 3µm pitch [92]. The wafers can be bonded in two ways: face-to-face
or face-to-back. In the former, a wafer is inverted to bond with another wafer (see Fig-
ure 4.2 (a)). This reduces the area overhead of the inter-layer vias because they do not
need to pass through the Silicon substrate. However, this limits the number of layers
to only two. The second way, face-to-back, does not invert the wafer (see Figure 4.2
(b)). Consequently, it can integrate more than two layers of Silicon. However, since the
65
inter-layer vias now need to pass through the Si layer, they take up die space. In this
study, we evaluate these two wafer-bonding techniques for 3-D FPGA integration.
Since the wafer-bonding 3-D technology is still being perfected, several meth-
ods are being explored. These methods result in different via dimensions and wafer
thicknesses. For this study, we explore three different methods, which result in the via
dimensions shown in Table 4.1. Via 1 reflects the process from Tezzaron [92], which
uses a wafer thickness of 10um. Because they are so thin, these wafers lack mechanical
strength, and require the use of handle wafers during processing. At the other extreme is
via 3 that uses 50um wafers, which reflects the process in [93]. A larger wafer thickness
imparts mechanical strength to the wafers, and eliminates the need for handle wafers.
Via 2 reflects an intermediate process that we use to illustrate the trends due to via
dimensions. An integration technology from MIT uses SOI wafers to reduce the device
layer thickness to less than a micron [94]. We do not model this technology in this work.
4.2 3-D Detailed Routing Architecture
We extend the island-style architecture of 2-D FPGAs to 3-D (see Figure 4.3).
The CLB consists of LUTs and FFs. The switch box is modified to connect the inter-
layer vias (ILVs) to the horizontal wires (CHANX and CHANY), and also with other
ILVs. The ILVs form channels in the vertical direction (CHANZ). The architecture
is symmetric in the X and Y directions, i.e., CHANX and CHANY contain the same
number of tracks. CHANZ, however, differs from CHANX and CHANY in its width,
which is influenced by the via density provided by the 3-D technology. We use V to
67
(a) Subset (b) Subset-split
(c) Subset-twist (d) Subset-more
(e) Universal-twist (f) Universal-more
Fig. 4.4. 3-D switch boxes for H=4, V=2.
68
refer to the number of vias (i.e. vertical channel width) and H for the horizontal channel
width. Figure 4.3 shows the case when H = V = 3.
CHANZ differs from CHANX and CHANY in another respect too. The length
of these vias depends on the wafer thickness, which is typically much smaller than the
average 2-D wire length (e.g., wafer thickness = 10um for Tezzaron’s process [92], length
of a wire spanning 4 CLBs = 150um in a 65nm process). These differences between
vertical and horizontal channels must be accounted for to design a good 3-D FPGA.
Next, we describe the various 3-D architectures we explored. Where appropriate, we
also discuss how technology parameters influence our design.
4.2.1 Switch Box Topology
The flexibility, Fs, of a switch box (SB) refers to the number of wires to which
each incoming wire can connect. Previous studies have shown that for a 2-D FPGA, an
Fs of 3 provides good routability [85]. In such SBs, a track connects to one track on
each of the other three sides of the SB. Subset and universal topologies are examples of
such SBs (see Figure 4.1).
These 2-D SBs are extended to 3-D by adding two more faces, which contain
terminals for vertical wires – one for going up, and another for going down. Since the
vias will be fewer than the horizontal wires, the two vertical faces will contain fewer
terminals than the other four. We use V to refer to the number of vias (i.e. vertical
channel width) and H for the horizontal channel width.
Figure 4.4 shows the SBs we created for this study for H=4 and V =2. Normally,
the 3-D SB is visualized as a cube, where each face of the cube represents one of the
69
directions. However, for ease of illustration, we have flattened the SB and shown it as
a hexagon, where each side represents a direction: North (Y0), South (Y1), East (X1),
West (X0), Top (Z1), or Bottom (Z0). Furthermore, we show only the connections to
the vertical faces (Z0 and Z1). For all SBs, the horizontal wires (CHANX and CHANY)
use either the subset or universal connections among themselves. These connections
were described in Section 4.1.1 and illustrated in Figure 4.1. For clarity, we do not
show the horizontal connections in Figure 4.4. The first four SBs use subset connections
among the horizontal wires, and the last two use universal. Figure 4.4 also tabulates the
connections from the vertical faces, where Xi,j refers to the jth terminal on the Xi face
of the SB.
The first SB (subset, see Figure 4.4 (a)) is an extension of the 2-D subset SB. This
SB connects the same track number on all sides. Consequently, the entire routing fabric
gets divided into disjoint subsets, and a net uses the same track number for its entire
route. Note that only the first V of the H horizontal wires connect to the vias. While
these wires have a flexibility of 5 (3 connections to the other horizontal directions, and 2
to the vertical ones), the other wires connect to only horizontal tracks (flexibility = 3).
Apart from decreasing the routing flexibility, this results in a difference in the capacitive
loads of the horizontal wires: large for the first V wires, and small for the rest.
The second SB (subset-split, see Figure 4.4 (b)) modifies the subset SB by allowing
the first V horizontal tracks to connect to the vias going above, and the last V to those
going below. This implies that now there are twice as many horizontal wires that connect
to the vertical wires. Therefore, if nets do not fanout at the SB, then this SB provides
greater flexibility to the vertical directions. A limitation, however, is that the first V
70
can only go above, and the last V, only below. Consequently, if a net needs to fanout to
both Top and Bottom, then it needs to use two horizontal tracks (compared to one for
subset). This SB distributes the capacitive loads on the horizontal tracks more evenly
than the Subset SB.
The subset-split SB, although more flexible than subset, suffers from the “disjoint”
property of subset SBs: the entire routing fabric is divided into disjoint subsets, and a
net can use only one of those subsets. This disjoint subset consists of vertical track i,
and horizontal tracks i and H − i− 1 (where i ∈ 0, 1, ..., V − 1). In order to improve
upon this, we modified the connections to the vertical faces as shown in Figure 4.4 (c).
Now, terminal Z0,0 connects to track 1 on the side X0, but track 0 on side X1. This
allows the net to switch tracks at the SBs. We call this SB subset-twist.
The main objective of the subset-twist SB is to improve the flexibility in the
vertical direction. Another way to achieve this is by adding more switches to the vertical
faces – the approach used by the next, subset-more SB (see Figure 4.4 (d)). Here, the
vertical terminal i connects to both i and H − i − 1 terminals on the horizontal faces
(where i ∈ 0, 1, ..., V −1). The extra switches have a two-fold effect. On the one hand,
they improve the flexibility in the vertical direction, and on the other, they increase the
area of the SB and the capacitive loads on the wires.
The next two switch boxes use universal connections among the horizontal wires.
The vertical connections in the universal-twist SB are identical to the subset-twist SB
(see Figure 4.4 (e)). However, due to universal connections among the horizontal wires,
it provides greater flexibility. The last SB, universal-more further increases the flexibility
by adding more switches to the vertical faces. For example, in Figure 4.4 (f), track 0 on
71
side Z0 connects to both, tracks 1 and 3 on the X0 side. These extra switches improve
the flexibility in the vertical direction, but also increase the area of the SB and the
capacitive loads on the wires.
4.2.2 Experimentation
We modified VPR [10], an FPGA place and route tool available from University
of Toronto, to model our 3-D FPGA architectures. We refer to this tool as 3-D VPR.
It uses simulated annealing to place the logic blocks and then routes the nets using a
modified path-finder algorithm. Both placement and routing are timing-driven, i.e., they
try to reduce the delays of critical paths.
The 2-D placement algorithm of VPR optimized the following cost function.
Cost2D = α · Costtiming + (1− α) · Costcong−2D
Costtiming =
Nnets∑
i=1
[ num sinks(i)∑
j=1
delay(i, j)
]
Costcong−2D =
Nnets∑
i=1
q(i)
[
bbx(i)
Cav,x(i)β+
bby(i)
Cav,y(i)β
]
where Nnets is the number of nets in the design, num sinks(i) is the number of sink
pins of net i, delay(i, j) is the estimated delay from the source of net i to sink number j.
For each net i, bbx(i) and bby(i) denote the x and y spans of its bounding box, respec-
tively. The q(i) factor compensates for the fact that the bounding box wire length model
underestimates the wiring necessary to connect nets with more than three terminals. Its
72
value depends on the number of terminals of net i. Cav,x(i) and Cav,y(i) are the average
channel capacities in the x and y directions respectively, over the bounding box of net i.
The value of β adjusts the weight given to congestion in the cost function. The larger
the value of β, the more wiring in narrow channels is penalized relative to wiring in wider
channels. A value of 1 has been previously found to work best, and is used in this work.
To the 2-D cost function, we add a term, Costspanz , to reduce the vertical span
of the nets. This is similar to what was proposed in [51], except that, similar to the
congestion cost terms for x and y directions, we incorporate congestion in Costspanz .
Cost3D = α · Costtiming + β · Costcong−2D + γ · Costspanz
Costspanz =
Nnets∑
i=1
q(i)
[
bbz(i)
Cav,z(i)β
]
By varying the values for α, β, and γ for two of the benchmark designs, we found α = 0.5,
β = 0.1, and γ = 0.4 to give the smallest critical path delay. Hence, we use these values
in all our experimentation.
4.2.2.1 Architecture and Technology Parameters
The logic blocks in our experiments consist of 4 4-input LUTs and 4 FFs, with 10
inputs and 4 outputs. All the inputs are equivalent, and so are the outputs, that is, every
input pin can internally drive any LUT input. The pins are uniformly distributed around
the sides of the CLB. Each output pin connects to 25% of the tracks in the adjacent
channel, and every input pin connects to 60% of the adjacent tracks. All horizontal
73
segments (CHANX and CHANY) in the routing fabric span 1 CLB, and are driven by
tri-state buffers.
The vertical channel (CHANZ) has vias that transcend only single layer. When
these vias are very short (10um), we use minimum size pass transistor switches to drive
them. However, for the case when they are 50um high, we use a 5X tri-state buffer switch
to drive them. In contrast, the buffers driving the CHANX and CHANY segments are
always 5X the minimum, and consist of two stages.
We calculated the resistance and capacitance values for the vias and horizontal
wires by using the Predictive Technology Model (PTM) [95]. Timing parameters for
switches were derived from Spice simulations using 65nm BPTM.
We explored a spectrum of 3-D technologies: with the via properties shown in
table 4.1, number of layers varying from 2 to 5, and either face-to-face (f2f) or face-
to-back (f2b) bonding technology. The finest vias of 1um thickness are in line with
Tezzaron’s process [92], while the coarsest ones (of 5um thickness) are reflecting the
process from [93].
4.2.2.2 Experimentation Flow
Figure 4.5 shows the experimentation flow. A design in blif format is packed into
clusters (CLBs) of 4-LUTs using T-VPack. On the basis of the number of CLBs in
the design, 3-D VPR creates the smallest FPGA fabric that would contain the design.
It takes the number of layers as an input, and finds the minimum square size of one
layer, assuming that all layers contain the same number of CLBs. The packed netlist
is then placed and routed using 3-D VPR to find the minimum number of vias for a
74
Fig. 4.5. Experimentation flow
large horizontal channel width (= 80 for 5 layers). The router performs a binary search
over the number of vias to find the minimum value. Fixing the number of vias to 130%
of the minimum value, we re-route the design to find the minimum possible channel
width. Thus, this flow gives priority to reducing the number of vias instead of channel
width, which makes sense because the vias take more area than the horizontal wires.
However, most FPGAs provide more than the minimum number of channels to ensure
good performance for the worst case too. On similar lines, we add 30% to the minimum
via and channel-width numbers while evaluating the FPGA. Using these values (which
may be different for every design), we re-route the design to obtain the critical path
delay of the routed design. This flow is repeated for every switch-block type for all the
20 MCNC benchmark designs.
75
4.2.2.3 Area Model
VPR estimates area by counting the number of transistors in the fabric. This
works because the 2-D FPGA area is transistor-dominated. In case of 3-D, however,
we must add the via areas to the transistor areas. The two types of 3-D integration
technologies discussed in Section 4.1.2 need different area models. In case of face-to-face
(f2f) bonding, the inter-layer vias (ILVs) do not pass through the Silicon (see Figure 4.2).
Consequently, they do not take any die area. In contrast, the face-to-back (f2b) bonding
requires vias to pass through the Silicon (through-vias). In this case, every via con-
sumes some Si area. We incorporate the area overhead of these through-vias in our area
estimates.
While comparing the area of two architectures, we estimate the total FPGA area
and divide it by the number of CLBs in the fabric to estimate the area per CLB. Thus,
the area numbers in the next section include the area for one logic block (CLB), and the
routing resources (horizontal wires, switches, and vias) associated with it.
4.2.3 Results and Analysis
Here, we show the results for two extremes of 3-D integration: first, a simple stack
of two layers; and second, a more aggressive stack of 5 layers. Together they capture
the trends seen by varying the number of layer in a 3-D FPGA. While the two-layer
FPGA can be fabricated using f2f or f2b wafer bonding, the 5-layer FPGA must be
fabricated using f2b. For all these technology points, we evaluate the effects of different
via dimensions shown in table 4.1. The metric we primarily look at to evaluate an
architecture is the area-delay product (ADP), because it is inversely proportional to the
77
throughput of the device [96]. In all the figures in this section, we plot the geometric
means over 20 MCNC benchmarks.
The first step towards evaluating 3-D FPGAs is comparing them with 2-D FPGAs.
Figure 4.6 shows the average area (per CLB), delay, and ADP for 1, 2, and 5 layers
in 65nm technology. For both 2 and 5 layers, it shows the results for the three via
technologies of table 4.1. The key ‘2-layers-f2f-3um’ in the figure refers to the use of
2 device layers, stacked using f2f bonding with vias at 3um pitch (via 1 in Table 4.1).
Figure 4.6 uses the same switch box (universal-twist) for all cases.
The area is estimated as explained in Section 4.2.2.3. Note that area reduces as
we increase the number of layers, or reduce the pitch of the vias. The smallest area is
obtained when five layers are used with 3um-pitch vias, in which case, the CLB’s area is
only 84% of the single-layer case. Furthermore, we observe that the area of the 2-layer
FPGA using f2f bonding remains constant with increasing via pitches. This happens
because the vias in this case are accommodated within the transistors’ footprint, and
the CLB area is determined by the transistors.
The critical path delay also reduces with increasing number of layers (second set
of bars in Figure 4.6). The 5-layer FPGA with 5um-pitch vias (best case) reduces the
delay by 24.7% compared with the single layer case, and by 14% compared with the
2-layer case. This happens because interconnect lengths (and hence delays) reduce as
we increase the number of layers. F2f and f2b technologies do not have any significant
impact on the delay.
The reduction of area and delay in 3-D combine to significantly reduce the area-
delay product of the FPGA (third set of bars in Figure 4.6). The 5-layer FPGA reduces
78
the area-delay product by 36% (for 3um pitch vias), while a 2-layer FPGA does so by
about 20%, when compared to a single-layer FPGA. These results justify the interest in
3-D FPGAs, and demonstrate that we can obtain significant improvements even by the
relatively simple integration of two FPGA layers.
Now, we explore the different switch boxes to find which one gives the best values
for area, delay, and area-delay product. Figure 4.7 shows the results for 5 layers, using
65nm process and 3um-pitch vias (via 1 in Table 4.1). The results for 2 layers follow a
similar trend. The first set of bars in Figure 4.7 compare the flexibility in the vertical
direction of the various SBs by looking at the minimum number of vias they take for
the designs to route. Observe that the universal-more type of SB provides the greatest
flexibility (minimum number of vias). In fact, it uses only 49% of the vias needed by the
subset SB. It also results in the minimum channel width among all the SBs. However,
the total area is determined by both, the vias and the number of transistors in the fabric.
Since universal-more uses extra switches to increase flexibility, we observe that the total
area taken by the FPGA using universal-more SB is larger than that of the one with
universal-twist SB. This indicates that the universal-twist SB provides greater flexibility
per switch than the universal-more SB.
While the area metric reduced to 88% by using universal-twist SB instead of the
subset SB, the critical path delay does not show such a strong variation. This happens
because the timing-driven router of 3-D VPR gives less weight to congestion for timing-
critical nets, which implies that they almost always take the shortest possible route.
The smallest delay is obtained for the subset-split case. Note that adding more switches
to the SB increases the delay, which is explained by the larger parasitic capacitances
79
due to these switches. Because the variation in delay is not much, the trend for area-
delay product is similar to that for area. The universal-twist offers the lowest area-delay
product, 91% of that for the subset SB.
Next, we explore how the via properties affect the choice of SB for the 5-layer
FPGA. Figure 4.8 compares the area-delay product for different SBs for the three via
technologies of Table 4.1. The x-axis is labeled as <via pitch>-<via height>. Intuitively,
as the vias become larger, we will prefer the SB that provides the minimum number of
vias. Figure 4.8 demonstrates this trend. As vias become larger, the difference between
the area-delay products for universal-twist and universal-more (which produces the min-
imum number of vias) reduces. This happens because, as vias become larger, the area
taken by the vias starts dominating the total area. However, even for 10um-pitch vias
(the largest case), the universal-twist SB continues to provide the smallest area-delay
product.
We also look at the effect of technology scaling on the performance of our SBs in
a 5-layer FPGA (see Figure 4.9). The vias are assumed to remain at 3um pitch while
the CMOS technology scales from 65nm to 45nm and 32nm. Again, the universal-twist
remains the best SB for all process nodes. Since the via dimensions remain constant
among the different process nodes, the area penalty due to through-vias increases as
transistor dimensions shrink. Consequently, the universal-more SB (which gives the
minimum number of vias) improves as process scales. However, even for the 32nm node,
the universal-twist SB remains the best from an area-delay product perspective.
80
Fig. 4.8. Comparing the switch boxes for different via technologies for 5-layer FPGA
Fig. 4.9. Comparing the switch boxes for different process nodes for 5-layer FPGA
81
4.3 Thermal Issues in 3-D FPGAs
Junction temperature is a growing concern in integrated circuits. Improvements
in fabrication technology, circuit design, architecture, and tools, have all contributed to-
wards an increase in logic density as well as clock frequency. Increased logic density and
performance have in turn led to an increase in power densities, which manifests itself in
the form of high temperatures. FPGAs are following a similar trend. Recent articles on
thermal management from leading FPGA manufacturers ([97, 98]) clearly indicate the
growing importance of thermal issues in FPGA designs. Since three-dimensional integra-
tion increases the effective power density, 3-D ICs suffer from even higher temperatures.
Die temperature must be controlled because it impacts the timing, leakage power,
package design, and lifetime of the device. Circuits run slower when they are hot,
and their lifetime reduces exponentially with increasing temperature. Besides, plastic
packages can only withstand relatively low temperatures. Furthermore, leakage power
increases exponentially with temperature, which can cause a thermal runaway.
All these factors have forced chip manufacturers to employ techniques to con-
trol the die temperature. These techniques can be divided into two categories, namely
package level, and design level.
Package designers have been considering thermal issues for a long time. Heat
sinks, spreaders, and fans are the most common examples of package level techniques.
Instead of considering variations in the temperatures on the die, they design the package
to support the worst case specifications of the design. They typically provide the user
with the thermal resistance (θJA) of the package, which is used to estimate the junction
82
temperature (TJ ) using
TJ = TA + θJA ∗ Power, (4.1)
where TA is the ambient temperature, and Power refers to the total power consumed
by the chip.
As designing the package for the worst case junction temperature started becom-
ing too expensive, researchers started looking at design level solutions to reduce the
temperature. A common example is dynamic thermal management (DTM), where the
design is run at a reduced power (and performance) if the chip temperature increases
beyond a previously set threshold. Thermal sensors measure the temperature, and power
is reduced by lowering the clock frequency or the supply voltage, and clock-gating [55].
Design level techniques can also aid in removing the heat generated by the design.
For example, thermal-aware floorplanning tries to reduce the hotspots on the die by
distributing the temperature uniformly [56, 57]. Researchers have mostly focused at
microprocessors in these works. Thermal placement is a similar technique applied at the
placement stage. Chen and Sapatnekar [58] proposed a partition-driven algorithm for
standard cell thermal placement. Thermal floorplanning and placement are particularly
attractive because they impact the performance less than DTM.
On the modeling front, several researchers have developed tools for estimating the
die temperature. Among them, HotSpot [59] is an architecture-level thermal simulator,
which can perform transient as well as steady-state temperature estimation. HS3d [60] is
another architecture-level tool that performs only steady state temperature estimation,
but is orders of magnitude faster than HotSpot. Both HS3d and HotSpot provide the
83
flexibility to set several package and die parameters, such as the spreader thickness,
package-to-air thermal resistance (r convec), and substrate thickness. Since in this work
we look at only steady state temperatures, we use HS3d.
Recently, some researchers have proposed solutions for thermal issues in 3-D ICs
too. Cong et al. [61] suggested a thermal-driven floorplanning for 3-D. Goplen and Sap-
atnekar [62] also proposed a temperature-driven placement algorithm for 3-D standard
cell ASICs. Studies have also indicated that careful insertion of thermal vias can reduce
the peak temperature [63, 64].
Thermal issues in FPGAs are relatively unexplored. Some researchers have pro-
posed the use of distributed sensors for monitoring temperatures in FPGAs [65, 66].
They, however, considered only CLBs in the fabric, and consequently, observed very
little temperature variations across the die. In contrast, we focus on platform FPGAs,
containing embedded circuit blocks including high-speed transceivers, multipliers, DLLs,
and memories (see Figure 4.10) [1, 2]. Here, we first characterize the temperature distri-
bution in a modern 2-D FPGA, and then observe how it changes when we stack multiple
such layers. We further propose changes in the placement of hard blocks in a 3-D FPGA
to reduce the die temperature.
4.3.1 Thermal-Characterization of FPGAs: 2-D to 3-D
Most modern FPGAs incorporate hard blocks in the fabric (e.g., Virtex-4, see
Figure 4.10). Table 4.2 shows the power densities for the blocks in a Virtex-4 FPGA.
Observe that the power densities vary from 0.78 for the DSP blocks to 11.46 for the
DCMs. This vast range results in large temperature variations within the FPGA die
84
Fig. 4.10. Virtex-4 FX100 device (not to scale)
Table 4.2. Power densities in 4VFX100 (Freq : 500MHz)Block type Power density
(normalized to CLB)DSP 0.78CLB 1.00PPC 1.32IOB 2.33
BRAMDual Port 3.85Single Port 1.93
MGTTransceiver 7.75Transmitter 4.22Receiver 4.11
PMCD 11.4
DCMHigh Freq 11.46Low Freq 9.84
85
74
76
78
80
82
84
86
88
90
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 0.2
0.4 0.6
0.8 1
1.2 1.4
1.6 1.8
74 76 78 80 82 84 86 88 90
Temperature (C)
X location(cm)
Y location(cm)
Temperature (C)
Fig. 4.11. Thermal profile of 4VFX100
Fig. 4.12. Effect of stacking on peak temperature
86
Table 4.3. Effect of stacking on temperature#Layers 3-D Tech Vias Temperature C
Peak Average Min1 - - 89.4 79.4 75.02 Ref [92] No via 133.7 114.1 105.32 Ref [92] Via 1 (Table 4.1) 133.4 113.9 105.12 Ref [92] Max vias 133.2 113.9 105.22 Ref [93] No via 132.9 114.2 105.62 Ref [93] Via 3 132.6 114.0 105.52 Ref [93] Max vias 132.4 114.0 105.53 Ref [92] No via 178.2 149.1 135.93 Ref [92] Via 1 177.2 148.5 135.43 Ref [92] Max vias 176.5 148.5 135.63 Ref [93] No via 175.8 149.3 136.93 Ref [93] Via 3 174.8 148.6 136.43 Ref [93] Max vias 174.4 148.6 136.54 Ref [92] No via 222.8 184.3 166.74 Ref [92] Via 1 220.7 183.0 165.74 Ref [92] Max vias 219.4 183.0 166.14 Ref [93] No via 218.4 184.7 168.64 Ref [93] Via 3 216.4 183.4 167.64 Ref [93] Max vias 215.6 183.4 167.9
Table 4.4. Parameters for temperature estimation in HS3dParameter ValueAmbient temperature 45Cr convec 0.5C/WSubstrate thickness 500 umSpreader thickness 1 mmSink thickness 6.9 mmGlue thickness 2 um
87
(see Figure 4.11). The hotspots occur near the MGTs and DCMs, which are about 14C
above the coolest portions.
Table 4.3 shows the temperatures for 3-D FPGAs consisting of identical FPGA
layers of 4VFX100. The temperatures were estimated using HS3d [60] with the param-
eters listed in Table 4.4. The r convec value of 0.5C reflects the thermal resistance
of a high-end package with a moderate heat sink. We estimated temperatures for two
extremes of 3-D technologies: one with very thin layers and fine vias (Tezzaron’s process,
Via 1 of Table 4.1), and another with 5um vias and 50um layers (Via 3 of Table 4.1). For
both these technology nodes, we also varied the number of inter-layer thermal vias be-
tween the two extremes of no thermal vias to the maximum possible number of thermal
vias. Table 4.3 shows the temperatures for these two corners along with a more realistic
number based on the via pitches in Table 4.1.
As expected, the peak temperature increases with increase in the number of layers
– from 89.4C for a 2-D FPGA to 220.7C for a 4-layer FPGA using Tezzaron’s process.
The intra-package temperature variation also increases with increase in the number of
layers, from 14.4C for a 2-D FPGA to 55.0C for a 4-layer FPGA. This large variation
in temperature indicates that the peak temperature could be reduced by distributing the
hot blocks more evenly across the fabric. Interestingly, 3-D technology parameters change
the temperatures only minutely. For a 4-layer FPGA, layer thickness changes the peak
temperature by up to 4.4C, while thermal vias could decrease the peak temperature
by up to 3.4C. Figure 4.12 shows the effect of stacking on temperature, as well as the
possible variations because of 3-D technology parameters.
88
Table 4.5. Thermal-aware 3-D FPGA designFPGA Design Temperature C
Peak Average Minimum2-D 86.9 78.0 73.82-layer stacked 128.48 111.11 102.782-layer thermal 112.92 111.19 110.302-layer thermal inverted 112.94 111.23 110.33
Layer 1 Layer 2a) 2-layer stacked
Layer 1 Layer 2b) 2-layer thermal
Fig. 4.13. 3-D FPGA organizations
89
4.3.2 Thermal-Aware 3-D FPGA Organization
Recently, a study proposed alternate organizations for a 2-D FPGA to reduce the
intra-die temperature variations [67]. Using a fully utilized Virtex-4 FX100 FPGA as
an example, it demonstrated a reduction in peak die temperature of about 6C. Since
temperature variation is larger in a 3-D FPGA, we would expect thermal organiza-
tion to have a greater impact. To demonstrate this, we design a thermal-aware 2-layer
FPGA. For ease of experimentation, we consider only 4 types of blocks in the FPGA,
namely, CLB, BRAM, DSP, and MGT. These blocks consume the majority of the area
in 4VFX100. The peak temperature for a 2-D FPGA containing these blocks is 86.9C.
In the first case, we stack two identical such layers to form a 2-layer stacked FPGA (see
Figure 4.13(a)). The peak temperature for this FPGA is 128.5C. Note that stacking
the hot blocks significantly increases the power density, and therefore, the temperature.
Hence, next, we keep all the MGTs, DSPs, and BRAMs on a single layer. The second
layer now consists only of CLBs (see Figure 4.13(b)). This change in floorplan can be im-
plemented easily with the column-based modular architecture of Virtex-4 (ASMBL) [1].
This reduces the peak temperature to 112.9C (2-layer thermal in Table 4.5). The tem-
perature variation also drops from 25.7C for the stacked design to only 2.6C for the
thermal-aware design.
In the previous experiments, the heat sink is attached closest to the layer consum-
ing the maximum power. Previous studies have suggested that this should be preferred.
In fact, researchers have proposed thermal-aware 3-D floorplanning that tries to place
the hot blocks closer to the sink [61]. In order to see the effect of sink placement, we
90
attached it to the layer containing only CLBs in the 2-layer thermal organization. Ta-
ble 4.5 also shows the temperature for this case (2-layer thermal inverted). We observe
that the temperature increases only very slightly because of this change. This happens
because the vertical distances are small compared to the horizontal dimensions of the
FPGA.
4.4 Summary
This chapter demonstrated that 3-D FPGAs can provide significant advantages
over 2-D by reducing the interconnect area and the total area-delay product. The 3-D
FPGA with 5 layers and 3um-pitch vias reduces the area-delay product of a 2-D FPGA
by 36%.
We designed and evaluated several switch boxes for 3-D FPGAs, and showed that
the area-delay product depends heavily on the switch box topology. In 65nm technology,
the area-delay product for our universal-twist switch box is 15% lower than that of the
subset switch box for 5um-pitch vias. We further showed that the universal switch boxes
become even better with scaling process technology, as well as with larger vias. However,
adding more switches to the universal SB does not provide any benefit.
Three-D integration, however, increases the die temperature. Our experiments
indicate that the peak temperature for a 4-layer FPGA is 2.4 times that of a single-layer
FPGA. However, the large variation in temperature within the 3-D package allows us
to re-organize the 3-D FPGA to reduce the peak temperature. For a 2-layer FPGA,
the peak temperature reduced by 16C when the design was altered to create a more
uniform temperature profile.
91
Chapter 5
Technology Alternatives for Nanoscale
FPGA Interconnects
The previous decade has seen large-scale concerted efforts to develop nano-scale
technologies that will help sustain the Moore’s law. Innovations in lithographic CMOS
technologies have indicated that it would be possible to scale CMOS till at least up
to the second half of the next decade. However, conventional lithographic techniques
suffer from increasing fabrication costs, which may ultimately limit their application.
Recently, a (comparatively) low cost and reliable nano-imprint lithography technique
has been proposed [99, 100] which raises the hopes of obtaining cost-effective nano-
scale fabrication. However, at present, this imprint technique is limited to very regular
structures, and is unlikely to produce the complex structures that current lithography
can produce. While nano-imprint as well as conventional lithography are top-down
techniques, there are several bottom-up assembly techniques [101] in which molecules
assemble to form nano-structures. Although these techniques are expected to be very
low cost, they suffer from yield issues and are limited to very simple geometries.
Modern high-end FPGAs contain a variety of resources, and are not restricted to
a simple array of logic blocks consisting of Look-Up Tables (LUTs) connected using pro-
grammable switch blocks. In current FPGAs, apart from the basic programmable blocks,
there exist RAM modules, some hard-coded blocks (e.g. multipliers), and even some full
processors (e.g. PowerPC processors). Apart from them, the basic programmable logic
92
block itself has been augmented to contain non-LUT structures, like fast carry-chain
circuits. There have been advances in the interconnect architecture too. Modern FP-
GAs consist of segments of different lengths, each with different connectivity. However,
it is widely accepted that the interconnect is the major bottleneck in FPGAs. The in-
terconnect multiplexers in Xilinx’s Virtex-2 FPGAs take around 70% of the CLB area.
Furthermore, even after careful timing-driven packing and placement, interconnects are
the dominant source of delay for most designs. In addition to this, the power con-
sumption in a typical FPGA-mapped design is absolutely dominated (> 70%) by the
interconnect resources [13].
In this chapter, we explore different solutions to the interconnect problem in the
nano-scale regime. We explore nano-wires of different widths and materials as inter-
connect. We also explore replacing the pass-transistor switches in current FPGAs by
molecular switches [101, 102] that provide reprogrammable connections between wires.
This alleviates the need for SRAM cells to control the state of the switch, since these
molecules store the state within themselves. This is similar to anti-fuse FPGAs, but, in
contrast to anti-fuse technology, these molecules are reprogrammable. Furthermore, we
expect the structure of the CLB to be more difficult to realize efficiently in a technology
more amenable to regular structures. Therefore, the logic blocks in our architecture are
fabricated using lithographic techniques.
5.1 Nanotechnology Primitives
Several nano-structure fabrication techniques have been proposed over the past
few years. Among them, Nano-imprint [99, 100] and Dip Pen Nano-lithography (DPN) [103]
93
are the most promising techniques. In case of nano-imprint technology [99, 100], e-beam
lithography (or a similar technique) is used to create a mould, which is subsequently
used to emboss the circuit on other chips for mass production. The mould can be made
very fine, and the technique is expected to scale up to a few nano-meters of feature size.
DPN [103], in contrast, uses an Atomic Force Microscope (AFM) to write the circuit
on the die. Although inherently slower than nano-imprint, using multiple AFM tips
improves the writing speed significantly. This has been demonstrated to produce very
small features, and is expected to fabricate features smaller than 10nm. Directed self-
assembly [101] is another approach towards making nano-structures. Although this may
be the cheapest way to make circuits, it suffers from very high defect rates.
Note that all these (nano-imprint, DPN and self-assembly) technologies are ex-
pected to be limited to very simple geometries. It has been shown that it is possible to
get sets of parallel wires using any of the above techniques. Therefore, we propose to use
them (preferably nano-imprint) to make only wires in the FPGA. These wires could be
made using a single crystal of metal-silicide (e.g., NiSi nano-wires [104]) or made out of
metal. Carbon nanotube wires could also be considered, although a recent work claimed
that carbon nanotubes may not be better than metal wires with respect to reducing
interconnect delays [105].
In addition to the wires, we also need some sort of programmable switches to
provide programmable connection among the wires and between wires and logic pins. In
the FPGAs of Xilinx and Altera, these are made using pass transistors and SRAM cells,
while Accelerator FPGAs use one-time programmable anti-fuse material. At the nano-
scale we can use single-molecule switches that exhibit reversible switching behavior [70].
94
These molecules self-assemble at the cross-points of nano-wires, and can be switched
between ON and OFF states by the application of a voltage bias. It is desirable that these
switches have very low ON resistance and a very large OFF resistance. ON resistances
of hundreds of ohms and OFF-to-ON ratios of 1000 have been observed recently [102].
Note that very fast switching characteristics is not essential for FPGAs, because these
switches will not be configured very frequently and the FPGA configuration time is
normally not critical.
Early work in molecular switching suffered from filament formation due to the
small gap separating the nano-wires. Consequently, the switching behavior observed was
due to the metallic filament instead of molecule. Chemists at several research institutions
are targeting this problem. One such (as yet unpublished) work from our collaborating
chemists can increase the vertical separation among wires to 30nm and uses nano-spheres
to provide programmable connections. In line with this work, we experiment with a fixed
vertical separation between nano-wires of 30nm.
5.1.1 Related Work
DeHon [68], Goldstein [69], Tour [70] have previously proposed programmable ar-
chitectures using some form of nano-structures that are made using self-assembly. Gold-
stein tried to make crossbar-based devices by aligning nano-wires in two planes at right
angles to each other. The crosspoints contained molecules that provided programmable
logic as well as interconnections. It suffered from problems of signal-degradation, as
there was no way to restore the signal using only two terminal devices. DeHon overcame
this problem by using SiNW based FETs to restore the signals, and proposed a PLA
95
structure. However, the logic functionality in that architecture was limited to OR (and
inversion).
Tour, instead, proposed replacing the logic blocks by nanocells and connecting
them using metal wires. This suffered with problems of training these nanocells, which
were assumed to consist of a randomly connected mass of molecules. Furthermore, since
the bottleneck in current FPGAs lies in the interconnect, Tour’s architecture does not
help solve this problem.
All the above architectures propose drastic changes in the existing CMOS tech-
nology as well as the design methodologies. We propose an architecture that blends with
existing technology easily, and preserves all the design methodologies and flexibility in
logic functionality.
5.2 Nanoscale FPGA Architectures
We explored FPGA architectures with varying degrees of nanoscale integration in
the interconnect fabric. The logic block in all architectures is assumed to be made using
22nm lithography (which [8] predicts to be available in 2016). In the first architecture, we
consider the inter-CLB wires to be made using some nano-fabrication technology and the
interconnect switches to be made using self-assembled molecular switches. Both metal
and metal-silicide nano-wires are explored. Note that this organization needs decoders
to address the (nano) wires. In the second architecture, we assume inter-CLB copper
wires fabricated using advanced lithography but keep molecular switches to connect
them. In order to make the exploration tractable, we limit the inter-CLB metal wires
to only two levels (M3 and M4). The main difference between arch1 and arch2 is the
96
attainable wire pitch (up to 10nm for arch1, 54nm for arch2). Finally, we compare these
architectures with the current island-style FPGA architecture containing pass-transistor
switches (arch2), scaled to the 22nm technology node.
5.2.1 Arch1: Using non-lithographic nano-wires and
molecular switches
Figure 5.1 shows the proposed architecture, and figure 5.2 shows how the different
technologies are stacked together. The logic block remains in silicon, and uses M1 and
M2 layers for local connections. The IO pins of the logic block are in M2 layer, and the
nano-wires are on top of this. Molecular switches provide programmable connections
between nano-wires and between nano-wires and logic blocks. Note that each layer in
figure 5.2(a) is isolated from its adjacent layers by a dielectric.
The salient features of this architecture are described below.
Interconnect wires
A good interconnect material must have a low resistivity, a large current-carrying
capacity, and the ability to be made at small pitches. A low resistivity is needed to have
small delay, which is determined by the RC product. While copper wires are expected
to have a resistivity of 2.2µΩ-cm at the 22nm technology node [8], NiSi nanowires have
been shown to have resistivities of around 10µΩ-cm [104]:. Even with poorer resistivities,
NiSi nanowires may be preferred due to their ability to sustain a current density of up
to hundred times that of copper (> 1 × 108A/cm2). Some nano-fabrication technology
may be needed to fabricate wires at pitches of less than 10nm1.
1The wire pitch at the 22nm node is predicted to be 54nm.
97
We experimented with different routing architectures, consisting of different seg-
ment lengths. It has been previously shown that a segmented routing architecture is
better than non-segmented ones [81]. The logic block (8 LUT+FFs) in 22nm technology
is expected to be around 12.5µm x 12.5µm. In addition to this, the decoders take some
space. Therefore, a single-length wire in our architecture needs to run 25µm, a double
length wire 38µm, triple-length wire 50µm. Assuming 50µm as the limit for the length
of these wires, we investigate architectures having a maximum segment length of 3 logic
blocks.
Interface to CMOS
The problem of interfacing such nano-structures with the structures made using
traditional lithography was addressed in [68]. These nano-wires can be accessed with a
decoder made using advanced nano-imprint technology. [106] also proposes a stochastic
approach to addressing these wires, and claims that we can uniquely address these wires
with high probability if the number of wires is large. [68] proposed the use of√
N control
signals for a decoder that is used to address N wires. We use a similar technique, and
therefore account for 15 decoder control signals for 200 wires in the FPGA channel. Note
that these decoders are needed only to configure the switches, and are switched off at
operation time.
Programmable Switches
98
As described in section 5.2, arch1 uses molecular switches that can be made to
assemble at the cross-points of the wires. After this, these switches can be configured to
make the desired connections by applying the correct voltages at the wires (similar to
anti-fuse FPGAs).
Configuring the FPGA
The logic functionality of this FPGA can be easily programmed using SRAM cells.
Programming the routing is similar to anti-fuse FPGAs, except that we need decoders
to address the nano-wires. The main concept is that the wires should be activated in
some particular order to avoid affecting wrong switches. [107] presents a way to program
the anti-fuses in an anti-fuse FPGA, which is directly applicable to our architecture too.
Initially, all the molecular switches are off and all the wires are pre-charged to a voltage
Vp/2. This is required to ensure that the voltage difference of Vp is applied only to
the desired switch. Then the two wires that need to be connected through a switch are
addressed using a decoder and pulled to Vp and ground respectively, thus applying a
voltage difference of Vp to the molecular switch that needs to be turned on. Note that
Vp needs to be larger than the operating voltage. Experiments with molecular switches
have shown a value of 1.75V [70], which is more than double that of the operating voltage
at 22nm node.
We also envision a possibility of using the CLB logic itself to program the molecu-
lar switches. In order to do that, the configuration will need to go through the following
steps. First the global clock resources need to be configured. Next, the CLB (logic)
is configured to drive appropriate control signals to the address decoder. Note that
99
since different CLBs cannot communicate at this stage, all control signals need to be
synchronized with the global clock signal. Furthermore, since the configuration time is
usually not critical, we can afford to minimize the configuration logic (that needs to fit
within a single CLB). Next the routing (molecular) switches are programmed followed
by configuration of the CLBs to implement the user design. Note that this configuration
methodology will greatly simplify the programming circuitry when compared to anti-fuse
FPGAs.
Capacitance and Area Estimation
Capacitance of a single-length wire2, C1−wire, in arch1 is estimated as follows.
C1−wire = 4×Nchannel × Cnano−jn
+(2×Nclb−pins + 2×Ndecoder)× Cmicro−jn +
2× Ccouple
where Nchannel is the number of wires in the FPGA channel (channel width), Cnano−jn
is the junction capacitance between two nano-wires, Nclb−pins in the number of IO pins
in the logic block, Ndecoder refers to the number of control signals in the decoders,
Cmicro−jn is the junction capacitance between a lithographic wire and a nano-wire, and
Ccouple is the coupling capacitance with an adjacent wire.
2wire that spans adjacent CLBs
100
The junction capacitance between any two wires, Cjunc is calculated using [68]
Cjunc = 2πǫLln(2h
r),
where ǫ is the permittivity of the dielectric separating the wires (we assumed SiO2), r is
the radius of the wires and h is the separation between the wires.
For Cnano−jn, L = 2r and h was kept as 30nm and for calculating Cmicro−jn,
L was changed to the lithographic metal half pitch (54nm for 22nm node).
Ccouple was estimated using the equation for two long parallel cylindrical con-
ductors.
Ccouple =πǫL
ln( D2a +
√
( D2a)2 − 1)
where D is the spacing between the axes of the two cylinders, and L is the length of
the cylinders (wires). We observed that the coupling capacitance calculated using the
above equation was always larger than the capacitance calculated using Berkeley device
group’s interconnect model [80], and therefore used the above as a pessimistic value.
The area of the arch1 FPGA is equal to area of logic blocks + area of decoders
when the pitch of the nano-wires is within 25nm. For larger wire-pitches, area is deter-
mined by the wires and is quadratically proportional to the wire pitch. Note that when
area of the device increases, the lengths of the wires also increase and consequently, wire
capacitance and resistance per CLB length changes.
101
5.2.2 Arch2: FPGA using lithographic wires and molecular switches
Arch1 described in the previous subsection needs decoders for addressing the
nano-wires, which increases the complexity of the fabrication process. Therefore, we also
explore an FPGA, which uses conventional lithographic metal wires as the interconnect,
with molecular switches at their cross-points (as in the previous architecture). Note that
assuming a channel width of 200 (same as arch3, and similar to commercial SRAM-based
FPGAs), the area of the CLB will be determined by the wires instead of the logic. For
22nm technology, ITRS predicts a wire pitch of 54nm. For a channel width of 200, we
will need 400 wires within the CLB pitch. This comes out to be 400 × 54 = 21.6µm
long. In addition to that, we will need space for the logic pins, which calculates to 40 x
54 = 2.16 µm. Therefore, the CLB dimensions in this case is projected to be 23.76µm
x 23.76µm, which is only slightly smaller than the current Xilinx CLB scaled to 22nm
technology (25µm x 25µm).
5.3 Comparative Evaluation
We used VPR [10] to model the various FPGA architectures and evaluate their
performance.
Modeling Arch1 in VPR
In order to model arch1 in VPR, we added a new type of switch box that allows
a wire to connect only to the wires at right angles to it. This was done because in arch1,
molecules assemble only at wire cross-over points and not between two wires running in
the same direction. In order to account for the large defect rates expected at this scale,
102
we started with assuming that only half of the switches are operational, but due to the
immensely large number of programmable switches in our architecture (even when only
half of the switches are visible), VPR takes extremely long (> 2 days on a SunBlade-
2000, for a 191 CLB design) to finish the placement and routing of the designs. In order
to facilitate experimenting with multiple designs, we limited the number of switches in
VPR to only about 1% of the total physically present switches. Consequently, in VPR,
the CLB outputs have switches to only half of the wires in the channel, and a wire
can connect to only 4 other wires in the switch box, two in each of the perpendicular
directions. The performance we obtained by limiting the number of switches was not
very different from that obtained by keeping all the switches for the few designs we
initially experimented with. Since the flexibility provided by our switch box is still
greater than the switches built in VPR, we expect that our switch box is still not very
limiting, and similar results will be obtained considering all switches too. Note that
since we still counted the junction capacitances between all crossing wires, our results
for the proposed architectures should be considered as the lower bound, and could be
enhanced by improvements in the tools. We used MCNC benchmark circuits for all
experimentation. These designs varied in size from 131 to 806 CLBs. In order to have
reasonable performance, we kept the routing as segmented with 20% single-length, 30%
double-length, and 50% triple-length wires.
5.3.1 Results
Figure 5.3 shows the critical path delays of all the designs when mapped to the
three architectures. The results for arch1 use a spacing s of 10nm between the nanowires
103
and a wire diameter of 15nm. The lithographic wire pitch was kept as 54nm, as predicted
by [8] for the 22nm node. The resistance of the molecular switch was assumed to be 1kΩ,
and the material for the nano-wire was assumed to be copper (resistivity=2.2µΩ-cm [8]).
Note that the delay is maximum for arch3 (lithographic, SRAM-based), and the delays
for arch1 and arch2 are comparable. However, the area of the arch1 FPGA is only about
30% of the arch2 FPGA. The average reduction in critical path delay was 30% for arch2
and 32% for arch1, when compared to arch3.
The performance of the designs (mapped on arch1 and arch2 FPGAs) strongly
depends on the molecular switch resistance. For our experimentation we assumed that
the off resistance of the switch is sufficiently high to consider it as an open circuit. Results
for varying molecular on resistance from 100 Ω to 100 KΩ (typical value is around 10kΩ
today) are shown in figure 5.4. It is observed that the delay of the circuit increases very
sharply beyond 10kΩ. In fact the delay becomes as large as 20X for arch1 when the
molecular resistance is 100kΩ. The delay value for arch1 using NiSi nanowire remains
larger than arch3 for all values of molecular resistances. This happens due to very large
resistance of these wires. Note that these NiSi nano-wires can support large current
densities, while the metal nano-wires may in reality be limited by electro-migration.
Figure 5.5 shows the variation of resistance and capacitance of single-length NiSi
nano-wires with wire dimensions. The notation R-25 means resistance for nano-wires
with a pitch of 25nm. The plot shows results for wire pitches ranging from 25nm to
55nm. Note that as the wire pitch is increased, the area of the FPGA increases, thereby
increasing the wire length. Therefore, we can see a slight increase in the wire resistance
when the pitch is increased even when the width of the wire remains the same. The
104
capacitance value at 50nm width clearly reaches unacceptable limits (>20fF). At the
other extreme, the resistance values are very large (>100 kΩ) when the width of the
wire is reduced to 5nm. Note that looking at the RC product of the wire alone is not
expected to give an indication of the performance of the FPGA, since every net will
go through some molecular switches (with resistances) and into the input pins of logic
blocks (with capacitances).
Figure 5.6 shows the variation of performance of arch1 with varying wire dimen-
sions for the design misex3; other designs showed a similar behavior. Note that per-
formance of arch1 is inferior to arch3 when the molecular resistance is 100kΩ or 10kΩ.
However, as the molecular resistance reduces, arch1 starts performing better than arch3.
The figure is divided into vertical sections of separate wire pitches. For every wire pitch,
we experimented with several wire dimensions. Note that for Rswitch=100kΩ, delay
increases monotonically (except 5 30 → 10 25) with width of the wire for a fixed pitch.
This happens because the large switch resistance makes the net delay very sensitive to
capacitance of the wire. With the delay of the design being dominated by the routing de-
lay (and because the logic delay remains almost same for different wire dimensions), the
delay of the design increases with capacitance. The other extreme occurs when Rswitch
is 100 Ω, in which case the delay decreases with increase in width due to reduction in
wire resistance. Rswitch values of 10 and 1 kΩ show intermediate behavior.
5.4 Summary
In this paper we explored several nano-scale interconnect technologies for FPGAs.
First, we replaced the FPGA interconnect fabric by nano-structures: lithographic wires
105
by nano-wires made using nano-imprint technology, and switches by molecular switches.
Second, we used lithographic wires connected using programmable switches. The results
for these two were compared with current FPGA architecture containing pass transistor
switches, scaled to 22nm.
We found that the first architecture provided the best performance with the least
area. The area reduced to 30% of the scaled architecture, and the critical path delay
reduced by 32% on an average. The second architecture improved the performance over
the scaled FPGA, but area reduction was only 10%. Using NiSi nano-wires instead of
metal nano-wires was not good for performance, but may be useful to counter electro-
migration. The resistance of the molecular switch was found to be a crucial factor in the
performance of the design, and values lower than 10kΩ were observed to be critical for
performance.
This kind of exploratory research is highly interdisciplinary, and building success-
ful nanoscale devices requires synergy between the architects and the chemists. One of
the motivations of this work was to set the requirements from these nanoscale technolo-
gies to the chemists who are actually developing these. From the results we conclude that
molecular switches with on-resistances of around 1kΩ are needed for good performance.
Furthermore, materials with lower resistivities than NiSi nanowires must be explored for
fabricating nano-wires. Architectural improvements, and throughput-oriented designs
may utilize the area benefits of nanotechnologies to provide faster application run-times
even with higher molecule and wire resistances.
106
Fig. 5.1. FPGA using nano-wires and molecular switches
(a) (b)
Fig. 5.2. 3-D organization of nano-wires
107
Fig. 5.3. Critical path delays in the 3 architectures
Fig. 5.4. Dependence of performance on molecular switch’s ON resistance
108
Fig. 5.5. Resistance and Capacitance values of single-length NiSi nano-wires
Fig. 5.6. Performance of a design (misex3) using metal nano-wires
109
Chapter 6
Summary and Future Directions
The growing popularity of FPGAs demands ways to sustain their growth in future.
In this thesis, we looked at three main future technologies: scaled CMOS, 3-D integration,
and nanotechnology.
Leakage Energy in Scaled CMOS
Leakage energy is increasing exponentially with shrinking feature sizes. The prob-
lem is more severe in FPGAs because they use extra transistors to provide programma-
bility. This, however, also gives us an opportunity to save leakage: since a majority of
these transistors are unused or idle for any given design, we can save leakage by simply
cutting off their power supplies. This is similar to the use of sleep transistors in ASICs,
but is simpler because the state of the sleep transistor can be fixed at the configuration
time. In contrast, ASICs normally control the sleep transistor dynamically, based on the
usage of the circuit block. Such a dynamic control can also be used in FPGAs if the
design has modules that remain idle for long periods of times. However, even without
the dynamic control, we can save significant leakage by using sleep transistors controlled
by configuration SRAM cells.
Another, more expensive (in terms of area overhead) approach is the use of mul-
tiple supply voltages in the FPGA. We proposed an architecture that reduces both
dynamic and leakage powers by using two supply voltages (high: Vddh, low: Vddl).
110
Through the use of supply transistors, circuit blocks on the FPGA can run on either of
the two Vdd’s. Extra configuration bits store the state of the supply transistors, which
sets the Vdd for every circuit block. A Vdd assignment algorithm assigns values to these
configuration bits, such that timing constraints are not violated. We integrated this
algorithm with VPR (an FPGA place-and-route tool from Toronto [10]), and obtained
a 61% reduction in total FPGA power. For the case when all routing muxes can be
individually programmed to either Vddh or Vddl, the FPGA area increases by about
50%. In order to reduce this area penalty, we present another architecture where the
voltages of the routing resources are fixed (at either Vddh or Vddl) at the time of fab-
rication. The router then routes the critical nets through the resources that are on high
Vdd, thus maintaining the performance of the design. We observe that this architecture
reduces the FPGA energy by 57.3% with an area penalty of only about 20%, and the
performance for all the benchmark designs remains within 20% of the original (when all
resources are at Vddh). We further reduce the area penalty by controlling the Vdd of
multiple logic blocks from the same set of supply transistors.
Three-Dimensional Integration of FPGAs
Three-dimensional (3-D) integration is a promising technique for reducing wire
lengths in an integrated circuit. Three-D is especially attractive for FPGAs because
the interconnect dominates their total area, delay, and power. In order to achieve this
goal, we design a 3-D FPGA, which uses wafer bonding to stack multiple programmable
fabrics. The inter-layer vias in this technology are much larger than the horizontal wires,
which forces us to minimize the number of such vias. Consequently, inter-layer vias are
111
fewer than the horizontal wires. Designing an efficient switch block for such an FPGA
is a challenging task.
We design multiple switch boxes for 3-D FPGAs, and analyze their benefits and
drawbacks. Using detailed area-timing models and a 3-D FPGA place and route tool,
we evaluate the different switch box topologies. The results indicate that the area-delay
product depends heavily on the switch box topology. Compared with the subset switch
block, our best switch block reduces the area-delay product by 9% for 3um pitch vias
and by over 15% for 5um pitch vias in a 65nm technology. We also demonstrate that,
compared with a 2-D FPGA, a 5-layer 3-D FPGA reduces the area-delay product by
more than 35%.
Besides the influence on the FPGA routing architecture, 3-D increases the die
temperature of FPGAs. We analyzed the thermal issues in 3-D FPGAs, and proposed
an FPGA organization that reduced the peak temperature of a 2-layer FPGA by 16C.
Nano-Scale Technologies
Recently, chemists and material scientists have made significant progress in de-
veloping alternate technologies for creating features at the nano-scale [99, 103]. They
have successfully fabricated wires separated by only a few nanometers (nanowires) [104].
Similarly, they are developing reprogrammable molecular switches [102]. The scaling of
CMOS is expected to saturate some time during the next decade, which mandates the
exploration of alternate technologies for chip fabrication.
In this project, we explore different technologies to implement FPGA intercon-
nect. We explore nano-wires of different widths and materials, and also explore replacing
112
the pass-transistor switches in current FPGAs by molecular switches that provide repro-
grammable connections between wires. This alleviates the need for SRAM cells to control
the state of the switch, since these molecules store the state within themselves. This is
similar to anti-fuse FPGAs, but, in contrast to anti-fuse technology, these molecules are
re-programmable. Furthermore, we expect the structure of the CLB to be more difficult
to realize efficiently in these nanotechnologies because they are amenable only to regular
structures. Therefore, the logic blocks in our architecture are fabricated using 22nm
CMOS process (scaled from the 65nm values).
We compare them to traditional SRAM-based FPGAs that use pass transistors
as switches (scaled to 22nm), and show that by using nano-wires and molecular switches,
FPGA area reduces by 70%.
6.1 Future Directions
Process variation is increasing significantly in CMOS technologies. Tackling pro-
cess variations effectively in FPGAs would be an interesting project for future study.
Since FPGAs can be reprogrammed, we can potentially reconfigure them to avoid in-
crease in delay of the critical paths. Furthermore, including the hard blocks in the power
reduction mechanisms will be needed for deploying power-efficient versions of the cur-
rent FPGAs. Since the various hard blocks can be combined and configured in multiple
modes, the tools can decide on a power-efficient mode for these blocks. Altera’s Quartus-
2 already performs some such optimizations. The dual-Vdd technique proposed in this
thesis can also be improved, especially by improving the routing and placement tools to
be aware of the voltage clusters. This will improve the results shown in Chapter 3.
113
Since previous studies for 2-D have shown that the optimal switch box changes
when the routing fabric contains long segments, our study of 3-D switch boxes must
be extended for FPGAs containing segments of different lengths. Furthermore, the 3-D
FPGA can include some direct connections among neighboring CLBs. Since the number
of neighbors increases when we go to 3-D, direct connections may have a larger impact
than for 2-D.
Nanotechnologies, being in their infancy, need a continuous effort to be realized
into real implementations. A detailed power and thermal model will benefit in further
evaluating their benefits. Besides this, the architecture can be improved to be more
robust towards the high defect rates, common to self-assembly.
114
References
[1] “Xilinx Products Documentation,” http://www.xilinx.com/literature.
[2] Altera product datasheets http://www.altera.com/literature.
[3] “Cray XD1 Supercomputer,” http://www.cray.com/products/xd1.
[4] “SGI RASC Technology,” http://www.sgi.com/products/rasc.
[5] E.Lattanzi, A.Gayasen, M.Kandemir, V.Narayanan, L.Benini, and A.Bogliolo.
“Improving Java Performance by Dynamic Method Migration on FPGAs”. In
Int. J. Embedded Systems (to appear).
[6] “FPGA High Performance Computing Alliance,” http://www.fhpca.org.
[7] PLD Sector Overview by Jefferies and Company, Dec 2005.
[8] International technology roadmap for semiconductors. <http://public.itrs.net>.
[9] S. Brown, R. Francis, J. Rose, an Z. Vranesic. “Field-Programmable Gate Arrays”.
Kluwer Academic Publishers, May 1992.
[10] V. Betz and J. Rose. “VPR: A New Packing, Placement and Routing Tool for
FPGA Research”. In International Workshop on Field-programmable Logic and
Applications, 1997.
[11] V. George, H. Zhang, and J. Rabaey. “The design of a low energy FPGA”. In
Proceedings of International Symposium on Low Power Electronics and Design,
1999.
115
[12] E. Kusse and J. Rabaey. “Low-Energy Embedded FPGA Structures”. In Proceed-
ings of International Symposium on Low Power Electronics and Design, 1998.
[13] L. Shang, A. S. Kaviani, and K. Bathala. “Dynamic power consumption in
Virtex[tm]-II FPGA family”. In Proceedings of ACM/SIGDA International Sym-
posium on Field-programmable gate arrays, 2002.
[14] K. Poon, A. Yan, and S. Wilton. “A flexible Power Model for FPGAs”. In Proceed-
ings of International Conference on Field Programmable Logic and Applications,
2002.
[15] F. Li, D. Chen, L. He, and J. Cong. “Architecture Evaluation for Power-Efficient
FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-
programmable gate arrays, 2003.
[16] A. Singh and M. Marek-Sadowska. “Efficient Circuit Clustering for Area and Power
Reduction in FPGAs”. In Proceedings of ACM/SIGDA International Symposium
on Field-programmable gate arrays, 2002.
[17] Kaushik Roy. “Power-Dissipation Driven FPGA Place and Route Under Timing
Constraints”. In IEEE Trans. on Circuits and Systems-I, Vol. 46, No. 5, May 1999.
[18] B. Kumthekar, L. Benini, E. Macii, and F. Somenzi. “In-place power optimiza-
tion for LUT-based FPGAs”. In Proceedings of ACM/IEEE Design Automation
Conference, 1998.
116
[19] J. Lamoureux and S.J.E. Wilton. “On the Interaction between Power-Aware FPGA
CAD Algorithms”. In IEEE International Conference on Computer Aided Design,
November 2003.
[20] A. Gayasen and N. Vijaykrishnan. “Architecture and Design Flow Optimizations
for Power-Aware FPGAs”. In VLSI Handbook, CRC Press (to appear).
[21] T. Tuan and B. Lai. “Leakage Power Analysis of a 90nm FPGA”. In Proceedings
of Custom Integrated Circuits Conference, 2003.
[22] B. Calhoun, F. Honore, and A. Chandrakasan. “Design Methodology for Fine-
grained Leakage Control in MTCMOS”. In Proceedings of International Symposium
on Low Power Electronics and Design, 2003.
[23] A. Gayasen, Y. Tsai, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.
“Reducing leakage energy in FPGAs using region-constrained placement”. In Pro-
ceedings of International Symposium on Field-programmable gate arrays, 2004.
[24] N. Azizi and F. N. Najm. “Look-Up Table Leakage Reduction for FPGAs”. In
Proceedings of Custom Integrated Circuits Conference, 2005.
[25] S. Srinivasan, A. Gayasen, N. Vijaykrishnan, M. J. Irwin, and T. Tuan. “Leakage
Control in FPGA Routing Fabric”. In Proceedings of Asia and South Pacific Design
Automation Conference, 2005.
[26] A. Lodi, L. Ciccarelli, and R. Giansante. “Combining Low-Leakage Techniques for
FPGA Routing Design”. In Proceedings of ACM/SIGDA international symposium
on Field-programmable gate arrays, 2005.
117
[27] A. Rahman and V. Polavarapuv. “Evaluation of Low-Leakage Design Techniques
for Field Programmable Gate Arrays”. In Proceedings of ACM/SIGDA Interna-
tional Symposium on Field-programmable gate arrays, 2004.
[28] A. Rahman, S. Das, T. Tuan, and A. Rahut. “Heterogeneous Routing Architecture
for Low-Power FPGA Fabric”. In Custom Integrated Circuits Conference, 2005.
[29] H. Hassan, M. Anis, and M. Elmasry. “LAP: a logic activity packing methodology
for leakage power-tolerant FPGAs”. In Proceedings of International Symposium
on Low Power Electronics and Design, 2005.
[30] J. H. Anderson, F. Najm, and T. Tuan. “Active Leakage Power Optimization
for FPGAs”. In Proceedings of ACM/SIGDA International Symposium on Field-
programmable gate arrays, 2004.
[31] S. Srinivasan, A. Gayasen, N. Vijaykrishnan, Y. Xie, M. J. Irwin, and T. Tuan.
“Improving Soft-Error Tolerance of FPGA Configuration Bits”. In Proceedings of
International Conference on Computer Aided Design, 2004.
[32] N. Azizi, F. N. Najm, A. Moshovos. “Low Leakage Asymmetric-cell SRAM”. In
IEEE Intl. Symposium on Low Power Electronic Devices, 2002.
[33] K. Usami and M. Horowitz. “Clustered voltage scaling technique for low-power
design”. In Proceedings of International Symposium on Low Power Electronics and
Design, 1995.
118
[34] M. Takahashi et.al. “A 60-mW MPEG4 Video Codec Using Clustered Voltage
Scaling with Variable Supply-Voltage Scheme”. In IEEE Journal of Solid-State
Circuits, Vol. 33, No. 11, Nov 1998.
[35] F. Li, Y. Lin, L. He, and J. Cong. “Low-power FPGA using Pre-Defined Dual-
Vdd/Dual-Vt Fabrics”. In Proceedings of ACM/SIGDA International Symposium
on Field-programmable gate arrays, 2004.
[36] F. Li, Y. Lin, and L. He. “FPGA Power Reduction Using Configurable Dual-Vdd”.
In Proceedings of ACM/IEEE Design Automation Conference, 2004.
[37] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.
“A Dual-Vdd Low Power FPGA Architecture”. In Proceedings of ACM/SIGDA
International Conference on Field-Programmable Logic and Applications, 2004.
[38] A. Gayasen, S. Srinivasan, N. Vijaykrishnan, M. Kandemir. “Design of Power-
Aware FPGA Fabrics”. In Int. J. Embedded Systems (to appear).
[39] Y. Lin, F. Li, and L. He. “Circuits and Architectures for Field Programmable Gate
Array with Configurable Supply Voltage”. In IEEE Trans. on VLSI Systems, vol.
13, no. 9, pp. 1035-1047, Sep 2005.
[40] Y. Lin and L. He. “Leakage efficient chip-level dual-vdd assignment with time
slack allocation for FPGA power reduction”. In Proceedings of ACM/IEEE Design
Automation Conference, 2005.
119
[41] J. H. Anderson and F. Najm. “Low Power Programmable Routing Circuitry for
FPGAs”. In Proceedings of IEEE International Conference on Computer-Aided
Design, 2004.
[42] D. Chen, J. Cong, F. Li, and L. He. “Low-Power Technology Mapping for FPGA
Architectures with Dual Supply Voltages”. In Proceedings of International Sym-
posium on Field-programmable gate arrays, 2004.
[43] A. J. Alexander, J. P. Cohoon, Jared L. Colflesh, J. Karro, and G. Robins, “Three-
Dimensional Field-Programmable Gate Arrays,” In Proceedings of the Interna-
tional ASIC Conference, pp. 253-256, 1995.
[44] J. V. Campenhout, H. V. Marck, J. Depreitere, and J. Dambre, “Optoelectronic
FPGAs,” In IEEE Journal of Selected Topics in Quantum Electronics, pp. 306-315,
1999.
[45] M. Leeser, W. M. Meleis, M. M. Vai, S. Chiricescu, W. Xu, and P. M. Zavracky,
“Rothko: A Three-Dimensional FPGA,” In IEEE Design and Test of Computers,
pp. 16-23, 1998.
[46] G. Borriello, C. Ebeling, S. A. Hauck, S. Burns, “The triptych FPGA architecture,”
In IEEE Transactions on VLSI Systems, Vol. 3, No. 4, pp. 491-501, 1995.
[47] S. Chiricescu, M. Leeser, and M. M. Vai, “Design and analysis of a dynamically
reconfigurable three-dimensional FPGA,” In IEEE Transactions on VLSI Systems,
Vol. 9, No. 1, pp. 186-196, 2001.
120
[48] M. Lin, A. El Gamal, Y. Lu, and S. Wong, “Performance benefits of monolith-
ically stacked 3D-FPGA,” Proceedings of the international Symposium on Field
Programmable Gate Arrays, 2006.
[49] A. Rahman, S. Das, A. Chandraksan, and R. Reif, “Wiring requirement and three-
dimensional integration of field-programmable gate arrays,” In Proceedings of the
international workshop on System-level interconnect prediction, 2001.
[50] Y-S Kwon, P. Lajevardi, A. Chandrakasan, F. Honore, and D. E. Troxel, “A 3-D
FPGA wire resource prediction model validated using a 3-D placement and routing
tool,” In Proceedings of the international workshop on System-level interconnect
prediction, 2005.
[51] C. Ababei, H. Mogal, and K. Bazargan, “Three-dimensional Place and Route for
FPGAs,” In Proceedings of the Asia South-Pacific Design Automation Conference,
2005.
[52] C. Ababei, Y. Feng, B. Goplen, H. Mogal, T. Zhang, K. Bazargan, and S. Sapat-
nekar, “Placement and Routing in 3D Integrated Circuits,” In IEEE Design and
Test, Vol. 22, No. 6, pp. 520-531, Nov-Dec 2005.
[53] G-M. Wu, M. Shyu, and Y-W. Chang, “Universal switch blocks for three-
dimensional FPGA design,” In Proceedings of ACM/SIGDA international sym-
posium on Field-programmable gate arrays, 1999.
121
[54] A. Gayasen, N. Vijaykrishnan, M. Kandemir, A. Rahman. “Switch Box Archi-
tectures for Three-Dimensional FPGAs”. In Proceedings of Field-Programmable
Custom Computing Machines (FCCM), Apr 2006.
[55] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-
Performance Microprocessors,” In Proceedings of the 7th International Symposium
on High-Performance Computer Architecture, 2001.
[56] K. Sankaranarayanan, S. Velusamy, M. Stan, and K. Skadron, “A Case for
Thermal-Aware Floorplanning at the Microarchitectural Level,” In the Journal
of Instruction-Level Parallelism, Vol. 7, Oct 2005 (http://www.jilp.org/vol7).
[57] Y. Han, I. Koren, and C. A. Moritz, “Temperature aware floorplanning,” In Second
Workshop on Temperature-Aware Computer Systems(TACS-2), held in conjunc-
tion with ISCA-32, June 2005.
[58] G. Chen and S. Sapatnekar, “Partition-driven standard cell thermal placement,”
In Proceedings of the International Symposium on Physical Design, 2003.
[59] K. Skadron, M. R. Stan, et al., “Temperature-Aware Microarchitecture,” In Pro-
ceedings of the 30th International Symposium on Computer Architecture (ISCA),
2003.
[60] G. M. Link and N. Vijaykrishnan, “Thermal Trends in Emerging Technolo-
gies,” In Proceedings of the International Symposium on Quality Electronic Design
(ISQED), 2006.
122
[61] J. Cong, J. Wei, and Y. Zhang. “A Thermal-Driven Floorplanning Algorithm
for 3D ICs”. In Proceedings of the International Conference on Computer-Aided
Design, Nov 2004.
[62] B. Goplen and S. S. Sapatnekar. “Efficient Thermal Placement of Standard Cells
in 3D ICs using a Force Directed Approach”. In Proceedings of the International
Conference on Computer-Aided Design, 2003.
[63] J. Cong and Y. Zhang. “Thermal Via Planning for 3-D IC’s”. In Proceedings of
the International Conference on Computer Aided Design, Nov 2005.
[64] B. Goplen and S. S. Sapatnekar. “Thermal Via Placement in 3D ICs”. Proceedings
of the ACM International Symposium on Physical Design, 2005.
[65] S. Lopez-Buedo, J. Garrido, and E. Boemo, “Dynamically Inserting, Operating,
and Eliminating Thermal Sensors of FPGA-based Systems,” In IEEE Transactions
on Components and Packaging Technologies (CPM), Vol.25, No.4, pp.561-566, De-
cember 2002.
[66] S. Velusamy et al., “Monitoring Temperature in FPGA based SoCs,” In Proceedings
of the International Conference on Computer Design (ICCD), 2005.
[67] P. Sundararajan, A. Gayasen, N. Vijaykrishnan, and T. Tuan. Thermal Character-
ization and Optimization in Platform FPGAs In Proceedings of the International
Conference on Computer Aided Design, Nov 2006.
123
[68] Andre DeHon and Michael J. Wilson, “Nanowire-Based Sublithographic Pro-
grammable Logic Arrays,” In Proc. of International Symposium on Field Pro-
grammable Gate Arrays, 2004.
[69] S.C. Goldstein and M. Budiu, “Nanofabrics: Spatial computing using molecular
electronics,” In Proceedings of the International Symposium on Computer Archi-
tecture (ISCA 2001), July 2001.
[70] J. M. Tour, W. L. Van Zandt, C. P. Husband, S. M. Husband, L. S. Wilson, P.
D. Franzon, D. P. Nackashi, “Nanocell Logic Gates for Molecular Computing,” In
IEEE Transactions on Nanotechnology 2002, 1, 100-109.
[71] A. Gayasen, N. Vijaykrishnan, and M. J. Irwin. “Exploring Technology Alterna-
tives for Nano-Scale FPGA Interconnects”. In Proceedings of the Design Automa-
tion Conference, 2005.
[72] A. Chandrakasan, W. Bowhill, and F. Fox. “Design of High-Performance Micro-
processor Circuits”. IEEE Press, 2001.
[73] Z. Chen, M. Johnson, L. Wei, and K. Roy. “Estimation of standby leakage power in
CMOS circuits considering accurate modeling of transistor stacks”. In Proceedings
of International Symposium on Low Power Electronics and Design, 1998.
[74] W. Zhang, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, , and V. De. “Compiler
support for reducing leakage energy consumption”. In Proceedings of the 6th Design
Automation and Test in Europe Conference, 2003.
124
[75] J. Kao, S. Narendra, and A. Chandrakasan. “MTCMOS Hierarchical Sizing Based
on Mutual Exclusive Discharge Patterns”. In Proceedings of the Design Automation
Conference, 1998.
[76] M. Powell, S. Yang, B. Falsafi, K. Roy, and T. Vijaykumar. “Gated-Vdd: A Circuit
Technique to Reduce Leakage in Deep-Submicron Cache Memories”. In Proceedings
of International Symposium on Low Power Electronics and Design, 2000.
[77] Xilinx Application Note. “Two Flows for Partial Reconfiguration: Module Based
or Difference Based”. http://direct.xilinx.com/bvdocs/publications/xapp290.pdf.
[78] S.Swaminathan, R. Tessier, D. Goeckel, and W. Burleson. “A Dynamically Recon-
figurable Adaptive Viterbi Decoder”. In Proceedings of ACM/SIGDA international
symposium on Field-programmable gate arrays, 2002.
[79] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, A. Srivastava, and
S. Kulkarni. Pushing ASIC performance in a power envelope. In Proceedings of
the Design Automation Conference, 2003.
[80] “Berkeley Predictive Technology Model,” http://www-
device.eecs.berkeley.edu/∼ptm/interconnect.html.
[81] S. Wilton, “Architecture and Algorithms for Field-Programmable Gate Arrays
with Embedded Memory,” PhD thesis, University of Toronto, 1997.
[82] V. Betz and J. Rose. “FPGA Routing Architecture: Segmentation and Buffering
for to Optimize for Speed and Density”. In Proceedings of ACM/SIGDA Interna-
tional Symposium on Field-programmable gate arrays, 1999.
125
[83] Y-C. Ju and R. A. Saleh. “Incremental Techniques for the Identification of Stati-
cally Sensitizable Critical Paths”. In Proceedings of the Design Automation Con-
ference, 1991.
[84] T. Inukai, M. Takamiya, K. Nose, H. Kawaguchi, T. Hiramoto, and T. Sakurai.
“Boosted Gate MOS (BGMOS): Device/Circuit Cooperation Scheme to Achieve
Leakage-Free Giga-Scale Integration”. In Proceedings of Custom Integrated Circuits
Conference, 2000.
[85] J. Rose and S. Brown, “Flexibility of interconnection structures for field-
programmable gate arrays,” In IEEE J. Solid-State Circuits, vol. 26, pp. 277-282,
Mar 1991.
[86] G. Lemieux, S. Brown, and D. Vranesic, “On two-step routing for FPGAS,” In
Proceedings of the international symposium on Physical design, 1997.
[87] M. I. Masud and S. Wilton, “A new switch block for segmented FPGAs,” In
Proceedings of the International Workshop on Field Programmable Logic and Ap-
plications, 1999.
[88] P. Hallschmid and S. Wilton, “Detailed routing architectures for embedded pro-
grammable logic IP cores,” In Proceedings of the ACM/SIGDA International Sym-
posium on Field-Programmable Gate Arrays, 2001.
[89] Y.-W. Chang, D. F. Wong, and C. K. Wong, “Universal switch blocks for FPGA
design,” In ACM Transactions Design Automation of Electronic Systems, 1(1):80-
101, Jan. 1996.
126
[90] H. Fan, J. Liu, Y.-L. Wu, and C.-C. Cheung, “On optimum switch box designs for
2-D FPGAs,” In Proceedings of the Design Automation Conference (DAC), 2001.
[91] K. Banerjee, S. J. Souri, P. Kapur, and K. C. Saraswat, “3-D ICs: A novel chip
design for improving deep submicron interconnect performance and systems-on-
chip integration,” In Proceedings of the IEEE, Vol. 89, May 2001, pp. 602-633.
[92] S. Gupta, M. Hilbert, S. Hong, and R. Patti, “Techniques for Producing 3D ICs
with High-Density Interconnect,” 2005. Available from Tezzaron Semiconductor.
[93] Y. Yamaji, T. Ando, et al., “Thermal Characterization of Bare-Die Stacked Mod-
ules with Cu through-vias,” In Electronic Components and Technology Conference,
2001.
[94] C. S. Tan and R. Reif, “Multi-Layer Silicon Layer Stacking Based on Copper Wafer
Bonding,” In Electrochemical and Solid-State Letters, 8(6):G147-G149, 2005.
[95] Predictive Technology Model, http://www.eas.asu.edu/∼ptm
[96] V. Betz, J. Rose, and A. Marquardt, “Architecture and CAD for Deep-Submicron
FPGAs,” Kluwer Academic Publishers, February 1999.
[97] A. Telikepalli, “Designing for Power Budgets and Effec-
tive Thermal Management,” In Xcell Journal, Issue 56, 2006.
(http://www.xilinx.com/publications/xcellonline/xcell 56)
[98] “Thermal Management for 90-nm FPGAs,” Application Note 358, Altera Corpo-
ration.
127
[99] Yong Chen et al., “Nanoscale molecular-switch devices fabricated by imprint lithog-
raphy, ” In Applied Physics Letters 82 (2003), no. 10, 1610–1612.
[100] M. C. McAlpine, R. S. Friedman, and C. M. Lieber, “Nanoimprint Lithography
for Hybrid Plastic Electronics,” In Nano Letters 3, 443-445, 2003.
[101] Brent A. Mantooth and Paul S. Weiss, “Fabrication, Assembly, and Characteri-
zation of Molecular Electronic Components,” In Proceedings of the IEEE, Vol 91,
No. 11, Nov 2003.
[102] D. R. Stewart, D. A. A. Ohlberg, P. A. Beck, Y. Chen, R. S. Williams, J. O. Jeppe-
sen, K. A. Nielsen, J. F. Stoddart, “Molecule-Independent Electrical Switching in
Pt/Organic Monolayer/Ti Devices,” In Nano Letters 2004 Vol. 4, No. 1, 133-136.
[103] Jung-Hyurk Lim, Chad A. Mirkin, “Electrostatically Driven Dip-Pen Nanolithog-
raphy of Conducting Polymers,” In Adv. Mat., 2002, 14(20), 1474-1477.
[104] Yue Wu, Jie Xiang, Chen Yang, Wei Lu, and C. M. Lieber, “Single-crystal metallic
nanowires and metal/semiconductor nanowire heterostructures,” In Nature, Vol.
430, Jul 2004.
[105] Arijit Raychowdhury, Kaushik Roy, “A Circuit Model for Carbon Nanotube Inter-
connects: Comparative Study with Cu Interconnects for Scaled Technologies,” In
Proc. of International Conference on Computer Aided Design, 2004.
[106] Andre DeHon, Patrick Lincoln, John E. Savage, “Stochastic Assembly of Sublitho-
graphic Nanoscale Interfaces,” In IEEE Transactions on Nanotechnology, Vol. 2,
No. 3, Sep 2003.
128
[107] Jonathan Greene, Esmat Hamdy, and Sam Beal, “Antifuse Field Programmable
Gate Arrays,”, In Proceedings of the IEEE, Vol. 81, No. 7, Jul 1993.
Vita
Aman Gayasen was born in India on August 8, 1979. In 2001, he received the
B.Tech degree in Electrical Engineering from the Indian Institute of Technology, Delhi.
After graduation, he joined Ikos Systems (now Mentor Graphics), India as a software
engineer working on a synthesis tool. He joined the Ph.D. program in Computer Science
and Engineering at the Pennsylvania State University, University Park in August 2002.
He worked as a teaching assistant from August 2002 to May 2003. Since then, he has
been employed as a research assistant in Computer Science and Engineering. He worked
as a summer intern at Xilinx Research Labs, San Jose during the summers of 2004 and
2005.
He received the National Talent Search Scholarship (a Govt. of India undergrad-
uate fellowship) from 1995 to 2001. The Pennsylvania State University awarded him the
Robert M. Owens Memorial Scholarship in 2005.
His research interests include VLSI design, CAD for VLSI, reconfigurable archi-
tectures, and nanotechnology. He is a student member of IEEE and ACM.