Measuring and Navigating the Gap Between FPGAs and ASICs

201
Measuring and Navigating the Gap Between FPGAs and ASICs by Ian Carlos Kuon A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Electrical and Computer Engineering University of Toronto Copyright c 2008 by Ian Carlos Kuon

Transcript of Measuring and Navigating the Gap Between FPGAs and ASICs

Measuring and Navigating the Gap Between FPGAs andASICs

by

Ian Carlos Kuon

A thesis submitted in conformity with the requirementsfor the degree of Doctor of Philosophy

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

Copyright c© 2008 by Ian Carlos Kuon

Abstract

Measuring and Navigating the Gap Between FPGAs and ASICs

Ian Carlos Kuon

Doctor of Philosophy

Graduate Department of Electrical and Computer Engineering

University of Toronto

2008

Field-programmable gate arrays (FPGAs) have enjoyed increasing use due to their low

non-recurring engineering (NRE) costs and their straightforward implementation pro-

cess. However, it is recognized that they have higher per unit costs, poorer performance

and increased power consumption compared to custom alternatives, such as application-

specific integrated circuits (ASICs). This thesis investigates the extent of this gap and it

examines the trade-offs that can be made to narrow it.

The gap between 90 nm FPGAs and ASICs was measured for many benchmark cir-

cuits. For circuits that only make use of general-purpose combinational logic and flip-

flops, the FPGA-based implementation requires 35 times more area on average than an

equivalent ASIC. Modern FPGAs also contain “hard” specific-purpose circuits such as

multipliers and memories and these blocks are found to narrow the average gap to 18 for

our benchmarks or, potentially, as low as 4.7 when the hard blocks are heavily used. The

FPGA was found to be on average between 3.4 and 4.6 times slower than an ASIC and

this gap was not influenced significantly by hard memories and multipliers. The dynamic

power consumption is approximately 14 times greater on average on the FPGA than on

the ASIC but hard blocks showed promise for reducing this gap. This is one of the most

comprehensive analyses of the gap performed to date.

The thesis then focuses on exploring the area and delay trade-offs possible through

architecture, circuit structure and transistor sizing. These trade-offs can be used to

selectively narrow the FPGA to ASIC gap but past explorations have been limited in

ii

their scope as transistor sizing was typically performed manually. To address this issue,

an automated transistor sizing tool for FPGAs was developed. For a range of FPGA

architectures, this tool can produce designs optimized for various design objectives and

the quality of these designs is comparable to past manual designs.

With this tool, the trade-off possibilities of varying both architecture and transistor-

sizing were explored and it was found that there is a wide range of useful trade-offs

between area and delay. This range of 2.1 × in delay and 2.0 × in area is larger than

was observed in past pure architecture studies. It was found that lookup table (LUT)

size was the most useful architectural parameter for enabling these trade-offs.

iii

Acknowledgements

Thanks must certainly first go to my supervisor, Jonathan Rose. His guidance and

enthusiasm have been crucial throughout this work. Even more important though, has

been his support and concern for me as a person. I am fortunate to have worked with

him.

I would also like to thank the others that have passed through Jonathan’s research

group. The search for clarity has been made easier by both this supportive group and

the general population of Pratt 392.

My work would not have been possible without the resources and information provided

by CMC Microsystems. The efforts of Eugenia Distefano and Jaro Pristupa are also

appreciated since their timely support ensured that computer problems never slowed me

down.

This work was improved by the opportunities to present it at Actel, Altera and Xilinx

as they provided essential information and feedback. In particular, Richard Cliff of Altera

provided information that was crucial to much of this work. I also benefited greatly from

the experience I gained through my internships at Altera.

My inherent cheapness might have barred me from graduate school. Therefore, I am

lucky to have been generously funded during my PhD by an NSERC Canada Graduate

Scholarship, a Mary Beatty Scholarship, a Rogers Scholarship, a Government of Ontar-

io/Montrose Werry Scholarship in Science and Technology, my supervisor (whose funds

for me came from an NSERC Discovery Grant and Altera), my parents and my wife. I

am extremely thankful to all these sources for ensuring that, despite now spending 11

years in school (or 23 depending how you count), I have never wanted for anything.

I greatly appreciate the support and patience of my wife throughout this work. My

many years in graduate school would have felt even longer without her.

Finally, the encouragement and support of my parents was essential for this work.

Much of the credit for any of my acheivements is owed to them.

iv

Contents

List of Tables viii

List of Figures x

List of Acronyms xii

1 Introduction 11.1 Measuring the FPGA to ASIC Gap . . . . . . . . . . . . . . . . . . . . . 31.2 Navigating the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 62.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Logic Block Architecture . . . . . . . . . . . . . . . . . . . . . . . 72.1.2 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2 FPGA Circuit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 FPGA Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.4 FPGA Assessment Methodology . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.1 FPGA CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.4.2 Area Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4.3 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . 22

2.5 Automated Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . 232.5.1 Static Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . 242.5.2 Dynamic Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.3 Hybrid Approaches to Sizing . . . . . . . . . . . . . . . . . . . . . 272.5.4 FPGA-Specific Sizing . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 FPGA to ASIC Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Measuring the Gap 313.1 Comparison Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.1 Benchmark Circuit Selection . . . . . . . . . . . . . . . . . . . . . 333.2 FPGA CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.3 ASIC CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.3.1 ASIC Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

v

3.3.2 ASIC Placement and Routing . . . . . . . . . . . . . . . . . . . . 403.3.3 Extraction and Timing Analysis . . . . . . . . . . . . . . . . . . . 41

3.4 Comparison Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.4.3 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.5.2 Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583.5.3 Dynamic Power Consumption . . . . . . . . . . . . . . . . . . . . 653.5.4 Static Power Consumption . . . . . . . . . . . . . . . . . . . . . . 68

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 Automated Transistor Sizing 744.1 Uniqueness of FPGA Transistor Sizing Problem . . . . . . . . . . . . . . 75

4.1.1 Programmability . . . . . . . . . . . . . . . . . . . . . . . . . . . 754.1.2 Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Optimization Tool Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . 774.2.1 Logical Architecture Parameters . . . . . . . . . . . . . . . . . . . 774.2.2 Electrical Architecture Parameters . . . . . . . . . . . . . . . . . 784.2.3 Optimization Objective . . . . . . . . . . . . . . . . . . . . . . . . 79

4.3 Optimization Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.1 Area Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 804.3.2 Performance Modelling . . . . . . . . . . . . . . . . . . . . . . . . 83

4.4 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.4.1 Phase 1 – Switch-Level Transistor Models . . . . . . . . . . . . . 874.4.2 Phase 2 – Sizing with Accurate Models . . . . . . . . . . . . . . . 92

4.5 Quality of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 964.5.1 Comparison with Past Routing Optimizations . . . . . . . . . . . 974.5.2 Comparison with Past Logic Block Optimization . . . . . . . . . . 994.5.3 Comparison to Exhaustive Search . . . . . . . . . . . . . . . . . . 1044.5.4 Optimizer Run Time . . . . . . . . . . . . . . . . . . . . . . . . . 105

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5 Navigating the Gap through Area and Delay Trade-offs 1075.1 Area and Performance Measurement Methodology . . . . . . . . . . . . . 108

5.1.1 Performance Measurement . . . . . . . . . . . . . . . . . . . . . . 1095.1.2 Area Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2 Transistor Sizing Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . 1125.3 Definition of “Interesting” Trade-offs . . . . . . . . . . . . . . . . . . . . 1155.4 Trade-offs with Transistor Sizing and Architecture . . . . . . . . . . . . . 118

5.4.1 Impact of Elasticity Threshold Factor . . . . . . . . . . . . . . . . 1215.5 Logical Architecture Trade-offs . . . . . . . . . . . . . . . . . . . . . . . 122

5.5.1 LUT Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1235.5.2 Cluster Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

vi

5.5.3 Segment Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255.6 Circuit Structure Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.6.1 Buffer Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 1275.6.2 Multiplexer Implementation . . . . . . . . . . . . . . . . . . . . . 128

5.7 Trade-offs and the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1345.7.1 Comparison with Commercial Families . . . . . . . . . . . . . . . 137

5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

6 Conclusions and Future Work 1396.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1396.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

6.2.1 Measuring the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . 1416.2.2 Navigating the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . 143

6.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

Appendices 146

A FPGA to ASIC Comparison Details 146A.1 Benchmark Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 146A.2 FPGA to ASIC Comparison Data . . . . . . . . . . . . . . . . . . . . . . 146

B Representative Delay Weighting 152B.1 Benchmark Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152B.2 Representative Delay Weights . . . . . . . . . . . . . . . . . . . . . . . . 155

C Multiplexer Implementations 159C.1 Multiplexer Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159C.2 Evaluation of Multiplexer Designs . . . . . . . . . . . . . . . . . . . . . . 160

D Architectures and Results from Trade-off Investigation 167

E Logical Architecture to Transistor Sizing Process 171

Bibliography 176

vii

List of Tables

2.1 Altera Stratix II Memory Blocks [16] . . . . . . . . . . . . . . . . . . . . 11

3.1 Summary of Process Characteristics . . . . . . . . . . . . . . . . . . . . . 333.2 Benchmark Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3 Area Ratio (FPGA/ASIC) . . . . . . . . . . . . . . . . . . . . . . . . . . 473.4 Area Gap Estimation with Full Heterogeneous Block Usage . . . . . . . . 523.5 Area Ratio (FPGA/ASIC) – FPGA Area Measurement Accounting for

Logic Blocks with Partial Utilization . . . . . . . . . . . . . . . . . . . . 573.6 Critical Path Delay Ratio (FPGA/ASIC) – Fastest Speed Grade . . . . . 593.7 Critical Path Delay Ratio (FPGA/ASIC) – Slowest Speed Grade . . . . . 633.8 Impact of Retiming on FPGA Performance with Heterogeneous Blocks . 653.9 Dynamic Power Consumption Ratio (FPGA/ASIC) . . . . . . . . . . . . 663.10 Dynamic Power Consumption Ratio (FPGA/ASIC) for Different Measure-

ment Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673.11 Static Power Consumption Ratio (FPGA/ASIC) at 25 ◦C with Typical

Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.12 Static Power Consumption Ratio (FPGA/ASIC) at 85 ◦C with Worst Case

Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 713.13 FPGA to ASIC Gap Measurement Summary . . . . . . . . . . . . . . . . 72

4.1 Logical Architecture Parameters Supported by the Optimization Tool . . 774.2 Area Model versus Layout Area . . . . . . . . . . . . . . . . . . . . . . . 834.3 Comparison of Routing Driver Optimizations . . . . . . . . . . . . . . . . 984.4 Comparison of Logic Cluster Delays from [18] for 180 nm with K = 4 . . 1014.5 Comparison of LUT Delays from [18] for 180 nm with N = 4 . . . . . . . 1024.6 Comparison of Logic Cluster Delays from [130] for 350 nm with K = 4 . 1034.7 Comparison of LUT Delays from [130] for 350 nm with N = 4 . . . . . . 1034.8 Comparison of Logic Cluster Delays from [14] for 350 nm CMOS with

K = 4 and N = 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1044.9 Exhaustive Search Comparison . . . . . . . . . . . . . . . . . . . . . . . 105

5.1 Comparison of Delay Measurements between HSPICE and VPR for 20Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

5.2 Architecture Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125.3 Area and Delay Changes from Transistor Sizing and Past Architectural

Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

viii

5.4 Range of Parameters Considered for Transistor Sizing and ArchitectureInvestigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.5 Optimization Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195.6 Span of Different Sizings/Architecture . . . . . . . . . . . . . . . . . . . 1215.7 Span of Interesting Designs with Varied LUT Sizes . . . . . . . . . . . . 1235.8 Span of Interesting Designs with Varied Cluster Sizes . . . . . . . . . . . 1255.9 Span of Interesting Designs with Varied Segment Lengths . . . . . . . . . 1265.10 Comparison of Multiplexer Implementations . . . . . . . . . . . . . . . . 1305.11 Number of Transistors per Input for Various Multiplexer Widths . . . . . 1335.12 Potential Impact of Area and Delay Trade-offs on Soft Logic FPGA to

ASIC Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1365.13 Area and Delay Trade-off Ranges Compared to Commercial Devices . . . 137

A.1 Benchmark Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 147A.2 FPGA and ASIC Operating Frequencies . . . . . . . . . . . . . . . . . . 148A.3 FPGA and ASIC Dynamic Power Consumption . . . . . . . . . . . . . . 149A.4 FPGA and ASIC Static Power Consumption – Typical . . . . . . . . . . 149A.5 FPGA and ASIC Static Power Consumption – Worst Case . . . . . . . . 150A.6 Impact of Retiming on FPGA Performance . . . . . . . . . . . . . . . . . 151

B.1 Normalized Usage of FPGA Components . . . . . . . . . . . . . . . . . . 153B.2 Usage of LUT Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155B.3 Representative Path Weighting Test Weights . . . . . . . . . . . . . . . . 157

D.1 Parameters Considered for Design Space Exploration . . . . . . . . . . . 168D.2 Architectures and Partial Results from Design Space Exploration . . . . 169

E.1 Architecture Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 172E.2 Transistor Sizes for Example Architecture . . . . . . . . . . . . . . . . . 175

ix

List of Figures

2.1 Generic FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Basic Logic Elements (BLEs) and Logic Clusters [14] . . . . . . . . . . . 82.3 Heterogeneous FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Altera Stratix II Logic Element [16] . . . . . . . . . . . . . . . . . . . . . 102.5 Altera Stratix II DSP Block [16] . . . . . . . . . . . . . . . . . . . . . . . 122.6 Routing Architecture Parameters [14] . . . . . . . . . . . . . . . . . . . . 132.7 Routing Segment Lengths . . . . . . . . . . . . . . . . . . . . . . . . . . 132.8 Routing Driver Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.9 Multiplexer Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.10 Implementation of Four Input Multiplexer and Buffer . . . . . . . . . . . 172.11 Alternate Multiplexer Implementations . . . . . . . . . . . . . . . . . . . 182.12 FPGA CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.13 Minimum Width Transistor Area . . . . . . . . . . . . . . . . . . . . . . 22

3.1 ASIC CAD Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Area Gap Compared to Benchmark Sizes for Soft-Logic Benchmarks . . . 493.3 Effect of Hard Blocks on Area Gap . . . . . . . . . . . . . . . . . . . . . 503.4 Area Gap vs. Average FPGA Interconnect Usage . . . . . . . . . . . . . 553.5 Effect of Hard Blocks on Delay Gap . . . . . . . . . . . . . . . . . . . . . 613.6 Speed Gap Compared to Benchmark Sizes for Logic Only Benchmarks . . 623.7 Speed Gap Compared to the Area Gap . . . . . . . . . . . . . . . . . . . 623.8 Effect of Hard Blocks on Power Gap . . . . . . . . . . . . . . . . . . . . 68

4.1 Repeated Equivalent Parameters . . . . . . . . . . . . . . . . . . . . . . 764.2 FPGA Optimization Path . . . . . . . . . . . . . . . . . . . . . . . . . . 854.3 FPGA Optimization Methodology . . . . . . . . . . . . . . . . . . . . . . 874.4 Switch-level RC Transistor Model . . . . . . . . . . . . . . . . . . . . . . 884.5 Example of a Routing Track modelled using RC Transistor Models . . . . 904.6 Pseudocode for Phase 2 of Transistor Sizing Algorithm . . . . . . . . . . 944.7 Test Structure for Routing Track Optimization . . . . . . . . . . . . . . . 974.8 Logic Cluster Structure and Timing Paths . . . . . . . . . . . . . . . . . 99

5.1 Performance Measurement Methodology . . . . . . . . . . . . . . . . . . 1095.2 Area Delay Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1145.3 Determining Designs that Offer Interesting Trade-offs . . . . . . . . . . . 1175.4 Area Delay Space with Interesting Region Highlighted . . . . . . . . . . . 118

x

5.5 Full Area Delay Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1205.6 Impact of Elasticity Factor on Area, Delay and Area-Delay Ranges . . . 1225.7 Area Delay Space with Varied LUT Sizing . . . . . . . . . . . . . . . . . 1245.8 Area Delay Space with Varied Cluster Sizes . . . . . . . . . . . . . . . . 1255.9 Area Delay Space with Varied Routing Segment Lengths . . . . . . . . . 1265.10 Buffer Positioning around Multiplexers . . . . . . . . . . . . . . . . . . . 1285.11 Area Delay Trade-offs with Varied Pre-Multiplexer Inverter Usage . . . . 1295.12 Transistor Counts for Varied Multiplexer Implementations . . . . . . . . 1315.13 Transistor Counts for Varied Multiplexer Implementations using a Single

Configuration Bit Output . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.14 Area Delay Trade-offs with Varied Multiplexer Implementations . . . . . 134

B.1 Input-dependant Delays through the LUT . . . . . . . . . . . . . . . . . 154B.2 Area and Delay with Varied Representative Path Weightings . . . . . . . 158

C.1 Two Level 16-input Multiplexer Implementations . . . . . . . . . . . . . 161C.2 Area Delay Trade-offs with Varied 16-input Multiplexer Implementations 163C.3 Area Delay Trade-offs with Varied 32-input Multiplexer Implementations 165

E.1 Terminology for Transistor Sizes . . . . . . . . . . . . . . . . . . . . . . . 173

xi

List of Acronyms

ALM Adaptive Logic Module

ALUT adaptive lookup table

ASIC application-specific integrated circuit

BLE Basic Logic Element

CAD computer-aided design

CMP Circuits Multi-Projets

CLB Cluster-based Logic Block

CMOS Complementary Metal Oxide Semiconductor

DFT Design for Testability

FPGA Field-programmable gate array

HDL hardware description language

LAB Logic Array Block

LUT lookup table

MPGA Mask-programmable Gate Array

MWTA minimum-width transistor areas

NMOS n-channel MOSFET

NRE non-recurring engineering

PLL phase-locked loop

PMOS p-channel MOSFET

QIS Quartus II Integrated Synthesis

SRAM Static random access memory

VCD Value Change Dump

xii

Chapter 1

Introduction

Field-programmable gate arrays (FPGAs) have become a standard medium for imple-

menting digital circuits and they are now used in a wide variety of markets including

telecommunications, automotive systems, high-performance computers and consumer

electronics. The primary advantages of FPGAs are that they offer lower non-recurring

engineering (NRE) costs and faster time to market than more customized approaches

such as full-custom or application-specific integrated circuit (ASIC) design. This pro-

vides digital circuit designers with access to many of the benefits of the latest process

technologies without the expense and effort that accompany these technologies when

custom design is used.

This simplified access to new technologies is possible because of the pre-fabricated and

programmable nature of FPGAs. With pre-fabrication, the challenges associated with

the latest processes are almost entirely shifted to the FPGA manufacturer whereas, for

custom fabrication, significant time and money must be spent on large teams of engineers

to address issues such as signal integrity, power distribution, process variability and soft

errors. Once these challenges are addressed and a design is finalized, the benefits of

FPGAs become even clearer since, due to the programmability of an FPGA, the design

can be implemented on an FPGA in seconds and the only cost of this implementation is

that of the FPGA itself. In contrast, for ASIC or full-custom designs, it takes months

and millions of dollars to create the masks defining the design and then fabricate the

silicon implementation [1]. The combined effect of these factors is that, while an ASIC

design cycle easily takes a year and a full-custom design even longer, the FPGA-based

design cycle can be completed in months for at least an order of magnitude lower costs.

1

Chapter 1. Introduction 2

However, FPGAs suffer from a number of significant limitations. Compared to the

non-programmable alternatives, FPGAs have much higher per unit costs, lower perfor-

mance and higher power consumption. Higher per unit costs arise because, compared to

custom designs, FPGAs require more silicon area to implement the same functionality.

This increased area not only affects costs, it also limits the size of the designs that can be

implemented with FPGAs. Lower performance can drive up costs as well as more paral-

lelism (and hence greater area) may be needed to achieve a performance target or, worse,

it simply may not be possible to achieve the desired performance on an FPGA. Simi-

larly, higher power consumption often precludes FPGAs from power-sensitive markets.

Together, this area, performance and power gap limits the markets for FPGAs.

Since this gap limits the use of FPGAs, research into FPGAs and their architec-

ture has focused, implicitly or explicitly, on narrowing the gap. As a result, significant

improvements have been made in industry and academia at improving FPGAs and re-

ducing the gap relative to their alternatives; however, the gap itself has not been studied

extensively. Its magnitude has only been measured through limited anecdotal or point

comparisons [2, 3, 4, 5]. As well, it has not been fully appreciated, at least academically,

that through varied architecture and electrical design, FPGAs can be created with a

wide range of area, delay and performance characteristics. These possibilities create a

large design space within which trade-offs can be made to reduce area at the expense of

performance or to improve performance at the expense of area. However, the extent to

which such trade-offs can be used to selectively narrow this gap is largely unknown.

Considering such trade-offs and thereby navigating the gap has become particularly

important as the use of FPGAs has expanded beyond their traditional markets. This

broader range of markets has made it necessary to develop multiple distinct FPGA fami-

lies that cater to the varied needs of these markets and, indeed, it has become a standard

trend for FPGA manufacturers to offer both a high-performance/high-cost family [6, 7]

and a lower-cost/low-performance family [8, 9]. If the FPGA market expands further, it

is likely that a greater number of FPGA families will be necessary and, therefore, it is

useful to examine the range of possible designs and the extent to which the gap can be

managed through varied design choices.

The goal of this work is to improve the understanding of the area, performance and

power gaps faced by FPGAs. This is done by first measuring the gap between FPGAs

and ASICs. It will be shown that this gap is large and the latter portion of this work

explores the design of FPGAs and how best to navigate the gap.

Chapter 1. Introduction 3

1.1 Measuring the FPGA to ASIC Gap

It has long been accepted that FPGAs suffer in terms of area, performance and power

consumption relative to the many more customized alternatives such as full custom de-

sign, ASICs, Mask-programmable Gate Arrays (MPGAs) and structured ASICs. In this

dissertation, the gap between a modern FPGA and a standard cell ASIC will be quanti-

fied. ASICs are selected as the comparison point because they are currently the standard

alternative to FPGAs when lower cost, better performance or lower power is desired.

Full custom design is typically only possible for extremely high volume products and

structured ASICs are not in widespread use. Measurements of the FPGA to ASIC gap

are useful for both the FPGA designers and architects who aim to narrow this gap and

the system designers who select the implementation platform for their design.

This comparison is non-trivial given the wide range of digital circuit applications and

the complexity of modern FPGAs and ASICs. An experimental approach, that will be

described in detail, is used to perform the comparison. One of the challenges, that also

makes this comparison interesting, is that FPGAs no longer consist of a homogeneous

array of programmable logic elements. Instead, modern FPGAs have added hard special-

purpose blocks, such as multipliers, memories and processors [6, 7], that are not fully

programmable and are often ASIC-like in their construction. The selection of the func-

tionality to include in these hard blocks is one of the central questions faced by FPGA

architects and this dissertation quantitatively explores the impact of these blocks on the

area, performance and power gaps. This is the first published work to perform a detailed

analysis of these gaps for modern FPGAs.

1.2 Navigating the Gap

Simply measuring the FPGA to ASIC gap is the first step towards understanding the

changes that can help narrow it. Given the complexity of modern FPGAs it is often

not possible for any single innovation to universally narrow the area, performance and

power gaps. Instead, as FPGAs inhabit a large design space comprising the wide range of

architectural and electrical design possibilities, trade-offs within this space that narrow

one dimension of the gap at the expense of another must be considered. Navigating the

gap through the exploration and exploitation of these trade-offs is the second focus of

this dissertation.

Chapter 1. Introduction 4

Exploring the breadth of the design space requires that all aspects of FPGA design be

considered from the architectural level, which defines the logical behaviour of an FPGA,

down to the transistor-level. With such a broad range of possibilities to consider, detailed

manual optimization at the transistor-level is not feasible. Therefore, to enable this

exploration, an automated transistor sizing tool was developed. Transistor-level design of

FPGAs has unique challenges due to the programmable nature of FPGAs which means

that the eventual critical paths are unknown at the design time of the FPGA itself.

An additional challenge is that architectural requirements constrain the optimizations

possible at the transistor-level. These challenges are described and investigated during

the design of the optimization tool.

With this transistor-level design tool and a previously developed architectural explo-

ration tool, VPR [10], it is possible to explore a range of architectures, circuit topologies

and transistor sizings. The trade-offs that are possible, particularly between performance

and area, will be examined to determine the magnitude of the trade-offs possible, the

most effective parameters for making trade-offs and the impact of these trade-offs on the

FPGA to ASIC gap.

1.3 Organization

The remainder of this thesis is organized as follows. Chapter 2 provides related back-

ground information on FPGA architecture, FPGA computer-aided design (CAD) tools,

past measurements of the gaps between FPGAs and ASICs and automated transistor

sizing.

Chapter 3 focuses on measuring the gap between FPGAs and ASICs. It describes

the empirical process used to quantify the area, performance and power gaps and then

presents the measurements obtained using that process. These results are analyzed in

detail to investigate the impact of a number of factors including the use of hard special-

purpose blocks.

The remainder of this work, in Chapters 4 and 5, is centred on navigating the FPGA

to ASIC gap. The transistor-level design tool developed to aid this investigation is

described in Chapter 4. That tool is used in Chapter 5 to explore the trade-offs that

are possible in FPGA design. Throughout this exploration the implications for the gap

between FPGAs and ASICs are considered.

Chapter 1. Introduction 5

Finally, Chapter 6 concludes with a summary of the primary contributions of this

work and possible avenues for future research. The appendices following that chapter

provide much of the raw data underlying the work presented in this thesis.

Chapter 2

Background

One goal of this thesis is to measure and understand the FPGA to ASIC gap. The gap is

affected by many aspects of FPGA design including the FPGA’s architecture, the circuit

structures used to implement the architectural features, and the sizing of the transistors

within those circuits. In this chapter, the terminology and the conventional design ap-

proaches for these three areas are summarized. As well, the standard methodology for

assessing the quality of an FPGA is reviewed. Such accurate assessments require the com-

plete transistor-level design of the FPGA. However, transistor-level design is an arduous

task and prior approaches to automated transistor sizing will be reviewed in this chapter.

Finally, previous attempts at measuring the FPGA to ASIC gap are reviewed. This re-

view will describe the issues that necessitated the more accurate comparison performed

as part of this thesis.

2.1 FPGA Architecture

FPGAs have three primary components: logic blocks which implement logic functions,

I/O blocks which provide the off-chip interface, and routing that programmably makes the

connections between the various blocks. Figure 2.1 illustrates the use of these components

to create a complete FPGA. The global structure and functionality of these components

comprise what is termed the architecture, or more specifically the logical architecture,

of an FPGA and, in this section, the major architectural parameters for both the logic

block and the routing are reviewed. I/O block architecture will not be examined as it is

not explored in this thesis. This review will primarily focus on defining the architectural

terms that will be explored in this work.

6

Chapter 2. Background 7

Logic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

I/O B lockR outing

Figure 2.1: Generic FPGA

2.1.1 Logic Block Architecture

It is the logic block that implements the logical functionality of a circuit and, therefore,

its architecture significantly impacts the area, performance and power consumption of

an FPGA. Logic blocks have conventionally and most commonly been built around a

lookup table (LUT) with K inputs that can implement any digital function of K inputs

and one output [11, 12, 8, 13]. Each LUT is generally paired with a flip-flop to form a

Basic Logic Element (BLE) [14] as illustrated in Figure 2.2(a). The output from each

logic element is programmably selected either from the LUT or the flip-flop. Modern

FPGAs have added many features to their logic elements including additional logic to

improve arithmetic operations [7, 6] and LUTs that can also be configured to be used as

memories [6, 7] or shift registers [7]. LUTs have also evolved away from simple K input,

one output structures to fracturable structures that allow larger LUTs to be split into

multiple smaller LUTs. For example, a 6-LUT that can be split into two independent

4-LUT [15, 16, 6, 17]. The specific features of the commercial FPGAs that will be used

for a portion of the work in this thesis will be described at the end of this section.

A BLE by itself could be used as a logic block in the array of Figure 2.1 but it

is now more common for BLEs to be grouped into logic clusters of N BLEs as shown

in Figure 2.2(b). (These logic clusters will also be referred to as Cluster-based Logic

Chapter 2. Background 8

K -input

LU TC lock

Inputs O utD FF

(a) Basic Logic Element (BLE)

B LE

N

B LE

1

C lock

I Inputs

N

B LE s N

O utputs

...

...I

N

(b) Logic Cluster

Figure 2.2: Basic Logic Elements (BLEs) and Logic Clusters [14]

Blocks (CLBs).) This is advantageous because it is frequently possible for input and

output signals to be shared within the cluster [14, 18]. Specifically, it has been observed

for logic clusters with N BLEs containing K-input LUTs that setting the number of

inputs, I as

I =K

2(N + 1) (2.1)

is sufficient to enable all the BLEs to be used in nearly all cases [18]. The intra-cluster

routing connecting the I logic block inputs to the BLE inputs is shown to be a full

cross-bar in Figure 2.2 and, for simplicity, this work will assume such a configuration.

However, it has been found that such flexibility is not necessary [19] and is no longer

common in modern FPGAs [20].

LUT-based logic blocks make up the soft logic fabric of an FPGA. While an FPGA

could be constructed purely from homogeneous soft logic, modern FPGAs generally in-

corporate other types of logic blocks such as multipliers [7, 6, 8, 9], memories [7, 6, 8, 9],

and processors [7]. This heterogeneous mixture of logic blocks is illustrated in Figure 2.3.

These alternate logic blocks only perform specific logic operations, such as multiplica-

tion, that could have also been implemented using the soft logic fabric of the FPGA and,

Chapter 2. Background 9

S oft

Log icM ultip lie r

S oft

Log ic

S oft

Log icM ultip lie r

S oft

Log ic

S oft

Log icM ultip lie r

S oft

Log ic

S oft

Log icM ultip lie r

S oft

Log ic

S oft

Log ic

S oft

Log ic

S oft

Log ic

S oft

Log ic

S oft

Log icM ultip lie r

S oft

Log ic

S oft

Log ic

M em ory

M em ory

S oft

Log icM ultip lie r

S oft

Log ic

M em ory

S oft

Log ic

Figure 2.3: Heterogeneous FPGAs

therefore, these blocks are considered to be hard logic. The selection of what to include

as hard logic on an FPGA is one of the central questions of FPGA architecture because

such blocks can provide area, performance and power benefits when used but waste area

when not used. In this thesis, the impact of these hard blocks on the FPGA to ASIC gap

will be examined. That investigation, in Chapter 3, focuses on one particular FPGA,

the Altera Stratix II [16], and the logic block architecture of this FPGA is now briefly

reviewed.

Logic Block Architecture of the Altera Stratix II

The Stratix II [16], like most modern FPGAs, contains a heterogeneous mixture of soft

and hard logic blocks. The soft logic block, known as a Logic Array Block (LAB), is built

as a cluster of eight logic elements. These logic elements are referred to as Adaptive Logic

Modules (ALMs) and a high-level view of these elements is illustrated in Figure 2.4. This

logic element contains a number of additional features not found in the standard BLE

described earlier. In particular to improve the performance of arithmetic operations there

are dedicated adder blocks, labelled adder0 and adder1, in the figure. The carry in input

to adder0 in the figure is driven by the carry out pin of the preceding logic element. This

Chapter 2. Background 10

D QTo general or

local routing

reg0

To general or

local routing

datae0

dataf0

shared_arith_in

shared_arith_out

reg_chain_in

reg_chain_out

adder0

dataa

datab

datac

datad

Combinational

Logic

datae1

dataf1

D QTo general or

local routing

reg1

To general or

local routing

adder1

carry_in

carry_out

Figure 2.4: Altera Stratix II Logic Element [16]

path is known as a carry chain and enables fast propagation of carry signals in multi-bit

arithmetic operations. Two registers are present in the ALM because the combinational

logic block can generate multiple outputs. The combinational block itself is a 6-input

LUT with additional logic and inputs that enable a number of alternate configurations

including the ability to implement two 4-LUTs each with four unique inputs or various

other combinations of 4-, 5- and 6-LUTs with shared inputs. To reflect this ability to

implement two logic functions the ALM is considered to be composed of two adaptive

lookup tables (ALUTs) and these ALUTs will be used as a measure of the size of a circuit

as they roughly correspond to the functionality of a 4-LUT.

To complement the soft logic, there are four different types of hard logic blocks. Three

of these blocks known as the M512, M4K and M-RAM blocks implement memories with

nominal sizes of 512 bits, 4 kbits and 512 kbits respectively. To allow these memories to

be used in a wide range of designs the depth and width can be programmably selected

from a range of sizes. The largest memory can, for example, be used in a number of

configurations ranging from 64K words by 8 bits to 4K words by 144 bits. The full

listing of possible configurations for the three block types is provided in Table 2.1.

The other hard block used in the Stratix II is known as a DSP block and is designed

to perform multiplication, multiply-add or multiply-accumulate operations. Again to

broaden the usefulness of this block a number of different configurations are possible

and the basic structure of the block that enables this flexibility is shown in Figure 2.5.

Specifically, a single DSP block can perform eight 9x9 multiplications or four 18x18

Chapter 2. Background 11

Table 2.1: Altera Stratix II Memory Blocks [16]

Memory Block M512 M4K M-RAM

Configurations 512× 1256× 2128× 464× 864× 932× 1632× 18

4K× 12K× 21K× 4512× 8512× 9256× 16256× 18128× 32128× 36

64K× 864K× 932K× 1632K× 1816K× 3216K× 368K× 648K× 724K× 1284K× 144

multiplications or a single 36x36 multiplication. Depending on the size and number of

multipliers used, addition or accumulation can also be performed in the block.

2.1.2 Routing Architecture

Programmable routing is necessary to make connections amongst the logic and I/O

blocks. A number of global routing topologies have been proposed and used including

row-based [21, 2], hierarchical [22, 23, 24] and island-style [14, 6, 7]. This thesis focuses

exclusively on island-style FPGAs as it is currently the dominant approach [6, 7, 8, 9].

An island-style topology was illustrated in Figures 2.1 and 2.3.

A number of parameters define the flexibility of these island-style FPGAs and these

parameters are illustrated in Figure 2.6. In this architecture, the routing network is

organized into channels running between each logic block and each individual routing

resource within these channels is known as a track or routing segment. From an FPGA

user’s perspective each track can be viewed simply as a wire; however, the physical

implementation of the track need not be just a wire. The number of tracks in a channel is

the channel width, W . Each track has a logical length, L, that is defined as the number of

logic blocks spanned by the track. This is illustrated in Figure 2.7. Connections between

routing tracks are made at the intersection of the channels in a switch block. The number

of tracks that any track can connect to in a switch block is the switch block flexibility,

Fs. The specific tracks to which each track connects is defined by the switch box pattern

and a number of patterns, such as disjoint and Wilton [25] patterns, have been used or

analyzed [26].

Chapter 2. Background 12

A dder /

S ubtractor /

A ccum ulator

1

A dder

M u lt ip lie r B lo c kPRN

CL RN

D

Q

ENA

PRN

CL RN

D

Q

ENA

PRN

CL RN

D

Q

ENA

PRN

CL RN

DQ

ENA

PRN

CL RN

D

Q

ENA

PRN

CL RN

DQ

ENA

PRN

CL RN

DQ

ENA

PRN

CL RN

D

Q

ENA

PRN

CL RN

DQ

ENA

PRN

CL RN

D

Q

ENA

PRN

CL RN

DQ

ENA

PRN

CL RN

D

Q

ENA

Su m m a tio n

Blo ck

A d d e r O u t p u t B lo c k

A dder /

S ubtractor /

A ccum ulator

2

Q1 .15

R ound /

S aturate

Q 1.15

R ound /

S aturate

Q 1. 15

R ound /

S aturate

Q 1.15

R ound /

S aturate

C LR N

DQEN A

Q 1.15

R ound /

S aturate

Q 1.15

R ound /

S aturate

18 x 18 M ultipl iers

(C an be split into 2 9 x 9 M ultipl iers )

Adders for M ultiply Accum ulate or 18 x 18 C om plex

M ultipl ication

Adder for 7 2 x 72 M ultipl ier

Figure 2.5: Altera Stratix II DSP Block [16]

Chapter 2. Background 13

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

C hanne l

W id th , W

O utpu t C onnection B lock

F c,ou t = 1 /4

Inpu t C onnection

B lock

F c,in = 2 /4

P rogram m able R outing

S w itch

S w itch B lockR outing T rack

L=1

L=2

Track Length

S w itch B lock

F lex ib ility

F s = 3

Figure 2.6: Routing Architecture Parameters [14]

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Log ic

B lock

Length 1 T rack Length 2 T rack Length 4 T rack

Figure 2.7: Routing Segment Lengths

The number of tracks within the channel that can connect to a logic block input is the

input connection block flexibility, Fc,in and the number of tracks to which a logic block

output can connect is the output connection block flexibility, Fc,out. While this output

connection block is shown as distinct from the switch block, the two are actually merged

in many commercial architectures [7, 6].

One significant attribute of the routing architecture is the nature of the connections

driving each routing track. In the past, approaches that allow each routing track to

be driven from multiple points along the track were common [14]. These multi-driver

designs required some form of tri-state mechanism on all potential drivers. A single driver

approach is now widely used instead [27, 6, 7]. These different styles are illustrated in

Chapter 2. Background 14

Logic B lock

Log ic B lock

Log ic B lock

Log ic B lock

(a) Multiple Driver Routing

Logic B lock

Log ic B lock

Log ic B lock

Log ic B lock

(b) Single Driver Routing

Figure 2.8: Routing Driver Styles

Figure 2.8. The single driver approach, while reducing the flexibility of the individual

routing tracks, was found to be advantageous for both area and performance reasons

[20, 28] because it allows standard inverters to drive each routing track instead of the

tri-state buffers or pass transistors required for the multi-driver approaches. Single-driver

routing is the only type of routing that will be considered in this work.

2.1.3 Heterogeneity

One aspect of FPGA architecture that transects the preceding discussions about logic

block and routing architectures is the introduction of heterogeneity to FPGAs. At the

highest-level FPGAs appear very regular as was shown in Figure 2.1 but such regularity

need not be the case. One example of this was described in Section 2.1.1 and illustrated

in Figure 2.3 in which heterogeneous logic blocks were added to the FPGA. Similarly,

routing resources can also vary throughout the FPGA and ideas such as having more

routing in the centre of the FPGA or near the periphery have been investigated in the

past [14].

There are two possible forms of heterogeneity: tile-based and resource-based. Tile-

based heterogeneity refers to the selection of logic blocks and routing parameters across

the FPGA. It is termed tile-based because FPGAs are generally constructed as an array

of tiles with each tile containing a single logic block and any adjacent routing. A single

Chapter 2. Background 15

tile can be replicated to create a complete FPGA [29] (ignoring boundary issues) if the

routing and logic block architecture is kept constant. Alternatively, additional tiles, with

varied logic blocks or routing, can be intermingled in the array as desired; however,

such tiles must be used efficiently to justify both their design and silicon costs. While

the previous example of heterogeneity in Figure 2.3 added heterogeneous features with

differing functions, it is also possible for this heterogeneity to be introduced between tiles

that are functionally identical by varying other characteristics such as their performance.

The other source of architectural heterogeneity, at the resource-level, occurs within

each tile. Both the logic block and the routing are composed of individual resources

such as BLEs and routing tracks respectively. Each of these individual resources could

potentially have its own unique characteristics or some or all of the resources could be

defined similarly. In this latter case, resources that are to be architecturally similar will be

called logically equivalent. Again, the determination of which resources to make logically

equivalent requires a balance between the design costs of making resources unique with

the potential benefits of introducing non-uniformity.

While this thesis will not extensively explore issues of heterogeneity, maintaining

logical equivalence has significant electrical design implications. All resources that are

logically equivalent must present the same behaviour (ignoring differences due to the pro-

cess variations after manufacturing) and, therefore, must have the same implementation

at the transistor-level.

2.2 FPGA Circuit Design

The architectural parameters described in the previous section define the logical be-

haviour of an FPGA but, for any architecture, there are a multitude of possible circuit-

level implementations. This section reviews the standard design practises for these cir-

cuits in FPGAs. There are a number of restrictions that are placed on the FPGAs that

are considered in this work which limit the circuit structures that must be considered.

First, only SRAM-based FPGAs, in which programmability is implemented using static

memory bits, will be used in this work because this approach is the dominant approach

used in industry [6, 7, 8, 9]. As well, with the exception of the measurements of the

FPGA to ASIC gap, homogeneous soft logic-based FPGAs with BLEs as shown in Fig-

ure 2.2(a) will be assumed. Finally, as mentioned previously, only single-driver routing

will be considered in this work.

Chapter 2. Background 16

S R A M

S R A M

S R A M

S R A M

S R A M

S R A M

S R A M

S R A M

Inpu ts

O utpu t

(a) Multiplexer as a lookup table (LUT)

O utpu t. . .Inpu ts

S R A M

bits

(b) Multiplexer as programmable routing

Figure 2.9: Multiplexer Usage

Given these restrictions, the only required circuit structures are inverters, multiplex-

ers, flip-flops and memory bits. Of these components, the flip-flops found in the BLEs

are only a small portion of the design and a standard master-slave arrangement can be

assumed [30]. The memory bits comprise a significant portion of the FPGA as they

store the configuration for all the programmable elements. These memory bits are imple-

mented using a standard six-transistor SRAM cell [14]. Similarly, the design of inverters is

straightforward and they are added as needed for buffering or logical inversion purposes.

This leaves multiplexers which are used both to implement logic and to enable pro-

grammable routing and these two uses are illustrated in Figure 2.9. Due this varied

usage, multiplexers may range in size from having 2 inputs to having 30 or more in-

puts. As FPGAs are replete with such multiplexers, their implementation affects the

area and performance of an FPGAs significantly and, therefore, to reduce area and im-

prove performance multiplexers are generally implemented using NMOS pass transistor

networks.

The use of only NMOS pass transistors poses a potential problem since an NMOS

pass transistor with a gate voltage of VDD is unable to pass a signal at VDD from source to

drain. Left unaddressed this could lead to excessive static power consumption because the

reduced output voltage from the multiplexer prevents the PMOS device in the proceeding

inverter from being fully cut-off. A standard remedy for this issue is the use of level

restoring PMOS pull-up transistors [15] as shown in Figure 2.10. An alternative solution

of gate boosting, in which the gate voltage is raised above the standard VDD, has also

Chapter 2. Background 17

S R A M S R A M

Inp

uts

M ultip lexer Buffer

Output

Figure 2.10: Implementation of Four Input Multiplexer and Buffer

been used [14]. Another less common alternative is to use complementary transmission

gates (with an NMOS and a PMOS) to construct the multiplexer tree [31, 32]; however,

such an approach could typically only be used selectively as use throughout the FPGA

would incur a massive area penalty.

There are also a number of alternatives for implementing the pass transistor tree.

An example of this is shown in Figure 2.11 in which two different implementations of an

8-input multiplexer are shown. The different approaches trade-off the number of memory

bits required with both the number of pass transistors required in the design and the

number through which a signal must pass. A range of strategies have been used including

fully encoded [14] (which uses the minimal number of memory bits), three-level partially-

decoded structures [33] and two-level partially-decoded structures [15, 31, 34]. This issue

of multiplexer structure is one area that is explored in this thesis.

In that exploration, multiplexers will be classified according to the number of pass

transistors a signal must travel through from input to output. At one extreme, a one-level

(or equivalently one-hot) multiplexer has only a single pass transistor on the signal path

and a two-level or three-level multiplexer has two or three pass transistors respectively.

For simplicity, it will be assumed that the multiplexers are homogeneous with all multi-

plexer inputs passing through the same number of pass transistors. However, it is often

necessary or useful as an optimization [15] to have shorter paths through the multiplexer.

Chapter 2. Background 18

S R A M S R A M S R A M S R A M S R A M S R A M

(a) Two-level 8-input multiplexer

S R A M S R A MS R A M

(b) Three-level 8-input multiplexer

Figure 2.11: Alternate Multiplexer Implementations

These varied implementation approaches are generally only used for the routing mul-

tiplexers. The multiplexers used to implement LUTs are typically constructed using a

fully encoded structure [14]. This avoids the need for any additional decode logic on any

user signal paths.

2.3 FPGA Transistor Sizing

Finally, after considering the circuit structures to use within the FPGA, the sizing of the

transistors within these circuits must be optimized as this also directly affects the area,

performance and power consumption of an FPGA. This optimization has historically

been performed manually [14, 35, 18, 28]. In these past works [14, 35, 18, 28], each

resource, such as a routing track, is individually optimized and the sizing which minimizes

the area delay product for that resource is selected. As this is a laborious process,

sizing was only performed once for one particular architecture and then, for architectural

studies, the same sizings were generally used as architectural parameters were varied.

The optimization goal for transistor sizing is frequently selected as minimizing the

area-delay product since this maximizes the throughput of the FPGA assuming that

the applications implemented on the FPGA are perfectly parallelizable [35]. However,

alternative approaches such as minimizing delay assuming a fixed “feasible” area [36] or

minimizing delay only [31, 32] have been used in the past. Such optimization assumed

an architecture in which the routing resources were all logically equivalent but another

possibility is to introduce resource-based heterogeneity by making some resources faster

than other resources. It has been found that sizing 20 % of the routing resources for speed

and the reminder for minimal area delay product yielded performance results similar to

Chapter 2. Background 19

when only speed was optimized but with significantly less area [35]. Similar conclusions

about the benefits of heterogeneously sizing some resources for speed were also reached in

[37, 36]. The relative amount of resources that can be made slower depends on the relative

speed differences. In a set of industrial designs it was observed that approximately 80 %

of the resources could tolerate a 25 % slowdown while approximately 70 % could tolerate

a 75 % slowdown [33, 38].

While transistor sizing certainly has a significant impact on the quality and efficiency

of an FPGA, most works have focused exclusively on the optimization of the routing

resources [31, 32, 36, 35] and do not consider the optimization of the complete FPGA.

As well, only a few discrete objectives such as area-delay or delay have typically been

considered instead of the large continuous range of possibilities that actually exist. Such

broad exploration was not possible because, with the exception of [31, 32], transistor

sizes were optimized manually. This greatly limited the ability to explore a wide range

of designs and the optimization tool developed as part of this thesis will address this

limitation.

2.4 FPGA Assessment Methodology

All these previously described aspects of FPGA design can have a significant effect on

the area, performance and power consumption of an FPGA. However, it is challenging to

accurately measure these qualities for any particular FPGA. The standard method used

has been an experimental approach [14] in which benchmark designs are implemented on

the FPGA by processing them through a complete CAD flow. From that implementation

the area, performance and the power consumption of each benchmark design can be

measured and then the effective area, performance, and power consumption of the FPGA

design can be determined by compiling the results across a set of benchmark circuits.

The details of this evaluation process are reviewed in this section.

2.4.1 FPGA CAD Flow

The CAD flow used for much of this work1 is the standard academic CAD flow for FPGAs

[14, 18, 28, 39] and is shown in Figure 2.12. The process illustrated in the figure takes

1The work in Chapter 3 makes use of commercial CAD tools and the details of that process will beoutlined in that chapter.

Chapter 2. Background 20

benchmark circuits and information about the FPGA design as inputs and determines

the area and critical path for each circuit. (Power consumption could be measured with

well-known modifications to the CAD tools [40, 41, 42] but the primary focus of this

work will be area and performance.) The required information about the FPGA design

includes both Logical Architecture definitions that describe the target configuration of

the attributes detailed in Section 2.1 and Electrical Design Information that reflects the

area and performance of the FPGA based on the circuit structure and sizing decisions.

The first step in the process is synthesis and technology mapping which optimizes and

maps the circuit into LUTs of the appropriate size [43, 44, 45, 46]. In the more general

case of an FPGA with a variety of hard and soft logic blocks the synthesis process would

also identify and map structures to be implemented using hard logic [47]. As only soft

logic is assumed in Chapters 4 and 5 of this work, such additional steps are not required

and the synthesis and mapping process is performed using SIS [48] with FlowMap [45].

The technology mapped LUTs and any flip-flops are then grouped into logic clusters in

the packing stage which is performed using T-VPack [49, 14].

Next, the logic clusters are placed onto the FPGA fabric which involves determining

the physical position for each block within the array of logic blocks. The goal in placing

the blocks is to create a design that minimizes wirelength requirements and maximizes

speed, if the tool is timing driven, and this problem has been the focus of extensive

study [50, 51, 52, 53, 54]. After the positions of the logic blocks are finalized, routing is

performed to determine the specific routing resources that will be used to connect the logic

block inputs and outputs. Again, the goal is to minimize the resources required and, if

timing-driven, to maximize the speed of the design [55, 56]. In this thesis, both placement

and routing will be performed with VPR [10] used in its timing driven mode. An updated

version of VPR that can handle the single-driver routing described in Section 2.1.2 is used

in this work. The details, regarding how specifically area and performance are typically

measured, are provided in the following sections.

2.4.2 Area Model

One important output from the previously described CAD flow is the area required for

each benchmark circuit. Two factors impact this area: the number of logic blocks required

and the size in silicon of those logic blocks and their adjacent routing. The first term, the

Chapter 2. Background 21

S ynthesis &

M apping

(S IS + F low M ap)

C lustering

(T -V P ack)

P lacem ent and

R outing

(V P R )

A rea and D elay

B enchm ark

C ircu its

Log ica l

A rch itecture

E lectrica l D esign

In form ation

Figure 2.12: FPGA CAD Flow

number of blocks is easily determined after packing while determining the second term,

the silicon area, is significantly more involved.

The most accurate area estimate would require the complete layout of every FPGA

design but this is clearly not feasible if a large number of designs are to be considered.

Simpler approaches such as counting the number of configuration bits or the number of

transistors in a design have been used but they are inaccurate as they fail to capture the

effect of circuit topology and transistor sizing choices on the silicon area. A compromise

approach is to consider the full transistor-level design of the entire FPGAs but use an

easily calculated estimate of each transistor’s laid out area. One such approach, known

as a minimum-width transistor area model, was first described in [14] and will serve as

the foundation for the area models in this thesis.

The basis for this model is a minimum-width transistor area which is the area re-

quired to enclose a minimum-width (and minimum-length) transistor2 and the white

space around it such that another transistor could be adjacent to this area while still

satisfying all appropriate design rules. This is illustrated in Figure 2.13. The area for

2The minimum width of a transistor is taken to be the minimum width in which the diffusion area isrectangular as shown in Figure 2.13. This width is generally set by contact size and spacing rules and,therefore it is greater than the absolute minimum width permitted by a process.

Chapter 2. Background 22

M in im um V ertica l S pacing

M in im um

H orizon ta l

S pacing

M in im um

W id th

P erim eter o f M in im um

W id th T ransis to r A rea

M in im um Length

Figure 2.13: Minimum Width Transistor Area

each transistor, in minimum-width transistor areas (MWTA), is then calculated as,

Minimum-width transistor areas(width) = 0.5 +width

2 ·Minimum Width(2.2)

where width is the total width of the transistor. The total silicon area is simply the

sum of the areas for each transistor. To enable process independent comparisons, the

total area is typically reported in minimum-width transistor areas [14, 18, 28] and not

as an absolute area in square microns. Since FPGAs are typically created as an array of

replicated tiles, the total silicon area can be computed as the product of the number of

tiles used and the area of each tile.

This approach to area modelling will serve as the basis for the area model used in

this work; however, as will be described in Chapter 4, some improvements will be made

to account for factors such as densely laid out configuration memory bits.

2.4.3 Performance Measurement

Equally as important as the area measurements are the performance measurements of the

FPGA. Performance is measured based on each circuit’s critical path delay as determined

by VPR [10, 14]. Delay modelling within VPR uses an Elmore delay-based model that

is augmented, using the approach from [57], to handle buffers together with the RC-tree

delays predicted by the standard Elmore model [58, 59]. With this model the delay for

Chapter 2. Background 23

a path, TD, is given by

TD =∑

i∈source-sink path

(Ri ·C (subtreei) + Tbuffer,i) (2.3)

where i is an element along the path, Ri is the equivalent resistance of element i,

C (subtreei) is the downstream dc-connected capacitance from element i, and Tbuffer,i

is the intrinsic delay of the buffer in element i if it is present [14]. While the Elmore

model has long been known to be limited in its accuracy [60, 61], the accuracy in this case

was found to be reasonable [14] and, more importantly, it had been previously observed

that it provided high fidelity despite the inaccuracies [61, 62]. However, it is recognized

in [14] that the most accurate and ideal approach would be full time-domain simula-

tion with SPICE. This approach of SPICE simulation will be used for most performance

measurements in this thesis as will be described in Chapter 4.

Irrespective of the specific delay model used, a necessary input is the properties of the

transistors whose behaviour is being modelled. Therefore, just as detailed transistor-level

design was necessary for the accurate area models described previously, this same level of

detail is also required for these delay models. From the transistor-level design, intrinsic

buffer delays and equivalent resistances and capacitances are determined and used as

inputs to the Elmore model.

2.5 Automated Transistor Sizing

Clearly, detailed transistor-level optimization is necessary to obtain accurate area and

delay measurements for an FPGA design. One of the goals of this work is to explore a

wide range of different FPGA designs, and, therefore, manual optimization of transistor

sizes as was done in the past [14, 28, 18] is not appropriate for this work. Instead, it is

is necessary to develop automated approaches to transistor sizing. Relevant work from

this area will be reviewed in this section; however, almost all prior work in this area is

focused on sizing for custom designs.

Automated approaches to transistor sizing can generally be classified as either dy-

namic or static. Dynamic approaches rely on time-domain simulations with a simulator

such as HSPICE but, due to the computational depends of such simulation, only the

delay of user specified and stimulated paths is optimized. Static approaches, based on

Chapter 2. Background 24

static timing analysis techniques, automatically find the worst paths but generally must

rely on simplified delay models.

2.5.1 Static Transistor Sizing

The central issues in static tuning are the selection of a transistor model and the algorithm

for performing the sizing. Early approaches [63] used the Elmore delay model along with

a simple transistor model consisting of gate, drain and source capacitances proportional

to the transistor width and a source-to-drain resistance inversely proportional to the

transistor width. The delay of a path through a circuit is the sum of the delays of each

gate along the path. For this simple model, this path delay is a posynomial function3

and a posynomial function can be transformed into a convex function. The delay of an

entire circuit is the maximum over all the combinational paths in the circuit and since the

maximum operation preserves convexity, the critical path delay can also be transformed

into a convex function. The advantage of the problem being convex is that any local

minimum is guaranteed to be the global minimum.

This knowledge that this optimization problem is convex was used in the development

of one of the first algorithmic attempts at transistor sizing [63]. The algorithm starts

with minimum transistor sizes throughout the design. Static timing analysis is performed

to identify any paths that fail to meet the timing constraints. Each of these failing paths

is then traversed backwards from the end of the path to the start. Each transistor on the

path is analyzed and the transistor which provides the largest delay reduction per area

increment is increased. The process repeats until all constraints are met. This approach

for sizing was implemented in a program called TILOS. For four circuits sized using

TILOS, with the largest consisting of 896 transistors, the delay was improved by 60 % on

average and the area increased by 16 % on average compared to the result before sizing.

However, the TILOS algorithm fails to guarantee an optimal solution. This occurs

despite the convex nature of the problem because the TILOS algorithm can terminate

3A posynomial resembles a polynomial except that all the coefficient terms and the variables arerestricted to the positive real numbers while the exponents can be any real number. More precisely, aposynomial function with K terms is a function, f : Rn → R, as follows

f (x) =K∑

k=1

ckxa1k1 xa2k

2 · · ·xankn (2.4)

where ck > 0 and a1k . . . ank ∈ R [64].

Chapter 2. Background 25

with a solution that is not a minimum. Such a situation can be caused by the combination

of three factors: 1) TILOS only considered the most critical path, 2) it only increased

the transistor sizes and 3) the definition of delay as the maximum of all possible paths

through a combinational block may result in discontinuous sensitivity measurements

(since an adjustment of one transistor size on the critical path may cause a different path

to become critical) which could lead to excessively large transistors on the former critical

path [65]. Due to these problems, examples have been encountered in which the circuit

is not sized correctly [65].

Numerous algorithmic improvements have been made to address this shortcoming.

One approach again leverages the convex nature of the problem and solved the prob-

lem with an interior point method which guaranteed an optimal solution to the sizing

problem [66]. However, the run time of this approach was unsatisfactory. A alternate

approach based on Lagrangian relaxation was estimated to be 600 times faster for a

circuit containing 832 transistors [67]. With this new method, an optimal solution is

still guaranteed. Another improvement on the original algorithm for producing optimal

solutions was the use of an iterative relaxation method that also achieved significant

run-time improvements [68]. This performance was only 2–4 times slower than a TILOS

implementation but delivered area savings of up to 16.5 % relative to the TILOS-based

approach.

While these algorithmic improvements were significant since they provided optimal

solutions with reasonable run times, this optimality is dependent on the delay and tran-

sistor models used. Unfortunately, the linear models used above have long been known

to be inaccurate [60]. More recently, the error with the Elmore delay models relative

to HSPICE has been found to be up to 28 % [69]. One factor that contributes to this

inaccuracy is that these models assume ideal (zero) transition times on all signals. This

transition time issue was partially addressed by including the effect of non-zero transi-

tion times in the delay model [66] but even with such improvements the models remained

inaccurate.

Another approach for addressing any inaccuracies was to use generalized posynomi-

als4 which improve the accuracy of the device models but retain the convexity of the

optimization problem [70]. To do this, delays for individual cells were curve fit to a

4Generalized posynomials are expressions consisting of a summation of positive product terms. Theproduct terms are the product of generalized posynomials of a lower degree raised to a positive realpower. The zeroth order generalized posynomial is defined as a regular posynomial. [64, 70]

Chapter 2. Background 26

generalized posynomial expression with the transistor widths, input transition times and

output load as variables. To reduce computation time requirements, this approach de-

composed all gates into a set of primitives. With these new models, convex optimizers

or TILOS-like algorithms could still be used for optimization. The accuracy was found

to be at worst 6 % when compared to SPICE for a specific test circuit.

One possibility besides convex curve fitting is the use of piecewise convex functions

[71]. With such an approach, the data is divided into smaller regions and each region is

modelled by an independent convex model. This improves the accuracy and also allows

the model to cover a larger range of input conditions. However, the lack of complete

convexity means that different and potentially non-optimal algorithms must be used for

sizing.

The difficulties in modelling are particularly problematic for FPGAs as the frequent

use of NMOS-only pass transistors adds additional complexity that is not encountered

as frequently in typical custom designs. This necessitates the consideration of dynamic

sizing approaches that perform accurate simulations.

2.5.2 Dynamic Sizing

With the difficulties in the transistor modelling necessary to enable static sizing ap-

proaches, the often considered alternative is dynamic simulation-based sizing. The pri-

mary advantage of such an approach is that the accuracy and modelling issues are avoided

because the circuit can be accurately simulated using foundry-provided device models.

The disadvantage, and the reason full simulation is generally not used with static anal-

ysis techniques, is that massive computational resources are required which limits the

size of the circuits that can be optimized. As well with the complex device models such

as the BSIM3 [72] or BSIM4 [73] models commonly used to capture modern transistor

behaviour, it is generally not possible to ascribe properties such as convexity to the op-

timization problem. Instead, the optimization space is exceedingly complex with many

local minima making it unlikely that optimal results will be obtained.

The first dynamic-based approaches simply automated the use of SPICE [74, 65]. An

improvement on this is to use a fast SPICE simulator with gradient-based optimization

[75]. Fast SPICE simulators are transistor-level simulators that use techniques such as

hierarchical partitioning and event-driven algorithms to outperform conventional SPICE

simulators with minimal losses in accuracy. For the optimizer in [75] known as Jiffy-

Chapter 2. Background 27

Tune, a fast spice simulator called SPECS was used with the LANCELOT nonlinear

optimization package. The selection of simulator is significant because, with SPECS, the

sensitivity to the parameters being tuned can be efficiently computed. The non-linear

solver, LANCELOT, uses a trust-region method to solve the optimization problem. Using

new methods for the gradient computation the capabilities of the optimizer are extended

to handle circuits containing up to 18 854 transistors. The authors report that the run-

time of the optimizer is similar to that which would have been required for a single full

SPICE simulation. While such capacity increases are encouraging, the size of circuits the

can be optimized is still somewhat limited and, therefore, alternative hybrid approaches

have also been considered.

2.5.3 Hybrid Approaches to Sizing

An alternative to purely static or dynamic methods is simulation-based static timing

analysis. This is used in EinsTuner which is a tool developed by IBM for static-analysis

based circuit optimization [76]. The tool is designed to perform non-linear optimization

on circuits consisting of parametrized gates. Each gate is modelled at the transistor

level using SPECS, the fast SPICE simulator. As described previously, SPECS can

easily compute gradient information with respect to parameters such as transistor widths,

output load and input slew. Thus, for each change in a gate’s size and input/output

conditions, the simulator is used to compute the cell delays, slews and gradients. Using

this gradient information, the LANCELOT nonlinear optimization package is used to

perform the actual optimization. Various optimization objectives are possible such as

minimizing the arrival time of all the paths through the combinational circuit subject to

an area constraint, minimizing a weighted sum of delays and area or minimizing the area

subject to a timing constraint. Given the expense of simulation and gradient calculations,

LANCELOT was modified to ensure more rapid convergence. Using EinsTuner, the

performance of a set of well-tuned circuits ranging in size from 6 to 2 796 transistors is

further improved by 20 % on average with no increase in area. This optimizer was further

updated to avoid creating a large number of equally critical paths which improves the

operation of the optimizer in the presence of manufacturing uncertainty [77].

Chapter 2. Background 28

2.5.4 FPGA-Specific Sizing

There has been at least one work that considered automated transistor sizing specifically

for FPGAs [31, 32]. This work focused exclusively on the optimization of the transistor

sizes for individual routing tracks. With this focus on a single resource, only one circuit

path had to be considered and, therefore, static analysis techniques were unnecessary.

A number of different methodologies were considered involving either simulation with

HSPICE or Elmore delay modelling and, in each case, the best sizing (for the given

delay model) for the optimizable parameters was found through an exhaustive search.

As the intent of the work was to investigate the usefulness of repeater insertion in routing

interconnect, such exhaustive searches were appropriate; however, for this thesis, since

the aim is to consider the design of a complete FPGA, exhaustive searching is not feasible

and alternative approaches will be considered and described in Chapter 4.

2.6 FPGA to ASIC Gap

The preceding sections have provided a basic overview of FPGA architectures and their

design and evaluation practises. Despite the many architectural and design improvements

that have been incorporated into FPGAs, they continue to be recognized as requiring

more silicon area, offering lower performance and consuming more power than more

customized approaches. One of the goals of this thesis is to quantify these differences

focusing in particular on the FPGA to ASIC gap. There have been some past attempts at

measuring these differences and these attempts are reviewed in this section. Throughout

this discussion and the rest of this thesis, the gap will be specified as the number of times

worse an FPGA is for the specified attribute compared to an ASIC.

One of the earliest statements quantifying the gap between FPGAs and pre-fabricated

media was by Brown et al. [2]. That work reported the logic density gap between FPGAs

and Mask-programmable Gate Arrays (MPGAs) to be between 8 to 12 times, and the

circuit performance gap to be approximately a factor of 3. The basis for these numbers

was a superficial comparison of the largest available gate counts in each technology, and

the anecdotal reports of the approximate typical operating frequencies in the two tech-

nologies at the time. While the latter may have been reasonable, the former potentially

suffered from optimistic gate counting in FPGAs as there is no standard method for

determining the number of gates that can be implemented in a LUT.

Chapter 2. Background 29

MPGAs are no longer commonly used and standard cell ASICs are a more standard

implementation medium. These standard cell implementations are reported to be on

the order of 33 % to 62 % smaller and 9 % to 13 % faster than MPGA implementations

[78]. Combined with the FPGA to MPGA comparison above, these estimates suggest

an area gap between FPGAs and standard-cell ASICs of 12 to 38. However, the reliance

of these estimates on only five circuits in [78] and the use of potentially suspect gate

counts in [2] makes this estimate of the area gap unreliable. Combining the MPGA:ASIC

and FPGA:MPGA delay gap estimates, the overall delay gap of FPGAs to ASICs is

approximately 3.3 to 3.5 times. Ignoring the reliance on anecdotal evidence [2], the

past comparison is dated because it does not consider the impact of hard blocks such as

multipliers and block memories that, as described in Section 2.1.1, are now common [6, 7].

The comparison performed in this thesis addresses this issue by explicitly considering the

impact of such blocks.

More recently, a detailed comparison of FPGA and ASIC implementations was per-

formed by Zuchowski et al. [3]. They found that the delay of an FPGA LUT was

approximately 12 to 14 times the delay of an ASIC gate. Their work found that this

ratio has remained relatively constant across CMOS process generations from 0.25 µm

to 90 nm. ASIC gate density was found to be approximately 45 times greater than that

possible in FPGAs when measured in terms of kilo-gates per square micron. Finally,

the dynamic power consumption of a LUT was found to be over 500 times greater than

the power of an ASIC gate. Both the density and the power consumption exhibited

variability across process generations but the cause of such variability was unclear. The

main issue with this work is that it also depends on the number of gates that can be

implemented by a LUT. In this thesis, this issue is handled by instead focusing on the

area, speed and power consumption of application circuits.

Wilton et al. [4] also examined the area and delay penalty of using programmable

logic. The approach taken for the analysis was to replace part of a non-programmable

design with programmable logic. They examined the area and delay of the programmable

implementation relative to the non-programmable circuitry it replaced. This was only

performed for a single module in the design consisting of the next state logic for a chip

testing interface. They estimated that when the same logic is implemented on an FPGA

fabric and directly in standard cells, the FPGA implementation is 88 times larger. They

measured the delay ratio of FPGAs to ASICs to be 2.0 times. This thesis improves on this

by comparing more circuits and using an actual commercial FPGA for the comparison.

Chapter 2. Background 30

Compton and Hauck [5] have also measured the area differences between FPGA and

standard cell designs. They implemented multiple circuits from eight different applica-

tion domains, including areas such as radar and image processing, on the Xilinx Virtex-II

FPGA, in standard cells on a 0.18 µm CMOS process from TSMC, and on a custom con-

figurable platform. Since the Xilinx Virtex-II is designed in 0.15 µm CMOS technology,

the area results are scaled up to allow direct comparison with 0.18 µm CMOS. Using

this approach, they found that the FPGA implementation is only 7.2 times larger on

average than a standard cell implementation. The authors believe that one of the key

factors in narrowing this gap is the availability of heterogeneous blocks such as memory

and multipliers in modern FPGAs and these claims are quantified in this thesis.

While this thesis focuses on the gap between FPGAs and ASICs, it is noteworthy

that the area, speed and power penalty of FPGAs is even larger when compared to the

best possible custom implementation using full-custom design. It has been observed that

full-custom designs tend to be 3 to 8 times faster than comparable standard cell ASIC

designs [1]. In terms of area, a full-custom design methodology has been found to achieve

14.5 times greater density than a standard cell ASIC methodology [79] and the power

consumption of standard cell designs has been observed as being between 3 to 10 times

greater than full-custom designs [80, 81].

Given this large ASIC to custom design gap, it is clear that FPGAs are far from the

most efficient implementation. The remainder of this thesis will focus on measuring the

extent of these inefficiencies and exploring the trade-offs that can be made to narrow the

gap. The deficiencies in the past measurements of the FPGA to ASIC gap necessitate

the more thorough comparison that will be described in the subsequent chapter.

Chapter 3

Measuring the Gap

The goal of this research is to explore the area, performance and power consumption gap

between FPGAs and standard cell ASICs. The first step in this process is measuring

the FPGA to ASIC gap. In the previous chapter, we described how all prior published

attempts to make this comparison were superficial since none of those works focused ex-

clusively on measuring this gap. In this chapter, we present a detailed methodology used

to measure this gap and the resulting measurements. A key contribution is the analysis

of the impact of logic block architecture, specifically the use of heterogeneous hard logic

blocks, on the area, performance and power gap. These quantitative measurements of

the FPGA to ASIC gap will benefit both FPGA architects, who aim to narrow the gap,

and system designers, who select implementation media based on their knowledge of the

gap. As well, this measurement of the gap motivates the latter half of the work in this

thesis which explores the trade-offs that can be made to selectively narrow one dimension

of the gap at the expense of another.

The FPGA to ASIC comparison described in this chapter will compare a 90 nm

CMOS SRAM-programmable FPGA to a 90 nm CMOS standard cell ASIC. An SRAM-

based FPGA is used because such FPGAs dominate the market and limiting the scope

of the comparison was necessary to make this comparison tractable. Similarly, a CMOS

standard cell implementation is the standard approach for ASIC designs [1, 82]. The

use of newer “structured ASIC” platforms [83, 84] is not as widespread or mature as the

market continues to rapidly evolve. This comparison will focus primarily on core logic.

It is true that I/O area constraints and power demands can be crucial considerations;

however, the core programmable logic of an FPGA remains fundamentally important.

31

Chapter 3. Measuring the Gap 32

A fair comparison between two very different implementation platforms is challenging.

To ensure that the results are understood in the proper context, we carefully describe

the comparison process used. The specific benchmarks used can also significantly impact

the results and, as will be seen in our results, the magnitude of the FPGA to ASIC

gap can vary significantly from circuit to circuit and application to application. Given

this variability, we perform the comparison using a large set of benchmark designs from

a range of application domains. However, using a large set of designs means that it is

not feasible to individually optimize each design. A team of designers focusing on any

single design could likely optimize the area, performance and power consumption of a

design more thoroughly but this is true of both the ASIC and FPGA implementations.

Therefore, this focus on multiple designs instead of single point comparisons (which as

described in Chapter 2 was typically done historically) increases the usefulness of these

measurements.

This chapter begins by describing the implementation media and the benchmarks

that will be used in the comparison. The details of the FPGA and ASIC implementation

and measurement processes are then reviewed. Finally the measurements of the area,

performance and power gap are presented and a number of issues impacting this gap are

examined.

3.1 Comparison Methodology

As described in Chapter 2, past measurements of the gaps between FPGAs and ASICs

have been based on simple estimates or single-point comparisons. In this work, the gap

is measured more definitively using an empirical method that includes the results from

many benchmark designs. Each benchmark design is implemented in an FPGA and using

a standard cell methodology. The silicon area, maximum operating frequency and power

consumption of the two implementations are compared to quantify the area, delay and

power gaps between FPGAs and ASICs.

Both the ASIC and FPGA-based implementations are built using 90 nm CMOS tech-

nology. For the FPGA, the Altera Stratix II [15, 16] FPGA, whose logic block archi-

tecture was described in Section 2.1.1, was selected based on the availability of specific

device data [85]. This device is fabricated using TSMC’s Nexsys 90 nm process [86].

The IC process we use for the standard cells is STMicroelectronic’s CMOS090 Design

Platform [87]. Standard cell libraries provided by STMicroelectronics are used. Since the

Chapter 3. Measuring the Gap 33

Table 3.1: Summary of Process Characteristics

Parameter TSMC 90 nm STMicroelectronics 90 nmValue Source Value Source

Metal 1 Half-Pitch 125 nm Measured [93] 140 nm Published [87]Minimum Gate Length 55 nm Measured [93]a 65 nm Published [87]Number of Metal Layers 9 Cu/1 Al Measured [93] 7 Cu/1 Al Published [87, 94]b

SRAM Bit Cell Size 0.99 µm2 Published [92] 0.99 µm2 Published [90](Ultra High Density)SRAM Bit Cell Size 1.15 µm2 Published [92] 1.15 µm2 Published [90](High Density)Nominal Core Voltage 1.2 V 1.2 V

aPublished reports have indicated that the nominal minimum gate length was 59 nm [93] or 65 nm[89].

bThe process allows for between 6-9 Cu layers [87]. The specific design kit available to us uses 7layers. We will use all these layers and assume that the additional metal layers could be used to improvepower and ground distribution.

Altera Stratix II is implemented using a multi-Vt process [88], we will assume a dual-Vt

process for the ASIC to ensure a fair comparison. Unfortunately, the TSMC and STMi-

croelectronics processes are certainly not identical; however, they share many similar

characteristics. These characteristics are summarized in Table 3.1. Different parameters

are listed in each row and the values of these parameters in the two processes is indi-

cated. The source of these values is labelled as either “Measured” which indicates that

the particular characteristic was measured by a third party or “Published” which means

that a foundry’s publications were used as the source of the data. Clearly, both processes

have similar minimum nominal poly lengths and metal 1 pitches [87, 89] and, in both

processes, SRAM bit cell sizes of 0.99 µm2 and 1.15 µm2 have been reported [90, 91, 92].

Given these similarities, it appears acceptable to compare the FPGA and ASIC despite

the different design platforms (and this is the best option available to us). The results

from both platforms will assume a nominal supply voltage of 1.2 V.

3.1.1 Benchmark Circuit Selection

It is important to ensure the benchmark designs are suitable for this empirical FPGA to

ASIC comparison. In particular, it is undesirable to use benchmarks that were designed

for a specific ASIC or FPGA platform as that could potentially unfairly bias the compari-

Chapter 3. Measuring the Gap 34

son. For this work, benchmarks were drawn from a range of sources including OpenCores1

and designs developed for projects at the University of Toronto [95, 96, 97, 47, 98, 99, 100].

All the benchmarks were written in either Verilog or VHDL. In general, the designs were

targeted for implementation on FPGAs. While none of the designs appeared to be heav-

ily optimized for a particular FPGA, this use of FPGA-focused designs does raise the

possibility of a bias in favour of FPGAs. However, this would be the typical result for

FPGA designers changing to target an ASIC. As well, this is necessary because we were

unable to obtain many ASIC-targeted designs (This is not surprising given the large costs

for ASIC development that make it undesirable to publicly release such designs). From

the available sources, the specific benchmarks to use were selected based on two critical

factors.

The first was to ensure that the Verilog or VHDL was synthesized similarly by the

different tools used for the FPGA and the ASIC implementations. Different tools were

used because we did not have access to a single synthesis tool that could adequately tar-

get both platforms. The preferred approaches to verifying that the synthesis was similar

in both cases are post-synthesis simulation and/or formal verification. Unfortunately,

verification through simulation was not possible due to the lack of test benches for most

designs and the lack of readily available formal verification tools prevented such tech-

niques from being explored. Instead, we compared the number of registers inferred by

the two synthesis processes, which we describe in Sections 3.2 and 3.3.1. We rejected

any design in which the register counts deviated by more than 5 %. Some differences in

the register count are tolerated because different implementations are appropriate on the

different platforms. For example, FPGA designs tend to use one-hot encoding for state

machines because of the low incremental cost for flip-flops.

Secondly, it was important to ensure that some of designs use the block memories

and dedicated multipliers on the Stratix II. This is important because one of the aims of

this work is to analyze the improvements possible when these hard dedicated blocks are

used. However, not all designs will use such features which made it essential to ensure

that the set of benchmarks includes both cases when these hard structures are used, and

are not used.

Based on these two factors, the set of benchmarks in Table 3.2 were selected for use in

this work. Brief descriptions and the source of each benchmark are given in Appendix A.

1OpenCores is an open source hardware effort which collects and archives a wide range of user-createdcores at http://www.opencores.org/.

Chapter 3. Measuring the Gap 35

To provide an indication of the size of the benchmarks, the table lists the number of Altera

Stratix II ALUTs (recall from Chapter 2 that an ALUT is “half” of a Stratix II Adaptive

Logic Module (ALM) and it is roughly equivalent to a 4-input LUT), 9x9 multipliers

and memory bits used by each design. The column labelled “Total 9x9 Multipliers”

indicates the number of these 9x9 multipliers (which, as described in Section 2.1.1, are

the smallest possible division of the Stratix II DSP Block) that are used throughout

the design including those used to implement the larger 18x18 or 36x36 multiplications

supported by the DSP block. Similarly, the number of memory bits indicates the number

of bits used across the three hard logic memory block sizes.

While every attempt was made to obtain benchmarks that are as large as possible

to reflect the realities of modern systems, the final set of benchmarks used for this work

are modest in size compared to the largest designs that can be implemented on the

largest Stratix II FPGA which contains 143 520 ALUTs, 768 9x9 multipliers and 9 383 040

memory bits [16]. This is a concern and various efforts to compensate for these modest

sizes are described later in this chapter. Despite these attempts to address potential size

issues, it is possible that with larger benchmarks different results would be obtained and,

in particular, there is the possibility that the results obtained will be somewhat biased

against FPGAs since FPGAs are engineered to handle significantly larger circuits than

those used in this work. This issue is examined in greater detail in Section 3.5.1.

The following sections will describe the processes used to implement these benchmarks

on the FPGA and as an ASIC.

3.2 FPGA CAD Flow

The benchmark designs were implemented on Altera Stratix II devices using the Altera

Quartus II v5.0SP1 software for all stages of the CAD flow. (This was the most recent

version of the software available at the time this work was completed.) Synthesis was

performed using Quartus II Integrated Synthesis (QIS) with all the settings left at their

default values. The default settings perform “balanced” optimization which focuses on

speed for timing critical portions of the design and area optimization for non-critical

sections. The defaults also allow the tool to infer the use of DSP blocks and memory

blocks automatically from the hardware description language (HDL).

Placement and routing with Quartus II was performed using the “Standard Fit”

effort level. This effort setting forces the tool to obtain the best possible timing results

Chapter 3. Measuring the Gap 36

Table 3.2: Benchmark Summary

Design ALUTs Total Memory9x9 Bits

Multipliers

booth 68 0 0rs encoder 703 0 0cordic18 2 105 0 0cordic8 455 0 0des area 595 0 0des perf 2 604 0 0fir restruct 673 0 0mac1 1 885 0 0aes192 1 456 0 0fir3 84 4 0diffeq 192 24 0diffeq2 288 24 0molecular 8 965 128 0rs decoder1 706 13 0rs decoder2 946 9 0atm 16 544 0 3 204aes 809 0 32 768aes inv 943 0 34 176ethernet 2 122 0 9 216serialproc 680 0 2 880fir24 1 235 50 96pipe5proc 837 8 2 304raytracer 16 346 171 54 758

regardless of timing constraints [101] and, hence, no timing constraints were placed on

any design in the reported results2. The final delay measurements were obtained using

the Quartus Timing Analyzer. As will be described in Section 3.4, area is measured

according to the number of logic clusters used and, therefore, we set the packer to cluster

elements into as few LABs as possible without significantly impacting speed. This is

done using special variables provided by Altera that mimic the effect of implementing

our design on a highly utilized FPGA. In addition to this, we used the LogicLock feature

of Quartus II to restrict the placement of a design to a rectangular region of LABs, DSP

blocks and memories [101]. By limiting the size of the region for each benchmark, the

implementation will more closely emulate the results expected for larger designs that

heavily utilize a complete FPGA. We allow Quartus II to automatically size the region

because we found that this automatic sizing generally delivered results with greater or

2To verify that this effort setting has the desired effect the results obtained were compared to theoperating frequency obtained when the clocks in the designs were constrained to an unattainable 1 GHz.Both approaches yielded similar results.

Chapter 3. Measuring the Gap 37

equal density than when we manually defined the region sizes to be nearly square with

slightly more LABs than necessary.

The selection of a specific Stratix II device is performed by the placement and routing

tool but we restrict the tool to use the fastest or the slowest speed grade parts depending

on the specific comparison being performed. These speed grades exist because most

FPGAs, including the Altera Stratix II, are speed binned which means that parts are

tested after manufacturing and sold based on their speed. The fastest FPGA speed grade

is a valid comparison point since those faster parts are available off-the-shelf. However,

exclusively using the fast speed grade devices favours the FPGA since ASICs generally

are not speed-binned [1]. (Alternatively, it could be argued that this is fair as one of the

advantages of FPGAs is that the diverse markets they serve make it effective to perform

speed binning.) As will be described later, the ASIC delay is measured assuming worst-

case temperature, voltage and process conditions. Comparing the ASIC results to the

slowest FPGA speed grade addresses this issue and allows for an objective comparison of

the FPGA and ASIC at worst case temperature, voltage and process. When presenting

the results, we will explicitly note which FPGA devices (fastest or slowest) were used.

Even within the same speed grade, the selection of a specific Stratix II part can have

a significant impact on the cost of an FPGA-based design and, for industrial designs,

the smallest (and cheapest) part would typically be selected. However, this issue is not

as important for our comparison because, as will be described later, the comparison

optimistically (for the FPGA) ignores the problem of device size granularity.

Finally, the reported operating frequency of a design is known to vary depending on

the random seed given to the placement tool. To reduce the impact of this variability

on our results, the entire FPGA CAD flow is repeated five times using five different

placement seeds. All the results (area, speed and power) are taken based on the placement

and routing that resulted in the fastest operating frequency.

3.3 ASIC CAD Flow

While the FPGA CAD flow is straightforward, the CAD flow for creating the standard

cell ASIC implementations is significantly more complicated. Our CAD flow is based on

Synopsys and Cadence tools for synthesis, placement, routing, extraction, timing analysis,

and power analysis. The steps involved along with the tools used are shown in Figure 3.1.

The CAD tools were provided through CMC Microsystems (http://www.cmc.ca).

Chapter 3. Measuring the Gap 38

S y nthe s isS yn o p sys

D e sig n C o m p ile r

P la c e m e nt a nd

R outingC a d e n ce S OC

E n co u n te r

E x tra c tionS yn o p sys

S ta r-R C X T

Tim ing A na ly s isS yn o p sys

P rim e Tim e

P owe r A na ly s is

S yn o p sys P rim e P o w e r

S im ula tionC a d e n ce

N C-S im

R TL D e sig n D e scrip tio n

A re a

D e la y

P o w e r

Figure 3.1: ASIC CAD Flow

Chapter 3. Measuring the Gap 39

A range of sources were used for determining how to properly use these tools. These

sources included vendor documentation, tutorials created by CMC Microsystems and

tool demonstration sessions provided by the vendors. In the following sections, all the

significant steps in this CAD flow will be described.

3.3.1 ASIC Synthesis

Synthesis for the ASIC implementation was completed using Synopsys Design Compiler

V-2004.06-SP1. All the benchmarks were synthesized using a common compile script

that performed a top-down compilation. This approach preserves the design hierarchy

and ensures that any inter-block dependencies are handled automatically [102, 103]. This

top-down approach is reasonable in terms of CPU time and memory size because all the

benchmarks have relatively modest sizes.

The compile script begins by analyzing the HDL source files for each benchmark.

Elaboration and linking of the top level module is then performed. After linking, the

following constraints are applied to the design. All the clocks in a design are constrained

to a 2 GHz operating frequency. This constraint is unattainable but, by over-constraining

the design, we aim to create the fastest design possible. In addition, an area constraint

of 0 units is also placed on the design. This constraint is also unattainable but this is a

standard practise for enabling area optimization [102].

The version of the STMicroelectronics 90 nm design kit available to us contains four

standard cell libraries. Two of the libraries contain general-purpose standard cells. One

version of the library uses low leakage high-Vt transistors while the other uses higher

performing standard-Vt transistors. The other set of two libraries include more complex

gates and is also available in high and standard-Vt versions. For compilation with Design

Compiler, all four libraries were set as target libraries meaning that the tool is free to

select cells from any of these libraries as it sees fit. The process from STMicroelectronics

also has the option for low-Vt transistors; however, standard cell libraries based on these

transistors were not available to us at the time of this work. Such cells would have offered

even greater performance at the expense of static power consumption.

Once the specific target cells and clock and area constraints are specified the design is

compiled with Design Compiler. The compilation was performed using the “high-effort”

setting. After the compile completed, an additional high-effort incremental compilation

Chapter 3. Measuring the Gap 40

is performed. This incremental compilation maintains or improves the performance of

the design by performing various gate-level optimizations [103].

Virtually all modern ASIC designs require Design for Testability (DFT) techniques

to simplify post-manufacturing tests. At a minimum, scan chains are typically used

to facilitate these tests [104]. This requires that all the sequential cells in the design

are replaced by scan-equivalent implementations. Accordingly, for all compilations with

Design Compiler, the Test Ready Compile option is used which automatically replaces

sequential elements with scan-equivalent versions. Such measures were not needed for

the FPGA-based implementation because testing is performed by the manufacturer.

After the high effort compilations are complete, the timing constraints are adjusted.

The desired clock period is changed to the delay that was obtained under the unattainable

constraints. With this new timing constraint, a final high effort compilation is performed.

“Sequential area recovery” optimizations are enabled for this compile which allows Design

Compiler to save area by remapping sequential elements that are not on a critical or

near-critical path. After this final compilation is complete, the scan-enabled flip flops are

connected to form the scan chains. The final netlist and the associated constraints are

then saved for use during placement and routing.

For circuits that used memory, the appropriate memory cores were generated by

STMicroelectronics using their custom memory compilers. CMC Microsystems and Cir-

cuits Multi-Projets (CMP) (http://cmp.imag.fr) coordinated the generation of these

memory cores with STMicroelectronics. When selecting from the available memories,

we chose compilers that delivered higher speeds instead of higher density or lower power

consumption. The memories were set to be as square as possible.

3.3.2 ASIC Placement and Routing

The synthesized netlist is next placed and routed with Cadence SOC Encounter GPS

v4.1.5. The placement and routing CAD flow was adapted from that described in the

Encounter Design Flow Guide and Tutorial [105]. The key steps in this flow are described

below.

The modest sizes of the benchmarks allow us to implement each design as an indi-

vidual block and the run times and memory usage were reasonable despite the lack of

design partitioning. For larger benchmarks, hierarchical chip floor-planning steps could

Chapter 3. Measuring the Gap 41

well have been necessary. Hierarchical design flows can result in lower quality designs

but are necessary to achieve acceptable run times and to enable parallel design efforts.

Before placement, a floorplan must be created. For this floorplan we selected a target

row utilization3 of 85 % and a target aspect ratio of 1.0. The 85 % target utilization was

selected to minimize any routing problems. Higher utilizations tend to make placement

and routing significantly more challenging [107]. Designs with large memory macro blocks

proved to be more difficult to place and route; therefore, the target utilization was lowered

to 75 % for those designs.

After the floorplan is created under these constraints, placement is performed. This

placement is timing-driven and optimization is performed based on the worst-case tim-

ing models. Scan chain reordering is performed after placement to reduce the wirelength

required for the scan chain. The placement is further optimized using Encounter’s “opt-

Design” macro command which performs optimizations such as buffer additions, gate

resizing and netlist restructuring. Once these optimizations are complete, the clock tree

is inserted. Based on the new estimated clock delays from the actual clock tree, setup and

hold time violations are then corrected. Finally, filler cells are added to the placement

in preparation for routing.

Encounter’s Nanoroute engine is used for routing. The router is configured to use

all seven metal layers available in the STMicroelectronics process used for this work.

Once the routing completes, metal fill is added to satisfy metal density requirements.

Detailed extraction is then performed. This extraction is not of the same quality as

the sign-off extraction but is sufficient for guiding the later timing-driven optimizations.

The extracted parasitic information is used to drive post-routing optimizations that aim

to improve the critical path of the design. These in-place optimizations include drive-

strength adjustments. After these optimizations, routing is again performed and the

design is checked for connectivity or design rule violations. The design is then saved in

various forms as required for the subsequent steps of the CAD flow.

3.3.3 Extraction and Timing Analysis

In our design environment, the parasitic extraction performed within SOC Encounter

GPS is not sufficiently accurate for the final sign-off timing and power analysis. Therefore,

3Row utilization is the total area required for standard cells relative to the total area available forplacement of the standard cells [106].

Chapter 3. Measuring the Gap 42

after placement and routing is complete, the final sign-off quality extraction is performed

using Synopsys Star-RCXT V-2004.06. This final extraction is saved for use during

the timing and power analysis that is performed using Synopsys PrimeTime SI version

X-2005.06 and Synopsys PrimePower version V-2004.06SP1 respectively.

3.4 Comparison Metrics

After implementing each design as an ASIC and using an FPGA, the area, delay and

power of each implementation were compared. The specific measurement approach can

significantly impact results; therefore, in this section, the measurement methodology for

each of the metrics is described in detail.

3.4.1 Area

The area for the standard cell implementation is defined in this work to be the final core

area of the placed and routed design. This includes the area for any memory macros

that may be required for a design. The area of the inputs and outputs is intentionally

excluded because the focus in this work is on the differences in the core logic.

Measuring the area of the FPGA implementation is less straightforward because the

benchmark designs used in this work generally do not fully utilize the logic on an FPGA.

To include the entire area of an FPGA that is not fully utilized would artificially quantize

the area measured to the vendor device sizes and would completely obscure the effects

we wish to measure. Instead, for the area measurements, only the silicon area for any

logic resources used by a design is included. The area of a design is computed as the

number of LABs, DSP blocks and memory blocks each multiplied by the silicon area of

that specific block. Again, the area of I/O’s is excluded to allow us to focus on the core

programmable logic. The silicon areas for each block were provided by Altera [85]. These

areas include the routing resources that surround each of the blocks. The entire area of

a block (such as a memory or LAB) is included in the area measurement regardless of

whether only a portion of the block is used. This block level granularity is potentially

pessimistic and in Section 3.5.1 the impact of this choice is examined. To avoid disclosing

any proprietary information, absolute areas are not reported and only the ratio of the

FPGA area to ASIC area will be presented.

Chapter 3. Measuring the Gap 43

This approach (of only considering the resources used) may also be considered opti-

mistic for the following reasons: first, it ignores the fact that FPGAs unlike ASICs are

not available in arbitrary sizes and, instead, a designer must select one particular discrete

size even if it is larger than required for the design. This optimism is acceptable because

we are focusing on the cost of the programmable fabric itself. As well, we optimistically

measure the area used for the hard logic blocks such as the multipliers and the memories.

In commercial FPGAs, the ratio of logic to memories to multipliers is fixed and a designer

must tolerate this ratio regardless of the needs of their particular design. For the area

calculations in this work, these fixed ratios are ignored and the area for a heterogeneous

structure is only included as needed. This implies that we will measure the best case

impact of these hard blocks.

3.4.2 Delay

The critical path of each ASIC and FPGA design is obtained from static timing analysis

assuming worst case operating conditions. This determines the maximum clock frequency

for each design. For the ethernet benchmark which contains multiple clocks, the geometric

mean of all the clocks in each implementation is compared. For the FPGA, timing

analysis was preformed using the timing analyzer integrated in Altera Quartus II4. Timing

analysis for the ASIC was performed using Synopsys PrimeTime SI which accounts for

signal integrity effects such as crosstalk when computing the delay. The use of different

timing analysis tools for the FPGA and the ASIC is a potential source of error in the delay

gap measurements since the tools may differ in their analysis and that may contribute to

timing differences that are not due to the differences in the underlying implementation

platforms. However, both tools are widely used in their respective domains and their

results are indicative of the results for typical users.

3.4.3 Power

Power is an important issue for both FPGA and ASIC designs but it is challenging to

fairly compare measurements between the platforms. This section describes in detail the

method used to measure the power consumption of the designs. For these measurements

we separate the dynamic and static contributions to the power consumption both to

4The timing analyzer used is now known as the Quartus II’s Classic Timing Analyzer.

Chapter 3. Measuring the Gap 44

simplify the analysis and because, as will be described later, a conclusive comparison of

the static power consumptions remained elusive.

It is important to note that in these measurements we aim to compare the power

consumption gap as opposed to energy consumption gap. To make this comparison

fair, we compare the power with both the ASIC and the FPGA performing the same

computation over the same time interval. An analysis of the energy consumption gap

would have to reflect the slower operating frequencies of the FPGA. The slower frequency

means that more time or more parallelism would be required to perform the same amount

of work as the ASIC design. To simplify the analysis in this work, only the power

consumption gap was be considered.

Also, it is significant that we perform this comparison using the same implementations

for the FPGA and the ASIC that were used for the delay measurements. For those

measurements, every circuit is designed to operate at the highest speed possible. This

is done because our goal is to measure the power gap between typical ASIC and FPGA

implementations as opposed to the largest possible power gap. Our results would likely

be different if we performed the comparison using an ASIC designed to operate at the

same frequency as the FPGA since power saving techniques could be applied to the ASIC.

Dynamic and Static Power Measurement

The preferred measurement approach, particularly for dynamic power measurements, is

to stimulate the post-placed and routed design with vectors representing typical usage

of the design. This approach is used when appropriate testbenches are available and

the results gathered using this method are labelled accordingly. However, in most cases,

appropriate testbenches are not available and we are forced to rely on a less accurate

approach of assuming constant toggle rates and static probabilities for all the nets in

each design.

The dynamic power measurements are taken assuming worst-case process, 85 ◦C and

1.2 V. Both the FPGA and ASIC implementations are simulated at the same operating

frequency of 33 MHz. This frequency was selected since it was a valid operating frequency

for all the designs on both platforms. Performing the comparison assuming the same

frequency of operation for both the ASIC and FPGA ensures that both implementations

perform the same amount of computation.

Chapter 3. Measuring the Gap 45

For the FPGA implementation, an exported version of the placed and routed design

was simulated using Mentor Modelsim 6.0c when the simulation-based method was pos-

sible. That simulation was used to generate a Value Change Dump (VCD) file containing

the switching activities of all the circuit nodes. Based on this information, the Quartus

II Power Analyzer measured the static and dynamic power consumption of the design.

Glitch filtering was enabled for this computation which ignores any transitions that do

not fully propagate through the routing network. Altera recommends using this setting

to ensure accurate power estimates [101]. Only core power (supplied by VCCINT) was

recorded because we are only interested in the power consumption differences of the core

programmable fabric. The power analyzer separates the dynamic and static contributions

to the total power consumption.

For the standard cell implementation, the placed and routed netlist was simulated

with back annotated timing using Cadence NC-Sim 5.40. Again, a VCD file was generated

to capture the state and transition information for the nets in the design. This file, along

with parasitic information extracted by Star-RCXT, is used to perform power analysis

with the Synopsys PrimePower tool, version V-2004.06SP1. PrimePower automatically

handles glitches by scaling the dynamic power consumption when the interval between

toggles is less than the rise and fall delays of the net. The tool also splits the power

consumption up into static and dynamic components.

In most cases, proper testbenches were not available and, for those designs, power

measurements were taken assuming all the nets in the design toggle at the same frequency

and have the same static probability. This approach does not accurately reflect the true

power consumption of a design but should be reasonable since the measurements are

only used for a relative measurement of an FPGA versus an ASIC. However, it should be

recognized that this approach may cause the power consumption of the clock networks

to be less than typically observed. Both the Quartus II Power Analyzer and Synopsys

PrimePower also offered the ability to use statistical vectorless estimation techniques in

which toggle rates and static probabilities are propagated statistically from source nodes

to the remaining nodes in design. However, the two power estimation tools produced

significantly different activity estimates when using this statistical method and, therefore,

it was decided to use the constant toggle rate method instead.

Chapter 3. Measuring the Gap 46

Dynamic and Static Power Comparison Methodology

Directly comparing the dynamic power consumption between the ASIC and the FPGA is

reasonable but the static power measurements on the FPGA require adjustments before a

fair comparison is possible to account for the fact that the benchmarks do not fully utilize

the FPGA device. Accordingly, the static power consumption reported by the Quartus

Power Analyzer is scaled by the fraction of the core FPGA area used by the particular

design. The fairness of this decision is arguable since end users would be restricted to

the fixed available sizes and would therefore incur the static power consumption of any

unused portions of their design. However, the discrete nature of the device sizes obscures

the underlying differences in the programmable logic that we aim to measure. Given

the arbitrary nature of the FPGA sizes and the existence of fine-grained programmable

lower power modes in modern FPGAs [6, 17] this appears to be a reasonable approach

to enable a fair comparison.

An example may better illustrate these static power adjustments. Assume a hypo-

thetical FPGA in which one LAB and one DSP block out of a possible 10 LABs and 2

DSP blocks are used. If the silicon area of the LAB and DSP block is 51 µm2 and the

area of all the LABs and DSP blocks is 110 µm2 then we would scale the total static

power consumption of the chip by 51/110 = 0.46. This adjustment assumes that leakage

power is approximately proportional to the total transistor width of a design which is

reasonable [108] and that the area of a design is a linear function of the total transistor

width which is also reasonable as FPGAs tend to be active area limited [14].

3.5 Measurement Results

All the benchmarks were implemented using the flow described in Sections 3.2 and 3.3.

Area, delay and power measurements were then taken using the approach described in

Section 3.4 and, in this section, the results for each of these metrics will be examined.

3.5.1 Area

The area gap between FPGAs and ASICs for the twenty-three benchmark circuits is

summarized in Table 3.3. The gap is reported as the factor by which the area of the

FPGA implementation is larger than the ASIC implementation. As a key goal of this

work is to investigate the effect of heterogeneous memory and multiplier blocks on the

Chapter 3. Measuring the Gap 47

Table 3.3: Area Ratio (FPGA/ASIC)

Logic Logic Logic Logic,Name & & Memory

Only DSP Memory & DSP

booth 33rs encoder 32cordic18 19cordic8 25des area 42des perf 17fir restruct 28mac1 43aes192 47fir3 45 17diffeq 41 12diffeq2 39 14molecular 47 36rs decoder1 54 58rs decoder2 41 37atm 70aes 24aes inv 19ethernet 34serialproc 36fir24 9.5pipe5proc 23raytracer 26

Geometric Mean 35 25 33 18

gap, the results in the table are separated into four categories based on which com-

binations of heterogeneous resources are used. Those benchmarks that used only soft

logic are labelled “Logic Only.” (Recall from Chapter 2 that the soft logic block in

the Stratix II is the LAB.) Those that used soft logic and hard DSP blocks contain-

ing multiplier-accumulators are labelled “Logic and DSP.” Those that used soft logic

and memory blocks are labelled “Logic and Memory,” and, finally, those that used all

three are labelled “Logic, DSP and Memory.” We implemented the benchmarks that

contained multiplication operations with and without the hard DSP blocks so results for

these benchmarks appear in two columns, to enable a direct measurement of the benefit

of these blocks.

In viewing Table 3.3, first, consider those circuits that only use the soft logic: the

area required to implement these circuits in FPGAs compared to standard cell ASICs is

Chapter 3. Measuring the Gap 48

on average5 a factor of 35 times larger, with the different designs ranging from a factor

of 17 to 54 times larger. This is significantly larger than the area gap suggested by [2],

which used extant gate counts as its source. It is much closer to the numbers suggested

by [3].

The range in the area gap from 17 to 54 times is clearly significant but the reason

for this variability is unclear. One potential reason for these differences was thought

to be the varying sizes of the benchmarks. It is known that FPGAs are architected to

handle relatively large designs and, therefore, it was postulated that the area gap would

shrink for larger designs that can take increasing advantage of the resources included

to handle those larger circuits. This idea was tested by comparing the area gap to the

size of the circuit measured in ALUTs and the results are plotted in Figure 3.2. Only

the soft-logic benchmarks are included to keep the analysis focused on benchmark sizes

and not issues surrounding the use of heterogeneous hard blocks. For these benchmarks,

there does not appear to be a relationship between benchmark size and the area gap and,

therefore, benchmark size does not appear to be the primary cause of the varying area

gap measurements. However, additional analysis on the effects of benchmark size on the

area gap is performed later in this section.

Another factor that could cause the variability in the area gap measurements between

designs is the capability of a LUT to implement a wide range of logic functions. For

example, a 2-input LUT can implement all possible two input functions including a 2-

input NAND, a 2-input AND or a 2-input XOR. The static CMOS implementations of

those gates would require 4 transistors, 6 transistors and 10 transistors respectively. Such

implementations would be used in the standard cell gates and, therefore, depending on

the specific logic function the area gap between the LUT and the standard cell gate will

vary significantly. As the LUT is the primary resource for implementing logic in the

soft logic portion of the FPGA, the characteristics of the logic implemented in those

LUTs may significantly affect the area gap. This potential source of the wide ranging

measurements was not investigated but it likely explains at least part of the variations

observed in the measurements.

5The results are averaged using the geometric mean. The geometric mean for n positive numbersa1, a2, . . . , an is the nth root of their product (i.e. n

√a1a2 · · · an). This is a better measure of the average

gap than alternative averages such as the arithmetic mean because the gap measurement is a multiplica-tive factor. For example, if two designs had an area gap of 0.25 and 4, then clearly the geometric meanof 1 would be more indicative of the typical result than the arithmetic mean of 2.125.

Chapter 3. Measuring the Gap 49

0

10

20

30

40

50

60

70

80

0 500 1000 1500 2000 2500 3000

Be n ch m ar k Siz e (A L UT s )

Are

a G

ap

Figure 3.2: Area Gap Compared to Benchmark Sizes for Soft-Logic Benchmarks

The third, fourth and fifth columns of Table 3.3 report the impact of the hard het-

erogeneous blocks. It can seen that these blocks do significantly reduce this area gap.

The benchmarks that make use of the hard multiplier-accumulators, in column three,

are on average only 25 times larger than an ASIC. When hard memories are used, the

average of 33 times larger is slightly lower than the average for regular logic and when

both multiplier-accumulators and memories are used, we find the average is 18 times.

Comparing the area gap between the benchmarks that make use of the hard multiplier-

accumulator blocks and those same benchmarks when the hard blocks are not used best

demonstrates the significant reduction in FPGA area when such hard blocks are avail-

able. In all but one case the area gap is significantly reduced6. This reduced area gap

was expected because these heterogeneous blocks are fundamentally similar to an ASIC

implementation with the only difference being that the FPGA implementation requires a

6The area gap of the rs decoder1 increases when the multiplier-accumulator blocks are used. Thissurprising result is attributed to the benchmark’s exclusive use of 5 bit by 5 bit multiplications and theseare more efficiently implemented (from a silicon area perspective) in regular logic instead of the StratixII’s 9x9 multiplier blocks.

Chapter 3. Measuring the Gap 50

0

10

20

30

40

50

60

70

80

0% 10% 20% 30% 40% 50% 60% 70%

Heterogeneous Content (% of Total FPGA Area)

Are

a G

ap

Figure 3.3: Effect of Hard Blocks on Area Gap

programmable interface to the outside blocks and routing. Hence, compared to soft logic

blocks which have both programmable logic and routing, these heterogeneous blocks are

less programmable.

It is noteworthy that there is also significant variability in the area gap for the bench-

marks that make use of the heterogeneous blocks. One contributor to this variability is

the varying amount of heterogeneous content. The classification system used in Table 3.3

is binary in that a benchmark either makes use of a hard structure or it does not but this

fails to recognize that the benchmarks differ in the extent to which the heterogeneous

blocks are used. An alternative approach is to consider the fraction of a design’s FPGA

area that is used by heterogeneous blocks. The area gap is plotted versus this measure

of heterogeneous content in Figure 3.3. The figure demonstrates the expected trend that

as designs make use of more heterogeneous blocks the area gap tends to decline. It is not

quantified in the figure but the reduction in the area gap is accompanied by a decrease

in the degree of programmability possible in FPGA.

Chapter 3. Measuring the Gap 51

While these results demonstrate the importance of these heterogeneous blocks in

improving the competitiveness of FPGAs, it is important to recall that for these hetero-

geneous blocks, the analysis is optimistic for the FPGAs. As described earlier, we only

consider the area of blocks that are used, and we ignore the effect of the fixed ratio of

logic to heterogeneous blocks that a user is forced to tolerate and pay for. Therefore,

the measurements will favour FPGAs for designs that do not fully utilize the available

heterogeneous blocks. This is the case for many of the benchmarks used in this work,

particularly the benchmarks with memory. However, this is also potentially unfair to the

FPGAs since FPGA manufacturers likely tailor the ratios of regular logic to multiplier

and memory blocks to the ratios seen in their customer’s designs. If it is assumed that

the ratios closely match, then bounds on the area gap can be estimated.

Approximate Bounds

The previous results demonstrated the trend that the area gap shrinks when an increasing

proportion of heterogeneous blocks is used. However, these results were based on bench-

marks that only partially exploited the available heterogeneous blocks and, instead, if all

the heterogeneous blocks were used, the resulting area gap could be significantly lower.

We can not directly determine the potential area gap in that case because no actual

benchmarks that fully used the heterogeneous blocks were available to us. However, with

a few assumptions, it is possible to estimate a bound on the area gap.

We will base this estimate on the assumption that all the core logic blocks are used

on a Stratix II device including both the soft logic blocks (LABs) and the heterogeneous

memory and multiplier blocks (DSP, M512, M4K and M-RAM blocks). The silicon

area for all these blocks on the FPGA is known but the ASIC area to obtain the same

functionality must be estimated. This area will be calculated by considering each logic

block type individually and estimating the area gap for that block (and its routing)

relative to an ASIC. Based on those area gaps, the ASIC area estimate can be computed

by determining each logic block’s equivalent ASIC area and then summing those areas

to get the total area.

The area gap estimates for each block type are summarized in Table 3.4. (Recall that

the functionality of these logic blocks was described in Section 2.1.1.) The estimate of 35

for the LAB (or soft logic) is based on the results described previously in this section. The

DSP, M512 and M4K blocks are assumed to have an area gap of 2. This assumption is

Chapter 3. Measuring the Gap 52

Table 3.4: Area Gap Estimation with Full Heterogeneous Block Usage

Block Estimated Area Gap

LAB 35DSP Block 2M512 Block 2M4K Block 2M-RAM Block 1

Gap with 100 % Core Utilization 4.7

based on the knowledge that, while the logic functionality itself is implemented similarly

in both the FPGA and the ASIC, the FPGA implementation requires additional area

for the programmable routing. The M-RAM block is a large 512 kbit memory and the

area overhead of the programmable interconnect is assumed to be negligible hence the

area gap of 1. (This is potentially optimistic as the M-RAM is a dual ported memory

and, when only a single port is required, the ASIC implementation would be considerably

smaller than the FPGA.) With these estimated area gaps, the full chip area gap can be

calculated as a weighted sum based on the silicon areas required for each of the block

types on the FPGA. Based on this calculation, the area gap could potentially be as low as

4.7 when all the hard blocks are used. (To avoid disclosing proprietary information, the

specific weights that produced this bound can not be disclosed.) Clearly, heterogeneous

blocks can play a significant role in narrowing the area gap.

While the focus of this work is on the core logic within the FPGA, it is worth noting

that the peripheral logic consumes a sizeable portion of the FPGA area. For one of

the smaller members of the original Stratix family (1S20), the periphery was reported

to consume 43 % of the overall area [109]. This peripheral area contains both the I/O

blocks that interface off-chip and other circuitry such as phase-locked loops (PLLs). If it

is assumed that, like the hard logic blocks on the FPGA, these peripheral blocks are more

efficient than soft logic then these blocks may further narrow the gap. This is especially

true for blocks such as PLLs which would be implemented similarly in both an ASIC and

a FPGA. However, some of the peripheral circuitry may be unnecessary in an ASIC im-

plementation. This is particularly true for circuitry related to the configuration memory

such as the drivers used to program the memory or the voltage regulators used to gener-

ate the appropriate voltages required for programming. This reduces the area savings of

the efficient hard peripheral blocks and, therefore, we will assume that on average all the

peripheral circuitry on the FPGA is twice as large as its ASIC implementation. Then,

Chapter 3. Measuring the Gap 53

if we assume all the peripheral circuitry is used and we use the earlier calculations for

the core logic, the FPGA to ASIC gap further shrinks to approximately 3.2 on average

across the Stratix II devices. Smaller devices benefit most from these assumptions as

the proportion of the peripheral area is larger and, under these assumptions, the area

gap narrows to 2.8 for the smallest Stratix II device. It clearly appears possible that

the full chip area gap will shrink if the FPGA’s peripheral circuitry can be implemented

efficiently; however, the assumed area gap for the periphery, while seemingly reasonable,

is unsubstantiated. It is left for future work to more thoroughly analyze the full chip

area gap by accurately exploring the area gap for the peripheral circuitry and the focus

of this work will now return to the core area gap measurements.

Impact of Benchmark Size on FPGA Area Measurements

One concern with the core area gap measurements is that they are significantly affected by

the size of the benchmark circuits. As described previously, in comparison to the largest

Stratix II devices, the benchmarks are relatively small in size. This is an issue because

the architecture of the Stratix II was designed to accommodate the implementation of

large circuits on those large FPGAs.

Earlier in this section, this issue was partially investigated by comparing the area

gap to the size of the benchmarks measured in the number of ALUTs used. No obvious

relationship between the circuit size and the area gap was observed. However, that

analysis did not examine the extent to which the FPGA architecture was exercised and,

in particular, the usage of the routing was not investigated. This issue of routing is

important because larger circuits generally require more routing in the form of greater

channel widths. The channel width for a FPGA family is typically determined based on

the needs of the largest and most routing-intensive circuits that can fit on the largest

device in the FPGA family. With the smaller circuits used in this work, it is possible

that the routing is not used as extensively and, therefore, a non-trivial portion of the

routing in the FPGA may be unused. This can cause the gap measurements to be biased

against the FPGA because, in the ASIC implementation, there is no unused routing.

It is useful to first investigate the theoretical impact on the area gap of this unused

FPGA routing. It has been reported that in modern FPGAs, such as the Stratix II,

the area for the soft logic, excluding the routing, is 40 % of the total area [47]. (In the

work in Chapters 4 and 5, we observed a similar trend for architectures with the large

Chapter 3. Measuring the Gap 54

LUT sizes now seen in high-performance FPGAs.) This leaves 60 % of the area for all

forms of routing. The routing into and inside of a logic block can be a sizeable portion

of this routing area. Based on our experiences with the work that will be described in

Chapters 4 and 5, it is common for at least a third of the total routing area to be used

by the routing into and within the logic block. The usage of this routing will primarily

depend on the utilization of the logic block. Fortunately, the FPGA CAD flow we use was

developed to ensure that the logic block was used as it would be used in large circuits.

Therefore, this routing should in general be highly used irrespective of the overall size of

the benchmark circuits.

This leaves the routing between logic blocks as the only potentially under-utilized

resource. We estimate that this inter-block routing accounts for at most only 40 % of

the total FPGA area. However, these resources typically can not be fully used. The

FPGA CAD software [101] indicates that using more than 50 % of the routing resources

in the FPGA will make it difficult to perform routing. Similarly, it has been observed

in academic studies that the average utilization of these resources is typically between

56 % and 71 % [110, 111, 112, 113]. Clearly, a sizeable portion of this routing is unused

regardless of the benchmark size. If it is assumed that at most 60 % of the routing can

be used on average then this means that at most 60 % × 40 % = 24 % of the FPGA area

would be unused by circuits with trivial routing needs. That translates into an area gap

of 27 instead of 35 for the soft logic circuits. While that is a significant reduction in the

area gap, it is clear that, even in the worst case with trivial benchmarks, the FPGA to

ASIC area gap would still be large. Furthermore, the benchmarks used in this work were

small but many were not trivially small and, therefore, it is useful to examine the actual

usage of the routing resources by the benchmarks.

This was done for all the benchmarks used to measure the area gap. As described pre-

viously, the FPGA CAD flow used LogicLock regions to restrict each design to a portion

of the FPGA and, therefore, it is only meaningful to consider the utilization within that

portion of the FPGA. Unfortunately, that specific routing utilization information is not

readily available from Quartus II. Instead, the average routing utilization was computed

as the total number of resources used divided by the number of resources available within

the region of the FPGA that was used. The resulting average utilization will be some-

what optimistic as it includes routing elements that were used to connect the LogicLock

region with the I/O pins. To partially account for this when calculating the average, it is

Chapter 3. Measuring the Gap 55

0

10

20

30

40

50

60

70

80

0 10 20 30 40 50 60

Are

a G

ap

Average FPGA Inter-Block Rou!ng U!liza!on (%)

Figure 3.4: Area Gap vs. Average FPGA Interconnect Usage

assumed that each I/O pin used two routing segments outside of the LogicLock region.

The area gap is plotted against this average routing utilization in Figure 3.4.

In the figure, it can be seen that the utilization is generally below the typical maximum

utilizations of between 50 % to 70 %. Nevertheless, a reasonable portion of the routing

is used in most cases and, therefore, the earlier worst-case estimate for the area gap

reduction due to underutilization of the routing was excessively pessimistic. Equally

significant is that increasing routing utilization has no apparent effect on the area gap

and the correlation coefficient of the area gap and the utilization is -0.2. Clearly, there

are other effects that impact the area gap more significantly than the routing utilization

or the benchmark size.

It should also be noted that benchmark size only has a modest effect on the routing

demands of a circuit once beyond a certain threshold. In [109], it was shown that for

benchmarks between 5000 logic elements7 and 25000 logic elements there was only a

modest increase in the required channel width. It is expected that this region of small

increases with circuit size would continue for even larger circuits. Some of the larger

7A logic element is approximately equivalent to the ALUTs used as a circuit size measure in thiswork.

Chapter 3. Measuring the Gap 56

circuits used in this work fall in this region and, with the flat increases to channel width

in the region, the behaviour of these large circuits should match those of the largest

circuits that can be implemented on the FPGA.

Based on this examination, it appears that the small sizes of the benchmarks used for

this work has not unduly influenced the results. In the worst case, it was estimated that

the impact on the results would be less than 24 % and, in practise, the impact should be

smaller since the benchmarks, while small, were not unreasonably small in terms of their

routing demands. Clearly, there must be other factors that affect the results and some

of these issues are explored in the following section.

Other Considerations

Besides the sizes of the benchmark circuits, there are a number of other factors that can

effect the measurements of the area gap. One factor is the approach used to determine

the area of a design on an FPGA. As described earlier, the approach used in this work is

to include the area for any resource used at the LAB, memory block or DSP block level.

If any of these blocks is even partially used, the entire area of the block (including the

surrounding routing) is included in the area measurement. This implicitly assumes the

FPGA CAD tools attempt to minimize LAB usage which is generally not the case for

designs that are small relative to the device on which they are implemented. The special

configuration of the Quartus II tools used in this work mitigated this problem.

An alternative to measuring area by the number of LABs used is to instead consider

the fraction of a LAB utilized based on the number of ALMs used in a LAB. The

area gap results using this area metric are summarized in Table 3.5. With this FPGA

area measurement technique, the area gap in all cases is reduced. The average area

gap for circuits implemented in LUT-based logic only is now 32 and the averages for

the cases when heterogeneous blocks are used have also become smaller. However, such

measurements are an optimistic lower bound on the area gap because it assumes that all

LABs can be fully utilized. As well, it ignores the impact such packing could have on

the speed of a circuit.

These measurement alternatives for the FPGA do not apply to the ASIC area mea-

surements. However, the ASIC area may be impacted by issues related to the absolute

size of the benchmarks used in this work. The density of the ASIC may decrease for

larger designs because additional white space and larger buffers may be needed to main-

Chapter 3. Measuring the Gap 57

Table 3.5: Area Ratio (FPGA/ASIC) – FPGA Area Measurement Accounting for LogicBlocks with Partial Utilization

Logic Logic Logic Logic,Name & & Memory

Only DSP Memory & DSP

booth 32rs encoder 31cordic18 19cordic8 25des area 41des perf 17fir restruct 27mac1 43aes192 47fir3 28 17diffeq 32 11diffeq2 32 14molecular 40 36rs decoder1 44 57rs decoder2 36 37atm 69aes 24aes inv 19ethernet 33serialproc 36fir24 8.5pipe5proc 22raytracer 26

Geometric Mean 32 24 32 17

tain speed and signal integrity for the longer wires inherent to larger designs. The FPGA

is already designed to handle those larger designs and, therefore, it would not face the

same area overhead for such designs. As well, with larger designs, hierarchical floorplan-

ning techniques, in which the design is split into smaller blocks that are individually

placed and routed, may become necessary for the ASIC. Such techniques often add

area overhead because the initial area budgets for each block are typically conservative

to avoid having to make adjustments to the global floorplan later in the design cycle.

As well, it may be desirable to avoid global routing over placed and routed blocks to

simplify design rule checking and this necessitates the addition of white space between

between blocks for the global routing. This further decreases the density of the ASIC

design; however, the FPGA would not suffer from the same effects. These factors may

be another reason why large benchmarks may have a narrower FPGA to ASIC area gap

but it is unlikely that these factors would lead to substantially different results.

Chapter 3. Measuring the Gap 58

As described earlier, the focus in this comparison is on the area gap between FPGAs

and ASICs for the core area only. This area gap is important because it can have a

significant impact on the cost difference between FPGAs and ASICs but other factors

can also be important. One such factor is the peripheral circuitry, which as discussed

previously may narrow the gap when fully utilized. The previous discussion of a bound

on the gap assumed that both the core and periphery logic were used fully but that need

not be the case. In particular, many small designs could be pad limited which would

mean that the die area would be set by the requirements for the I/O pads not by the core

logic area. In those cases, the additional core area required for an FPGA is immaterial.

Ultimately, area is important because of the strong input it has on the cost of a

device. The package costs, however, are also a factor that can reduce the significance of

the core area gap. For small devices, the cost of the package can be a significant fraction

of the total cost for a packaged FPGA. The costs for silicon are then less important and,

therefore, the large area gap between FPGAs and ASICs may not lead to a large cost

difference between the two implementation approaches.

Clearly, while the measurements reported in this section indicated that the area gap

is large, there are a number of factors that may effectively narrow the gap. However, area

is only one dimension of the gap between FPGAs and ASICs and the following section

examines delay.

3.5.2 Delay

The speed gap for the benchmarks used in this work is given in Table 3.6. (The absolute

frequency measurements for each benchmark can be found in Appendix A). The table

reports the ratio between the FPGA’s critical path delay relative to the ASIC for each of

the benchmark circuits. The results in the table are for the fastest speed grade FPGAs.

As was done for the area comparison, the results are categorized according to the types

of heterogeneous blocks that were used on the FPGA.

Table 3.6 shows that, for circuits with soft logic only, the average FPGA circuit is 3.4

times slower than the ASIC implementation. This generally confirms the earlier estimates

from [2], which were based on anecdotal evidence of circa-1991 maximum operating speeds

of the two technologies. However, these results deviate substantially from those reported

in [3], which is based on an apples-to-oranges LUT-to-gate comparison.

Chapter 3. Measuring the Gap 59

Table 3.6: Critical Path Delay Ratio (FPGA/ASIC) – Fastest Speed Grade

Name Logic Logic Logic Logic,& & Memory

Only DSP Memory & DSP

booth 5.0rs encoder 3.8cordic18 3.7cordic8 1.9des area 2.0des perf 3.1fir restruct 4.0mac1 3.8aes192 4.4fir3 3.9 3.5diffeq 4.0 4.1diffeq2 3.9 4.0molecular 4.6 4.7rs decoder1 2.5 2.9rs decoder2 2.2 2.4atm 2.9aes 3.8aes inv 4.3ethernet 4.3serialproc 2.8fir24 2.6pipe5proc 2.9raytracer 3.5

Geometric Mean 3.4 3.5 3.5 3.0

The circuits that make use of the hard DSP multiplier-accumulator blocks are on

average 3.5 times slower in the FPGA than in an ASIC and, in general, the use of the

hard block multipliers appeared to slow down the design as can be seen by comparing the

second and third column of Table 3.6. This result is surprising since intuition suggests

the faster hard multipliers would result in faster overall circuits.

We examined each of the circuits that did not benefit from the hard multipliers

to determine the reason this occurred. For the molecular benchmark, the delays with

and without the DSP blocks were similar because there are more multipliers in the

benchmark than there are DSP blocks. As a result, even when DSP blocks are used

the critical path on the FPGA is through a multiplier implemented employing regular

logic blocks. For the rs decoder1 and rs decoder2 benchmarks, only small 5x5 bit and

8x8 bit multiplications are performed and the DSP blocks which are based on 9x9 bit

multipliers do not significantly speed up such small multiplications. In such cases where

the speed improvement is minor, the extra routing that can be necessary to accommodate

Chapter 3. Measuring the Gap 60

the fixed positions of the hard multiplier blocks may eliminate the speed advantage of the

hard multipliers. Finally, the diffeq and diffeq2 benchmarks perform marginally slower

when the DSP blocks are used. These benchmarks contain two unpipelined stages of

32x32 multiplication that do not map well to the hard 36x36 multiplication blocks and it

appears that implementation in the regular logic clusters is efficient in such a case. With a

larger set of benchmark circuits it seems likely that more benchmarks that could benefit

from the use of the hard multipliers would have been encountered, particularly if any

designs were tailored specifically to the Stratix II DSP block’s functionality. However,

based on the current results, it appears that the major benefit of these hard DSP blocks

is not the performance improvement, if any, but rather the significant improvement in

area efficiency.

The circuits that make use of the block memory the FPGA-based designs are on

average 3.5 times slower and the benefit of the memory blocks appears to be similar to

benefits of the DSP blocks in that they only narrow the speed gap slightly, if at all, and

their primary benefit is improved area efficiency. For the few circuits using both memory

and multipliers, the FPGA is on average 3.0 times slower. This is an improvement over

the soft logic only results but it is inappropriate to draw a strong conclusion from this

given that the improvement is relatively small and that the result is from only three

benchmarks.

To better demonstrate the limited benefit of heterogeneous blocks in narrowing the

speed gap, Figure 3.5 plots the speed gap against the amount of heterogeneous content

in a design. As described previously, the amount of heterogeneous content is measured as

the fraction of the area used in the FPGA design for the hard memory and DSP blocks.

Unlike the results seen for the area gap, as the amount of hard content is increased the

delay gap does not narrow appreciably.

If heterogeneous content does not appear to impact the speed gap, this gives rise to

the question of what causes the large range in the measurement results. As was done

for the area gap, the speed gap for soft logic only circuits was compared to the size of

the circuits measured in ALUTs. The results are plotted in Figure 3.6 and, again, it

appears that there is no significant relationship between the speed gap and the size of

the benchmark. The speed gap was also compared to the area gap to see if there was any

relationship and these results are plotted in Figure 3.7 as the speed gap versus the area

gap. There does not appear to be any relationship between the two gaps. Therefore,

despite these investigations, the reason for the wide range in the speed gap measurements

Chapter 3. Measuring the Gap 61

0

1

2

3

4

5

6

0% 10% 20% 30% 40% 50% 60% 70%

Heterogeneous Content (% of Total FPGA Area)

Dela

y G

ap

Figure 3.5: Effect of Hard Blocks on Delay Gap

is unknown. As with the area gap, it may be partly due to specific logical characteristics

of the circuits but it is left to future work to determine what such factors may be.

Speed Grades

As described earlier, the FPGA delay measurements presented thus far employ the fastest

speed grade parts. Comparing to the fastest speed grade is useful for understanding

the best case disparity between FPGAs and ASICs but it is not entirely fair. ASICs

are generally designed for the worst case process and it may be fairer to compare the

ASIC performance to that of the slowest FPGA speed grade. Table 3.7 presents this

comparison. For soft logic only circuits, the ASIC performance is 4.6 times greater than

the slow speed grade FPGA. When the circuits make use of the DSP blocks the gap is

4.6 times and when memory blocks are used the performance difference is 4.8 times. For

the circuits that use both the memory and the multipliers, the average is 4.1 times. As

expected, the slower speed grade parts cause a larger performance gap between ASICs

and FPGAs.

Chapter 3. Measuring the Gap 62

0

1

2

3

4

5

6

0 500 1000 1500 2000 2500 3000

Be n ch m ar k Siz e (A L UT s )

Sp

ee

d G

ap

Figure 3.6: Speed Gap Compared to Benchmark Sizes for Logic Only Benchmarks

0

1

2

3

4

5

6

0 10 20 30 40 50 60 70 80

Are a G a p

Sp

ee

d G

ap

Figure 3.7: Speed Gap Compared to the Area Gap

Chapter 3. Measuring the Gap 63

Table 3.7: Critical Path Delay Ratio (FPGA/ASIC) – Slowest Speed Grade

Name Logic Logic Logic Logic,& & Memory

Only DSP Memory & DSP

booth 6.7rs encoder 5.3cordic18 5.1cordic8 2.5des area 2.8des perf 4.1fir restruct 5.2mac1 5.2aes192 6.0fir3 5.3 4.6diffeq 5.5 5.4diffeq2 5.3 5.4molecular 6.2 6.3rs decoder1 3.4 3.7rs decoder2 3.0 3.0atm 4.0aes 5.1aes inv 5.7ethernet 5.6serialproc 3.8fir24 3.8pipe5proc 3.9raytracer 4.7

Geometric Mean 4.6 4.6 4.8 4.1

Retiming and Heterogeneous Blocks

While the CAD flows described in Sections 3.2 and 3.3 aimed to produce the fastest

designs possible, there are a number of other non-standard optimizations that could

potentially further improve performance. Since this is true for both the FPGA and the

ASIC, it is likely that any such optimizations would not impact the gap measurements

due to their relative nature. However, one optimization, retiming, in particular warranted

further investigation as it has been suggested as playing a significant role in improving

the performance in designs with heterogeneous blocks [114].

Retiming involves moving registers within a user’s circuit in a manner that improves

performance (or power and area if desired) while preserving the external behaviour of the

circuit . When performance improvement is desired, retiming amounts to positioning the

registers within a design such that the logic delays between the registers are balanced.

For FPGAs with heterogeneous blocks, retiming may be particularly important because

Chapter 3. Measuring the Gap 64

the introduction of those heterogeneous blocks may lead to significant delay imbalances

as some portions of the circuit become faster when implemented in the dedicated block

while other portions are still implemented in the slower soft logic. With retiming those

imbalances could be lessened and the overall performance improved. In [114], significant

performance improvements are obtained with retiming and these gains are attributed

to the reduction of the delay imbalances introduced by the use of heterogeneous blocks

within the circuit.

Since [114] only considered a small number of benchmarks, we investigated the role of

retiming with heterogeneous blocks for our larger benchmark set. For this work, Quartus

II 7.1 was used and, in addition to the settings described in Section 3.2, the physical

synthesis register retiming option was enabled and the tool was configured to use its

“extra” effort setting. The LogicLock feature was disabled since operating frequency

was the primary concern. The results with these settings were compared to a baseline

case which did not use the physical synthesis options but did disable the LogicLock

functionality.

The performance improvement with retiming is given in Table 3.88. The table indi-

cates the average improvement in maximum operating frequency for each class of bench-

mark. The row labelled “All Circuits” gives the average results across all the benchmarks

and there is a performance improvement with retiming of 5.9 %. If the benchmark cat-

egories are considered, the “Logic-only Circuits” have an average improvement of 4.0 %

which is in fact larger than the improvements for the “Logic and DSP” and “Logic and

Memory” categories which improved by 3.7 % and 1.9 % respectively.

The “Logic, Memory and DSP” designs appear to benefit tremendously from retiming;

however, this large gain comes almost exclusively from two of the twelve designs in that

category. Those two designs, which are in fact closely related as they were created for a

single application, had frequency improvements of approximately 100 %. Accompanying

those large performance improvements was a significant increase in the number of registers

added to the circuit. The increase in registers for each class of benchmarks is listed in

the third column of Table 3.8 and the doubling of registers in those two benchmarks

is out of line from the other benchmarks. Given the unusual results with these two

benchmarks excluding them from the comparison appears to be appropriate and the final

8Since both the retiming and the baseline CAD flows were performed using the same tools, the fullset of benchmarks was used including those that were rejected from the FPGA to ASIC comparisons.These full results can be found in Appendix A.

Chapter 3. Measuring the Gap 65

Table 3.8: Impact of Retiming on FPGA Performance with Heterogeneous Blocks

BenchmarkCategory

Geometric MeanOperating Frequency

Increase (%)

Geometric MeanRegister CountIncrease (%)

All Circuits 5.9 % 9.7 %

Logic-only Circuits 4.0 % 11 %Logic and DSP Circuits 3.7 % 4.3 %Logic and Memory Circuits 1.9 % 2.7 %Logic, Memory and DSP Circuits 18 % 22 %

Logic, Memory and DSP Circuits (subset) 3.1 % 4.6 %

row of the table labelled “Logic, Memory and DSP Circuits (subset)” excludes those two

designs. We then see an average improvement of only 3.1 % which is again below the

improvement achieved in logic only circuits. It is possible that the results from the two

excluded benchmarks were valid as there does not appear to be anything abnormal in

the circuits to explain the significant improvements they achieved. An investigation with

more benchmarks is needed in the future to more thoroughly examine whether these

benchmarks were atypical as we assumed or were in fact indicative of the improvements

possible with retiming.

Based on these results (excluding the two outliers), retiming does not appear to offer

additional performance benefits to designs using heterogeneous blocks. Therefore, the

earlier conclusion that the performance gap between FPGAs and ASICs is not signifi-

cantly impacted by heterogeneous blocks remains valid. It should be emphasized that,

while retiming did clearly offer improved performance for all the FPGA designs on av-

erage, similar improvements for the ASIC designs could likely be achieved through the

addition of retiming to standard cell CAD flow. For that reason, FPGA to ASIC mea-

surements are taken using the standard CAD flows from Sections 3.2 and 3.3 that did

not make use of retiming.

3.5.3 Dynamic Power Consumption

The last dimension of the gap between FPGAs and ASICs is that of power consumption.

As mentioned previously, to simplify this analysis, the dynamic power and static power

consumption are considered separately and this section will focus on the dynamic power

consumption. In Table 3.9, we list the ratio of FPGA dynamic power consumption to

Chapter 3. Measuring the Gap 66

Table 3.9: Dynamic Power Consumption Ratio (FPGA/ASIC)

Name Method Logic Logic Logic Logic,Only & & Memory

DSP Memory & DSP

booth Sim 26rs encoder Sim 52cordic18 Const 6.3cordic8 Const 5.7des area Const 27des perf Const 9.3fir restruct Const 9.6mac1 Const 19aes192 Sim 12fir3 Const 12 7.5diffeq Const 15 12diffeq2 Const 16 12molecular Const 15 16rs decoder1 Const 13 16rs decoder2 Const 11 11atm Const 15aes Sim 13aes inv Sim 12ethernet Const 16serialproc Const 16fir24 Const 5.3pipe5proc Const 8.2raytracer Const 8.3

Geometric Mean 14 12 14 7.1

ASIC power consumption for the benchmark circuits. Again, we categorize the results

based on which hard FPGA blocks were used. As described in Section 3.4.3, two ap-

proaches are used for power consumption measurements and the table indicates which

method was used. “Sim” means that the simulation-based method (with full simulation

vectors) was used and “Const” indicates that a constant toggle rate and static probability

was applied to all nets in the design.

The results indicate that on average FPGAs consume 14 times more dynamic power

than ASICs when the circuits contain only soft logic. The simulation-based results are

compared to the constant-toggle rate measurements in Table 3.10 for the few circuits for

which this was possible. The results for each specific benchmark do differ substantially in

some cases; however, overall the ranges of the measurements are similar and these is no

obvious bias towards under or over-prediction. Therefore, while the constant toggle rate

method was not the preferred measurement approach, its results appear to be satisfactory.

Chapter 3. Measuring the Gap 67

Table 3.10: Dynamic Power Consumption Ratio (FPGA/ASIC) for Different Measure-ment Methods

Name Dynamic Power Consumption RatioSimulation-Based Constant Toggle Rate

Measurements Measurements

booth 26 30rs encoder 52 25aes192 12 30aes 13 9.5aes inv 12 6.8

When we examine the results for designs that include hard blocks such as DSP blocks

and memory blocks, we observe that the gap is 12, 14 and 7.1 times for the cases when

multipliers, memories and both memories and multipliers are used, respectively. The area

savings that these hard blocks enabled suggested that some power savings should occur

because a smaller area difference implies less interconnect and fewer excess transistors

which in turn means that the capacitive load on the signals in the design will be less.

With a lower load, dynamic power consumption is reduced and we observe this in general.

In particular, we note that the circuits that use DSP blocks consume equal or less power

when the area efficient DSP blocks are used as compared to when those same circuits

are implemented without the DSP blocks. The exceptions are rs decoder1 which suffered

from an inefficient use of the DSP blocks described in Section 3.5.1 and molecular.

In Figure 3.8, the power gap is plotted against the amount of heterogeneous content

in a design (with heterogeneous content again measured in terms of area). The chart

suggests that as designs use more heterogeneous resources, there is a slight reduction in

the FPGA to ASIC dynamic power gap. Such a relationship was expected because of

the previously shown reduction in the area gap with increased hard content.

Other Considerations

The clock network in the FPGA is designed to handle much larger circuits than were

used for this comparison. As a result, for these modestly sized benchmarks, the dynamic

power consumption of this large network may be disproportionately large. With larger

designs, the incremental power consumption of the clock network may be relatively small

and the dynamic power gap could potentially narrow as it becomes necessary to construct

equally large clock networks in the ASIC.

Chapter 3. Measuring the Gap 68

0

10

20

30

40

50

60

0% 10% 20% 30% 40% 50% 60% 70%

Heterogeneous Content (% of Total FPGA Area)

Dyn

amic

Pow

er G

ap

Figure 3.8: Effect of Hard Blocks on Power Gap

It is also important to recognize that core dynamic power consumption is only one

contributor to a device’s total dynamic power consumption. The other source of dynamic

power is the input/output cells. Past studies have estimated that I/O power consumption

is approximately 7 − 14 % of total dynamic power consumption [115, 116] but this can

be very design dependent. While the dynamic power consumption gap for the I/O’s

was not measured in this work, we anticipate that it would not be as large as the core

logic dynamic power gap because, like the multipliers and memories, I/O cells are hard

blocks with only limited programmability. Therefore, including the effect of I/O power

consumption is likely to narrow the overall dynamic power gap.

3.5.4 Static Power Consumption

In addition to the dynamic power, we measured the static power consumption of the

designs for both the FPGA and the ASIC implementations; however, as will be described,

we were unable to definitively quantify the size of the gap. We performed static power

Chapter 3. Measuring the Gap 69

measurements for both typical silicon at 25 ◦C and worst-case silicon at 85 ◦C. For these

power measurements, the worst case silicon is the fast process corner. To account for the

fact that the provided worst case standard cell libraries were characterized for a higher

temperature, the standard cell results were scaled by a factor determined from HSPICE

simulations of a small sample of cells. We did not need to scale the results for typical

silicon. Also, as described in Section 3.4.3, the FPGA static power measurements are

adjusted to reflect that only a portion of each FPGA is used in most cases.

Despite these adjustments, we did not obtain meaningful results for the static power

consumption comparison when the power was very small. Therefore, any results where

the static power consumption for the standard cell implementation was less than 0.1mW

(in the typical case) are excluded from the comparison. Based on these restrictions, the

results from this comparison, with the lower power benchmarks removed, are given in

Tables 3.11 and 3.12 for the typical and worst cases respectively. The tables list the ratio

of the static power measurement for the FPGA relative to the ASIC and, as was done

for the dynamic power measurements, the measurement method, either simulation-based

(“Sim”) or constant toggle-based (“Const”), is indicated in the second column of the

table.

Clearly, the typical and worst case results deviate significantly. For soft logic only

designs, on average the FPGA-based implementations consumed 81 times9 more static

power than the equivalent ASIC when measured for typical conditions and typical silicon

but this difference was only 5.1 times under worst case conditions for worst case silicon.

Similar discrepancies can be seen for the benchmarks with heterogeneous blocks.

Unfortunately, neither set of measurements offers a conclusive measure of the static

power consumption gap. Designers are generally most concerned about worst-case condi-

tions which makes the typical-case measurements uninformative and potentially subject

to error since more effort is likely spent by the foundries and vendors ensuring the ac-

curacy of the worst-case models. However, the worst-case results measured in this work

suffer from error introduced by our temperature scaling. As well, static power, which

is predominantly due to sub-threshold leakage for these processes [117], is very process

dependent and this makes it difficult to ensure a fair comparison given the available in-

formation. In particular, we do not know the confidence level of either worst-case leakage

9For the subset of benchmarks in Table 3.11 that are soft logic-only and do not have an DSP imple-mentation, which are the only soft logic results given in Table 3.12, the average static power consumptiongap is 74.

Chapter 3. Measuring the Gap 70

Table 3.11: Static Power Consumption Ratio (FPGA/ASIC) at 25 ◦C with Typical Sili-con

Name Method Logic Logic Logic Logic,Only & & Memory

DSP Memory & DSP

rs encoder Sim 50cordic18 Const 77des area Const 91des perf Const 51fir restruct Const 69mac1 Const 86aes192 Sim 85diffeq Const 86 25diffeq2 Const 91 32molecular Const 84 69rs decoder2 Const 130 120atm Const 230aes Sim 33aes inv Sim 31ethernet Const 170fir24 Const 13pipe5proc Const 160raytracer Const 97

Geometric Mean 81 51 80 59

estimate. These estimates are influenced by a variety of factors including the maturity of

a process and, therefore, a comparison of leakage estimates from two different foundries,

as we attempt to do here, may reflect the underlying differences between the foundries

and not the differences between FPGAs and ASICs that we seek to measure. Another

issue that makes comparison difficult is that, if static power is a concern for either FPGAs

or ASICs, manufacturers may opt to test the power consumption and eliminate any parts

which exceed a fixed limit. Both business and technical factors could impact those fixed

limits. Given all these factors, to perform a comparison in which we could be confident,

we would need to perform HSPICE simulations using identical process models. We did

not have these same concerns about dynamic power because process and temperature

variations have significantly less impact on dynamic power.

Despite our inability to reliably measure the static power consumption gap, the re-

sults do provide some useful information. In particular, we did find that, as expected,

the static power gap and the area gap are somewhat correlated. The correlation coeffi-

cient of the area gap to the static power gap is 0.73 and 0.76 for the typical and worst

case measurements respectively. This was expected because transistor width is generally

Chapter 3. Measuring the Gap 71

Table 3.12: Static Power Consumption Ratio (FPGA/ASIC) at 85 ◦C with Worst CaseSilicon

Name Method Logic Logic Logic Logic,Only & & Memory

DSP Memory & DSP

rs encoder Sim 3.4cordic18 Const 5.1des area Const 6.3des perf Const 3.5fir restruct Const 4.7mac1 Const 5.9aes192 Sim 6.7diffeq Const 1.7diffeq2 Const 2.2molecular Const 5.5rs decoder2 Const 8.2atm Const 17aes Sim 2.7aes inv Sim 2.5ethernet Const 13fir24 Const 1.0pipe5proc Const 5.9raytracer Const N/A

Geometric Mean 5.0 3.5 6.2 2.4

proportional to the static power consumption [108] and the area gap partially reflects

the difference in total transistor width between an FPGA and an ASIC. This relation-

ship is important because it demonstrates that hard blocks such as multipliers and block

memories, which reduced the area gap, reduce the static power consumption gap as well.

While the static power consumption gap is correlated with the area gap, it is poten-

tially noteworthy that the two gaps are not closer in magnitude. There are a number of

potential reasons for this difference. One is that there are portions of the FPGA, such

as the PLLs and large clock network buffers, which may contribute to the static power

consumption but are not present in the ASIC design. Our measurement method of reduc-

ing the static power according to the area used does not eliminate such factors; instead,

it only amortizes the power consumption of those additional features across the whole

device. Another source of difference between the area and static power consumption

gaps may be that the FPGA and the ASIC use different ratios of low-leakage high-Vt

transistors to leakier standard-Vt and/or low-Vt transistors. For instance a significant

portion of the area gap is due to the configuration memories in the FPGA but those

memories can make use of high-Vt devices as they are not performance critical. Given

Chapter 3. Measuring the Gap 72

Table 3.13: FPGA to ASIC Gap Measurement Summary

Metric Logic Only Logic & DSP Logic & Memory Logic, DSP,& Memory

Area 35 25 33 18Performance 3.4–4.6 3.4–4.6 3.5–4.8 3.0–4.1

Dynamic Power 14 12 14 7.1Static Power Inconclusive

the combination of these factors and the measurement challenges described previously,

the deviation between the static power and area gaps is somewhat understandable.

3.6 Summary

In this chapter, we have presented empirical measurements quantifying the gap between

FPGAs and ASICs for core logic and these results are summarized in Table 3.13. As

shown in the table, we found that for circuits implemented purely using soft logic, an

FPGA is on average approximately 35 times larger, between 3.4 to 4.6 times slower

and 14 times more power hungry for dynamic power as compared to a standard cell

implementation. While this core logic area gap may not be a concern for I/O lim-

ited designs, for core-area limited designs this large area gap contributes significantly

to the higher costs for FPGAs. When it is desired to match the performance of an

ASIC with an FPGA implementation, the area gap is effectively larger because ad-

ditional parallelism must be added to the FPGA-based design to achieve the same

throughput. If it is assumed that ideal speedup is possible (with 2 instances yielding

2× performance, 3 instances 3× performance, and so on) then the effective area gap is

Area Gap × Performance Gap = 3.4 × 35 = 119. Clearly, this massive gap prevents the

use of FPGAs in any cost-sensitive markets with demanding performance requirements.

The large power gap of 14 also detracts significantly from FPGAs and is one factor that

largely limits them to non-mobile applications.

As well, as described in Chapter 2, it is well known that ASICs are not the most

efficient implementation possible. If the ASIC to custom design gap is also considered

using the numbers from [80, 81, 1, 79], then compared to full custom design the soft logic

of an FPGA is potentially 508 times larger, 10.2 times slower and 42 times more power

hungry.

Chapter 3. Measuring the Gap 73

While heterogeneous blocks, in the form of memory and multipliers, were found to

significantly reduce the area gap and at least partially narrow the power consumption

gap, their effect on performance was minimal. Therefore, expanding the market for

FPGAs requires further work addressing these large gaps. The remainder of this thesis

focuses on understanding methods for narrowing this gap through appropriate electrical

design choices.

Chapter 4

Automated Transistor Sizing

The large area, performance and power gap between FPGAs and ASICs reported in the

previous chapter clearly demonstrates the need for continued research aimed at narrowing

this gap. While narrowing the gap will certainly require innovative improvements to

FPGA architectures, it is also instructive to gain a more thorough understanding of the

existing gap and the trade-offs that can be made with current architectures. This offers a

complementary approach for closing the gap. The navigation of the gap by exploring these

trade-offs is the focus of the remainder of this thesis. This exploration will consider the

three central aspects of FPGA design: logical architecture, circuit design and transistor

sizing. The challenge for such an exploration is that transistor sizing for FPGAs has been

performed manually in most past works [14, 18, 28] and that has limited the scope of

those previous investigations. To enable broader exploration in this work, an automated

approach for transistor sizing of FPGAs was developed and that is the subject of this

chapter.

Transistor sizing is important because accurate assessment of an FPGA’s area, perfor-

mance and power consumption requires detailed transistor-level information. With the

past manually sized designs, it was not feasible to optimize a design for each architecture

being investigated. Instead, a single carefully optimized design was created and then

only a portion of the design would be optimized when a new architecture was considered.

For example, in [18], as different LUT sizes were considered, the LUT delays were opti-

mized again but the remainder of the design was left unchanged. This means that many

potentially significant architecture and circuit design interactions were ignored. In [18],

the delay of a routing track of fixed logical length was taken to be constant but other

architectural parameters can significantly affect that routing track’s physical length and

74

Chapter 4. Automated Transistor Sizing 75

its delay. An automated transistor sizing tool will ensure that these important effects

are considered by optimizing the transistor-level design for each unique architecture. In

addition, an automated sizing tool enables a new axis of exploration, that of area and

performance trade-offs through transistor sizing. Previously, only a single sizing, such as

that which minimized the area-delay product, would have been considered. Exploration

of varied sizings has become particularly relevant because the market for FPGAs has

expanded to include sectors that have different area/cost and performance requirements.

The remainder of this chapter will describe the automated sizing tool developed to

enable these explorations. It was appropriate to develop a custom tool to perform this

sizing because the transistor sizing of FPGAs has unique attributes that require special

handling and this chapter first reviews these issues. The inputs and the metrics used

by the optimizer are then described in detail. Next, the optimization algorithm itself

is presented and, finally, the quality of results obtained using the optimization tool are

assessed through comparisons with past works.

4.1 Uniqueness of FPGA Transistor Sizing Problem

The optimization problem of transistor sizing for FPGAs is on the surface similar to the

problem faced by any custom circuit designer. It involves minimizing some objective

function such as area, delay or a product of area and delay, subject to a number of

constraints. Examples of typical constraints include: a requirement that transistors are

greater than minimum size, an area constraint limiting the maximum area of the design or

a delay constraint specifying the maximum delay. While this is a standard optimization

problem, the unique features of programmable circuit design create additional challenges

but also offer opportunities for simplification.

4.1.1 Programmability

The most significant unique feature of FPGA design optimization is that there is no

well-defined critical path. Different designs implemented on an FPGA will have different

critical paths that use the resources on the FPGA in varying proportions. Therefore, there

is no standard or useful definition for the “delay” of an FPGA; yet, a delay measurement

is necessary if the performance of an FPGA is to be optimized. Architectural studies have

addressed this challenge by using the full experimental process described in Section 2.4

Chapter 4. Automated Transistor Sizing 76

w bu f,p

w bu f,n

w m ux,n

z

w bu f,p

w bu f,n

w m ux,n

x y

Logic

B lock

Logic

B lock

Logic

B lock

wb

uf,

p

wb

uf,

n

wm

ux

,n wb

uf,

p

wb

uf,

n

wm

ux

,n

Figure 4.1: Repeated Equivalent Parameters

to assess the quality of an FPGA design. However, such an approach is not suitable for

transistor sizing as it is not feasible to evaluate the impact of every sizing change using

that full experimental flow. Instead, simplifications must be made and the handling of

this issue will be described in Section 4.3.2.

4.1.2 Repetition

The other feature of FPGA design is the significant number of logically equivalent com-

ponents. A simple example of this is shown in Figure 4.1 in which a single routing track

connects to equivalent tracks in different tiles and within each tile. Modern FPGAs con-

tain thousands of equivalent tiles each with potentially hundreds of equivalent resources

[6, 7] and, therefore, the number of logically equivalent components is large. Breaking

that equivalency by altering the sizes of some resources can be advantageous [35, 36, 37]

but such changes alter the logical architecture of the FPGA. Therefore, for transistor

sizing purposes, identical sizes for logically equivalent components must be maintained

as described in Section 2.1.3.

This requirement to maintain identical sizes has two significant effects: the first is

that it reduces the flexibility available during optimization. Figure 4.1 illustrates an

example of this where it would be advantageous to minimize intermediate loads, such

as that imposed by the multiplexer at y, when optimizing the delay from point x to z.

However, since the multiplexer at y is logically equivalent to the multiplexer at z, any

reduction in those transistor sizes might also increase the delay through the multiplexer

at z. Clearly, during optimization, the conflicting desires to reduce intermediate loads

must be balanced with the potential detrimental delay impact of such sizings.

Chapter 4. Automated Transistor Sizing 77

Table 4.1: Logical Architecture Parameters Supported by the Optimization Tool

Parameter Possible Settings

Logic Block Fully Populated Clusters of BLEsLUT Size No RestrictionCluster Size No RestrictionNumber of Cluster Inputs No RestrictionInput Connection Block Flexibility No RestrictionOutput Connection Block Flexibility No RestrictionRouting Structure Single Driver OnlyRouting Channel Width Multiples of Twice the Track LengthRouting Track Length No Restriction

The other effect of the requirement to maintain logical equivalency is that the number

of parameters that require optimization is greatly reduced. This enables approaches to

be considered that would not normally be possible when optimizing a design containing

the hundreds of millions of transistors now found in modern FPGAs [118]. Both of these

effects are considered in the optimization strategy developed in this chapter.

4.2 Optimization Tool Inputs

We have developed a custom tool to perform transistor sizing in the face of these unique

aspects of FPGA design. The goal of this tool is to determine the transistor sizings for an

FPGA design and to do this for a broad range of FPGAs with varied logical architectures,

electrical architectures, and optimization objectives. Accordingly, parameters describing

these variables must be provided as inputs to the tool. This section describes these inputs

and the range of parameters the tool is designed to handle.

4.2.1 Logical Architecture Parameters

The logical architecture of an FPGA comprises all the parameters that define its func-

tionality. The specific parameters that will be considered and any restrictions on their

values are summarized in Table 4.1. (The meaning of these parameters was described

in Section 2.1.) This list of parameters that can be explored includes the typical ar-

chitecture parameters such as LUT size, cluster size, routing channel width and routing

segment length that have been investigated in the past [14, 18].

However, a significant restriction in the parameters is that only soft logic and, fur-

thermore, only one class of logic blocks will be considered. This restriction still allows

Chapter 4. Automated Transistor Sizing 78

many architectural issues within this class of logic blocks to be investigated but it means

that larger changes including the use and architecture of hard blocks such as multipliers

or memories can not be investigated. This restriction is necessary to keep the scope

of this work tractable because the design of hard logic blocks often has its own unique

challenges, particularly in the case of memory. Ignoring the hard logic blocks is accept-

able as the soft logic comprises a large portion of the FPGA’s area [109] and soft logic

continues to significantly impact the overall area, performance and power consumption

of an FPGA as shown in Chapter 3. As well, while the design of the hard logic block

itself will not be considered, the design of the soft logic routing, which we will consider,

could be reused for hard logic blocks.

The other architectural restrictions relate to the routing. Only single driver routing

will be considered since, as described in Section 2.1.2, this is now the standard approach

for commercial FPGAs. It is also conventional to assume a uniform design and layout for

the FPGA. With a regular design, a single tile containing both logic and routing can be

replicated to create the full FPGA. However, this desire for regularity limits the number

of routing tracks in each channel to multiples of twice the track length. (Multiples of

twice the track length are required due to the single driver topology as it is necessary to

have an equal number of track running in each direction.)

4.2.2 Electrical Architecture Parameters

The logical architecture parameters discussed above define the functionality of the cir-

cuitry. However, there are a number of different possible electrical implementations that

can be used to implement that functionality and exploring these possibilities is another

goal of this work. As the FPGAs we will consider are composed solely of multiplexers,

buffers and flip-flops, the primary electrical issues to consider are multiplexer imple-

mentation and buffer placement relative to those multiplexers. Flip-flops consume a

relatively small portion of the FPGA area and are not a significant portion of typical

critical paths; therefore, their implementation choices will not be examined. The de-

sign of flip-flops can significantly affect power consumption but, as will be described in

the following section, this work focuses exclusively on area and delay trade-offs. For

multiplexers, there are a number of alternative electrical architectures that have been

used including fully encoded multiplexers [14], three-level partially-decoded structures

[33] or two-level partially-decoded structures [15, 31, 34]. The approaches offer different

Chapter 4. Automated Transistor Sizing 79

trade-offs between the delay through the multiplexer and the area required for the mul-

tiplexer. There are also issues to consider regarding the placement of buffers as it has

varied between placement at the input to multiplexers (in addition to the output) [18]

or simply at the output [15, 31]. Again, there are possible trade-offs that can be made

between performance and area. These implementation choices are left as inputs to allow

the impact of these parameters to be explored in Chapter 5.

4.2.3 Optimization Objective

Finally, the optimizer will also be used to explore the trade-offs possible through varied

transistor sizings. The most obvious such trade-offs are between area and performance as

improved performance often requires (selectively) increasing transistor sizes and thereby

area. We have chosen not to explore power consumption trade-offs for a number of rea-

sons. First, power consumption is closely related to area for many architectural changes

[39] and we confirmed this in our own architectural and transistor-sizing investigations.

The exploration of power would therefore add little to the breadth of the trade-offs ex-

plored but it requires CAD tools that support power optimization. The CAD tool flow

described in Section 2.4.1 that is used in this work does not support power optimiza-

tion and, therefore, extensive work would be required to add such support. Also, while

there are approaches such as power gating that alter the relationship between area and

power [119], such techniques require architectural changes. That would necessitate more

extensive changes to the CAD tools beyond those necessary to enable power optimization.

This leaves the exploration of area and performance trade-offs. To explore such

trade-offs, a method is needed for varying the emphasis placed on area or delay during

optimization. The optimizer could be set to aim for an arbitrary area or delay constraint

while minimizing the delay or area, respectively. However, such an approach does not

provide an intuitive feeling of the trade-offs being made. A more intuitive approach is

to have the optimizer minimize a function that reflects the desired importance of area

and delay and, for example, in past works [14, 28, 18], the optimization goal has been

to minimize the area-delay product. A more general form of such an approach is to

minimize

AreabDelayc (4.1)

with b and c greater than or equal to zero. With this form, area and delay after op-

timization can be varied by altering the b and c coefficients. This approach provides a

Chapter 4. Automated Transistor Sizing 80

better sense of the desired circuit design objective and also allows for direct comparisons

with designs that optimized the area-delay product. Therefore, this is the form of the

objective function that will be used in this work and appropriate values of b and c will be

provided as inputs to the optimization process. This objective function clearly requires

quantitative estimates of the area and delay of the design and the process for generating

these measurements is described in the following section.

4.3 Optimization Metrics

The sizing tool must appropriately optimize the area and performance of the FPGA given

the logical architecture, electrical architecture and optimization objective inputs. To do

this, the optimizer needs to have measures of area and performance and, since thousands

or more different designs will be considered, these measurements must be performed

efficiently. The issues of programmability described in Section 4.1 suggest that a full

experimental flow is necessary to obtain accurate area and performance measurements,

but efficiency dictates that simpler approaches are used. Therefore, proxies for the area

and performance of an FPGA were developed and are described in this section.

4.3.1 Area Model

The goal of the area model is to estimate the area of the FPGA based on its transistor-

level design with little manual effort or computation time. The manual effort per area

estimate must be low because of the large number of designs that will be considered.

Similarly, it is necessary to keep the computation time low to prevent area calculations

from dominating the optimization run times.

The desire for low effort clearly precludes the use of the most accurate approach for

measuring the area of a design which would be to lay out the full design. An alternative

of automated layout of the FPGA [120, 121, 122, 123] is also not appropriate both

because these approaches require manually designed cells as inputs1 and because the

tools require significant computational power. An alternative is to use area models that

are based simply on the number of transistors or the total width of the transistors. Such

1Standard cell libraries are not a suitable alternative because they would severely limit the specifictransistor sizings and circuit structures that could be considered. As well, it has been found that theyintroduce a significant area overhead [29, 122].

Chapter 4. Automated Transistor Sizing 81

models allow the area measurement to be calculated easily from a transistor-level netlist

but these approaches are not sufficiently accurate.

Instead, we will use a hybrid approach that estimates the area of each transistor

based on its width and then determines the area of the full design by summing the

estimated transistor areas. This approach, known as the minimum-width transistor areas

(MWTA) model [14], was described in Section 2.4.2. The original form of this model

was developed based on observations made from 0.35 µm and 0.40 µm CMOS processes.

These technologies are no longer current and, therefore, an updated MWTA model was

developed. In developing this new model, two goals were considered: first it should

reflect the design rules of modern processes and second it should incorporate the effects

of modern design practises. To ensure that the model is sufficiently accurate, the area of

manually created layouts will be compared to the area predicted by the model.

The original model [14] for estimating the area based on the transistor width, w, was

Area (w) =

(β +

w

α ·wminimum

)AreaMWTA (4.2)

with β = 0.5 and α = 2. We observed that these particular values of coefficients2 no

longer accurately reflected the number of minimum-width transistor areas required as

the width of a transistor is increased and, therefore, new values were determined based

on the design rules for the target process. The particular values used are not reported

to avoid disclosing information obtained under non-disclosure agreements3. These up-

dated coefficient values ensure that the model reflects current design rules but further

adjustments are necessary to capture the impact of standard layout practises.

The first such enhancement is necessary to reflect the impact of fingering on the

area of a transistor. This is an issue because performance-critical devices are typically

implemented using multi-fingered layouts as this reduces the diffusion capacitances and

thereby improves performance. However, the number of fingers in a device layout can

have a significant effect on the area as we observed that the α term for the 2-finger layout

was 32 % larger than the α factor for a non-fingered layout. To account for this, the area

2Due to the requirement that a minimum width device has an area of one minimum-width tran-sistor area, the two parameters α and β are not independent. Their values must satisfy the followingrelationship: α = 1

1−β .3While we cannot disclose the α and β coefficients for our target process, we can report that, for the

deep-submicron version (DEEP) of the Scalable CMOS design rules [124], α = 2.8 and β = 0.653. Thesenew values agree with the general trend we observe that α > 2 and β > 0.5.

Chapter 4. Automated Transistor Sizing 82

model will use a different set of β and α coefficients when a transistor’s width is large

enough to permit the use of two fingers. A maximum of two fingers will be assumed for

simplicity.

Another issue that is not reflected in the original MWTA model is that the layout of

some portions of an FPGA are heavily optimized because they are instantiated repeatedly

and cover a significant portion of the area of an FPGA. This is particularly true of the

configuration memory cells. This cell is used throughout the FPGA and it is also used

identically in every single FPGA design we will consider. Therefore, to obtain a more

accurate estimate of the memory cell’s area, we manually laid it out4. When laying

out the cell, it was apparent that there are significant area-saving opportunities possible

through the sharing of diffusion regions between individual bits. As bits usually occur in

groups such as when controlling a multiplexer, it is reasonable to assume such sharing is

generally possible. Our estimate of the typical configuration memory bit area therefore

assumes that diffusion sharing is possible in one dimension.

After these changes to the MWTA model, the estimated area was compared to the

actual area for three manually drawn designs. These three designs were a 2-level 16-input

multiplexer, a 2-level 32-input multiplexer5 and a 3-LUT. To improve the accuracy of

the estimate, the AreaMWTA factor in Equation 4.2 (which should be the minimum area

required for a minimum-width transistor as was shown in Figure 2.13) was scaled. The

scale factor was selected to minimize the absolute error of the predicted areas relative to

the actual areas for the three designs.

The estimated areas including the impact of the scaling factor are compared to the

actual areas in Table 4.2 for the three test cells. The cell being compared is indicated

in the first column and the area from the manual layout is given in the second column.

The third and fourth columns indicate the estimated area and the error in this estimate

relative to the actual area when the original MWTA model is used. The last two columns

provide the area estimate and the error when the updated model is used. Clearly, the new

model offers improved accuracy but error remains. However, the results were considered

4The cell was laid out using standard design rules. The use of the standard design rules and theassumption of the need for body contacts in every cell means that each bit is larger than the bit areain commercial memories [90, 91, 92]. While relaxed design rules are common for commercial memories[104], it will conservatively be assumed that such relaxed rules are not possible given the distributednature of the configuration memory throughout the FPGA.

5The 32-input multiplexer was sized differently than the 16-input multiplexer. As well, since it is alsoa 2-level multiplexer, the layout does not reuse any portion of the 16-input multiplexer layout.

Chapter 4. Automated Transistor Sizing 83

Table 4.2: Area Model versus Layout Area

Cell Actual MWTA Model [14] New ModelArea (µm2) Area (µm2) Error Area (µm2) Error

32-input Multiplexer & Buffer 164.2 226.6 38.0 % 209.6 27.7 %16-input Multiplexer & Buffer 67.3 64.0 −4.97 % 67.2 0.2 %3-input LUT 48.6 43.4 −8.67 % 48.6 0.0 %

sufficiently accurate for this work. There is, nevertheless, room for future work to develop

improved area modelling.

The final area metric for optimization purposes will be the area of a single tile (which

includes both the routing and the logic block) as determined using this area estimation

model. It should also be noted that the area estimates serve an additional purpose beyond

the direct area metric used for optimization. These estimates are also used to determine

the physical length of the programmable routing tracks. This is done by estimating the

X-Y dimensions of the tile from the area estimate for the full tile. The estimates of these

interconnect lengths are needed to accurately assess the performance of the FPGA. The

following section describes the use of these interconnect segments and the modelling of

delay.

4.3.2 Performance Modelling

The performance model used by the optimizer must reflect the performance of user de-

signs when implemented on the FPGA. It certainly is not feasible to perform the full

CAD flow with multiple benchmark designs for each change to the sizing of individual

transistors in the FPGA. Instead, a delay metric that can be directly measured by the

optimizer is needed. One potential solution is to take the circuitry of one or more critical

paths when implemented on the FPGA and use some function of the delays of those

paths as the performance metric. Such an approach is not ideal, however, because it

only reflects the usage of the FPGA by a few circuits. These sample circuits may only

use some of the resources on the FPGA or may use some resources more than is typi-

cal. The number of circuits could be expanded but that would cause the simulation and

optimization time to increase considerably.

Instead, to ensure a reasonable run-time, an alternative approach was developed. A

single artificial path was created and this path contains all the resources that could be on

the critical path of an application circuit. A simplified version of this path is illustrated

Chapter 4. Automated Transistor Sizing 84

in Figure 4.2. (The figure does not illustrate any of the additional loads that are on

the path shown but those loads are included where appropriate in the path used by the

optimizer.) This artificial path is the shortest path that contains all the unique resources

on the FPGA. To ensure realistic delay measurements, as shown in the figure, non-ideal

interconnect is assumed for both the routing tracks and the intra-cluster tracks. These

interconnect segments are assumed to be minimum width metal 3 layer wires and the

length of these tracks is set based on the estimated area.

This single path ensures that simulation times will be reasonable; however, an obvious

issue with this single artificial path is that it is unlikely to to be representative of typical

critical paths in the FPGA. Therefore, the delay of this artificial path would not be not

an effective gauge of the performance of the FPGA. However, the delay of this path

contains the delays of all the components that could be on the critical path. These

individual delays are labelled in the figure. Trouting,i is the delay of routing segment of

type i and it includes the delay of the multiplexer, buffer and interconnect. The delay

through the multiplexer and buffer into the logic block is TCLB in (Recall that a CLB

is a Cluster-based Logic Block) and the delay from the intra-cluster routing line to the

input of the LUT is referred to as TBLE in. The delay through the LUT depends on the

particular input of the LUT that was used. The inputs are numbered from slowest (1)

to fastest (LUT Size) and, hence, the delay through the LUT is TLUT,i where i is the

number of the LUT input. Finally, the delay from the output of the LUT to a routing

track inputs is TCLB out.

On their own the individual component delays are not useful measures of the FPGA

performance but, if those component delays are appropriately combined, then it is pos-

sible to obtain a representative measure of the FPGA performance. This representative

delay, Trepresentative, will be used as the performance metric and, to compute this delay,

each delay term, Tx, is assigned a weight wx. The representative delay is the calculated

as follows:

Trepresentative =

Num Segments∑i=1

wrouting,i ·Trouting,i +LUT Size∑

i=1

wLUT,i ·TLUT,i+

wCLB in ·TCLB in + wCLB out ·TCLB out + wBLE in ·TBLE in.

(4.3)

Chapter 4. Automated Transistor Sizing 85

Ro

uti

ng

Dri

ve

r 1

Ro

uti

ng

Dri

ve

r 1

Ro

uti

ng

Dri

ve

r 1

CL

B O

ut

Trouting,1

Ro

uti

ng

Dri

ve

r 2

Ro

uti

ng

Dri

ve

r 2

Ro

uti

ng

Dri

ve

r 2

Trouting,2

CL

B I

n

BL

E I

np

ut

4-L

UT

12

43

CL

B O

ut

BL

E I

np

ut

4-L

UT

12

43

CL

B O

ut

BL

E I

np

ut

4-L

UT

12

43

CL

B O

ut

...

TLUT,1

TLUT,2

TLUT,2

TBLE In

TCLB Out

TC

LB

In

RC

fo

r M

eta

l In

terc

on

ne

ct

=

Figure 4.2: FPGA Optimization Path

Chapter 4. Automated Transistor Sizing 86

The specific weights were set based on the frequency with which each resource was used on

average in the critical paths of a set of benchmark circuits. For the interested reader, the

impact of these weights on the optimization process is further discussed in Appendix B.

4.4 Optimization Algorithm

With the inputs and optimization metrics defined, the optimization process can now be

described. This optimization involves selecting the sizes of the transistors that can be

adjusted, w1 . . . wn, according to an objective function, f (w1 . . . wn), of the form shown

in Equation 4.1. The optimization problem can then be stated as follows:

min f (w1 . . . wn)

s.t. wi ≥ wmin i = 1, . . . , n.(4.4)

The optimization tool must perform this optimization for any combination of the param-

eters detailed in Section 4.2 to enable the exploration of design trade-offs that will be

performed in Chapter 5. Based on this focus, the goal is to obtain good quality results in

a reasonable amount of time. The run time will only be evaluated subjectively; however,

the quality of the results will be tested quantitatively in Section 4.5 by comparing the

results obtained using this tool to past works.

Two issues were considered when developing the optimization methodology: the tran-

sistor models and the algorithm used to perform the sizing. Simple transistor models

such as switch-level RC device models enable fast run-times and there are many straight-

forward optimization algorithms that can be used with these models. However, these

models are widely recognized as being inaccurate [60, 61]. The use of more accurate

models increases the computation time significantly as the model itself requires more

computation and the optimization algorithms used with these models also typically re-

quire complex computations. Neither of these two extremes appeared suitable on its own

for our requirements. Therefore, we adopted a hybrid approach that first uses simple

(and inaccurate) models to get the approximate sizes of all the devices. Those sizes are

then further optimized using accurate device models. We believe we can avoid the need

for complex optimization algorithms despite using the accurate models because the first

phase of optimization will have ensured that sizes are reasonable. This two step process

is illustrated in Figure 4.3 and described in detail below.

Chapter 4. Automated Transistor Sizing 87

Logical

Architecture

Optimization

Objective

Electrical

Architecture

Phase 1

RC Models

Phase 2

HSPICE-based

Optimized

Sizes

Figure 4.3: FPGA Optimization Methodology

4.4.1 Phase 1 – Switch-Level Transistor Models

For this first phase of optimization, the goal is to quickly optimize the design using

simple transistor models. One of the simplest possible approaches is to use switch-

level resistor and capacitor models. With such models the delay of a circuit can be

easily computed using the standard Elmore delay model [58, 59]. The optimization

of circuits using these models has been well-studied [66, 63, 67, 125] and it has been

recognized that delay modelled in this way is a posynomial function [63]. The expression

for area is also generally a posynomial function6. Therefore, the optimization objective

as a product of these posynomial area and delay functions is also a posynomial [64].

The optimization problem is then one of minimizing a posynomial function and such

an optimization problem can be mapped to a convex optimization problem [64]. This

provides the useful property that any local minimum solution obtained is in fact the global

minimum. Given these useful characteristics and the mature nature of this problem,

switch-level RC models were selected for use in this phase of optimization.

The algorithm to use for the optimization can be relatively simple since there is

no danger of being trapped in a sub-optimal local minima. Accordingly, the TILOS

6The use of multiple α and β coefficients in the calculation of the area of a transistor as described inSection 4.3.1 means that our expression for area is no longer a posynomial. Since having a posynomialfunction is desirable, for this phase of the optimization only, the α and β coefficients in the area modelare fixed to their two-fingered values.

Chapter 4. Automated Transistor Sizing 88

C d

R sd

C d

C g

G

S D

G

S D

Figure 4.4: Switch-level RC Transistor Model

algorithm [63] (described in Section 2.5) was selected; however, as will be described,

some modifications were made to the algorithm. Before describing these modifications,

the transistor-level model will be reviewed in greater detail.

Switch-Level Transistor Models

The switch-level models used for this phase of the optimization treat transistors as a

gate capacitance, Cg, a source to drain resistance, Rsd, and source and drain diffusion

capacitances, Cd. This model is illustrated in Figure 4.4. All the capacitances (Cg and

Cd) are varied linearly according to the transistor width with different proportionality

constants, Cdiff and Cgate, for gate and diffusion capacitances. For both capacitances,

a single effective value is used for both PMOS and NMOS devices and, therefore, for

either device type of width, w, the diffusion and gate capacitances are calculated as

Cd = Cdiff ·w and Cg = Cgate ·w, respectively.

The source to drain resistance is varied depending on both the type of transistor,

PMOS or NMOS, and its use within the circuit because both factors can significantly

affect the effective resistance. The source to drain resistor for PMOS and NMOS devices

when their source terminals are connected to VDD and VSS respectively is modelled with

resistances that are inversely proportional to the transistor’s width. To reflect the differ-

ence conductances of the devices, different proportionality constants are used such that

for an NMOS of width, w, Rsd = Rn/w and for a PMOS of width w, Rsd = Rp/w. (This

same model would also be used if PMOS or NMOS devices were used as part of a larger

pull-up or pull-down network respectively; however, since only inverters are used within

our FPGA such cases did not have to be considered.) NMOS devices are also used as

pass transistors within multiplexers as described in Section 2.2. Such devices pass low

voltage signals well and, therefore, falling transitions through these devices are modelled

identically to NMOS devices that were connected to VSS. However, those devices do not

Chapter 4. Automated Transistor Sizing 89

pass signals that are near VDD well and a different resistance calculation is used for those

rising transitions. That resistance is calculated by fitting simulated data of such a device

to a curve of the form

Resistance (w) =Rn,1

wb. (4.5)

where Rn,1 and b are the constants determined through curve fitting. This fitting need

only be performed once for each process technology7. Using a different resistance calcu-

lation for pass transistor devices has been previously proposed [60]. It is unnecessary to

consider the transmission of logic zero signals through PMOS devices because, in general,

the circuits we will explore do not use PMOS pass transistors.

The use of these resistance and capacitance models is demonstrated in Figure 4.5.

In Figure 4.5(a) the routing track being modelled is shown and Figures 4.5(b) and (c)

illustrate the resistor and capacitor representation of that routing track for a falling and

rising transition respectively8. Based on these models, the delay for each transition can

computed using the Elmore delay [58, 59] as described in Section 2.4.3.

While these RC transistor models are computationally easy to calculate and provide

the useful property that the delay is a posynomial function, they are severely limited in

their accuracy [60, 61]. One frequently recognized limitation is the failure to consider the

impact of the input slope on the effective source to drain resistance and, while a number of

approaches have been proposed to remedy this [60, 126], the inaccuracies remain partic-

ularly for the latest technologies. Therefore, instead of developing improved switch level

models, the subsequent phase of optimization will refine the design using accurate device

models. Before describing that optimization, the TILOS-based optimization algorithm

will be reviewed.

TILOS-based algorithm

The task in this first phase of optimization is to optimize transistor sizes using the

previously described switch-level RC models. A TILOS-based [63] algorithm is used

for this optimization. Changes were made to the original TILOS algorithm to address

some of its known deficiencies and to adapt to the design environment employed in this

work. The basic algorithm along with the changes are described in this section. In

7This resistance model does not affect the posynomial nature of the delay since Rn,1 is always positive.8The falling or rising nature of the transition refers to the transition on the routing track itself.

Clearly, for the example shown in the figure since the track is driven by a single inverter, the inputtransition would be the inverse transition.

Chapter 4. Automated Transistor Sizing 90

w p

w n

w p

w n

w n,passw n,pass

(a) Simple Routing Track

C d iff(w p+w n) C d iffw n,pass C d iffw n,passM

R n/w n,pass

C gate(w p+w n)

R n/w n

(b) Resistance and Capacitance for Falling Transition

C d iff(w p+w n) C d iffw n,pass C d iffw n,passM

R n,1/w n,passb

C gate(w p+w n)

R p/w p

(c) Resistance and Capacitance for Rising Transition

Figure 4.5: Example of a Routing Track modelled using RC Transistor Models

this discussion, the algorithm will be described as changing parameter values not specific

transistor sizes to emphasize that transistors are not sized independently since preserving

logical equivalency requires groups of transistors to have matching sizes.

The TILOS-based phase of the optimization begins with all the transistors set to min-

imum size. For each parameter, the improvement in the objective function per change in

area is measured. This improvement per amount of area increase is termed the param-

eter’s sensitivity. With only a single representative path to optimize, the sensitivity of

every parameter must be measured since they can all affect the delay. Like the original

TILOS algorithm, the value of the parameter with the greatest sensitivity is increased.

In addition to this, the algorithm was modified to also decrease the size of the param-

eter with the most negative sensitivity. Negative sensitivity means that increasing the

parameter, increases the objective function. Therefore, decreasing the parameter im-

proves (reduces) the objective function. This eliminates one of the limitations of TILOS

which can prevent it from achieving optimal results. After the adjustments are made,

the process repeats and all the sensitivities are again measured.

Chapter 4. Automated Transistor Sizing 91

The sensitivity in the original TILOS implementation was computed analytically9

[63]. For this work, we compute the sensitivity numerically as follows:

Sensitivity(w) = −Objective (w + δw)−Objective (w)

Area (w + δw)− Area (w)(4.6)

where w is the width of the transistor or transistors whose sensitivity is being measured.

This numerical computation of the sensitivity requires multiple evaluations of the ob-

jective function which means that the computational demands for this approach may

be higher than an approach relying on analytic methods. However, this was not a con-

cern since this phase of optimization was not a significant bottleneck compared to the

following more computationally intensive phase.

In the later phases of optimization, only discrete sizes for each parameter are con-

sidered to reduce the size of the design space that must be explored. For example, in a

90 nm technology we would only consider sizes to the nearest 100 nm. Using these large

quantized transistor sizes with the numerical sensitivity calculations during the TILOS

optimization could lead to a sub-optimal result; however, to avoid this, the TILOS phase

of the algorithm does not maintain the predefined discrete sizes and, instead, uses sizing

adjustments that are one hundredth the size of the standard increment. This is the size

used for δw in the above expression. Once TILOS completes, sizes are rounded to the

nearest discrete size.

The modification to the algorithm that allows parameters to increase and decrease

in size has one significant side effect. It is possible for a parameter to oscillate from

having the greatest positive sensitivity to having the largest negative sensitivity. Without

any refinement, the algorithm would then alternate between increasing the size of the

parameter before decreasing it in the next cycle. Clearly, this is an artifact arising from

the numerical sensitivity measurements and the quantized adjustments to parameter

values. To address this, the last iteration in which a parameter was changed is recorded.

No changes in the opposite direction are permitted for a fixed number of iterations i.e.

if a parameter was increased in one iteration, it can not be decreased in the subsequent

iteration. This phenomenon of oscillatory changes in parameters can also occur amongst

a group of parameters and, for this reason, the number of iterations between changes must

be made larger than two as one might expect. We found experimentally that requiring

9An analytic expression for the derivative of the delay function with respect to the transistor widthwas determined which allowed the sensitivity at the current width to be computed directly [63].

Chapter 4. Automated Transistor Sizing 92

the number of iterations between changes to be one tenth the total number of parameters,

yielded satisfactory results. This approach also impacts the way in which this algorithm

terminates and that is reviewed in the following section.

Termination Criteria

The original TILOS algorithm terminates when the constraints are satisfied or when any

additional size increases worsen the performance of the circuit. These criteria where

suitable for the original algorithm but, due to the modifications made in the present

work, the termination criteria must also be modified. The issue with the first criterion

of terminating when the constraints are satisfied is that it is not applicable in our work

because the optimization problem is always one of minimizing an objective function with

the only constraints being the minimum width requirements of transistors. The second

criterion of stopping when no size increases are advantageous is also not useful due to

the capability of the current algorithm to decrease sizes as well.

Therefore, new termination criteria were necessary for use with the new algorithm.

The approach that was used is to terminate the algorithm when all the parameters either:

1. can not be adjusted, due to restrictions on oscillatory changes, or

2. offer no appreciable improvement from sizing changes.

With these requirements, there is the possibility that the algorithm will terminate before

achieving an optimal solution. This did not prove to be a major concern in practise in

part because the optimal solution would only be optimal for this simple device model.

The near-optimal result obtained with our algorithm provides a more than adequate

starting point for the next phase of optimization which will further refine the sizes.

4.4.2 Phase 2 – Sizing with Accurate Models

The sizes determined using the RC switch-level models in the previous phase are then

optimized with delay measurements taken based on more accurate device models. There

are a number of possible models that could be used ranging from improvements on the

basic RC model to foundry-supplied device models. We opted for the foundry supplied

models to ensure a high level of accuracy. It is feasible to consider using such models

because the circuit to be simulated (the single path described in Section 4.3.2) contains at

most thousands of transistors. This relatively modest transistor count also means that the

simulation using these device models can be performed using the full quality simulation of

Chapter 4. Automated Transistor Sizing 93

Synopsys HSPICE [127]. We use HSPICE because runtime is not our primary concern.

If shorter runtimes were desired, then simulation with fast SPICE simulators such as

Synopsys HSIM [128] or Nanosim [129] could be used instead. This decision to use

HSPICE and the most accurate device models does mean that simulation will be compute

intensive and, therefore, an optimization algorithm that requires relatively few simulation

simulations must be selected.

The task for the optimization algorithm is to take the transistor sizing created in the

previous phase of optimization and produce a new sizing based on the results obtained

from simulation with HSPICE. The underlying optimization problem is unchanged from

that defined in Equation 4.4 and, therefore, the final sizing produced by the algorithm will

further reduce the optimization objective function. Since the initial phase of optimization

ensured that the input transistor sizes are reasonable, a relatively simple optimization

algorithm was adopted. The approach selected was a greedy iterative algorithm that is

summarized as pseudocode in Figure 4.6 and is described in greater detail below.

This algorithm begins with all the parameters, P , set to the values determined in

the previous optimization phase. The current value of the ith parameter will be denoted

P (i). For each parameter, i, a number of possible values, PossibleParameterV alues(i),

around the current value of the parameter are considered. Specifically, for transistors in

buffers, forty possible values are examined and, for transistors in multiplexers, twenty

possible values are considered. The reduction in the number of multiplexer transistor

sizes was made simply because it was observed that the multiplexer transistor sizes did

not increase in size as much. The size of the increments between test values depends on

the target technology. For the 90 nm used for most of this work, an increment of 100 nm

was used. This somewhat coarse granularity allowed a relatively large region of different

values to be considered. It is certainly possible that with a smaller granularity slight

improvements in the final design could be made. However, for the broad exploration

that will be undertaken in this work, a coarse granularity was considered appropriate.

The optimization path described in Section 4.3.2 is then simulated with all the pos-

sible values as shown in the loop from lines 5 to 8 in the Figure 4.6. For each parameter

value, the representative delay, Di(k), is calculated. Similarly, the area, Ai(k), is also

determined for each parameter value. From the delay and area measurements, the ob-

jective function given in Equation 4.1 is calculated (with the ComputeObjectiveFunction

function in the pseudocode) and a value for the parameter is selected with the goal of

minimizing the objective function. However, to prevent minor transistor modelling issues

Chapter 4. Automated Transistor Sizing 94

Input: Parameter values, P , from first phase of optimizationOutput: Final optimized parameter values, Pbegin1

repeat2

ParameterChanged = false;3

for i← 1 to NumberOfParameters do4

for k ∈ {PossibleParameterV alues(i)} do5

Di(k)= Delay from Simulation with P (i) = k;6

Ai(k)= Area with P (i) = k;7

end8

BestV alue = min {PossibleParameterV alues(i)};9

MinObjectiveFunctionV alue =10

ComputeObjectiveFunction(Di(BestV alue), Ai(BestV alue));for j = ({PossibleParameterV alues(i)} sorted smallest to largest) do11

CurrObjectiveV alue = ComputeObjectiveFunction(Di(j), Ai(j));12

if CurrObjectiveV alue < MinObjectiveFunctionV alue and13

Di(j) < (0.9999 ∗Di(BestV alue)) thenMinObjectiveFunctionV alue = CurrObjectiveV alue;14

BestV alue = j;15

end16

end17

if BestV alue 6= P (i) then18

ParameterChanged = true;19

P (i) = BestV alue;20

end21

end22

Reduce PossibleParameterV alues;23

until not ParameterChanged ;24

end25

Figure 4.6: Pseudocode for Phase 2 of Transistor Sizing Algorithm

or numerical noise from unduly influencing the optimization, the absolute minimum is

not necessarily accepted as the best parameter value. An alternative approach is needed

because, particularly for pure delay optimization, numerical or modelling issues would

occasionally lead to unrealistic sizings if the absolute minimum was simply used.

To avoid such issues, the specific approach used for selecting the best parameter value

starts by examining the results for the parameter value that produced the smallest area

design. The result for the objective function at this parameter value is used as the starting

minimum value of the objective function and is denoted as MinObjectiveFunctionV alue.

The next largest parameter value is then considered to see if it yielded a design that

Chapter 4. Automated Transistor Sizing 95

reduced the objective function; however, the objective function for this parameter value

will only be taken as a new minimum if it offers a non-trivial improvement in the objective

function. Specifically, the delay with the new parameter value must improve by at least

0.01 %. If the improvement satisfied this requirement, then the current value of the

objective function is taken as the minimum. If the improvement was insufficient or the

delay was in fact worse, then the minimum objective function value is left unchanged.

The process then repeats for the next largest parameter value. This whole process for

selecting the best parameter value is captured in lines 8 to 16 of the pseudocode. After

considering all the simulated parameter values, the parameter value that produced the

best minimum objective function is selected. This approach to selecting the minimum

also has the effect of ensuring that if two parameter values led to the same values for the

objective function then the parameter value with the smallest area would be selected.

Once the current best value of the parameter has been determined, this new value of

the parameter will be used when the next parameters are evaluated. This whole process

repeats for the next parameter.

Once all the parameters have been examined in this way, the entire process then

repeats again. It was found that in each subsequent pass through all the parameters

the range of values considered for each parameter could be reduced slightly with little

impact on the final results. Specifically, the number of values is reduced by 25 %. For

example, if ten values were evaluated for a parameter the current iteration then this

would be reduced to eight (due to rounding) for the next iteration. This range reduction

was implemented as it significantly reduced the amount of simulation required and it is

represented in the pseudocode with the “Reduce PossibleParameterValues” step.

The process repeats over all the parameters until one complete pass is made without

a change to any parameter. At this point the algorithm terminates and the final sizing

of the design has been determined.

Parameter Grouping

During the development of this algorithm, it was hypothesized that this greedy algorithm

would be limited if it considered transistor sizes individually because it can be advanta-

geous to adjust the sizes of closely connected transistors in tandem. For this reason, the

parameters considered during optimization also include parameters that affect groups of

transistors and, in particular, this is done for the buffers in the design. For example,

Chapter 4. Automated Transistor Sizing 96

in a two stage buffer, one optimization parameter linearly affects the sizes of all four

transistors in the design. Similarly, the two transistors in the second inverter stage can

be increased in size together. The two stage buffer is still described by four parameters

which means we retain the freedom to adjust each transistor size individually as well.

This is useful as it can enable improvements such as those possible by skewing the p-n

ratios to offset the slow rise times introduced by multiplexers implemented using NMOS

pass transistors.

Parameter Ordering

The described algorithm considers each parameter sequentially and, as the algorithm

progresses, updated values are used for the previously examined parameters. It seemed

possible that the ordering of the parameters could impact the optimization results. This

issue was examined by optimizing the same design with the same parameters but with

different orderings. The possibilities examined included random orderings and orderings

crafted to deal with the parameters in order of their subjective importance. In all cases,

similar results were obtained and, therefore, it was concluded that the ordering of the

parameters does not have a significant effect on the results from the optimizer.

To further test that these potential issues were adequately resolved and to determine

the overall quality of the optimization methodology, the following section compares the

results obtained from this methodology with past designs.

4.5 Quality of Results

The goal in creating the optimization tool based on the previously described algorithm

is to enable the exploration of the performance and area trade-offs that are possible

in the design of FPGAs. To ensure the validity of that exploration, the quality of the

results produced by the optimization tool was tested through comparison with past works.

Specifically, the post-optimization delays will be compared to work that considered the

transistor sizing of the routing resources within the FPGA [31, 32] and of the logic block

[14, 130, 18]. In addition to this, the performance of the optimizer will be compared to an

exhaustive search for a simplified problem for which exhaustive searching was possible.

Chapter 4. Automated Transistor Sizing 97

Input S tim ulus

D elay to be

O ptim ized

W ire length

Figure 4.7: Test Structure for Routing Track Optimization

4.5.1 Comparison with Past Routing Optimizations

The routing used to programmably interconnect the logic blocks has a significant impact

on the performance and area of an FPGA and its design has been the focus of extensive

study [28, 14, 35, 31, 32]. Most past studies focused on the designs using multi-driver

routing and any such results are not directly comparable to our work which exclusively

considered single driver routing. However, the design of single driver routing was explored

in [31, 32] and the results of that work will be compared to the results obtained using

designs generated by our optimizer.

Optimization purely for delay was the primary focus of [31, 32] and, specifically, the

delay to be optimized is shown in Figure 4.7 along with the circuitry used for waveform

shaping and output loading. The delay optimization was performed using the process

described in Section 2.5 in which the sizing of the buffers shown in the figure was opti-

mized using an exhaustive search to determine the overall buffer size and the number of

inverter stages to use. The size ratio between the inverters within the buffer was then

determined analytically.

For comparison purposes, our optimizer was configured to also operate on the path

illustrated in Figure 4.7. To match the procedure used in [31, 32], the delay to be

optimized was set to the delay shown in the figure instead of the representative delay

typically used by the optimizer. Similarly, the target technology was set to be TSMC’s

180 nm 1.8 V CMOS process [131] to conform with the process used in [31, 32].

As was done in [31, 32], optimization was performed for a number of interconnect

segment lengths and the results obtained by our optimizer are compared to [31, 32] in

Table 4.3. The first column of the table indicates the physical length of the interconnect

line driven by the buffer whose size was optimized. The results from [32] are then listed

in the second column labelled “As Published”. All the delays in the table are listed

in ps/mm and are the average of the rising and falling transitions. As was shown in

Chapter 4. Automated Transistor Sizing 98

Table 4.3: Comparison of Routing Driver Optimizations

Delay (ps/mm)Wirelength From [31, 32] Present Work

(mm ) As Published Replicated Minimum Mux Mux Optimization

0.5 408 410 409 2621 260 258 257 2032 192 189 193 1593 184 178 179 1604 191 181 176 168

Figure 4.7, the resistance and capacitance of the metal interconnect lines was modelled

and, since the specific manner in which this interconnect was modelled was not described

in [31, 32], slightly different results were obtained when the exact sizings reported in [31,

32] were simulated. These re-simulated delays are listed in the table in the column labelled

“Replicated”. Clearly, the differences are minor as they are at worst 11 ps/mm and

could also be caused by slightly different simulator versions (HSPICE Version A-2007.09

is used in our work). These replicated results are included to provide a fair comparison

with the column labelled “Minimum Mux” which indicates the results obtained by our

optimizer. The results from our work closely match the results obtained by [31, 32] and

vary between being 2.8 % faster to being 2.1 % slower as compared to the replicated delay

results. Clearly, our optimizer is able to produce designs that are comparable to those

from [31, 32].

Our optimizer also provides the added benefit that it can be used to perform more

thorough optimization. In [32], the multiplexers were assumed to use minimum width

transistors and, for the above comparison, this restriction was preserved. However, with

our sizing tool, it is possible to also consider the optimization of those transistor sizes.

The results when such optimization is permitted are listed in the column labelled “Mux

Optimization” in Table 4.3. Performance improvements of up to 36 % are observed when

the multiplexer transistor widths are increased. Clearly, the optimizer is able to both

deliver results on par with prior investigations while also enabling a broader optimization.

However, this comparison only considered the optimization of the routing drivers. The

logic block is also important and a comparison of its optimization is performed in the

next section.

Chapter 4. Automated Transistor Sizing 99

K -LU T D Q

B LE 2

B LE N

To

R outing

R outing T rack Log ic C lusterIn tra -c luster track

Input C onnection

B lock

B LE Input

B lock

...

...

...

A

BC

D

Figure 4.8: Logic Cluster Structure and Timing Paths

4.5.2 Comparison with Past Logic Block Optimization

The transistor-level design of the logic block has been examined in a number of past

works [14, 130, 18] and the designs created by our optimizer will be compared to these

past studies. For these previous investigations, the design of a complete logic cluster, as

shown in Figure 4.8, was performed. The goal in the transistor-level design of the logic

block was to produce a design with minimal area-delay product. However, only delay

measurements for a number of paths through the logic block will be compared as areas

were not reported.

For this comparison, our optimizer was set to perform sizing to minimize the design’s

area-delay product based on the area and delay measurements described previously in

this chapter. This means that the sizing of both the logic block and routing will be

performed whereas in the past works [14, 130, 18] only the logic block was considered.

Our approach is preferable because it attempts to ensure that area and delay are balanced

between the routing and the logic block.

The prior work considered the sizing for a number of different cluster sizes (N) and

LUT sizes (K) and the delays for these different cluster and LUT sizes will be compared.

In all cases the number of inputs to the cluster is set according to Equation 2.1. The

routing network will be built using length 4 wires and the channel width (W ) is set to

be 20 % more than the number of tracks needed to route the 20 largest MCNC bench-

mark circuits [132]. Fc,out was set to W/N and Fc,in was set to the lowest value that

did not cause a significant increase to the channel width. We implement all multiplexers

Chapter 4. Automated Transistor Sizing 100

having more than 4 inputs as a 2-level multiplexer and the multiplexer gates are driven

at nominal VDD (i.e. gate boosting is not performed and, to compensate, a PMOS level

restorer is used). While these choices for multiplexer implementation are now standard

[31, 32, 15], they are different than the choices that were made in [14, 130, 18] and the

impact of these differences will be discussed later in this section. Based on these assump-

tions, the optimizer was set to size the FPGA in the appropriate process technology and

the delays between points A, B, C, and D in Figure 4.8 will be compared to the past

results [18, 130, 14].

The delays obtained by the optimizer are compared to [18] in Table 4.4. For this

work, TSMC’s 180 nm 1.8 V CMOS process [131] was used. Table 4.4(a) lists the results

reported in [18] for a range of cluster sizes as indicated in the first column of the table.

All the results are for 4-LUT architectures. The remaining columns of the table list the

delays between the specified timing points10. The rightmost column labelled A to D is a

combination of the other timing points and it is a measure of the complete delay from a

routing track through the logic block to the driver of a routing track.

The delays from the designs created by the optimizer are tabulated in Table 4.4(b)

(All the delays are the maximum of the rise and fall delays for typical silicon under

typical conditions). The percentage improvement in delay obtained by the optimizer

relative to the delay from [18] is listed in Table 4.4(c). Clearly, the delays between A and

B, between B and C, and between D and C are significantly better with our optimizer

while the delay from C to D is worse. (The delays for C to D in both cases are for the

fastest input through the LUT. This will be the case for all delays involving the LUT

unless noted otherwise.) While the increases in delay are a potential concern, they may

be in part due to the specific positioning of the timing points relative to any buffers

and, therefore, the most meaningful comparison is for the delay from A to D which is the

complete path through the logic block. For that delay, the optimizer consistently delivers

modest improvements. The other potential cause of the significant differences observed

between the two sets of designs is the different area and delay metrics used. Unlike our

results, the optimization in [18] did not consider the delay or area of the full FPGA and,

therefore, area and delay may have been allocated differently when our optimizer was

used.

10For readers familiar with timing as specified in the VPR [10] architecture file, delay A toB is T ipin cblock, delay B to C is T clb ipin to sblk ipin, C to D is T comb and D to C isT sblk opin to sblk ipin.

Chapter 4. Automated Transistor Sizing 101

Table 4.4: Comparison of Logic Cluster Delays from [18] for 180 nm with K = 4

(a) Delays from [18]

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 377 180 376 N/Aa 9332 377 221 385 221 9834 377 301 401 301 10796 377 332 397 332 11068 377 331 396 331 110410 377 337 387 337 1101

(b) Delays with Present Work

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 156 0 444 N/Aa 5992 273 150 509 132 9324 299 155 536 141 9006 286 157 565 142 10098 317 152 538 133 100710 308 159 526 147 993

(c) Percent Improvement of Present Work over [18]

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 59% 100% -18% N/Aa 36%2 28% 32% -32% 40% 5.2%4 21% 49% -34% 53% 8.3%6 24% 53% -42% 57% 8.8%8 16% 54% -36% 60% 8.8%10 18% 53% -36% 56% 10%

aThe “cluster” of size one is implemented without any intra-cluster routing and, therefore, there isno direct path from D to C.

For completeness, the delays from C to D for a range of LUT sizes are compared in

Table 4.5. The first column indicates the LUT size and the second and third columns list

the delay for C to D from [18] and our optimizer respectively. The fourth column lists the

percent improvement obtained by our optimizer. As observed with the data in Table 4.4,

the optimizer generally delivers slower delays than were reported in [18] but, again, this

may be caused by issues such as buffer positioning or area and delay measurement.

A comparison with the delays reported in [130] was also performed. In this case, the

target technology was TSMC’s 350 nm 3.3 V CMOS process [133]. The same optimization

Chapter 4. Automated Transistor Sizing 102

Table 4.5: Comparison of LUT Delays from [18] for 180 nm with N = 4

LUT Size From [18] This Work Percent Improvement(K) (ps) (ps) (%)

2 199 463 -133%3 283 511 -80%4 401 536 -34%5 534 552 -3%6 662 600 9%7 816 717 12%

process used in the previous comparison was used for this comparison and the results are

summarized in Table 4.6 for clusters ranging in size from 1 to 10. The published delays

from [130] are listed in Table 4.6(a) while the delays from the designs created by our

optimizer are given in Table 4.6(b). The improvement in percentage for our optimizer’s

designs is shown in Table 4.6(c). Again, while some of the delays are lower and others

larger, the most meaningful comparison is for the delay from A to D which is given in

the rightmost column of the table. For that delay path, improvements of between 11 %

and 22 % were observed. The delays of the C to D path for a range of LUT sizes are

summarized in Table 4.7 and, as before, for this portion of the delay, the design created

by our optimizer is slower than the previously published delays. Again, part of this

difference may be due to the positioning of buffers relative to the timing points since,

as was seen in Table 4.6, comparable delays for the overall path through the logic block

were observed.

The logic block delays obtained from the optimizer using TSMC’s 350 nm 3.3 V

CMOS process [133] were also compared to the results from [14] as is shown in Fig-

ure 4.8. For this comparison, the logic block consisted of 4-LUTs in clusters of size 4.

As in the previous tables, timing is reported relative to the points labelled in Figure 4.8.

The second row of the table lists those delays as reported in [14] and row labelled “This

Work” summarizes the delays obtained from the sizing created by our optimizer. The

percent improvement in delay obtained by the optimizer is given in the last row of the

table. As observed in the comparison with [18] and [130], improvements are seen for a

number of the timing paths while a slow down is observed for a portion of the path.

Again, the most useful comparison is the timing through the entire block from A to D

which is listed in the last column of the table. For that delay, a slight improvement in

the delay obtained by the optimizer is observed.

Chapter 4. Automated Transistor Sizing 103

Table 4.6: Comparison of Logic Cluster Delays from [130] for 350 nm with K = 4

(a) Delays from [130]

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 760 140 438 140 13382 760 649 438 649 18474 760 761 438 761 19596 760 849 438 849 20478 760 892 438 892 209010 760 912 438 912 2110

(b) Delays with Present Work

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 319 0 753 N/A 10722 436 325 877 289 16384 512 316 836 300 16646 448 336 812 322 15968 474 332 866 313 167210 510 365 847 321 1722

(c) Percent Improvement of Present Work over [130]

Cluster Size A to B B to C C to D D to C A to D(N) (ps) (ps) (ps) (ps) (ps)

1 58% 100% -72% N/A 20%2 43% 50% -100% 55% 11.3%4 33% 58% -91% 61% 15.1%6 41% 60% -85% 62% 22.0%8 38% 63% -98% 65% 20.0%10 33% 60% -93% 65% 18%

Table 4.7: Comparison of LUT Delays from [130] for 350 nm with N = 4

LUT Size From [130] This Work Percent Improvement(K) (ps) (ps) (%)

2 100 760 -660%3 294 756 -157%4 438 836 -91%5 562 862 -53%6 707 1013 -43%7 862 1065 -24%

Chapter 4. Automated Transistor Sizing 104

Table 4.8: Comparison of Logic Cluster Delays from [14] for 350 nm CMOS with K = 4and N = 4

A to B B to C C to D D to C A to D(ps) (ps) (ps) (ps) (ps)

From [14] 1040 340 465 620 1845This Work 512 316 836 300 1664

Percent Improvement 51% 7.1% -80% 52% 10%

While clearly the optimizer was able to create designs with overall performance com-

parable to the previously published works, there are differences between the works that

should be noted. Most significantly, the circuit structures used for the multiplexer differ

substantially. Two-level multiplexers were used in the designs created by the optimizer

(as is now standard [15]) but, in [14, 130, 18], fully encoded multiplexers with gate

boosting were used11. This means that, for all multiplexers with more than 4 inputs12,

the previous designs would be implemented using more than 2-levels of pass transistors.

However, the delay impact of the additional transistor levels is partially offset by the

use of gate boosting. Another difference is in the optimization process used. While the

designs produced with our optimizer were designed to minimize the area-delay product

for the whole FPGA, such an approach was not used in [14, 130, 18] which optimized the

area-delay product at a much finer granularity. This difference could lead to a different

allocation of area and performance across the FPGA.

We believe that because similar results were obtained these results validate our op-

timization process. The previously created designs were the result of months of careful

human effort. Our optimizer was able to match those results with a small fraction of the

effort. However, before putting this optimizer to use, one additional set of experiments

will be performed to further test the quality of the optimizer.

4.5.3 Comparison to Exhaustive Search

The results obtained using the optimizer were also compared to the best results obtained

from an exhaustive search with a goal of minimum delay for both optimizer and the

11It is not stated in [18, 134, 130] that fully encoded multiplexers with gate boosting are used but thishas been confirmed through private communications with the author.

12The CLB input multiplexer has Fc,in inputs (and that would generally be on the order of 10 orgreater) and the BLE input multiplexer has N + I = N + K

2 (N + 1) inputs. Therefore, both of thesemultiplexers would generally have more than 4 inputs.

Chapter 4. Automated Transistor Sizing 105

Table 4.9: Exhaustive Search Comparison

Test Routing Track Delay (ps) %Case Exhaustive This Work Difference

W 221.4 221.9 -0.2%X 226.9 230.1 -1.4%Y 224.9 224.9 0.0%Z 226.4 226.4 0.0%

exhaustive search. It is only possible to exhaustively optimize the sizes of a small num-

ber of transistors for this comparison because the number of test cases quickly grows

unreasonably large for the exhaustive search. Furthermore, to make the simulation time

reasonable, the path under optimization was simplified to be a properly loaded routing

segment similar to that shown in Figure 4.7. This simplified path has four transistors

in the buffer and two transistors in the multiplexer whose size could be adjusted. For

the comparison, only three of these transistor sizes will be optimized to keep the run

times reasonable. This comparison was performed using a 90 nm 1.2 V CMOS process

[87] from STMicroelectronics.

While only three transistor sizes were optimized in each test, we did vary the set

of specific transistors whose size could be adjusted. The delay for the routing segment

from our sizer and the exhaustive search were compared for four different combinations

of adjustable transistors. The two results were consistently within 1.2 % of each other

and the results are listed in Table 4.9. Each row contains the data for the optimization

performed using a different subset of adjustable parameters from the six possible perfor-

mance impacting parameters. The second column of the table reports the delay result

obtained using an exhaustive search of all possible values of the adjustable parameters

and the third column indicates the delay when those same parameters were sized by our

optimizer. The percentage difference is given in the fourth column. For these cases, our

optimization tool was on average 30.6 times faster than the exhaustive search. For a

larger number of adjustable sizes, the exhaustive search quickly becomes infeasible and

the speedup would grow significantly.

4.5.4 Optimizer Run Time

The preceding comparisons have shown that the optimization tool is able to achieve its

stated aim of producing realistic results across a range of design parameters. The other

Chapter 4. Automated Transistor Sizing 106

goal for this work was to have subjectively reasonable run times. For the experiments

performed in this chapter the run time varied between from 0.4 hours to 28 hours when

running on an Intel Xeon 5160 processor. The wide range in reported run times is due to

the various factors that affect the execution times. The two most significant factors are

the architectural parameters, as that impacts the artificial path used by the optimizer,

and the target technology, as the transistor models used by HSPICE affect its execution

time. Of these parameters, the LUT size is most significant as it both increases the

size of the LUTs and the number of LUTs included in the artificial path because each

additional LUT input adds an additional path that must be optimized. Therefore, the

longest run times were for 7-LUT architectures. For smaller LUT sizes, the run times

were significantly reduced. Given this, the execution time required for the optimization

tool was considered satisfactory as it will permit the exploration of a broad range of

parameters in Chapter 5.

4.6 Summary

This chapter has presented an algorithm and tool that performs transistor sizing for

FPGAs across a wide range of architectures, circuit designs and optimization objectives.

It has been shown that the optimizer is able to produce designs that are comparable to

past work but with significantly less effort thanks to the automated approach. The next

chapter will make use of this optimizer to explore the design space for FPGAs and the

trade-offs that can be made to narrow the FPGA to ASIC gap.

Chapter 5

Navigating the Gap through Area

and Delay Trade-offs

The measurement and analysis of the FPGA to ASIC gap in Chapter 3 found that there

is significant room for improvement in the area, performance and power consumption

of FPGAs. Whether it is possible to close the gap between FPGAs and ASICs is an

important open question. Our analysis in Chapter 3 (by necessity) focused on a single

FPGA design but there are in fact a multitude of different FPGA designs that can be

created by varying the logical architecture, circuit design and transistor-level sizing of

the device. The different designs within that rich design space offer trade-offs between

area and performance but exploring these trade-offs is challenging because accuracy ne-

cessitates that each design be implemented down to the transistor-level. The automated

transistor sizing tool described in the previous chapter makes such exploration feasible

and this chapter investigates the area and delay trade-offs that can be made within the

design space.

The goal in exploring these trade-offs is two-fold: the primary goal is to determine

the extent to which these trade-offs can be used to selectively narrow the FPGA to

ASIC gap. This could allow the creation of smaller and slower or faster and larger

FPGAs. It has become particularly relevant to consider such trade-offs as the market

for FPGAs has broadened to include both sectors with high performance requirements

such as high-speed telecommunications and sectors that are more cost focused such as

consumer electronics. Understanding the possible trade-offs could allow FPGAs to be

created that are better tailored to their end market. To aid such investigations, the second

107

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 108

goal of this exploration is to determine the parameters either at the logical architecture

level or the circuit design level that can best enable these trade-offs.

We explore these trade-offs in the context of general purpose FPGAs that are not

designed for a specific domain of applications. Application-specific FPGAs have been

suggested in the past [135] and they likely do offer additional opportunities for making

design trade-offs. However, general purpose FPGAs continue to dominate the market

[6, 136, 7, 8, 137, 9] and are required in both cost-oriented and performance-oriented

markets.

This chapter will first describe the methodology used to measure the area and perfor-

mance of an FPGA design. To ensure the accuracy of the measurements, this methodol-

ogy is different than that used by the optimizer described in the previous chapter. The

trade-offs that are possible for a single architecture are then examined. It will be seen

that some trade-offs are not useful and, therefore, in Section 5.3, the criteria used to

determine whether such trade-offs are interesting are introduced. Using those criteria,

a large design space with varied architecture and transistor sizings is examined in Sec-

tion 5.4 to quantify the range of trade-offs possible. The logical architecture and circuit

structure parameters are then examined individually to determine which parameters are

most useful for making area and delay trade-offs. Finally, the impact of these trade-offs

on the gap between FPGAs and ASICs is examined in Section 5.7.

5.1 Area and Performance Measurement Methodol-

ogy

As described in Chapter 4, the inherent programmability of FPGAs means that until

an FPGA is programmed with an end-user’s design there is no definitive measure of the

performance or area of the FPGA. Only after a circuit is implemented on an FPGA is

it possible to measure the performance of the circuit and area consumed in the FPGA in

a meaningful manner. To explore the area and performance trade-offs accurately, a full

experimental process, as described in Section 2.4, is necessary and the specific process

that will be used is described in this section.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 109

Fu ll C A D Flo w

(S IS , T-V P A C K , V P R )

S P IC E

N etlis t of

C rit ic al P ath

H S P IC E

D elay

T rans is tor -

Lev el D es ignT im ing M odel

B enc hm ark

C irc uits

Figure 5.1: Performance Measurement Methodology

5.1.1 Performance Measurement

The performance of a particular FPGA implementation is measured experimentally using

the 20 largest MCNC benchmarks [132, 138]. Each benchmark circuit is implemented

through a complete CAD flow on a proposed FPGA fabric and a final delay measurement

is generated as an output. The geometric mean delay of all the circuits is then used as

the figure of merit for the performance of the FPGA implementation. This mean delay

will be referred to as the effective delay of the FPGA. The steps involved in generating

this delay measurement are illustrated in Figure 5.1.

Synthesis, packing, placement and routing of the benchmark circuit onto the FPGA

is done using SIS with FlowMap [139], T-VPack [140] and VPR [10] (an updated version

of VPR that handles unidirectional routing is used). The placement and routing process

is repeated with 10 different seeds for placement and the placement and routing with

the best performance is used for the final results. The tools cannot directly make use of

the transistor size definitions of the FPGA fabric and, instead, a simplified timing model

must be provided. This timing model is encapsulated in VPR’s architecture file and

includes fixed delays for both the routing tracks and the paths within the logic block.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 110

Table 5.1: Comparison of Delay Measurements between HSPICE and VPR for 20 Circuits

Design AverageHSPICE / VPR

Standard DeviationHSPICE / VPR

VPR to HSPICEDelay Correlation

Delay 1.05 0.123 0.971Area Delay 1.32 0.0921 0.977Area 0.939 0.0474 0.990

We generate this file automatically through simulation of the circuit design with the

appropriate transistor sizes.

After placement and routing is complete, VPR performs timing analysis to deter-

mine the critical path of the design implemented on the FPGA. While this provides an

approximate measure of the FPGA’s performance, it is not sufficiently accurate for our

purposes since the relatively simple delay model does not accurately capture the complex

behaviour of transistors in current technologies. To address this, we have created a mod-

ified version of VPR that emits the circuitry of the critical path. Any elements that load

this path are also included to ensure the delay measurement is accurate. This circuit

is then simulated with the appropriate transistor sizes and structures using Synopsys

HSPICE. The delay as measured by HSPICE is used to define the performance of this

benchmark implemented on this particular FPGA implementation.

To determine if the additional step of simulation with HSPICE was needed, a compari-

son was performed between the VPR reported critical path delay and the delay when that

same path was measured using HSPICE. This was done for one architecture using three

different FPGA transistor sizings. The twenty benchmark circuits were implemented on

these three different FPGA designs and the results of the comparison are summarized in

Table 5.1. The three different designs were created by changing the optimization objec-

tive of the design. In one case, the optimization exclusively aimed to minimize delay and

this is labelled as the “Delay” design in the first column of the table. For another design,

the objective was minimal area-delay product for the design and this is labelled as “Area

Delay”. Finally, area was minimized in the “Area” design. The second column reports

the average value across the twenty benchmarks of the delay from HSPICE divided by

the delay from VPR. With the average varying between 0.939 and 1.32 it is clear that

the delay model in VPR does not accurately reflect the delays of the different designs

and, therefore, simulation with HSPICE is essential to properly measuring the delays of

the different designs.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 111

The inaccuracy in the VPR delay model is a potential concern because the underlying

timing analysis used during routing is performed with the inaccurate timing model. As

a result, poor routing choices may be made since the timing analysis may incorrectly

predict the design’s critical path. Fully addressing this concern would require a complete

overhaul of VPR’s timing analysis engine but, fortunately, it was observed that such

extreme measures were not required. Table 5.1 also provides the correlation between the

VPR and HSPICE critical path delay measurements in the column labelled “VPR to

HSPICE Delay Correlation”. For every design, the two delay measurement approaches

are well correlated which demonstrates that, while the VPR model may be inaccurate

in predicting the delays of different transistor-level designs, for any particular design the

VPR model has a relatively high fidelity. An alternative measure of this can be seen

in the standard deviation across all the benchmarks of the HSPICE delay divided by

VPR delay measurements and this is listed in the table under the heading “Standard

Deviation HSPICE / VPR”. The standard deviations indicate that there is relatively low

variability across the measurements. In relative terms, the standard deviation is at most

12 % of the mean. Given this, it was considered reasonable to continue to rely on VPR’s

timing model for intermediate timing analyses; however, HSPICE measurements will be

used for the final performance measurement.

5.1.2 Area Measurement

The area model described in Section 4.3.1 was designed to predict the area of an FPGA

tile based on its transistor-level design. While considering only the tile area was accept-

able when focused purely on transistor sizing changes, if architectural changes are to be

considered (as they will be in this chapter) then the area metric must accurately capture

the different capabilities of the tiles from different architectures. This is crucial because

both LUT size and cluster size have a significant impact on the amount of logic that can

be implemented in each logic block and, therefore, the amount of logic in each tile.

To account for the varied logic capabilities, the effective area for a design is calculated

as the product of the tile area and the number of tiles required to implement the bench-

mark circuits. The count of required tiles only includes tiles in which the logic block is

used. Since each benchmark is placed onto an M ×M grid of tiles with M set to the

smallest value that will fit the benchmark, there may be tiles in which the routing is used

but not the logic block. Such tiles are not included as it would cause the tile count to be

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 112

Table 5.2: Architecture Parameters

Parameter Value

LUT Size, k 4Cluster Size, N 10Number of Cluster Inputs, I 22Tracks per Channel, W 104Track Length, L 4Interconnect Style UnidirectionalDriver Style Single DriverFc,input 0.2Fc,output 0.1Pads per row/column 4

poorly quantized and thereby limit the precision of the area measurements. For example,

if a design required 20 × 20 = 400 logic blocks on one architecture and 401 logic blocks

on another then if all the tiles were counted 21× 21 = 441 tiles would have been consid-

ered necessary and instead of a 0.25 % area increase a 10.25 % increase would have been

measured. A fine-grained approach to counting the tiles conflicts with the conventional

perception of FPGA users that FPGAs are only available in a small number of discrete

sizes (on the order of approximately ten [6, 7]). However, this approach is necessary to

allow the impact of changes to the FPGA to be assessed with sufficient precision and it

is the standard for architectural experiments [18, 14]. This effective area measurement

will be used in the remainder of this section.

5.2 Transistor Sizing Trade-offs

With the area and performance measurements methodologies now defined, the trade-offs

between area and performance can be explored. We start this exploration by focusing

on a single architecture with 4-LUTs in clusters of size 10. Its architectural parameters

are fully described in Table 5.2; this architecture will serve as the baseline architecture

for future experiments. It uses two-level multiplexers for all multiplexers with more

than four inputs and the designs are implemented down to the transistor-level using

STMicroelectronics’ 90 nm 1.2 V CMOS process [87] (This process will be used for all the

work in this section). Given this architecture and multiplexer implementation strategy,

the optimizer described in Chapter 4 was used to create a range of designs with varied

transistor sizings that offer different trade-offs between area and performance.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 113

The range of results is plotted in Figure 5.2. The Y-axis in the figure, the effective

delay, is the geometric mean delay across the benchmark circuits and the X-axis is the

area required for all the benchmarks. Each point in the figure indicates the area and

performance of a particular FPGA design. The different points are created by varying the

input parameters, b and c, that specify the objective function (AreabDelayc) to optimize.

At one extreme, area is the only concern (b = 1, c = 0) and at the other extreme, delay is

the only concern (b = 0, c = 1). In between, these extremes, various other combinations

of b and c are used. As described in Section 5.1, the area and performance are measured

using a the full FPGA CAD flow. Clearly, transistor sizing enables a large range of area

and delay possibilities. The range in these trade-offs is quantified as follows:

Area Range =Area of Largest Design

Area of Smallest Design, (5.1)

Delay Range =Delay of Slowest Design

Delay of Fastest Design. (5.2)

The largest design should also be the fastest design and the smallest design should be

slowest design. The area and delay range from Figure 5.2 are then 2.2 × and 8.0 ×,

respectively, using this definition for the ranges.

To provide a relative sense of the size of these ranges, Table 5.3 compares this area-

delay range to the range seen when architectural parameters have been varied in past

studies [134, 14]. Those studies considered the cluster size, LUT size and segment length

and the table lists the delay and area range when each of these attributes was varied. For

each of these parameters on their own, delay ranges between 1.6 and 2.2 and area ranges

between 1.5 and 1.6 were observed. The largest range is obtained when cluster size and

LUT sizes are both varied. In that case, ranges of 3.2 × and 1.7 × were observed in delay

and area respectively [134]. While the area range is of a similar magnitude to that seen

from transistor sizing, the delay range from architectural changes is considerably smaller

than that from transistor sizing indicating the significant effect transistor sizing can have

on performance.

The full range of transistor sizing possibilities illustrates the important role sizing

plays in determining performance trade-offs but reasonable architects and designers would

not consider this full range useful. At the area-optimized and the delay-optimized ex-

tremes, the trade-off between area and delay is severely unbalanced. This is particularly

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 114

0

5E-09

1E-08

1.5E-08

2E-08

2.5E-08

3E-08

3.E+07 4.E+07 5.E+07 6.E+07 7.E+07 8.E+07 9.E+07

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

Figure 5.2: Area Delay Space

Table 5.3: Area and Delay Changes from Transistor Sizing and Past Architectural Studies

Variable Delay Range Area Range

Transistor Sizing (Full) 8.0 2.2

Cluster Size (1-10) [134] 1.6 1.5LUT Size (2-7) [134] 2.2 1.5Segment Length (1-16) [14] 1.6 1.6Cluster & LUT Size [134] 3.2 1.7

true near the minimal area sizing where the large negative slope seen in Figure 5.2 indi-

cates that, for a slight increase in area, a significant reduction in delay can be obtained.

Quantitatively, there is a 14 % reduction in delay for only a 0.02 % increase in area.

Clearly, a reasonable designer would always accept that minor area increase to gain that

significant reduction in delay. Therefore, to ensure only realistic trade-offs are considered,

the range of trade-offs must be restricted and this restriction is described in the following

section.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 115

5.3 Definition of “Interesting” Trade-offs

The goal in exploring the area and performance trade-offs is to understand how the gap

between FPGAs and ASICs can be selectively narrowed by exploiting these trade-offs.

However, the trade-offs considered must be useful and, as seen in the previous section, an

imbalance between the area and delay trade-offs occurs at the extremes of the transistor

sizing trade-off curve shown earlier. Selecting the regions in which the trade-offs are

useful is a somewhat arbitrary decision. Intuitively, this region is where the elasticity

[141], defined as

elasticity =ddelay

darea· area

delay(5.3)

is neither too small or too large. Since we do not have a differentiable function relating

the delay and area for an architecture, we approximate the elasticity as:

elasticity =% change in delay

% change in area. (5.4)

An elasticity of -1 means that a 1 % area increase achieves a 1 % performance im-

provement. Clearly, a 1-for-1 trade-off between area and delay is useful. However, based

on conversations with a commercial FPGA architect [142], we will view the trade-offs

as useful and interesting when at most a 3 % area increase is required for a 1 % delay

reduction (an elasticity of -1/3) and when a 1 % area decrease increases delay by at most

3 % (an elasticity of -3). This factor of three that determines the degree to which area

and delay trade-offs can be imbalanced will be called the elasticity threshold factor. All

points within the range of elasticities set by the threshold factor will make up what we

call the interesting range of trade-offs. With this restriction, designs are removed both

because too much area is required for too small a performance improvement and because

too much performance is sacrificed for too small an area reduction. While this restric-

tion only explicitly considers delay and area, it has the effect of eliminating designs with

excessive power consumption because those designs would generally also have significant

area demands.

This approach is appropriate for considering the interesting regions of a single area-

delay curve. A more involved approach is necessary when considering discrete designs,

such as those from [134, 14], or multiple different trade-off curves. In such cases, the

process for determining the interesting design is as follows: first the set of potentially

interesting designs is determined by examining the designs ordered by their area. Starting

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 116

for the minimum area design, each design is considered in turn. A design is added to

the set of potentially interesting designs if its delay is lower than all the other designs

currently in the potentially interesting set. This first step eliminates all designs that can

not be interesting because other designs provide better performance for less area. The

next step will apply the area-delay trade-off criterion to determine which designs are

interesting.

Two possibilities must be considered when evaluating whether a design is interesting.

These two possibilities are illustrated in Figure 5.3 through four examples. In these

examples, we will determine if the three designs labelled A, B and C are in the interesting

set. Design B is first compared to design A using the -1/3 elasticity requirement as shown

in Figures 5.3(a) and 5.3(b). If the delay improvement in B relative to A is too small

compared to the additional area required as it is in Figure 5.3(a), then design B would

be rejected. In Figure 5.3(b) the delay improvement is sufficiently large and design B

could be accepted as interesting. However, the design must also be compared to design

C. In this case, the -3 elasticity requirement is used. If the delay of B relative to C is too

large compared to the area savings of B relative to C the design would not be included

in the interesting set. Such a case is shown in Figure 5.3(c). An example in which design

B is interesting based on this test is illustrated in Figure 5.3(d). A design whose delay

satisfies both the -1/3 and the -3 requirements is included in the interesting set. At the

boundaries of minimum area or minimum delay (i.e. design A and design C respectively

if these were the only three designs being considered) only the one applicable elasticity

threshold must be satisfied.

When examining more than three designs, the process is the same except the com-

parison designs A and C need not be actual designs. Instead, those two points represent

the minimum and maximum interesting delays possible for the areas required for designs

A and C respectively. Equivalently, designs A and C are the largest or smallest designs

respectively that satisfied the -1/3 or -3 elasticity threshold. If no such designs exist then

the minimum area or delay of actual designs, respectively, would be used.

With this restriction to the interesting region, the range of trade-offs is decreased to

a range of 1.41 in delay from slowest to fastest and 1.47 in area from largest to smallest.

Figure 5.4 plots the data shown previously in Figure 5.2 but with the interesting points

highlighted. Clearly, there is a significant reduction in the effective design space but the

range is still appreciable and it demonstrates that there are a range of designs for a specific

architecture that can be useful. Applying this same criteria to the past investigation of

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 117

De

lay

A rea

E lastic ity = -1 /3

A

B

C

(a) Design B is not interesting

De

lay

A rea

E lastic ity = -1 /3

A

B

C

(b) Design B may be interesting

B

A

De

lay

A rea

E lastic ity = -3

C

(c) Design B is not interesting

B

A

De

lay

A rea

E lastic ity = -3

C

(d) Design B may be interesting

Figure 5.3: Determining Designs that Offer Interesting Trade-offs

LUT size and cluster size [134], we find that the range of useful trade-offs is 1.17 for delay

from fastest to slowest and 1.11 for area from largest to smallest. This space is smaller

than the range observed for transistor sizing changes of a single architecture. From the

perspective of designing FPGAs for different points in the design space, transistor sizing

appears to be the more powerful tool. However, architecture and transistor sizing need

not be considered independently and, in the following section, we examine the size of the

design space when these attributes are varied in tandem.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 118

0

5E-09

1E-08

1.5E-08

2E-08

2.5E-08

3E-08

3.E+07 4.E+07 5.E+07 6.E+07 7.E+07 8.E+07 9.E+07

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

Full Design Space

Interes!ng Region

Figure 5.4: Area Delay Space with Interesting Region Highlighted

5.4 Trade-offs with Transistor Sizing and Architec-

ture

For each logical architecture, a range of different transistor sizings, each with different

performance and area, is possible. In the previous section, only a single architecture was

considered, but now we explore varying the transistor sizes for a range of architectures.

We considered a range of architectures with varied routing track lengths (L), cluster

sizes (N) and LUT sizes (K). Table 5.4 lists the specific values that were considered

for each of these parameters. Not every possible combination of these parameter values

was considered and the full list of architectures that were considered can be found in

Appendix D. A comparison between architectures is most useful if the architectures

present the same ease of routing. Therefore, as each parameter is varied, it is necessary

to adjust other related architectural parameters such as the channel width (W ) and the

input/output pin flexibilities (Fc,in, Fc,out).

We determine appropriate values for the channel width (which is one factor affecting

the ease of routing) experimentally by finding the minimum width needed to route our

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 119

Table 5.4: Range of Parameters Considered for Transistor Sizing and Architecture Inves-tigation

Parameter Values Considered

LUT Size (K) 2–7Cluster Size (N) 2, 4, 6, 8, 10, 12Routing Track Length (L) 1, 2, 4, 6, 8, 10

Table 5.5: Optimization Objectives

Optimization Objectives

Area1Delay1 Area2Delay1

Area3Delay1 Area4Delay1

Area6Delay1 Area8Delay1

Area10Delay1 Area1Delay2

Area1Delay3 Area1Delay4

Area1Delay6 Area1Delay8

Area1Delay10 Area1Delay0

Area0Delay1

benchmark circuits. The minimum channel width is increased by 20 % and rounded to

the nearest multiple of twice the routing segment length1 to get the final width. The

input pin flexibility (Fc,in) is determined experimentally as the minimum flexibility which

does not significantly increase the channel width requirements. The output flexibility is

set as, 1/N , where N is the cluster size. For each architecture, a range of transistor sizing

optimization objectives were considered. The typical objective functions used are listed

in Table 5.5. The results for all these architectures and sizes are plotted in Figure 5.5.

Again, each point in the figure indicates the delay and area of a particular combination

of architecture and transistor sizing. In total, 60 logical architectures were considered.

With the different sizings for each architecture, this gives a total of 1331 distinctly sized

architectures. The delay in all cases is the geometric mean delay across the benchmark

circuits and the area is the total area required to implement all the benchmarks.

The goal in considering this large number of architectures is to determine the range

of the area and performance trade-offs. However, trade-offs that are severely imbalanced

must be eliminated using the process described in the previous section. The smallest

(and slowest), the fastest (and largest) and the minimum area-delay designs from the

1The rounding of the channel width is necessary to ensure that the complete FPGAs can be createdby replicating a single tile. It is necessary to round to twice the segment length because, with thesingle-driver routing topology, there must be an equal number of tracks driving in each direction.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 120

2E-09

4E-09

6E-09

8E-09

1E-08

1.2E-08

1.4E-08

30000000 50000000 70000000 90000000 110000000 130000000 150000000

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

All Points

Smallest Interes!ng

Min. Area Delay

Fastest Interes!ng

Figure 5.5: Full Area Delay Space

interesting set of designs are labelled in Figure 5.5. Clearly, there are both faster de-

signs and smaller designs but such designs require too much area or sacrifice too much

performance, respectively.

Compared to conventional experiments which would have only considered the mini-

mum area-delay point useful we see that in fact there are a wide range of designs that

are interesting when different design objectives are considered. The span of these designs

is of particular interest and is summarized in Table 5.6. We see that there is a range

of 2.03 × in area from the largest design to the smallest design. In terms of delay, the

range is 2.14 × from the slowest design relative to the fastest design. It is clear that when

creating new FPGAs there is a great deal of freedom in the area and delay trade-offs that

can be made and, as can be seen in Table 5.6, both transistor sizing and architecture

are key to achieving this full range. Before investigating the impact of the individual

architectural parameters, we investigate the effect of the elasticity threshold factor that

determined which designs were deemed to offer interesting trade-offs.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 121

Table 5.6: Span of Different Sizings/Architecture

Area Delay Area Delay Architecture(1E8 µm2) (ns) (µm2 · s)

Fastest Interesting 0.761 3.29 0.251 N=8, K=6, L=4Min. Area Delay 0.451 4.64 0.209 N=8, K=4, L=6Smallest Interesting 0.375 7.06 0.265 N=8, K=3, L=4

Range 2.03 2.14 1.27

5.4.1 Impact of Elasticity Threshold Factor

The area and delay ranges described previously were determined using the requirement

that trade-offs in area and delay differ by at most a factor of three. While this factor

of three threshold was selected based on the advice of a commercial architect, it is a

somewhat arbitrary threshold and it is useful to explore the impact of this factor on the

range.

To explore this issue, the elasticity threshold factor, that determines the set of inter-

esting designs, was varied. The resulting area, delay and area-delay ranges are plotted in

Figure 5.6 for the complete set of designs. As expected, increasing the threshold factor

increases the range of trade-offs since a larger factor permits a greater degree of imbal-

ance in the trade-offs between area and delay. The range does not increase indefinitely

and, for threshold factors greater than 6, there are only minor increases in the range.

The maximum value for the area range in Figure 5.6 is 3.1. This is larger than the

maximum range reported for a single architecture in Section 5.2 which is not surprising

as the additional architectures used in this section broaden the range of possible designs.

However, it is somewhat unexpected that the maximum delay range of 3.8 as seen in the

figure is considerably smaller than the unrestricted range of 8.0 reported in Section 5.2.

It is possible that the maximum delay range seen here could be further enlarged through

the addition of yet more architectures and designs. However, with the current set of

architectures and designs, the delay range is smaller because the additional architectures

offer substantially improved performance at the low area region of the design space. As

a result, the small and excessively slow designs seen in Section 5.2 are never useful.

The figure also demonstrated that the area and delay ranges are highly sensitive to the

elasticity threshold factor as a slight reduction or increase away from the value of three

used in this work could cause a substantial change to the area and delay ranges. This is a

potential concern since it suggests that changes to the optimizer or the architectures uses

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 122

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 2 4 6 8 10 12

Range

Elas city Threshold Factor

Area Range Delay Range AreaDelay Range

Chosen Threshold

Figure 5.6: Impact of Elasticity Factor on Area, Delay and Area-Delay Ranges

could lead to changes in the reported ranges. Unfortunately, this sensitivity is inherent to

this problem and we will continue to use an elasticity threshold factor of 3 to determine

the designs that offer interesting trade-offs.

5.5 Logical Architecture Trade-offs

In the previous section, the range of possible area and delay trade-offs was quantified and

the impact of these trade-offs was examined. However, how these trade-offs are made

was not explored. In this section, three architectural parameters will be investigated to

better understand their usefulness with area and delay trade-offs.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 123

Table 5.7: Span of Interesting Designs with Varied LUT Sizes

Area Delay Area Delay LUT Size(1E8 µm2) (ns) (µm2 · s)

Smallest Interesting 0.1341 11.2 0.384 K=2Min. Area Delay 0.492 4.56 0.224 K=4Fastest Interesting 0.725 3.44 0.249 K=6

Range 2.1 3.3 1.7

5.5.1 LUT Size

First, we examine the impact of LUT size on area and delay. Delay versus area curves

are plotted in Figure 5.7 for architectures with clusters of size 10, routing segments of

length 4 and LUT sizes ranging from 2 to 6. The plotted effective delay is again the

geometric mean delay for the benchmarks and the area is the area required for all the

benchmark designs. The different curves in the figure plot the results for the different

LUT sizes. Within each curve, only transistor sizing is changed and that is accomplished

by varying the optimization objective input to the optimizer.

In the figure, all the curves intersect each other and, depending on the area, the

best delay is obtained from different LUT sizes. In fact, each LUT size is best at some

point in the design space. This indicates that, visually, LUT size is highly useful for

making trade-offs because performance can be improved with increasing area (and LUT

size.) When these designs are analyzed using the previously described interesting trade-

off requirements, we also find that there are designs from every architecture that satisfy

the requirements. The boundaries of the interesting region are summarized in Table 5.7.

The table lists the three main points within the space: the smallest interesting design,

the minimum area delay design and the fastest interesting design. For each of these

designs, the area, delay, area-delay product and LUT size are listed. The three designs

all have different LUT sizes which confirms the earlier observations. As well, the range

of trade-offs is large with an area range of 2.1 and a delay range of 3.3 when all these

architectures are considered2.

2The ranges are larger than the ranges reported in Section 5.4 because the additional architecturesused for the previous range numbers cause some of the architectures used here to fall outside of theinteresting region. However, the measurement of the range is still useful for comparing against theranges for other architectural changes.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 124

2E-09

3E-09

4E-09

5E-09

6E-09

7E-09

8E-09

9E-09

1E-08

30000000 50000000 70000000 90000000 110000000 130000000

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

K = 2 K = 3 K = 4 K = 5 K = 6

Figure 5.7: Area Delay Space with Varied LUT Sizing

5.5.2 Cluster Size

The role of cluster size was also examined and, in Figure 5.8, the area and delay results

for varied transistor sizings of architectures with cluster sizes ranging from 2 to 12 are

plotted. In all the architecture, routing segments of length 4 and LUTs of size 4 are

used. It can be seen in the figure that the difference in area is relatively small between

the curves of different cluster sizes. The delay differences are also relatively minor and,

for most of the design space, a cluster size of 10 offers the lowest delay. Clearly, cluster

size provides much less leverage for trade-offs than LUT size. There is some opportunity

for trade-offs at the area and delay extremes as large cluster sizes are best for low delay

and small cluster are sizes best for low area designs. Table 5.8 summarizes the area,

delay, area-delay product and cluster size for boundaries of the interesting region for

these designs. From the table, it is apparent that the magnitude of the possible trade-

offs is significantly reduced compared to LUT size as the area range and delay range are

both only 1.7.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 125

3E-09

4E-09

5E-09

6E-09

7E-09

8E-09

9E-09

30000000 40000000 50000000 60000000 70000000 80000000 90000000 100000000

Effec

�ve

Del

ay (

s)(G

eom

etri

c Mea

n D

elay

as

Mea

sure

d by

HSP

ICE)

Effec�ve Area (um2)

N = 2 N = 4 N = 6 N = 8 N = 10 N = 12

Figure 5.8: Area Delay Space with Varied Cluster Sizes

Table 5.8: Span of Interesting Designs with Varied Cluster Sizes

Area Delay Area Delay Cluster Size(1E8 µm2) (ns) (µm2 · s)

Smallest Interesting 0.402 6.63 0.266 N=4Min. Area Delay 0.492 4.56 0.224 N=10Fastest Interesting 0.665 3.89 0.258 N=10

Range 1.7 1.7 1.2

It should be noted that for all cluster sizes fully populated intra-cluster routing is

assumed. As described in Section 2.1, full connectivity has been found to be unnecessary

[19]. With depopulated intra-cluster routing, it is possible that the usefulness of cluster

size for making trade-offs would be improved.

5.5.3 Segment Length

Figure 5.9 plots the transistor sizing curves for architectures with 4-LUT clusters of size

10 with the routing segment lengths varying from 1 to 8. It is immediately clear that

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 126

3E-09

4E-09

5E-09

6E-09

7E-09

8E-09

9E-09

30000000 40000000 50000000 60000000 70000000 80000000 90000000 100000000

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

L = 1 L = 2 L = 4 L = 6 L = 8

Figure 5.9: Area Delay Space with Varied Routing Segment Lengths

Table 5.9: Span of Interesting Designs with Varied Segment Lengths

Area Delay Area Delay Segment Length(1E8 µm2) (ns) (µm2 · s)

Smallest Interesting 0.431 5.79 0.250 L=4Min. Area Delay 0.492 4.56 0.224 L=4Fastest Interesting 0.665 3.89 0.258 L=4

Range 1.5 1.5 1.2

the length-1 and length-2 architectures are not useful in terms of area and delay trade-

offs. Similar conclusions have been made in past investigations [14]. From the trade-off

perspective, the remaining segment lengths are all very similar. In Table 5.9, the area,

delay and area-delay characteristics of the boundary designs from the interesting space

are summarized. Based on these designs, the interesting area and delay ranges are both

1.5 which is smaller than the ranges seen for cluster size and LUT size. Clearly, segment

length is not a powerful tool for adjusting area and delay as a single segment length

generally offers universally improved performance.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 127

5.6 Circuit Structure Trade-offs

While varied logical architecture is the most frequently considered possibility for trading

off area and performance, the circuit-level design of the FPGA presents another possible

source of trade-offs. In this section, we investigate two circuit topology issues, the place-

ment of buffers before multiplexers and the structure of the multiplexers themselves. Our

goal is to determine if either topology issue can be leveraged to enable useful area and

performance trade-offs.

5.6.1 Buffer Positioning

As discussed in Section 4.2.2, one circuit question that has not been fully resolved is

whether to use buffers prior to multiplexers in the routing structures. For example,

in Figure 5.10(a), a buffer could be placed at positions a and b to isolate the routing

track from the multiplexers as shown in Figure 5.10(b) In terms of delay, the potential

advantage of the pre-multiplexer buffer is that it reduces the load on the routing tracks

because only a single buffer can be used to drive the multiple multiplexers that connect to

the track in a given region. (For example, at both positions a and b in Figure 5.10.) The

disadvantage is the addition of another stage of delay. Both logical architecture, which

affects the number of multiplexers connecting to a track, and electrical design, which

determines the size (and hence load) of the transistors in the multiplexers relative to the

size of the transistors in the buffer, may impact the decision to use the pre-multiplexer

buffers.

We investigated this issue for one particular architecture consisting of 4-LUT clusters

of size 10 with length 4 routing tracks to determine the best approach. As was done

previously, the effective area and delay was determined using the full experimental flow

for a range of varied transistor sizings without a buffer, with a single inverter, and with

a two-inverter buffer. Figure 5.11 plots the area delay curves for each of these cases.

It is interesting to consider the full area-delay space because it might be possible that

for different transistor sizings the buffers might become useful. However, in Figure 5.11,

we see that across the range of the design space the fastest delay for any given area is

obtained without using the buffers. For this architecture, no pre-multiplexer buffering is

appropriate. Similar results were obtained for other cluster sizes as well.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 128

ba

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

(a) Routing Track without Pre-Multiplexer Buffers

ba

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

Logic

B lock

(b) Routing Track with Pre-Multiplexer Buffers

Figure 5.10: Buffer Positioning around Multiplexers

5.6.2 Multiplexer Implementation

The implementation of multiplexers throughout the FPGA is another circuit design issue

that has not been conclusively explored. With few exceptions [31, 32], multiplexers have

been implemented using NMOS only pass transistors [14, 28, 18, 33, 15] as described in

Section 2.2. However, this still leaves a wide range of possibilities in the structure of

those NMOS-only multiplexers. The approaches generally differ in the number of levels

of pass transistors through which an input signal must pass. Before exploring the impact

of multiplexer design choices on the overall area and performance of an FPGA, we first

examine the multiplexer design choices in isolation to provide a better understanding of

the potential impact of the choices.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 129

3.E-09

4.E-09

5.E-09

6.E-09

7.E-09

8.E-09

3.0E+07 5.0E+07 7.0E+07 9.0E+07 1.1E+08 1.3E+08

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

No Inverters 1 inverter 2 inverters

Figure 5.11: Area Delay Trade-offs with Varied Pre-Multiplexer Inverter Usage

General Multiplexer Design Analysis

The three most frequently considered possibilities for a multiplexer are fully-encoded,

2-level and 1-level (or one-hot). Some of their main properties are summarized in Ta-

ble 5.10. The different design styles are given in the different rows of the table and

various properties are summarized in each row. Recall from Section 2.2 that one key

parameter of multiplexers is the number of levels of pass transistors that are traversed

from input to output. This characteristic for the three designs is summarized in the

row of the table labelled “Pass Transistor Levels.” The number of levels is constant

for the 1-level and 2-level designs but, in the fully encoded multiplexer, the number of

inputs to the multiplexer, X, determines the number of levels, dlog2 Xe. The number

of configuration bits required to control each multiplexer design is indicated in the row

labelled “Configuration Bits.” Clearly, the benefit of the fully-encoded multiplexer is

that it requires only dlog2 Xe configuration memory bits compared to 2√

X bits for the

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 130

Table 5.10: Comparison of Multiplexer Implementations (X = Number of MultiplexerInputs)

Property Fully Encoded 2-level 1-level

Pass Transistor Levels dlog2 Xe 2 1Configuration Bits dlog2 Xe 2

√X X

Pass Transistors 2X − 2 X +√

X X

2-level multiplexer3 and X bits for the the 1-level multiplexer. Finally, the row labelled

“Pass Transistors” lists the total number of pass transistors required for the different

multiplexer styles. A fully encoded multiplexer design is worse by this metric as it needs

2X − 2 pass transistors compared to X +√

X and X pass transistors for the 2-level and

1-level designs respectively.

To better illustrate the impact of these differences in the number of configuration

memory bits and the number of pass transistors, Figure 5.12 plots the total number of

transistors (including both configuration bits and pass transistors) per multiplexer input

as a function of the input width of the multiplexer. The transistor count required per

input for the 1-level multiplexer is constant with 6 transistors required for the configura-

tion bit and 1 pass transistor per input. For the 2-level multiplexer, for each width the

topology that yielded the lowest transistor count was used. The number of transistors

required per input tends to decrease as the width of the multiplexer increases. A similar

trend can be seen with the fully encoded multiplexers. These results are also summarized

in Table 5.11. The table lists the various input widths and, for each width, the number

of transistors is given for each of the design styles. For the 2-level and fully encoded

results, the results depend on how the configuration bit is used. The previously plotted

data assumed that data and data from each bit were used. These results are summarized

in the columns labelled “2 O/Bit.” (Note that the 1-level design only uses one output

from each bit and, hence, its results are labelled as “1 O/Bit.”) To better illustrate

the differences between the fully encoded designs and the 2-level designs, the final two

columns of the table report the area savings, in transistor count, when the fully encoded

design is used instead of the 2-level design. For the larger multiplexers, the savings are

relatively constant at around 15 %.

3This is only an approximation of the number of the number of bits required for the 2-level multiplexerbecause the number of bits used at each level of the multiplexer must be a natural number. There arealso different implementations of 2-level multiplexers that will require more memory bits.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 131

0

1

2

3

4

5

6

7

8

2 7 12 17 22 27 32

To

tal

Tra

nsi

sto

rs p

er

Inp

ut

Number of Inputs to Mul!plexer

1 stage 2 stage Fully Encoded

Figure 5.12: Transistor Counts for Varied Multiplexer Implementations

The use of two outputs from each configuration memory bit is a potentially risky

design practise as it exposes both nodes of the memory bit to noise which complicates

the design of the bit cell. A more conservative approach would be to only use one output

from the bit cell and produce the inverted signal using an additional static inverter. If

such an approach is used the gap between the 2-level and fully encoded design shrinks

further as can be seen in Figure 5.13. These results are also given in Table 5.11 in the

columns labelled “1 O/Bit.” In this case, the difference between the designs is much

smaller and, for the large multiplexer sizes, it is around 6 % at worst. The number of

transistors required for a 3-level multiplexer is not shown but it would generally fall

between the 2-level and fully encoded designs. While clearly a fully encoded multiplexer

does reduce the number of transistors required for its implementation, these gains are

relatively modest. However, there is the potential for useful area and performance trade-

offs with the 2-level design which should generally be faster.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 132

0

1

2

3

4

5

6

7

8

2 7 12 17 22 27 32

To

tal

Tra

nsi

sto

rs p

er

Inp

ut

Number of Inputs to Mul!plexer

1 stage 2 stage Fully Encoded

Figure 5.13: Transistor Counts for Varied Multiplexer Implementations using a SingleConfiguration Bit Output

Area Delay Trade-offs using Varied Multiplexer Designs

Area and delay trade-offs in the design of an FPGA were explored for four different

multiplexer design styles. The first style uses 1-level multiplexers for all the multiplexers

in the FPGA. (While the LUT is constructed using a multiplexer, its implementation was

always the standard fully encoded structure.) The second style uses 2-level multiplexers in

every case except for multiplexers with 2 inputs. The third style uses 3-level multiplexers

except for multiplexers with four or fewer inputs which will be implemented using 1-level

or 2-level multiplexers according to their width. Finally the fourth approach uses fully

encoded multiplexers.

These approaches were applied in the design of an FPGA with a logical architecture as

described in Table 5.2. As in the previous investigations, transistor sizing is performed for

a range of design objectives for each multiplexer implementation strategy. The resulting

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 133

Table 5.11: Number of Transistors per Input for Various Multiplexer Widths

Numberof

Inputs

Transistors per Input Improvement of FullyEncoded vs. 2-level1-level 2-level Fully Encoded

1 O/B 2 O/Bit 1 O/Bit 2 O/Bit 1 O/Bit 2 O/Bit 1 O/Bit

3 7 5.3 6.7 5.3 6.7 0% 0%4 7 4.5 5.5 4.5 5.5 0% 0%5 7 6.0 6.4 5.2 6.4 13% 0%6 7 5.3 5.7 4.7 5.7 13% 0%7 7 5.4 5.7 4.3 5.1 21% 10%8 7 5.0 5.3 4 4.8 20% 10%9 7 5.3 5.3 4.4 5.3 17% 0%10 7 4.8 5.0 4.2 5 13% 0%11 7 5.0 5.0 4 4.7 20% 5%12 7 4.8 4.8 3.8 4.5 19% 5%13 7 4.8 4.8 3.7 4.3 23% 10%14 7 4.6 4.6 3.6 4.1 22% 9%15 7 4.4 4.4 3.5 4 21% 9%16 7 4.3 4.3 3.4 3.9 21% 9%17 7 4.3 4.3 3.6 4.2 15% 1%18 7 4.2 4.2 3.6 4.1 15% 1%19 7 4.0 4.0 3.5 4 13% 0%20 7 3.9 3.9 3.4 3.9 13% 0%21 7 4.0 4.0 3.3 3.8 17% 5%22 7 3.8 3.8 3.3 3.7 14% 2%23 7 3.7 3.7 3.2 3.7 14% 2%24 7 3.7 3.7 3.2 3.6 14% 2%25 7 3.6 3.6 3.1 3.5 13% 2%26 7 3.6 3.6 3.1 3.5 15% 4%27 7 3.6 3.6 3.0 3.4 15% 4%28 7 3.5 3.5 3.0 3.4 14% 4%29 7 3.4 3.4 3.0 3.3 13% 3%30 7 3.4 3.4 2.9 3.3 13% 3%31 7 3.4 3.4 2.9 3.2 15% 6%32 7 3.4 3.4 2.9 3.2 15% 6%

area-delay trade-off curves are shown in Figure 5.14. Note that for this architecture the

largest multiplexer is the BLE Input multiplexer with 32 inputs.

The data in the figure indicates that the 1-level multiplexers offer a potential speed

advantage but the area cost for this speed-up is significant. Based on the previously de-

fined criteria of interesting designs, the 1-level multiplexer design does not offer a useful

trade-off. Similarly, the 3-level design offers area savings but these area savings are not

sufficient to justify the diminished performance. The fully encoded multiplexer designs

suffer significantly in terms of delay and, while they do yield the absolute smallest de-

signs, the area savings never overcome the delay penalty. Clearly, the 2-level multiplexer

implementation strategy is the most effective and, for that reason, all the work in this

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 134

2E-09

4E-09

6E-09

8E-09

1E-08

1.2E-08

1.4E-08

3.E+07 4.E+07 5.E+07 6.E+07 7.E+07 8.E+07 9.E+07 1.E+08

Eff

ec�

ve

De

lay

(s)

(Ge

om

etr

ic M

ea

n D

ela

y a

s M

ea

sure

d b

y H

SP

ICE

)

Effec�ve Area (um2)

2-level 1-level 3-level Fully Encoded

Figure 5.14: Area Delay Trade-offs with Varied Multiplexer Implementations

chapter used the 2-level multiplexer topology. For any given multiplexer size, there are

a number of different 2-level topologies. The impact of these topologies is analyzed in

Appendix C and it was found to be relatively modest. Therefore, varied 2-level strategies

are not explored further.

5.7 Trade-offs and the Gap

The previous sections of this chapter have demonstrated that there are a wide range

of interesting area and delay trade-offs that can be made through varied architecture

and transistor sizing. One goal in examining the trade-offs was to understand how these

trade-offs could be used to selectively narrow the area and delay gaps and, in this section,

we investigate the impact of the trade-off ranges observed on the gap measurements from

Chapter 3.

The preceding work in this section has demonstrated that the design space for FPGAs

is large and, by simply varying the transistor sizing of a design, the area and delay of an

FPGAs can be altered dramatically. This presents a challenge to exploring the impact

of the trade-off ranges because the area and delay gap measurements were performed for

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 135

a single commercial FPGA family, the Stratix II, and the trade-off decisions made by

the Stratix II’s designers and architects to conserve area or improve performance are not

known. As a result, the specific point occupied by this family within the large design

space is unknown.

To address this issue, we consider a range of circumstances for the possible trade-offs

that could be applied to the area delay gap. For example, in one case, it is assumed that

the Stratix II was designed to be at the performance extreme of the interesting region.

Based on that assumption, it could be possible to narrow the area gap by trading off

performance for area savings and create a design that is instead positioned at the area

extreme of the region. We compute the possible narrowed gap by applying the trade-off

range factors determined previously in this chapter. This is done as follows

Area Gap with Trade-off =Measured Area Gap

Area Range(5.5)

Delay Gap with Trade-off = Measured Delay Gap ·Delay Range. (5.6)

These trade-offs clearly narrow the gap in only one dimension and in the other dimension

the gap grows larger. If we were to assume that the Stratix II was designed with a greater

focus on area, then the trade-offs would be applied in the opposite manner with the area

gap growing and the delay gap narrowing by the area range and delay range factors

respectively.

The results for a variety of cases are summarized in Table 5.12. The row labelled

“Baseline” repeats the area and delay gap measurements for soft-logic only as reported

in Chapter 3. The subsequent rows list the area and delay gaps when the area and delay

trade-offs are used. The “Starting Point” column refers to the position within the design

space that the FPGA occupies before making any trade-offs and the “Ending Point”

describes the position in the design space after making the trade-offs. Three positions

within the design space are considered: Area, Delay and Area-Delay. The Area and

Delay points refer to the smallest and fastest positions (that still satisfy the interesting

trade-off requirements) in the design space respectively and the Area-Delay point refers

to the point within the design space with minimal Area-Delay. For the example described

above, the starting point would be the “Delay” point and the ending point would be the

“Area” point. When making trade-offs from the Area point to the Delay point or vice

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 136

Table 5.12: Potential Impact of Area and Delay Trade-offs on Soft Logic FPGA to ASICGap

Starting Point Ending Point Area Gap Delay Gap

Baseline 35 3.4

Delay Area 18 7.1Delay Area-Delay 21 4.8

Area-Delay Area 29 5.2Area-Delay Delay 59 2.4

Area Delay 70 1.6

versa, the full area and delay range factors would be applied. For trade-offs involving

the Area-Delay point then only the range to or from that point would be considered.

For example, if starting at the “Delay” point and ending at the “Area-Delay” point the

partial ranges would be calculated as:

Partial Area Range =Area of Largest Design

Area of Area-Delay Design, (5.7)

Partial Delay Range =Delay of Area-Delay Design

Delay of Fastest Design(5.8)

and these ranges would be applied as per Equations 5.5 and 5.6 to determine the gap

after making the trade-offs.

From the data in the table, it is clear that leveraging the trade-offs can allow the area

and delay gap to vary significantly. In particular, it is most interesting to consider starting

from the delay optimized point in the design space because the Stratix II is Altera’s higher

performance/higher cost FPGA family [16] at the 90 nm technology node. In that case,

the area gap can be shrunk to 18 for soft logic and, if such trade-offs were combined

with the appropriate use of heterogeneous blocks, the overall area gap would shrink even

lower. The row in the table with an “Area” starting point and a “Delay” ending point

suggests that the delay gap could be narrowed (at the expense of area); however, this is

unlikely to be possible as the Stratix II is sold as a high performance part which suggests

its designers were not focused primarily on conserving area.

The extent of the trade-offs possible with power consumption was not examined ex-

tensively. However, it was observed that for a number of difference architectures and

sizings changes in power consumption were closely related to changes in area. If this

relationship is assumed to apply in general, then it follows that the range in power con-

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 137

Table 5.13: Area and Delay Trade-off Ranges Compared to Commercial Devices

This Work Commercial DevicesArea-Delay vs. Delay Cyclone II vs. Stratix II

Delay Range 1.41 1.40Area Range 1.68 N/A

sumption from greatest consumption to least consumption varies by the same factor of

2.0 as the area range. This estimated power range can also be applied to the power

gap measurements from Chapter 3. If we apply the trade-offs from the delay optimized

extreme of the design space to the area optimized extreme, then the dynamic power con-

sumption gap could potentially narrow from 14 down to 7.0. While this highlights that

it may be possible to reduce the power gap significantly, this is only a very approximate

estimate of the possible changes and future work is needed to more accurately assess

these possibilities.

5.7.1 Comparison with Commercial Families

While the reduction in the area gap is useful, the impact on the delay gap is also sig-

nificant. It is useful to compare these trade-offs to those found in commercial FPGA

families. Altera has two 90 nm FPGA families, the high performance/high cost Stratix

II [16] and the lower cost/lower performance Cyclone II [143]. For the benchmarks used

in Chapter 3, the Stratix II was on average approximately 40 % faster than the Cyclone

II. This means that the delay range between these parts was 1.40. This closely matches

the delay range of 1.41 we observed between the largest/fastest design and the minimal

area-delay design. This result is summarized in Table 5.13.

Unfortunately, the core area for the Cyclone II is not known and, therefore, a direct

area comparison is not possible. However, an approximate measure of the area difference

can be made by comparing the prices of the different parts. The parts with the most

similar logic capabilities are the Stratix II 2S30 and Cyclone II 2C35, and the Stratix II

2S15 and Cyclone II 2C15. Using prices from http://www.buyaltera.com/ (an autho-

rized distribution site for Altera parts) for these sets of parts, the Stratix II was found

to have a price that is 4–6 times higher than the Cyclone II4. This only provides an

4Both sets of parts have a similar number of (equivalent) logic elements but the Stratix II containedmore memory and more multipliers; therefore, the range should be considered an upper bound on theprice difference.

Chapter 5. Navigating the Gap through Area and Delay Trade-offs 138

approximate indication of the area difference as price is affected by a number of factors

including profit margins, package costs, I/O differences and other differences in core logic

capabilities and, therefore, the true area difference is certainly smaller than the price dif-

ference suggests. Given these caveats, the 1.68 × area difference we observed between

the largest design and minimal area-delay design appears realistic.

5.8 Summary

In this chapter, we have explored the trade-offs between area and delay that are possible in

the design of FPGAs when both architecture and transistor sizing are varied. Compared

to past pure architectural studies, it was found that varying the transistor sizing of a

single architecture offers a greater range of possible trade-offs between area and delay

than was possible by only varying the architecture. By varying the architecture along

with the transistor sizings, we see that performance could be usefully varied by a factor

of 2.1 and area by a factor of 2.0. These trade-off ranges can be used to selectively

shrink the gap between FPGAs and ASICs to create slower and smaller FPGAs or faster

and larger FPGAs as desired. Specifically, for the soft logic of the FPGA the area gap

could shrink as low as 18 by taking advantage of these trade-offs. When making these

trade-offs, LUT size was found to be by far the most useful architectural parameter.

Chapter 6

Conclusions and Future Work

The focus of this thesis is on gaining a better understanding of the area, performance

and power consumption gap between FPGAs and ASICs. The first step in doing this

was to measure the gap. It was found that heterogeneous hard blocks can be useful

tools in narrowing the area gap but, regardless, the area, performance and power gap

for soft logic remains large. To address this large gap, the latter portion of this thesis

explored the opportunities to trade-off area and delay through varied transistor-level and

architectural trade-offs. Such trade-offs allow the gap to be navigated by improving one

attribute at the expense of another. The most significant contributions that were made

in this work are summarized in the following section.

6.1 Contributions

One significant result from this thesis has been in the most thorough analysis to date

of the area, performance and power consumption differences between FPGAs to ASICs.

It was found that designs implemented using only the soft logic of an FPGA used 35

times more area, were 3.4–4.6 times slower and used 14 times more dynamic power than

equivalent ASIC implementations. When designs also employed hard memory and multi-

plier blocks, it was observed that the area gap could be shrunk considerably. Specifically,

it was found that the area gap was 25 for circuits that used hard multiplier blocks, 33

for circuits that used hard memory blocks and 18 for circuits that used both multipliers

and memory blocks. These reductions in the area gap occurred even though none of the

benchmark circuits used all the available hard blocks on an FPGA. If it is optimistically

assumed that all hard blocks in the target FPGA were fully used then the area gap could

139

Chapter 6. Conclusions and Future Work 140

potentially shrink as low as 4.7 when only the core logic is considered or as low as 2.8

when the peripheral circuitry is also optimistically assumed to be fully used. Contrary to

popular perception, it was found that hard blocks did not offer significant performance

benefits as the average performance gap for circuits that used memory and multiplier

hard blocks was only 3.0–4.1. The hard blocks did appear to enable appreciable im-

provements to the dynamic power gap which was measured to be on average 7.1 for the

circuits that used both multiplier and memory hard blocks. This work was published in

[144, 145].

The automated transistor sizing tool for FPGAs developed as part of this dissertation

led to a number of contributions. This work was the first to consider the automated

transistor sizing of complete FPGAs and this raised a number of previously unexplored

issues. One such issue in particular is the impact of an FPGA’s programmability on

the transistor-level optimization choices. Due to the programmability, it is not known

what the critical paths of the FPGA will be when the FPGA is being designed. One

effective solution, the use of a representative path delay, was developed and described in

this work. In terms of optimization algorithms, a two-phased approach to optimization

that optimized sizes first based on RC transistor models and then using full simulation

with accurate models was developed. It was shown that this approach produced results

on par or better than past manual designs. As a direct contribution to the community, a

range of optimized designs created by the sizing tool was released at http://www.eecg.

utoronto.ca/vpr/architectures. These contributions were summarized in [146].

Finally, this thesis also demonstrated the large range of trade-offs that can be made

between area and performance. In past investigations, trade-offs had generally been

achieved through logical architecture changes but this work found that a significantly

wider range of trade-offs were possible when logical architecture and transistor sizing

changes were explored together. This broader range of trade-offs is significant as it

indicated the possibility of selectively narrowing the FPGA to ASIC gap. The analysis

of the trade-offs was also unique in that a quantitative method was used to determine if

a trade-off was useful and interesting. This work was published in [147].

6.2 Future Work

The outcomes of this research suggest a number of directions for future research both in

understanding the gap between FPGAs and ASICs and in narrowing it.

Chapter 6. Conclusions and Future Work 141

6.2.1 Measuring the Gap

The measurements of the FPGA to ASIC gap described in Chapter 3 offered one of the

most thorough measurements of this gap to date; however, further research in specific

areas could be useful to improve the understanding this gap. One of the issues raised in

Chapter 3 was the size of the benchmarks used in the comparison. The largest benchmark

used 9 656 ALMs while the largest currently announced FPGA with 272 440 ALMs [17]

offers over an order of magnitude more resources. Since FPGAs are architected to handle

those larger circuits, it could be informative to measure the gap using benchmarks that

fully exercise the capacity of the largest FPGAs. Additionally, it could also be interesting

to measure the gap with benchmarks that make more use of the hard blocks in the FPGA

since the benchmark circuits for the current work often did not use the hard blocks

extensively.

Another issue is that the measurement of the gap focused only on core logic and it

could be informative to extend the work to consider the I/O portions of the design. It

is possible that for designs that demand a large number of I/O’s the area gap could

effectively shrink if the design is pad limited. As well, for more typical core-limited

designs, I/O blocks may also impact the gap as they are essentially a form of hard block.

While an optimistic analysis of this effect was performed in Chapter 3, a more thorough

analysis as was done for the core logic could provide greater insight into the impact of

the I/O on the FPGA to ASIC gap. This could be particularly useful as the architecture

of the I/O’s has not been studied extensively and, with more quantitative assessments of

its role, new architectural enhancements may be discovered.

The area, performance and power consumption gap was also only measured at one

technology node, 90 nm CMOS, and there are many reasons why it could be useful to

explore the gap in another technology nodes. In [3], some measurements of the area,

performance and power gap were made in technologies ranging from 250 nm down to

90 nm and, as described in Section 2.6, there was significant variability observed between

technology nodes, particularly for area and performance. The reason for this variability is

unknown and more accurate measurements might uncover the reasons for the differences.

The knowledge of the cause of the difference, could point to possibilities for improving

FPGAs.

Furthermore, it would also be interesting to remeasure the gap in more modern tech-

nologies. Recent FPGAs in 40 nm and 65 nm CMOS have added programmable power

Chapter 6. Conclusions and Future Work 142

capabilities [17, 6] which allow portions of the FPGA to be programmably slowed down

to reduce leakage power. The programmability is accomplished through the use of body

biasing [148] which adds area overhead as additional wells and spacing are necessary to

support this adjustable body biases. There has been some work that has considered the

area impact of these schemes [149, 150] but to date no direct comparisons with ASICs

have been reported. Such comparisons could be interesting since, while body biasing

may be necessary to combat leakage power in ASICs, the fine-grained programmability

present in the FPGA would not be required and, hence, the area gap may potentially be

larger in the latest technologies.

This thesis focused exclusively on SRAM-based FPGAs as they dominate the mar-

ket; however, the development of flash and antifuse-based FPGAs [151, 152, 153] has

continued. Measuring the gap for such FPGAs would be interesting as they potentially

offer area savings. In addition to this, there are also a number of single or near-single

transistor one-time programmable memory designs that promise full compatibility with

standard CMOS processes [154, 155]. (The lack of CMOS compatibility has been one

of the major issues limiting the use of the flash and traditional antifuse-based FPGAs.)

While no current FPGAs make use of these memories as their sole configuration memory

storage, they could potentially be useful in future FPGA and investigating their impact

on the area, performance and power of FPGAs relative to ASICs could be informative.

The gap measurements were centred on three axes, area, performance and power con-

sumption, for the core logic. However, these measurements are only indirect measures of

the true variables that affect FPGA usage which are their system-level cost, performance

and power consumption. The addition of measurements that include the impact of the

I/O portion of the design as suggested previously would partially address this issue as

it would provide a better measure of system-level performance and power consumption.

However, silicon area is not always a reliable measure of system-level cost. For small de-

vices, the costs of the package can be a significant portion of the total device cost. Since

these costs may be similar for both ASICs and FPGAs, this could reduce the effective

cost gap between the implementation media. As well, it has long been known that yield

decreases at an exponential rate with increasing area [156] and that causes greater than

linear cost increases with increased area. Some FPGAs are able to mitigate this issue

through the use of redundancy [157, 158, 159] to correct faults. Such techniques increase

area but presumably lower costs. In contrast, the irregular nature of logic in ASICs likely

prevents the use of such techniques in ASICs. Clearly, there is a great deal of complexity

Chapter 6. Conclusions and Future Work 143

to the area and cost relationship and a more detailed analysis of these area and cost

issues could be informative.

Finally, as described in Chapter 3, measurements of the static power consumption gap

were inconclusive and more definitive measures of that axis would be useful in the future.

However, there are many challenges to getting reliable and comparable static power

measurements. One of the central challenges is that it is difficult to compare results

for parts from different foundries unless the accuracy of the estimates is well defined.

This is crucial as the goal in the static power comparison would be to compare FPGA

and ASIC technology and not any underlying foundry-specific issues. This particular

issue could be addressed by performing the comparison with both the FPGA and ASIC

implemented in the same process from the same foundry. However, this would not fully

address all issues because the FPGA manufacturers, due to technical or business factors,

may eliminate parts with unacceptable leakage. The removal of those leaky parts would

reduce the static power measurements for the FPGA but, since the same could also be

done for an ASIC, any static power consumption comparison must ensure that the results

are not influenced by such issues. Therefore, a fair comparison may require SPICE level

simulations of both the FPGA and the ASIC using identical process technology libraries.

A fair comparison such as that would certainly be useful as static power consumption

has become a significant concern in the latest process technology nodes.

6.2.2 Navigating the Gap

In addition to the avenues for future research in measuring the gap, there are also op-

portunities to explore in selectively narrowing the gap through design trade-offs. The

simplest extension would be to consider an even broader range of logical architectures.

One potential avenue is the use of routing segments with a mix of segment lengths, which

is common in commercial FPGAs [17, 7]. As well, changes to the logic cluster such as

intra-cluster depopulation [19, 20], the use of arithmetic carry chains and the addition

of dedicated arithmetic logic within the logic block warrant exploration. These ideas

have all been adopted in high-performance FPGAs [20, 15, 7, 17] but the impact of these

approaches on the area-delay design space and the associated trade-offs have not been

investigated.

Future research is also needed to investigate the impact of hard blocks such as mul-

tipliers and memories on the trade-offs that are possible. It was seen in Chapter 3 that

Chapter 6. Conclusions and Future Work 144

these hard blocks offer significant area benefits but that work only considered a sin-

gle architecture. With varied logical architectures and transistor sizings the role of hard

blocks throughout the design space could be better understood. A notation and language

that captures the issues of the supply and the demand of these blocks was introduced in

[160, 47] and that framework would certainly be useful for explorations of the impact of

hard blocks on the design space.

There is also clearly work that can be done exploring trade-offs of area and per-

formance with power consumption. As described previously, in many cases power and

area are closely related; however, there are a number of techniques that can alter this

relationship. In particular, the use of programmable body biasing [149, 150, 148] or

programmable VDD connections [161, 162, 163] can change the area, performance and

power relationships. While many of these ideas have been studied independently, the

use of these techniques has not been examined throughout the design space. For exam-

ple, it is not clear which techniques are useful for area constrained designs or what the

performance impact is with these techniques when no area increase is permitted.

An additional dimension for trade-offs that was not explored in this work was that of

the time required to implement (synthesize, place and route) designs on an FPGA. This

time, typically referred to as compile time, can be significantly impacted by architectural

changes such as altering the cluster size [164] or the number of routing resources. This

could enable interesting trade-offs as area savings could be made at the expense of in-

creased compile time but future research is needed to determine if any of these trade-offs

are viable. This will become particularly important as single-processor performance is

no longer growing at the same rate as FPGAs are increasing in size. If that discrepancy

is left unaddressed, it will lead to increased compile times and that could then threaten

to diminish one of the key advantages of FPGAs which is their fast design time.

There are also many opportunities for further research into the optimizer used to

perform transistor sizing. While the optimizer described in Chapter 4 delivered results

that were better or at worse comparable to past manually optimized designs, there is still

room for future improvement. In particular, little attention was paid to the run time of

the tool and research to develop alternative algorithms that lower the run time require-

ments would be useful. Another possibility for future research is the investigation of new

approaches to optimization that better handle the programmability of FPGAs. This

could allow optimization to be performed on hard blocks such as multipliers. Ultimately,

an improved optimizer could prove useful in the design of commercial FPGAs.

Chapter 6. Conclusions and Future Work 145

6.3 Concluding Remarks

This thesis has demonstrated that, while a large area, performance and power consump-

tion gap exists between FPGAs and ASICs, there is the potential to selectively narrow

these gaps through architectural and transistor-level changes. As described in the previ-

ous section there are many promising areas for future research that may provide a deeper

understanding of both the magnitude of the FPGA to ASIC gap and the trade-offs that

can be used to narrow it. This coupled with innovation in the architecture and design of

FPGAs may enable the broader use of FPGAs.

Appendix A

FPGA to ASIC Comparison Details

This appendix provides information on the benchmarks used for the FPGA to ASIC

comparisons in Chapter 3. As well, some of the absolute data from that comparison

is provided; however, area results are not included as that would disclose confidential

information.

A.1 Benchmark Information

Information about each of the benchmarks used in the FPGA to ASIC comparisons is

listed in Table A.1. For each benchmark, a brief description of what the benchmark does

is given along with information about its source. Most of the benchmarks were obtained

from OpenCores (http://www.opencores.org/) while the remainder of the benchmarks

came from either internal University of Toronto projects [98, 99, 95, 96] or external bench-

mark projects at http://www.humanistic.org/∼hendrik/reed-solomon/index.html

or http://www.engr.scu.edu/mourad/benchmark/RTL-Bench.html. As noted in the

table, in some cases, the benchmarks were not obtained directly from these sources and,

instead, were modified as part of the work performed in [47]. The modifications included

the removal of FPGA vendor-specific constructs and the correction of any compilation

issues in the designs.

A.2 FPGA to ASIC Comparison Data

The results in Chapter 3 were given only in relative terms. This section provides the raw

data underlying these relative comparisons. Table A.2 and Table A.3 list the maximum

146

Appendix A. FPGA to ASIC Comparison Details 147

Table A.1: Benchmark Descriptions

Benchmark Description

booth 32-bit serial Booth-encoded multiplier created by the authorrs encoder (255,239) Reed Solomon encoder from OpenCorescordic18 18-bit CORDIC algorithm implementation from OpenCorescordic8 8-bit CORDIC algorithm implementation from OpenCoresdes area DES Encryption/Decryption designed for area from OpenCores with mod-

ifications from [47]des perf DES Encryption/Decryption designed for performance from OpenCores

with modifications from [47]fir restruct 8-bit 17-tap finite impulse response filter with fixed coefficients from http:

//www.engr.scu.edu/mourad/benchmark/RTL-Bench.html with modifi-cations from [47]

mac1 Ethernet Media Access Control (MAC) block from OpenCores with modi-fications from [47]

aes192 AES Encryption/Decryption with 192-bit keys from OpenCoresfir3 8-bit 3-tap finite impulse response filter from OpenCores with modifications

from [47]diffeq Differential equation solver from OpenCores with modifications from [47]diffeq2 Differential equation solver from OpenCores with modifications from [47]molecular Molecular dynamics simulator [95]rs decoder1 (31,19) Reed Solomon decoder from http://www.humanistic.org/

∼hendrik/reed-solomon/index.html with modifications from [47]rs decoder2 (511,503) Reed Solomon decoder http://www.humanistic.org/

∼hendrik/reed-solomon/index.html with modifications from [47]atm High speed 32x32 ATM packet switch based on the architecture from [100]aes AES Encryption with 128-bit keys from OpenCoresaes inv AES Decryption with 128-bit keys from OpenCoresethernet Ethernet Media Access Control (MAC) block from OpenCoresserialproc 32-bit RISC processor with serial ALU [99, 98]fir24 16-bit 24-tap finite impulse response filter from OpenCores with modifica-

tions from [47]pipe5proc 32-bit RISC processor with 5 pipeline stages [99, 98]raytracer Image rendering engine [96]

operating frequency and dynamic power, respectively, for each design for both the FPGA

and ASIC. Finally, Tables A.4 and A.5 report the FPGA and ASIC absolute static power

measurements for each benchmark at typical and worst case conditions respectively. The

static power measurements for the FPGAs include the adjustments to account for the

partial utilization of each device as described in Section 3.4.3. Finally, Table A.6 sum-

marizes the results when retiming was used with the FPGA CAD flow as described in

Section 3.5.2. The benchmark size (in ALUTs), the operating frequency increase and the

total register increase are listed for each of the benchmarks.

Appendix A. FPGA to ASIC Comparison Details 148

Table A.2: FPGA and ASIC Operating Frequencies

BenchmarkMaximum Operating

Frequency (MHz)FPGA ASIC

booth 188.71 934.58rs encoder 288.52 1098.90cordic18 260.08 961.54cordic8 376.08 699.30des area 360.49 729.93des perf 321.34 1000.00fir restruct 194.55 775.19mac1 153.21 584.80aes192 125.75 549.45fir3 278.40 961.54diffeq 78.23 318.47diffeq2 70.58 281.69molecular 89.01 414.94rs decoder1 125.27 358.42rs decoder2 101.24 239.23atm 319.28 917.43aes 213.22 800.00aes inv 152.28 649.35ethernet 168.58 704.23serialproc 142.27 393.70fir24 249.44 645.16pipe5proc 131.03 378.79raytracer 120.35 416.67

Appendix A. FPGA to ASIC Comparison Details 149

Table A.3: FPGA and ASIC Dynamic Power Consumption

BenchmarkDynamic Power

Consumption(W)FPGA ASIC

booth 5.10E-03 1.71E-04rs encoder 4.63E-02 1.88E-03cordic18 6.75E-02 1.08E-02cordic8 1.39E-02 2.44E-03des area 3.50E-02 1.32E-03des perf 1.22E-01 1.31E-02fir restruct 2.47E-02 2.56E-03mac1 8.94E-02 4.63E-03aes192 1.04E-01 3.50E-03fir3 7.91E-03 1.06E-03diffeq 4.53E-02 3.86E-03diffeq2 5.18E-02 4.16E-03molecular 4.55E-01 2.76E-02rs decoder1 3.48E-02 2.20E-03rs decoder2 4.74E-02 4.29E-03atm 5.59E-01 3.71E-02aes 6.32E-02 6.71E-03aes inv 7.65E-02 1.13E-02ethernet 9.17E-02 5.91E-03serialproc 3.42E-02 2.16E-03fir24 1.18E-01 2.22E-02pipe5proc 5.11E-02 6.23E-03raytracer 8.99E-01 1.08E-01

Table A.4: FPGA and ASIC Static Power Consumption – Typical

BenchmarkStatic Power

Consumption (W)FPGA ASIC

rs encoder 1.31E-02 2.61E-04cordic18 4.43E-02 5.73E-04des area 1.14E-02 1.25E-04des perf 5.52E-02 1.08E-03fir restruct 1.40E-02 2.03E-04mac1 3.52E-02 4.08E-04aes192 1.61E-02 1.90E-04diffeq2 1.15E-02 3.63E-04molecular 1.27E-01 1.83E-03rs decoder1 1.74E-02 7.47E-05rs decoder2 2.31E-02 1.91E-04atm 2.46E-01 1.08E-03aes 1.67E-02 5.06E-04aes inv 2.06E-02 6.68E-04ethernet 5.11E-02 2.94E-04fir24 2.18E-02 1.66E-03pipe5proc 2.06E-02 1.27E-04raytracer 1.69E-01 1.74E-03

Appendix A. FPGA to ASIC Comparison Details 150

Table A.5: FPGA and ASIC Static Power Consumption – Worst Case

BenchmarkStatic Power

Consumption (W)FPGA ASIC

rs encoder 3.46E-02 1.00E-02cordic18 1.17E-01 2.27E-02des perf 1.45E-01 4.16E-02fir restruct 3.70E-02 7.86E-03mac1 9.28E-02 1.56E-02aes192 5.00E-02 7.51E-03diffeq 2.45E-02 1.44E-02diffeq2 3.04E-02 1.40E-02molecular 3.95E-01 7.19E-02rs decoder1 4.60E-02 3.02E-03rs decoder2 6.10E-02 7.46E-03atm 7.70E-01 4.61E-02aes 5.21E-02 1.93E-02aes inv 6.42E-02 2.58E-02ethernet 1.35E-01 1.07E-02fir24 6.80E-02 6.52E-02pipe5proc 5.44E-02 9.20E-03raytracer 7.14E-01 N/A

Appendix A. FPGA to ASIC Comparison Details 151

Table A.6: Impact of Retiming on FPGA Performance

BenchmarkBenchmarkCategory

ALUTsOperating Frequency

Increase (%)Register CountIncrease (%)

des area Logic 469 1.2 % 0.0 %booth Logic 34 0.0 % 0.0 %rs encoder Logic 683 0.0 % 0.0 %fir scu rtl Logic 615 14 % 89 %fir restruct1 Logic 619 11 % 64 %fir restruct Logic 621 15 % 76 %mac1 Logic 1852 0.0 % 0.0 %cordic8 Logic 251 0.0 % 0.0 %mac2 Logic 6776 0.0 % 0.0 %md5 1 Logic 2227 23 % 21 %aes no mem Logic 1389 0.0 % 0.0 %raytracer framebuf v1 Logic 301 3.0 % 0.0 %raytracer bound Logic 886 0.0 % 0.0 %raytracer bound v1 Logic 889 0.0 % 0.0 %cordic Logic 907 0.0 % 0.0 %aes192 Logic 1090 9.7 % 30 %md5 2 Logic 858 10 % 13 %cordic Logic 1278 0.0 % 0.0 %des perf Logic 1840 −0.5 % 1.0 %cordic18 Logic 1169 0.0 % 0.0 %aes inv no mem Logic 1962 0.0 % 0.0 %fir3 DSP 52 −14 % −40 %diffeq DSP 219 0.0 % 0.0 %iir DSP 284 0.0 % 0.0 %iir1 DSP 218 0.0 % 0.0 %diffeq2 DSP 222 0.0 % 0.0 %rs decoder1 DSP 418 5.4 % 7.5 %rs decoder2 DSP 535 −0.3 % 11 %raytracer gen v1 DSP 1625 0.0 % 0.0 %raytracer gen DSP 1706 0.0 % 0.0 %molecular DSP 6289 1.3 % 14 %molecular2 DSP 6557 24 % 71 %stereovision1 DSP 2934 36 % 19 %stereovision3 Memory 82 10 % 9.3 %serialproc Memory 671 −2.0 % 16 %raytracer framebuf Memory 457 12 % 0.0 %aes Memory 675 0.0 % 0.0 %aes inv Memory 813 0.0 % 0.0 %ethernet Memory 1650 −0.6 % 4.1 %faraday dma Memory 1987 0.5 % 0.9 %faraday risc Memory 2596 −1.0 % 1.3 %faraday dsp Memory 7218 −2.9 % −0.1 %stereovision0 v1 Memory 2919 −1.6 % 0.2 %atm Memory 10514 4.7 % 1.1 %stereovision0 Memory 19969 3.7 % 0.4 %oc54 cpu DSP & Mem 1543 0.0 % 0.0 %pipe5proc DSP & Mem 746 5.5 % 49 %fir24 DSP & Mem 821 −7.4 % −3.3 %fft256 nomem DSP & Mem 966 0.0 % 0.0 %raytracer top DSP & Mem 11438 14 % 0.0 %raytracer top v1 DSP & Mem 11424 11 % −0.3 %raytracer DSP & Mem 13021 3.0 % −0.6 %fft256 DSP & Mem 27479 0.0 % 0.0 %stereovision2 v1 DSP & Mem 27097 117 % 131 %stereovision2 DSP & Mem 27691 97 % 124 %

Appendix B

Representative Delay Weighting

The programmability of FPGAs means that the eventual critical paths are not known

at design time. However, a delay measurement is necessary if the performance of an

FPGA is to be optimized. A solution described in Section 4.3.2 was to create a path

containing all the possible critical path components. The delays of the components were

then combined as a weighted sum to reflect the typical usage of each component and that

weighted sum, which was termed the representative delay, was used as a measure of the

FPGAs performance during optimization. This appendix investigates the selection of the

weights used to compute the representative delay. As a starting point, the behaviour of

benchmark circuits is analyzed. That analysis provided one set of possible weights that

are then tested along with other possible weightings in Section B.2. The results from the

different weightings are compared and conclusions are made.

B.1 Benchmark Statistics

The representative delay is intended to capture the behaviour of typical circuits imple-

mented on the FPGA. Therefore, to determine appropriate values for the delay weight-

ings, it is useful to examine the characteristics of benchmark circuits. The focus in this

examination will be on how frequently the various components of the FPGA appear on

the critical paths of circuits. In particular, for the architecture we will consider, there are

four primary components whose usage effectively determines the usage of all the compo-

nents of the FPGA. These four components are the routing segments, the CLB1 inputs,

1Recall that a Cluster-based Logic Block (CLB) is the only type of logic block considered in thiswork.

152

Appendix B. Representative Delay Weighting 153

Table B.1: Normalized Usage of FPGA Components

Benchmark LUTs Routing Segments CLB Inputs CLB Outputs

alu4 0.20 0.43 0.14 0.23apex2 0.17 0.49 0.15 0.20apex4 0.17 0.46 0.17 0.20bigkey 0.12 0.53 0.18 0.18clma 0.19 0.44 0.14 0.22des 0.17 0.46 0.17 0.20diffeq 0.34 0.13 0.13 0.39dsip 0.12 0.53 0.18 0.18elliptic 0.25 0.31 0.15 0.29ex1010 0.16 0.55 0.12 0.18ex5p 0.16 0.47 0.18 0.18frisc 0.25 0.25 0.21 0.28misex3 0.18 0.42 0.18 0.21pdc 0.14 0.59 0.12 0.15s298 0.22 0.33 0.20 0.25s38417 0.22 0.33 0.18 0.27s38584.1 0.20 0.34 0.22 0.24seq 0.18 0.44 0.18 0.21spla 0.14 0.54 0.16 0.16tseng 0.26 0.26 0.17 0.31

Minimum 0.12 0.13 0.12 0.15Maximum 0.34 0.59 0.22 0.39Average 0.19 0.42 0.17 0.23

the CLB Outputs and the LUT. The usage of LUT will be examined in detail later in

this section.

The usage of these key components was tracked for the critical paths of the 20 MCNC

benchmark circuits [138] when implemented on the standard baseline architecture de-

scribed in Table 5.2. For each benchmark, the number of times each of the components

appear on the critical path was recorded. These numbers were normalized to the total

number of components on the benchmark’s critical path to allow for comparison across

benchmarks with different lengths of critical paths and the results are summarized in Ta-

ble B.1. The final three rows of the table indicate the minimum, maximum and average

normalized usage of each component. Clearly, there is a great deal of variation between

the benchmarks, in particular, in the relative demands placed on the LUTs versus the

routing segments. The optimization of an FPGA must attempt to balance these different

needs and, therefore, it seems appropriate to consider using these average path statistics

to determine the representative delay weights. Before examining the use of these weights,

the LUT usage will be more thoroughly investigated.

Appendix B. Representative Delay Weighting 154

SR AM bit

SR AM

bit

SR AM

bit

SR AM bit

SR AM

bit

SR AM

bit

SR AM bit

SR AM

bit

F ast InputSlow Input

LU T

Output

Figure B.1: Input-dependant Delays through the LUT

LUT Usage

In the previous results, the usage of the LUT was assumed to be the same in all cases.

However, in reality, the specific input to the LUT that is used has a significant effect on

the delay of a signal through the LUT. The reason for these differences is the imple-

mentation of the LUT as a fully encoded multiplexer structure and this is illustrated in

Figure B.1. These speed differences can be significant and, therefore, it is advantageous

to use the faster inputs on performance critical nets. Commercial CAD tools generally

perform such optimization [101] when possible and, as a result, the faster LUT inputs

appear more frequently on the critical path. This usage of some LUT inputs more than

other inputs has potentially important optimization implications because area can be

potentially conserved on less frequently used paths through the LUT. As the LUT uses

a significant portion of the FPGA area, such area savings can impact the overall area

and performance of the FPGA.

To address this, the usage of the LUT inputs was examined. Unfortunately, the

CAD tools used in this work do not recognize the timing differences between the LUT

inputs and, therefore, the input LUT usage is certainly not optimized. Instead, to gain a

sense of the relative importance of the different LUT inputs, the LUT usage for designs

Appendix B. Representative Delay Weighting 155

Table B.2: Usage of LUT Inputs

FPGA Family Logic Element LUT InputA (slowest) B C D E F (Fastest)

Stratix 4-LUT 0.215 0.251 0.197 0.336Cyclone 4-LUT 0.243 0.251 0.187 0.319Cyclone II 4-LUT 0.214 0.261 0.153 0.372Stratix II ALM (6-LUT) 0.099 0.103 0.202 0.117 0.041 0.439

implemented on commercial CAD tools was examined. For the set of benchmark circuits

in Table A.6, the critical path of each circuit was examined and the LUT input that

was used for each LUT on the critical path was tracked2. The results are summarized in

Table B.2 for all the benchmarks implemented on different FPGA families. The specific

FPGA family is listed in the first column of the table. The remaining columns indicated

the normalized usage of each input on the critical path from the slowest input to fastest

input. Clearly, fastest input is used most frequently while the remaining inputs are not

used as much. In general, the remaining inputs are all used with approximately equal

frequency.

These results, however, only provide statistics for the two commercially used LUT

sizes of 4 and 6. Since more LUT sizes will be examined in this work, it is necessary

to make some assumptions about the LUT usage. For simplicity, the fastest input will

be assumed to be used 50 % of the time and the remaining LUT usage will be divided

equally amongst the remaining LUT inputs. These relative LUT usage proportions will

be used to create a weighted sum of the individual LUT input delays that reflects the

overall behaviour of the LUT.

With the suitable weights now known for the LUTs and all the FPGA components, the

usage of these weights to create a representative delay will be examined in the following

section.

B.2 Representative Delay Weights

The representative delay measurement described in Section 4.3.2 attempts to capture the

performance of an FPGA with a single overall delay measurement. That overall measure-

ment is computed as the weighted combination of the delays of the FPGA components.

2These commercial devices have additional features in the logic element that may require the usageof particular inputs of the LUT. This may have some impact on the LUT usage results.

Appendix B. Representative Delay Weighting 156

The results from the previous section provided a measure of the relative usage of the

components within the FPGA and that is one possible weighting that can be applied to

the component delays. However, there are other possible weightings and, in this section,

a range of weightings will be examined. The full list of weightings that will be tested

is given in Table B.3. (Note that weighting number 1 approximately matches the av-

erage benchmark characteristics from Table B.1. It does not match precisely because a

different approach was used for calculating the average characteristics when this work

was performed.) Only a single routing weight is used as there was only a single type of

routing track in the test architecture. Similarly, the LUT weight is the weight for all

LUT inputs and the weight amongst the different input cases is split as described above.

These different weightings were used to create different representative path delays.

The optimization process described in Chapter 4 was then used to produce different

FPGA designs. For this optimization, an objective function of Area0Delay1 was used.

The area and delay of the design produced for each different weighting was then de-

termined using the standard experimental process with the full CAD flow as described

in Section 5.1. These area and delay results are plotted in Figure B.2. The Y-axis is

the geometric mean delay for the benchmark circuits and the X-axis refers to the area

required to implement all the benchmark designs.

The figure suggests that the final area and delay of the design does depend on the

weighting function used but the differences are not that in fact that large. The slowest

design is only 12 % slower than the fastest design and the largest design is only 24 %

larger than the smallest design. These differences are relatively small despite the massive

changes in the weightings. For example, Weightings 22 and 23 yielded the smallest

and largest designs respectively, yet the specific weights were widely different. This

effectively demonstrates that the final delay and area are not extremely sensitive to

the specific weights used for the representative path. Based on this observation, the

weights determined from the analysis of the benchmark circuits were used in this work

for simplicity. Slight performance improvements could be obtained with the use of one

of the alternate weights but that new weighting would likely only be useful for this

particular architecture. For another architecture, a new set of weights would be required

because the usage of the components would have changed. It would not be feasible to

revisit this issue of weighting for every single architecture and, instead, the same weights

were used in all cases. This does indicate a potential avenue for future work that better

incorporates the eventual usage of the FPGA components into the optimization process.

Appendix B. Representative Delay Weighting 157

Table B.3: Representative Path Weighting Test Weights

Weighting LUT Routing Segment CLB Input CLB OutputwLUT, wBLE in wrouting,i wCLB in wCLB out

1 20 40 17 232 10 50 17 233 30 30 17 234 40 20 17 235 50 10 17 236 20 47 10 237 20 42 15 238 20 37 20 239 20 32 25 2310 20 27 30 2311 20 53 17 1012 20 48 17 1513 20 43 17 2014 20 38 17 2515 20 33 17 3016 20 28 17 3517 30 10 25.5 34.518 26.7 20 22.7 30.719 23.3 30 19.8 26.820 16.7 50 14.2 19.221 13.3 60 11.3 15.322 10 70 8.5 11.523 55 5 17 2324 30 40 7 2325 35 40 2 2326 25 40 17 1827 30 40 17 1328 35 40 17 829 40 40 17 330 25 30 17 2831 30 20 17 3332 35 10 17 38

Appendix B. Representative Delay Weighting 158

3.6E-09

3.7E-09

3.8E-09

3.9E-09

4E-09

4.1E-09

4.2E-09

4.3E-09

4.4E-09

4.5E-09

70000000 75000000 80000000 85000000 90000000 95000000

Eff

ec�

ve

De

lay

(s

)

(Ge

om

etr

ic M

ea

n D

ela

y

as

Me

asu

red

by

HS

PIC

E)

Effec�ve Area (um2)

Figure B.2: Area and Delay with Varied Representative Path Weightings

Appendix C

Multiplexer Implementations

Multiplexers make up a large portion of an FPGA and, therefore, their design has a

significant effect on the overall performance and area of an FPGA. This appendix ex-

plores some of the issues surrounding the design of multiplexers to explore and justify

the choices made in the thesis. This complements the work in Section 5.6.2 which exam-

ined one attribute of multiplexer design: the number of levels. That previous analysis

considered the design of the entire FPGA and found that two-level multiplexers were

best. The following section revisits this issue of the number of levels in a multiplexer

and, in addition to this, the implementation choices for the multi-level multiplexers will

be further examined. For simplicity, this analysis will only consider the design and sizing

of the multiplexer while the design of the remainder of the FPGA will be treated as

constant.

C.1 Multiplexer Designs

In the earlier investigation of multiplexers the only design choice examined was that of

the number of levels in a multiplexer. That is certainly an important factor as each level

adds another pass transistor through which signals must pass. However, for any given

number of levels (except for one-level designs), there are generally a number of different

implementations possible. For example, a 16-input multiplexer could be implemented in

at least three different ways as shown in Figure C.1. These different implementations

will be described in terms of the number of configuration bits at each level of the pass

transistor tree. This makes the design in Figure C.1(b) a 8:2 implementation since the

first level has 8 bits to control the 8 pass transistors in each branch of the tree at this level.

159

Appendix C. Multiplexer Implementations 160

In the second and last stage of this multiplexer there are 2 bits. Some configurations allow

for more inputs than required such as the 6:3 design shown in Figure C.1(c) and, in that

case, the additional pass transistors could simply be eliminated. However, this creates

a non-symmetric multiplexer as some inputs will then be faster than other outputs.

In some cases this is clearly unavoidable, such as for a 13-input multiplexer, but, in

general, we will avoid these asymmetries and restrict our analysis to completely balanced

multiplexers.

C.2 Evaluation of Multiplexer Designs

We will examine a range of possible designs for both 16 input and 32 input multiplexers.

These sizes are particularly interesting because a 16-input multiplexer is within the range

of sizes typically found in the programmable routing and a 32-input multiplexer is a

typical size seen for the input multiplexers to the BLEs in large clusters. For both the 16-

input and 32-input designs, the possibilities considered ranged from a one-level (one hot)

design to four level designs (which is fully encoded in the case of the 16-input multiplexer.)

To simplify this investigation, minimum width transistors were assumed and the area of

the multiplexer was measured simply by counting the number of transistors, including

the configuration memory bits, in the design. While this is not the preferred analysis

approach, it was the most appropriate method at the time this work was performed.

This analysis still provides an indication of the minimum size of a design and its typical

performance.

The different 16-input designs are compared in Figure C.2. Each design is labelled

according to the number of configuration bits used in each stage as follows:

(Number of Inputs) (Bits in Level 1) (Bits in Level 2) (Bits in Level 3) (Bits in Level 4),

where Level 1 refers to the transistors that are closest to the inputs. For example, the

label “16 8 2 0 0” describes the 2-level 8:2 multiplexer shown in Figure C.1(b). The area

(in number of transistors) of the various configurations is shown in Figure C.2(a). The

fully encoded design, “16 2 2 2 2,” requires the least area as expected and the one hot

encoding requires the most area. There is also significant variability in the areas for the

different 2-level and 3-level designs.

The delay results are shown in Figure C.2(b). The reported delay is for the multiplexer

and the following buffer. These results indicate clearly that the most significant factor

Appendix C. Multiplexer Implementations 161

Inp

uts O utput

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

(a) 4:4 Implementation

Inp

uts Outpu t

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

SR AMb it

(b) 8:2 Implementation

Figure C.1: Two Level 16-input Multiplexer Implementations

Appendix C. Multiplexer Implementations 162

Inp

uts

O utpu t

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bitSR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

SR AM

bit

(c) 6:3 Implementation

Figure C.1: Two Level 16-input Multiplexer Implementations

is the number of multiplexer levels and, as expected the performance degrades with an

increasing number of levels. The performance of the 2-level designs is certainly worse

than the 1-level design. The difference in performance is slightly larger than was observed

in Section 5.6.2 but this is likely due to the poor sizing used for the results in this section.

In Figure C.2(c) the different multiplexer configurations are compared in terms of their

area delay product. By this metric, the 2-level “16 4 4 0 0” and 3-level the “16 4 2 2 0”

designs are very similar. The lower delay for the 2-level 4:4 design clearly makes it the

preferred choice.

Similar trends can be seen in Figure C.3 which plots the results for the 32-input

multiplexer. Figure C.3(a) summarizes the area of the different designs. The 1-level

design requires the most area by far and the remainder of the designs with a few ex-

ceptions have relatively similar area requirements. The overall trend is unchanged from

the 16-input multiplexers as increasing the number of levels typically decreases the area.

The delay results are shown in Figure C.3(b). It is notable that the 1-level design no

longer offers the best performance and, instead, the best performance is obtained with

the “32 8 4 0 0” design. As was seen with the 16-input designs, the 3-level and 4-level

designs have longer delays. Finally, in Figure C.3(c), the area and delay measurements

Appendix C. Multiplexer Implementations 163

0

20

40

60

80

100

120

Are

a (

Nu

mb

er

of

Tra

nsi

sto

rs)

Mul!plexer Structure

(a) Transistor Count for Different Topologies of 16-input Multiplexer

0.00E+00

2.00E-11

4.00E-11

6.00E-11

8.00E-11

1.00E-10

1.20E-10

1.40E-10

1.60E-10

1.80E-10

Mu

tlip

lex

er

De

lay

(s)

Mul!plexer Structure

(b) Delay of Different Topologies of 16-input Multiplexer

Figure C.2: Area Delay Trade-offs with Varied 16-input Multiplexer Implementations

Appendix C. Multiplexer Implementations 164

0.00E+00

2.00E-09

4.00E-09

6.00E-09

8.00E-09

1.00E-08

1.20E-08

1.40E-08

Are

a D

ela

y (

Nu

mb

er

of

Tra

nsi

sto

rs ·

De

lay

)

Mul!plexer Structure

(c) Area Delay for Different Topologies of 16-input Multiplexer

Figure C.2: Area Delay Trade-offs with Varied 16-input Multiplexer Implementations

for each design are combined as the area-delay product. Again, some of the 2-level and

3-level designs achieve similar results but, with its lower delay, the 2-level design is a

more useful choice.

These results for the 16-input and the 32-input multiplexers confirm the observations

made in Section 5.6.2 that 2-level designs are the most effective choice. It is also clear

from these results that the number of levels could be useful for making area and delay

trade-offs as increasing the number of levels offers area savings but that comes at the

cost of degraded performance. However, the same potential opportunity for trade-offs

does not appear to exist when changing designs for any particular fixed number of levels

because one design tended to offer both the best area and performance. Therefore, only

the number of levels in a multiplexer was explored in Section 5.6.2. (However, the results

in Section 5.6.2 found that in practise the number of levels did not enable useful trade-

offs.)

These results do indicate that the specific design for any number of levels should be

selected judiciously. For example, the “32 2 16 0 0” is both slow and requires a lot of

area despite being a 2-level design. In this work, 2-level designs were selected based on

Appendix C. Multiplexer Implementations 165

0

50

100

150

200

250

Are

a (

Nu

mb

er

of

Tra

nsi

sto

rs)

Mul!plexer Structure

(a) Transistor Count for Different Topologies of 32-input Multiplexer

0.00E+00

2.00E-11

4.00E-11

6.00E-11

8.00E-11

1.00E-10

1.20E-10

1.40E-10

1.60E-10

1.80E-10

Mu

tlip

lex

er

De

lay

(s)

Mul!plexer Structure

(b) Delay of Different Topologies of 32-input Multiplexer

Figure C.3: Area Delay Trade-offs with Varied 32-input Multiplexer Implementations

Appendix C. Multiplexer Implementations 166

0.00E+00

5.00E-09

1.00E-08

1.50E-08

2.00E-08

2.50E-08

3.00E-08

3.50E-08

Are

a D

ela

y (

Nu

mb

er

of

Tra

nsi

sto

rs ·

De

lay

)

Mul!plexer Structure

(c) Area Delay for Different Topologies of 32-input Multiplexer

Figure C.3: Area Delay Trade-offs with Varied 32-input Multiplexer Implementations

two factors. First, the number of configuration bits was minimized. The second factor

was that amongst the designs with the same number of configuration bits, the design

that puts the larger number of pass transistors closer to the input of the multiplexer

(Level 1) was used. This intuitively makes sense as it puts the larger capacitive load on

a lower resistance path to the driver.

Appendix D

Architectures and Results from

Trade-off Investigation

This appendix summarizes the architectures that were used for the design space explo-

ration in Chapter 5. The specific parameters that were varied for this exploration are

summarized in Table D.1 and the specific architectures used are listed in Table D.2. The

headings in Table D.2 refer to the abbreviations described in Table D.1. In all cases, the

intra-cluster routing was fully populated.

Table D.2 also lists effective area and delay results for three different sizings of

each architecture. The optimization objectives used to create the different sizings were

Area1Delay1 optimization, Area10Delay1 optimization and Delay optimization. For each

of these designs, the area and delay was measured using the experimental procedure out-

lined in Chapter 5. Many more sizings were used to produce the full results examined in

Chapter 5 and those results along with the full per circuit results for each FPGA design

can be found at http://www.eecg.utoronto.ca/∼jayar/pubs/theses/theses.html.

167

Appendix D. Architectures and Results from Trade-off Investigation 168

Table D.1: Parameters Considered for Design Space Exploration

Parameter Symbol

LUT Size KCluster Size NRouting Track Length Type 1 L1

Fraction of Tracks of Length Type 1 F1

Routing Track Length Type 2 L2

Fraction of Tracks of Length Type 2 F2

Input Connection Block Flexibility(as a fraction of the channel width) Fc,in

Output Connection Block Flexibility(as a fraction of the channel width) Fc,out

Channel Width WNumber of Inputs to Logic Block INumber of Inputs/Output pins per row or column

of logic blocks on each side of array I/Os per row/col

Appendix D. Architectures and Results from Trade-off Investigation 169

Tab

leD

.2:

Arc

hit

ectu

res

and

Par

tial

Res

ult

sfr

omD

esig

nSpac

eE

xplo

rati

on

NK

L1

F1

L2

F2

Fc,i

nF

c,o

ut

WI

I/O

sper

row

/co

l

Are

a10D

elay1

Optim

ized

Are

a1D

elay1

Optim

ized

Del

ay

Optim

ized

Are

aD

elay

Are

aD

elay

Are

aD

elay

23

41

0.2

50.5

48

52

4.0

2E

+07

1.1

6E

-08

5.1

5E

+07

6.9

4E

-09

1.0

0E

+08

5.5

2E

-09

24

41

0.2

50.5

56

62

4.3

5E

+07

9.7

2E

-09

5.5

1E

+07

6.0

8E

-09

1.1

3E

+08

4.7

8E

-09

25

41

0.2

50.5

56

82

4.9

9E

+07

8.2

4E

-09

6.0

6E

+07

5.0

9E

-09

1.2

6E

+08

4.2

5E

-09

42

41

0.2

0.2

556

52

3.3

9E

+07

1.3

1E

-08

4.4

0E

+07

8.3

9E

-09

8.7

6E

+07

6.4

2E

-09

43

41

0.2

0.2

564

82

3.5

5E

+07

9.5

9E

-09

4.3

8E

+07

6.3

0E

-09

8.4

9E

+07

4.9

1E

-09

44

41

0.2

50.2

564

10

43.8

9E

+07

7.8

1E

-09

4.6

2E

+07

5.2

5E

-09

9.1

1E

+07

4.1

9E

-09

45

41

0.2

0.2

556

13

44.3

6E

+07

6.8

9E

-09

5.1

8E

+07

4.5

1E

-09

1.0

6E

+08

3.7

2E

-09

46

41

0.2

0.2

556

15

45.2

2E

+07

5.7

0E

-09

6.1

5E

+07

3.9

5E

-09

1.4

4E

+08

3.5

8E

-09

47

41

0.2

0.2

556

18

57.0

4E

+07

4.8

7E

-09

7.7

4E

+07

3.7

0E

-09

1.6

7E

+08

3.1

8E

-09

62

41

0.1

50.1

67

64

72

3.2

3E

+07

1.3

2E

-08

4.0

6E

+07

8.4

9E

-09

7.6

8E

+07

6.2

4E

-09

63

41

0.2

0.1

67

80

11

3N

/A

N/A

4.4

9E

+07

5.8

5E

-09

8.0

3E

+07

4.5

6E

-09

64

41

0.2

50.1

680

14

43.9

9E

+07

7.4

8E

-09

4.8

4E

+07

4.8

0E

-09

8.6

6E

+07

4.0

5E

-09

66

41

0.2

0.1

67

80

21

65.5

2E

+07

5.2

9E

-09

6.4

3E

+07

3.7

0E

-09

1.3

2E

+08

3.3

6E

-09

82

41

0.1

50.1

25

80

93

3.2

6E

+07

1.2

7E

-08

4.1

1E

+07

8.2

4E

-09

7.4

3E

+07

6.1

8E

-09

83

21

0.2

50.1

25

76

14

84.0

6E

+07

1.0

7E

-08

5.0

7E

+07

6.5

9E

-09

9.3

2E

+07

5.1

5E

-09

83

41

0.2

0.1

25

88

14

4N

/A

N/A

4.3

6E

+07

5.8

9E

-09

8.0

5E

+07

4.5

4E

-09

84

11

0.2

50.1

25

82

18

84.3

3E

+07

1.2

6E

-08

5.3

3E

+07

6.6

9E

-09

1.0

9E

+08

5.4

1E

-09

84

21

0.2

50.1

84

18

44.1

8E

+07

8.2

7E

-09

5.0

2E

+07

5.2

2E

-09

9.6

2E

+07

4.2

4E

-09

84

41

0.2

50.1

88

18

44.0

8E

+07

7.4

5E

-09

4.8

7E

+07

4.8

6E

-09

8.8

0E

+07

3.9

5E

-09

84

61

0.2

50.1

96

18

44.0

4E

+07

7.2

9E

-09

4.6

8E

+07

4.6

8E

-09

8.2

3E

+07

3.9

9E

-09

85

21

0.2

50.1

25

80

23

8N

/A

N/A

5.8

7E

+07

4.6

2E

-09

1.1

1E

+08

3.9

2E

-09

85

41

0.2

0.1

25

88

23

64.8

2E

+07

6.6

0E

-09

5.6

3E

+07

4.2

2E

-09

1.0

4E

+08

3.5

1E

-09

86

21

0.2

50.1

25

76

27

85.9

5E

+07

6.0

5E

-09

7.0

5E

+07

4.0

8E

-09

1.2

7E

+08

3.4

8E

-09

86

41

0.2

0.1

25

88

27

75.6

8E

+07

5.3

1E

-09

6.4

4E

+07

3.7

5E

-09

1.2

8E

+08

3.4

4E

-09

87

21

0.2

50.1

25

80

32

87.7

8E

+07

5.3

9E

-09

9.2

5E

+07

3.6

9E

-09

1.6

7E

+08

3.1

4E

-09

87

41

0.2

0.1

25

96

32

81.4

3E

+08

3.0

5E

-09

8.0

7E

+07

3.8

4E

-09

N/A

N/A

10

24

10.2

0.1

80

11

33.3

5E

+07

1.2

7E

-08

4.1

4E

+07

8.0

6E

-09

7.4

5E

+07

6.3

3E

-09

10

31

10.2

50.1

88

17

84.0

2E

+07

1.5

3E

-08

5.1

2E

+07

8.2

1E

-09

1.0

1E

+08

6.4

7E

-09

10

32

10.2

50.1

92

17

84.1

8E

+07

1.0

0E

-08

5.1

0E

+07

6.5

0E

-09

9.7

8E

+07

5.0

3E

-09

10

34

10.2

0.1

104

17

6N

/A

N/A

4.5

6E

+07

5.7

2E

-09

8.0

9E

+07

4.4

7E

-09

Conti

nued

on

nex

tpage

Appendix D. Architectures and Results from Trade-off Investigation 170

Tab

leD

.2–

conti

nued

from

pre

vio

us

pag

eN

KL

1F

1L

2F

2F

c,i

nF

c,o

ut

WI

I/O

sper

row

/co

l

Are

a10D

elay1

Optim

ized

Are

a1D

elay1

Optim

ized

Del

ay

Optim

ized

Are

aD

elay

Are

aD

elay

Are

aD

elay

10

34

10.3

0.1

104

17

63.9

7E

+07

8.4

8E

-09

4.7

9E

+07

5.7

3E

-09

8.3

6E

+07

4.5

3E

-09

10

41

10.2

0.1

96

22

44.3

5E

+07

1.3

1E

-08

5.3

6E

+07

6.8

8E

-09

1.0

8E

+08

5.4

2E

-09

10

42

10.2

0.1

96

22

44.4

5E

+07

8.3

5E

-09

5.2

8E

+07

5.3

0E

-09

9.8

0E

+07

4.1

3E

-09

10

44

10.2

0.1

104

22

74.1

7E

+07

7.2

4E

-09

4.9

2E

+07

4.5

6E

-09

9.3

9E

+07

3.7

6E

-09

10

44

10.3

0.1

104

22

74.3

3E

+07

6.9

8E

-09

5.1

3E

+07

4.5

9E

-09

9.2

1E

+07

3.8

5E

-09

10

44

0.6

67

10

0.3

33

0.2

0.1

120

22

44.1

9E

+07

7.0

6E

-09

4.7

7E

+07

4.9

4E

-09

7.8

8E

+07

4.1

5E

-09

10

44

0.7

06

10

0.2

94

0.2

0.1

136

22

44.3

8E

+07

6.8

8E

-09

4.9

7E

+07

4.9

3E

-09

8.2

6E

+07

4.2

0E

-09

10

46

10.2

50.1

120

22

44.3

2E

+07

7.2

9E

-09

5.0

4E

+07

4.6

8E

-09

8.5

3E

+07

3.9

1E

-09

10

48

10.2

0.1

128

22

44.2

0E

+07

7.5

0E

-09

4.8

8E

+07

4.7

7E

-09

8.0

2E

+07

4.0

9E

-09

10

51

10.2

50.1

84

28

85.3

4E

+07

9.6

6E

-09

6.4

6E

+07

5.6

5E

-09

1.2

7E

+08

4.9

8E

-09

10

54

10.2

0.1

96

28

85.1

5E

+07

6.2

0E

-09

5.9

5E

+07

4.1

0E

-09

1.1

8E

+08

3.6

0E

-09

10

54

10.3

0.1

96

28

85.3

5E

+07

6.1

9E

-09

6.0

9E

+07

4.2

4E

-09

1.1

6E

+08

3.8

4E

-09

10

61

10.2

50.1

84

33

86.1

5E

+07

8.0

8E

-09

7.4

2E

+07

4.9

4E

-09

1.3

8E

+08

4.3

3E

-09

10

62

10.2

50.1

80

33

85.9

9E

+07

5.9

2E

-09

6.9

8E

+07

4.1

1E

-09

1.2

1E

+08

3.5

2E

-09

10

64

10.1

50.1

96

33

85.8

3E

+07

5.5

4E

-09

6.6

6E

+07

3.5

9E

-09

1.1

4E

+08

3.1

9E

-09

10

64

10.3

0.1

96

33

86.1

7E

+07

5.3

9E

-09

6.9

1E

+07

3.7

7E

-09

1.2

1E

+08

3.3

3E

-09

10

71

10.2

50.1

92

39

88.4

5E

+07

7.0

2E

-09

9.9

2E

+07

4.5

8E

-09

1.9

3E

+08

4.1

9E

-09

10

72

10.2

50.1

92

39

88.4

8E

+07

5.3

8E

-09

9.9

1E

+07

3.6

4E

-09

2.2

2E

+08

3.3

5E

-09

10

74

196

39

88.1

9E

+07

4.3

8E

-09

8.9

8E

+07

3.6

1E

-09

1.6

3E

+08

3.1

0E

-09

10

44

10.2

0.1

104

22

44.1

7E

+07

7.2

4E

-09

4.9

2E

+07

4.5

6E

-09

9.3

9E

+07

3.7

6E

-09

10

44

0.5

80.5

0.2

0.1

128

22

44.2

2E

+07

6.8

3E

-09

4.8

3E

+07

5.1

3E

-09

7.3

4E

+07

3.9

2E

-09

12

24

10.1

50.0

83

96

13

43.4

0E

+07

1.3

1E

-08

4.1

7E

+07

8.0

7E

-09

7.2

0E

+07

6.2

5E

-09

12

34

10.2

0.0

83

104

20

53.8

1E

+07

9.0

8E

-09

4.5

6E

+07

5.7

7E

-09

7.5

0E

+07

4.6

4E

-09

12

44

10.2

0.0

83

104

26

74.3

6E

+07

6.7

1E

-09

5.1

0E

+07

4.6

6E

-09

8.7

6E

+07

3.9

1E

-09

12

54

10.2

0.0

83

104

33

95.4

3E

+07

6.1

1E

-09

6.1

2E

+07

4.2

7E

-09

1.1

4E

+08

3.7

6E

-09

12

64

10.2

0.0

83

104

39

10

6.3

8E

+07

5.1

5E

-09

7.3

0E

+07

3.7

8E

-09

1.3

6E

+08

3.4

6E

-09

Appendix E

Logical Architecture to Transistor

Sizing Process

This appendix reviews the main steps in translating a logical architecture into a optimized

transistor-level netlist. This will done by way of example using the baseline architecture

from Chapter 5. The logical architecture parameters for this design are listed in Table E.1.

Starting from the architecture description, the widths (or fan-ins) of the multiplexers

in the design must first be determined. For the architectures considered in this work,

there are three multiplexers whose width must be determined. The Routing Mux is

the multiplexer used within the inter-block routing. Determining this width is rather

involved due to rounding issues and the possibility of tracks with multiple different seg-

ment lengths. However, for the baseline architecture, the width can be approximately

computed from the parameters in Table E.1 as follows,

WidthRouting Mux =2WL

Fs +(2W − 2W

L

)(Fs − 1) + Fc,outputWN2WL

≈ 12. (E.1)

The width of 12 would not be obtained if the numbers from Table E.1 are substituted

into the equation due to rounding steps that are omitted in the equation for simplicity.

The next multiplexer to be considered is the CLB Input Mux which is used within

the input connection block to connect the inter-block routing into the logic block. The

width of this multiplexer is

WidthCLB Input Mux = Fc,input ·W = 0.2 · 104 ≈ 22 (E.2)

171

Appendix E. Logical Architecture to Transistor Sizing Process 172

Table E.1: Architecture Parameters

Parameter Value

LUT Size, k 4Cluster Size, N 10Number of Cluster Inputs, I 22Tracks per Channel, W 104Track Length, L 4Interconnect Style UnidirectionalDriver Style Single DriverFc,input 0.2Fc,output 0.1Fs 3Pads per row/column 4

where again the rounding process has been omitted for simplicity.

Finally, the width of the multiplexers that connect the intra-cluster routing to the

BLEs is determined. These multiplexers are known as BLE Input Mux and are deter-

mined as follows

WidthBLE Input Mux = I + N = 22 + 10 = 32. (E.3)

There is an additional multiplexer inside the BLE but, for the architectures considered,

this multiplexer, the CLB Output Mux will always have 2 inputs, one from the LUT and

one from the flip-flop.

With the widths of the multiplexers known, appropriate implementations must be

determined. A number of implementation choices were examined in both Chapter 5 and

Appendix C. The specific implementation for each multiplexer will be selected based on

the input electrical parameters.

The transistor-level implementation of the remaining components of the FPGA is

straightforward. Buffers, with level-restorers, are necessary after all the multiplexers. If

desired, buffers are also added prior to the multiplexers; however, for this example, no

such buffers will be added. The LUT is implemented as a fully encoded multiplexer.

Buffers can be added inside the pass transistor tree as needed. For this particular design,

such buffers will not be added. Once these decisions have been made, the complete

structure of the FPGA is known.

The transistor sizes within this structure must then be optimized. This is done using

the optimizer described in Chapter 4. For this analysis, sizing will be performed with

the goal of minimizing the Area-Delay product. The resulting transistor sizes are listed

in Table E.2. In Figure E.1, the meaning of the different transistor size parameters is

Appendix E. Logical Architecture to Transistor Sizing Process 173

K-L

UT

DQ

BL

E 2

BL

E N

To

Ro

uti

ng

Ro

utin

g

Tra

ckL

og

ic C

lus

ter

Intr

a-c

lust

er

tra

ck

Inp

ut

Co

nne

ctio

n

Blo

ck

BL

E I

np

ut

Blo

ck

...

...

...

M U X_C LB _IN PU T

BU F F ER_ C LB _IN PU T _POST

M U X_ R OU T IN G

BU F F ER _R OU T IN G _POST

M U X_LE _IN PU T

BU F F ER _ LE _IN PU T _ POST

BU F F ER _ LU T _POST

M U X_C LB _OU T PU T

BU F F ER _ C LB_ OU T PU T _POST

Figure E.1: Terminology for Transistor Sizes

Appendix E. Logical Architecture to Transistor Sizing Process 174

illustrated through labels in the figure. For the buffers in the parameter list, stage0 refers

to the inverter stage within the buffer that is closest to the input. Similarly, level0 for

the multiplexers refers to the pass transistor grouping that is closest to the input. The

multiplicity of each multiplexer stage refers to the number of pass transistors within each

group of transistors at each level of the multiplexer. Equivalently, the multiplicity is also

the number of configuration memory bits needed at each level.

Once these sizes have been determined, the transistor-level design of the FPGA is

complete. The effective area and delay for this design can then be assessed using the full

experimental process described in Chapter 5.

Appendix E. Logical Architecture to Transistor Sizing Process 175

Table E.2: Transistor Sizes for Example Architecture

Parameter Value

MUX CLB INPUT num levels 2.00MUX CLB INPUT level0 width 0.24MUX CLB INPUT level0 multiplicity 6.00MUX CLB INPUT level1 width 0.24MUX CLB INPUT level1 multiplicity 4.00BUFFER CLB INPUT POST num stages 2.00BUFFER CLB INPUT POST stage0 nmos width 0.84BUFFER CLB INPUT POST stage0 pmos width 0.42BUFFER CLB INPUT POST stage1 nmos width 0.84BUFFER CLB INPUT POST stage1 pmos width 1.34BUFFER CLB INPUT POST pullup width 0.24BUFFER CLB INPUT POST pullup length 0.50MUX ROUTING num levels 2.00MUX ROUTING level0 width 1.64MUX ROUTING level0 multiplicity 4.00MUX ROUTING level1 width 1.84MUX ROUTING level1 multiplicity 3.00BUFFER ROUTING POST num stages 2.00BUFFER ROUTING POST stage0 nmos width 5.34BUFFER ROUTING POST stage0 pmos width 2.67BUFFER ROUTING POST stage1 nmos width 5.34BUFFER ROUTING POST stage1 pmos width 8.01BUFFER ROUTING POST pullup width 0.24BUFFER ROUTING POST pullup length 0.50MUX LE INPUT num levels 2.00MUX LE INPUT level0 width 0.24MUX LE INPUT level0 multiplicity 8.00MUX LE INPUT level1 width 0.24MUX LE INPUT level1 multiplicity 4.00BUFFER LE INPUT POST num stages 1.00BUFFER LE INPUT POST stage0 nmos width 0.64BUFFER LE INPUT POST stage0 pmos width 0.32BUFFER LE INPUT POST pullup width 0.24BUFFER LE INPUT POST pullup length 0.50LUT LUT0 stage0 width 0.24LUT LUT0 stage1 width 0.34LUT LUT0 stage2 width 0.34LUT LUT0 stage3 width 0.24LUT LUT0 stage0 buffer nmos width 0.34LUT LUT0 stage0 buffer pmos width 0.24LUT LUT0 stage pullup length 0.40LUT LUT0 stage pullup width 0.24LUT LUT0 signal buffer stage0 nmos width 0.34LUT LUT0 signal buffer stage0 pmos width 0.41LUT LUT0 signal buffer stage1 nmos width 0.24LUT LUT0 signal buffer stage1 pmos width 0.34BUFFER LUT POST num stages 2.00BUFFER LUT POST stage0 nmos width 0.54BUFFER LUT POST stage0 pmos width 0.38BUFFER LUT POST stage1 nmos width 1.08BUFFER LUT POST stage1 pmos width 1.30BUFFER LUT POST pullup width 0.24BUFFER LUT POST pullup length 0.50MUX CLB OUTPUT num levels 1.00MUX CLB OUTPUT level0 width 2.74MUX CLB OUTPUT level0 multiplicity 2.00BUFFER CLB OUTPUT POST stage0 nmos widths 3.94BUFFER CLB OUTPUT POST stage0 pmos widths 3.15BUFFER CLB OUTPUT POST stage1 nmos widths 3.94BUFFER CLB OUTPUT POST stage1 pmos widths 5.12BUFFER CLB OUTPUT POST pullup widths 0.24BUFFER CLB OUTPUT POST pullup lengths 0.50

Bibliography

[1] David Chinnery and Kurt Keutzer. Closing the Gap Between ASIC & Custom Toolsand Techniques for High-Performance ASIC Design. Kluwer Academic Publishers,2002.

[2] Stephen D. Brown, Robert Francis, Jonathan Rose, and Zvonko Vranesic. Field-programmable gate arrays. Kluwer Academic Publishers, 1992.

[3] Paul S. Zuchowski, Christopher B. Reynolds, Richard J. Grupp, Shelly G. Davis,Brendan Cremen, and Bill Troxel. A hybrid ASIC and FPGA architecture. InICCAD ’02, pages 187–194, November 2002.

[4] Steven J.E. Wilton, Noha Kafafi, James C. H. Wu, Kimberly A. Bozman, Vic-tor Aken’Ova, and Resve Saleh. Design considerations for soft embedded pro-grammable logic cores. IEEE Journal of Solid-State Circuits, 40(2):485–497, Febru-ary 2005.

[5] Katherine Compton and Scott Hauck. Automatic design of area-efficient config-urable ASIC cores. IEEE Transactions on Computers, 56(5):662–672, May 2007.

[6] Altera Corporation. Stratix III device handbook, Nov 2007. SIII5V1-1.4 http:

//www.altera.com/literature/hb/stx3/stratix3 handbook.pdf.

[7] Xilinx. Virtex-5 user guide, March 2008. UG190 (v4.0) http://www.xilinx.com/support/documentation/user guides/ug190.pdf.

[8] Altera Corporation. Cyclone III device handbook, Sept 2007. ver. CIII5V1-1.2http://www.altera.com/literature/hb/cyc3/cyclone3 handbook.pdf.

[9] Xilinx. Spartan-3E, November 2006. Ver. 3.4 http://direct.xilinx.com/

bvdocs/publications/ds312.pdf.

[10] Vaughn Betz and Jonathan Rose. VPR: A new packing, placement and routingtool for FPGA research. In Seventh International Workshop on Field-ProgrammableLogic and, pages 213–222, 1997.

[11] J. Rose, R.J. Francis, D. Lewis, and P. Chow. Architecture of field-programmablegate arrays: the effect of logic block functionality on area efficiency. IEEE Journalof Solid-State Circuits, 25(5):1217–1225, 1990.

176

Bibliography 177

[12] Altera Corporation. Stratix device family data sheet, volume 1, S5V1-3.4, January2006. http://www.altera.com/literature/hb/stx/stratix vol 1.pdf.

[13] Xilinx. Virtex-4 family overview, June 2005. http://www.xilinx.com/bvdocs/

publications/ds112.pdf.

[14] Vaughn Betz, Jonathan Rose, and Alexander Marquardt. Architecture and CADfor Deep-Submicron FPGAs. Kluwer Academic Publishers, 1999.

[15] David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault, DavidCashman, David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul Leventis,Sandy Marquardt, Cameron McClintock, Ketan Padalia, Bruce Pedersen, GilesPowell, Boris Ratchev, Srinivas Reddy, Jay Schleicher, Kevin Stevens, RichardYuan, Richard Cliff, and Jonathan Rose. The Stratix II logic and routing archi-tecture. In FPGA ’05: Proceedings of the 2005 ACM/SIGDA 13th internationalsymposium on Field-programmable gate arrays, pages 14–20, New York, NY, USA,2005. ACM Press.

[16] Altera Corporation. Stratix II Device Handbook SII5V1-4.3, May 2007. http:

//www.altera.com/literature/hb/stx2/stratix2 handbook.pdf.

[17] Altera Corporation. Stratix IV Device Handbook Volumes 1–4 SIV5V1-1.0,May 2008. http://www.altera.com/literature/hb/stratix-iv/stratix4

handbook.pdf.

[18] Elias Ahmed and Jonathon Rose. The effect of LUT and cluster size on deep-submicron FPGA performance and density. IEEE Transactions on Very LargeScale Integration (VLSI) Systems, 12(3):288–298, March 2004.

[19] Guy Lemieux and David Lewis. Using sparse crossbars within LUT clusters. InFPGA ’01: Proceedings of the 2001 ACM/SIGDA ninth international symposiumon Field programmable gate arrays, pages 59–68, New York, NY, USA, Feb. 2001.ACM.

[20] David Lewis, Vaughn Betz, David Jefferson, Andy Lee, Chris Lane, Paul Leventis,Sandy Marquardt, Cameron McClintock, Bruce Pedersen, Giles Powell, SrinivasReddy, Chris Wysocki, Richard Cliff, and Jonathan Rose. The StratixTM rout-ing and logic architecture. In FPGA ’03: Proceedings of the 2003 ACM/SIGDAeleventh international symposium on Field programmable gate arrays, pages 12–20.ACM Press, 2003.

[21] Actel Corporation. Act 1 series FPGAs, April 1996. http://www.actel.com/

documents/ACT1 DS.pdf.

[22] Altera Corporation. APEX 20K programmable logic device family data sheet, DS-APEX20K-5.1, March 2004. http://www.altera.com/literature/ds/apex.pdf.

[23] Altera Corporation. APEX II programmable logic device family, DS-APEXII-3.0,Aug 2002. http://www.altera.com/literature/ds/ds ap2.pdf.

Bibliography 178

[24] A. Aggarwal and D. Lewis. Routing architectures for hierarchical field-programmable gate arrays. In IEEE International Conference on Computer Design,pages 475–478, Oct 1994.

[25] S. Wilton. Architectures and Algorithms for Field-Programmable Gate Arrays withEmbedded Memories. PhD thesis, University of Toronto, 1997. Department ofElectrical and Computer Engineering.

[26] G. Lemieux and D. Lewis. Analytical framework for switch block design. In Inter-national Conference on Field Programmable Logic and Applications, pages 122–131,Sept. 2002.

[27] Steven P. Young, Trevor J. Bauer, Kamal Chaudhary, and Sridhar Krishnamurthy.FPGA repeatable interconnect structure with bidirectional and unidirectional in-terconnect lines, Aug 1999. US Patent 5,942,913.

[28] Guy Lemieux, Edmund Lee, Marvin Tom, and Anthony Yu. Directional and single-driver wires in FPGA interconnect. In IEEE International Conference on Field-Programmable Technology, pages 41–48, December 2004.

[29] Ian Kuon. Automated FPGA design, verification and layout. Master’s thesis,University of Toronto, 2004.

[30] Jan M. Rabaey. Digital Integrated Circuits A Design Perspective. Prentice Hall,1996.

[31] E. Lee, G. Lemieux, and S. Mirabbasi. Interconnect driver design for long wiresin field-programmable gate arrays. In Field Programmable Technology, 2006. FPT2006. IEEE International Conference on, pages 89–96, December 2006.

[32] Edmund Lee, Guy Lemieux, and Shahriar Mirabbasi. Interconnect driver design forlong wires in field-programmable gate arrays. Journal of Signal Processing Systems,51(1):57–76, April 2008.

[33] J. H. Anderson and F. N. Najm. Low-power programmable routing circuitry forFPGAs. In IEEE/ACM International Conference on Computer Aided Design 2004,pages 602–609, Washington, DC, USA, 2004. IEEE Computer Society.

[34] Steven P. Young. Six-input multiplexer with two gate levels and three memorycells, April 1998. US Patent 5,744,995.

[35] Vaughn Betz and Jonathan Rose. Circuit design, transistor sizing and wire layoutof FPGA interconnect. In Proceedings of the 1999 IEEE Custom Integrated CircuitsConference, pages 171–174, 1999.

[36] Vikas Chandra and Herman Schmit. Simultaneous optimization of driving bufferand routing switch sizes in an FPGA using an iso-area approach. In Proceedingsof the IEEE Computer Society Annual Symposium on VLSI (ISVLSI’02), pages28–33, 2002.

Bibliography 179

[37] Michael Hutton, Vinson Chan, Peter Kazarian, Victor Maruri, Tony Ngai, JimPark, Rakesh Patel, Bruce Pedersen, Jay Schleicher, and Sergey Shumarayev. In-terconnect enhancements for a high-speed PLD architecture. In Proceedings ofthe 2002 ACM/SIGDA tenth international symposium on Field-programmable gatearrays, pages 3–10, New York, NY, USA, 2002. ACM.

[38] J. Anderson and F. Najm. A novel low-power FPGA routing switch. In Proceedingsof the IEEE 2004 Custom Ingretated Circuits Conference, pages 719–722, Oct 2004.

[39] Fei Li, Y. Lin, Lei He, Deming Chen, and Jason Cong. Power modeling andcharacteristics of field programmable gate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(11):1712–1724, Nov. 2005.

[40] Kara K. W. Poon, Steven J. E. Wilton, and Andy Yan. A detailed power modelfor field-programmable gate arrays. ACM Transactions on Design Automation ofElectronic Systems (TODAES), 10(2):279–302, 2005.

[41] Julien Lamoureux. On the interaction between power-aware computer-aided designalgorithms for field-programmable gate arrays. Master’s thesis, University of BritishColumbia, 2003.

[42] Julien Lamoureux and Steven J. E. Wilton. On the interaction between power-aware FPGA CAD algorithms. In ICCAD ’03: Proceedings of the 2003 IEEE/ACMinternational conference on Computer-aided design, page 701, Washington, DC,USA, 2003. IEEE Computer Society.

[43] R.K. Brayton, G.D. Hachtel, and A.L. Sangiovanni-Vincentelli. Multilevel logicsynthesis. Proceedings of the IEEE, 78(2):264–300, 1990.

[44] A. Sangiovanni-Vincentelli, A. El Gamal, and J. Rose. Synthesis methods for fieldprogrammable gate arrays. Proceedings of the IEEE, 81(7):1057–1083, 1993.

[45] Jason Cong and Yuzheng Ding. FlowMap: An optimal technology mapping algo-rithm for delay optimization in lookup-table based FPGA designs. IEEE Trans-actions on Computer-Aided Design of Integrated Circuits and Systems, 13(1):1–12,Jan 1994.

[46] J. Cong and Y. Ding. On area/depth trade-off in LUT-based FPGA technologymapping. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,2(2):137–148, 1994.

[47] Peter Jamieson. Improving the Area Efficiency of Heterogeneous FPGAs withShadow Clusters. PhD thesis, University of Toronto, 2007.

[48] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha,H. Savoj, P. R. Stephan, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. Sis:A system for sequential circuit synthesis. Technical Report UCB/ERL M92/41,University of California, Berkeley, Electronics Research Lab, Univ. of California,Berkeley, CA, 94720, May 1992.

Bibliography 180

[49] Alexander R. Marquardt. Cluster-based architecture, timing-driven packing andtiming-driven placement for FPGAs. Master’s thesis, University of Toronto, 1999.

[50] S. Kirkpatrick, C. D. Gelatt Jr, and M. P. Vecchi. Optimization by simulatedannealing. Science, 220(4598):671–680, May 1983.

[51] A.E. Dunlop and B.W. Kernighan. A procedure for placement of standard-cell VLSIcircuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, 4(1):92–98, 1985.

[52] J.M. Kleinhans, G. Sigl, F.M. Johannes, and K.J. Antreich. GORDIAN: VLSIplacement by quadratic programming and slicing optimization. IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems, 10(3):356–365, 1991.

[53] C.J. Alpert, T.F. Chan, A.B. Kahng, I.L. Markov, and P. Mulet. Faster minimiza-tion of linear wirelength for global placement. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17(1):3–13, 1998.

[54] C. Sechen and A. Sangiovanni-Vincentelli. The TimberWolf placement and routingpackage. IEEE Journal of Solid-State Circuits, 20(2):510–522, 1985.

[55] C. Ebeling, L. McMurchie, S.A. Hauck, and S. Burns. Placement and routingtools for the Triptych FPGA. IEEE Transactions on Very Large Scale Integration(VLSI) Systems, 3(4):473–482, 1995.

[56] S. Brown, J. Rose, and Z.G. Vranesic. A detailed router for field-programmablegate arrays. IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, 11(5):620–628, 1992.

[57] Takumi Okamoto and Jason Cong. Buffered Steiner tree construction with wiresizing for interconnect layout optimization. In ICCAD ’96: Proceedings of the1996 IEEE/ACM international conference on Computer-aided design, pages 44–49, Washington, DC, USA, 1996. IEEE Computer Society.

[58] W. C. Elmore. The transient response of damped linear networks with particularregard to wideband amplifiers. Journal of Applied Physics, 19:55–63, January 1948.

[59] Jorge Rubinstein, Paul Penfield, and Mark A. Horowitz. Signal delay in RC treenetworks. IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, 2(3):202–211, July 1983.

[60] John K. Ousterhout. Switch-level delay models for digital MOS VLSI. In DAC’84: Proceedings of the 21st conference on Design automation, pages 542–548, Pis-cataway, NJ, USA, 1984. IEEE Press.

[61] K. D. Boese, A. B. Kahng, B. A. McCoy, and G. Robins. Fidelity and near-optimality of Elmore-based routing constructions. In Proceedings of 1993 IEEEInternational Conference on Computer Design: VLSI in Computers and ProcessorsICCD’93, pages 81–84, 1993.

Bibliography 181

[62] Jason Cong and Lei He. Optimal wiresizing for interconnects with multiplesources. ACM Transactions on Design Automation of Electronic Systems (TO-DAES), 1(4):478–511, 1996.

[63] J. P. Fishburn and A.E. Dunlop. TILOS: A posynomial programming approach totransistor sizing. In International Conference on Computer Aided Design, pages326–328, November 1985.

[64] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge Uni-versity Press, 2003.

[65] Jyuo-Min Shyu and Alberto Sangiovanni-Vincentelli. ECSTASY: A new environ-ment for IC design optimization. In International Conference on Computer AidedDesign, pages 484–487, 1988.

[66] S. S. Sapatnekar, V. B. Rao, P.M. Vaidya, and Kang Sung-Mo. An exact solu-tion to the transistor sizing problem for CMOS circuits using convex optimization.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,12(11):1621–1634, November 1993.

[67] Chung-Pin Chen, Chris C. N. Chu, and D. F. Wong. Fast and exact simultaneousgate and wire sizing by langrangian relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(7):1014–1025, July 1999.

[68] Vijay Sundararajan, Sachin S. Sapatnekar, and Keshab K. Parhi. Fast and exacttransistor sizing based on iterative relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21(5):568–581, 2002.

[69] Kishore Kasamsetty, Mahesh Ketkar, and Sachin Sapatnekar. A new class of convexfunctions for delay modeling and its application to the transistor sizing problem.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,19(7):779–788, July 2000.

[70] Mahesh Ketkar, Kishore Kasamsetty, and Sachin Sapatnekar. Convex delay modelsfor transistor sizing. In DAC ’00: Proceedings of the 37th Design AutomationConference, pages 655–660, New York, NY, USA, 2000. ACM Press.

[71] Hiran Tennakoon and Carl Sechen. Efficient and accurate gate sizing with piecewiseconvex delay models. In DAC ’05: Proceedings of the 42nd annual conference onDesign automation, pages 807–812, New York, NY, USA, 2005. ACM Press.

[72] Weidong Liu, Xiaodong Jin, Xuemei Xi, James Chen, Min-Chie Jeng, Zhihong Liu,Yuhua Cheng, Kai Chen, Mansun Chan, Kelvin Hui, Jianhui Huang, Robert Tu,Ping K. Ko, and Chenming Hu. BSIM3V3.3 MOSFET Model, July 2005. http:

//www-device.eecs.berkeley.edu/∼bsim3/ftpv330/Mod doc/b3v33manu.tar.

[73] Mohan V. Dunga, Wenwei (Morgan) Yang, Xuemei (Jane) Xi, Jin He, WeidongLiu, Kanyu, M. Cao, Xiaodong Jin, Jeff J. Ou, Mansun Chan, Ali M. Niknejad,

Bibliography 182

and Chenming Hu. BSIM4.6.1 MOSFET Model, May 2007. http://www-device.eecs.berkeley.edu/∼bsim3/BSIM4/BSIM461/doc/BSIM461 Manual.pdf.

[74] William Nye, David C. Riley, Alberto Sangiovanni-Vincentelli, and Andre L. Tits.DELIGHT.SPICE: An optimization-based system for the design of integrated cir-cuits. IEEE Transactions on Computer-Aided Design of Integrated Circuits andSystems, 7(4):501–519, 1988.

[75] Andrew R. Conn, Paula K. Coulman, Rudd A. Haring, Gregory L. Morrill, ChanduVisweswariah, and Chai Wah Wu. JiffyTune: circuit optimization using time-domain sensitivities. IEEE Transactions on Computer-Aided Design of IntegratedCircuits and Systems, 17(12):1292–1309, Dec 1998.

[76] A. R. Conn, I. M. Elfadel, Jr. W. W. Molzen, P. R. O’Brien, P. N. Strenski,C. Visweswariah, and C. B. Whan. Gradient-based optimization of custom circuitsusing a static-timing formulation. In DAC ’99: Proceedings of the 36th ACM/IEEEconference on Design automation, pages 452–459, New York, NY, USA, 1999. ACMPress.

[77] Xiaoliang Bai, Chandu Visweswariah, and Philip N. Strenski. Uncertainty-awarecircuit optimization. In DAC ’02: Proceedings of the 39th conference on Designautomation, pages 58–63, New York, NY, USA, 2002. ACM Press.

[78] H. S. Jones Jr., P. R. Nagle, and H. T. Nguyen. A comparison of standard celland gate array implementions in a common CAD system. In IEEE 1986 CustomIntegrated Circuits Conference, pages 228–232, 1986.

[79] William J. Dally and Andrew Chang. The role of custom design in ASIC chips. InDAC ’00: Proceedings of the 37th Design Automation Conference, pages 643–647.ACM Press, 2000.

[80] Andrew Chang and William J. Dally. Explaining the gap between ASIC and cus-tom power: a custom perspective. In DAC ’05: Proceedings of the 42nd annualconference on Design automation, pages 281–284, New York, NY, USA, 2005. ACMPress.

[81] David G. Chinnery and Kurt Keutzer. Closing the power gap between ASIC andcustom: an ASIC perspective. In DAC ’05: Proceedings of the 42nd annual con-ference on Design automation, pages 275–280, New York, NY, USA, 2005. ACMPress.

[82] Michael John Sebastian Smith. Application-Specific Integrated Circuits. Addison-Wesley, 1997.

[83] NEC Electronics. ISSP (Structured ASIC), 2005. http://www.necel.com/issp/

english/.

[84] Altera Corporation. HardCopy ASICs: Technology for business, 2008. http:

//www.altera.com/products/devices/hardcopy/hrd-index.html.

Bibliography 183

[85] Richard Cliff. Altera Corporation. Private Communication.

[86] Altera Corporation. Partnership with TSMC yields first silicon successon Altera’s 90-nm, low-k products, June 2004. http://www.altera.

com/corporate/news room/releases/releases archive/2004/products/

nr-tsmc partnership.html.

[87] STMicroelectronics. 90nm CMOS090 Design Platform, 2005. http://www.st.

com/stonline/products/technologies/soc/90plat.htm.

[88] Altera Corporation. Altera demonstrates 90-nm leadership by ship-ping world’s highest-density, highest-performance FPGA, January 2005.http://www.altera.com/corporate/news room/releases/releases archive/

2005/products/nr-ep2s180 shipping.html.

[89] C.C. Wu, Y.K. Leung, C.S. Chang, M.H. Tsai, H.T. Huang, D.W. Lin, Y.M. Sheu,C.H. Hsieh, W.J. Liang, L.K. Han, et al. A 90-nm cmos device technology with high-speed, general-purpose, and low-leakage transistors for system on chip applications.In Electron Devices Meeting, 2002. IEDM’02. Digest. International, pages 65–68,2002.

[90] P. Roche and G. Gasiot. Impacts of front-end and middle-end process modifica-tions on terrestrial soft error rate. IEEE Transactions on Device and MaterialsReliability, 5(3):382–396, 2005.

[91] STMicroelectronics. MOTOROLA, PHILIPS and STMicroelectronics IntroducesDebut Industry’s First 90-NANOMETER CMOS Design Platform, August 2002.http://www.st.com/stonline/press/news/year2002/t1222h.htm.

[92] Taiwan Semiconductor Manufacturing Company Ltd. TSMC 90nm technology plat-form, April 2005. http://www.tsmc.com/download/english/a05 literature/

90nm Brochure.pdf.

[93] Dick James. 2004 – the year of 90-nm: A review of 90 nm devices. In 2005IEEE/SEMI Advanced Semiconductor Manufacturing Conference and Workshop,pages 72–76, 2005.

[94] Emanuele Capitanio, Matteo Nobile, and Didier Renard. Removing aluminum capin 90 nm copper technology, 2006. www.imec.be/efug/EFUG2006 Renard.pdf.

[95] Navid Azizi, Ian Kuon, Aaron Egier, Ahmad Darabiha, and Paul Chow. Recon-figurable molecular dynamics simulator. In FCCM ’04: Proceedings of the 12thAnnual IEEE Symposium on Field-Programmable Custom Computing Machines,pages 197–206, Washington, DC, USA, 2004. IEEE Computer Society.

[96] J. Fender and J. Rose. A high-speed ray tracing engine built on a field-programmable system. In Field-Programmable Technology (FPT), 2003. Proceed-ings. 2003 IEEE International Conference on, pages 188–195, 2003.

Bibliography 184

[97] A. Darabiha, J. Rose, and J.W. Maclean. Video-rate stereo depth measurementon programmable hardware. In Computer Vision and Pattern Recognition, 2003.Proceedings. 2003 IEEE Computer Society Conference on, volume 1, 2003.

[98] Peter Yiannacouras, J. Gregory Steffan, and Jonathan Rose. Application-specificcustomization of soft processor microarchitecture. In FPGA ’06: Proceedings ofthe 2006 ACM/SIGDA 14th international symposium on Field Programmable GateArrays, pages 201–210, New York, NY, USA, 2006. ACM.

[99] Peter Yiannacouras. The microarchitecture of FPGA-based soft processors. Mas-ter’s thesis, University of Toronto, 2005.

[100] P. Chow, D. Karchmer, R. White, T. Ngai, P. Hodgins, D. Yeh, J. Ranaweera,I. Widjaja, and A. Leon-Garcia. A 50,000 transistor packet-switching chip for theStarburst ATMswitch. In Custom Integrated Circuits Conference, 1995., Proceed-ings of the IEEE 1995, pages 435–438, 1995.

[101] Altera Corporation. Quartus II Development Software Handbook, 5.0 edition, May2005.

[102] Synopsys. Design Compiler Reference Manual: Constraints and Timing, versionv-2004.06 edition, June 2004.

[103] Synopsys. Design Compiler User Guide, version v-2004.06 edition, June 2004.

[104] Neil H. E. Weste and David Harris. CMOS VLSI Design A Circuits and SystemsPerspective. Pearson Addison-Wesley, 2005.

[105] Cadence. Encounter Design Flow Guide and Tutorial, Product Version 3.3.1,February 2004.

[106] Dan Clein. CMOS IC layout : concepts, methodologies and tools. Elsevier Inc,2000.

[107] Xiaojian Yang, Bo-Kyung Choi, and M. Sarrafzadeh. Routability-driven whitespace allocation for fixed-die standard-cell placement. IEEE Transactions onComputer-Aided Design of Integrated Circuits and Systems, 22(4):410–419, April2003.

[108] Wenjie Jiang, Vivek Tiwari, Erik de la Iglesia, and Amit Sinha. Topological analysisfor leakage prediction of digital circuits. In ASP-DAC ’02: Proceedings of the2002 conference on Asia South Pacific design automation/VLSI Design, page 39,Washington, DC, USA, 2002. IEEE Computer Society.

[109] Paul Leventis, Mark Chan, Michael Chan, David Lewis, Behzad Nouban, GilesPowell, Brad Vest, Myron Wong, Renxin Xia, and John Costello. Cyclone: A low-cost, high-performance FPGA. In Proceedings of the IEEE 2003 Custom IngretatedCircuits Conference, pages 49–52, 2003.

Bibliography 185

[110] Jordan S. Swartz, Vaughn Betz, and Jonathan Rose. A fast routability-drivenrouter for FPGAs. In FPGA ’98: Proceedings of the 1998 ACM/SIGDA sixthinternational symposium on Field programmable gate arrays, pages 140–149, NewYork, NY, USA, 1998. ACM.

[111] Jordan S. Swartz. A high-speed timing-award router for FPGAs. Master’s the-sis, University of Toronto, 1998. http://www.eecg.toronto.edu/∼jayar/pubs/

theses/Swartz/JordanSwartz.pdf.

[112] Wei Mark Fang and Jonathan Rose. Modeling routing demand for early-stage fpgaarchitecture development. In FPGA ’08: Proceedings of the 16th internationalACM/SIGDA symposium on Field programmable gate arrays, pages 139–148, NewYork, NY, USA, 2008. ACM.

[113] Wei Mark Fang. Modeling routing demand for early-stage FPGA architecturedevelopment. Master’s thesis, University of Toronto, 2008.

[114] C.H. Ho, P.H.W. Leong, W. Luk, S.J.E. Wilton, and S. Lopez-Buedo. Virtualembedded blocks: A methodology for evaluating embedded elements in FPGAs.In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines(FCCM), pages 45–44, 2006.

[115] Altera Corporation. Stratix II vs. Virtex-4 power comparison & estimation ac-curacy. White Paper, August 2005. http://www.altera.com/literature/wp/

wp s2v4 pwr acc.pdf.

[116] Li Shang, Alireza S. Kaviani, and Kusuma Bathala. Dynamic power consumptionin VirtexTM-II FPGA family. In Proceedings of the 2002 ACM/SIGDA tenth inter-national symposium on Field-programmable gate arrays, pages 157–164, New York,NY, USA, 2002. ACM Press.

[117] Vivek De and Shekhar Borkar. Technology and design challenges for low powerand high performance. In Proceedings of the 1999 international symposium on Lowpower electronics and design (ISLPED ’99), pages 163–168, New York, NY, USA,1999. ACM Press.

[118] Erich Goetting. Introducing the new Virtex-4 FPGA family. Xcell Journal, FirstQuarter 2006. http://www.xilinx.com/publications/xcellonline/xcell 52/

xc pdf/xc v4topview52.pdf.

[119] L. Cheng, F. Li, Y. Lin, P. Wong, and L. He. Device and Architecture Cooptimiza-tion for FPGA Power Reduction. IEEE Transactions on Computer-Aided Designof Integrated Circuits and Systems, 26(7):1211–1221, July 2007.

[120] Ian Kuon, Aaron Egier, and Jonathan Rose. Design, layout and verification of anFPGA using automated tools. In FPGA ’05: Proceedings of the 2005 ACM/SIGDA13th international symposium on Field-programmable gate arrays, pages 215–226,New York, NY, USA, 2005. ACM Press.

Bibliography 186

[121] Ketan Padalia, Ryan Fung, Mark Bourgeault, Aaron Egier, and Jonathan Rose.Automatic transistor and physical design of FPGA tiles from an architectural spec-ification. In FPGA ’03: Proceedings of the 2003 ACM/SIGDA eleventh interna-tional symposium on Field programmable gate arrays, pages 164–172, New York,NY, USA, 2003. ACM Press.

[122] Victor Aken’Ova, Guy Lemieux, and Resve Saleh. An improved ”soft” eFPGAdesign and implementation strategy. In Proceedings of the IEEE 2005 CustomIntegrated Circuits Conference, pages 178–181, September 2005.

[123] Shawn Phillips and Scott Hauck. Automatic layout of domain-specific reconfig-urable subsystems for system-on-a-chip. In Proceedings of the 2002 ACM/SIGDAtenth international symposium on Field-programmable gate arrays, pages 165–173.ACM Press, 2002.

[124] The MOSIS Service. MOSIS scalable CMOS (SCMOS) revision 8.00, October 2004.http://www.mosis.com/Technical/Designrules/scmos/scmos-main.html.

[125] Ivan Sutherland, Robert Sproule, and David Harris. Logical Effort : Designing fastCMOS circuits. Morgan Kaufmann Publishers, 1999.

[126] V. Eisele, B. Hoppe, and O. Kiehl. Transmission gate delay models for circuitoptimization. In Design Automation Conference, 1990. EDAC. Proceedings of theEuropean, pages 558–562, 1990.

[127] Synopsys. HSPICE. http://www.synopsys.com/products/mixedsignal/

hspice/hspice.html.

[128] Synopsys. HSIM. http://www.synopsys.com/products/mixedsignal/hsim/

hsim.html.

[129] Synopsys. NanoSim. http://www.synopsys.com/products/mixedsignal/

nanosim/nanosim.html.

[130] Elias Ahmed and Jonathan Rose. The effect of LUT and cluster size on deep-submicron FPGA performance and density. In FPGA ’00: Proceedings of the 2000ACM/SIGDA eighth international symposium on Field programmable gate arrays,pages 3–12, New York, NY, USA, 2000. ACM.

[131] Taiwan Semiconductor Manufacturing Company Ltd. TSMC 0.18 and 0.15-microntechnology platform, April 2005. http://www.tsmc.com/download/english/a05literature/0.15-0.18-micron Brochure.pdf.

[132] Saeyang Yang. Logic synthesis and optimization benchmarks user guide version3.0. Technical report, Microelectronics Center of North Carolina, Jan 1991.

[133] Taiwan Semiconductor Manufacturing Company Ltd. TSMC 0.35-micron tech-nology platform, April 2005. http://www.tsmc.com/download/english/a05

literature/0.35-micron Brochure.pdf.

Bibliography 187

[134] Elias Ahmed. The effect of logic block granularity on deep-submicron FPGAperformance and density. Master’s thesis, University of Toronto, 2001. http:

//www.eecg.toronto.edu/∼jayar/pubs/theses/Ahmed/EliasAhmed.pdf.

[135] K. Compton, A. Sharma, S. Phillips, and S. Hauck. Flexible routing architecturegeneration for domain-specific reconfigurable subsystems. In International Confer-ence on Field Programmable Logic and Applications, pages 59–68, 2002.

[136] Lattice Semiconductor Corporation. LatticeSC/M Family Handbook, Version 02.1,DS1004, June 2008. http://www.latticesemi.com/dynamic/view document.

cfm?document id=19028.

[137] Lattice Semiconductor Corporation. LatticeECP2/M Family Handbook, Ver-sion 02.9, HB1003, July 2007. http://www.latticesemi.com/dynamic/view

document.cfm?document id=21733.

[138] Ken McElvain. LGSynth93 benchmark set: Version 4.0, May 1993. Formerlyavailable at mcnc.org.

[139] Jason Cong, John Peck, and Yuzheng Ding. RASP: a general logic synthesis systemfor SRAM-based FPGAs. In FPGA ’96: Proceedings of the 1996 ACM fourthinternational symposium on Field-programmable gate arrays, pages 137–143, NewYork, NY, USA, 1996. ACM Press.

[140] A. Marquardt, V. Betz, and J. Rose. Using cluster-based logic blocks and timing-driven packing to improve FPGA speed and density. In ACM/SIGDA InternationalSymposium on Field Programmable Gate Arrays, pages 37–46, 1999.

[141] Jean E. Weber. Mathematical Analysis: Business and Economic Applications.Harper & Row, 3rd edition, 1976.

[142] Trevor Bauer. Xilinx. Private Communication.

[143] Altera Corporation. Cyclone II device handbook, Feb 2008. ver. CII5V1-3.3 http:

//www.altera.com/literature/hb/cyc2/cyc2 cii5v1.pdf.

[144] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. InFPGA ’06: Proceedings of the 2006 ACM/SIGDA 14th international symposium onField Programmable Gate Arrays, pages 21–30, New York, NY, USA, 2006. ACMPress.

[145] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,26(2):203–215, Feb 2007.

[146] Ian Kuon and Jonathan Rose. Automated transistor sizing for fpga architectureexploration. In DAC ’08: Proceedings of the 45th annual conference on Designautomation, pages 792–795, New York, NY, USA, 2008. ACM.

Bibliography 188

[147] Ian Kuon and Jonathan Rose. Area and delay trade-offs in the circuit and ar-chitecture design of FPGAs. In FPGA ’08: Proceedings of the 16th internationalACM/SIGDA symposium on Field programmable gate arrays, pages 149–158, NewYork, NY, USA, 2008. ACM.

[148] Altera Corporation. Stratix IV FPGA power management and advan-tages WP-01057-1.0, May 2008. http://www.altera.com/literature/wp/

wp-01059-stratix-iv-40nm-power-management.pdf.

[149] G. Nabaa, N. Azizi, and F.N. Najm. An adaptive FPGA architecture with processvariation compensation and reduced leakage. In Proceedings of the 43rd annualconference on Design automation, pages 624–629. ACM Press New York, NY, USA,2006.

[150] Arifur Rahman and Vijay Polavarapuv. Evaluation of low-leakage design tech-niques for field programmable gate arrays. In FPGA ’04: Proceedings of the 2004ACM/SIGDA 12th international symposium on Field programmable gate arrays,pages 23–30, New York, NY, USA, 2004. ACM.

[151] Actel Corporation. ProASIC3 flash family FPGAs, Feb 2008. http://www.actel.com/documents/PA3 DS.pdf.

[152] Actel Corporation. SX-A Family FPGAs v5.3, Feb 2007. http://www.actel.com/documents/SXA DS.pdf.

[153] Actel Corporation. Axcelerator family FPGAs, May 2005. http://www.actel.

com/documents/AX DS.pdf.

[154] Kilopass Technology Inc. Kilopass XPM embedded non-volatile memory solutions,2007. http://www.kilopass.com/public/Killopass Bro CR1-01(Web).pdf.

[155] Sidense Corp. Sidense the future of logic NVM, 2008. http://www.sidense.com/index.php?option=com content\&task=view\&id=130\&Itemid=30.

[156] R.M. Warner. Applying a composite model to the IC yield problem. IEEE Journalof Solid-State Circuits, 9(3):86–95, 1974.

[157] Altera Corporation. The Industry’s Biggest FPGAs, 2005. http://www.altera.

com/products/devices/stratix2/features/density/st2-density.html.

[158] Cameron McClintock, Andy L. Lee, and Richard G. Cliff. Redundancy circuitryfor logic circuits, Mar 2000. US Patent 6034536.

[159] Michael Chan, Paul Leventis, David Lewis, Ketan Zaveri, Hyun Mo Yi, and ChrisLane. Redundancy structures and methods in a programmable logic device, Feb2007. US Patent 7,180,324.

[160] P. Jamieson and J. Rose. Enhancing the area-efficiency of FPGAs with hard circuitsusing shadow clusters. In IEEE International Conference on Field-ProgrammableTechnology, pages 1–8, 2006.

Bibliography 189

[161] Fei Li, Yan Lin, and Lei He. Vdd programmability to reduce FPGA interconnectpower. In IEEE/ACM International Conference on Computer Aided Design 2004,2004.

[162] Fei Li, Yan Lin, and Lei He. FPGA power reduction using configurable dual-Vdd.In DAC ’04: Proceedings of the 41st annual conference on Design automation,pages 735–740, New York, NY, USA, 2004. ACM.

[163] A. Gayasen, K. Lee, N. Vijaykrishnan, M. Kandemir, M. J. Irwin, and T. Tuan.A dual-vdd low power FPGA architecture. In Proceedings of the InternationalConference on Field-Programmable Logic and Applications, pages 145–157, August2004.

[164] A. Marquardt, V. Betz, and J. Rose. Speed and area tradeoffs in cluster-basedFPGA architectures. IEEE Transactions on Very Large Scale Integration (VLSI)Systems, 8(1):84–93, 2000.