2017.03.ieee_micro.top_picks.cover.pdf - People
-
Upload
khangminh22 -
Category
Documents
-
view
4 -
download
0
Transcript of 2017.03.ieee_micro.top_picks.cover.pdf - People
The magazine for chip and silicon systems designers
Reflections from Uri Weiserp. 126
www.computer.org/micro
Top
Picks fro
m the 2016 C
om
puter A
rchitecture Co
nferencesIE
EE
MIC
RO
m
Ay
/jun
e 2017
VO
Lum
e 37
nu
mB
eR
3
May/June 2017 Volume 37, Number 3
IEEE Micro (ISSN 0272-1732) is published bimonthly by the IEEE Computer Society.IEEE Headquarters, Three Park Ave., 17th Floor, New York, NY 10016-5997; IEEEComputer Society Headquarters, 2001 L St., Ste. 700, Washington, DC 20036; IEEEComputer Society Publications Office, 10662 Los Vaqueros Circle, PO Box 3014,Los Alamitos, CA 90720.Subscribe to IEEE Micro by visiting www.computer.org/micro.Postmaster: Send address changes and undelivered copies to IEEE, MembershipProcessing Dept., 445 Hoes Ln., Piscataway, NJ 08855. Periodicals postage is paidat New York, NY, and at additional mailing offices. Canadian GST #125634188.Canada Post Corp. (Canadian distribution) Publications Mail Agreement #40013885.Return undeliverable Canadian addresses to 4960-2 Walker Road; Windsor, ON N9A6J3. Printed in USA.Reuse rights and reprint permissions: Educational or personal use of this material ispermitted without fee, provided such use: 1) is not made for profit; 2) includes this noticeand a full citation to the original work on the first page of the copy; and 3) does not implyIEEE endorsement of any third-party products or services. Authors and their companiesare permitted to post the accepted version of IEEE-copyrighted material on their ownwebservers without permission, provided that the IEEE copyright notice and a fullcitation to the original work appear on the first screen of the posted copy. An acceptedmanuscript is a version which has been revised by the author to incorporate reviewsuggestions, but not the published version with copy-editing, proofreading, and for-matting added by IEEE. For more information, please go to www.ieee.org/publications_standards/publications/rights/paperversionpolicy.html.Permission to reprint/republish this material for commercial, advertising, or promo-tional purposes or for creating new collective works for resale or redistribution must beobtained from IEEE by writing to the IEEE Intellectual Property Rights Office,445 Hoes Lane, Piscataway, NJ 08854-4141 or [email protected] 2017 by IEEE. All rights reserved.Abstracting and library use: Abstracting is permitted with credit to the source.Libraries are permitted to photocopy for private use of patrons, provided theper-copy fee indicated in the code at the bottom of the first page is paid throughthe Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923.Editorial: Unless otherwise stated, bylined articles, as well as product and service descrip-tions, reflect the author’s or firm’s opinion. Inclusion in IEEE Micro does not necessarilyconstitute an endorsement by IEEE or the Computer Society. All submissions are subject toediting for style, clarity, and space. IEEE prohibits discrimination, harassment, and bullying.For more information, visit www.ieee.org/web/aboutus/whatis/policies/p9-26.html.
May/June 2017 Volume 37 Number 3
Features
6 Guest Editors’ Introduction: Top Picks from the 2016 ComputerArchitecture ConferencesAamer Jaleel and Moinuddin Qureshi
12 Using Dataflow to Optimize Energy Efficiency of Deep Neural NetworkAcceleratorsYu-Hsin Chen, Joel Emer, and Vivienne Sze
22 The Memristive Boltzmann Machines
Mahdi Nazm Bojnordi and Engin Ipek
30 Analog Computing in a Modern Context: A Linear Algebra AcceleratorCase StudyYipeng Huang, Ning Guo, Mingoo Seok, Yannis Tsividis, and SimhaSethumadhavan
40 Domain Specialization Is Generally Unnecessary for Accelerators
Tony Nowatzki, Vinay Gangadhar, Karthikeyan Sankaralingam, and Greg Wright
52 Configurable Clouds
Adrian M. Caulfield, Eric S. Chung, Andrew Putnam, Hari Angepat, Daniel Firestone,Jeremy Fowers, Michael Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur,Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael,Lisa Woods, Sitaram Lanka, Derek Chiou, and Doug Burger
62 Specializing a Planet’s Computation: ASIC Clouds
Moein Khazraee, Luis Vega Gutierrez, Ikuo Magaki, and Michael Bedford Taylor
70 DRAF: A Low-Power DRAM-Based Reconfigurable Acceleration Fabric
Mingyu Gao, Christina Delimitrou, Dimin Niu, Krishna T. Malladi, HongzhongZheng, Bob Brennan, and Christos Kozyrakis
80 Agile Paging for Efficient Memory Virtualization
Jayneel Gandhi, Mark D. Hill, and Michael M. Swift
88 Transistency Models: Memory Ordering at the Hardware–OS Interface
Daniel Lustig, Geet Sethi, Abhishek Bhattacharjee, and Margaret Martonosi
98 Toward a DNA-Based Archival Storage System
James Bornholt, Randolph Lopez, Douglas M. Carmean, Luis Ceze, Georg Seelig, andKarin Strauss
106 Ti-states: Power Management in Active Timing Margin Processors
Yazhou Zu, Wei Huang, Indrani Paul, and Vijay Janapa Reddi
116 An Energy-Aware Debugger for Intermittently Powered Systems
Alexei Colin, Graham Harvey, Alanson P. Sample, and Brandon Lucia
Departments
4 From the Editor in ChiefThoughts on the Top Picks SelectionsLieven Eeckhout
126 AwardsInsights from the 2016 Eckert-Mauchly Award RecipientUri Weiser
130 Micro EconomicsTwo Sides to ScaleShane Greenstein
Computer Society Information, p. 3Advertising/Product Index, p. 61
Oliver BurstonDebut Art
...............................
2
.............................................................
MAY/JUNE 2017 3
........................................................................................................................................ ..................................................................................................................................................................................................................................................................
................
...........
........................................................................................................................................EDITOR IN CHIEF
Lieven EeckhoutGhent [email protected]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ADVISORY BOARD
David H. Albonesi, Erik R. Altman, Pradip Bose,Kemal Ebcioglu, Michael Flynn, Ruby B. Lee,Yale Patt, James E. Smith, and Marc Tremblay
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EDITORIAL BOARD
David BrooksHarvard University
Alper BuyuktosunogluIBM
Bronis de SupinskiLawrence Livermore National Laboratory
Natalie Enright JergerUniversity of Toronto
Babak FalsafiEPFL
Shane GreensteinNorthwestern University
Lizy Kurian JohnUniversity of Texas at Austin
Hyesoon KimGeorgia Tech
John KimKAIST
Hsien-Hsin (Sean) LeeTaiwan Semiconductor Manufacturing Company
Richard MateosianTrevor Mudge
University of Michigan, Ann ArborShubu Mukherjee
Cavium NetworksOnur Mutlu
ETH ZurichToshio Nakatani
IBM ResearchVojin G. Oklobdzija
University of California, DavisRonny Ronen
IntelKevin W. Rudd
Laboratory for Physical SciencesAndre Seznec
INRIAPer Stenstrom
Chalmers University of TechnologyRichard H. Stern
George Washington University Law SchoolLixin Zhang
Chinese Academy of Sciences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EDITORIAL STAFFEditorial Product Lead
Cathy Martin
[email protected] Management
Molly Gamborg
Publications Coordinator
[email protected]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Director, Products & Services
Evan Butterfield
Senior Manager, Editorial Services
Robin Baldwin
Manager, Editorial Services
Brian Brannon
Manager, Peer Review & PeriodicalAdministration
Hilda Carman
Digital Library Marketing Manager
Georgann Carter
Senior Business Development Manager
Sandra Brown
Director of Membership
Eric Berkowitz
Digital Marketing Manager
Marian Anderson
[email protected]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
EDITORIAL OFFICE
PO Box 3014, Los Alamitos, CA 90720;
(714) 821-8380; [email protected]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Submissions:
https://mc.manuscriptcentral.com/micro-cs
Author guidelines:
http://www.computer.org/micro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IEEE CS PUBLICATIONS BOARD
Greg Byrd (VP for Publications), Alfredo Benso, Irena
Bojanova, Robert Dupuis, David S. Ebert, Davide
Falessi, Vladimir Getov, Jose Martinez, Forrest
Shull, and George K. Thiruvathukal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IEEE CS MAGAZINE OPERATIONS
COMMITTEE
George K. Thiruvathukal (Chair), Gul Agha, M. Brian
Blake, Jim X. Chen, Maria Ebling, Lieven Eeckhout,
Miguel Encarnacao, Nathan Ensmenger, Sumi Helal,
San Murugesan, Yong Rui, Ahmad-Reza Sadeghi,
Diomidis Spinellis, VS Subrahmanian, and Mazin
Yousif
..................................................................................................................................................................................................................................................................
................................................................................................................................................................
Thoughts on the Top PicksSelections
LIEVEN EECKHOUTGhent University
......The May/June issue of IEEE
Micro traditionally features a selection of
articles called Top Picks that have the
potential to influence the work of com-
puter architects for the near future. A
selection committee of experts selects
these articles from the previous year’s
computer architecture conferences; the
selection criteria are novelty and potential
for long-term impact. Any paper published
in the top computer architecture confer-
ences of 2016 was eligible, which makes
the job of the selection committee both a
challenge and a pleasure. Selections are
based on the original conference paper
and a three-page write-up that summa-
rizes the paper’s key contributions and
potential impact. We received a record
number of 113 submissions this year.
Aamer Jaleel and Moinuddin Qureshi
chaired the selection committee, which
comprised 33 experts. I wholeheartedly
thank them and their committee for
having done such a great job. As they
note in the Guest Editors’ Introduction,
Aamer and Moin introduced a novel two-
phase review procedure. Four commit-
tee members reviewed each paper
during the first round. A subset of the
papers was selected to move to the
second round based on the reviewers’
scores and online discussion of the first
round. Six more committee members
reviewed each paper during the second
round; second-round papers thus received
a total of 10 reviews! This formed the
basic input for the in-person selection
committee meeting.
The selection committee reached a
consensus on 12 Top Picks and 12 Hono-
rable Mentions. Top Pick selections
were invited to prepare an article to be
included in this special issue. Because
these magazine articles are much shorter
than the original conference papers, they
tend to be more high-level and more
qualitative than the original conference
publications, providing an excellent intro-
duction to these highly innovative contri-
butions. The Honorable Mentions are top
papers that the selection committee
unfortunately could not recognize as Top
Picks because of magazine space con-
straints; these are acknowledged in the
Guest Editors’ Introduction. I encourage
you to read these important contribu-
tions to our field and share your thoughts
with students and colleagues.
Having participated in the selection
committee myself, I was deeply im-
pressed by the effectiveness of the
new review process. In particular, I
found it interesting to observe that the
committee reached a consensus that
very closely aligned with the ranking
obtained by the 10 reviews for each of
the second-round papers. This makes
me wonder whether we still need an in-
person selection committee meeting.
Of course, the meeting itself has great
value in terms of generating interesting
discussions and providing the opportu-
nity to meet colleagues from our com-
munity, but it undeniably also imposes
a big cost in terms of time, effort,
money, and carbon footprint (with
many committee members flying in and
out from all over the world).
Glancing over the set of papers
selected for Top Picks and Honorable
Mentions, one important trend has
emerged just recently—namely, the
focus on accelerators and hardware
specialization. A good number of papers
are related to hardware acceleration in
the broad sense. This does not come as
a surprise given current application
trends, along with the end of Dennard
scaling, which pushes architects to
improve system performance within
stringent power and cost envelopes
through hardware acceleration. We
observe this trend throughout the entire
computing landscape, from mobile devi-
ces to large-scale datacenters. There is a
lot of exciting research and advanced
development going on in this area by
many research groups in industry and aca-
demia, and I expect many more important
advances in the near future. Next to this
emerging trend, there is (still) a good frac-
tion of outstanding papers in more tradi-
tional areas, including microarchitecture,
memory hierarchy, memory consistency,
multicore, power management, security,
and simulation methodology.
I want to share a couple more
thoughts with you regarding the Top Picks
procedure that arose from conversations
I’ve had with various people in our com-
munity. I’d love to get the broader com-
munity’s feedback on this, so please
don’t hesitate to contact me and share
your thoughts.
.......................................................
4 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
From the Editor in Chief
One thought relates to the number of
selected Top Picks being too restrictive.
There is a hard cap of only 12 Top Picks.
On one hand, we want the process to be
selective and Top Picks recognition to be
prestigious. On the other hand, our com-
munity is growing. Our top-tier conferen-
ces, such as ISCA, MICRO, HPCA, and
ASPLOS, receive an ever-increasing
number of papers to review, and the
number of accepted papers is increasing
as well. One could argue that in response
we need to recognize more papers as
Top Picks. The hard constraint that we
are hitting here is the page limit we have
for the magazine, because the number
of pages is related to the production
cost. One solution may be to have more
Top Picks selections but fewer pages
allocated per selected article—but this
may compromise the comprehensive-
ness of the articles. Another solution
may be to recognize more Honorable
Mentions, because they don’t affect the
page count. Or, we may want to elec-
tronically publish the three-page Top
Picks submissions (paper summary and
potential impact, as mentioned earlier) as
they are, if the authors agree. This would
not incur any production cost at all, yet
the community would benefit from read-
ing them. Yet another solution may be to
select more than 12 Top Picks and pub-
lish them in different issues of the maga-
zine. The counterargument here is that
we have only six issues per year, which
makes it difficult to argue for more than
one issue devoted to Top Picks.
Another issue relates to the timing of
the Top Picks selection. Our community
has relatively few awards, and Top Picks
is an important vehicle in our community
to recognize top-quality research. How-
ever, one may argue whether selecting
Top Picks one year after publication is
too soon—it might make sense to wait a
couple more years before recognizing
the best research contributions of the
year. We may not want to wait as long
as the ISCA’s Influential Paper Award (15
years after publication) and MICRO’s
Test of Time Award (18 to 22 years after
publication), but still, one could argue for
waiting a few more years before under-
standing the true value of a novel
research contribution and how it impacts
our field. An important argument in this
discussion is that awards are generally
more important to young researchers
than they are for senior researchers.
Young researchers looking for a faculty or
research position in a leading academic
institute or industry lab need recognition
fairly soon in their careers as they get in
competition with other researchers from
other fields that have more awards.
Senior researchers, on the other hand, do
not need the recognition as much—or at
least their time scale is (much) longer.
Please let me know your thoughts on
these ideas or any other concerns you
may have. I’m open to any suggestions.
My only concern is to make sure Top
Picks continues to recognize the best
research in our field while serving the
best interests of both the community
and IEEE Micro.
Before wrapping up, I want to high-
light that this issue also includes an
award testimonial. Uri Weiser received
the 2016 Eckert-Mauchly Award for his
seminal contributions to the field of com-
puter architecture over the course of his
40-year career in industry and academia.
Uri Weiser single-handedly convinced
Intel executives to continue designing
CISC-based x86 processors by showing
that through adding new features such
as superscalar execution, branch predica-
tion, split instruction, and data cache, the
x86 processors could be made competi-
tive against the RISC family of process-
ors initiated by IBM and Berkeley. This
laid the foundation for the Intel Pentium
processor. Uri Weiser made several
other seminal contributions, including
the design of instruction-set extensions
(that is, Intel’s MMX) for supporting mul-
timedia applications. The Eckert-Mauchly
Award is considered the computer archi-
tecture community’s most prestigious
award. I wholeheartedly congratulate Uri
Weiser on the award and thank him for
his insightful testimonial.
With that, I wish you happy reading,
as always!
Lieven Eeckhout
Editor in Chief
IEEE Micro
Lieven Eeckhout is a professor in the
Department of Electronics and Informa-
tion Systems at Ghent University. Con-
tact him at [email protected].
.............................................................
MAY/JUNE 2017 5
Guest Editors’ Introduction........................................................................................................................................................................................................................
TOP PICKS FROM THE 2016COMPUTER ARCHITECTURE
CONFERENCES......It is our pleasure to introduce thisyear’s Top Picks in Computer Architecture.This issue is the culmination of the hardwork of the selection committee, which chosefrom 113 submissions that were published incomputer architecture conferences in 2016.We followed the precedent set by last year’sco-chairs and encouraged the selection com-mittee members to consider characteristicsthat make a paper worthy of being a “toppick.” Specifically, we asked them to considerwhether a paper challenges conventionalwisdom, establishes a new area of research, isthe definitive “last word” in an establishedresearch area, has a high potential for indus-try impact, and/or is one they would recom-mend to others to read.
Since the number of papers that could beselected for this Top Picks special issue waslimited to 12, we continued the precedent setover the past two years of having the selectioncommittee recognize 12 additional high-quality papers for Honorable Mention. Westrongly encourage you to read these papers(see the “Honorable Mentions” sidebar).Before we present the list of articles appearingin this special issue, we will first describe thenew review process that we implemented toimprove the paper selection process.
Review ProcessA selection committee comprising 31 mem-bers reviewed all the 113 papers (see the“Selection Committee” sidebar). This year,we tried a different selection process com-
pared to previous years’ Top Picks, keepingin mind the constraints and objectives thatare unique to Top Picks. The conventionalapproach to Top Picks selection has largelyremained similar to that used in our confer-ences (for example, four to five reviews perpaper and a four-to-six-point grading scale).For Top Picks, the number of papers that canbe accepted is fixed (11 to 12), and the selec-tion committee’s primary job is to identifythe top 12 papers out of all the submittedpapers, instead of providing a detailed cri-tique of the technical work and how thepaper can be improved. The papers submit-ted to Top Picks tend to be of much higher(average) quality than the typical paper sub-mitted at our conferences, and in many casesthe reviewers are already aware of the work(through prior reviewing, reading the papers,or attending the presentations). Therefore,the time and effort spent reviewing Top Pickspapers tends to be less than that spent review-ing the typical conference submissions.
We identified two key areas in which theTop Picks selection process could beimproved. First, a small number of reviewers(approximately five) made the decisions forTop Picks. The confidence in selection couldbe improved significantly by having a largernumber of reviews (approximately 10) perpaper, especially for the papers that are likelyto be discussed at the selection committeemeeting. This also ensures that reviewers aremore engaged at the meeting and makeinformed decisions. Second, the selection ofTop Picks gets overly influenced by excessively
Aamer Jaleel
Nvidia
Moinuddin Qureshi
Georgia Tech
............................................................
6 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
....................................................................................................................................................................................
Honorable Mentions
Paper Summary
“Exploiting Semantic Commutativity in Hardware Speculation”
by Guowei Zhang, Virginia Chiu, and Daniel Sanchez (MICRO
2016)
This paper introduces architectural support to exploit a broad class
of commutative updates enabling update-heavy applications to
scale to thousands of cores.
“The Computational Sprinting Game” by Songchun Fan, Seyed
Majid Zahedi, and Benjamin C. Lee (ASPLOS 2016)
Computational sprinting is a mechanism that supplies extra power
for short durations to enhance performance. This paper introduces
game theory for allocating shared power between multiple cores.
“PoisonIvy: Safe Speculation for Secure Memory” by Tamara
Silbergleit Lehman, Andrew D. Hilton, and Benjamin C. Lee
(MICRO 2016)
Integrity verification is a main cause of slowdown in secure memo-
ries. PoisonIvy provides a way to enable safe speculation on unveri-
fied data by tracking the instructions that consume the unverified
data using poisoned bits.
“Data-Centric Execution of Speculative Parallel Programs” by
Mark C. Jeffrey, Suvinay Subramanian, Maleen Abeydeera,
Joel Emer, and Daniel Sanchez (MICRO 2016)
The authors’ technique enables speculative parallelization (such as
thread-level speculation and transactional memory) to scale to thou-
sands of cores. It also makes speculative parallelization as easy to
program as sequential programming.
“Efficiently Scaling Out-of-Order Cores for Simultaneous
Multithreading” by Faissal M. Sleiman and Thomas F.
Wenisch (ISCA 2016)
This paper demonstrates that it is possible to unify in-order and
out-of-order issue into a single, integrated, energy-efficient SMT
microarchitecture.
“Racer: TSO Consistency via Race Detection” by Alberto Ros
and Stefanos Kaxiras (MICRO 2016)
The authors propose a scalable approach to enforce coherence and
TSO consistency without directories, timestamps, or software
intervention.
“The Anytime Automaton” by Joshua San Miguel and Natalie
Enright Jerger (ISCA 2016)
This paper provides a general, safe, and robust approximate com-
puting paradigm that abstracts away the challenge of guaranteeing
user acceptability from the system architect.
“Accelerating Markov Random Field Inference Using Molecular
Optical Gibbs Sampling Units” by Siyang Wang, Xiangyu Zhang,
Yuxuan Li, Ramin Bashizade, Song Yang, Chris Dwyer, and Alvin
R. Lebeck (ISCA 2016)
This paper proposes cross-layer support for probabilistic computing
using novel technologies and specialized architectures.
“Stripes: Bit-Serial Deep Neural Network Computing” by
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M.
Aamodt, and Andreas Moshovos (MICRO 2016)
The authors demonstrate that bit-serial computation can lead to
high-performance and energy-efficient designs whose performance
and accuracy adapts to precision at a fine granularity.
“Strober: Fast and Accurate Sample-Based Energy Simulation
for Arbitrary RTL” by Donggyu Kim, Adam Izraelevitz, Christo-
pher Celio, Hokeun Kim, Brian Zimmer, Yunsup Lee, Jonathan
Bachrach, and Krste Asanovicc (ISCA 2016)
This paper proposes a sample-based RTL energy modeling method-
ology for fast and accurate energy evaluation.
“Back to the Future: Leveraging Belady’s Algorithm for
Improved Cache Replacement” by Akanksha Jain and Calvin
Lin (ISCA 2016)
The authors’ algorithm enhances cache replacement by learning
replacement decisions made by Belady. The paper also presents a
novel mechanism to efficiently simulate Belady behavior.
“ISAAC: A Convolutional Neural Network Accelerator with
In-Situ Analog Arithmetic in Crossbars” by Ali Shafiee, Anir-
ban Nag, Naveen Muralimanohar, Rajeev Balasubramonian,
John Paul Strachan, Miao Hu, R. Stanley Williams, and Vivek
Srikumar (ISCA 2016)
The authors advance the state of the art in deep network accelera-
tors by an order of magnitude and overcome the challenges of ana-
log-digital conversion with innovative encodings and pipelines
suitable for precise and energy-efficient analog acceleration.
.................................................................
MAY/JUNE 2017 7
harsh or generous reviewers, who either givescores at extreme ends or advocate for toofew or too many papers from their stack. Wewanted to ensure that all reviewers play anequal role in the selection, regardless of theirharshness or generosity. For example, wecould give all reviewers an equal voice byrequiring them to advocate for a fixed num-ber of papers from their stack. We used thedata from the past three years’ Top Picksmeetings to analyze the process for Top Picksand used this data to drive the design of ourprocess. For example, the typical acceptancerate of Top Picks is approximately 10 per-cent; therefore, if we assign 15 papers to eachreviewer, then each reviewer can be expectedto have only 1.5 Top Picks papers on averagein their stack, and the likelihood of having 5or more Top Picks papers in the stack wouldbe extremely small.
Based on the data and constraints of TopPicks, we formulated a ranking-based two-phase process. The objective of the first phasewas to filter about 35 to 40 papers that wouldbe discussed at the selection committee meet-ing. The objective of the second phase was toincrease the number of reviews per paper toabout 10 and ask each reviewer to provide aconcrete decision for the assigned paper:whether it should be selected as a Top Picks
or Honorable Mention, or neither. In the firstphase, each reviewer was assigned exactly 14papers and was asked to recommend exactlyfive papers (Top 5) to the second phase. Eachpaper received four ratings in this phase. If apaper got three or more ratings of Top 5, itautomatically advanced to the second phase.If the paper had two ratings of Top 5, thenboth positive reviewers had to champion thepaper for it to advance to the second phase.Papers with less than two ratings of Top 5 didnot advance to the second phase. A total of38 papers advanced to the second phase, andeach such paper got a total of 9 to 10 reviews.In the second phase, each reviewer wasassigned an additional seven to eight papersin addition to the four to five papers that sur-vived the first phase. Each reviewer had 12papers and was asked to place exactly 4 ofthem into each category: Top Picks, Honora-ble Mention, and neither.
The selection committee meeting washeld in person in Atlanta, Georgia, on17 December 2016. At the selection com-mittee meeting, the 38 papers were rank-ordered on the basis of the number of TopPicks votes and the average rating the paperreceived in the second phase. If, after the in-person discussion, 60 percent or morereviewers rated a paper as a Top Pick, then
....................................................................................................................................................................................
Selection Committee� Tor Aamodt, University of British Columbia
� Alaa Alameldeen, Intel
� Murali Annavaram, University of Southern California
� Todd Austin, University of Michigan
� Chris Batten, Cornell University
� Luis Ceze, University of Washington
� Sandhya Dwarkadas, University of Rochester
� Lieven Eeckhout, Ghent University
� Joel Emer, Nvidia and MIT
� Babak Falsafi, EPFL
� Hyesoon Kim, Georgia Tech
� Nam Sung Kim, University of Illinois at Urbana–Champaign
� Benjamin Lee, Duke University
� Hsien-Hsin Lee, Taiwan Semiconductor Manufacturing
Company
� Gabriel Loh, AMD
� Debbie Marr, Intel
� Andreas Moshovos, University of Toronto
� Onur Mutlu, ETH Zurich
� Ravi Nair, IBM
� Milos Prvulovic, Georgia Tech
� Scott Rixner, Rice University
� Eric Rotenberg, North Carolina State University
� Karu Sankaralingam, University of Wisconsin
� Yanos Sazeidas, University of Cyprus
� Simha Sethumadhavan, Columbia University
� Andre Seznec, INRIA
� Dan Sorin, Duke University
� Viji Srinivasan, IBM
� Karin Strauss, Microsoft
� Tom Wenisch, University of Michigan
� Antonia Zhai, University of Minnesota
..............................................................................................................................................................................................
GUEST EDITORS’ INTRODUCTION
.................................................................
8 IEEE MICRO
the paper was selected as a Top Pick. Other-wise, the decision to select the paper as a TopPick (or Honorable Mention or neither) wasmade by a committee-wide vote using a sim-ple majority. We observed that the top eightranked papers all got accepted as Top Picks,and four more papers were selected as TopPicks from the next nine papers. Overall, outof the top 25 papers, all but one was selectedas either a Top Pick or an Honorable Men-tion. Thus, having a large number of reviewsper paper reduced the dependency on the in-person discussion. Coincidentally, the daybefore the selection committee meeting therewas a hurricane, which caused many flightsto be canceled, and 4 of the 31 selection com-mittee members were unable to attend themeeting. However, having 9 to 10 reviewersper paper still ensured that there were at leasteight reviewers present for each paper dis-cussed at the selection committee meeting,resulting in a robust and high-confidenceprocess, even with a relatively high rate ofabsentees. Given the unique constraints andobjectives of Top Picks, we hope that such aprocess with a larger number of reviews perpaper and a process that is robust to variationin generosity levels of reviewers (for example,ranking papers into fixed-sized bins) will beuseful for future Top Picks selection commit-tees as well.
Selected PapersWith the slowing down of conventionalmeans for improving performance, the archi-tecture community has been investigatingaccelerators to improve performance andenergy efficiency. This was evident in theemergence of a large number of papers onaccelerators appearing throughout the archi-tecture conferences in 2016. Given theemphasis on accelerators, it is no surprise thatmore than half of the articles in this issuefocus on architecting accelerators. Memorysystem and energy considerations are twoother areas from which the Top Picks paperswere selected.
AcceleratorsData movement is a primary factor thatdetermines the energy efficiency and effec-tiveness of accelerators. “Using Dataflow to
Optimize Energy Efficiency of Deep NeuralNetwork Accelerators” by Yu-Hsin Chen andhis colleagues describes a spatial architecturethat optimizes the dataflow for energy effi-ciency. This article also has an insightfulframework for classifying different accelera-tors based on access patterns.
“The Memristive Boltzmann Machines”by Mahdi Nazm Bojnordi and Engin Ipekproposes a memory-centric hardware acceler-ator for combinatorial optimization and deeplearning that leverages in-situ computing ofbit-line computation in memristive arraysto eliminate the need for exchanging dataamong the memory arrays and the computa-tional units.
The concept of using analog computingfor efficient computation is also explored byYipeng Huang and colleagues in “AnalogComputing in a Modern Context: A LinearAlgebra Accelerator Case Study.” The authorstry to address the typical challenges faced byanalog computing, such as limited problemsize, limited dynamic range, and precision.
In contrast to the first three articles, whichuse domain-specific acceleration, “DomainSpecialization Is Generally Unnecessary ForAccelerators” by Tony Nowatzki and his col-leagues focuses on retaining the programm-ability of accelerators while maintaining theirenergy efficiency. The authors use an architec-ture that has a large number of tiny cores withkey building blocks typically required foraccelerators and configure these cores intelli-gently based on the domain requirement.
Large-Scale AcceleratorsThe next three articles look at enhancing thescalability of accelerators so that they canhandle larger problem sizes and cater to vary-ing problem domains. The article“Configurable Clouds” by Adrian Caulfieldand his colleagues describes a cloud-scaleacceleration architecture that can connect dif-ferent accelerator nodes within a datacenterusing a high-speed FPGA fabric that lets thesystem accelerate a wide variety of applica-tions and has been deployed in Microsoftdatacenters.
In “Specializing a Planet’s Computation:ASIC Clouds,” Moein Khazraee and his col-leagues target scale-out workloads comprisingmany independent but similar jobs, often on
.................................................................
MAY/JUNE 2017 9
behalf of many users. This architecture showsa way to make ASIC usage more economical,because different users can potentially sharethe cost of fabricating a given ASIC, ratherthan each design team incurring the cost offabricating the ASIC.
“DRAF: A Low-Power DRAM-BasedReconfigurable Acceleration Fabric” by Min-gyu Gao and his colleagues describes a way toincrease the size of FPGA fabrics at low costby using DRAM instead of SRAM for thestorage inside the FPGA, thereby enabling ahigh-density and low-power reconfigurablefabric.
Memory and Storage SystemsMemory systems continue to be important indetermining the performance and efficiencyof computer systems. This issue features threearticles that focus on improving memory andstorage systems. “Agile Paging for EfficientMemory Virtualization” by Jayneel Gandhiand his colleagues addresses the performanceoverhead of virtual memory in virtualizedenvironments by getting the best of bothworlds: nested paging and shadow paging.
Virtual address translation can some-times affect the correctness of memory con-sistency models. Daniel Lustig and hiscolleagues address this problem in their article,“Transistency Models: Memory Ordering atthe Hardware–OS Interface.” The authorspropose to rigorously integrate memory con-sistency models and address translation at themicroarchitecture and operating system levels.
Moving on to the storage domain, in“Toward a DNA-Based Archival Storage Sys-tem,” James Bornholt and his colleaguesdemonstrate DNA-based storage architectedas a key-value store. Their design enables ran-dom access and is equipped with error correc-tion capability to handle the imperfections ofthe read and write process. As the demandfor cheap storage continues to increase, suchalternative technologies have the potential toprovide a major breakthrough in storagecapability.
Energy ConsiderationsThe final two articles are related to optimiz-ing energy or operating under low energybudgets. Modern processors are provisionedwith a timing margin to protect against tem-
perature inversion. In the article “Ti-states:Power Management in Active Timing Mar-gin Processors,” Yazhou Zu and his col-leagues show how actively monitoring thetemperature on the chip and dynamicallyreducing this timing margin can result in sig-nificant power savings.
Energy harvesting systems represent anextreme end of energy-constrained comput-ing in which the system performs computingonly when the harvested energy is present.One challenge in such systems is to providedebugging functionality for software, becausesystem failure could happen due to eitherlack of energy or incorrect code. “An Energy-Aware Debugger for Intermittently PoweredSystems” by Alexei Colin and his colleaguesdescribes a hardware–software debugger foran intermittent energy-harvesting system thatcan allow software verification to proceedwithout getting interference from the energy-harvesting circuit.
W e hope you enjoy reading these articlesand that you will explore both the
original conference versions and the Honora-ble Mention papers. We welcome your feed-back on this special issue and any suggestionsfor next year’s Top Picks issue. MICRO
AcknowledgmentsWe thank Lieven Eeckhout for providingsupport and direction as we tried out thenew paper selection process. Lieven alsohandled the papers that were conflicted withboth co-chairs. We also thank the selectioncommittee co-chairs for the past three TopPicks issues (Gabe Loh, Babak Falsafi, LuisCeze, Karin Strauss, Milo Martin, and DanSorin) for providing the review statistics fromtheir editions of Top Picks and for answeringour questions. We thank Vinson Young forhandling the submission website and Pra-shant Nair and Jian Huang for facilitating theprocess at the selection committee meeting.We owe a huge thanks to our fantastic selec-tion committee, which not only diligentlyreviewed all the papers but also were suppor-tive of the new review process. Furthermore,the selection committee members spent a dayattending the in-person meeting in Atlanta,fairly close to the holiday season. Finally, we
..............................................................................................................................................................................................
GUEST EDITORS’ INTRODUCTION
.................................................................
10 IEEE MICRO
thank all the authors who submitted theirwork for consideration to this Top Picks issueand the authors of the selected papers for pro-ducing the final versions of their papers forthis issue.
Aamer Jaleel is a principal research scientistat Nvidia. Contact him at [email protected].
Moinuddin Qureshi is an associate profes-sor in the School of Electrical and Com-puter Engineering at Georgia Tech. Contacthim at [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.................................................................
MAY/JUNE 2017 11
.................................................................................................................................................................................................................
USING DATAFLOW TO OPTIMIZEENERGY EFFICIENCY OF DEEP NEURAL
NETWORK ACCELERATORS.................................................................................................................................................................................................................
THE AUTHORS DEMONSTRATE THE KEY ROLE DATAFLOWS PLAY IN OPTIMIZING ENERGY
EFFICIENCY FOR DEEP NEURAL NETWORK (DNN) ACCELERATORS. THEY INTRODUCE BOTH A
SYSTEMATIC APPROACH TO ANALYZE THE PROBLEM AND A NEW DATAFLOW, CALLED
ROW-STATIONARY, THAT IS UP TO 2.5 TIMES MORE ENERGY EFFICIENT THAN EXISTING
DATAFLOWS IN PROCESSING A STATE-OF-THE-ART DNN. THIS ARTICLE PROVIDES
GUIDELINES FOR FUTURE DNN ACCELERATOR DESIGNS.
......Recent breakthroughs in deepneural networks (DNNs) are leading to anindustrial revolution based on AI. The super-ior accuracy of DNNs, however, comes atthe cost of high computational complexity.General-purpose processors no longer deliversufficient processing throughput and energyefficiency for DNNs. As a result, demandsfor dedicated DNN accelerators are increas-ing in order to support the rapidly growinguse of AI.
The processing of a DNN mainly com-prises multiply-and-accumulate (MAC) oper-ations (see Figure 1). Most of these MACs areperformed in the DNN’s convolutionallayers, in which multichannel filters are con-volved with multichannel input feature maps(ifmaps, such as images). This generates par-tial sums (psums) that are further accumu-lated into multichannel output feature maps(ofmaps). Because the MAC operations havefew data dependencies, DNN accelerators can
use high parallelism to achieve high process-ing throughput. However, this processingalso requires a significant amount of datamovement: each MAC performs three readsand one write of data access. Because movingdata can consume more energy than thecomputation itself,1 optimizing data move-ment becomes key to achieving high energyefficiency.
Data movement can be optimized byexploiting data reuse in a multilevel storagehierarchy. By maximizing the reuse of data inthe lower-energy-cost storage levels (such aslocal scratchpads), thus reducing data accessesto the higher-energy-cost levels (such asDRAM), the overall data movement energyconsumption is minimized.
In fact, DNNs present many data reuseopportunities. First, there are three typesof input data reuse: filter reuse, whereineach filter weight is reused across multipleifmaps; ifmap reuse, wherein each ifmap
Yu-Hsin Chen
Massachusetts Institute of
Technology
Joel Emer
Nvidia and Massachusetts
Institute of Technology
Vivienne Sze
Massachusetts Institute of
Technology
.......................................................
12 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
pixel is reused across multiple filters; andconvolutional reuse, wherein both ifmap pix-els and filter weights are reused due to thesliding-window processing in convolutions.Second, the intermediate psums are reusedthrough the accumulation of ofmaps. If notaccumulated and reduced as soon as possi-ble, the psums can pose additional storagepressure.
A design can exploit these data reuseopportunities by finding the optimal MACoperation mapping, which determines boththe temporal and spatial scheduling of theMACs on a highly parallel architecture.Ideally, data in the lower-cost storage levels isreused by as many MACs as possible beforereplacement. However, due to the limitedamount of local storage, input data reuse(ifmaps and filters) and psum reuse cannotbe fully exploited simultaneously. For exam-ple, reusing the same input data for multipleMACs generates psums that cannot be accu-mulated together and, as a result, consumeextra storage space. Therefore, the systemenergy efficiency is maximized only when themapping balances all types of data reuse in amultilevel storage hierarchy.
The search for the mapping that maxi-mizes system energy efficiency thus becomesan optimization process. This optimizationmust consider the following factors: the datareuse opportunities available for a givenDNN shape and size (for example, the num-ber of filters, number of channels, size of fil-ters, and feature map size), the energy cost ofdata access at each level of the storage hier-archy, and the available processing parallelismand storage capacity. The first factor is a func-tion of workload, whereas the second andthird factors are a function of the specificaccelerator implementation.
Because of implementation tradeoffs, pre-vious proposals for DNN accelerators havemade choices on the subset of mappings thatcan be supported. Therefore, for a specificDNN accelerator design, the optimal map-ping can be selected only from the subset ofsupported mappings instead of the entiremapping space. The subset of supportedmappings is usually determined by a set ofmapping rules, which also characterizes thehardware implementation. Such a set of map-ping rules defines a dataflow.
Because state-of-the-art DNNs come in awide range of shapes and sizes, the corre-sponding optimal mappings also vary. Thequestion is, can we find a dataflow that accom-modates the mappings that optimize datamovement for various DNN shapes and sizes?
In this article, we explore different DNNdataflows to answer this question in the con-text of a spatial architecture.2 In particular, wewill present the following key contributions:3
� An analogy between DNN accelera-tors and general-purpose processorsthat clearly identifies the distinctaspects of operation of a DNN accel-erator, which provides insights intoopportunities for innovation.
� A framework that quantitatively eval-uates the energy consumption of dif-ferent mappings for different DNNshapes and sizes, which is an essentialtool for finding the optimal mapping.
� A taxonomy that classifies existingdataflows from previous DNN accel-erator projects, which helps to under-stand a large body of work despitedifferences in the lower-level details.
� A new dataflow, called Row-Stationary(RS), which is the first dataflow to
Filters
Input feature maps(ifmaps)
Partialsums
(psums)
C
RH
E
M
M
E
E
N
1
E
Output feature maps(ofmaps)
1
H
C
R
C
H
R
R
1
M
N
H
......
...
C
Figure 1. In the processing of a deep neural network (DNN), multichannel
filters are convolved with the multichannel input feature maps, which then
generate the output feature maps. The processing of a DNN comprises
many multiply-and-accumulate (MAC) operations.
.............................................................
MAY/JUNE 2017 13
optimize data movement for superiorsystem energy efficiency. It has alsobeen verified in a fabricated DNNaccelerator chip, Eyeriss.4
We evaluate the energy efficiency of theRS dataflow and compare it to other data-flows from the taxonomy. The comparisonuses a popular state-of-the-art DNN model,AlexNet,5 with a fixed amount of hardwareresources. Simulation results show that theRS dataflow is 1.4 to 2.5 times more energyefficient than other dataflows in the convolu-tional layers. It is also at least 1.3 times moreenergy efficient in the fully connected layersfor batch sizes of at least 16. These resultswill provide guidance for future DNN accel-erator designs.
An Analogy to General-Purpose ProcessorsFigure 2 shows an analogy between the oper-ation of DNN accelerators and general-purpose processors. In conventional computersystems, the compiler translates a programinto machine-readable binary codes for exe-cution; in the processing of DNNs, the map-per translates the DNN shape and size into ahardware-compatible mapping for execu-tion. While the compiler usually optimizesfor performance, the mapper especially opti-mizes for energy efficiency.
The dataflow is a key attribute of a DNNaccelerator and is analogous to one of theparts of a general-purpose processor’s archi-
tecture. Similar to the role of an ISA ormemory consistency model, the dataflowdefines the mapping rules that the mappermust follow in order to generate hardware-compatible mappings. Later in this article,we will introduce several previously pro-posed dataflows.
Other attributes of a DNN accelerator,such as the storage organization, also areanalogous to parts of the general-purposeprocessor architecture, such as scratchpads orvirtual memory. We consider these attributespart of the architecture, instead of microarch-itecture, because they may largely remaininvariant across implementations. Although,similar to GPUs, the distinction betweenarchitecture and microarchitecture is likely toblur for DNN accelerators.
Implementation details, such as those thatdetermine access energy cost at each level ofthe storage hierarchy and latency betweenprocessing elements (PEs), are analogous tothe microarchitecture of processors, because amapping will be valid despite changes inthese characteristics. However, they play avital part in determining a mapping’s energyefficiency.
The mapper’s goal is to search in the map-ping space for the mapping that best opti-mizes data movement. The size of the entiremapping space is determined by the totalnumber of MACs, which can be calculatedfrom the DNN shape and size. However,only a subset of the space is valid given themapping rules defined by a dataflow. Forexample, the dataflow can enforce the follow-ing mapping rule: all MACs that use thesame filter weight must be mapped on thesame PE in the accelerator. Then, it is themapper’s job to find out the exact ordering ofthese MACs on each PE by evaluating andcomparing the energy efficiency of differentvalid ordering options.
As in conventional compilers, performingevaluation is an integral part of the mapper.The evaluation process takes a certain map-ping as input and gives an energy consump-tion estimation based on the availablehardware resources (microarchitecture) anddata reuse opportunities extracted from theDNN shape and size (program). In the nextsection, we will introduce a framework thatcan perform this evaluation.
Compilation
DNN shape and size(Program)
Dataflow, ...(Architecture)
Mapping(Binary)
Inputdata
Implementationdetails(μArch)
Execution
Processeddata
DNN accelerator(Processor)
Mapper(Compiler)
Figure 2. An analogy between the operation of DNN accelerators (roman
text) and that of general-purpose processors (italicized text).
..............................................................................................................................................................................................
TOP PICKS
............................................................
14 IEEE MICRO
Evaluating Energy ConsumptionFinding the optimal mapping requires evalu-ation of the energy consumption for variousmapping options. In this article, we evaluateenergy consumption based on a spatial archi-tecture,2 because many of the previousdesigns can be thought of as instances of suchan architecture. The spatial architecture (seeFigure 3) consists of an array of PEs and amultilevel storage hierarchy. The PE arrayprovides high parallelism for high through-put, whereas the storage hierarchy can beused to exploit data reuse in a four-level setup(in decreasing energy-cost order): DRAM,global buffer, network-on-chip (NoC, forinter-PE communication), and register file(RF) in the PE as local scratchpads.
In this architecture, we assume all datatypes can be stored and accessed at any levelof the storage hierarchy. Input data for theMAC operations—that is, filter weights andifmap pixels—are moved from the mostexpensive level (DRAM) to the lower-costlevels. Ultimately, they are usually deliveredfrom the least expensive level (RF) to thearithmetic logic unit (ALU) for computation.The results from the ALU—that is, psums—generally move in the opposite direction.The orchestration of this movement is deter-mined by the mappings for a specific DNNshape and size under the mapping rule con-straints of a specific dataflow architecture.
Given a specific mapping, the systemenergy consumption is estimated by account-ing for the number of times each data valuefrom all data types (ifmaps, filters, psums) isreused at each level of the four-level memoryhierarchy, and weighing it with the energycost of accessing that specific storage level.Figure 4 shows the normalized energy con-sumption of accessing data from each storagelevel relative to the computation of a MAC atthe ALU. We extracted these numbers from acommercial 65-nm process and used them inour final experiments.
Figure 5 uses a toy example to show how amapping determines the data reuse at eachstorage level, and thus the energy consump-tion, in a three-PE setup. In this example, wehave the following assumptions: each ifmappixel is used by 24 MACs, all ifmap pixelscan fit into the global buffer, and the RF of
each PE can hold only one ifmap pixel at atime. The mapping first reads an ifmap pixelfrom DRAM to the global buffer, then fromthe global buffer to the RF of each PEthrough the NoC, and reuses it from the RFfor four MACs consecutively in each PE. Themapping then switches to MACs that useother ifmap pixels, so the original one in theRF is replaced by new ones, due to limitedcapacity. Therefore, the original ifmap pixelmust be fetched from the global buffer again
PE array
(zoom in)
pF
IFO
RFRF
pF
IFO
RF
pF
IFO
iFIFO/oFIFO
Global
buffer
PE arrayiFIFO/oFIFO
Off-chip
DRAM
Ac
ce
lera
tor c
hip
CPU
GPU
Figure 3. Spatial array architecture comprises an array of processing
elements (PEs) and a multilevel storage hierarchy, including the off-chip
DRAM, global buffer, network-on-chip (NoC), and register file (RF) in the PE.
The off-chip DRAM, global buffer, and PEs in the array can communicate
with each other directly through the input and output FIFOs (the iFIFO and
oFIFO). Within each PE, the PE FIFO (pFIFO) controls the traffic going in and
out of the arithmetic logic unit (ALU), including from the RF or other storage
levels.
Normalized energy cost
200×
6×
2×
1×
1× (Reference)
RF (0.5 to 1.0 Kbytes)
1 MAC at ALUComputation
Data accessNoC (1 to 2 mm)
Global buffer(>100 Kbytes)
DRAM
Figure 4. Normalized energy cost relative to the computation of one MAC
operation at ALU. Numbers are extracted from a commercial 65-nm
process.
.............................................................
MAY/JUNE 2017 15
when the mapping switches back to theMACs that use it. In this case, the sameifmap pixel is reused at the DRAM, globalbuffer, NoC, and RF for 1, 2, 6, and 24times, respectively. The corresponding nor-malized energy consumption of moving thisifmap pixel is obtained by weighing thesenumbers with the normalized energy num-bers in Figure 4 and then adding themtogether (that is, 1 � 200 þ 2 � 6 þ 6 � 2þ 24 � 1 ¼ 248). For other data types, thesame approach can be applied.
This analysis framework can be used notonly to find the optimal mapping for a spe-cific dataflow, but also to evaluate and com-pare the energy consumption of differentdataflows. In the next section, we willdescribe various existing dataflows.
A Taxonomy of Existing DNN DataflowsNumerous previous efforts have proposedsolutions for DNN acceleration. Thesedesigns reflect a variety of trade-offs betweenperformance and implementation complex-ity. Despite their differences in low-levelimplementation details, we find that many ofthem can be described as embodying a set ofrules—that is, a dataflow—that defines the
valid mapping space based on how they han-dle data. As a result, we can classify them intoa taxonomy.
� The Weight-Stationary (WS) data-flow keeps filter weights stationary ineach PE’s RF by enforcing the follow-ing mapping rule: all MACs that usethe same filter weight must bemapped on the same PE for process-ing serially. This maximizes the con-volutional and filter reuse of weightsin the RF, thus minimizing theenergy consumption of accessingweights (for example, work by SrimatChakradhar and colleagues6 andVinayak Gokhale and colleagues7).Figure 6a shows the data movementof a common WS dataflow imple-mentation. While each weight staysin the RF of each PE, the ifmap pixelsare broadcast to all PEs, and the gen-erated psums are then accumulatedspatially across PEs.
� The Output-Stationary (OS) data-flow keeps psums stationary by accu-mulating them locally in the RF. Themapping rule is that all MACs thatgenerate psums for the same ofmappixel must be mapped on the samePE serially. This maximizes psumreuse in the RF, thus minimizingenergy consumption of psum move-ment (for example, work by ZidongDu and colleagues,8 Suyog Guptaand colleagues,9 and Maurice Pee-men and colleagues10). The datamovement of a common OS dataflowimplementation is to broadcast filterweights while passing ifmap pixelsspatially across the PE array (seeFigure 6b).
� Unlike the previous two dataflows,which keep a certain data type sta-tionary, the No-Local-Reuse (NLR)dataflow keeps no data stationarylocally so it can trade the RF off for alarger global buffer. This is to mini-mize DRAM access energy consump-tion by storing more data on-chip(for example, work by Tianshi Chenand colleagues11 and Chen Zhangand colleagues12). The corresponding
PE arrayiFIFO/oFIFO
pF
IFO
DR
AM
Global
buffer
RF RF RF
pF
IFO
pF
IFO
NoC levelBuffer level
time
Memory level
RF levelIfmap pixeldata movement Processing other data . . .
Figure 5. Example of how a mapping determines data reuse at each storage
level. This example shows the data movement of one ifmap pixel going
through the storage hierarchy. Each arrow means moving data between
specific levels (or to an ALU for computation).
..............................................................................................................................................................................................
TOP PICKS
............................................................
16 IEEE MICRO
mapping rule is that at each process-ing cycle, all parallel MACs mustcome from a unique pair of filter andchannel. The data movement of theNLR dataflow is to single-cast weights,multicast ifmap pixels, and spatiallyaccumulate psums across the PE array(see Figure 6c).
The three dataflows show distinct datamovement patterns, which imply differenttradeoffs. First, as Figures 6a and 6b show,the cost for keeping a specific data type sta-tionary is to move the other types of datamore. Second, the timing of data accessesalso matters. For example, in the WS data-flow, each ifmap pixel read from the globalbuffer is broadcast to all PEs with properlymapped MACs on the PE array. This is moreefficient than reading the same value multipletimes from the global buffer and single-cast-ing it to the PEs, which is the case for filterweights in the NLR dataflow (see Figure 6c).Other dataflows can make other tradeoffs. Inthe next section, we present a new dataflowthat takes these factors into account for opti-mizing energy efficiency.
An Energy-Efficient DataflowAlthough the dataflows in the taxonomydescribe the design of many DNN accelera-tors, they optimize data movement only for aspecific data type (for example, WS forweights) or storage level (NLR for DRAM).In this section, we introduce a new dataflow,called Row-Stationary (RS), which aims tooptimize data movement for all data types inall levels of the storage hierarchy of a spatialarchitecture.
The RS dataflow divides the MACs intomapping primitives, each of which comprisesa subset of MACs that run on the same PE ina fixed order. Specifically, each mappingprimitive performs a 1D row convolution, sowe call it a row primitive, and intrinsicallyoptimizes data reuse per MAC for all datatypes combined. Each row primitive isformed with the following rules:
� The MACs for applying a row of fil-ter weights on a row of ifmap pixels,which generate a row of psums, mustbe mapped on the same PE.
� The ordering of these MACs enablesthe use of a sliding window for ifmaps,as shown in Figure 7.
Convolutional and psum reuse opportu-nities within a row primitive are fullyexploited in the RF, given sufficient RF stor-age capacity.
Even with the RS dataflow, as defined bythe row primitives, there are still a large num-ber of valid mapping choices. These mappingchoices arise both in the spatial and temporalassignment of primitives to PEs:
1. One spatial mapping option is toassign primitives with data rowsfrom the same 2D plane on the PEarray, to lay out a 2D convolution(see Figure 8). This mapping fullyexploits convolutional and psumreuse opportunities across primitivesin the NoC: the same rows of filterweights and ifmap pixels are reusedacross PEs horizontally and diago-nally, respectively; psum rows are
Ifmap pixel (l)
Weight-Stationary (WS) dataflow
(a)
(b)
(c)
Output-Stationary (OS) dataflow
No-Local-Reuse (NLR) dataflow
Filter weight (W) Psum (P)
Global buffer
W0 W1 W2 W3 W4 W5 W6 W7
I8
PE
P0
P1P2P3P4P5P6P7
P8
P0 P1 P2 P3 P4 P5 P6 P7
W7
PEI0I1I2I3I4I5I6
I7
PE
W0 W1 W2 W3 W4 W5 W6 W7P1P0P9P8
P6
P7
P4
P5
P2
P3
I0 I1 I2 I3
Global buffer
Global buffer
Figure 6. Dataflow taxonomy. (a) Weight Stationary. (b) Output Stationary.
(c) No Local Reuse.
.............................................................
MAY/JUNE 2017 17
further accumulated across PEsvertically.
2. Another spatial mapping optionarises when the size of the PE array islarge, and the pattern shown inFigure 8 can be spatially duplicatedacross the PE array for various 2Dconvolutions. This not only increasesutilization of PEs, but also furtherexploits filter, ifmap, and psum reuseopportunities in the NoC.
3. One temporal mapping option ariseswhen row primitives from different2D planes can be concatenated orinterleaved on the same PE. As Figure9 shows, primitives with differentifmaps, filters, and channels have filterreuse, ifmap reuse, and psum reuseopportunities, respectively. By concat-enating or interleaving their computa-tion together in a PE, it virtually
becomes a larger 1D row convolution,which exploits these cross-primitivedata reuse opportunities in the RF.
4. Another temporal mapping choicearises when the PE array size is toosmall, and the originally spatiallymapped row primitives must be tem-porally folded into multiple process-ing passes (that is, the computation isserialized). In this case, the data reuseopportunities that are originally spa-tially exploited in the NoC can betemporally exploited by the globalbuffer to avoid DRAM accesses,given sufficient storage capacity.
As evident from the preceding list, the RSdataflow provides a high degree of mappingflexibility, such as using concatenation, inter-leaving, duplicating, and folding of the rowprimitives. The mapper searches for the exactamount to apply each technique in the opti-mal mapping—for example, how many fil-ters are interleaved on the same PE to exploitifmap reuse—to minimize overall systemenergy consumption.
Dataflow ComparisonIn this section, we quantitatively compare theenergy efficiency of different DNN dataflowsin a spatial architecture, including those fromthe taxonomy and the proposed RS dataflow.We use AlexNet5 as the benchmarking DNNbecause it is one of the most popular DNNsavailable, and it comprises five convolutional(CONV) layers and three fully connected(FC) layers with a wide variety of shapes andsizes, which can more thoroughly evaluatethe optimal mappings from each dataflow.
In order to have a fair comparison, weapply the following two constraints for alldataflows. First, the size of the PE array isfixed at 256 for constant processing through-put across dataflows. Second, the total hard-ware area is also fixed. For example, becausethe NLR dataflow does not use an RF, it canallocate more area for the global buffer. Thecorresponding hardware resource parametersare based on the RS dataflow implementationin Eyeriss, a DNN accelerator chip fabricatedin 65-nm CMOS.4 By applying these con-straints, we fix the total cost to implementthe microarchitecture of each dataflow.
Filter row Ifmap row Psum row
A
A B C
a b c
x y z
a b c a b c
A B C A B C
B C a b c d e x y z∗ =
Time
MAC1 MAC2 MAC3 MAC4 MAC5 MAC6 MAC7 MAC8 MAC9
Filter weight:
Ifmap pixel:
Psum:II II II
x + + + + + +x x x x x x x x
Figure 7. Each row primitive in the Row-Stationary (RS) dataflow runs a 1D
row convolution on the same PE in a sliding-window processing order.
Row 1
Row 1 Row 1 Row 1 Row 2 Row 1 Row 3
Row 2 Row 2 Row 2 Row 3 Row 2 Row 4
Row 3 Row 3 Row 3 Row 4 Row 3 Row 5
Row 2 Row 3
PE1
PE2
PE3
PE4
PE5
PE6
PE7
PE8
PE9
∗
∗
∗
∗ = = =∗ ∗
∗
∗
∗
∗
∗
∗
Figure 8. Patterns of how row primitives from the same 2D plane are
mapped onto the PE array in the RS dataflow.
..............................................................................................................................................................................................
TOP PICKS
............................................................
18 IEEE MICRO
Therefore, the differences in energy efficiencyare solely from the dataflows.
Figures 10a and 10b show the comparisonof energy efficiency between dataflows in theCONV layers of AlexNet with an ifmap batchsize of 16. Figure 10a gives the breakdown interms of storage levels and ALU, and Figure10b gives the breakdown in terms of datatypes. First, the ALU energy consumption isonly a small fraction of the total energy con-sumption, which proves the importance ofdata movement optimization. Second, eventhough NLR has the lowest energy consump-tion in DRAM, its total energy consumptionis still high, because most of the data accessescome from the global buffer, which are moreexpensive than those from the NoC or RF.Third, although WS and OS dataflows clearlyoptimize the energy consumption of accessingweights and psums, respectively, they sacrificethe energy consumption of moving other datatypes, and therefore do not achieve the lowestoverall energy consumption. This shows that
DRAM alone does not dictate energy effi-ciency, and optimizing the energy consump-tion for only a certain data type does not leadto the best system energy efficiency. Overall,the RS dataflow is 1.4 to 2.5 times moreenergy efficient than other dataflows in theCONV layers of AlexNet.
Figure 11 shows the same experimentresults as in Figure 10b, except that it is forthe FC layers of AlexNet. Compared to theCONV layers, the FC layers have no convo-lutional reuse and use much more filterweights. Still, the RS dataflow is at least 1.3times more energy efficient than the otherdataflows, which proves that the capability tooptimize data movement for all data types isthe key to achieving the highest overallenergy efficiency. Note that the FC layersaccount for less than 20 percent of the totalenergy consumption in AlexNet. In recentDNNs, the number of FC layers have alsobeen greatly reduced, making their energyconsumption even less significant.
Filter 1
Filter 1Channel 1 =
=
=
=
=
Channel 1
Row 1
Row 1
Row 1
Row 1
Row 1
Row 1
Row 1
Row 1 Row 1 Row 1 Row 1Filter 1
Ifmap 1
Ifmap 2
Ifmap 1 and 2 Psum 1 and 2Psum 1
Psum 2
filter reuse
Ifmap reuse
psum reuse (can be further accumulated)
∗
Filter 1
Channel 1
Channel 1
Row 1
Row 1
Row 1 Row 1
Row 1 Row 1
Filter 2
Ifmap 1
Ifmap 1
Psum 1
Psum 2
∗
∗
=
=
=
=
Filter 1
Filter 1
Filter 1 and 2
Ifmap 1
Ifmap 1 Psum 1 and 2
PsumChannel 1
Channel 2 Row 1
Row 1Row 1 Row 1
Row 1
Row 1 Row 1
Filter 1
Ifmap 1
Ifmap 1
Psum 1
Psum 1
∗
∗
∗
∗
∗
∗
(a)
(b)
(c)
Figure 9. Row primitives from different 2D planes can be combined by concatenating or interleaving their computation on the
same PE to further exploit data reuse at the RF level. (a) Two-row primitives reuse the same filter row for different ifmap
rows. (b) Two-row primitives reuse the same ifmap row for different filter rows. (c) Two-row primitives from different
channels further accumulate psum rows.
.............................................................
MAY/JUNE 2017 19
R esearch on architectures for DNNaccelerators has become very popular
for its promising performance and wideapplicability. This article has demonstratedthe key role of dataflows in DNN acceleratordesign, and it shows how to systematicallyexploit all types of data reuse in a multilevelstorage hierarchy for optimizing energy effi-ciency with a new dataflow. It challenges con-ventional design approaches, which focusmore on optimizing parts of the problem,and shifts it toward a global optimizationthat considers all relevant metrics.
The taxonomy of dataflows lets us comparehigh-level design choices irrespective of low-level implementation details, and thus can beused to guide future designs. Although thesedataflows are currently implemented on dis-
tinct architectures, it is also possible to comeup with a union architecture that can supportmultiple dataflows simultaneously. The ques-tions are how to choose a combination ofdataflows that maximally benefit the searchfor optimal mappings, and how to supportthese dataflows with the minimum amount ofhardware implementation overhead.
This article has also pointed out how theconcept of DNN dataflows and the mappingof a DNN computation onto a dataflow can beviewed as analogous to a general-purpose pro-cessor’s architecture and compiling onto thatarchitecture. We hope this will open up spacefor computer architects to approach the designof DNN accelerators by applying the knowl-edge and techniques from a well-establishedresearch field in a more systematic manner,such as methodologies for design abstraction,modularization, and performance evaluation.
For instance, a recent research trend forDNNs is to exploit data statistics. Specifically,different proposals on quantization, pruning,and data representation have all shown prom-ising results on improving the performance ofDNNs. Therefore, it is important that newarchitectures also take advantage of these find-ings. As compilers for general-purpose pro-cessors can take the profile of targetedworkloads to further improve the performanceof the generated binary, the analogy betweengeneral-purpose processors and DNN acceler-ators suggests that the mapper for DNN accel-erators might also take the profile of targetedDNN statistics to further optimize the
2.0
1.5
1.0
0.5
0
Norm
aliz
ed
energ
y/M
AC
2.0
1.5
1.0
0.5
0
Norm
aliz
ed
energ
y/M
AC
WS NLR
RF
psums
weights
pixels
NoC
Buffer
DRAM
ALU
RSOSA OSB OSC
DNN dataflows
WS NLR RSOSA OSB OSC
DNN dataflows(a) (b)
Figure 10. Comparison of energy efficiency between different dataflows in the convolutional (CONV) layers of AlexNet.5 (a)
Breakdown in terms of storage levels and ALU versus (b) data types. OSA, OSB, and OSC are three variants of the OS dataflow
that are commonly seen in different implementations.3
2.0
1.5
1.0
0.5
0
Norm
aliz
ed
energ
y/M
AC
psums
weights
pixels
WS NLR RSOSA OSB OSC
DNN dataflows
Figure 11. Comparison of energy efficiency between different dataflows in
the fully connected (FC) layers of AlexNet.
..............................................................................................................................................................................................
TOP PICKS
............................................................
20 IEEE MICRO
generated mappings. This is an endeavor wewill leave for future work. MICRO
....................................................................References1. M. Horowitz, “Computing’s Energy Prob-
lem (And What We Can Do About It),” Proc.
IEEE Int’l Solid-State Circuits Conf. (ISSCC
14), 2014, pp. 10–14.
2. A. Parashar et al., “Triggered Instructions: A
Control Paradigm for Spatially-Programmed
Architectures,” Proc. 40th Ann. Int’l Symp.
Computer Architecture (ISCA 13), 2013, pp.
142–153.
3. Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A
Spatial Architecture for Energy-Efficient Data-
flow for Convolutional Neural Networks,” Proc.
ACM/IEEE 43rd Ann. Int’l Symp. Computer
Architecture (ISCA 16), 2016, pp. 367–379.
4. Y.-H. Chen et al., “Eyeriss: An Energy-
Efficient Reconfigurable Accelerator for
Deep Convolutional Neural Networks,”
Proc. IEEE Int’l Solid-States Circuits Conf.
(ISSCC 16), 2016, pp. 262–263.
5. A. Krizhevsky, I. Sutskever, and G.E. Hinton,
“ImageNet Classification with Deep Convo-
lutional Neural Networks,” Proc. 25th Int’l
Conf. Neural Information Processing Sys-
tems (NIPS 12), 2012, pp. 1097–1105.
6. S. Chakradhar et al., “A Dynamically Config-
urable Coprocessor for Convolutional Neural
Networks,” Proc. 37th Ann. Int’l Symp.
Computer Architecture (ISCA 10), 2010, pp.
247–257.
7. V. Gokhale et al., “A 240 G-ops/s Mobile
Coprocessor for Deep Neural Networks,”
Proc. IEEE Conf. Computer Vision and Pat-
tern Recognition Workshops (CVPRW 14),
2014, pp. 696–701.
8. Z. Du et al., “ShiDianNao: Shifting Vision
Processing Closer to the Sensor,” Proc.
ACM/IEEE 42nd Ann. Int’l Symp. Computer
Architecture (ISCA 15), 2015, pp. 92–104.
9. S. Gupta et al., “Deep Learning with Limited
Numerical Precision,” Proc. 32nd Int’l Conf.
Machine Learning, vol. 37, 2015, pp.
1737–1746.
10. M. Peemen et al., “Memory-Centric Accelera-
tor Design for Convolutional Neural Networks,”
Proc. IEEE 31st Int’l Conf. Computer Design
(ICCD 13), 2013, pp. 13–19.
11. T. Chen et al., “DianNao: A Small-Footprint
High-Throughput Accelerator for Ubiquitous
Machine-Learning,” Proc. 19th Int’l Conf.
Architectural Support for Programming Lan-
guages and Operating Systems (ASPLOS
14), 2014, pp. 269–284.
12. C. Zhang et al., “Optimizing FPGA-based
Accelerator Design for Deep Convolutional
Neural Networks,” Proc. ACM/SIGDA Int’l
Symp. Field-Programmable Gate Arrays
(FPGA 15), 2015, pp. 161–170.
Yu-Hsin Chen is a PhD student in theDepartment of Electrical Engineering andComputer Science at the MassachusettsInstitute of Technology. His research inter-ests include energy-efficient multimedia sys-tems, deep learning architectures, and com-puter vision. Chen received an MS inelectrical engineering and computer sciencefrom the Massachusetts Institute of Tech-nology. He is a student member of IEEE.Contact him at [email protected].
Joel Emer is a senior distinguished researchscientist at Nvidia and a professor of electricalengineering and computer science at the Mas-sachusetts Institute of Technology. His researchinterests include spatial and parallel architec-tures, performance modeling, reliability analy-sis, and memory hierarchies. Emer received aPhD in electrical engineering from the Uni-versity of Illinois. He is a Fellow of IEEE.Contact him at [email protected].
Vivienne Sze is an assistant professor in theDepartment of Electrical Engineering andComputer Science at the MassachusettsInstitute of Technology. Her research inter-ests include energy-aware signal processingalgorithms and low-power architecture andsystem design for multimedia applications,such as machine learning, computer vision,and video coding. Sze received a PhD in elec-trical engineering from the MassachusettsInstitute of Technology. She is a senior mem-ber of IEEE. Contact her at [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 21
.................................................................................................................................................................................................................
THE MEMRISTIVE BOLTZMANNMACHINES
.................................................................................................................................................................................................................
THE PROPOSED MEMRISTIVE BOLTZMANN MACHINE IS A MASSIVELY PARALLEL,
MEMORY-CENTRIC HARDWARE ACCELERATOR BASED ON RECENTLY DEVELOPED RESISTIVE
RAM (RRAM) TECHNOLOGY. THE PROPOSED ACCELERATOR EXPLOITS THE ELECTRICAL
PROPERTIES OF RRAM TO REALIZE IN SITU, FINE-GRAINED PARALLEL COMPUTATION
WITHIN MEMORY ARRAYS, THEREBY ELIMINATING THE NEED FOR EXCHANGING DATA
BETWEEN THE MEMORY CELLS AND COMPUTATIONAL UNITS.
......Combinatorial optimization is abranch of discrete mathematics that is con-cerned with finding the optimum element of afinite or countably infinite set. An enormousnumber of critical problems in science andengineering can be cast within the combinato-rial optimization framework, including classi-cal problems such as traveling salesman, integerlinear programming, knapsack, bin packing,and scheduling problems, as well as numerousoptimization problems in machine learningand data mining. Because many of these prob-lems are NP-hard, heuristic algorithms arecommonly used to find approximate solutionsfor even moderately sized problem instances.
Simulated annealing is one of the mostcommonly used optimization algorithms. Onmany types of NP-hard problems, simulatedannealing achieves better results than otherheuristics; however, its convergence may beslow. This problem was first addressed byreformulating simulated annealing within thecontext of a massively parallel computationalmodel called the Boltzmann machine.1 TheBoltzmann machine is amenable to a massivelyparallel implementation in either software or
hardware.2 With the growing interest in deeplearning models that rely on Boltzmannmachines for training (such as deep belief net-works), the importance of high-performanceBoltzmann machine implementations isincreasing. Regrettably, the required all-to-allcommunication among the processing unitslimits these recent efforts’ performance.
The memristive Boltzmann machine is amassively parallel, memory-centric hardwareaccelerator for the Boltzmann machine basedon recently developed resistive RAM(RRAM) technology. RRAM is a memristive,nonvolatile memory technology that providesFlash-like density and DRAM-like readspeed. The accelerator exploits the electricalproperties of the bitlines and wordlines in aconventional single-level cell (SLC) RRAMarray to realize in situ, fine-grained parallelcomputation, which eliminates the need forexchanging data among the memory arraysand computational units. The proposedhardware platform connects to a general-purpose system via the DDRx interface andcan be selectively integrated with systems thatrun optimization workloads.
Mahdi Nazm Bojnordi
University of Utah
Engin Ipek
University of Rochester
.......................................................
22 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
Computation within Memristive ArraysThe key idea behind the proposed memory-centric accelerator is to exploit the electricalproperties of the storage cells and the intercon-nections among those cells to compute the dotproduct—the fundamental building blockof the Boltzmann machine—in situ withinthe memory arrays. This novel capability ofthe proposed memristive arrays eliminatesunnecessary latency, bandwidth, and energyoverheads associated with streaming the dataout of the memory arrays during computation.
The Boltzmann MachineThe Boltzmann machine, proposed by Geof-frey Hinton and colleagues in 1983,2 is awell-known example of a stochastic neuralnetwork that can learn internal representa-tions and solve combinatorial optimizationproblems. The Boltzmann machine is a fullyconnected network comprising two-stateunits. It employs simulated annealing fortransitioning between the possible networkstates. The units flip their states on the basisof the current state of their neighbors and thecorresponding edge weights to maximize aglobal consensus function, which is equiva-lent to minimizing the network energy.
Many combinatorial optimization prob-lems, as well as machine learning tasks, can bemapped directly onto a Boltzmann machineby choosing the appropriate edge weights andthe initial state of the units within the net-work. As a result of this mapping, each possi-ble state of the network represents a candidatesolution to the optimization problem, andminimizing the network energy becomesequivalent to solving the optimization prob-lem. The energy minimization process is typi-cally performed either by adjusting the edgeweights (learning) or recomputing the unitstates (searching and classifying). This processis repeated until convergence is reached.The solution to an optimization problemcan be found by reading—and appropri-ately interpreting—the network’s final state.For example, Figure 1 depicts the mappingfrom an example graph with five vertices to aBoltzmann machine with five nodes. TheBoltzmann machine is used to solve a Max-Cut problem. Given an undirected graphG with N nodes whose connection weights
(dij) are represented by a symmetric weightmatrix, the maximum cut problem is to find asubset S � {1, …, N} of the nodes that maxi-mizes
Xi;j
dij, in which i � S and j 62 S. Tosolve the problem on a Boltzmann machine, aone-to-one mapping is established betweenthe graph G and a Boltzmann machine withN processing units. The Boltzmann machineis configured as wjj ¼
Xidji and wji¼ –2dji.
When the machine reaches its lowest energy,(E(x)¼ �19), the state variables represent theoptimum solution, in which a value of 1 atunit i indicates that the corresponding graphi-cal node belongs to S.
In Situ ComputationThe critical computation that the Boltzmannmachine performs consists of multiplying aweight matrix W by a state vector x. Everyentry of the symmetric matrix W (wji) recordsthe weight between two units (j and i); everyentry of the vector x(xi) stores the state of asingle unit (i). Figure 2 depicts the funda-mental concept behind the design of the
Cost = 19 Energy = –19
4
5
11
73
S
Mapping
9
–8–2
–2
–10
–14–6
9
51 1 0
109
1
Figure 1. Mapping a Max-Cut problem to the Boltzmann machine model. An
example five-vertex undirected graph is mapped and partitioned using a five-
node Boltzmann machine.
xj
wji
Ijixi
liVsupply
x0
Wordlines
+–
Bitl
ine
Ij = xjxixjiIji = i = 0 i = 0
Figure 2. The key concept of in situ
computation within memristive arrays.
Current summation within every bitline is
used to compute the result of a dot product.
.............................................................
MAY/JUNE 2017 23
memristive Boltzmann machine. The weightsand the state variables are represented usingmemristors and transistors, respectively. Aconstant voltage supply (Vsupply) is connectedto parallel memristors through a shared verti-cal bitline. The total current pulled from thevoltage source represents the result of thecomputation. This current (Ij) is set to zerowhen xj is OFF; otherwise, the current isequal to the sum of the currents pulled by theindividual cells connected to the bitline.
System OverviewFigure 3 shows an example of the proposedaccelerator that resides on the memory busand interfaces to a general-purpose computersystem via the DDRx interface. This modularorganization permits the system designers toselectively integrate the accelerator in systemsthat execute combinatorial optimization andmachine learning workloads. The memristiveBoltzmann machine comprises a hierarchy ofdata arrays connected via a configurableinterconnection network. A controller imple-ments the interface between the acceleratorand the processor. The data arrays can store
the weights (wji) and the state variables (xi); itis possible to compute the product of weightsand state variables in situ within the dataarrays. The interconnection network permitsthe accelerator to retrieve and sum these par-tial products to compute the final result.
Fundamental Building BlocksThe fundamental building blocks of the pro-posed memristive Boltzmann machine arestorage elements, a current summation circuit,a reduction unit, and a consensus unit. Thedesign of these hardware primitives must strikea careful balance among multiple goals: highmemory density, low energy consumption,and in situ, fine-grained parallel computation.
Storage ElementsAs Figure 4 shows, the proposed acceleratoremploys the conventional one-transistor,one-memristor (1T-1R) array to store theconnection weights (the matrix W). The rele-vant state variables (the vector x) are keptclose to the data arrays holding the weights.The memristive 1T-1R array is used for bothstoring the weights and computing the dotproduct between these weights and the statevariables.
Current Summation CircuitThe result of a dot product computation isobtained by measuring the aggregate currentpulled by the memory cells connected to acommon bitline. Computing the sum of thebit products requires measuring the totalamount of current per column and mergingthe partial results into a single sum of prod-ucts. This is accomplished by local columnsense amplifiers and a bit summation tree atthe periphery of the data arrays.
Reduction UnitTo enable the processing of large matricesusing multiple data arrays, an efficient datareduction unit is employed. The reductionunits are used to build a reduction network,which sums the partial results as they aretransferred from the data arrays to the con-troller. Large matrix columns are partitionedand stored in multiple data arrays, in whichthe partial sums are individually computed.The reduction network merges the partial
CPU
DDRx
Mainmemory
Memristiveaccelerator
Controller
Array1
Arrayn
Configurableinterconnect
Computationalandstorage arrays
Figure 3. System overview. The proposed accelerator can be selectively
integrated in general-purpose computer systems.
State variables (x) Connection weights (W )
Row
dec
oder
Controller
Computational signal Interface to the data interconnect
D/S D/S D/S
x1
xn
Figure 4. The proposed array structure. The conventional one-transistor,
one-memristor (1T-1R) array structure is employed to build the proposed
accelerator.
..............................................................................................................................................................................................
TOP PICKS
............................................................
24 IEEE MICRO
results into a single sum. Multiple such net-works are used to process the weight columnsin parallel. The reduction tree comprises ahierarchy of bit-serial adders to strike a bal-ance between throughput and area efficiency.
Figure 5 shows the proposed reductionmechanism. The column is partitioned intofour segments, each of which is processedseparately to produce a total of four partialresults. The partial results are collected by areduction network comprising three bimodalreduction elements. Each element is config-ured using a local latch that operates in oneof two modes: forwarding or reduction. Eachreduction unit employs a full adder to com-pute the sum of the two inputs when operat-ing in the reduction mode. In the forwardingmode, the unit is used for transferring thecontent of one input upstream to the root.
Consensus UnitThe Boltzmann machine relies on a sigmoi-dal activation function, which plays a keyrole in both the model’s optimization andmachine learning applications. A preciseimplementation of the sigmoid function,however, would introduce unnecessaryenergy and performance overheads. The pro-posed memristive accelerator employs anapproximation unit using logic gates andlookup tables to implement the consensusfunction. As Figure 6 shows, the table con-tains 64 precomputed sample points of thesigmoid function f ðxÞ ¼ 1
1þe�x , in which xvaries between –4 and 4. The samples areevenly distributed on the x-axis. Six bits of agiven fixed-point value are used to index thelookup table and retrieve a sample value. Themost significant bits of the input data areANDed and NORed to decide whether theinput value is outside the domain [–4, 4]; ifso, the sign bit is extended to implement f(x)¼ 0 or f(x)¼ 1; otherwise, the retrieved sam-ple is chosen as the outcome.
System ArchitectureThe proposed architecture for the memristiveBoltzmann machine comprises multiple banksand a controller (see Figure 7). The banksoperate independently and serve memory andcomputation requests in parallel. For example,column 0 can be multiplied by the vector x at
bank 0 while any location of bank 1 is beingread. Within each bank, a set of sub-banks isconnected to a shared interconnection tree.The bank interconnect is equipped withreduction units to contribute to the dot prod-uct computation. In the reduction mode, allsub-banks actively produce the partial results,while the reduction tree selectively merges theresults from a subset of the sub-banks. Thiscapability is useful for computing the largematrix columns partitioned across multiplesub-banks. Each sub-bank comprises multiplemats, each of which is composed of a control-ler and multiple data arrays. The sub-banktree transfers the data bits between the mats
A largematrix
column
A
B
Mode
F.A. OutputMode Output
Forwarding
Reduction
A
A+B
Figure 5. The proposed reduction element. The reduction element can
operate in forwarding or reduction mode.
Decimal pointB
it ex
tens
ion
Accept/Reject
64 × 16lookuptable
In
Out
Pseudorandom generator
In (energy difference)43210–1–2–3–4O
ut (
pro
bab
ility
) 1.00.80.60.40.2
0
64 evenly sampled points from sigmoid
Figure 6. The proposed unit for the activation function. A 64-entry lookup
table is used for approximating the sigmoid function.
Chip
Controller
Bank Subbank
Reductiontree
Subbanktree
Mat
F/R
F/R
F/R Dataarray
Figure 7. Hierarchical organization of a chip. A chip controller is employed to
manage the multiple independent banks.
.............................................................
MAY/JUNE 2017 25
and the bank tree in a bit-parallel fashion,thereby increasing the parallelism.
Data OrganizationTo amortize the peripheral circuitry’s cost, thedata array’s columns and rows are time shared.Each sense amplifier is shared by four bitlines.The array is vertically partitioned along thebitlines into 16 stripes, multiples of which canbe enabled per array computation. This allowsthe software to keep a balance between theaccuracy of the computation and the perform-ance for a given application by quantizingmore bit products into a fixed number of bits.
On-Chip ControlThe proposed hardware can accelerate opti-mization and deep learning tasks by appro-priately configuring the on-chip controller.The controller configures the reduction trees,maps the data to the internal resources,orchestrates the data movement among thebanks, performs annealing or training tasks,and interfaces to the external bus.
DIMM OrganizationTo solve large-scale optimization andmachine learning problems whose statespaces do not fit within a single chip, we caninterconnect multiple accelerators on aDIMM. Each DIMM is equipped with con-trol registers, data buffers, and a controller.This controller receives DDRx commands,data, and address bits from the external inter-face and orchestrates computation among allof the chips on the DIMM.
Software SupportTo make the proposed accelerator visible tosoftware, we memory map its address rangeto a portion of the physical address space. Asmall fraction of the address space withinevery chip is mapped to an internal RAMarray and is used to implement the data buf-fers and configuration parameters. Softwareconfigures the on-chip data layout and ini-tiates the optimization by writing to a mem-ory mapped control register.
Evaluation HighlightsWe modify the SESC simulator3 to model abaseline eight-core out-of-order processor.
The memristive Boltzmann machine is inter-faced to a single-core system via a singleDDR3-1600 channel. We develop anRRAM-based processing-in-memory (PIM)baseline. The weights are stored within dataarrays that are equipped with integer andbinary multipliers to perform the dot prod-ucts. The proposed consensus units, optimi-zation and training controllers, and mappingalgorithms are employed to accelerate theannealing and training processes. When com-pared to existing computer systems andGPU-based accelerators, the PIM baselinecan achieve significantly higher performanceand energy efficiency because it eliminatesthe unnecessary data movement on the mem-ory bus, exploits data parallelism throughoutthe chip, and transfers the data across thechip using energy-efficient reduction trees.The PIM baseline is optimized so that itoccupies the same area as that of the memris-tive accelerator.
Area, Delay, and Power BreakdownWe model the data array, sensing circuits,drivers, local array controller, and interconnectelements using Spice predictive technologymodels4 of n-channel and p-channel metal-oxide semiconductor transistors at 22 nm.The full adders, latches, and control logic aresynthesized using FreePDK5 at 45 nm. Wefirst scale the results to 22 nm using scalingparameters reported in prior work,6 and thenscale them using the fan-out of 4 (FO4)parameters for International Technology Road-map for Semiconductors low-standby-power(LSTP) devices to model the impact of usinga memory process on peripheral and globalcircuitry.7,8 We use McPAT9 to estimate theprocessor power.
Figure 8 shows a breakdown of the compu-tational energy, leakage power, computationallatency, and die area among different hard-ware components. The sense amplifiers andinterconnects are the major contributors tothe dynamic energy (41 and 36 percent,respectively). The leakage is caused mainly bythe current summation circuits (40 percent)and other logic (59 percent), which includesthe charge pumps, write drivers, and control-lers. The computation latency, however, isdue mainly to the interconnects (49 percent),the wordlines, and the bitlines (32 percent).
..............................................................................................................................................................................................
TOP PICKS
............................................................
26 IEEE MICRO
Notably, only a fraction of the memory arraysmust be active during a computational opera-tion. A subset of the mats within each bankperforms current sensing of the bitlines; thepartial results are then serially streamed to thecontroller on the interconnect wires. Theexperiments indicate that a fully utilized accel-erator integrated circuit (IC) consumes 1.3 W,which is below the peak power rating of astandard DDR3 chip (1.4 W).
PerformanceFigure 9 shows the performance on theproposed accelerator, the PIM architecture,the multicore system running the multi-threaded kernel, and the single-core systemrunning the semidefinite programing (SDP)and MaxWalkSAT kernels. The results are nor-malized to the single-threaded kernel runningon a single core. The results indicate that thesingle-threaded kernel (Boltzmann machine) isfaster than the baselines (SDP and MaxWalk-SAT heuristics) by an average of 38 percent.The average performance gain for the mul-tithreaded kernel is limited to 6 percent,owing to significant state update overheads.PIM outperforms the single-threaded ker-nel by 9.31 times. The memristive accelera-tor outperforms all of the baselines (57.75times speedup over the single-threaded ker-nel and 6.19 times over PIM). Moreover,the proposed accelerator performs the deeplearning tasks 68.79 times faster than thesingle-threaded kernel and 6.89 times fasterthan PIM.
EnergyFigure 10 shows the energy savings as com-pared to PIM, the multithreaded kernel,SDP, and MaxWalkSAT. On average, energyis reduced by 25 times as compared to thesingle-threaded kernel implementation, whichis 5.2 times better than PIM. For the deeplearning tasks, the system energy is improvedby 63 times, which is 5.3 times better than theenergy consumption of PIM.
Sensitivity to Process VariationsMemristor parameters can deviate from theirnominal values, owing to process variationscaused by line edge roughness, oxide thick-ness fluctuation, and random discrete dop-ing. These parameter deviations result in
cycle-to-cycle and device-to-device variabil-ities. We evaluate the impact of cycle-to-cyclevariation on the computation’s outcome byconsidering a bit error rate of 10�5 in all ofthe simulations, along the lines of the analy-sis provided in prior work.10 The proposedaccelerator successfully tolerates such errors,with less than a 1-percent change in the out-come as compared to a perfect softwareimplementation.
The resistance of RRAM cells can fluctu-ate because of the device-to-device variation,which can impact the outcome of a columnsummation—that is, a partial dot product.
Peak energy (8.6 nJ)Leakage power (405 mW)
Computational latency (6.59 ns)Die area (25.67 mm2)
Others Interconnects Sense amplifiers Data arrays
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Figure 8. Area, delay, and power breakdown. Peak energy, leakage
power, computational latency, and die area are estimated at the 22-nm
technology node.
Sp
eed
up o
ver
the
sing
le-t
hrea
ded
ker
nel 100
10
0.1
1
Baseline Multithreaded kernel PIM Memristive accelerator
MC
-1
MC
-2
MC
-3
MC
-4
MC
-5
MC
-6
MC
-7
MC
-8
MC
-9
Geo
mea
n
MC
-10
MS
-1
MS
-2
MS
-3
MS
-4
MS
-5
MS
-6
MS
-7
MS
-8
MS
-9
MS
-10
Figure 9. Performance on optimization. Speedup of various system
configurations over the single-threaded kernel.
Ene
rgy
savi
ngs
over
the
sing
le-t
hrea
ded
ker
nel
100
10
0.1
1
Baseline Multithreaded kernel PIM Memristive accelerator
MC
-1
MC
-2
MC
-3
MC
-4
MC
-5
MC
-6
MC
-7
MC
-8
MC
-9
Geo
mea
n
MC
-10
MS
-1
MS
-2
MS
-3
MS
-4
MS
-5
MS
-6
MS
-7
MS
-8
MS
-9
MS
-10
Figure 10. Energy savings on optimization. Energy savings of various
system configurations over the single-threaded kernel.
.............................................................
MAY/JUNE 2017 27
We use the geometric model of memri-stance variation proposed by Miao Hu andcolleagues11 to conduct Monte Carlo simu-lations for 1 million columns, each com-prising 32 cells. The experiment yields twodistributions for low resistance (RLO) andhigh resistance (RHI) samples that are thenapproximated by normal distributions withrespective standard deviations of 2.16 and2.94 percent (similar to the prior work byHu and colleagues). We then find a bit pat-tern that results in the largest summationerror for each column. We observe up to2.6 � 10�6 deviation in the column con-ductance, which can result in up to 1 biterror per summation. Subsequent simula-tion results indicate that the accelerator cantolerate this error, with less than a 2 percentchange in the outcome quality.
Finite Switching EnduranceRRAM cells exhibit finite switching endur-ance ranging from 1e6 to 1e12 writes. Weevaluate the impact of finite endurance on anaccelerator module’s lifetime. Because wear isinduced only by the updating of the weightsstored in memristors, we track the number oftimes that each weight is written. The edgeweights are written once in optimizationproblems and multiple times in deep learningworkloads. (Updating the state variables,stored in static CMOS latches, does notinduce wear on RRAM.) We track the totalnumber of updates per second to estimatethe lifetime of an eight-chip DIMM. Assum-ing endurance parameters of 1e6 and 1e8writes, the respective module lifetimes are 3.7and 376 years for optimization and 1.5 and151 years for deep learning.
D ata movement between memory cellsand processor cores is the primary con-
tributor to power dissipation in computersystems. A recent report by the US Depart-ment of Energy identifies the power con-sumed in moving data between the memoryand the processor as one of the 10 most sig-nificant challenges in the exascale computingera.12 The same report indicates that by2020, the energy cost of moving data acrossthe memory hierarchy will be orders of mag-nitude higher than the cost of performing adouble-precision floating-point operation.
Emerging large-scale applications such ascombinatorial optimization and deep learn-ing tasks are even more influenced by mem-ory bandwidth and power problems. In theseapplications, massive datasets have to be iter-atively accessed by the processor cores toachieve a desirable output quality, whichconsumes excessive memory bandwidth andsystem energy. To address this problem,numerous software and hardware optimiza-tions using GPUs, clusters based on messagepassing interface (MPI), field-programmablegate arrays, and application-specific inte-grated circuits have been proposed in theliterature. These proposals focus on energy-efficient computing with reduced data move-ment among the processor cores and memoryarrays. These proposals’ performance andenergy efficiency are limited by read accessesthat are necessary to move the operands fromthe memory arrays to the processing units. Amemory subsystem that allows for in situcomputation within its data arrays couldaddress these limitations by eliminating theneed to move raw data between the memoryarrays and the processor cores.
Designing a platform capable of perform-ing in situ computation is a significant chal-lenge. In addition to storage cells, extracircuits are required to perform analog com-putation within the memory cells, whichdecreases memory density and area efficiency.Moreover, power dissipation and area con-sumption of the required components forsignal conversion between analog and digitaldomains could become serious limiting fac-tors. Hence, it is critical to strike a careful bal-ance between the accelerator’s performanceand complexity.
The memristive Boltzmann machine isthe first memory-centric accelerator thataddresses these challenges. It provides a newframework for designing memory-centricaccelerators. Large scale combinatorial opti-mization problems and deep learning tasksare mapped onto a memory-centric, non-Von Neumann computing substrate andsolved in situ within the memory cells, withorders of magnitude greater performance andenergy efficiency than contemporary super-computers. Unlike PIM-based accelerators,the proposed accelerator enables computationwithin conventional data arrays to achieve the
..............................................................................................................................................................................................
TOP PICKS
............................................................
28 IEEE MICRO
energy-efficient and massively parallel proc-essing required for the Boltzmann machinemodel.
We expect the proposed memory-centricaccelerator to set off a new line of research onin situ approaches to accelerate large-scaleproblems such as combinatorial optimizationand deep learning tasks and to significantlyincrease the performance and energy effi-ciency of future computer systems. MICRO
AcknowledgmentsThis work was supported in part by NSFgrant CCF-1533762.
....................................................................References1. E. Aarts and J. Korst, Simulated Annealing
and Boltzmann Machines: A Stochastic
Approach to Combinatorial Optimization and
Neural Computing, John Wiley & Sons,
1989.
2. S.E. Fahlman, G.E. Hinton, and T.J. Sejnow-
ski, “Massively Parallel Architectures for AI:
NETL, Thistle, and Boltzmann Machines,”
Proc. Assoc. Advancement of AI (AAAI),
1983, pp. 109–113.
3. J. Renau et al., “SESC Simulator,” Jan.
2005; http://sesc.sourceforge.net.
4. W. Zhao and Y. Cao, “New Generation of
Predictive Technology Model for Sub-45nm
Design Exploration,” Proc. Int’l Symp. Qual-
ity Electronic Design, 2006, pp. 585–590.
5. “FreePDK 45nm Open-Access Based PDK
for the 45nm Technology Node,” 29 May
2014; www.eda.ncsu.edu/wiki/FreePDK.
6. M.N. Bojnordi and E. Ipek, “Pardis: A Pro-
grammable Memory Controller for the
DDRX Interfacing Standards,” Proc. 39th
Ann. Int’l Symp. Computer Architecture
(ISCA), 2012 pp. 13–24.
7. N.K. Choudhary et al., “Fabscalar: Compos-
ing Synthesizable RTL Designs of Arbitrary
Cores Within a Canonical Superscalar
Template,” Proc. 38th Ann. Int’l Symp. Com-
puter Architecture, 2011, pp. 11–22.
8. S. Thoziyoor et al., “A Comprehensive Mem-
ory Modeling Tool and Its Application to the
Design and Analysis of Future Memory Hier-
archies,” Proc. 35th Int’l Symp. Computer
Architecture (ISCA), 2008, pp. 51–62.
9. S. Li et al., “McPAT: An Integrated Power,
Area, and Timing Modeling Framework for
Multicore and Manycore Architectures,”
Proc. 36th Int’l Symp. Computer Architec-
ture (ISCA), 2009, pp. 468–480.
10. D. Niu et al., “Impact of Process Variations
on Emerging Memristor,” Proc. 47th ACM/
IEEE Design Automation Conf. (DAC), 2010,
pp. 877–882.
11. M. Hu et al., “Geometry Variations Analysis
of Tio 2 Thin-Film and Spintronic Mem-
ristors,” Proc. 16th Asia and South Pacific
Design Automation Conf., 2011, pp. 25–30.
12. The Top Ten Exascale Research Challenges,
tech. report, Advanced Scientific Comput-
ing Advisory Committee Subcommittee,
Dept. of Energy, 2014.
Mahdi Nazm Bojnordi is an assistant pro-fessor in the School of Computing at theUniversity of Utah. His research focuses oncomputer architecture, with an emphasis onenergy-efficient computing. Nazm Bojnordireceived a PhD in electrical engineeringfrom the University of Rochester. Contacthim at [email protected].
Engin Ipek is an associate professor in theDepartment of Electrical and ComputerEngineering and the Department of Com-puter Science at the University of Rochester.His research interests include energy-efficientarchitectures, high-performance memory sys-tems, and the application of emerging tech-nologies to computer systems. Ipek receiveda PhD in electrical and computer engineer-ing from Cornell University. He has receivedthe 2014 IEEE Computer Society TCCAYoung Computer Architect Award, twoIEEE Micro Top Picks awards, and an NSFCAREER award. Contact him at [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 29
.................................................................................................................................................................................................................
ANALOG COMPUTING IN A MODERNCONTEXT: A LINEAR ALGEBRA
ACCELERATOR CASE STUDY.................................................................................................................................................................................................................
THIS ARTICLE PRESENTS A PROGRAMMABLE ANALOG ACCELERATOR FOR SOLVING
SYSTEMS OF LINEAR EQUATIONS. THE AUTHORS COMPENSATE FOR COMMONLY PERCEIVED
DOWNSIDES OF ANALOG COMPUTING. THEY COMPARE THE ANALOG SOLVER’S
PERFORMANCE AND ENERGY CONSUMPTION AGAINST AN EFFICIENT DIGITAL ALGORITHM
RUNNING ON A GENERAL-PURPOSE PROCESSOR. FINALLY, THEY CONCLUDE THAT PROBLEM
CLASSES OUTSIDE OF SYSTEMS OF LINEAR EQUATIONS COULD HOLD MORE PROMISE FOR
ANALOG ACCELERATION.
......As we approach the limits of sili-con scaling, it behooves us to reexamine fun-damental assumptions of modern computing,even well-served ones, to see if they are hin-dering performance and efficiency. An analogaccelerator discussed in this article breakstwo fundamental assumptions in moderncomputing: in contrast to using digital binarynumbers, an analog accelerator encodes num-bers using the full range of circuit voltage andcurrent. Furthermore, in contrast to operatingstep by step on clocked hardware, an analogaccelerator updates its values continuously.These different hardware assumptions canprovide substantial gains but would needdifferent abstractions and cross-layer optimi-zations to support various modern workloads.We draw inspiration from an immenseamount of prior work in analog electronic
computing (see the sidebar, “Related Work inAnalog Computing”).
To support modern workloads in the digitalera, we observed that modern scientific com-puting and big data problems are converted tolinear algebra problems. To maximize analogacceleration’s usefulness, we explored whetheranalog accelerators are effective at solvingsystems of linear equations, the single mostimportant numerical primitive in continuousmathematics.
For readers not familiar with linear alge-bra, systems of linear equations are oftensolved using iterative numerical linear algebramethods, which start with an initial guess forthe entire solution vector and update the sol-ution vector over iterations of the algorithm,each step further minimizing the differencebetween the guess and the correct solution.1
Yipeng Huang
Ning Guo
Mingoo Seok
Yannis Tsividis
Simha Sethumadhavan
Columbia University
.......................................................
30 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
Efficient iterative methods such as the conju-gate gradient method are increasingly impor-tant because intermediate guess vectors are agood approximation of the correct solution.
In discrete-time-iterative linear algebraalgorithms, the solution vector changes insteps, and each step is characterized by a stepsize. The step size affects the algorithm’s effi-ciency and requires many processor cycles tocalculate. In the conjugate gradient method,for example, the step size is calculated fromprevious step sizes and the gradient magni-tude, and this calculation takes up half of themultiplication operations in each conjugategradient step.
In an analog accelerator, systems of linearequations can also be thought of as solvedvia an iterative algorithm, with an importantdistinction that the guess vector is updatedusing infinitesimally small steps, over infin-
itely many iterations. This continuous trajec-tory from the original guess vector to thecorrect solution is an ordinary differentialequation (ODE), which states that the changein a set of variables is a function of the varia-bles’ present value. We can naturally solveODEs using an analog accelerator.
We give an example of an analog accelera-tor solving an ODE that in turn solves asystem of linear equations. At the analogaccelerator’s heart are integrators, which con-tain the present guess of the solution vectorrepresented as an analog signal evolving as afunction of time (see Figure 1). We performoperations on this solution vector by feedingthe vector through multiplier and summationunits. Digital-to-analog converters (DACs)provide constant coefficients and biases.Using these function units, we create a lin-ear function of the solution vector, which is
Related Work in Analog ComputingAnalog computers of the mid-20th century were widely used to
solve scientific computing problems, described as ordinary differ-
ential equations (ODEs). Analog computers would solve those
ODEs by setting up analog electronic circuits, whose time-depend-
ent voltage and current were described by corresponding ODEs.
The analog computers therefore were computational analogies of
physical models.
Our group revisited this model of analog computing for solving
nonlinear ODEs, which frequently appear in cyber-physical sys-
tems workloads, with higher performance and efficiency com-
pared to digital systems.1,2 The analog, continuous-time output of
analog computing is especially suited for embedded systems
applications in which sensor inputs are analog and actuators can
use such results directly. The question for this article is whether
analog acceleration can help conventional workloads in which
inputs and outputs are digital.
Modern scientific computation and big data workloads are
phrased as linear algebra problems. In this article, our analog
accelerator solves an ODE that does steepest descent, in turn
solving a linear algebra problem. Such a solving method belongs
to a broad class of ODEs that can solve other numerical problems,
including nonlinear systems of equations.3,4 These ODEs point to
other ways analog accelerators can support modern workloads.
We draw a distinction between our approach to analog accel-
eration and that of using analog circuits to build neural net-
works.5,6 Most importantly, we do not use training to get a
network topology and weights that solve a given problem. No
prior knowledge of the solution or training set of solutions is
required. The analog acceleration technique presented in this
article is a procedural approach to solving problems: there is a
predefined way to convert a system of linear equations under
study into an analog accelerator configuration.
References1. G. Cowan, R. Melville, and Y. Tsividis, “A VLSI Analog
Computer/Digital Computer Accelerator,” IEEE J. Solid-
State Circuits, vol. 41, no. 1, 2006, pp. 42–53.
2. N. Guo et al., “Energy-Efficient Hybrid Analog/Digital
Approximate Computation in Continuous Time,” IEEE J.
Solid-State Circuits, vol. 51, no. 7, 2016, pp. 1514–1524.
3. M.T. Chu, “On the Continuous Realization of Iterative
Processes,” SIAM Rev., vol. 30, no. 3, 1988, pp.
375–387.
4. O. Bournez and M.L. Campagnolo, A Survey on Continu-
ous Time Computations, Springer, 2008, pp. 383–423.
5. R. LiKamWa et al., “RedEye: Analog ConvNet Image Sen-
sor Architecture for Continuous Mobile Vision,”
SIGARCH Computer Architecture News, vol. 44, no. 3,
2016, pp. 255–266.
6. A. Shafiee et al., “ISAAC: A Convolutional Neural Net-
work Accelerator with In-Situ Analog Arithmetic in Cross-
bars,” SIGARCH Computer Architecture News, vol. 44,
no. 3, 2016, pp. 14–26.
.............................................................
MAY/JUNE 2017 31
fed back to the inputs of the integrators. Inthis fully formed circuit, the solution vector’stime derivative is a linear function of the sol-ution vector itself.
The integrators are charged to an initialcondition representing the iterative method’sinitial guess. The accelerator starts computa-tion by releasing the integrator, allowing itsoutput to deviate from its initial value. Thevariables contained in the integrators con-verge on the correct solution vector that satis-fies the system of linear equations. When theanalog variables are steady, we sample themusing analog-to-digital converters (ADCs).
These techniques were used in early ana-log computers2–4 and have recently beenexplored in small-scale experiments with ana-log computation.5,6
Analog Linear Algebra AdvantagesSolving linear algebra problems using ODEson an analog accelerator has several potentialadvantages compared to using a discrete-time algorithm on a digital general-purposeor special-purpose system.
Explicit Data-Graph Execution ArchitectureThe analog accelerator uses an explicit data-flow graph in which the sequence of opera-tions on data is realized by connectingfunctional units end to end. During compu-tation, analog signals representing intermedi-ate results flow from one unit to the next, sothere are no overheads in fetching and decod-ing instructions, and there are no accesses todigital memory. The former is a benefit ofdigital accelerators, too, but the latter is aunique benefit of the analog computationalmodel.
Continuous Time SpeedThe analog accelerator hardware and algo-rithm both operate in continuous time. Thevalues contained in the integrators are contin-uously being updated, and the update rate isnot limited by a finite clock frequency, whichis the limiting factor in discrete-time hard-ware. Furthermore, a continuous-time ODEsolution has no concern about the correctstep size to take to update the solution vec-tor, in contrast to discrete-time iterative algo-rithms, in which computing the correct stepsize represents most operations needed peralgorithm iteration. In these regards, theanalog accelerator is potentially faster thandiscrete-time architectures. Finally, no power-hungry clock signal is needed to synchronizeoperations.
Continuous Value EfficiencyThe analog accelerator solves the system oflinear equations using real numbers encodedin voltage and current, so each wire can rep-resent the full range of values in the analogaccelerator. In contrast, changing the value ofa digital binary number affects many bits:sweeping an 8-bit unsigned integer from 0 to255 needs 502 binary inversions, whereas amore economical Gray encoding still needs255 inversions. Furthermore, multiplication,addition, and integration are all comparativelystraightforward on analog variables comparedto digital ones. This contrasts with floating-point arithmetic, in which the logarithmicallyencoded exponent portion of digital floating-point variables makes it complicated to addand subtract variables. In these regards, analogencoding is potentially more efficient thandigital, binary encodings.
–a00
–a10
–a01
–a11
b1
b0x0(t)
x1(t)
DAC
DAC
ADC
ADC
dx1
dt
dx0
dt
Figure 1. Schematic of an analog
accelerator for solving Ax 5 b, a linear
system of two equations with two
unknown variables. Matrix A is a known
matrix of coefficients realized using
multipliers; x is an unknown vector
contained in integrators; b is a known vector
of biases generated by digital-to-analog
converters (DACs). Signals are encoded as
analog current and are copied using current
mirror fan-out blocks. The solver converges
if matrix A is positive definite, which is
usually true for the problems we discuss.
..............................................................................................................................................................................................
TOP PICKS
............................................................
32 IEEE MICRO
Analog Accelerator ArchitectureThe analog accelerator acts as a peripheral toa digital host processor. The analog accelera-tor interface accepts an accelerator configura-tion, which entails the connectivity betweenfunction units, multiplier gains, DAC con-stants, and integrator initial conditions.Additionally, the interface allows calibration,computation control, and reading of outputvalues, and reporting exceptions. Table 1summarizes the analog accelerator’s essentialsystem calls and corresponding instructions.
Analog Accelerator Physical PrototypeWe tested analog acceleration for linear alge-bra using a prototype reconfigurable analogaccelerator silicon chip implemented in65-nm CMOS technology (see Figures 2and 3). The accelerator comprises four inte-grators, plus accompanying DACs, multi-pliers, and ADCs connected using crossbarswitches. In our analog accelerator, electrical
currents represent variables. Fan-out cur-rent mirrors allow the analog circuit to copyvariables by replicating values onto differentbranches. To sum variables, currents are addedtogether by joining branches. Eight multi-pliers allow variable-variable and constant-variable multiplication.
The physical prototype validates the ana-log circuits’ functionality and allows physicalmeasurement of component area and energy.Additionally, the chip allows rapid prototyp-ing of accelerator algorithms.
Using physical timing, power, and areameasurements recorded by Ning Guo andcolleagues7 and summarized in Table 2, webuilt a model that predicts the properties oflarger-scale analog accelerators. In Table 2,“analog core power” and “analog core area”show the power and area of each block thatforms the analog signal path. The noncoretransistors and nets not involved in analogcomputation include calibration and testingcircuits and registers. The core area and power
Table 1. Analog accelerator instruction set architecture.
Instruction type Instruction Parameters Instruction effect
Control Initialize — Find input and output offset and gain calibration
settings for all function units.
Configuration Set connection Source, destination Set a crossbar switch to create an analog current
connection between two analog interfaces.
Configuration Set initial condition Pointer to an integrator, initial
condition value
Charge integrator capacitors to have ODE initial
condition value.
Configuration Set multiplier gain Pointer to a multiplier, gain value Amplify values by constant coefficient gain.
Configuration Set constant offset Pointer to a DAC, offset value Add a constant bias to values.
Configuration Set timeout time Number of digital controller
clock cycles
Stop analog computation after specified time
once started.
Configuration Configuration commit — Finish configuration and write any new changes
to chip registers.
Control Execution start — Start analog computation by letting integrators
deviate from initial conditions.
Control Execution stop — Stop analog computation by holding integrators
at their present value.
Data input Enable analog input Pointer to chip analog input
channel
Open chip analog input channel, allowing multi-
ple chips to participate in computation.
Data output Read analog value Pointer to an ADC, memory
location to store result
Read analog computation results from ADCs and
store values.
Exception Read exceptions Memory location to store result Read the exception vector indicating whether
analog units exceeded their valid range....................................................................................................................................
*ADC: analog-to-digital converter; DAC: digital-to-analog converter; ODE: ordinary differential equation.
.............................................................
MAY/JUNE 2017 33
scale up and down for different analog band-width designs. We explore how differentbandwidth choices influence analog accel-erator performance and efficiency.
Mitigation of Analog Linear AlgebraDisadvantagesWe encountered several drawbacks of analogcomputing, including limited accuracy, pre-cision, and scalability. We tackled each ofthese problems in the context of solvinglinear algebra, although the techniques we dis-cuss apply to other styles of analog computerarchitecture.
Improve Accuracy Using Calibration and AnalogExceptionsAnalog circuits provide limited accuracycompared to binary ones, in which values areunambiguously interpreted as 0 or 1. Analoghardware uses the full range of values. Subtlevariations in analog hardware due to processand temperature variation lead to undesirablevariations in the computation result.
We identify three main sources of inaccur-acy in analog hardware: gain error, offseterror, and nonlinearity. Gain and offset errorsrefer to inaccurate results in multiplicationand summation, which can be calibratedaway using additional DACs that adjust cir-cuit parameters to shift signals and adjustgains. These DACs are controlled by registers,whose contents are set using binary searchduring calibration by the digital host. The set-tings vary across different copies of functionalunits and accelerator chips, but remain con-stant during accelerator operation.
Nonlinearity errors occur when changesin inputs result in disproportionate changesin outputs, and when analog values exceedthe range in which the circuit’s behavior ismostly linear, resulting in clipping of theoutput, akin to overflow of digital numberrepresentations. At the same time, the hostobserves if the dynamic range is not fullyused, which could result in low precision.When either exception type occurs, the origi-nal problem is rescaled to fit in the dynamicrange of the analog accelerator, and computa-tion is reattempted.
The combination of widespread calibra-tion and exception checking ensures that the
8 fan-out blocks 4 integrators 8 multiplier/VGAs 4 analoginputs
4 an
alog
outp
uts
Dig
ital
inp
ut
Digitaloutput
CT
AD
C
CT
AD
C
CT
DA
C
CT
DA
C
8 8
8 8 8
8
8
8
88
4
SPI SRAM SRAM
SPI controller
Figure 2. Analog accelerator architecture diagram, showing rows of analog,
mixed-signal, and digital components, along with crossbar interconnect.7
“CT” refers to continuous time. Static RAMs (SRAMs) are used as lookup
tables for nonlinear functions (not used for the purposes of this article).
8×fan-out
8×multiplier/
VGA
4×integrator
2 × CT ADC 2 × CT DAC
2 × SRAMSPI controller
Figure 3. Die photo of an analog accelerator
chip fabricated in 65-nm CMOS technology,
showing major components.7 “VGAs” are
variable-gain amplifiers. The die area is
3.8 mm2.
..............................................................................................................................................................................................
TOP PICKS
............................................................
34 IEEE MICRO
analog solution’s accuracy is within the sam-pling resolution of ADCs.
Improve Sampling Precision by Focusing onAnalog Steady StateHigh-frequency and high-precision analog-to-digital conversion is costly. So, instead oftrying to capture the time-dependent analogwaveform, we use the analog accelerator as alinear algebra solver by solving a convergentODE. When the analog accelerator outputsare steady, we can sample the solutions oncewith higher-precision ADCs.
Even then, high-precision ADCs still fallshort of the precision in floating-point num-bers. Even though the analog variables arethemselves highly precise, sampling the varia-bles using ADCs can result in only 8 to 12bits of precision. We get higher-precisionresults by running the analog accelerator mul-tiple times. We use the digital host computerto find the residual error in the solution, andwe set up the analog accelerator to solve a newproblem, focusing on the residual. Each prob-lem has smaller-magnitude variables than pre-vious runs, which lets us scale up the variablesto fit the dynamic range of the analog hard-ware. We can iterate between analog and digi-tal hardware a few times to get a more preciseresult than using the analog hardware alone.
Tackle Larger Problems by Accelerating SparseLinear Algebra SubproblemsModern workloads routinely need thousandsof variables, corresponding to as many analogintegrators in the accelerator, exceeding thearea constraints of realistic analog accelerators.Furthermore, the analog datapath is fixedduring continuous time operation, so there isno way to dynamically load variables fromand store variables to main memory.
Analog accelerators can solve large-scalesparse linear algebra problems by acceleratingthe solving of smaller subproblems. This letsanalog accelerators solve problems containingmore variables than the number of integra-tors in the analog accelerator.
In such a scheme, the analog acceleratorfinds the correct solution for a subproblem.To get overall convergence across the entireproblem, the set of subproblems would besolved several times, using an outer loop iter-ating across the subproblems. Typically, thelarger iteration is an iterative method operat-ing on vectors, which do not have as strongconvergence properties as iterative methodsdo on individual numbers. Therefore, it is stilldesirable to ensure the block matrices cap-tured in the analog accelerator are large, sothat more of the problem is solved using theefficient lower-level solver.
EvaluationWe compare the analog accelerator and digi-tal approaches in terms of performance, hard-ware area, and energy consumption, whilevarying the number of problem variables andthe choice of analog accelerator componentbandwidth, a measure of how quickly theanalog circuit responds to changes.
Analog Bandwidth ModelThe prototype chip has a relatively low analogbandwidth of 20 KHz, a design that ensuresthat the prototype chip accurately solves fortime-dependent solutions in ODEs. How-ever, the prototype’s small bandwidth makesit unrepresentative of an analog acceleratordesigned to solve time-independent algebraicequations, in which accuracy degradation intime-dependent behavior has no impact onthe final steady state output. We scale up the
Table 2. Summary of analog accelerator components.
Unit type Analog core
power
Total unit
power
Analog core
area
Total unit
area
Integrator 22 lW 28 lW 0.016 mm2 0.040 mm2
Fan-out 30 lW 37 lW 0.005 mm2 0.015 mm2
Multiplier 39 lW 49 lW 0.024 mm2 0.050 mm2
ADC 27 lW 54 lW 0.049 mm2 0.054 mm2
DAC 4.6 lW 4.6 lW 0.013 mm2 0.022 mm2
.............................................................
MAY/JUNE 2017 35
model’s bandwidth, within reason, up to1.3 MHz.
Increasing the bandwidth of the analogcircuit design proportionally decreases the
solution time, but also increases area andenergy consumption. As Figures 4 and 5show, we assume an analog accelerator withbandwidth multiplied by a factor of a hashigher power and area consumption in thecore analog circuits, by a factor of a.
The projected analog power figures aresignificantly below the thermal design powerof clocked digital designs of equal area. Evenin the designs that fill a 600 mm2 die size, theanalog accelerator uses about 0.7 W in thebase prototype design and about 1.0 W inthe design with 320 KHz of bandwidth.
Sparse Linear Algebra Case StudyWe use as our test case a sparse system oflinear equations derived from a multigridelliptic partial differential equation (PDE)solver. In multigrid PDE solvers, the overallPDE is converted to several linear algebraproblems with varying spatial resolution.Lower-resolution subproblems are quicklysolved and fed to high-resolution subpro-blems, aiding the high-resolution problem toconverge faster. The linear algebra subpro-blems can be solved approximately. Overallaccuracy of the solution is guaranteed by iter-ating the multigrid algorithm. Because perfectconvergence is not required, less stable, inac-curate, and low-precision techniques, such asanalog acceleration, can support multigrid.
In our case, we compare the analog accel-erator designs to a conjugate gradient algo-rithm running on a CPU, solving to equal(relatively low) solution precision, equivalentto the precision obtained from one run ofthe analog accelerator equipped with high-resolution ADCs. On the digital side, thenumerical iteration stops short of the machineprecision provided by high-precision digitalfloating-point numbers.
The conjugate gradient algorithm uses asustained 20 clock cycles per numerical itera-tion per row element. The comparisonassumes identical transfer cost of data frommain memory to the accelerator versus theCPU: the energy needed to transfer data toand from memory is not modeled, due to therelatively small problem sizes, allowing theprogram data to be entirely cache resident.
As Figure 6 shows, we found that an opti-mal analog accelerator design that balancesperformance and the number of integrators
1.2
1.0
0.8
0.6
0.4
0.2
0.00 500 1,000 1,500 2,000
Max
imum
act
ivity
pow
er (
W)
Total grid points
20 KHz 80 KHz 320 KHz 1.3 MHz
Figure 4. Power versus analog accelerator
size for various bandwidth choices. We
observe that analog circuits operate faster
when the internal node voltages
representing variables change more quickly.
We hold the capacitance fixed to the
capacitance of the prototype’s design, and
use larger currents that draw more power
to charge and discharge the node
capacitances in the signal paths carrying
variables.
0 500 1,000 1,500 2,000
Are
a (m
m2 )
Total grid points
20 KHz 80 KHz 320 KHz 1.3 MHz
6.00E+02
4.00E+02
2.00E+02
0.00E+00
Figure 5. Area versus analog accelerator
size for various bandwidth choices. We
observe that the transistor aspect ratio W/L
must increase to increase the current, and
therefore bandwidth, of the design. L is
kept at a minimum dictated by the
technology node, leaving bandwidth to be
linearly dependent on W. Thus, we
estimate area increasing linearly with
bandwidth.
..............................................................................................................................................................................................
TOP PICKS
............................................................
36 IEEE MICRO
should have components with an analogbandwidth of approximately 320 KHz. Withour bandwidth model, high-bandwidth ana-log computers come with high area cost,quickly reaching the area of the largest CPUor GPU dies. On performance and energymetrics, we find that, with 400 integratorsoperating at 320 KHz of analog bandwidth,analog acceleration can potentially have a 10-times faster solution time; using our analogbandwidth model for power, this designcorresponds to 33 percent lower energy con-sumption compared to a digital general-purpose processor.
W e recognize that the performance in-creases and energy savings are not as
drastic as one expects when using a domain-specific accelerator built on a fundamentallydifferent computing model than digital, syn-chronous computing. The reason for thisshortfall is twofold.
First, the high area cost of high-bandwidthanalog components limits the problem sizesthat can fit in the accelerator, and thereforelimits the analog performance advantage.
Second, the extreme importance of linearalgebra problems has also led to intenseresearch in optimal algorithms and hardwaresupport. Although discrete-time operation hasdrawbacks, it permits algorithms to intelli-gently select a step size, which has advantagesin solving systems of linear equations. Boththe analog and digital solvers perform iterativenumerical algorithms, but the digital programruns the conjugate gradient method, the mostefficient and sophisticated of the classicaliterative algorithms. In the conjugate gradientmethod, each step size is chosen, consideringthe gradient magnitude of the present point,along with the history of step sizes. With theseadditional calculations, the conjugate gradientmethod avoids taking redundant steps, accel-erating toward the answer when the error islarge and slowing when close to convergence.
In contrast, the analog accelerator hasfewer iterative algorithms it can carry out. Inusing the analog accelerator for linear alge-bra, the design’s bandwidth limits the con-vergence rate, so the convergence rate withina time interval cannot be arbitrarily large.Therefore, the numerical iteration in theanalog accelerator is akin to fixed-step size
relaxation or steepest descent. Although wecan consider the analog accelerator as doingcontinuous-time steepest descent, taking manyinfinitesimal steps in continuous time, doingmany iterations of a poor algorithm is in thiscase no match for a better algorithm.
Efficient discrete-time algorithms such asconjugate gradient and multigrid have beenknown to researchers since the 1950s. Analogcomputers remained in use in the 1960s tosolve steepest descent due to their betterimmediate performance relative to early digi-tal computers.
Changing the basic abstractions in com-puter architecture could change what typesof problems are solvable. Interesting physi-cal phenomena are usually continuous-time,
0 200 400 600
Con
verg
ence
tim
e (µ
s)Total grid points
200
150
100
50
0
Digital conjugate gradientsAnalog 20 KHz Linear (analog 20 KHz)Linear (analog 80 KHz projection)
Linear (analog 320 KHz projection)Linear (analog 1.3 MHz projection)
Figure 6. Comparison of time taken to
converge to equivalent precision, for high-
bandwidth analog accelerators and a digital
CPU. The time needed to converge is
plotted against the linear algebra problem
vector size. We give the projected solution
time for 80-KHz, 320-KHz, and 1.3-MHz
analog accelerator designs. The high-
bandwidth designs have increasing area
cost. In this plot, the 320-KHz and 1.3-MHz
designs hit the size of 600 mm2, the size of
the largest GPUs, so the projections are cut
short. The convergence time for digital is
the software runtime on a single CPU core.
.............................................................
MAY/JUNE 2017 37
analog, nonlinear, and often stochastic, so thecomputer architectures and mathematicalabstractions for simulating these processesshould also be continuous-time and analog.Although analog acceleration has limited ben-efits for solving linear algebra, analog accelera-tion holds promise in problem classes such asnonlinear systems, in which digital algorithmsand hardware architectures have been less suc-cessful. In this regard, this article could be thefirst in a line of work redefining what prob-lems are tractable and should be pursued foranalog computing. MICR O
AcknowledgmentsThis work is supported by NSF award CNS-1239134 and a fellowship from the Alfred P.Sloan Foundation. This article is based onour ISCA 2016 paper.8
....................................................................References1. W.H. Press et al., Numerical Recipes: The
Art of Scientific Computing, 3rd ed., Cam-
bridge Univ. Press, 2007.
2. W. Chen and L.P. McNamee, “Iterative Sol-
ution of Large-Scale Systems by Hybrid
Techniques,” IEEE Trans. Computers, vol.
C-19, no. 10, 1970, pp. 879–889.
3. W.J. Karplus and R. Russell, “Increasing
Digital Computer Efficiency with the Aid of
Error-Correcting Analog Subroutines,” IEEE
Trans. Computers, vol. C-20, no. 8, 1971,
pp. 831– 837.
4. G. Korn and T. Korn, Electronic Analog and
Hybrid Computers, McGraw-Hill, 1972.
5. C.C. Douglas, J. Mandel, and W.L. Miranker,
“Fast Hybrid Solution of Algebraic Systems,”
SIAM J. Scientific and Statistical Computing,
vol. 11, no. 6, 1990, pp. 1073–1086.
6. Y. Zhang and S.S. Ge, “Design and Analysis
of a General Recurrent Neural Network
Model for Time-Varying Matrix Inversion,”
IEEE Trans. Neural Networks, vol. 16, no. 6,
2005, pp. 1477–1490.
7. N. Guo et al., “Energy-Efficient Hybrid Ana-
log/Digital Approximate Computation in
Continuous Time,” IEEE J. Solid-State Cir-
cuits, vol. 51, no. 7, 2016, pp. 1514–1524.
8. Y. Huang et al., “Evaluation of an Analog
Accelerator for Linear Algebra,” Proc. ACM/
IEEE 43rd Ann. Int’l Symp. Computer Archi-
tecture (ISCA), 2016, pp. 570–582.
Yipeng Huang is a PhD candidate in theComputer Architecture and Security Tech-nologies Lab at Columbia University. Hisresearch interests include applications ofanalog computing and benchmarking ofrobotic systems. Huang has an MPhil incomputer science from Columbia Univer-sity. He is a member of the IEEE ComputerSociety and ACM SIGARCH. Contact himat [email protected].
Ning Guo is a hardware engineer at Cog-nescent. His research interests include con-tinuous-time analog/hybrid computing andenergy-efficient approximate computing.Guo received a PhD in electrical engineer-ing from Columbia University, where heperformed the work for this article. Contacthim at [email protected].
Mingoo Seok is an assistant professor in theDepartment of Electrical Engineering atColumbia University. His research interestsinclude low-power, adaptive, and cognitiveVLSI systems design. Seok received a PhDin electrical engineering from the Univer-sity of Michigan, Ann Arbor. He hasreceived an NSF CAREER award and is amember of IEEE. Contact him at [email protected].
Yannis Tsividis is the Edwin Howard Arm-strong Professor of Electrical Engineering atColumbia University. His research interestsinclude analog and hybrid analog/digitalintegrated circuit design for computationand signal processing. Tsividis received aPhD in electrical engineering from the Uni-versity of California, Berkeley. He is a LifeFellow of IEEE. Contact him at [email protected].
Simha Sethumadhavan is an associate pro-fessor in the Department of Computer Scienceat Columbia University. His research interestsinclude computer architecture and computersecurity. Sethumadhavan received a PhD incomputer science from the University of Texasat Austin. He has received an Alfred P. Sloanfellowship and an NSF CAREER award. Con-tact him at [email protected].
..............................................................................................................................................................................................
TOP PICKS
............................................................
38 IEEE MICRO
.................................................................................................................................................................................................................
DOMAIN SPECIALIZATION IS GENERALLYUNNECESSARY FOR ACCELERATORS
.................................................................................................................................................................................................................
DOMAIN-SPECIFIC ACCELERATORS (DSAS), WHICH SACRIFICE PROGRAMMABILITY FOR
EFFICIENCY, ARE A REACTION TO THE WANING BENEFITS OF DEVICE SCALING. THIS ARTICLE
DEMONSTRATES THAT THERE ARE COMMONALITIES BETWEEN DSAS THAT CAN BE
EXPLOITED WITH PROGRAMMABLE MECHANISMS. THE GOALS ARE TO CREATE A
PROGRAMMABLE ARCHITECTURE THAT CAN MATCH THE BENEFITS OF A DSA AND TO
CREATE A PLATFORM FOR FUTURE ACCELERATOR INVESTIGATIONS.
......Performance improvements fromgeneral-purpose processors have proved elusivein recent years, leading to a surge of interest inmore narrowly applicable architectures in thehope of continuing system improvements in atleast some significant application domains. Apopular approach so far has been buildingdomain-specific accelerators (DSAs): hardwareengines capable of performing computationsin a particular domain with high performanceand energy efficiency. DSAs have been devel-oped for many domains, including machinelearning, cryptography, regular expressionmatching and parsing, video decoding, anddatabases. DSAs have been shown to achieve10 to 1,000 times performance and energybenefits over high-performance, power-hungrygeneral-purpose processors.
For all of their efficiency benefits, DSAssacrifice programmability, which makes themprone to obsoletion—the domains we needto specialize, as well as the best algorithms touse, are constantly evolving with scientificprogress and changing user needs. Moreover,the relevant domains change between device
types (server, mobile, wearable), and creatingfundamentally new designs for each costsboth design and validation time. More sub-tly, most devices run several different impor-tant workloads (such as mobile systems onchip), and therefore multiple DSAs will berequired; this could mean that althougheach DSA is area-efficient, a combination ofDSAs might not be.
Critically, the alternative to domain spe-cialization is not necessarily standard general-purpose processors, but rather programmableand configurable architectures that employsimilar microarchitectural mechanisms forspecialization. The promises of such anarchitecture are high efficiency and the abil-ity to be flexible across workloads. Figure 1adepicts the two specialization paradigms at ahigh level, leading to the central question ofthis article: How far can the efficiency ofprogrammable architectures be pushed, andcan they be competitive with domain-specificdesigns?
To this end, this article first observes thatalthough DSAs differ greatly in their design
Tony Nowatzki
University of California,
Los Angeles
Vinay Gangadhar
Karthikeyan
Sankaralingam
University of
Wisconsin–Madison
Greg Wright
Qualcomm Research
.......................................................
40 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
choices, they all employ a similar set of spe-cialization principles:
� Matching of the hardware concurrencyto the enormous parallelism typicallypresent in accelerated algorithms.
� Problem-specific functional units(FUs) for computation.
� Explicit communication of data asopposed to implicit transfer throughshared (register and memory) addressspaces in a general-purpose instruc-tion set architecture (ISA).
� Customized structures for caching.� Coordination of the control of the
other hardware units using simple,low-power hardware.
Our primary insight is that these sharedprinciples can be exploited by composingknown programmable and configurablemicroarchitectural mechanisms. In this article,we describe one such architecture, our pro-posed design, LSSD (see Figure 1b). (LSSDstands for low-power core, spatial architecture,scratchpad, and DMA.)
To exploit the concurrency present inspecializable algorithms while retaining pro-grammability, we employ many tiny, low-power cores. To improve the cores forhandling the commonly high degree ofcomputation, we add a spatial fabric to eachcore; the spatial fabric’s network specializesthe operand communication, and its FUscan be specialized for algorithm-specificcomputation. Adding scratchpad memoriesenables the specialization of caching, and aDMA engine specializes the memory com-munication. The low-power core makes itpossible to specialize the coordination.
This article has two primary goals. First,we aim to show that by generalizing com-mon specialization principles, we can createa programmable architecture that is compet-itive with DSAs. Our evaluation of LSSDmatches DSA performance with two to fourtimes the power and area overhead. Second,our broader goal is to inspire the use of pro-grammable fabrics like LSSD as vehicles forfuture accelerator research. These types ofarchitectures are far better baselines thanout-of-order (OoO) cores for distilling thebenefits of specialization, and they can serveas a platform for generalizing domain-specific
research, broadening and strengthening theirimpact.
The Five C’s of SpecializationDSAs achieve efficiency through the employ-ment of five common specialization princi-ples, which we describe here in detail. Wealso discuss how four recent acceleratordesigns apply these principles.
Defining the Specialization PrinciplesBefore we define the specialization principles,let’s clarify that we are discussing specializa-tion principles only for workloads that aremost commonly targeted with accelerators.
CommunicationComputation CachingConcurrency
Specialization of:
Memory
Coordination
DM
A
Output interface
Input interface Input interface
Output interface
D$
Spatial fabric Spatial fabric
ScratchpadScratchpad
DM
A
Low-powercore
Low-powercore
D$
...
Competitive?
Lower?
Future proof?
Programmable specializationDomain-specific acceleration
Graphtraversal
Neuralapproxi-mation
Deepneural
Linearalgebra
Stencil
RegExp AI
Sort
Scan
10 to 1,000 times
High overall
Obsoletion prone
Deep neural,neural
approximation
Graph, AI,RegExp
Stencil, Scan,Linear, Sort
Cache
Traditionalmulticore
Cache
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cor
e
Cache
Cor
e
Cor
e
Cor
e
Performance, energy benefits:
Area footprint cost:
Generality/flexibility:
(Specializationalternatives)
(b)
(a)
Figure 1. Specialization paradigms and tradeoffs. (a) Alternate specialization
paradigms. (b) Our LSSD architecture for programmable specialization.
.............................................................
MAY/JUNE 2017 41
In particular, these workloads have significantparallelism, either at the data or thread level;perform some computational work; havecoarse-grained units of work; and havemainly regular memory access.
Here, we define the five principles ofarchitectural specialization and discuss thepotential area, power, and performancetradeoffs of targeting each.
Concurrency specialization. A workload’sconcurrency is the degree to which its opera-tions can be performed simultaneously. Spe-cializing for a high degree of concurrencymeans organizing the hardware to performwork in parallel by favoring lower overheadstructures. Examples of specialization strat-egies include employing many independentprocessing elements with their own control-lers or using a wide vector model with a singlecontroller. Applying hardware concurrencyincreases the performance and efficiency ofparallel workloads while increasing area andpower.
Computation specialization. Computationsare individual units of work in an algorithmperformed by FUs. Specializing computationmeans creating problem-specific FUs (forinstance, a FU that computes sine). Specializ-ing computation improves performance andpower by reducing the total work. Althoughcomputation specialization can be problem-specific, some commonality between domainsat this level is expected.
Communication specialization. Communica-tion is the means of transmission of valuesbetween and among storage and FUs. Speci-alized communication is simply the instantia-tion of communication channels and buffersbetween hardware units to facilitate fasteroperand throughput to the FUs. This reducespower by lessening access to intermediatestorage, and potentially to area if the alterna-tive is a general communication network.One example is a broadcast network for effi-ciently sending immediately consumabledata to many computational units.
Caching specialization. Specialization forcaching exploits the inherent data reuse, whichis an algorithmic property wherein intermedi-
ate values are consumed multiple times. Thespecialization of caching means using customstorage structures for these temporaries.
In the context of accelerators, access pat-terns are often known a priori, often meaningthat low-ported, wide scratchpads (or smallregisters) are more effective than classic caches.
Coordination specialization. Hardware coor-dination is the management of hardwareunits and their timing to perform work.Instruction sequencing, control flow, signaldecoding, and address generation are allexamples of coordination tasks. Specializingit usually involves the creation of small statemachines to perform each task, rather thanreliance on a general-purpose processor and(for example) OoO instruction scheduling.Performing more coordination specializationtypically means less area and power com-pared to something more programmable, atthe price of generality.
Relating Specialization Principles to AcceleratorMechanismsFigure 2 depicts the block diagrams of thefour DSAs that we study; shading indicatesthe types of specialization of each compo-nent. We relate the specialization mecha-nisms to algorithmic properties below.
Neural Processing Unit (NPU) is a DSAfor approximate computing using neural net-works.1 It exploits the concurrency of eachnetwork level, using parallel processing enti-ties (PEs) to pipeline the computations ofeight neurons simultaneously. NPU special-izes reuse with accumulation registers andper-PE weight buffers. For communication,NPU employs a broadcast network specializ-ing the large network fan-outs and specializescomputation with sigmoid FUs. A bus sched-uler and PE controller specialize the hardwarecoordination.
Convolution Engine accelerates stencil-like computations.2 The host core coordi-nates control through custom instructions. Itexploits concurrency through both vectorand pipeline parallelism and uses customscratchpads for caching pixels and coeffi-cients. In addition, column and row interfa-ces provide shifted versions of intermediatevalues. These, along with other wide buses,provide communication specialization. It
..............................................................................................................................................................................................
TOP PICKS
............................................................
42 IEEE MICRO
also uses a specialized graph-fusion compu-tation unit.
Q100 is a DSA for streaming databasequeries, which exploits the pipeline concur-rency of database operators and intermediateoutputs.3 To support a streaming model, ituses stream buffers to prefetch database col-umns. Q100 specializes the communicationby providing dynamically routed channelsbetween FUs to prevent memory spills. Ituses custom database FUs, such as Join, Sort,and Partition. It specializes data caching bystoring constants and reused intermediateswithin these operators’ implementations.The communication network configurationand stream buffers are coordinated using aninstruction sequencer.
DianNao is a DSA for deep neural net-works.4 It achieves concurrency by applying avery wide vector computation model anduses wide memory structures (4,096-bit widestatic RAMs) for reuse specialization ofneurons, accumulated values, and synapses.DianNao also relies on specialized sigmoidFUs. Point-to-point links between FUs,with little bypassing, specialize the commu-nication. A specialized control processor isused for coordination.
An Architecture for ProgrammableSpecializationOur primary insight is that well-understoodmechanisms can be composed to target thesame specialization principles that DSAs use,
but in a programmable fashion. In this sec-tion, we explain the architecture of LSSD,highlighting how it performs specializationusing the principles while remaining pro-grammable and parameterizable for differentdomains. This is not the only possible set ofmechanisms, but it is a simple and effectiveset. The sidebar, “Related ProgrammableSpecialization Architectures,” discusses alter-native designs.
LSSD DesignThe most critical principle is exploiting con-currency, of which there is typically an abun-dant amount when considering specializableworkloads. Requiring high concurrencypushes the design toward simplicity, andrequiring programmability implies the use ofsome sort of programmable core. The naturalway to satisfy these is to use an array of tinylow-power cores that communicate throughmemory. This is a sensible tradeoff becausecommonly specialized workloads exhibit littlecommunication between the coarse-grainedparallel units. The remainder of the design isa straightforward application of the remainingspecialization principles.
Achieving communication specializationof intermediate values requires an efficientdistribution mechanism for operands thatavoids expensive intermediate storage suchas multiported register files. Arguably, thebest-known approach is an explicit routingnetwork that is exposed to the ISA to
CommunicationComputation CachingConcurrencySpecialization of:
Memory Memory
Streambuffers
DMA
Sort
Mult-Add
Weight bufferFIFO
Out buffer
Accumulationregister
Sigmoid
Coordination
PEPE
PE PE
PE PE
PE PE
1D shiftregister
Cont-roller
OutIn Synapse buffer...
Con
trol p
roce
dure
Temp. instructionsequencer
DMA
Outregister Config.
2D shiftregister
2D coefficientregister
DatashuffleFusion
General-purpose processor General-purpose processorIn
FIF
O
Out
FIF
O Bus sched.
Mapunits x64...
ALU
Con
st
Router
Join
Con
st
Router
Filter
Con
st
RouterRouter
Per-neuronlanes
Mult.
Nonlinearfunction
Add,Avg, Max
Reducttree
...Mult/SubAbs/Shft
Add/And/Shift
bypa
ss
SIMD
Pro
cess
ing
un
its
Q100 Database Processing UnitConvolution Engine DianNao Machine LearningNPU (Neural Processing Unit)
Control interface
Hig
h-l
evel
org
aniz
atio
n
Pro
cess
ing
eng
ine
(PE
)
Figure 2. Application of specialization principles in four domain-specific accelerators (DSAs). The elements of each DSA’s
high-level organization and low-level processing unit structure are labeled according to their primary role in performing
specialization.
.............................................................
MAY/JUNE 2017 43
eliminate the hardware burden of dynamicrouting. This property is what defines spatialarchitectures, and we add a spatial architec-ture as our first mechanism. This serves as anappropriate place to instantiate customFUs—that is, computation specialization. Italso enables specialization of caching constantvalues associated with specific computations.
To achieve communication specializationwith the global memory, a natural solution isto add a DMA engine and configurablescratchpad, with a vector interface to the spa-tial architecture. The scratchpad, configuredas a DMA buffer, enables the efficient stream-ing of memory by decoupling memory accessfrom the spatial architecture. When config-ured differently, the scratchpad can act as aprogrammable cache. A single-ported scratch-pad is enough, because access patterns areusually simple and known ahead of time.
Finally, to coordinate the hardware units(for example, synchronizing DMA with thecomputation), we use the simple core, whichis programmable and general. The overhead islow, provided the core is low-power enough,and the spatial architecture is large enough.
Thus, each unit of our proposed fabriccontains a low-power core, a spatial architec-ture, scratchpad, and DMA (LSSD), as
shown in Figure 1b. It is programmable,has high efficiency through the applicationof specialization principles, and has simpleparameterizability.
Use of LSSD in PracticePreparing the LSSD fabric for use occursin two phases—design synthesis andprogramming.
For specialized architectures, design syn-thesis is the process of provisioning for givenperformance, area, and power goals. Itinvolves examining one or more workloaddomains and choosing the appropriate FUs,the datapath size, the scratchpad sizes andwidths, and the degree of concurrencyexploited through multiple core units.Although many optimization strategies arepossible, in this work, we consider the pri-mary constraint to be performance—that is,there exists some throughput target that mustbe met, and power and area should be mini-mized, while still retaining some degree ofgenerality and programmability.
Programming an LSSD has two majorcomponents: creation of the coordinationcode for the low-power core and generationof the configuration data for the spatialdatapath to match available resources.
Related Programmable Specialization ArchitecturesOne related architecture is Smart Memories,1 which when config-
ured acts like either a streaming engine or a speculative multi-
processor. Its primary innovations include mechanisms that let
static RAMs (SRAMs) act as either scratchpads or caches for
reuse. Smart Memories is both more complex and more general
than LSSD, although it’s likely less efficient on the regular work-
loads we target.
Another example is Charm,2 the composable heterogeneous
accelerator-rich microprocessor, which integrates coarse-grained
configurable functional unit blocks and scratchpads for reuse spe-
cialization. A fundamental difference is in the decoupling of the
computation units, reuse structures, and host cores, which allows
concurrent programs to share blocks in complex ways.
The Vector-Thread architecture supports unified vector-and-
multithreading execution, providing flexibility across data-parallel
and irregularly parallel workloads.3 The most similar design in
terms of microarchitecture is MorphoSys.4 It also embeds a low-
power TinyRisc core, integrated with a coarse-grained reconfigur-
able architecture (CGRA), direct memory access engine, and frame
buffer. Here, the frame buffer is not used for data reuse, and the
CGRA is more loosely coupled with the host core.
References1. K. Mai et al., “Smart Memories: A Modular Reconfigura-
ble Architecture,” Proc. 27th Int’l Symp. Computer Archi-
tecture, 2000, pp. 161–171.
2. J. Cong et al., “Charm: A Composable Heterogeneous Accel-
erator-Rich Microprocessor,” Proc. ACM/IEEE Int’l Symp.
Low Power Electronics and Design, 2012, pp. 379–384.
3. R. Krashinsky et al., “The Vector-Thread Architecture,”
Proc. 31st Ann. Int’l Symp. Computer Architecture, 2004,
pp. 52–63.
4. H. Singh et al., “MorphoSys: An Integrated Reconfigura-
ble System for Data-Parallel and Computation-Intensive
Applications,” IEEE Trans. Computers, vol. 49, no. 5,
2000, pp. 465–481.
..............................................................................................................................................................................................
TOP PICKS
............................................................
44 IEEE MICRO
Programming for LSSD in assembly mightbe reasonable because of the simplicity ofboth the control and data portions. In prac-tice, using either standard languages with#pragma annotations or languages likeOpenCL would likely be effective.
Design Provisioning and MethodologyIn this section, we describe the design pointsthat we study in this work, along with theprovisioning and evaluation methodology.More details are in our original paper for the2016 Symposium on High PerformanceComputer Architecture.5
Implementation Building BlocksWe use several existing components, bothfrom the literature and from availabledesigns, as building blocks for the LSSDarchitecture. The spatial architecture we lev-erage is the DySER coarse-grained reconfig-urable architecture (CGRA),6 which is alightweight, statically routed mesh of FUs.Note that we will use CGRA and “spatialarchitecture” interchangeably henceforth.
The processor we leverage is a TensilicaLX3, which is a simple, very long instructionword design featuring a low-frequency (1GHz) seven-stage pipeline. We chose thisbecause of its low area and power footprintand because it can run irregular code.
LSSD ProvisioningTo instantiate LSSD, we provision its param-eters to meet each domain’s performancerequirements by modifying FU composition,scratchpad size, and the number of cores (fordetails, see our original paper). By provision-ing for each domain separately, we createLSSDN, LSSDC, LSSDQ, and LSSDD, for
neural network approximation, convolution,databases, and deep neural networks work-loads, respectively. We also consider a bal-anced design (LSSDB), which contains asuperset of the capabilities of each of theabove and can execute all workloads withrequired performance.
Evaluation MethodologyAt a high level, our methodology attempts tofairly assess LSSD tradeoffs across workloadsfrom four accelerators through pulling datafrom past works and the original authors,applying performance modeling techniques,using sanity checking against real systems,and using standard area/power models.Where assumptions were necessary, we madethose that favored the benefits of the DSA.
LSSD performance estimation. Our strategyuses a combined trace-simulator and applica-tion-specific modeling framework to capturethe role of the compiler and the LSSD pro-gram. This framework is parameterizable fordifferent FU types, concurrency parameters(single-instruction, multiple-data [SIMD]width and LSSD unit counts), and reuse andcommunication structures.
LSSD power and area estimation. Integer FUestimates come from DianNao4 and floating-point FUs are from DySER.6 CGRA-net-work estimates come from synthesis, andstatic RAMs use CACTI estimates.
DSA and baseline characteristics. We obtaineach DSA’s performance, area, and power asshown in Table 1.
Comparison to OoO baseline. We estimatethe properties of the OoO baseline (Intel’s
Table 1. Methodology for obtaining DSA baseline characteristics.
DSA Execution time Power/Area
Neural Processing Unit (NPU) Authors provided MCPAT-based estimation
Convolution Engine Authors provided MCPAT-based estimation
Q100 Optimistic model In original paper3
DianNao Optimistic model In original paper4..............................................................................................
*All area and power estimates are scaled to 28 nm.
.............................................................
MAY/JUNE 2017 45
3770K) from datasheets and die photos, andfrequency is scaled to 2 GHz.
EvaluationWe organize our evaluation around fourmain questions:
� Q1. What is the cost of general pro-grammability in terms of area andpower?
� Q2. If multiple workloads arerequired on-chip, can LSSD ever sur-pass the area or power efficiency?
� Q3. What are the sources of LSSD’sperformance?
� Q4. How do LSSD’s power overheadsaffect the overall energy efficiency?
We answer these questions throughdetailed analysis as follows.
LSSD Area/Power Overheads (Q1)To elucidate the costs of more general pro-grammability, Table 2 shows the power andarea breakdowns for the LSSD designs.
LSSDD has the worst-case area and poweroverheads of 3.76 and 4.06 times, respec-tively, compared to DianNao. The CGRAnetwork dominates area and power because itsupports relatively tiny 16-bit FUs. The bestcase is LSSDQ, which has 0.48 times the areaand 0.6 times the power of Q100. The pri-mary reason is that LSSD does not embedthe expensive Sort and Partition units. Eventhough not including these units leads to per-formance loss on several queries, we believethis to be a reasonable tradeoff overall.
The takeaway: with suitable engineering,we can reduce programmability overheads tosmall factors of 2 to 4 times, as opposed tothe 100- to 1,000-times inefficiency of largeOoO cores.
Supporting Multiple Domains (Q2)If multiple workload domains require spe-cialization on the same chip, but do not needto be run simultaneously, it is possible thatLSSD can be more area-efficient than a Mul-tiDSA design. Figure 3a shows the area and
Table 2. Breakdown and comparison of LSSD (a) area and (b) power (normalized
to 28 nm).
Area (mm2) LSSDN LSSDC LSSDQ LSSDD
Core and cache N/A N/A 0.09 0.09
Static RAM (SRAM) 0.04 0.02 0.04 0.04
Functional unit (FU) 0.24 0.02 0.09 0.02
CGRA* network 0.09 0.11 0.22 0.11
Unit total 0.37 0.15 0.44 0.26
LSSD total area 0.37 0.15 1.78 2.11
DSA total area 0.30 0.08 3.69 0.56
LSSD/DSA overhead 1.23 1.74 0.48 3.76
(a)
Power (mW) LSSDN LSSDC LSSDQ LSSDD
Core and cache 41 41 41 41
SRAM 9 5 9 5
FU 65 7 33 7
CGRA network 34 56 46 56
Unit total 149 108 130 108
LSSD total power 149 108 519 867
DSA total power 74 30 870 213
LSSD/DSA overhead 2.02 3.57 0.60 4.06
(b) ..............................................................................................
*CGRA: Coarse-grained reconfigurable architecture.
..............................................................................................................................................................................................
TOP PICKS
............................................................
46 IEEE MICRO
geomean power tradeoffs for two workloaddomain sets, comparing the MultiDSA chipto the balanced LSSDB design.
The domain set (NPU, ConvolutionEngine, and DianNao) excludes our bestresult (Q100 workloads). In this case, LSSDB
still has 2.7 times the area and 2.4 times thepower overhead. However, with Q100added, LSSDB is only 0.6 times the area.
The takeaway: if only one domain needsto be supported at a time, LSSD can becomemore area-efficient than using multiple DSAs.
Performance Analysis (Q3)Figure 3b shows the performance of the DSAsand domain-provisioned LSSD designs, nor-malized to the OoO core. Across workloaddomains, LSSD matches the performance ofthe DSAs, with speedups over a modern OoOcore of between 10 and 150 times.
To elucidate the sources of benefits of eachspecialization principle in LSSD, we definefive design points (which are not power orarea normalized), wherein each builds on thecapabilities of the previous design point:
� CoreþSFU, the LX3 core with addedproblem-specific FUs (computationspecialization);
� Multicore, LX3 multicore system(plus concurrency specialization);
� SIMD, an LX3 with SIMD, its widthcorresponding to LSSD’s memory inter-face (plus concurrency specialization);
� Spatial, an LX3 in which the spatialarchitecture replaces the SIMD units(plus communication specialization);and
� LSSD, the previous design plus scratch-pad (plus caching specialization).
The largest factor is consistently concur-rency (4 times for LSSDN, 31 times forLSSDC, 9 times for LSSDQ, and 115 timesfor LSSDD). This is intuitive, because theseworkloads have significant exploitable parallel-ism. The largest secondary factors for LSSDN
and LSSDD are from caching neural weightsin scratchpad memories, which enablesincreased data bandwidth to the core. LSSDC
and LSSDQ benefit from CGRA-based execu-tion when specializing for communication.
The takeaway is that LSSD designs havecompetitive performance with DSAs. The
performance benefits come mostly fromconcurrency rather than other specializationtechniques.
Energy-Efficiency Tradeoffs (Q4)It is important to understand how much thepower overheads affect the overall system-levelenergy benefits. Here, we apply simple analyti-cal reasoning to bound the possible energy-efficiency improvement of a general-purposesystem accelerated with a DSA versus an LSSDdesign, by considering a zero-power DSA.
We define the overall relative energy, E, foran accelerated system in terms of S, the accel-erator’s speedup; U, the accelerator utilizationas a fraction of the original execution time;Pcore, the general-purpose core power; Psys, thesystem power; and Pacc, the accelerator power.The core power includes components that arenot used because the accelerator is invoked,
0 2 4 6 8 10 12 14Normalized area
0
10
20
30
40
50
60
70
Nor
mal
ized
pow
er
LSSDB2.7× more area than DSA2.4× more power than DSA
LSSDsimd-only
Multi-DSANPU/CE/DianNao workloads
0 1 2 3 4 5Normalized area
0
10
20
30
40
Multi-DSA
All workloads
LSSDB0.6× more area than DSA2.5× more power than DSA
LSSDsimd-only
0
2
4
6
8
10
12
14
LSSDN vs.NPU
LSSDC vs.Convolution Engine
LSSDQ vs.Q100
LSSDD vs.DianNao
Sp
eed
up o
ver
OoO
cor
e
0
5
10
15
20
25
30
35
0
20
40
60
80
100
120
0
20
40
60
80
100
120
140
160
180
LSSD (+ caching specialization)SIMD (+ concurrency) Multi-Tile (+ concurrency)LP core + SFUs (+ computation) DSA geomean
LSSD (+ communication specialization)
(a)
(b)
Figure 3. LSSD’s power, area, and performance tradeoffs. (a) Area and
power of Multi-DSA versus LSSDB. (Baseline: core plus L1 cache plus L2
cache from I3770K processor.) (b) LSSD versus DSA performance across
four domains.
.............................................................
MAY/JUNE 2017 47
whereas the system power components areactive while accelerating (such as higher-levelcaches and DRAM). The total energy thenbecomes
E ¼ PaccðU=SÞ þ Psysð1� U þ U=SÞþ Pcoreð1� U Þ
We characterize system-level energy tradeoffsacross accelerator utilizations and speedupsin Figure 4. Figure 4a shows that the maxi-mum benefits from a DSA are reduced bothas the utilization goes down (stressing corepower) and when accelerator speedupincreases (stressing both core and systempower). For a reasonable utilization of U ¼0.5 and speedup of S ¼ 10, the maximumenergy efficiency gain from a DSA is less than0.5 percent. Figure 4b shows a similar graph,in which LSSD’s power is varied, whereas uti-lization is fixed at U ¼ 0.5. Even consideringan LSSD with power equivalent to the core,when LSSD has a speedup of 10 times, thereis only 5 percent potential energy savings
remaining for a DSA to optimize. The take-away is that when an LSSD can match theperformance of a DSA, the further poten-tial energy benefits of a DSA are usuallysmall, making LSSD’s overheads largelyinconsequential.
T he broad intellectual thrust of thisarticle is to propose an accelerator
architecture that could be used to drive futureinvestigations. As an analogy, the canonicalfive-stage pipelined processor was simple andeffective enough to serve as a framework foralmost three decades of big ideas, policies,and microarchitecture mechanisms thatdrove the general-purpose processor era. Upto now, architects have not focused on anequivalent framework for accelerators.
Most accelerator proposals are presentedas a novel design with a unique compositionof mechanisms for a particular domain, mak-ing comparisons meaningless. However,from an intellectual standpoint, our workshows that these accelerators are more similarthan dissimilar; they exploit the same essen-tial principles with differences in their imple-mentation. This is why we believe anarchitecture designed around these principlescan serve as a standard framework. Of course,the development of DSAs will continue tobe critical for architecture research, both toenable the exploration of the limits of acceler-ation, and as a means to extract new accelera-tion principles.
In the literature today, DSAs are pro-posed and compared to conventional high-performance processors, and typically yieldseveral orders of magnitude better measure-ments on various metrics of interest. For thefour DSAs we looked at, the benefit of spe-cialization is only two to four times forLSSD in area and power when performance-normalized. Therefore, we argue that theoverheads of OoO processors and GPUsmake them poor targets to distill the truebenefit of specialization.
Using LSSD as a baseline will revealmore and deeper insights on what techni-ques are truly needed for a particular prob-lem or domain, as opposed to merelyremoving the inefficiency of a general-pur-pose OoO processor using already-known
0 10 20 30 40 50Accelerator speedup
1.00
1.02
1.04
1.06
1.08
1.10
1.12
Max
imum
DS
A e
nerg
y-ef
ficie
ntim
pro
vem
ent
Max
imum
DS
A e
nerg
y-ef
ficie
ntim
pro
vem
ent
U = 1.00U = 0.95U = 0.90U = 0.75U = 0.50U = 0.25
0 10 20 30 40 50
Accelerator speedup
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
Plssd = 5.0
Plssd = 2.5
Plssd = 1.0
Plssd = 0.5
Plssd = 0.25
(a)
(b)
Figure 4. Energy benefits of zero-power DSA (Pcore¼ 5 W, Psys¼ 5 W). (a)
Varying utilization, Plssd¼ 0.5 W. (b) Varying LSSD power, U¼ 0.5 W.
..............................................................................................................................................................................................
TOP PICKS
............................................................
48 IEEE MICRO
techniques applied in a straightforward man-ner to a new domain.
Orthogonally, LSSD can serve as a guide-line for discovering big ideas for specialization.Undoubtedly, there are additions necessary tothe five principles, alternative formulations,and microarchitecture extensions. Theseinclude ideas that have been demonstrated inan accelerator’s specific context, as well as prin-ciples not discovered or defined yet.
To see how LSSD can serve as a frame-work for generalizing accelerator-specificmechanisms, consider the recent Proteus7
and Cnvlutin8 works, which proposed mech-anisms for extending the DianNao accelera-tor. The idea of bit-serial multiplication(Proteus) and eliminating zero-computing(Cnvlutin) can apply to the database- andimage-processing domains we considered,and evaluating with LSSD enables the studyof these mechanisms’s generalizability.
Our formulation of principles makes clearwhat workload behaviors are currentlyuncovered and need discovery of new princi-ples to match existing accelerators. Thisdirection leads to the more open questions ofwhether the number of principles are eventu-ally too numerous to be practical to put in asingle substrate, whether efficient mecha-nisms can be discovered to target many prin-ciples with a single substrate, or whether theyare sufficiently few in number such that onecan build a single universal framework.
Overall, the specialization principles andLSSD-style architectures can be used todecouple accelerator research from workloaddomains, which we believe can help fostermore shared innovation in this space. MICRO
....................................................................References1. H. Esmaeilzadeh et al., “Neural Acceleration
for General-Purpose Approximate Programs,”
Proc. 45th Ann. IEEE/ACM Int’l Symp. Micro-
architecture, 2012, pp. 449–460.
2. W. Qadeer et al., “Convolution Engine: Bal-
ancing Efficiency and Flexibility in Specialized
Computing,” Proc. 40th Ann. Int’l Symp.
Computer Architecture, 2013, pp. 24–35.
3. L. Wu et al., “Q100: The Architecture and
Design of a Database Processing Unit,”
Proc. 19th Int’l Conf. Architectural Support
for Programming Language and Operating
Systems, 2014, pp. 255–268.
4. T. Chen et al., “DianNao: A Small-Footprint
High-Throughput Accelerator for Ubiquitous
Machine-Learning,” Proc. 19th Int’l Conf.
Architectural Support for Programming Lan-
guage and Operating Systems, 2014, pp.
269–284.
5. T. Nowatzki et al., “Pushing the Limits of
Accelerator Efficiency While Retaining Pro-
grammability,” Proc. IEEE Int’l Symp. High
Performance Computer Architecture, 2016,
pp. 27–39.
6. V. Govindaraju et al., “DySER: Unifying
Functionality and Parallelism Specialization
for Energy Efficient Computing,” IEEE
Micro, vol. 32, no. 5, 2012, pp. 38–51.
7. P. Judd et al., “Proteus: Exploiting Numeri-
cal Precision Variability in Deep Neural
Networks,” Proc. Int’l Conf. Supercomput-
ing, 2016, article 23.
8. J. Albericio et al., “Cnvlutin: Ineffectual-
Neuron-Free Deep Neural Network
Computing,” Proc. ACM/IEEE 43rd Ann.
Int’l Symp. Computer Architecture, 2016,
pp. 1–13.
Tony Nowatzki is an assistant professor inthe Department of Computer Science at theUniversity of California, Los Angeles. Hisresearch interests include architecture andcompiler codesign and mathematical mod-eling. Nowatzki received a PhD in computerscience from the University of Wisconsin–Madison. He is a member of IEEE. Contacthim at [email protected].
Vinay Gangadhar is a PhD student in theDepartment of Electrical and ComputerEngineering at the University of Wiscon-sin–Madison. His research interests includehardware/software codesign of program-mable accelerators, microarchitecture, andGPU computing. Gangadhar received anMS in electrical and computer engineeringfrom the University of Wisconsin–Madison.He is a student member of IEEE. Contacthim at [email protected].
Karthikeyan Sankaralingam is an associateprofessor in the Department of Computer
.............................................................
MAY/JUNE 2017 49
Sciences and the Department of Electricaland Computer Engineering at the Univer-sity of Wisconsin–Madison, where he alsoleads the Vertical Research Group. Hisresearch interests include microarchitecture,architecture, and very large-scale integra-tion. Sankaralingam received a PhD in com-puter science from the University of Texas atAustin. He is a senior member of IEEE.Contact him at [email protected].
Greg Wright is the director of engineeringat Qualcomm Research. His research inter-ests include processor architecture, virtualmachines, compilers, parallel algorithms,and memory models. Wright received aPhD in computer science from the Univer-sity of Manchester. Contact him [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
..............................................................................................................................................................................................
TOP PICKS
............................................................
50 IEEE MICRO
.................................................................................................................................................................................................................
CONFIGURABLE CLOUDS.................................................................................................................................................................................................................
THE CONFIGURABLE CLOUD DATACENTER ARCHITECTURE INTRODUCES A LAYER OF
RECONFIGURABLE LOGIC BETWEEN THE NETWORK SWITCHES AND SERVERS. THE AUTHORS
DEPLOY THE ARCHITECTURE OVER A PRODUCTION SERVER BED AND SHOW HOW IT CAN
ACCELERATE APPLICATIONS THAT WERE EXPLICITLY PORTED TO FIELD-PROGRAMMABLE
GATE ARRAYS, SUPPORT HARDWARE-FIRST SERVICES, AND ACCELERATE APPLICATIONS
WITHOUT ANY APPLICATION-SPECIFIC FPGA CODE BEING WRITTEN.
......Hyperscale clouds (hundreds ofthousands to millions of servers) are anattractive option to run a vast and increasingnumber of applications and workloads span-ning web services, data processing, AI, andthe Internet of Things. Modern hyperscaledatacenters have made huge strides withimprovements in networking, virtualization,energy efficiency, and infrastructure manage-ment, but they still have the same basic struc-ture they’ve had for years: individual serverswith multicore CPUs, DRAM, and localstorage, connected by the network interfacecard (NIC) through Ethernet switches toother servers. However, the slowdown inboth CPU scaling and the end of Moore’slaw has resulted in a growing need for hard-ware specialization to increase performanceand efficiency.
There are two basic ways to introducehardware specialization into the datacenter:one is to form centralized pools of specializedmachines, which we call “bolt-on” accelera-tors, and the other is to distribute the special-ized hardware to each server. Introducingbolt-on accelerators into a hyperscale infra-structure reduces the highly desirable homo-geneity and limits the scalability of thespecialized hardware, but minimizes disrup-
tion to the core server infrastructure. Distrib-uting the accelerators to each server in thedatacenter retains homogeneity, allows moreefficient scaling, allows services to run on allthe servers, and simplifies management byreducing costs and configuration errors. Thequestion of which method is best is mostlyone of economics: is it more cost effective todeploy an accelerator in every new server, tospecialize a subset of an infrastructure’s newservers and maintain an ever-growing num-ber of configurations, or to do neither?
Any specialized accelerator must be com-patible with the target workloads through itsdeployment lifetime (for example, six years—two years to design and deploy the acceleratorand four years of server deployment lifetime).This requirement is a challenge given boththe diversity of cloud workloads and the rapidrate at which they change (weekly ormonthly). It is thus highly desirable thataccelerators incorporated into hyperscale serv-ers be programmable. The two most commonexamples are field-programmable gate arrays(FPGAs) and GPUs, which (in this regard)are preferable over ASICs.
Both GPUs and FPGAs have beendeployed in datacenter infrastructure at reason-able scale, but with limited direct connectivity
Adrian M. Caulfield
Eric S. Chung
Andrew Putnam
Hari Angepat
Daniel Firestone
Jeremy Fowers
Michael Haselman
Stephen Heil
Matt Humphrey
Puneet Kaur
Joo-Young Kim
Daniel Lo
Todd Massengill
Kalin Ovtcharov
Michael Papamichael
Lisa Woods
Sitaram Lanka
Derek Chiou
Doug Burger
Microsoft
.......................................................
52 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
between accelerators, such as intraserver con-nectivity or small-scale ring and torus net-works. Our first deployment in a productionhyperscale datacenter was 1,632 servers,each with an FPGA, to accelerate Bing websearch ranking. The FPGAs were connectedto each other in a 6-�-8 torus network in ahalf rack. Although effective at acceleratingsearch ranking, our first architecture1 andsimilar small-scale connectivity architectures(see the sidebar, “Projects Related to theConfigurable Cloud Architecture”) have sev-eral significant limitations:
� The number of FPGAs that couldcommunicate directly, without goingthrough software, is limited to a sin-gle server or single rack (that is, 48nodes).
� The secondary network requiresexpensive and complex cabling and
requires awareness of the machines’physical location.
� Failure handling requires complexrerouting of traffic to neighboringnodes, causing both performance lossand isolation of nodes under certainfailure patterns.
� These fabrics are limited-scale bolt-on accelerators, which can accelerateapplications but offer few enhance-ments for the datacenter infrastruc-ture, such as networking and storageflows.
� Programs must be aware of wheretheir applications are running andhow many specialized machines areavailable, not just the best way to accel-erate a given application.
We propose the Configurable Cloud, anew FPGA-accelerated hyperscale datacenter
Projects Related to the Configurable Cloud ArchitectureFor a complete analysis of related work and a taxonomy of various
system design options, see our original paper.1 In this sidebar, we
focus on projects with related system architectures.
Hyperscale accelerators commonly comprise three types of
accelerators: custom ASICs, application-specific processors and
GPUs, and field-programmable gate arrays (FPGAs). Custom ASICs
solutions, such as DianNao,2 provide excellent performance and
efficiency for their target workload. However, ASICs are inflexible,
so they restrict the rapidly evolving applications from evolving
while still being able to use the accelerator.
Google’s Tensor Processing Unit (TPU)3 is an application-
specific processor that has been highly tuned to execute Tensor-
Flow. GPUs have been commonly used in datacenters as bolt-on
accelerators, even with small-scale interconnection networks
such as NVLink.4 However, the size and power requirements for
GPUs is still much higher than for FPGAs.
FPGA deployments, including Amazon’s EC2 F1,5 Baidu’s SDA,6
IBM’s FPGA fabric,7 Novo-G,8 and our first-generation architecture,1 all
cluster FPGAs into a small subset of the architecture—bolt-on acceler-
ators. None of these have the scalability or ability to benefit the base-
line datacenter server as the Configurable Cloud architecture does.
References1. A. Putnam et al., “A Reconfigurable Fabric for Accelerat-
ing Large-Scale Datacenter Services,” Proc. 41st Ann.
Int’l Symp Computer Architecture, 2014, pp. 13–24.
2. T. Chen et al., “DianNao: A Small-Footprint High-Through-
put Accelerator for Ubiquitous Machine-Learning,” ACM
SIGPLAN Notices, vol. 49, no. 4, 2014, pp. 269–284.
3. M. Abadi et al., “TensorFlow: A System for Large-Scale
Machine Learning,” Proc. 12th USENIX Symp. Operat-
ing Systems Design and Implementation, 2016, pp.
265–283.
4. Nvidia NVLink High-Speed Interconnect: Application Per-
formance, white paper, Nvidia, Nov. 2014.
5. “Amazon EC2 F1 Instances (Preview),” 2017; http://aws
.amazon.com/ec2/instance-types/f1.
6. J. Ouyang et al., “SDA: Software-Defined Accelerator for
Large-Scale DNN Systems,” Proc. Hot Chips 26 Symp.,
2014; doi:10.1109/HOTCHIPS.2014.7478821.
7. J. Weerasinghe et al., “Enabling FPGAs in Hyperscale
Data Centers,” Proc. IEEE 12th Int’l Conf. Ubiquitous
Intelligence and Computing, 12th Int’l Conf. Autonomic
and Trusted Computing, and 15th Int’l Conf. Scalable
Computing and Communications (UIC-ATC-ScalCom),
2015; doi:10.1109/UIC-ATC-ScalCom-CBDCom-IoP
.2015.199.
8. A.G. Lawande, A.D. George, and H. Lam, “Novo-G#: A
Multidimensional Torus-Based Reconfigurable Cluster for
Molecular Dynamics,” Concurrency and Computation:
Practice and Experience, vol. 28, no. 8, 2016, pp.
2374–2393.
.............................................................
MAY/JUNE 2017 53
architecture that addresses these limitations.2
This architecture is sufficiently robust andperformant that it has been, and is being,deployed in most new servers in Microsoft’sproduction datacenters across more than 15countries and 5 continents.
The Configurable CloudOur Configurable Cloud architecture is thefirst to add programmable acceleration to acore at-scale hyperscale cloud. All new Bingand Microsoft Azure cloud servers are nowdeployed as Configurable Cloud nodes.The key difference with our previous work1
is that this architecture replaces the dedi-cated FPGA network with a tight couplingbetween each FPGA and the datacenter net-work. Each FPGA is a “bump in the wire”between the servers’ NICs and the Ethernetnetwork switches (see Figure 1b). All net-work traffic is routed through the FPGA,which enables significant workload flexibil-ity, while a local PCI Express (PCIe) con-nection maintains the local computationaccelerator use case and provides local man-agement functionality. Although this changeto the network design might seem minor, theimpact to the types of workloads that can be
accelerated and the scalability of the special-ized accelerator fabric is profound.
Integration with the network lets everyFPGA in the datacenter reach every other one(at a scale of hundreds of thousands) in under10 microseconds on average, massivelyincreasing the scale of FPGA resources avail-able to applications. It also allows for accelera-tion of network processing, a common taskfor the vast majority of cloud workloads,without the development of any application-specific FPGA code. Hardware services can beshared by multiple hosts, improving the eco-nomics of accelerator deployment. Moreover,this design choice essentially turns the distrib-uted FPGA resources into an independentplane of computation in the datacenter, at thesame scale as the servers. Figure 1a shows alogical view of this plane of computation.This model offers significant flexibility, free-ing services from having a fixed ratio of CPUcores to FPGAs, and instead allowing inde-pendent allocation of each type of resource.
The Configurable Cloud architecture’s dis-tributed nature lets accelerators be imple-mented anywhere in the Hyperscale cloud,including at the edge. Services can also beeasily reached by any other node in the clouddirectly through the network, enabling services
TOR
TOR TOR
TOR
L1 L1
Expensivecompression
Deep neuralnetworks
Web searchranking Bioinformatics
L2
TOR
(a) (b)Traditional SW (CPU) server plane
Datacenter HW acceleration plane
Network switch (top of rack, cluster)
FPGA-switch link
FPGA acceleration board
NIC-FPGA link
Two-socket CPU server Two-socket server blade
DRAM DRAM
CPUCPU QPI
Gen3 x8 Gen3 2x8
NIC FPGA
DR
AM
Acc
eler
ator
car
d
40 Gbps 40 GbpsQSFPQSFP QSFPWeb search
ranking
Figure 1. Enhanced datacenter architecture. (a) Decoupled programmable hardware plane. (b) Server and field-programmable
gate array (FPGA) schematic. (NIC: network interface card; TOR: top of rack.)
..............................................................................................................................................................................................
TOP PICKS
............................................................
54 IEEE MICRO
to be implemented at any location in theworldwide datacenter network.
Microsoft has deployed this new architec-ture to most of its new datacenter servers.Although the actual production scale of thisdeployment is orders of magnitude larger, forthis article we evaluate the ConfigurableCloud architecture using a bed of 5,760 serv-ers deployed in a production datacenter.
Usage ModelsThe Configurable Cloud is sufficiently flexibleto cover three scenarios: local acceleration(through PCIe), network acceleration, andglobal application acceleration through poolsof remotely accessible FPGAs. Local accelera-tion handles high-value scenarios such as websearch ranking acceleration, in which everyserver can benefit from having its own FPGA.Network acceleration supports services such assoftware-defined networking, intrusion detec-tion, deep packet inspection, and networkencryption, which are critical to infrastructureas a service (for example, “rental” of cloudservers), and which have such a huge diversityof customers that it makes it difficult to justifylocal acceleration alone economically. Globalacceleration permits acceleration hardwareunused by its host servers to be made availablefor other hardware services—for example,large-scale applications such as machine learn-ing. This decoupling of a 1:1 ratio of servers toFPGAs is essential for breaking the chicken-and-egg problem in which accelerators cannotbe added until enough applications needthem, but applications will not rely on theaccelerators until they are present in the infra-structure. By decoupling the servers andFPGAs, software services that demand moreFPGA capacity can harness spare FPGAs fromother services that are slower to adopt (or donot require) the accelerator fabric.
We measure the system’s performancecharacteristics using web search to representlocal acceleration, network flow encryptionand network flow offload to represent net-work acceleration, and machine learning torepresent global acceleration.
Local AccelerationWe brought up a production Bing web searchranking service on the servers, with 3,081 of
these machines using the FPGA for localacceleration, and the rest used for other func-tions associated with web search. Unlike inour previous work, we implemented only themost expensive feature calculations and omit-ted less-expensive feature calculations, thepost-processed synthetic features, and themachine-learning calculations.
Figure 2 shows the performance of searchranking running in production over a five-day period, with and without FPGA accelera-tion. The top two lines show the normalizedtail query latencies at the 99.9th percentile(aggregated across all servers over a rollingtime window), and the bottom two linesshow the corresponding query loads receivedat each datacenter.
Because load varies throughout the day,the queries executed without FPGAs experi-enced a higher latency with more frequentlatency spikes, whereas the FPGA-acceleratedqueries had much lower, tighter-bound laten-cies. This is particularly impressive given themuch higher query loads experienced by theFPGA-accelerated machines. The higherquery load, which was initially unexpected,was due to the top-level load balancers select-ing FPGA-accelerated machines over thosewithout FPGAs due to the lower and less-
99.9% software latency
99.9% FPGA latency
Average FPGA query load
Averagesoftware load
Day 1 Day 2 Day 3 Day 4 Day 5
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Nor
mal
ized
load
and
late
ncy
Figure 2. Five-day query throughput and latency of ranking service queries
running in production, with and without FPGAs enabled.
.............................................................
MAY/JUNE 2017 55
variable latency. The FPGA-acceleratedmachines were better at serving query trafficeven at higher load, and hence were assignedadditional queries, nearly twice the load thatsoftware alone could handle.
Infrastructure AccelerationAlthough local acceleration such as Bingsearch was the primary motivation fordeploying FPGAs into Microsoft’s datacen-ters at hyperscale, the level of effort requiredto port the Bing software stack to FPGAsis not sustainable for the more than 200first-party workloads currently deployed inMicrosoft datacenters, not to mention thethousands of third-party applications run-ning on Microsoft Azure. The architectureneeds to provide benefit to workloads thathave no specialized code for running onFPGAs by accelerating components commonacross many different workloads. We call thiswidely applicable acceleration infrastructureacceleration. Our first example is accelerationof network services.
Nearly all hyperscale workloads rely on afast, reliable, and secure network betweenmany machines. By accelerating networkprocessing in ways such as protocol offloadand host-to-host network crypto, the FPGAscan accelerate workloads that have no specifictuning for FPGA offload. This is increasinglyimportant as growing network speeds putgreater pressure on CPU cores trying to keepup with protocol processing and line-ratecrypto.
In the case of network crypto, each packetis examined and encrypted or decrypted atline rate as necessary while passing from theNIC through the FPGA to the networkswitch. Thus, once an encrypted flow is setup, no CPU usage is required to encrypt ordecrypt the packets. Encryption occurstransparently from the perspective of thesoftware, which sees unencrypted packets atthe endpoints. Network encryption/decryp-tion offload yields significant CPU savings.Our AES-128 implementations support afull 40 Gbits per second (GBps) of encryp-tion/decryption with no load on the CPUbeyond setup and teardown. Achieving thesame performance in software requiresmore than four CPU cores running at fullutilization.
The same Configurable Cloud architec-ture has also been used to accelerate software-defined networking,3 in which bulk packetoperations are offloaded to the FPGA underthe software’s policy control. The initialimplementation gave Microsoft Azure thefastest public cloud network, with 25 lslatency and 25 Gbps of bandwidth. Thisservice was offered free to third-party users,who can benefit from FPGA accelerationwithout writing any code for the FPGA.
Hardware as a Service: Shared, RemoteAcceleratorsMost workloads can benefit from local accel-eration, infrastructure acceleration, or both.There are two key motivations behind ena-bling remote hardware services: hardwareaccelerators should be placed flexibly any-where in the hyperscale datacenter network,and the resource requirements for the hard-ware fabric (FPGAs) should scale commen-surately with software demand (CPU), notjust one server to one FPGA.
To address the first requirement, remoteaccelerators are never more than a few net-work hops away from any other server acrosshundreds of thousands to millions of servers.Since each server has acceleration capabilities,any accelerator can be mapped to anylocation.
Similarly, some software services haveunderutilized FPGA resources, while othersneed more than one. This architecture lets allaccelerators in the datacenter communicatedirectly, enabling harvesting of FPGAs fromthe deployment for services with greaterneeds. This also allows the allocation of thou-sands of FPGAs for a single job or service,independent of their CPU hosts and withoutimpacting the CPU’s performance, in effectcreating a new kind of computer embeddedin the datacenter. A demonstration of thispooled FPGA capability at Microsoft’s Igniteconference in 2016 showed that four har-vested FPGAs translated 5.2 million Wikipe-dia articles from English to Spanish fiveorders of magnitude faster than a 24-coreserver running highly tuned vectorized code.4
We developed a custom FPGA-to-FPGAnetwork protocol called the LightweightTransport Layer (LTL), which uses the UserDatagram Protocol for frame encapsulation
..............................................................................................................................................................................................
TOP PICKS
............................................................
56 IEEE MICRO
and Internet Protocol for routing packetsacross the datacenter network. Low-latencycommunication demands infrequent packetdrops and infrequent packet reorders. By usinglossless traffic classes provided in datacenterswitches and provisioned for traffic (such asRemote Direct Memory Access and FiberChannel over Ethernet), we avoid most packetdrops and reorders. Separating out such trafficto its own classes also protects the datacenter’sbaseline TCP traffic. Because the FPGAs areso tightly coupled to the network, they canreact quickly and efficiently to congestionnotification and back off when needed toreduce packets dropped from in-cast patterns.
At the endpoints, the LTL protocol engineuses an ordered, reliable connection-basedinterface with statically allocated, persistentconnections, realized using send and receiveconnection tables. The static allocation andpersistence (until they are deallocated) reducelatency for inter-FPGA and inter-service mes-saging, because once established, they cancommunicate with low latency. Reliable mes-saging also reduces protocol latency. Althoughdatacenter networks are already fairly reliable,LTL provides a strong reliability guaranteevia an ACK/NACK-based retransmissionscheme. When packet reordering is detected,NACKs are used to request timely retrans-mission of particular packets without waitingfor a time-out.
To evaluate LTL and resource sharing,we implemented and deployed a latency-optimized Deep Neural Network (DNN)accelerator developed for natural languageprocessing, and we used a synthetic stresstest to simulate DNN request traffic atvarying levels of oversubscription. By increas-ing the ratio of clients to accelerators (byremoving FPGAs from the pool), we measurethe impact on latency due to oversubscrip-tion. The synthetic stress test generated by asingle software client is calibrated to generateat least twice the worst-case traffic expected inproduction (thus, even with 1:1 oversubscrip-tion, the offered load and latencies are highlyconservative).
Figure 3 shows request latencies as over-subscription increases. In the 1:1 case (nooversubscription), remote access adds lessthan 4.7 percent additional latency to eachrequest up to the 95th percentile, and 32
percent at the 99th percentile. As expected,contention and queuing delay increase asoversubscription increases. Eventually, theFPGA reaches its peak throughput and satu-rates. In this case study, each individual FPGAhas sufficient throughput to comfortably sup-port roughly two clients even at artificiallyhigh traffic levels, demonstrating the feasibil-ity of sharing accelerators and freeing resour-ces for other functions.
Managing remote accelerators requiressignificant hardware and software support. Acomplete overview of the management of ourhardware fabric, called Hardware as a Service(HaaS), is beyond the scope of this article,but we provide a short overview of the plat-form here. HaaS manages FPGAs in a man-ner similar to Yarn5 and other job schedulers.A logically centralized resource manager(RM) tracks FPGA resources throughout thedatacenter. The RM provides simple APIs forhigher-level service managers (SMs) to easilymanage FPGA-based hardware componentsthrough a lease-based model. Each compo-nent is an instance of a hardware servicemade up of one or more FPGAs and a set ofconstraints (for example, locality or band-width). SMs manage service-level tasks, suchas load balancing, intercomponent connec-tivity, and failure handling, by requesting andreleasing component leases through the RM.An SM provides pointers to the hardwareservice to one or more users to take advantageof the hardware acceleration. An FPGA
0.0
1.0
2.0
3.0
4.0
5.0
0.5 1.0 1.5 2.0 2.5 3.0
Late
ncy
norm
aliz
ed to
loca
l FP
GA
Oversubscription: No. of remote clients/No. of FPGAs
Avg95%99%
Figure 3. Average, 95th, and 99th percentile latencies to a remote Deep
Neural Network (DNN) accelerator (normalized to locally attached
performance in each latency category).
.............................................................
MAY/JUNE 2017 57
manager (FM) runs on each node to provideconfiguration support and status monitoringfor the system.
Hardware DetailsIn addition to the architectural requirementto provide sufficient flexibility to justify scaleproduction deployment, there are also physi-cal restrictions in current infrastructures thatmust be overcome. These restrictions includestrict power limits, a small physical space inwhich to fit, resilience to hardware failures,and tolerance to high temperatures. For exam-ple, the accelerator architecture we describe isthe widely used OpenCompute server, whichconstrained power to 35 W, the physical sizeto roughly a half-height, half-length PCIeexpansion card, and tolerance to an inlet airtemperature of 70�C at low airflow.
We designed the accelerator board as astand-alone FPGA board that is added to thePCIe expansion slot in a production serverconfiguration. Figure 4 shows a photographof the board with major components labeled.The FPGA is an Altera Stratix V D5, with172,600 adaptive logic modules of program-mable logic, 4 Gbytes of DDR3-1600DRAM, two independent PCIe Gen 3 x8connections, and two independent 40 Giga-bit Ethernet interfaces. The realistic powerdraw of the card under worst-case environ-mental conditions is 35 W.
The dual 40 Gigabit Ethernet interfaceson the board could allow for a private FPGA
network, as was done in previous work, butthis configuration also lets us wire the FPGAas a “bump in the wire,” sitting between theNIC and the top-of-rack (ToR) switch.Rather than cabling the standard NICdirectly to the ToR, the NIC is cabled to oneport of the FPGA, and the other FPGA portis cabled to the ToR (see Figure 1b).
Maintaining the discrete NIC in the sys-tem lets us leverage all the existing networkoffload and packet transport functionalityhardened into the NIC. This minimizes thecode required to deploy FPGAs to simplebypass logic. In addition, both FPGA resour-ces and PCIe bandwidth are preserved foracceleration functionality, rather than beingspent on implementing the NIC in soft logic.
One potential drawback to the bump-in-the-wire architecture is that an FPGA failure,such as loading a buggy application, couldcut off network traffic to the server, renderingthe server unreachable. However, unlike in aring or torus network, failures in the bump-in-the-wire architecture do not degrade anyneighboring FPGAs, making the overall sys-tem more resilient to failures. In addition,most datacenter servers (including ours) havea side-channel management path that existsto power servers on and off. By policy, theknown-good golden image that loads onpower up is rarely (if ever) overwritten, so themanagement network can always recover theFPGA with a known-good configuration,making the server reachable via the networkonce again.
In addition, FPGAs have proven to bereliable and resilient at hyperscale, with only0.03 percent of boards failing across ourdeployment after one month of processingfull-production traffic, and with all failureshappening at the beginning of production.Bit flips in the configuration layer were meas-ured at an average rate of one per 1,025machine days. We used configuration layermonitoring and correcting circuitry to mini-mize the impact of these single-event upsets.Given aggregate datacenter failure rates, wedeemed the FPGA-related hardware failuresto be acceptably low for production.
T he Configurable Cloud architecture is amajor advance in the way datacenters
are being designed and utilized. The impact
Stratix V
D5 FPGA
40G QSFP ports
(NIC and TOR)
4-Gbyte DDR3
Figure 4. Photograph of the manufactured board. The DDR channel is
implemented using discrete components. PCI Express connectivity goes
through a mezzanine connector on the bottom side of the board (not
shown).
..............................................................................................................................................................................................
TOP PICKS
............................................................
58 IEEE MICRO
of this design goes far beyond just an improvednetwork design.
The collective architecture and softwareprotocols described in this article can be seenas a fundamental shift in the role of CPUs inthe datacenter. Rather than the CPU control-ling every task, the FPGA is now the gate-keeper between the server and the network,determining how incoming and outgoingdata will be processed and handling commoncases quickly and efficiently. In such a model,the FPGA calls the CPU to handle infrequentand/or complex work that the FPGA cannothandle itself. Such an architecture addsanother mode of operation to a traditionalcomputing platform, potentially removingthe CPU as the machine’s master. Such anorganization can be viewed as relegating theCPU to be a complexity offload engine forthe FPGA.
Of course, there will be many applicationsin which the CPUs handle the bulk of thecomputation. In that case, the FPGAsattached to those CPUs can be used over thenetwork by other applications running ondifferent servers that need more FPGAresources in a few tens of microseconds acrossthe datacenter.
Such a Configurable Cloud provides enor-mous flexibility in how computation is doneand in the placement of computational tasks,enabling the right computational unit to beassigned a particular task at the right time.Thus, additional efficiency can be extractedfrom the hardware, allowing accelerators to beshared and underutilized resources to bereclaimed and repurposed independently ofthe other resources. In addition, by distribut-ing heterogeneous accelerators throughout thenetwork, this architecture avoids the networkbottlenecks, cost, and complexities of bolt-onclusters of specialized accelerator nodes.
As a result, Configurable Clouds enablethe performant implementation of wildly dif-ferent functions on exactly the same hard-ware. In addition, specialized hardware is farmore efficient than CPUs, making Configu-rable Clouds better for the environment.
This work has already had significantimpact. Microsoft is building the vast major-ity of its next-generation datacenters across 15countries and 5 continents using this architec-ture. Microsoft has publicly described how
that architecture is being used to improveBing search quality and performance andAzure networking capabilities and perform-ance. Those deployments and the results ofaccelerating applications on them provide fur-ther confirmation of programmable accelera-tors’ value for datacenter services.
FPGA architectures are being heavilyinfluenced by this work. Investment intodatacenter FPGA technology and ecosystemsby Microsoft and other major companies isincreasing, not least being Intel’s acquisitionof Altera for $16.7 billion, as well as therecent introduction of FPGAs at a limitedscale by the majority of the other major cloudproviders. MICRO
....................................................................References1. A. Putnam et al., “A Reconfigurable Fabric
for Accelerating Large-Scale Datacenter
Services,” Proc. 41st Ann. Int’l Symp Com-
puter Architecture, 2014, pp. 13–24.
2. A.M. Caulfield et al., “A Cloud-Scale Accel-
eration Architecture,” Proc. 49th Ann. IEEE/
ACM Int’l Symp. Microarchitecture, 2016;
doi:10.1109/MICRO.2016.7783710.
3. Y. Khalidi et al., “Microsoft Azure Network-
ing: New Network Services, Features and
Scenarios,” Microsoft Ignite, 2016; http://
channel9.msdn.com/Events/Ignite/2016
/BRK3237-TS.
4. S. Nadella, “Innovation Keynote,” Microsoft
Ignite, 2016; http://channel9.msdn.com
/Events/Ignite/2016/KEY02.
5. V.K. Vavilapalli et al., “Apache Hadoop Yarn:
Yet Another Resource Negotiator,” Proc.
4th Ann. Symp. Cloud Computing, 2013, pp.
5:1–5:16.
Adrian M. Caulfield is a principal researchhardware development engineer at Micro-soft Research. His research interests includecomputer architecture and reconfigurablecomputing. Caulfield received a PhD incomputer engineering from the Universityof California, San Diego. Contact him [email protected].
Eric S. Chung is a researcher at MicrosoftResearch NExT, where he leads an effort toaccelerate large-scale machine learning using
.............................................................
MAY/JUNE 2017 59
FPGAs. Chung received a PhD in electricaland computer engineering from CarnegieMellon University. Contact him at [email protected].
Andrew Putnam is a principal researchhardware development engineer at MicrosoftResearch NExT. His research interests includereconfigurable computing, future datacenterdesign, and computer architecture. Putnamreceived a PhD in computer science and engi-neering from the University of Washington.Contact him at [email protected].
Hari Angepat is a senior software engineer atMicrosoft and a PhD candidate at the Univer-sity of Texas at Austin. His research interestsinclude novel FPGA accelerator architectures,hardware/software codesign techniques, andapproaches for flexible hardware design. Ange-pat received an MS in computer engineeringfrom the University of Texas at Austin. Con-tact him at [email protected].
Daniel Firestone is the tech lead and man-ager for the Azure Networking Host SDNteam at Microsoft. His team builds theAzure virtual switch and SmartNIC. Con-tact him at [email protected].
Jeremy Fowers is a senior research hardwaredesign engineer in the Catapult team atMicrosoft Research NeXT. He specializes inthe design and implementation of FPGAaccelerators across a variety of applicationdomains, and is currently focused on machinelearning. Fowers received a PhD in electricalengineering from the University of Florida.Contact him at [email protected].
Michael Haselman is a senior software engi-neer at Microsoft ASG (Bing). His researchinterests include FPGAs, computer architec-ture, and distributed computing. Haselmanreceived a PhD in electrical engineering fromthe University of Washington. Contact himat [email protected].
Stephen Heil is a principal program man-ager at Microsoft Research. His research inter-ests include field-programmable gate arrays,application accelerators, and rack-scale systemdesign. Heil received a BS in electrical engi-neering technology and computer science
from the College of New Jersey (formerlyTrenton State College). Contact him [email protected].
Matt Humphrey is a principal engineer atMicrosoft, where he works on Azure. Hisresearch interests include analog and digitalelectronics and the architecture of high-scaledistributed software systems. Humphreyreceived an MS in electrical engineeringfrom the Georgia Institute of Technology.Contact him at [email protected].
Puneet Kaur is a principal software engineerat Microsoft. Her research interests includedistributed systems and scalability. Kaurreceived a MTech in computer science fromthe Indian Institute of Technology, Kanpur.Contact her at [email protected].
Joo-Young Kim is a senior research hard-ware development engineer at MicrosoftResearch. His research interests includehigh-performance, energy-efficient computerarchitectures for various datacenter work-loads, such as data compression, video trans-coding, and machine learning. Kim received aPhD in electrical engineering from KoreaAdvanced Institute of Science and Technol-ogy (KAIST). Contact him at [email protected].
Daniel Lo is a research hardware develop-ment engineer at Microsoft Research. Hisresearch interests include designing high-performance accelerators on FPGAs. Loreceived a PhD in electrical and computerengineering from Cornell University. Con-tact him at [email protected].
Todd Massengill is a senior hardware designengineer at Microsoft Research. His researchinterests include hardware acceleration of arti-ficial intelligence applications, biologicallyinspired computing, and tools to improvehardware engineering design, collaboration,and efficiency. Massengill received an MS inelectrical engineering from the University ofWashington. Contact him at [email protected].
Kalin Ovtcharov is a research hardwaredevelopment engineer at Microsoft ResearchNeXT. His research interests include
..............................................................................................................................................................................................
TOP PICKS
............................................................
60 IEEE MICRO
accelerating computationally intensive taskson FPGAs in areas such as machine learn-ing, image processing, and video compres-sion. Ovtcharov received a BS in computerengineering from McMaster University.Contact him at [email protected].
Michael Papamichael is a researcher atMicrosoft Research, where he’s working onthe Catapult project. His research interestsinclude hardware acceleration, reconfigurablecomputing, on-chip interconnects, and meth-odologies to facilitate hardware specialization.Papamichael received a PhD in computer sci-ence from Carnegie Mellon University. Con-tact him at [email protected].
Lisa Woods is a principal program managerfor the Catapult project at MicrosoftResearch, where she focuses on driving strate-gic alignment between the Catapult team andits many internal and external partners as wellas scalability and execution for the project.Woods received an MS in computer science.Contact her at [email protected].
Sitaram Lanka is a partner group engineer-ing manager in Search Platform at Micro-soft Bing. His research interests include weband enterprise search, large-scale distributed
systems, machine learning, and reconfigura-ble hardware in datacenters. Lanka receiveda PhD in computer science from the Uni-versity of Pennsylvania. Contact him [email protected].
Derek Chiou is a partner hardware groupengineering manager at Microsoft and aresearch scientist at the University of Texasat Austin. His research interests include accel-erating datacenter applications and infrastruc-ture, rapid system design, and fast, accuratesimulation. Chiou received a PhD in electri-cal engineering and computer science fromthe Massachusetts Institute of Technology.Contact him at [email protected].
Doug Burger is a distinguished engineer atMicrosoft, where he leads a Disruptive Sys-tems team at Microsoft Research NExT andcofounded Project Catapult. Burger received aPhD in computer science from the Universityof Wisconsin. He is an IEEE and ACM Fel-low. Contact him at [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 61
.................................................................................................................................................................................................................
SPECIALIZING A PLANET’SCOMPUTATION: ASIC CLOUDS
.................................................................................................................................................................................................................
ASIC CLOUDS, A NATURAL EVOLUTION TO CPU- AND GPU-BASED CLOUDS, ARE
PURPOSE-BUILT DATACENTERS FILLED WITH ASIC ACCELERATORS. ASIC CLOUDS MAY
SEEM IMPROBABLE DUE TO HIGH NON-RECURRING ENGINEERING (NRE) COSTS AND ASIC
INFLEXIBILITY, BUT LARGE-SCALE BITCOIN ASIC CLOUDS ALREADY EXIST. THIS ARTICLE
DISTILLS LESSONS FROM THESE PRIMORDIAL ASIC CLOUDS AND PROPOSES NEW
PLANET-SCALE YOUTUBE-STYLE VIDEO-TRANSCODING AND DEEP LEARNING ASIC CLOUDS,
SHOWING SUPERIOR TOTAL COST OF OWNERSHIP. ASIC CLOUD NRE AND ECONOMICS
ARE ALSO EXAMINED.
......In the past 10 years, two parallelphase changes in the computational landscapehave emerged. The first change is the bifurca-tion of computation into two sectors—cloudand mobile. The second change is the rise ofdark silicon and dark-silicon-aware designtechniques, such as specialization and near-threshold computation.1 Recently, researchersand industry have started to examine the con-junction of these two phase changes. Baiduhas developed GPU-based clouds for distrib-uted neural network accelerators, and Micro-soft has deployed clouds based on field-programmable gate arrays (FPGAs) for Bing.
At a single-node level, we know that appli-cation-specific integrated circuits (ASICs) canoffer order-of-magnitude improvements inenergy efficiency and cost performance overCPU, GPU, and FPGA by specializing siliconfor a particular computation. Our researchproposes ASIC Clouds,2 which are purpose-
built datacenters comprising large arrays ofASIC accelerators. ASIC Clouds are notASIC supercomputers that scale up problemsizes for a single tightly coupled computation;rather, they target workloads comprisingmany independent but similar jobs.
As more and more services are builtaround the Cloud model, we see the emer-gence of planet-scale workloads in whichdatacenters are performing the same compu-tation across many users. For example, con-sider Facebook’s face recognition of uploadedpictures, or Apple’s Siri voice recognition, orthe Internal Revenue Service performing taxaudits with neural nets. Such scale-out work-loads can easily leverage racks of ASIC serverscontaining arrays of chips that in turn con-nect arrays of replicated compute accelera-tors (RCAs) on an on-chip network. Thelarge scale of these workloads creates the eco-nomic justification to pay the non-recurring
Moein Khazraee
Luis Vega Gutierrez
Ikuo Magaki
Michael Bedford Taylor
University of California,
San Diego
.......................................................
62 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
engineering (NRE) costs of ASIC develop-ment and deployment. As a workload grows,the ASIC Cloud can be scaled in the datacen-ter by adding more ASIC servers, unlike accel-erators in, say, a mobile phone population,3 inwhich the accelerator/processor mix is fixed attape out.
Our research examines ASIC Clouds inthe context of four key applications thatshow great potential for ASIC Clouds,including YouTube-style video transcoding,Bitcoin and Litecoin mining, and deep learn-ing. ASICs achieve large reductions in siliconarea and energy consumption versus CPUs,GPUs, and FPGAs. We specialize the ASICserver to maximize efficiency, employingoptimized ASICs, a customized printed cir-cuit board (PCB), custom-designed coolingsystems and specialized power delivery sys-tems, and tailored DRAM and I/O subsys-tems. ASIC voltages are customized to tweakenergy efficiency and minimize total cost ofownership (TCO). The datacenter itself canalso be specialized, optimizing rack-level anddatacenter-level thermals and power deliveryto exploit the knowledge of the computation.We developed tools that consider all aspectsof ASIC Cloud design in a bottom-up way,and methodologies that reveal how thedesigners of these novel systems can optimizeTCO in real-world ASIC Clouds. Finally, we
propose a new rule that explains when itmakes sense to design and deploy an ASICCloud, considering NRE.
ASIC Cloud ArchitectureAt the heart of any ASIC Cloud is an energy-efficient, high-performance, specialized RCAthat is multiplied up by having multiple cop-ies per ASIC, multiple ASICs per server,multiple servers per rack, and multiple racksper datacenter (see Figure 1). Work requestsfrom outside the datacenter will be distrib-uted across these RCAs in a scale-out fashion.All system components can be customizedfor the application to minimize TCO.
Each ASIC interconnects its RCAs using acustomized on-chip network. The ASIC’s con-trol plane unit also connects to this networkand schedules incoming work from the ASIC’soff-chip router onto the RCAs. Next, the pack-aged ASICs are arranged in lanes on a custom-ized PCB and connected to a controller thatbridges to the off-PCB interface (1 to 100 Gig-abit Ethernet, Remote Direct Memory Access,and PCI Express). In some cases, DRAMs canconnect directly to the ASICs. The controllercan be implemented by an FPGA, a microcon-troller, or a Xeon processor. It schedulesremote procedure calls (RPCs) that comefrom the off-PCB interface on to the ASICs.
On-PCBrouter
Machine room
1 U
42U-rack
1 U
1 U
Server ASIC
Fans
PS
U
DC/DCconverters
PCI-Express1/10/100 Gigabit Ethernet
Controller
Replicatedcompute
unitsDRAM
controller
1 U
1 U
Controlplane
ASICs
RCA
Figure 1. High-level abstract architecture of an ASIC Cloud. Specialized replicated compute accelerators (RCAs) are multiplied
up by having multiple copies per application-specific integrated circuit (ASIC), multiple ASICs per server, multiple servers per
rack, and multiple racks per datacenter. Server controller can be a field-programmable gate array (FPGA), microcontroller, or a
Xeon processor. The power delivery and cooling system are customized based on ASIC needs. If required, there would be
DRAMs on the printed circuit board (PCB) as well. (PSU: power supply unit.)
.............................................................
MAY/JUNE 2017 63
Depending on the application, it can imple-ment the non-acceleratable part of the work-load or perform UDP/TCP-IP offload.
Each lane is enclosed by a duct and has adedicated fan blowing air through it acrossthe ASIC heatsinks. Our simulations indicate
using ducts results in better cooling perform-ance compared to conventional or staggeredlayout. The PCB, fans, and power supplyare enclosed in a 1U server, which is thenassembled into racks in a datacenter. Basedon ASIC needs, the power supply unit (PSU)and DC/DC converters are customized foreach server.
The “Evaluating an ASIC Server Config-uration” sidebar shows our automated meth-odology for designing a complete ASIC Cloudsystem.
Application Case StudyTo explore ASIC Clouds across a range ofaccelerator properties, we examined fourapplications that span a diverse range ofproperties—namely, Bitcoin mining, Lite-coin mining, video transcoding, and deeplearning (see Figure 2).
Perhaps the most critical of these applica-tions is Bitcoin mining. Our inspiration forASIC Clouds came from our intensive studyof Bitcoin mining clouds,4 which are one ofthe first known instances of a real-life ASICCloud. Figure 3 shows the massive scale outof the Bitcoin-mining workload, which isnow operating at the performance of 3.2 bil-lion GPUs. Bitcoin clouds have undergone arapid ramp from CPU to GPU to FPGA tothe most advanced ASIC technology avail-able today. Bitcoin is a logic-intensive designthat has high power density and no need forstatic RAM (SRAM) or external DRAM.
Litecoin is another popular cryptocur-rency mining system that has been deployedinto clouds. Unlike Bitcoin, it is an SRAM-intensive application with low power density.
Video transcoding, which converts fromone video format to another, currently takesalmost 30 high-end Xeon servers to do inreal time. Because every cell phone andInternet of Things device can easily be avideo source, it has the potential to be anunimaginably large planet-scale computation.Video transcoding is an external memory-intensive application that needs DRAMsnext to each ASIC. It also requires high off-PCB bandwidth.
Finally, deep learning is extremely com-putationally intensive and is likely to beused by every human on the planet. It is oftenlatency sensitive, so our Deep Learning neural
On-chipRAMintensity
On-chiplogicintensity
Latencysensitivity
DRAMor I/Ointensity
Litecoin
VideoXcode
Bitcoin
Deep learning
Figure 2. Accelerator properties. We explore applications with diverse
requirements.
1’09 ’10 ’11 ’12 ’13 ’14 ’15 ’16
130110
6555
28
22
2016
GPU
CPU
FPGA
125
102050
100200500
1,0002,0005,000
10,00020,00050,000
100,000200,000500,000
1,000,0002,000,0005,000,000
10,000,00020,000,00050,000,000
100,000,000200,000,000500,000,000
1,000,000,0002,000,000,0005,000,000,000
10,000,000,00020,000,000,00050,000,000,000
100,000,000,000
Difficulty
Figure 3. Evolution of specialization: Bitcoin cryptocurrency mining clouds.
Numbers are ASIC nodes, in nanometers, which annotate the first date of
release of a miner on that technology. Difficulty is the ratio of the total Bitcoin
hash throughput of the world, relative to the initial mining network throughput,
which was 7.15 MH per second. In the six-year period preceding November
2015, the throughput increased by a factor of 50 billion times, corresponding to a
world hash rate of approximately 575 million GH per second.
..............................................................................................................................................................................................
TOP PICKS
............................................................
64 IEEE MICRO
net accelerator has a tight low-latency service-level agreement.
For our Bitcoin and Litecoin studies, wedeveloped the RCA and got the requiredparameters such as gate count from placed-
and-routed designs in UMC 28 nm usingSynopsys IC compiler and analysis tools(such as PrimeTime). For deep learning andvideo transcoding, we extracted propertiesfrom accelerators in the research literature.
Evaluating an ASIC Server ConfigurationOur ASIC Cloud server configuration evaluator, shown in Figure
A1, starts with a Verilog implementation of an accelerator, or a
detailed evaluation of the accelerator’s properties from the
research literature. In the design of an ASIC server, we must
decide how many chips should be placed on the printed circuit
board (PCB) and how large, in mm2 of silicon, each chip should be.
The size of each chip determines how many replicated compute
accelerators (RCAs) will be on each chip. In each duct-enclosed
lane of ASIC chips, each chip receives around the same amount of
airflow from the intake fans, but the most downstream chip
receives the hottest air, which includes the waste heat from the
other chips. Therefore, the thermally bottlenecking ASIC is the
one in the back, shown in our detailed computational fluid
dynamics (CFD) simulations in Figure A2. Our simulations show
that breaking a fixed heat source into smaller ones with the same
total heat output improves the mixing of warm and cold areas,
resulting in lower temperatures. Using thermal optimization tech-
niques, we established a fundamental connection between an
RCA’s properties, the number of RCAs placed in an ASIC, and how
many ASICs go on a PCB in a server. Given these properties, our
heat sink solver determines the optimal heat sink configuration.
Results are validated with the CFD simulator. In the “Design
Space Evaluation” sidebar, we show how we apply this evaluation
flow across the design space to determine TCO and Pareto-
optimal points that trade off cost per operation per second (ops/s)
and watts per ops/s.
Core voltage Design specification
Voltage scaling
W/mm2 GHash/s/mm2
Die sizeVoltage
FrequencyNo. of ASICs
GHash/s/serverW/server$/serverMHash/J
Choose an optimal heatsink
Die-size optimization
ASIC power budgeting
ASICarrangement
Maximum powerinput
Powerbudgeting policy
Temperature (°C)
87.8606
80.6280
73.3954
66.1628
58.9302
51.6976
44.4650
37.2324
29.9998
(1) (2)
Figure A. ASIC server evaluation flow. (1) The server cost, per server hash rate, and the energy efficiency are evaluated
using replicated compute accelerator (RCA) properties and a flow that optimizes server heatsinks, die size, voltage, and
power density. (2) Thermal verification of an ASIC Cloud server using Computational Fluid Dynamics tools to validate
the flow results. The farthest ASIC from the fan has the highest temperature and is the bottleneck for power per ASIC
at a fixed voltage and energy efficiency.
.............................................................
MAY/JUNE 2017 65
ResultsTable 1 gives details of optimal server config-urations for energy-, TCO-, and cost-optimaldesigns for each application. The “DesignSpace Exploration” sidebar explains how theseoptimal configurations are determined.
For example, for video transcoding, thecost-optimal server packs the maximumnumber of DRAMs per lane, 36, which max-imizes performance. However, increasing thenumber of DRAMs per ASIC requires higherlogic voltage (1.34 V) and corresponding fre-quencies to attain performance within themaximum die area constraint, resulting inless-energy-efficient designs. Hence, theenergy-optimal design has fewer DRAMs perASIC and per lane (24), but it gains backsome performance by increasing ASICs perlane, which is possible due to lower powerdensity at 0.54 V. The TCO-optimal design
increases DRAMs per lane, 30, to improveperformance, but is still close to the optimalenergy efficiency at 0.75 V, resulting in a diesize and frequency between the other twooptimal points.
Figure 4 compares the performance ofCPU Clouds, GPU Clouds, and ASIC Cloudsfor the four applications that we presented.ASIC Clouds outperform CPU Clouds’ TCOper operations per second (ops/s) by 6,270,704, and 8,695 times for Bitcoin, Litecoin,and video transcoding, respectively. ASICClouds outperform GPU Clouds’ TCO perops/s by 1,057, 155, and 199 times for Bit-coin, Litecoin, and deep learning, respectively.
ASIC Cloud Feasibility: The Two-for-TwoRuleWhen does it make sense to design anddeploy an ASIC Cloud? The key barrier is
Design Space ExplorationAfter all thermal constraints were in place, we optimized ASIC
server design targeting two conventional key metrics—namely,
cost per ops/s and power per ops/s—and then applied TCO analy-
sis. TCO analysis incorporates the datacenter-level constraints,
including the cost of power delivery inside the datacenter, land,
depreciation, interest, and the cost of energy itself. With these
tools, we can correctly weight these two metrics and find the over-
all optimal point (TCO-optimal) for the ASIC Cloud.
Design-space exploration is application dependent, and there are
frequently additional constraints. For example, for the video trans-
coding application, we model the PCB real estate occupied by these
DRAMs, which are placed on either side of the ASIC they connect to,
perpendicular to airflow. As the number of DRAMs increases, the
number of ASICs placed in a lane decreases for space reasons. We
model the more expensive PCBs required by DRAM, with more layers
and better signal/power integrity. We employ two 10-Gigabit Ether-
net ports as the off-PCB interface for network-intensive clouds, and
we model the area and power of the memory controllers.
Our ASIC Cloud infrastructure explores a comprehensive
design space, including DRAMs per ASIC, logic voltage, area per
ASIC, and number of chips. DRAM cost and power overhead are
significant, and so the Pareto-optimal video transcoding designs
ensure DRAM bandwidth is saturated, and link chip performance
to DRAM count. As voltage and frequency are lowered, area
increases to meet the performance requirement. Figure B shows
the video transcoding Pareto curve for five ASICs per lane and dif-
ferent numbers of DRAMs per ASIC. The tool comprises two tiers.
The top tier uses brute force to explore all possible configurations
to find the energy-optimal, cost-optimal, and TCO-optimal points
based on the Pareto results. The leaf tier comprises various expert
solvers that compute the optimal properties of the server compo-
nents—for example, CFD simulations for heat sinks, DC-DC con-
verter allocation, circuit area/delay/voltage/energy estimators,
and DRAM property simulation. In many cases, these solvers
export their data as large tables of memoized numbers for every
component.
15141312W/OP/s
$/O
P/s
11109
70
60
50
40
30
No. of DRAMsper ASIC
1 DRAM2 DRAMs3 DRAMs4 DRAMs5 DRAMs6 DRAMs
Figure B. Pareto curve example for video transcoding.
Exploring different numbers of DRAMs per ASIC and
logic voltage for optimal TCO per performance point.
Voltage increases from left to right. Diagonal lines show
equal TCO per performance values; the closer to the
origin, the lower the TCO per performance. This plot is
for five ASICs per lane.
..............................................................................................................................................................................................
TOP PICKS
............................................................
66 IEEE MICRO
Table 1. ASIC Cloud optimization results for four applications: (a) Bitcoin, (b) Litecoin, (c) video transcoding,
and (d) deep learning.
Property Energy optimal server TCO optimal server Cost optimal server
ASICs per server 120 72 24
Logic voltage (V) 0.400 0.459 0.594
Clock frequency (MHz) 71 149 435
Die area (mm2) 599 540 240
GH per second (GH/s) per server 7,292 8,223 3,451
W per server 2,645 3,736 2,513
Cost ($) per server 12,454 8,176 2,458
W per GH/s 0.363 0.454 0.728
Cost ($) per GH/s 1.708 0.994 0.712
Total cost of ownership (TCO) per GH/s 3.344 2.912 3.686
(a)
ASICs per server 120 120 72
Logic voltage (V) 0.459 0.656 0.866
Clock frequency (MHz) 152 576 823
Die area (mm2) 600 540 420
MH/s per server 405 1,384 916
W per server 783 3,662 3,766
$ per server 10,971 11,156 6,050
W per MH/s 1.934 2.645 4.113
$ per MH/s 27.09 8.059 6.607
TCO per MH/s 37.87 19.49 23.70
(b)
DRAMs per ASIC 3 6 9
ASICs per server 64 40 32
Logic voltage (V) 0.538 0.754 1.339
Clock frequency (MHz) 183 429 600
Die area (mm2) 564 498 543
Kilo frames per second (Kfps) per server 126 158 189
W per server 1,146 1,633 3,101
$ per server 7,289 5,300 5,591
W per Kfps 9.073 10.34 16.37
$ per Kfps 57.68 33.56 29.52
TCO per Kfps 100.3 78.46 97.91
(c)
Chip type 4� 2 2� 2 2� 1
ASICs per server 32 64 96
Logic voltage (V) 0.900 0.900 0.900
Clock frequency (MHz) 606 606 606
Tera-operations per second (Tops/s) per server 470 470 353
W per server 3,278 3,493 2,971
$ per server 7,809 6,228 4,146
W per Tops/s per server 6.975 7.431 8.416
$ per Tops/s per server 16.62 13.25 11.74
TCO per Tops/s per server 46.22 44.28 46.51
(d) ...................................................................................................................................
*Energy-optimal server uses lower voltage to increase the energy efficiency. Cost-optimal server uses higher voltage toincrease silicon efficiency. TCO-optimal server has a voltage between these two and balances energy versus silicon cost.
.............................................................
MAY/JUNE 2017 67
the cost of developing the ASIC server,which includes both the mask costs (about$1.5 million for the 28-nm node we con-sider here), and the ASIC design costs, whichcollectively comprise the NRE expense. Tounderstand this tradeoff, we proposed thetwo-for-two rule. If the cost per year (that is,the TCO) for running the computation onan existing cloud exceeds the NRE by twotimes, and you can get at least a two-timesTCO improvement per ops/s, then buildingan ASIC Cloud is likely to save money.
Figure 5 shows a wider range of break-evenpoints. Essentially, as the TCO exceeds the
NRE by more and more, the required speedupto break even declines. As a result, almost anyaccelerator proposed in the literature, no mat-ter how modest the speedup, is a candidate forASIC Cloud, depending on the scale of thecomputation. Our research makes the keycontribution of noting that, in the deploy-ment of ASIC Clouds, NRE and scale can bemore determinative than the absolute speedupof the accelerator. The main barrier for ASICClouds is to reign in NRE costs so they areappropriate for the scale of the computation.In many research accelerators, TCO improve-ments are extreme (such as in Figure 5), butauthors often unnecessarily target expensive,latest-generation process nodes because theyare more cutting-edge. This tendency raisesthe NRE exponentially, reducing economicfeasibility. A better strategy is to target theolder nodes that still attain sufficient TCOimprovements. Our most recent work sug-gests that a better strategy is to lower NREcost by targeting older nodes that still havesufficient TCO per ops/s benefit.5
O ur research generalizes primordial Bit-coin ASIC Clouds into an architec-
tural template that can apply across a rangeof planet-scale applications. Joint knowledgeand control over datacenter and hardwaredesign allows for ASIC Cloud designers toselect the optimal design that optimizes energyand cost proportionally to optimize TCO.Looking to the future, our work suggests thatboth Cloud providers and silicon foundrieswould benefit by investing in technologiesthat reduce the NRE of ASIC design, includ-ing open source IP such as RISC-V, in newlabor-saving development methodologies forhardware and in open source back-end CADtools. With time, mask costs fall by them-selves, but older nodes such as 65 nm and40 nm may provide suitable TCO per ops/sreduction, with one-third to half the maskcost and only a small difference in perform-ance and energy efficiency from 28 nm.This is a major shift from the conventionalwisdom in architecture research, whichoften chooses the best process even thoughit exponentially increases NRE. Foundriesalso should take interest in ASIC Cloud’s low-voltage scale-out design patterns because theylead to greater silicon wafer consumption
TCO
imp
rove
men
t ove
r b
asel
ine
104
103
102
101
100
10–1
CPU Cloud GPU Cloud ASIC Cloud
BitcoinLitecoin
Video transcoding
Deeplearning
Figure 4. CPU Cloud versus GPU Cloud versus ASIC Cloud death match.
ASIC servers greatly outperform the best non-ASIC alternative in terms of
TCO per operations per second (ops/s).
1 2 3 4 5 6 7 8 9 10
TCO/NRE ratio
Min
imum
req
uire
d T
CO
imp
rove
men
t
11109876543210
Knee of curve (two-for-two rule)
Figure 5. The two-for-two rule. Moderate speedup with low non-recurring
engineering (NRE) beats high speedup at high NRE. The points are break-
even points for ASIC Clouds.
..............................................................................................................................................................................................
TOP PICKS
............................................................
68 IEEE MICRO
than CPUs within fixed environmentalenergy limits.
With the coming explosive growth ofplanet-scale computation, we must work tocontain the exponentially growing environ-mental impact of datacenters across theworld. ASIC Clouds promise to help addressthis problem. By specializing the datacenter,they can do greater amounts of computationunder environmentally determined energylimits. The future is planet-scale, and special-ized ASICs will be everywhere. MICRO
AcknowledgmentsThis work was partially supported by NSFawards 1228992, 1563767, and 1565446,and by STARnet’s Center for Future Archi-tectures Research, a SRC program sponsoredby MARCO and DARPA.
....................................................................References1. M.B. Taylor, “A Landscape of the Dark Sili-
con Design Regime,” IEEE Micro, vol. 33,
no. 5, 2013, pp. 8–19.
2. I. Magaki et al., “ASIC Clouds: Specializing
the Datacenter,” Proc. 43rd Int’l Symp.
Computer Architecture, 2016, pp. 178–190.
3. N. Goulding-Hotta et al., “The GreenDroid
Mobile Application Processor: An Architecture
for Silicon’s Dark Future,” IEEE Micro, vol. 31,
no. 2, 2011, pp. 86–95.
4. M.B. Taylor, “Bitcoin and the Age of
Bespoke Silicon,” Proc. Int’l Conf. Com-
pilers, Architectures and Synthesis for
Embedded Systems, 2013, article 16.
5. M. Khazraee et al., “Moonwalk: NRE Optimi-
zation in ASIC Clouds,” Proc. 22nd Int’l
Conf. Architectural Support for Programming
Languages and Operating Systems, 2017,
pp. 511–526.
Moein Khazraee is a PhD candidate in theDepartment of Computer Science and Engi-neering at the University of California, SanDiego. His research interests include ASICClouds, NRE, and specialization. Khazraeereceived an MS in computer science from theUniversity of California, San Diego. Contacthim at [email protected].
Luis Vega Gutierrez is a staff research asso-ciate in the Department of Computer Sci-
ence and Engineering at the University ofCalifornia, San Diego. His research interestsinclude ASIC Clouds, low-cost ASIC design,and systems. Vega received an MSc in electri-cal and computer engineering from the Uni-versity of Kaiserslautern, Germany. Contacthim at [email protected].
Ikuo Magaki is an engineer at Apple. Heperformed the work for this article as aToshiba visiting scholar in the Departmentof Computer Science and Engineering atthe University of California, San Diego. Hisresearch interests include ASIC design andASIC Clouds. Magaki received an MSc incomputer science from Keio University, Japan.Contact him at [email protected].
Michael Bedford Taylor advises his PhDstudents at various well-known west coastuniversities. He performed the work for thisarticle while at the University of California,San Diego. His research interests includetiled multicore architecture, dark silicon,HLS accelerators for mobile, Bitcoin min-ing hardware, and ASIC Clouds. Taylorreceived a PhD in electrical engineering andcomputer science from the MassachusettsInstitute of Technology. Contact him [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 69
.................................................................................................................................................................................................................
DRAF: A LOW-POWERDRAM-BASED RECONFIGURABLE
ACCELERATION FABRIC.................................................................................................................................................................................................................
THE DRAM-BASED RECONFIGURABLE ACCELERATION FABRIC (DRAF) USES COMMODITY
DRAM TECHNOLOGY TO IMPLEMENT A BIT-LEVEL, RECONFIGURABLE FABRIC THAT
IMPROVES AREA DENSITY BY 10 TIMES AND POWER CONSUMPTION BY MORE THAN
3 TIMES OVER CONVENTIONAL FIELD-PROGRAMMABLE GATE ARRAYS. LATENCY
OVERLAPPING AND MULTICONTEXT SUPPORT ALLOW DRAF TO MEET THE PERFORMANCE
AND DENSITY REQUIREMENTS OF DEMANDING APPLICATIONS IN DATACENTER AND
MOBILE ENVIRONMENTS.
......The end of Dennard scaling hasmade it imperative to turn toward applica-tion- and domain-specific acceleration as anenergy-efficient way to improve performance.1
Field-programmable gate arrays (FPGAs)have become a prominent acceleration plat-form as they achieve a good balance betweenflexibility and efficiency.2 FPGAs have enabledaccelerator designs for numerous domains,including datacenter computing,3 in whichapplications are much more complex andchange frequently, and multitenancy sharing isa principal way to achieve resource efficiency.
For FPGA-based accelerators to becomewidely adopted, their cost must remain low.This is an issue both for large-scale datacentersthat are optimized for total cost of ownershipand for small mobile devices that have strictbudgets for power and chip area. Unfortu-nately, conventional FPGAs realize arbitrary
bit-level logic functions using static RAM(SRAM) based lookup tables and configurableinterconnects, both of which incur significantarea and power overheads. The poor logicdensity and high power consumption limitthe functionality that one can implementwithin an FPGA. Previous research has usednetworks of medium-sized FPGAs3 or devel-oped multicontext FPGAs4 to circumventthese limitations, but these approaches comewith their own overheads. For details, see thesidebar, “FPGAs in Datacenters and Multi-context Reconfigurable Fabrics.”
We developed the DRAM-Based Recon-figurable Acceleration Fabric (DRAF), areconfigurable fabric that improves logic den-sity and reduces power consumption throughthe use of dense, commodity DRAM arrays.DRAF is bit-level reconfigurable and has sim-ilar flexibility as conventional FPGAs. DRAF
Mingyu Gao
Stanford University
Christina Delimitrou
Cornell University
Dimin Niu
Krishna T. Malladi
Hongzhong Zheng
Bob Brennan
Samsung Semiconductor
Christos Kozyrakis
Stanford University
.......................................................
70 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
includes architectural optimizations, such aslatency overlapping and multicontext sup-port with fast context switching, that allow itto transform slow DRAM into a performantreconfigurable fabric suitable for both data-centers and mobile platforms.
Challenges for DRAM-Based FPGAsDense DRAM technology provides a newapproach to realize high-density, low-powerreconfigurable fabrics necessary in constrainedenvironments such as datacenters and mobiledevices. However, simply replacing theSRAM-based lookup tables in FPGAs withDRAM-based cell arrays would lead to crit-ical challenges in logic utilization, perform-ance, and even operation correctness. First,DRAM arrays are heavily optimized for
area and cost efficiency by using very wideinputs (address) and outputs (data). Suchwide granularity does not match the rela-tively fine-grained logic functions in mostreal-world accelerator designs, resulting inunderutilization of the DRAM-based lookuptables. Simply reducing the I/O width ofDRAM arrays would forfeit the density ben-efit, because the peripheral logic would nowdominate the lookup table area. Second,DRAM access speed is 30 times slower thanthat of SRAM arrays (2 to 10 ns versus 0.1to 0.5 ns). Without careful optimization, a30 times slower FPGA would hardly provideany acceleration over programmable process-ors. Third, implementing large and complexlogic functions often requires multiple lookuptables to be chained together, which is prob-lematic with DRAM lookup tables, because
FPGAs in Datacenters and Multicontext Reconfigurable FabricsThe advantages of spatial programmability and post-fabrication
reconfigurability have made field-programmable gate arrays (FPGAs)
the most successful and widely used reconfigurable fabric for accel-
erator designs in various domains. FPGAs provide bit-level reconfi-
gurability through lookup tables, which can implement arbitrary
combinational logic functions by storing the function outputs in small
static RAM arrays. The typical lookup table granularity at the
moment is 6-bit input and 1-bit output. FPGAs also have flip-flops for
data retiming and temporary storage. The lookup tables and flip-flops
are grouped into configurable logic blocks, which are organized into
a 2D grid layout with other dedicated DSP and block RAM blocks. A
bit-level, statically configurable interconnect supports communica-
tion between these blocks.
FPGAs have recently been used in datacenters as an accelera-
tion fabric for cloud applications.1–3 Datacenter servers often host
multiple complex applications. Hence, multiple large FPGA devices
are often necessary to provide sufficient resources for multiple
large accelerators. Unfortunately, the tight power budget and the
focus on total cost of ownership make it impractical to introduce
expensive, power-hungry devices. To counteract these issues,
Microsoft proposed the Catapult design, using medium-sized
FPGAs with custom-designed interconnects linked between
them.1 Although it improves performance, this approach increases
the system complexity and design integration cost, while still sup-
porting only a single application on the acceleration fabric.
Multicontext reconfigurable fabrics4 can support multitenancy
sharing by allowing rapid runtime switch between multiple designs
(contexts) that are all mapped onto a single fabric, similar to hard-
ware-supported thread switching in multithreaded processors. Such
fabrics store all context configurations on chip, either in specialized
lookup tables5 or in separate global backup memories.6 Both
approaches consume significant on-chip area for the additional stor-
age, greatly reducing the single-context logic capacity. In addition,
loading the configuration from the backup memories can result in
long context switch latency. Because of their large overheads, multi-
context FPGAs have not been widely adopted by industry.
References1. A. Putnam et al., “A Reconfigurable Fabric for Accelerat-
ing Large-Scale Datacenter Services,” Proc. 41st Ann.
Int’l Symp. Computer Architecture (ISCA), 2014, pp.
13–24.
2. J. Hauswald et al., “Sirius: An Open End-to-End Voice
and Vision Personal Assistant and Its Implications for
Future Warehouse Scale Computers,” Proc. 20th Int’l
Conf. Architectural Support for Programing Languages
and Operating Systems (ASPLOS), 2015, pp. 223–238.
3. R. Polig et al., “Giving Text Analytics a Boost,” IEEE
Micro, vol. 34, no. 4, 2014, pp. 6–14.
4. T.R. Halfhill, “Tabula’s Time Machine,” Microprocessor
Report, vol. 131, 2010.
5. E. Tau et al., “A First Generation DPGA Implementation,”
Proc. 3rd Canadian Workshop Field-Programmable Devi-
ces (FPD), 1995, pp. 138–143.
6. S. Trimberger et al., “A Time-Multiplexed FPGA,” Proc.
5th IEEE Symp. FPGA-Based Custom Computing
Machines (FCCM), 1997, p. 22.
.............................................................
MAY/JUNE 2017 71
the destructive nature of DRAM accessesrequires explicit activation and prechargeoperations with precise timing. Without care-ful management and coordination betweenlookup tables, the lookup table contentswould be destroyed if accessed with an unsta-ble input. Finally, DRAM requires periodicrefresh operations, which could negativelyimpact system power consumption and appli-cation performance.
DRAF ArchitectureDRAF leverages DRAM technology to imple-ment a reconfigurable fabric with higher logiccapacity and lower power consumption thanconventional FPGAs. Table 1 summarizes thekey features of DRAF as compared to a con-ventional FPGA.
DRAF implements several key architec-tural optimizations to overcome the challengesdiscussed in the previous section. First, it usesa specialized DRAM lookup table design thatachieves both high density and high utiliza-tion by using a narrower output width andflexible column logic. Second, it uses a simplephase-based solution to specify the correcttiming for each lookup table, and a three-waydelay overlapping technique to significantlyreduce the impact of DRAM operationlatencies. Third, DRAF coordinates DRAMrefresh in the device driver to reduce its powerand latency impact. Finally, DRAF providesefficient multicontext support, which opensup the opportunity for sharing the accelera-tion fabric between multiple applications,greatly decreasing the overall system cost.
OverviewSimilar to an FPGA, DRAF uses three typesof logic blocks. The configurable logic block
(CLB) contains lookup tables made withDRAM cell arrays and conventional flip-flops. The lookup table supports multipleon-chip configurations, each stored in one ofthe contexts. The digital signal processing(DSP) block is used for complex arithmeticoperations, and the block RAM (BRAM) isfor on-chip data storage. They are similar tothose in FPGAs, but implemented in DRAMtechnology, which makes their latency andarea worse than the corresponding imple-mentation in a logic process. However, aswe will show, the DRAM-based lookuptable will also have much higher latencythan an SRAM-based lookup table; there-fore, the increased latencies of DSP andBRAM are not critical and do not dominatethe overall design critical path. In addition,the combinational DSP logic does notneed to be replicated across contexts, thusoffsetting its area overhead. The DRAMarray in the BRAM block is similar to thelookup tables, but with larger capacity, andis used for data storage rather than designconfigurations.
The blocks in DRAF are organized in a2D grid layout similar to that of conventionalFPGAs (see Figure 1a). The DRAF intercon-nect uses a simple and static time-multiplexingscheme to support multiple contexts.5
Configurable Logic BlockFigure 1b shows the structure of the CLB inDRAF. The density advantage of DRAMtechnology allows a DRAF CLB to provide10 times the logic capacity over an FPGACLB within the same area. The CLB con-tains a few DRAM-based lookup tables andthe associated flip-flops and auxiliary multi-plexers. The inputs of the lookup table are
Table 1. Comparison of the DRAM-Based Reconfigurable Acceleration Fabric
(DRAF) and a conventional field-programmable gate array (FPGA).
Features Conventional FPGA DRAF
Lookup table technology Static RAM (SRAM) DRAM
Lookup table delay Short (0.1 to 1 ns) Long (1 to 10 ns)
Lookup table output width Single bit Multiple bits
Logic capacity Moderate Very high
No. of configurations Single Multiple (4 to 8)
Power consumption Moderate Low
..............................................................................................................................................................................................
TOP PICKS
............................................................
72 IEEE MICRO
split into two parts and connected to the rowand column address ports of the DRAMarray, respectively. To support multicontext,each lookup table is divided into four to eightcontexts, leveraging the hierarchical structurein modern DRAM chips, in which the arrayis divided into DRAM MATs. Each contextin DRAF is a modified DRAM MAT (seeFigure 1c). The decoded row address will acti-vate a single local wordline, which connectsthe cells in that row to the bitlines. The dataare then transferred to the sense-amplifiersand augmented to full-swing signals.
The typical MAT width and height incommodity DRAM devices are 512 to 1,024cells. This implies a 9-bit-input, 512-bit-output lookup table, whereas a typical FPGAlookup table has a 6-bit input and 1-bitoutput. To bring the DRAF lookup tablegranularity close to the needs of real-worldapplications to increase the logic utilization,we make each MAT narrower, reducing itswidth to 8 to 16 bits. This offers a goodtradeoff between utilization and density.The aggregated row size of all contexts is stillin the order of hundreds of bits, sufficiently
amortizing the area overheads of the sharedperipheral logic (such as the row decoder).
To further increase the logic utilizationand flexibility, we apply a specialized columnlogic to allow for each output bit to be inde-pendently selected from the corresponding setof bitlines. As Figure 1c shows, rather thansharing the same column address for all bits asin conventional DRAM, we organize the 16bitlines into four groups, and provide eachgroup a separate set of 2 bits to select one out-put from the 4 bits. This additional level ofmultiplexing further reduces the output widthto 4 bits, while allowing each bit to have parti-ally different input bits, increasing the flexibil-ity of the lookup table functionality.
Multicontext SupportDRAF seamlessly supports multicontextoperations by storing each design configura-tion in one MAT and allowing for single-MAT access. Effectively, each MAT formsone context across all lookup tables. Themultiple contexts in one device can be usedfor different independent logic designs, eachof which accelerates one application running
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
CLB CLB CLB CLB
DSP
BRAM BRAM
DSP DSP
DSP DSP DSP
INP
UT
_0
[13
:0]
OU
TP
UT
_0
[3:0
]O
UTP
UT
_1
[3:0
]
INP
UT_1
[13
:0]
CT
X_S
EL
6
144×2
4×2
2
4
OUT
IN_col
(col addr)
MAT
Local wordline
Sense-amps
Bitlin
e
Ro
w d
ec
6
3
144×2
Ro
w d
ec
IN_c
ol
IN_W
L
EN
OUT
CTX
[0]
IN_c
ol
IN_W
L
EN
OUT
CTX
[1]
FFs FFs
IN_c
ol
IN_W
L
EN
EN
IN_WL[i]
(master
wordline)
OUT
CTX
[7]
FFs
IN_c
ol
IN_W
L
EN
OUT
CTX
[0]
IN_c
ol
IN_W
L
EN
OUT
CTX
[1]
FFs FFs
IN_c
ol
IN_W
L
EN
OUT
CTX
[7]
FFs
Context ouput MUX
Context ouput MUX
4
4
(a) (b) (c)
Figure 1. The DRAM-Based Reconfigurable Acceleration Fabric (DRAF) architecture. (a) The block layout of DRAF, similar to
an FPGA. Block sizes and numbers can vary across devices. (b) The configurable logic block (CLB) in DRAF. As a typical
example, a CLB contains two DRAM-based lookup tables and associated flip-flops (FFs) organized into eight contexts. Each
lookup table has an 8-bit (6 bits for row and 2 bits for column) input and a 4-bit output. (c) Detailed view of one context in a
DRAF lookup table with the context enable gate and specialized column logic.
.............................................................
MAY/JUNE 2017 73
on the shared system. Alternatively, we cansplit a single large and complicated design,such as websearch in Microsoft Catapult,3
and map each part to one context in order tofit the entire design on a single device insteadof using a multi-FPGA network.
We leverage the hierarchical wordlinestructure in DRAM to decouple the accessesto each MAT by adding an enable AND gateto each local wordline driver,6 as shown inFigure 1c. This lets us access only a singleMAT corresponding to the current selectedcontext, while disabling the other MATs. Acontext output multiplexer selects the enabledcontext for the lookup table output port.
The multicontext support in DRAF is par-ticularly efficient. First, the area overhead isnegligible, because the peripheral logic (forexample, the row decoder) is shared betweencontexts. Second, the idle contexts (MATs) arenot accessed, introducing little dynamic poweroverhead, and they can be further power-gatedto reduce static power. Third, because thedesign configurations are stored in-place in
each lookup table, the context switch is instantby simply updating a context index counter(CTX SEL in Figure 1b) and the new contextis ready to use in the next cycle.
Timing Management and OptimizationDRAM access is destructive. Therefore,modern DRAM array organization introdu-ces a two-step access protocol. First, an entireDRAM row is activated and copied into thesense-amplifiers according to the row address(activation); next, a subset of the sense-amplifiers are read or written on the basis ofthe column address. Because the originalcharge of the cells in the DRAM row isdestroyed after the activation, the cells mustbe recharged or discharged to restore to theoriginal values (restoration).7 Finally, we mustprecharge the bitlines and sense-amplifiers toprepare for the next activation (precharging).The explicit activation, restoration, and pre-charging create two major challenges for usingDRAM in a reconfigurable fabric.
First, when multiple DRAM-based lookuptables are chained together for a large logicfunction, we must enforce a specific order foreach lookup table access and the correspond-ing timing constraints, to avoid loss of con-figuration data. DRAF uses a phase-basedtiming solution. We divide the acceleratordesign (user) cycle into multiple phases andassign a specific phase for each lookup tablein the design (see Figure 2). By requiring thephase of a lookup table to be greater than thephases of all lookup tables producing itsinput signals, we can guarantee the correctaccess order. We also delay the prechargeoperation into the next user cycle, ensuringthat the lookup table output is valid acrossdifferent phases (for example, from LUT-2and LUT-3 to LUT-4 in Figure 2). Thephase assignment can be implemented by aCAD tool using techniques similar to criticalpath finding. The phase information isstored in a small local controller per lookuptable. There is no need for global coordina-tion at runtime.
Second, the restoration and precharging ofDRAM arrays introduce high latency over-heads7 that limit the design frequency to nomore than 20 MHz. To hide these overheads,DRAF applies a three-way latency overlappingwithout violating the timing constraints. As
Physical
path
User
clock
LUT-1
LUT-2
LUT-3
LUT-4
Phase 0 Phase 1 Phase 2
PRE ACT RST
PRE ACT RST
PRE ACT RST
PRE ACT RST
Δ = max(tPRE, tRST, troute)
LUT-3
LUT-1 LUT-2
LUT-4 RegReg
Figure 2. The timing diagram and critical path for a chain of four DRAF
lookup tables. Each clock cycle is decomposed into three phases. LUT-1 and
LUT-3 are in phase 0, LUT-2 is in phase 1, and LUT-4 is in phase 2. D
represents the three-way overlapping of restoration, precharging, and
routing delays.
..............................................................................................................................................................................................
TOP PICKS
............................................................
74 IEEE MICRO
Figure 2 shows, we overlap the charge restora-tion of the source lookup table that producesa signal, with the time for precharging the des-tination lookup table of this signal and thetime for routing this signal between the twolookup tables. Because routing delay is typi-cally the dominant latency component inFPGAs,8 this critical optimization lets DRAFbe only two to three times slower than anFPGA and provides reasonable performancespeedup over programmable cores.
DRAM RefreshDRAM requires periodic refresh due to cellleakage. We refresh all lookup tables in aDRAF chip concurrently using a shared rowaddress counter in each CLB and BRAMblock. This is easier for DRAF than for com-modity DRAM chips in terms of powerconsumption, because the arrays are muchsmaller in DRAF. All utilized contexts arerefreshed simultaneously, and unused contextsare skipped. The DRAF device driver coordi-nates the refresh by holding on to newrequests, ignoring output data, and pausingongoing operations similar to processor pipe-line stalls. The internal states in the DRAFdesign are held in the flip-flops and are notaffected. The pause period is less than 1 ls per64 ms, which is negligible even for latency-critical applications in datacenters that requiremillisecond-level tail latency.
Design FlowThe success of a reconfigurable fabric reliesheavily on the CAD toolchain support.Because DRAF uses the same primitives(lookup tables, flip-flops, DSPs, and BRAMs)as modern FPGAs, its design flow is similarto that of FPGAs with some mild tuning.First, the tool now needs to pack more logicper lookup table to utilize the larger lookuptables. Second, the primary optimizationgoal should be latency, because area is not ascarce resource anymore. Third, the toolmust enforce all timing requirements, includ-ing the phases and DRAM timing constraints.Finally, the tool should take advantage of themulticontext support.
Use of DRAF for Datacenter AcceleratorsDRAF trades off some of the potential per-formance of FPGAs to achieve high logic
density, multiple contexts, and low powerconsumption. These features make DRAFdevices appropriate for both mobile and serverapplications in which one wants to introducean FPGA device for acceleration without sig-nificantly impacting existing systems’ powerbudget, airflow, and cooling constraints.
In datacenters that host public and privateclouds, servers are routinely shared by multi-ple diverse applications to increase utiliza-tion. Different applications and differentportions of each application (for example,RPC communication versus security versusmain algorithm) require different accelera-tors. The long reconfiguration latency ofconventional FPGAs leads to nonnegligibleapplication downtime,3 decreasing the sys-tem availability and making it expensive toshare the acceleration fabrics.
In contrast, DRAF provides a shared fabricthat supports multiple accelerators by usingdifferent contexts, which can be viewed asmultiple independent FPGA instances thatneed to be used in a time-multiplexed fashion.The high logic density ensures that each indi-vidual context has sufficient capacity for thedifferent accelerator designs. The instantane-ous context switch ensures that the desiredaccelerator becomes immediately available touse when needed, with negligible overhead inenergy and no application downtime. Beingable to share the acceleration fabric cangreatly reduce the overall system cost whilestill enjoying the benefits of special-purposeacceleration.
EvaluationWe evaluate DRAF as a reconfigurable fabricfor datacenter applications using a wide set ofaccelerator designs for representative compu-tation kernels commonly used in large-scaleproduction services, including both latency-critical online services and batch data analytics.We use seven-input, two-output, eight-contextlookup tables in DRAF, because they achievea good tradeoff between efficiency and logicutilization and flexibility. We compare DRAFto an FPGA similar to a Xilinx Virtex-6 deviceand a programmable processor (Intel XeonE5-2630 at 2.3 GHz). For a fair comparison,the accelerator designs are synthesized usingthe same open-source CAD tools for both theconventional FPGA and DRAF. The DRAF
.............................................................
MAY/JUNE 2017 75
results are conservative compared to theprogrammable core baselines, because highlyoptimized commercial tools are likely to gen-erate more efficient mappings of acceleratordesigns on the DRAF fabric. Our full papercontains a complete description of ourmethodology.9
Area and PowerFigures 3a and 3b compare the area and peakpower consumption of DRAF and FPGAdevices with different logic capacities meas-ured in 6-bit-input lookup table equivalentsfor 45-nm technology. For a fixed logiccapacity, an eight-context DRAF providesmore than 10 times area improvement androughly 50 times power consumption reduc-tion. If we target a cost-effective device sizeof 75 mm2, an FPGA can pack roughly200,000 lookup tables, whereas DRAF canhave more than 1.5 million lookup tables,a logic capacity comparable to that of the
state-of-the-art Virtex-UltraScaleþ FPGAsthat use a much more recent 16-nm tech-nology. The power consumption advantage isalso remarkable. Although the FPGA powercan easily exceed 10 W, DRAF consumesonly about 1 to 2 W.
Figures 3c and 3d further compare DRAFto FPGA using real accelerator designs. Wemap each accelerator to one of the eight avail-able contexts in DRAF. The other unusedcontexts still contribute to the area, consumeleakage power, and introduce a slight accesslatency penalty in the DRAF lookup tables.On average, each accelerator design occupies19 percent less area on DRAF than on theFPGA, roughly matching the 10-times areaadvantage if we consider the seven additionalcontexts available within the area occupiedin DRAF. DRAF’s area advantage stems pri-marily from using lookup tables with widerinputs and outputs; these lookup tablescan realize larger functions and also reduce
Logic capacity
(in million 6-LUT equivalents)
103
102
101
100
10–1
10–2
1.2
1.0
0.8
0.6
0.4
0.2
0.0
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.5 1.0 1.5C
hip
are
a (
mm
2)
Logic capacity
(in million 6-LUT equivalents)
103
102
101
100
10–1
10–2
0.0 0.5 1.0 1.5
Pe
ak c
hip
po
we
r (W
)
FPGA DRAF FPGA DRAF
aes
back
prop
gemm gmm
harr
is
stem
mer
sten
cil
vite
rbi
aes
back
prop
gemm gmm
harr
is
stem
mer
sten
cil
vite
rbi
Norm
aliz
ed
min
imum
bound
ing
are
a
FPGA logic
FPGA routing
DRAF logic
DRAF routing
FPGA logic
FPGA routing
DRAF logic
DRAF routing
Norm
aliz
ed
pow
er
(a) (b)
(c) (d)
Figure 3. Area and power comparison between DRAF and a conventional FPGA. (a, b)
Device-level comparison. (c, d) Comparison after real application mapping.
..............................................................................................................................................................................................
TOP PICKS
............................................................
76 IEEE MICRO
pressure on the configurable interconnect.The gmm design uses more area in DRAFthan the FPGA, because it requires exponen-tial and logarithmic functions that are notcurrently supported in our DSP blocks.
Regarding power, the FPGA power con-sumption is dominated by the routing fabric,especially for larger designs. DRAF providesa 3.2 times power improvement on average,resulting from both the more efficientDRAM-based lookup tables and the savingson routing due to denser packing.
PerformanceFinally, we compare the performance ofaccelerator designs mapped onto DRAF andFPGA devices to that of optimized softwarerunning on general-purpose programmablecores. For the programmable cores, we opti-mistically assume ideal linear scaling to fourcores, owing to the abundant request-levelparallelism in datacenter services. The chipsize for FPGA and DRAF is fixed at 75 mm2.
Figure 4 shows that both FPGA andDRAF outperform the single-core baseline,on average by 37 and 13 times, respectively.When compared to four cores with idealspeedup, DRAF still exhibits significantspeedup of 3.4 times while consuming just0.63 W, compared to 7 to 10 W of a singlecore in Xeon-class processors. Overall, theseresults establish DRAF as an attractive andflexible acceleration fabric for cost (area) andpower constrained environments.
D RAF is the first complete design to usedense, commodity DRAM technology
to implement a reconfigurable fabric withsignificant logic density and power improve-ments over conventional FPGAs. DRAFprovides a low-cost solution for multicontextacceleration fabrics, which are expected tobecome widely used in future multitenantcloud and mobile systems. Looking forward,it is important to tune CAD tools and run-time management systems to efficiently mapaccelerator designs on DRAF, taking fulladvantage of its high-density and multicon-text features.
The techniques that allow DRAF to turndense storage technology to cost-effectivereconfigurable fabrics are also applicable tomemory technologies beyond DRAM. The
upcoming dense nonvolatile memory tech-nologies, such as spin-transfer torque RAM,exhibit good density scaling and have betterstatic power characteristics compared toDRAM. An exciting research direction is toextend DRAF to exploit the advantages andaddress the shortcomings of new memorytechnologies in order to produce accelerationfabrics with low area and power cost. MICRO
....................................................................References1. M. Horowitz, “Computing’s Energy Prob-
lem (and What We Can Do About it),” Proc.
IEEE Int’l Solid-State Circuits Conf. (ISSCC),
2014, pp. 10–14.
2. R. Tessier, K. Pocek, and A. DeHon,
“Reconfigurable Computing Architectures,”
Proc. IEEE, vol. 103, no. 3, 2015, pp. 332–354.
3. A. Putnam et al., “A Reconfigurable Fabric
for Accelerating Large-Scale Datacenter
Services,” Proc. 41st Ann. Int’l Symp.
Computer Architecture (ISCA), 2014, pp.
13–24.
4. S. Trimberger et al., “A Time-Multiplexed
FPGA,” Proc. 5th IEEE Symp. FPGA-Based
Custom Computing Machines (FCCM),
1997, p. 22.
5. B. Van Essen et al., “Static versus Sched-
uled Interconnect in Coarse-Grained Recon-
figurable Arrays,” Proc. Int’l Conf. Field
Programmable Logic and Applications (FPL),
2009, pp. 268–275.
Nor
mal
ized
thro
ughp
ut
CPU 4 CPUs FPGA DRAF
103
102
101
100
10–1
aes
back
prop
gemm gmm
harr
is
stem
mer
sten
cil
vite
rbi
Figure 4. Performance comparison between single-core, multicore, FPGA,
and DRAF using representative datacenter application kernels. Assume
ideal scaling from single-core to multicore platform.
.............................................................
MAY/JUNE 2017 77
6. A.N. Udipi et al., “Rethinking DRAM Design
and Organization for Energy-Constrained
Multi-cores,” Proc. 37th Ann. Int’l Symp. Com-
puter Architecture (ISCA), 2010, pp. 175–186.
7. Y.H. Son et al., “Reducing Memory Access
Latency with Asymmetric DRAM Bank
Organizations,” Proc. 40th Ann. Int’l Symp.
Computer Architecture (ISCA), 2013, pp.
380–391.
8. V. Betz et al., Architecture and CAD for
Deep-Submicron FPGAs, Kluwer Academic
Publishers, 1999.
9. M. Gao et al., “DRAF: A Low-Power DRAM-
based Reconfigurable Acceleration Fabric,”
Proc. ACM/IEEE 43rd Ann. Int’l Symp. Com-
puter Architecture (ISCA), 2016, pp. 506–518.
Mingyu Gao is a PhD student in the Depart-ment of Electrical Engineering at StanfordUniversity. His research interests includeenergy-efficient computing and memorysystems, specifically on practical and effi-cient near-data processing for data-intensiveanalytics applications, high-density and low-power reconfigurable architectures for data-center services, and scalable accelerators forlarge-scale neural networks. Gao received anMS in electrical engineering from StanfordUniversity. He is a student member of IEEE.Contact him at [email protected].
Christina Delimitrou is an assistant profes-sor in the Departments of Electrical andComputer Engineering and Computer Sci-ence at Cornell University, where she workson computer architecture and distributed sys-tems. Her research interests include resource-efficient datacenters, scheduling and resourcemanagement with quality-of-service guaran-tees, disaggregated cloud architectures, andcloud security. Delimitrou received a PhD inelectrical engineering from Stanford Univer-sity. She is a member of IEEE and ACM.Contact her at [email protected].
Dimin Niu is a senior memory architect inthe Memory Solutions Lab in the US R&Dcenter at Samsung Semiconductor. Hisresearch interests include computer archi-tecture, emerging nonvolatile memory tech-nologies, and processing near-/in-memoryarchitecture. Niu received a PhD in com-
puter science and engineering from Penn-sylvania State University. Contact him [email protected].
Krishna T. Malladi is a staff architect in theMemory Solutions Lab in the US R&D cen-ter at Samsung Semiconductor. His researchinterests include next-generation memoryand storage systems for datacenter platforms.Malladi received a PhD in electrical engi-neering from Stanford University. Contacthim at [email protected].
Hongzhong Zheng is a senior manager inthe Memory Solutions Lab in the US R&Dcenter at Samsung Semiconductor. Hisresearch interests include novel memory sys-tem architecture with DRAM and emergingmemory technologies, processing in-memoryarchitecture for machine learning applica-tions, computer architecture and systemperformance modeling, and energy-efficientcomputing system designs. Zheng received aPhD in electrical and computer engineeringfrom the University of Illinois at Chicago.He is a member of IEEE and ACM. Contacthim at [email protected].
Bob Brennan is the senior vice president ofthe Memory Solutions Lab in the US R&Dcenter at Samsung Semiconductor. He has lednumerous research projects on memorysystem architecture, SoC architecture, CPUvalidation, and low-power system design.Brennan received an MS in electrical engi-neering from the University of Virginia. Con-tact him at [email protected].
Christos Kozyrakis is an associate professorin the Departments of Electrical Engineeringand Computer Science at Stanford Univer-sity, where he investigates hardware architec-tures, system software, and programmingmodels for systems ranging from cell phonesto warehouse-scale datacenters. His researchinterests include resource-efficient cloudcomputing, energy-efficient computing andmemory systems for emerging workloads,and scalable operating systems. Kozyrakishas a PhD in computer science from theUniversity of California, Berkeley. He is afellow of IEEE and ACM. Contact him [email protected].
..............................................................................................................................................................................................
TOP PICKS
............................................................
78 IEEE MICRO
.................................................................................................................................................................................................................
AGILE PAGING FOR EFFICIENT MEMORYVIRTUALIZATION
.................................................................................................................................................................................................................
VIRTUALIZATION PROVIDES BENEFITS FOR MANY WORKLOADS, BUT THE ASSOCIATED
OVERHEAD IS STILL HIGH. THE COST COMES FROM MANAGING TWO LEVELS OF ADDRESS
TRANSLATION WITH EITHER NESTED OR SHADOW PAGING. THIS ARTICLE INTRODUCES
AGILE PAGING, WHICH COMBINES THE BEST OF BOTH NESTED AND SHADOW PAGING
WITHIN A PAGE WALK TO EXCEED THE PERFORMANCE OF BOTH TECHNIQUES.
......Two important trends in comput-ing are evident. First, computing is becomingmore data-centric, wherein low-latency accessto a very large amount of data is critical. Sec-ond, virtual machines are playing an increas-ingly critical role in server consolidation,security, and fault tolerance as substantialamounts of computing migrate to sharedresources in cloud services. Because softwareaccesses data using virtual addresses, fastaddress translation is a prerequisite for effi-cient data-centric computation and for pro-viding the benefits of virtualization to a widerange of applications. Unfortunately, thegrowth in physical memory sizes is exceedingthe capabilities of the most widely usedvirtual memory abstraction—paging—whichhas worked for decades.
Translation look-aside buffer (TLB) sizeshave not grown in proportion to growth inmemory sizes, causing a problem of limitedTLB reach: the fraction of physical memorythat TLBs can map reduces with each hardwaregeneration. There are two key factors causinglimited TLB reach: first, TLBs are on the crit-ical path of accessing the L1 cache and thushave remained small in size, and second, mem-ory sizes and the workload’s memory demandshave increased exponentially. This has intro-
duced significant performance overhead dueto TLB misses causing hardware page walks.Even the TLBs in the recent Intel Skylake pro-cessor architecture cover only 9 percent of a256-Gbyte memory. This mismatch betweenTLB reach and memory size will keep growingwith time.
In addition, our experiments show virtu-alization increases page-walk latency by twoto three times compared to unvirtualized exe-cution. The overheads are due to two levelsof page tables: one in the guest virtualmachine (VM) and the other in the host vir-tual machine monitor (VMM). There aretwo techniques to manage these two levels ofpage tables: nested paging and shadow pag-ing. In this article, we explain the tradeoffsbetween the two techniques that intrinsicallylead to high overheads of virtualizing mem-ory. With current hardware and software, theoverheads of virtualizing memory are hard tominimize, because a VM exclusively uses onetechnique or the other. This effect, com-bined with limited TLB reach, is detrimentalfor many virtualized applications and makesvirtualization unattractive for big-memoryapplications.1
This article addresses the challenge ofhigh overheads of virtualizing memory in a
Jayneel Gandhi
Mark D. Hill
Michael M. Swift
University of
Wisconsin–Madison
.......................................................
80 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
comprehensive manner. It proposes a hard-ware/software codesign called agile pagingfor fast virtualized address translation toaddress the needs of many different big-memory workloads. Our goal, originally setforth in our paper for the 43rd Interna-tional Symposium on Computer Architec-ture,2 is to minimize memory virtualizationoverheads by combining the hardware (nestedpaging) and software (shadow paging) techni-ques, while exceeding the best performance ofboth individual techniques.
Techniques for Virtualizing MemoryA key component enabling virtualization isits support for virtualizing memory with twolevels of page tables:
� gVA!gPA: guest virtual address(gVA) to guest physical address trans-lation (gPA) via a per-process guestOS page table.
� gPA!hPA: guest physical address tohost physical address (hPA) transla-tion via a per-VM host page table.
Table 1 shows the tradeoffs between nestedpaging and shadow paging, the two techniquescommonly used to virtualize memory, andcompares them to our agile paging proposal.
Nested PagingNested paging is a widely used hardware tech-nique to virtualize memory. The processor hastwo hardware page-table pointers to perform
a complete translation: one points to the guestpage table (gcr3 in x86-64), and the otherpoints to the host page table (ncr3).
In the best case, the virtualized addresstranslation has a hit in the TLB to directlytranslate from gVA to hPA with no overheads.In the worst case, a TLB miss needs to per-form a nested page walk that multiplies over-heads vis-�a-vis native (that is, unvirtualized 4-Kbyte pages), because accesses to the guestpage table also require translation by the hostpage table. Note that extra hardware isrequired for nested page walk beyond the basenative page walk. Figure 1a depicts virtualizedaddress translation for x86-64. It showshow page table memory references growfrom a native 4 to a virtualized 24 references.Although page-walk caches can elide some ofthese references,3 TLB misses remain sub-stantially more expensive with virtualization.
Despite the expense, nested paging allowsfast, direct updates to both page tables withoutany VMM intervention.
Shadow PagingShadow paging is a lesser-used software tech-nique to virtualize memory. With shadowpaging, the VMM creates on demand ashadow page table that holds complete trans-lations from gVA!hPA by merging entriesfrom the guest and host tables.
In the best case, as in nested paging, thevirtualized address translation has a hit in theTLB to directly translate from gVA to hPA
Table 1. Tradeoffs provided by memory virtualization techniques as compared to base native. Agile paging
exceeds the best of both worlds.
Properties Base native Nested paging Shadow paging Agile paging
Translation look-aside
buffer (TLB) hit
Fast (gVA!hPA) Fast (gVA!hPA) Fast (gVA!hPA) Fast (gVA!hPA)
Memory accesses per
TLB miss
4 24 4 Approximately 4 to 5
on average
Page table updates Fast: direct Fast: direct Slow: mediated by the virtual
machine monitor (VMM)
Fast: direct
Hardware support Native page
walk
Nestedþ native
page walk
Native page walk Nestedþ native page
walk with switching...................................................................................................................................
*gVA!hPA: guest virtual address to host physical address.
.............................................................
MAY/JUNE 2017 81
with no overheads. On a TLB miss, the hard-ware performs a native page walk on theshadow page table. The native page tablepointer points to the shadow page table(scr3). Thus, the memory referencesrequired for shadow page table walk are the
same as a base native walk. For example, x86-64 requires up to four memory references on aTLB miss for shadow paging (see Figure 1b).In addition, as a software technique, there isno need for any extra hardware support forpage walks beyond base native page walk.
Even though TLB misses cost the same asnative execution, this technique does not allowdirect updates to the page tables, because theshadow page table needs to be kept consistentwith guest and host page tables.4 These updatesoccur because of various optimizations, such aspage sharing, page migrations, setting accessedand dirty bits, and copy-on-write. Every pagetable update requires a costly VMM interven-tion to fix the shadow page table by invalidat-ing or updating its entries, which causessignificant overheads in many applications.
OpportunityShadow paging reduces overheads of virtual-izing memory to that of native execution ifthe address space does not change. Our keyobservation is that empirically page tables arenot modified uniformly: some regions of anaddress space see far more changes thanothers, and some levels of the page table,such as the leaves, are updated far more oftenthan the upper-level nodes. For example,code regions might see little change over thelife of a process, whereas regions that mem-ory-map files might change frequently. Ourexperiments showed that generally less than1 percent and up to 5 percent of the addressspace changes in a 2-second interval of guestapplication execution (see Figure 2).
Proposed Agile Paging DesignWe propose agile paging as a lightweight sol-ution to reduce the cost of virtualized addresstranslation. We use the opportunity we justdescribed to combine the best of shadow andnested paging by using
� shadow paging with fast TLB missesfor the parts of the guest page tablethat remain static, and
� nested paging for fast in-placeupdates for the parts of the guestpage tables that dynamically change.
In the following subsections, we describethe mechanisms that enable us to use both
hPA
gVA
gPA
Guest page table
Host page table
gCR3
gVA
Memory accessesnc
r3+ 5
ncr3
+ 5
ncr3gPAgPAgPAgPA ncr3
+ 5
ncr3
+ 4
hPA
= 245
(a)
(b)
hPA
Guest pagetable (read only)
Host page table
gVA
Shadow pagetable sCR3
Memory accesses = 4
gVA
hPA
gPA
Figure 1. Nested paging has a longer page walk as compared to shadow
paging, but nested paging allows fast, in-place updates whereas shadow
paging requires slow, mediated updates (guest page tables are read-only).
(a) Nested paging. (b) Shadow paging.
Fully static address spaceShadow paging preferred
Fully dynamic address spaceNested paging preferred
Only a small fraction ofaddress space is dynamicOpportunity for agile paging
Guest virtual address space
Figure 2. Opportunity that agile paging uses to improve performance.
Portions in white denote static portions, stripes denote dynamic portions,
and solid gray denotes unallocated portions of the guest virtual address
space.
..............................................................................................................................................................................................
TOP PICKS
............................................................
82 IEEE MICRO
constituent techniques at the same timefor a guest process, and we discuss policiesused by the VMM to select shadow ornested mode.
Mechanism: Hardware SupportAgile paging allows both techniques for thesame guest process—even on a single addresstranslation—using modest hardware supportto switch between the two. Agile paging hasthree hardware architectural page tablepointers: one each for shadow, guest, andhost page tables. If agile paging is enabled,virtualized page walks start in shadow pagingand then switch, in the same page walk, tonested paging if required.
To allow fine-grained switching fromshadow paging to nested paging on anyaddress translation at any level of the guestpage table, the shadow page table needs tologically support a new switching bit perpage table entry. This notifies the hardwarepage table walker to switch from shadow tonested mode. When the switching bit is setin a shadow page table entry, the shadowpage table holds the hPA (pointer) of thenext guest page table level. Figure 3a depictsthe use of the switching bit in the shadowpage table for agile paging. Figure 3b showsa page walk that is possible with agile paging.The switching is allowed at any level of thepage table.
Mechanism: VMM SupportLike shadow paging, the VMM for agile pag-ing manages three page tables: guest, shadow,and host. Agile paging’s page table manage-ment is closely related to that of shadow pag-ing, but there are subtle differences.
Guest page table (gVA!gPA). With allapproaches, the guest page table is created andmodified by the guest OS for every guest pro-cess. The VMM in shadow paging, though,controls access to the guest page table bymarking its pages read-only. With agile pag-ing, we leverage the support for marking guestpage tables read-only with one subtle change.The VMM marks as read-only just the partsof the guest page table covered by the partialshadow page table. The rest of the guest pagetable (handled by nested mode) has full read-write access.
Shadow page table (gVA!hPA). For all guestprocesses with agile paging enabled, the VMMcreates and maintains a shadow page table.However, with agile paging, the shadow pagetable is partial and cannot translate all gVAsfully. The shadow page table entry at eachswitching point holds the hPA of the nextlevel of the guest page table with the switch-ing bit set. This enables hardware to performthe page walk correctly with agile pagingusing both techniques.
Host page table (gPA!hPA). The VMMmanages the host page table to map fromgPA to hPA for each VM. As with shadowpaging, the VMM merges this page tablewith the guest page table to create a shadowpage table. The VMM must update theshadow page table on any changes to the hostpage table. The host page table is updatedonly by the VMM, and during that update,
sCR3
hPA
gVA
gPA
Guestpage table
Hostpage table
Shadowpage table
ncr3gPA
+ 5
hPA
= 8
sCR3
gVA
gCR3
gVA
Memory accesses 1 + 1 + 1
Switch modes at level 4 of guest page table
(b)
(a)
11
Figure 3. Agile paging support. (a) Mechanism for agile paging: when the
switching bit is set, the shadow page table points to the next level of the
guest page table. (b) Example page walk possible with agile paging, wherein
it switches to nested mode at level four of the guest page table.
.............................................................
MAY/JUNE 2017 83
the shadow page table is kept consistent byinvalidating affected entries.
Policy: What Level to Switch?Agile paging provides a mechanism for vir-tualized address translation that starts inshadow mode and switches at some level ofthe guest page table to nested mode. Thepurpose of a policy is to determine whetherto switch from shadow to nested mode for asingle virtualized address translation and atwhich level of the guest page table the switchshould be performed.
The ideal policy would determine thatpage table entries are changing rapidlyenough and the cost of correspondingupdates to the shadow page table outweighsthe benefit of faster TLB misses in shadowmode, and so translation should use nestedmode. The policy would quickly detect thedynamically changing parts of the guestpage table and switch them to nested modewhile keeping the rest of the static parts ofthe guest page table under shadow mode.
To achieve this goal, a policy will movesome parts of the guest page table fromshadow to nested mode and vice versa. Weassume that the guest process starts in fullshadow mode, and we propose a simple algo-rithm for when to change modes.
Shadow!Nested mode. We start a guestprocess in the shadow mode to allow theVMM to track all updates to the guest pagetable (the guest page table is marked read only
in shadow mode, requiring VMM interven-tions for updates). Our experiments showedthat the updates to a single page of a guestpage table are bimodal in a 2-second timeinterval: only one update or many updates(for example, 10, 50, 100). Thus, we use atwo-update policy to move a page of theguest page table from shadow mode to nestedmode: two successive updates to a page trig-ger a mode change. This allows all subse-quent updates to frequently changing parts ofthe guest page table to proceed withoutVMM interventions.
Nested!Shadow mode. Once we move partsof the guest page table to the nested mode, allupdates to those parts happen without anyVMM intervention. Thus, the VMM cannottrack if the parts under the nested mode havestopped changing and thus can be movedback to the shadow mode. So, we use dirtybits on the pages containing the guest pagetable as a proxy to find these static parts ofthe guest page table after every time interval,and we switch those parts back to the shadowmode. Figure 4 depicts the policy used byagile paging.
To summarize, the changes to the hard-ware and VMM to support agile paging areincremental, but they result in a powerful,efficient, and robust mechanism. This mech-anism, when combined with our proposedpolicies, helps the VMM detect changesto the page tables and intelligently make adecision to switch modes and thus reduceoverheads.
Our original paper has more details onthe agile paging design to integrate page walkcaches, perform guest context switches, setaccessed/dirty bits, and handle small orshort-lived processes. It also describes possi-ble hardware optimizations.2
MethodologyTo evaluate our proposal, we emulate ourproposed hardware with Linux and proto-type our software in Linux KVM.5 Weselected workloads with poor TLB perform-ance from SPEC 2006,6 BioBench,7 Parsec,8
and big-memory workloads.9 We reportoverheads using a combination of hardwareperformance counters from native and vir-tualized application executions, along with
Shadow(1 write)Shadow
Nested
Subsequent writes(no VMM traps)
Use dirty bits to trackwrites to guest page table
Write to page table
(VMM trap)
(VM
M tr
aps)
Writ
e to
pag
eta
ble
Start
Timeout
Move non-dirty
Figure 4. Policy to move a page between nested mode and shadow mode in
agile paging.
..............................................................................................................................................................................................
TOP PICKS
............................................................
84 IEEE MICRO
TLB performance emulation using a modi-fied version of BadgerTrap10 with a linearperformance model. Our original paper hasmore details on our methodology, results,and analysis.2
EvaluationFigure 5 shows the execution time overheadsassociated with page walks and VMM inter-ventions with 4-Kbyte pages and 2-Mbytepages (where possible). For each workload,four bars show results for base native paging(B), nested paging (N), shadow paging (S),and agile paging (A). Each bar is split intotwo segments. The bottom represents theoverheads associated with page walks, andthe top dashed segment represents the over-heads associated with VMM interventions.
Agile paging outperforms its constituenttechniques for all workloads and improvesperformance by 12 percent over the best ofnested and shadow paging on average, andperforms less than 4 percent slower thanunvirtualized native at worst. In our originalpaper,2 we show that more than 80 percentof TLB misses are covered under full shadowmode, thus having four memory accessesfor TLB misses. Overall, the average numberof memory accesses for a TLB miss comesdown from 24 to between 4 and 5 for allworkloads.
W e and others have found that theoverheads of virtualizing memory
can be high. This is true in part because guestprocesses currently must choose betweennesting paging with slow nested page tablewalks and shadow paging, in which pagetable updates cause costly VMM interven-tions. Ideally, one would want to use nestedpaging for addresses and page table levels thatchange and use shadow paging for addressesand page table levels that are relatively static.
Our proposal—agile paging—approachesthis ideal. With agile paging, a virtualizedaddress translation usually starts in shadowmode and then switches to nested mode onlyif required to avoid VMM interventions.Moreover, agile paging’s benefits could begreater in the future, because Intel hasrecently added a fifth level to its page table11
that makes a virtualized nested page walk up
to 35 memory references, and emerging non-volatile memory technology promises vastphysical memories. MICRO
AcknowledgmentsThis work is supported in part by the USNational Science Foundation (CCF-1218323, CNS-1302260, CCF-1438992,and CCF-1533885), Google, and the Uni-versity of Wisconsin (John Morgridge chairand named professorship to Hill). Hill andSwift have significant financial interests inAMD and Microsoft, respectively.
....................................................................References1. J. Buell et al., “Methodology for Perform-
ance Analysis of VMware vSphere Under
Tier-1 Applications,” VMware Technical J.,
vol. 2, no. 1, 2013, pp. 19–28.
0102030405060708090
B N S A B N S A B N S A B N S A
graph500 memcached canneal dedup
Exe
cutio
n tim
e ov
erhe
ads
(%)
0102030405060708090
Exe
cutio
n tim
e ov
erhe
ads
(%)
13%4%
10%14%14%
5%6%
68%
2%
3%2%
2%
B N S A B N S A B N S A B N S A
graph500 memcached canneal dedup
(b)
(a)
28%
11%
18%
30%19%
6%
6%
70%
2%
4%
2%
3%
Figure 5. Execution time overheads for (a) 4-Kbyte pages and (b) 2-Mbyte
pages (where possible) with base native (B), nested paging (N), shadow
paging (S), and agile paging (A) for four representative workloads. All
virtualized execution bars are in two parts: the bottom solid parts represent
page walk overheads, and the top hashed parts represent VMM intervention
overheads. The numbers on top of the bars represent the slowdown with
respect to the base native case.
.............................................................
MAY/JUNE 2017 85
2. J. Gandhi, M.D. Hill, and M.M. Swift,
“Agile Paging: Exceeding the Best of
Nested and Shadow Paging,” Proc. 43rd
Int’l Symp. Computer Architecture, 2016,
pp. 707–718.
3. R. Bhargava et al., “Accelerating Two-
Dimensional Page Walks for Virtualized
Systems,” in Proceedings of the 13th Inter-
national Conference on Architectural Sup-
port for Programming Languages and
Operating Systems, 2008, pp. 26–35.
4. K. Adams and O. Agesen, “A Comparison
of Software and Hardware Techniques for
x86 Virtualization,” Proc. 12th Int’l Conf.
Architectural Support for Programming Lan-
guages and Operating Systems, 2006, pp.
2–13.
5. A. Kivity et al., “KVM: The Linux Virtual
Machine Monitor,” Proc. Linux Symp., vol.
1, 2007, pp. 225–230.
6. J.L. Henning, “SPEC CPU2006 Benchmark
Descriptions,” SIGARCH Computer Archi-
tecture News, vol. 34, no. 4, 2006, pp.
1–17.
7. K. Albayraktaroglu et al., “BioBench: A
Benchmark Suite of Bioinformatics
Applications,” Proc. IEEE Int’l Symp. Per-
formance Analysis of Systems and Soft-
ware, 2005, pp. 2–9.
8. C. Bienia et al., “The Parsec Benchmark
Suite: Characterization and Architectural
Implications,” Proc. 17th Int’l Conf. Parallel
Architectures and Compilation Techniques,
2008, pp. 72–81.
9. A. Basu et al., “Efficient Virtual Memory for
Big Memory Servers,” Proc. 40th Ann. Int’l
Symp. Computer Architecture, 2013, pp.
237–248.
10. J. Gandhi et al., “BadgerTrap: A Tool to
Instrument x86-64 TLB Misses,” SIGARCH
Computer Architecture News, vol. 42, no. 2,
2014, pp. 20–23.
11. 5-Level Paging and 5-Level EPT, white
paper, Intel, Dec. 2016.
Jayneel Gandhi is a research scientist atVMware Research. His research interestsinclude computer architecture, operatingsystems, memory system design, virtualmemory, and virtualization. Gandhi has a
PhD in computer sciences from the Univer-sity of Wisconsin–Madison, where he com-pleted the work for this article. He is a mem-ber of ACM. Contact him at [email protected].
Mark D. Hill is the John P. MorgridgeProfessor, Gene M. Amdahl Professor ofComputer Sciences, and Computer SciencesDepartment Chair at the University of Wis-consin–Madison, where he also has a cour-tesy appointment in the Department ofElectrical and Computer Engineering. Hisresearch interests include parallel computersystem design, memory system design, andcomputer simulation. Hill has a PhD incomputer science from the University ofCalifornia, Berkeley. He is a fellow of IEEEand ACM. He serves as vice chair of theComputer Community Consortium. Con-tact him at [email protected].
Michael M. Swift is an associate professorin the Computer Sciences Department atthe University of Wisconsin–Madison. Hisresearch interests include operating systemreliability, the interaction of architectureand operating systems, and device driverarchitecture. Swift has a PhD in computerscience from the University of Washington.He is a member of ACM. Contact him [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
..............................................................................................................................................................................................
TOP PICKS
............................................................
86 IEEE MICRO
.................................................................................................................................................................................................................
TRANSISTENCY MODELS: MEMORYORDERING AT THE HARDWARE–OS
INTERFACE.................................................................................................................................................................................................................
THIS ARTICLE INTRODUCES THE TRANSISTENCY MODEL, A SET OF MEMORY ORDERING
RULES AT THE INTERSECTION OF VIRTUAL-TO-PHYSICAL ADDRESS TRANSLATION AND
MEMORY CONSISTENCY MODELS. USING THEIR COATCHECK TOOL, THE AUTHORS SHOW
HOW TO RIGOROUSLY MODEL, ANALYZE, AND VERIFY THE CORRECTNESS OF A GIVEN
SYSTEM’S MICROARCHITECTURE AND SOFTWARE STACK WITH RESPECT TO ITS
TRANSISTENCY MODEL SPECIFICATION.
......Modern computer systems con-sist of heterogeneous processing elements(CPUs, GPUs, accelerators) running multi-ple distinct layers of software (user code,libraries, operating systems, hypervisors) ontop of many distributed caches and memo-ries. Fortunately, most of this complexity ishidden away underneath the virtual memory(VM) abstraction presented to the user code.However, one aspect of that complexity doespierce through: a typical memory subsystemwill buffer, reorder, or coalesce memoryrequests in often unintuitive ways for thesake of performance. This results in essen-tially all real-world hardware today exposinga weak memory consistency model (MCM)to concurrent code that communicatesthrough shared VM.
The responsibilities for maintaining theVM abstraction and for enforcing the mem-ory consistency model are shared between
the hardware and the operating system(OS) and require careful coordinationbetween the two. Although MCMs at theinstruction set architecture (ISA) and pro-gramming language levels are becomingincreasingly well understood,1–5 a key veri-fication challenge is that events within sys-tem layers can behave differently than the“normal” accesses described by the ISA orprogramming language MCM. For exam-ple, on the x86-64 architecture, whichimplements the relatively strong total storeordering (TSO) memory model,5 page tablewalks are automatically issued by hardware,can happen at any time, and often are notordered even with respect to fences. Evenworse is that while an ISA by design tends toremain stable across processor generations,microarchitectural phenomena often changedramatically from one generation to the next.For example, CPUs today are experimenting
Daniel Lustig
Princeton University
Geet Sethi
Abhishek Bhattacharjee
Rutgers University
Margaret Martonosi
Princeton University
.......................................................
88 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
with features such as concurrent page tablewalkers and translation lookaside buffer(TLB) coalescing that improve performanceat the cost of adding significant complexity.6
Consequently, VM and MCM specificationsand implementations tend to be bug-proneand are only becoming more complex as sys-tems become increasingly heterogeneous anddistributed.
Bogdan Romanescu and colleagues werethe first to distinguish between MCMsmeant for virtual addresses (VAMC) andthose for physical addresses (PAMC).7 Theyconsidered hardware to be responsible for thelatter, and a combination of hardware andOS for the former. However, as we show inthis article, not even VAMC and PAMC cap-ture the full intersection of address transla-tion and memory ordering. Even machinesthat implemented the strictest model theyconsidered—virtual address sequential con-sistency (SC-for-VAMC)—may be prone tosurprising ordering bugs related to the check-ing of metadata at a different virtual andphysical address from the data being accessed.We therefore coin the term memory transis-tency model to refer to any set of memoryordering rules that explicitly account for thesebroader virtual-to-physical address transla-tion issues.
To enable rigorous analysis of transistencymodels and their implementations, we devel-oped a tool called COATCheck for verifyingmemory ordering enforcement in the contextof virtual-to-physical address translation.(COAT stands for consistency ordering andaddress translation.) COATCheck lets usersreason about the ordering implications of sys-tem calls, interrupts, microcode, and so on atthe microarchitecture, architecture, and OSlevels. System models are built in COAT-Check using a domain-specific language(DSL) called lspec (pronounced “mu-spec”),within which each component in a system(for example, each pipeline stage, each cache,and each TLB) can independently specify itsown contribution to memory ordering usingthe languages of first-order logic and micro-architecture-level happens-before (lhb)graphs.8,9 This allows COATCheck verifica-tion to be modular and flexible enough toadapt to the fast-changing world of modernheterogeneous systems.
Our contributions are as follows. First, wedeveloped a comprehensive methodology forspecifying and statically verifying memoryordering enforcement at the hardware–OSinterface. Second, we built a fast and general-purpose constraint solver that automates theanalysis of lspec microarchitecture specifica-tions. Third, as a case study, we built asophisticated model of an Intel Sandy-Bridge-like processor running a Linux-likeOS, and using that model we analyzed vari-ous translation-related memory ordering sce-narios of interest. Finally, we identified casesin which transistency goes beyond the tradi-tional scope of consistency: where even SC-for-VAMC7 is insufficient. Overall, our workoffers a rigorous yet practical framework formemory ordering verification, and it broad-ens the very scope of memory ordering as afield. The full toolset is open source.10
Enhanced Litmus TestsLitmus tests are small stylized programs test-ing some aspect of a memory model. Eachtest proposes an outcome: the value returnedby each load plus the final value at eachmemory location, or some relevant subsetthereof. The rules of a memory model deter-mine whether an outcome is permitted orforbidden. Consider Figure 1a: as written, xand y appear to be distinct addresses. Underthat assumption, the proposed outcome isobservable even under a model as strict assequential consistency (SC),11 because theevent interleaving shown in Figure 1b produ-ces that outcome. If instead x and y areactually synonyms (both map to the samephysical address), as in Figure 1c, the test isforbidden by SC, because then no interleav-ing of the threads produces the proposed out-come. While simple, this example highlightshow memory ordering verification is funda-mentally incomplete unless it explicitlyaccounts for address translation.
The basic unit of testing in COATCheckis the enhanced litmus test (ELT). ELTsextend traditional litmus tests by addingaddress translation, memory (re)mappings,interrupts, and other system-level operationsrelevant to memory ordering. In addition,just as a traditional litmus test outcome speci-fies the values returned by loads, ELTs also
.............................................................
MAY/JUNE 2017 89
consider the physical addresses used by eachVM access to be part of the outcome.Finally, ELTs include “ghost instructions”that model lower-than-ISA operations(such as microcode and page table walks)executed by hardware, even if these instruc-tions are not fetched, decoded, or issued aspart of the normal ISA-level instructionstream. These features give ELTs sufficientexpressive power to test all aspects of mem-ory ordering enforcement as it relates toaddress translation.
The COATCheck toolflow provides auto-mated methods to create ELTs from user-provided litmus tests plus other system-levelannotation guidance. We describe this con-version process below.
OS SynopsesOS activities such as TLB shootdowns andmemory (re)mappings are captured withinELTs as sequences of loads, stores, systemcalls, and/or interrupts. An OS synopsisspecifies a mapping from each system callinto a sequence of micro-ops that capture theeffects of that system call on ordering andaddress translation. When the system callcontains an interprocessor interrupt (IPI),the OS synopsis also instantiates predefinedinterrupt handler threads on interrupt-receiving cores.
For example, an OS synopsis mightexpand the mprotect call of Figure 2a intothe shaded instructions of Figure 2b. The callitself expands into four instructions: oneupdates the page table entry, one invalidatesthe local TLB, one sends an IPI, and onewaits for the IPI to be acknowledged. TheOS synopsis also produces the interrupt han-
dler (Thread 1b), which performs its ownlocal TLB invalidation before responding tothe initiating thread.
Microarchitecture SynopsesAs with the OS synopses, microarchitecturesynopses map each instruction onto a micro-code sequence that includes ghost instruc-tions such as page table walks. Not everyinstruction actually triggers a page table walk,so these ghost instructions are instantiatedonly as needed during the analysis.
For example, Figure 2b is transformedinto the ELT of Figure 2c by the addition ofthe gray-shaded ghost instructions. In thisexample, Thread 0’s store to [x] requires apage table walk, because the TLB entry forthat virtual address would have been invali-dated by the preceding invlpg instruction.Furthermore, because the page was originallyclean, ghost instructions also model howhardware marks the page dirty at that point.Finally, the microarchitecture synopsis addsto Thread 1b a microcode preamble contain-ing ghost instructions to receive the interrupt,save state, and disable nested interrupts. Inthis example, hardware is responsible for sav-ing state, but software is responsible forrestoring it. This again highlights the degreeof collective responsibility between hardwareand OS for ensuring ordering correctness.
lspec: A DSL for Specifying MemoryOrderingslspec is a domain-specific language for speci-fying memory ordering enforcement in theform of lhb graphs8,9 (see Figure 3). Nodesin a lhb graph represent events corresponding
Initially: [x] = 0, [y] = 0
Thread 0 Thread 1
Initially: [x] = 0, [y] = 0
Thread 0 Thread 1
St [x] 1
Ld [y] r1
St PA1 1
Ld PA1 r1
St PA1 2
Ld PA1 r2
St [y] 2
Ld [x] r2
Proposed outcome: r1 = 2, r2 = 1 Outcome r1 = 2, r2 = 1 forbidden
(a) (b)
Initially: [x] = 0, [y] = 0
Thread 0 Thread 1
(c)
St PA1 1
Ld PA2 r1
St PA2 2
Ld PA1 r2
Outcome: r1 = 2, r2 = 1 permitted
Figure 1. A litmus test showing how virtual memory interacts with memory ordering. (a)
Litmus test code. (b) A possible execution showing how the proposed outcome is observable
if x and y point to different physical addresses. (c) The outcome is forbidden if x and y point
to the same physical address (only one possible interleaving among many is shown).
..............................................................................................................................................................................................
TOP PICKS
............................................................
90 IEEE MICRO
to a particular instruction (column) andsome particular microarchitectural event(row). Edges represent happens-before order-ings guaranteed by some property of themicroarchitecture: an instruction flowingthrough a pipeline, a structure maintainingfirst-in, first-out (FIFO) ordering, the pas-sage of a message, and so on. Although pre-vious work derived lhb graphs using hard-coded notions of pipelines8 and caches,9
lspec models provide a completely general-purpose language for drawing lhb graphstailored to any arbitrary system design. Weprovide a detailed example of the lspec syn-tax in the next section.
Hardware memory models today tend tobe either axiomatic, where an outcome ispermitted if and only if it simultaneously sat-isfies all of the axioms of the model, or opera-tional, where an outcome is permitted onlyif it matches the outcome of some series ofexecution steps on an abstract “golden hard-ware model.” lspec models are axiomatic: alhb graph represents a legal test execution ifand only if it is acyclic and satisfies all of theconstraints in the model. Each hardware orsoftware component designer provides anindependent set of lhb graph axioms whichthat component guarantees to maintain. Theconjunction of these axioms forms the overalllspec model. This modularity means thatcomponents can be added, changed, orremoved as necessary without affecting anyof the other components.
Although they are inherently axiomatic,lspec models capture the best of the opera-tional approach as well. A total ordering ofthe nodes in an acyclic lhb graph is also anal-ogous to the sequence of execution steps inan operational model. This analogy lets lhbgraphs retain much of the intuitiveness ofoperational models while simultaneouslyretaining the scalability of axiomatic models.As such, lhb graphs are useful not only fortransistency models but also more generallyfor software and hardware memory models.
The COATCheck constraint solver isinspired by SAT and SMT solvers. It searchesto find any lhb graph that satisfies all of theconstraints of a given lspec model applied tosome ELT. If one can be found, the proposedELT outcome is observable. If not, the pro-posed outcome is forbidden. This result is
then checked against the architecture-levelspecification1 to ensure correctness.
System Model Case StudyIn this section, we present an in-depth casestudy of how hardware and software design-ers can use COATCheck and lspec to modela high-performance out-of-order processorand OS. Our case study has three parts. Thefirst is a lspec model called SandyBridge thatdescribes an out-of-order processor based onpublic documentation of and patents relatingto Intel’s Sandy Bridge microarchitecture.
Core 0/Thread 0 Core 1/Thread 1a
Core 0/Thread 0 Core 1/Thread 1a Core 1/Thread 1b
Core 1/Thread 1b
St [y] ← 1
Ld [x] → 0
Ld PML4E (x)
Ld PDPTE (x)
Ld PDE (x)
Ld PTE (x)
St [z/PTE (x)] ← R/W
invlpg [x]
Send IPI
Wait for Acks
Ld PML4E (x)
Ld PDPTE (x)
Ld PDE (x)
LdAtomic PTE (x) → clean
StAtomic PTE (x) ← dirty
St [x] ← 1
Ld [y] → 0
invlpg [x]
Send Ack
iret
St [z/PTE (x)] ← R/W
invlpg [x]
Send IPI
Wait for Acks
St [x] ← 1
Ld [y] → 0
St [y] ← 1
Ld [x] → 0
Depicted outcome permitted
Depicted outcome permitted
IPI Receive
Save state
disable ints
invlpg [x]
Send ACK
iret
Initially: [x] = 0, [y] = 0
Initially: [x] = 0, [y] = 0
Initially: [x] = 0, [y] = 0
Core 0/Thread 0 Core 1/Thread 1a
mprotect [x], r/w
St [x] ← 1
Ld [y] → 0
St [y] ← 1
Ld [x] → 0
Depicted outcome permitted
(a)
(b)
(c)
Figure 2. Traditional litmus tests are expanded into enhanced litmus tests
(ELTs). (a) A traditional litmus test with an mprotect system call added.
(b) The userþkernel version of the litmus test. On core 1, threads 1a and 1b
will be interleaved dynamically. “R/W” indicates that the page table entry
(PTE) R/W bit will be set. (c) The ELT. Page table accesses for [y],
accessed bit updates, and so forth are not depicted but will be included in
the analysis.
.............................................................
MAY/JUNE 2017 91
The second is the microarchitecture synopsis,which specifies how ghost instructions suchas page table walks behave on SandyBridge.The third is an OS synopsis inspired byLinux’s implementations of system calls andinterrupt handlers. We offer in-depth modelhighlights in this article; see our full paper foradditional detail.12
Memory Dependency Prediction andDisambiguationSandyBridge uses a sophisticated, high-performance virtually and physically addressedstore buffer (SB). This decision was inten-tional: a virtual-only SB would be unable todetect virtual address synonyms, whereas aphysical-only SB would place the TLB ontothe critical path for SB forwarding. The Sandy-Bridge SB instead splits the forwarding processinto two parts: a prediction stage tries to pre-emptively anticipate physical address matches,and a disambiguation stage later ensures thatall predictions were correct. This pairing keepsthe TLB off the critical path without giving upthe ability to detect synonyms.
The mechanism works as follows. Allstores write their virtual address and data intothe SB in parallel with accessing the TLB.
Once the TLB provides it, the physicaladdress is written into the SB as well. Eachload, in parallel with accessing the TLB,writes the lower 12 bits (the “index bits”) ofits virtual address into a CAM-based loadbuffer storing uncommitted loads. The loadthen compares its index bits against those ofall older stores in the SB. If an index match isfound, the load then compares its virtual tag,and potentially its physical tag, against thestores. If there is a tag match, the youngestmatching store forwards its data to the load.If no match is found, the load proceeds tothe cache. If there is an empty slot becausethe load executed out of order before an ear-lier store, then the load predicts that there isno dependency. This prediction is laterchecked during disambiguation: before eachstore commits, it checks the load buffer to seeif any younger loads matching the same phys-ical address have speculatively executedbefore it. If so, it squashes and replays thosemispredicted loads.
The following lspec snippet shows a por-tion of the SandyBridge lspec model captur-ing a case in which a load has an indexmatch, a virtual tag miss, and a physical tagmatch with a previous store.
DefineMacro “StoreBufferForwardPTag”:
exists microop “w”, (
SameCore w i /\IsAnyWrite w /\ProgramOrder w i /\
SameIndex w i /\~(SameVirtualTag w i) /\
SamePhysicalTag w i /\SameData w i/\
EdgesExist [
((w, SB-VTag/Index/Data), (i, LB-SB-IndexCompare),
“SBEntryIndexPresent”);
((w, SB-PTag), (i, LB-SB-PTagCompare), “SBEntryPTagPresent”);
((i, SB-LB-DataForward), (w, (0, MemoryHierarchy)),
“BeforeSBEntryLeaves”);
((i, LB-SB-IndexCompare), (i, LB-SB-VTagCompare), “path”);
((i, LB-SB-VTagCompare), (i, LB-SB-PTagCompare), “path”);
((i, LB-PTag), (i, LB-SB-PTagCompare), “path”);
((i, LB-SB-PTagCompare), (i, SB-LB-DataForward), “path”);
((i, SB-LB-DataForward), (i, WriteBack), “path”)
] /\
ExpandMacro STBNoOtherMatchesBetweenSrcAndRead
).
..............................................................................................................................................................................................
TOP PICKS
............................................................
92 IEEE MICRO
The first set of predicates narrows the axiomdown to apply to the scenario we described ear-lier. The edges listed in the EdgesExistpredicate then describe the associated memoryordering constraints. The first three ensurethat write w is still in the SB when load isearches for it, and the rest describe the paththat i itself takes through the microarchitec-ture. Finally, the axiom also checks (using amacro defined elsewhere) that the store is infact the youngest matching store in the SB.
Other Model DetailsA second component of our SandyBridgemodel reflects the functionality of system callsand interrupts as they relate to memory map-ping and remapping functions. Although x86TLB lookups and page table walks are per-formed by the hardware, x86 TLB coher-ence is OS-managed. To support this, x86provides the privileged invlpg instruc-tion, which invalidates the local TLB entryat a given address, along with support forinterprocessor interrupts (IPIs). As a serial-izing instruction, invlpg forces all pre-vious instructions to commit and drains theSB before fetching the following instruc-tion. invlpg also ensures that the nextaccess to the virtual page invalidated will bea TLB miss, thus forcing the latest versionof the corresponding page table entry to bebrought into the TLB.
To capture IPIs and invlpg instructions,our Linux OS synopsis expands the systemcall mprotect into code snippets thatupdate the page table, invalidate the now-stale TLB entry on the current core, and sendTLB shootdowns to other cores via IPIs andinterrupt handlers that execute invlpgoperations on the remote cores. The Sandy-Bridge microarchitecture synopsis capturesinterrupts by adding ghost instructions thatrepresent the reception of the interrupt andthe hardware disabling of nested interruptsbefore each interrupt handler. All possibleinterleavings of the interrupt handlers andthe threads’ code are considered. Figures 2band 2c depict the effects of both of thesesynopses.
To model TLB occupancy, the Sandy-Bridge lspec model adds two nodes to thelhb graph to represent TLB entry creationand invalidation, respectively. These are then
constrained following the value-in-cache-line(ViCL) mechanism.9 All loads and stores(including ghost instructions) are constrainedby the model to access the TLB within thelifetime of some matching TLB entry.
Page table walks are also instantiated bythe microarchitecture synopsis as a set ofghost instruction loads of the page tableentry. Because these are generated by dedi-cated hardware, the SandyBridge lspecmodel does not draw nodes such as Fetchand Dispatch for these instructions, becausethey do not pass through the pipeline. Fur-thermore, because the page table walk loadsare not TSO-ordered, they do not search theload buffer. They are, however, ordered withrespect to invlpg.
Our SandyBridge model also captures theaccessed and dirty bits present in the pagetable and TLB. When an accessed or dirty bitneeds to be updated, the pipeline waits untilthe triggering instruction reaches the head ofthe reorder buffer. At that point, the pro-cessor injects microcode (modeled via ghostLOCKed read-modify-write [RMW]instructions) implementing the update. Theghost instructions in a status bit update dotraverse the Dispatch, Issue, and Commitstages, unlike the ghost page table walks,because the status bit updates do propagatethrough most of the pipeline and affect archi-tectural state. The model also uses lhb edgesto ensure that the update is ordered againstall other instructions.
fr
fr
Fetch
Decode
Execute
Memory
Writeback
LeaveStoreBuffer
MemoryHierarchy
St[x]←1 Ld[y]→0 St[y]←1 Ld[x]→0
Core 0/Thread 0 Core 0/Thread 1a
Figure 3. A lhb graph for the litmus test in Figure 1 (minus the
mprotect), executing on a simple five-stage out-of-order pipeline.
Because the graph is acyclic, the execution is observable.
.............................................................
MAY/JUNE 2017 93
Analysis and Verification ExamplesIn this section, we present three test cases forour SandyBridge model.
Store Buffer ForwardingTest n5 (see Figure 4) checks the SB’s abilityto detect synonyms. If a synonym is misde-tected, one of the loads (i1.0 or i3.0)might be allowed to bypass the store (i0.0or i2.0) before it, leading to an illegal out-come. Also pictured are the TLB access ghostinstructions associated with each ISA-levelinstruction. Figure 4a shows one of the lhbgraphs COATCheck uses to rule out such a
situation on SandyBridge. Figure 4b showsthe code itself. If load (i3.0) executes outof order, it finds that the SB contains noprevious entries with the same index; this iscaptured by a lhb edge between (i3.0,LB-SB-IndexCompare) and (i2.0,SB-VTag/Index/Data). However, whenthe store (i2.0) does eventually execute, itwill squash (i3.0) unless the load buffer hasno index matches—that is, if (i3.0) has notyet entered the load buffer. The lhb edgefrom (i2.0, LBSearch) back to (i3.0,LB-Index) completes the cycle, which rulesout the execution.
Page RemappingsFigure 5 reproduces and extends the keyexample studied by Bogdan Romanescu andcolleagues:7 thread 0 changes the mappingfor x (i0.0), triggers a TLB shootdown(i2.0), and sends a message to thread 1(i4.0). Thread 1 receives the message(i7.0) and is hypothesized to write to x(i8.0) using the old, stale mapping (a situa-tion COATCheck should be expected to ruleout). Thread 1 (i9.0) sends a message backto thread 0 (i5.0), which checks (i6.0)that the value at x (according to the new map-ping) was not overwritten by the thread 1store (i8.0), which used the old mapping.The lhb graph generated for this scenario(Figure 5a) is also cyclic, showing how COAT-Check does in fact rule out the execution ofFigure 5b. The graph also simultaneouslydemonstrates many COATCheck features,such as IPIs, handlers, microcode, and fences,and it shows COATCheck’s ability to scale upto large and highly nontrivial problems.
Transistency versus ConsistencyOur third example focuses on status bits andsynonyms. Status bits are tracked per virtual-to-physical mapping rather than per physicalpage, and so the OS is responsible for track-ing the status of synonyms. In this example,suppose the OS intends to swap out to disk aclean page that is a synonym of some dirtypage. If it fails to check the status bits for thatsynonym, it might think that the page isclean and hence that it can be safely swappedout without being first written back.
Notably, in this example, the bug may beobservable even when there is no reordering
i0.0 i2.1i2.0i0.1 i3.0 i3.1i1.0 i1.1
Fetch
Dispatch
Issue
AGU
AccessTLB
TLBEntryCreate
TLBEntryInvalidate
SB-VTag/Index/Data
LB-Index
LB-SB-IndexCompare
LB-SB-VTagCompare
SB-PTag
LB-PTag
LB-SB-PTagCompare
SB-LB-DataForward
AccessCache
CacheLineInvalidated
WriteBack
LBSearch
Commit
LeaveStoreBuffer
MemoryHierarchy
Initially: [x] = 0, [y] = 0VA x → PA a (R/W, accessed, dirty)
VA y → PA a (R/W, accessed, dirty)
Core 0/Thread 0 Core 1/Thread 1
(i0.0) St [x/a] ← 1 (i2.0) St [y/a] ← 2
(i0.1) Ld PTE [x] (i2.1) Ld PTE (y)
(i1.0) Ld [y/a] → r1 (i3.0) Ld [x/a] → r2
(i3.1) Ld PTE [x](i1.1) Ld PTE (y)
Outcome r1 = 2, r2 = 1 forbidden
(a)
(b)
Figure 4. Analyzing litmus test n5 with COATCheck. (a) The lhb graph, with
the cycle shown with thicker edges. (b) The ELT code.
..............................................................................................................................................................................................
TOP PICKS
............................................................
94 IEEE MICRO
of any kind taking place, even under virtual-and/or physical-address sequential consis-tency.7 Because the checks of the two syno-nym page mappings are to different virtualand physical addresses, the necessary orderingcannot even be described by VAMC. Thisexample highlights a key way in which tran-sistency models are inherently broader inscope than consistency models.
We tested COATCheck on 118 litmustests, many of which come from Intel andAMD manuals and from prior work,1 andothers that are handwritten to stress the
SandyBridge model (including the case stud-ies discussed earlier). On an Intel Xeon E5-2667-v3 CPU, all 118 tests completed infewer than 100 seconds, and many were evenfaster. Although these lhb graphs are oftenan order of magnitude larger than thosestudied by prior tools analyzing lhbgraphs,8,9 the runtimes are similar. This dem-onstrates the benefits of combining the lspecDSL with an efficient dedicated solver. It alsopoints to the feasibility of providing transis-tency verification fast enough to supportinteractive design and debugging.
i11.0 i12.0i9.0 i10.0 i15.0i3.0i1.0i0.1 i6.1i5.0i0.0 i14.0i4.0i2.0 i6.0 i8.0i7.0i2.1 i8.1 i13.0
Fetch
Dispatch
Issue
AGU
AccessTLB
TLBEntryCreate
TLBEntryInvlidate
SB-VTag/Index/Data
LB-Index
LB-SB-IndexCompare
LB-SB-VTagCompare
SB-PTag
LB-PTag
LB-SB-PTagCompare
SB-LB-DataForward
AccessCache
CacheLineInvalidated
WriteBack
LBSearch
Commit
LeaveStoreBuffer
MemoryHierarchy
(a)
(b)
Initially: [x] = 0, VA x → PA a (R/W, accessed, dirty)(other initial mapping not shown)
Thread 0
Core 0
Thread 1a
Core 1
(i0.0) St [z/PTE (x)] ←(VA x → PA b)
(i10.0) Ld [w/APIC] → mrf(i11.0) Ld EFLAGS → (IF)(i12.0) St EFLAGS ← (!IF)(i13.0) invlpg [x](i14.0) St [v/d] ← ack(i15.0) iret
(i7.0) Ld [y/c] → 2(i8.0) St [x/a] ← 3(i8.1) Ld PTE [x] → TLB(i9.0) St [y/c] ← 4
Depicted outcome forbidden
(i0.1) Ld PTE [x](i1.0) invlpg [x](i2.0) St [w/APIC] ← mrf(i2.1) Ld PTE(w) → TLB(i3.0) Ld [v/d] → ack(i4.0) St [y/c] ← 2(i5.0) Ld [y/c] → 4(i6.0) Ld [x/b] → 1(i6.1) Ld PTE [x] → TLB
Thread 1b
Figure 5. Litmus test ipi8.7 (a) Because the graph is cyclic (thick edges), the outcome is forbidden. In this case, the cycle
was found before the PTEs for y were even enumerated. (b) The ELT code.
.............................................................
MAY/JUNE 2017 95
W ith COATCheck, we were able tosuccessfully identify, model, and ver-
ify a number of interesting scenarios at theintersection of memory consistency modelsand address translation. However, manyimportant challenges remain; COATCheckonly scratches the surface of the complete setof phenomena that can arise at the OS andmicroarchitecture layers. For example, a nat-ural next step might be to extend COAT-Check to model virtual machines andhypervisors of arbitrary depth. Generally, wehope and expect that future work in the areawill build on top of COATCheck to createmore complete and more rigorous transistencymodels that can capture an ever-growing setof system-level behaviors and bugs.
We also envision COATCheck becomingmore integrated with top-to-bottom memoryordering verification tools. We hope that oneday verification tools will cohesively span thefull computing stack, from programminglanguages all the way down to register trans-fer level, thereby giving programmers andarchitects much more confidence in the cor-rectness of their code and systems. Thesegoals will only become more challenging assystems grow more heterogeneous and morecomplex over time. However, COATCheckprovides a rigorous and scalable roadmap forunderstanding how such systems can beunderstood rigorously, and as such we hopethat future work finds COATCheck and itslspec modeling language to be useful build-ing blocks for continued research into thearea. MICR O
....................................................................References1. J. Alglave, L. Maranget, and M. Tautschnig,
“Herding Cats: Modelling, Simulation, Test-
ing, and Data Mining for Weak Memory,”
ACM Trans. Programming Languages and
Systems, vol. 36, no. 2, 2014; doi:10.1145
/2627752.
2. M. Batty et al., “Clarifying and Compiling C/
Cþþ Concurrency: From Cþþ 11 to
POWER,” Proc. 39th Ann. ACM SIGPLAN-
SIGACT Symp. Principles of Programming
Languages, 2012, pp. 509–520.
3. H.-J. Boehm and S.V. Adve, “Foundations
of the Cþþ Concurrency Memory Model,”
Proc. 29th ACM SIGPLAN Conf. Program-
ming Language Design and Implementa-
tion, 2008, pp. 68–78.
4. S. Sarkar et al., “Understanding POWER
Multiprocessors,” Proc. 32nd ACM SIG-
PLAN Conf. Programming Language Design
and Implementation, 2011, pp. 175–186.
5. P. Sewell et al., “x86-TSO: A Rigorous and
Usable Programmer’s Model for x86 Multi-
processors,” Comm. ACM, vol. 53, no. 7,
2010, pp. 89–97.
6. M. Clark, “A New, High Performance x86
Core Design from AMD,” Hot Chips 28
Symp., 2016; www.hotchips.org/archives
/2010s/hc28.
7. B. Romanescu, A. Lebeck, and D.J. Sorin,
“Address Translation Aware Memory Con-
sistency,” IEEE Micro, vol. 31, no. 1, 2011,
pp. 109–118.
8. D. Lustig, M. Pellauer, and M. Martonosi,
“Verifying Correct Microarchitectural Enforce-
ment of Memory Consistency Models,” IEEE
Micro, vol. 35, no. 3, 2015, pp. 72–82.
9. Y. Manerkar et al., “CCICheck: Using lhb
Graphs to Verify the Coherence-Consis-
tency Interface,” Proc. 48th Int’l Symp.
Microarchitecture, 2015, pp. 26–37.
10. Check Verification Tool Suite; http://check
.cs.princeton.edu.
11. L. Lamport, “How to Make a Multiprocessor
Computer That Correctly Executes Multiproc-
ess Programs,” IEEE Trans. Computers, vol.
28, no. 9, 1979, pp. 690–691.
12. D. Lustig et al., “COATCheck: Verifying
Memory Ordering at the Hardware-OS
Interface,” Proc. 21st Int’l Conf. Architec-
tural Support for Programming Languages
and Operating Systems, 2016, pp. 233–247.
Daniel Lustig is a research scientist at Nvi-dia. His research interests include computerarchitecture and memory consistency mod-els. Lustig received a PhD in electrical engi-neering from Princeton University, where heperformed the work for this article. He is amember of IEEE and ACM. Contact him [email protected].
Geet Sethi is a PhD student in the Depart-ment of Computer Science at Stanford
..............................................................................................................................................................................................
TOP PICKS
............................................................
96 IEEE MICRO
University. His research interests includeserverless computing, machine learning, andcomputer architecture. Sethi received a BSin computer science and mathematics fromRutgers University, where he performed thework for this article. He is a student memberof IEEE and ACM. Contact him at [email protected].
Abhishek Bhattacharjee is an associate pro-fessor in the Department of Computer Sci-ence at Rutgers University. His researchinterests span the hardware–software inter-face. Bhattacharjee received a PhD in electri-cal engineering from Princeton University.He is a member of IEEE and ACM. Contacthim at [email protected].
Margaret Martonosi is the Hugh TrumbullAdams ’35 Professor of Computer Scienceat Princeton University. Her research inter-ests include computer architecture and
mobile computing, with an emphasis onpower-efficient heterogeneous systems.Martonosi has a PhD in electrical engineer-ing from Stanford University. She is a Fellowof IEEE and ACM. Contact her at [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 97
.................................................................................................................................................................................................................
TOWARD A DNA-BASED ARCHIVALSTORAGE SYSTEM
.................................................................................................................................................................................................................
STORING DATA IN DNA MOLECULES OFFERS EXTREME DENSITY AND DURABILITY
ADVANTAGES THAT CAN MITIGATE EXPONENTIAL GROWTH IN DATA STORAGE NEEDS. THIS
ARTICLE PRESENTS A DNA-BASED ARCHIVAL STORAGE SYSTEM, PERFORMS WET LAB
EXPERIMENTS TO SHOW ITS FEASIBILITY, AND IDENTIFIES TECHNOLOGY TRENDS THAT
POINT TO INCREASING PRACTICALITY.
......The “digital universe” (all digitaldata worldwide) is forecast to grow to morethan 16 zettabytes in 2017.1 Alarmingly, thisexponential growth rate easily exceeds ourability to store it, even when accounting forforecast improvements in storage technolo-gies such as tape (185 terabytes2) and opticalmedia (1 petabyte3). Although not all datarequires long-term storage, a significant frac-tion does: Facebook recently built a datacen-ter dedicated to 1 exabyte of cold storage.4
Synthetic (manufactured) DNA sequen-ces have long been considered a potentialmedium for digital data storage because oftheir density and durability.5–7 DNA mole-cules offer a theoretical density of 1 exabyteper cubic millimeter (eight orders of magni-tude denser than tape) and half-life durabilityof more than 500 years.8 DNA-based storagealso has the benefit of eternal relevance: aslong as there is DNA-based life, there will bestrong reasons to read and manipulate DNA.
Our paper for the 2016 Conference onArchitectural Support for Programming Lan-guages and Operating Systems (ASPLOS)proposed an architecture for a DNA-basedarchival storage system.9 Both reading andwriting a synthetic DNA storage medium
involve established biotechnology practices.The write process encodes digital data intoDNA nucleotide sequences (a nucleotide isthe basic building block of DNA), synthe-sizes (manufactures) the corresponding DNAmolecules, and stores them away. Readingthe data involves sequencing (reading) theDNA molecules and decoding the informa-tion back to the original digital data (seeFigure 1).
Progress in DNA storage has been rapid:in our ASPLOS paper, we successfully storedand recovered 42 Kbytes of data; since publi-cation, our team has scaled our process tostore and recover more than 200 Mbytes ofdata.10,11 Constant improvement in the scaleof DNA storage—at least two times peryear—is fueled by exponential reduction insynthesis and sequencing cost and latency;growth in sequencing productivity eclipseseven Moore’s law.12 Further growth in thebiotechnology industry portends orders ofmagnitude cost reductions and efficiencyimprovements.
We think the time is ripe to seriously con-sider DNA-based storage and explore systemdesigns and architectural implications. OurASPLOS paper was the first to address two
James Bornholt
Randolph Lopez
University of Washington
Douglas M. Carmean
Microsoft
Luis Ceze
Georg Seelig
University of Washington
Karin Strauss
Microsoft Research
.......................................................
98 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
fundamental challenges in building a viableDNA-based storage system. First, how shouldsuch a storage medium be organized? Wedemonstrate the tradeoffs between density,reliability, and performance by envisioningDNA storage as a key-value store. Multiplekey-value pairs are stored in the same pool,and multiple such pools are physicallyarranged into a library. Second, how can databe recovered efficiently from a DNA storagesystem? We show for the first time that ran-dom access to DNA-based storage pools isfeasible by using a polymerase chain reaction(PCR) to amplify selected molecules forsequencing. Our wet lab experiments validateour approach and point to the long-term via-bility of DNA as an archival storage medium.
System DesignA DNA storage system (see Figure 2) takesdata as input, synthesizes DNA molecules torepresent that data, and stores them in alibrary of pools. To read data back, the sys-tem selects molecules from the pool, ampli-fies them with PCR (a standard process frombiotechnology), and sequences them back todigital data. We model the DNA storage sys-tem as a key-value store, in which input datais associated with a key, and read operationsidentify the key they wish to recover.
Writing to DNA storage involves encod-ing binary data as DNA nucleotides and syn-thesizing the corresponding molecules. Thisprocess involves two non-trivial steps. First,although there are four DNA nucleotides(A, C, G, T) and so a conversion from binaryappears trivial, we instead convert binarydata to base 3 and employ a rotating encod-ing from ternary digits to nucleotides.7 Thisencoding avoids homopolymers—repetitionsof the same nucleotide—that significantlyincrease the chance of errors.
Second, DNA synthesis technology effec-tively manufactures molecules one nucleotideat a time, and cannot synthesize molecules ofarbitrary length without error. A reasonablyefficient strand length for DNA synthesis is120 to 150 nucleotides, which gives a maxi-mum of 237 bits of data in a single moleculeusing this ternary encoding. The write proc-ess therefore fragments input data into smallblocks that correspond to separate DNA
molecules. This blocking approach also ena-bles added redundancy. Previous work over-lapped multiple small blocks,7 but ourexperimental and simulation results showthis approach to sacrifice too much densityfor little gain. Our ASPLOS experimentsinstead used an XOR encoding, in whicheach consecutive pair of blocks is XORedtogether to form a third redundancy block.Although this encoding is simple, we showedthat it achieves similar redundancy propertiesto existing approaches with much less densityoverhead. Since publishing this paper, ourteam has been exploring more sophisticatedencodings, such as Reed-Solomon codes.
Random AccessReading from DNA storage involves sequenc-ing molecules and decoding their data back tobinary (using the inverse of the encoding dis-cussed earlier). In existing work on DNA stor-age, recovering data meant sequencing all
Write path
AGTCACT AGTCACT01010111 01010111
Read path
Encoding Synthesis Sequencing Decoding
Figure 1. Using DNA for digital data storage. Writes to DNA first encode
digital data as nucleotide sequences and then synthesize (manufacture)
molecules. Reads from DNA first sequence (read) the molecules and then
decode back to digital data.
Datain
DNA synthesizer
PCRthermocycler
DNA sequencer
DNA storage library
Dataout
DNApool
Figure 2. Overview of a DNA storage system. Stored molecules are
arranged in a library of pools.
.............................................................
MAY/JUNE 2017 99
synthesized molecules and decoding all dataat once. However, a realistic storage systemmust offer random access—the ability toselect individual files for reading—if it is tobe practical at large capacities.
Because DNA molecules do not offer spa-tial organization like traditional storagemedia, we must explicitly include addressinginformation in the synthesized molecules.Figure 3 shows the layout of an individualDNA strand in our system. Each strand con-tains a payload, which is a substring of theinput data to encode. An address includesboth a key identifier and an index into theinput data (to allow data longer than onestrand). At each end of the strand, special pri-mer sequences—which correspond to the keyidentifier—allow for efficient sequencingduring read operations. Finally, two sensenucleotides (“S”) help determine the direc-tion and complementarity of the strand dur-ing sequencing.
Our design allows for random access byusing PCR, shown in Figure 4. The readprocess first determines the primers for thegiven key (analogous to a hash function) andsynthesizes them into new DNA molecules.Then, rather than applying sequencing to theentire pool of stored molecules, we first applyPCR to the pool using these primers. PCRamplifies the strands in the pool whose pri-mers match the given ones, creating manycopies of those strands. To recover the file, wenow take a sample of the product pool, whichcontains a large number of copies of all therelevant strands but only a few other irrele-vant strands. Sequencing this sample there-fore returns the data for the relevant keyrather than all data in the system.
Although PCR-based random access is aviable implementation, we don’t believe it ispractical to put all data in a single pool. Weinstead envision a library of pools offeringspatial isolation. We estimate each pool tocontain about 100 Tbytes of data. An addressthen maps to both a pool location and a PCRprimer. Figure 5 shows how the randomaccess described earlier fits in a system with alibrary of DNA pools. This design is analo-gous to a magnetic-tape storage library, inwhich robotic arms are used to retrieve tapes.In our proposed DNA-based storage system,DNA pools could be manipulated and neces-sary reactions could be automated by fluidicssystems.
Wet Lab ExperimentsTo demonstrate the feasibility of DNA stor-age with random access, we encoded and hadDNA molecules synthesized for four imagefiles totaling 151 Kbytes. We then selectivelyrecovered 42 Kbytes of this image data usingour random access scheme. We used both anexisting encoding7 and our XOR encoding.We were able to recover files encoded withXOR with no errors. Using the previouslyexisting encoding resulted in a 1-byte error.In total, the encoded files required 16,994DNA strands, and sequencing produced atotal of 20.8 million reads of those strands(with an average of 1,223 reads per DNAstrand, or depth of 1,223).
To explore the impact of lower sequenc-ing depth on our results, we performed an
Input nucleotides
Output strand 5’ 3’
Primertarget
S S Primertarget
Payload Address
TCTACGATC A TCTACGCTCGAGTGATACGA
TCTACGCTCGAGTGATACGAATGCGTCGTACTACGTCGTGTACGTA...
TCTACG A CCAGTATCA
Figure 3. Layout of individual DNA strands. Each strand must carry an
explicit copy of its address, because DNA molecules do not offer the spatial
organization of traditional storage media.
PCR Sample
Figure 4. Polymerase chain reaction (PCR) amplifies selected strands to
provide efficient random access. The resulting pool after sampling contains
primarily the strands of interest.
..............................................................................................................................................................................................
TOP PICKS
............................................................
100 IEEE MICRO
experiment in which we discarded much ofthe sequencing data (see Figure 6). Lowerdepth per DNA sequence frees up additionalsequencing bandwidth for other DNAsequences, but could omit some strandsentirely if they are not sequenced at all.Despite such omissions, the results show thatwe can successfully recover all data using asfew as 1 percent of the sequencing results,indicating we could have recovered 100times more data with the same sequencingtechnology. Future sequencing technology islikely to continue increasing this amount.
To inform our coding-scheme design,we assessed errors in DNA synthesis andsequencing by comparing the sequencingoutput of two sets of DNA sequences withthe original reference data. The first setincludes the sequences we used to encodedata, which were synthesized for our storageexperiments by a supplier using an arraymethod. Errors in these sequencing resultscould be caused either by sequencing or syn-thesis (or both). The second set includesDNA that was synthesized by a different sup-plier using a process that’s much more accu-rate (virtually no errors), but also muchcostlier. Errors in these sequencing results areessentially caused only by the sequencingprocess. By comparing the two sets of results,we can determine the error rate of bothsequencing (results from the second set) and
array synthesis (the difference between thetwo sets). Our results indicate that overallerrors per base are a little more than 1 percentand that sequencing accounts for most of theerror (see Figure 7).
Technology TrendsWith demand for storage growing fasterthan even optimistic projections of currenttechnologies, it is important to develop newsustainable storage solutions. A significant
0
25
50
75
100
0.01 0.1 1 10Reads used (%)
Per
−b
ase
accu
racy
(%
)
EncodingGoldmanXOR
Figure 6. Decoding accuracy as a function of sequencing depth. We
successfully recover all data using as little as 1 percent of the sequencing
results, suggesting current sequencing technology can recover up to 100
times more data.
A T G T T G G A T G C A A A A A C A T C CC
A T G T T G C C A G T T A A A G C A T C CC
A T G T T T G C T T A C A A A C C A T C CC
A T G T T G G A T G C A A A A A C A T C CC
A T G T T G C C A G T T A A A G C A T C CC
A T G T T T G C T T A C A A A C C A T C CC
Sequencing
PCR amplification
Manufacture DNA
Write process
Read process
Decode
Select primers for key
Encode data
DNA storage(physical) library
01010111...
01010111...
foo.jpg
Look up anddetermine primer
foo.jpg
Figure 5. Putting it all together: random access with a pool library for physical isolation. The key data (here, foo.jpg) is used
with a hash function to identify the relevant pool within the library.
.............................................................
MAY/JUNE 2017 101
fraction of the world’s data can be stored inarchival form. For archival purposes, as longas there is enough bandwidth to write andread data, latency can be high, as is the casefor DNA data storage systems.
Archival storage should be dense tooccupy as little space as possible, be verydurable to avoid continuous rewriting opera-tions, and have low power consumption atrest because it is meant to be kept for longperiods of time. DNA fulfills all these criteria,because it is ultra-dense (1 exabyte per cubicinch for a practical system), is very durable(millennia scale), and has low power require-ments (keep it dark, dry, and slightly cooler
than room temperature). As we showed inour work, DNA can also support randomaccess, allowing most data to remain at restuntil needed.
Current DNA technologies do not yetoffer the throughput necessary to supporta practical system—in our experiments,throughput was on the order of kilobytes perweek. But a key reason for choosing DNAas storage media, rather than some other bio-molecule, is that there is already significantmomentum behind improvements to DNAmanipulation technology. The Carlson curvesin Figure 8 compare progress in DNA manip-ulation technology (both sequencing andsynthesis) to improvements in transistor den-sity.12 Sequencing continues to keep up with,and sometimes outpace, Moore’s law. Newtechnologies such as nanopore sequencingpromise to continue this rate of improvementin the future.13
Future DirectionsUsing DNA for data storage opens manyresearch opportunities. In the short term,because DNA manipulation is relatively noisy,it requires coding-theoretic techniques to offerreliable behavior with unreliable components.Our team has been working on adoptingmore sophisticated encoding schemes andbetter calibrating them to the stochasticbehavior of molecular storage. DNA storagealso involves much higher access latencythan digital storage media, suggesting newresearch opportunities in latency hiding andcaching. Finally, the compactness of DNA-based storage, together with the necessity forwet access to molecules, could open newdatacenter-level organizations and automationopportunities for biological manipulation.
In the long term, a last layer of the storagehierarchy with unprecedented density anddurability opens up the possibility of storingall kinds of records for extended periods oftime. Figure 9 illustrates a possible hierarchywith the properties of each layer. Data thatcould be preserved for a long time includeboth system records, such as search andsecurity logs, as well as human records, suchas health and historical data in textual, audio,and video formats. Besides its obvious uses indisaster recovery, this opportunity could one
102
104
106
108
1010
1970 1980 1990 2000 2010Year
Pro
duc
tivity
Transistors on chipReading DNAWriting DNA
Figure 8. Carlson curves compare trends in DNA synthesis and sequencing
to Moore’s law.12 Recent growth in sequencing technology outpaces
Moore’s law. (Data provided by Robert Carlson.)
ACTGCCT
Array synthesis Column synthesis- Cheap- High error
- Expensive- Zero error
Sequencing
Error analysis
Errors due tosynthesis andsequencing
Errors due only tosequencing
Ave
rag
e er
ror
per
bas
e (%
)
1.0
0.5
0
Synthesiserror
Sequencingerror
Array Column
Figure 7. Analysis of error from synthesis and sequencing. Overall errors per
base are little more than 1 percent and are mostly attributable to
sequencing.
..............................................................................................................................................................................................
TOP PICKS
............................................................
102 IEEE MICRO
day be a great contributor to the field of digi-tal archeology, the study of human historythrough “ancient” digital data.
T he success of the initial project, pub-lished in our ASPLOS paper, motivated
us to significantly expand our efforts toexplore DNA-based data storage. We formedthe Molecular Information Systems Lab(MISL), with members from the Universityof Washington and Microsoft Research.MISL has worked with Twist Bioscience tosynthesize a 200-Mbyte DNA pool,11 morethan three orders of magnitude larger thanour ASPLOS results, and an order of magni-tude larger than the prior state of the art.14
Some of its more recent efforts include newcoding schemes, sequencing with nanopore-based techniques, and fluidics automation.
Given the impending limits of silicontechnology, we believe that hybrid siliconand biochemical systems are worth seriousconsideration. Now is the time for architectsto consider incorporating biomolecules as anintegral part of computer design. DNA-basedstorage is one clear, practical example of thisdirection. Biotechnology has benefited tre-mendously from progress in silicon technol-ogy developed by the computer industry;perhaps now is the time for the computerindustry to borrow back from the biotechnol-ogy industry to advance the state of the art incomputer systems. MICRO
AcknowledgmentsWe thank the members of the MolecularInformation Systems Laboratory for their con-tinuing support of this work. We thank Bran-don Holt, Emina Torlak, Xi Wang, the Sampagroup at the University of Washington, andthe anonymous reviewers for feedback on
this work. This material is based on worksupported by the National Science Founda-tion under grant numbers 1064497 and1409831, by gifts from Microsoft Research,and by the David Notkin Endowed Gradu-ate Fellowship.
....................................................................References1. “Where in the World Is Storage: Byte
Density Across the Globe,” IDC, 2013;
www.idc.com/downloads/where is storage
infographic 243338.pdf.
2. “Sony Develops Magnetic Tape Technology
with the World’s Highest Recording Density,”
press release, Sony, 30 Apr. 2014; www.sony
.net/SonyInfo/News/Press/201404/14-044E.
3. J. Plafke, “New Optical Laser Can Increase
DVD Storage Up to One Petabyte,” blog,
20 June 2013; www.extremetech.com
/computing/159245-new-optical-laser-can
-increase-dvd-storage-up-to-one-petabyte.
4. R. Miller, “Facebook Builds Exabyte Data
Centers for Cold Storage,” blog, 18 Jan. 2013;
www.datacenterknowledge.com/archives
/2013/01/18/facebook-builds-new-data-centers
-for-cold-storage.
5. G.M. Church, Y. Gao, and S. Kosuri, “Next-
Generation Digital Information Storage in
DNA,” Science, vol. 337, no. 6102, 2012,
pp. 1628–1629.
6. C.T. Clelland, V. Risca, and C. Bancroft,
“Hiding Messages in DNA Microdots,”
Nature, vol. 399, 1999, pp. 533–534.
7. N. Goldman et al., “Towards Practical, High-
Capacity, Low-Maintenance Information
Storage in Synthesized DNA,” Nature,
vol. 494, 2013, pp. 77–80.
8. M.E. Allentoft et al., “The Half-Life of DNA
in Bone: Measuring Decay Kinetics in 158
Flash
HDD
Tape
DNA storage
Access time Durability Capacity
µs–ms
10s of ms
Minutes
Hours
~5 yrs
~5 yrs
~15–30 yrs
Centuries
Tbytes
100s of Tbytes
Pbytes
Zbytes
Figure 9. A possible storage system hierarchy. DNA storage is a promising new bottom layer,
offering higher density and durability at the cost of latency.
.............................................................
MAY/JUNE 2017 103
Dated Fossils,” Proc. Royal Society of
London B: Biological Sciences, vol. 279, no.
1748, 2012, pp. 4724–4733.
9. J. Bornholt et al., “A DNA-Based Archival
Storage System,” Proc. 21st Int’l Conf.
Architectural Support for Programming
Languages and Operating Systems
(ASPLOS), 2016, pp. 637–649.
10. M. Brunker, “Microsoft and University of
Washington Researchers Set Record for
DNA Storage,” blog, 7 July 2016; http://
blogs.microsoft.com/next/2016/07/07
/microsoft-university-washington
-researchers-set-record-dna-storage.
11. L Organick et al., “Scaling Up DNA Data
Storage and Random Access Retrieval,”
bioRxiv, 2017; doi:10.1101/114553.
12. R. Carlson, “Time for New DNA Synthesis
and Sequencing Cost Curves,” blog, 12
Feb. 2014; www.synthesis.cc/2014/02
/time-for-new-cost-curves-2014.html.
13. “Oxford Nanopore Technologies,” http://
nanoporetech.com.
14. M. Blawata et al., “Forward Error Correction
for DNA Data Storage,” Procedia Computer
Science, vol. 80, 2016, pp. 1011–1022.
James Bornholt is a PhD student in thePaul G. Allen School of Computer Scienceand Engineering at the University of Wash-ington. His research interests include pro-gramming languages and formal methods,focusing on program synthesis. Bornholtreceived an MS in computer science fromthe University of Washington. Contact himat [email protected].
Randolph Lopez is a graduate student inbioengineering at the University of Wash-ington. His research interests include theintersection of synthetic biology, DNAnanotechnology, and molecular diagnostics.Lopez received a BS in bioengineering fromthe University of California, San Diego.Contact him at [email protected].
Douglas M. Carmean is a partner architectat Microsoft. His research interests includenew architectures on future device technol-ogy. Carmean received a BS in electrical andelectronics engineering from Oregon State
University. Contact him at [email protected].
Luis Ceze is the Torode Family AssociateProfessor in the Paul G. Allen School ofComputer Science and Engineering at theUniversity of Washington. His researchinterests include the intersection betweencomputer architecture, programming lan-guages, and biology. Ceze received a PhD incomputer science from the University of Illi-nois at Urbana–Champaign. Contact him [email protected].
Georg Seelig is an associate professor in theDepartment of Electrical Engineering andthe Paul G. Allen School of Computer Sci-ence and Engineering at the University ofWashington. His research interests includeunderstanding how biological organismsprocess information using complex biochemi-cal networks and how such networks can beengineered to program cellular behavior. See-lig received a PhD in physics from the Uni-versity of Geneva. Contact him at [email protected].
Karin Strauss is a senior researcher atMicrosoft Research and an affiliate faculty atthe University of Washington. Her researchinterests include studying the application ofbiological mechanisms and other emergingtechnologies to storage and computation,and building systems that are efficient andreliable with them. Strauss received a PhDin computer science from the University ofIllinois at Urbana–Champaign. Contact herat [email protected].
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
..............................................................................................................................................................................................
TOP PICKS
............................................................
104 IEEE MICRO
.................................................................................................................................................................................................................
TI-STATES: POWER MANAGEMENT INACTIVE TIMING MARGIN PROCESSORS
.................................................................................................................................................................................................................
TEMPERATURE INVERSION IS A TRANSISTOR-LEVEL EFFECT THAT IMPROVES PERFORMANCE
WHEN TEMPERATURE INCREASES. THIS ARTICLE PRESENTS A COMPREHENSIVE
MEASUREMENT-BASED ANALYSIS OF ITS IMPLICATIONS FOR ARCHITECTURE DESIGN AND
POWER MANAGEMENT USING THE AMD A10-8700P PROCESSOR. THE AUTHORS
PROPOSE TEMPERATURE-INVERSION STATES (TI-STATES) TO HARNESS THE OPPORTUNITIES
PROMISED BY TEMPERATURE INVERSION. THEY EXPECT TI-STATES TO BE ABLE TO IMPROVE
THE POWER EFFICIENCY OF MANY PROCESSORS MANUFACTURED IN FUTURE CMOS
TECHNOLOGIES.
......Temperature inversion refers tothe phenomenon that transistors switch fasterat a higher temperature when operating undercertain regions. To harness temperature inver-sion’s performance benefits, we introduceTi-states, or temperature-inversion states, foractive timing-margin management in emerg-ing processors. Ti-states are frequency, temper-ature, and voltage triples that enable processortiming-margin adjustments through runtimesupply voltage changes. Similar to P-states’frequency-voltage table lookup mechanism,Ti-states operate by indexing into a temper-ature-voltage table that resembles a seriesof power states determined by transistors’temperature-inversion effect. Ti-states pushgreater efficiency out of the underlyingprocessor, specifically in active timing-margin-based processors.
Ti-states are the desired evolution of clas-sical power-management mechanisms, such
as P-states and C-states. This evolution isenabled by the growing manifestation of thetransistor’s temperature-inversion effect asdevice feature size scales down.
When temperature increases, transistorperformance is affected by two factors: adecrease in both carrier mobility and thresh-old voltage. Reduced carrier mobility causesdevices to slow down, whereas reducedthreshold voltage causes devices to speed up.When supply voltage is low enough, transistorspeed is sensitive to minute threshold voltagechanges, which makes the second factor(threshold voltage reduction) dominate. Inthis situation, temperature inversion occurs.1
In the past, designers have safely dis-counted temperature inversion because itdoes not occur under a processor’s normaloperation. However, as transistor feature sizescales down, today’s processors are operatingclose to the temperature inversion’s voltage
Yazhou Zu
University of Texas at Austin
Wei Huang
Indrani Paul
Advanced Micro Devices
Vijay Janapa Reddi
University of Texas at Austin
.......................................................
106 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
region. Therefore, the speedup benefit oftemperature inversion deserves more atten-tion from architects and system operators.
Figure 1a provides a device simulationanalysis based on predictive technology mod-els.2,3 We use inflection voltage to denote thecrossover point for temperature inversion tooccur. Below the inflection voltage is thetemperature-inversion region, in which cir-cuits speed up at high temperature. Abovethe inflection voltage is the noninversionregion, in which circuits slow down at hightemperature. From 90 nm to 22 nm, theinflection voltage keeps increasing andapproaches the processor’s nominal voltage.This means temperature inversion is becom-ing more likely to occur in recent smallertechnologies.
Our silicon measurement corroboratesand strengthens this projection. The meas-ured 28-nm AMD A10-8700P processor’sinflection voltage falls within the range of theprocessor’s different P-states. Figure 1b fur-ther illustrates temperature inversion by con-trasting circuit performance in the inversionand noninversion regions. At 1.1 V, the cir-cuit is slightly slower at a higher temperaturewhile safely meeting the specified frequency,as expected from conventional wisdom. At0.7 V, however, this circuit becomes faster bymore than 15 percent at 80�C as a result oftemperature inversion.
Ti-states harness temperature inversion’sspeedup effect by actively undervolting tosave power. Ti-states exploit the fact that thefaster circuits offered by temperature inver-sion add extra margin to the processor’s clockcycle. It then calculates the precise amount ofvoltage that can be safely reduced to reclaimthe extra margin. The undervolting decisionfor each temperature is stored in a table forruntime lookup.
Ti-states are instrumental because theycan apply to almost all processors manufac-tured with today’s technologies that manifesta strong temperature-inversion effect, includ-ing bulk CMOS, fin field-effect transistor(FinFET), and fully depleted silicon on insu-lator (FD-SOI). The comprehensive charac-terization we present in this article is basedon rigorous hardware measurement, and itcan spawn future work that exploits the tem-perature-inversion effect.
Measuring Temperature InversionWe measure temperature inversion on a 28-nm AMD A10-8700P accelerated processingunit (APU).4 The APU integrates two CPUcore pairs, eight GPU cores, and other systemcomponents. We conduct our study on boththe CPU and GPU and present measure-ments at the GPU’s lowest P-state of 0.7 Vand 300 MHz, because it has strong tempera-ture inversion. The temperature-inversioneffect we study depends on supply voltagebut not on the architecture. Thus, we expectthe analysis on the AMD-integrated GPU tonaturally extend to the CPU and other archi-tectures as well for all processor vendors.
We leverage the APU’s power supply mon-itors (PSMs) to accurately measure circuitspeed changes under different conditions.5
1.4
1.2
1.0
0.8
0.6
0.4
Vo
lta
ge
(V
)
90 n
m
65
nm
45
nm
32 n
m
22 n
m
Measurement
at 28 nm
pstates
20
15
10
5
0
–5
Circuit s
pe
ed
up
(%
)
80604020
Temperature (C)
0.7 V, speed up
1.1 V, slow down
0.9 V, unchanged
Nominal Inflection
(a)
(b)
Figure 1. Temperature inversion is having
more impact on processor performance as
technology scales. (a) Temperature
inversion was projected to be more
common in smaller technologies as its
inflection voltage keeps increasing and
approaches nominal supply. (b) High
temperature increases performance under
low voltage due to temperature inversion,
compared to conventional wisdom under
high voltage.
.............................................................
MAY/JUNE 2017 107
Figure 2 illustrates a PSM’s structure. A PSMis a time-to-digital converter that reflectscircuit time delay in numeric form. Its corecomponent is a ring oscillator that counts thenumber of inverters an “edge” has traveledthrough in each clock cycle. When the circuitis faster, an edge can pass more inverters, anda PSM will produce a higher count output.We use a PSM as a means to characterize cir-cuit performance under temperature varia-tion. We normalize the PSM reading to areference value measured under 0.7 V, 300MHz, 0�C, and idle chip condition.
To characterize the effect of temperatureinversion on performance and power underdifferent operating conditions, we carefullyregulate the processor’s on-die temperatureusing a temperature feedback control system(see Figure 3). The feedback control checksdie temperature measured via an on-chipthermal diode and adjusts the thermal headtemperature every 10 ms to set the chip tem-perature to a user-specified value. Physically,the thermal head’s temperature is controlledvia a water pipe and a heater to control itssurface temperature.
The Temperature-Inversion EffectTemperature inversion primarily affects circuitperformance. We first explain temperatureinversion’s performance impact with respectto supply voltage and temperature. We thenextrapolate the power optimization potentialoffered by temperature inversion. Throughour measurement, we make two observa-tions: temperature inversion’s speedup effectsbecome stronger with lower voltage, and thespeedup can be turned into more than 5 per-cent undervolting benefits.
Inversion versus NoninversionWe contrast temperature inversion and non-inversion effects by sweeping across a wideoperating voltage range. Figure 4 shows thecircuit speed change under different supplyvoltages and die temperatures. Speed isreflected by the PSM’s normalized output—ahigher value implies a faster circuit. We keepthe chip idle to avoid any workload disturb-ance, such as the di/dt effect.
Figure 4 illustrates the insight that thetemperature’s impact on circuit performancedepends on the supply voltage. In the highsupply-voltage region around 1.1 V, thePSM’s reading becomes progressively smalleras the temperature rises from 0�C to 100�C.The circuit operates slower at a higher tem-perature, which aligns with conventionalbelief. The reason for this circuit perform-ance degradation is that the transistor’s carriermobility decreases at a higher temperature,leading to smaller switch-on current (Ion) andlonger switch time.
Under a lower supply voltage, the PSM’sreading increases with higher temperature,which means the circuit switches faster (thatis, the temperature-inversion phenomenon).The reason is because the transistor’s thresh-old voltage (Vth) decreases linearly as tempera-ture increases. For the same supply voltage, alower Vth provides more drive current (Ion),which makes the circuit switch faster. Thespeedup effect is more dominant when supplyvoltage is low, because then the supply voltageis closer to Vth, and any minute change of Vth
can affect transistor performance.An “inflection voltage” exists that balan-
ces high temperature’s speedup and slow-down effects. On the processor we tested,
Edge
propagation
Figure 2. A power supply monitor (PSM) is a
ring of inverters inserted between two
pipeline latches. It counts the number of
inverters an “edge” travels through in one
cycle to measure circuit speed.
..............................................................................................................................................................................................
TOP PICKS
............................................................
108 IEEE MICRO
the inflection voltage is between 0.9 V and 1V. Around this point, temperature does nothave a notable impact on circuit perform-ance. Technology evolution has made morechip operating states fall below the inflectionvoltage (that is, in the temperature-inversionregion). For the APU we tested, half of theGPU’s P-states, ranging from 0.75 to 1.1 V,operate in the temperature-inversion region.Therefore, we must carefully inspect temper-ature inversion and take advantage of itsspeedup effect.
Active Timing Margin’s Undervolting PotentialWe propose to harness temperature inversion’sspeedup effect by reclaiming the extra pipelinetiming margin provided by the faster circuitry.Specifically, we propose to actively undervoltto shrink the extra timing margin, an approachsimilar in spirit to prior active-timing-marginmanagement schemes.6 To explore the optimi-zation space, we first estimate the undervoltingpotential using PSM measurement.
Figure 5 illustrates the estimation proc-ess. The x-axis zooms into the low-voltageregion between 0.6 and 0.86 V in Figure 4to give a closer look at the margin-reductionopportunities.
Temperature inversion’s performance ben-efit becomes stronger at lower voltages, asreflected by the widening gap between 100�Cand 0�C. At 0.7 V, the PSM differencebetween 100�C and 0�C represents the extratiming margin in the units of inverter delays.In other words, it reflects how much faster thecircuits run at a higher temperature by count-ing how many more inverters the faster circuitcan switch successively in one cycle. To bringthe faster circuit back to its original speed,supply voltage needs to be reduced such thatunder a higher temperature the PSM can readthe same value. We estimate the voltagereduction potential with linear extrapolation.At 0.7 V, the extra margin translates to a 46-mV voltage reduction, equivalent to 5 percentundervolting potential. See our original paperfor more complete extrapolation results.7
Temperature-Inversion StatesBased on our temperature-inversion charac-terization, we propose Ti-states to construct asafe undervolting control loop to reclaim theextra timing margin provided by temperature
16
12
8
4
Norm
aliz
ed
PS
M
1.11.00.90.80.70.6
Supply voltage (V)
Temperature
inversion
Noninversion
Inflection
voltage
100 C 80 C60 C 40 C20 C 0 C
Figure 4. Circuit speed changes under different supply voltages and die
temperatures. Temperature inversion happens below 0.9 V and is
progressively stronger when voltage scales down.
Measured plate
temperature
Setpoint
Thermal head
ProcessorConnect
to heater
Water pipe
Figure 3. Temperature control setup. The thermal head’s temperature is
controlled via a water pipe and a heater. The water pipe is connected to an
external chiller to offer low temperatures while the heater increases
temperature to reach the desired temperature setting.
5
4
3
2Norm
aliz
ed
PS
M
0.850.800.750.700.650.60
Supply voltage (V)
Extra margin
Voltage
reduction potential46 mV
100 C0 C
Figure 5. Temperature inversion happens
below 0.9 V. It speeds up circuits, as
reflected by larger PSM values under higher
temperatures, and becomes stronger when
voltage scales down.
.............................................................
MAY/JUNE 2017 109
inversion. In doing this, we must not intro-duce additional pipeline timing threats forreliability purposes, such as overly reducingtiming margins or enlarging voltage droopscaused by workload di/dt effects.
To guarantee timing safety, we use the tim-ing margin measured at 0�C as the “golden”reference. We choose 0�C as the referencebecause it represents the worst-case operatingcondition under temperature inversion. Work-loads that run safely at 0�C are guaranteed topass under higher temperatures, because tem-perature inversion can make circuits run faster.Although 0�C rarely occurs in desktop,mobile, and datacenter applications, duringthe early design stage, timing margins shouldbe set to tolerate these worst-case conditions.In industry, 0�C or below is used as a standardcircuit design guideline.8 In critical scenarios,an even more conservative reference of –25�Cis adopted.
Ti-states’ undervolting goal is to maintainthe same timing margin as 0�C when a chipis operating at a higher temperature. In otherwords, the voltage Ti-state sets should alwaysmake the timing margins measured by thePSM match 0�C. Under this constraint,Ti-states undervolt to maximize power saving.
Algorithm 1 summarizes the methodol-ogy to construct Ti- states:
The algorithm repeatedly stress tests theprocessor under different temperature-voltageenvironments with a set of workloads, andproduces a temperature-voltage table thatcan be stored in system firmware.9 At run-time, the system can index into this table toactively set the supply voltage according toruntime temperature measurement.
Algorithm 1 uses a set of workloads as thetraining sets to first get a tentative tempera-ture-voltage mapping. We then validate thismapping with another set of test workloads.
During the training stage, Algorithm 1first measures each workload’s golden refer-ence timing margin at 0�C using PSMs. Thetiming margin is recorded as the worst-casemargin during the entire program run. Then,at each target temperature, Algorithm 1selects four candidate voltages around theextrapolated voltage value as in Figure 5. Thefour candidate voltages are stepped through,and each workload’s timing margin is recordedusing PSMs. Finally, the timing margins at dif-ferent candidate voltages are compared againstthe 0�C reference, and the voltage with theminimum PSM difference is taken as thetarget temperature’s Ti-state voltage.
Table 1 shows the PSM difference com-pared with the 0�C reference across differentcandidate voltages for 20�C, 40�C, 60�C,and 80�C. The selected Ti-state voltages withthe smallest difference are shown in bold inthe table. For instance, at 80�C, 0.6625 V isthe Ti-state, which provides around 5% volt-age reduction benefits.
We observed from executing Algorithm 1that a Ti-state’s undervolting decision is inde-pendent of the workloads. It achieves thesame margin reduction effects across all pro-grams. This makes sense, because tempera-ture inversion is a transistor-level effect anddoes not depend on other architecture or pro-gram behavior. This observation is good forTi-states, because it justifies the applicabilityof the undervolting decision made from asmall set of test programs to a wide range offuture unknown workloads.
Figure 6 illustrates our observation. Goingfrom 0�C to 80�C, temperature inversionoffers more than 15 percent extra timingmargin. Ti-states safely reclaim the extramargin by reducing voltage to 0.66 V. Aftervoltage reduction, workload timing margins
1:procedure GET REFERENCE MARGIN2: set voltage and temperature to reference3: for each training workload do4: workloadMargin ← PSM measurement5: push RefMarginArr, workloadMargin
returnRefMarginArr
returnexploreVDD
6:procedure EXPLORE UNDERVOLT7: initVDD ← idle PSM extrapolation8: candidateVDDArr ← voltage around initVDD9: minErr ← MaxInt10: set exploration temperature11: for each VDD in candidateVDDArrdo12: set voltage to VDD13: for each training workload do14: workloadMargin ← PSM measurement15: push TrainMarginArr, workloadMargin16: err ← diff(RefMarginArr,TrainMarginArr) 17: if err<minErr then18: minErr ← err19: exploreVDD ← VDD
..............................................................................................................................................................................................
TOP PICKS
............................................................
110 IEEE MICRO
closely track the baseline for all workloads.Overall, Ti-states can achieve 6 to 12 percentpower savings on our measured chip acrossdifferent temperatures.
Long-Term ImpactAs CMOS technology scales to its end, itis important to extract as much efficiencyimprovement opportunity as possible fromthe underlying transistors. Ti-state achievesthis goal with active timing-margin man-agement. Exploiting slack in timing mar-gins to improve processor efficiency will beubiquitous, just as P-states and C-stateshave helped reduce redundant power in thepast. We believe the simplicity of Ti-statesand the insights behind them render a widerange of applicability. Our work bringstemperature inversion’s value from devicelevel to architects and system managers, andopens doors for other ideas to improve pro-cessor efficiency.
Wide Range of ApplicabilityTi-state is purely based on transistor’s temper-ature-inversion effect and is independent ofother factors. Temperature inversion is anopportunity offered by technology scaling,which makes it a free meal for computerarchitects. Therefore, Ti-state is applicable tochips made with today’s advanced technolo-gies, including bulk CMOS, FD-SOI, andFinFET (as we show in our original paper7).Many, if not all, processor architectures canbenefit from it, whether they’re CPUs, GPUs,FPGAs, or other accelerators.
Ti-state’s design is succinct. Its main com-ponents are on-chip timing margin sensors,temperature sensors, and system firmwarethat stores Ti-state tables. A Ti-state’s runtimeoverhead is a table lookup and a voltage regu-lator module’s set command, which are mini-mal. Because chip temperature changes overthe course of several seconds, a Ti-state’s feed-back loop has no strict latency requirement,which makes it easy to design, implement,and test.
Implications at Circuit, Architecture, andSystem LevelOur study conducted on an AMD A10-8700P processor focuses on a single chipmade in planar CMOS technology. Goingbeyond current technology and across systemscale, Ti-states will have a bigger impact inthe future.
Table 1. PSM error compared to the reference setting for different <temperature, voltage> configurations.
Candidate voltages (mV) 208C 408C 608C 808C 1008C
693.75 3.7% — — — —
687.50 2.2% — — — —
681.25 8.4% 2.3% — — —
675.00 13.9% 5.3% 4.9% — —
668.75 — 9.5% 2.5% — —
662.50 — 13.5% 6.5% 1.9% —
656.25 — — 12.2% 5.6% 9.9%
650.00 — — — 9.3% 5.1%...................................................................................................................................
*Bold type indicates the voltages with the smallest PSM difference.
1.0
0.9
0.8
Norm
aliz
ed
PS
M
Benchmark
0.7 V, 0 C
0.7 V, 80 C0.6625 V, 80 C
Temperature
inversion offers
more margin
Undervolt safely
reclaims margin
Figure 6. Temperature inversion’s speedup effect offers extra timing margin
at 80�C, as reflected by the elevated workload worst-case PSM. Ti-state
precisely reduces voltage to have the same timing margin as under 0�C and
nominal voltage, which achieves better efficiency and guarantees reliability.
.............................................................
MAY/JUNE 2017 111
Significance for FinFETand FD-SOI. FinFETand FD-SOI are projected to have strongerand more common temperature-inversioneffects.10,11 In these technologies, Ti-stateshave broader applicability and more bene-fits. Furthermore, the low-leakage charac-teristics of these technologies promise otheropportunities for a tradeoff between tem-perature and power.
In our original paper, we provide a detailedFinFET and FD-SOI projection analysisbased on measurements taken at 28-nm bulkCMOS. We find the 10-times leakage reduc-tion capabilities make these technologiesenjoy a higher operating temperature, becauseTi-states reduce more VDD under higher tem-peratures. The optimal temperature for poweris usually between 40�C to 60�C, dependingon workloads and device type. Thus, Ti-statesnot only reduce chip power itself for FinFETand FD-SOI but also relieve the burden ofthe cooling system.
System-level thermal management. Datacen-ters and supercomputers strive to make roomtemperature low at the cost of very highpower consumption and cost. A tradeoffbetween cooling power and chip leakagepower exists in this setting.12 Ti-states addnew perspective to this problem. First, wefind that high temperature does not worsentiming margins, but actually preserves pro-cessor timing reliability because of tempera-ture inversion. Second, Ti-states reducepower under higher temperatures, mitigating
processor power cost. For FinFET andFD-SOI, the processor might prefer hightemperatures around 60�C to save power,which further provides room for coolingpower reduction.
Figure 7 shows a control mechanism thatwe conceived to synergistically reduce chipand cooling power. The test-time procedureand loop 1 is what the Ti-state achieves. Inaddition, loop 2 takes cooling system powerinto consideration and jointly optimizes fanand chip power together. Overall, tempera-ture inversion and Ti-states enable an optimi-zation space involving cooling power, chippower, and chip reliability.
Opportunity for near-threshold computing.Our measurement on a real chip shows thattemperature inversion is stronger at lower vol-tages, reaching up to 10 percent VDD reduc-tion potential for a Ti-state at 0.6 V for our28-nm chip. In near-threshold conditions aslow as 0.4 V, temperature inversion will havea much stronger effect and will offer muchlarger benefits. In addition to power reduc-tion, a Ti-state can be employed to boost theperformance of near-threshold chips by over-clocking directly to exploit extra margin.Extrapolation similar to Figure 4 shows over-clocking potential is between 20 and 50 per-cent with the help of techniques that mitigatedi/dt effects.5
Temperature inversion offers a new ave-nue for improving processor efficiency.
VRM On-die temp
sensor data
Find desired VDD
Set VDD
Workload
activity factorFan control
Technology model
Set temperature
Find optimaltemp
1. Per-part PSM characterization
at different (V, T ) points
Test time
2. Undervolt validation at
different temperature
3. Fuse (V, F, T ) table into
firmware/OS
Runtime1 2
(V, F, T ) table
Processor
Dynamic/leakage
power analysis
Figure 7. Ti-state temperature and voltage control: two loops work in synergy to minimize
power. Loop 1 is a fast control loop that uses a Ti-state table to keep adjusting voltage in
response to silicon temperature variation. Loop 2 is a slow control loop that sets the optimal
temperature based on workload steady-state dynamic power profile.
..............................................................................................................................................................................................
TOP PICKS
............................................................
112 IEEE MICRO
On the basis of detailed measurements, ourarticle presents a comprehensive analysis ofhow temperature inversion can alter the waywe do power management today. Throughthe introduction of Ti-states, we show thatactive timing margin management can besuccessfully applied to exploit temperatureinversion. Applying such optimizations inthe future will likely become even moreimportant as technology scaling continues.We envision future work that draws on Ti-states to enhance computing systems acrossthe stack and at a larger scale. MICRO
....................................................................References1. C. Park et al., “Reversal of Temperature
Dependence of Integrated Circuits Operat-
ing at Very Low Voltages,” Proc. Int’l
Electron Devices Meeting (IEDM), 1995,
pp. 71–74.
2. D. Wolpert and P. Ampadu, “Temperature
Effects in Semiconductors,” Managing
Temperature Effects in Nanoscale Adaptive
Systems, Springer, 2012, pp. 15–33.
3. W. Zhao and Y. Cao, “New Generation of
Predictive Technology Model for Sub-45 nm
Early Design Exploration,” IEEE Trans. Elec-
tron Devices, vol. 53, no. 11, 2006, pp.
2816–2823.
4. B. Munger et al., “Carrizo: A High Perform-
ance, Energy Efficient 28 nm APU,”
J. Solid-State Circuits (JSSC), vol. 51, no. 1,
2016, pp. 1–12.
5. A. Grenat et al., “Adaptive Clocking System
for Improved Power Efficiency in a 28nm
x86–64 Microprocessor,” Proc. IEEE Int’l
Solid-State Circuits Conf. (ISSCC), 2014,
pp. 106–107.
6. C.R. Lefurgy et al., “Active Management of
Timing Guardband to Save Energy in
POWER7,” Proc. 44th IEEE/ACM Int’l Symp.
Microarchitecture (MICRO), 2011, pp. 1–11.
7. Y. Zu, “Ti-states: Processor Power Manage-
ment in the Temperature Inversion Region,”
Proc. 49th Ann. IEEE/ACM Int’l Symp. Micro-
architecture (MICRO), 2016; doi:10.1109
/MICRO.2016.7783758.
8. Guaranteeing Silicon Performance with
FPGA Timing Models, white paper WP-
01139-1.0, Altera, Aug. 2010.
9. S. Sundaram et al., “Adaptive Voltage Fre-
quency Scaling Using Critical Path Accumu-
lator Implemented in 28nm CPU,” Proc.
29th Int’l Conf. VLSI Design and 15th Int’l
Conf. Embedded Systems (VLSID), 2016,
pp. 565–566.
10. W. Lee et al., “Dynamic Thermal Manage-
ment for FinFET-Based Circuits Exploiting
the Temperature Effect Inversion Phenom-
enon,” Proc. Int’l Symp. Low Power Elec-
tronics and Design (ISLPED), 2014, pp.
105–110.
11. E. Cai and D. Marculescu, “TEI-Turbo: Tem-
perature Effect Inversion-Aware Turbo
Boost for FinFET-Based Multi-core Sys-
tems,” Proc. IEEE/ACM Int’l Conf. Com-
puter-Aided Design (ICCAD), 2015, pp.
500–507.
12. W. Huang et al., “TAPO: Thermal-Aware
Power Optimization Techniques for Servers
and Data Centers,” Proc. Int’l Green Com-
puting Conf. and Workshops (IGCC), 2011;
doi:10.1109/IGCC.2011.6008610.
Yazhou Zu is a PhD student in the Depart-ment of Electrical and Computer Engineer-ing at the University of Texas at Austin.His research interests include resilient andenergy-efficient processor design and man-agement. Zu received a BS in microelec-tronics from Shanghai Jiao Tong Universityof China. Contact him at [email protected].
Wei Huang is a staff researcher at AdvancedMicro Devices Research, where he works onenergy-efficient processors and systems.Huang received a PhD in electrical andcomputer engineering from the Universityof Virginia. He is a member of IEEE. Con-tact him at [email protected].
Indrani Paul is a principal member of thetechnical staff at Advanced Micro Devices,where she leads the Advanced Power Man-agement group, which focuses on innovat-ing future power and thermal managementapproaches, system-level power modeling,and APIs. Paul received a PhD in electricaland computer engineering from the GeorgiaInstitute of Technology. Contact her [email protected].
.............................................................
MAY/JUNE 2017 113
Vijay Janapa Reddi is an assistant professorin the Department of Electrical and Com-puter Engineering at the University ofTexas at Austin. His research interests spanthe definition of computer architecture,including software design and optimiza-
tion, to enhance mobile quality of experi-ence and improve the energy efficiency ofhigh-performance computing systems. JanapaReddi received a PhD in computer sciencefrom Harvard University. Contact him [email protected].
..............................................................................................................................................................................................
TOP PICKS
............................................................
114 IEEE MICRO
.................................................................................................................................................................................................................
AN ENERGY-AWARE DEBUGGER FORINTERMITTENTLY POWERED SYSTEMS
.................................................................................................................................................................................................................
DEVELOPMENT AND DEBUGGING SUPPORT IS A PREREQUISITE FOR THE ADOPTION OF
INTERMITTENTLY OPERATING ENERGY-HARVESTING COMPUTERS. THIS WORK IDENTIFIES
AND CHARACTERIZES INTERMITTENCE-SPECIFIC DEBUGGING CHALLENGES THAT ARE
UNADDRESSED BY EXISTING DEBUGGING SOLUTIONS. THIS WORK ADDRESSES THESE
CHALLENGES WITH THE ENERGY-INTERFERENCE-FREE DEBUGGER (EDB), THE FIRST
DEBUGGING SOLUTION FOR INTERMITTENT SYSTEMS. THIS ARTICLE DESCRIBES EDB’S CO-
DESIGNED HARDWARE AND SOFTWARE IMPLEMENTATION AND SHOWS ITS VALUE IN
SEVERAL DEBUGGING TASKS ON A REAL RF-POWERED ENERGY-HARVESTING DEVICE.
......Energy-harvesting devices areembedded computing systems that eschewtethered power and batteries by harvestingenergy from radio waves,1,2 motion,3 temper-ature gradients, or light in the environment.Small form factors, resilience to harsh envi-ronments, and low-maintenance operationmake energy-harvesting computers well-suited for next-generation medical, indus-trial, and scientific sensing and computingapplications.4
The power system of an energy-harvestingcomputer collects energy into a storage ele-ment (that is, a capacitor) until the bufferedenergy is sufficient to power the device.Once powered, the device can operate untilenergy is depleted and power fails. After thefailure, the cycle of charging begins again.These charge–discharge cycles power the sys-tem intermittently, and consequently, soft-ware that runs on an energy-harvesting deviceexecutes intermittently.5 In the intermittent
execution model, programs are frequently,repeatedly interrupted by power failures, incontrast to the traditional continuously pow-ered execution model, in which programs areassumed to run to completion. Every rebootinduced by a power failure clears volatile state(such as registers and memory), retains non-volatile state (such as ferroelectric RAM), andtransfers control to some earlier point in theprogram.
Intermittence makes software difficult towrite and understand. Unlike traditional sys-tems, the power supply of an energy-harvestingcomputer changes high-level software behav-ior, such as control-flow and memory con-sistency.5,6 Reboots complicate a program’spossible behavior, because they are implicitdiscontinuities in the program’s control flowthat are not expressed anywhere in the code.A reboot can happen at any point in a pro-gram and cause control to flow unintuitivelyback to a previous point in the execution.
Alexei Colin
Carnegie Mellon University
Graham Harvey
Alanson P. Sample
Disney Research Pittsburgh
Brandon Lucia
Carnegie Mellon University
.......................................................
116 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
The previous point could be the beginningof the program, a previous checkpoint,5,7 ora task boundary.6 Today, devices that executeintermittently are a mixture of conventional,volatile microcontroller architectures and non-volatile structures. In the future, alternativearchitectures based on nonvolatile structuresmay simplify some aspects of the executionmodel, albeit with lower performance andenergy efficiency.8
Intermittence can cause correct softwareto misbehave. Intermittence-induced jumpsback to a prior point in an execution inhibitforward progress and could repeatedly executecode that should not be repeated. Intermit-tence can also leave memory in an inconsis-tent state that is impossible in a continuouslypowered execution.5 These intermittence-related failure modes are avoidable with care-fully written code or specialized system sup-port.5–7,9,10 Unaddressed, these failure modesrepresent a new class of intermittence bugsthat manifest only when executing on anintermittent power source.
To debug an intermittently operatingprogram, a programmer needs the ability tomonitor system behavior, observe failures,and examine internal program state. Withthe goal of supporting this methodology,prior work on debugging for continuouslypowered devices has recognized the need tominimize resources required for tracing11
and reduce perturbation to the programunder test.12 A key difference on energy-harvesting platforms is that interferencewith a device’s energy level could perturb itsintermittent execution behavior. Unfortu-nately, existing tools, such as Joint TestAction Group (JTAG) debuggers, require adevice to be powered, which hides intermit-tence bugs. Programmers face an unsatisfyingdilemma: to use a debugger to monitor thesystem and never observe a failure, or to runwithout a debugger and observe the failure,but without the means to probe the system tounderstand the bug.
This article identifies the key debuggingfunctionality necessary to debug intermittentprograms on energy-harvesting platforms andpresents the Energy-Interference-Free Debug-ger (EDB), a hardware–software platformthat provides that functionality (see Figure 1).First, we observe that debuggers designed for
continuously powered devices are not effec-tive for energy-harvesting devices, becausethey interfere with the target’s power supply.Our first contribution is a hardware devicethat connects to a target energy-harvestingdevice with the ability to monitor and manip-ulate its energy level, but without permittingany significant current to flow between thedebugger and the target.
Second, we observe that basic debuggingtechniques, such as assertions, printf trac-ing, and LED toggling, are not usable onintermittently powered devices without sys-tem support. Our second contribution is the
MCU MCU\Code markers
Monitor
n
Interrupt
I/O device #1…
I/O device #d
Monitor Rx/Tx …
Monitor Rx/Tx
RF In
DC+ –
Energy monitoringand manipulation
Har
vest
edvo
ltag
e
Brownout Checkpoint Wild pointerTurn on
Threshold foroperation
Time
ED
B
Intermittent computation
I/O monitoringProgram event monitoring
Chargedischarge
Analog buffer
Dig
ital
leve
l shi
fter
Diode
Ene
rgy
harv
este
r
Capacitor
EH
dev
ice
(c)
(b)
(a)
Figure 1. The Energy-Interference-Free Debugger (EDB) is an energy-
interference-free system for monitoring and debugging energy-harvesting
devices. (a) Photo. (b) Architecture diagram. (c) The charge–discharge cycle
makes computation intermittent.
.............................................................
MAY/JUNE 2017 117
EDB software system, which was codesignedwith EDB’s hardware to make debuggingprimitives that are useful for intermittentlypowered devices, including energy breakpointsand keep-alive assertions. EDB addressesdebugging needs unique to energy-harvestingdevices, with novel primitives for selectivelypowering spans of code and for tracing thedevice’s energy level, code events, and fullydecoded I/O events. The whole of EDB’scapabilities is greater than the sum of capabil-ities of existing tools, such as a JTAG debuggerand an oscilloscope. Moreover, EDB is sim-pler to use and far less expensive. We applyEDB’s capabilities to diagnose problems onreal energy-harvesting hardware in a series ofcase studies in our evaluation.
Intermittence Bugs and Energy InterferenceAn intermittent power source complicatesunderstanding and debugging of a system,
because the behavior of software on an inter-mittent system is closely linked to its powersupply. Figure 2 illustrates the undesirableconsequences of disregarding this linkbetween the software and the power system.The code has an intermittence bug that leadsto memory corruption only when the deviceruns on harvested energy.
Debugging intermittence bugs using exist-ing tools is virtually impossible due to energyinterference from these tools. JTAG debug-gers supply power to the device under test(DUT), which precludes observation of arealistically intermittent execution, such asthe execution on the left in Figure 2. EvenJTAG, with a power rail isolator completelymasks intermittent behavior, because theprotocol requires the target to be poweredthroughout the debugging session. An oscillo-scope can directly observe and trace a DUT’spower system and lines, but a scope cannot
Source__NV list_t listmain(){ init_list(list) while(true){ __NV elem e select(e) remove(list,e) update(e) append(list,e) }}
append(list,e){ e->next = NULL e->prev = list->tail list->tail->next = e list->tail = e}
remove(list,e){ e->prev->next = e->next if(e==list->tail){ tail = e->prev }else{ e->next->prev = e->prev }}
Tim
e
Continuousexecution
while(true){[true]
select(e)
remove(list,e) e->prev->next=e->next if(e==list->tail)[false] e->next->prev=e->prev
update(e)
append(list,e) e->next=NULL e->prev=list->tail list->tail->next=e list->tail=e
Always executescompletely w/continuous power
The example program on the left illustrates how intermittence
perturbs a program’s execution. The code manipulates a
linked-list in nonvolatile memory using append and remove
functions. The continuous execution completes the code
sequentially. The intermittent execution, however, is not
sequential. Instead, the code captures a checkpoint at the
top of the while loop, then proceeds until power fails at the
indicated point. After the reboot, execution resumes from
the checkpoint. In some cases, an execution resumed from
the checkpoint mimics a continuously powered execution.
However, intermittence can also cause misbehavior
stemming from an intermittence bug in the code.
while(true){[true]
select(e)
remove(list,e) e->prev->next=e->next if(e==list->tail) [false] e->next->prev=e->prev
update(e)
append(list,e) e->next=NULL e->prev=list->tail list->tail->next=e
select(e)
remove(list,e) e->prev->next=e->next if(e==list->tail)[false] e->next->prev=e->prev
Checkpoint
Reboot! Back to checkpoint
Bug! Writing a wild pointerbecause e->next = NULL
Intermittentexecution
Power fails beforelist->tail=e
Should be true, butappend rebooted
Tim
e
The illustrated intermittent execution of the example code
exhibits incorrect behavior that is impossible in a continuous
execution. The execution violates the precondition assumed
by remove that only the tail’s next should be NULL.
The reboot interrupts append before it can make node e the
list’s new tail, but after its next pointer is set to NULL.
When execution resumes at the checkpoint, it attempts to
remove node e again. The conditional in remove confirms
that e is not the tail, then dereferences its next pointer
(which is NULL). The NULL next pointer makes e ->next->prev
a wild pointer that, when written, leads to undefined behavior.
Figure 2. An intermittence bug. The software executes correctly with continuous power, but
incorrectly in an intermittent execution.
..............................................................................................................................................................................................
TOP PICKS
............................................................
118 IEEE MICRO
observe internal software state, which limitsits value for debugging.
Debugging code added to trace and reactto certain program events—such as togglingLEDs, streaming events to a universal asyn-chronous receiver/transmitter (UART), or in-memory logging—has a high energy cost, andsuch instrumentation can change programbehavior. For example, activating an LED toindicate when the Wireless Identification andSensing Platform (WISP)2 is actively executingincreases its total current draw by five times,from around 1 mA to more than 5 mA. Fur-thermore, in-code instrumentation is limitedby scarcity of resources, such as nonvolatilememory to store the log, and an analog-to-digital converter (ADC) channel for measure-ments of the device’s energy level.
Energy interference and the lack of visibil-ity into intermittent executions make priorapproaches to debugging inadequate forintermittently powered devices.
Energy-Interference-Free DebuggingEDB is an energy-interference-free debug-ging platform for intermittent devices thataddresses the shortcomings of existing debug-
ging approaches. In this section, we describeEDB’s functionality and its implementationin codesigned hardware and software.
Figure 3 provides an overview of EDB.The capabilities of EDB’s hardware andsoftware (Figure 3a) support EDB’s debug-ging primitives (Figure 3b). The hardwareelectrically isolates the debugger from thetarget. EDB has two modes of operation:passive mode and active mode. In passivemode, the target’s energy level, programevents, and I/O can be monitored unobtru-sively. In active mode, the target’s energy leveland internal program state (such as memory)can be manipulated. We combine passive- andactive-mode operation to implement energy-interference-free debugging primitives, includ-ing energy and event tracing, intermittence-aware breakpoints, energy guards for instru-mentation, and interactive debugging.
Passive-Mode OperationEDB’s passive mode of operation lets devel-opers stream debugging information to ahost workstation continuously in real time,relying on the three rightmost components inFigure 3a. Important debugging streams that
Measureenergy level
for(…){
sense(&s)
ok=check(s)
if(ok){
i++
data[i]=s
Trace programevents
Trace I/Oevents
Event logging
Code breakpointsEnergy breakpoints
Manipulateenergy level
Assertions
Energy guards/instrumentation
Interactivedebugging
Deb
ug
gin
gp
rim
itiv
es
Code/energy breakpoints
Cap
abili
ties
Active mode Passive mode
Energy logging I/O logging
(a)
(b)
(c)
libEDB API
assert (expr)
break|watch_point (id)
energy_guard (begin|end)
printf (fmt, ...)
Debug console commands
charge|discharge energy_level
break|watch en|dis id [energy_level]
trace {energy, iobus, rfid, watch_points}
read|write address [value]
Figure 3. EDB’s features support debugging tasks and developer interfaces. (a) Hardware
and software capabilities. (b) Debugging primitives. (c) API and debug console commands.
.............................................................
MAY/JUNE 2017 119
are available through EDB are the device’senergy level, I/O events on wired buses,decoded messages sent via RFID, and programevents marked in application code. A majoradvantage of EDB is its ability to gather manydebugging streams concurrently, allowing thedeveloper to correlate streams (for example,relating changes in I/O or program behaviorwith energy changes). Correlating debuggingstreams is essential, but doing so is difficult orimpossible using existing techniques. Anotherkey advantage of EDB is that data is collectedexternally without burdening the target or per-turbing its intermittent behavior.
Monitoring signals in the target’s circuitrequires electrical connections between thedebugger and the target, and EDB ensuresthat these connections do not allow signifi-cant current exchange, which could interferewith the target’s behavior. To measure the tar-get energy level, EDB samples the analogvoltage from the target’s capacitor through anoperational amplifier buffer. To monitor digi-tal communication and program events with-out energy interference, EDB connects towired buses—including Inter-Integrated Cir-cuit (I2C), Serial Peripheral Interface (SPI),RF front-end general-purpose I/Os (GPIOs),and watch point GPIOs—through a digitallevel-shifter. As an external monitor, EDBcan collect and decode raw I/O events, evenif the target violates the I/O protocol due toan intermittence bug.
Active-Mode OperationEDB’s active mode frees debugging tasksfrom the constraint of the target device’s smallenergy store by compensating for energy con-sumed during debugging. In active mode, theprogrammer can perform debugging tasksthat require more energy than a target couldever harvest and store—for example, complexinvariant checks or user interactions. EDBhas an energy compensation mechanism thatmeasures and records the energy level on thetarget device before entering active mode.While the debugging task executes, EDB sup-plies power to the target. After performingthe task, EDB restores the energy level to thelevel recorded earlier. Energy compensationpermits costly, arbitrary instrumentation,while ensuring that the target has the behav-ior of an unaltered, intermittent execution.
EDB’s energy compensation mechanismis implemented using two GPIO pins con-nected to the target capacitor, an ADC, anda software control loop. To prevent energyinterference by these components during pas-sive mode, the circuit includes a low-leakagekeeper diode and sets the GPIO pins to high-impedance mode. To charge the target to adesired voltage level, EDB keeps the sourcepin high until EDB’s ADC indicates that thetarget’s capacitor voltage is at the desiredlevel. To discharge, the drain pin is kept lowto open a path to ground through a resistor,until the target’s capacitor voltage reaches thedesired level. Several of the debugging primi-tives presented in the next section are builtusing this energy-compensation mechanism.
EDB PrimitivesUsing the capabilities described so far, EDBcreates a toolbox of energy-interference-freedebugging primitives. EDB brings to intermit-tent platforms familiar debugging techniquesthat are currently confined to continuouslypowered platforms, such as assertions andprintf tracing. New intermittence-awareprimitives, such as energy guards, energybreakpoints, and watch points, are introducedto handle debugging tasks that arise only onintermittently powered platforms. Each prim-itive is accessible to the end user through twocomplimentary interfaces: the API linked intothe application and the console commands ona workstation computer (see Figure 3c).
Code and energy breakpoints. EDB imple-ments three types of breakpoints. A codebreakpoint is a conventional breakpoint thattriggers at a certain code point. An energybreakpoint triggers when the target’s energylevel is at or below a specified threshold. Acombined breakpoint triggers when a certaincode point executes and the target device’senergy level is at or below a specified thresh-old. Breakpoints conditioned on energy levelcan initiate an interactive debugging sessionprecisely when code is likely to misbehavedue to energy conditions—for example, justas the device is about to brownout.
Keep-alive assertions. EDB provides supportfor assertions on intermittent platforms.When an assertion fails, EDB immediately
..............................................................................................................................................................................................
TOP PICKS
............................................................
120 IEEE MICRO
tethers the target to a continuous power sup-ply to prevent it from losing state by exhaust-ing its energy supply. This keep-alive featureturns what would have to be a post-mortemreconstruction of events into an investigationon a live device. The ensuing interactivedebugging session for a failed assert includesthe entire live target address space and I/Obuses to peripherals. In contrast to EDB’skeep-alive assertions, traditional assertions areineffective in intermittent executions. After atraditional assertion fails, the device wouldpause briefly, until energy was exhausted, thenrestart, losing the valuable debugging infor-mation in the live device’s state.
Energy guards. Using its energy compensa-tion mechanism, EDB can hide the energycost of arbitrary code enclosed within anenergy guard. Code within an energy guardexecutes on tethered power. Code before andafter an energy-guarded region executes asthough no energy was consumed by theenergy-guarded region. Without energy cost,instrumentation code becomes nondisruptiveand therefore useful on intermittent plat-forms. Two especially valuable forms of instru-mentation that are impossible without EDBare complex data structure invariant checksand event tracing. EDB’s energy guards allowcode to check data invariants or report appli-cation events via I/O (such as printf), thehigh energy cost of which would normallydeplete the target’s energy supply and preventforward progress.
Interactive debugging. An interactive debug-ging session with EDB can be initiated by abreakpoint, an assertion, or a user interrupt,and allows observation and manipulation ofthe target’s memory state and energy level.Using charge–discharge commands, the devel-oper can intermittently execute any part of aprogram starting from any energy level, assess-ing the behavior of each charge–dischargecycle. During passive-mode debugging, theEDB console delivers traces of energy state,watch points, I/O events, and printf output.
EvaluationWe built a prototype of EDB, including thecircuit board in Figure 1 and software thatimplements EDB’s functionality. A release of
our prototype is available (http://intermittent.systems). The purpose of our evaluation istwofold. First, we characterize potential sour-ces of energy interference and show that EDBis free of energy interference. Second, we use aseries of case studies conducted on a realenergy-harvesting system to show that EDBsupports monitoring and debugging tasks thatare difficult or impossible without EDB.
Our target device is a WISP2 powered byradio waves from an RFID reader. The WISPhas a 47 lF energy-storage capacitor and anactive current of approximately 0.5 mA at 4MHz. We evaluated EDB using several testapplications, including the official WISP 5RFID tag firmware and a machine-learning-based activity-recognition application used inprior work.5,6
Energy InterferenceEDB’s edge over existing debugging tools isits ability to remain isolated from an inter-mittently operating target in passive modeand its ability to create an illusion of anuntouched target energy reservoir in activemode. Our first experiment concretely dem-onstrates the energy interference of a tradi-tional debugging technique, when applied toan intermittently operating system. Themeasurements in Table 1 demonstrate theimpact on program behavior of executiontracing using printf over UART withoutEDB. Without EDB, the energy cost of theprint statement significantly changes the iter-ation success rate—that is, the fraction ofiterations that complete without a power fail-ure. Next, we show with data that EDB iseffectively free of energy interference both inpassive- and active-mode operation.
In passive mode, current flow betweenEDB and the target through the connectionsin Figure 1 can inadvertently charge or dis-charge the target’s capacitor. We measured themaximum possible current flow over eachconnection by driving it with a source meterand found that the aggregate current cannotexceed 0.85 lA in the worst case, representingjust 0.2 percent of the target microcontroller’stypical active-mode current.
In active mode, energy compensationrequires EDB to save and restore the voltage ofthe target’s storage capacitor, and any discrep-ancy between the saved and restored voltage
.............................................................
MAY/JUNE 2017 121
represents energy interference. Using an oscil-loscope, we measured the discrepancy betweenthe target capacitor voltage saved and restoredby EDB. Over 50 trials, the average voltagediscrepancy was just 4 percent of the target’senergy-storage capacity, with most error stem-ming from our limited-precision softwarecontrol loop.
Debugging CapabilitiesWe now illustrate the new capabilities thatEDB brings to the development of intermit-tent software by applying EDB in case studiesto debugging tasks that are difficult to resolvewithout EDB.
Detecting memory corruption early. We eval-uated how well EDB’s keep-alive assertionshelp diagnose memory corruption that is notreproducible in a conventional debugger.
� Application. The code in Figure 4amaintains a doubly linked list in non-volatile memory. On each iteration ofthe main loop, a node is appended tothe list if the list is empty; otherwise,a node is removed from the list. Thenode is initialized with a pointer to abuffer in volatile memory that is lateroverwritten.
� Symptoms. After running on harvestedenergy for some time, the GPIO pinindicating main loop progress stopstoggling. After the main loop stops,normal behavior never resumes, evenafter a reboot; thus, the device mustbe decommissioned, reprogrammed,and redeployed.
� Diagnosis. To debug the list, we assertthat the list’s tail pointer must pointto the list’s last element, as shown in
Figure 4a. A conventional assertion isunhelpful: after the assertion fails, thetarget drains its energy supply andthe program restarts, losing the con-text of the failure. In contrast, EDB’sintermittence-aware, keep-alive asserthalts the program immediately whenthe list invariant is violated, powersthe target, and opens an interactivedebugging session.
Interactive inspection of target memoryusing EDB’s commands reveals that the tailpointer points to the penultimate element,not the actual tail. The inconsistency arosewhen a power failure interrupted append.In the absence of the keep-alive assert, theprogram would proceed to read this inconsis-tent state, dereference a null pointer, andwrite to a wild pointer.
Instrumenting code with consistency checks.On intermittently powered platforms, theenergy overhead of instrumentation code canrender an application nonfunctional by pre-venting it from making any forward progress.In this case study, we demonstrate how anapplication can be instrumented with aninvariant check of arbitrary energy cost usingEDB’s energy guards.
� Application. The code in Figure 4bgenerates the Fibonacci sequencenumbers and appends each to a non-volatile, doubly linked list. Each iter-ation of the main loop toggles aGPIO pin to track progress. The pro-gram begins with a consistency checkthat traverses the list and asserts thatthe pointers and the Fibonacci valuein each node are consistent.
Table 1. Cost of debug output and its impact on the activity-recognition
application’s behavior
Instrumentation
method
Iteration success
rate (%)
Iteration cost
Energy (%*) Time (ms)
No print 87 3.0 1.1
UARTprintf 74 5.3 2.1
EDBprintf 82 3.4 4.7..............................................................................................
*Energy cost as percentage of 47 lF storage capacity.
..............................................................................................................................................................................................
TOP PICKS
............................................................
122 IEEE MICRO
� Symptoms. Without the invariant check,the application silently produces aninconsistent list. With the invariantcheck, the main loop stops executingafter the list grows large. The oscillo-scope trace in Figure 4c shows an earlycharge cycle when the main loop exe-cutes (left) and a later one when it doesnot (right).
� Diagnosis. The main loop stops execut-ing because once the list is too long, theconsistency check consumes all of thetarget’s available energy. Once reached,this hung state persists indefinitely. AnEDB energy guard allows the inclusionof the consistency check without break-ing the application’s functionality (seeFigure 4b). The effect of the energyguard on target energy state is shown inFigure 4d. The energy guard providestethered power for the consistencycheck, and the main loop gets the sameamount of energy in early charge–discharge cycles when the list is short(left) and in later ones when the list islong (right). On intermittent power,we observed invariant violations inseveral experimental trials.
Instrumentation and consistency check-ing are an essential part of building a reliableapplication. These techniques are inaccessi-ble to today’s intermittent systems becausethe cost of runtime checking and analysis isarbitrary and often high. EDB brings instru-mentation and consistency checking to inter-mittent devices.
Tracing program events and RFID messages.Extracting intermediate results and eventsfrom the executing program using JTAG orUART is valuable, but it often interferes witha target’s energy level and changes applicationbehavior. Moreover, communication stackson energy-harvesting devices are difficult todebug without simultaneous visibility intothe device’s sent and received packet streamand energy state.
In Table 1, we traced the activity recogni-tion application using EDB’s energy-interfer-ence-free printf and watch points. In thissection, we trace messages in an RFID com-munication stack using EDB’s I/O tracer. We
Application code
1: init_list(list)
2: while (1)
3: node = list->head
4: while (node->next != NULL)
5: node = node->next
8: init_node(new_node)
9: append(list, new_node)
11: remove(list, node, &bufptr)
12: memset(bufpter, 0x42, BUFSZ)
6: assert(list->tail == node)
7: if (node == list->head)
10: else
Debug console
> run
Interrupted:
ASSERT line: 8
Vcap = 1.9749
*> print node
0xAA10: 0x00BB
0xAA20: 0x00AA
*> print list->tail
*> print list->tail.next
0xAA30: 0x00BB
0xAA 0xBB
tail node
1: main()2: energy_guard_begin()3: for (node in list)
6: assert(list->tail == node)7: energy_guard_end()8: while(1)
4: assert(node->prev->next == node ==->next->prev)5: assert(node->prev->fib + node->fib == node->next->fib)
9: append_fibonacci_node(list)
(a)
(b)
3.0
2.5
2.0
1.5
1.0
Vol
tage
(V
)
20 30 40 50 60 70 80 +20 +30 +40 +50 +60 +70 +80
Time (ms)
V cap
V brownout
CheckMain loop
(c)
3.0
2.5
2.0
1.5
1.0
Vol
tage
(V
)
20 30 40 50 60 70 80 +20 +30 +40 +50 +60 +70 +80
Time (ms)
V cap
V brownout
CheckMain loop
Tetheredpower
Tethered power
1 2 3 4
(d)
Figure 4. Debugging intermittence bugs with EDB. (a) An application with a
memory-corrupting intermittence bug, diagnosed using EDB’s intermittence-
aware assert (left) and interactive console (right). (b) An application
instrumented with a consistency check of arbitrary energy cost using EDB’s
energy guards. Oscilloscope trace of execution (c) without the energy guard
and (d) with the energy guard. Without the energy guard, the check and main
loop both execute at first, but only the check executes in later discharge
cycles. With an energy guard, the check executes on tethered power from
instant 1 to 2 and 3 to 4, and the main loop always executes.
.............................................................
MAY/JUNE 2017 123
used EDB to collect RFID message identifiersfrom the WISP RFID tag firmware, alongwith target energy readings. From the collectedtrace, we found that in our lab setup the appli-cation responded 86 percent of the time for anaverage of 13 replies per second. To producesuch a mixed trace of I/O and energy usingexisting equipment, the target would have tobe burdened with logging duties that exceedthe computational resources, given the alreadyhigh cost of message decoding and response.
E nergy-harvesting technology extends thereach of embedded devices beyond tra-
ditional sensor network nodes by eliminatingthe constraints imposed by batteries and wires.However, developing software for energy-harvesting devices is more difficult than tra-ditional embedded development, because ofsurprising behavior that arises when softwareexecutes intermittently. Debugging intermit-tently executing software is particularlychallenging because of a new class of inter-mittence bugs that are immune to existingdebugging approaches. Without effectivedebugging tools, energy-harvesting devicesare accessible only to a small community ofsystems experts instead of a wide communityof application-domain experts.
We identified energy interference as thefundamental shortcoming of available de-bugging tools. We designed EDB, the firstenergy-interference-free debugging systemthat supports debugging primitives for energy-harvesting devices, such as energy guards, keep-alive assertions, energy watch points, and energybreakpoints. Students in our lab and at a grow-ing list of other academic institutions havesuccessfully used EDB to debug and profileapplications in scenarios similar to the casestudies we evaluated.
EDB’s low-cost, compact hardware designmakes it suitable for incorporation into next-generation debugging tools and for fielddeployment with a target device. In the field, afuture automatic diagnostic system could lev-erage EDB to catch rare bugs and automati-cally log memory states from the target device.In the lab, EDB can serve research projectsthat require data on energy consumption andprogram execution on an energy-harvestingplatform, such as an intermittence-aware com-piler analysis.
We created EDB because we foundenergy-harvesting devices to be among theleast accessible platforms for research, requir-ing each researcher to reinvent ad hoc techni-ques for troubleshooting each device. EDBmakes intermittently powered platformsaccessible to a wider research audience andhelps establish a new research area surround-ing intermittent computing. MICRO
....................................................................References1. S. Gollakota et al., “The Emergence of RF-
Powered Computing,” Computer, vol. 47,
no. 1, 2014, pp. 32–39.
2. A.P. Sample et al., “Design of an RFID-
Based Battery-Free Programmable Sensing
Platform,” IEEE Trans. Instrumentation and
Measurement, vol. 57, no. 11, 2008, pp.
2608–2615.
3. P. Mitcheson et al., “Energy Harvesting
From Human and Machine Motion for Wire-
less Electronic Devices,” Proc. IEEE, vol.
96, no. 9, 2008, pp. 1457–1486.
4. J.A. Paradiso and T. Starner, “Energy Scav-
enging for Mobile and Wireless Electro-
nics,” IEEE Pervasive Computing, vol. 4, no.
1, 2005 pp. 18–27.
5. B. Lucia and B. Ransford, “A Simpler, Safer
Programming and Execution Model for
Intermittent Systems,” Proc. 36th ACM
SIGPLAN Conf. Programming Language
Design and Implementation (PLDI), 2015,
pp. 575–585.
6. A. Colin and B. Lucia, “Chain: Tasks and
Channels for Reliable Intermittent Pro-
grams,” Proc. ACM SIGPLAN Int’l Conf.
Object-Oriented Programming, Systems, Lan-
guages, and Applications, 2016, pp. 514–530.
7. B. Ransford, J. Sorber, and K. Fu, “Mementos:
System Support for Long-Running Computa-
tion on RFID-Scale Devices,” Proc. 16th Int’l
Conf. Architectural Support for Programming
Languages and Operating Systems, 2011, pp.
159–170.
8. K. Ma et al., “Architecture Exploration for
Ambient Energy Harvesting Nonvolatile Pro-
cessors,” Proc. IEEE 21st Int’l Symp. High
Performance Computer Architecture (HPCA),
2015, pp. 526–537.
..............................................................................................................................................................................................
TOP PICKS
............................................................
124 IEEE MICRO
9. D. Balsamo et al., “Hibernus: Sustaining
Computation During Intermittent Supply for
Energy-Harvesting Systems,” IEEE Embedded
Systems Letters, vol. 7, no. 1, 2015, pp.
15–18.
10. M. Buettner, B. Greenstein, and D. Wetherall,
“Dewdrop: An Energy-Aware Task Sched-
uler for Computational RFID,” Proc. 8th
USENIX Conf. Networked Systems Design
and Implementation (NSDI), 2011, pp.
197–210.
11. V. Sundaram et al., “Diagnostic Tracing for
Wireless Sensor Networks,” ACM Trans.
Sensor Networks, vol. 9, no. 4, 2013, pp.
38:1–38:41.
12. J. Yang et al., “Clairvoyant: A Comprehen-
sive Source-Level Debugger for Wireless
Sensor Networks,” Proc. 5th Int’l Conf.
Embedded Networked Sensor Systems
(SenSys 07), 2007, pp. 189–203.
Alexei Colin is a graduate student in theDepartment of Electrical and ComputerEngineering at Carnegie Mellon University.His research interests include reliability, pro-grammability, and efficiency of software onenergy-harvesting devices. Colin received anMSc in electrical and computer engineeringfrom Carnegie Mellon University. He is astudent member of ACM. Contact him [email protected].
Graham Harvey is an associate show elec-tronic engineer at Walt Disney Imagineering.His research interests include real-world appli-cations of wireless technologies to enhanceguest experiences in themed environments.Harvey received a BS in electrical and com-puter engineering from Carnegie Mellon Uni-versity. He completed the work for this articlewhile interning at Disney Research Pitts-burgh. Contact him at [email protected].
Alanson P. Sample is an associate lab direc-tor and principal research scientist at DisneyResearch Pittsburgh, where he leads theWireless Systems group. His research inter-ests include enabling new guest experiencesand sensing and computing devices by apply-ing novel approaches to electromagnetics, RFand analog circuits, and embedded systems.
Sample received a PhD in electrical engineer-ing from the University of Washington. He isa member of IEEE and ACM. Contact himat [email protected].
Brandon Lucia is an assistant professor inthe Department of Electrical and ComputerEngineering at Carnegie Mellon University.His research interests include the boundariesbetween computer architecture, compilers,system software, and programming lan-guages, applied to emerging, intermittentlypowered systems and efficient parallel sys-tems. Lucia received a PhD in computer sci-ence and engineering from the University ofWashington. He is a member of IEEE andACM. Contact him at [email protected] http://brandonlucia.com.
Read your subscriptions through the myCS publications portal at http://mycs.computer.org.
.............................................................
MAY/JUNE 2017 125
................................................................................................................................................................
Insights from the 2016 Eckert-Mauchly Award Recipient
URI WEISERTechnion–Israel Institute of Technology
......I appreciate the opportunity to
share with you the insights I presented in
my Eckert-Mauchly Award acceptance
speech at the 43rd International Sympo-
sium on Computer Architecture (ISCA)
held in Seoul, South Korea, in June 2016.
I would like to thank the Editor in Chief of
IEEE Micro, Lieven Eeckhout, for this
opportunity.
I am humbled and honored to have
received the ACM-IEEE Computer Soci-
ety Eckert-Mauchly Award. During my
nearly 40 years in the field of computer
architecture, I have had the privilege of
working with many architects, profes-
sors, and students at the University of
Utah, National Semiconductor, Intel, and
the Technion–Israel Institute of Technol-
ogy and to collaborate with many col-
leagues in industry and academia around
the world. I see this award as recognition
of the computer architecture researchers
I have worked with in Israel and abroad.
I was fortunate to work on several
state-of-the-art concepts in research and
development that impacted the industry
and academia alike. Computerization is
one of the most rapidly developing trends
in human history, influencing almost every
aspect of our lives, as it will continue to do
for a long time to come. Thus, in this field,
it will always be the right time to innovate.
In these exciting times, I was lucky to
be involved in developing new computer
architecture concepts and products that
have changed the way we use com-
puters in our daily lives. To be at just the
right time and place may not be pure luck.
If it happens again and again, it means
that you keep trying to make a difference.
The Passion PathI was born in Tel Aviv, Israel. My parents
were German Jews who fled Germany
before the Holocaust in 1933. The culture
I was exposed to during my childhood
was heavily influenced by the necessity
of constantly being in survival mode. The
main theme was to do your utmost to
excel, move forward, and survive.
The values I was nourished on were
to take the road less traveled by, to look
for new directions, and to challenge the
status quo even when the target seems
unobtainable: the obligation to innovate
in order to find new paths, to debate con-
structively on any solution, to crystallize
the proposed solution, and to be passion-
ate about whatever you do.
EducationSoon after graduating with a BSc in electri-
cal engineering from the Technion and
completing my MSc degree while working
at the Israeli DoD, I made an audacious
decision to pursue my PhD studies in com-
puter science abroad. I had a few options
and ultimately chose the University of
Utah. At Utah, with Professor Al Davis as
my advisor (and I may also say my friend), I
was exposed to computer architecture
and helped pave the way (together with
others) toward new systolic array graphics
and analytical approaches. This exposure
to the industry and academia outside of
Israel set me on my technical path.
IndustryThereafter, at National Semiconductor (in
the mid-1980s), I was lucky to lead the
design of the CISC NS32532 processor,
the best microprocessor at that time. I
learned there how a small team of excited
professionals could achieve the impossi-
ble (OS run on first silicon). The product
was a huge technical success, but
unfortunately, the market had already
shifted to the “other” CISC processors
(68000, MIPS, PowerPC, and X86).
NS32532 insight: Technology is important;
having the market behind you is a must.
With this strong insight, I moved to
Intel in the late 1980s. Intel’s market for
the X86 was huge compared to the market
for any other microprocessor. As Nick Tre-
dennick said in his 1988 talk, “More 386s
are produced at Intel between coffee
break and lunch than the number of RISC
chips produced all year by RISC vendors.”
However, Intel management’s belief in the
X86 product path was not strong enough.
Intel’s processors (the X86 family)
were based on the “old” complex-
instruction-set computer (CISC) architec-
ture, while a few years before IBM (with
its 801 processor) and Berkeley initiated a
.......................................................
126 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
Awards
new direction—the reduced-instruction-
set computer (RISC) processor. A debate
emerged within the computing commun-
ity as to whether the RISC design would
eclipse the old CISC design. Intel was
contemplating whether to design a new
X86 processor using the CISC concept or
abandon the program and shift from the
company’s central product toward a new
RISC architecture–based microprocessor
(the i860 family). Moving to a new archi-
tecture meant losing SW computability—
that is, writing new software for the entire
application base.
At that time, together with a few other
architects, I passionately tried to convince
Intel executives to continue developing a
new generation of CISC-based X86 pro-
cessors. We did this by showing how,
with the addition of new microarchitecture
features such as superscalar execution,
branch predication, split instructions, and
data caches, the X86 processors could be
made to perform competitively against
the RISC-based processors. Part of this
process included a superb one-page tech-
nical document titled “Do Not Kill the
Golden Goose,” which was sent to Intel’s
then-CEO, Dr. Andy Grove, and his staff.
The debate within Intel took several
months, and finally the decision was
made to design the next-generation micro-
processor based on the old X86 CISC fam-
ily. The architecture enhancements we
proposed laid the foundation for Intel’s
first Pentium processor.
Pentium insight: Understand the environ-
ment; do not follow the trend; be innova-
tive, passionate, and involved. Do not give
up; prove that your way is the right way.
Thereafter, I was lucky to be invited
to lead Intel’s Platform Architecture Cen-
ter in Santa Clara, California. There, I led
a group of researchers and strategists
who formulated the first PCI definition,
defined Intel’s CPU Roadmap, and pro-
posed a systems solution for the Pen-
tium processor. This group formed the
foundation of the Intel Research Labora-
tories, established a few years later.
Shortly after enhancing Intel’s line of
CISC-based processors in the early 1990s,
I co-invented and led the development of
the MMX architecture. This was the first
time (after i386) that Intel added a full set
of instructions to its X86 architecture. The
set of 57 instructions was based on a 64-
bit single-instruction, multiple-data (SIMD)
instruction set that improved performance
of digital signal processing, media and
graphics processing, speech recognition,
and video encoding/decoding. The new
MMX-based processor (P55C designed in
Israel) was a huge success in the market.
....................................................................................................................................................................................
Reading List� U. Weiser and A.L. Davis, “Wavefront Notation Tool for VLSI Array Design,” VLSI System and Computation, H.T. Kung, R.F. Sproull,
and G.T. Steele, eds., Computer Science Press, 1981, pp. 226–234.
� L. Johnson et al., “Towards a Formal Treatment of VLSI Arrays,” Proc. Caltech Conf. VLSI, 1981; http://authors.library.caltech.edu
/27041/1/4191 TR 81.pdf.
� U. Weiser et al., “Design of the NS32532 MicroProcessor,” Proc. IEEE Int’l Conf. Computer Design: VLSI in Computers and Processors,
1987, pp. 177–180.
� A. Peleg and U. Weiser, Dynamic Flow Instruction Cache Memory Organized Around Trace Segments Independent of Virtual Address
Line, US patent 5,381,533, to Intel, Patent and Trademark Office, 1995.
� A. Peleg, S. Wilkie, and U. Weiser, “Intel MMX for Multimedia PCs,” Comm. ACM, vol. 40, no. 1, 1997, pp. 25–38.
� A. Peleg et al., The Complete Guide to MMX, McGraw-Hill, 1997.
� T.Y. Morad, U.C. Weiser, and A. Kolodny, ACCMP—Asymmetric Cluster Chip Multiprocessing, tech. report 488, Dept. Electrical Eng.,
Technion, 2004.
� T. Morad et al., “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip MultiProcessors,” IEEE Computer Archi-
tecture Letters, vol. 5, no. 1, 2006, pp. 14–17.
� T. Morad, A. Kolodny, and U.C. Weiser, Multiple Multithreaded Applications on Asymmetric and Symmetric Chip MultiProcessors,
tech. report 701, Dept. Electrical Eng., Technion, 2008.
� Z. Guz et al., “Utilizing Shared Data in Chip Multiprocessors with the Nahalal Architecture,” Proc. 20th Ann. Symp. Parallelism in
Algorithms and Architectures Conf., 2008; doi:10.1145/1378533.1378535.
� Z. Guz et al., “Multi-core vs. Multi-thread Machines: Stay Away from the Valley,” IEEE Computer Architecture Letters, vol. 8, no. 1,
2009, pp. 25–28.
� T. Zidenberg, I. Keslassy, and U. Weiser, “Optimal Resource Allocation with MultiAmdahl,” Computer, vol. 46, no. 7, 2013, pp. 70–77.
� L. Peled et al., “Semantic Locality and Context-Based Prefetching Using Reinforcement Learning,” Proc. ACM/IEEE 42nd Ann. Int’l
Symp. Computer Architecture (ISCA), 2015; doi:10.1145/2749469.2749473.
� T. Morad et al., “Optimizing Read-Once Data Flow in Big-Data Applications,” IEEE Computer Architecture Letters, 2016; doi:10.1109
/LCA.2016.2520927.
.............................................................
MAY/JUNE 2017 127
MMX insight: Marketing has a tremen-
dous impact on your success.
Later, Intel invited me to co-lead the
foundations of a new design center in
Austin, Texas, the Texas Development
Center (TDC). At Intel, management usu-
ally provides the vision, whereas the
strategy is defined bottom up. We had to
define our product path and convince
Intel to adopt our strategy. Establishing a
new Intel design center is a challenging
task: hiring and building a team, defining a
local culture, defining a new product path,
and striving for recognition inside Intel.
Establishing a new design center insight:
This challenging task required me to cover
the varied domains of architecture, estab-
lishing a local culture, hiring, and building
a leading team.
After returning to Israel, I realized that
processors were reaching the perform-
ance wall when operating under a limited
power envelope environment. By nar-
rowing the application range, accelera-
tors can achieve better performance and
better performance/power than general-
purpose processors. I realized that an
on-die accelerator can provide a better,
more comprehensive solution. I formu-
lated a new concept called a Streaming
Processor (StP). We defined the concept
(an X86-based media processor), archi-
tecture, SW model, and application range.
The main purpose was an on-die X86-
based media (streaming) coprocessor.
Intel had to choose between two
options: an X86 graphics processor or
an X86 media processor. Intel manage-
ment chose the graphic path (Larrabee).
Streaming Processor insight: When you
dare, sometimes you fail.
AcademiaAlong with my industrial work, I kept my
ties with the academic world. I continued
publishing papers, taught and advised
graduate students, and participated in pro-
fessional conferences. In 1990, together
with one of my students at the Technion,
Alex Peleg, I invented the Trace Cache, a
microarchitecture concept that increases
performance and reduces power con-
sumption by storing in-cache the dynamic
trace-flow of instructions that have
already been fetched and decoded. This
innovation brought about a fundamental
change in the design principles of high-
performance microprocessors. A trace
cache concept was incorporated into
more than 500 million Intel Pentium 4 pro-
cessors that Intel sold. Digital’s EV8 used
this architecture enhancement, too.
Trace Cache insight: Always continue to
look for new research avenues. Not all
will be successful, but some may be.
The limitation of general-purpose pro-
cessor performance under a limited power
envelope became a major performance
hurdle. This drove me to strive for better
performance/power architecture and led
me to pursue the concept of heterogene-
ous computing in general and asymmetric
computing in particular. Initially conceived
(as mentioned) for speeding up high-
throughput media applications, the con-
cept of heterogeneous computing later
served as a means to improve perform-
ance and efficiency by using “big cores”
for serial code and “small cores” for paral-
lel code and low-energy consumption.
Together with a colleague and a stu-
dent, I investigated the fundamentals of
heterogeneous computing. This research
led to new insights such as the Multi-
Amdahl concept, an analytical based
optimal resource division in heteroge-
neous systems, and the Multi-Core vs.
Multithread concept, which avoids the
performance valley in multiple core envi-
ronments. Additional research activities
included new architecture paradigms
such as Nahalal, a specialized cache
architecture for multicore systems, and
Continuous Flow Multithreading, which
uses memristors to allow fine-grained
switch-on-event multithreading.
Heterogeneous insight: Technology
changes over time. Be ready to take
advantage of the changes that may lead
to new avenues.
The introduction of the new Big Data
environment calls for re-evaluation of our
existing solutions. We started to direct
our research toward a more effective sol-
ution for the new environment and came
up with two new concepts: the Funnel, a
computing element whose input band-
width is much wider than its output band-
width, and the Non-Temporal Locality
Memory Access, exhibited by some Big
Data applications.
Big Data insight: Watch for a change in
the environment and the validity of cur-
rent computing solutions. We often
need to tune, change, and/or adapt our
computing structure to accept the new
requirements.
My professional path from the indus-
try to academia was, in a way, a
calculated decision. Its purpose was to pro-
long my technical career. Academia pro-
vides a limitless professional trail (as long
as you are productive) not always available
in the industry. In addition, academia keeps
you in young, vibrant surroundings in which
the research targets are to look forward,
innovate, and open new technological ave-
nues for the industry to follow.
I believe that the current slowdown in
the process technology trend, combined
with technological limitations on energy
and power, will place the burden of revital-
izing computing technology on research-
ers in the computer architecture field.
Thus, I believe that we are on the verge of
new architectural findings. The perform-
ance/application/capability baton is being
handed to you, the architects. Take it, and
go do wonderful things!
I have enjoyed being part of a group
of architects that made big changes in
computer architecture, and I continue to
enjoy the interactions, the passion, and
the unforgettable ride. MICRO
Uri Weiser is an emeritus professor in
the Electrical Engineering Department at
the Technion–Israel Institute of Technol-
ogy (IIT). Contact him at uri.weiser@ee
.technion.ac.il.
..............................................................................................................................................................................................
AWARDS
............................................................
128 IEEE MICRO
................................................................................................................................................................
Two Sides to Scale
SHANE GREENSTEINHarvard Business School
......It used to be that only AT&T, oil
companies, and Soviet enterprises could
aspire to monstrous size. Technology
firms entered that club only in rare cir-
cumstances, and when IBM, Intel, and
Microsoft did so, they each found their
own path to headlines.
We live in a different era today. The
largest organizations on the planet are
leading technology firms. These firms
aspire to sell tens of billions of dollars of
products and services, employ hundreds
of thousands of workers, and lure invest-
ors to value their organizations at hun-
dreds of billions of dollars. They deploy
worldwide distribution, complemented
by worldwide supply of inputs, growing
brand names recognized in Canada, the
Kalahari Desert, and the Caribbean.
Each of the big four—Alphabet, Ama-
zon, Apple, and Facebook—have achieved
this unprecedented scale. Large and val-
uable? Check. Hundreds of thousands of
employees? At last count. Global in aspira-
tions and operations? You bet. Endless
opportunities in front of them? So it
seems. A few others—such as Microsoft,
Intel, IBM, SAP, and Oracle—could round
out a top 10 list. A few more young firms—
such as Uber, Airbnb, and Alibaba—aspire
to be the tenth tomorrow.
Mainstream economics regards this
scale with either praise or alarm. One
view marvels at the spread of such star-
tling efficiency, low prices, and wide vari-
ety. A contrasting view worries about
the distortions from concentrating deci-
sion making in a single large firm.
Let’s compare and contrast those
views.
Scale Is a Moving TargetScale cannot be achieved without opera-
tions that produce and deliver many serv-
ices or products whose price exceeds
their cost. Accordingly, one advantage of
scale is replication of operations. Take
Alphabet’s search engine, Google. Their
engineers learned to parse the web in
one language and extended the approach
to other languages. Software algorithms
that worked in one language can work in
others. User-oriented processes that
helped build loyalty in one language build
it in another—for example, “Did you
mean to misspell that word, or would
you prefer an alternative?”
That does not happen by itself. Proc-
esses must be well documented, and
the knowledge about them must pass
between employees. Replication then
yields gains the second and third and
eighteenth time. To say it another way,
Google faces a lower cost supporting
search in another language than anyone
supporting just one.
At one level, this is not new. Others
have benefited from the economics of
replication in the past. Technology firms
brought it to new heights owing to the rise
of the worldwide Internet and the ease of
moving software to new locations.
Consider Amazon, which started its
life as an electronic retailer and never let
up on relentless expansion. It started in
books, and now has expanded to every
conceivable product category. In the
process, it developed an operation to
support the worldwide sale and distribu-
tion of its products, achieving a scale
never before seen in any retailer other
than Walmart.
Here is the remarkable part. Walmart
does not rent out its warehouses and
trucks and computing and order fulfill-
ment staff to anyone else. It does not rent
its staff’s insights about how to secure
its IT, nor its management’s insights
about how to fulfill customer demand.
It regards these as foundational trade
secrets.
Amazon’s management, in contrast,
took these services and developed,
refined, and standardized their use for
others. In their retail operations, they both
resell for others and perform order fulfill-
ment for others. They also developed a
range of back-end services to sell to
others, layered on top of additional options
for software applications and a range of
needs. It is called Amazon Web Services
(AWS). Both of these are available to other
firms, even some of Amazon’s ostensible
competitors in retail services.
I cannot think of any other compara-
ble historical example where a large firm
has developed such scale, and also
grown by making its processes available
to others. (If you can think of one, feel
free to suggest it. I would love to hear
from you.)
To be sure, there is more than eco-
nomics behind this achievement. After all,
the malleability and mobility of software
.......................................................
130 Published by the IEEE Computer Society 0272-1732/17/$33.00�c 2017 IEEE
Micro Economics
also contributes to the scale seen in these
two examples. So too does the legal sys-
tem, as writers such as Tom Friedman
have noted. This worldwide scale takes
advantage of all the efforts to coordinate
global markets over the last half century.
Diplomats went to great effort decades
ago to standardize processes for imports
and exports and remove frictions in the
settlement of accounts across interna-
tional boundaries.
It is still funny when you think about it.
These frictions were reduced to benefit
the prior generation’s global companies,
such as McDonald’s and IBM, not to
mention Coca-Cola, Boeing, Caterpillar,
and Goldman Sachs. Today these same
rules benefit several firms selling services
that Ray Kroc and Thomas Watson Jr.
never could have imagined.
Decision MakingA less appealing attribute accompanies
scale: concentration of decision making.
To begin, let’s recognize that popular
discussion often gets this one wrong.
Hollywood likes dystopian conspiracy
theories in which quasi-evil executives
manipulate society for selfish reasons.
However, the problems are usually less
sinister than that. Even with the best-
intentioned executives, the biggest firms
make decisions that can have enormous
consequences, many unintended.
Facebook’s recent travails are a good
illustration of one type of problem. Recall
that, despite multiple complaints about
the manipulation of its algorithm and
advertising program during the election,
Facebook refused to intervene in policing
fake news stories, many of which were
invented out of whole cloth for the pur-
poses of making some ad revenue from a
hyped electorate. After the election, Mark
Zuckerberg cloaked his firm’s behavior in
the language of free speech and user
choice.
What a tin ear from Zuck. Irrespective
of your short-term political outlook,
invented news is plainly not good for
democratic societies. And, more nar-
rowly, Facebook’s long-term fortunes
depended on the credibility of the mat-
erial being shared. A platform polluted
with lies does not attract participation.
The point is this: all firms mess up.
When large firms mess up, more of soci-
ety pays the cost.
More to the point, scale makes com-
petitive discipline more difficult to apply
when firms mess up. For example, many
years ago Apple had a series of policies
for its new smartphone that prevented
developers from spreading porn and
gambling apps, and that made a lot of
sense. But Apple kept expanding its
requirements, eventually angering many
programmers with rules about owning
data. That gave an opening to an alterna-
tive platform with less restrictive rules,
and Android took advantage of that
opportunity. In short, that is competitive
discipline incarnate: when a big firm mes-
ses up, competitors gain.
Ah, therein lies the problem. Scale
can sometimes provide almost impene-
trable insulation from competitive disci-
pline. As noted earlier, for example, in
many languages nobody can challenge
Google, so, effectively, nobody can disci-
pline them when they mess up. And in
the earlier example, who stepped in
when Facebook messed up? What were
the alternatives? In other words, the
absence of competitive discipline arises
occasionally, almost by definition, when-
ever large-scale firms are involved.
Perhaps more awkwardly, today’s plat-
forms support large ecosystems, in which
the leading firms coordinate many actions
in that ecosystem. Occasionally, a leading
firm bullies a smaller one in the ecosys-
tem, but the more common issue might
be called “uncomfortable dependence.”
Let’s pick on Apple again for an illus-
tration of dependence. Years ago, and for
a number of reasons, Steve Jobs refused
to let Flash deploy on Apple products.
Whether you think those motives were
justified or selfish (which typically gets
the attention in a conversation about this
topic), let’s focus on the more mundane
fact that nobody questions: one person
held enough authority to kill a part of the
ecosystem, which until then had been a
thriving software language. It does not
matter why he did it. Jobs’ decision
devalued a set of skills held by many
programmers, reducing the return on
investments those programmers made
in refining and developing those skills.
That is not the only form dependence
takes. Once again, Alphabet provides a
good illustration in its Google News serv-
ice. Whether you like it or not, Google
News has consistently interpreted copy-
right law online in a way that permits
them to show parts of another party’s
content. For understandable reasons, and
like almost all news organizations world-
wide, Spanish newspapers were among
the complainers. But they took one more
step, and had their country’s legislature
pass a law requiring Google to pay for the
content, even small slices. Long story
short, Google refused to pay, and it shut
down Google News in Spain. In no time,
those newspapers saw their traffic drop,
and they were begging to get it back.
Now that is dependence for you.
I am not going to argue about who
was right or wrong. Rather, my point is
this: such dependence tends to arise in
virtually every setting in which scale
concentrates authority. And so the self-
interested strategic decisions of one set
of executives has consequences for so
many others. We have already seen that
when firms mess up, many pay a cost.
Even when leading firms don’t mess up,
their intended decisions can impose
worry on others.
T he spread of efficiency is breathtak-
ing. The potential dangers from con-
centrating managerial decision making
are worrying.
After looking more closely, these do
not seem like two different perspec-
tives. These are more like yin and yang:
it is not possible to have one without
the other—and they are an unavoidable
feature of our times. MICR O
Shane Greenstein is a professor at the
Harvard Business School. Contact him
.............................................................
MAY/JUNE 2017 131