Reliability: Fallacy or reality?

10
. .................................................................................................................................................................................................................................................... RELIABILITY:FALLACY OR REALITY? . .................................................................................................................................................................................................................................................... AS CHIP ARCHITECTS AND MANUFACTURERS PLUMB EVER-SMALLER PROCESS TECHNOLOGIES, NEW SPECIES OF FAULTS ARE COMPROMISING DEVICE RELIABILITY. FOLLOWING AN INTRODUCTION BY ANTONIO GONZA ´ LEZ,SCOTT MAHLKE AND SHUBU MUKHERJEE DEBATE WHETHER RELIABILITY IS A LEGITIMATE CONCERN FOR THE MICROARCHITECT.TOPICS INCLUDE THE COSTS OF ADDING RELIABILITY VERSUS THOSE OF IGNORING IT, HOW TO MEASURE IT, TECHNIQUES FOR IMPROVING IT, AND WHETHER CONSUMERS REALLY WANT IT. Moderator’s introduction: Antonio Gonza ´ lez ...... Technology projections suggest that Moore’s law will continue to be effective for at least the next 10 years. Basically, as Figure 1 shows, in each new generation devices will continue to get smaller, become faster, and consume less energy. However, the new technology also brings along some new cotravelers. Among them are variations, which manifest in multiple ways. First, there are variations caused by the characteristics of the materials and the way chips are manu- factured; these are called process variations. There are multiple types of process varia- tions: spatial and temporal, within die and between dies. Random dopant fluctuations are one type of process variation. Second, there are voltage variations, such as voltage droops. Third, there are variations caused by temperature. Temperature affects many key parameters, such as delay and energy con- sumption. Finally, there are variations due to inputs. A given functional unit behaves differently—in terms of delay, energy, and other parameters—depending on the data input sets. Faults are another group of cotravelers that accompany new technology. Faults have multiple potential sources. One of these is radiation particles; it is expected that future devices will be more vulnerable to particle strikes. Another source of faults is wear-out effects, such as electromigration. Finally, faults are also caused by variations. When variations are high, we might want to target designs to the common case rather than the worst case. When the worst case occurs, the system would need to take some corrective action, to continue operating correctly. Basically, the topic of the panel is these faults. We typically classify faults into three main categories: N Transient faults appear for a very short period of time and then disappear by themselves. Particle strikes are the most common type of transient fault. N Intermittent faults appear and disap- pear by themselves, but the duration can be very long—that is, undeter- mined. Voltage droops are an example of this type of fault. N Permanent faults remain in the system until a corrective action is taken. Electromigration is an example of this type of fault. Some of these faults have a changing probability of occurrence over the lifetime Antonio Gonza ´ lez Intel Scott Mahlke University of Michigan Shubu Mukherjee Intel Resit Sendag University of Rhode Island Derek Chiou University of Texas at Austin Joshua J. Yi Freescale Semiconductor . ...................................................................... 36 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE

Transcript of Reliability: Fallacy or reality?

.....................................................................................................................................................................................................................................................

RELIABILITY: FALLACY OR REALITY?.....................................................................................................................................................................................................................................................

AS CHIP ARCHITECTS AND MANUFACTURERS PLUMB EVER-SMALLER PROCESS

TECHNOLOGIES, NEW SPECIES OF FAULTS ARE COMPROMISING DEVICE RELIABILITY.

FOLLOWING AN INTRODUCTION BY ANTONIO GONZALEZ, SCOTT MAHLKE AND SHUBU

MUKHERJEE DEBATE WHETHER RELIABILITY IS A LEGITIMATE CONCERN FOR THE

MICROARCHITECT. TOPICS INCLUDE THE COSTS OF ADDING RELIABILITY VERSUS THOSE OF

IGNORING IT, HOW TO MEASURE IT, TECHNIQUES FOR IMPROVING IT, AND WHETHER

CONSUMERS REALLY WANT IT.

Moderator’s introduction: Antonio Gonzalez......Technology projections suggestthat Moore’s law will continue to be effectivefor at least the next 10 years. Basically, asFigure 1 shows, in each new generationdevices will continue to get smaller, becomefaster, and consume less energy. However,the new technology also brings along somenew cotravelers. Among them are variations,which manifest in multiple ways. First, thereare variations caused by the characteristics ofthe materials and the way chips are manu-factured; these are called process variations.There are multiple types of process varia-tions: spatial and temporal, within die andbetween dies. Random dopant fluctuationsare one type of process variation. Second,there are voltage variations, such as voltagedroops. Third, there are variations caused bytemperature. Temperature affects many keyparameters, such as delay and energy con-sumption. Finally, there are variations due toinputs. A given functional unit behavesdifferently—in terms of delay, energy, andother parameters—depending on the datainput sets.

Faults are another group of cotravelersthat accompany new technology. Faultshave multiple potential sources. One ofthese is radiation particles; it is expected that

future devices will be more vulnerable toparticle strikes. Another source of faults iswear-out effects, such as electromigration.Finally, faults are also caused by variations.When variations are high, we might want totarget designs to the common case ratherthan the worst case. When the worst caseoccurs, the system would need to take somecorrective action, to continue operatingcorrectly.

Basically, the topic of the panel is thesefaults. We typically classify faults into threemain categories:

N Transient faults appear for a very shortperiod of time and then disappear bythemselves. Particle strikes are themost common type of transient fault.

N Intermittent faults appear and disap-pear by themselves, but the durationcan be very long—that is, undeter-mined. Voltage droops are an exampleof this type of fault.

N Permanent faults remain in the systemuntil a corrective action is taken.Electromigration is an example of thistype of fault.

Some of these faults have a changingprobability of occurrence over the lifetime

Antonio Gonzalez

Intel

Scott Mahlke

University of Michigan

Shubu Mukherjee

Intel

Resit Sendag

University of Rhode

Island

Derek Chiou

University of Texas at

Austin

Joshua J. Yi

Freescale Semiconductor

.......................................................................

36 Published by the IEEE Computer Society 0272-1732/07/$20.00 G 2007 IEEE

of a device. Failure rates during devicelifetimes exhibit a bathtub curve behavior.At the beginning, during the period calledinfant mortality (1 to 20 weeks), theprobability of a fault occurring is relativelyhigh. Then, during the normal lifetime, theprobability is far lower. Finally, during thewear-out period, the probability starts toincrease again.

Many questions remain open in reliabil-ity research. First, what will be themagnitude of these faults, and what oppor-tunities will arise from exploiting thesevariations? What will the impact of particlestrikes be in the future? What is the degreeof wear-out in the typical lifetime ofa processor? Will reliability be critical tokeep high yields? How much does theprocessor contribute to the total faults ina system? Is it really an important part ofthe problem, or can architects just ignore itand take care of the other parts?

Is the microarchitecture the right level atwhich to address these issues? Are system-,software-, or circuit-level solutions prefera-ble? Which types of solution are the mostadequate, feasible, and cost-effective?

Is ignoring reliability an option? Do weneed schemes and methods just to anticipate

and detect faults? Or, do we need mech-anisms to detect and correct these faults?For instance, some parts that fail (or arelikely to fail) might be transparentlycorrected, just as when you go to a mechanicto replace weak or failing parts of your car.What will be the cost of such solutions?Will all users be willing to pay for the costof reliability, or would only certain classesof users (for example, users of large servers)be willing to pay? Does reliability dependon the application? Are certain applicationsmore sensitive? Is reliability going to bea mainstream architecture consideration, oris it going to be limited to a niche of systemsand applications? All of these remain to beanswered.

Until recently, reliability has been ad-dressed in the manufacturing process(through burn-in and testing) and throughcircuit techniques, whereas microarchitec-ture techniques have focused primarily onmission-critical systems. However, over thepast 5 to 10 years, reliability has movedmore into the mainstream of computerarchitecture research. On the one hand,transient and permanent faults due toCMOS scaling are a looming problem thatmust be solved. In a recent keynote address,

Figure 1. Technology scaling trends: more, faster, less-energy transistors.

........................................................................

NOVEMBER–DECEMBER 2007 37

Shekhar Borkar summed up the emergingdesign space as follows: ‘‘Future designs willconsist of 100 billion transistors, 20 billionof which are unusable due to manufacturingdefects; 10 billion will fail over time due towear-out, and regular intermittent errorswill be observed.’’1 This vision clearlysuggests that fault tolerance must becomea first-class design feature.

On the other hand, some people in thecomputer architecture community believethat reliability will provide little added valuefor most of the computer systems that willbe sold in the near future. They claim thatresearchers have artificially enhanced themagnitude of the problems to increase theperceived value of their work. In theiropinion, unreliable operation has beenaccepted by consumers as commonplace,and significant increases in hardware failurerates will have little effect on the end-userexperience. From this perspective, reliabilityis simply a tax that the doomsayers want tolevy on your computer system.

The goal of this panel is to debate therelevance of reliability research for computerarchitects. Highlights of the discussionfollow the panelists’ position statements.

The need for reliability is a fallacy:Scott Mahlke

I believe there is a need for highly reliablemicroprocessors in mission-critical systems,such as airplanes and the Space Shuttle. Inthese systems, the cost of the computersystems is not a dominant factor; the moreimportant reality is that people’s lives are atstake. However, for the mainstream com-puter systems used in consumer andbusiness electronic devices, the need forreliability is a fallacy. Starting with the mostobvious and working toward the leastobvious, the following are the top fivereasons why computer architecture researchin reliability is a fallacy.

Reason 1: It’s the software, stupid!The unreliability of software has long

dominated that of hardware. Figure 2shows the failure rates for several systemcomponents. The failures per billion hoursof operation (also called failures in time, orFITs) in Microsoft Windows are an order ofmagnitude higher than corresponding va-lues for all hardware components. Bill Gateshas stated that the average Windowsmachine will fail, on average, twice a month.In fact, when operating systems start off,they fail very frequently. Mature operatingsystems can have a mean time to failure(MTTF) measured in months, whereasa newer operating system might crash everyfew days.

This is not intended as a bash ofMicrosoft or other software companies.Software is inherently much more complexthan hardware, and software verification isan open research question. The bottom lineis that to improve the reliability of currentsystems for the user, the focus should be onthe software, not the hardware.

Reason 2: Electronics have become disposableOne of the big issues that researchers are

examining today is transistor wear-out andhow to build wear-out-tolerant computersystems. But, the majority of consumerscare little about the reliable operation ofelectronic devices, and their concerns aredecreasing as these devices become moredisposable. In 2006, the average lifetime ofa business cell phone was nine months. Theaverage lifetimes of a desktop and a laptopcomputer were about two years and oneyear, respectively. Most electronic devicesare replaced before wear-out defects canmanifest. Therefore, building devices whosehardware functions flawlessly for 20 years issimply unnecessary. Furthermore, from theeconomic perspective, reliability can bequite expensive in terms of chip area, powerconsumption, and design complexity. Thus,it’s often not worth the cost.

Reason 3: A transient fault is about as likely asmy winning the lottery

Data from the IBM z900 server systemshows that three transient faults occurred in198 million hours of operation, or about

.....................................................................................................................................................................

About this article

Derek Chiou, Resit Sendag, and Josh Yi conceived and organized the 2007 CARD

Workshop and transcribed, organized, and edited this article based on the panel discussion.

Video and audio of this and the other panels can be found at http://www.ele.uri.edu/CARD/.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

38 IEEE MICRO

one fault every 2.7 million days. One of myfavorite comments about this was that it wasmuch more likely that somebody wouldwalk by and kick the power cord than it wasthat an actual transient fault would occur.

Let’s compare the rate of transient faultsto some other things that we don’t thinkabout in our everyday lives. My chance ofwinning the lottery is about equal: 1 in3 million. The chance of getting struck bylightning in a thunderstorm is about twicethat of a transient fault: 1 in 1.4 million.How about getting murdered? The chanceis about 1 in 10,000 in the United States.The chance of being involved in a fatal carcrash is about 1 in 6,000, and the chance ofa plane crashing is 1 in 10 million. Thepoint is that we don’t constantly worryabout these things happening to us, so dowe really need to be concerned abouta transient fault happening? The chancesare so unlikely that the best thing to doabout them may be to ignore them.

Reason 4: Does anyone care?In many situations, 100 percent reliable

operation of hardware is not important orworth the extra cost. For instance, no onecan tell the difference if a few pixels areincorrect in a picture displayed on their

laptop. Imagine a streaming video comingthrough; no one would care if occasionallya pixel on a frame had the wrong color.How often do you lose a call on your cellphone? How often is a word garbled andyou have to ask the person on the other endto repeat something? The majority ofconsumers today either do not notice orreadily accept imperfect operation of elec-tronic devices. There is also a lot ofredundancy in applications such as stream-ing video, so maybe the answer is thatsoftware, rather than hardware, should bemade more resilient.

Reason 5: This problem is better solved closer tothe circuit level

Even if you accept reliability as a problemfor microprocessor designers, one importantquestion is at what level of design shouldthe problem be solved—architectural orcircuit? An example of the circuit-levelapproach is Razor, which introduced a mon-itoring latch.6 Essentially, Razor is an in situsensor that measures the latency of a partic-ular circuit. It enables timing speculationand the detection of transient pulses causedby energetic particle strikes. The point isthat architects may not need to worry aboutsolving these problems. Rather, techniques

Figure 2. Failures in billions of hours of operation.2–5

........................................................................

NOVEMBER–DECEMBER 2007 39

like Razor can handle reliability problemsbeneath the surface, where the area andpower overhead is lower. Another impor-tant factor is that many designs can benefitfrom circuit-level techniques, so pointsolutions need not be constructed. Finally,in situ solutions naturally handle processvariation.

The need for reliability is a reality:Shubu Mukherjee

Captain Jean-Luc Picard of the starshipUSS Enterprise once said that there are threeversions of the truth: your truth, his truth,and the truth. The God-given truth is thatcircuits are becoming more unreliable dueto increasing soft errors, cell instability,time-dependent device degradation, andextreme device variations. We have to figureout how to deal with them.

The user’s truthUsers care deeply about the reliability of

their systems. If they get hundreds of errorsa day, they will be unhappy. But, if thenumber of hardware errors is in the noiserelative to other errors, hardware errors maynot matter as much to them. That is exactlythe direction in which the entire industry ismoving. The goal of hardware vendors is tomaintain a low enough hardware error ratethat those errors continue to be obfuscatedby software crashes and bugs. However,there have always been point risks that makecertain individual corruption or crashescritical—for example, a Windows 98 crashduring a Bill Gates demo—even if sucherrors occur rarely.

The IT manager or vendor’s truthThe truth, however, is very different from

the perspective of an IT manager who has todeal with thousands of users. The greaterthe number of user complaints per day, thegreater is her company’s total cost ofownership for those machines. It is likethe classic light bulb phenomenon: Themore you have, the sooner at least one ofthem will fail. And that’s what we see inmany houses. In a house with 48 lightbulbs, each with 4 years MTTF, we replacea light bulb every month. These failuresnegatively impact business, because billions

of dollars are involved. User-visible errors,even when few surface, have an enormousimpact on the industry. They increase costbecause companies start getting productreturns. Companies can face product re-turns even for soft errors, because userssometimes demand replacement of parts. Inaddition, there is the issue of loss of data oravailability.

The designer’s awakeningThe designer’s awakening is an experi-

ence similar to going through the four stagesof grief. First, you have the shock: ‘‘Softerrors (SERs) are the crabgrass in the lawnof computer design.’’ This is followed bydenial: ‘‘We will do the SER work twomonths before tape out.’’ Then comesanger: ‘‘Our reliability target is too ambi-tious.’’ Finally, there is acceptance: ‘‘Youcan deny physics only so long.’’ These areall real comments by my colleagues. Thetruth is, designers have accepted siliconreliability as a challenge they will have todeal with.

The designer’s challengeThe industry is addressing the reliability

problem with the help of research commu-nity. We need solutions at every level.Protection comes at many levels: at theprocess level through improved processtechnology; at the materials level throughshielding for alpha particles; at the circuitlevel through radiation-hardened cells; atthe architecture level through error-correct-ing code (ECC), parity, hardened gates, andredundant execution; and at the softwarelevel through higher-level detection andrecovery. Companies are doing a lot. Theyare constantly making trade-offs betweenthe cost of protection (in terms of perfor-mance and die size) and chip reliability,without sacrificing the minimum perfor-mance, reliability, and power thresholdsthat they must achieve.

Industry needs universities’ help with researchIndustry needs help from academia, but

academia has some misconceptions aboutreliability research. One of the misconcep-tions concerns mean time between failures(MTBF), which is only a rough estimate of

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

40 IEEE MICRO

an individual part’s life. Using MTBF topredict the time to failure (TTF) of a singlepart is fundamentally flawed, becauseMTBF does not apply to a specific part.Thus, we cannot start optimizing lifetimereliability on the basis of MTBF.

Another common misconception is thenotion that a system hang doesn’t cause datacorruption. However, if you cannot proveotherwise, you should assume it does causedata corruption because your data mightalready have been written to the disk beforethe system hangs. Finally, one other mis-conception is that adding protection with-out correction reduces the overall error rate,but in reality it does not.

Many questions remain unanswered indifferent areas of silicon reliability, andindustry needs help from the universities.How do we predict and/or measure errorrate from radiation, wear-out, and variabil-ity? How do we detect soft errors, wear-out,and variability on individual parts? Manytraditional solutions exist, but how do wemake them cheaper?

Cost of reliability—Are users willingto pay?Mahlke: The big question is how much arethe people willing to pay? This is verymarket dependent. For example, a creditcard company trying to compute billswould be willing to pay a fair amount ofmoney. But what about the average laptopuser, how much extra are they willing topay? My theory is that end users are notwilling to pay. Either they are used to errors,they accept them, or they don’t care aboutinfrequent errors.

Gonzalez: I want to add that cost could alsobe reflected in some performance penalty.Reliability can be sometimes provided at theexpense of some decrease in performance—for example, lower frequency—or someincrease in power due to the extra hardware.

Mukherjee: If you look back, people payfor ECC. We do. We pay for parity, we payfor RAID systems (redundant arrays ofindependent disks), we pay a lot for fault-tolerant file servers from EMC. So, peoplepay. The main thing that we need to do is

to measure and show them what they arepaying for. It turns out that every companyhas some applications that they need to runwith much more reliability than others. E-mail, surprisingly, is one of them. Financialapplications are another. So, yes, they arewilling to pay if we show them what theyare paying for.

Gonzalez: Following up on that, do wehave any quantification of reliability interms of area or any other metric? Howmuch is already in the chip today toguarantee certain levels of reliability?

Mukherjee: The amount of area that we areputting in for error correction logic is goingup exponentially in order to keep a constantMTTF. And it will continue to grow.

Mahlke: Yes, but most of that errorprotection logic is in the memory, right?There is not much in the actual processor.

Mukherjee: Not necessarily. I cannotpublicly reveal the details. Mean time tofailure—what does it tell us?

Audience member: Why do we makeproducts with MTTF of seven years, whenmost of the users are going to throw themaway in one year? It’s just a matter of doingthe mathematics. It all depends on thedistribution of the failures versus time. Youcan have 90 percent of your populationfailing in one year, and still have an MTTFof seven years with the right distribution.So, just because my MTTF is seven yearsdoesn’t mean that unacceptably large frac-tions of the people are not going to seefailures in one year.

Mahlke: I think you are right. Just becausethe MTTF is seven years, it doesn’t meanthat all fail at seven years. Many of themwill fail before that. But, I think if you lookat it, people are keeping these things11 months and then throwing them away.If you look at the data for how manyphones actually fail after 11 months, Ibelieve that it is a very small number, evenwhen there are hundreds of millions ofphones sold each year. And, if we justreplace those phones, and each phone is

........................................................................

NOVEMBER–DECEMBER 2007 41

$100 or something, the cost is relativelysmall.

The counter-argument is this: Let’s say Iam going to add $10 worth of electronicsfor reliability to each phone. Does theaverage phone customer want to pay thatmoney to get that little bit of extrareliability? I think there is a big distinctionbetween servers that are doing importantcomputation and disposable electronic de-vices that people use. Maybe we need twodifferent reliability strategies for these twodomains. Because, for disposable electron-ics, where we try to reduce the cost, it maybe too much overhead if we blindlyincorporate reliability mechanisms such asparity bits or dual modular redundancy(DMR) into the hardware.

Mukherjee: We don’t specify lifetimereliability based on the mean, but ratheron a very high percentile of the chipssurviving whatever number of years weinternally think the chips should survive.So, it is not based on the mean.

Audience member: I would like to add onesentence to that. I have seen graphs fromIntel that are publicly available. They showthat this number of years for this 99.99+percent of chips is going down. The reasonis that it is harder to give the sameguarantee.

Mukherjee: Good observation.

Error detection and correctionAudience member: One of the things I finddisturbing is silent failure. If the hardwarefailure is a silent failure, I don’t know if mydata is corrupted, I have no indication. So,in these systems, if I have error detection,then I can track the data being corruptedand follow on. But, if we are having bit-flipping hardware, I don’t have detectionand neither have I correction. Detection isnot correction, I agree, but detection is oneof the reasons we tolerate failures insoftware and supplement systems.

Mukherjee: That’s a very good point. I wasin the same camp for a while, but afterinteracting with some of the customers, Iam beginning to think otherwise, because

fault detection raises your overall error rate.Silent data corruption falls into two classi-fications: the one that you care about, andthe one that you don’t care about. Whenyou put fault detection to prevent silentdata corruption, you end up flagging bothtypes of these errors, and the customerannoyance factor goes up. The bottom lineis that detection alone is not enough. Youhave to go for full-blown correction.

Gonzalez: So, there are errors that matterand errors that don’t—but are we not usingthe same kind of systems for both? Forinstance, you may be checking your bankaccount with the same computer that you’reusing to run your media entertainment.

Mahlke: If you are accessing something andyou knew there was an error in somethingsmall, maybe your address book, you candownload a copy of it, right? But, if therewas something larger than that, and maybeyou didn’t have a copy of the data—ormaybe, as Antonio [Gonzalez] said, youwere doing something critical like trans-ferring money from your bank account—then I kind of agree with Shubu [Mukher-jee] that detection alone may not be goodenough. If you want to go down thisreliability path, you may need to detectand correct, because detection just throwsmany red flags and you start worrying aboutwhat you lost versus fixing it behind thescenes. If a fault actually leads to a systemhanging or crashing, then you will knowabout it. This may reduce the number ofthings we need to worry about to the thingsthat lead to silent data corruption. Because,if that is a relatively small number, and I canfigure out the other ones when they occur,maybe I don’t need to worry about thesmall subset of faults.

Mukherjee: A system hang is not necessar-ily a detected error; it can cause silent datacorruptions.

Audience member: In Scott’s [Mahlke]presentation, he used the z series fromIBM as an example, but these are systemsthat are all about the reliability, and they areenormously internally redundant and faulttolerant. So, when you talked about the low

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

42 IEEE MICRO

error rate, that is the low error rate after allthe expense devoted to reliability andadding everything for reliability. It wouldbe very interesting to understand the non-zseries experience that people have.

Mahlke: The errors that I mentioned didnot cause corruption, but were actuallyparity errors that were caught and correctedin their system. These weren’t the errorsthat got through all the armor that theyhave put up.

Mukherjee: I have an example of that, froma recently published paper from Los AlamosNational Lab.7 There is a system of 2048HP AlphaServer ES45s, where the MTTF isquite small. It is proven that cosmic-ray-induced neutrons are the primary cause ofthe BTAG (board-level cache tag) parityerrors that are causing the machines to fail.

Will classical solutions be enough?Audience member: I think one thing thatboth of you agreed on is that reliability hasa cost, either in area or in performance. Youmay not see the errors, because the system isover-dimensioned. Perhaps, we are notdoing our jobs as microarchitects to actuallylook at the trade-offs between performanceand errors that we are able to tolerate, or tolook at dimensioning the system to toleratemore errors, and perhaps to make thesystem cheaper using cheaper materials orarchitectures. Perhaps the tragedy here isthat a lot of these trade-offs at this point aredone at the semiconductor level rather thanat microarchitecture level. But, in the futureit may be done at the microarchitecturelevel, where you can compute the trade-offsbetween performance versus reliability ver-sus cost.

Mahlke: One of the problems is that as yougo towards more multipurpose systems,these systems tend to do both critical andnoncritical things. You might have differenttrade-offs for reliability versus performancefor different tasks. For instance, for a videoencoder, the system requires maximumperformance and lower reliability. There-fore, as we go towards less-programmablesystems, the trade-off is more obvious. Onthe other hand, as we go towards more-

programmable systems doing both criticaland noncritical tasks, the trade-off becomesa little bit harder, or a little bit foggy, withthese multipurpose systems.

Gonzalez: Do you have a good example ofa potential area where you believe thatcurrent approaches—for example, ECC—won’t be enough? Do you have a goodexample to motivate what can be done atthe microarchitecture level, Shubu?

Mukherjee: If you look at logic gates today,their contribution to error rate is hidden inthe noise of all the factors that cause theerrors, such as soft errors and cell instability.But, if you look five to 10 years ahead—once timing problems start to show up,maybe due to variations or wear-outs—logicis going to become a problem. So, in thatcase, classical ECC is not going to buy youanything. We can start looking at residue-checking or parity-prediction circuits todetect logic errors.

It also comes down to this fundamentalpoint that you have a full stack, startingfrom the software all the way down to theprocess. As you go up from one layer toanother, the definition of errors makes itclearer whether it is an error or not. That’swhat you need to track and what you needto expose. That’s what Joel Emer and I haveworked on for a long time. We areconvinced that if you look at different levelsof a system, the resilience is very different.In some cases, if you hit the bit, youimmediately see the error, but sometimesyou don’t see it at all. There is a wide degreeof variability.

Measuring reliabilityAudience member: Can you give us an ideaof how much reliability we gain by whatyou are putting in the processor, comparedto when we have nothing?

Mukherjee: I have some data on a cancelledprocessor project. I was the lead architect forreliability of that processor. Our datashowed that if you didn’t have any pro-tection, that chip would be failing inmonths due to all kinds of reliabilityissues.

........................................................................

NOVEMBER–DECEMBER 2007 43

Audience member: How much can we putin the commodity processors that customersare willing to pay extra pennies for?

Mukherjee: That goes to the fundamentalproblem of how to let customers know whatthey are getting for the extra price. Theproblem is that if you look at soft errors, wecannot tell them what extra benefit they aregoing to get. We can measure the perfor-mance by clock time, but we don’t havea good measurement of a system’s reliability.

Gonzalez: Any idea on how we can measurereliability?

Mukherjee: We fundamentally needa mechanism to measure these things. Forhard errors, the problem may be tractable.For soft errors—induced by radiation—thisis still a hard problem. For gradual errors,such as wear-out, we still don’t know howto measure the reliability of an individualpart. So, the answer is that, in many cases,we don’t know how to measure reliability.

Mahlke: There may be a different angle oflooking at how reliability can be measured.Instead of thinking of reliability as a tax thatyou have to pay, and trying to justify thistax, maybe the right way to go about this isthinking of what else we get in addition toreliability.

Let’s take Razor as an example. Razor canidentify events like transient faults, and italso allows you to drive voltage down andessentially operate at the lowest voltagemargin possible. It identifies when thevoltage goes too low and self-corrects thecircuit. If we talk about adaptive systemsand how to make systems more adaptablefor reliability or power consumption, then itmay be about justifying the cost of somefeature that the user really wants, andreliability just kind of happens magicallybehind your back.

How much extra hardware is neededfor reliability?Audience member: How much can Intelafford to put in the chip for reliability?

Mukherjee: We will put as much as weneed to hide under the software errors.

Mahlke: I guess you are saying that you areputting in too much, since the softwareerrors are two orders of magnitude greaterthan the hardware errors.

Mukherjee: Microsoft has actually shownthat Windows causes very few of theproblems. It is the device drivers that causemany of the problems. Memory is also a bigproblem, since more than 90 percent ofmemories out there don’t have any faultdetection or error correction in them. Stratusis a company that actually builds fault-tolerant systems using Windows boxesrunning on Pentium 3s. How did they dothat? They tested all the device drivers. Andthey don’t let anyone install any device driverarbitrarily on those systems. So, they havea highly reliable, fault-tolerant Windowssystem running on Pentium 3s. Believe itor not, that exists. So, blaming Windows isnot the right way. Microsoft has donea phenomenal job showing that it is notWindows itself that causes most of thereliability problems in today’s computers.MICRO

AcknowledgmentsAll views expressed by Antonio Gonzalez

in this article are his alone, and all viewsexpressed by Shubu Mukherjee are hisalone. Neither author represents the posi-tion of Intel Corporation in any shape orform. Although Mahlke argues the fallacyviewpoint in this article, his research groupactively works in the areas of designingreliable and adaptive computer systems.

................................................................................................

References1. S. Borkar, ‘‘Microarchitecture and Design

Challenges for Gigascale Integration,’’ key-

note address, 37th Ann. IEEE/ACM Int’l

Symp. Microarchitecture, 2004.

2. National Software Testing Labs, http://

www.nstl.com.

3. R. Mariani, G. Boschi, and A. Ricca, ‘‘A

System-Level Approach for Memory Ro-

bustness,’’ Proc. Int’l Conf. Memory Tech-

nology and Design (ICMTD 05), 2005;

http://www.icmtd.com/proceedings.htm.

4. J. Srinivasan et al., ‘‘Lifetime Reliability: To-

ward an Architectural Solution,’’ IEEE Micro,

vol. 25, no. 3, May-June 2005, pp. 70-80.

.........................................................................................................................................................................................................................

COMPUTER ARCHITECTURE DEBATE

.......................................................................

44 IEEE MICRO

5. Center for Advanced Life Cycle Engineer-

ing, Univ. of Maryland; http://www.calce.

umd.edu.

6. D. Ernst et al., ‘‘Razor: A Low-Power

Pipeline Based on Circuit-Level Timing

Speculation,’’ Proc. 36th Ann. Int’l Symp.

Microarchitecture (MICRO 03), IEEE CS

Press, 2003, pp. 7-18.

7. S.E. Michalak et al., ‘‘Predicting the Num-

ber of Fatal Soft Errors in Los Alamos

National Laboratory’s ASC Q Supercomput-

er,’’ IEEE Trans. Device and Materials

Reliability, vol. 5, no. 3, Sept. 2005, pp. 329-

335.

Antonio Gonzalez is the founding directorof the Intel-UPC Barcelona Research Cen-ter and a professor of computer architectureat Universitat Politecnica de Catalunya. Hisresearch focuses on computer architecture,with particular emphasis on processormicroarchitecture and code generation tech-niques. Gonzalez is an associate editor ofIEEE Transactions on Computers, IEEETransactions on Parallel and DistributedSystems, ACM Transactions on Architectureand Code Optimization, and Journal ofEmbedded Computing.

Scott Mahlke is an associate professor inthe Electrical Engineering and ComputerScience Department at the University ofMichigan, where he directs the Compilers

Creating Custom Processors research group.His research interests include application-specific processors, high-level synthesis,compiler optimization, and computer ar-chitecture. Mahlke has a PhD from theUniversity of Illinois, Urbana-Champaign.He is a member of the IEEE and ACM.

Shubu Mukherjee is a principal engineerand director of SPEARS (Simulation andPathfinding of Efficient and Reliable Sys-tems) at Intel. The SPEARS Group spear-heads architectural innovation in the de-livery of microprocessors and chipsets bybuilding and supporting simulation andanalytical models of performance, power,and reliability. Mukherjee has a PhD com-puter science from the University ofWisconsin–Madison.

The biographies of Resit Sendag, DerekChiou, and Joshua J. Yi appear on p. 24.

Direct questions and comments aboutthis article to Antonio Gonzalez, Intel andUPC, c/Jordi Girona 29, Edifici Nexus II,3a. planta, 08034 Barcelona, Spain; [email protected].

For more information on this or any

other computing topic, please visit our

Digital Library at http://computer.org/

csdl.

........................................................................

NOVEMBER–DECEMBER 2007 45