Reliability issues in deep deep sub-micron technologies: time-dependent variability and its impact...

6
Reliability issues in deep deep sub-micron technologies: time-dependent variability and its impact on embedded system design Antonis Papanikolaou, Miguel Miranda, Hua Wang, Francky Catthoor, Munaga Satyakiran, Pol Marchal, Ben Kaczer, Christophe Bruynseraede, Zsolt Tokei IMEC v.z.w., Kapeldreef 75, 3001 Leuven, Belgium [email protected] Abstract Technology scaling has traditionally offered advantages to embedded system design in terms of reduced energy consumption and cost and increased performance. Scaling past the 45 nm tech- nology node, however, brings a host of problems, whose impact on system-level design has not been evaluated. Random intra- die process variability, reliability and their combined impact on the system level parametric quality metrics are effects that are gaining prominence and that will need to be tackled in the next few years. Dealing with these new challenges will require a paradigm shift in the system level design phase. I. Introduction Embedded system design is especially demanding and chal- lenging in terms of constraints that need to be satisfied, e.g. real-time processing, low cost and low energy operation. These constraints have to be properly balanced until an economically viable solution is found. Novel mobile multimedia applications, e.g. interactive gaming combining advanced 3D and video codecs with leading edge wireless connectivity protocols, like software defined radio font-ends for cognitive radio, pose extremely large requirements on the amount of storage, processing and func- tionality capabilities on the system. Meanwhile, battery capacity is only increasing by about 7% per year and users demand longer times between battery recharges. Optimizing one of these requirements by compromising on another is a straightforward design task. However, in embedded system design the solution must obey the constraints in all three requirement axes. Nevertheless products containing some sort of embedded system implementation are today’s predominant design choice in the low and high end consumer market, including safety critical applications, like the advance braking systems of modern cars, or health-care applications, such as medical devices that measure vital body functions and react accordingly. These types of products pose additional constraints on the design of embedded systems. The main one is reliability and fail-safe operation during the guaranteed product lifetime, which translates onto very high yield related constraints (near 100%), both in functionality and correct parametric features. Indeed either because these embedded systems are deployed in a very large market - like the low end consumer electronics market - or because they are part of a safety critical functionality, the failure in one or few of the components they contain can have a negative financial impact or lead to catastrophic results. For instance, products like mobile phones and PDAs which are not safety critical, still have a very high requirement on yield, due to the large number of devices that are deployed. Withdrawing products from the field due to malfunctions has very big consequences for companies both in terms of cost and consumer loyalty/company image. Even if a small percentage of the products fail, the number of malfunctioning products and unsatisfied consumers will be large. For all these reasons yield is becoming a very important optimization target for embedded system design. Technology scaling has traditionally enabled improvements in all three design criteria: increased processing power, lower energy consumption and lower die cost while embedded system design has focused on bringing more functionality into products at lower cost. Unfortunately this ”happy scaling” scenario is coming to an end. New technologies become far less mature than earlier ones, e.g. the nanometer range feature sizes require the introduction of new materials and process steps that are not fully characterized leading to potentially less reliable products. Indeed, Deep-Deep Sub-Micron (DDSM) technologies come with a host of problems that need solutions. Among others, they include new variability and reliability effects that are intrinsic to technology scaling, hence almost impossible to handle at manufacturing time. For instance, variability, both inter-die but especially intra-die, has an unpredictable impact right after manufacturing on the main performance and energy metrics of devices and interconnects. Also progressive degradation of electrical characteristics of de- vices and wires due to reliability and aging effects are intrinsic consequences to the smaller feature sizes and interfaces, increas- ing electric fields and operating temperatures. Soft-breakdowns (SBD) in gate oxide of devices (especially dramatic in high- k oxides) [4], Negative Bias Temperature Instability (NBTI) issues in the threshold voltage of the PMOS transistors [22], Electro-Migration (EM) problems in copper interconnects [19], breakdown of dielectrics in porous low-k materials [20], etc. that were considered as second-order effects in the past become a clear threat for the parametric and functional operation of the circuits and systems in near future technologies. Moreover, the combined impact of variability and reliability results in time- dependent variability, where the electrical characteristics of the devices and the wires will vary statistically in a spatial and a temporal manner, directly translating into design uncertainty

Transcript of Reliability issues in deep deep sub-micron technologies: time-dependent variability and its impact...

Reliability issues in deep deep sub-micron technologies: time-dependent variability and itsimpact on embedded system design

Antonis Papanikolaou, Miguel Miranda, Hua Wang, Francky Catthoor,Munaga Satyakiran, Pol Marchal, Ben Kaczer, Christophe Bruynseraede, Zsolt Tokei

IMEC v.z.w., Kapeldreef 75, 3001 Leuven, [email protected]

Abstract

Technology scaling has traditionally offered advantages toembedded system design in terms of reduced energy consumptionand cost and increased performance. Scaling past the 45 nm tech-nology node, however, brings a host of problems, whose impacton system-level design has not been evaluated. Random intra-die process variability, reliability and their combined impacton the system level parametric quality metrics are effects thatare gaining prominence and that will need to be tackled in thenext few years. Dealing with these new challenges will require aparadigm shift in the system level design phase.

I. IntroductionEmbedded system design is especially demanding and chal-

lenging in terms of constraints that need to be satisfied, e.g.real-time processing, low cost and low energy operation. Theseconstraints have to be properly balanced until an economicallyviable solution is found. Novel mobile multimedia applications,e.g. interactive gaming combining advanced 3D and video codecswith leading edge wireless connectivity protocols, like softwaredefined radio font-ends for cognitive radio, pose extremely largerequirements on the amount of storage, processing and func-tionality capabilities on the system. Meanwhile, battery capacityis only increasing by about 7% per year and users demandlonger times between battery recharges. Optimizing one of theserequirements by compromising on another is a straightforwarddesign task. However, in embedded system design the solutionmust obey the constraints in all three requirement axes.

Nevertheless products containing some sort of embeddedsystem implementation are today’s predominant design choicein the low and high end consumer market, including safetycritical applications, like the advance braking systems of moderncars, or health-care applications, such as medical devices thatmeasure vital body functions and react accordingly. These typesof products pose additional constraints on the design of embeddedsystems. The main one is reliability and fail-safe operationduring the guaranteed product lifetime, which translates onto veryhigh yield related constraints (near 100%), both in functionalityand correct parametric features. Indeed either because theseembedded systems are deployed in a very large market - likethe low end consumer electronics market - or because they arepart of a safety critical functionality, the failure in one or few

of the components they contain can have a negative financialimpact or lead to catastrophic results. For instance, products likemobile phones and PDAs which are not safety critical, still havea very high requirement on yield, due to the large number ofdevices that are deployed. Withdrawing products from the fielddue to malfunctions has very big consequences for companiesboth in terms of cost and consumer loyalty/company image.Even if a small percentage of the products fail, the numberof malfunctioning products and unsatisfied consumers will belarge. For all these reasons yield is becoming a very importantoptimization target for embedded system design.

Technology scaling has traditionally enabled improvements inall three design criteria: increased processing power, lower energyconsumption and lower die cost while embedded system designhas focused on bringing more functionality into products at lowercost. Unfortunately this ”happy scaling” scenario is coming to anend. New technologies become far less mature than earlier ones,e.g. the nanometer range feature sizes require the introduction ofnew materials and process steps that are not fully characterizedleading to potentially less reliable products. Indeed, Deep-DeepSub-Micron (DDSM) technologies come with a host of problemsthat need solutions. Among others, they include new variabilityand reliability effects that are intrinsic to technology scaling,hence almost impossible to handle at manufacturing time. Forinstance, variability, both inter-die but especially intra-die, hasan unpredictable impact right after manufacturing on the mainperformance and energy metrics of devices and interconnects.Also progressive degradation of electrical characteristics of de-vices and wires due to reliability and aging effects are intrinsicconsequences to the smaller feature sizes and interfaces, increas-ing electric fields and operating temperatures. Soft-breakdowns(SBD) in gate oxide of devices (especially dramatic in high-k oxides) [4], Negative Bias Temperature Instability (NBTI)issues in the threshold voltage of the PMOS transistors [22],Electro-Migration (EM) problems in copper interconnects [19],breakdown of dielectrics in porous low-k materials [20], etc. thatwere considered as second-order effects in the past become aclear threat for the parametric and functional operation of thecircuits and systems in near future technologies. Moreover, thecombined impact of variability and reliability results in time-dependent variability, where the electrical characteristics of thedevices and the wires will vary statistically in a spatial anda temporal manner, directly translating into design uncertainty

during and even after fabrication.Conventional design techniques for dealing with uncertainty

at the technology level have been developed with design marginsin mind. Most of them rely on the introduction of worst-case design slacks at the process technology, circuit and thesystem level in order to absorb the unpredictability of thedevice and interconnects and to provide implementations withpredictable parametric features. However trade-offs are alwaysinvolved in these decisions, which result in excessive energyconsumption and/or cost, leading to infeasible design choices.From the designers perspective reliability manifests itself as time-dependent uncertainties in the parametric performance metricsof the devices. In the future sub-45nm era, these uncertaintieswould be way too high to be handled with existing worst-casedesign techniques without incurring significant penalties in termsof area/delay/energy. As a result, reliability becomes a greatthreat to the design of reliable complex digital systems-on-chip(SoCs). We believe this will require the embedded system designcommunity to unavoidably come up with new design paradigmsto build reliable systems out of components which are quiteunpredictable in nature. This problem cannot only be solved atthe technology and circuit level anymore.

II. State of the art overviewAt the technology level, the communities working on pro-

cessing and reliability aspects are different. Literature existsin the technology processing community about the impact andsources of variability in devices and interconnects. However,the random intra-die process variability issues are mostly dealtwith by the design community. The reliability community, onthe other hand, generally focuses on the impact of the physicalbreakdown and degradation mechanisms on individual transistorsand interconnects in typically small circuits and test structureswhich are not fully representative of the design reality. Themain assumption there is the classical way of reliability lifetimeprediction, which is based on extensive accelerated testing.

Research on the parametric performance degradation of largercircuits has been mainly limited to studying the impact of processvariations via corner point analysis and compensating for thisduring circuit design. However, the available solutions at thislevel are restricted to adapting the transistor sizing in the logicgates so as to minimize this degradation. In the case of embeddedmemory design, things are more complex. Transistor sizing is notan option for the SRAM cells due to the huge area overhead itwould incur. Thus, memory designers typically insert worst-casemargins in the timing interfaces between the various memorycomponents. Some work also exists on the impact of reliabilityon the Static Noise Margin (SNM) degradation which leads tofunctional failures [8], [9]. The impact of parametric reliabilityeffects at this level has not been fully explored. Only very recentlysome work has been carried out to evaluate the combined impactof random variability and reliability at the circuit level [10], [14].To our knowledge, these are the only existing contributions thattake into account both these effects.

In the past few years, Statistical Timing Analysis (SSTA)techniques have gained momentum as an alternative for corner-point analysis [3]. Their goal is mainly to reduce the requiredmargins that need to be embedded during the design of the

circuits. By analyzing the impact of variability on the electricalparameters of the design, the parts of the parameter space thatare extremely unlikely are identified and removed from the spacewhere the circuits should guarantee meeting the constraints. Eventhough these techniques still rely on circuit tuning techniques,they manage to reduce the required margins. Furthermore, theycan handle the impact of spatially uncorrelated variability, bycharacterizing its impact on the electrical characteristics of thelogic gates. They still are worst-case oriented, however, thus theyare not so robust to large variations at the process technologylevel.

Solutions at the architecture level dealing with either of thetwo problems have also been proposed in the system designcommunity. Razor [18] is a run-time approach that targets theminimization of the supply voltage for reduction of energy con-sumption. It applies an error detection and correction technique,thus it can adapt to the parametric degradation of the delay ofthe circuits. To achieve correct functionality, however, it tradesoff performance predictability at the system level. The time itwill take to execute a given application is not predictable, whichmakes it unattractive for use in embedded systems with hardreal-time constraints. A design-time solution for the impact ofNBTI on the memory characteristics has been proposed in [13].The authors motivate flipping the memory contents to relieve thestress on the PMOS transistors of the cell.

It becomes clear that even though partial solutions for intra-dieprocess variability and reliability are being worked out, solutionsthat can deal with the combined impact of time-dependent vari-ability have not gained attention yet by the research community.On the other hand, both effects manifest themselves as parametricdrifts in the timing and the energy consumption of the devices.Their combined impact can also be described as time-dependentvariability. For any solution to be adequate, it will have to dealwith the run-time temporal shifts in the performance metrics ofthe devices and circuits.

III. The impact of reliability effects in the run-timeparametric variations of the system

NBTI effects [21], [22] in PMOS transistors and (Soft) gateoxide Break-Downs (SBD) in NMOS [11] are becoming twoof the most important sources of progressive degradation ofelectrical properties of devices in DDSM technologies. Thinnerequivalent gate oxides and a deficient supply voltage scalingare leading to higher electrical fields in the oxide interfaces,hence in larger tunneling currents that degrade the electricalproperties of the oxide resulting in electric traps in the interfaces.These traps translate onto both NBTI and SBD effects. NBTIappears as a progressive drift of the threshold voltage of the P-transistor over time which can partially recover once the negativevoltage stress between the gate and the drain/source becomeszero or positive. SBD appears when enough traps align in thegate dielectric, a conducting path is created resulting in ”micro”tunneling currents through the gate. After some time the pathcreated will ”burn out” leading to a short or Hard Break-Down(HBD) resulting in a catastrophic failure. The transition from theinitial conducting path to the HBD is not abrupt, the gate currentwill start progressively increasing long before the HBD occurs(Figure 1). Moreover, changes of the stress conditions due to the

application influence in the platform such as activity and the waythese are translated into operating conditions of the devices andwires will also have a major impact on the actual dynamics ofthe degradation phenomena.

Fig. 1. Wear-out and breakdown model fornormal (SiON) and high-k (HfO2) gate ox-ides [24]Similar effects are predicted for wires from the 45nm tech-

nology node on. Both electro-migration and reliability problemsin the barriers between metal lines are serious concerns toguarantee correct operation during product lifetime. The smallerwidths of the wires bring an increase in current densities alongtechnology nodes, which is accelerating electro-migration prob-lems not only in aluminum but also in the more robust copperinterconnects [19]. The problem is not alleviated by assuminga decreasing fan-out condition throughout technology nodes.For relatively long local and intermediate interconnects currentdensities will even increase. For instance in the intra-die com-munication architecture much of the load is determined by thewire capacitance itself, which is scaling poorly when compared tothe devices [25] leading to increased density currents. Similarlyto SBD effects in devices, electro-migration is also translatedto a progressive degradation of the associated resistance of thewire. The thinner the wire is, the earlier the degradation willstart [19] (see Figure 2). This is aggravated by asymmetry aspectsin the printed interconnect features. For instance, irregularitiesin the critical dimension of the interconnects introduced dueto Line Edge Roughness [26] in combination with enhancedreticle techniques such as Optical Proximity Correction (OPC),will make the whole metal structure far more vulnerable toelectro-migration issues leading to ”uncontrollable” (location-and impact-wise), random hot-spots.

A similar scenario can also be drawn for breakdowns in thedielectrics, where the smaller pitches between wires are leading toreduced dielectric barrier distances. The main figure of merit forreliability (Mean Time To Failure or MTTF) drastically shrinksas technology scales down [20] (see Figure 3). The reason isthe combination of the increasing electric fields in active wires(due to insufficient voltage scaling) with the introduction of low-k dielectric materials (for improving the RC delay of wires)based on less (electrically) robust porous materials. Even whenthese failure phenomena manifest themselves as catastrophicwithout an explicit progressive degradation phase, the number ofdielectric breaks over time and the time to first break becomesless predictable than earlier. Imperfections of the low-k dielectric

Fig. 2. EM signature in narrow lines (<120 nmline width) [19]

material (granularity of the material grains and/or air gaps) aredramatically increasing the uncertainty on the actual useful life-time of the product.

Fig. 3. Reliability targets and projected MTTFin advanced Cu-low-K materials [20]For a proper evaluation of the impact that these reliability

problems have in circuit and system design it is not sufficient tohave models representing the situation of a particular reliabilityeffect in a single device or interconnect. Not even when con-sidering possible interactions with other reliability phenomena,e.g. studying the combined impact of NBTI and SBD effects inthe behavior of an SRAM cell [14]. The real problems need tobe evaluated in the context of the particular circuit where thedevice/interconnect subject to reliability issues is situated. Forinstance, the fact that a progressive degradation effect may man-ifest mildly when looking at each single device/wire separatelydoes not mean that the impact on the circuit is necessary negli-gible. For instance, oxide breaks manifest themselves as a slightincrease in the total gate leakage [5] that may not have strongimpact on the transistor current-voltage characteristics [24], sincethe drain current does not change significantly at the momentthe soft oxide breakdown occurs. However, when looking to theinteraction that the gate current increase may have with the circuitoperation, although small, it can affect the parametric figures ofthe circuit by affecting the current of another device whose drainis connected to that gate.

The gate leakage current of the FETs can impede/favorthe charging/discharging process of the output node of a gateleading to longer/shorter delays. Furthermore, the extra leakage

contributes directly to the increase of total energy consumption.A lower than nominal voltage swing can be observed at the gateof the output node, due to the soft oxide breakdown induced gateleakage. Such a voltage swing then slows down the downstreamlogic driven by the defective gate [6]. Delay degradation inducedby such a defect has already been observed in simple logicNOR/NAND gates and small data-paths (full adder) [6], [7].

It has recently been shown that SBDs in the transistors ofSRAM components can also bring shifts in their performance.The energy and delay of both sense amplifiers (SA) and SRAMcells are dramatically affected by having a single SBD in one oftheir transistors. A variation of 36% in energy and 22% in delayis reported for the SA and a similar variation is also reported inthe SRAM cell parametric (energy/delay) operation [10]. Similarto combinational logic, soft gate oxide breakdown can sometimeseven reduce the delay of the feedback sensing loop of the SAand cell. Such drifts come from the second-order interactions ofthe gate leakage increase enabled via the circuit topology and amore significant variation in the circuit parametric figures is alsoexpected when the soft gate oxide breakdown starts affecting thefirst order characteristics of the device.

These complex interactions exhibit a multiplicative effectwhen considered in combination with variability. The time de-pendent nature of the degradation effects and the uncertainty inthe initial parametric figures due to variability lead to a timedependent variability that is very difficult to predict and controlat design time. Given that the breakdown resistance value andlocation are random in nature [11], it is reasonable to expecta more dramatic impact of this combined effect on the energyand delay of the SRAM in the DDSM era. Figure 4 illustratesthe increase in the uncertainty ranges when a single soft break-down is considered in one transistor of the SA. The delay andenergy consumption ranges increase by more than a factor of2. This expansion in the range clearly brings more difficulty forthe designers to predict the circuit behavior at design time, e.g.timing slacks, device sizing decisions, etc., that are robust enoughto both effects combined.

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40

delay relative

en

erg

yre

lati

ve

Vt variation only

nominal

Others: Vt variation +

oxide breakdown

49%

64%

23%

24%

Variability

onlySoft breakdown (aging) + Variability

Fig. 4. Impact of variability and gate oxidebreakdown in the Energy and delay on anSRAM sense amplifier

Finally, the effect of the application and its dynamic naturein the operating conditions of the devices and interconnect isessential to fully characterize the actual impact that the reliabilityeffects will have in the time-dependent parametric variations ofthe system. Trying to characterize this impact at design timebecomes extremely difficult, if not impossible in sub-45 nm

technologies. First of all, we have seen that the way reliabilityproblems appear within the circuit is a rather random processand it depends on the actual operation conditions: time, tem-perature and stress voltages [11]. This is especially true for largecircuit and systems featuring many transistors which can undergosignificantly different stress conditions when executing dynamicapplications. For them, the actual location of the ”progressive”defect and severity degree is hard to estimate at design time.Moreover due to the varying nature of the stress induced by theapplication the defect generation rate also becomes very difficultto capture unless this is done at operation time (run-time). Thesefacts simply indicate that innovation in circuit and system leveldesign and analysis has to take place to counteract the impactthat progressive parametric degradation embedded systems willhave in the actual useful life-time of the system.

IV. Impact of time dependent variability and progres-sive degradation in system design

Variations have always existed on critical parameters duringthe design and operation of electronic systems. The most ob-vious parameters are temperature, activity and other operatingconditions. The circuits must always operate within the specifiedperformance constraints for a range of temperatures. In recentyears, such variations have also been observed in the electricalparameters, like capacitance, drive current etc., of the devicesthemselves due to tolerances during the processing of the wafers.The conventional solution for dealing with these variations is toincorporate worst-case margins (also called tuning) so that thecircuit will always meet the target constraints under all possibleconditions. The minimum and maximum value of each varyingparameter is characterized and the combinations of these valuesfor all the parameters form the corners of the parameter spacewhich defines the working conditions of the design. Designerstypically tune their designs to meet the performance constraintsfor all the corner-points, this technique is called corner-pointanalysis.

This technique is still widely used in the industry, but itincreasingly suffers from a number of disadvantages. The cornerpoints are usually pessimistic, it is extremely unlikely that allthe parameters will have their maximum or minimum valuessimultaneously. Thus, the margins required to make the circuitsoperational under all corner conditions are excessive. Further-more, the amount of parameters affected by time-dependentvariability becomes very large. This means that circuit designerswill have to deal with parameter spaces of many dimensionsand extremely large numbers of corner points. Finally, corner-point analysis techniques cannot handle the impact of intra-die time-dependent variability, which is spatially uncorrelated innature [15].

Mani et al. [12] have quantified the impact of corner-pointanalysis and statistical analysis on the power consumption, per-formance and yield of small logic circuits comprising a few hun-dred gates for the 130nm technology node. They have assumeda limited impact of variability on the performance characteristicsof the gate, a 25% delay variation in terms of 3σ/mean, whichwas reasonable for the 130nm technology node. In their paperthey demonstrate that in order to achieve a yield of 3σ (99.73%)using statistical timing analysis, squeezing the last 5.5% out of

the circuit delay to meet the performance constraint incurs apower overhead of about 65% even for a small circuit. Thisillustrates one of the walls that circuit designers have to facedue to the increased variability. Even though the mean delay ofthe circuit might be improved, the increased spread due to thevariations forces them to invest excessive power overheads toachieve reasonable parametric yield figures. The overheads thatcorner-point analysis incurs, on the other hand, are about 30%larger on average.

Variability on the electrical characteristics of devices andwires and hence of the circuits is also growing in magnitudeas technology scales. Moreover it is becoming (randomly) time-dependent as illustrated in the previous section due to themore progressive degradation of the key electrical parametersof devices and wires. Indeed the uncertainty region collectingthe actual electrical properties of the devices/wire will move(randomly) in space as time progress. This results on a new(global) region of uncertainty resulting from the collection ofthe local variability ”clouds” (see Figure 5) which becomes farbigger than the corresponding one right after manufacturing.

Fig. 5. Evolution of the uncertainty region ofthe energy consumption and delay of compo-nents with time.

This increase in uncertainty has a very significant impacton system design. Conventional design optimization techniquesinclude trading-off energy consumption for performance at designtime where most options are available at the component level. Forinstance, if a system has to meet a given clock frequency target,memories from a high-speed memory library might be usedinstead of slower low-power memories to guarantee sufficienttiming slack. Typically components that are significantly fasterthan the given requirement are used in order to guarantee theparametric system target is met with reasonable yield. This is aworst-case margin that is usually added by system designers ontop of the circuit tuning already performed by circuit designers.However, the large performance and energy consumption uncer-tainty at the components level combined with the requirement forvery high yield forces designers at all levels to take increasinglylarger safety margins. Stacking all these margins leads to systemsthat are nominally much more performant than required andhence, much more energy hungry and potentially costly as well.

It becomes clear that using margins is an acceptable solutiononly if we can give up on one of the major embedded system re-quirements (real-time performance, low energy consumption, lowcost, high yield). Design margins trade-off energy consumptionfor performance, redundancy trades off cost for yield, parallelismtrades off cost for performance and so on. No solution exists,however, that can optimize all these cost metrics simultaneously.

Furthermore, it is not yet known whether the corner conditionsfor each of the varying parameters will be fully characterizable,because they will depend on the detailed operating conditionson each device, like activity and stress conditions on the transis-tors and wires. Furthermore, these operating conditions heavilydepend on the applications that are running on the system andthe way they use the system resources. This means that thecorner-points and the distributions of each parameter, whichguide the corner-point and the SSTA techniques respectively,will not be available anymore at design time. Putting the fore-mentioned results of the SSTA technique in perspective of thisunpredictability of the magnitude of the growing time-dependentvariability, we conclude that design-time tuning of the circuit willbe impossible for the target constraints of real-time performance,low energy and low cost.

The number of available alternative solutions, however, isvery limited. One solution would be to add redundancy at acoarse-grain level, such as additional memories and processingelements and assign tasks to resources at run-time based onthe actual performance of the system components. This solutionwould incur a very large area overhead and will not provideadequate solutions. The interconnect becomes the bottleneck inthis case. Time-dependent variability influences both the perfor-mance characteristics of processing elements and memories aswell as those of communication networks. Existing redundancysolutions rely on perfect communication between the variousdegrading blocks in order to find an optimal assignment of tasksto system resources. This is not a realistic assumption, becausethe performance of the communication will also be degradedand adding and controlling redundancy in the communication ismore difficult. Furthermore, existing testing fault models are notappropriate for dealing with the parametric degradation, becausethey have been developed for catastrophic defects that impact oneor a few of the redundant layers [27]. In the case of parametrictime-dependent variability all the layers will be affected, thusconventional redundancy solutions cannot be applied.

Other existing solutions focus on run-time configuration ofa given parameter based on the actual circuit performance,Razor [18] is a good example. This solution can guarantee correctI/O functional behavior of the processor pipeline. It works on theprinciple of error detection and correction, however, so the timingat the application level cannot be guaranteed because the numberof faulty cycles cannot be predicted.

One of the main reasons why the existing solutions arebreaking down in the case of time-dependent variability is thatthey try to tackle both the functional and the parametric issues atthe circuit level with clear performance constraints on meeting thetarget clock period. This means that all the system componentsare designed so as to be functional and satisfy the frequencyperformance constraints with maximum yield. This forces thedesigners to design for the worst-case, since all the components

should meet the constraint. In reality, the performance of eachsystem component will follow a statistical distribution if marginsare not embedded in its design, see [1] for a case study onon-chip memories. Some components will be faster than themean performance and some will be slower. This variation isnot exploited in state-of-the-art techniques dealing with vari-ability issues. Instead all the components are designed to havea predictable performance, even though this brings an energyoverhead. Meeting the constraints of low energy, low cost andreal-time performance for maximum yield is impossible with theconventional techniques if the magnitude of intra-die variabilityincreases. A paradigm shift will be required both in the designof the circuits and at system level design to overcome theselimitations.

V. A paradigm shift in system design solutionsAs indicated above, one of the necessary steps required to

deal with time-dependent variability is to separate the functionalissues from the parametric issues, like performance and energyconsumption. Circuit designers should deal with making circuitsthat can remain functionally correct independently of the de-gree of time-dependent variability impact, because it may beimpossible anyhow to fully characterize that at design time. Anexample of a circuit level technique to design robust SRAMscells under variability can be found in [16]. The parametricconstraints can be ignored in favor of finding a functional solutionfor a larger uncertainty range due to time-dependent variability.This approach relieves the circuit designers of the pressure tomeet performance requirements, the target is to design functionalcircuits under extreme variability with minimal overhead inenergy consumption and delay.

The system itself will have to adapt to the unpredictabilityof the component performance and guarantee the parametricconstraints. This can be enabled by moving the timing constraintsfrom the clock cycle to the level of the application deadline.Given that the components are designed without additional mar-gins, their average performance will be faster than the one ofcomponents with margins, but much more unpredictable. Somecomponents will be faster than the clock frequency and somewill be slower. Moreover, their real performance will dependon the actual impact of variability on each of them duringfabrication and operation. Even though some components willviolate the nominal frequency target, the average performanceof the components could still meet it. Thus, over a number ofcycles the application deadline can still be met. This solution doesnot require system designers to resort to asynchronous logic. Theconventional synchronization boundaries can be preserved as longas the clock frequency can be slowly adapted to the speed of theslowest component that is used at each moment in time. This canachieved via dynamic frequency scaling or fine grain frequencyislands, similar to the GALS principle. The actual performanceof the components will have to be regularly monitored to finetune the operating clock frequency.

Energy consumption is equally important to meeting the real-time constraints. It is influenced by two factors, time-dependentvariability and design margins. Variability introduces side-effectslike unnecessary switching overhead and additional standby en-ergy and it can only be partially solved at the technology level,

so the system will have to live with these overhead situations.The second source of additional energy consumption are thedesign margins themselves. Designers control the magnitude ofthe margins; separating the functional from the parametric issueswill allow the use of smaller margins which will result into moreenergy efficient system implementations.

A solution method based on the above principles has beenoutlined in [2] and an implementation in [17]. It is based on theassumption that the performance unpredictability is not tackledat the component level. The system is exposed to it to minimizethe circuit energy and delay overhead due to the margins. Theindividual component performance and energy consumption ismeasured after fabrication by on-chip monitors and relevantcomponent level configuration options are adapted by the systemin order to meet the real-time performance requirements of theapplication with minimal energy and area overhead. In [2] thesolution is only activated after processing to increase the initialprocessing yield. But once it is in place the same approach can beused infrequently, e.g. once every few seconds, to check whetherthe energy or timing is lower for the alternative path. This is stilla reactive approach though, so not that difficult to implement inexisting design flows.

VI. ConclusionsScaling to sub 45 nm technology nodes changes the nature

of reliability effects from abrupt functional problems to progres-sive degradation of the performance characteristics of devicesand system components. This necessitates a paradigm shift forembedded system design in order to meet the power, timing andcost constraints with acceptable yield and life-time guarantees.

VII. Acknowledgments

The authors would like to acknowledge the fruitful discussionswith Guido Groeseneken, Robin Degraeve, Michele Stucchi andPhilippe Roussel.

References[1] H.Wang et al., “Impact of deep submicron (DSM) process variation effects in SRAM design”, DATE 2005.[2] A. Papanikolaou et al., “A System-Level Methodology for Fully Compensating Process Variability Impact of

Memory Organizations in Periodic Applications”, CODES+ISSS 2005.[3] C. Visweswariah, ”Statistical Timing of Digital Integrated Circuits”, Microprocessor Circuit Design Forum at

ISSCC 2004.[4] G. Groeseneken et al., “Recent trends in reliability assessment of advanced CMOS technologies”, ICMTS

2005.[5] B. Kaczer et al., “Understanding nMOSFET characteristics after soft breakdown and their dependence on the

breakdown location”, ESSDERC 2002.[6] J.R. Carter et al., “Circuit-Level Modeling for Concurrent Testing of Operational Defects due to Gate Oxide

Breakdown”, DATE 2005.[7] A. Avellan et al., “Impact of soft and hard breakdown on analog and digital circuits”, IEEE Trans. Device

and Materials Reliability 2004.[8] R. Rodriguez et al., “Oxide breakdown model and its impact on SRAM cell functionality”, SISPAD 2003.[9] B. Kaczer et al., “Experimental verification of SRAM Cell functionality after hard and soft gate oxide

breakdowns”, ESSDERC 2003.[10] H. Wang et al., “On the Combined Impact of Soft and Medium Gate Oxide Breakdown and Process Variability

on the Parametric Figures of SRAM components”, to appear MTDT 2006.[11] J. Stathis, “Physical and Predictive Models of Ultrathin Oxide Reliability in CMOS Devices and Circuits”,

IEEE Trans. Device and Materials Reliability, 2001.[12] M. Mani et al., “A new statistical optimization algorithm for gate sizing”, ICCD 2004.[13] S. Kumar et al., “Impact of NBTI on SRAM Read Stability and Design for Reliability”, ISQED 2006.[14] V. Ramadurai et al., “SRAM operational voltage shifts in the presence of gate oxide defects in 90 nm SOI”,

IRPS 2006.[15] F. Najm, “On the need for statistical timing analysis”, DAC 2005.[16] E. Grossar et al., “Statistically aware SRAM memory array design”, ISQED 2006.[17] A.Papanikolaou et al., “A system architecture case study for efficient calibration of memory organizations

under process variability”, WASP 2005.[18] T. Austin et al., “Making typical silicon matter with Razor”, IEEE Computer 2004.[19] C. Bruynseraede et al., “The impact of scaling on interconnect reliability”, IRPS 2005.[20] Z. Tokei et al., “Reliability challenges for copper low-k dielectrics and copper diffusion barriers”, J. of

Microelectronics Reliability 2005.[21] D.Schroder and J.Babcock, “Negative bias temperature instability: road to cross in deep submicron silicon

semiconductor manufacturing”, J.Appl.Phys, vol.94, 2003.[22] V.Reddy et al., “Impact of Negative Bias Temperature Instability on Digital Circuit Reliability”, IRPS 2002.[23] J.H.Stathis et.al, “Impact of ultra thin oxide breakdown on circuits”, ICICDT 2005.[24] B. Kaczer, et.al, “Implications of progressive wear-out for lifetime extrapolation of ultra-thin SiON films”,

IEDM 2004.[25] ITRS, www.itrs.net[26] J.A.Croon et al., “Line Edge Roughness: Characterisation, Modelling and Impact on Device Behaviour”, IEDM

2002.[27] D. Brahme et al., “Functional testing of microprocessors”, IEEE Trans. Computers 1984.