The 65-nm 16MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series

846 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 42, NO. 4, APRIL 2007

The 65-nm 16-MB Shared On-Die L3 Cache for theDual-Core Intel Xeon Processor 7100 Series

Jonathan Chang, Senior Member, IEEE, Ming Huang, Jonathan Shoemaker, John Benoit, Member, IEEE,Szu-Liang Chen, Wei Chen, Siufu Chiu, Raghuraman Ganesan, Gloria Leong, Venkata Lukka,

Stefan Rusu, Fellow, IEEE, and Durgesh Srivastava

Abstract—The 16-way set associative, single-ported 16-MBcache for the Dual-Core Intel Xeon Processor 7100 Series uses a0.624 m2 cell in a 65-nm 8-metal technology. Low power tech-niques are implemented in the L3 cache to minimize both leakageand dynamic power. Sleep transistors are used in the SRAM arrayand peripherals, reducing the cache leakage by more than 2X.Only 0.8% of the cache is powered up for a cache access. Dynamiccache line disable (Intel Cache Safe Technology) with a historybuffer protects the cache from latent defects and infant mortalityfailures.

Index Terms—Circuit design, computer architecture, manufac-turability, microprocessor, on-die cache, power reduction, relia-bility, test.

I. INTRODUCTION

THE Dual-Core Intel Xeon Processor 7100 with up to16-MB unified L3 cache is implemented in a 65-nm

process technology with eight copper interconnect layers [1],[2]. Fig. 1 shows the die photo of the processor. It consists oftwo cores, each with a 1-MB L2 cache. The processor has atotal of 1.3 billion transistors, while each core has over 100million transistors. The processor runs at 3.5 GHz at 1.25 V.It supports 150 W and 95 W thermal design power. The L3cache and the associated logic have a separate power supplyfrom the cores, PLL, and I/O. Fig. 2 shows the four voltagedomains of the processor. The front side bus can run at 800 or667 MT/s on a 3 load configuration. Both L3 and L2 use thesame 0.624- m bit cell. Sleep transistors were designed inthe SRAM arrays and their peripherals to achieve 0.75 W/MBaverage power, while maintaining the cache content all the time[3]. The overall leakage power reduction is more than 2X andconfirmed by silicon measurements. Long channel length de-vices were used wherever possible to further reduce the leakagepower consumption. A shutdown option is implemented in theSRAM arrays to minimize the leakage power for the inactivesub-arrays. Aggressive clock gating, fine-grained sleep reso-lution, and wake-up counters were implemented to minimizethe dynamic power. Column redundancy is available in dataand tag arrays. Block redundancy is available through cachesizing. Intel Cache Safe Technology, formerly know as Pellstontechnology, is used to keep track of the random ECC event ofeach cache line and disable the cache lines susceptible to latent

Manuscript received August 25, 2006; revised December 19, 2006.The authors are with Intel Corporation, Santa Clara, CA 95052 USA (e-mail:

[email protected]).Digital Object Identifier 10.1109/JSSC.2007.892185

Fig. 1. Die photo.

Fig. 2. Voltage domains.

defects and infant mortality [4]. Extensive test solutions areavailable to ensure manufacturability.

II. CACHE ORGANIZATION AND FLOORPLAN

The logical cache size is 16 MB. It is 19 MB with ECC andredundancy. The L3 cache is a 16-set, 16-way set-associativecache, organized as shown in Fig. 3. The cache line size is 64bytes, which is sent in two chunks on the data buses. Each chunkhas 256 data bits, 32 ECC bits, and 2 redundancy bits. Eachphysical address is 40 bits. Cache sizing was done through setreduction. Sets can be configured to 16 K, 8 K and 4 K. Table Isummarizes the cache organization of three major configura-tions: 16 M, 8 M, and 4 M. The set associativity stays at 16for all three configurations. Set reduction is used to achieve thetarget cache size. Tag array and the associated datapath and con-trol logic are built to support the largest tag width, coming from4 M configuration.

The floorplan is built with wrap-around style. Fig. 4 shows thefloorplan and an example of the data grouping. The data cacheis constructed with 256 regular sub-arrays and 32 redundancysub-arrays. A regular sub-array is a 64 KB sub-array, storing32 bits. A redundancy sub-array is a 68 KB sub-array, storing

0018-9200/$25.00 © 2007 IEEE

CHANG et al.: THE 65-NM 16-MB SHARED ON-DIE L3 CACHE FOR THE DUAL-CORE INTEL XEON PROCESSOR 7100 SERIES 847

Fig. 3. Data cache organization.

TABLE ICACHE SIZING OPTIONS

Fig. 4. L3 cache floorplan and subarray grouping.

34 bits. Two types of sub-arrays are used to ensure no wasteddata bits, which would have cost area and power. Inactive sub-arrays drive 1’s to their local data buses. An AND structure isimplemented as the multiplexer to minimize the gate delay.

Fig. 5. Data sub-array and sleep design resolution.

The L3 cache is constructed with 18 data groups, with eachgroup having 16 sub-arrays. Fig. 5 shows the construction ofa 64 K/68 K sub-array and the power-up resolution. Each sub-array has 8 blocks, and the power up resolution is designed at theblock level. For each cache access, only one of the 16 sub-arraysin each group is accessed, and one of the 8 blocks in the accessedsub-array is powered up. As a result, only 0.8% of all arrayblocks power up for each cache access. Fig. 6 shows the sleepresolution for the L3 cache for each cache access. The goal ofthe fine-grained sleep resolution is to improve the sleep designeffectiveness and to minimize the di/dt noise.

III. LEAKAGE POWER REDUCTION

Low power is one of the main objectives. Many low-powertechniques were implemented to achieve the design target.N-sleep design was implemented in the SRAM arrays. Thesleep bias is configurable between 50 and 300 mV. P-sleepdesign was implemented in the cache peripherals, including


Fig. 6. L3 cache sleep resolution.

Fig. 7. N-sleep design for SRAM arrays.

decoders, WL drivers, and cache I/O. A floating bitline wasimplemented at local bitlines. A shut-off mechanism is avail-able for the disabled sub-arrays in 4 M and 8 M configurations.Wake-up counters were used to improve the effectivenessof the sleep design. Long channel length devices were usedextensively to minimize the leakage power.

Fig. 7 shows the n-sleep design in the SRAM arrays. Biasingcircuitry controls the temperature variation of the virtual ground(SRAMVSS in Fig. 7). The sleep bias is programmable and itwas characterized such that the variation of the virtual groundacross process corners and temperature is within specification.An nMOS diode and a pMOS pulldown were added to controlthe temperature variation of the virtual ground. Using the diodestructure as the sleep bias has been described in [5] and [6]. Onthe contrary, the virtual ground of this publication is mainly con-trolled by the programmable bias and the diode structure is usedto minimize the variations. When virtual VSS goes above tran-sistor threshold voltage (Vt), MND will be on because Vgs islarger than Vt. MPB pulldown function will also be more signif-icant because Vds is larger than Vt. As a result, the virtual VSSwill be limited to a Vt or less. As the temperature rises, the Vtof MND and MPB reduce, which makes the transistors strongerand preventing the virtual ground from moving up drasticallybecause of the higher leakage in the bit cells. Fig. 8 shows thevirtual ground variation across temperature and device skews.During shutoff, the sleep bias is cut off from the ground and thepMOS diode is turned off to let the virtual ground float up to themaximum value.

PMOS sleep is implemented in the decoders and cache I/O,which include column multiplexers (MUXes), write drivers, andsense amplifiers. Fig. 9 shows the p-sleep design in the decoderand WL drivers. Fig. 10 shows the p-sleep design in the cacheI/O. The bitlines are floated when the cache I/O is in sleep modeto reduce the bitline leakage. This also avoids DC current across

Fig. 8. Virtual ground variations.

Fig. 9. P-sleep design for decoders and WL drivers.

Fig. 10. P-sleep design for cache I/O.

the read/write column MUX when they are driven by virtualVcc. The virtual Vcc of the decoders and cache I/O is a fixedsetting and is limited to stay within a predetermined drop fromnominal Vcc so that proper logic values are available at thewordlines during the sleep mode. The predetermined drop fromthe nomial Vcc is chosen such that meaningful power savingcan be achieved without incurring significant noise. AnnMOS diode (MND) is implemented to limit the voltage dropfor virtual Vcc. The bitline prechargers are excluded from the


Fig. 11. Wake-up counters.

Fig. 12. Leakage shut-off infrared images.

sleep transistor design to avoid the long precharge time and tomeet the stringent requirements of bitline equalization for thedifferential sense amplifiers. The timer is intentionally left outof the sleep design because of its criticality of timing. Long Letransistors are used wherever possible in the timer without com-promising the accuracy of the signal edges. With the p-sleepdesign in the cache peripherals, the incremental power saving isapproximately 6 W.

All 16 ways are included in a block, primarily to hide thelatency penalty of turning on sleep transistors and to minimizethe number of blocks/sub-arrays to be powered up. Each blockhas one wake signal, which is used for both the SRAM sleepand wordline sleep. To minimize the dynamic power consump-tion due to the switching of the sleep transistors, programmablewake-up counters are implemented to detect if there is anothercache access to the same sub-array block within a pre-de-termined number of cycles. Because most L3 accesses arelocalized, wake-up counters can minimize dynamic power and

induced power transient due to switching of sleep tran-sistors. Each of the eight blocks within a sub-array has its owncounter. Sleep occurs if a sub-array is in a set-disabled regionor when the counter reaches the programmed limit. The counterlimit is set to the break-even cycles between leakage power and

switching power. Fig. 11 demonstrates how wake-up countermakes a difference using a simplified example. Assuming thereare three consecutive accesses to BLK7, BLK6 and back toBLK7, without the wake-up counter, Vgnd (the virtual ground)will start floating up after the first BLK7 access in cycle .It is then clamped down to real ground in cycle becauseof the next BLK7 access in cycle . On the contrary, theVgnd will stay at the real ground with the wake-up counteruntil cycle if there is no other access in between.

To reduce the dynamic power consumption, about 40% of theclock loading is gated in each sub-array. The overall dynamicpower consumption for the entire L3 cache is 1.7 W for averageapplications. The switching power of n-sleep and p-sleep tran-sistors is less than 100 mW because of the low activity factor ofL3 access and the small number of powered-up sub-arrays foreach access. Meanwhile, it takes about 50 ns for virtual groundto reach the target value based on a pessimistic analysis. The ef-ficiency of the leakage power saving is about 95%.

Fig. 12 shows the IREM image of 16 M, 8 M, and 4 M caches.For a 16 M unit, the cache is operated in sleep mode. For the 8M and 4 M units, the inactive subarrays can clearly be identifiedin dark color. Overall, 3 W are saved for 8 M cache and 5 W for4 M cache from the implementation of the shut-off feature.


Fig. 13. Power saving from N and P sleep.

Fig. 14. Redundancy-data arrays.

Fig. 15. Redundancy-tag arrays.

Fig. 13 shows the percentage of power saving due to n-sleepor p-sleep design. The X axis indicates how fast the material is.As the material becomes faster and leakier, the p-sleep design inthe cache peripherals contributes more to the power saving. Onthe other hand, the percentage of the contribution due to n-sleepin the SRAM arrays becomes lower. This is because the savingfrom the n-sleep in the SRAM arrays is relatively independentof the material while the saving from the p-sleep design in thecache peripherals scales with the material. The overall savingreduces the cache power by slightly more than 2X.

IV. REDUNDANCY

Fig. 14 shows the redundancy bits allocation for each chunkof the cache line for the data cache. Each chunk has two redun-dancy columns that can repair up to two random defects in eachcache line. This redundancy scheme also addresses the local-ized row-oriented bit cell defects. Up to eight possible bit lo-cation pairs can be repaired. Three of the physical address bitsare used to steer the redundancy MUX. The redundancy MUXselect bits are selected so that the total number of independentrepairs remains the same for 4 M and 8 M configurations. Thegranularity is 1/2 of the sub-array. The redundancy MUX is im-plemented in the global data path.

One redundancy column is implemented in each tag. It canrepair one random defect in each entry. Fig. 15 shows the re-dundancy bit allocation for each tag. Overall, four possible lo-cations can be repaired independently.

V. CACHE RELIABILITY AND TESTABILITY

Because the SRAM cells are mostly in the sleep mode wherethe effective voltage bias is reduced, ECC and Intel Cache SafeTechnology are implemented to ensure the reliability of thecache. Thorough post-silicon characterization and statistical

analysis were performed to ensure the product meets the reli-ability specifications. For the data arrays, each 8 bytes of dataare protected by SECDED ECC. The maximum tag size is alsoprotected by SECDED ECC. Furthermore, data arrays are alsoprotected by Intel Cache Safe Technology. Up to 32 cache linesor up to 2 ways per set can be disabled on detection of postproduction ECC corrections. This has negligible performanceimpact. The cache line is disabled until the next reset.

Dynamic cache line disable was implemented to resolve Vc-cmin sensitivity from latent defects. An on-die history table(Pellston Engine Queue) keeps track of random ECC events foreach cache line. Unlike the previous implementation [4], thisdynamic cache line disabling can isolate random cache errors.

The first time an ECC error occurs on a cache line it could bea soft-error. The second occurrence of ECC error to the samelocation means that it is less likely to be a soft-error, but morelikely a physical issue, such as a latent defect or Vccmin sensi-tivity. Cache Safe Technology is enabled during both power-onself-test and in normal operation. Whenever one-bit ECC erroris detected, the ECC logic signals the error information to thePellston Engine, including the set and way information for theGlobal Bus Queue (GBSQ). If the error is for the first time onthe particular set and way, then entry is logged in the PellstonEngine Queue. Cache way is not invalidated or scrubbed on thefirst error. If the error is the second one on the same location, thePellston Engine kicks in and disables the corresponding cacheline.

Many test features were implemented to ensure the 16-MBL3 cache is testable and manufacturable. PBIST was used ex-tensively for both silicon debug and test coverage. Low yieldanalysis, pWWTM, and stability test mode are implemented.In-die variation chain was available to monitor the in-die varia-tions. There are multiple clock regions for speed path debug.

VI. SUMMARY

The reported processor has the largest on-die L3 cache andtransistor count for x86 processors. The low power design en-ables more than 2X of power saving in the cache. Extensive re-pair, reliability, and test features ensure the manufacturabilityand reliability of the processor. The processor was character-ized to meet the power envelope and frequency target.

ACKNOWLEDGMENT

The authors would like to thank K. Zhang and his team forproviding the technical feedback on p-sleep design and the base-line sub-array design, S. Vora and B. Cherkauer for their sup-port on power data measurements, Q. Zhang and R. Jhutti fortheir post silicon support, and S. Tam for his input on the paperwrite-up.

REFERENCES

[1] S. Rusu et al., “A dual core multi threaded Xeon processor with 16MBL3 cache,” in IEEE ISSCC 2006 Dig. Tech. Papers, pp. 315–324.

[2] J. Chang et al., “The 65-nm 16-MB on-die L3 cache for a dual coremulti-threaded Xeon processor,” in Symp. VLSI Circuits 2006 Dig.Tech. Papers, pp. 158–159.

[3] K. Zhang et al., “A SRAM design on 65-nm CMOS technology with in-tegrated leakage reduction scheme,” in Symp. VLSI Circuits 2006 Dig.Tech. Papers, pp. 294–295.


[4] J. Wuu et al., “The asynchronous 24-MB on-chip level-3 cache for adual-core Itanium® family processor,” in IEEE ISSCC 2005 Dig. Tech.Papers, pp. 488–489.

[5] K. Nii et al., “A 90-nm low power 32 K-byte embedded sram with gateleakage suppression circuit for mobile applications,” in Symp. VLSICircuits 2003 Dig. Tech. Papers, pp. 247–250.

[6] A. J. Bhavnagarwala et al., “ A pico-joule class, 1 GHz, 32 KByte �64 b DSP SRAM with self reverse bias,” in Symp. VLSI Circuits 2003Dig. Tech. Papers, pp. 251–252.

Jonathan Chang (M’93–SM’06) received theB.S. degree in electrical engineering from NationalTaiwan University, Taipei, Taiwan, R.O.C., in 1990,and the M.S. and Ph.D. degrees in electrical engi-neering from Stanford University, Stanford, CA, in1994 and 1998, respectively.

He joined Intel Corporation, Santa Clara, CA, in1998 and since then has been engaged in the design ofseveral high-performance microprocessors with em-phasis on large, high-speed, low-power cache design.He is currently a technical lead in the area of cache

design in the Enterprise Microprocessor Group, Santa Clara, working on thenext-generation Xeon server processors.

Ming Huang received the B.S. degree in physicsfrom National Taiwan University, Taipei, Taiwan,R.O.C., in 1990, and the Ph.D. degree from theUniversity of Washington, Seattle, in 1999.

He joined Intel Corporation in 1999 and hasworked on a number of desktop and server micro-processors. He is currently a Design Engineer inthe Enterprise Processor Division, Santa Clara, CA,working on the next generation of Xeon processor,focusing on L3 cache design.

Jonathan Shoemaker received the B.S. degree inelectrical engineering and the M.Eng. degree inelectrical engineering and computer science from theMassachusetts Institute of Technology, Cambridge,in 1996 and 1997, respectively.

He joined Intel Corporation in 1997 and hascontributed to the logic design and validation ofseveral microprocessor projects, including productsfor desktop, mobile, and server platforms. He cur-rently works in the Digital Enterprise Group, SantaClara, CA, on a next-generation Xeon processor. He

specializes in large, low-power cache design, testability feature design, andpost-silicon validation.

John Benoit (M’90) received the B.S.E.E. degreefrom the University of Rochester, Rochester, NY, in1989.

He is a Principal Engineer with the Digital Enter-prise Group, Intel Corporation, Santa Clara, CA. Hejoined Intel in 1989 and has worked in CAD, designautomation, full chip integration, and is currently fo-cused on logic design, DFT, and silicon debug.

Szu-Liang Chen received the B.Tech. degree fromHuafan University. Taipei, Taiwan, R.O.C., and theM.S. degree in electrical engineering from the Uni-versity of Southern California, Los Angeles.

He has been with Intel Corporation for almostseven years, involved in developing register files,custom function block and last level cache. Hisexpertise includes floor planning, cache tools andflows, and timing convergence in high speed design.He has been involved in the design of severalgeneration of microprocessors, including Itanium,

Itanium 2 processor, and Xeon processor. He is currently working on the

next-generation Xeon processor. He is a member of the Xeon cache designteam in Intel Corporation.

Wei Chen received the M.S. degree in electrical en-gineering from the University of Toledo, Toledo, OH.

He has been with Intel Corporation for 10 yearsand has developed expertise on CPU register file,cache design, especially in the timing and specialcircuit. He has been working on CPU circuit designon Pentium 2 processor, Pentium 3 processor,Itanium, Itanium 2 processor, and Xeon Processor,and currently working on the next-generation XeonProcessor.

Siufu Chiu received the B.S. degree in electricalengineering from HiuaQiao University, China, in1988, and the M.S.E.E. degree from the Universityof Southern California, Los, Angeles, in 1996.

Since 1996, he has been with Intel Corporation andhas worked on a number of desktop and server micro-processors. He is currently a Design Engineer in theEnterprise Microprocessor Group, Santa Clara, CA,working on the next-generation processors. He spe-cializes in high-speed circuit design.

Raghuraman Ganesan received the B.S. degreein physics from the University of Madras, India, in1993, and the B. S. degree in instrumentation andelectronics engineering from Anna University, India,in 1996. He received the Masters degree in electricalengineering with specialization in VLSI from theUniversity of Toledo, Toledo, OH, in 2000.

He joined Intel as a Circuit Design Engineer in2000. He has expertise in cache design, register filedesign and special circuits. He has worked on CPUdesign for two generations of Itanium and Xeon pro-

cessors.

Gloria Leong received the B.S. degree in electricalengineering from the Massachusetts Institute ofTechnology, Cambridge.

She joined Intel Corporation in 1991 and hassince worked on various microprocessor projects,including desktop, mobile, and server designs. Hertechnical interests are mainly in cache design andcircuit methodologies. She is currently working onnext-generation Xeon server processors.

Venkata Lukka received the B.S. degree in elec-tronics and communications engineering fromAndhra University, India, in 1998, and the M.S.degree in electrical engineering from the Universityof Toledo, Toledo, OH, in 1999.

He joined Intel Corporation, Santa Clara, CA, in2000 and has been engaged in the design of severalhigh-performance microprocessors. He is currentlya Circuit Designer in the Digital Enterprise Group,Santa Clara, working on the next-generation Xeonserver processors.


Stefan Rusu (M’85–SM’01–F’07) received theM.S.E.E. degree from the Polytechnic Institute,Bucharest, Romania.

He first joined Intel Corporation in 1984 workingon data communications integrated circuits. In 1988,he joined Sun Microsystems working on micropro-cessor design with focus on clock and power distri-bution, packaging, standard cell libraries, CAD, andcircuit design methodology. He rejoined Intel Corpo-ration in 1996 working on the clock and power dis-tribution, cell library, I/O buffers, and package for

the first Itanium processor. He is presently a Senior Principal Engineer withIntel’s Enterprise Microprocessor Group leading the technology and special cir-cuits design team for the Xeon Processors Family. His technical interests arehigh-speed clocking, power distribution, I/O buffers, power and leakage reduc-tion, and high-speed circuit design techniques. He has authored or co-authoredmore than 75 papers on VLSI design methodology and microprocessor circuittechnology. He holds 30 U.S. patents with several more pending.

Dr. Rusu is a member of the Technical Program Committee for ISSCC, ESS-CIRC and A-SSCC conferences, and an Associate Editor of the IEEE JOURNAL

OF SOLID-STATE CIRCUITS.

Durgesh Srivastava received the B.Tech. degreefrom the Indian Institute of Technology, Kanpur, in1992, and the M.Phil. degree from the Hong KongUniversity of Science and Technology, Hong Kong,in 1995.

He is a Senior Staff Architect at Intel Corporation,Santa Clara, CA. He has been working on micropro-cessor and platform architecture focusing on powersavings schemes and memory technology.

The 65-nm 16MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series

Documents

Transcript of The 65-nm 16MB Shared On-Die L3 Cache for the Dual-Core Intel Xeon Processor 7100 Series