ECE4740: Digital VLSI Design

35
6/8/2018 1 ECE4740: Digital VLSI Design Lecture 28: Memories 1024 Semiconductor memories Contribute significantly to area and power of VLSI circuits 1025

Transcript of ECE4740: Digital VLSI Design

6/8/2018

1

ECE4740: Digital VLSI Design

Lecture 28: Memories

1024

Semiconductor memoriesContribute significantly to area and power of VLSI circuits

1025

6/8/2018

2

Memory hierarchy in a processor

• Exploit locality: – large amount of cheap memory – sufficient amount of fast, expensive memory

1026

second

level

cache

(SRAM)

datapathsecondary

memory

(Disk)

on-chip components

Re

gF

ile

main

memory

(DRAM)

da

ta.

cach

ein

str.

cach

e

eDRAM

Speed (ns): .1�s 1�s 10�s 100�s 1,000�s

Size (bytes): 100�s k�s 10k�s M�s T�s

Cost: highest lowest

Memories in an ASIC

• Critical data stored in flip-flops within datapath

• Small memories, typically build from standard-cell based static cells (latches or flip-flops)

• Larger amounts of data stored in SRAM macrocells

• Off-chip DRAMs store GBs of data 1027

Datapath

Application-specific integrated circuit

flip-flops

latch

arrays

SRAM

macrocells

ROM

external

DRAM

can have specialized memory

interfaces

fastest data access

fast but regularmemory access

LUTbuilt with logic cells

larger LUTs

big but slow

6/8/2018

3

Semiconductor memories

• 50% (growing) of silicon area is memory in most designs• Often limits throughput or energy efficiency

1028

RWM NVRWM ROM

Random Access Non-Random Access

EPROM Mask-programmed

SRAM (cache, register file)

FIFO/LIFO E2PROM

DRAM Shift register

CAM

FLASH Electrically-programmed

(PROM)

read/write memories non-volatile

Also this: write-only memories

1029

6/8/2018

4

DRAM chip capacity growth

1030

64

256

1,000

4,000

16,000

64,000

256,000

10

100

1000

10000

100000

1000000

1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000

Year of introduction

Kb

it c

ap

acit

y

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

Bandwidth growth in DRAMs

1031http://www.samsung.com/global/business/semiconductor/file/media/DDR4_Brochure-0.pdf

6/8/2018

5

Memory architecture overview

1032

Word 0

Word 1

Word 2

Word n-1

Word n-2

Storage

Cell

m bit

S0

S1

S2

S3

Sn-2

Sn-1

Input/Output

n words n select signals

Word 0

Word 1

Word 2

Word n-1

Word n-2

Storage

Cell

m bit

S0

S1

S2

S3

Sn-2

Sn-1

Input/Output

A0

A1

Ak-1

decoder reduces # of

Inputs k = log2(n)

Memory architecture details

1033

A0

A1

Aj-1

sense amplifiers

bit line

word line

storage

(RAM) cell

Aj

Aj+1

Ak-1

read/write circuits

column decoder

2k-j

m2j

input/output (m bit)

amplifies bit line swing

selects appropriate word

from memory row

6/8/2018

6

Split large memories in blocks/banks

• Shorter word and bit lines faster

• BlockAddr signal activates only 1 block power reduction

1034

input/output (m bits)

Read-only memories (ROMs)Useful for large look-up tables (LUTs)

1035

6/8/2018

7

Read-only memory cells

1036

WL

BL

WL

BL

1WL

BL

WL

BL

WL

BL

0

VDD

WL

BL

GND

Diode ROM

1: BL=VDD

0: BL=GND

MOS OR ROM

1: BL=VDD-Vt

0: BL=GND

MOS NOR ROM

1: BL=GND

0: BL=VDD

GND

GND

GND

GND

VDD

VDD

does not isolate bit line from word line

4x4 MOS OR ROM cell array

1037

WL [0]

V DD

BL [0]

WL [1]

WL [2]

WL [3]

V bias

BL [1]

pull-down loads

BL [2] BL [3]

V DD

supply line shared to

reduce area

if one WL high, then

output high OR

Image Adapted From: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

6/8/2018

8

4x4 MOS NOR ROM cell array

1038

WL [0]

GND

BL [0]

WL [1]

WL [2]

WL [3]

V DD

BL [1]

pull-up devices

BL [2] BL [3]

GND

pseudo-NMOS NOR gate

if one WL low, then output low NOR

MOS NOR ROM layout

1039

Polysilicon

Metal1

Diffusion (GND)

Metal1 on diffusion

bit lines on Metal 1

1 ROM cell

GND

connected to GND

WL[0]

WL[1]

WL[2]

WL[3]

GND

GND

6/8/2018

9

4x4 MOS NAND ROM

1040

WL [0]

WL [1]

WL [2]

WL [3]

V DD

pull-up devices

BL [3]BL [2]BL [1]BL [0]

word lines high per default

active WL ties BL to

GND

needs no supply lines! smaller only single GND

connection required!

MOS NAND ROM layout

1041

Polysilicon

Diffusion

Metal1 on diffusion

bit lines on Metal 1

1 ROM cell

WL[0]

WL[1]

WL[2]

WL[3]

forms transistor

About 15% smaller than NOR ROM

6/8/2018

10

(Model for NOR ROM)

• Word-line parasitics

– Wire capacitance and gate capacitance

– Wire resistance (polysilicon)

1042

V DD

C bit

rword

cword

WL

BL

• Bit-line parasitics

– Resistance not dominant (metal)

– Drain and gate-drain capacitance

NOR model

Precharged MOS NOR ROM

1043

WL [0]

GND

BL [0]

WL [1]

WL [2]

WL [3]

V DD

BL [1]

pull-up devices

BL [2] BL [3]

GND

pseudo-NMOS NOR gate

precharge

0 1

0 1

1 1 1 10 1 1 0

avoids static power

Image adapted from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

6/8/2018

11

(Model for NAND ROM)

• Word-line parasitics

– Wire capacitance and gate capacitance

– Wire resistance (polysilicon)

1044

• Bit-line parasitics

– Resistance of cascaded transistors dominates

– Drain/source and complete gate capacitance

NAND model

V DD

C L

rword

cword

cbit

rbit

WL

BL

NAND ROM is smaller but also much slower!!!!

Remember: reducing word-line delay

• Use bypasses!

– Drive word line from both sides

– Use metal bypass (better materials)

1045

Driver

Polysilicon word line

Metal word line

WL

Driver

Polysilicon word line

Metal bypass

WL K cells

e.g., use silicides

6/8/2018

12

ROMs for look-up tables (LUTs)

• LUTs used in ASIC designs and some processors– Store fixed program (hard-coded software)– Fast division and square root units– Approximating arbitrary arithmetic functions

• Just set input value to word address and output value is ROM content

• ROM-based approach only useful for large LUTs

1046

N-bit to M-bit LUT

inputvalue

outputvalue

Read-write memories (RAMs)RWM is hard to pronounce

1047

6/8/2018

13

Read-write memories (RAMs)• Static: SRAM

– Data is stored as long as supply is applied– Large cells (6T) fewer bits per area– Fast (used where speed is important)– Differential outputs (BL and !BL)– Use sense amps for better performance– Compatible with CMOS technology

• Dynamic: DRAM– Requires periodic refresh– Small cells (1T to 3T) more bits per area– Slower (used for large main memories)– Single-ended output (BL only)– Need sense amps for correct operation– Not typically compatible with regular CMOS technology1048

Memory timing

1049

read

read cycle

read access read access

write

write cycle

data

write

setup

data valid

write

hold

data written

6/8/2018

14

SRAMs are used everywhere!

• Turbo-decoder ASIC for LTE

• 65nm CMOS

• 129kb SRAM

• Single-port only

• Runs at 300MHz

1050S, Benkeser, Belfanti, & Huang, 2011

4x4 static RAM (SRAM)

1051

A0

!BLWL[0]

A1

A2

column decoder

sense amplifiers

write circuitry

BL

WL[1]

WL[2]

WL[3]

bit line precharge2 bit words

clocking and

control

enable

BL[i] BL[I+1]

read

precharge

6/8/2018

15

2D memory configuration

1052

sense ampssense amps

halves the length of the word and bit lines increased speed

Example: large SRAM

• UltraSparc 512KB cache SRAM

• Split into 4 subarrays

– 4x 128KB subarrays

– Each subarray consists of 16 8KB banks

– Also includes control and cache circuitry

1053

Image taken from: CMOS VLSI Design: A Circuits and Systems Perspective by Weste, Harris

6/8/2018

16

6T SRAM cell

• Cross-coupled inverters sizing critical1054

!BL BL

WL

M1

M2

M3

M4

M5

M6Q

!Q

6T SRAM cell: read

• Read disturb (read upset):

– Bit-line capacitance Cbit can be in pF range

– Limit allowed voltage rise on !Q to not change SRAM cell state

1055

!BL=1 BL=1

WL=1

M1

M4

M5M6

Q=1!Q=0

CbitCbit

6/8/2018

17

(6T SRAM cell: read)

• Cell ratio: CR = (W1/L1)/(W5/L5)

• ∆V = maximum allowed voltage ripple

1056

WL=1

M1

M4

M5M6

Q=1!Q=0

CbitCbit

M5 = saturationM1 = linear

!BL=1 BL=1

(6T SRAM cell: read)

• Common cell ratios are 1.25 to 21057

0

0

0.2

0.4

0.6

0.8

1

1.2

0.5 1 1.2 1.5 2

cell ratio (CR)

2.5 3

vo

lta

ge

ris

e i

n !

Q (

V)

M1 usually stronger than M2

Vdd = 2.5V, VTn = 0.5V

6/8/2018

18

6T SRAM cell: write

• Q=1 stored in cell, trying to write a 0

• M6 must be more conductive than M4 to pull node Q low enough for M1 & M2

1058

!BL=1 BL=0

WL=1

M1

M4

M5M6

Q=1!Q=0

M2

change value

(6T SRAM cell: write)

• Pullup ratio: PR= (W4/L4)/(W6/L6)

1059

!BL=1 BL=0

WL=1

M1

M4

M5M6

Q=1!Q=0

M2

M4 = saturationM6 = linear

6/8/2018

19

(6T SRAM cell: write)

• Node Q must be pulled below VTn

1060

0

0.2

0.4

0.6

0.8

1

0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4

Pullup Ratio (PR)

Wri

te V

olt

ag

e (

VQ

)

Vdd = 2.5V, |VTp| = 0.5V, p/n = 0.5

Cell sizing

• Keep cell size minimized max. density

• Min-sized pull-down FETs (M1 and M3)

– Requires min-width and longer-than-min-length PTs (M5 and M6) to ensure proper CR

– Sizing of PTs increases load on bit lines

• Min-sized PTs can be used

– Increase width of pull-downs (M1 and M3)

– Reduces load on word lines but increases size

1061

6/8/2018

20

Cell layout of 6T SRAM

1062

VDD

GND

QQ

WL

BLBL

M1 M3

M4M2

M5 M6

!BL BL

WL

M1

M2

M3

M4

M5M6

Q!Q

VDD

WL

GND GND

BL !BL

Image taken from: CMOS VLSI Design: A Circuits and Systems Perspective by Weste, Harris

Memories with multiple ports

• So far: single-port memories: – either read or write

• Dual (or two) port memories: – read and write, 2x read, or 2x write

1063

careful with sizing of PTs

and also access contentions

WL2

BL

2

!BL2 !BL1 BL

1

WL1

Q!Q

6/8/2018

21

Reducing bit-line delay

• Reduce voltage swing

– Needs sense amplifier to restore signal

• Isolate memory cells from bit lines after

sensing pulsed word line

• Isolate sense amplifiers from bit lines after

sensing bit line isolation

1064

Bit line isolation

1065

sense

read sense

amplifier

cross-

coupled latch

bit lines

isolate

sense amplifier outputs

V = 0.1Vdd

V = Vdd

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

6/8/2018

22

Dynamic RAM (DRAM)Higher density than static RW memories (SRAMs)

1066

DRAM properties

• Requires periodic refresh

• Small cells (1T to 3T) more bits/area

• Slower (used for large main memories)

• Single-ended output (bit-line only)

• Need sense amps for correct operation

• Typically not compatible with regular CMOS technology

1067

6/8/2018

23

4x4 dynamic RAM (DRAM)

1068

A0

WL[0]

A1

A2

column decoder

sense amplifiers

write circuitry

BL

WL[1]

WL[2]

WL[3]

bit line precharge2 bit words

clocking,

control, and

refresh

enable

BL[0] BL[1]

read

precharge

BL[2] BL[2]

required to restore full

swing

WL

3T DRAM cell

• No constraint on device ratios

• Reads are non-destructive

• Value stored at node X: “1”=VWWL-VTn1069

WWL

BL 1

M1X

M3

M 2

C S

BL 2

RWL

VDD

VDD - VT

DVVDD - VTBL 2

BL 1

X

RWL

WWL write

read

reads are inverting!

leakage requires refresh

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

6/8/2018

24

Layout of 3T DRAM cell

1070

BL2 BL1 GND

RWL

WWL

M3

M2

M1

WWL

BL 1

M1X

M 3

M 2

C S

BL 2

RWL

GND

• About 2x smaller than SRAM cell

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

1T DRAM cell

1071

M1 X

BL

WL

CsCBL

X Vdd-Vt

WLwrite

�1�

BL Vdd

read

�1�

Vdd/2 sensing

• Write: CS is charged (or discharged) by asserting WL and asserting (or lowering) BL

• Read: Charge distribution between CBL and CS

• Read is destructive refresh after read

requires sense amp for proper functionality

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

6/8/2018

25

Sense amplifier operation

• Sense amplifier needed for functionality

1072

DV (1)

V (1)

V (0)

t

VPRE

V BL

sense amp activated

word line activated

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

1T DRAM cell properties

• Output is single-ended (requires special sense amps)• Requires extra capacitor to store state• Not compatible with conventional CMOS processes

1073

M1 word

line

Diffused

bit line

polysilicon

gate

polysilicon

plate

capacitor

Cross-section Layout

metal word line

poly

SiO2

field oxiden+ n+

inversion layer

induced by

plate bias

Poly

6/8/2018

26

Peripheral memory circuitsDecoders, sense amps, etc.

1074

Row decoders (remember Lab 2)

• M to 2M decoder consists of 2M logic gates

• Two main options:

– NAND decoder

– NOR decoder

1075

6/8/2018

27

Hierarchical row decoding

• Multi-stage decoding improves performance1076

• • •

• • •

A 2A 2

A 2A 3

WL 0

A 2A 3A 2A 3A 2A 3

A 3 A 3A 0A 0

A 0A 1A 0A 1A 0A 1A 0A 1

A 1 A 1

WL 1

NAND decoder using

2-input pre-decoders

align decoder output with rows

of memory!

Dynamic NOR row decoder

• Signal goes through at most one FET

• Almost constant propagation delay

1077

Precharge devices

VDD f

GND

WL 3

WL 2

WL 1

WL 0

A 0A 0

GND

A 1A 1

• Precharge all outputs high

• Then, GND inactive outputs

• Active “high” signals

6/8/2018

28

Dynamic NAND row decoder

• All inputs must be low during precharge (prevent VDD and GND short)

• Slower than NOR implementation (3 FETs in series)

1078

• Precharge all outputs high

• Then, discharge active output

• Active “low” signals

WL 3

A 0A 0 A 1A 1

WL 2

WL 1

WL0

V DD

V DD

V DD

V DD

Precharge devices

PT-based column decoder

• Advantage: speed (only 1 extra T in path)

• Disadvantage: large transistor count1079

2-i

np

ut N

OR

dec

oder

A 0

S 0

BL0 BL 1 BL 2 BL 3

A 1

S 1

S 2

S 3

D

• One transistor per bit line

• k*2k transistors in total

• k=10 10240T

SRAMs require 2x the transistors for

BL and !BL

6/8/2018

29

Tree-based column decoder

• Advantage: number of T reduced significantly• Disadvantage: delay increases quadratically with stages

– prohibitive for large decoders– Add buffers, progressive sizing, combination of tree and PTs

1080

• One transistor per bit line

• 2*(2k-1) transistors in total

• k=10 2046T

BL 0 BL 1 BL 2 BL 3

D

A 0

A 0

A 1

A 1

Bit-line precharging: SRAM

1081

• Static: no precharge clock required, but consumes static power (fight against bit-line discharge)

• Clocked: can use large precharge devices and bit line equalization much faster, but large clock load

static pull-up precharge

BL !BL

clock

clocked precharge

!BLBL

VDDVDD VDD VDD

equalization transistor:

equalizes levels on BL and !BL

6/8/2018

30

Sense amplifiers

• Amplifies small swing on bit lines to full rail-to-rail swing needed at memory output

1082

SA

input output

large

small

make as small as possible

Differential sense amplifier

• Directly applicable to SRAMs1083

M4

M1

M5

M3

M2

VDD

BB

SE

Outy

6/8/2018

31

Differential sensing (cont’d)

1084

VDD

VDD

VDD

VDD

BL

EQ

Diff.SenseAmp

(a) SRAM sensing scheme (b) two stage differential amplifier

SRAM cell i

WL i

2xx

VDD

Output

BL

PC

M3

M 1

M 5

M2

M 4

x

SE

SE

SE

Output

SE

x2x 2x

y

y

2y

precharge two stages to be even faster

column decoder would be here

Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

Latch-based sense amp

• EQ used to initialize latch in its meta-stable point

• Once adequate voltage

established SE=1 enables

sense amplifier

• Positive feedback quickly forces output to stable operating point

1085

EQ

VDD

BL BL

SE

SE

6/8/2018

32

Read/write circuitry

1086

D: data (write) bus

R: read bus

W: write signal

CS: column select

(column decoder)

Local W (write):

BL = D, !BL = !D

enabled by W & CS

Local R (read):

R = BL, !R = !BL

enabled by !W & CS

!BL BL

sense

amp

D

W

!R

R

pre

cha

rge

local R/WCS

Reliability and yieldTrade-off noise for density and performance

1087

6/8/2018

33

Reliability and yield

• Semiconductor memories trade-off noise margin for bit density and performance

– Sensitive to noise (crosstalk, supply noise, etc.)

• High-density and large die size causes yield problems:

• Increase yield using error correction and redundancy

1088

A = chip area D = defect density

Yield vs. chip area and process

• Yield curves at different stages of process maturity [Veendricks92]

1089

6/8/2018

34

Soft errors

• Ionizing radiation can cause non-recurrent and nonpermanent errors in memories

1090

WL

BL

VDD

n 1

α -particle

SiO2

+

+

++

++

-

-

--

-

-

Keep critical chargelarger than minimum!

1 particle can cause 1M carriers!

Redundancy in memory structure

• Replace bad row or column with “spare” set by fuse bank

• Solves problem at manufacturing time, but not soft errors1091

memoryarray

column decoder

row

dec

oder

redundantrows

redundantcolumns

rowaddress

columnaddress

fusebank

:

6/8/2018

35

Redundancy and error correction

• Suitable for preventing soft errors

1092Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic

Example: (7,4,3) Hamming code

• Consider 4-bit number [B3,B5,B6,B7]

• Add 3 parity check bits [P1,P2,P4]

• Chose parity bits as follows:

1093

• Error in bit B3 would cause 1st and 2nd parity check to fail

• Implies that bit 3 is in error can be corrected (flipped)

1

1

0

011 = 3