ECE4740: Digital VLSI Design
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of ECE4740: Digital VLSI Design
6/8/2018
1
ECE4740: Digital VLSI Design
Lecture 28: Memories
1024
Semiconductor memoriesContribute significantly to area and power of VLSI circuits
1025
6/8/2018
2
Memory hierarchy in a processor
• Exploit locality: – large amount of cheap memory – sufficient amount of fast, expensive memory
1026
second
level
cache
(SRAM)
datapathsecondary
memory
(Disk)
on-chip components
Re
gF
ile
main
memory
(DRAM)
da
ta.
cach
ein
str.
cach
e
eDRAM
Speed (ns): .1�s 1�s 10�s 100�s 1,000�s
Size (bytes): 100�s k�s 10k�s M�s T�s
Cost: highest lowest
Memories in an ASIC
• Critical data stored in flip-flops within datapath
• Small memories, typically build from standard-cell based static cells (latches or flip-flops)
• Larger amounts of data stored in SRAM macrocells
• Off-chip DRAMs store GBs of data 1027
Datapath
Application-specific integrated circuit
flip-flops
latch
arrays
SRAM
macrocells
ROM
external
DRAM
can have specialized memory
interfaces
fastest data access
fast but regularmemory access
LUTbuilt with logic cells
larger LUTs
big but slow
6/8/2018
3
Semiconductor memories
• 50% (growing) of silicon area is memory in most designs• Often limits throughput or energy efficiency
1028
RWM NVRWM ROM
Random Access Non-Random Access
EPROM Mask-programmed
SRAM (cache, register file)
FIFO/LIFO E2PROM
DRAM Shift register
CAM
FLASH Electrically-programmed
(PROM)
read/write memories non-volatile
Also this: write-only memories
1029
6/8/2018
4
DRAM chip capacity growth
1030
64
256
1,000
4,000
16,000
64,000
256,000
10
100
1000
10000
100000
1000000
1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000
Year of introduction
Kb
it c
ap
acit
y
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
Bandwidth growth in DRAMs
1031http://www.samsung.com/global/business/semiconductor/file/media/DDR4_Brochure-0.pdf
6/8/2018
5
Memory architecture overview
1032
Word 0
Word 1
Word 2
Word n-1
Word n-2
Storage
Cell
m bit
S0
S1
S2
S3
Sn-2
Sn-1
Input/Output
n words n select signals
Word 0
Word 1
Word 2
Word n-1
Word n-2
Storage
Cell
m bit
S0
S1
S2
S3
Sn-2
Sn-1
Input/Output
A0
A1
Ak-1
decoder reduces # of
Inputs k = log2(n)
Memory architecture details
1033
A0
A1
Aj-1
sense amplifiers
bit line
word line
storage
(RAM) cell
Aj
Aj+1
Ak-1
read/write circuits
column decoder
2k-j
m2j
input/output (m bit)
amplifies bit line swing
selects appropriate word
from memory row
6/8/2018
6
Split large memories in blocks/banks
• Shorter word and bit lines faster
• BlockAddr signal activates only 1 block power reduction
1034
input/output (m bits)
Read-only memories (ROMs)Useful for large look-up tables (LUTs)
1035
6/8/2018
7
Read-only memory cells
1036
WL
BL
WL
BL
1WL
BL
WL
BL
WL
BL
0
VDD
WL
BL
GND
Diode ROM
1: BL=VDD
0: BL=GND
MOS OR ROM
1: BL=VDD-Vt
0: BL=GND
MOS NOR ROM
1: BL=GND
0: BL=VDD
GND
GND
GND
GND
VDD
VDD
does not isolate bit line from word line
4x4 MOS OR ROM cell array
1037
WL [0]
V DD
BL [0]
WL [1]
WL [2]
WL [3]
V bias
BL [1]
pull-down loads
BL [2] BL [3]
V DD
supply line shared to
reduce area
if one WL high, then
output high OR
Image Adapted From: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
6/8/2018
8
4x4 MOS NOR ROM cell array
1038
WL [0]
GND
BL [0]
WL [1]
WL [2]
WL [3]
V DD
BL [1]
pull-up devices
BL [2] BL [3]
GND
pseudo-NMOS NOR gate
if one WL low, then output low NOR
MOS NOR ROM layout
1039
Polysilicon
Metal1
Diffusion (GND)
Metal1 on diffusion
bit lines on Metal 1
1 ROM cell
GND
connected to GND
WL[0]
WL[1]
WL[2]
WL[3]
GND
GND
6/8/2018
9
4x4 MOS NAND ROM
1040
WL [0]
WL [1]
WL [2]
WL [3]
V DD
pull-up devices
BL [3]BL [2]BL [1]BL [0]
word lines high per default
active WL ties BL to
GND
needs no supply lines! smaller only single GND
connection required!
MOS NAND ROM layout
1041
Polysilicon
Diffusion
Metal1 on diffusion
bit lines on Metal 1
1 ROM cell
WL[0]
WL[1]
WL[2]
WL[3]
forms transistor
About 15% smaller than NOR ROM
6/8/2018
10
(Model for NOR ROM)
• Word-line parasitics
– Wire capacitance and gate capacitance
– Wire resistance (polysilicon)
1042
V DD
C bit
rword
cword
WL
BL
• Bit-line parasitics
– Resistance not dominant (metal)
– Drain and gate-drain capacitance
NOR model
Precharged MOS NOR ROM
1043
WL [0]
GND
BL [0]
WL [1]
WL [2]
WL [3]
V DD
BL [1]
pull-up devices
BL [2] BL [3]
GND
pseudo-NMOS NOR gate
precharge
0 1
0 1
1 1 1 10 1 1 0
avoids static power
Image adapted from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
6/8/2018
11
(Model for NAND ROM)
• Word-line parasitics
– Wire capacitance and gate capacitance
– Wire resistance (polysilicon)
1044
• Bit-line parasitics
– Resistance of cascaded transistors dominates
– Drain/source and complete gate capacitance
NAND model
V DD
C L
rword
cword
cbit
rbit
WL
BL
NAND ROM is smaller but also much slower!!!!
Remember: reducing word-line delay
• Use bypasses!
– Drive word line from both sides
– Use metal bypass (better materials)
1045
Driver
Polysilicon word line
Metal word line
WL
Driver
Polysilicon word line
Metal bypass
WL K cells
e.g., use silicides
6/8/2018
12
ROMs for look-up tables (LUTs)
• LUTs used in ASIC designs and some processors– Store fixed program (hard-coded software)– Fast division and square root units– Approximating arbitrary arithmetic functions
• Just set input value to word address and output value is ROM content
• ROM-based approach only useful for large LUTs
1046
N-bit to M-bit LUT
inputvalue
outputvalue
Read-write memories (RAMs)RWM is hard to pronounce
1047
6/8/2018
13
Read-write memories (RAMs)• Static: SRAM
– Data is stored as long as supply is applied– Large cells (6T) fewer bits per area– Fast (used where speed is important)– Differential outputs (BL and !BL)– Use sense amps for better performance– Compatible with CMOS technology
• Dynamic: DRAM– Requires periodic refresh– Small cells (1T to 3T) more bits per area– Slower (used for large main memories)– Single-ended output (BL only)– Need sense amps for correct operation– Not typically compatible with regular CMOS technology1048
Memory timing
1049
read
read cycle
read access read access
write
write cycle
data
write
setup
data valid
write
hold
data written
6/8/2018
14
SRAMs are used everywhere!
• Turbo-decoder ASIC for LTE
• 65nm CMOS
• 129kb SRAM
• Single-port only
• Runs at 300MHz
1050S, Benkeser, Belfanti, & Huang, 2011
4x4 static RAM (SRAM)
1051
A0
!BLWL[0]
A1
A2
column decoder
sense amplifiers
write circuitry
BL
WL[1]
WL[2]
WL[3]
bit line precharge2 bit words
clocking and
control
enable
BL[i] BL[I+1]
read
precharge
6/8/2018
15
2D memory configuration
1052
sense ampssense amps
halves the length of the word and bit lines increased speed
Example: large SRAM
• UltraSparc 512KB cache SRAM
• Split into 4 subarrays
– 4x 128KB subarrays
– Each subarray consists of 16 8KB banks
– Also includes control and cache circuitry
1053
Image taken from: CMOS VLSI Design: A Circuits and Systems Perspective by Weste, Harris
6/8/2018
16
6T SRAM cell
• Cross-coupled inverters sizing critical1054
!BL BL
WL
M1
M2
M3
M4
M5
M6Q
!Q
6T SRAM cell: read
• Read disturb (read upset):
– Bit-line capacitance Cbit can be in pF range
– Limit allowed voltage rise on !Q to not change SRAM cell state
1055
!BL=1 BL=1
WL=1
M1
M4
M5M6
Q=1!Q=0
CbitCbit
6/8/2018
17
(6T SRAM cell: read)
• Cell ratio: CR = (W1/L1)/(W5/L5)
• ∆V = maximum allowed voltage ripple
1056
WL=1
M1
M4
M5M6
Q=1!Q=0
CbitCbit
M5 = saturationM1 = linear
!BL=1 BL=1
(6T SRAM cell: read)
• Common cell ratios are 1.25 to 21057
0
0
0.2
0.4
0.6
0.8
1
1.2
0.5 1 1.2 1.5 2
cell ratio (CR)
2.5 3
vo
lta
ge
ris
e i
n !
Q (
V)
M1 usually stronger than M2
Vdd = 2.5V, VTn = 0.5V
6/8/2018
18
6T SRAM cell: write
• Q=1 stored in cell, trying to write a 0
• M6 must be more conductive than M4 to pull node Q low enough for M1 & M2
1058
!BL=1 BL=0
WL=1
M1
M4
M5M6
Q=1!Q=0
M2
change value
(6T SRAM cell: write)
• Pullup ratio: PR= (W4/L4)/(W6/L6)
1059
!BL=1 BL=0
WL=1
M1
M4
M5M6
Q=1!Q=0
M2
M4 = saturationM6 = linear
6/8/2018
19
(6T SRAM cell: write)
• Node Q must be pulled below VTn
1060
0
0.2
0.4
0.6
0.8
1
0.3 0.6 0.9 1.2 1.5 1.8 2.1 2.4
Pullup Ratio (PR)
Wri
te V
olt
ag
e (
VQ
)
Vdd = 2.5V, |VTp| = 0.5V, p/n = 0.5
Cell sizing
• Keep cell size minimized max. density
• Min-sized pull-down FETs (M1 and M3)
– Requires min-width and longer-than-min-length PTs (M5 and M6) to ensure proper CR
– Sizing of PTs increases load on bit lines
• Min-sized PTs can be used
– Increase width of pull-downs (M1 and M3)
– Reduces load on word lines but increases size
1061
6/8/2018
20
Cell layout of 6T SRAM
1062
VDD
GND
WL
BLBL
M1 M3
M4M2
M5 M6
!BL BL
WL
M1
M2
M3
M4
M5M6
Q!Q
VDD
WL
GND GND
BL !BL
Image taken from: CMOS VLSI Design: A Circuits and Systems Perspective by Weste, Harris
Memories with multiple ports
• So far: single-port memories: – either read or write
• Dual (or two) port memories: – read and write, 2x read, or 2x write
1063
careful with sizing of PTs
and also access contentions
WL2
BL
2
!BL2 !BL1 BL
1
WL1
Q!Q
6/8/2018
21
Reducing bit-line delay
• Reduce voltage swing
– Needs sense amplifier to restore signal
• Isolate memory cells from bit lines after
sensing pulsed word line
• Isolate sense amplifiers from bit lines after
sensing bit line isolation
1064
Bit line isolation
1065
sense
read sense
amplifier
cross-
coupled latch
bit lines
isolate
sense amplifier outputs
V = 0.1Vdd
V = Vdd
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
6/8/2018
22
Dynamic RAM (DRAM)Higher density than static RW memories (SRAMs)
1066
DRAM properties
• Requires periodic refresh
• Small cells (1T to 3T) more bits/area
• Slower (used for large main memories)
• Single-ended output (bit-line only)
• Need sense amps for correct operation
• Typically not compatible with regular CMOS technology
1067
6/8/2018
23
4x4 dynamic RAM (DRAM)
1068
A0
WL[0]
A1
A2
column decoder
sense amplifiers
write circuitry
BL
WL[1]
WL[2]
WL[3]
bit line precharge2 bit words
clocking,
control, and
refresh
enable
BL[0] BL[1]
read
precharge
BL[2] BL[2]
required to restore full
swing
WL
3T DRAM cell
• No constraint on device ratios
• Reads are non-destructive
• Value stored at node X: “1”=VWWL-VTn1069
WWL
BL 1
M1X
M3
M 2
C S
BL 2
RWL
VDD
VDD - VT
DVVDD - VTBL 2
BL 1
X
RWL
WWL write
read
reads are inverting!
leakage requires refresh
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
6/8/2018
24
Layout of 3T DRAM cell
1070
BL2 BL1 GND
RWL
WWL
M3
M2
M1
WWL
BL 1
M1X
M 3
M 2
C S
BL 2
RWL
GND
• About 2x smaller than SRAM cell
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
1T DRAM cell
1071
M1 X
BL
WL
CsCBL
X Vdd-Vt
WLwrite
�1�
BL Vdd
read
�1�
Vdd/2 sensing
• Write: CS is charged (or discharged) by asserting WL and asserting (or lowering) BL
• Read: Charge distribution between CBL and CS
• Read is destructive refresh after read
requires sense amp for proper functionality
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
6/8/2018
25
Sense amplifier operation
• Sense amplifier needed for functionality
1072
DV (1)
V (1)
V (0)
t
VPRE
V BL
sense amp activated
word line activated
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
1T DRAM cell properties
• Output is single-ended (requires special sense amps)• Requires extra capacitor to store state• Not compatible with conventional CMOS processes
1073
M1 word
line
Diffused
bit line
polysilicon
gate
polysilicon
plate
capacitor
Cross-section Layout
metal word line
poly
SiO2
field oxiden+ n+
inversion layer
induced by
plate bias
Poly
6/8/2018
26
Peripheral memory circuitsDecoders, sense amps, etc.
1074
Row decoders (remember Lab 2)
• M to 2M decoder consists of 2M logic gates
• Two main options:
– NAND decoder
– NOR decoder
1075
…
…
6/8/2018
27
Hierarchical row decoding
• Multi-stage decoding improves performance1076
• • •
• • •
A 2A 2
A 2A 3
WL 0
A 2A 3A 2A 3A 2A 3
A 3 A 3A 0A 0
A 0A 1A 0A 1A 0A 1A 0A 1
A 1 A 1
WL 1
NAND decoder using
2-input pre-decoders
align decoder output with rows
of memory!
Dynamic NOR row decoder
• Signal goes through at most one FET
• Almost constant propagation delay
1077
Precharge devices
VDD f
GND
WL 3
WL 2
WL 1
WL 0
A 0A 0
GND
A 1A 1
• Precharge all outputs high
• Then, GND inactive outputs
• Active “high” signals
6/8/2018
28
Dynamic NAND row decoder
• All inputs must be low during precharge (prevent VDD and GND short)
• Slower than NOR implementation (3 FETs in series)
1078
• Precharge all outputs high
• Then, discharge active output
• Active “low” signals
WL 3
A 0A 0 A 1A 1
WL 2
WL 1
WL0
V DD
V DD
V DD
V DD
Precharge devices
PT-based column decoder
• Advantage: speed (only 1 extra T in path)
• Disadvantage: large transistor count1079
2-i
np
ut N
OR
dec
oder
A 0
S 0
BL0 BL 1 BL 2 BL 3
A 1
S 1
S 2
S 3
D
• One transistor per bit line
• k*2k transistors in total
• k=10 10240T
SRAMs require 2x the transistors for
BL and !BL
6/8/2018
29
Tree-based column decoder
• Advantage: number of T reduced significantly• Disadvantage: delay increases quadratically with stages
– prohibitive for large decoders– Add buffers, progressive sizing, combination of tree and PTs
1080
• One transistor per bit line
• 2*(2k-1) transistors in total
• k=10 2046T
BL 0 BL 1 BL 2 BL 3
D
A 0
A 0
A 1
A 1
Bit-line precharging: SRAM
1081
• Static: no precharge clock required, but consumes static power (fight against bit-line discharge)
• Clocked: can use large precharge devices and bit line equalization much faster, but large clock load
static pull-up precharge
BL !BL
clock
clocked precharge
!BLBL
VDDVDD VDD VDD
equalization transistor:
equalizes levels on BL and !BL
6/8/2018
30
Sense amplifiers
• Amplifies small swing on bit lines to full rail-to-rail swing needed at memory output
1082
SA
input output
large
small
make as small as possible
Differential sense amplifier
• Directly applicable to SRAMs1083
M4
M1
M5
M3
M2
VDD
BB
SE
Outy
6/8/2018
31
Differential sensing (cont’d)
1084
VDD
VDD
VDD
VDD
BL
EQ
Diff.SenseAmp
(a) SRAM sensing scheme (b) two stage differential amplifier
SRAM cell i
WL i
2xx
VDD
Output
BL
PC
M3
M 1
M 5
M2
M 4
x
SE
SE
SE
Output
SE
x2x 2x
y
y
2y
precharge two stages to be even faster
column decoder would be here
Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
Latch-based sense amp
• EQ used to initialize latch in its meta-stable point
• Once adequate voltage
established SE=1 enables
sense amplifier
• Positive feedback quickly forces output to stable operating point
1085
EQ
VDD
BL BL
SE
SE
6/8/2018
32
Read/write circuitry
1086
D: data (write) bus
R: read bus
W: write signal
CS: column select
(column decoder)
Local W (write):
BL = D, !BL = !D
enabled by W & CS
Local R (read):
R = BL, !R = !BL
enabled by !W & CS
!BL BL
sense
amp
D
W
!R
R
pre
cha
rge
local R/WCS
Reliability and yieldTrade-off noise for density and performance
1087
6/8/2018
33
Reliability and yield
• Semiconductor memories trade-off noise margin for bit density and performance
– Sensitive to noise (crosstalk, supply noise, etc.)
• High-density and large die size causes yield problems:
• Increase yield using error correction and redundancy
1088
A = chip area D = defect density
Yield vs. chip area and process
• Yield curves at different stages of process maturity [Veendricks92]
1089
6/8/2018
34
Soft errors
• Ionizing radiation can cause non-recurrent and nonpermanent errors in memories
1090
WL
BL
VDD
n 1
α -particle
SiO2
+
+
++
++
-
-
--
-
-
Keep critical chargelarger than minimum!
1 particle can cause 1M carriers!
Redundancy in memory structure
• Replace bad row or column with “spare” set by fuse bank
• Solves problem at manufacturing time, but not soft errors1091
memoryarray
column decoder
row
dec
oder
redundantrows
redundantcolumns
rowaddress
columnaddress
fusebank
:
6/8/2018
35
Redundancy and error correction
• Suitable for preventing soft errors
1092Image taken from: Digital Integrated Circuits (2nd Edition) by Rabaey, Chandrakasan, Nikolic
Example: (7,4,3) Hamming code
• Consider 4-bit number [B3,B5,B6,B7]
• Add 3 parity check bits [P1,P2,P4]
• Chose parity bits as follows:
1093
• Error in bit B3 would cause 1st and 2nd parity check to fail
• Implies that bit 3 is in error can be corrected (flipped)
1
1
0
011 = 3