FORMAL VERIFICATION OF A 32-BIT PIPELINED RISC ...

105
FORMAL VERIFICATION OF A 32-BIT PIPELINED RISC PROCESSOR By MOHAMMAD MOSTAFA DARWISH B. Sc. (Computer Engineering) University of Petroleum & Minerals, 1991 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF APPLIED SCIENCE in THE FACULTY OF GRADUATE STUDIES DEPARTMENT OF ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA July 1994 © MOHAMMAD MOSTAFA DARWISH, 1994

Transcript of FORMAL VERIFICATION OF A 32-BIT PIPELINED RISC ...

FORMAL VERIFICATION OF A 32-BIT PIPELINED RISC PROCESSOR

By

MOHAMMAD MOSTAFA DARWISH

B. Sc. (Computer Engineering) University of Petroleum & Minerals, 1991

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF APPLIED SCIENCE

in

THE FACULTY OF GRADUATE STUDIES

DEPARTMENT OF ELECTRICAL ENGINEERING

We accept this thesis as conforming

to the required standard

E.

THE UNIVERSITY OF BRITISH COLUMBIA

July 1994

© MOHAMMAD MOSTAFA DARWISH, 1994

In presenting this thesis in partial fulfilment of the requirements for an advanced

degree at the University of British Columbia, I agree that the Library shall make it

freely available for reference and study. I further agree that permission for extensive

copying of this thesis for scholarly purposes may be granted by the head of my

department or by his or her representatives. ,t is understood that copying or

publication of this thesis for financial gain shall not be allowed without my written

permission.

(Signature)

Department of e’cc4, a.—/

The University of British ColumbiaVancouver, Canada

Date ‘, 19911

DE-6 (2)88)

Abstract

Designing a microprocessor is a significant undertaking. Modern RISC processors are

no exception. Although RISC architectures originally were intended to be simpler than

CISC processors, modern RISC processors are often very complex, partially due to the

prominent use of pipelining. As a result, verifying the correctness of a RISC design is an

extremely difficult task. Thus, it has become of great importance to find more efficient

design verification techniques than traditional simulation. The objective of this thesis is

to show that symbolic trajectory evaluation is such a technique.

For demonstration purposes, we designed and implemented a 32-bit pipelined RISC

processor, called Oryx. It is a fairly generic first-generation RISC processor with a five-

stage pipeline and is similar to the MIPS-X and DLX processors. The Oryx processor

is designed down to a detailed gate-level and is described in VHDL. Altogether, the

processor consists of approximately 30,000 gates.

We also developed an abstract, non-pipelined, specification of the processor. This

specification is precise and is intended for a programmer. We made a special effort not

to over-specify the processor, so that a family of processors, ranging in complexity and

speed, could theoretically be implemented all satisfying the same specification.

We finally demonstrated how to use an implementation mapping and the Voss veri

fication system to verify important properties of the processor. For example, we verified

that in every sequence of valid instructions, the ALU instructions are implemented cor

rectly. This included both the actual operation performed as well as the control for

fetching the operands and storing the result back into the register file. Carrying out this

verification task required less than 30 minutes on an IBM RS6000 based workstation.

11

Table of Contents

Abstract ii

TABLE OF CONTENTS iii

LIST OF TABLES VI.

LIST OF FIGURES Vi

ACKNOWLEDGEMENTS ix

DEDICATION Xi

1 Introduction 1

1.1 Symbolic Trajectory Evaluation 2

1.2 Research Objectives 7

1.3 Related Work 7

1.4 Thesis Overview 11

2 Architecture 13

2.1 Memory Organization 13

2.2 Registers 16

2.2.1 General Registers 16

2.2.2 Multiplication and Division Register 16

2.2.3 Status and Control Registers 16

2.3 Instruction Formats 18

2.4 Addressing Modes 20

2.5 Operand Selection 20

2.6 Co-processor Interface 20

ill

2.7 Instructions.

2.7.1 Arithmetic Instructions .

2.7.2 Logical Instructions

2.7.3 Control Transfer Instructions

2.7.4 Miscellaneous Instructions

2.8 Interrupts

3 Formal Specification

3.1 Abstract Model

3.2 Formal Specification

3.2.1 Example I: Register File Specification

3.2.2 Example II: PC Specification . .

4 Implementation

4.1 Design Overview

4.2 The Control Path

4.3 The Data Path

5 Formal Verification

5.1 Verification Difficulty

5.2 Property Verification

5.3 Implementation Mapping

5.3.1 Cycle Mapping

5.3.2 Instant Abstract State Mapping . .

5.3.3 Combined Mapping

5.4 Verification Condition

5.5 Practical Verification

22

24

27

28

30

30

32

32

35

38

41

42

42

45

57

67

6768697070737474

iv

5.6 Design Errors.76

6 Concluding Remarks and Future Work 77

Bibliography 79

A The FL Verification Script 81

V

List of Tables

2.1 PSW bits definition 17

2.2 The Oryx processor instruction set 21

3.3 Function abbreviations 37

3.4 Operation definitions 37

4.5 The Oryx processor pipeline action during exceptions 48

4.6 The Oryx processor pipeline action during instruction squashing 50

4.7 Pipeline action during instruction processing phases 52

5.8 Some experimental results 76

vi

List of Figures

1.1 The state space partial order 3

1.2 The Voss verification system 6

1.3 A simple latch 9

2.4 Little Endian byte ordering 13

2.5 A system block diagram of the Oryx processor 14

2.6 The Oryx processor instruction formats 19

3.7 The abstract model of the Oryx processor 36

4.8 The Oryx processor pipeline 42

4.9 i&2 logic diagram 44

4.10 Program counter chain logic diagram 45

4.11 ‘b2FC logic diagram 46

4.12 The control path logic diagram 47

4.13 The exception FSM logic diagram 48

4.14 The squash FSM logic diagram 49

4.15 The PC chain FSM logic diagram 51

4.16 A typical instruction fetch timing diagram 51

4.17 ICache page fault timing diagram 53

4.18 The PC unit logic diagram 54

4.19 The TakeBranch signal logic diagram 55

4.20 A typical memory read timing 56

vii

4.21 The Oryx processor data path logic diagram 58

4.22 The Oryx processor register file logic diagram 59

4.23 A register file memory bit logic diagram 59

4.24 The NoDest signal logic diagram 60

4.25 The ALU logic diagram 61

4.26 The AU logic diagram 61

4.27 The LU logic diagram 62

4.28 The shifter logic diagram 63

4.29 A typical MD bit logic diagram 64

4.30 PSWCurrent-PSWOther pair 65

5.31 Time mapping 70

5.32 State mapping 71

5.33 Register file mapping 72

5.34 Combined mapping 73

viii

Acknowledgments

There are several people who were major forces in the completion of this work. First

and foremost, warm and very grateful thanks go to Carl Seger who introduced me to

formal verification and has nurtured my knowledge with amazing generosity and patience.

Rabab Ward believed in me and encouraged me to carry on with my work despite all

the difficulties. Carl and Rabab have provided support in times of crisis and criticism in

times of confidence.

I am heavily obliged to André lvanov, Scott Hazelhurst, and Alimuddin Mohammad

for donating their time, the most precious of all gifts, to help me improve the presentation.

Many ideas in this work come from fascinating discussions with Mark Greenstreet,

Andrew Martin, Nancy Day, John Tanlimco, and Gary Hovey.

Warm thanks go to my friends who made my life more enjoyable. In particular, I would

like to mention Ammar Muhiyaddin who has offered me his friendship over the years,

Nasir Aljameh who has always remembered his youth friend despite the distance, and

Rafeh Hulays who has been a good friend since the very first day I arrived in Vancouver.

I also would like to mention Mohammad Estilaei, Catherine Leung, Hélène Wong, Neil

Fazel, Mike Donat, Thomas Ligocki, Kevin Chan, Barry Tsuji, Andrew Bishop, Matthew

Mendelsohn, William Low, Kendra Cooper, Peter Meisl, Chris Cobbold, Robert Link,

and Jeff Chow. To those I have surely forgotten, I must apologize.

Special thanks go to the Original Beanery Coffee House where I wrote most of this

work. Gordon Schmidt has been kind enough to offer me a quiet place where I could sit

down and do all the thinking. I will let you discover the secrets of that place on your

own.

ix

I have been fortunate in obtaining support for my studies from the Semiconductor

Research Corporation as well as the Department of Electrical Engineering and the De

partment of Computer Science.

Finally, and above all, I owe a great debt of gratitude to my parents, Bedour and

Mostafa who inspired my life and pushed me to the limit. This work is my way of saying

thank you mother and father.

x

To my mother and father

xi

Chapter 1

Introduction

The arrival of very large scale integration (VLSI) technology has paved the way for

innovative digital system designs, like reduced instruction set computer (RISC) processors.

However, the design of RISC processors is very complex and demands considerable effort,

partially due to the problems that designers are faced with in specifying and verifying

this type of system.

From a marketing point of view, there is no doubt that any enterprise should release

new products on time in order to keep its market share. Unfortunately, the design time

of RISC processors, like other digital systems, is strongly influenced by the number of

design errors. The MIPS 4000, for example, was estimated to cost $3-$8 million for

each month of additional design time and 27% of the design time was spent verifying

the design [8]. This example clearly indicates the importance of finding better ways to

uncover design errors as early as possible to avoid any delay in releasing the product.

Today, design verification is done using (traditional) simulation [10]. A good design

of a RISC processor is composed of several smaller modules. Each module is simulated

using switch-level and gate-level simulators by trying relatively few test cases, one at a

time, and checking whether the results are correct. Towards the end of the design phase,

all modules are put together and the final design is simulated for an extended period of

time. This simulation is usually done using behavioral models of the components rather

than at the switch or gate level. A common approach is to run some reasonably large

programs (e.g. boot UNIX) on the simulated design. It is very common to spend months

1

Chapter 1. Introduction 2

simulating the final design on large computers [12].

In spite of this, traditional simulation is still inadequate for verifying RISC processors

as many design errors can be left undetected [1]. The deficiency of traditional simulation

is partially due to the introduction of aggressive pipelining and concurrently-operating

sub-systems, which make predicting the interactions between logically unrelated system

activities very difficult. For example, it is often impossible to simulate all possible instruc

tion sequences that might lead to a trap, because instructions, which could be logically

unrelated, all of a sudden are being processed in parallel and may even be processed out

of order.

In addition to the above, even the best design team can make mistakes. For example,

an early revision of the DLX processor contained a flaw in the pipeline despite that the

design was reviewed by more than 100 people and was taught at 18 universities. The

flaw was detected when someone actually tried to implement the processor. [11]

From the above discussion, we have established the need to find a more effective

technique for verifying RISC processors. When we say effective, we mean three things:

fast, automated (i.e. requires minimum human interveiltion), and thorough (i.e. uncover

all those design errors that cause the system to behave in a way other than according to

the system specification). Ideally, the method should allow the design team to be able

to go home at the end of the day with good confidence that the design is going to work

rather than having to come the next day to try yet another simulation run to uncover

more design errors.

1.1 Symbolic Trajectory Evaluation

A way to improve traditional simulation is to consider symbols rather than actual values

for the design under simulation, hence, the term symbolic simulation is used to distinguish

Chapter 1. Introduction 3

TZN

0 1

NZx

Figure 1.1: The state space partial order

it from traditional simulation.

Using symbolic simulation for circuit verification was first proposed by researchers at

IBM Yorktown Heights in the late 1970’s. However, the power of symbolic simulation

was not fully utilized until an efficient method of manipulating symbols emerged when

ordered binary decision diagrams (OBDD’s) [3] were developed for representing Boolean

functions.

Bryant and Seger [15] developed a new generation of symbolic simulators based on a

technique they called symbolic trajectory evaluation (STE). In this technique, the state

space of a system is embedded in a lattice {0,1,X,T}. The elements of the lattice are

partially ordered by their information content as shown in Figure 1.1, where X represents

no information about the node value, 0 and 1 represent fully defined information, and

T represents over-constrained values. As a result, the behavior of the system can be

expressed as a sequence of points in the lattice determined by the initial state and the

functionality of the system. The partial order between sequences of states is defined by

extending the partial order on the state space in the natural way.

The model structure of the system is a tuple M = [(S, ), Y], where (5, ) is

a complete lattice (S being the state space and IE a partial order on 5) and Y is

a monotone successor function Y: S —÷ S. A sequence is a trajectory if and only if

Y(u) 1 0.i+1 for i 0.

Chapter 1. Introduction 4

The key to the efficiency of trajectory evaluation is the restricted language that can be

used to phrase questions about the model structure. The basic specification language is

very simple, but expressive enough to capture the properties to be checked for in verifying

a RISC processor.

A predicate over $ is a mapping from $ to the lattice {false, true} (where false true).

Informally, a predicate describes a potential state of the system (e.g. a predicate might be

(A is x) which says that node A has the value x). A predicate is simple if it is monotone

and there is a unique weakest s E $ for which p(s) = true. A trajectory formula is defined

recursively as:

1. Simple predicates: Every simple predicate over $ is a trajectory formula.

2. Conjunction: (F1AF2) is a trajectory formula if F1 and F2 are trajectory formulas.

3. Domain restriction: (e —* F) is a trajectory formula if F is a trajectory formula

and e is a Boolean expression over some set of Boolean variables.

4. Next time: (NF) is a trajectory formula if F is a trajectory formula.

The truth semantics of a trajectory formula is defined relative to a model structure and

a trajectory. Whether a trajectory 5 satisfies a formula F (written as &M

F) is given

by the following rules:

1. g0g I=MP iffp(a°) = true.

2. o 1M (F1 A F2) iff o M F1 andM F2

3. ° M (e —* F) Hf (e) (u M F), for all mapping Boolean expressions to

{ true,false}.

4. u°&MNFiff&MF.

Chapter 1. Introduction 5

Given a formula F, there is a unique defining sequence, 5F, which is the weakest se

quence which satisfies the formula . The defining sequence can usually be computed very

efficiently. From 6F a unique defining trajectory, TF, can be computed (often efficiently).

This is the weakest trajectory which satisfies the formula — all trajectories which satisfy

the formula must be greater than it in terms of the partial order.

If the main verification task can be phrased in terms of “for every trajectory u that

satisfies the trajectory formula A, verify that the trajectory also satisfies the formula C”,

verification can be carried out by computing the defining trajectory for the formula A

and checking that the formula C holds for this trajectory. Such a result is written as

=M [A=>>C]. The fundamental result of STE is given below.

Theorem 1.1.1 Assume A and C are two trajectory formulas. Let TA be the defin

ing trajectory for formula A and let 6c be the defining sequence for formula C. Then

=M[A=>>C] iff6cTA.

A key reason why STE is an efficient verification method is that the cost of performing

STE is more dependent on the size of the formula being checked than the size of the

system model.

The Voss system [13], a formal verification system developed at the University of

British Columbia by Carl Seger, supports STE. Conceptually, the Voss system consists

of two major components, as shown in Figure 1.2.

The front-end of the Voss system is an interpreter for the FL language, which is the

command language of the Voss system. The FL language is a a strongly typed, polymor

phic, and fully lazy language. Moreover, the FL language has an efficient implementation

of OBDD’s built in. In other words, every object of type Boolean in the system is in

ternally represented as an OBDD. A specification is written as a program, which when1 “Weakest” is defined in terms of the partial order.

Chapter 1. Introduction 6

Figure 1.2: The Voss verification system

executed builds up the simulation sequence necessary for verifying the circuit.

The back-end of the Voss system is an extended symbolic simulator with very com

prehensive delay and race analysis capabilities, which uses an externally generated finite

state description to compute the needed trajectories. As shown in Figure 1.2, this de

scription can be generated from a variety of hardware description languages (HDL’s).

Verification script

Verification libraries

SymbolicOK/Counter example

Chapter 1. Introduction 7

1.2 Research Objectives

The goal of this research was to demonstrate that STE is an effective technique for

verifying RISC processors. To demonstrate the technique, we applied it to the Oryx

processor, which was designed and implemented for verification purposes. The Oryx

processor is a fairly generic model for RISC processors. The model is intended to capture

the RISC fundamental characteristics and it is very similar to other RISC processors, such

as the DLX [9] and the MIPS-X [5]. The Oryx processor is described in more details in

later chapters.

1.3 Related Work

In the formal methods community, there have been several efforts to verify RISC pro

cessors. The verification process of a RISC processor involves first writing a formal

specification of the processor. A hardware model of the RISC processor is then com

pared against the formal specification to prove the correctness of the design. So far, most

researchers have used extremely simple architectures as hardware models.

Tahar and Kumar [18] proposed a methodology for verifying RISC processors based

on a hierarchical model of interpreters. Their model is based on abstraction levels, which

are introduced between the formal specification and the circuit implementation. The

abstraction levels are called stage level, phase level, and class level. A stage instruction is

a set of transfers which occur during the corresponding pipeline segment, while a phase

instruction is the set of sub-transfers which occur during that clock phase. The top

level of the model is the class instruction, which corresponds to the set of instructions

with similar semantics. The model can be used to successively prove the correctness of

all phase, stage, and class instructions by instantiating the proofs for each particular

instruction. The specification of the DLX processor was described in HOL and parts of

Chapter 1. Introduction 8

the proof tasks were carried out.

Srivas and Bickford [16] used Clio, a functional language based verification system,

to prove the correctness of the Mini Cayuga processor. The Mini Cayuga is a three-stage

instruction pipelined RISC processor with a single interrupt. The correctness properties

of the Mini Cayuga were expressed in the Clio assertion language. The Clio system

supports an interactive theorem prover, which can be used to prove the correctness of

the properties.

The Mini Cayuga was specified at two levels, abstract specification and realization

specification. The abstract specification described the instruction level behavior from a

programmer’s point of view, while the realization specification described the processor’s

design by defining the activities that occur during a single clock cycle.

The abstract model of the Mini Cayuga is specified as a function that generates an

infinite trace of states over time as the processor executes instructions. The abstract

state of the Mini Cayuga is represented as a tuple with four components: the state of

the memory, the state of the register file, the next program counter, and the current

instruction. The progression of state as the processor executes instructions is done with

the help of the Step function, which takes as parameters a state and an interrupt list. This

function defines the effect of executing a single instruction on the state of the processor,

including the effect of the possible arrival of one or more interrupts during execution.

The correctness criterion of the Mini Cayuga requires proving that a realization state

and an abstract state of the processor are equivalent under some conditions.

Burch and Dill [4] described an automated technique for verifying the control path

of RISC processors assuming the data path computations are correct. They showed an

approach for extracting the correspondence between the specification and the implemen

tation. The statement of correspondence is written in a decidable logic of propositional

connectives, equality, and uninterpreted functions. Uninterpreted functions are used to

Chapter 1. Introduction 9

L

D

Figure 1.3: A simple latch

represent combinational logic, while propositional connectives and equality are used in

describing control in the specification and the implementation, and in comparing them.

This technique compares a pipelined implementation to an architectural description.

The description is written in a simple HDL based on a small subset of Common LISP and

is compiled through a kind of symbolic simulator to a transition function, which takes as

arguments a state and the current inputs, and returns the next state.

The Ph.D. dissertation of Beatty [1] addressed the processor correctness problem and

provided a step toward a solution. Because of the similarities between Beatty’s work and

our work, we will explain his method in some detail. To best illustrate the method, we

will consider the same example Beatty used in his dissertation.

The example starts by describing an extremely simple nMOS implementation of the

latch, which is considered to be the simplest sequential circuit. As shown in Figure 1.3,

the latch has a control input, a data input, and an output, which are labeled L, D, and

Q, respectively. The D input provides the data to be loaded when L is pulsed to load

the latch. The capacitance on node S retains the latched value when the latch is holding

the data. In other words, the latch performs two operations: load and hold.

Beatty’s approach to verification is through symbolic simulation accompanied with

the reliability of mathematics to establish correctness results. He started by examining

S

Q

Chapter 1. Introduction 10

the nature of circuits and his simulation model of them to understand their properties,

which yielded requirements on his verification method.

By considering the latch and its timing diagram more closely, Beatty observed that

the basic timing of the latch is defined by the operations the latch performs and not by

clock signals. In other words, operations dictate their own timing and there is no need

to distinguish clock signals from other input signals. Therefore, the beginning and the

ending of each operation must be marked. This can be done by identifying the nominal

start and the nominal end of each operation.

Conceptually, the start marker of each operation is aligned with the end marker of

its predecessor to perform a sequence of operations. Moreover, these markers provide

a way to define the timing of state storage. For example, the state of the latch can be

considered at a particular time, such as at the nominal beginning of an operation.

The specification model Beatty used is an abstract Moore machine, structured as a

set of symbolic assertions, where each assertion corresponds to an operation. By using

symbols, a single assertion can concisely represent the operation for all possible data

values. The latch can be specified with two assertions:

((operation = load) A (D = b) 4 (Q = b)

((operation = hold) A (Q = c) z. (Q = c)

The left-hand side of the relation denotes the conditions on the system in the

current state, while the right-hand side denotes the conditions in its successor state.

Beatty showed that specifications expressed in such a language define an abstract machine

and the checks he made between specifications and circuits entail a formal relationship

between machines. The specifications are made sufficiently abstract, hence, they include

no particular details of a specific circuit implementation. Consequently, the same high

level specification can be used to verify several circuit implementations by providing some

Chapter 1. Introduction 11

formal relation between the specification and the circuit implementation.

Beatty designed a formal language to support such a style of specifications. The

language consists of writing assertions. Each assertion consists of two parts, an antecedent

and a consequent, which are written in a restricted logic. There are two kind of variables,

state and case. Case variables appear in the specifications, but are not components of

the state space. They simply keep the specifications concise. The abstract specification

of the latch is given bellow:

SMALversion 0 specification latchtypes

value word 1;operation enumeration load, hold.

state

op : operation;D value;

Q value.assertions

op = load /\ D = b:value > Q = b;op = hold /\ Q = c:value > Q = c.

Beatty used his methodology to verify the Hector, which is a 16-bit processor. Though

the Hector is a non-pipelined processor, the methodology can be applied to pipelined

systems as well. The physical layout of the Hector was extracted and used as a model

in the symbolic simulation. This can be a tedious and time consuming process, which

explains why Beatty did not use a more complex system to apply his methodology.

1.4 Thesis Overview

This thesis consists of six chapters with one appendix. In Chapter 2, we give an informal

description of the Oryx processor. This description is what usually can be found in

any programmer’s manual. Of course, this description is vague and cannot be used for

verifying the Oryx processor as it is just a natural language description of what the

Oryx processor does. In Chapter 3, we provide a more precise description of the Oryx

Chapter 1. Introduction 12

processor based on an abstract model. The abstract model is used to reason about the

behavior of the Oryx processor and to facilitate the verification process. In Chapter

4, we describe in details the implementation of the Oryx processor. In Chapter 5, we

explain the verification process and provides some experimental results. In Chapter 6,

we conclude the thesis by discussing some future work. The FL verification script of the

Oryx processor can be found in Appendix A.

Chapter 2

Architecture

This chapter describes the architectural features of the Oryx processor from a program

mer’s point of view. The chapter contains a section for each feature of the architecture

normally visible to the machine language programmer and the compiler writer.

2.1 Memory Organization

The physical memory of the Oryx processor is organized as a sequence of words. A

word is four bytes occupying four consecutive addresses. Each word is assigned a unique

physical address, which ranges from 0 to 232— 1 (4 Gigawords).

The byte ordering within a word is Little Eridian as shown in Figure 2.4. A word

contains 32 bits. The bits of a word are numbered from 0 through 31. The byte containing

bit 0 of a word is the least significant byte and the byte containing bit 31 of a word is

the most significant byte.

The Oryx is a cache-oriented processor. A system block diagram of the Oryx pro

cessor is shown in Figure 2.5. In order to achieve high performance, the Oryx processor

Byte 3 Byte 2 Byte 1 Byte 0

31 0

Figure 2.4: Little Endian byte ordering

13

Chapter 2. Architecture 14

_____

RESULT BUS

H H

__

rDATA

REGISTERINSTRUION

UNIT

RLE 1 r UNIT

MEMORY

CACHE MANAGEMENT

UNIT

BUS CONTROL

MEMORY

CACHE MANAGEMENT

UNIT

BUS CONTROL

IT MEMORY BUS

I ORYX

INTEGER UNIT

I I SOURCE 1 BUS

SOURCE 2 BUS

BUS CONTROL BUS CONTROL

UFigure 2.5: A system block diagram of the Oryx processor

Chapter 2. Architecture 15

is provided with two high speed cache memory modules. One cache is dedicated to in

structions and the other to data. It is left to the system designer to decide what cache

memory sizes should be used. This is usually a performance-cost tradeoff.

The Oryx processor is designed with the assumption that the processor does not

directly address main memory. When the Oryx processor requires instructions or data,

it generates a request for that information and sends it to a memory management unit

(MMU). If the MMU does not find the necessary information in the cache memory, then

the Oryx processor waits while the MMU collects the information (i.e. instructions or

data) from main memory and moves it into the cache memory.

The Oryx processor is designed to support virtual memory. In other words, the Oryx

processor operates under the illusion that the system’s disk storage is actually part of

the system’s main memory.

The physical address space is broken into fixed blocks called pages. An address

generated by the Oryx processor is called a logical address. The logical address space is

mapped into the physical address space which is, in turn, mapped into some number of

pages.

A page may be either in main memory or on disk. When the Oryx processor issues

a logical address, it is translated into an address for a page in memory or an interrupt

is generated. The interrupt gives the operating system a chance to read the page from

disk, update the page mapping, and restart the instruction.

To provide the protection needed to implement an operating system, the physical

address space is divided into system space and user space. The Oryx processor reserves

the first 2 Gigawords of the physical memory for executing system programs and the

other 2 Gigawords for executing user programs.

Chapter 2. Architecture 16

2.2 Registers

The Oryx processor has thirty six registers, which are visible to the programmer. These

registers may be grouped as:

• General registers.

• Multiplication and division register.

• Status and control registers.

2.2.1 General Registers

The Oryx processor has thirty two 32-bit general registers R0,R1,. . . , R31. For architec

tural reasons, register R0 is permanently hardwired to a zero value. This register can be

read only. A write attempt into register R0 has no effect. Registers R1 to R31 are used

to hold operands for arithmetic and logical operations. They may also be used to hold

operands for address calculations.

2.2.2 Multiplication and Division Register

The multiplication and division (MD) register is used to facilitate the implementation of

shift-and-add integer multiplication and division algorithms. The MD register stores the

32-bit value of the multiplier for multiplication operations and the dividend for division

operations.

2.2.3 Status and Control Registers

The Oryx processor has one status and two control registers, which report and allow

modification of the state of the processor.

Chapter 2. Architecture 17

S CRA queue shift enableE Non-maskable interrupt flagI Maskable interrupt flagM Interrupt maskV Overflow mask0 Overflow flagU User/System modeIPF Instruction page fault flagDPF Data page fault flag

Table 2.1: PSW bits definition

Program Counter Register

The program counter (PC) register holds a 32-bit logical address of the next instruction

to be executed. The PC is not directly available to the programmer. It is controlled

implicitly by control-transfer instructions and interrupts.

Control Return Address Registers

The Oryx processor has four control return address (CRA) registers CRAO,CRA1,CRA2,

and CRA3. These registers form a queue with the CRA3 register being the head of the

queue and the CRA0 register being the tail. Every clock cycle, the value of the PC

register is shifted in the queue. When an interrupt occurs, the shifting of the queue is

disabled to give the programmer a chance to save the CRA registers. The actual content

of the CRA registers is implementation dependent.

Program Status Word Register

The program status word (PSW) register contains nine status bits, which control certain

operations and indicate the current state of the processor. Table 2.1 shows the definition

of these bits within the P5W register. The programmer (in supervisor mode) can set the

Chapter 2. Architecture 18

PSW register bits, which are also set by the hardware on an interrupt to a pre-determined

value.

2.3 Instruction Formats

The Oryx processor distinguishes four instruction formats: memory, branch, compute,

and compute immediate. An instruction format has various fixed size fields as shown

in Figure 2.6. The type and opcode fields are always present. The other fields may or

may not be present depending on the operation. The fields of an instruction format are

described below:

• Type: is a 2-bit field used to identify the instruction format.

• Opcode: is a 3-bit field used to specify the operation performed by the instruction.

Several instructions may have the same opcode. In this case, some bits of other

fields are used to distinguish between instructions.

• Source Register Specifiers: are 5-bit fields used to specify source register(s).

Instructions may specify one or two source registers. In memory instruction format,

one source register is used for address calculations.

• Destination Register Specifier: is also a 5-bit field used to specify a destination

register into which the result of an operation will be moved.

• Sq (Squash) Bit: is a 1-bit field used in branch instruction format only. When

set, it indicates that the branch is statically predicted to be taken.

• Offset: is a 17-bit field used in memory instruction format. This field holds a

signed value which is used to generate the logical address of a word in memory.

Chapter 2. Architecture 19

Memory

iol Op Srcl Dest Offset (17)

3130 27 22 17 0Branch

00 Cond Srcl Src2 sq Disp (16)

3130 27 22 17 16 0Compute

01 Op Src1 Src2 Dest Func(12) I3130 27 22 17 12 0

Compute Immediate

11 Op Src 1 Dest Immed (17)

3130 27 22 17 0

Figure 2.6: The Oryx processor instruction formats

• Disp (Displacement): is a 16-bit field used in branch instruction format. This

field is similar to the offset field except that it is used to determine the branch

target address.

• Func (Function): is a 12-bit field used in compute instruction format. When

compute instructions share the same opcode, this field is used to further define the

functionality of the instructions.

• Immed (Immediate): is a 17-bit field used in compute immediate instruction

format. This field holds a signed two’s complement value of an immediate operand.

The value is sign-extended to form a 32-bit value.

Chapter 2. Architecture 20

2.4 Addressing Modes

The Oryx processor supports only one addressing mode. The logical address is obtained

by adding the content of a base register to the address part of the instruction. The base

register is assumed to hold a base address and the address field, also known as the offset,

gives a displacement relative to this base register.

2.5 Operand Selection

An Oryx processor instruction may take zero or more operands. Of the instructions which

have operands, some specify operands explicitly and others specify operands implicitly.

An operand is held either in the instruction itself, as in compute immediate instructions,

or in a general register. Since no memory operands are allowed, access to operands is

very fast as they are available on-chip.

2.6 Co-processor Interface

The Oryx processor’s memory instructions can communicate with other types of hardware

besides the memory system. This is used to implement a simple co-processor interface.

Co-processor instructions use the memory instruction format. These instructions are

used to transfer data between the processor registers and the co-processor registers. It

is also possible to load and store the co-processor registers without using the processor

registers as an intermediate step.

The instruction bits for the co-processor are placed in the offset field of the memory

instruction. The co-processor can read these bits when they appear on the address lines

of the processor.

Chapter 2. Architecture 21

Instruction Operands OperationMOV Srcl,Dest Dest:=SrclMOVFRS SpecReg,Dest Dest: =SpecRegMOVTOS Srcl ,SpecReg SpecReg: =SrclMOVFRC Coplnstr,Dest Dest : =CopRegMOVTOC Coplnstr,Src2 CopReg: =Src2ALUC Coplnstr Cop executes CoplnstrLD X[Srcl],Dest Dest:=M[X+Srcl]ST X[SrclJ,Src2 M[X+Srcl]:= Src2LDT X[Srcl],Dest Dest:=M[X+SrcljSTT X[Srcl],Src2 M[X+Srcl]:=Src2LDF X[Srcl] ,fDest fDest : =M [X+SrclJSTF X[Srcl] ,fSrc2 M[X+Srcl] :=fSrc2ADD Srcl ,Src2,Dest Dest:=Srcl+Src2ADDI Srcl,#N,Dest Dest:=N+SrclSUB Srcl,Src2,Dest Dest:=Srcl-Src2SUBNC Srcl ,Src2,Dest Dest:=Srcl+Src2DSTEP Srcl,Src2,DestMSTART Src2,DestMSTEP Srcl,Src2,DestAND Srcl,Src2,Dest Dest:=Srcl bitwise and Src2BIC Srcl,Src2,Dest Dest:=J bitwise and Src2NOT Srcl,Dest Dest:=SrclOR Srcl,Src2,Dest Dest:= Srcl bitwise or Src2XOR Srcl,Src2,Dest Dest:= Srcl bitwise exclusive or Src2SRA Srcl,Dest Dest:=Srcl shifted right 1 positionSLL Srcl,Dest Dest:=Srcl shifted left 1. positionSRL Srcl,Dest Dest:=Srcl shifted right 1 positionIRET PC:= CRA3JUMP Srcl ,Dest PC:=Srcl ,Dest:=PC+1BEQ,BEQSQ Srcl,Src2,Label PC:=PC+Label if Srcl=Src2BGT,BCTSQ Srcl,Src2,Label PC:=PC+Label if Srcl>Src2BLT,BLTSQ Srcl,Src2,Label PC:=PC+Label if Srcl<Src2BNE,BNESQ Srcl,Src2,Label PC:=PC+Label if Src1LSrc2TRAP VectorNOP RO:=RO+RO

Table 2.2: The Oryx processor instruction set

Chapter 2. Architecture 22

2.7 Instructions

The Oryx processor provides programmers with a simple instruction set, which can be

used to write application programs for the processor. The Oryx processor instructions

are listed in Table 2.2. They are grouped by categories of related functionality. This

section describes the operation of each instruction and mentions the operands required.

Data Movement Instructions

These instructions provide convenient ways for transferring word quantities. They come

in two types:

1. Register-Register instructions.

2. Register-Memory instructions.

Register-Register Instructions

These instructions are used for transferring data along any of the following paths:

• Between general registers.

• Between general registers and special registers.

• Between general registers and co-processor registers.

The MOV (Move) instruction transfers a word between general registers. This in

struction, like the other instructions of its type, takes as operands a source register

address R and a destination register address Ri,, where 0 x, y <31.

The MOVFRS (Move From Special) instruction transfers a word from a special

register (e.g. MD register) to a general register. This instruction is useful for reading

special registers.

Chapter 2. Architecture 23

The PSW register is a special case as it has oniy nine bits. When this instruction is

executed, the high-order bits (i.e. bits 9 to 31) of the destination register are filled with

zeros.

The MOVTOS (Move To Special) instruction transfers a word from a general register

to a special register.

In the special case of the PSW register, only the low-order bits (i.e. bits 0 to 8) of the

general register are transferred. In user mode, the PSW is read-only where in supervisor

mode it is both readable and writable. The high-order bits (i.e. bits 9 to 31) are ignored.

The MOVTOC (Move To Co-processor) instruction transfers a word from a general

register to a co-processor register. This instruction is useful for loading the co-processor

registers.

The MOVFRC (Move From Co-processor) instruction transfers a word from a

processor register to a general register. This instruction is used to move the result of a co

processor operation to the processor when the co-processor has finished the computation.

The ALUC (ALU Co-processor) instruction provides the co-processor with some

instruction bits. This instruction results in the co-processor performing an operation as

specified by the instruction bits.

Register-Memory Instructions

In a way similar to previous instruction group, these instructions are useful for transfer

ring data along any of these paths:

• Between general registers and cache memory.

• Between general registers and main memory.

• Between co-processor registers and main memory.

Chapter 2. Architecture 25

Addition and Subtraction Instructions

These instructions take two source operands and a destination operand. The source

operands are two general registers. The ADDI (Add Immediate) instruction is an excep

tion. The second operand of this instruction is an immediate value. The result of the

computation is moved into the destination operand which can be any general register.

The PSW 0 status bit is updated according to the result of the operation.

The ADD (Add) replaces the destination operand with the sum of the two source

operands.

The ADDI (Add Immediate) computes the sum of a source operand and a 17-bit

sign extended immediate value.

The SUB (Subtract) subtracts the second source operand from the first source

operand and replaces the destination operand with the result.

The SUBNC (Subtract No Carry In) subtracts the second source operand from the

first source operand and replaces the destination operand with the result minus one.

Multiplication and Division Instructions

The Oryx processor supports signed multiplication. However, only unsigned division is

supported as signed division requires complex hardware.

Multiplication is done with the simple 1-bit shift-and-add algorithm except that the

computation is started from the most significant bit instead of the least significant bit of

the multiplier.

The MSTART (Multiplication Start) initializes the signed multiplication algorithm.

This instruction requires one source operand and a destination operand which can be

general registers only. The source register holds the multiplicand value and the multipli

cation result will be in the destination register.

Chapter 2. Architecture 24

The LD (Load) instruction trallsfers a word from a location in cache memory to

a general register. The instruction, like the other instructions of its type, takes three

operands: a base address register, a 17-bit offset, and a destination register address. The

base address register can be any of the processor general registers.

There is a delay of one instruction between a load instruction and the availability of

its result. The programmer must take this load delay into consideration.

The ST (Store) instruction transfers a word from a general register to a location in

cache memory.

The LDT (Load Through) instruction transfers a word from a location in main

memory to a general register. It requests the MMU to access main memory directly

without first looking in the cache memory.

This instruction is needed for memory mapped I/O. It also helps improve the perfor

mance of the system if the programmer knows in advance that the word is not available

in the cache memory. This way, the programmer can avoid the penalty of a cache miss.

The STT (Store Through) instruction transfers a word from a general register to a

location in main memory without writing it in the cache memory.

The LDF (Load Float) instruction transfers a word from a location in main memory

to a co-processor register.

The STF (Store Float) instruction transfers a word from a co-processor register to a

location in main memory.

2.7.1 Arithmetic Instructions

The arithmetic instructions of the Oryx processor operate on numeric data encoded in

binary. The Oryx processor supports both signed and unsigned binary integers. Two’s

complement representation is used to encode negative numbers.

Chapter 2. Architecture 26

For signed multiplication, this instruction tests the most significant bit (MSB) of

the MD register which holds the multiplier value. If this bit is equal to one, then the

multiplicand is subtracted from zero and the result is moved into the destination register.

Otherwise, the destination register is initialized to zero. The MD register is shifted 1-bit

to the left.

The MSTEP (Multiplication Step) implements one step of both the signed and un

signed multiplication algorithms. Contrary to the MSTART instruction, this instruction

requires one extra source operand. The destination register is also used as the first source

register. The second source register holds the multiplicand value.

This instruction also tests the MSB of the MD register. If this bit is equal to one

then the destination register is shifted 1-bit to the left and added to the multiplicand.

The partial result is moved back into the destination register. If the MSB of the MD

register is equal to zero, then the destination register is shifted 1-bit to the left. The MD

register is always shifted 1-bit to the left.

Division is done with a restoring division algorithm. The dividend is loaded into

the MD register and the register that will contain the remainder is initialized to zero.

The divisor is loaded into another register. The result of the division will be in the MD

register.

The DSTEP (Division Step) instruction implements a one step of the division al

gorithm. This instruction is similar to the MSTEP instruction. It requires two source

operands and a destination operand. The destination register holds the value of the re

mainder and serves as the first source register. The second source register holds the value

of the divisor. Each time this instruction is executed, it shifts the remainder register 1-bit

to the left and adds to it the MSB of the MD register. It also subtracts from the partial

result of shifting and addition the value of the divisor to generate a final result.

If the MSB of the final result is equal to one, then the remainder register gets the

Chapter 2. Architecture 27

value of the partial result. The MD register is shifted 1-bit to the left. On the other

hand, if the MSB of the final result is equal to zero, then then remainder register gets

the value of the partial result minus the value of the divisor. The MD register is shifted

1-bit to the left and one is added to it.

2.7.2 Logical Instructions

The logical instructions may have either two or three operands which can only be general

registers. These instructions come in two types:

1. Boolean operation instructions.

2. Shift instructions.

Boolean Operation Instructions

The Oryx processor provides several instructions which can be used to perform basic

booleall operations.

The NOT (Not) instruction forms the one’s complement of the source operand and

replaces the destinatioll operand with the result.

The AND (And) instruction performs the standard boolean bitwise “and” operation

on the two source operands and moves the result into the destination register.

The BIC (Bit Clear) instruction is similar to the previous instruction except that

it uses the one’s complement of the first source operand. This instruction is useful for

manipulating byte quantities.

The OR (Or) and the XOR (Exclusive Or) instructions perform the standard boolean

bitwise “or” and “exclusive or” operations, respectively.

Chapter 2. Architecture 28

Shift Instructions

The shift instructions can be used to rearrange the bits within an operand. They apply

either arithmetic or logical 1-bit shift to words. These instruction do not affect the PSW

0 status bit.

The SRA (Shift Right Arithmetic) instruction copies the sign bit into an empty bit

position on the upper end of the operand. This instruction is useful for performing simple

arithmetic such as integer divide by 2.

The SRL (Shift Right Logical) instruction is similar to the previous instruction except

that it fills the high-order bit positions with zeros.

The SLL (Shift Left Logical) instruction fills the low-order bit positions of an operand

with zeros.

2.7.3 Control Transfer Instructions

The Oryx processor provides both conditional and unconditional control transfer instruc

tions to direct the flow of execution.

Unconditional Transfer Instructions

The JUMP (Jump) instruction unconditionally transfers execution to the destination.

This instruction takes one source operand and one destination operand. Both operands

are general registers. The first operand specifies the address of the jump destination.

The return address is moved into the second operand. The same instruction can be used

to return from a subroutine by swapping the two operands.

The IRET (Return From Interrupt) instruction returns control to an interrupted

procedure. It also restores the state of the P5W processor which was saved when the

interrupt occurred.

Chapter 2. Architecture 29

The IRET instruction is followed by three slots. Instructions in these slots are not

executed. This is a mechanism to facilitate the return from interrupt.

The TRAP (Trap) instruction allows the programmer to specify a transfer of execu

tion to an interrupt service routine (TSR). This instruction specifies an 8-bit vector which

is used to lookup the address of the TSR in the interrupt vector table.

Conditional Transfer Instructions

Conditional transfer instructions are called branch instructions. There are several of

them. Each branch instruction specifies a condition, two source operands, and a 16-bit

displacement. A source operand can be any general register.

A branch instruction compares the values of the two source registers and takes some

action depending on the comparison result. If the result of the comparison is positive,

then a 16-bit sign extended displacement is added to the value of the PC register to

generate the branch target address. Otherwise, the PC register is incremented to point

at the next instruction.

The Oryx processor features a two-slot delayed branch scheme. This means that

branch instructions require two clock cycles from the time they were issued to evaluate

the branch condition. The Oryx processor issues an instruction every clock cycles. By

the time the comparison of the two source registers has been done, two instructions have

been issued. These instructions enter a ready queue and wait for execution.

The Oryx processor provides two versions of the branch instructions. One version

makes it possible to statically predict that the branch will be taken. This provides a

mechanism for squashing (or canceling) the instructions in the ready queue if the branch

is not taken. Consequently, the compiler should try to schedule useful instructions in the

delay slots to increase the performance of the system.

The BEQ, BEQSQ (Branch If Equal) instructions test for equality of the two source

Chapter 2. Architecture 30

registers. If the result of the test is positive, then the branch is taken.

The BGT, BGTSQ (Branch If Greater Than) instructions transfer control to the

destination if the first source register has a value which is greater than that of the second

source register.

The BLT, BLTSQ (Branch If Less Than) instructions are the complements of the

BGT and BGTSQ instructions, respectively. Control is transferred if the first source

register has a value which is less than that of the second source register.

The BNE, BNESQ (Branch If Not Equal) instructions transfer control to the des

tination if the source registers are not equal.

2.7.4 Miscellaneous Instructions

The NOP (No Operation) instruction increments the PC register to point at the next

instruction. This instruction does not change the status of the Oryx processor.

2.8 Interrupts

The Oryx processor has a mechanism to provide precise interrupts which means that

interrupts will be served in the order of their occurrence. Interrupts are either external or

internal. The Oryx processor supports four external interrupts: non-maskable, maskable,

data page fault, and instruction page fault. As for internal interrupts, the Oryx processor

supports two types: overflow and software interrupts.

When an interrupt occurs, the processor goes into supervisor mode, the PSW register

is saved, and control is transferred to execute the ISR, which is located at memory address

zero. While the Oryx processor is in supervisor mode, no other interrupts are accepted.

When the processor has finished executing the TSR, control is transferred back to the

interrupted procedure.

Chapter 2. Architecture 31

A typical ISR is given below:

MOVFRS PSW,Ra ; Read PSWMOVFRS MD,Rb ; Save MD registerMOVFRS CRA3,RC ; Save CRA3 registerMOVFRS CRA3,Rd ; Save CRA2 registerMOVFRS CRA3,Re ; Save CRA1 registerMOVFRS CRA3,Rf ; Save CRAO register

Body of the ISR.

MOVTOS Rb,MD ; Restore MD registerMOVTOS RC,CRAO ; Restore CRA3 registerMOVTOS Rd,CRAO ; Restore CRA2 registerMOVTOS Re,CRA0 ; Restore CRA1 registerMOVTOS Rf,CRAO Restore CRA0 registerIRET ; Return

The TSR reads the PSW register to determine what caused the interrupt. Then it

saves some registers which include the MD register and the CRA registers. The CRA

registers are saved by reading the CRA3 register four times while shifting the CRA queue.

The ISR does some work and when it is done it restores the registers it saved earlier.

The CRA registers are restored by writing to the CRA0 register four times. Finally, the

TSR returns control to the interrupted procedure by executing the IRET instructions.

Chapter 3

Formal Specification

An abstract model of the Oryx processor, as described in this chapter, provides a general

framework for writing the formal specification of the Oryx processor. This formal specifi

cation describes the architecture of the Oryx processor in a way similar to the description

that is given in the previous chapter. However, a formal specification is more precise,

which makes it suitable for reasoning about the behavior of the Oryx processor.

This chapter contains two sections. Section One describes the abstract model of the

Oryx processor from a programmer’s point of view and Section Two shows how this

abstract model is used for writing the formal specification of the Oryx processor.

3.1 Abstract Model

The circuit of the Oryx processor consists of several parts (i.e. control logic, registers,

input pins, output pins, etc...) some of which need to be visible to the programmer

and some need not. Therefore, abstraction is used to hide unnecessary details which the

programmer does not need to know.

Within our model, we have two abstraction types, structural and temporal. Structural

abstraction hides the implementation details such as pipelining. On the other hand,

temporal abstraction hides the actual timing of signal values in the implementation.

The natural way to model the Oryx processor is to view it as a state machine that

operates on a stream of inputs. The abstract state of the Oryx processor is modeled as

a record. The record contains a field for each part of the Oryx processor which is visible

32

Chapter 3. Formal Specification 33

to the programmer, as shown below:

record S is

VRF : array (0 to 31) of bit vector (0 to 31);

PC : bit vector (0 to 31);

MD : bit vector (0 to 31);

CRA : array (0 to 3) of bit vector (0 to 31);

P5W : bit vector (0 to 8);

ICacheData : bit vector (0 to 31);

DCacheData : bit vector (0 to 31);

ICacheAddr : bit vector (0 to 31);

DCacheAddr : bit vector (0 to 31);

FPReg : bit vector (0 to 3);

Reset : bit;

JAck : bit;

DAck : bit;

NMlnterrupt : bit;

Interrupt : bit;

IPFault : bit;

DPFault : bit;

IFetch : bit;

Read Writeb : bit;

MemCycle : bit;

CopCycle : bit;

BypassCache : bit;

end;

Chapter 3. Formal Specification 34

In other words, an abstract state is a collection of memory storing elemeilts which

can be either combined to form a vector (e.g. PC register) or kept apart as in the case

of an input signal (e.g. reset).

The abstract model of the Oryx processor describes the run-time abstract states which

can be reached from an initial abstract state by means of a transition function over the

set of run-time abstract states. These run-time abstract states form a sequence.

Definition 3.1 The set Sw = {< ,5k,. . . ‘Sn >1 . e {O, 1}m} is the set of run-time

abstract state sequences, where m is the size of S.

Traditional transition functions map one state to another. However, this definition

is not sufficient for modeling the Oryx processor as we may need information from more

than one abstract state (i.e. a sequence of abstract states) in order to determine the next

abstract state.

For example, if the Oryx processor is currently executing an instruction in the sec

ond delay slot of a branch instruction, then the transition function will require some

information from two previous abstract states to determine the new value of the PC

register.

Therefore, we generalize the definition of a transition function so that it may take a

sequence of abstract states instead of a single abstract state.

Definition 3.2 A transition function 1’ : S’ —* S maps a sequence of abstract states to

an abstract state.

It is important to note that a specification T is a partial function, hence, a mapping

may not exist for some sequences of abstract states. These sequences represent a class of

programs which do not comply with the specifications of the Oryx processor. The behav

ior of the Oryx processor is simply not defined for such programs. It is the programmer’s

responsibility to write valid programs if a correct result is desired.

Chapter 3. Formal Specification 35

To identify the sequences of abstract states which can be obtained by executing valid

programs only, we define what we call an abstract history.

Definition 3.3 (Abstract History) An abstract history A is a sequence of abstract

states such that each prefix of A is consistent with the transition function. More formally,

an abstract history A= {< > E $W Vj 0 <j < 1. T(< 8O,S1,”.,Sj >) =

A specification, as defined below, describes how the abstract state of the Oryx pro

cessor is affected by executing an instruction every cycle.

Definition 3.4 (Specification) A specification F : A —* A maps an abstract history to

a new abstract history.

Keeping the above discussion in mind, we now present the abstract model of the Oryx

processor as depicted in Figure 3.7. Assuming that the Oryx processor is in a state So,

the specification determines what the next state 51 is going to be based on the current

(recent) history of the Oryx processor.

3.2 Formal Specification

This section shows how to write the formal specification of the Oryx processor by giving

an example. The full formal specification can be driven in a similar way. Before giving

the example, however, we introduce some notations and conventions, which will be used

throughout this section.

We write S1- to refer to the abstract state at time r. S0 refers to the current abstract

state. Negative indices are used to refer to previous abstract states. For example, S_

is the abstract state one instruction cycle ago, 8—2 is the abstract state two instruction

cycles ago, and so on. Similarly, positive indices are used to refer to next abstract states.

Chapter 3. Formal Specification 36

ification

Si -

Inputs -

S0 S S2

Figure 3.7: The abstract model of the Oryx processor

Chapter 3. Formal Specification 37

RF7(i) S7.RF(i)PC7 ST.PCMD7 ST.MDSrclT ST.ICacheData[5..9]Src27 ST.ICacheData[1O..14]Dest7 S,-.ICacheData[15..19]ImmedT ST.ICacheData[15. .31]Offset7 ST.ICacheData[15. .31]DCacheData7 S- .DCacheDataSpecRegT ST.MDCRAT (i) S. CRA (i)NMInterruptT S-. NMlnterruptInterrupt7 ST .InterruptIPFaultT S-. IFFaultDFFau1tT ST. DPFaultReset1- ST.Reset

Table 3.3: Function abbreviations

Writing formal specifications can be a very tedious task since it usually involves a lot

of formulation. Therefore, a formal specification syntax should be made as simple as pos

sible. Table 3.3 lists some abbreviations which are used to make the formal specification

of the Oryx processor more readable and Table 3.4 defines some symbols.

Throughout the remaining of this chapter, we will show by example how to write

the formal specification of the Oryx processor. We will provide a hierarchical definition

+ Binary signed addition— Binary signed subtractionA Bitwise logical ANDV Bitwise logical OR

Bitwise logical XOR: Bit concatenation

Table 3.4: Operation definitions

Chapter 3. Formal Specification 38

of two transition functions, namely the register file and the PC register, under normal

conditions (i.e. no interrupts and no reset). We then go down in the hierarchy and define

all functions which compose the top-level definition of these transition functions.

3.2.1 Example I: Register File Specification

I NewRF(S) if Dest0=i=

RF0(’i) Otherwise

ArithmRes(S) if Arithmlnstr(S)

LogicRes(S) else if Logiclnstr(S)

TransferRes(S) else if Transferlnstr(S)NewRF(S)=

MDRes(’S) else if MDlnstr(S)

ShifiRes(S) else if Shiftlnstr(S)

JumpRes(S) else if Jumplnstr(S)

Arithmlnstr(S) = InstrIsADD(Opcodeo) V

InstvIsADDI(Opcodeo) V

InstrIsSUB(Opcodeo) V

InstrIsS UBNC(Opcodeo)

Logiclnstr(S)= InstrIsAND(Opcodeo) V

InstrIsBIC(Opcodeo) V

InstrIsOR (Opcodeo) V

InstrIsXOR (Opcodeo) V

InstrlsNO T(Opcodeo)

Chapier 3. Formal Specification 39

Transferlnstr(S)= InstrIsMO V(Opcodeo) V

InstrlsLD(Opcodeo) V

InstrlsLD T(Opcodeo) V

InstrIsMO VFRS(Opcodeo) V

InstrIsMO VFR C(Opcodeo)

MDlnstr(S)= InstrIsMSTA RT(Opcodeo) V

InstrfsMSTEP(Opcodeo) V

InstrIsDSTEP(Opcodeo)

Shiftlnstr(S) InstrIsSRA (Opcodeo) V

InstrIsSLL (Opcodeo) V

InstrIsSRL (Opcodeo)

Jumplnstr(S)= InstrlsJUMP(Opcodeo)

RF0(Srclo) +RF0(Src2o) if InstrIsADD(Opcodeo)

ArithmRes(S) = RF0(Srclo) —RF0(Src2o) else if InstrIsSUB(Opcodeo)

RF0(Srclo) — RF0(Src2o) —1 else if InstrIsSUBNC(Opcodeo)

RF0(Srclo) /\RF0(Src2o) if InstrIsAND(Opcodeo)

(RF0(Srclo)) ARE’0(Src2o) else if InstrIsBIT(Opcodeo)

LogicRes(S) = RF0(Srclo) VRF0(Src2o) else if InstrlsOR(Opcodeo)

RF0(Srclo) EDRF0(Src2o) else if InstrIsXOR(Opcodeo)

(RF0(Srclo)) else if InstrIsNOT(Opcodeo)

Chapter 3. Formal Specification 40

RF0(Srclo) if IristrlsMOV(Opcodeo)

DCacheData0 else if InstrlsLD(Opcodeo)

TransferRes(S) = DCacheDatao else if InstrIsLDT(Opcodeo)

DCacheData0 else if InstrIsMOVFRC(Opcodeo)

SpecRego else if InstrIsMOVFRC(Opcodeo)

FirstMultStep(S) if InstrIsMSTA RT(Opcodeo)

MDRes(S) MultStep(S) else if InstrIsMSTEP(Opcodeo)

DivStep (S) else if InstrIsDSTEP(Opcodeo)

1 0 — RF0(Src2o) if (MSB(MD0)=1)FisrtMultStep (S) =

1 0 Otherwise

1 2 xRF0(Srclo) +RF0(Src2o) if (MSB(MD0)=1)FisrtMultStep(S) =

( 2 xRF0(Srclo) Otherwise

I DivStepResA(S) if (MSB(DivStepResB(S))=1)DivStep(S) =

( DivStepResB(S) Otherwise

DivStepResA(S) = 2 xRFo(Srclo) +MSB(MD0)

DivStepResB(S) = DivStepResA(S) —RF0(’Src2o)

RF0(’Srclo)[Sl]:RFo(Srclo)[Sl..l] if InstrIsSRA(Opcodeo)

ShifiRes(S) = RF0(Srclo)[30. . 0]:0 else if InstrIsSLL(Opcodeo)

O:RF0(Srclo)[31. .1] else if InstrIsSRL (Opcodeo)

JumpRest(S)= PC0 +í

Chapter 3. Formal Specification 41

3.2.2 Example II: PC Specification

if Exception(S)

else if ReturnFromException(S)

else if TakeBranch(S)

else if DoJurnp(S)

Otherwise

(InstrIsBEQ(Opcode_2)A

(RF_2(Srcl_2)= RF_2(Src2_2)))

(InstrIsBNE(Opcode_2)A

(RF_2(Srcl_2) RF_2(’Src2_2)))

(InstrIsBGT(Opcode_2)A

(RF_2(SrcL2)> RF_2(Src2_2)))

(InstrIsBLT(Opcode_2)A

(RF_2(Srcl_2)< RF_2(Src_2)))

ReturnFromException (S) = InstrIsJRET(Opcode_3)

DoJump (S) = InstrIsJUMP(Opcode_2)

Exception (S) = NMlnterrupto V

Interrupt0 V

IPFault0 V

DPFault0 V

Reset0

0

CRA0(8)

Tpc(S) = PCO +Disp_2

RF_2(SrcL2)

TakeBranch (S) =

V

V

V

Chapter 4

Implementation

This chapter describes the implementation of the Oryx processor in the VHSIC hardware

description language (VHDL). The VHDL code is a detailed gate-level logic implemen

tation of the Oryx processor [6].

The logical way to describe the Oryx processor implementation is to divide it into

two parts: control path and data path. This chapter contains a section for each path,

but first we start by describing the Oryx processor from a designers’ point of view.

4.1 Design Overview

The Oryx is a pipelined processor. The pipeline consists of five stages as shown in

Figure 4.8. The pipeline stages are typically called the instruction fetch (IF) stage, the

instruction decode (ID) stage, the execute (EXE) stage, the memory (MEM) stage, and

the write-back (WB) stage, respectively.

IF ID EXE MEM WB

Clock

Figure 4.8: The Oryx processor pipeline

42

Chapter 4. Implementation 43

Each stage of the pipeline is a pure combinational circuit. High-speed interface latches

separate the pipeline stages. The latches are fast registers, which hold intermediate

results between the stages. A common clock is simultaneously applied to all the latches

to control information flow between adjacent stages.

The functionality of the pipeline stages is explained by describing what each stage

of the pipeline does from the time an instruction is issued to the time the instruction is

processed completely.

The IF stage reads the instruction from the instruction cache (ICache) memory using

an address in the PC register. The ID stage identifies the instruction and prepares its

operands.

The computation performed by the EXE stage depends on the instruction being exe

cuted. For compute or compute immediate instructions, this stage performs the obvious

computation. During the execution of memory instructions, the EXE stage computes

the address of the desired memory location. For branch instructions, the EXE stage

compares two registers to evaluate the branch condition.

The MEM stage reads from and writes into the data cache (DCache) memory. The

WB stage is the final stage in the pipeline where the result of the instruction is written

into the register file.

Pipelining is a form of temporal parallelism. Several instructions occupy different

pipeline stages at the same time. These instructions could very likely depend on each

other. Therefore, internal bypassing is provided so that an instruction does not have to

wait for previous instructions to write their results into the register file before being able

to access their results. With bypassing, two instructions can be issued in sequence even

if the second one uses the result of the first one. The only exception to this is the load

instructions.

The Oryx processor uses a two-phase non-overlapping clocking scheme. The two clock

Chapter 4. Implementation 44

phases are called Phil (ql) and Phi2 (q2), respectively. There are also three derived

clocks called Psil (‘l), Psi2 (2), and Psi2PC (b2PC).

The logic which generates the b2 clock is shown in Figure 4.9. The b2 clock is

controlled by both the instruction acknowledgment (JAck) and data acknowledgment

(DAck) external signals.

The memory elements used in the implementation are all negative-edge triggered

flipflops. If either JAck or DAck is not asserted due to an ICache miss or a DCache miss,

respectively, the Oryx processor is easily stalled by preventing the ‘tt’2 clock from falling.

The L’l clock is allowed to go high only if the b2 clock is low.

The ‘ib2FC is a special clock which is used in the PC chain (PCChain). The PCChain

is used to store the addresses of the instructions which need to be restarted after an

exception. The PCChain is implemented as a four-stage shift register. As shown in

Figure 4.10, the four stages are called PCm1 (PC minus 1), PCm2, PCm3, and PCm4,

respectively.

On an exception, the 2FC clock is disabled to give the programmer a chance to

save the contents of the PCChain registers. While the 2PC clock is disabled, it can

go high only for a “MOVFRS PCm4” instruction. Once the programmer has saved the

PCChain registers, the /2FC clock is re-activated. The ,1J2FC clock logic diagram is

Reset

JAck

DAck Psi2

Phi2

Figure 4.9: çb2 logic diagram

Chapter 4. Implementation 45

PCm4out[0-31]

PCm4 Register

PCm3 Register

I PCm2 Register

_J.____Psi2PC PCm1 Register

PCToPCrnI

________

SrclToPCml

PCBus[0-3 1] Srcl [0-3 1]

Figure 4.10: Program counter chain logic diagram

shown in Figure 4.11.

Instructions can change the Oryx processor state only when they reach their last

pipeline stage. When an instruction causes an interrupt (e.g. an instruction page fault),

an interrupt flag is carried along the pipeline stages and is detected in the WB stage. The

interrupt flag is used to prevent the instruction from doing anything as it goes through

the pipeline stages. In other words, the instruction is turned into a “NOP” instruction.

4.2 The Control Path

The control path is 32 bit wide. The logic diagram of the control path is shown in

Figure 4.12. The design of the Oryx processor is kept as simple as possible and it does

not require a very complex control. The control is primarily done with logic which is

local to each stage of the pipeline.

Chapter 4. Implementation 46

Psi2

PSWSBit

Psi2PC

Psi2PSWSBit

PCm4DrvResultBus

Figure 4.11: 2FC logic diagram

There are few global control signals used in most stages of the pipeline. They are the

reset signal, the exception signal, the squash signal, and the PCChainToPC signal.

The reset signal puts the Oryx processor in a known state and forces an exception to

take place. This is an active-high signal which is controlled by an external pin.

The exception is an active-high signal which originates in the exception FSM. The

exception FSM is a simple six-state machine implemented as a six-stage shift register.

The logic diagram of the exception FSM is shown in Figure 4.13.

The exception FSM is activated when an interrupt is detected in the WB stage. The

exception signal stays active for five clock cycles. This is the time needed to flush the

pipeline. The pipeline is flushed by feeding in five “NOP” instructions and preventing

the instructions which are currently in the pipeline from writing their results back into

the register file. The Oryx processor starts fetching instructions at the beginning of the

TSR when the exception signal goes low.

Table 4.5 shows the pipeline action during exceptions. Instruction T is assumed to

have caused an exception. Consequently, instructions Jo to 14 are canceled. The Oryx

processor starts fetching instruction T which is the first instruction of the TSR.

Chapter 4. Implementation

IPFauIt

CompFunBusImmedBusTBEqChkTBGrChk

—‘ TBWhichMovefrsMovetosBranch

• Compute• Memory• Computelinmed• MDlnstr• IRET• ShifterDrvResultBus

• RegFiIeToALUB• RegFiIeToALUA

MemToALUBMemToALUA

• ExeToALUB

47

NMlnterrupt

Interrupt

Exception

• Rd

NoDeat

Overflow

P5W Flags

ID

Takeflranch

Exception

PC

< Reset£ Phil

Phi2

SquashJh

FSM

Ra

Rb

ICache

Figure 4.12: The Control path logic diagram

Chapter 4. Implementation 48

ExceptionFSMln

Figure 4.13: The exception FSM logic diagram

IF ID EXE MEM WB‘0 X X X XIi jo X X X12 ‘1 ‘0 X X13 ‘2 ‘1 ‘0 X14 13 ‘2 Ii To

NOP 14 13 ‘2 IiNOP NOP 14 13 ‘2

NOP NOP NOP 14 13NOP NOP NOP NOP 14

15 NOP NOP NOP NOP

Psil

Table 4.5: The Oryx processor pipeline action during exceptions

Chapter 4. Implementation 49

SquashFSMln

Figure 4.14: The squash FSM logic diagram

The squash is an active-high signal which originates in the squash FSM. The squash

FSM is similar to the exception FSM. It is a three-state machine implemented as a

two-stage shift register. The logic diagram of the squash FSM is shown in Figure 4.14.

The squash FSM is activated when the instruction being executed is a squashing

branch instruction and the branch is not taken. The squash signal stays active for two

clock cycles. This is the time needed to cancel the two instructions which are in the delay

slots.

Table 4.6 shows the pipeline action during instruction squashing. Instruction 14 is

assumed to be a squashing branch instruction. The two instructions following 14 are

turned into “NOP” instructions. The Oryx processor starts fetching instruction 15 which

follows the two delay slots.

The PCChainToPC is an active-high signal which originates in the PCChain FSM.

The PCChain FSM is a five-state machine implemented as a four-stage shift register.

The logic diagram of the PCChain FSM is shown in Figure 4.15. The PCChain FSM is

triggered by an IRET instruction. The PCChainToPC signal stays active for four clock

Chapter 4. Implementation 50

IF ID EXE MEM WB‘0 X X X XIi jo X X X12 ‘1 ‘0 X X13 ‘2 ‘1 ‘0 X14 13 ‘2 ‘1 ‘0

NOP 14 13 ‘2 ‘1

NOP NOP 14 13 ‘2

I NOP NOP 14 13

Table 4.6: The Oryx processor pipeline action during instruction squashing

cycles. During this time, the PC register gets its new value directly from the PCm4

register.

In order to understand how the control works in the Oryx processor, it is best to

examine what the hardware does in each processing phase of an instruction (Table 4.7).

The instruction fetch starts during the IF phase when the PC register is loaded with

the address of the next instruction to be fetched. A typical instruction fetch timing

diagram is shown in Figure 4.16. On 1 of the IF phase, the ICache address pads are

driven with the instruction address and the fetch signal is asserted by the Oryx processor.

The ICache is accessed on q2 of the IF phase. In the ideal situation (i.e. a cache hit), the

MMU responds by asserting the JAck signal to tell the Oryx processor that an instruction

is now available on the ICache pads. Consequently, the value on the ICache pads is loaded

into the instruction register (IR).

Since the Oryx processor supports virtual memory, the design has a mechanism to

detect page faults. A typical ICache page fault timing diagram is shown in Figure 4.17

The same diagram can be used to show a DCache page fault timing by relabeling the

waveforms so they correspond to a DCache memory operation. An ICache page fault

is similar to an ICache miss except that the MMU waits for one clock cycle and then

Chapter 4. Implementation

lCache[O-3l]

51

ICacheAddr[O-31]

lAck

Fetch

Psi2

Phi2

Phil

Figure 4.15: The PC chain FSM logic diagram

Figure 4.16: A typical instruction fetch timing diagram

Chapter 4. Implementation 52

Processing Phase Clock Phase Pipeline ActionIF q1 ICache address pads PC

42 Do ICache accessInstruction register ICache

ID q1 Calculate new PCDo bypass register comparisonsDetect squashing branch instructionsSrcl bus Srcl register or bypass registerSrc2 bus 4= Src2 register, bypass register oroffset field in memory instructions

42 No actionEXE qfl Do ALU operation or shift operation

Evaluate TakeBranch signal for branch instructions42 EXEout register 4= Result bus

SMDR Src2 busPCm1 PC

MEM No action

q2 Do DCache accessTEMP register 4= EXEout registerLMDR register 4= DCache input pads

WB Detect interrupts

q2 WBout register TEMP register or LMDRRegister file 4= WBout register

Table 4.7: Pipeline action during instruction processing phases

Chapter 4. Implementation 53

ICache[O-3 1]

ICacheAddr[O-3 1]

IPFault

JAck

Fetch

Psi2

Psil

Phi2

Phil

asserts the instruction page fault (IPFault) signal.

On ql of the ID phase, squashing branch instructions are detected. In the PC unit

(Figure 4.18), the address of the next instruction to be fetched is calculated. At the same

time in the bypass control unit, the source register addresses of the current instruction are

compared with the destination register addresses of the previous two instructions using

four identical comparators. As a result of the comparison, either the source register or

the pipeline intermediate values are driven on the source 1 (Srcl) and source 2 (Src2)

buses.

On q1 of the EXE phase, either an ALU operation or a shift operation is performed.

On q52, the result of the operation is moved into the EXEout register and the value of

the Src2 bus is loaded into the SMDR register. For memory instructions, the DCache

Figure 4.17: ICache page fault timing diagram

Chapter 4. Implementation

PC[O-31]

“4”

Pc[O-31J

Offset

PCm4DrvPCBus

PCm4[O-31J

Figure 4.18: The PC unit logic diagram

Pc[o-31]

54

SrclDrvPcBus

Srcl[O-31]

ZeroDrvPcBus

“0”

TrapVectorOrvPcBus

Trap Vector

Reset

Chapter 4. Implementation 55

TakeBranch

Figure 4.19: The TakeBranch signal logic diagram

address pads are driven with the content of the EXEout register and the DCache output

pads are driven with the content of the SMDR register. For branch instructions, the

TakeBranch signal indicates whether fetching instructions should continue sequentially

or at the branch target. The TakeBranch signal logic diagram is shown in Figure 4.19.

A typical memory read timing diagram is shown in Figure 4.20. On q2 of the MEM

phase, the content of the DCache input pads is loaded into the LDMR register and the

content of the EXEout register is moved into the TEMP register. The Oryx processor

asserts the read signal. The DCache is accessed on q2 of the next clock cycle giving the

MMU enough time to find the required information.

On ql of the WB phase, the interrupt flags are scanned according to their priority.

When an interrupt is pending, the exception FSM is activated before the content of the

WBout register is written into the register file as to prevent the contents of the registers

from being changed.

On 2, the content of the TEMP register is moved into the WBout register except

for memory load instructions where the WBout register gets the content of the LMDR

register instead.

TBEqChk

z

TBWhich

Branch

Chapter 4. Implementation 56

DCacheIn[O3)X(

DAck

Read)WñtC

Psi2

Phi2

Phil

Figure 4.20: A typical memory read timing

Chapter 4. Implementation 57

4.3 The Data Path

The width of data path is parameterized so it can take on a value which is equal to N,

where N=21, i {2, 3,4, 5}. The data path logic diagram is shown in Figure 4.21. It

mainly consists of two parts: the register file and the execute unit.

The register file provides the operands which are needed to perform an operation. It

is an M-word memory, where M=2t, i {1, 2,3,4, 5}, which can be accessed through

three ports, one of which is used for writing and the other two ports are used for reading.

As shown in Figure 4.22, the register file can be logically divided into three components:

the memory array, the destination address decoder, and the output multiplexors.

Each memory bit is a D-fiipfiop with an input multiplexor as shown in Figure 4.23.

The content of the memory bit is changed when the load signal becomes active. The N

load signals are grouped together to form a word address line which is connected to one

output of the address decoder.

The NoDest signal originates in the ID stage and is carried along the pipeline stages.

It is used to to prevent instructions which do not write into the register file from changing

the contents of the registers. The logic which generates the NoDest signal is shown in

Figure 4.24.

An M-to-1 word multiplexor is used for each of the register file output ports. Reading

from the register file is done asynchronously. When the source register address becomes

stable at the multiplexor select lines, the required memory word is made available at the

output port after some propagation delay.

The design of the register file is somewhat naive. The main reason being that the tools

which we used, specifically the VHDL compiler, do not support switch-level component

(i.e. transistors) which are necessary for a realistic implementation. In practice, the

register file should be implemented as a Dual-Ported RAM with pre-charge lines and

PC

ExeT0ALUAExeToALUBMemT0ALUAMemT0ALUB a’

ID

Ra

Rb

TakeBranchOverflow

CompFuncMovetosMovefrsBranchTBWhichTBGrChkTBEqChk

Reset- Compute* Memory

Immedflus

ComputelmmedMDlnstrWET

IF

Chapter 4. Implementation 58

Rd Register File

WB

MEM

EXE

RegFiIeT0ALUARegFiIeT0ALUB

PSW Flags a

PCm4

DCacheOut

3.

a

ShifterDrvResultBus

Register File

Figure 4.21: The Oryx processor data path logic diagram

Chapter 4. Implementation 59

Ra[O-4]WBout[O-3 1]

Register 0Srcl[0-31]

Register 1

Destination

Address Rd[0-4]

Decoder

Src2[0-311 Register 31 I 31

Rb[0-4]

NoDest Exception

Figure 4.22: The Oryx processor register file logic diagram

out(i)

LoadPsi2

Figure 4.23: A register file memory bit logic diagram

Chapter 4. Implementation 60

TypO

Typi

Op1

NoDest

Rd[O-4]

Figure 4.24: The NoDest signal logic diagram

differential sense amplifiers, although designs like ours have been reported [16].

The execute unit (Figure 4.21) consists of five components: the ALU, the shifter, the

MD register, the PSW register, and the PCChain.

The ALU consists of two parts: the arithmetic unit (AU) and the logic unit (LU). The

logic diagram of the ALU is shown in Figure 4.25. Data arrive at the ALU input ports

on either the source buses or the immediate bus. A multiplexor is used to select between

the Src2 bus or the immediate bus. Similarly, another multiplexor selects between Srcl

or Srcl shifter by one bit (used to support the multiplication/division instructions). A

third multiplexor selects between the output of the AU and the output of the LU.

The AU is an N-bit carry propagate adder composed of - carry look-ahead adders.

The logic diagram of the AU is shown in Figure 4.26. The AU supports both two’s

complement addition and subtraction. The AL UComplement signal is used to generate

the two’s complement of the second operand for subtract instructions.

The AU is also used to perform register comparison for branch instructions. The

Z signal takes on a “0” value when the contents of the two source register of a branch

instruction are equal.

TypO

Typl

Chapter 4. Implementation 61

ALUout[O-31]

AUDrvResultflus

LUDrvResultBus —7”

Logic Unit

_

II—

_

Src1ToALU

SrclShiftedToALU

SrclBusShiftedByOne[O-311 srcl[o-311 Src2[O-31]

Figure 4.25: The ALU logic diagram

1 Arithmetic Unit

I I IISrc2ToALU

edToALU

I

AUout[O-31]

ALUCarryIn

ALUCarryOut

z

OperandA[O-31] OperandB[O-31]

Figure 4.26: The AU logic diagram

Chapter 4. Implementation 62

LUout[O-31]

AndDrvResultBus XorDrvResultBus

OrDrvResultBus NotDrvResultBus

OperandA[O-311 OperandB[O-31]

Figure 4.27: The LU

The logic diagram of the LU is shown in Figure 4.27. The LU supports bitwise logical

“and”, “or”, “exclusive or”, “not”, and “bit clear” operations.

The shifter is a combinational circuit shifter which performs standard shifts such

as logical shifts and arithmetic shifts. The logic diagram of the shifter is shown in

Figure 4.28.

The shifter has two control lines ShiftLeft and ShiftRight for selecting the type of

operation. The diagram shows the first stage, the last stage, and a typical stage. The

shifter consists of N such identical stages.

The shifting to the right or left is for one bit position. The serial input to the least

significant stage is always “0”. The ShifterMSBln signal serves as a serial input for the

most significant stage. For arithmetic shift operations, ShiftMSBIn takes on the value of

the sign bit and for shift right logical operations it becomes “0”.

logic diagram

Chapter 4. Implementation 63

ShiftLeft ShiftRight

Figui’e 4.28: The shifter logic diagram

Shifterout(3 1) Shifterout(i) Shifterout(O)

ShifterMSBln “O”

Chapter 4. Implementation 64

RestoreMDReg

SrcJ ()

LdMDReg

MDRegout(i-1)

MDRegShift

Figure 4.29: A typical MD bit logic diagram

The MD register is an N-bit special register with built-in hardware to facilitate the

implementation of the shift-and-add multiplication and division instructions.

A typical MD register bit consists of an input multiplexor and two D-fiipflops as

shown in Figure 4.29. On ql, the output value of the input multiplexor is loaded into

the first D-fiipfiop. The second D-fiipfiop which is known as the shadow fiipflop saves the

content of the first D-fiipfiop on so that the value of the MD register can be restored

after an exception.

The LdMDReg signal loads the MD register with the value on the Srcl bus. This

signal becomes high only for a “MOVTOS MD” instruction which is not squashed.

The MDRegShift signal causes the MD register to shift left by one bit for multiplica

tion/division instructions which are not squashed.

On an exception, both the LdMDReg and the MDRegShift signals stay low. Before

starting the ISR, the Restore OldMDReg rises to restore the value of the MD register which

was held immediately before the exception signal became active. The MDRegLSBIn is

shifted into the low order bit of the MD register when the MDRegShift signal is high.

This bit is simply the sign bit of the ALU output.

Chapter 4. Implementation 65

PSWout(i+I)

PSWout(i)

Psil

Figure 4.30: PSWCurrent-PSWOther pair

The PSW register consists of nine pairs of bits. The two bits of each pair are called

PSWCurrent and PSWOther, respectively. The logic diagram for each PSWCurrent

PSWOther pair is shown in Figure 4.30. The top D-flipflop makes up the PSWCurrent

bit and the bottom D-flipflop makes up the PSWOther bit. Each PSWcurrent bit can

be saved in and restored from its PSWOther bit.

On an exception, the PSWCurrent bits are saved in the PSWOther bits and a partic

ular value is forced into most of the PSWCurrent bits except the E, I, 0, IPF, and DPF

bits. These bits are forced to values which depend on whether or not the exception was

caused by a non-maskable interrupt, a maskable interrupt, an overflow, an ICache page

fault, or a DCache page fault, respectively.

The LdFSW signal loads the PSW register with the value on the Srcl bus. This

signal becomes high only for a “MOVTOS PSW” instruction executed when the Oryx

Srcl(i+1)

1-dPsw

ForcePSW

PSWOtherToPSW

ForcePSW

Srcl(i)

LdPSW

Chapter 4. Implementation 66

processor is in the system mode. The LdPSW signal stays low in case of an exception.

On an exception, the ForcePSW signal rises to save the PSWCurrent bits in their

respective PSWOther bits and force the PSWCurrent bits to predetermined values.

The PSWOtherToPSW signal is triggered by an IRET instruction as part of the

TSR return sequence. This signal restores the PSWCurrent bits with the values of their

PSWOther bits.

Chapter 5

Formal Verification

In this chapter, we demonstrate the application of symbolic trajectory evaluation for

verifying the Oryx processor. The chapter contains six sections. First, we briefly discuss

the inherent difficulty in carrying out the verification of the Oryx processor. Second,

we explain what it means to verify the processor. Third, we show how implementation

mappings can be used to overcome some of the difficulty in verifying the Oryx processor.

Fourth, we state the verification condition of the processor which allows us to establish

the correctness of the processor. Fifth, we show how the verification is actually done

and we then present some experimental results. Finally, we list some of the design errors

which were detected during the verification.

5.1 Verification Difficulty

In practice, the Voss system is used to carry out the verification task of the Oryx pro

cessor. Generally speaking, the formal specification of the Oryx processor is expressed in

the FL language and the simulation model needed by the Voss system is extracted from

the VHDL implementation of the processor.

One of the difficulties in verifying the correctness of the Oryx processor is that the

formal specification of the processor is very abstract so that it does not constrain the

implementation of the processor. On the other hand, our implementation of the Oryx

processor is fairly complex. Therefore, a major effort in the verification effort was to find

the relationship between the formal specification of the Oryx processor and its pipelined

67

Chapter 5. Formal Verification 68

implementation. In particular, we had to deal with both temporal abstraction and struc

tural abstraction.

In the abstract domain, time is measured in terms of instruction cycles. The processor

executes one instruction every cycle. In the implementation domain, however, time is

measured in terms of clock cycles. It takes an instruction several clock cycles to emerge

from the pipeline of the Oryx processor. For example, an ADD instruction takes 5 clock

cycles to complete in the implementation. Furthermore, each clock cycle consists of many

basic time units. Assuming that each clock cycle is equal to 1000 time units, we thus

need to map 1 abstract instruction cycle to 5000 implementation time units.

The problem imposed by structural abstraction is due to the fact that a data value in

the abstract domain could have several images in the implementation domain and which

image should be used at one point of time depends of the state of the Oryx processor. To

further illustrate the problem, consider the following example. In the abstract domain,

the value of the operands can be obtained from one source called the virtual register file.

In the implementation domain, however, the value of operands could be in the register

file or anywhere along the stages of the pipeline. We will return to this shortly.

5.2 Property Verification

If the programmer wants to use the Oryx processor to run a program, the implementation

of the processor should perform the “correct” function, as specified by the program. In

very general terms, answering this is the essential verification task of the Oryx processor.

However, carrying out the verification for any arbitrary program is impossible due

to the complexity of the problem. Instead, we will show how to verify properties of the

Oryx processor using bounded-length, arbitrary, but valid (i.e. written according to the

formal specification), instruction sequences. A property of the Oryx processor to verify

Chapter 5. Formal Verification 69

might be that the PC register is updated properly or that the ALU computes the sum

of two operands correctly.

At this point, we will consider the abstract model of the Oryx processor to show how

to check for properties in the abstract domain. Later in this chapter, we will show how to

check for the same properties in the implementation domain. We express the properties

we want to check for in the form A = NO, where A and 0 are instantaneous formulas

that relate to the abstract model of the Oryx processor and IC/ is a next-time operator. If

the property we are checking for holds in the abstract domain, we write = A NC.

Referring back to Chapter 1, one can realize that STE is a very suitable technique

for checking properties of the form given above. The formula A, also known as the

antecedent, describes the abstract state of the Oryx processor before we perform some

instruction. Intuitively, the antecedent selects out of the arbitrary instruction sequences,

those sequences that are relevant for the property we wish to verify. Similarly, the formula

O, also known as the consequent, describes what the state of the Oryx processor should

be if the particular property holds and we allow the Oryx processor to run for one more

cycle.

Since we are verifying properties of the Oryx processor only, it may not be clear to

the reader how that results in verifying the Oryx processor as proposed in the beginning

of this section. Nevertheless, it is possible, after each single property has been verified,

to compose the individual results so that the correctness of the abstract model can be

established [7). However, describing those techniques is beyond the scope of this thesis.

5.3 Implementation Mapping

Implementation mapping is a technique to close the gap between the abstract formal spec

ification of the Oryx processor and its pipelined implementation. The technique involves

Chapter 5. Formal Verification 70

I I

I IiiiiiiiiiiiiiiI IFigure 5.31: Time mapping

finding a well-defined relationship between the abstract domain and the implementation

domain. Due to the size of the gap, the mapping is relatively complex.

Since we have two types of abstraction, we will first show how to do the mapping to

handle each type of abstraction. We will then show how to combine the two mappings

to obtain one implementation mapping.

5.3.1 Cycle Mapping

The first entity we need to map is time. As shown in Figure 5.31, one instruction cycle

in the abstract domain corresponds to many basic time units in the implementation

domain. To relate time in both domains, we define a linear function, (i), which takes

an instruction cycle i and maps it to some clock cycles. The definition of (i) depends on

the delays in the components of the implementation. In our case (i) is simply 100000 x i.

5.3.2 Instant Abstract State Mapping

The second entity we need to map is the instant abstract state. The term instant is

used here to emphasize that the abstract state does not involve time explicitly. Thus,

this mapping deals primarily with the structural abstraction. As shown in Figure 5.32,

i+1Abstract Time

Implementation Time

Chapter 5. Formal Verification 71

1 i+ I

I I Abstract State

Implementation State

Figure 5.32: State mapping

we define a function, q(i), which takes an abstract state i and maps it to a sequence

of implementation states. We impose some constraints on (i) to make sure that all

implementation states within a sequence are consistent with respect to each other.

This mapping is best illustrated by an example. Recall from Chapter 4 that bypassing

is a technique used to resolve data conflicts within the pipeline. This means operands

can be obtained from the register file or anywhere along the pipeline stages. To hide this

fact in the abstract domain, we refer to a virtual register file whenever we need to obtain

an operand.

To illustrate the basic idea behind the mapping, we present two examples as shown

in Figure 5.33. The upper part of Figure 5.33 shows the abstract model of the Oryx

processor with some fields filled in, while the lower part shows a simplified version of its

implementation.

The first example shown in Figure 5.33(a) assumes that the current instruction is an

ADD 2,3,4 instruction which is supposed to add the contents of the virtual register 2 and

the virtual register 3, and put the result in the virtual register 4. Since the address of

the instruction operands are not equal to any of the addresses of the destination registers

of the previous two instructions, bypassing is not done. Thus, the virtual register file is

Chapter 5. Formal Verification 72

/

Abstract Domain

(N

(a)

/

Abstract Domain

N

(b)

2 Src1

Src2

4 6 7 Dest-

- VRF[21

JL VRF[31

0

Src1Src2

4 3 7 Dest-

- -- VRF[2]

--u VRF[31

, S2

/ EXEout’

MEMout

I2 10 11

3 15 EXEout

MEMout

Figure 5.33: Register file mapping

Chapter 5. Formal Verification 73

i+1I

‘ Abstract State

Implementation State

Figure 5.34: Combined mapping

mapped to the physical register file as shown by the arrows.

The second example shown in Figure 5.33(b) assumes that the current instruction is

the same. However, in this case one of the operand addresses is equal to the address of

the previous destination register. Thus, the virtual register is mapped to the EXEout

register.

5.3.3 Combined Mapping

From the previous discussion, we showed how to deal with temporal and structural ab

straction separately. Here we show how to combine both mappings to define a single

implementation mapping function that maps both time and state. We call this function

t. The way the function works is depicted in Figure 5.34. The function takes an abstract

state i and maps it to a sequence of implementation states as explained earlier. The func

tion then takes the next abstract state i + 1, maps it to a sequence of implementation

states, adds time elements to the result and combines it to the previously generated

sequences. Again, some constraints are imposed on the mapping to make sure that the

resulting combination of sequences is conflict-free.

Chapter 5. Formal Verification 74

5.4 Verification Condition

To establish the correctness of the Oryx processor, we need to show the following is true:

VA,O if A NO then j=M 1u(A= IrO)

This basically says that if some properties hold in the abstract domain, then we can

prove that the same properties hold in the implementation domain by using implemen

tation mapping.

5.5 Practical Verification

In this section, we demonstrate how to write implementation mapping functions in the

FL language. We also show how to use STE to prove properties of the Oryx processor.

To illustrate implementation mappings, consider the following simple example of an

output signal:

let DCacheAddris cyc dv = DCacheAddr isv dvfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

The DCacheAddris function takes two arguments, an abstract cycle number (cyc) and

a data vector (dv). This function basically says that the DCache address output pads

of the Oryx processor take on the value dv for a period of time. The abstract time is

mapped to implementation time with the help of the Phi2F function. The DCacheAddris

function also takes into account set-up time and hold time of the data value.

A more sophisticated example of implementation mapping is given below:

let VRegFile cyc addr dv instrlist =

let BypassFromEze = BypassFromExe addr instrlist inlet BypassFroniMem = BypassFromNem addr instrlist inlet NoBypass = (NOT BypassFromExe) AND (NOT BypassFromMem) in(if (BypassFromExe) (EXEoutis (cyc+1) dv))(if (BypassFromNem) (MEMoutis (cyc+1) dv)) @(if (NoBypass) (RegFileis (cyc+1) addr dv))

Chapter 5. Formal Verification 75

The VRegFile function takes four arguments: an abstract cycle number, a register

address, a data vector, and an instruction list. The function compares the register address

with the destination register address of the previous two instructions and determines the

bypassing conditions. Given the bypassing conditions, the function then asserts or checks

the appropriate values at the appropriate time.

At this point, we are ready to show how to verify a property of the Oryx processor.

Recall that the instructions of the Oryx processor fall into four classes of which the third

class is the ALU and shift instructions. Thus, we write a function called VerifyClass3lnstr

as shown below:

let VerifyClass3lnstr =

let Newlnstr = (hd NewlnstrList) inlet Ra = RaAddr Newlnstr inlet Rb = RbAddr Newlnstr inlet Rd = RdAddr Newlnstr inlet Ant =

(GenOryxClock Clocks) ®

(Reset is F from (Cycle 0) to (Cycle Clocks)) @(Exception is F from (Cycle 0) to (Cycle Clocks)) @(Squash is F from (Cycle 0) to (Cycle Clocks)) 0(AssignState 1 ControlList InstrList) 0(if (GlobalConstraint) (

(if (TakesTwoSources U V NewlnstrList OldlnstrList)((VRegFile (Target) Ra U OldlnstrList) 0(VRegFile (Target) Rb V OldlnstrList))) 0

(if (TakesOneSource U V NewlnstrList OldlnstrList)(VRegFile (Target) Ra U OldlnstrList))

)) inlet Cons =

(if (GlobalConstraint)

((if (TakesTwoSources U V NewlnstrList OldlnstrList)(EvaluateClass3lnstrFirst Target Rd Newlnstr OldlnstrList U V)) 0

(if (TakesOneSource U V NewlnstrList OldlnstrList)(EvaluateClass3lnstrSecond Target Rd Newlnstr OldlnstrList U V))

)) in(nverify Options Oryx VarList Ant Cons TraceList);

When we evaluate this function, it will check if the shifter and ALU perform correctly,

the control logic and bypass logic are correct, and reading from and writing to the register

file are done properly. The time it takes the Voss system to evaluate the function for

Chapter 5. Formal Verification 76

various data path widths is given in Table 5.8.

Data Path Width Verification Engine Time(Bits) (Minutes)

4 SPARC 10 with 64 MB RAM 532 RS6000 with 375 MB RAM 30

Table 5.8: Some experimental results

5.6 Design Errors

The following is a non-exhaustive list of errors which where uncovered during the veri

fication of the Oryx processor. The purpose of this list is to show that the verification

can detect various types of mistakes that the designer usually makes.

1. Pipeline bypasses when source register is 0. (forgotten boundary conditions)

2. Both Srcl and ALU drive the result bus. (logic error)

3. Trap instructions take effect before reaching their WB stage. (logic error)

4. Bypass logic uses Rdm2 and Rdm3 instead of Rdml and Rdm2. (typographical

error)

5. Register file writes data at the wrong edge of the clock. (timing error)

6. No way to distinguish between exceptions due to reset and an exception die to a

MNlnterrupt. (misunderstanding error)

7. Shifter uses some buffered control signals but the buffers do not exist. (VHDL

compiler bug)

Chapter 6

Concluding Remarks and Future Work

In this thesis, we established that symbolic trajectory evaluation is an effective and

practical technique for verifying even very complex digital systems. For demonstration

purposes, we designed and implemented a 32-bit RISC processor which is very similar to

many commercial RISC processors. We called it the Oryx processor.

Moreover, we showed how to describe the Oryx processor in an abstract domain.

The motivation for this is to avoid over-specifying the architecture and thus constrain

the implementation of the processor. In other words, our implementation of the Oryx

processor can be replaced with another implementation at any point of time without

having to change the specification. Only the circuit description and the implementation

mapping would have to be revised. We also showed how to relate the abstract model

of the Oryx processor to its pipelined implementation using implementation mappings.

Finally, we used the Voss system to carry out the practical verification of the Oryx

processor.

Although we have not completely verified the processor, the properties verified were

sufficiently complex that we do not foresee any major difficulties in completing the veri

fication. However, to complete the verification we estimate would require 3-6 additional

man months. Furthermore, the work has been a very good learning experience. We now

have a much better understanding of the verification problem which will be an asset

towards any future work.

77

Chapter 6. Concluding Remarks and Future Work 78

Like other related verification work, we realize the need to develop a general method

ology for verifying this type of systems. We have already began the development of such

methodology, but more research is still needed. First, due to the complexity of the sys

tems we are dealing with, it would be ideal if we can break down the major verification

task into smaller tasks and then use result composition to establish the correctness of the

system. Result composition can be used in the abstract domain or in the implementation

domain. However, showing that composing a sequence of results in the abstract domain

would give the same result if we compose the corresponding sequence in the implementa

tion domain is subject to future work. Second, finding the relation between the abstract

domain and the implementation domain is a very tedious job. A better way is needed to

analyze the mapping conditions to ensure correct mappings.

The application of symbolic trajectory evaluation for the verification of RISC proces

sors is only one example of what type of systems we can verify using this technique. The

general framework we presented can be adapted to verify other systems as well. This

work is by no means a complete solution to the verification problem. However, we think

that it is a good start and a very practical way to establish the correctness of this type

of systems.

Bibliography

[1] Derek Beatty. A Methodology for Formal Hardware Verification with Application toMicroprocessors. Ph.D. Thesis, CMU-CS-93-190, Department of Computer Science,Carnegie Mellon University, 1993.

[2] Graham Birtwistle and P.A. Subrahmanyam. VLSI Specification, Verification andSynthesis. Kiuwer Academic Publishers, Boston, 1988.

[3] Randal Bryant. “Graph-Based Algorithms for Boolean Function Manipulation”.IEEETC, Vol. C-35, No. 8, August 1986, pp. 677-691.

[4] Jerry R. Burch and David L. Dill. “Automatic Verification of Pipelined Microprocessor Control”. CAV ‘9j: Proceedings of the Sixth International Conference onComputer-Aided Verification. Lecture Notes in Computer Science 818, Springer-Verlag, June 1994, pp. 68-80.

[5] Paul Chow. The MIPS-X RISC Microprocessor. Kiuwer Academic Publishers,Boston, 1989.

[6] Mohammad Darwish. Design of a 32-bit Pipelined RISC Processor. Technical Reportin preparation.

[7] Scott Hazelhurst and Carl-Johan Seger. “Composing Symbolic Trajectory Evaluation Results”. CAV ‘94: Proceedings of the Sixth International Conference onComputer-Aided Verification. Lecture Notes in Computer Science 818, Springer-Verlag, June 1994, pp. 273-285.

[8] J. L. Hennessy. “Designing a computer as a microprocessor: Experience and lessonsfrom the MIPS 4000”. A lecture at the Symposium on Integrated Systems, Seattle,Washington, 1993.

[9] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufman Publishers, San Mateo, 1990.

[10] Lee Howard. “The design of the SPARC processor”. A lecture at the University ofBritish Columbia, Vancouver, British Columbia, 1994.

[11] D. A. Patterson and 3. L. Hennessy. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufman Publishers, San Mateo, 1994.

79

Bibliography 80

[12] S. Mirapuri, M. Woodacre, and N. Vasseghi. “The MIPS R4000 Processor”. IEEEMicro, April 1992, pp. 10-22.

[13] Carl-Johan Seger. Voss - A Formal Hardware Verification System: User’s Guide.Technical Report UBC-CS-93-45, Department of Computer Science, University ofBritish Columbia, Vancouver, British Columbia, 1993.

[14] Carl-Johan Seger. An Introduction to Formal Hardware Verification. TechnicalReport UBC-CS-92-13, Department of Computer Science, University of BritishColumbia, Vancouver, British Columbia, 1992.

[15] Carl-Johan Seger and Randal Bryant. Formal Verification by Symbolic Evaluationof Partially-Ordered Trajectories. Technical Report UBC-CS-93-8, Department ofComputer Science, University of British Columbia, Vancouver, British Columbia,1993.

[16] Mandayam Srivas and Mark Bickford. “Formal Verification of a Pipelined Microprocessor”. IEEE Software, September 1990, pp. 52-64.

[17] Daniel Tabak. RISC Systems. Research Studies Press Ltd., England, 1990.

[18] Sofiene Tahar and Ramayya Kumar. “Implementing a Methodology for FormallyVerifying RISC Processors in HOL”. HUG ‘93: Proceedings of the Sixth InternationalWorkshop on Higher Order Logic Theorem Proving and its Applications. LectureNotes in Computer Science 780, Springer-Verlag, August 1993, pp. 281-294.

Appendix A

The FL Verification Script

II include some libraries

load “verification.fl”;load “arithm.f1”;load “HighLowEx f1”;load “OryxLib.f1”;

II define timing parameters

let ClockPeriod = 2000000;let ScaleFactor = 20;let CycleLength = ClockPeriod/ScaleFactor;let LatchDelay = 100000/ScaleFactor;let Setup = 100000/ScaleFactor;let Iold = lO/ScaleFactor;

// define clock phases

let PhilRof± = 10*CycleLength/100;

let PhilFoff = 36*CycleLength/100;

let Phi2Roff = 60*CycleLength/100;let Phi2Foff = 85*CycleLength/100;let PhilF i = (Cycle (i—1))+PhilFoff;let PhilR i = (Cycle (i—1))+Philftoff;let Phi2F i = (Cycle (i—1))+Phi2Foff;let Phi2R i = (Cycle (i—1))+Phi2Roff;let Cycle k = (k)*CycleLength;

II define implementation parameters

let DataWidth = 32;let InstrWidth = 32;let DataAddrWidth = 32;let InstrAddrWidth = 32;let FPRegAddrWidth = 5;let RxAddrWidth = 5;

81

Appendix A. The FL Verification Script 82

let RegFileSize = 32;let CompFiincWidth = 12;let OpcodeWidth = 5;

II short—hands for clock signals

let Phil = ‘1Phil”;let Phi2 = ‘Phi2’;

7/ define two-phase non—overlapping clock signals

letrec GenPhil n = (n0) > (UNC for 0) I(((((Phil is F for PhilRoff) then

(Phil is T for (PhilFoff — PhilRoff))) then(Phil is F for (Phi2Roff — PhilFoff))) then(Phil is F for (Phi2Foff — Phi2Roff))) then(Phil is F for (CycleLength — Phi2Foff))) then(GenPhil (n—l));

letrec GenPhi2 n = (n0) > (UNC for 0) I(((((Phi2 is F for PhilRoff) then

(Phi2 is F for (PhilFoff — PhilRoff))) then(Phi2 is F for (Phi2Roff — PhilFoff))) then(Phi2 is T for (Phi2Foff — Phi2Roff))) then(Phi2 is F for (CycleLength — Phi2Foff))) then(GenPhi2 (n—l));

let GenOryxClock n (GenPhil n) @ (GenPhi2 n)

II define histories of length (n)

letrec GenControl n 1 = (n0) > 1 Ilet new = [F,F,F,F,TT] inlet ni = new:l inGenControl (n—l) nl;

letrec GenData n 1 = (n0) > 1 Ilet new = (variable_vector (“DCacheln.”(int2str (n—l))”.”)

DataWidth) inlet nl = new:l inGenData (n—l) ni;

letrec Genlnstr n 1 = (n0) > 1 Ilet new = (variable_vector (“ICache.”(int2str (n—l))”.)

InstrWidth) inlet nl = new:l in

Appendix A. The FL Verification Script 83

Genlnstr (n—i) nl;

II short—hands for some signals

let Reset = “Reset”;let Exception = “ExceptionSignal”;let Squash = “SquashSignal”;let IPFault = “IPFault”;let DPFault = “DPFault”;let NMlnterrupt = “NMlnterrupt”;let Interrupt = “Interrupt”;let DAck = “DAck”;let lAck = “lAck”;let DCacheln = nvector “Dcacheln” DataWidth;let ICache = nvector “ICache” InstrWidth;let BypassCache = “BypassCache”;let MemCycle = “MemCycle”;let CopCycle = “CopCycle”;let ReadWriteb = “ReadWriteb”;let IFetch = “IFetch”;let FPReg = nvector “FPReg” FPRegAddrWidth;let DCacheAddr = nvector “DCacheAddr” DataAddrWidth;let OCacheOut = nvector “DCacheOut” DataWidth;let ICacheAddr = nvector “ICacheAddr” InstrAddrWidth;

let EXEout =

letrec row x = (x0) > E] I([“ExecutionStage/DataRegisterCellO/DffrsCellO_”(int2str (x—i)Y”/m4”]) Q (row (x—i)) inrow DataWidth;

let TEMPout =

letrec row x = (x0) > C] I(E”MemoryStage/DataRegisterCellO/DffrsCellO...”(int2str (x—i)Y”/m4”]) C (row (x—1)) inrow DataWidth

let LMDRout =

letrec row x = (x0) > C] I(C”MemoryStage/DataRegisterCelli/DffrsCellO_”(int2str (x—i)Y”/m4”]) C (row (x—i)) inrow DataWidth

let RegFileliodes =

letrec row x y = (x0) > C] I([“RegisterFileStage/PLDataRegisterCell”(int2str yY“/DffrsCellO...”(int2str (x—i)”/m4”]) C (row (x—i) y) inletrec column y = (y0) > C] I

Appendix A. The FL Verification Script 84

([row DataWidth y]) C (column (y—i)) incolumn (RegFileSize—1)

II define opcodes

let L.D_OPCODE = [T,F,F,F,F];let ST_OPCODE = [T,F,F,T,F];let LDF_OPCODE = [T,F,T,F,F];let STF_OPCODE = ET,FT,T,F];let LDT_OPCODE = [T,F,F,F,T];let STT_OPCODE — [T,F,FIT,T],let MOVFRC_OPCODE = [T,F,T,F,T];let MOVTOC_OPCODE = [T,F,T,T,T];let ALUC_OPCODE = [T,F,T,T,T];let BEQ_OPCODE = [F,F,F,F,T];let BGE_OPCODE — [F,F,T,T,T],let BLT_OPCODE — [F,F,F,TIT],let BNE_OPCODE = [F,F,T,F,T];let ADD_OPCODE = [F,T,T,F,F];let DSTEP_OPCODE = EF,T,F,F,F];let MSTART_OPCODE = [F,T,F,F,F];let MSTEP_OPCODE = EF,T,F,F,F];let SUB_OPCODE = [F,T,T,F,F];let SUBNC_OPCODE — [F,T,T,F,F],let AND_OPCODE — [F,T,T,F,F],let BIC_OPCODE — [F,T,T,F,F],let NOT_OPCODE — [F,T,T,F,F],let OR_OPCODE = [F,T,T,F,F];let XOR_OPCODE = [F,T,T,F,F];let NOV_OPCODE = [F,T,T,F,F];let SLL_OPCODE = EF,T,F)F,T];let SRL_OPCODE = [F,T,F,F,T];let SRA_OPCODE = [F,T,F,F,T];let NOP_OPCODE

— [F,T,T,FIF],let ADDI_OPCODE — [T,T,T,F,F],let J_OPCODE = [T,T,F,F,F];let IRET_OPCODE = [T,T,T,T,T];let MOVFRS_OPCODE = [T,T,F,T,T];let MOVTOS_OPCODE = [T,T,F,T,F];let TRAP_OPCODE = ET,T,T,T,FJ;

let ADD_COMPFUNC = [F,F,F,F,F,F,F,T,F,F,F,F];let DSTEP_COMPFUNC — [F,F,F)T,F,T,F,T)F,F,F,F],let MSTART_COMPFUNC - [F,F,F,F,T,T,F,T,F,F,F,F],let NSTEP_COMPFUNC = [F,F,F,F,T,F,F,T,F,F,F,F];let SUB...COMPFUNC = [F,F,F,F,F,T,T,T,F,F,F,F];let SUBNC_COMPFUNC

— [F,F,F,F,F,F,T,T,F,F,F,F],let AND_COMPFUNC = [F,F,F,F,F,F,F,F,F,T,F,F];

Appendix A. The FL Verification Script 85

let BIC_COMPFUNC = [F,F,F,F,F,F,T,F,F,T,F,F];let NOT_COMPFUNC = [F,F,F,F,F,F,F,F,F,F,F,T];let OR_COMPFUNC = EF,F,F,F,F,F,F,F,T,F,F,F];let XOR_COMPFUNC = [F,F,F,F,F,F,F,F,F,F,T,F];let MOV_COMPFUNC = [F,F,F,F,F,F,F,T,F,F,F,FJ;let SLL_COMPFUNC = EF,F,T,F,F,F,F,F,F,F,F,F],let SRLCOMPFUNC - EF,F,F,T,F,F,F,F,F,F,F,F],let SRA_COMPFUNC — CF,F,F,F,T,F,F,F,F,F,F,F],let NOP_COMPFUNC — [F,F,F,F,F,F,F,T,F,F,F,F],

II Opcode tables

let OpcodesTablel = CLD_OPCODE,ST...OPCODE,LDF_OPCODE,STF_OP CODE,LDT_OPCODE,STT_OPCODE,MOVFRC_OP CODE,MOVTOC_OPCODE,ALUC_OPCODE,BEQ_OPCODE,BGE_OPCODE,BLT_OPCODE,BNE_OPCODE,ADDI_OPCODE,J_OPCODE,IRET_OPCODE,MOVFRS.OPCODE,MOVTOS_OPCODE,TRAP_OPCODE];

let OpcodesTable2 = [(ADD_OPCODE, ADLLCOMPFUNC),(DSTEP_OPCODE, DSTEP_COMPFUNC),(MsTART_OPCODE, MSTART_COMPFUNC),(MsTEP_OPCODE, MSTEP_COZ1PFUNC),(SUB_OPCODE, SUB_COMPFUNC),(SUBNC_OPCODE, SUBNC_COMPFUNC),(AND_OPCODE, AND_COMPFUNC),(BIC_OPCODE, BIC_CONPFUNC),(NOT_OPCODE, NOT_COMPFUNC),(OR_OPCODE, OR_COMPFUNC),(XOR_OPCODE, XO&..COMPFUNC),(MOV_OPCODE, MOV_COMPFUNC),(SLL_OPCODE, SLL_CONPFUNC),(sRL_OPCODE, SRL_COMPFUNC),(SRA_OPCODE, SRA_COMPFUNC),(NOP_OPCODE, NOP.COMPFUNC)];

Appendix A. The FL Verification Script 86

II these instructions generate results

let OpGeneratesTablel = ELD_OPCODE,LDT_OPCODE,MOVFRC_OPCODE,

ADDI_OPCODE,

MOVFRS_OPCODE];

let OpGeneratesTable2 = C(ADD_OPCODE, ADD_COMPFUNC),

(sUB_0Pc0DE, SUB_COMPFUNC),(SUBNC_OPCODE, SUBNC_COMPFUNC),(AND_OPCODE, AND_COMPFUNC),(BIc_0Pc0DE, BIC_CONPFUNC),

(NOT_OPCODE, NOT_COMPFUNC),(0R_0Pc0DE, OR_COMPFUNC),

(XOR_OPCODE, XOR_COMPFUNC),

(MOV_OPCODE, MOV_COMPFUNC),(SLL_OPCODE, SLL_COMPFUNC),(SRL_OPCODE, SRL_COMPFUNC),(SRA_OPCODE, SRA_COMPFUNC)];

II instructions to be verified

let SelectedlnstrTable2 = E(SRA_OPCODE, SRA_COMPFUNC),(SRL_OPCODE, SRL.COMPFUNC),(SLL_OPCODE, SLL...COMPFUJC),(BIc_oPcoDE, BIC_COMPFUJC),(N0T_oPcooE, NOT_COMPFUNC),(x0R_oPcoDE, XOR_COMPFUNC),

(0R_oPcoDE, OR....COMPFUNC),(AND_OPCODE, AND_CONPFUNC),(SUBNC_OPCODE, SUBNC_COMPFUNC),(SUB_OPCODE, SUB_COMPFUNC),(ADD_OPCODE, ADD_COMPFUNC)];

II instructions which take two source operands

let Needs2SrcTable = E(BIC_OPCODE, BIC_COMPFUNC),

(xoR_0Pc0DE, XOR_COMPFUNC),

(oR_oPcoDE, OR_COMPFUNC),

(AND_OPCODE, AND_COMPFUNC),(SUBNC_OPCODE, SUBNC_COMPFUNC),

(SUB_OPCODE, SUB_COMPFUNC),

(ADD_OPCODE, ADD_COMPFUNC)J;

Appendix A. The FL Verification Script 87

II some implementation mapping functions

let Squashis v cyc = (Squash is v)from ((PhilF cyc)+LatchDelay)to (Cycle cyc);

let ICacheis cyc i src =

(((Opcode ICache) isv (Opcode i))((RdAddr ICache) isv (RdAddr i))

(((RaAddr ICache) isv (RaAddr i)) when src)(((RbAddr ICache) isv (RbAddr i)) when src) ®

((CompFunc ICache) isv (CompFunc i)))from ((PhilF cyc)+LatchDelay)to (Cycle cyc);

let ICacheAddris cyc a = ICacheAddr isv afrom ((PhilF cyc)+LatchDelay)to ((Phi2F cyc)+LatchDelay);

let DCacheAddris cyc a = DCacheAddr isv afrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

let DCacheOutis cyc dv = DCacheOut isv dvfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

let DCachelnis cyc dv = DCacheln isv dvfrom ((Phi2F (cyc)H-LatchDelay)to ((Phi2F (cyc))+LatchDelay);

let lAckis cyc v = lAck is vfrom ((Cycle (cyc—1)))to ((Phi2F cyc)+LatchDelay);

let DAckis cyc v = DAck is vfrom ((Cycle (cyc—1)))to ((Phi2F (cyc))+L.atchDelay);

let IPFaultis cyc v = IPFault is vfrom ((PhilF cyc))to ((Phi2F cyc)+LatchDelay);

let DPFaultis cyc v = DPFault is vfrom ((Phi2R (cyc))+LatchDelay)to ((Phi2F (cyc))+LatchDelay);

let NMlnterruptis cyc v = NMlnterrupt is v

Appendix A. The FL Verification Script 88

from ((Phi2R cyc)+LatchDelay)to ((Phi2F cyc)+LatchDelay);

let Interruptis cyc v = Interrupt is V

from ((Phi2R cyc)+LatchDelay)to ((Phi2F cyc)+LatchDelay);

let Resetis cyc v = Reset is vfrom (Cycle (cyc—1))to (Cycle cyc);

let BypassCacheis cyc v BypassCache is vfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

let MemCycleis cyc v = NemCycle is vfrom ((Phi2F (cyc+2))+Latchoelay)to ((Phi2F (cyc+4))+LatchDelay);

let CopCycleis cyc v = CopCycle is vfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+Latchflelay);

let ReadWritebis cyc v = ReadWriteb is vfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

let FPRegis cyc dv = FPReg isv dvfrom ((Phi2F (cyc+2))+LatchDelay)to ((Phi2F (cyc+4))+LatchDelay);

let IFetchis cyc v = IFetch is vfrom ((PhilR cyc)+LatchDelay)to ((Phi2F cyc)+LatchDelay);

let EXEoutis cyc dv = EXEout isv dvfrom ((Phi2F cyc)+LatchDelay)to (Cycle cyc);

let MEMoutis cyc dv = (TEMPout isv dvfrom ((Phi2F cyc)+LatchDelay)to (Cycle cyc))

(LMDRout isv dvfrom ((Phi2F cyc)+LatchDelay)to (Cycle cyc))

let RegFileis cyc addr dv =

let rRegFileNodes = rev RegFileNodes inlet sRegFileis i dv =

Appendix A. The FL Verification Script 89

(el (i) rRegFileNodes) isv dvfrom ((Phi2R cyc)—Setup)to ((Phi2R cyc)+LatchDelay) in

letrec iterate i =

i = 0 => UNC Ilet iv = num2bv RxAddrWidth i in

(if (iv = addr) (sRegFileis i dv)) (iterate (i—i)) initerate (RegFileSize—1);

// check if an instruction generates a destination

let GeneratesDestination instr =

let Operation = Opcode instr inlet Function = CompFunc instr inlet Combi opcode res = (Operation = opcode) OR res inlet Comb2 (opcode,fncd) res =

((Operation = opcode) AND (Function fncd)) OR res in(itlist Combi OpGeneratesTablel F) OR (itlist Comb2 OpGeneratesTable2 F);

II check if an instruction is allowed in the instruction sequence

let Selectedlnstr instr =

let operation = Opcode instr inlet function = CompFunc instr inlet comb (opcode,fncd) res =

((operation = opcode) AND (function = fncd)) OR res in(itlist comb SelectedlnstrTable2 F)

II check if an instruction takes two source operands

let Needs2Src instr =

let operation = Opcode instr inlet function = CompFunc instr inlet comb (opcode,fncd) res =

((operation = opcode) AND (function = fncd)) OR res in(itlist comb Needs2SrcTable F)

II check if an instruction is valid

let Validlnstr instr =

let Operation = Opcode instr inlet Function = CompFunc instr inlet Combi opcode res = (Operation = opcode) OR res inlet Comb2 (opcode,fncd) res =

((Operation = opcode) AND (Function = fncd)) OR res in

Appendix A. The FL Verification Script 90

((itlist Combi OpcodesTablel F) OR(itlist Comb2 OpcodesTable2 F)) AND (Selectedlnstr instr);

letrec ValidlnstrList 1 = 1 = C] > T IValidlnstr (hd 1) AND ValidlnstrList (tl 1);

II determine bypass conditions

let BypassFromExe src instrlist =

let Prevlnstr = hd instrlist inlet Generate = GeneratesDestination Prevlnstr inGenerate AND (src = (RdAddr Prevlnstr)) AND (src !t ZeroAddr)

let BypassFromMem src instrlist =

let Prevlnstr = hd instrlist inlet PrevPrevlnstr = hd (tl instrlist) inlet Generate = GeneratesDestination PrevPrevlnstr inNOT (BypassFromExe src instrlist) ANDGenerate AND (src = (RdAddr PrevPrevlnstr)) AND (src ZeroAddr)

II virtual register file implementation mapping function

let VRegFile cyc addr dv instrlist =

let BypassFromExe = BypassFromExe adds instrlist inlet BypassFromMem = BypassFromMem addr instrlist inlet NoBypass = (NOT BypassFromExe) AND (NOT BypassFromMem) in(if (BypassFromExe) (EXEoutis (cyc+1) dv)) @(if (BypassFromNem) (MEMoutis (cyc+1) dv)) ®(if (NoBypass) (RegFileis (cyc+1) addr dv))

/1 load the circuit

let Oryx = load_exe “. .1. ./bin/Oryx.exe”;

II length of the instruction sequence

let Clocks = 5;

II which instruction in the sequence we will check for

let Target = 3;

II generate histories

let ControlList = (GenControl Clocks []);

Appendix A. The FL Verification Script 91

let DataList = (GenData Clocks []);

let InstrList = (Genlnstr Clocks C]);

II assign values to input signals using generated histories

let CurrentState cyc c d src omit =

(NMlnterruptis cyc (el 1 c)) ®(Interruptis cyc (el 2 c)) 0(IPFaultis cyc (el 3 c)) 0

(DPFaultis cyc (el 4 c)) 0

(lAckis cyc (el 5 c)) 0

(DAckis cyc (el 6 c)) 0

(omit > UNC I (if (Validlnstr d) (ICacheis cyc d src)))

letrec AssignState cyc cl il = cyc > Clocks > UNC I(CurrentState cyc (hd cl) (hd il) (cyc=Target) (cyc>Target)) 0(AssignState (cyc+1) (tl cl) (tl il))

// define data values as Boolean vectors

let U = variable_vector “U.” DataWidth;

let V = variable_vector “V.” DataWidth;

II specify global constraint here

let GlobalConstraint = ValidlnstrList (prefix Target InstrList)

II evaluate the result of an instruction

let EvaluateClass3lnstrFirst Target Rd Newlnstr OldlnstrList U V =

((if (InstrIsADD Newlnstr)(VRegFile (Target+1) Rd (U add V) (Newlnstr:OldlnstrList))) 0

(if (InstrIsSUB Newlnstr)(VRegFile (Target+1) Rd (U subtract V) (Newlnstr:OldlnstrList))) 0

(if (InstrIsSUBNC Nevlnstr)(VRegFile (Target+1) Rd (U subtractnc V) (Newlnstr:OldlnstrList))) 0

(if (InstrIsAND Newlnstr)(VRegFile (Target+1) Rd (U bvAND V) (Newlnstr:OldlnstrList))) 0

(if (InstrlsOR Newlnstr)(VRegFile (Target+1) Rd (U bvOR V) (Newlnstr:OldlnstrList))) 0

(if (InstrIsXOR Newlnstr)(VRegFile (Target+1) Rd (U bvXOR V) (Newlnstr:OldlnstrList))) 0

(if (InstrIsNOT Newlnstr)

Appendix A. The FL Verification Script 92

(VRegFile (Target+i) Rd (bvNOT U) (Newlnstr:OldlnstrList)))

(if (InstrIsBIC Newlnstr)(VRegFile (Target+i) Rd (U bvBIC V) (Newlnstr:OldlnstrList))))

let EvaluateClass3lnstrSecond Target Rd Newlnstr OldlnstrList U V =(if (InstrIsSLL Newlnstr)

(VRegFile (Target+1) Rd (sll U) (Newlnstr:OldlnstrList)))

(if (InstrIsSRL Newlnstr)(VRegFile (Target+i) Rd (srl U) (Newlnstr:O].dlnstrList)))

(if (InstrIsSRA Newlnstr)(VRegFile (Target+i) Rd (sra U) (Newlnstr:OldlnstrList)))

II split the instruction sequence into two parts

// the instructions that the processor has executed so far

/1 and the instructions that the processor will execute in the future

let OldlnstrList = (rev (prefix (Target—i) InstrList));

let NewlnstrList = (suffix (Clocks—Target+i) InstrList);

II check how many source operands the instruction takes

let TakesTwoSources U V NewlnstrList OldlnstrList =let Newlnstr = (hd NewlnstrList) inlet Prevlnstr = (lid OldlnstrList) inlet PrevPrevlnstr = (hd (tl OldlnstrList)) inlet Ra = (RaAddr Newlnstr) inlet Rb = (RbAddr Newlnstr) in(Needs2Src Newlnstr) AND

((Ra Rb) OR (U = V)) AND((Ra = ZeroAddr) OR (U = ZeroData)) AND((Rb != Zerokddr) OR (V = ZeroData)) AND(NOT(InstrIsSLL Prevlnstr) OR ((last U) = F)) AND

(NOT(InstrIsSLL PrevPrevlnstr) OR ((last U) = F)) AND(NOT(InstrIsSLL Prevlnstr) OR ((last V) = F)) AND(NOT(InstrIsSLL PrevPrevlnstr) OR ((last V) = F)) AND(NOT(InstrIsSRL Prevlnstr) OR ((hd U) = F)) AND(NOT(InstrIsSRL Prevprevlnstr) OR ((hd U) = F)) AND(NOT(InstrIsSRL Prevlnstr) OR ((hd V) = F)) AND(NOT(InstrIsSRL Prevprevlnstr) OR ((hd V) = F)) AND(NOT(InstrIsSRA Prevlnstr) OR ((hd U) = (hd (tl U)))) AND(NOT(InstrIsSRA PrevPrevlnstr) OR ((lid U) = (hd (tl U)))) AND(NOT(InstrIsSRA Prevlnstr) OR ((hd V) = (hd (tl V)))) AND(NOT(InstrIsSRA PrevPrevlnstr) OR ((hd V) = (hd (tl V))))

let TakesOneSource U V NewlnstrList OldlnstrList =let Newlnstr = (hd NewlnstrList) in

Appendix A. The FL Verification Script 93

let Prevlnstr = (hd OldlnstrList) inlet PrevPrevlnstr = (hd (tl OldlnstrList)) inlet Ra = (RaAddr Newlnstr) in

let Rb = (RbAddr Newlnstr) in(NOT(Needs2Src Newlnstr)) AND

((Ra ZeroAddr) OR (U = ZeroData)) AND

(NOT(InstrIsSLL Prevlnstr) OR ((last U) = F)) AND

(NOT(InstrIsSLL PrevPrevlnstr) OR ((last U) = F)) AND(NOT(InstrIsSRL Prevlnstr) OR ((hd U) = F)) AND(NOT(InstrIsSRL PrevPrevlnstr) OR ((hd U) = F)) AND(NOT(InstrIsSRA Prevlnstr) OR ((hd U) = (hd (tl U)))) AND(NOT(InstrIsSRA PrevPrevlnstr) OR ((hd U) = (hd (tl U))))

II verify ALU and shift instructions

let VerifyClass3lnstr =

let Newlnstr = (hd NewlnstrList) inlet Ra = RaAddr Newlnstr inlet Rb = RbAddr Newlnstr in

let Rd = RdAddr Newlnstr inlet Ant =

(GenOryxClock Clocks) 0

(Reset is F from (Cycle 0) to (Cycle Clocks)) 0(Exception is F from (Cycle 0) to (Cycle Clocks)) 0(Squash is F from (Cycle 0) to (Cycle Clocks)) 0(AssignState 1 ControlList InstrList) 0(if (GlobalConstraint) (

(if (TakesTwoSources U V NewlnstrList OldlnstrList)((VRegFile (Target) Ra U OldlnstrList) (0(VRegFile (Target) Rb V OldlnstrList))) (0

(if (TakesOneSource U V NewlnstrList OldlnstrList)(VRegFile (Target) Ra U OldlnstrList))

)) in

let Cons =

(if (GlobalConstraint)

((if (TakesTwoSources U V NewlnstrList OldlnstrList)(EvaluateClass3lnstrFirst Target Rd Newlnstr OldlnstrList U V)) 0

(if (TakesOneSource U V NewlnstrList OldlnstrList)(EvaluateClass3lnstrSecond Target Rd Newlnstr OldlnstrList U V))

)) in

(nverify Options Oryx VarList Ant Cons TraceList);

II find verification time

time(VerifyClass3lnstr);