Analysis and optimization of a debug post-silicon hardware ...

132
Analysis and optimization of a debug post-silicon hardware architecture Joel Sanchez Moreno SUPERVISED BY Miquel Izquierdo Ustrell Universitat Politècnica de Catalunya Master in Innovation and Research in Informatics January 2022

Transcript of Analysis and optimization of a debug post-silicon hardware ...

Analysis and optimization of a debug

post-silicon hardware architecture

Joel Sanchez Moreno

SUPERVISED BYMiquel Izquierdo Ustrell

Universitat Politècnica de Catalunya

Master in Innovation and Research in Informatics

January 2022

Abstract

Post-silicon validation is a crucial step in System On-Chip (SoC) design flow as it is used to identify

functional errors and manufacture faults that escaped from previous testing stages. The complexity

of SoC designs has been increasing over the last decades and along with the shrinking

time-to-market constraints, it is not possible to detect all design flaws during pre-silicon validation.

Nowadays there exist thousands of systems in the world that have been designed for very different

applications so that the functionality to be tested differs between them but all of them require a

post-silicon validation infrastructure able to detect misbehaviors.

The objective of post-silicon validation is to ensure that the silicon design works as expected under

real operating conditions while executing software applications, and to identify errors that may

have been missed during pre-silicon validation. Post-silicon validation is performed making use of

specialized hardware that is added in the system during design and implementation stages.

The complexity of post-silicon validation arises from the fact that it is much harder to observe and

debug the execution of software in a silicon device than in simulations. In addition, it is important to

note that the post-silicon validation stage is critical and performed under a highly-aggressive

schedule to ensure the product is in the market as soon as possible.

The goal of this thesis is to analyze the post-silicon validation hardware infrastructure implemented

on multicore systems taking as an example Esperanto Technologies SoC which has thousands of

RISC-V processors and targets specific software applications. Then, based on the conclusions of the

analysis, the project proposes a new post-silicon debug architecture that can fit on any System

on-Chip without depending on its target application or complexity and that optimizes the options

available on the market for multicore systems.

1

Acknowledgements

I would like to express my gratitude to all those who have helped me in some way or another to

write this thesis. Firstly, I would like to thank my supervisor Miquel Izquierdo Ustrell, who has been

my manager within Esperanto Technologies, for their support and guidance since I started at the

company and during the last months while working on the master thesis. I would like to specially

thank Esperanto Technologies for allowing me to develop this thesis based on a product that

entered the market just a few months ago and also for providing me access to all the required tools

to perform power and area analysis. Finally, I would like to thank all the engineers I have worked

with during all these months which helped me improve my knowledge on the semiconductor field.

2

Contents

1. Introduction 10

1.1 Context 10

1.2 Motivation 11

1.3 Methodology 12

1.3.1 Work development 12

2. State of the Art 13

2.1 System on-Chip debug requirements 13

2.1.1 Run Control 13

2.1.2 Monitoring 13

2.1.3 Intrusive debugging 13

2.1.4 Performance debug 14

2.1.5 External debug support 14

2.2 Market analysis 15

2.2.1 FPGA based solutions 15

2.2.1.1 Xilinx 15

2.2.1.1.1 Integrated Logic Analyzer (ILA) 15

2.2.1.1.2 Virtual Input/Output (VIO) 16

2.2.1.1.3 JTAG-to-AXI Master 16

2.2.1.2 Synopsys 16

2.2.1.2.1 High-performance ASIC Prototyping Systems (HAPS) 17

2.2.1.2.1.1 HAPS Deep Trace Debug (DTD) 17

2.2.1.2.2.2 HAPS Global Signal Visibility (GSV) 17

2.2.1.2.2 ZeBu 17

2.2.1.3 Cadence 18

2.2.1.4 Siemens - Mentor 18

2.2.2 Embedded IP solutions 19

2.2.2.1 ARM 19

2.2.2.1.1 Embedded Trace Macrocell 20

2.2.2.1.2 System Trace Macrocell 20

2.2.2.1.3 Trace Memory Controller 20

2.2.2.2 SiFive 21

2.2.2.2.1 Debug Module 21

3

2.2.2.2.2 Nexus Encoder 21

2.2.2.3 UltraSoC 22

2.2.2.3.1 Communicator modules 23

2.2.2.3.1.1 JTAG Communicator 23

2.2.2.3.1.2 USB Communicator 23

2.2.2.3.1.3 Universal Streaming Communicator 23

2.2.2.3.1.4 Aurora communicator 23

2.2.2.3.1.5 Bus Communicator 23

2.2.2.3.2 Message infrastructure modules 24

2.2.2.3.2.1 Message Engine 24

2.2.2.3.2.2 Message Slice 24

2.2.2.3.2.3 Message Lock 24

2.2.2.3.2.4 System Memory Buffer 24

2.2.2.3.3 Analytic modules 25

2.2.2.3.3.1 Direct Memory Access (DMA) 25

2.2.2.3.3.2 Processor Analytic Module 25

2.2.2.3.3.3 Trace Encoder 25

2.2.2.3.3.4 Trace Receiver 25

2.2.2.3.3.5 Bus Monitor 25

2.2.3.3.3.6 Status Monitor 26

2.2.2.3.3.7 Static Instrumentation 26

2.2.2.3.3.8 Network On-Chip Monitor 26

2.2.3 Design For Testability (DFT) 27

2.3 Debugging multicore systems 29

2.3.1 I/O peripherals 29

2.3.1.1 Debug features 29

2.3.2 Memory subsystem 29

2.3.2.1 Debug features 29

2.3.3 Master core 30

2.3.4 Compute cores 30

2.3.5 Interconnection network 30

2.4 Debug infrastructure on multicore systems 32

3. Many-Core SoC design 33

3.1 I/O Shire 34

3.1.1 Architecture 34

4

3.1.2 Debug features and UltraSoC integration 35

3.1.2.1 Clock and reset unit 35

3.1.2.2 Registers and memory access 35

3.1.2.3 Service Processor 36

3.1.2.4 Maxion validation 36

3.1.2.5 Debugger hardware interface 37

3.1.2.5.1 External interface with the host 37

3.1.2.5.2 Cores communication with the host 37

3.1.2.5.3 Debug through Service Processor 37

3.1.2.6 Security 38

3.2 Minion Shire 39

3.2.1 Architecture 39

3.2.2 Debug features and UltraSoC integration 40

3.2.2.1 Software application debugging 40

3.2.2.2 Resources visibility 40

3.3 Memory Shire 42

3.3.1 Architecture 42

3.3.2 Debug features and UltraSoC integration 42

3.3.2.1 Register visibility 42

3.3.2.2 Memory visibility 42

3.3.2.3 Debug trace buffering 42

3.4 PCIe Shire 43

3.4.1 Architecture 43

3.4.2 Debug features and UltraSoC integration 44

3.5 Debug Network On-Chip 44

3.6 Conclusions after ET-SoC-1 45

3.6.1 Architecture, RTL and physical design evaluation 45

3.6.2 Post-silicon validation evaluation 46

4. Proposed multicore debugging architecture 48

4.1 Run control support 49

4.1.1 Execution 49

4.1.2 Registers 50

4.1.3 Core status 50

4.1.4 Multicore execution support 51

4.1.5 Multicore status report 52

5

4.2 Debug Network On-Chip 53

4.2.1 Message Types 54

4.2.1.1 Configuration messages 56

4.2.1.1.1 Read configuration messages 57

4.2.1.1.2 Writes configuration messages 58

4.2.1.2 Event messages 60

4.2.1.3 Trace messages 61

4.2.2 Interface and message packaging 62

4.2.3 Timestamps 64

4.2.3.1 Synchronization 65

4.2.3.2 Timestamp computation 65

4.2.4 Interleaving 66

4.2.5 Streaming 66

4.2.5.1 Requirements 67

4.2.5.2 Operational mode 67

4.2.5.3 Timestamp computation 68

4.2.5.4 Bottleneck analysis 68

4.2.6 System resources access 69

4.2.6.1 Event support 70

4.2.6.2 Polling & Compare 70

4.2.7 Routing 71

4.2.8 Network on-Chip interconnection 71

4.3 Monitoring 72

4.3.1 Logic Analyzer 72

4.3.1.1 Interface 74

4.3.1.2 Filtering 74

4.3.1.3 Counters 76

4.3.1.4 Actions 76

4.3.2 Bus Tracker 77

4.3.2.1 Interface 78

4.3.2.2 Filtering 79

4.3.2.3 Counters 80

4.3.2.4 Actions 80

4.3.3 Trace Storage Handler 81

4.3.3.1 Trace storage configuration 82

6

4.3.3.2 Trace extraction operational modes 83

4.3.3.3 Trace extraction through active debugger 83

4.3.4 Compute core execution analysis 87

4.3.5 Multicore system workload balance 88

4.3.6 Network On-Chip 89

4.3.7 Hardware assertions 89

4.4 Debug modules messaging infrastructure 90

4.4.1 Interface 90

4.4.2 Compression 92

4.4.2.1 XOR and RLE compression 93

4.5 Access to the debug infrastructure 94

4.5.1 Debug address map 94

4.5.2 Debug Transaction Manager (DTM) 95

4.5.2.1 Communication with Debug NoC 95

4.5.2.2 USB and JTAG support 95

4.5.2.3 Master core and PCIe support 97

4.5.3 Read operations to SoC resources 98

4.5.3.1 JTAG/USB 98

4.5.3.2 Master core and PCIe 98

4.5.4 Write operations to SoC resources 100

4.5.4.1 JTAG/USB 100

4.5.4.2 Master core and PCIe 101

4.5.5 Security 103

4.5.6 JTAG integration 104

5. Proposed multicore design integration 106

5.1 I/O Shire 106

5.1.1 Architecture 106

5.1.2 Integration of the new proposed architecture 106

5.1.2.1 Clock and reset unit 106

5.1.2.2 Registers and memory access 107

5.1.2.3 Service Processor 107

5.1.2.4 Maxion validation 107

5.1.2.5 Debugger hardware interface 108

5.2 Minion Shire 109

5.2.1 Architecture 109

7

5.2.2 Integration of the new proposed architecture 109

5.2.2.1 Software application debugging 110

5.2.2.2 Resources visibility 110

5.3 Memory Shire 111

5.3.1 Architecture 111

5.3.2 Integration of the new proposed architecture 111

5.3.2.1 Register visibility 111

5.3.2.2 Memory visibility 112

5.3.2.3 Debug trace buffering 112

5.4 PCIe Shire 113

5.4.1 Architecture 113

5.4.2 Integration of the new proposed architecture 113

5.5 Debug Network On-Chip 114

5.6 Security 114

6. Evaluation 115

6.1 Run control 115

6.1.1 Broadcast support 115

6.1.2 Status report 116

6.1.2.1 Cores run control status 116

6.1.2.2 Cores register dump 117

6.2 Debug NoC 118

6.2.1 Debugger support 118

6.2.2 Efficient messaging protocol 119

6.2.2.1 Performance improvements 119

6.2.2.2 Compression savings 121

6.2.3 Support for high-bandwidth monitoring 121

6.3 Monitoring 122

6.3.1 Filtering optimization 122

6.3.2 Multiple protocol support 122

6.3.3 Trace storage improvements 123

7. Conclusion 125

8. Future work 127

9. References 128

8

9

1. IntroductionEveryday we interact with a wide variety of computing systems that are used not only to make our

lives easier but also more secure. Since we wake up every morning and take the phone, we are

using a piece of technology that integrates a System on-Chip (SoC) that is constantly performing

computation for us. We live surrounded by embedded devices that perform from very simple

operations to complex computation. From taking an elevator to flying on an airplane, many devices

work together to ensure we have not only commodities but also safety. The same way, when we use

a device for financial purposes or to share personal information, embedded computing devices

ensure the security and privacy of the data.

Those devices must work following very well defined specifications in order to not only meet its

performance target and provide the user certain functionalities but also to ensure that there is no

data corruption nor security holes. Systems on-Chip are integrated in our bedroom television as well

as in the most advanced military systems in the world and due to that there is the need to ensure

that once a system has been designed and fabricated, it meets all the needed requirements.

Even when performing huge efforts during pre-silicon validation, it is not always possible to detect

all the errors. Post-silicon validation allows us to detect design flaws that go from escaped

functional errors to manufacturing problems.

Nowadays, post-silicon validation has become the major bottleneck for complex SoC designs as the

time-to-market constraints are shrinking every year in order to be more competitive, and pre-silicon

stages are now highly time-limited. Note that one of the limitations during pre-silicon validation is

the amount of cycles that designers can simulate or emulate in reasonable time.

Post-silicon validation is a challenge not only because SoC designs are becoming more complex

every year but also because observability and controllability of the actual silicon is highly limited.

The validation effort once silicon is available can be classified into three major steps: (i) a

preparation phase for post-silicon validation and debug, (ii) a testing phase to detect problems, (iii)

a phase focused on finding the root cause of the source of the problem and, in case is critical, apply

workarounds to continue testing effort.

1.1 ContextEsperanto Technologies has developed a System On-Chip chip for large-scale machine learning

applications that is composed of more than 1000 RISC-V cores, whose architecture provides an

open source and royalty-free instruction set architecture (ISA).

This chip has been named ET-SoC-1 and features two kinds of general-purpose 64-bit RISC-V cores.

The ET-Maxion, which is a superscalar Out-of-Order core that delivers high performance for modern

operating systems and applications, and the ET-Minion core which is an energy-efficient in-order

multithreaded core with a vector/tensor accelerator that provides a massive parallel compute array.

10

The goal of this SoC is to deliver better performance per watt than legacy CPU and GPU solutions

without compromising general purpose flexibility. In addition, the integration of this SoC on data

centers could generate a huge decrease in the energy costs which translates to an important cost

reduction for third companies. On the other hand, the ET-SoC-1 supports LPDDR4x DRAM with up to

32GB DRAM, 137GB/sec memory bandwidth, and a 256-bit wide interface [1].

Post-silicon validation is an important stage in all System On-Chip designs but it is critical for the first

chips developed by semiconductor companies as they have to get very fast to the market to be

competitive and must be able to detect the errors to solve them as quickly as possible. In the case

of Esperanto the complexity grows because not only is the first design developed by the company

but also integrating thousands of cores together with on-chip memory increases complexity.

The goal of this project is to analyze the available options in the market and focus on Esperanto’s

post-silicon validation hardware infrastructure for ET-SoC-1 as an example. Then, based on the

conclusions we propose optimizations that target not only functionality but also focus on providing

a more energy-efficient debug infrastructure while increasing observability.

1.2 MotivationEsperanto Technologies is interested in exploring a new post-silicon hardware debug infrastructure

that solves the bottlenecks in the system that causes lack of observability under certain debugging

phases. In addition, the company also wants to optimize the usage of power and area while keeping

good visibility and monitoring features on the system. The current debug infrastructure is based on

the integration of a third party IP from UltraSoC Technologies, which is part of the Siemens Digital1

Industries Software, within the Mentor Tessent group.

The use of third party IPs allow companies such as Esperanto Technologies to develop the first

version of their chips faster given that the IP not only adds the required functionalities to the system

but also ensures that the logic has been verified and is safe to be used as long as the company

follows the integration guides. Nevertheless, the use of third party IPs also means that there can be

area and power overheads given that those components were designed to fit in different chips and

were not customized for the specific project so there are features that may not be required and

cause an impractical use of resources.

This project is a first step on the implementation of a new post-silicon hardware debug

infrastructure that fits Esperanto’s requirements while ensuring that the area and power overheads

added by these components are as optimized as possible. It is important to note that the debug

infrastructure is used during bring up effort and is rarely used by the clients.

1 Intellectual Property

11

1.3 Methodology

1.3.1 Work development

This project follows a cascade methodology in which the activities are splitted in different phases,

where each phase depends on the previous one. The projects consist of the next phases: analysis,

design and evaluation.

The goal of this project, as has been described in previous sections, is to develop a new post-silicon

hardware infrastructure and in order to do that we must first determine which are the requirements

for debugging silicon so that later on we can analyze the state of the art of post-silicon validation by

doing an overview of which are the options in the market. Then, taking into account the basic debug

functionalities and the available options in the market we focus on multicore systems debugging.

After that, we take Esperanto’s SoC debug infrastructure as an example of a multicore system debug

infrastructure to analyze the advantages and drawbacks of the current implementation. The

evaluation of Esperanto’s SoC implementation takes into account not only the offered debug

features but also power and area measurements together with an overview of the missing features

that could be added to support software debugging.

Once we have all the information regarding the state of the market and the multicore systems

requirements, we propose a new post-silicon debug infrastructure which focuses not only on

improving observability but also on analyzing the most optimal ways to implement those features.

Finally, we perform an evaluation of the proposed debug capabilities versus Esperanto’s SoC debug

infrastructure, which has served as a baseline model.

The intellectual property of this project belongs to Esperanto Technologies and all the information

contained in this thesis is confidential.

Note that the evaluation of Esperanto’s SoC hardware post-silicon infrastructure is performed

making use of the tools provided by Esperanto Technologies. The power analysis is performed

making use of Synopsys PrimePower tool while area analysis is performed through Synopsys

PrimeTime tool. The presented results correspond to the latest implementation stage of ET-SoC-1.

12

2. State of the Art

2.1 System on-Chip debug requirementsThe first step for analyzing the available options in the market is to be aware of which are the the

basic requirements for an efficient post-silicon debug infrastructure. Note that access to the debug

infrastructure potentially involves security problems as it provides not only visibility of confidential

system architecture details but also direct access to system resources.

2.1.1 Run Control

System on-Chips (SoC) usually combine peripherals and/or memories with one or multiple cores.

Those cores can be used either to run general purpose software or for running specific

high-performance software applications. Both options require the developers to be able to debug

possible hangs or misbehaviours on the instruction execution which involves run control support.

Among the different run control operations the most important ones are: halt, resume, single step

or using breakpoints. Depending on the system core architecture it may be required to follow

specific implementations. For example, Esperanto Technologies SoC cores are RISC-V compliants

which enforces them to follow the RISC-V debug specification [2].

Access to core’s run control operations during post-silicon is provided by a dedicated interface

between the host and the SoC cores. This interface at its minimum must consist of a set of I/O pins

that allow the host system to connect to the SoC and send commands to the core. Generally, the

debug infrastructure I/O interface is based on JTAG protocol since it's a well-defined standard that is

easy to use and integrate on systems.

2.1.2 Monitoring

Following the same trend as introduced in the previous section, the user must be able to observe

system behavior and to gather information about software applications execution. Monitoring

features are required on several resources within the system as the user needs to have visibility not

only of the instructions retired by the core(s) but also of the transactions in-flight or performance

counters, as an example.

This functionality is not part of core architecture which means that there are no standards defined

and there are different options in the market which are analyzed later in this document. Extracting

information from the system generally requires a sustainable throughput or on-chip buffering.

2.1.3 Intrusive debugging

Depending on the complexity of the design sometimes it is recommended to allow the host the

option to have control of systems behavior, that is, the host can not only read but also write

memories or registers within the system which have direct impact on its behavior.

13

This feature is generally implemented for having a workaround for critical pieces such as a master

core in a multicore system as it has to distribute workload among the rest of the system cores. In

this case, the user may want to ensure that in case the master core does not work as expected the

post-silicon validation effort can continue by making use of intrusive debugging features that allow

emulating the master core operations.

2.1.4 Performance debug

When analyzing software execution it is required not only to provide visibility to detect design flaws

that escaped from pre-silicon validation but also to provide tools to the users that allow them to

analyze system performance. Performance monitoring can be provided either through monitor

components that enable the users to perform filtering of system signals and retrieve the number of

times a specific event happened, or by providing read access to system registers that automatically

gather performance metrics.

2.1.5 External debug support

As has been mentioned in previous sections, the user must be able to access debug features from

the host in order to be able to apply run control operations or monitor the system.

The most extended debugging software support tool is the GNU Debugger, also known as GDB [3],

which allows the user, for example, to run software applications on a given core up to a certain

instruction, then stop and print out registers or variable values, or to step through the program one

instruction or function at a time and print out the values of each variable after executing those. GDB

has a wide range of supported operations [4] and, depending on the core architecture, there may

be specific implementations. For example, RISC-V provides its own support for GDB [5].

GDB is the high-level layer that the user sees and makes use of for run control operations, but it is

important to note that external debug support consists of several layers, and one of those is

OpenOCD [6] which aims to provide debugging, in-system programming and boundary scan testing

for embedded target devices.

The OpenOCD is able to connect to the debug SoC infrastructure with the assistance of a debug

adapter, which is a small hardware module integrated on the system. There can be one or multiple

hardware modules that allow the OpenOCD to connect such as USB or JTAG. Note that OpenOCD is

open source and can be modified to adapt to custom debug architectures and enhance debugging

performance.

On the other hand, depending on the debug infrastructure implemented on the system, there may

be additional software tools that allow users to perform monitoring and to visualize the information

gathered from the system.

14

2.2 Market analysisBefore starting the effort on developing a custom debug infrastructure, it is needed to evaluate the

available options in the market. There are several companies that provide debug components that

could be integrated in Systems On-Chip and allow companies such as Esperanto Technologies to

avoid using human resources to develop their own debug components which requires design,

implementation and validation effort.

In this section we analyze not only the main options in the market for acquiring debug components

from a third party company but also the available debug platforms in the market such as emulators

based on FPGA clusters and the debug features they provide so that this information can be taken

into account for developing a more efficient debug infrastructure.

2.2.1 FPGA based solutionsThe Field Programmable Gate Arrays (FPGA) have been out of the market for many years due to

their programming complexity even providing good performance and low power consumption. In

the last decade this has changed as the multicore systems become more and more complex,

increasing the probability of bugs during post-silicon validation. Since time-to-market became

crucial it is needed to validate the designs before they are sent to the manufacturer and this

validation effort relies on using FPGAs in most of the cases. Some of the FPGA validation solutions

include debug features that are interesting to take into consideration on this project.

2.2.1.1 Xilinx

Xilinx [7] is known for inventing the Field Programmable Gate Array (FPGA) and leads the FPGA

market due to its extensive number of FPGA products which enable customers to test and develop

designs without requiring manufacturing and allowing them to iterate on the design and verification

as many times as needed.

Xilinx provides debug components that can be integrated on FPGA designs to allow users to extract

information or drive control signals. In addition, Xilinx relies on the use of Vivado [8] as the software

tool to be used to integrate the debug IP and to visualize the gathered information.

2.2.1.1.1 Integrated Logic Analyzer (ILA)

The ILA IP core [9] allows the user to monitor internal signals of the design and can be customized

so that the user can decide the number of input data signal ports, the trigger function and the data

width and depth to be traced. The Integrated Logic Analyzer can easily be attached to the design

under test (DUT) given that it works following the clock constraints that the user applied to its

design.

The captured signals are sampled at the design clock frequency and stored on a buffer using on-chip

block RAM (BRAM). Once the trigger occurs, the buffer is filled and then the software can use this

data to generate a waveform.

15

The trigger configuration is based on multiple comparators such that each input port has its own.

The comparators allow the user to perform matching level patterns where some of the bits can be

masked out and not taken into account, then the masked input data is compared with the specified

match value.

2.2.1.1.2 Virtual Input/Output (VIO)

The VIO IP core [10] can be used to monitor and drive FPGA signals in real time but it is important to

note that it does not apply filtering and does not make use of buffering through RAMs as the

Integrated Logic Analyzer does.

As has been mentioned before, the VIO captures data without filtering which means that the input

ports are designed to detect and capture the value transitions but the sample frequency may not

meet the design frequency causing the signals being monitored to transition many times between

consecutive samples.

On the other hand, the VIO can also be used to drive data from a user's design through a JTAG

interface allowing it to manage system control signals.

The Virtual Input/Output core allows the user to replace status indicators such as LEDs in order to

save I/O ports and also to manage design control signals by driving different values.

2.2.1.1.3 JTAG-to-AXI Master

The JTAG-to-AXI Master IP [11] core allows the user to drive AMBA AXI signals on the FPGA at run

time. This IP supports all kinds of AMBA AXI interfaces and data interface widths of 32 or 64 bit. The

AMBA AXI data interface width is customizable so that it can be adapted to the user's design. In

addition, the JTAG-to-AXI Master IP supports AXI4 and AXI4-Lite interfaces.

This IP can be used to drive one or multiple design AMBA AXI slave interfaces to emulate other

components or host connections during post-silicon.

2.2.1.2 Synopsys

Synopsys [12] is a company that has based their products on electronic design automation focusing

on silicon design and verification, silicon IP and software. It is widely known in the industry as it

provides a wide range of products that go from design components such as DDR memories that can

be integrated on SoC to software that is used to analyze the behavior of a design before it is

fabricated or handles the physical design routing of the Register-Transfer Level code.

Some of Synopsys IP can be used for debug purposes [13] in order to facilitate the integration of

debug components and facilitate validation and bring up effort. That said, Synopsys offers two

approaches that can be used for debugging user’s design: prototyping and emulation.

Prototyping solution focuses on the use of a board with multiple FPGAs, usually Xilinx FPGAs,

mounted on it and pre-implemented debug hardware. This approach provides better performance

than emulation but requires designers to spend a significant amount of time partitioning the model

16

across FPGAs, managing clock domains and/or completing FPGA physical designs to handle timing

closure.

On the other hand, emulation allows designers to reduce the implementation time and resources

typically associated with prototyping but is not as efficient as prototyping in terms of performance.

Emulation is highly recommended for fast model bring-up which has an impact on schedule and

resources as it does not require designers to handle timing closure for example.

In summary, when designers RTL models are less stable it is highly recommended to go through

emulation-like solutions as they speedup over software simulation, provide high trace visibility

while making it easier for designers to integrate their models on the platform.

Then, as the RTL model stabilizes, designers can transition to prototyping as it provides an efficient

path to achieving performance suitable for firmware/software development and debugging.

2.2.1.2.1 High-performance ASIC Prototyping Systems (HAPS)

Synopsys prototyping solution is based on HAPS [14][15] and makes use of Synopsys ProtoCompiler

software tool to automatically build debug trace visualization making use of the extracted

information.

2.2.1.2.1.1 HAPS Deep Trace Debug (DTD)

The HAPS DTD [15] can be classified as an embedded trace extraction technique that allows the

user to get signal visibility within each of the FPGAs on the HAPS system. In addition, it also enables

the user to extract sample depths under complex triggering scenarios that can be defined at

runtime.

HAPS DTD trigger scenarios can be based on simple signal transitions or on more advanced filtering

taking into account several control signal sequences. Once the trigger is fired, the HAPS

ProtoCompiler Debugger [15] extracts the information and generates a text file or a waveform that

can be used for debugging purposes.

The Deep Trace Debug is automatically implemented through the HAPS ProtoCompiler tool, which is

also provided by Synopsys, such that HAPS DTD can be considered as a default implementation on

HAPS systems.

2.2.1.2.2.2 HAPS Global Signal Visibility (GSV)

The HAPS GSV [15] enables users to stop the system clock and perform a snapshot of the system

allowing it to dump the information from all signals. Then, once the system signals are captured, the

clocks are released to normal operation on the platform, or they can be stepped one by one or in

small chunks to capture more information and this process can be repeated any number of times.

2.2.1.2.2 ZeBu

On the other hand, Synopsys ZeBu [16] solution is the emulation approach used by many companies

to allow engineers run long tests that would take a big amount of time in simulation without

17

requiring huge effort on doing FPGA prototyping. In addition, emulation provides more visibility

than prototyping as it does not optimize parts of the designs that act as black boxes.

ZeBu stands out because of its fast runs as it can emulate systems faster than its competitors while

also providing large capacity for big designs. On the other hand, the time for compilation is way

bigger than its competitors which makes it worse in case designers have to compile frequently.

ZeBu supports SystemVerilog Assertions (SVA) [17], but there is no support for Universal Verification

Methodology (UVM) [18] nor functional coverage. In addition, the implementation of ZeBu tracing

in order to extract information from the system is based on Xilinx FPGAs built-in scan-chain support,

which allows to gather the information from all signals within the system but at the cost of running

at few tens of hertz. The gathered information is not stored in on-board memories and instead it is

sent directly to the host server. Note that performing tracing on an emulation platform is very slow

and the generated output can only be handled by Synopsys tools.

2.2.1.3 Cadence

Cadence [19] is an American multinational company that develops software, hardware and IP used

to design chips, systems and printed circuit boards (PCB). In addition, it also provides IP covering

interfaces, memory, verification and SoC peripherals among others.

Palladium [20] is Cadence emulator solution that highlights at compilation speed as it is faster than

its competitors in terms of compiling millions of gates per hour, that said, it requires more FPGAs

ending up on larger dimensions and more power consumption.

When it comes to debug features, Palladium supports SystemVerilog Assertions (SVA), Universal

Verification Methodology (UVM) and functional coverage. In addition, it also implements FullVision

which provides at-speed full visibility of nets for a small number of million samples during runtime.

Note that in case designers need to generate samples for a big number of million cycles, then

Palladium provides a partial vision of a subset of signals in the system that must be pre-selected at

compile time.

Furthermore, Palladium offers InfiniTrace feature which allows the user to capture the status of the

system at a given point and be able to revert back to any checkpoint and restart emulation.

2.2.1.4 Siemens - Mentor

Mentor [21] is a company focused on electronic design automation (EDA), which was acquired a few

years ago by Siemens. Mentor does not only provide EDA solutions but also simulation tools, VPN

solutions or heat transfer tools among others.

Among Mentor’s wide range of products, Veloce [22] is its emulator platform solution which

differentiates from previous solutions in the fact that it is easier and less expensive to install and

does not consume as much power as its competitors. Unfortunately, it does not reach the

compilation speed from Palladium nor the high performance from ZeBu.

18

Veloce also supports SystemVerilog Assertions (SVA), Universal Verification Methodology (UVM) and

functional coverage. In addition, Mentor’s emulator allows on-demand waveform streaming of a

few selected signals without requiring to re-compile and reduces the amount of data sent to the

host which improves time-to-visibility.

2.2.2 Embedded IP solutionsNowadays, most of the Systems On-Chip integrate debug components that are embedded to

facilitate validation efforts during post-silicon. Those components are used not only to monitor

particular functionalities at runtime but also to measure performance, observe the internal states of

control signals and allow applying workarounds when critical bugs appear.

Several companies in the world develop embedded components providing not only hardware to be

integrated on the systems but also software tools that ease the management and configuration of

those. Embedded debug IPs can provide a wide range of features such as ways to capture data

based on advanced filtering, synchronize monitors, generate timestamps to know when certain

events occured among others.

It is important to note that some of the embedded components are designed to be integrated on

specific core architectures and that there is not a defined standard that is followed by all

companies, that is, each company provides its own solution. In this section, we do an overview of

the most interesting options in the market that provide embedded IP to address observability in

post-silicon debugging.

Figure 1. Debug embedded IP integration on a system

2.2.2.1 ARM

ARM [23] is a well-known company for its wide range of products that go from complex designs

such as complete Systems On-Chip to small designs such as memories or interfaces. ARM develops

the architecture and licenses it to other companies which design their own systems and integrate

those architectures.

CoreSight [24] technology provides not only an exhaustive range of debug components that can be

integrated on customers SoC but also a set of tools for debug software execution on ARM-based

19

SoCs. This technology allows to gather information from the system and to modify the state of parts

of the design.

The main CoreSight components are:

● CoreSight Embedded Trace Macrocell (ETM)

● System Trace Macrocell (STM)

● Trace Memory Controller (TMC)

2.2.2.1.1 Embedded Trace Macrocell

The Embedded Trace Macrocell (ETM) [25] component provides the capability to gather information

regarding instruction execution so that the user can analyze the behavior of the processor by

looking at which instructions have been retired or if there have been exceptions during the

execution. In addition, this component allows adding timestamp information so that the user can

see how many cycles passed between each instruction. ETM generated traces are used for profiling,

code coverage, and to diagnose problems that are difficult to detect such as race conditions.

The debugger can configure the ETM through a JTAG interface. On the other hand, this information

can be either exported through a trace port and analyzed in the host, or can be stored to an on-chip

buffer, known as the Embedded Trace Buffer (ETB) [26], which later can be read through the

external interfaces.

It is important to note that the ETM implements a compression algorithm in order to reduce the

amount of information generated which can also impact on the number of additional pins required

on the ASIC, and also reduces the amount of memory required by the Embedded Trace Buffer.

2.2.2.1.2 System Trace Macrocell

The System Trace Macrocell (STM) [27] is a software driven trace source, that is, it focuses on

enabling real-time instrumentation of the software such that while the core is executing it provides

information with no impact on system behavior or performance.

The main purpose of STM is to allow developers to embed printf-like functionality within their code.

This can be required given that printf goes through UART which is expensive not only in terms of

speed but also because of the fact that the developer must embed a big library and exception

handlers within its embedded application causing an overhead. On the other hand, the STM trace

allows the developer to add a single store to the address of an STM channel so that the message is

transferred and, at some point, the host can capture the message with no impact on core behavior.

2.2.2.1.3 Trace Memory Controller

The Trace Memory Controller (TMC) [28] is an infrastructure component which can be configured to

support three different behaviors in order to store or route trace messages generated through the

Embedded Trace Macrocell (ETM). The TMC can serve as a trace buffer, as a trace FIFO or as a trace

data router over AMBA AXI to memory or off-chip to interface controllers. Those three

configurations are explained below:

20

● Embedded Trace Buffer (ETB):

Provides a circular buffer to store trace messages and supports up to 4GB of SRAM.

It is important to note that traces cannot be extracted until trace capture has finished,

which can be done after receiving a trigger signal. In case traced data is bigger than buffer

size, then the oldest samples are overwritten.

● Embedded Trace FIFO (ETF):

Provides a FIFO to store trace messages into dedicated SRAM. It is differentiated from ETB

in the fact that trace messages are not lost or overwritten in case the external debugger

does not read at enough speed, instead, back pressure is applied to the source component

that generates those trace messages when the FIFO is full.

● Embedded Trace Router (ETR):

Provides a mechanism in which trace messages are converted to AMBA AXI protocol so that

they can be routed to either system memory or any other AMBA AXI slave.

2.2.2.2 SiFive

SiFive [29] is a young semiconductor company that has focused on developing the potential of the

open source RISC-V ISA by designing and selling a wide range of RISC-V cores that can be used to

general purpose applications or to accelerate artificial intelligence and machine learning

applications, SoCs where the customer can choose the memory interface and peripherals, IPs and

development boards.

Among all the SiFive products there’s the SiFive Insight [30], which is presented as the first advanced

trace and debug solution for RISC-V allowing designers access to debug capabilities in order to

bring-up first silicon, support hardware and software integration, and debug software applications.

SiFive Insight includes a wide range of options that go from run-control debug, where designers can

control how the core operates, to advanced multicore trace solutions. Usually customers have these

features already integrated when they acquire a SiFive RISC-V core so that its integration is already

verified.

2.2.2.2.1 Debug Module

The Debug Module is intended to be integrated as the interface between the debug infrastructure

and the external interface, which is usually based on JTAG protocol, and follows the RISC-V debug

specification [2]. The run control operations are received by the Debug Module which then connects

with the CPU and sends the required RISC-V debug commands to enable operations such as

breakpoints or single stepping.

In addition, the Debug Module also allows the option to include a System Bus Access (SBA) in order

to access the memory or registers without interrupting the core.

2.2.2.2.2 Nexus Encoder

The SiFive cores also provide the option to be connected with a Nexus 5001-compliant trace

encoder [31][32] similar to ARM’s Embedded Trace Macrocell, which is a highly configurable

21

component that allows multicore trace capability allowing the designer to know which instructions

are being retired, which branches have been taken or if exceptions have been raised.

2.2.2.3 UltraSoC

UltraSoC, which has been recently acquired by Siemens and renamed to Tessent Embedded

Analytics [33], is one of the most important companies for post-silicon debugging due to their wide

range of embedded monitoring hardware products [34] for complex SoC which enabled accelerating

silicon bring-up and optimize product performance. Their products are used in several fields such as

automotive, high-performance computing, storage and semiconductor industries. In this section we

do an overview of the most interesting embedded products that UltraSoC can offer for embedded

monitoring.

UltraSoC components can separated in three types:

➢ Communicator modules: Interface between the UltraSoC message infrastructure and the

debugger.

➢ Message infrastructure: Connection of the UltraSoC modules together into a hierarchy.

➢ Analytic modules: Monitoring and control of system resources.

The UltraSoC components are connected following a tree methodology in which the components at

the top are the Communicator modules which allows the user to manage the debug subsystem.

Then, those are connected with the message infrastructure components that interconnect all the

system allowing to configure the analytic modules to gather data from the system as can be seen in

the image below.

Figure 2. Example of UltraSoC embedded components integration

Next, we can see the different embedded components available.

22

2.2.2.3.1 Communicator modules

The communicator components are used to act as an interface between the debugger and the

analytic components in the system. The debugger can be either an external debugger that

communicates through I/O pins or a core in the system that is able to configure the UltraSoC

components and can use the gathered information to manage the system.

2.2.2.3.1.1 JTAG Communicator

The JTAG Communicator provides low-bandwidth connectivity to the UltraSoC infrastructure via an

IEEE 1149.1 scan-test interface. Enables a basic flow-control mechanism for burst communications

(input data) in supplement to the host initiated symmetric data transfer (data scanned in and out

simultaneously).

It is widely recommended in the semiconductor field to integrate a JTAG interface that allows

accessing the debug post-silicon infrastructure as it does not require a complex configuration and

follows a standard and easy-to-use protocol. On the other hand, it is important to note that due to

its low bandwidth it does not allow to extract huge amounts of data in parallel and this can be a

problem for big SoCs.

2.2.2.3.1.2 USB Communicator

The USB Communicator provides medium-speed debug communications. It implements an UltraSoC

designed cut-down USB MAC core that implements a fixed configuration for a pair of bulk data

end-points. It is intended for direct connection to a USB PHY interface to enable a dedicated debug

channel or to the optional UltraSoC USB debug hub core.

The USB communicator is autonomous, requiring no host processor or software intervention.

UltraSoC has two versions of the USB Communicator: one that acts as a device and another that

acts as a hub allowing to share the USB PHY interface with other USB devices apart from UltraSoC

one.

2.2.2.3.1.3 Universal Streaming Communicator

The Universal Streaming Communicator (USC) provides a proprietary low overhead communications

protocol suitable for low, medium or high bandwidth communications using existing established

physical signaling.

2.2.2.3.1.4 Aurora communicator

The Aurora Communicator utilizes Xilinx’s Aurora Protocol [35] over a PIPE interface to provide a

high bandwidth output from the UltraSoC infrastructure to an external Aurora based probe. The

Aurora Communicator is unidirectional so it only allows data to be transmitted out of the UltraSoC

infrastructure.

2.2.2.3.1.5 Bus Communicator

The Bus Communicator enables software running inside the chip to drive the UltraSoC system as

though it is a debugger through a bus slave like AMBA AXI4 or similar. The purpose of this

23

functionality is to allow the software to monitor itself such as during pre-deployment soak-testing

and post-deployment / early-adoption. This enables monitoring of critical software and on-chip

activity to be monitored using either background software on the applications processor or a

smaller processor core dedicated to housekeeping activities.

2.2.2.3.2 Message infrastructure modules

The message infrastructure components are used to connect the communicator components with

the analytic components allowing to route messages in a tree-based topology. In addition, some of

the message infrastructure components allow broadcast operations to inject events that are

propagated to all the analytic modules which can enable or disable tracing.

2.2.2.3.2.1 Message Engine

The Message Engine connects analytic modules together, and also communicators. Message

Engines route messages and events between their interfaces and are arranged in a tree topology.

The Message Engine is a universal block that can be connected hierarchically to form a complex

debug fabric based on the UltraSoC message interface. It employs multiple logical flows so that

independent message streams can be routed and buffered with different priorities.

2.2.2.3.2.2 Message Slice

The Message Slice component can be used to break the combinatorial path between one message

interface and another by registering the signals in each direction. It is intended to aid timing closure

by breaking a long message link into two shorter links. The Message Slice component does not add

functionality and has no run-time configuration. It has parameterized interface data path widths

and is intended to aid integration.

2.2.2.3.2.3 Message Lock

The Message Lock component can be used to block a specific message interface. It is intended to be

used in systems where security mandates fine granularity of access permissions to UltraSoC

components. This component can be placed on any message interface in an UltraSoC system. It is

typically used on the interface between the communicator components and the Message Engine

which, along with the analytic components attached to the Message Engine, may need to have

access blocked for security purposes.

When the system’s security function determines that access is not allowed on a particular interface,

it sets the lock request on the lock control interface. The Message Lock provides indication that the

interface is locked.

2.2.2.3.2.4 System Memory Buffer

The System Memory Buffer (SMB) provides network-level storage and output to system memory via

a system bus interface. This component is used when the amount of gathered data is so big that

cannot be absorbed by the communicators due to their restricted bandwidth, the data is then

forwarded to the SMB which is connected to an SoC memory and uses it to store the data. This

component can be configured in two different modes:

24

- In double-ended operation the SMB will autonomously read messages back from system

memory and direct them to a suitably configured communicator, such as USB or JTAG.

- In single-ended operation the SMB will store the messages in the system but the software is

responsible for extracting this information by accessing the target memory.

2.2.2.3.3 Analytic modules

The analytic components are used to monitor and control the behavior of the system and allow not

only to gather information from the system but also to inject read and write requests that can

change the configuration of the SoC. These modules can be considered the leafs of UltraSoC’s

tree-based topology as they have two interfaces which are connected to the Message Engine and

the system resources, respectively.

2.2.2.3.3.1 Direct Memory Access (DMA)

The DMA analytic module provides direct memory access. It enables the debugger to issue

transactions through an AMBA AXI4 interface in order to read or write a block of memory or access

registers. This module is intended to enable program loading in an UltraSoC system as well as

directed memory inspection via read accesses and manipulation via write accesses.

2.2.2.3.3.2 Processor Analytic Module

The Processor Analytic Modules are intended to be used to control the processor cores, extract

processor related status information and performance monitor values as well as integrate with the

processor’s existing debug support which may additionally provide instruction and or data trace.

There are two main variants, the BPAM (Bus Processor Analytic Module) and the JPAM (JTAG

Processor Analytic Module). The former makes use of an AMBA APB3 interface to connect with the

system resources while the latter allows the same but through a JTAG interface.

2.2.2.3.3.3 Trace Encoder

The Trace Encoder provides a mechanism to monitor the program execution of a CPU in real time. It

encodes instruction execution and optionally load/store activity and outputs data in a highly

compressed format. Filtering can be applied to determine what and when to trace. In addition,

optional statistics gathered through counters are also available. The Trace Encoder allows support

only for CPUs that are RISC-V compliant.

Software running externally can take this data and use it to reconstruct the program execution flow.

2.2.2.3.3.4 Trace Receiver

The Trace Receiver module accepts trace data, typically generated by a CPU, and encapsulates the

information into UltraSoC messages. The Trace Receiver converts trace data into UltraSoC messages.

2.2.2.3.3.5 Bus Monitor

The Bus Monitor analytic module can be used to monitor a master or slave interface on a bus,

point-to-point fabric or on-chip network. The component can be parameterised to have between

one and sixteen filters, and up to 64 metric counters enabling it to collect several independent,

25

although potentially overlapping, data sets. Each filter is equipped with a match unit which can be

used to filter the observable transactions to only those of interest.

The Bus Monitor can passively monitor up to four master or slave bus interfaces compliant to one of

the following standards: AMBA AXI3, AMBA AXI4, AMBA ACE-Lite, AMBA ACE, OCP 2.1 or OCP 3.0.

Monitoring configuration can be based on any bus field such as: address range, transaction

identifier range, length, size, burst type, privilege, priority and any other control fields.

2.2.3.3.3.6 Status Monitor

The Status Monitor provides a wide variety of monitoring functions that can be used for debug,

diagnostics or performance profiling. It is delivered as a parameterized soft core which allows

capabilities to be balanced against gate count according to the needs of the application on a per

instance basis.

The Status Monitor offers a wide variety of monitoring functions, each of which can be included or

excluded at instantiation through parameterization. In this way, gate count can be adjusted to match

the functional requirements of the application. Monitoring functions are based on the use of

advanced filters which consist of one or more comparators that can be configured to select the

subset of bits to be analyzed, the expected value and the operation to be performed. For example,

the user can connect the Status Monitor to the core pipeline to analyze internal signals from core

execution and through the comparators it can select the subset of bits he/she is interested in and

the expected value to trigger a hit. Then, depending on the user's configuration the comparator hit

can either trace data at that same cycle, capture the data in previous cycles or in the next cycles. In

addition, the comparator hit can also be used to generate an event to enable or disable other filters

or to increase or decrease internal counters.

The Status Monitor filters can be attached to one or more comparators allowing the user to look for

specific values or to match under a range of values. In addition, the user can also configure Status

Monitor counters to gather information about how many times a filter has performed a hit, which

can then be used for performance analysis.

2.2.2.3.3.7 Static Instrumentation

The Static Instrumentation analytic module enables data delivered over a bus fabric such as AMBA

AXI to be automatically converted to UltraSoC instrumentation trace messages and transferred to

the debugger along with the trace messages from other modules in the UltraSoC systems. The

module presents blocks of independently enabled channels in order to facilitate acceptance

(capture) and filtering of many trace data streams from the software.

2.2.2.3.3.8 Network On-Chip Monitor

The NoC Monitor analytic module can be used to monitor a set of master or slave NoC channels. It

is parameterised to have between one and sixteen filters, and up to 64 metric counters enabling it

to collect several independent, although potentially overlapping, data sets. Each filter is equipped

26

with a match unit which can be used to filter the observable transactions to only those of interest..

It passively monitors up to 4 sets of master or slave NoC channels compliant to AMBA CHI standard.

2.2.3 Design For Testability (DFT)The first step during the bring up stage consists of confirming that the system has been built

correctly, and in order to do that design for testability (DFT) is used to achieve high fault coverage

during manufacturing tests. That is, designers must be able to detect failures such as shorts,

incorrectly inserted components or faulty components so that they can be fixed. This verification

effort is typically controlled by an automatic test equipment (ATE) which is a powerful test

instrument that not only verifies the design as a whole but also focuses on parts of the SoC

increasing coverage on the most critical regions.

The design is identified as a fault-free circuit in case the input patterns introduced during ATE effort

match with the expected correct responses. In case the circuit test has failed, then it is required to

do a diagnosis to identify the root cause of the failure.

Those test patterns and the expected response makes use of a hardware implementation that is

known under the name of Design For Testability [36], which consists of one or multiple scan chains

that connect system flip-flops providing observability and a testing mode.

The scan chain is a hardware structure that facilitates observability of all the system flip-flops and is

the most common method to drive signals within the system and also to extract information. This

method requires registers (flip-flops or latches) in the system to be connected in one or more chains

allowing the user to introduce patterns or read the information within the registers. Those chains

are then connected to I/O system pins such that the user can shift data into the system and also

extract data from the system. Note that apart from the I/O system pins to introduce and extract

information, there are two extra input pins that are to enable scan mode (scan_enable) and to

control all flip-flops clock during scan mode (scan_clock).

Figure 3. System design scan-chains integration schematic

27

In order to drive data into the system, the user must enable scan mode and shift data in via the scan

chain(s) SoC input physical pins such that at each clock pulse, controlled via the scan_clock input

pin, a new bit is shifted into the chain. This is done as many times as flip-flops or latches are in the

scan chain in order to fill all of them. Then, the user can disable scan mode so that the next clock

cycle the system flip-flops will not shift the information to the next flip-flop on the chain, but the

system logic will execute in functional mode generating a response.

Next, once the user has shifted-in all flip-flops values, disabled scan mode and performed a clock

cycle, the system design response can be extracted. Note that data is extracted the same way such

that the user must first enable scan mode and then at each clock pulse, which is controlled through

scan_clock, the target scan chain value is shifted out to the output system pins and captured by the

user.

Note that at each clock pulse the first flip-flop in the chain receives the user input data through the

system input pins, at the same time, the connected flip-flops receive the value of the previous

flip-flop in the chain, and the last flip-flop in the chain outputs its value through the system output

pins.

This scan-in (SI) and scan-out (SO) methodology allows designers to introduce test patterns, enable

execution and then extract registered internal values to see if the system performs the operations

as expected. As we can see this feature allows access to all the flip-flops in the system.

Unfortunately, it requires the system to be stopped, is very slow and does not allow advanced

filtering. In addition, once signals have been shifted out to extract the information, in case the user

wants to continue running, those signals must be inserted once again through the scan-in system

pins which has a severe impact time.

SoC designs can be built with millions of flip-flops causing the shift-in/shift-out methodology of the

whole design to be extremely slow. Due to that, nowadays SoC DFT implementations make use of

several scan-chains which can be selected by programming an internal module named the Design

Control Unit (DCU) allowing the user to select the blocks within the SoC to be verified.

On the other hand, it is important to note that there are many synthesis tools in the market that

take care of handling the concatenation of system’s internal flip-flops to create the scan chains. In

addition, those synthesis tools also allow us to create sub-chains and automatically insert hubs and

routers to target one or multiple chains.

There are other manufacturing testing methods such as built-in self-test (BIST) which are used as a

low cost testing approach. In this thesis we focus on post-silicon validation and assume that

manufacturing tests have been successful so that no design nor functional bugs have been

detected.

28

2.3 Debugging multicore systemsIn this section we do an overview of the most important resources that compose multicore systems

in order to analyze the basic debug features that must be integrated to provide observability and

allow workarounds in case of critical failures.

2.3.1 I/O peripherals

Systems On-Chip must be able to connect with one or multiple systems which can be either a

computer acting as a host or a cluster with thousands of other SoCs. This connection is based on the

use of peripheral components that not only allows access to SoC from an external system for

computation but can also allow the SoC access to resources such as external memories.

Nowadays, multicore systems use to integrate interfaces such as PCIe in order to provide a

high-bandwidth connection between the SoC and the external world. In addition, it is common to

have external interfaces such as UART or SPI that allow low-bandwidth and high-reliable connection

for basic testing.

2.3.1.1 Debug features

It is highly recommended that in case PCIe has been integrated within the system, the users have a

debug infrastructure that allows them to monitor the behavior of this resource in order to ensure

that it is properly configured and that transactions are performed as expected. Note that this is

something critical during the bring up effort due to the fact that users must be able to check that

data is properly loaded and read from the system.

On the other hand, it is also important to note that the SoC must have a dedicated I/O interface

connected to the system debug infrastructure allowing the users to access debug features within

the system. This is critical as is the entry point for debugging the whole design. JTAG is the preferred

option due to its simplicity but note that its a low-bandwidth interface. In case the user expects to

gather big chunks of information from the system in real-time, then a faster interface such as USB or

PCIe can be integrated and connected to the debug infrastructure as an optional entry point.

2.3.2 Memory subsystem

Multicore systems make use of caches and complex memory hierarchies in order to ensure that the

cores can access data as fast as possible, reducing latency and increasing performance.

2.3.2.1 Debug features

Given that the memory subsystem receives a big amount of requests from the cores either through

direct connection to those memories or through the Network On-Chip, it is very important to

ensure that requests are received and responded properly and there are no hungs that could end

up stopping the cores and would have critical consequences on functionality.

29

Due to this, it is highly recommended that the debug infrastructure allows the users to monitor the

transactions received by the memories and ensure that they are responded with the correct data. In

addition, it is also recommended to allow the debug infrastructure access to those memories so

that the user can check which is the stored data and the state of the cache lines at runtime.

2.3.3 Master core

One or multiple master cores are used not only to balance workloads among compute cores but

also to handle critical tasks such as handling initial hardware boot, maintenance and managerial

tasks in the SoC, and associated peripherals configuration for performing those tasks.

2.3.3.1 Debug featuresIt is essential that the user can perform run control operations on the master core(s) so that it can

apply operations such as breakpoints, single stepping or reading core’s internal registers. In

addition, it is highly recommended that the debug run control integration supports connection

through software tools such as GDB, which has already been introduced in this document.

On the other hand, it is also recommended to add monitoring features that allow the user to gather

information about which instructions are executed by the core, visibility on the requests and

responses with resources from the system, and control signals that can help debug possible hangs

or misbehaviours.

2.3.4 Compute cores

As the name already specifies, a multicore system can integrate from tens of processors to

thousands potentially creating large groups of threads that can handle the computation of big

workloads. In addition, as we have noted before, there are multiple cache hierarchies splitted

around the system that are used to improve performance.

2.3.4.1 Debug featuresThe same way as for master core(s), it is also essential to allow the user to perform run control

operations such as breakpoints, single stepping or reading core’s internal registers, and to support

software tools such as GDB.

On the other hand, it is also recommended to provide visibility of the operations performed by the

compute cores and the requests that have been sent. This allows the user to handle cases such as

debugging if the core is waiting for a response that is never received, if the core has taken a wrong

branch or if there is a hardware bug when performing a specific instruction.

2.3.5 Interconnection network

Systems on-Chip used to integrate bus and crossbar-based architectures to communicate the system

resources. That is, in the early days there was a bus architecture which provided dedicated

point-to-point connections between resources. Nowadays, systems have increased in size and

complexity and point-to-point connections are less cost-effective and often become the bottleneck

30

of the system. On the other hand, systems must be able to scale, be reusable and manage power

consumption.

Networks on-Chip (NoC) appeared to handle all those challenges. The main objective is to ensure

that a request or response goes from one resource to another efficiently, for example, from a core

to on-chip memory. The message can be divided into flits and injected on the network. Then, flits

are routed through the interconnection network making use of the links and routers until they

arrive at the destination.

We refer to flits as the smallest unit that composes a packet, and equals the link width. The flit is

the link level unit of transport, and packets are composed of one or more flits.

NoCs are becoming more and more complex as they include features such as routing protocols,

arbitration or flow control among others while they must be reusable to reduce costs, design risk

and impact time-to-market schedule.

Figure 4. Network On-Chip mesh topology

One of the Network on-Chip challenges is related to handling multiple endpoint interfaces that may

use different protocols, voltage or can be running at different clock speeds. On the other hand, due

to the fact that the NoC handles most of the SoC traffic, it can lead to serious security concerns as

master interfaces can corrupt data, degrade performance or even leak sensitive data. Due to this,

there is the special need to ensure that transactions are always forwarded through the correct

channels so that they do not end up in the wrong endpoints causing a hole in security.

Multiple studies [37][38] have shown that a big percentage of post-silicon errors are detected on

the NoC so it is crucial to add support for monitoring requests and response behavior.

2.3.5.1 Debug features

31

First, there must be visibility features on the connection between the Network on-Chip and the SoC

blocks so that the user can detect if a request has been lost within the NoC or if it is a problem

within the SoC block itself. This monitoring feature must be able to handle the multiple interface

protocols used in the different SoC blocks. This idea was introduced a long time ago [39] together

with the options of adding configurable monitors to observe router signals and generate

timestamped events [40]. In addition, the Design For Testability (DFT) approach allows the use of

scan-chains to apply tests to specific components such that test data can be shifted-in and applied

to the component being tested. Then, results and traces can be read from the SoC scan output pins

and analyzed on the host.

It is important to note that adding those features has a major drawback which is the area impact as

well as the performance overhead. Area impact can be reduced by making use of small buffers and

constraining the amount of signals being analyzed, which sacrifices observability. On the other

hand, performance overhead is caused in case the traced information is sent through the same

Network on-Chip that is being analyzed and shared by the SoC functional resources, in this case it is

highly recommended to use a simplified messaging infrastructure that separates the non-debug

traffic from the debug traffic.

2.4 Debug infrastructure on multicore systems

From the previous analysis of the different available options in the market we can conclude that

UltraSoC is the company which provides a wider range of options allowing faster integration of a

debug infrastructure on System On-Chips. On the other hand, we must also note that some of the

analyzed options do not fit all multicore systems requirements given that ARM’s debug components

must be integrated with ARM’s cores and not all SoCs rely on using ARM cores. For example,

Esperanto Technologies has developed their own processors based on RISC-V architecture. On the

other hand, alternatives such as SiFive provide RISC-V compatible debug modules but depending on

processor designs the Debug Module may be already integrated on the system. Finally, Synopsys

provides fast integration and verification but its debug proposals cannot be implemented on silicon

as they are based on prototyping.

The outcome of this analysis meets with previous research done in 2018 when Esperanto

Technologies was designing its first System On-Chip, which is known as the ET-SoC-1. At this point in

time Esperanto Technologies decided to make use of UltraSoC components as they allowed fast

integration and verification requiring a small number of engineers, which was a key factor for

Esperanto’s first product in order to improve the time-to-market constraint.

32

3. Many-Core SoC designIn this section we analyze Esperanto’s System On-Chip architecture as an example of a multicore

system. The analysis focuses on its debug infrastructure integration and evaluates the design and

use cases.

The Esperanto’s System On-Chip, which has been named as ET-SoC-1, has two kinds of

general-purpose 64-bit RISC-V cores. The first one is the ET-Maxion, which implements advanced

features such as quad-issue out-of-order execution, branch prediction, and sophisticated

prefetching algorithms to deliver high single-thread performance. On the other hand, the ET-Minion

is an in-order core with 2 hardware threads and a software configurable L1 data cache, the

ET-Minion has been designed for energy efficiency, delivering TeraFlops and TeraOps of computing.

The ET-Minion cores are integrated in groups of 32 forming “Minion Shires”. Within each “Minion

Shire” the ET-Minion cores are splitted in groups of 8 cores, so there are 4 groups named

“Neighborhoods”. Then, the “Minion Shire” has a crossbar between the Neighborhoods and an L2

private memory cache of 4MB as shown below.

Figure 5. Esperanto Technologies ET-SoC-1 Minion Shire architecture

The ET-SoC-1 has 34 Minion Shires, which share an L3 memory that is splitted between them, 4

ET-Maxion processors and 32GB of memory with a 137GB/sec memory bandwidth. On the other

hand, the Network On-Chip follows a mesh topology that connects all the Minion Shires, the

ET-Maxion, memories and peripherals such as the PCIe.

The ET-SoC-1 debug infrastructure is based on the integration of the UltraSoC IP components

introduced in previous sections. Those components are integrated in the different SoC blocks, which

are known as shires, in order to provide visibility and, in some cases, workarounds for

non-functional components in the system during the initial stages of bring up.

33

3.1 I/O Shire

3.1.1 Architecture

The IOShire is an essential block in the system as it contains the Service Processor, which is the SoC

master core, and its ancillary components which are responsible for booting the whole system. Note

that there is a single instance of the IOShire block on Esperanto’s system.

The Service Processor (SP) is a dedicated standalone ET-Minion core which has access to all the

resources on the SoC that any other agent can access, and it also has exclusive access to some

resources that no other agent can access. It is responsible for booting the SoC and providing

runtime services for other SoC components.

On the other hand, the IOShire also contains the ET-Maxion cores and I/O devices. Additionally, the

IOShire contains the USB and JTAG interfaces used to access the debug infrastructure from an

external host.

In effect, the IOShire can be compared with a smaller SoC given its implementation and resources,

the primary roles for the IOShire within the ET-SoC-1 are as follows:

● Integrates the Service Processor, which performs maintenance and managerial tasks in the

SoC, and associated peripherals for performing those tasks.

● Handles initial hardware boot of the SoC from the deassertion of the primary reset until the

Service Processor begins executing firmware code.

● Provides security/access control mechanisms for the SoC including secure debug with

various levels of authentication to gain debug access to various aspects of the SoC.

● Manages and supplies major SoC clocks and resets to other SoC components.

● Integrates general purpose peripherals for use by other agents in the SoC such as the

ET-Minion Cores and ET-Maxion, excluding PCIe and DDR which have their own shires.

● Serves as the primary entry point for the UltraSoC components, including housing

peripherals which provide access to the debug infrastructure.

● Integrates the Maxion shire, which includes the 4 high-performance out-of-order ET-Maxion

cores, a shared L2 cache, and supporting components.

The IOShire is connected with the rest of the shires in the SoC through the Network On-Chip (NoC),

which connects all the shires in the SoC together in a mesh topology.

34

3.1.2 Debug features and UltraSoC integration

3.1.2.1 Clock and reset unit

The System On-Chip clock and reset unit (CRU) is responsible for controlling the reset signals of all

system resources, which include not only the Phase-Locked Loops (PLL), Network On-Chip and the

cores, but also all the system peripherals and caches.

The default behavior of the system is such that the Service Processor is responsible for performing

the boot initialization sequence in which it configures the PLLs and accesses the CRU to release the

reset of several resources.

As can be seen, it is critical that those boot sequence steps can be performed. Otherwise, the rest of

the SoC cannot be accessed meaning that if there is a bug in the Service Processor, or in its

connection with the Network On-Chip, the user would not be able to use any of the resources in the

system.

In order to solve this possible issue, an UltraSoC BPAM component is connected to the CRU and the

I/O Shire PLLs through an AMBA APB3 interface such that the user can control the reset and clocks

of all the system resources without relying on the Service Processor nor the NoC allowing to verify

the rest of the SoC in case of a critical failure.

3.1.2.2 Registers and memory access

During post-validation effort the user may be interested in reading not only the shire registers but

also the values being read or written in the memories by the cores. Due to that, the I/O Shire

integrates the UltraSoC DMA, which is connected to the Network On-Chip through an AMBA AXI4

interface and provides full visibility of the system resources allowing the user for example to load

code or dump the registers values from all SoC blocks.

Due to the fact that the connection to the main Network On-Chip on the IOShire is splitted in two

different NoCs for security, there are two UltraSoC DMA modules which are connected to those two

NoCs for redundancy. The first IOShire NoC instance is known as SPIO NoC and is integrated on the

sub-block that contains the Service Processor providing full visibility of all the resources in the SoC.

On the other hand, the second NoC instance, which is known as the Peripheral Unit (PU) NoC, is

integrated on a different sub-block that contains several peripherals such as UART or SPI and has

visibility of all resources in the SoC except the Service Processor and security components.

In case of a bug in the Service Processor, the user can make use of the UltraSoC DMA component

that is connected to the SPIO NoC in order to configure all the resources in the SoC including the

SPIO peripherals. That said, in case the SPIO NoC has a bug the system would not be configurable

from the Service Processor nor from the SPIO UltraSoC DMA so that no further debugging of the

rest of shires could be done, in order to avoid that, Esperanto’s design team decided to add the PU

UltraSoC DMA component which has visibility of the rest of the SoC shires resources.

35

3.1.2.3 Service Processor

The Service processor is a RISC-V core for which Esperanto has implemented the RISC-V External

Debug Support Version 0.13 specification [2]. The Service Processor Debug Module, which is

accessible through an AMBA APB3 interface, allows the user to perform run control operations such

as halt, resume, single stepping or setting breakpoints, and supports external software tools such as

GDB. The Debug Module is accessible through the UltraSoC BPAM that is also connected to the CRU

and the PLLs, so that there is a multiplexer that the user can control to target those resources.

On the other hand, given that the Service Processor is a custom core developed by Esperanto and is

the first time it is taped out, the design team integrated other UltraSoC components to increase

observability and to allow the user detect possible bugs either in the hardware integration or in the

firmware code.

An UltraSoC Trace Encoder component has been attached to the Service Processor allowing the user

to trace all the instructions and exceptions during code execution. The Trace Encoder captures the

taken branches and, based on the code being executed, the user can reproduce the sequence of

instructions and branches through UltraSoC software tools on the host.

In addition, an UltraSoC Status Monitor component has also been connected to the Service

Processor allowing the user to capture the values from finite-state machines, control signals and

track transactions. This module does not only enable the user to debug the pipeline, internal cache

or other resources within the core, but it is also connected to the interface between the Service

Processor and the Network On-Chip such that it can be used to monitor the requests and responses

with the rest of the system resources.

3.1.2.4 Maxion validation

ET-Maxion cores are based on the RISC-V Berkeley Out-of-Order Machine (BOOM), because of that

there is a Debug Module that allows the user to perform run control operations following the

RISC-V External Debug Support Version 0.13 specification [2]. The Maxion Debug Module interface

is accessible through a JTAG interface and due to that the Esperanto design team has integrated an

UltraSoC JPAM component that allows the debugger to configure the ET-Maxion cores to perform

run control operations such as the ones commented for the Service Processor.

On the other hand, given that the ET-Maxion cores have been customized by Esperanto there are

also other UltraSoC components integrated in the I/O Shire whose goal is to increase observability

and allow the user to detect bugs.

The Maxion sub-block has two UltraSoC Status Monitor components which provide visibility of both

the ET-Maxion core control signals and the Maxion cache hierarchy. In addition, there is also an

UltraSoC Bus Monitor component that allows the user to capture the requests and responses on the

master and slave AMBA AXI interfaces that ET-Maxion cores use to communicate with the rest of

the system resources through the Network On-Chip.

36

Finally, an UltraSoC BPAM component has also been integrated on the Maxion shire allowing users

to access to the Maxion configuration registers either for visibility or to apply different

configurations.

3.1.2.5 Debugger hardware interface

The debug subsystem can be accessed through JTAG and USB interfaces, which have been

integrated in the I/O Shire. From the point of view of the user this is transparent as the JTAG and

USB encoding protocols are handled by a software layer in the host.

3.1.2.5.1 External interface with the host

The JTAG interface provides a simple but slower interface, which is used to access the SoC during

first bring up stages to check that UltraSoC components can be accessed, perform basic debug

operations and to configure USB interface if needed.

The USB interface provides a faster interface but due to its complexity it is not ready to be used

when the SoC external reset is released, that is, this interface can be used when the user tries to

gather as much information as possible from the system but requires previous configuration.

3.1.2.5.2 Cores communication with the host

The Service Processor is a critical piece in Esperanto’s SoC and due to that, during the bring up effort

it is important that the user has a way to communicate the core with the external debugger.

Esperanto’s design integrates an UltraSoC Static Instrumentation component in the I/O Shire

allowing the Service Processor to write data as if it was a mailbox. Once the Service Processor writes

data in UltraSoC Static Instrumentation, the external host receives a message that can either

contain the written data or a notification that new data is available depending on the configuration

applied by the user.

This allows data to be passed from the software on the debug target to the debugger. A number of

software APIs are available and with an appropriate driver this module can support most popular

tools, such as LTTng or Ftrace.

3.1.2.5.3 Debug through Service Processor

The debug components are also configurable through the Service Processor via the UltraSoC Bus

Communicator module, which allows the master core to configure and manage the SoC debug

infrastructure, that is, the Service Processor can send requests to the UltraSoC modules as if it was

an external debugger, in order to do that it communicates through an AMBA AXI4 interface

targeting the Bus Communicator internal registers.

UltraSoC Bus Communicator module internal registers are used to read the status of the requests

and/or their response, and to write the requests to be performed. The complexity of the

communication between the UltraSoC Bus Communicator and the Service Processor relies on the

37

fact that the Service Processor has to encode the debug configuration messages and also decode

the responses. UltraSoC Bus Communicator does not encode the requests, so it just serves as a

mailbox between the debug components and the Service Processor.

3.1.2.6 Security

As we have seen in previous sections the debug infrastructure can be accessed either through a

JTAG or USB, due to that the system is exposed to receive attacks trying to access the internal

resources and cause malfunctions. In order to avoid this behavior, the System On-Chip has a

component that is in charge of enabling/disabling the access to the rest of the debug components

on the SoC.

The security component can only be accessed through the UltraSoC JPAM and requires the user to

go through a challenge that requires a public and a private key authentication. The only debug

components that are accessible by the user without performing the security authentication are the

JTAG and USB interfaces and the UltraSoC JPAM, that is, those modules are not locked so that the

user can perform the handshaking challenge ensuring that the rest of the system debug

components are only accessed by verified users.

In order to block the access to the UltraSoC analytic components, there is a small module named

Message Lock that receives the output signals of the security component and enables or disables

the interface between the debug infrastructure and the UltraSoC components.

Below we can see the distribution of the debug components within the IOShire, which are placed in

the three sections that have been previously introduced: Peripheral Unit (PU), Maxion and SPIO

blocks. For each of the analytic modules there is an UltraSoC Message Lock component that ensures

that no requests can be sent nor information can be extracted if the SoC is not configured at the

correct security level.

Figure 6. IOShire UltraSoC component integration

38

3.2 Minion Shire

3.2.1 Architecture

As has been introduced in previous sections, the Minion Shires integrate the system cores that

handle the computation of the different workloads and contain multiple cache hierarchy levels to

increase performance.

The ET-Minion cores within a Minion Shire have direct access to L1 cache memory but also rely on

the Minion Shire L2 cache. The former is splitted in two cache memories: one for instructions and

one for data.

Due to architectural decisions that are out of the scope of this thesis, the Minion Shire is splitted in

different voltage domains such that the ET-Neighborhoods are in a different voltage than the rest of

the Minion Shire resources. Due to that, the L1 cache is integrated within the ET-Neighborhood to

avoid delays in the access to instructions and data so that the ET-Minion cores can fetch new

instructions and perform load and store operations as fast as possible.

The ET-Minion cores L1 instruction cache memory is shared between the eight cores in the

neighborhood such that parallelism is the key factor for the system. On the other hand, the L1 data

cache is private for each ET-Minion and is not shared such that the cores do not rely on others to

access data.

On the other hand, the Minion Shire also integrates an L2 cache memory that is used to store both

data and instructions and it is shared by all the cores within the Minion Shire, that is, it is shared by

32 ET-Minion cores. The L3 cache memory is also part of the Minion Shire but it is shared by all the

Minion Shires in the SoC such that the whole cache is splitted across the system. The integration of

those two cache memories, L2 and L3, within the Minion Shire is separated from the

ET-Neighborhoods in a block named Shire Cache at a different voltage.

In the image below we can observe different Minion Shire blocks and the connection of the

ET-Neighborhoods with the Shire Cache through a crossbar, which allows the ET-Minion cores to

access L2 cache memory. In case of a miss in L2 cache memory, the request is forwarded to L3

cache, whose requested region can be in the same Minion Shire or in a different one.

39

Figure 7. Minion Shire voltage and clock internal architecture diagram

3.2.2 Debug features and UltraSoC integration

The Minion Shire is replicated several times in the SoC, which means that adding debug

components on the Minion Shire has a huge impact on area and power as this is replicated multiple

times, and the ET-Minion cores within the Minion Shire handle all the computation and its

performance is critical so that we need to provide as much visibility as possible.

On the other hand, due to security concerns it is important to ensure that UltraSoC components

access is restricted and because of that Esperanto’s design makes use of the UltraSoC Message Lock

component. Two security levels have been defined: full access to the debug infrastructure allowing

access to all monitor modules such as the UltraSoC Status Monitor, and a second level of security

which allows the user to perform run control operations on the compute cores, which is intended to

be used when running debugging software such as GDB.

3.2.2.1 Software application debugging

The most important feature for software debugging relies on the ability to perform run control

operations on the ET-Minion cores so that the user can set up breakpoints, enable single stepping or

read core internal registers. On the other hand, the user must also be able to access performance

counters and other Minion Shire registers. Therefore, the Minion Shire integrates an UltraSoC BPAM

component that provides an AMBA APB3 interface that is connected to all the Minion Shire registers

and also provides the debug run control interface for the ET-Minion cores.

3.2.2.2 Resources visibility

In addition, the user can also monitor ET-Minion cores execution making use of the UltraSoC Status

Monitor component. As mentioned above, the Minion Shire is replicated multiple times which

means that adding debug components escalates in area and power quickly, due to that there is one

UltraSoC Status Monitor per each neighborhood such that resources within this sub-block can be

monitored, this includes the 8 ET-Minion cores, the L1 caches and the interconnection between

those resources and the rest of the Minion Shire.

40

Furthermore, there is an UltraSoC Status Monitor that provides visibility of the crossbar between

the four Minion Shire neighborhoods and the L2 and L3 cache levels, which are known as Shire

Cache, such that the user can see the requests being generated by the compute cores and the

responses sent back from the Shire Cache. Note that the UltraSoC Status Monitor does not allow to

track requests based on unique identifiers which is a problem if the filtering to be applied requires

matching requests and responses.

An additional UltraSoC Status Monitor is connected to the Shire Cache control signals allowing

analysis of cache control signals and finite-state machines.

The compute cores can generate huge amounts of transactions that target not only their own

Minion Shire resources but also other Minion Shires and SoC resources such as memory. Then, it is

essential to be able to monitor the traffic on the Minion Shire AMBA AXI interfaces that connect

with the Network On-Chip. This monitoring feature is provided by the UltraSoC Bus Monitor which

allows to track the requests and responses with the rest of the system.

Figure 8. Minion Shire UltraSoC component integration

41

3.3 Memory Shire

3.3.1 Architecture

The Memory Shire integrates the 4GB LPDDR4 memory parts and connects those memories with

the rest of the SoC through the Network On-Chip. There are eight Memory Shires in the system

providing 32 GB of memory, each of them has two memory controllers which allow access to two

16-bit LPDDR4 channels resulting in a system with a 137GB/sec memory bandwidth.

The Memory Shire has multiple voltage domains which allows it to connect with the Network

On-Chip, internal third party components and the DDR memories. In addition, there are also

different frequencies for each of those blocks. Due to this integration the Memory Shire has

multiple voltage and clock crossing FIFOs that handle the different requests and responses.

3.3.2 Debug features and UltraSoC integration

3.3.2.1 Register visibility

The Memory Shire has several registers that can be configured to define different operational

modes to handle the memories. Those registers are accessed through the Network On-Chip by the

Service Processor but they can also be accessed through an UltraSoC BPAM component integrated

within the Memory Shire so that the user can perform read and write requests to configure the

memories without interfering with the rest of the system as the transactions go through the debug

infrastructure which is detached from the main Network On-Chip.

3.3.2.2 Memory visibility

Given that the Memory Shires integrate the DDRs, which serve as main memory for the cores, they

receive a big amount of requests through the Network On-Chip and it is very important to ensure

that requests are received/responded properly so that there are no hungs as this would end up

stopping the compute cores and would have critical consequences on functionality.

For this purpose, each Memory Shire integrates an UltraSoC Bus Monitor component that allows

the user to analyze the traffic on the interfaces that connect the Network On-Chip, through which

compute cores requests are received, and the DDR memories.

In addition, the Memory Shire also integrates an UltraSoC Status Monitor component so that the

user can have visibility of control signals and finite-state machine values.

3.3.2.3 Debug trace buffering

UltraSoC Status Monitor and UltraSoC Bus Monitor components can be used to capture data in

real-time and forward it to the host for further debugging. Due to the fact that the debug

infrastructure interface with the host is based on JTAG and USB protocols, the bandwidth is limited

42

and those interfaces can become a bottleneck causing the user to lose information gathered from

the system.

In order to avoid losing debug information, which means that engineers could miss the chunk of

data with information to detect the bug they are looking for, each of the Memory Shire blocks

integrate an UltraSoC System Memory Buffer component which allows the debug messages to be

stored in the DDR memories. It is important to note that the region of memory used to store those

debug messages acts as a circular buffer and is specified by the user, which means that the user

must ensure that compute cores do not overwrite this region as this would cause data corruption.

Figure 9. Memory Shire UltraSoC component integration

3.4 PCIe Shire

3.4.1 Architecture

The PCIe Shire integrates the PCIe Subsystem allowing Esperanto’s SoC to connect with an external

host through a PCIe interface. The PCIe Subsystem contains two PCIe Controllers, two PCIe PHYs and

the subsystem custom logic.

43

The same way as the rest of Esperanto’s SoC shires, the PCIe Shire integrates access to the system

Network On-Chip allowing not only the host to access memory mapped locations but also the

compute cores to use host memory as SoC extended memory.

3.4.2 Debug features and UltraSoC integration

The PCIe serves as an interconnection between the host and the System On-Chip, it is widely used in

the industry and a critical feature for Esperanto’s SoC performance. Due to its complexity the PCIe

Shire debug components have been integrated focusing on providing visibility of the PCI Express

functionality and verification.

Given that the PCIe is connected to the system Network On-Chip through multiple AMBA AXI4

interfaces, the debug infrastructure on the shire integrates two UltraSoC Bus Monitor components

that allow to analyze the transactions in-flight of the two PCIe lanes simultaneously.

In addition, the PCIe Shire also integrates two UltraSoC Status Monitor components that allow the

user to capture PCIe control signals and monitor its interconnection with system resources.

Figure 10. PCIe Shire UltraSoC component integration

3.5 Debug Network On-Chip

Access to system debug infrastructure is done through JTAG and USB interfaces which are integrated

in the IOShire. Given that there are several UltraSoC components integrated in other shires, the

44

Esperanto’s debug subsystem must have a way to route all the debug messages up to the IOShire.

We understand as debug messages not only those that configure UltraSoC components but also the

real-time data captured through the UltraSoC Status Monitor and UltraSoC Bus Monitor

components.

Apart from the system Network On-Chip, through which the Service Processor and compute cores

can access the memory mapped resources such as registers, caches or the DDRs, there is an

independent debug Network On-Chip that handles the routing of debug messages.

The debug messages sent from the debugger to the UltraSoC components always follow the fastest

path based on optimization route algorithms. On the other hand, the responses to those

configuration messages and the real-time data captured can either be sent to the debugger

interface or can be sent to the Memory Shire DDR memories in which they are stored using the

UltraSoC System Memory Buffer.

The destination of the configuration messages response and the real-time captured data messages

is defined by the user such that it can decide whether to send those debug messages directly to the

JTAG or USB interfaces, or to store them in DDR memories and then read the debug messages upon

request to avoid overflow in the external interfaces with the host.

This document does not include any further details about the debug Network On-Chip architecture

as it has been developed in an agreement between Netspeed and UltraSoC and its architectural

details are confidential.

3.6 Conclusions after ET-SoC-1

Now, after Esperanto’s ET-SoC-1 has been fabricated and the UltraSoC components have gone

through RTL integration, verification, physical design implementation and post-silicon validation

stages, Esperanto’s engineering team has found some critical aspects that must be improved for

future SoCs.

3.6.1 Architecture, RTL and physical design evaluation

During RTL integration the Esperanto engineering team found that UltraSoC proposal was

architected to be used on small SoCs due to how their messaging infrastructure was built and

because of the lack of voltage and clock crossing support, which is typical on multicore systems that

integrate several cores, memories and peripherals.

The lack of an efficient message infrastructure caused an impact on throughput capabilities which

limited the amount of information that could be gathered from the system, that is because UltraSoC

message infrastructure is based on a tree architecture. That is, on this tree architecture the

information flows from several analytic components to an UltraSoC Message Engine, which then

sends the information to the next UltraSoC Message Engine in the hierarchy which has other

UltraSoC Message Engines or analytic components connected, this is done until the top UltraSoC

Message Engine in the tree is reached and the information is then sent through JTAG or USB.

45

In addition, a tree-based architecture does not allow building a system out of stamping or tiling

shires. This forced Esperanto to involve a third-company (NetSpeed) to develop a debug Network

On-Chip (NoC) that met UltraSoC messaging requirements causing not only a non-predicted impact

on area and power but also a loss of debug performance as the NoC was not customizable and an

unexpected cost. In addition, the internal debug NoC protocol did not have support for tracing

capabilities such as allowing to target different memory subsystems and was based on NetsSpeed

protocol where many fields were unused causing an extra area overhead.

In addition, some of the debug components such as the UltraSoC Status Monitor or the UltraSoC

Bus Monitor have been found to be not efficient for high tracing requirements given that the signals

to be analyzed are stored on internal buffers before checking if they meet user filter configuration

causing buffers to overflow and ending up with the user losing visibility.

It is also important to mention that UltraSoC did not provide area and power estimation numbers

for their debug components and this required the RTL team to have to perform several iterations

and simulations to find out the appropriate configuration to fit the area and power requirements.

This also forced the design team to drop some key configurations such as wider buffers to store

traces while other features could not be removed as they were non-optional parts of the UltraSoC

components.

Power and area analysis have been performed through Synopsys PrimePower and Synopsys

PrimeTime tools, respectively. The analysis was made applying clock gating to UltraSoC modules in

order to discard dynamic power due to the fact that the debug infrastructure is intended to be

inactive after the post-silicon validation stage. The numbers showed that ET-SoC-1 debug

infrastructure static power consumes around 2% of SoC power and occupies 4% of the SoC area.

3.6.2 Post-silicon validation evaluation

Finally, during post-silicon validation the engineering team was able to stress the debug

components and make use of the provided features. The debug infrastructure ended up being

critical for bring up development and had a huge impact meeting time-to-market requirements. In

addition, the experience during this stage showed that some deprecated features, such as access to

some SoC registers, were essential and some features that were supposed to be essential, such as

providing the Service Processor a mailbox through the UltraSoC Static Instrumentation component,

were not really needed. Below we can see further details of post-silicon validation conclusions.

The UltraSoC Bus Communicator module, which was integrated to allow the Service Processor

access to other UltraSoC components without requiring JTAG/USB interface connection, was not

used due to its complexity and the lack of time on the engineering team to develop new software as

no support was provided by UltraSoC.

In addition, the UltraSoC Static Instrumentation, which was supposed to act as a way to allow the

master core to send information to the debugger, was neither used as the Service Processor was

making use of the UART interface for this purpose.

46

It is also important to note that the debug infrastructure showed problems when monitoring

multiple ET-Minion cores as the UltraSoC Status Monitor filtering implementation was not efficient

and caused the internal buffers to fill too fast before gathered data could be sent, causing a loss of

information that was critical for debugging.

On the other hand, due to the way RISC-V GDB software is implemented by default, operations such

as single stepping had big latencies as all General Purpose Registers (GPR) from the selected core

were read automatically. In addition, note that Esperanto Technologies SoC had no implementation

that could speed up reading GPR from the cores.

Moreover, the management team reported problems caused with license-acquisition meaning that

for being able to use the debug infrastructure on Esperanto’s ET-SoC-1 silicon it was required to buy

licenses from UltraSoC, which forced Esperanto to assume around 15% unexpected extra cost and

also to have to find out the best approach for customer support as they would have to reach

UltraSoC licenses for being able to debug instead of having a fully functional SoC with debug

features already available.

Note that UltraSoC software also shown problems and had to be modified several times because it

did not support CentOS, which was the operative system being used on the servers where the

Esperanto System Engineering team connected the Bring up Boards (BuB), causing the Esperanto

team to work with a slower version of UltraSoC software for weeks until they fixed the issue. This

operative system was supposed to be fully supported.

47

4. Proposed multicore debugging architectureAfter analyzing the required features for debugging multicore systems, the available options in the

market and the conclusions drawn from Esperanto Technologies ET-SoC-1 implementation, we focus

on developing a new debug architecture that intends to solve all the problems found during the last

sections.

As we have observed in previous sections an efficient multicore debug architecture requires designs

that are customizable, for example, allowing to decide the number of comparators or buffering

sizes, but that do not add features which cannot be disabled causing a direct impact on area. In

addition, when debug is not used we must ensure that it wastes the minimum power possible.

Users must be able to monitor not only the compute cores execution but also have visibility of other

logic in the system that interconnects the cores between them, with the memories and other

custom modules that are placed on the SoC. This means that the proposed debug monitor

components must be able to analyze different resources and support multiple protocols used on

those modules. For example, in case the interconnection between the compute cores SoC block and

the non-debug Network On-Chip is based on AMBA AXI4 protocol, then the debug monitor

components must be able to analyze transactions based on that protocol.

Furthermore, with the aim of providing more visibility, monitor debug components must allow the

user to gather information from the system based on advanced filtering. Note that when debugging

multicore systems it is generally required to gather information during a large amount of cycles

which means that the debug infrastructure must be able to sustain high-bandwidth during large

periods and to provide in-system storage in case the information cannot be extracted due to slow

I/O interface. We assume that intensive data gathering is performed frome one source at a time,

and in case the user wants to capture traces from multiple sources, advanced filtering is applied to

reduce the amount of information gathered avoiding loss of information and bottlenecks.

Multicore systems do also require a run control infrastructure that allows managing the execution

of multiple cores at the same time, propagate events such as halt or resume efficiently to stop the

cores with the minimum latency as possible, and extract cores information (e.g. register file values)

in a very short time so that the user can manage thousands of cores without a huge delay.

Note that the debug infrastructures we have analyzed previously in this document did not focus on

the time it takes to execute run control operations due to human interaction slowness and,

consequently, there was no requirement to operate within a certain latency. However, on a

multicore system with hundreds or thousands of compute cores the amount of transactions

required for run control operations escalate quickly, having a direct impact on latency and causing

large delays.

On the other hand, we also noted that the messaging infrastructure is essential and that is why we

not only focus on developing monitoring solutions but also center on designing a debug Network

48

On-Chip that can support sustained high-bandwidth traffic, escalate with different SoC sizes without

causing information to be aggregated on each router as in a tree-based architecture, utilize all NoC

resources by making use of an efficient messaging protocol and support multiple leader interfaces

to act as debuggers.

Note that, based on ET-SoC-1 experience and research previously done, we assume that debug can

be managed from multiple sources which can be either external interfaces such as JTAG, USB or

PCIe or a master interface within the System On-Chip itself such as a master core. In addition, we

also assume that only one debugger interface manages the system at any given time which is known

as the active debugger.

4.1 Run control support

Run control features are essential for developers to be able to debug hangs or misbehaviours on the

instruction execution of systems with one or multiple cores. As has been introduced in previous

sections, multicore systems can potentially combine peripherals, memories and hundreds or

thousands of cores and while debugging a problem the user must be able not only to have visibility

of the system resources such as peripherals or memories, but also to manage core execution.

Multicore systems can be used either to run general purpose software or to run very specific

software applications that require high-performance. Depending on the core architecture, the

debug run control support may follow specific implementations to be architecture compliant. Our

proposal takes RISC-V debug specification [2] as a baseline from which we expand to support

multicore systems by focusing on the indispensable features that must be integrated on the SoC

debug infrastructure no matter the core architecture to allow users manage core execution. RISC-V

has been selected as the baseline option due to its open Instruction Set Architecture (ISA) allowing

developing royalty-free designs, a well-defined debug specification and a wide range of tools

available that are supported and improved by the community.

4.1.1 Execution

First, the user must be able to stop the execution of the core so that no instructions are retired and

the pipeline is stopped. This operation is known as applying a halt to the cores. On the other hand,

once the user wants the core to start executing again it can perform a resume operation.

Those two operations allow to stop and restart the execution of a code in the cores. Then, given

that the user may want to debug starting from a specific point of the execution, it must be able to

configure the core so that it automatically halts when a certain instruction is going to be performed

or when accessing a specific memory address. This feature is known as breakpoints and allows the

user to ensure that the core will stop its execution when a certain condition is met.

On the other hand, the user may want to execute one instruction and halt the core automatically so

that it can see how this instruction has been performed and how the core behaved. This operation

is supported by what is known as single-stepping, which allows the user to configure the core such

that once a resume is received one instruction is performed and the core automatically halts. This

49

allows the user to debug the execution of a program step-by-step without much latency as it does

not require the user to set a new breakpoint each time an instruction has been performed.

An extra feature that is highly recommended is to allow the user to inject operations on the core

pipeline without having to modify the software application being executed. That is, the user can

stop the system at a given point and force the core to execute one or multiple instructions to see

how it behaves and to iterate faster as modifying software applications and loading them on the

system can require a huge amount of time.

4.1.2 Registers

The user may also be interested in having visibility of core registers contents so that it can see if an

instruction has modified the register as expected. For example, this allows the user to check if data

has been properly loaded onto a register, or if an arithmetical operation has been performed and

reported the correct value.

It is also recommended on complex-core architecture to allow the user modify the value of the

internal core registers so that it can load custom values, perform a given operation and check if

there is any misbehavior. In addition, some cores have a set of Control and Status Registers (CSR)

that enable special features or change instruction behavior such that allowing the user to read and

configure those registers at runtime provides not only good visibility but also a flexible test

environment.

4.1.3 Core status

It is also required to allow the user to check the status of the core not only when running software

applications but also to check if a set of injected instructions have reported problems or finished

successfully. The proposed architecture recommends adding at least the next flags so that the user

can check core status:

● halted: This bit is asserted when the core performed the halt request and stopped its

pipeline. It is cleared when the core receives a reset or resume request.

● running: This bit is asserted when the core is executing instructions. It is cleared when the

core performs a reset or halt request.

● busy: This bit is set to 1 while the injected instruction(s) by the user is being executed, and

is set to 0 when no instruction(s) injected by the user is being executed.

● exception: This bit is set to 1 if the user's injected instruction generated an exception. It is

manually cleared by the user to ensure it is not lost.

● error: This bit is set to 1 if the user’s injected instruction cannot complete but no exception

has been generated. It is manually cleared by the user to ensure it is not lost.

50

Note that halted flag is used by the user as it takes time between halt is requested and the core

stops fetching instructions. On the other hand, running allows the user to check that the core is

fetching instructions after it has been enabled or after performing a resume operation.

Finally, busy, exception and error allow the user to check the status of the instructions that have

been injected on the core pipeline while it was halted.

4.1.4 Multicore execution support

Due to the fact that multicore systems can have a huge number of cores and because the user may

want to apply a run control operation on a subset of them, it is important to have one or multiple

registers within the SoC that act as a mask. That is, the user must be able to configure the mask

registers so that it can select which cores must be affected by the run control operations such as

halt or resume.

Multicore systems can escalate fast on the number of cores and it is essential to allow users to

manage multiple cores at the same time, that is why core selection is a key feature of the proposed

debug architecture.

On the other hand, it is also important to take into account that the propagation of run control

requests can take several cycles on the SoC to reach all the target cores so an efficient

communication method is required.

The proposed architecture allows the user to configure mask registers once and then send as many

run control requests as desired. The supported run control operations include single stepping or

breakpoint configuration so that the user does not have to go core-by-core to apply this

configuration and requiring a back and forth communication between the host and the SoC causing

huge time impacts.

In the proposed architecture, there is a component named Debug Multicore Handler (DMH) which

serves as the interface between the host and the Debug Module (DM) within the SoC block(s) that

contain the compute cores. The DMH handles the propagation of the run control request to the

selected cores so that the operation is broadcasted through hardware to the DMs. Then, the Debug

Module within the SoC block forwards the request to the selected cores within the block.

In addition, the SoC block Debug Module integrates a feature which allows the user to halt a group

of cores within the block in case one of them halts. That is, the DM can be configured to handle up

to N groups, being group 0 reserved and meaning “no group”. Then , each core can be only on a

given group, when one of the cores within a group halts, the Debug Module sends a halt request to

the rest. This allows the user to specify a breakpoint to be hit on a specific core and ensure that the

rest of the cores also halt and do not perform any further instructions and avoids communication

latency with the host.

51

4.1.5 Multicore status report

As mentioned in previous sections, the user must be able to know the status of each core by having

access to a subset of flags that specify if it is halted, running or executing injected instructions.

On multicore systems the user can apply run control operations to several cores and due to that it

may be interested in having a status report that gathers the information from all selected cores

while maintaining the ability to access the status from a single core. The former can be achieved by

making use of a special register that is introduced later in this section while the latter can be

achieved by selecting a single core and requesting the status information as it is reported on the

same status register.

When the user wants to gather the status information this request is received by the Debug

Multicore Handler (DMH) which gathers the information from the status register of the compute

cores in the SoC Block(s). Then, this information is merged and user receives the next flags:

● anyhalted: This bit is asserted if any of the selected cores is halted.

● allhalted: This bit is asserted when all selected cores are halted.

● anyrunning: This bit is asserted if any of the selected cores is running.

● allrunning: This bit is asserted if all the selected cores are running.

● anynavailable: This bit is set if any of the selected cores is not halted nor running.

● allunavailable: This bit is set if all selected cores are not halted nor running.

For user interaction it is recommended to implement the Debug Multicore Handler (DMH) status

register such that it is updated based on the SoC block compute cores status every time selected

cores status change. Then, the user can retrieve SoC compute cores information by accessing a

single register.

On the other hand, it is also important to note that one of the biggest user potential problems is the

fact that the RISC-V GDB performs a read of all the core General Purpose Registers (GPR) each time

a single step is performed, which becomes a problem on multicore systems as it introduces a huge

latency due to the fact that not only gathering information from a single register requires multiple

accesses to the core program buffer but also the fact that there can be hundreds or thousands of

cores in the SoC.

On this project we recommend implementing a special register within the Debug Multicore Handler

(DMH) such that it captures the GDB request and gathers the information from all core’s GPR

avoiding the software to send hundreds of requests. Note that the SoC blocks Debug Module (DM)

can integrate a finite-state-machine that automatizes the read of the General Purpose Registers

from each of the SoC block cores so that information can be gathered as a trace of values that are

later forwarded to the DMH. Then, this trace of values can be read by the host software to show the

user a list with the registers and the value of each one.

The OpenOCD software layer, which acts as an interface between the GDB and the SoC debug

infrastructure and is open source, can be easily extended to make use of this DMH special register

52

after performing a single step rather than following the default implementation which sends

multiple requests for each register.

4.2 Debug Network On-Chip

The proposed debug infrastructure relies on the use of a Network On-Chip to communicate all the

debug components between them, the SoC resources and also with the active debuggers. The

debug components can be configured either through external interfaces such as JTAG, USB or PCIe,

or through a master core in the SoC.

The debug Network On-Chip must support high-bandwidth traffic and handle different message

types that are sent between the active debugger and the debug components within the SoC blocks.

The first requirement for a competent message infrastructure is to keep the overhead to the

minimum while maintaining good debug capabilities performance, due to that it is important to

implement efficient message packaging and compression algorithms that reduce the amount of

information in-flight causing less traffic and avoiding bottlenecks within the debug NoC as much as

possible.

The architecture implements a protocol with multiple message types in order to support all the

features explained later in this specification. It is important to note that the debug NoC interface

that connects to the SoC blocks has a fixed size which, depending on user debug components

configuration, may cause some of the messages to be splitted in flits. Note that, as introduced

previously in this document, a flit is the smallest unit that composes a packet, and equals the link

width. The flit is the link level unit of transport and packets are composed of one or more flits.

Further information about how flits are handled by the proposed architecture can be found in each

of the subsections below.

We assume, from the ET-SoC-1 experience and the research done in previous sections, that debug

components Configuration messages and run control capabilities require sending a big number of

messages, whose size is small, that go from the active debugger (e.g. JTAG or USB) to everywhere,

and that may require responses. In addition, there is also a need to support high-bandwidth

capability at given times when gathering data such that the debug NoC must support configuration

to allow extracting as much information as possible even if sacrificing access to debug resources or

slowing those.

Moreover, as commented previously, the debug NoC supports multiple active debuggers that go

from the standard and easy-to-use ones (e.g. JTAG) to the ones that provide high-throughput but

are complex to configure (e.g. PCIe). The proposed architecture, which is introduced below,

supports both options but focuses on not oversizing the debug NoC as adding buffering or

many-wires would increase the bandwidth but would have a huge impact on area and power. It is

important to have in mind that the debug infrastructure is intended to be used during bring up

effort and we need to focus on reducing the leakage consumed by this piece of the SoC as it is

disabled during normal operation mode.

53

4.2.1 Message Types

The message types introduced in this section support different functionalities but their encoding

follows a similar pattern.

As mentioned above, we must ensure that any lead debugger, which can be on the host or within

the SoC, can communicate with all the SoC debug resources. In addition, the proposed debug

architecture must be able to escalate from small SoCs to big SoCs. Due to that it is important to

have a set of flexible, but carefully defined message types. As part of this flexibility, the debug

resources are targeted making use of a coordinate system that identifies the SoC block and an

identifier of the debug component to be targeted within the SoC block.

On the other hand, reducing the amount of static power consumed is achieved by sharing the

messaging interface between the different message types. Furthermore, we also focus on reducing

the number of wires by sending the least amount of information required.

Finally, given that messaging protocol is shared between different lead debuggers and the fact that

some follow protocols that require matching in-flight requests with their correspondent responses,

the identifier of the request and unique transactions identifiers are attached as part of the message.

There are different message types that make the debug infrastructure to be flexible allowing the

user to apply different configurations and gather information from the system. This section

describes the four message types and how they are encoded.

Message type Encoding

Configuration 00

Event 01

Trace non-streaming 10

Trace streaming 11

The messages share the same interface and make use of different fields introduced below.

● Destination:

○ The most significant bits correspond to the X and Y coordinates of the system

allowing the debugger to target any location of the SoC.

○ The less significant bits are used to encode the destination debug component

within the target SoC block.

■ For example, the X and Y coordinates can be used to target the SoC Block 0

which has compute cores while the less significant bits of the Destination

field are used to target a monitor within the SoC Block to analyze core

retired instructions.

54

○ The coordinates are used by the debug NoC routers to select among all the possible

destinations and route the message to the endpoint. Once the message arrives at

the SoC block NoC interface, the less significant bits of the Destination field are

used to select the destination debug component and forward the rest of the

package information.

● Source:

○ Configuration requests and trace messages require the destination endpoint to

know which debug component generated the message. Due to that, the SoC block

debug component identifier is attached.

○ Configuration messages (both request and response) also require the use of a

transaction identifier so that the destination endpoint can generate a response

attaching the same identifier value. This is needed to support the cases in which the

active debugger makes use of protocols such as AMBA AXI4 which allow sending

multiple requests in parallel to be able to match each request with its

corresponding response.

● Data:

○ Data field encoding is different for each message type and the proposed

architecture ensures that the number of NoC dedicated Data field wires are the

minimum mandatory to support the different message types so that no wires are

wasted.

○ Configuration response messages make use of a dedicated wire that indicates that

the received package is the last one for the given message. That is, in case the

message has been splitted in multiple packages, this wire is used to indicate the

last. Otherwise, the value of this wire is not taken into account.

○ Note that the debug NoC does not use the Data field and it just forwards the value

to the correspondent destination. Further information can be found later in this

document.

The previous fields are exposed as input and output buses on the debug NoC. That is, the

communication between the debug NoC and SoC blocks is based on an interface of five output

buses, which are used to allow the debug NoC send requests to the SoC blocks, and five input

buses, which are used to allow SoC blocks send response messages and traces through the debug

NoC. The set of five buses refer to the Destination, Source and Data fields together with a validation

signal, to mark that the master interface wants to send a new message, and a ready signal used by

the slave interface to note that it can accept new messages.

In addition, this interface is parameterizable such that depending on users' design the debug NoC

interface can be replicated allowing to use different sets of buses for debug components responses

55

and to forward gathered data from the SoC blocks. From now on this document assumes a debug

NoC with a unique interface that is shared by all message types for simplicity.

Below we can see an example of how the debug NoC is integrated on an SoC composed of six

different blocks. Each SoC block debug component connects to the debug NoC making use of the

interface introduced above. In addition, lead debuggers such as JTAG, PCIe or the SoC master core

are also attached to the debug NoC allowing access to all SoC debug resources. Note that each block

has a unique identifier composed by (X,Y) coordinates, and each SoC block debug component has a

unique identifier within the SoC block where it is integrated.

Figure 11. Debug NoC integration on an SoC composed of six blocks

4.2.1.1 Configuration messages

This type of message is used to configure the debug components on the system. As a few examples

of Configuration messages we have filter enabling, counter configuration or to perform read and

write requests to registers and memories through the master interfaces exposed by the debug NoC.

The Configuration messages can be categorized in two types of operations: reads and writes; The

reads are used to gather information from the system which can either read a configuration applied

to a debug component or a value from a register or memory. On the other hand, the write

operations are used to apply configurations on the debug components and to write data on

registers or memory.

As mentioned above, the interface between the debug components and the debug NoC has a fixed

size which means that some requests or responses may be splitted in multiple packages. Those

packages are handled by the NoC and it is transparent for master interfaces such as the master core

or PCIe. Note that since the debug NoC handles the concatenation of the packages this improves

the efficiency of the PCIe, USB or JTAG interfaces allowing those interfaces to maximize their

throughput as they are no longer sending multiple messages when data can be handled more

efficiently.

56

4.2.1.1.1 Read configuration messages

Reads are used to gather information from the system by targeting a debug component or a

memory mapped resource such as a register or memory.

Request

The read request does not require to send data and we assume that all the information regarding

the target endpoint and operation size can be encoded in one debug NoC message.

SoC block and debug component target

When a read request is issued, the endpoint is specified under the Destination field following the

pattern explained in previous sections where the most significant bits of this field correspond to the

X and Y coordinates and the less significant bits are used to encode the destination debug

component.

Targeting register/memory

On the other hand, the user needs to specify which is the address offset within that SoC block

and/or debug component that has to be read, together with the size of the request and the

operation type (opcode), which in this case is a normal read.

All this information is encoded on the Data field such that when the request arrives at the debug

component, the package is decoded and read is performed.

Request source

When a read is issued the source expects a response with the requested value. It is important to

note that the response is generated by the endpoint as the debug NoC is stateless, that is, it does

not store the request identifiers waiting for a response. Due to that, the debug component that

received the request stores which is the requestor identifier so it can send the response back.

The debug component source information is encoded by the requestor under the Source field, so

that the message includes the source debug component that generated the request. In addition,

when the debug NoC receives the request, it appends the SoC block source coordinates such that

once the destination debug component receives the request it has all the information to generate a

response.

On the other hand, given that there can be multiple master interfaces that may require using

unique transaction identifiers to match multiple in-flight requests and responses, the requestor

debug component can encode a transaction unique identifier (trans_id) under the request Source

field so that the destination debug component can attach this information in the response. In case

no unique transaction identifier is used, then this subset of bits are unused.

Response

The read response does require to send the gathered value together with all the information

regarding the destination endpoint, this means that, depending on read request size, responses may

57

require more than one packet and information is splitted in flits. The debug NoC is responsible for

handling all the flits that compose a message before the response is returned to the requestor.

In case the debug module has to send multiple packages for a single read response, the encoding is

always the same and the only change is the value appended on the Data field.

SoC block and debug component target

As has been commented in previous sections, the debug component must generate a response to

the requestor returning the read value.

Response messages also target the endpoint based on the use of the Destination field and follow

the same pattern as read requests; the most significant bits of the Destination field correspond to

the X and Y coordinates of the SoC block while the less significant bits are used to encode the

destination within that block. In this case the destination endpoint corresponds to the information

extracted from the read request Source field.

Transaction

Given that the requestor may follow a protocol that requires the use of unique identifiers such as

AMBA AXI4, the debug module encodes the transaction identifier (trans_id) information extracted

from the read request, under the response Source field.

Data

The read value is returned under the Data field together with information that notifies if the

transaction has been successfully completed or if an error such as targeting a non-existing

destination has occurred. Note that the reply code, which indicates if the request was successful or

not, makes use of a subset of dedicated wires within the Data field. In addition, there is also a

dedicated wire that indicates if the sent package is the last one for a given message so that the

endpoint can concatenate the messages successfully.

4.2.1.1.2 Writes configuration messages

Request

Write requests do require to send a value to be written together with all the information regarding

the target endpoint and operation size. Depending on the size to be written, it can be required to

send multiple messages through the debug NoC.

SoC block and debug component target

Following the same pattern as read requests, the endpoint is specified under the Destination field

making use of the most significant bits as the X and Y coordinates to select the target SoC block and

the less significant bits of Destination field to encode the destination component within that block.

Targeting register/memory

On the other hand, the user needs to specify which is the address offset within that SoC block

and/or debug component that must be written, together with the size of the request and the

operation type (opcode), which in this case is a write request.

58

All this information is encoded on the Data field such that when the request arrives at the debug

component, the package is decoded and the write is performed.

Request source

As has been introduced in previous sections, depending on the active debugger protocol it may be

required or not to keep ordering of the transactions, or to observe if an operation has reported an

error as this may not be observed through other ways such as error interrupts. The write requests

do support the use of unique transaction identifiers and to specify if the user wants the destination

debug component to generate a write confirmation response. This allows the user to save around

50% of the bandwidth and sustain more outstanding writes in case write confirmation is not

required.

Due to the fact that debug components may need to generate a response, the information about

the requestor is sent as part of the write request so that it is stored and used later to set the

destination of the response.

This source information is encoded the same way as in read requests such that under the Source

field the active debugger encodes the identifier of the debug component that generated the

request. In addition, when the debug NoC receives the request, it appends the SoC block source

coordinates such that once the destination debug component receives the request it has all the

information to generate a response. Moreover, this information is also used by the debug NoC to

generate the message in case it has been splitted in multiple flits.

On the other hand, given that the master that generated the request may require using unique

transaction identifiers to match multiple in-flight requests and responses, the requestor debug

component can encode a transaction unique identifier (trans_id) under the request Source field so

that the destination debug component can attach this information in the response.

Then, the unique transaction identifier (trans_id) together with the SoC block coordinates and the

identifier of the debug component that performed the request generate a system transaction

identifier that can uniquely identify the transaction even if there are multiple master interfaces

using the same trans_id.

Data

Debug components support multiple configurations that enable the different debug features such

as advanced filtering or defining a counter configuration. These configurations are applied by the

user by attaching the register/memory address in the Data field together with the request size and

the value to be written.

Note that the write opcode is extended with an extra bit that is used as a write acknowledgement

enable. This opcode acknowledgement bit can be set to 0 when using interfaces such as USB or

JTAG as those do not need responses when configuring a debug component. On the other hand,

59

this bit can be set to 1 for the PCIe and the master core given that they make use of protocols that

may require write confirmation responses.

When a debug component receives a write request with an opcode that has the acknowledgement

bit set to one, it generates a response that is sent back to the requestor. Otherwise, operation is

performed and not notified.

In case the value to be written can be concatenated on the message Data field and fits without

needing extra messages, only one message is sent to the requested endpoint.

Otherwise, the write request must be splitted in multiple packages. The Destination and Source

fields of those messages have the same values while the Data field of the first message has the size

and opcode information, and the rest of messages' Data fields are filled with the value to be

written.

Response

As explained above, write responses depend on the value of the request opcode acknowledgement

bit.

SoC block and debug component target

Response endpoint is also based on the use of the Destination field and follows the same pattern as

write and read requests; the most significant bits of Destination field correspond to the X and Y

coordinates of the SoC block while the less significant bits are used to encode the destination within

that block.

Transaction

Given that the requestor may follow a protocol with unique identifiers such as AMBA AXI4, the

debug module encodes the transaction identifier (trans_id) information previously stored, under

the response Source field.

Data

In case a write response is generated, the Data field makes use of a subset of dedicated wires which

are used to contain information regarding the transaction completeness or failure. For example, an

error can be returned if the user has targeted a non-existing destination.

4.2.1.2 Event messages

This type of messages are used to propagate an identifier to all the debug components on the

system. The identifier can be used to enable or disable debug capabilities such as filtering, resetting

counters, or trigger read requests to registers and/or memories among others. Allows triggering

pre-configured debug features without having a huge impact on traffic as the amount of

information sent is small. In addition, events are by default propagated through the system so that

the user can configure multiple SoC blocks debug components to perform actions based on the

same event identifier.

60

The event messages are generated by the debug components or injected through the active

debugger and never generate a response.

Event messages are also composed following the same pattern as Configuration messages such that

there are Destination, Source and Data fields.

The Destination field is unused as the debug NoC decodes information regarding the message type

and once it identifies it as an Event message it automatically propagates the message to all the

possible endpoints.

On the other hand, given that no response is needed, no source information regarding the SoC

block identifier nor the debug component identifier are required. Finally, the Data field contains the

identifier to be propagated and that is received by all the debug components in the system.

4.2.1.3 Trace messages

This message type is used to send gathered values from the debug components to the active

debugger without requiring the debugger to perform a request. This is useful when the debugger

has configured a filter and wants to extract information from the system when the configured

condition is met.

The interface is shared with Configuration and Event messages which means that the interface

between the debug components and the debug NoC is the same. Depending on the amount of

information that the user wants to retrieve from the system some Trace messages may require to

be splitted in multiple packages.

Following the same pattern introduced in previous sections, the Destination field is filled with the X

and Y coordinates encoding the SoC block. The proposed architecture assumes that Trace messages

are always sent to the memory subsystem, which can be configured to either store the message in

memory or forward it to the debugger. Further information about Trace message routing can be

found later in this document.

On the other hand, even if no response is needed for Trace messages, the Source field is filled with

the identifier of the debug component that extracted the information from the system. Once the

debug NoC receives the Trace message from the debug component, it concatenates the X and Y

coordinates of the source SoC block such that the destination endpoint is aware of the debug

component that generated the message.

Finally, the Data field contains the value that has been gathered from the system. In addition, the

Data field may also be filled with information regarding the timestamp, which is introduced later in

this document.

61

4.2.1.3.1 Streaming mode

Trace message type supports streaming mode, which allows to remove some operational fields

from the Trace messages and use this freed space to contain additional extracted information from

the system.

This special trace mode is enabled by the user when configuring the monitor debug components.

Both the debug infrastructure and the host are aware of when a Trace message has streaming

encoding based on the message type encoding which is part of the Trace message. When the Trace

message is encoded as non-streaming, it is fully decoded by the active debugger as it contains

information about the source, the timestamp value and the gathered information from the system.

On the other hand, in case a Trace message is encoded as streaming, then it does not contain

information about the source module, source SoC block nor timestamp values as all the message

fields are filled with the information gathered from the system. Further information about this

feature can be found later in this specification.

4.2.2 Interface and message packaging

The interface between the SoC blocks debug infrastructure and the debug NoC follows the encoding

introduced on the message types sections. The debug NoC has a set of five input buses (Destination,

Source, Data, valid and ready) and five output buses (Destination, Source, Data, valid and ready)

that communicate with the SoC blocks.

● Destination

○ Encodes the destination SoC block and debug component for the message.

○ Note that in case of an Event message this field is unused as the debug NoC

automatically performs a broadcast and propagates the event to all SoC debug

components.

Figure 12. Message Destination field encoding

● Source

○ Configuration and Trace messages encode the source SoC block and debug

component identifiers from the requestor that initiated the transaction.

■ Note that Configuration messages also attach the transaction identifier

value when the requester that initiated the transaction makes use of a

protocol that requires matching in-flight requests and responses based on

unique identifiers. Otherwise, the transaction identifier field is unused.

■ Note also that in case of Trace messages if streaming mode is enabled, this

field contains the less significant bits of the gathered information from the

system by replacing the Transaction ID, SoC block debug component ID and

SoC block ID sub-fields.

62

○ Event messages do not use this field.

Figure 13. Message Source field encoding

● Data

○ Configuration messages requests encode the target endpoint address, for example,

the offset from a SoC block or debug component register. In addition, this field also

encodes the operation type and size together with the value to be written, in case

of write requests.

Figure 14. Configuration request message Data field encoding

The available operation codes (opcode) are:

Operation code Encoding

Read 00

Write without acknowledge 01

Write with acknowledge 10

Reserved 11

○ Configuration messages responses contain the reply code that notes if an operation

has been successful or failed. In addition, in case of reads requests, the

Configuration message response also attaches the returned read value. Note that

Configuration message responses make use of a dedicated wire indicating which is

the last package of the current message to handle when those have to be split. In

addition, dedicated wires are also used to encode the reply code, which indicates if

the operation has been successfully performed.

Configuration response message for reads

D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0

raw data Reply code

Figure 15. Configuration response message Data field encoding for read operations

63

Configuration response message for writes

D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Reply code

Figure 16. Configuration response message Data field encoding for write operations

○ Event messages encode the event identifier.

Figure 17. Event message Data field encoding

○ Trace messages encode the information gathered from the system. In addition,

depending on user configuration, it also encodes the timestamp value as part of the

payload on its less significant bits. Note that streaming Trace messages do not

attach the timestamp information.

Trace non-streaming message

D x x x x x x x x x x x x x x x x x x x x x x x x x TS+1 TS x x x 0

raw_data Timestamp

Figure 18. Trace non-streaming message Data field encoding

Trace streaming message

D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0

raw data

Figure 19. Trace streaming message Data field encoding

4.2.3 Timestamps

Timestamps are used to keep a record of when information has been captured so that it can not

only be ordered but the user can also see the elapsed time between different Trace messages. As

we have seen in previous sections, the timestamp value is attached as part of the debug message in

case the user has enabled this feature, this means that it takes part of the Data field so that more

information has to be exchanged.

Given that multicore systems can run one test during millions of cycles we need to ensure we can

capture a wide range of timestamp values and this requires using counters and having them on the

message source generation, that is, in each SoC block that contains debug monitor components.

With an eye on reducing the amount of area required we defined a design that optimizes the area

used for timestamp registers.

64

4.2.3.1 Synchronization

All timestamp counters in the system must be synchronized so that if monitor messages are

captured in different blocks we can order them at the host. This requires a way to start all

timestamp counters at the same time.

In the proposed design there is a timestamp counter within each block, which is used by all the

monitor debug modules within that block, which is enabled upon the reception of a

synchronization signal from a common point in the SoC. In our design this common point is the I/O

peripherals block as it contains the debug interface with the host.

Note also that since the timing path between the I/O peripherals block to the rest of the blocks is

different for each one, there is the need to ensure that all of them start at the same time, or with

the same value. With this purpose it is needed that each SoC block that requires a timestamp

counter has a different timestamp counter reset value which depends on the propagation latency

between the I/O peripherals block to the SoC block. That is, depending on how many cycles it takes

for the synchronization signal to reach the SoC block, the timestamp counter starting value will be

different ensuring that all of them have the almost the same value once they are enabled. Note that

we cannot make sure all the counters in the systems have the exact same value as there can be

differences of ±1 cycle due to the fact that this signal needs to be synchronized to the local clock

and the Change-Data-Capture (CDC) logic on the SoC block has certain variability depending on the

arrival time of the signal in relation to the SoC block clock.

On the other hand, due to the fact that each SoC block runs at a different frequency we need to

ensure that timestamp counters are increased at the same rate, so the timestamp increment relies

on using the Network On-Chip clock as a reference.

4.2.3.2 Timestamp computation

Another problem we face with timestamps is that given its value can be of thousands or millions, it

can take a big chunk of the debug message payload causing debug messages to have to be splitted

in multiple messages and impacting throughput, which can end up causing bottlenecks and lose of

information.

The proposed solution takes advantage of the fact that we can determine the maximum latency

between the debug NoC routers attached to each SoC block and the memory subsystem blocks. This

way we can compute the maximum number of timestamp bits required to reconstruct the full

timestamp value ensuring we transport the minimum number of bits required. This means that

using the Source field information and the deterministic latency, the full timestamp can be

reconstructed because all timestamp counters in the system are synchronized. This allows the SoC

blocks monitor debug components to send only the TS less significant bits of the timestamp counter

value.

The original full timestamp value can be reconstructed on the I/O block or memory subsystems by

taking into account the N less significant bits sent by the debug component as part of the payload,

65

the latency between the debug NoC router and the destination, the message source identifier, and

the destination current timestamp value when the message is received.

In case the destination is the memory subsystem, then the full timestamp value can either be stored

on memory, in case buffering is enabled, or send the full timestamp value to the debug interface

with the host.

4.2.4 Interleaving

As has been introduced in previous sections, the debug monitor component allows the user to

generate debug Trace messages that can be either stored on memory or sent to the active

debugger.

We must ensure that in case the debug NoC and memory controller throughput are lower than the

tracing capabilities, the debug NoC buffers do not overflow when the SoC block monitor

components generate a huge number of Trace messages. In addition, note that post-silicon

debugging of multicore systems do also require gathering information from the system during large

periods of time, which means that the debug NoC, and the debug infrastructure, must be able to

maintain high-throughput during large periods of time.

Interleaving feature is available when the multicore system has multiple memory subsystem blocks

and allows to alleviate throughput bottlenecks by distributing Trace messages through different

system memories in an interspersed manner.

Due to the fact that the debug Trace messages from a given monitor component can end up in

multiple memory subsystems, it is recommended that the user enables timestamps so that

messages can be ordered at the host to reconstruct the state of the system.

On the other hand, it is important to note that interleaving does not require to have just one debug

monitor component enabled. That is, the user is allowed to enable tracing on multiple debug

components given that the generated Trace messages encode the destination and source

information which means that debug messages can be routed and decoded without problems.

Nevertheless, it is important to note that it is not recommended to configure tracing capabilities on

multiple debug components targeting the same memory subsystem when interleaving is enabled as

this not only stresses the debug NoC but may cause overflows that end up in messages being lost.

4.2.5 Streaming

Streaming feature allows the user to capture information from the system every cycle through a

specific debug monitor component when the amount of real-time data to be sent does not fit on a

single debug message so that interleaving feature is not enough.

This feature is intended to be used together with interleaving in order to ensure that all generated

Trace messages can be handled and sent to the debugger, or stored in memory. As mentioned in the

66

Interleaving section, the user must enable timestamps to allow the host to order the Trace

messages.

Streaming feature focuses on optimizing the amount of information sent to the debugger which

impacts on how timestamp value is computed and also modifies the payload. That is, streaming

mode removes some operational fields from the debug Trace messages and uses this freed space to

contain the information extracted from the debug monitor components in order to utilize the full

debug NoC buses width to send system information.

4.2.5.1 Requirements

When streaming mode is enabled there must not be multiple debug monitor components on the

SoC generating Trace messages targeting the same memory subsystems. Otherwise, the Trace

messages from multiple debug components are merged and wrong information is either stored in

memory or forwarded to the active debugger.

Note also that memory subsystems only store debug Trace messages and not debug Configuration

messages. Due to that, the rest of the SoC blocks can still be used to retrieve information based on

Configuration response messages such as read requests to registers.

4.2.5.2 Operational mode

Once streaming mode is enabled, the first Trace message sent by the debug monitor module

contains the information regarding the source component identifier, the SoC block identifier, the

target destination of the Trace message, the full timestamp value and a chunk with the extracted

signals values from the system. Note that this Trace message type is marked as non-streaming.

Then, the subsequent debug Trace messages, which contain only gathered information from the

system, are marked as streaming by changing the Source field Message Type sub-field value

accordingly. The information such as the source component identifier, the SoC block identifier and

the timestamp value are computed by the host. In order to do that, the host decodes all the debug

Trace messages by always looking at the Message Type sub-field value to identify streaming and

non-streaming Trace messages.

● In case Message Type is encoded so that streaming mode is disabled, then the message is

fully decoded and fields such as source component identifier and SoC block identifier are

retrieved from the message.

● In case Message Type is encoded so that streaming mode is enabled, then the host makes

use of the source component identifier, SoC block identifier, and the timestamp value of the

last Trace message that had Message Type set as non-streaming in order to fill the missing

information.

The first debug Trace message works as a header that separates the non-streaming extracted data

from the generated streaming information. In case messages are stored in memory, the Message

Type information is also part of the stored message. Given that the source component identifier and

the SoC block identifier do not change while streaming mode is enabled there is no need to store

67

those fields. On the other hand, the timestamp is deterministic and can be regenerated by the host

as is explained later in this document.

4.2.5.3 Timestamp computation

As mentioned in previous sections, the timestamp value is only captured on the first message

before streaming mode is enabled and is the responsibility from the host to compute the ones

corresponding to the rest of debug Trace messages while streaming mode is enabled.

Due to the fact that streaming mode generates a Trace message per cycle, and that interleaving is

enabled, it is expected that each Trace message from the memory subsystems is forwarded to the

debugger or stored in memory every M cycles, being M the number of memory subsystems enabled

and targeted by the debug component when using streaming feature.

On the other hand, in case Trace messages can no longer be stored or forwarded to the debugger

due to lack of resources, back pressure is applied to the debug monitor component and no new

streaming Trace messages are generated until backpressure is resolved. Once the debug monitor

can send new Trace messages, the first generated Trace message is encoded with full payload

including the source component identifier, the SoC block identifier and the full timestamp value,

and Message Type field is configured to note that is a non-streaming message. The next debug

Trace messages are sent containing only gathered information from the system and setting Message

Type sub-field to mark that streaming is enabled.

This architecture allows the host to detect when streaming mode has stopped and started again, so

that it can be aware of when tracing has stopped and compute the Trace messages timestamps

based on the new full timestamp value.

Finally, due to interleaving being enabled, the host is responsible for merging the memory

subsystem buffer contents taking into account the order in which debug Trace messages have been

generated. For example, in case of using two memory subsystems then one of the buffers contains

debug Trace messages from odd cycles while the other has the even cycles as messages were sent

interspersed.

4.2.5.4 Bottleneck analysis

Multicore systems are likely to face situations in which due to the amount of traffic generated there

are bottlenecks. Below there is an analysis of the two most likely places we can find a bottleneck

and how is handled by the proposed architecture.

Memory Controller

First, in case the debug infrastructure is configured to store Trace messages in memory, this feature

relies on the readiness of the memory controller such that it is required by the debug subsystem to

read the debug messages received through the NoC and store them as buffering.

There are two options that can be followed by the memory subsystem debug logic:

68

1. Send an acknowledgement to the debug NoC as if Trace messages are being stored

properly. This option does not stall the system but the user is not aware that messages are

lost.

2. Do not send the acknowledgement to the debug NoC so that backpressure is propagated up

to the source. The debug monitor component ready signal goes low and, depending on the

configuration, tracing can be stopped and a notification is sent to the user.

The proposed architecture selects the second option in which backpressure is propagated to the

debug monitor components so that the user is always aware if gathered data from the system is

being lost. This allows the user to modify advanced filtering setup in order to reduce the amount of

information gathered from the system or to post-process the information on the host being aware

that there are cycles where information is missing.

Analytic component

The debug monitor component may not be able to send Trace messages because the interface with

the debug NoC is not ready, which can happen either because there are other monitors making use

of the debug NoC channel or because the memory subsystem is not ready and cannot store

messages so that backpressure has been propagated to the source.

The monitor components behave such that they stop sending debug Trace messages even if

streaming mode is enabled. Then, once the interface with the debug NoC is ready, the monitor

component sends a Trace message with streaming mode disabled and encodes all the required

fields and the timestamp value. Next, the rest of the Trace messages are sent with streaming mode

enabled as we have seen in previous sections.

This behavior ensures that the analytic component is aware that messages have been lost so that it

can re-start streaming mode to ensure that once Trace messages can be stored again, the

timestamp is correct and ensures that the memory subsystem is able to store new messages.

Furthermore, this behavior also allows the active debugger to detect that messages have been lost

since there are at least two debug Trace messages whose timestamp is not consecutive.

4.2.6 System resources access

The access to system resources is done through the debug Network On-Chip as it exposes AMBA

interfaces such as AMBA AXI4 and AMBA APB3 interfaces, which are then connected to caches,

memories and registers allowing the user to read and write without injecting traffic on the

non-debug NoC and interfering with the compute cores.

Note that AMBA is an open standard widely extended for the connection and management of

Systems on-Chip functional blocks as they reduce the risk and costs of developing new protocols

and adapt different resources such as third company IPs to those.

69

4.2.6.1 Event support

The user can configure the debug NoC master interfaces to trigger read or write requests when an

event with a certain identifier is received. The request target address, size and the value to be

written, in case of a write request, are pre-configured by the user and stored in debug NoC internal

registers such that it is transparent to the multicore system blocks.

For example, the user can configure the APB master interface to perform a write request to a run

control register based on a specific event, such that all cores would receive a halt request. This

event could have been generated based on a Logic Analyzer or Bus Tracker filter.

4.2.6.2 Polling & Compare

A nice-to-have feature for multicore systems is to be able to monitor a register or memory position

waiting for it to reach a certain value. This is very useful for example if the user wants to know when

all cores have reached a certain point in the execution monitoring a credit counter, or to wait for all

cores to hit a breakpoint and halt such that the debugger can poll the cores status register waiting

for all of them to be halted.

In addition to the event support, the debug NoC also allows the user to configure polling behavior

for a given register or memory so that the master interface sends a read request and checks the

returned value until the condition is met.

The supported operations are:

● Check if the read register value is {equal, not equal, greater than, less than, greater or equal

to, less or equal to} a given value provided by the user.

● Check if the read value is within a range provided by the user.

● Check if the read value changes.

The user can set a maximum number of cycles or run indefinitely until condition is met or polling is

manually disabled.

Actions

Once the the register or memory location being read has the expected value so that the comparator

condition is met, there are three available options:

● Generate an event, which is propagated through the system debug infrastructure allowing

to enable or disable filters, or perform run control operations.

● Generate a Trace message with the read value and forward it to the debugger.

● Generate a notification message and forward it to the debugger.

The user can also configure the debug infrastructure to generate a Trace message with the read

value for each iteration of the polling so that the user can have a sequence of values. Note that this

can generate huge amounts of debug traffic.

70

4.2.7 Routing

Debug NoC message routing is based on coordinates such that the bridges between the SoC blocks

and the debug NoC use the (X,Y) values to route messages through the routers north, south, east or

west interfaces.

Coordinate routing method allows to optimize area as the NoC does not need a static table

containing all SoC blocks identifiers and the router interface to be target for each one. In addition, in

case some path becomes unavailable, then coordinates allow re-routing messages easier without

complex algorithms.

Moreover, when comparing coordinates methodology versus enumeration we take advantage from

the fact that there is no need to perform a full discovery of the system when it is initialized nor to

reassign the SoC block identifiers when there is a change on the topology. The software

management is less complex causing less trouble for designers not only during integration but also

during post-silicon validation in case re-configuration is required.

The debug NoC design guarantees that it is lossless, that is, debug messages are not lost once they

are injected from the SoC blocks. In order to ensure this behavior, the debug NoC takes into account

the possible backpressure at the destination of the message such that buffering is added at the

endpoint. In case the endpoint cannot be accessed during a large period of time and the debug NoC

buffers are filled, then backpressure is propagated up to the source indicating that the debug NoC

does not accept more messages.

4.2.8 Network on-Chip interconnection

As explained in previous sections the debug NoC is integrated in the System on-Chip in such a way

that it exposes a set of input and output ports allowing SoC block debug components to send and

receive debug messages.

Due to the fact that the SoC blocks can have multiple debug components and in order to avoid

creating interfaces that require too many I/O pins on the debug NoC, a new debug component is

introduced.

This component handles the connection between the SoC block debug resources and the debug

NoC and is integrated as part of the debug NoC in such a way that its interface with the debug NoC

router consists of a set of ten fields, which relate to the Destination, Source and Data fields together

with the valid and ready signals for the connection with the SoC block debug resources.

On the other hand, the debug NoC component has N interfaces with the SoC block, being N the

number of debug components within the SoC block and the interface consisting of the ten fields

introduced previously in this document.

71

When a message from the debug NoC is received, the component makes use of the Destination field

debug component identifier subfield to route the message to the SoC block debug component

correspondent interface. Moreover, the access to the debug NoC interface that allows the SoC block

debug components to send Configuration responses, Trace messages or Event messages is

arbitrated between all the SoC block debug components following a Round Robin policy, which

provides all debug components the ability to send messages ensuring that there is no starvation.

Note that arbitration policy can be based on priorities such that it provides higher priority to Trace

messages making the architecture focus on gathering information while sacrificing system access

and making it less responsive, or the other way around so that it could ensure that the system

debug resources could be accessed at any time even if Trace messages are generated. That said, in

this document we assume that while tracing is enabled the user has reduced the activity on the rest

of the debug components so that there are no multiple debug components making use of the

debug NoC interface.

Figure 20. Debug NoC interconnection diagram

4.3 Monitoring

Visibility is essential when debugging execution failures on multicore systems. We need to provide

the user the ability to select which signals have to be monitored, allow filtering to look for specific

cases and gather information.

4.3.1 Logic Analyzer

This component provides the user a wide variety of monitoring functions, and can be used for

debugging and performance profiling. The Logic Analyzer allows observability of one or multiple

signals which designers connect on the filter input port. Then, based on advanced filtering it can

72

either generate an event to be propagated through the SoC, gather information from the system,

notify the debugger or perform special operations such as increasing or decreasing internal

counters. Note that the Logic Analyzer is also connected to the SoC debug infrastructure allowing it

to be configured and/or send data or events.

This debug component can be parameterized allowing each instance in the system to be unique and

completely adapted to designers requirements. The Logic Analyzer behaves on a two-step

methodology:

1. Reads the input filter bus to identify the condition of interest each cycle.

2. Takes action when the condition is met:

a. Generates an event that can be propagated the whole SoC and enable or

disable other debug features. Events are sent the next cycle.

b. Traces data by reading the whole filter bus value or a subset of consecutive

bits in the bus and forwards it up to the hierarchy. In case trace messages

cannot be sent on the next cycle, then data is captured, stored in internal

buffers and sent as soon as possible.

c. Gathers performance metrics using internal counters.

As we can see, the Logic Analyzer always performs filtering before storing captured data on the

internal buffers in order to avoid filling the buffers with data that does not meet filter requirements

and avoiding losing visibility due to bottlenecks.

In addition, the Logic Analyzer supports timestamps so that generated messages can have a

timestamp value attached to its information. On the other hand, the Logic Analyzer timer can also

be used to generate messages every N cycles, being N a value that can be configured by the user in

real-time, in order to report internal counter values or perform snapshots of the filter input bus.

Figure 21. Logic Analyzer block diagram

73

4.3.1.1 Interface

The Logic Analyzer is connected to the debug NoC through an I/O interface that is used for

configuration and tracing purposes, more details of this interface and message protocols can be

seen later on this document.

On the other hand, there is a filter interface consisting of an input bus, whose width is

parameterizable, where designers connect the signals they want to analyze in real-time. There is

also a general purpose output bus (gpio), whose value can also be specified by the user through the

external debugger, that can be used to control multiplexers to select among different groups of

signals to be monitored. There is no valid signal for the filter interface since the Logic Analyzer

assumes that the data to be filtered is correct and that the user makes use of a bit within the signals

being filtered to validate the data when needed.

There is no need to have multiple filter input ports since the comparators, as is explained in the next

section, can be configured to target different subsets of bits on the same filter input bus, that is,

designers can attach several signals from different blocks to the same input bus and then make use

of multiple comparators to select different subsets and apply a particular filtering in each of them.

In addition, there is an output port that shows if the Logic Analyzer is enabled allowing designers to

apply clock gating to the debug logic around the debug module if it is disabled. This has a direct

impact on dynamic power.

4.3.1.2 Filtering

In order to identify conditions of interest the Logic Analyzer makes use of one or more filters that

can be managed by the user in runtime; note that the number of filters is parameterizable by the

user. Each filter can be configured with one or multiple comparators, which means that the user can

specify multiple conditions that must be met to perform an action.

Each comparator has three fields that must be configured:

● A mask that specifies the bits of the filter bus that must be compared.

● A value to compare.

● The comparison operation to be performed.

The comparator hits if the subset of the mask bits set to 1 comply with the operation and the

specified value. The supported comparator operations are:

● Less than (lt)

● Bigger than (bt)

● Equal (e)

● Not equal (ne)

For example, if the user configures a comparator with a mask of 0xF0, operator less than (lt) and a

value 0xA0. This means that the comparator would hit when bits [7:4] of the signals connected to

74

the Logic Analyzer input filter port are less than 0xA. Note that the bits selected on the mask can be

spread and do not require to be consecutive.

Due to the ability of the Logic Analyzer comparators to use masks to select a subset of bits to be

filtered, there is no need to make the Logic Analyzer registers that save the comparators masks and

values configurations as wide as the filter input bus. That is, the Logic Analyzer internal registers to

save comparator configurations can be smaller than the filter input bus. This allows designers to

attach big groups of signals to the filter input bus but without causing the Logic Analyzer to have a

huge increase in area as internal registers scale together.

Following the description above, the Logic Analyzer is parameterized so that designers specify if

they want one or multiple comparators to be smaller than the filter input bus and set the

comparator mask and value width. This reduces the amount of area needed without sacrificing

visibility and debuggability. Note that in case of using smaller comparators, the subset of bits to be

monitored must always be consecutive so that the user during runtime configuration specifies the

position of the less significant bit of the value to be compared in the filter input bus.

In summary there are five parameters that the user must configure when instantiating the module:

● Total number of filters.

● Filter input port width.

● Total number of comparators.

● Number of comparators that can be used for ranges.

● Size of comparators mask and value fields that are used for ranges.

For example, let's suppose that designers want to analyze the core pipeline so that they have

connected 128 bits from different control signals. Then, they want to define a subgroup to analyze

the retired instructions in which case they are interested in the program counter value, which lets

assume is 49 bits wide and a valid signal. The designer can parameterize Logic Analyzer so that it

can have two comparators of 50 bits allowing it to look for upper and lower values on a specific

range. Then, the rest of Logic Analyzer comparators, which would be as wide as the filter input bus

(128 bits), could be used to look for specific bits within the filter input bus.

A possible configuration for this integration could be as next:

● Total number of filters set to 1.

● Filter input port width set to 128.

● Total number of comparators set to 4.

● Number of comparators that can be used for ranges set to 2.

● Size of the comparators that are used for ranges set to 50.

This configuration allows the user to use the 2 comparators to look for a subset of Program Counter

values, and use the rest of the comparators (2 in this case) for other filtering purposes that require

extra matching conditions based on valid, exceptions or other subset of bits of interest. The first two

comparators' mask and value registers would be limited to 50 bits, while the other two comparators

would have the same width as the filter input bus.

75

It is important to note that filter trigger condition depends on one or multiple comparators, and all

comparators can be used in a single filter.

Note that parameterization allows the user to adjust the capabilities of the debug component while

having an important impact on area measurements. If we take the example above, we can see that

adjusting a couple comparators size allowed the designer to save the 30% of the flops dedicated to

the Logic Analyzer comparators due to the fact that the comparators mask and value to be

compared require less flops to be stored.

4.3.1.3 Counters

A wide range of performance metrics can be analyzed using Logic Analyzer internal counters. Each

counter is assigned to a filter such that when the filter condition is met, the counter is incremented

or decremented. Since counter action is based on filters, the user can decide to either select a

subset of bits and look for a range of values or look if a single bit in the bus is high or low.

In addition, the user can also configure a threshold such that an event or message is generated

when the counter reaches a certain value, or make use of Logic Analyzer timestamp feature to

retrieve counter value every N cycles, being N a configurable value by the user at runtime.

4.3.1.4 Actions

When filter condition is met the Logic Analyzer module can perform different actions depending on

the user configuration.

Events

The events are messages that are broadcasted through all the debug infrastructure by default.

Events can be used to enable or disable filters from monitor modules or to apply run control

operations such as halt or resume. The user can enable event generation and set the event

identifier to be propagated.

Trace

When this action is selected, the filter input bus value is captured and sent through the debug

message infrastructure up in the hierarchy to the debugger. By default the Logic Analyzer sends the

whole bus value but the user can configure the Logic Analyzer to select a subset of consecutive bits

to be traced in order to reduce the amount of data to be sent impacting on the debug message

infrastructure traffic and potentially avoiding congestions that can end up causing back pressure

and loss of information.

In addition, the trace data buffer size is parameterizable meaning that the user can specify the size

of the data buffer, which stores the trace values to be sent, causing the amount of required area to

be reduced.

76

In case the user sets a data trace width smaller than the filter input bus width then by default the

trace data buffer stores the less significant bits of the filter bus. As explained in previous sections,

the user can configure the Logic Analyzer to specify the subset of bits from the filter input bus that

must be captured at runtime but in case the subset of bits is bigger than the trace data buffer width,

then just the less significant bits of that subset are captured and forwarded.

4.3.2 Bus Tracker

This component provides the user monitoring capabilities over transactions that follow a protocol,

so that it can do performance analysis while providing observability and debuggability features. The

Bus Tracker allows observability of bus transactions so that designers connect the request and

response critical signals together with some control signals to allow advanced filtering. Then, based

on the configured comparators by the user, the debug component can either generate an event to

be propagated through the SoC, gather information from the system, notify the debugger or

perform special operations such as increasing or decreasing internal counters. Note that the Bus

Tracker is also connected to the SoC debug infrastructure allowing it to be configured and/or send

data or events.

The Bus Tracker analyzes bus transactions based on the user's configuration to track the requests

in-flight. It can be parameterized allowing each instance in the system to be unique and monitor

buses that use different protocols. Similar to the Logic Analyzer, it behaves on a two-step process:

1. Reads the input bus values to identify the condition of interest each cycle.

2. Takes action when the condition is met:

a. Generates an event that can be propagated the whole SoC and enable or disable

other debug features. Events are sent the next cycle.

b. Traces information from the transaction and forwards it up to the hierarchy. In case

trace messages cannot be sent on the next cycle, then data is captured, stored in

internal buffers and sent as soon as possible.

c. Gathers performance metrics using internal counters.

In addition, the Bus Tracker also supports timestamps so that generated messages can have a

timestamp value attached to its information. On the other hand, timestamp can also be used to

generate messages every N cycles, being N a value that can be configured by the user at runtime.

Note that the Bus Tracker is integrated on the SoC blocks providing visibility of the requests and

responses received and sent. Different options, which focus on adding debug hardware on the

non-debug NoC [41][42][43], have been taken into account in order to provide the user visibility of

the transaction in-flight in order to detect possible issues. Unfortunately, some of those proposals

used the compute cores to analyze the transactions in-flight [41] causing an interference with the

software application being executed in the system while other approaches [42] required huge

amounts of buffering causing a big impact on the area overhead (up to 24%). There were some

more fancy solutions such as using wireless communication [43] but this increased complexity of

the SoC debug infrastructure. As a solution, it has been decided to use simple but powerful

monitors integrated on the SoC blocks.

77

Figure 22. Bus Tracker block diagram

4.3.2.1 Interface

The Bus Tracker is connected to the debug NoC through an I/O interface that is used for

configuration and tracing purposes, more details of this interface and message protocols can be

seen later on this document.

On the other hand, the filter interface consists of a set of input buses whose width is

parameterizable that are used to identify the requests and responses on a given interface. The set

of input buses consist of:

● Request valid and ready signals.

● Request unique identifier value.

● Response valid and ready signals.

● Response unique identifier value.

● Control signals.

Control signals can be used by designers to specify fields such as request address, request size or

other signals that serve for filtering purposes meaning that it supports custom protocols. In

addition, it can also be used to attach response fields such as error response or returned data.

There is also a general purpose output bus (gpio), whose value can also be specified by the user

through the external debugger, that can be used to control multiplexers to select different groups of

signals if needed. Furthermore, there is an output port that shows if Bus Tracker is enabled allowing

78

designers to apply clock gating to the debug logic around the debug module if it is disabled reducing

dynamic power.

4.3.2.2 Filtering

In order to identify the condition of interest the Bus Tracker makes use of one or more filters that

can be managed by the user in runtime. Each filter is associated with one or multiple comparators,

which means that the user can specify multiple conditions that must be met to perform an action.

Each comparator is configured by the user at runtime to perform a given operation looking either at

the request or response identifiers, or a subset of bits within the control signals attached to the Bus

Tracker input ports. Note that in case of comparing against the control signals, the user must

configure the subset of bits it is interested in and the expected value.

The comparator hits if the interface input port values comply with the operation and the configured

values and mask. The supported comparator operations are:

● Less than (lt)

● Bigger than (bt)

● Equal (e)

● Not equal (ne)

Bus Tracker parameterization allows the user to define buffering size so that request information is

stored for the transactions in-flight that comply with the advanced filtering configuration. There are

two buffers that allow tracking requests and responses information.

The first stores the request transaction identifiers which allow the Bus Tracker to match in-flight

requests and responses. The buffer width must be set to the request transaction identifier field

width and its depth corresponds to the maximum number of concurrent requests that can be

in-flight so that the user has full visibility of all requests in-flight and no information is lost. In

addition, each buffer position has an index identifier that points to the second buffer position

where further information about transaction request and response signals are stored so that this

information can be returned to the user for further analysis if needed.

On the other hand, the second buffer, which is named trace buffer, allows the Bus Tracker to store

requests and responses extra information that permits the user trace information about request

address, response reply code and/or other control signals values. The fields stored on the buffer

position are configured by the user at runtime. Note that the buffer width must be set by the

designer to allow tracking all or a subset of this information while buffer depth depends on how

many transactions information must be saved for tracing purposes.

In case the user makes a configuration mistake on the Bus Tracker component and intends to trace

more information than the one it fits on each trace buffer position, then the trace buffer is filled

with a subset of this information as explained in the next sections.

79

Moreover, in case the trace buffer is full and new trace information cannot be stored the Bus

Tracker can either send a notification to the debugger, overwrite buffer positions acting as a circular

buffer, or discard new messages depending on user runtime configuration.

In addition, the user can also configure tracing to be applied in advance and post-process the

information on the host software. That is, in case the user has configured a filter to track

transactions until a response is received (e.g. looking for a specific response signal value), the

transactions in-flight are always traced. Then, if the user wants to capture transaction request

signals such as the address but trace buffer capabilities may be not enough, the Bus Tracker can be

configured to forward this information to the debugger so that it does not need to be stored during

large amounts of periods avoiding possible overflows on trace buffering and allowing the designer

to reduce the trace buffer width. Note that the transaction unique identifier is always captured so

that the host software can use it during post-processing.

On the other hand, when a transaction response is received, filtering is applied and in case filter

configuration is met, then the information is captured and sent to the debugger. Note that the

response does also always capture the transaction unique identifier. Making use of the transaction

unique identifier values captured, the host software can discard those trace messages that did not

meet filter conditions for transaction response signals..

For example, the user can configure the Bus Tracker to look if an address is in a certain range with a

filter with multiple comparators or it can look if a single bit is high or low with just one comparator

assigned to that filter.

4.3.2.3 Counters

A wide range of performance metrics can be analyzed using the counters. Each counter is assigned

to a filter such that when there is a hit, the counter is incremented or decremented. The supported

performance metrics that can be gathered are:

● Count the number of transactions.

● Duration of transactions: minimum, maximum or average.

○ Tracks from the time the request is sent until a full response is received. In case of a

read request full response is achieved when the last beat of data is received.

In addition, the user can also configure a threshold such that an event or message is generated

when the counter reaches a certain value, or make use of Bus Tracker timestamp feature to retrieve

counter value every N cycles, being N a configurable value by the user at runtime.

4.3.2.4 Actions

When a filter condition is met the Bus Tracker module can perform different actions depending on

the user configuration.

Events

The user can enable event generation and set the event identifier to be propagated in order to

enable or disable filters from monitor modules, or to apply run control operations such as halt or

resume.

80

Trace

When this action is selected the user can also specify which of the input ports data must be

captured and forwarded. That is, the user can for example select the request identifier field so that

each time the filter condition is met the identifier is returned to the debugger. In case of selecting

the input control signals bus, the user can also specify the subset of bits making use of a mask. By

default the Bus Tracker sends the all input ports value.

In addition, the data buffer size, which captures trace values to be sent, is parameterizable such that

the total trace data buffer size can be smaller than the sum of the width of all input ports.

4.3.3 Trace Storage Handler

The debug monitor components can generate huge amounts of Trace messages that can collapse

external interfaces such as USB or JTAG so that those become the debug system bottleneck because

of their low bandwidth. In order to avoid losing information and to increase visibility, the debug

infrastructure can be configured to store those Trace messages in one or multiple memories. Note

that although trace buffering can be integrated making use of dedicated memories within the SoC,

this is not a realistic cost as it would consume huge amounts of area and impact on static power.

Due to this, the proposed debug architecture relies on the use of the SoC memory subsystem

commonly composed by DDR or High Bandwidth Memory (HBM) technology.

The routing of the debug traces is performed by the debug Network On-Chip based on the user

configuration, which is encapsulated under the Trace message Destination field. Note that in the

proposed infrastructure the Trace messages are always sent to the memory subsystem block which

then forwards messages either to the active debugger or stores them on memory, this facilitates the

debug NoC routing algorithm and the integration in multicore systems.

The Trace Storage Handler (TSH) is placed on the SoC block that is connected to the SoC memory

subsystem, and is responsible for applying the user configuration to the received Trace messages

such that Trace messages are stored in memory or forwarded to the active debugger either through

USB, JTAG or PCIe interfaces.

Due to the huge number of trace messages that the memory subsystem can receive, it may not be

possible to store those in memory in case the memory controllers are busy performing requests for

the compute cores. In order to avoid discarding messages and applying backpressure, it is

recommended to integrate a small SRAM that is used to store all received Trace messages before

they are moved to memory when it becomes available.

On the other hand, it is also a key factor to ensure that the region of memory used to store the

Trace messages is not overwritten by the compute core as this would cause data corruption.

81

Figure 23. Trace Storage Handler block diagram

4.3.3.1 Trace storage configuration

Multicore systems can have one or multiple blocks that contain the SoC memory, that is, there can

be several memory subsystem blocks. Due to that the user must be able to define which is the

memory subsystem to be target. In addition, the user must also be allowed to define a region of

memory to be targeted to ensure that debug traces are not overwritten by compute cores and the

other way around.

Each memory subsystem block instantiates a Trace Storage Handler component which has a subset

of internal registers that define tuples of values that are the start address range and the size of the

range. This allows the user to define multiple memory ranges within that memory subsystem to be

reserved for debug traces, each one having a unique identifier that can be targeted by monitor

debug components. Those regions of memory act as circular buffers and the user can configure at

runtime if buffer positions are overwritten when the buffer is full or if new arriving debug traces are

discarded.

The Trace message routing from the SoC blocks to the memory subsystem requires the user to

configure the debug monitor components to set the destination of the message. The destination

information is built by setting the memory subsystem block unique identifier on the most significant

bits while the less significant bits are used to specify the memory range identifier within that

memory subsystem. That is, as mentioned above the Trace Storage Handler can be configured to

handle multiple memory ranges to store Trace messages, each one has a unique identifier within

82

the SoC block and is seen as a different debug component such that the SoC block monitor debug

components can set a specific memory region as a destination so that Traces can be separated.

Once the debug NoC receives the Trace message from the SoC block debug component, it decodes

the Destination field and using the most significant bits targets the appropriate memory subsystem.

Then, when the Trace message arrives at the memory subsystem and is routed to the Trace Storage

Handler, the less significant bits of the Destination field are used to target the memory range to be

written.

4.3.3.2 Trace extraction operational modes

As we have seen, the debug traces can be stored in memory, which means that this information

must be read and returned to the debugger so that it can be analyzed. The user can configure the

Trace Storage Handler (TSH) to operate in two different modes to extract the information stored in

memories:

1. Memory is read automatically every N cycles and T traces are extracted, being N and T

configurable values by the user

a. The debug traces are always sent to the same destination, which is also

configurable by the user when performing the initial TSH configuration.

b. Note that the user can specify how many Trace messages are extracted from

memory periodically, allowing to optimize its configuration to avoid bottlenecks on

I/O interfaces.

2. Memory is read upon user requests

a. The debug traces are sent to the requestor as a response with data.

b. The user can gather multiple trace messages by performing read requests of

different sizes.

The Trace Storage Handler has read and write pointers which are updated automatically when

traces are read or written so that the user does not need to know which positions of memory have

valid traces or contain the oldest traced values.

On the other hand, when the user requests the oldest data, that data is sent to the requestor as a

response, which means that the user can make use of the master interfaces such as PCIe to extract

the traces from the memories and given that the read pointers are automatically updated, no

further work is needed by the host. Further information about PCIe support can be found later in

this specification.

4.3.3.3 Trace extraction through active debugger

The debug infrastructure supports multiple debuggers that can not only be controlled by the host

through an external interface such as the USB, JTAG or PCIe but can also be integrated within the

multicore system such as the master core.

83

Due to the complexity of trace analysis the traces are decoded and visualized in the host system,

which means that the master core can be used to retrieve information but it is recommended to not

use it to analyze it.

On the other hand, it is important to note that PCIe connection with the host allows data transfers

to be initiated either from the host, through requests, or from the multicore system as PCIe can

map host memory as system memory. This differs from the USB or JTAG interfaces which do not

follow a protocol based on a request-response neither make use of transaction identifiers. Due to

that, the PCIe support for reading debug traces is different than the one used for USB or JTAG

interfaces.

PCIe is the preferred option when the System On-Chip debug infrastructure is generating huge

amounts of debug traces given that it provides higher bandwidth than USB or JTAG interfaces.

When debug traces are gathered through the USB or JTAG interfaces we make use of the

operational modes previously described in this specification, but for PCIe the system requires a

different methodology.

Debug trace extraction through PCIe

The standard PCIe features allow both the host and the device to initiate a transaction as long as

properly configured. Consequently, there are two possible mechanisms to move data from the SoC

memory to the host memory. First, the host can read the SoC memory, which is mapped in the host

system address map. Second, the SoC can map, within its PCIe address space, a region of the host

system memory and use write operations to send data to the host.

The PCIe access to the debug infrastructure requires the system to be configured in the appropriate

security level as is explained later in this specification.

PCIe as trace destination

The debug infrastructure has been designed to support using host memory as storage endpoint

rather than using the SoC memory. To achieve this, the debug infrastructure acts as an initiator to

write the traces through the PCIe system into the host memory. This feature can be used in case the

user does not want to modify the code being executed on the compute cores to avoid targeting a

SoC memory region that was intended to be used to store debug traces.

The debug traces generated by monitor debug components are always routed to the memory

subsystem. However, the memory subsystem can be configured to forward the debug traces to the

PCIe, such that the traces are stored in host memory rather than be stored in the SoC memory. It is

important to note that interleaving is still supported and in this case the host must allocate two

buffers/memory regions and configure each correspondent memory subsystem to target host

memory.

The memory subsystem must be configured by the user before the monitor components are

enabled and start to generate trace messages. The configuration applied is used by the memory

subsystem to handle the received trace messages and route those messages to the PCIe interface.

84

On the other hand, the host is responsible for allocating one or more buffers and advertising these

buffers to the SoC through descriptors that are stored in a FIFO in the SoC memory subsystems.

Each descriptor, which corresponds to a host buffer, is a data tuple that contains the PCIe start

address and the size that corresponds to the host memory allocated as a buffer. The SoC memory

subsystem targets each of those descriptors and fills the host memory space with trace messages;

Once the current descriptor is filled, the SoC memory subsystem pops from the FIFO to get the next

one and this process keeps going until no more descriptors are available in the FIFO. Optionally the

last descriptor can be kept and its contents overwritten depending on Trace Handler Storage

configuration. Otherwise, new trace messages are discarded by the TSH until a new descriptor is

added to the FIFO.

The configuration to be applied to the Trace Storage Handler is as following:

● The TSH is set to the push-to-host mode.

● The host stores descriptors in the TSH descriptor FIFO to specify the address range that can

be used in host memory.

● The TSH is configured to overwrite or not the last available descriptor.

In this case we can see that the SoC memory subsystem uses host memory as a buffer and the

software running on the host is responsible to allocate one or more buffers large enough, so if

desired it has time to move the data from that buffer to a definite location.

Note, that the host must send a dump request through PCIe which enables the Trace Storage

Handler to start forwarding traces to the host memory. In addition, the SoC memory subsystem

must have a master interface with the non-debug NoC that allows it to send requests as it is writing

system memory mapped addresses. It is recommended that the host initializes the reserved

memory region to 0s such that it knows how many Trace messages have been written and are valid

based on the amount of memory positions within that region that have been filled with data

different than 0s.

On the other hand, note also that when using streaming mode we reuse some of the message fields

to contain gathered information from the system but both Destination field and Source Message

Type subfield values are maintained to identify the destination of the message and the type of this

one so that it can be successfully decoded. This means that the written information on the PCIe host

memory space will never be 0s even if the Destination field information is removed and the

gathered data from the system are 0s due to the fact that Source Message Type subfield value is

always different from 0 for Trace messages. This allows the user to know how many Trace messages

have been gathered from the system and must be post-processed.

In addition, it is important to note that it is expected that the host analyzes the traces only when

the allocated buffers have been filled or when the trace has been disabled. This ensures that there

is no need for the host to manage the SoC TSH read and write pointers for a given memory region.

85

PCIe as requestor

As has been explained in previous sections, the debug traces can also be stored in the SoC memory

and read through PCIe at any given time. In this case the user can make use of either the debug NoC

or the non-debug NoC to retrieve those trace messages from the SoC memories.

Note that using one NoC or another has different advantages and inconveniences as shown below:

1. PCIe read traces through the non-debug NoC

a. Provides higher bandwidth than the debug NoC.

b. Can have collisions with real traffic generated by the compute cores.

2. PCIe read traces through the debug NoC

a. Provides lower bandwidth than the non-debug NoC.

b. Ensures there are no collisions with the traffic generated by the compute cores.

When using debug NoC to retrieve Trace messages from SoC memory, the PCIe request targets the

Debug Transaction Manager virtual register that corresponds to the Trace Storage Handler, that is,

the read is captured by the Debug Transaction Manager which then forwards the read request

through the debug NoC. This request is responded by the TSH on the SoC memory subsystem and

routed through the debug NoC up to the Debug Transaction Manager where the PCIe response is

computed and forwarded. Note that the Debug Transaction Manager and its PCIe support can be

found later in this specification.

Note that it is up to the designer to define how the Debug Transaction Manager virtual register is

configured. On the one hand, the designer can set a specific virtual register per each Trace Storage

Handler instance in the SoC so that the user at runtime can target this specific address with no

previous configuration required. On the other hand, the designer can make the Debug Transaction

Manager virtual registers configurable such that the user has first to set up the virtual register to

target a certain SoC block and debug component and then perform as many reads as required.

On the other hand, when using the non-debug NoC approach to retrieve Trace messages from SoC

memory, the PCIe request targets the SoC memory subsystem Trace Storage Handler through the

SoC non-debug memory map. The TSH captures the unique transaction identifier and generates a

response attaching the Trace messages read from the SoC memory. Note that this means that the

Trace Storage Handler must be visible and accessible through the non-debug NoC.

Both options require the user to perform a read request from the host through the PCIe interface

targeting the SoC memory subsystem to initiate the read of debug traces from SoC memory to PCIe.

The Trace Storage Handler uses the PCIe request transaction unique identifier to route the debug

traces as a response. The generated response by the TSH can contain one or multiple traces. The

amount of trace messages returned is based on the read request size.

Following the same pattern as in the previous subsection, the Trace Storage Handler must be

configured by the user before the debug monitor components are enabled and start to generate

86

Trace messages. The SoC must reserve a specific address within its memory map to allow the user

access the Trace Storage Handler through PCIe and enable dump of Trace messages.

Note that interleaving is supported and the host is responsible for merging messages from one or

multiple SoC memory subsystems. On the other hand, there is no need for the host system to

reserve a memory region as traces are returned as responses from the first PCIe request.

4.3.4 Compute core execution analysis

On a multicore system it is essential to have visibility of the work that is done by the compute cores.

It is widely extended to make use of dedicated monitors to track information from the cores such as

the retired instructions, exceptions and other control signals making use of debug components like

the ARM Embedded Trace Macrocell or the UltraSoC Trace Encoder. Unfortunately, on a multicore

system the number of compute cores is so big that we would require several of those dedicated

debug components which would cause a huge impact on area and power measurements.

The proposed architecture allows the user to make use of the Logic Analyzer and connect with the

compute cores pipeline to track control signals, which can include retired instructions, exceptions

and/or control signals.

In addition, we also propose to add a special debug Control Status Register (CSR) on each core so

that software developers can use it to monitor the execution of a software application on multiple

cores at the same time. This option is faster than using memory to store debug information and is

also less intrusive as it does not inject traffic on the non-debug resources but requires adding extra

hardware on the system.

Software designers could modify their code slightly by adding a CSR write instruction with a specific

value. Then, designers can connect all or a subset of the CSR bits to the Logic Analyzer allowing to

trace the execution status of one or multiple cores at the same time. The visibility is limited by the

amount of Logic Analyzer filter input width and the subset of CSR bits per core connected from each

compute core to the Logic Analyzer.

Note that in case the user wants to monitor all instructions retired by compute cores it would have

to connect the full Program Counter value of each of those cores to the Logic Analyzer requiring a

wide Logic Analyzer filter input bus. On the other hand, with the proposed CSR the user can connect

N CSR bits from each compute core to the Logic Analyzer and keep track of the software application

execution stage. For example, making use of the 10 less significant bits of core CSR the user could

encode up to 1024 different execution stages so that monitoring 4 cores would take just 40 bits

rather than the Program Counter width four times which is way bigger.

As mentioned above, another feasible option is to modify the software such that the core writes

debug information on memory which is later on read through the debug infrastructure but this

option is slower, more intrusive and requires reserving memory space.

87

4.3.5 Multicore system workload balance

On multicore systems it is fundamental to allow the master core to manage and distribute the work

on the compute cores to achieve better performance. The proposed design allows the master core

to make use of the debug infrastructure to gather compute cores performance metrics and use it to

balance workload.

It is standard that the SoC blocks that contain the compute cores have one or multiple memory

mapped registers that are used as counters so that the software can be programmed to increase the

register value each time a given phase has started or finished. It is highly recommended that in case

there are multiple SoC blocks with several compute cores, the most significant bits of the register

are used to specify a unique identifier that can be used to differentiate between those blocks when

data is retrieved.

Then, based on features previously mentioned in this specification, the debug NoC master

interfaces (e.g. AMBA AXI4 or AMBA APB3) can be configured to perform polling and compare

operations targeting those performance registers. This configuration is applied such that the polling

comparator mask selects the less significant bits, which correspond to the counter value, and masks

out the SoC block identifier.

The comparator is then configured to match under a certain threshold value and to perform a write

request to a known memory address with the full register value. In addition, the comparator is also

configured to generate an event that is captured by the master core so that it is aware that memory

has been written with an updated counter value.

Once the master core has received the debug event, it performs a read request to the known

memory address to extract the updated counter value. Finally, the master core can use this

retrieved data to balance workload if needed. Note that if configuring multiple comparators to

monitor multiple SoC blocks, it is recommended that there is a unique event identifier and memory

address for each SoC block so that the master core knows which is the SoC block counter value that

has been updated and the specific memory address where this value has been written. Due to its

flexibility, the debug configuration could make use of a single event identifier and a counter so that

when it reached a threshold the master core would read the according memory region to gather the

information.

The master core can receive debug events through an interrupt controller component that is

integrated in the SoC, which receives Event messages as the rest of SoC debug components,

allowing the debug infrastructure to generate interrupts that are later captured by the master core.

In addition, the interrupt controller contains a set of registers which allow it to store the

outstanding events so that no events are lost.

It is highly recommended not to enable other monitors to trace or generate events such that the

traces on memory subsystem and the events in-flight only correspond to polling & compare

configuration on the compute cores SoC blocks.

88

Note that the polling & compare configuration must ensure that traces are always stored on the

same memory subsystem such that the master core gathers data from only one location no matter

which compute core SoC block generated the trace. That is, both interleaving and streaming debug

NoC features must be disabled.

4.3.6 Network On-Chip

As introduced before in this document, nowadays most of the Systems on-Chip rely on the use of

NoCs rather than point-to-point connections which do not scale well, are not reusable and can

become a bottleneck.

The proposed architecture allows the user to integrate both the Logic Analyzer and the Bus Tracker

on the non-debug Network on-Chip allowing advanced filtering and information gathering from the

system. Note that due to the fact that the Bus Tracker does not follow a specific standard protocol,

it can be attached to the NoC routers even if they are designed to follow custom protocols.

On the other hand, the user can also integrate access to the non-debug NoC routers performance

counters either filtering through the Logic Analyzer or accessing through the debug NoC exposed

master interfaces.

In addition, the user can also inject requests through different lead interfaces in the system which

can be either JTAG/USB or master core within the system to test different routings. Note that Design

For Testability (DFT) infrastructure connects all the flip-flops in the system in one or multiple scan

chains allowing further testing.

Other solutions have been taken into account [41] such as periodically taking snapshots from the

network, storing the data on core’s cache memory and analyzing it using software algorithms, but

have been discarded as they change SoC functionality and impact the way cores behave on the

system.

4.3.7 Hardware assertions

In the last sections we have focused on providing monitoring features that allow the user to

perform filtering and gather information from the system but assuming that the monitor modules

were running at enough speed to capture the changes from all signals.

This may not happen when running on silicon as there may be bits that flip too fast to be detected

by the monitors but that cause wrong behaviors on the hardware. There are two error scenarios

known as silent errors and masked errors. The former relate to the errors that propagate enough so

that they can be observed but they are missed due to insufficient checking while the latter relate to

those errors that are masked out before they can be observed. Recording this type of errors is

critical so that we need to ensure that error latency does not become a problem.

89

It is important to note that bit-flips take place depending on the logic state of the circuit, the

workload and electrical state such as voltage supplies or possible drops. That is, they do not keep a

constant value that can be observed or produce obvious hangs in the system. Due to that, it is

recommended to add further visibility on items that cannot be tested in simulation such as voltage

crossings, Phase-locked loops (PLL) stability or regulators.

During pre-silicon verification stage assertions allow the user to ensure that one or several signals

comply with a specific functionality and follow temporal and regular expressions. Those assertions

are added as part of the source code and can be simulated allowing to detect functional errors. The

most common languages for writing assertions are the Property Specification Language (PSL) and

the System Verilog Assertion (SVA).

The proposed architecture recommends the user to make use of assertions during post-silicon

validation in order to detect bit-flips. New techniques [44][45] have been introduced allowing not

only to map pre-silicon assertions to post-silicon hardware [44] but also to automate this effort and

introduce nonobvious assertions that support multiple time cycles and different modules

automatically [45].

Note that bit-flips are caused due to problems in the design netlist rather than functional errors

meaning that the assertions to be placed must cover items that can be affected during

manufacturing and cannot be detected making use of monitors such as the Logic Analyzer or the

Bus Tracker. In addition, due to the fact that during pre-silicon phase designers may have added a

huge number of assertions, it is needed to select those that must be mapped in order not to have a

huge impact on area.

4.4 Debug modules messaging infrastructure

4.4.1 Interface

The debug components are connected to the debug NoC without other messaging route

components such as the UltraSoC Message Engine. This means that the interface between the

debug components and the debug message infrastructure is based on the protocol shown earlier in

this document.

Multicore systems generally make use of different voltage regions and due to that the messaging

protocol must support voltage crossings, queues and other logic structures that require

synchronization between the leader and the follower placed at different voltages or running at

different frequencies. The proposed debug protocol, which is based on handshaking, supports

both voltage and clock crossing.

Note that handshaking protocol consists of valid, ready and data signals. The leader sends a

message capturing the information on the Data field and validating the data asserting a valid signal.

The follower makes use of the ready signal to acknowledge to the leader that it can capture the

90

information. In case the follower is not ready, the leader driven valid is asserted until follower

driven ready signal is asserted.

On the other hand, it is the user's responsibility to implement voltage and/or clock crossing logic

between the debug NoC and the debug components if required. Below we introduce how the user

can implement voltage crossing for handshaking protocol such as AMBA AXI4 between a leader and

a follower running at different voltages.

Figure 24. Voltage crossing implementation for handshaking protocol

On the other hand, note that protocols such as AMBA APB3 require more complex voltage crossing

support due to the fact that there are multiple signals used to start a transaction. Below we can see

an example of the recommended voltage crossing support for an APB master and leader

transaction:

91

Figure 25. Voltage crossing implementation for AMBA APB3 protocol

4.4.2 Compression

Reducing the amount of information that is sent between the active debugger and the SoC debug

components is essential in order to reduce the probability of losing data due to bottlenecks.

In previous sections we have seen that the debug NoC implements an interface and protocol such

that is lossless, that is, once a package has been injected into the NoC then it always reaches its

destination and, if needed, is stored in a buffer. Given that backpressure is propagated up to the

debug components, then visibility can be lost from a few cycles to hundreds of cycles.

It is important then trying to reduce the amount of information in-flight on the debug infrastructure

and because of that we now focus on trying to reduce the amount of information that debug

components send and receive through the debug NoC Data and Source fields given that Destination

92

already contains the minimum possible amount of information and is required by debug NoC

routers to forward the information in the correct path.

Both Source and Data fields are not used by the Debug NoC so that this information can be

compressed and decompressed on the SoC blocks allowing to reduce the amount of information

sent. The proposed method focuses on Trace messages given that Configuration and Event

messages do not stress the debug messaging infrastructure and are not generated every cycle while

Trace messages can stress the debug NoC when interleaving or streaming features are enabled. Due

to that the Source Message Type information is not compressed, this way the debug destination

component decodes only the Trace messages.

There are several research articles [46][47][48] that discuss compression algorithms for memories

and caches, so in this section we take some of their main ideas and adapt them to the proposed

debug architecture with the goal of building a more efficient messaging infrastructure.

First, decompression and compression methods must be extremely fast so that the debug

component can send information every cycle and the compression/decompression method does

not become a blocker.

Second, the added hardware area overhead must be small compared to the option of increasing the

number of wires in the debug NoC to allow more bandwidth causing also an impact on power

consumption.

Third, the algorithm should ensure that no information is lost when packages are compressed.

In our case the data obtained from the system does not follow a clear pattern since it belongs to

control signals whose value changes depending on many factors but some of the bits may not

change during several cycles as they can belong to finite state machines or logic that does not

toggle often.

4.4.2.1 XOR and RLE compression

This XOR technique does not compress the information and requires a very small amount of time to

compute the value to be sent which ensures that latency constraints are met and compression does

not become a problem.

The idea is to apply an XOR operation between the data that has been sent on the previous package

and the data to be sent on the current one. This operation is easy to invert both on the host or

within a debug component and ensures that if part of the package has not changed its value. In

addition, using XOR operation we maximize the number of consecutive zeros for a better use of the

Run Length Encoding (RLE) compression algorithm.

If not many bits have changed the data to be transmitted has several 0s. Then, we can make use of

Run Length Encoding (RLE) compression algorithm [49] which allows us to compress data in case a

93

number is repeated several times by specifying the number of repetitions and the number itself. For

example, a data package whose data is: “0000aadd000” would be translated to: “402a2d30”.

Once XOR is applied, multiple compression algorithms can be applied such as LZ77 [50] compression

which allows to encode data making use of tables that encode message patterns with identifiers

and changes dynamically based on the number of pattern appearances. That said, this adds

complexity and requires extra logic on the debug infrastructure.

Nevertheless, in case the Run Length Encoding compression algorithm is used when the debug

monitor components captured data changes often, the amount of information to be sent can

increase and become bigger than its original size. Due to this, the RLE compression algorithm is an

optional feature that can be enabled and disabled by the user at runtime depending on the

resource being monitored taking into account if enough bits from captured information do not

change so that it is worth applying RLE compression. Then, due to the fact that all Trace messages

are post-processed at the host, decompression is applied if required.

4.5 Access to the debug infrastructure

The proposed debug infrastructure can be managed by multiple master interfaces as has been

explained in previous sections. In order to ensure that each master does not require a different

protocol, which increases complexity of the infrastructure, it is important to define a messaging

infrastructure that can be used by multiple protocols to access all the SoC debug features.

4.5.1 Debug address map

Each master connected to the debug infrastructure has its own interface that is based on different

protocols such as JTAG, USB, PCIe or AMBA AXI4. In order to develop an infrastructure that is

accessible from all those interfaces and to maintain a common implementation for coherency, it is

necessary to manage the same debug protocol and rules.

The debug resources are all exposed using the same debug address map, which allows all master

interfaces to refer to a debug feature using the same address. Then, there is the need of a

component that is able to translate from master protocol to the debug protocol, which is the Debug

Transaction Module (DTM). This component connects the master interfaces with the debug NoC

allowing the active debugger to access all the SoC resources.

For the master core and PCIe to access the debug SoC infrastructure, the request address is used to

target the Debug Transaction Manager component that will handle the connection. Note that the

DTM has an address range in the system memory map that allows the master core and PCIe to

access this component. Further information on how this is managed can be found later in this

document.

Once the request has been forwarded within the debug NoC, the implemented memory map is

composed of the SoC block identifier on the most significant bits and the debug component

94

identifier on the less significant bits. Then, the configuration to be applied to a given debug

component is attached to the message Data field.

4.5.2 Debug Transaction Manager (DTM)

The Debug Transaction Manager handles the requests and responses between the master interfaces

and the debug NoC. On a system with multiple master interfaces there is an instance of the DTM in

up to four different locations:

● Between the JTAG external interface and the debug NoC,

● Between the USB external interface and the debug NoC,

● Between the master core interface and debug NoC and,

● Between the PCIe interface and the debug NoC.

4.5.2.1 Communication with Debug NoC

The communication between the Debug Transaction Manager and the Debug NoC is done through

an interface consisting of ten fields. Five debug NoC input ports that allow the DTM to send

requests to debug SoC resources and five debug NoC output ports that allow debug SoC resources

to respond to those requests and to send the gathered data from the system execution.

As has been explained in previous sections those five input and output ports are grouped as:

Destination, Source and Data fields together with a valid and ready signal, and the information

encoded in each of those fields allow to route the message, specify the operation to be performed

and the source of the request in case a response is required.

In addition, note that the debug monitor components do also follow this protocol such that they

encode the messages the same way as the DTM does for USB, JTAG, master core and PCIe support.

Further information about how the DTM translates to the debug NoC protocol can be seen in the

next subsections.

4.5.2.2 USB and JTAG support

The USB and JTAG interfaces do not follow protocols where message size information is part of the

package given that both interfaces send messages as a stream of data.

Due to the fact that the Debug Transaction Manager needs to know how many bytes are received to

perform successful decoding and JTAG/USB interfaces do not have a field that specifies this

information, the first subset of bits from this streamed data contain the size of the message, the

destination SoC block and debug component, the message type and the transaction identifier. Then,

the rest of the streamed information is filled with information about the Data field. The DTM makes

use of the specified size to concatenate the rest of the stream packages and forward them through

the debug NoC targeting the destination specified in the first package.

Note that the only Source field information required to be sent from the host is the Message Type

and the Transaction ID. The former identifies Configuration requests, Trace messages or Events

messages, while the latter allows the software to generate multiple requests and match the

95

corresponding responses. That is, the source SoC block identifier and the SoC debug component

identifier, which are used in case the target debug component has to generate a response, are

concatenated by the Debug Transaction Manager before the transaction is sent through the debug

NoC.

The first subset of JTAG/USB streaming data looks as shown below:

TR x x MT+1 MT x AS+1 AS x x x x x AD+1 AD x x P+1 P x x x 0

Transaction ID Message type SoC block IDSoC block debug

component IDPayload length

Source field Destination field Header

Figure 26. First JTAG/USB streaming data package encoding for requests

The rest of data stream message is used to encode the data depending on the message type as

shown below:

● Configuration request message:

○ Encodes the endpoint offset address to be accessed, the operation type and the

size together with the data to be written if needed.

Figure 27. Configuration request message encoding from JTAG/USB

● Event message:

○ Encodes the event identifier.

Figure 28. Event message encoding from JTAG/USB

On the other hand, the Configuration responses or Trace messages, which are sent from the SoC to

the host, are encoded in a similar way such that the first subset of bits contain the size of the

message, the source SoC block and debug component that is sending the message, and the

message type together with the transaction identifier. Then, the rest of the data is filled with

information about the Data field.

Note that the DTM discards the Destination field information before forwarding this message to the

JTAG or USB interface.

The first subset of JTAG/USB streaming data looks as shown below:

SS x x x x x SD+1 SD x x TR+1 TR x x MT+1 MT x P+1 P x x x 0

SoC block IDSoC block debug

component IDTransaction ID Message type Payload length

Source field Header

Figure 29. First JTAG/USB streaming data package encoding for responses

96

The rest of data stream message is used to encode the data depending on the message type as

shown below:

● Configuration response message:

○ Encodes the data returned from the SoC block debug components.

D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0

raw data Reply code

Figure 30. Configuration response message encoding to JTAG/USB

● Trace message:

○ Encodes the gathered data returned from the SoC block debug components. In

addition, depending on user configuration, it also encodes the timestamp value as

part of raw data on its less significant bits.

Trace non-streaming message

D x x x x x x x x x x x x x x x x x x x x x x x x x TS+1 TS x x x 0

raw_data Timestamp

Figure 31. Trace non-streaming message encoding to JTAG/USB

Trace streaming message

D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0

raw data

Figure 32. Trace streaming message encoding to JTAG/USB

4.5.2.3 Master core and PCIe support

The master core and the PCIe interfaces with the system may make use of protocols that generate a

single package for each request with variable size. For example, AMBA AXI4 allows generating

messages of different sizes which means that data may have to be splitted in multiple debug NoC

packages.

The Debug Transaction Manager receives requests from the master core and PCIe and based on the

destination address and operation type (opcode) information, which is encoded in the Data field, it

translates the request to the debug NoC protocol and forwards it to the endpoint through the

debug NoC. Further information about how messages are encoded can be seen later in this

document.

In addition, the Debug Transaction Manager also takes care of concatenating all the response

packages that are returned from the requests generated by master core and PCIe, and sends a

unique response to the requestor following the correspondent protocol, for example, AMBA AXI4.

97

Note that the Debug Transaction Manager makes use of the Configuration response message

dedicated wire that indicates which is the last package from a given message in order to

concatenate the returned value successfully. This is required due to the fact that Configuration

response messages do not contain information regarding size but the DTM needs to know how

many bytes of data must be concatenated for read requests.

4.5.3 Read operations to SoC resources

The debug infrastructure allows access to a collection of SoC resources comprising access to block

registers, memories and debug components internal registers, which are used to apply different

debug features configurations. Both read and write requests performed through the debug

infrastructure must encode the full endpoint target address. Consequently, allowing access to all the

SoC memory mapped resources, that is, not only the destination block and debug component must

be encoded but also the address of the memory or register to be accessed.

4.5.3.1 JTAG/USB

As has been mentioned in previous sections, the JTAG and USB protocols are based on data

streaming so that the user can send multiple data packages using the Data fields to encode the

target resource address. Because of that, the read operation requires just one request when

performed through the JTAG and USB interfaces.

The external debugger performs a read request whose address targets the destination block and

debug component and, within the message Data field, it encodes a tuple that contains the

register/memory target address, the operation type (opcode) and the requested read size.

As commented in previous sections, the Debug Transaction Manager captures the stream of data

sent by the host via USB and JTAG interfaces and builds a request by concatenating those packages

together with the Source SoC block and debug components identifiers. Then, once the message has

been built, it is forwarded through the debug NoC.

The Debug Transaction Manager does not need to perform special operations for the response

messages as they are forwarded to the USB/JTAG interfaces and handled by the external debugger

software. The maximum read size is limited by the resources in the system as data streaming allows

sending as much information as requested.

4.5.3.2 Master core and PCIe

The access to system resources through debug modules from the master core and PCIe is based on

read and write requests performed through their standard interfaces with the rest of the system.

As has been explained in the Debug Network On-Chip message types section, the Configuration

messages are used to perform read and write requests to access SoC block registers, memories and

debug components internal registers. The register/memory target address is encoded on the

Configuration messages Data field given that the master core/PCIe protocol request address field is

98

used to target the Debug Transaction Manager, the destination SoC block and the debug component

that performs the read operation.

The register/memory address is encoded as part of the Data field given that on multicore systems

the debug memory map can require a huge amount of bits which would cause debug protocol

Destination field to take a huge chunk of the payload that would not be used by Trace messages nor

Event messages, which only require to know the destination SoC block and debug component.

In case the user wants to perform a read to an SoC resource from the master core or PCIe through a

debug component, it is always necessary to send first a write request which specifies the target

register/memory address on the request Data field. This is needed because the master core and

PCIe read request interfaces may not have a data bus that can be used to encode the target

register/memory address. That is, the Debug Transaction Manager receives two requests, first a

write specifying the target resource and then the read request trigger, and generates a read request

through the debug NoC following the debug NoC protocol. The sequence can be observed in the

diagram attached below.

Figure 33. Master core/PCIe read sequence diagram

The Debug Transaction Manager (DTM) is accessed by the master core/PCIe through read and write

requests as there is a range reserved on the system memory map to target this resource. Then, the

reserved region is seen as a set of virtual registers that allow the master core/PCIe to have multiple

requests in-flight with different transaction identifiers and targeting different or the same SoC block

debug component.

The first request is a write whose address targets the Debug Transaction Manager. Then the

destination SoC block and debug component that will perform the read (e.g. the debug NoC APB

master interface attached to a given SoC block) are encoded on the Data field. That is, the debug

NoC Destination field information is encoded on the master core/PCIe request data field. Then, the

write request Data field of this first request is also filled with a tuple that contains the

99

register/memory target address and the operation opcode of a future request, which is a read in

this case as is shown below:

D x x x x x x x x x x x x x x x O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 read opcoderegister/memory

addressSoC block ID

SoC blockdebug

component ID

Figure 34. Read operation encoding for Master core/PCIe

The write request is captured by the Debug Transaction Manager, which looking at the operation

type sees that the master core/PCIe wants to perform a read operation so it stores the target

address and the operation type and waits for the second request, which is the read. Depending on

the write opcode the DTM responds to the first request with a write acknowledgement.

Once the master core or PCIe receives the write response confirmation, it performs a read request

targeting the same address, that is, the same DTM virtual register, with the size field set to the

desired read size.

This second request is then captured by the Debug Transaction Manager, which takes into account

the information stored on the previous write request, and generates a read request that is sent

through the debug NoC targeting register or memory endpoint. Note that the DTM encodes the

debug NoC Source field information by taking into account who has been the requestor and the SoC

block such that response from the endpoint can be routed back.

The read response is always sent by the target endpoint and once the response data is received by

the DTM, it concatenates the multiple response packages and builds the response based on the

master core or PCIe protocols.

4.5.4 Write operations to SoC resources

In the previous sections we have introduced the necessity of having the target endpoint address

encoded as part of the Data field due to the fact that multicore systems memory space is usually big

and requires many bits to target the SoC resources.

4.5.4.1 JTAG/USB

The write operations through the JTAG and USB interfaces follow the same pattern as the read

operations such that the external debugger performs a single request whose first package encodes

the destination SoC block and debug component and a tuple that contains the register/memory

target address, the operation type for the destination component and the requested size. Then, the

rest of the streaming packages contain the data to be written.

The Debug Transaction Manager captures the stream of data sent through the USB or JTAG

interfaces and by concatenating those packages it builds a request that later on is sent through the

debug NoC.

100

4.5.4.2 Master core and PCIe

The behavior of the master core and PCIe with the Debug Transaction Manager depends on the

desired write size given that the request information (e.g. target register/memory address) must be

part of Data field which means that the request information together with the data to be written

can exceed the master core/PCIe interface data width.

Write requests that do not exceed standard interface data width

In case the user wants to perform a write request whose total size, taking into account request

information and data to be written, is less or equal than master core/PCIe interface data width, only

one write request is needed. This write request encodes in the Data field the request information

on a tuple, that is placed on the less significant bits of the interface data bus, that contains the

destination SoC block and debug component that will perform the operation, the register/memory

target address, the operation type for the destination debug component and the requested write

size. Then, the rest of the bus is filled with the data to be written on the target endpoint as can be

seen below:

D x x x x x x x x x x S+1 S x xO+1

O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0

raw_data write size write opcoderegister/memory

addressSoC block ID

SoC blockdebug

componentID

Figure 35. Single-write operation encoding for Master core/PCIe

The Debug Transaction Manager captures this write request and looking at the operation type and

the requested size, it translates the request to the debug NoC protocol and sends it to the

destination target. Note that the DTM encodes the debug NoC Source field information by taking

into account who has been the requestor and the SoC block such that, in case acknowledgement

has been requested, the write confirmation response from the endpoint can be routed back.

Figure 36. Master core/PCIe single-write sequence diagram

Write requests that exceed standard interface data width

In case the user wants to perform a write request whose total size is bigger than the master

core/PCIe data interface width, then two write requests are required.

101

The first write request encodes in the Data field a tuple that contains the destination SoC block and

debug component that will perform the operation, the register/memory target address, the

operation type for the destination debug component and the requested write size the same way as

for read requests.

D x x x x x x x x x x S+1 S x xO+1

O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0

0 0 0 0 0 0 0 0 0 0 0 0 write size write opcoderegister/memory

addressSoC block ID

SoC blockdebug

componentID

Figure 37. Write operation encoding for Master core/PCIe

Note that the desired write size is required such that the DTM can know if the data to be written

has been attached on the most significant bits of the Data field, or if a second write request with

the raw data is required.

On the other hand, the second write request, which targets the same address and specifies the

desired write data size, Data field is filled with the data to be written on the target resource.

Figure 38. Write operation data encoding for Master core/PCIe

The Debug Transaction Manager captures the first request and looking at the operation type and

the requested size value, which is part of the tuple within the Data field, it stores the SoC block

target resource address and the requested size, and sends a write acknowledgement, if required,

back to the requestor, which is the master core or PCIe. Then, once the second write request to the

same destination SoC block and debug component is received, the DTM translates the request to

the debug NoC protocol and sends it to the destination target.

Note that the DTM encodes the debug NoC Source field information by taking into account who has

been the requestor and the SoC block such that, in case acknowledgement has been requested, the

write confirmation response from the endpoint can be routed back.

102

Figure 39. Master core/PCIe write sequence diagram

4.5.5 Security

Post-silicon validation debug infrastructure allows to detect design flaws that go from functional

errors that have not been detected during pre-silicon validation to checking if there are issues such

as bad manufacturing design that have not been detected using automatic test equipment (ATE).

Due to the fact that the proposed post-silicon validation architecture can be integrated in any SoC

and can end up in any company or user to run either general purpose software or very specific

high-performance applications, we cannot confirm that security and privacy of the system is

ensured.

On one hand, the post-silicon verification engineering team would like to have as much visibility of

the system's behavior as possible while the security experts would like to reduce visibility or access

to the minimum in order to ensure that the system is not accessed by malicious users.

Systems on-Chip security can be implemented in multiple ways; some SoC designs make use of a

specific module known as the Root of Trust (RoT) in which the user has to perform a challenge to

enable or disable access to the debug capabilities while others make use of efuses such that there is

physical pin on the SoC that enables or disables debug access, that is, during post-silicon validation

the full debug features are open and during production mode this physical pin is blown up at

manufacturing disabling the access.

On the proposed architecture there are several lead debuggers that can access the system making

use of external interfaces such as JTAG, USB or PCIe. That said, all those interfaces are connected to

the Debug Transaction Manager (DTM) which acts as the interconnection between the debugger

and the debug NoC providing the user access to all SoC debug resources. It is recommended that

the debug NoC routing is by default configured to redirect messages to the master interface

103

attached to the Root of Trust so that no other debug components can be accessed until the security

challenge has been validated.

Note that Systems on-Chip can have other critical security holes such as Design for Testability (DFT)

hardware due to the fact that all the flip-flops in the system are connected to one or multiple scan

chains and the scan input and output pins are accessible by the user. Wide research has been done

on DFT security issues that make systems vulnerable [51]. For example, one kind of attack consists

of introducing different patterns within the system and utilizing the difference between the

generated test output vectors to discover how the system behaves. In this thesis we do not focus on

DFT and assume that this security concern has already been addressed.

4.5.6 JTAG integration

As has been explained in previous sections JTAG protocol is a well-defined standard that is easy to

use and integrate on systems. The proposed architecture integrates a JTAG interface on the system

that allows users to perform post-silicon validation as it does not require a complex configuration

and is ready to use once the system power is up. On the other hand, it is important to note that due

to its low bandwidth it does not allow to extract huge amounts of information and this can become

a problem on multicore systems, because of that, the proposed architecture integrates support for

high-bandwidth interfaces such as USB and PCIe.

The software connection with the JTAG SoC interface is usually performed through J-Link debug

probes connectors such as Segger [52]. The J-Link debug probes are the most popular choice in the

semiconductor field for performing debugging and in order to flash new firmware into the systems.

In addition, J-Link is supported by almost all the integrated development environments (IDE) that go

from open-source ones such as GDB to commercial ones.

One of the most important choices when integrating a JTAG interface on a system is to decide the

size of the Test Data Register (TDR), which is used to load data from the host to the SoC and also to

retrieve data from the SoC to the host.

We must note that accessing the JTAG TDR, either for loading or extracting information, implies a

software overhead for managing the J-Link connection. In addition, depending on the J-Link speed

connection the TDR access speed varies. That is, configuring a J-Link connection of 4MHz implies a

period of 250 nanoseconds, which means that the software can load and extract a single bit in 250

nanoseconds. Then, increasing TDR size implies also increasing the time for loading or extracting the

information.

On the other hand, depending on software configuration, the software overhead for accessing TDR

may be big enough so that the time for accessing the TDR does not have a big impact.

Based on the previous analysis developed in this project, we have seen that one of the most

important features from users perspective is to load and store information from system memories.

Due to that, with the goal of optimizing this feature, this project recommends setting TDR size to a

value that allows both the host and the SoC to perform Configuration write requests and

Configuration read responses by loading the TDR just once. Note that we do not take into account

104

Configuration write responses nor Configuration read requests as those do not require to

send/retrieve values to/from the system. In addition, due to the fact that for debug architecture

access we only require one JTAG Test Data Register to be integrated in the system, there is a low

area overhead.

105

5. Proposed multicore design integrationIn this section we explain how to integrate the new proposed debug architecture on Esperanto’s

System On-Chip. Later on this document an evaluation is done comparing the advantages and

drawbacks of using the new proposed architecture.

Esperanto’s SoC architecture has been described before in this document but as a summary it is a

system composed of one master core, known as Service Processor, four RISC-V out-of-order cores,

known as ET-Maxions, and one-thousand-twenty-four RISC-V in-order cores, known as ET-Minions.

In addition, it has multiple memory subsystems, known as Memory Shires, that implement access to

DDR memories and an SoC block that provides a PCIe interface, known as PCIe Shire.

The Service Processor is allocated on the I/O Shire together with several peripherals while the

ET-Maxion cores are placed on an I/OShire sub-block with an independent cache hierarchy. On the

other hand, the ET-Minion cores are put together in groups of 32 on SoC blocks known as Minion

Shires with multiple cache hierarchies.

5.1 I/O Shire

5.1.1 Architecture

The IOShire block contains several sub-blocks that not only allow access through peripherals such as

JTAG, USB, UART or SPI but also integrates the Service Processor which is the SoC master core in

charge of maintenance and managerial tasks in the SoC, and associated peripherals configuration.

In addition, the IOShire also integrates the clock and reset unit (CRU) in charge of releasing or

asserting SoC blocks resets, the security subsystem that blocks non-allowed access to the debug

infrastructure and the ET-Maxion cores.

5.1.2 Integration of the new proposed architecture

5.1.2.1 Clock and reset unit

This critical SoC component is required to perform master core boot sequence successfully and

enable the use of the rest of SoC blocks. Due to that, it is required to maintain a workaround in case

the master core cannot access such that the user can perform the required operations on the clock

and reset unit (CRU) component and enable the rest of the SoC blocks to run.

In order to solve this potential issue, the debug NoC meshtop on the IOShire provides an AMBA

APB3 interface that is connected to the CRU and the I/O Shire PLLs such that the user can manage

the system reset and clock signals without relying on the Service Processor. Note it is highly

recommended to ensure that both the debug NoC and the debug peripherals reset are deasserted

based on external reset so that access to the clock and reset unit is available if something fails.

106

5.1.2.2 Registers and memory access

Access to System On-Chip registers and memory is granted by the debug NoC as read and write

requests can be performed through the different debuggers which can be either the host through

JTAG, USB or PCIe, or through the Service Processor.

Note that the SoC sub-blocks such as the Maxion shire or peripheral components whose registers

and/or memories must be visible through debug, require to integrate the debug NoC meshtop with

an interface that allows access to the SoC block resources.

5.1.2.3 Service Processor

Due to the fact that the Service processor is a RISC-V core it follows the run control RISC-V External

Debug Support Version 0.13 specification [2], which ensures that the user can perform operations

such as halt, resume, single stepping or setting breakpoints, and supports external software tools

such as GDB. The Service Processor Debug Module is accessible through the debug NoC AMBA APB3

interface that is also connected to the CRU and the PLLs. Note that there is a user-controllable

multiplexer that allows targeting those resources.

Visibility of the master core internal pipeline is still required and in order to fulfill users needs. The

proposed integration makes use of a Logic Analyzer that allows it to gather information from the

pipeline finite-state machines and control signals. The proposed debug infrastructure new and

improved features on filtering, tracing and compression allow using the Logic Analyzer to replace

both the UltraSoC Status Monitor and the UltraSoC Trace Encoder components from being attached

to the Service Processor to provide visibility.

On the other hand, a Bus Tracker module monitors the interface between the Service Processor and

the non-debug NoC allowing to track the transactions in-flight and not only detect if a request is

hang but also to allow the user count how many requests have been sent or to obtain performance

numbers based on latency. The integration of the Bus Tracker for monitoring this interface allows

replacing the UltraSoC Status Monitor which did not allow tracking requests and responses in-flight

as it was not supported for managing transaction identifiers.

5.1.2.4 Maxion validation

The ET-Maxion cores architecture is based on the RISC-V Berkeley Out-of-Order Machine (BOOM),

which means that it follows the RISC-V External Debug Support Version 0.13 specification [2]

allowing users to perform run control operations. The ET-Maxion cores Debug Module is accessible

through a debug NoC AMBA APB3 interface.

In addition, the Maxion sub-block has two Logic Analyzer components in order to allow the user

monitor not only the ET-Maxion cores pipeline but also the Maxion cache hierarchy.

107

On the other hand, due to the fact that ET-Maxion cores access the SoC memory subsystem and

share computation with the ET-Minion cores we need to have visibility of the transactions that are

sent and received on the Maxion block. Due to that, there is a Bus Tracker that monitors the Maxion

AMBA AXI4 interfaces that connect the Maxion cache with the rest of the system.

5.1.2.5 Debugger hardware interface

The debug subsystem is accessible through external interfaces such as JTAG but also making use of

Esperanto’s SoC master core, which in this case is the Service Processor.

External interface with the host

Access through JTAG and USB interfaces is still supported and now it is not required to integrate

UltraSoC’s third-party JTAG and USB which expose the I/O interface but have limited features and

throughput. On the proposed architecture the Debug Transaction Manager (DTM) handles the

connection of the external interfaces with the debug infrastructure so that the JTAG and USB I/O

interfaces can be directly attached to the debug infrastructure. The DTM also impacts data

streaming behavior improving performance as it supports different Maximum Transaction Unit

(MTU) values.

In addition, there is also now support for allowing debugging through other external interfaces such

as PCIe which provides a high-bandwidth interface as the DTM also handles the connection.

Cores communication with the host

Service Processor is a critical piece for which a way to communicate with the host is needed so that

the master core can send information even if there has not been a previous request from the

debugger.

In order to keep this feature but saving area overhead and power consumption, the proposed

architecture focuses on delegating the communication to the UART and SPI interfaces from which

the host can communicate with the Service Processor and exchange data. This allows removing the

UltraSoC Static Instrumentation component which has been unused during Esperanto’s SoC bring up

and save both area and power consumption.

It is also important to note that the Service Processor can target host memory through the PCIe

interface as part of Esperanto’s architecture features.

Debug through Service Processor

Access to the debug infrastructure from a master core is not only maintained but also improved as

now there is no need to target the UltraSoC Bus Communicator internal registers and handle the

UltraSoC messaging protocol encoding and decoding. On the proposed architecture, the Service

Processor is attached to the Debug Transaction Manager (DTM) which handles the connection

between the master core and the debug NoC allowing the Service Processor not only to use its own

108

protocol and still have full access to the debug features but also to avoid having to encode or

decode complex messages.

Note that for configuring UltraSoC debug components such as the Status Monitor it was needed to

perform multiple writes to the UltraSoC Bus Communicator internal registers and check the status

of a finite state machine at each step which added complexity and a huge overhead. The proposed

architecture allows the Service Processor to directly target the debug monitor components without

adding overhead.

Figure 40. IOShire proposed debug architecture component integration

5.2 Minion Shire

5.2.1 Architecture

The Minion Shire integrates the compute cores and their cache hierarchy that handle the

computation of the different workloads in the system. Note that, as explained before in this

document, there are thousands of compute cores distributed in groups of 32, which are known as

Minion Shires, within which are arranged in groups of 8, which are known as Neighborhoods.

5.2.2 Integration of the new proposed architecture

Due to the fact that the Minion Shire block is replicated several times in the system, adding several

debug components on those has a huge impact on area and power consumption. On the other

hand, it is essential to have visibility of the Minion Shire resources as they handle most of the work

and determine the SoC performance.

109

5.2.2.1 Software application debugging

First, it is required to allow debugging software applications through run control operations such as

breakpoints, single stepping or the ability to read core internal registers, together with having

access to the Minion Shire configuration registers. Therefore, the debug NoC meshtop attached to

each Minion Shire integrates an AMBA APB3 interface that provides access to all the Minion Shire

registers and ET-Minion cores run control interfaces through a sub-block named Debug Module.

Note that no new compute cores run control operations are introduced as Esperanto’s SoC previous

debug infrastructure already contained the most essential operations. That said, due to how

software applications are designed a new feature is integrated allowing run control operations such

as halt or resume to be applied automatically to a group of compute cores when one of the cores

within that same group halts. This allows performing fine-grained run control debugging allowing

software developers to debug based on different groups of cores at the same time ensuring that

they are halted within a few range of cycles when another halts. Note that the grouping feature

reduces latency when applying halt run control operation as the requests do not need

synchronization with the host.

5.2.2.2 Resources visibility

In addition, users still require to monitor the ET-Minion cores execution and due to that a Logic

Analyzer is attached placed within each Neighborhood allowing the user to connect with the 8

ET-Minion cores and other resources within that same sub-block.

The proposed integration connects the Logic Analyzer to the ET-Minion cores pipeline to track

control signals, which can include Program Counter or control signals values. On the other hand,

there is a new ET-Minion core Control Status Register (CSR) that is used to allow software

developers to track multiple cores' execution stages. In addition, filtering optimization allows one to

focus on gathering useful data from the system, reducing the probability of buffer overflows that

end up in losing information.

Furthermore, there is a Bus Tracker which provides visibility of the crossbar between the four

Minion Shire Neighborhoods and the L2 and L3 caches allowing the user track the requests sent

from the compute cores. Note that this is an essential debug feature that was not available on

Esperanto’s debug infrastructure as it was using an UltraSoC Status Monitor module which does not

handle transaction identifiers to match requests and responses.

On the other hand, a Logic Analyzer is also connected to the Shire Cache control signals to allow

analyzing critical control signals and finite-state machines.

Finally, there is a second Bus Tracker component connected to the interface between the Minion

Shire L2 and L3 caches and the non-debug Network On-Chip. This allows monitoring the traffic

generated by the compute cores. The Bus Tracker replaces the UltraSoC Bus Monitor which caused a

110

huge area overhead because even if Esperanto’s architecture does not use some of the AMBA AXI4

interface fields, UltraSoC’s component buffering for those fields could not be removed.

Figure 41. Minion Shire proposed debug architecture component integration

5.3 Memory Shire

5.3.1 Architecture

The Memory Shire provides access to the SoC memory by integrating the LPDDR4 memory parts

and connecting those with the rest of the SoC through the non-debug Network On-Chip. There are

eight Memory Shires in the system which means that, the same way that happened with the Minion

Shires, adding several debug components has a direct impact on SoC area and power consumption.

5.3.2 Integration of the new proposed architecture

5.3.2.1 Register visibility

The first feature to be maintained from Esperanto’s debug infrastructure is the access to the block

registers used to define different operational modes to handle the memories. The access to those

registers is now performed through an APB master interface that is exposed by the debug NoC

meshtop that connects the Memory Shire with the rest of the debug infrastructure.

111

5.3.2.2 Memory visibility

The same way as for registers, memory access is also supported not only for reading DDR content

but also allowing the user to load data through the debug infrastructure in case it is needed. The

access to this resource is performed through an AMBA AXI4 interface exposed by the debug NoC

meshtop attached to the Memory Shire.

5.3.2.3 Debug trace buffering

On the other hand, this document introduced optimizations not only for increasing the amount of

information that can be gathered using monitor components such as the Logic Analyzer or the Bus

Tracker, but also improving routing of trace data and adding the Trace Storage Handler which allows

setting PCIe as a destination providing an output high-bandwidth interface.

In addition, among the new features we have to highlight debug NoC interleaving and streaming

which allow to route gathered data to multiple memory subsystems and also allows to utilize the

full debug NoC wires to transmit the information.

All these features not only provide better visibility but also reduce the risk of losing information or

overflowing the reserved region of memory for the gathered data.

Figure 42. Memory Shire proposed debug architecture component integration

112

5.4 PCIe Shire

5.4.1 Architecture

Esperanto’s SoC PCIe is integrated on this block allowing access to memory mapped locations from

an external host and also allowing compute cores to use host memory as SoC extended memory.

5.4.2 Integration of the new proposed architecture

The PCIe integration on Esperanto’s SoC makes use of two PCIe controllers and two PCIe PHYs so in

order to have visibility of those resources, a Logic Analyzer is attached to each of them.

On the other hand, due to the fact that PCIe is connected to the SoC Network On-Chip through

multiple AMBA AXI4 interfaces, the debug infrastructure uses two Bus Tracker components to

analyze the transactions in-flight on the two PCIe lanes simultaneously.

Furthermore, the new debug infrastructure integrates a Debug Transaction Manager (DTM) allowing

the host to have access to the SoC debug components through the PCIe interface. Note that even if

both PCIe and Service Processor have the same DTM implementation and the interface with the

non-debug NoC follows the same AMBA protocol, a separate instance of the DTM is required just in

case the IOShire SPIO NoC is not available so that the host can still use the PCIe as an active

debugger.

Figure 43. PCIe Shire proposed debug architecture component integration

113

5.5 Debug Network On-Chip

System On-Chip debug components, which include both the debugger interfaces such as JTAG and

the debug components such as the Logic Analyzer, are connected through the debug NoC which

allows configuring and gathering information from the system.

In this document we propose to replace Esperanto’s SoC debug NoC with a new version that allows

gathering more information from the system and also provides further support for accessing the

debug infrastructure from different master interfaces through the Debug Transaction Manager

(DTM) components. This change not only affects visibility with new features such as interleaving or

streaming, but also has a huge impact on area and power as it maximizes the utilization of the

wires. Note that Esperanto’s first design used a debug NoC that was generic and non-intended to be

used for debug purposes so it unused many of the internal wires and did not provide support for

debugging using different master interfaces.

In addition, the multiple bridge modules between the debug NoC and the SoC blocks are no longer

needed as the proposed architecture simplifies the message routing based on coordinates and

provides further support through the DTM. Note that the UltraSoC Message Engine modules have

also been removed due to the fact that the SoC block debug components are directly attached to

the debug NoC.

5.6 Security

In order to ensure that Esperanto’s Technology SoC post-silicon validation debug architecture is

secure and data is not exposed we rely on the Root of Trust component integrated on the I/O Shire

and an efuse which provides access to the debug infrastructure.

The efuse provides access to the debug JTAG/USB/PCIe interface that connects the host software to

the debug infrastructure. Then, the user must target the I/O Shire Root of Trust component and

successfully perform the security challenge to enable the debug capabilities. The rest of the Debug

Transaction Manager instances on the system receive the output signal from the Root of Trust which

serves as an enable to lock or unlock the access to the debug NoC and the rest of the debug

components within the SoC.

Note that the access to the I/O Shire Root of Trust is performed by accessing through the debug

NoC but the rest of the SoC debug components cannot be accessed as the NoC blocks accesses to all

of them making the debug NoC master JTAG interface connected to the RoT the only available

resource to be accessed.

114

6. EvaluationIn this section we evaluate the proposed debug infrastructure and analyze the advantages and

drawbacks compared to the available options on the market and with Esperanto’s SoC

implementation. This is done by considering the different elements that compose the debug

infrastructure and taking into account not only the features but also the expected impact on area

and power consumption.

6.1 Run control

User debugging of software applications relies on the debug access to multicore systems compute

cores as they perform the workloads. In addition, it is also important to have access to the master

core as it is in charge of balancing workloads across the compute cores.

The proposed architecture does not introduce new run control features that are not available in the

market but ensures that there is support for multicore systems that can scale up to thousands of

cores. Note that there is not only the need to be able to manage multiple cores at a time but also to

ensure that latency of applying certain operations to them such as halt or setting breakpoints do

not take a huge time.

6.1.1 Broadcast support

The proposed architecture allows to broadcast operations such as halt, resume or single stepping to

all selected cores in the system instead of sending those operations core-by-core which causes a

huge latency as there needs to be a synchronization between the host and the system for each

operation.

Taking Esperanto’s Technology SoC as an example, there are 32 Minion Shires, which are SoC blocks

that contain 32 ET-Minion compute cores. Then, the average write latency on RTL simulation

without taking into account the latency from the JTAG/USB/PCIe transport to access ET-Minion

cores registers is of 196 nanoseconds and a total of 344 nanoseconds if acknowledgement is

required. In an architecture which does not empower the multicore system run control features

introduced in this document, sending a run control request to all system compute cores such as halt

would require to target individually each of those cores meaning 1024 write requests without

acknowledgement to be performed from the active debugger. That is, it would take 200704

nanoseconds to perform a halt or resume request to all the SoC compute cores.

On the other hand, in case Esperanto’s Technology system integrated the proposed mask register

within each SoC block together with the run control operations broadcast feature performed by the

Debug Multicore Handler (DMH), it is required to perform just one write per SoC block to select the

ET-Minion cores, in this case acknowledgement is required to ensure that run control operation is

performed after cores have been successfully selected, and one write to perform a given run control

operation, which does not require acknowledgment. That is, it would require 33 write operations to

115

perform a halt request to all the SoC cores which means 11204 nanoseconds, a reduction of

94,42%.

In addition, further run control operations to the selected cores can be performed by doing a single

write request to the DMH as the cores have already been selected and SoC blocks mask register

value is kept. For example, it would take for the user 196 nanoseconds (one write request without

requiring acknowledgement as the user can check cores status later if required) to perform a

resume on the selected cores in comparison to 200704 nanoseconds on an SoC debug architecture

that does not implement the proposed multicore debug run control integration as it would have to

access al SoC cores one-by-one. This means a percentage decrease of 99,9%.

Note that the Debug Multicore Handler (DMH) run control registers are recommended to be placed

near the debugger interfaces such as JTAG/USB to reduce latency, but in the example above we

assume that it has a similar latency as if the user was accessing SoC block compute cores registers.

Furthermore, the numbers shown above do not include the JTAG/USB latency so it is important to

take into account the impact of the savings is bigger due to the fact that the number of transactions

decreases, so the number of traversals through JTAG/USB/PCIe also decreases.

Moreover, using broadcast feature also allows all the SoC compute cores to receive run control

operations in a small window of time while following the traditional approach of going core-by-core

introduces huge latency from the first core that received the run control operation to the last in the

list.

In addition, there is also support to create execution groups such that when one core halts then the

rest of the cores in the group receive a halt request, which allows control of the execution of

multiple cores in the system based on the configuration of a specific one.

6.1.2 Status report

There is also support to be able to gather the status from all the selected cores in the system by

reading a subset of registers instead of having to gather the status information of each core

independently.

6.1.2.1 Cores run control status

The proposed debug architecture allows the user to retrieve selected cores' run control status by

targeting a register placed on the Debug Multicore Handler (DMH).

In terms of latency savings we can take Esperanto’s Technology SoC as an example as we did in the

previous section. Since there are 32 Minion Shires, each containing 32 ET-Minion compute cores,

and the average latency to perform a read to ET-Minion cores registers is of 344 nanoseconds, if

there is no support for multicore run control operations we need to target the 1024 cores

one-by-one, which means that it would take 352256 nanoseconds to know if all cores in the system

are running, halted or performed an injected instruction as expected.

116

On the other hand, if Minion Shires integrate the proposed mask register together with the SoC

status report infrastructure introduced before in this document, the user would require 32 writes

with acknowledgment to configure each Minion Shire mask register and one read to the DMH in

order to retrieve the global status. Then, following Esperanto Technologies SoC latencies, we would

require 32 writes with acknowledgment and a read whose latencies are of 344 nanoseconds each,

so that the user could retrieve all cores status in 11352 nanoseconds, a reduction of 96,77%.

In addition, in case the user applies new run control operations or wants to retrieve the status value

once again, a single read request to the DMH is required as the cores have already been selected

and SoC blocks mask register value is kept. For example, it would take for the user 344 nanoseconds

(one read request) on the proposed architecture in comparison to 352256 nanoseconds on an SoC

debug architecture that does not implement the proposed multicore debug run control integration

as it would have to access al SoC cores one-by-one. This means a percentage decrease of 99,9%.

Note that, as explained in the last section, we recommend placing the Debug Multicore Handler

(DMH) near the debugger interfaces such as JTAG/USB to reduce latency. In the example above we

assume that it has a similar latency as if the user was accessing SoC block compute cores registers.

In addition, note that we assume that the Debug Multicore Handler status register value is always

updated so no latency on updating this register has been taken into account.

6.1.2.2 Cores register dump

As we have seen before in this document, the RISC-V GDB software is not implemented in such a

way that support is provided for multicore systems. One of the biggest concerns is that certain

operations such as single stepping by default enforces the software to read all the thread General

Purpose Registers (GPR).

On Esperanto Technologies SoC reading a core GPR requires the debugger to perform 3 AMBA APB3

requests that are serialized, which load instructions on the program buffer and read its output. Due

to the fact that ET-Minion cores have 32 GPR for each hardware thread, that there are 2 hardware

threads per core and 1024 ET-Minion cores in the SoC, there are up to 2048 threads that can

potentially be selected at the same time, the latency becomes too big that is noticeable by the user

and causes problems during debugging.

Taking into account that the AMBA APB3 transactions are serialized and each require 344

nanoseconds, reading all GPR from a single thread takes 33024 nanoseconds, while reading all

ET-Minion core GPR from one Minion Shire, which contains 32 ET-Minion cores, that is, 64 hardware

threads, takes 2113536 nanoseconds. It is important to note that the 344 nanoseconds do not take

into account the delay of the JTAG/USB/PCIe interface, so the real latency seen by the user is even

bigger.

Following this project recommendation of adding a register within the Debug Multicore Handler

(DMH) to capture the GDB request to gather General Purpose Registers information from a given

thread and also by adding a finite-state-machine on the SoC block Debug Module that automatizes

the GPR read, we avoid hundreds of requests from being sent from the host software.

117

Then, in case the DMH and SoC blocks DMs have been integrated following this thesis

recommendation and, the OpenOCD software layer is extended to make use of this DMH special

register after performing operations such as single step rather than following the default

implementation which sends multiple requests for each register, we can perform 33 operations

from the software to read the General Purpose Register values from a given thread. That is, the first

request would be used to access the DMH register that triggers the ET-Minion thread GPR read

while the other 32 requests would be used to gather the registers values once they are extracted

from the thread. This means that latency for reading all thread registers, taking the same 344

nanoseconds, would be 11352 nanoseconds. On the other hand, reading all ET-Minion cores

registers from a given Minion Shire would take 726528 nanoseconds, a reduction of 65,62%.

Note that we assume that the latency between the first request to access the DMH register and the

second request to read the first GPR value is enough to extract this value. Otherwise, it is required

to add either delay on the software or to make use of a DMH register to check the status of the GPR

read process.

6.2 Debug NoC

The debug Network On-Chip is the most important change on the proposed debug infrastructure as

it is the key component that communicates all the debug components between them and with the

active debuggers.

6.2.1 Debugger support

The first important feature to notice is the support for both PCIe and master core interfaces

allowing the user to control the debug infrastructure through those master interfaces. The

proposed architecture is done in such a way that the master interfaces can follow protocols such as

JTAG or AMBA and is easy to translate to the debug NoC protocol through the Debug Transport

Manager (DTM).

From a performance point of view, the addition of the PCIe as master interface but also as a trace

destination, allows to gather more information from the system providing more visibility for the

user which is an essential point when debugging complex software applications. In addition, having

the PCIe as a master interface allows debugging from the server without needing to connect JTAG or

USB interfaces. That said, JTAG and/or USB interfaces are still highly recommended as PCIe requires

complex configuration and this may not be available during early bring up stages.

Then, due to the ability of the Debug NoC to split packages, performance is improved on the

different interfaces with the host given that JTAG, USB and PCIE work with different Maximum

Transaction Unit (MTU) values. This improvement is significantly useful when large amounts of data

like traces or loading/dumping memories have to be transported.

118

6.2.2 Efficient messaging protocol

On the other hand, as a result of having a messaging infrastructure that is based on a custom

protocol designed to handle configuration and trace messages efficiently and making use of all the

protocol fields and capabilities, performance is improved. Note that Esperanto Technologies debug

NoC was based on NetSpeed protocol where many fields were unused causing not only a waste of

area and static power consumption but also a lack of performance.

In addition, the debug NoC also takes advantage of the proposed compression algorithms that

reduce the amount of information that is sent through the network layer. This allows designers to

reduce the number of needed wires even more, directly impacting area and static power

consumption, while keeping the throughput requirements.

Furthermore, due to how bottlenecks are handled and the decision to propagate back pressure up

to the debug monitor components, the debug NoC is not only lossless but also does not require

large buffers and its a user configuration decision to decide if messages should be discarded while

configuring the monitoring system.

Note also that depending on the active debugger protocol it may not require write confirmation

responses, which means that at least 50% of the bandwidth is saved. In addition, due to the fact

that there is no round-trip latency, that is, there is no extra time invested on waiting for

acknowledgement of the request, more sustained outstanding writes without requiring more

transaction identifiers can be performed.

6.2.2.1 Performance improvementsTaking Esperanto’s Technology SoC as an example, we can see that the UltraSoc JTAG Communicator

integrates a Test Data Register (TDR) of 33 bytes (264 bits), which is fully loaded each time a

message is sent from the host to the SoC or retrieved from the SoC to the host. Note that the TDR is

always fully loaded or read even if the message is smaller than 33 bytes. That is, loading or

extracting data from the SoC requires loading all the 33 bytes and also includes a software overhead

each time this is performed. Based on post-silicon experience when running the JLink connection at

4MHz we have concluded that the software overhead is 1640000 nanoseconds each time the TDR is

accessed. Note that the mentioned software overhead number is believed to be reasonable as

Esperanto engineering team went through several iterations of UltraSoc software improvements to

optimize its utilization and adapt to the host operative system. In addition, due to the fact that the

JLink connection runs at 4MHz, loading each bit on the JTAG TDR takes 250 nanoseconds. Then, the

amount of time invested on making use of the TDR is computed as the software overhead plus the

time of sending each of the TDR bits.

If we now focus on the operations performed during ET-SoC-1 bring up stage, we have seen that

writing and reading to/from memory has been critical as the infrastructure has been used to load

big binaries and dump all memory content and those operations were the most time consuming.

119

Due to that, we can measure performance improvements based on how much time it takes for the

debug architectures to perform writes of 512 bits, which is the SoC cache line size.

The UltraSoc DMA, which is the only debug component on ET-SoC-1 that allows access to system

memory, allows to perform read and write operations of up to 8 bytes. This means that for writing

512 bits, which are 64 bytes, it is required to perform 8 write operations. In addition, note that the

way UltraSoc software is implemented, the UltraSoc DMA operations must be sequential, meaning

that for each request we must wait for its correspondent response. Furthermore, the UltraSoc DMA

write requests are encoded using 16 bytes while write responses require 5 bytes, that is, requests

and responses do not exceed the JTAG TDR size, which means that for each one we just need to load

TDR once.

Taking into account this information we can see that we need to perform 8 UltraSoc DMA requests

and extract 8 UltraSoc DMA responses, which means that the amount of time to write 512 bits is 16

times, 8 for requests and 8 for responses, the software overhead (1640000 nanoseconds) plus

loading the TDR, which is 250 nanoseconds per bit multiplied by 264 bits. That is, writing 512 bits

from memory on ET-SoC-1 through UltraSoc takes 27296000 nanoseconds. Note that we do not take

into account the amount of time for the request and response to be performed within the SoC as it

depends on factors that are outside of the debug infrastructure such as availability of the memory

controllers which are not discussed in this thesis. Hence, we assume that request and response

times do not change.

On the other hand, in case Esperanto’s Technology system integrated the proposed debug

architecture, which follows a new debug messaging protocol and implements a JTAG TDR of 73

bytes, the amount of time required for writing 512 bits from memory is reduced.

On the proposed architecture Configuration message write requests require up to 73 bytes while

Configuration message write responses need 3 bytes. Then, due to the fact that the proposed

architecture integrates an AMBA AXI4 interface with SoC memory and supports write requests of up

to 512 bits, performing a write only requires one request. In addition, the Configuration write

response messages allow to encode up to 512 bits of write data. That is, performing a write request

of 512 bits on ET-SoC-1 would require only one request and one response. Note that in previous

sections we recommended integrating a JTAG TDR whose size matches Configuration write request

messages. We now assume that Configuration write request messages perform writes of up to 512

bits, which match with system cache line size, that together with the rest of the message

information it ends up requiring 73 bytes of information to be transmitted.

As we have seen, Configuration write requests, which are encoded on messages that require up to

73 bytes, fit on the proposed JTAG TDR so that we have to take into account the software overhead

and the time for loading the JTAG TDR just once. In addition, Configuration write responses, which

are encoded on 3 bytes, also require loading the JTAG TDR just once. Assuming that software is not

improved (overhead of 1640000 nanoseconds) and the JLink connection runs at 4MHz (250

nanoseconds per bit), writing 512 bits on the proposed architecture takes 1786000 nanoseconds, a

reduction of 93,45%.

120

6.2.2.2 Compression savingsAs explained previously in this document, XOR operation and Run Length Encoding (RLE)

compression algorithm are applied to Trace messages in order to reduce the amount of information

sent through the debug NoC. Note that depending on designers parameterization Trace messages

may require to be splitted in multiple flits if information does not fit on the debug NoC interface.

Taking Esperanto’s Technology SoC as an example, the UltraSoc Status Monitor modules are used to

trace information from the system. During post-silicon validation effort one of the most important

features required gathering information from ET-Minion cores execution by tracing their Program

Counter value and a valid signal, which was 49 bits plus a 1 bit valid signal, for long periods of time.

Those ET-Minion core signals are attached on a bus that is 128 bit wide and due to how the UltraSoc

Status Monitor trace capabilities are implemented, there is no support for tracing a subset of the

filtered signals nor a compression algorithm that allows reducing the amount of information that is

sent. That is, each time a Trace message is generated, all the bus (128 bits) plus the message

header, which is 4 bytes, is traced. This means that each Trace message requires sending 20 bytes

through the debug NoC. In addition, we detected that out of the 128 bits the 64 most significant

bits were not changing during large periods of time.

On the other hand, the proposed architecture introduces the Logic Analyzer which allows the user

to apply advanced filtering and select a subset of bits to be traced. That is, following the same filter

signals connections as in Esperanto’s Technology SoC, the user could configure the debug

component to trace the 50 bits that correspond to the Program Counter value and a valid signal.

Then, together with the Trace message header for non-streaming mode, which is less than 4 bytes,

each Trace message would require 10 bytes. This means that there is a reduction of 50% even if no

compression can be applied.

Furthermore, assuming the user does want to trace all the filter input signals, which are 128 bits,

and following the same case as explained above where the 64 most significant bits do not change

during large periods of time, the reduction on the injected traffic on the debug NoC is also

noticeable if debug monitor components are configured to apply compression. Since XOR operation

is applied between consecutive gathered values, the 64 most significant bits would become 0

meaning that together with the RLE compression algorithm, the information to be sent to the

destination endpoint would consist of the 64 least significant bits plus 8 bits (encoded as 0x80),

which would encode the fact that there are 8 bytes (0x8) with value 0. Then, together with the

Trace message header for non-streaming mode, each Trace message would require less than 13

bytes. This means that there is a reduction of 30%.

6.2.3 Support for high-bandwidth monitoring

Another important feature that has been added is support for timestamps, which are not only

synchronized but also integrated on the system in a way that reduces the amount of information

121

sent allowing to have big traces with huge timestamp values and do not having a huge impact on

the amount of information sent from the debug monitor components.

The two most important monitoring support features, which are interleaving and streaming, allow

the user to gather information making use of the SoC block debug monitor components and store it

on memory in a way that the user cannot only extract a huge number of information but also

support multiple SoC memory subsystems and having host memory as a possible destination.

It is also important to note that standard interfaces such as AMBA APB3 or AMBA AXI4 have been

integrated as part of the debug NoC allowing the user not to add dedicated debug components

within the SoC blocks and avoiding extra points of failure.

6.3 Monitoring

Visibility is essential when debugging multicore systems because compute cores not only interact

between them but also with several resources such as peripherals or memories. The monitoring

features proposed in this document are standard on the processor design field but the way they

have been architected reduce the amount of area and power overhead dedicated to the debug

infrastructure which is extremely important for systems that must not exceed a certain power

threshold and require several instances of monitor components.

6.3.1 Filtering optimization

One of the most important changes refers to how the signals connected by the designers to the

monitor components are filtered in order to generate notification messages, events or traces. The

proposed architecture focuses on filtering signals before they are stored in the debug components

interface buffers which allows to discard useless data and avoids filling the buffers. This mechanism

reduces the probability of losing important information in case the destination endpoint is busy and

backpressure is applied.

In addition, the proposed filter mechanism can be used to reduce buffer size as, depending on filter

conditions, most of the analyzed data is discarded. Note that this has a direct impact on area usage.

The Bus Tracker implementation allows optimization for monitoring buses whose protocol makes

use of unique transaction identifiers allowing further visibility without sacrificing huge area and

power consumption. Moreover, the ability to send transaction gathered information in advance for

cases where response must be filtered, allows to reduce the pressure on the Bus Tracker trace

buffer and allows designers to reduce its width.

6.3.2 Multiple protocol support

On the other hand, the Bus Tracker debug component allows the user to track transactions without

requiring the analyzed interface to follow a specific protocol such as AMBA AXI4. The Bus Tracker

filtering interface consists of the essential signals to track requests by identifiers and a bus that can

be filled by the user with protocol specific control signals.

122

The proposed architecture has a huge impact in terms of area given that it reduces not only the

debug monitor I/O interface but also all the buffering for those signals.

As an example, Esperanto’s SoC used the UltraSoC Bus Monitor module to track requests on an

AMBA AXI4 interface. The AMBA AXI protocol required several I/O buses and internal buffers,

having a huge impact on the area when most of those AMBA AXI fields were hardwired or unused in

Esperanto’s architecture.

6.3.3 Trace storage improvements

Another important feature of the proposed debug architecture is the capability to store gathered

data from the system on the memory subsystem(s) through the Trace Storage Handler. This debug

component allows the system to forward stored data upon requests from the host or automatically

every a certain amount of cycles allowing to decrease the pressure on the reserved region of

memory for gathered data.

In addition, the proposed trace storage system supports PCIe either as the requestor for extracting

data, which provides a high-bandwidth interface to output gathered data, or as the trace

destination endpoint as the host memory space can be used for buffering. It is important to note

that if the host can analyze the gathered information at the same speed, or faster, than the monitor

components traces are written, it can virtually keep capturing traces during the whole execution.

Taking Esperanto’s Technology SoC as an example, each Minion Shire runs at 1GHz and is connected

to the debug NoC which runs at 500Mhz. The monitor components such as the UltraSoC Status

Monitor are connected to the Minion Shire clock. Note that the UltraSoC Status Monitor stores

captured data before applying filtering and requires 2 cycles to check if comparator conditions are

met and generate the trace message to be forwarded through the debug NoC. Then, the UltraSoC

Status Monitor is parameterized to capture 128 bits, which together with the message header

requires 161 bits to be sent for each traced message. Note that we are not taking into account

timestamp information.

This means that each Minion Shire UltraSoC Status Monitor is able to trace 128 bits every two cycles

running at 1GHz, meaning that the throughput from the debug monitor component is 64 trace bits

per nanosecond but 161 bits must be sent. In addition, the debug NoC still works at 500MHz which

is half the frequency and creates further pressure and possible bottlenecks on the debug

infrastructure. Moreover, there are multiple UltraSoC Status Monitor components within the Minion

Shire which traffic is aggregated on a Message Engine that has a 128 bit interface with the debug

NoC. Due to the fact that 161 bits must be sent and the interface between the Message Engine and

the debug NoC is 128 bits, the message needs to be split in two and due to the frequency difference

the debug NoC accepts one message every 2 Message Engine cycles. This means that on ET-SoC-1

debug infrastructure 4 cycles are required for each 128 bits of traced data, meaning a throughput of

32 bits per nanosecond. In conclusion, we can potentially have overflows on the UltraSoC Status

Monitor buffers as it needs 2 cycles per trace and stores traces before applying filtering, the

interface with the debug NoC does not escalate accordingly neither adapts at running at different

123

frequencies and, tracing from multiple debug components even if targeting different memory

subsystem blocks causes problems as traffic is aggregated on the Message Engine.

On the other hand, the proposed architecture introduces the Logic Analyzer which performs

filtering before captured data is stored in order to avoid filling the buffer with useless data, and

generates the trace message in just one cycle. In addition, the debug NoC interleaving feature

allows sending 1 trace per cycle even if the debug NoC goes half the frequency of the Minion Shire,

which means 1 trace per nanosecond.

Furthermore, we also need to take into account the debug NoC streaming feature which allows the

user to capture more data reusing some of the message fields. Assuming the Logic Analyzer is

parameterized to capture 64 bits of data, one non-streaming trace sends 64 bits of data that

together with the message header requires 92 bits. Note that we are not taking into account

timestamp information. Then, once streaming is enabled, Transaction ID, source SoC block and

source debug component ID sub-fields from the message Source field contain captured data from

the system. Then, each trace message is composed of the header together with the captured data

allowing the user to capture 81 bits of traced data per nanosecond and still use the 92 bus bits.

Then, the debug NoC interleaving and streaming features together would allow the user to trace 81

bits of raw data per nanosecond due to the fact that the messages are sent to different memory

subsystems each cycle allowing the debug NoC to work at half the frequency, and more data can be

traced each cycle. This means that the proposed architecture requires 28% less wires (from 128 bits

from the Message Engine to 92 bits) and provides up to 253% more trace throughput (from 32 bits

per nanosecond to 81 bits per nanosecond).

124

7. ConclusionThis project explores the post-silicon validation field which is a crucial step on System on-Chip

designs and, during the last decade, is becoming a bottleneck for companies development. In a

world that grows technologically very fast and where globalization has connected companies to all

the countries and their needs, time-to-market has become the most important constraint together

with a good SoC performance. Due to that, companies are shrinking the time invested on pre-silicon

validation such that it is not possible to detect all design flaws.

With the goal of designing an efficient post-silicon validation debug hardware architecture, this

project introduces an infrastructure that can be fit in any SoC without depending on its size,

complexity or target software applications.

This study analyzes the available options in the market that go from FPGA emulator clusters to

embedded debug components provided by third party companies to get a better understanding of

the debug features required on Systems on-Chip and to develop an efficient debug architecture. The

study shows that the options in the market either require to use specific cores architectures such as

ARM or RISC-V, or are too generic so that they integrate features that may not be needed causing an

area and power overhead.

After a study of multicore systems requirements and an overview of Esperanto Technologies

ET-SoC-1 design, various aspects could be improved in order to provide further observability while

maintaining a low area overhead.

The proposed debug architectured focused its efforts on developing an efficient debug Network

on-Chip and messaging infrastructure able to provide sustained high-bandwidth allowing to gather

data from the system during large periods of time. In addition, this project also centered its efforts

on developing a debug architecture that is accessible not only by external standard debug interfaces

such as JTAG or USB, but also from high-bandwidth external interfaces such as PCIe or a master core

within the system.

On the other hand, this project introduced new debug components which can be integrated on the

SoC blocks providing further visibility based on advanced filtering and allowing the user to gather

information from the system. The gathered information is compressed following an efficient and

easy-to-implement algorithm. The compressed data can then be sent either to the host for

post-processing analysis or stored in the SoC memory subsystem allowing the user to capture data

from the system even when using slow-speed external debug interfaces such as JTAG which could

become a bottleneck.

Furthermore, this thesis introduces recommendations for how to integrate the debug components

on multicore systems introducing debug features for the compute cores, management of system

workload from master core through the debug infrastructure or integrating hardware assertions for

capturing manufacturing faults that are hard to detect among others. A proposal of Esperanto

125

Technologies ET-SoC-1 new debug architecture is done, and based on the evaluation analysis

developed in previous sections, with less area and power overhead we obtain up to 93,45%

improvements in bandwidth for JTAG connectivity and up to 99,99% improvements in latency for

run control operations. All of them allow less impact of the debug infrastructure on the use of SoC

resources while providing further visibility.

126

8. Future workDuring this project we highlighted the importance of validation of System on-Chip designs from

both functional and nonfunctional perspectives nowadays. The validation effort of SoCs is

performed on two stages:(i) pre-silicon validation and, (ii) post-silicon validation.

Before in this document we have seen how to increase observability and controllability of the

design making use of scan chains, monitor components and an efficient network communication in

order to detect errors either during the design stage or manufacturing process.

First, we must implement the proposed architecture including the monitor components and the

debug Network On-Chip in order to get area and power numbers from which we can perform

further evaluation and improvements.

Then,as part of the future work for improving the proposed architecture we must focus on trace

compression techniques that target both the depth and width of the trace buffering. We understand

depth compression as the number of cycles where the data is erroneous and must be stored for

further analysis. On the other hand, width compression would allow us to store more information

each cycle.

Another important aspect to take into account is the tradeoff between observability and security.

Verification engineers would like to maximize observability of the system in order to reduce the

debug time, whereas the security engineering team prefers to sacrify visibility of system resources

in order to enhance security and privacy of systems behavior. Due to that, it is required to

investigate encryption methods that can protect the data being transmitted through the debug

architecture in order to prevent information from being leaked.

127

9. References

[1] Abhishek Jadhav, (2020). ET-SoC-1 Chip with More Than 1,000 RISC-V Cores Aimed at

Accelerating Machine Learning. Retrieved 1/12/2021, from:

https://www.hackster.io/news/et-soc-1-chip-with-more-than-1-000-risc-v-cores-aimed-at-accelerati

ng-machine-learning-ada13f0d9a1e

[2] RISC-V, (2019). RISC-V External Debug Support Version 0.13.2. Retrieved 1/12/2021, from:

https://riscv.org/wp-content/uploads/2019/03/riscv-debug-release.pdf

[3] Sourceware, (1990). GDB: The GNU Project Debugger. Retrieved 1/12/2021, from:

https://www.sourceware.org/gdb/

[4] Sourceware, (1990). Debugging with GDB. Retrieved 1/12/2021, from:

https://sourceware.org/gdb/onlinedocs/gdb/

[5] GitHub, (2018). RISC-V GDB. Retrieved 1/12/2021, from:

https://github.com/riscv-collab/riscv-binutils-gdb/tree/riscv-binutils-2.31.1/gdb

[6] OpenOCD, (2018). Open On-Chip Debugger: OpenOCD User’s Guide. Retrieved 1/12/2021, from:

https://openocd.org/doc/pdf/openocd.pdf

[7] Xilinx, (2021). Xilinx. Retrieved 1/12/2021, from:

https://www.xilinx.com

[8] Xilinx, (2021). Vivado design tool. Retrieved 1/12/2021, from:

https://www.xilinx.com/products/design-tools/vivado.html

[9] Xilinx, (2021). Integrated Logic Analyzer. Retrieved 1/12/2021, from:

https://www.xilinx.com/products/intellectual-property/ila.html

[10] Xilinx, (2021). Virtual Input/Output. Retrieved 1/12/2021, from:

https://www.xilinx.com/support/documentation/ip_documentation/vio/v3_0/pg159-vio.pdf

[11] Xilinx, (2021). JTAG to AXI Master. Retrieved 1/12/2021, from:

https://www.xilinx.com/support/documentation/ip_documentation/jtag_axi/v1_2/pg174-jtag-axi.p

df

[12] Synopsys, (2021). Synopsys. Retrieved 1/12/2021, from:

https://www.synopsys.com/

128

[13] Synopsys, (2021). Verification IP. Retrieved 1/12/2021, from:

https://www.synopsys.com/verification/verification-ip.html

[14] Synopsys, (2021). Finding Hardware Bugs Faster with Full Visibility Debug. Retrieved 1/12/2021,

from:

https://www.synopsys.com/company/resources/newsletters/prototyping-newsletter/finding-hardw

are-bugs-faster.html

[15] Synopsys, (2021). Faster Prototyping with HAPS. Retrieved 1/12/2021, from:

https://www.synopsys.com/company/resources/newsletters/prototyping-newsletter/faster-prototy

ping-haps.html

[16] Synopsys, (2021). ZeBu server. Retrieved 1/12/2021, from:

https://www.synopsys.com/verification/emulation/zebu-server.html

[17] System Verilog, (2021). System Verilog Assertion Basics. Retrieved 1/12/2021, from:

https://www.systemverilog.io/sva-basics

[18] Wikipedia, (2021). Universal Verification Methodology. Retrieved 1/12/2021, from:

https://en.wikipedia.org/wiki/Universal_Verification_Methodology

[19] Cadence, (2021). Cadence. Retrieved 1/12/2021, from:

https://www.cadence.com/

[20] Cadence, (2021). Palladium emulation platform. Retrieved 1/12/2021, from:

https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prot

otyping/palladium.html

[21] Siemens, (2021). Siemens. Retrieved 1/12/2021, from:

https://sw.siemens.com/

[22] Siemens, (2021). Veloce emulation platform. Retrieved 1/12/2021, from:

https://eda.sw.siemens.com/en-US/ic/veloce/

[23] ARM, (2021). ARM. Retrieved 1/12/2021, from:

https://www.arm.com/

[24] ARM, (2021). CoreSight Debug and Trace. Retrieved 1/12/2021, from:

https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace

[25] ARM, (2021). Embedded Trace Macrocells. Retrieved 1/12/2021, from:

https://developer.arm.com/documentation/ihi0014/q

129

[26] ARM, (2021). Embedded Trace Buffer. Retrieved 1/12/2021, from:

https://developer.arm.com/documentation/ddi0242/b/introduction/about-the-embedded-trace-bu

ffer

[27] ARM, (2021). System Trace Macrocell. Retrieved 1/12/2021, from:

https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace/coresight-component

s/system-trace-macrocell

[28] ARM, (2021). Trace Memory Controller. Retrieved 1/12/2021, from:

https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace/coresight-component

s/trace-memory-controller

[29] SiFive, (2021). SiFive. Retrieved 1/12/2021, from:

https://www.sifive.com/

[30] SiFive, (2021). SiFive Insight. Retrieved 1/12/2021, from:

https://www.sifive.com/blog/introducing-sifive-insight

[31] SiFive, (2021). SiFive Nexus Encoder. Retrieved 1/12/2021, from:

https://www.sifive.com/press/sifive-announces-key-enablement-of-trace-and-debug

[32] Nexus 5001, (2021). Nexus 5001 standard. Retrieved 1/12/2021, from:

https://nexus5001.org/nexus-5001-forum-standard/

[33] Tessent Embedded Analytics, (2021). Tessent Embedded Analytics. Retrieved 1/12/2021, from:

https://www.tessentembeddedanalytics.com/

[34] Tessent Embedded Analytics, (2021). Tessent Embedded Analytics debug IPs. Retrieved

1/12/2021, from:

https://www.tessentembeddedanalytics.com/technology-2/semiconductor-ip/

[35] Xilinx, (2021). Aurora 64B/66B Protocol Specification. Retrieved 1/12/2021, from:

https://www.xilinx.com/support/documentation/ip_documentation/aurora_64b66b_protocol_spec

_sp011.pdf

[36] International Research Journal of Engineering and Technology (IRJET), Lakshmi F Savanoor 1, Dr.

H. S. Guruprasad, (2020). Survey on Post Silicon Validation and Contribution of Scan Chain

Technology.

[37] AMD, (2009). Revision Guide forAMDAthlon 64 and AMD Opteron Processors. Retrieved

1/12/2021, from:

https://www.amd.com/system/files/TechDocs/25759.pdf

130

[38] P. Jayaraman, R. Parthasarathi, (2017), Association for COmputing Machinery. A survey on

post-silicon functional validation for multicore architectures.

[39] B. Vermeulen, K. Goossens, International Symposium on VLSI Design, Automation and Test

(2009). A network-on-chip monitoring infrastructure for communication centric debug of embedded

multiprocessor SoCs.

[40] C. Ciordas, K. Goossens, A. Radulescu, T. Basten, Proceedings of the IEEE International

Symposium on Circuits and Systems,( 2006). Noc monitoring: impact on the design flow.

[41] Rawan Abdel-Khalek and Valeria Bertacco, Computer Science and Engineering Department,

University of Michigan (2012). Functional Post-Silicon Diagnosis and Debug for Networks-on-Chip

[42] C. Ciordas, K. Goossens, T. Basten, A. Radulescu, and A. Boon, (2006). Transaction monitoring in

networks on chip: the on-chip run-time perspective.

[43] Sidhartha Sankar Rout, Sujay Deb,and Kanad Basu, Member, IEEE, (2020). WiND: An Efficient

Post-Silicon Debug Strategy for Network-on-Chip.

[44] M. Boulé, Z. Zilic, (2008). Generating Hardware Assertion Checkers: For Hardware Verification,

Emulation, Post-Fabrication Debugging and On-Line Monitoring.

[45] S. Hangal, S. Narayanan, N. Chandra, S. Chakravorty, ACM/IEEE Design Automation Conference,

(2005). IODINE: a tool to automatically infer dynamic invariants for hardware designs.

[46] Srikanth Pothula, Sk Nayab Rasool, International Journal of Soft Computing and Engineering

(IJSCE), (2009). C-Pack: a High-Performance Microprocessor Cache Compression Algorithm.

[47] Vinson Young , Sanjay Kariyappa , Moinuddin K. Qureshi, Georgia Institute of Technology,

(2019). Enabling Transparent Memory-Compression for Commodity Memory Systems.

[48] Luluk Anjar Fitriya, Tito Waluyo Purboyo, Anggunmeka Luhur Prasasti, International Journal of

Applied Engineering Research ISSN, (2017). A Review of Data Compression Techniques.

[49] Robinson, A. H., Cherry, C. (1967). Results of a prototype television bandwidth compression

scheme.

[50] Ziv, Jacob; Lempel, Abraham, (1977). A Universal Algorithm for Sequential Data Compression.

[51] B. Yang, K. Wu, R. Karri, (2006). Secure scan: a design-for-test architecture for crypto chips.

[52] Segger, (2021). J-Link Debug Probes. Retrieved 1/12/2021, from:

https://www.segger.com/products/debug-probes/j-link/

131