Analysis and optimization of a debug
post-silicon hardware architecture
Joel Sanchez Moreno
SUPERVISED BYMiquel Izquierdo Ustrell
Universitat Politècnica de Catalunya
Master in Innovation and Research in Informatics
January 2022
Abstract
Post-silicon validation is a crucial step in System On-Chip (SoC) design flow as it is used to identify
functional errors and manufacture faults that escaped from previous testing stages. The complexity
of SoC designs has been increasing over the last decades and along with the shrinking
time-to-market constraints, it is not possible to detect all design flaws during pre-silicon validation.
Nowadays there exist thousands of systems in the world that have been designed for very different
applications so that the functionality to be tested differs between them but all of them require a
post-silicon validation infrastructure able to detect misbehaviors.
The objective of post-silicon validation is to ensure that the silicon design works as expected under
real operating conditions while executing software applications, and to identify errors that may
have been missed during pre-silicon validation. Post-silicon validation is performed making use of
specialized hardware that is added in the system during design and implementation stages.
The complexity of post-silicon validation arises from the fact that it is much harder to observe and
debug the execution of software in a silicon device than in simulations. In addition, it is important to
note that the post-silicon validation stage is critical and performed under a highly-aggressive
schedule to ensure the product is in the market as soon as possible.
The goal of this thesis is to analyze the post-silicon validation hardware infrastructure implemented
on multicore systems taking as an example Esperanto Technologies SoC which has thousands of
RISC-V processors and targets specific software applications. Then, based on the conclusions of the
analysis, the project proposes a new post-silicon debug architecture that can fit on any System
on-Chip without depending on its target application or complexity and that optimizes the options
available on the market for multicore systems.
1
Acknowledgements
I would like to express my gratitude to all those who have helped me in some way or another to
write this thesis. Firstly, I would like to thank my supervisor Miquel Izquierdo Ustrell, who has been
my manager within Esperanto Technologies, for their support and guidance since I started at the
company and during the last months while working on the master thesis. I would like to specially
thank Esperanto Technologies for allowing me to develop this thesis based on a product that
entered the market just a few months ago and also for providing me access to all the required tools
to perform power and area analysis. Finally, I would like to thank all the engineers I have worked
with during all these months which helped me improve my knowledge on the semiconductor field.
2
Contents
1. Introduction 10
1.1 Context 10
1.2 Motivation 11
1.3 Methodology 12
1.3.1 Work development 12
2. State of the Art 13
2.1 System on-Chip debug requirements 13
2.1.1 Run Control 13
2.1.2 Monitoring 13
2.1.3 Intrusive debugging 13
2.1.4 Performance debug 14
2.1.5 External debug support 14
2.2 Market analysis 15
2.2.1 FPGA based solutions 15
2.2.1.1 Xilinx 15
2.2.1.1.1 Integrated Logic Analyzer (ILA) 15
2.2.1.1.2 Virtual Input/Output (VIO) 16
2.2.1.1.3 JTAG-to-AXI Master 16
2.2.1.2 Synopsys 16
2.2.1.2.1 High-performance ASIC Prototyping Systems (HAPS) 17
2.2.1.2.1.1 HAPS Deep Trace Debug (DTD) 17
2.2.1.2.2.2 HAPS Global Signal Visibility (GSV) 17
2.2.1.2.2 ZeBu 17
2.2.1.3 Cadence 18
2.2.1.4 Siemens - Mentor 18
2.2.2 Embedded IP solutions 19
2.2.2.1 ARM 19
2.2.2.1.1 Embedded Trace Macrocell 20
2.2.2.1.2 System Trace Macrocell 20
2.2.2.1.3 Trace Memory Controller 20
2.2.2.2 SiFive 21
2.2.2.2.1 Debug Module 21
3
2.2.2.2.2 Nexus Encoder 21
2.2.2.3 UltraSoC 22
2.2.2.3.1 Communicator modules 23
2.2.2.3.1.1 JTAG Communicator 23
2.2.2.3.1.2 USB Communicator 23
2.2.2.3.1.3 Universal Streaming Communicator 23
2.2.2.3.1.4 Aurora communicator 23
2.2.2.3.1.5 Bus Communicator 23
2.2.2.3.2 Message infrastructure modules 24
2.2.2.3.2.1 Message Engine 24
2.2.2.3.2.2 Message Slice 24
2.2.2.3.2.3 Message Lock 24
2.2.2.3.2.4 System Memory Buffer 24
2.2.2.3.3 Analytic modules 25
2.2.2.3.3.1 Direct Memory Access (DMA) 25
2.2.2.3.3.2 Processor Analytic Module 25
2.2.2.3.3.3 Trace Encoder 25
2.2.2.3.3.4 Trace Receiver 25
2.2.2.3.3.5 Bus Monitor 25
2.2.3.3.3.6 Status Monitor 26
2.2.2.3.3.7 Static Instrumentation 26
2.2.2.3.3.8 Network On-Chip Monitor 26
2.2.3 Design For Testability (DFT) 27
2.3 Debugging multicore systems 29
2.3.1 I/O peripherals 29
2.3.1.1 Debug features 29
2.3.2 Memory subsystem 29
2.3.2.1 Debug features 29
2.3.3 Master core 30
2.3.4 Compute cores 30
2.3.5 Interconnection network 30
2.4 Debug infrastructure on multicore systems 32
3. Many-Core SoC design 33
3.1 I/O Shire 34
3.1.1 Architecture 34
4
3.1.2 Debug features and UltraSoC integration 35
3.1.2.1 Clock and reset unit 35
3.1.2.2 Registers and memory access 35
3.1.2.3 Service Processor 36
3.1.2.4 Maxion validation 36
3.1.2.5 Debugger hardware interface 37
3.1.2.5.1 External interface with the host 37
3.1.2.5.2 Cores communication with the host 37
3.1.2.5.3 Debug through Service Processor 37
3.1.2.6 Security 38
3.2 Minion Shire 39
3.2.1 Architecture 39
3.2.2 Debug features and UltraSoC integration 40
3.2.2.1 Software application debugging 40
3.2.2.2 Resources visibility 40
3.3 Memory Shire 42
3.3.1 Architecture 42
3.3.2 Debug features and UltraSoC integration 42
3.3.2.1 Register visibility 42
3.3.2.2 Memory visibility 42
3.3.2.3 Debug trace buffering 42
3.4 PCIe Shire 43
3.4.1 Architecture 43
3.4.2 Debug features and UltraSoC integration 44
3.5 Debug Network On-Chip 44
3.6 Conclusions after ET-SoC-1 45
3.6.1 Architecture, RTL and physical design evaluation 45
3.6.2 Post-silicon validation evaluation 46
4. Proposed multicore debugging architecture 48
4.1 Run control support 49
4.1.1 Execution 49
4.1.2 Registers 50
4.1.3 Core status 50
4.1.4 Multicore execution support 51
4.1.5 Multicore status report 52
5
4.2 Debug Network On-Chip 53
4.2.1 Message Types 54
4.2.1.1 Configuration messages 56
4.2.1.1.1 Read configuration messages 57
4.2.1.1.2 Writes configuration messages 58
4.2.1.2 Event messages 60
4.2.1.3 Trace messages 61
4.2.2 Interface and message packaging 62
4.2.3 Timestamps 64
4.2.3.1 Synchronization 65
4.2.3.2 Timestamp computation 65
4.2.4 Interleaving 66
4.2.5 Streaming 66
4.2.5.1 Requirements 67
4.2.5.2 Operational mode 67
4.2.5.3 Timestamp computation 68
4.2.5.4 Bottleneck analysis 68
4.2.6 System resources access 69
4.2.6.1 Event support 70
4.2.6.2 Polling & Compare 70
4.2.7 Routing 71
4.2.8 Network on-Chip interconnection 71
4.3 Monitoring 72
4.3.1 Logic Analyzer 72
4.3.1.1 Interface 74
4.3.1.2 Filtering 74
4.3.1.3 Counters 76
4.3.1.4 Actions 76
4.3.2 Bus Tracker 77
4.3.2.1 Interface 78
4.3.2.2 Filtering 79
4.3.2.3 Counters 80
4.3.2.4 Actions 80
4.3.3 Trace Storage Handler 81
4.3.3.1 Trace storage configuration 82
6
4.3.3.2 Trace extraction operational modes 83
4.3.3.3 Trace extraction through active debugger 83
4.3.4 Compute core execution analysis 87
4.3.5 Multicore system workload balance 88
4.3.6 Network On-Chip 89
4.3.7 Hardware assertions 89
4.4 Debug modules messaging infrastructure 90
4.4.1 Interface 90
4.4.2 Compression 92
4.4.2.1 XOR and RLE compression 93
4.5 Access to the debug infrastructure 94
4.5.1 Debug address map 94
4.5.2 Debug Transaction Manager (DTM) 95
4.5.2.1 Communication with Debug NoC 95
4.5.2.2 USB and JTAG support 95
4.5.2.3 Master core and PCIe support 97
4.5.3 Read operations to SoC resources 98
4.5.3.1 JTAG/USB 98
4.5.3.2 Master core and PCIe 98
4.5.4 Write operations to SoC resources 100
4.5.4.1 JTAG/USB 100
4.5.4.2 Master core and PCIe 101
4.5.5 Security 103
4.5.6 JTAG integration 104
5. Proposed multicore design integration 106
5.1 I/O Shire 106
5.1.1 Architecture 106
5.1.2 Integration of the new proposed architecture 106
5.1.2.1 Clock and reset unit 106
5.1.2.2 Registers and memory access 107
5.1.2.3 Service Processor 107
5.1.2.4 Maxion validation 107
5.1.2.5 Debugger hardware interface 108
5.2 Minion Shire 109
5.2.1 Architecture 109
7
5.2.2 Integration of the new proposed architecture 109
5.2.2.1 Software application debugging 110
5.2.2.2 Resources visibility 110
5.3 Memory Shire 111
5.3.1 Architecture 111
5.3.2 Integration of the new proposed architecture 111
5.3.2.1 Register visibility 111
5.3.2.2 Memory visibility 112
5.3.2.3 Debug trace buffering 112
5.4 PCIe Shire 113
5.4.1 Architecture 113
5.4.2 Integration of the new proposed architecture 113
5.5 Debug Network On-Chip 114
5.6 Security 114
6. Evaluation 115
6.1 Run control 115
6.1.1 Broadcast support 115
6.1.2 Status report 116
6.1.2.1 Cores run control status 116
6.1.2.2 Cores register dump 117
6.2 Debug NoC 118
6.2.1 Debugger support 118
6.2.2 Efficient messaging protocol 119
6.2.2.1 Performance improvements 119
6.2.2.2 Compression savings 121
6.2.3 Support for high-bandwidth monitoring 121
6.3 Monitoring 122
6.3.1 Filtering optimization 122
6.3.2 Multiple protocol support 122
6.3.3 Trace storage improvements 123
7. Conclusion 125
8. Future work 127
9. References 128
8
1. IntroductionEveryday we interact with a wide variety of computing systems that are used not only to make our
lives easier but also more secure. Since we wake up every morning and take the phone, we are
using a piece of technology that integrates a System on-Chip (SoC) that is constantly performing
computation for us. We live surrounded by embedded devices that perform from very simple
operations to complex computation. From taking an elevator to flying on an airplane, many devices
work together to ensure we have not only commodities but also safety. The same way, when we use
a device for financial purposes or to share personal information, embedded computing devices
ensure the security and privacy of the data.
Those devices must work following very well defined specifications in order to not only meet its
performance target and provide the user certain functionalities but also to ensure that there is no
data corruption nor security holes. Systems on-Chip are integrated in our bedroom television as well
as in the most advanced military systems in the world and due to that there is the need to ensure
that once a system has been designed and fabricated, it meets all the needed requirements.
Even when performing huge efforts during pre-silicon validation, it is not always possible to detect
all the errors. Post-silicon validation allows us to detect design flaws that go from escaped
functional errors to manufacturing problems.
Nowadays, post-silicon validation has become the major bottleneck for complex SoC designs as the
time-to-market constraints are shrinking every year in order to be more competitive, and pre-silicon
stages are now highly time-limited. Note that one of the limitations during pre-silicon validation is
the amount of cycles that designers can simulate or emulate in reasonable time.
Post-silicon validation is a challenge not only because SoC designs are becoming more complex
every year but also because observability and controllability of the actual silicon is highly limited.
The validation effort once silicon is available can be classified into three major steps: (i) a
preparation phase for post-silicon validation and debug, (ii) a testing phase to detect problems, (iii)
a phase focused on finding the root cause of the source of the problem and, in case is critical, apply
workarounds to continue testing effort.
1.1 ContextEsperanto Technologies has developed a System On-Chip chip for large-scale machine learning
applications that is composed of more than 1000 RISC-V cores, whose architecture provides an
open source and royalty-free instruction set architecture (ISA).
This chip has been named ET-SoC-1 and features two kinds of general-purpose 64-bit RISC-V cores.
The ET-Maxion, which is a superscalar Out-of-Order core that delivers high performance for modern
operating systems and applications, and the ET-Minion core which is an energy-efficient in-order
multithreaded core with a vector/tensor accelerator that provides a massive parallel compute array.
10
The goal of this SoC is to deliver better performance per watt than legacy CPU and GPU solutions
without compromising general purpose flexibility. In addition, the integration of this SoC on data
centers could generate a huge decrease in the energy costs which translates to an important cost
reduction for third companies. On the other hand, the ET-SoC-1 supports LPDDR4x DRAM with up to
32GB DRAM, 137GB/sec memory bandwidth, and a 256-bit wide interface [1].
Post-silicon validation is an important stage in all System On-Chip designs but it is critical for the first
chips developed by semiconductor companies as they have to get very fast to the market to be
competitive and must be able to detect the errors to solve them as quickly as possible. In the case
of Esperanto the complexity grows because not only is the first design developed by the company
but also integrating thousands of cores together with on-chip memory increases complexity.
The goal of this project is to analyze the available options in the market and focus on Esperanto’s
post-silicon validation hardware infrastructure for ET-SoC-1 as an example. Then, based on the
conclusions we propose optimizations that target not only functionality but also focus on providing
a more energy-efficient debug infrastructure while increasing observability.
1.2 MotivationEsperanto Technologies is interested in exploring a new post-silicon hardware debug infrastructure
that solves the bottlenecks in the system that causes lack of observability under certain debugging
phases. In addition, the company also wants to optimize the usage of power and area while keeping
good visibility and monitoring features on the system. The current debug infrastructure is based on
the integration of a third party IP from UltraSoC Technologies, which is part of the Siemens Digital1
Industries Software, within the Mentor Tessent group.
The use of third party IPs allow companies such as Esperanto Technologies to develop the first
version of their chips faster given that the IP not only adds the required functionalities to the system
but also ensures that the logic has been verified and is safe to be used as long as the company
follows the integration guides. Nevertheless, the use of third party IPs also means that there can be
area and power overheads given that those components were designed to fit in different chips and
were not customized for the specific project so there are features that may not be required and
cause an impractical use of resources.
This project is a first step on the implementation of a new post-silicon hardware debug
infrastructure that fits Esperanto’s requirements while ensuring that the area and power overheads
added by these components are as optimized as possible. It is important to note that the debug
infrastructure is used during bring up effort and is rarely used by the clients.
1 Intellectual Property
11
1.3 Methodology
1.3.1 Work development
This project follows a cascade methodology in which the activities are splitted in different phases,
where each phase depends on the previous one. The projects consist of the next phases: analysis,
design and evaluation.
The goal of this project, as has been described in previous sections, is to develop a new post-silicon
hardware infrastructure and in order to do that we must first determine which are the requirements
for debugging silicon so that later on we can analyze the state of the art of post-silicon validation by
doing an overview of which are the options in the market. Then, taking into account the basic debug
functionalities and the available options in the market we focus on multicore systems debugging.
After that, we take Esperanto’s SoC debug infrastructure as an example of a multicore system debug
infrastructure to analyze the advantages and drawbacks of the current implementation. The
evaluation of Esperanto’s SoC implementation takes into account not only the offered debug
features but also power and area measurements together with an overview of the missing features
that could be added to support software debugging.
Once we have all the information regarding the state of the market and the multicore systems
requirements, we propose a new post-silicon debug infrastructure which focuses not only on
improving observability but also on analyzing the most optimal ways to implement those features.
Finally, we perform an evaluation of the proposed debug capabilities versus Esperanto’s SoC debug
infrastructure, which has served as a baseline model.
The intellectual property of this project belongs to Esperanto Technologies and all the information
contained in this thesis is confidential.
Note that the evaluation of Esperanto’s SoC hardware post-silicon infrastructure is performed
making use of the tools provided by Esperanto Technologies. The power analysis is performed
making use of Synopsys PrimePower tool while area analysis is performed through Synopsys
PrimeTime tool. The presented results correspond to the latest implementation stage of ET-SoC-1.
12
2. State of the Art
2.1 System on-Chip debug requirementsThe first step for analyzing the available options in the market is to be aware of which are the the
basic requirements for an efficient post-silicon debug infrastructure. Note that access to the debug
infrastructure potentially involves security problems as it provides not only visibility of confidential
system architecture details but also direct access to system resources.
2.1.1 Run Control
System on-Chips (SoC) usually combine peripherals and/or memories with one or multiple cores.
Those cores can be used either to run general purpose software or for running specific
high-performance software applications. Both options require the developers to be able to debug
possible hangs or misbehaviours on the instruction execution which involves run control support.
Among the different run control operations the most important ones are: halt, resume, single step
or using breakpoints. Depending on the system core architecture it may be required to follow
specific implementations. For example, Esperanto Technologies SoC cores are RISC-V compliants
which enforces them to follow the RISC-V debug specification [2].
Access to core’s run control operations during post-silicon is provided by a dedicated interface
between the host and the SoC cores. This interface at its minimum must consist of a set of I/O pins
that allow the host system to connect to the SoC and send commands to the core. Generally, the
debug infrastructure I/O interface is based on JTAG protocol since it's a well-defined standard that is
easy to use and integrate on systems.
2.1.2 Monitoring
Following the same trend as introduced in the previous section, the user must be able to observe
system behavior and to gather information about software applications execution. Monitoring
features are required on several resources within the system as the user needs to have visibility not
only of the instructions retired by the core(s) but also of the transactions in-flight or performance
counters, as an example.
This functionality is not part of core architecture which means that there are no standards defined
and there are different options in the market which are analyzed later in this document. Extracting
information from the system generally requires a sustainable throughput or on-chip buffering.
2.1.3 Intrusive debugging
Depending on the complexity of the design sometimes it is recommended to allow the host the
option to have control of systems behavior, that is, the host can not only read but also write
memories or registers within the system which have direct impact on its behavior.
13
This feature is generally implemented for having a workaround for critical pieces such as a master
core in a multicore system as it has to distribute workload among the rest of the system cores. In
this case, the user may want to ensure that in case the master core does not work as expected the
post-silicon validation effort can continue by making use of intrusive debugging features that allow
emulating the master core operations.
2.1.4 Performance debug
When analyzing software execution it is required not only to provide visibility to detect design flaws
that escaped from pre-silicon validation but also to provide tools to the users that allow them to
analyze system performance. Performance monitoring can be provided either through monitor
components that enable the users to perform filtering of system signals and retrieve the number of
times a specific event happened, or by providing read access to system registers that automatically
gather performance metrics.
2.1.5 External debug support
As has been mentioned in previous sections, the user must be able to access debug features from
the host in order to be able to apply run control operations or monitor the system.
The most extended debugging software support tool is the GNU Debugger, also known as GDB [3],
which allows the user, for example, to run software applications on a given core up to a certain
instruction, then stop and print out registers or variable values, or to step through the program one
instruction or function at a time and print out the values of each variable after executing those. GDB
has a wide range of supported operations [4] and, depending on the core architecture, there may
be specific implementations. For example, RISC-V provides its own support for GDB [5].
GDB is the high-level layer that the user sees and makes use of for run control operations, but it is
important to note that external debug support consists of several layers, and one of those is
OpenOCD [6] which aims to provide debugging, in-system programming and boundary scan testing
for embedded target devices.
The OpenOCD is able to connect to the debug SoC infrastructure with the assistance of a debug
adapter, which is a small hardware module integrated on the system. There can be one or multiple
hardware modules that allow the OpenOCD to connect such as USB or JTAG. Note that OpenOCD is
open source and can be modified to adapt to custom debug architectures and enhance debugging
performance.
On the other hand, depending on the debug infrastructure implemented on the system, there may
be additional software tools that allow users to perform monitoring and to visualize the information
gathered from the system.
14
2.2 Market analysisBefore starting the effort on developing a custom debug infrastructure, it is needed to evaluate the
available options in the market. There are several companies that provide debug components that
could be integrated in Systems On-Chip and allow companies such as Esperanto Technologies to
avoid using human resources to develop their own debug components which requires design,
implementation and validation effort.
In this section we analyze not only the main options in the market for acquiring debug components
from a third party company but also the available debug platforms in the market such as emulators
based on FPGA clusters and the debug features they provide so that this information can be taken
into account for developing a more efficient debug infrastructure.
2.2.1 FPGA based solutionsThe Field Programmable Gate Arrays (FPGA) have been out of the market for many years due to
their programming complexity even providing good performance and low power consumption. In
the last decade this has changed as the multicore systems become more and more complex,
increasing the probability of bugs during post-silicon validation. Since time-to-market became
crucial it is needed to validate the designs before they are sent to the manufacturer and this
validation effort relies on using FPGAs in most of the cases. Some of the FPGA validation solutions
include debug features that are interesting to take into consideration on this project.
2.2.1.1 Xilinx
Xilinx [7] is known for inventing the Field Programmable Gate Array (FPGA) and leads the FPGA
market due to its extensive number of FPGA products which enable customers to test and develop
designs without requiring manufacturing and allowing them to iterate on the design and verification
as many times as needed.
Xilinx provides debug components that can be integrated on FPGA designs to allow users to extract
information or drive control signals. In addition, Xilinx relies on the use of Vivado [8] as the software
tool to be used to integrate the debug IP and to visualize the gathered information.
2.2.1.1.1 Integrated Logic Analyzer (ILA)
The ILA IP core [9] allows the user to monitor internal signals of the design and can be customized
so that the user can decide the number of input data signal ports, the trigger function and the data
width and depth to be traced. The Integrated Logic Analyzer can easily be attached to the design
under test (DUT) given that it works following the clock constraints that the user applied to its
design.
The captured signals are sampled at the design clock frequency and stored on a buffer using on-chip
block RAM (BRAM). Once the trigger occurs, the buffer is filled and then the software can use this
data to generate a waveform.
15
The trigger configuration is based on multiple comparators such that each input port has its own.
The comparators allow the user to perform matching level patterns where some of the bits can be
masked out and not taken into account, then the masked input data is compared with the specified
match value.
2.2.1.1.2 Virtual Input/Output (VIO)
The VIO IP core [10] can be used to monitor and drive FPGA signals in real time but it is important to
note that it does not apply filtering and does not make use of buffering through RAMs as the
Integrated Logic Analyzer does.
As has been mentioned before, the VIO captures data without filtering which means that the input
ports are designed to detect and capture the value transitions but the sample frequency may not
meet the design frequency causing the signals being monitored to transition many times between
consecutive samples.
On the other hand, the VIO can also be used to drive data from a user's design through a JTAG
interface allowing it to manage system control signals.
The Virtual Input/Output core allows the user to replace status indicators such as LEDs in order to
save I/O ports and also to manage design control signals by driving different values.
2.2.1.1.3 JTAG-to-AXI Master
The JTAG-to-AXI Master IP [11] core allows the user to drive AMBA AXI signals on the FPGA at run
time. This IP supports all kinds of AMBA AXI interfaces and data interface widths of 32 or 64 bit. The
AMBA AXI data interface width is customizable so that it can be adapted to the user's design. In
addition, the JTAG-to-AXI Master IP supports AXI4 and AXI4-Lite interfaces.
This IP can be used to drive one or multiple design AMBA AXI slave interfaces to emulate other
components or host connections during post-silicon.
2.2.1.2 Synopsys
Synopsys [12] is a company that has based their products on electronic design automation focusing
on silicon design and verification, silicon IP and software. It is widely known in the industry as it
provides a wide range of products that go from design components such as DDR memories that can
be integrated on SoC to software that is used to analyze the behavior of a design before it is
fabricated or handles the physical design routing of the Register-Transfer Level code.
Some of Synopsys IP can be used for debug purposes [13] in order to facilitate the integration of
debug components and facilitate validation and bring up effort. That said, Synopsys offers two
approaches that can be used for debugging user’s design: prototyping and emulation.
Prototyping solution focuses on the use of a board with multiple FPGAs, usually Xilinx FPGAs,
mounted on it and pre-implemented debug hardware. This approach provides better performance
than emulation but requires designers to spend a significant amount of time partitioning the model
16
across FPGAs, managing clock domains and/or completing FPGA physical designs to handle timing
closure.
On the other hand, emulation allows designers to reduce the implementation time and resources
typically associated with prototyping but is not as efficient as prototyping in terms of performance.
Emulation is highly recommended for fast model bring-up which has an impact on schedule and
resources as it does not require designers to handle timing closure for example.
In summary, when designers RTL models are less stable it is highly recommended to go through
emulation-like solutions as they speedup over software simulation, provide high trace visibility
while making it easier for designers to integrate their models on the platform.
Then, as the RTL model stabilizes, designers can transition to prototyping as it provides an efficient
path to achieving performance suitable for firmware/software development and debugging.
2.2.1.2.1 High-performance ASIC Prototyping Systems (HAPS)
Synopsys prototyping solution is based on HAPS [14][15] and makes use of Synopsys ProtoCompiler
software tool to automatically build debug trace visualization making use of the extracted
information.
2.2.1.2.1.1 HAPS Deep Trace Debug (DTD)
The HAPS DTD [15] can be classified as an embedded trace extraction technique that allows the
user to get signal visibility within each of the FPGAs on the HAPS system. In addition, it also enables
the user to extract sample depths under complex triggering scenarios that can be defined at
runtime.
HAPS DTD trigger scenarios can be based on simple signal transitions or on more advanced filtering
taking into account several control signal sequences. Once the trigger is fired, the HAPS
ProtoCompiler Debugger [15] extracts the information and generates a text file or a waveform that
can be used for debugging purposes.
The Deep Trace Debug is automatically implemented through the HAPS ProtoCompiler tool, which is
also provided by Synopsys, such that HAPS DTD can be considered as a default implementation on
HAPS systems.
2.2.1.2.2.2 HAPS Global Signal Visibility (GSV)
The HAPS GSV [15] enables users to stop the system clock and perform a snapshot of the system
allowing it to dump the information from all signals. Then, once the system signals are captured, the
clocks are released to normal operation on the platform, or they can be stepped one by one or in
small chunks to capture more information and this process can be repeated any number of times.
2.2.1.2.2 ZeBu
On the other hand, Synopsys ZeBu [16] solution is the emulation approach used by many companies
to allow engineers run long tests that would take a big amount of time in simulation without
17
requiring huge effort on doing FPGA prototyping. In addition, emulation provides more visibility
than prototyping as it does not optimize parts of the designs that act as black boxes.
ZeBu stands out because of its fast runs as it can emulate systems faster than its competitors while
also providing large capacity for big designs. On the other hand, the time for compilation is way
bigger than its competitors which makes it worse in case designers have to compile frequently.
ZeBu supports SystemVerilog Assertions (SVA) [17], but there is no support for Universal Verification
Methodology (UVM) [18] nor functional coverage. In addition, the implementation of ZeBu tracing
in order to extract information from the system is based on Xilinx FPGAs built-in scan-chain support,
which allows to gather the information from all signals within the system but at the cost of running
at few tens of hertz. The gathered information is not stored in on-board memories and instead it is
sent directly to the host server. Note that performing tracing on an emulation platform is very slow
and the generated output can only be handled by Synopsys tools.
2.2.1.3 Cadence
Cadence [19] is an American multinational company that develops software, hardware and IP used
to design chips, systems and printed circuit boards (PCB). In addition, it also provides IP covering
interfaces, memory, verification and SoC peripherals among others.
Palladium [20] is Cadence emulator solution that highlights at compilation speed as it is faster than
its competitors in terms of compiling millions of gates per hour, that said, it requires more FPGAs
ending up on larger dimensions and more power consumption.
When it comes to debug features, Palladium supports SystemVerilog Assertions (SVA), Universal
Verification Methodology (UVM) and functional coverage. In addition, it also implements FullVision
which provides at-speed full visibility of nets for a small number of million samples during runtime.
Note that in case designers need to generate samples for a big number of million cycles, then
Palladium provides a partial vision of a subset of signals in the system that must be pre-selected at
compile time.
Furthermore, Palladium offers InfiniTrace feature which allows the user to capture the status of the
system at a given point and be able to revert back to any checkpoint and restart emulation.
2.2.1.4 Siemens - Mentor
Mentor [21] is a company focused on electronic design automation (EDA), which was acquired a few
years ago by Siemens. Mentor does not only provide EDA solutions but also simulation tools, VPN
solutions or heat transfer tools among others.
Among Mentor’s wide range of products, Veloce [22] is its emulator platform solution which
differentiates from previous solutions in the fact that it is easier and less expensive to install and
does not consume as much power as its competitors. Unfortunately, it does not reach the
compilation speed from Palladium nor the high performance from ZeBu.
18
Veloce also supports SystemVerilog Assertions (SVA), Universal Verification Methodology (UVM) and
functional coverage. In addition, Mentor’s emulator allows on-demand waveform streaming of a
few selected signals without requiring to re-compile and reduces the amount of data sent to the
host which improves time-to-visibility.
2.2.2 Embedded IP solutionsNowadays, most of the Systems On-Chip integrate debug components that are embedded to
facilitate validation efforts during post-silicon. Those components are used not only to monitor
particular functionalities at runtime but also to measure performance, observe the internal states of
control signals and allow applying workarounds when critical bugs appear.
Several companies in the world develop embedded components providing not only hardware to be
integrated on the systems but also software tools that ease the management and configuration of
those. Embedded debug IPs can provide a wide range of features such as ways to capture data
based on advanced filtering, synchronize monitors, generate timestamps to know when certain
events occured among others.
It is important to note that some of the embedded components are designed to be integrated on
specific core architectures and that there is not a defined standard that is followed by all
companies, that is, each company provides its own solution. In this section, we do an overview of
the most interesting options in the market that provide embedded IP to address observability in
post-silicon debugging.
Figure 1. Debug embedded IP integration on a system
2.2.2.1 ARM
ARM [23] is a well-known company for its wide range of products that go from complex designs
such as complete Systems On-Chip to small designs such as memories or interfaces. ARM develops
the architecture and licenses it to other companies which design their own systems and integrate
those architectures.
CoreSight [24] technology provides not only an exhaustive range of debug components that can be
integrated on customers SoC but also a set of tools for debug software execution on ARM-based
19
SoCs. This technology allows to gather information from the system and to modify the state of parts
of the design.
The main CoreSight components are:
● CoreSight Embedded Trace Macrocell (ETM)
● System Trace Macrocell (STM)
● Trace Memory Controller (TMC)
2.2.2.1.1 Embedded Trace Macrocell
The Embedded Trace Macrocell (ETM) [25] component provides the capability to gather information
regarding instruction execution so that the user can analyze the behavior of the processor by
looking at which instructions have been retired or if there have been exceptions during the
execution. In addition, this component allows adding timestamp information so that the user can
see how many cycles passed between each instruction. ETM generated traces are used for profiling,
code coverage, and to diagnose problems that are difficult to detect such as race conditions.
The debugger can configure the ETM through a JTAG interface. On the other hand, this information
can be either exported through a trace port and analyzed in the host, or can be stored to an on-chip
buffer, known as the Embedded Trace Buffer (ETB) [26], which later can be read through the
external interfaces.
It is important to note that the ETM implements a compression algorithm in order to reduce the
amount of information generated which can also impact on the number of additional pins required
on the ASIC, and also reduces the amount of memory required by the Embedded Trace Buffer.
2.2.2.1.2 System Trace Macrocell
The System Trace Macrocell (STM) [27] is a software driven trace source, that is, it focuses on
enabling real-time instrumentation of the software such that while the core is executing it provides
information with no impact on system behavior or performance.
The main purpose of STM is to allow developers to embed printf-like functionality within their code.
This can be required given that printf goes through UART which is expensive not only in terms of
speed but also because of the fact that the developer must embed a big library and exception
handlers within its embedded application causing an overhead. On the other hand, the STM trace
allows the developer to add a single store to the address of an STM channel so that the message is
transferred and, at some point, the host can capture the message with no impact on core behavior.
2.2.2.1.3 Trace Memory Controller
The Trace Memory Controller (TMC) [28] is an infrastructure component which can be configured to
support three different behaviors in order to store or route trace messages generated through the
Embedded Trace Macrocell (ETM). The TMC can serve as a trace buffer, as a trace FIFO or as a trace
data router over AMBA AXI to memory or off-chip to interface controllers. Those three
configurations are explained below:
20
● Embedded Trace Buffer (ETB):
Provides a circular buffer to store trace messages and supports up to 4GB of SRAM.
It is important to note that traces cannot be extracted until trace capture has finished,
which can be done after receiving a trigger signal. In case traced data is bigger than buffer
size, then the oldest samples are overwritten.
● Embedded Trace FIFO (ETF):
Provides a FIFO to store trace messages into dedicated SRAM. It is differentiated from ETB
in the fact that trace messages are not lost or overwritten in case the external debugger
does not read at enough speed, instead, back pressure is applied to the source component
that generates those trace messages when the FIFO is full.
● Embedded Trace Router (ETR):
Provides a mechanism in which trace messages are converted to AMBA AXI protocol so that
they can be routed to either system memory or any other AMBA AXI slave.
2.2.2.2 SiFive
SiFive [29] is a young semiconductor company that has focused on developing the potential of the
open source RISC-V ISA by designing and selling a wide range of RISC-V cores that can be used to
general purpose applications or to accelerate artificial intelligence and machine learning
applications, SoCs where the customer can choose the memory interface and peripherals, IPs and
development boards.
Among all the SiFive products there’s the SiFive Insight [30], which is presented as the first advanced
trace and debug solution for RISC-V allowing designers access to debug capabilities in order to
bring-up first silicon, support hardware and software integration, and debug software applications.
SiFive Insight includes a wide range of options that go from run-control debug, where designers can
control how the core operates, to advanced multicore trace solutions. Usually customers have these
features already integrated when they acquire a SiFive RISC-V core so that its integration is already
verified.
2.2.2.2.1 Debug Module
The Debug Module is intended to be integrated as the interface between the debug infrastructure
and the external interface, which is usually based on JTAG protocol, and follows the RISC-V debug
specification [2]. The run control operations are received by the Debug Module which then connects
with the CPU and sends the required RISC-V debug commands to enable operations such as
breakpoints or single stepping.
In addition, the Debug Module also allows the option to include a System Bus Access (SBA) in order
to access the memory or registers without interrupting the core.
2.2.2.2.2 Nexus Encoder
The SiFive cores also provide the option to be connected with a Nexus 5001-compliant trace
encoder [31][32] similar to ARM’s Embedded Trace Macrocell, which is a highly configurable
21
component that allows multicore trace capability allowing the designer to know which instructions
are being retired, which branches have been taken or if exceptions have been raised.
2.2.2.3 UltraSoC
UltraSoC, which has been recently acquired by Siemens and renamed to Tessent Embedded
Analytics [33], is one of the most important companies for post-silicon debugging due to their wide
range of embedded monitoring hardware products [34] for complex SoC which enabled accelerating
silicon bring-up and optimize product performance. Their products are used in several fields such as
automotive, high-performance computing, storage and semiconductor industries. In this section we
do an overview of the most interesting embedded products that UltraSoC can offer for embedded
monitoring.
UltraSoC components can separated in three types:
➢ Communicator modules: Interface between the UltraSoC message infrastructure and the
debugger.
➢ Message infrastructure: Connection of the UltraSoC modules together into a hierarchy.
➢ Analytic modules: Monitoring and control of system resources.
The UltraSoC components are connected following a tree methodology in which the components at
the top are the Communicator modules which allows the user to manage the debug subsystem.
Then, those are connected with the message infrastructure components that interconnect all the
system allowing to configure the analytic modules to gather data from the system as can be seen in
the image below.
Figure 2. Example of UltraSoC embedded components integration
Next, we can see the different embedded components available.
22
2.2.2.3.1 Communicator modules
The communicator components are used to act as an interface between the debugger and the
analytic components in the system. The debugger can be either an external debugger that
communicates through I/O pins or a core in the system that is able to configure the UltraSoC
components and can use the gathered information to manage the system.
2.2.2.3.1.1 JTAG Communicator
The JTAG Communicator provides low-bandwidth connectivity to the UltraSoC infrastructure via an
IEEE 1149.1 scan-test interface. Enables a basic flow-control mechanism for burst communications
(input data) in supplement to the host initiated symmetric data transfer (data scanned in and out
simultaneously).
It is widely recommended in the semiconductor field to integrate a JTAG interface that allows
accessing the debug post-silicon infrastructure as it does not require a complex configuration and
follows a standard and easy-to-use protocol. On the other hand, it is important to note that due to
its low bandwidth it does not allow to extract huge amounts of data in parallel and this can be a
problem for big SoCs.
2.2.2.3.1.2 USB Communicator
The USB Communicator provides medium-speed debug communications. It implements an UltraSoC
designed cut-down USB MAC core that implements a fixed configuration for a pair of bulk data
end-points. It is intended for direct connection to a USB PHY interface to enable a dedicated debug
channel or to the optional UltraSoC USB debug hub core.
The USB communicator is autonomous, requiring no host processor or software intervention.
UltraSoC has two versions of the USB Communicator: one that acts as a device and another that
acts as a hub allowing to share the USB PHY interface with other USB devices apart from UltraSoC
one.
2.2.2.3.1.3 Universal Streaming Communicator
The Universal Streaming Communicator (USC) provides a proprietary low overhead communications
protocol suitable for low, medium or high bandwidth communications using existing established
physical signaling.
2.2.2.3.1.4 Aurora communicator
The Aurora Communicator utilizes Xilinx’s Aurora Protocol [35] over a PIPE interface to provide a
high bandwidth output from the UltraSoC infrastructure to an external Aurora based probe. The
Aurora Communicator is unidirectional so it only allows data to be transmitted out of the UltraSoC
infrastructure.
2.2.2.3.1.5 Bus Communicator
The Bus Communicator enables software running inside the chip to drive the UltraSoC system as
though it is a debugger through a bus slave like AMBA AXI4 or similar. The purpose of this
23
functionality is to allow the software to monitor itself such as during pre-deployment soak-testing
and post-deployment / early-adoption. This enables monitoring of critical software and on-chip
activity to be monitored using either background software on the applications processor or a
smaller processor core dedicated to housekeeping activities.
2.2.2.3.2 Message infrastructure modules
The message infrastructure components are used to connect the communicator components with
the analytic components allowing to route messages in a tree-based topology. In addition, some of
the message infrastructure components allow broadcast operations to inject events that are
propagated to all the analytic modules which can enable or disable tracing.
2.2.2.3.2.1 Message Engine
The Message Engine connects analytic modules together, and also communicators. Message
Engines route messages and events between their interfaces and are arranged in a tree topology.
The Message Engine is a universal block that can be connected hierarchically to form a complex
debug fabric based on the UltraSoC message interface. It employs multiple logical flows so that
independent message streams can be routed and buffered with different priorities.
2.2.2.3.2.2 Message Slice
The Message Slice component can be used to break the combinatorial path between one message
interface and another by registering the signals in each direction. It is intended to aid timing closure
by breaking a long message link into two shorter links. The Message Slice component does not add
functionality and has no run-time configuration. It has parameterized interface data path widths
and is intended to aid integration.
2.2.2.3.2.3 Message Lock
The Message Lock component can be used to block a specific message interface. It is intended to be
used in systems where security mandates fine granularity of access permissions to UltraSoC
components. This component can be placed on any message interface in an UltraSoC system. It is
typically used on the interface between the communicator components and the Message Engine
which, along with the analytic components attached to the Message Engine, may need to have
access blocked for security purposes.
When the system’s security function determines that access is not allowed on a particular interface,
it sets the lock request on the lock control interface. The Message Lock provides indication that the
interface is locked.
2.2.2.3.2.4 System Memory Buffer
The System Memory Buffer (SMB) provides network-level storage and output to system memory via
a system bus interface. This component is used when the amount of gathered data is so big that
cannot be absorbed by the communicators due to their restricted bandwidth, the data is then
forwarded to the SMB which is connected to an SoC memory and uses it to store the data. This
component can be configured in two different modes:
24
- In double-ended operation the SMB will autonomously read messages back from system
memory and direct them to a suitably configured communicator, such as USB or JTAG.
- In single-ended operation the SMB will store the messages in the system but the software is
responsible for extracting this information by accessing the target memory.
2.2.2.3.3 Analytic modules
The analytic components are used to monitor and control the behavior of the system and allow not
only to gather information from the system but also to inject read and write requests that can
change the configuration of the SoC. These modules can be considered the leafs of UltraSoC’s
tree-based topology as they have two interfaces which are connected to the Message Engine and
the system resources, respectively.
2.2.2.3.3.1 Direct Memory Access (DMA)
The DMA analytic module provides direct memory access. It enables the debugger to issue
transactions through an AMBA AXI4 interface in order to read or write a block of memory or access
registers. This module is intended to enable program loading in an UltraSoC system as well as
directed memory inspection via read accesses and manipulation via write accesses.
2.2.2.3.3.2 Processor Analytic Module
The Processor Analytic Modules are intended to be used to control the processor cores, extract
processor related status information and performance monitor values as well as integrate with the
processor’s existing debug support which may additionally provide instruction and or data trace.
There are two main variants, the BPAM (Bus Processor Analytic Module) and the JPAM (JTAG
Processor Analytic Module). The former makes use of an AMBA APB3 interface to connect with the
system resources while the latter allows the same but through a JTAG interface.
2.2.2.3.3.3 Trace Encoder
The Trace Encoder provides a mechanism to monitor the program execution of a CPU in real time. It
encodes instruction execution and optionally load/store activity and outputs data in a highly
compressed format. Filtering can be applied to determine what and when to trace. In addition,
optional statistics gathered through counters are also available. The Trace Encoder allows support
only for CPUs that are RISC-V compliant.
Software running externally can take this data and use it to reconstruct the program execution flow.
2.2.2.3.3.4 Trace Receiver
The Trace Receiver module accepts trace data, typically generated by a CPU, and encapsulates the
information into UltraSoC messages. The Trace Receiver converts trace data into UltraSoC messages.
2.2.2.3.3.5 Bus Monitor
The Bus Monitor analytic module can be used to monitor a master or slave interface on a bus,
point-to-point fabric or on-chip network. The component can be parameterised to have between
one and sixteen filters, and up to 64 metric counters enabling it to collect several independent,
25
although potentially overlapping, data sets. Each filter is equipped with a match unit which can be
used to filter the observable transactions to only those of interest.
The Bus Monitor can passively monitor up to four master or slave bus interfaces compliant to one of
the following standards: AMBA AXI3, AMBA AXI4, AMBA ACE-Lite, AMBA ACE, OCP 2.1 or OCP 3.0.
Monitoring configuration can be based on any bus field such as: address range, transaction
identifier range, length, size, burst type, privilege, priority and any other control fields.
2.2.3.3.3.6 Status Monitor
The Status Monitor provides a wide variety of monitoring functions that can be used for debug,
diagnostics or performance profiling. It is delivered as a parameterized soft core which allows
capabilities to be balanced against gate count according to the needs of the application on a per
instance basis.
The Status Monitor offers a wide variety of monitoring functions, each of which can be included or
excluded at instantiation through parameterization. In this way, gate count can be adjusted to match
the functional requirements of the application. Monitoring functions are based on the use of
advanced filters which consist of one or more comparators that can be configured to select the
subset of bits to be analyzed, the expected value and the operation to be performed. For example,
the user can connect the Status Monitor to the core pipeline to analyze internal signals from core
execution and through the comparators it can select the subset of bits he/she is interested in and
the expected value to trigger a hit. Then, depending on the user's configuration the comparator hit
can either trace data at that same cycle, capture the data in previous cycles or in the next cycles. In
addition, the comparator hit can also be used to generate an event to enable or disable other filters
or to increase or decrease internal counters.
The Status Monitor filters can be attached to one or more comparators allowing the user to look for
specific values or to match under a range of values. In addition, the user can also configure Status
Monitor counters to gather information about how many times a filter has performed a hit, which
can then be used for performance analysis.
2.2.2.3.3.7 Static Instrumentation
The Static Instrumentation analytic module enables data delivered over a bus fabric such as AMBA
AXI to be automatically converted to UltraSoC instrumentation trace messages and transferred to
the debugger along with the trace messages from other modules in the UltraSoC systems. The
module presents blocks of independently enabled channels in order to facilitate acceptance
(capture) and filtering of many trace data streams from the software.
2.2.2.3.3.8 Network On-Chip Monitor
The NoC Monitor analytic module can be used to monitor a set of master or slave NoC channels. It
is parameterised to have between one and sixteen filters, and up to 64 metric counters enabling it
to collect several independent, although potentially overlapping, data sets. Each filter is equipped
26
with a match unit which can be used to filter the observable transactions to only those of interest..
It passively monitors up to 4 sets of master or slave NoC channels compliant to AMBA CHI standard.
2.2.3 Design For Testability (DFT)The first step during the bring up stage consists of confirming that the system has been built
correctly, and in order to do that design for testability (DFT) is used to achieve high fault coverage
during manufacturing tests. That is, designers must be able to detect failures such as shorts,
incorrectly inserted components or faulty components so that they can be fixed. This verification
effort is typically controlled by an automatic test equipment (ATE) which is a powerful test
instrument that not only verifies the design as a whole but also focuses on parts of the SoC
increasing coverage on the most critical regions.
The design is identified as a fault-free circuit in case the input patterns introduced during ATE effort
match with the expected correct responses. In case the circuit test has failed, then it is required to
do a diagnosis to identify the root cause of the failure.
Those test patterns and the expected response makes use of a hardware implementation that is
known under the name of Design For Testability [36], which consists of one or multiple scan chains
that connect system flip-flops providing observability and a testing mode.
The scan chain is a hardware structure that facilitates observability of all the system flip-flops and is
the most common method to drive signals within the system and also to extract information. This
method requires registers (flip-flops or latches) in the system to be connected in one or more chains
allowing the user to introduce patterns or read the information within the registers. Those chains
are then connected to I/O system pins such that the user can shift data into the system and also
extract data from the system. Note that apart from the I/O system pins to introduce and extract
information, there are two extra input pins that are to enable scan mode (scan_enable) and to
control all flip-flops clock during scan mode (scan_clock).
Figure 3. System design scan-chains integration schematic
27
In order to drive data into the system, the user must enable scan mode and shift data in via the scan
chain(s) SoC input physical pins such that at each clock pulse, controlled via the scan_clock input
pin, a new bit is shifted into the chain. This is done as many times as flip-flops or latches are in the
scan chain in order to fill all of them. Then, the user can disable scan mode so that the next clock
cycle the system flip-flops will not shift the information to the next flip-flop on the chain, but the
system logic will execute in functional mode generating a response.
Next, once the user has shifted-in all flip-flops values, disabled scan mode and performed a clock
cycle, the system design response can be extracted. Note that data is extracted the same way such
that the user must first enable scan mode and then at each clock pulse, which is controlled through
scan_clock, the target scan chain value is shifted out to the output system pins and captured by the
user.
Note that at each clock pulse the first flip-flop in the chain receives the user input data through the
system input pins, at the same time, the connected flip-flops receive the value of the previous
flip-flop in the chain, and the last flip-flop in the chain outputs its value through the system output
pins.
This scan-in (SI) and scan-out (SO) methodology allows designers to introduce test patterns, enable
execution and then extract registered internal values to see if the system performs the operations
as expected. As we can see this feature allows access to all the flip-flops in the system.
Unfortunately, it requires the system to be stopped, is very slow and does not allow advanced
filtering. In addition, once signals have been shifted out to extract the information, in case the user
wants to continue running, those signals must be inserted once again through the scan-in system
pins which has a severe impact time.
SoC designs can be built with millions of flip-flops causing the shift-in/shift-out methodology of the
whole design to be extremely slow. Due to that, nowadays SoC DFT implementations make use of
several scan-chains which can be selected by programming an internal module named the Design
Control Unit (DCU) allowing the user to select the blocks within the SoC to be verified.
On the other hand, it is important to note that there are many synthesis tools in the market that
take care of handling the concatenation of system’s internal flip-flops to create the scan chains. In
addition, those synthesis tools also allow us to create sub-chains and automatically insert hubs and
routers to target one or multiple chains.
There are other manufacturing testing methods such as built-in self-test (BIST) which are used as a
low cost testing approach. In this thesis we focus on post-silicon validation and assume that
manufacturing tests have been successful so that no design nor functional bugs have been
detected.
28
2.3 Debugging multicore systemsIn this section we do an overview of the most important resources that compose multicore systems
in order to analyze the basic debug features that must be integrated to provide observability and
allow workarounds in case of critical failures.
2.3.1 I/O peripherals
Systems On-Chip must be able to connect with one or multiple systems which can be either a
computer acting as a host or a cluster with thousands of other SoCs. This connection is based on the
use of peripheral components that not only allows access to SoC from an external system for
computation but can also allow the SoC access to resources such as external memories.
Nowadays, multicore systems use to integrate interfaces such as PCIe in order to provide a
high-bandwidth connection between the SoC and the external world. In addition, it is common to
have external interfaces such as UART or SPI that allow low-bandwidth and high-reliable connection
for basic testing.
2.3.1.1 Debug features
It is highly recommended that in case PCIe has been integrated within the system, the users have a
debug infrastructure that allows them to monitor the behavior of this resource in order to ensure
that it is properly configured and that transactions are performed as expected. Note that this is
something critical during the bring up effort due to the fact that users must be able to check that
data is properly loaded and read from the system.
On the other hand, it is also important to note that the SoC must have a dedicated I/O interface
connected to the system debug infrastructure allowing the users to access debug features within
the system. This is critical as is the entry point for debugging the whole design. JTAG is the preferred
option due to its simplicity but note that its a low-bandwidth interface. In case the user expects to
gather big chunks of information from the system in real-time, then a faster interface such as USB or
PCIe can be integrated and connected to the debug infrastructure as an optional entry point.
2.3.2 Memory subsystem
Multicore systems make use of caches and complex memory hierarchies in order to ensure that the
cores can access data as fast as possible, reducing latency and increasing performance.
2.3.2.1 Debug features
Given that the memory subsystem receives a big amount of requests from the cores either through
direct connection to those memories or through the Network On-Chip, it is very important to
ensure that requests are received and responded properly and there are no hungs that could end
up stopping the cores and would have critical consequences on functionality.
29
Due to this, it is highly recommended that the debug infrastructure allows the users to monitor the
transactions received by the memories and ensure that they are responded with the correct data. In
addition, it is also recommended to allow the debug infrastructure access to those memories so
that the user can check which is the stored data and the state of the cache lines at runtime.
2.3.3 Master core
One or multiple master cores are used not only to balance workloads among compute cores but
also to handle critical tasks such as handling initial hardware boot, maintenance and managerial
tasks in the SoC, and associated peripherals configuration for performing those tasks.
2.3.3.1 Debug featuresIt is essential that the user can perform run control operations on the master core(s) so that it can
apply operations such as breakpoints, single stepping or reading core’s internal registers. In
addition, it is highly recommended that the debug run control integration supports connection
through software tools such as GDB, which has already been introduced in this document.
On the other hand, it is also recommended to add monitoring features that allow the user to gather
information about which instructions are executed by the core, visibility on the requests and
responses with resources from the system, and control signals that can help debug possible hangs
or misbehaviours.
2.3.4 Compute cores
As the name already specifies, a multicore system can integrate from tens of processors to
thousands potentially creating large groups of threads that can handle the computation of big
workloads. In addition, as we have noted before, there are multiple cache hierarchies splitted
around the system that are used to improve performance.
2.3.4.1 Debug featuresThe same way as for master core(s), it is also essential to allow the user to perform run control
operations such as breakpoints, single stepping or reading core’s internal registers, and to support
software tools such as GDB.
On the other hand, it is also recommended to provide visibility of the operations performed by the
compute cores and the requests that have been sent. This allows the user to handle cases such as
debugging if the core is waiting for a response that is never received, if the core has taken a wrong
branch or if there is a hardware bug when performing a specific instruction.
2.3.5 Interconnection network
Systems on-Chip used to integrate bus and crossbar-based architectures to communicate the system
resources. That is, in the early days there was a bus architecture which provided dedicated
point-to-point connections between resources. Nowadays, systems have increased in size and
complexity and point-to-point connections are less cost-effective and often become the bottleneck
30
of the system. On the other hand, systems must be able to scale, be reusable and manage power
consumption.
Networks on-Chip (NoC) appeared to handle all those challenges. The main objective is to ensure
that a request or response goes from one resource to another efficiently, for example, from a core
to on-chip memory. The message can be divided into flits and injected on the network. Then, flits
are routed through the interconnection network making use of the links and routers until they
arrive at the destination.
We refer to flits as the smallest unit that composes a packet, and equals the link width. The flit is
the link level unit of transport, and packets are composed of one or more flits.
NoCs are becoming more and more complex as they include features such as routing protocols,
arbitration or flow control among others while they must be reusable to reduce costs, design risk
and impact time-to-market schedule.
Figure 4. Network On-Chip mesh topology
One of the Network on-Chip challenges is related to handling multiple endpoint interfaces that may
use different protocols, voltage or can be running at different clock speeds. On the other hand, due
to the fact that the NoC handles most of the SoC traffic, it can lead to serious security concerns as
master interfaces can corrupt data, degrade performance or even leak sensitive data. Due to this,
there is the special need to ensure that transactions are always forwarded through the correct
channels so that they do not end up in the wrong endpoints causing a hole in security.
Multiple studies [37][38] have shown that a big percentage of post-silicon errors are detected on
the NoC so it is crucial to add support for monitoring requests and response behavior.
2.3.5.1 Debug features
31
First, there must be visibility features on the connection between the Network on-Chip and the SoC
blocks so that the user can detect if a request has been lost within the NoC or if it is a problem
within the SoC block itself. This monitoring feature must be able to handle the multiple interface
protocols used in the different SoC blocks. This idea was introduced a long time ago [39] together
with the options of adding configurable monitors to observe router signals and generate
timestamped events [40]. In addition, the Design For Testability (DFT) approach allows the use of
scan-chains to apply tests to specific components such that test data can be shifted-in and applied
to the component being tested. Then, results and traces can be read from the SoC scan output pins
and analyzed on the host.
It is important to note that adding those features has a major drawback which is the area impact as
well as the performance overhead. Area impact can be reduced by making use of small buffers and
constraining the amount of signals being analyzed, which sacrifices observability. On the other
hand, performance overhead is caused in case the traced information is sent through the same
Network on-Chip that is being analyzed and shared by the SoC functional resources, in this case it is
highly recommended to use a simplified messaging infrastructure that separates the non-debug
traffic from the debug traffic.
2.4 Debug infrastructure on multicore systems
From the previous analysis of the different available options in the market we can conclude that
UltraSoC is the company which provides a wider range of options allowing faster integration of a
debug infrastructure on System On-Chips. On the other hand, we must also note that some of the
analyzed options do not fit all multicore systems requirements given that ARM’s debug components
must be integrated with ARM’s cores and not all SoCs rely on using ARM cores. For example,
Esperanto Technologies has developed their own processors based on RISC-V architecture. On the
other hand, alternatives such as SiFive provide RISC-V compatible debug modules but depending on
processor designs the Debug Module may be already integrated on the system. Finally, Synopsys
provides fast integration and verification but its debug proposals cannot be implemented on silicon
as they are based on prototyping.
The outcome of this analysis meets with previous research done in 2018 when Esperanto
Technologies was designing its first System On-Chip, which is known as the ET-SoC-1. At this point in
time Esperanto Technologies decided to make use of UltraSoC components as they allowed fast
integration and verification requiring a small number of engineers, which was a key factor for
Esperanto’s first product in order to improve the time-to-market constraint.
32
3. Many-Core SoC designIn this section we analyze Esperanto’s System On-Chip architecture as an example of a multicore
system. The analysis focuses on its debug infrastructure integration and evaluates the design and
use cases.
The Esperanto’s System On-Chip, which has been named as ET-SoC-1, has two kinds of
general-purpose 64-bit RISC-V cores. The first one is the ET-Maxion, which implements advanced
features such as quad-issue out-of-order execution, branch prediction, and sophisticated
prefetching algorithms to deliver high single-thread performance. On the other hand, the ET-Minion
is an in-order core with 2 hardware threads and a software configurable L1 data cache, the
ET-Minion has been designed for energy efficiency, delivering TeraFlops and TeraOps of computing.
The ET-Minion cores are integrated in groups of 32 forming “Minion Shires”. Within each “Minion
Shire” the ET-Minion cores are splitted in groups of 8 cores, so there are 4 groups named
“Neighborhoods”. Then, the “Minion Shire” has a crossbar between the Neighborhoods and an L2
private memory cache of 4MB as shown below.
Figure 5. Esperanto Technologies ET-SoC-1 Minion Shire architecture
The ET-SoC-1 has 34 Minion Shires, which share an L3 memory that is splitted between them, 4
ET-Maxion processors and 32GB of memory with a 137GB/sec memory bandwidth. On the other
hand, the Network On-Chip follows a mesh topology that connects all the Minion Shires, the
ET-Maxion, memories and peripherals such as the PCIe.
The ET-SoC-1 debug infrastructure is based on the integration of the UltraSoC IP components
introduced in previous sections. Those components are integrated in the different SoC blocks, which
are known as shires, in order to provide visibility and, in some cases, workarounds for
non-functional components in the system during the initial stages of bring up.
33
3.1 I/O Shire
3.1.1 Architecture
The IOShire is an essential block in the system as it contains the Service Processor, which is the SoC
master core, and its ancillary components which are responsible for booting the whole system. Note
that there is a single instance of the IOShire block on Esperanto’s system.
The Service Processor (SP) is a dedicated standalone ET-Minion core which has access to all the
resources on the SoC that any other agent can access, and it also has exclusive access to some
resources that no other agent can access. It is responsible for booting the SoC and providing
runtime services for other SoC components.
On the other hand, the IOShire also contains the ET-Maxion cores and I/O devices. Additionally, the
IOShire contains the USB and JTAG interfaces used to access the debug infrastructure from an
external host.
In effect, the IOShire can be compared with a smaller SoC given its implementation and resources,
the primary roles for the IOShire within the ET-SoC-1 are as follows:
● Integrates the Service Processor, which performs maintenance and managerial tasks in the
SoC, and associated peripherals for performing those tasks.
● Handles initial hardware boot of the SoC from the deassertion of the primary reset until the
Service Processor begins executing firmware code.
● Provides security/access control mechanisms for the SoC including secure debug with
various levels of authentication to gain debug access to various aspects of the SoC.
● Manages and supplies major SoC clocks and resets to other SoC components.
● Integrates general purpose peripherals for use by other agents in the SoC such as the
ET-Minion Cores and ET-Maxion, excluding PCIe and DDR which have their own shires.
● Serves as the primary entry point for the UltraSoC components, including housing
peripherals which provide access to the debug infrastructure.
● Integrates the Maxion shire, which includes the 4 high-performance out-of-order ET-Maxion
cores, a shared L2 cache, and supporting components.
The IOShire is connected with the rest of the shires in the SoC through the Network On-Chip (NoC),
which connects all the shires in the SoC together in a mesh topology.
34
3.1.2 Debug features and UltraSoC integration
3.1.2.1 Clock and reset unit
The System On-Chip clock and reset unit (CRU) is responsible for controlling the reset signals of all
system resources, which include not only the Phase-Locked Loops (PLL), Network On-Chip and the
cores, but also all the system peripherals and caches.
The default behavior of the system is such that the Service Processor is responsible for performing
the boot initialization sequence in which it configures the PLLs and accesses the CRU to release the
reset of several resources.
As can be seen, it is critical that those boot sequence steps can be performed. Otherwise, the rest of
the SoC cannot be accessed meaning that if there is a bug in the Service Processor, or in its
connection with the Network On-Chip, the user would not be able to use any of the resources in the
system.
In order to solve this possible issue, an UltraSoC BPAM component is connected to the CRU and the
I/O Shire PLLs through an AMBA APB3 interface such that the user can control the reset and clocks
of all the system resources without relying on the Service Processor nor the NoC allowing to verify
the rest of the SoC in case of a critical failure.
3.1.2.2 Registers and memory access
During post-validation effort the user may be interested in reading not only the shire registers but
also the values being read or written in the memories by the cores. Due to that, the I/O Shire
integrates the UltraSoC DMA, which is connected to the Network On-Chip through an AMBA AXI4
interface and provides full visibility of the system resources allowing the user for example to load
code or dump the registers values from all SoC blocks.
Due to the fact that the connection to the main Network On-Chip on the IOShire is splitted in two
different NoCs for security, there are two UltraSoC DMA modules which are connected to those two
NoCs for redundancy. The first IOShire NoC instance is known as SPIO NoC and is integrated on the
sub-block that contains the Service Processor providing full visibility of all the resources in the SoC.
On the other hand, the second NoC instance, which is known as the Peripheral Unit (PU) NoC, is
integrated on a different sub-block that contains several peripherals such as UART or SPI and has
visibility of all resources in the SoC except the Service Processor and security components.
In case of a bug in the Service Processor, the user can make use of the UltraSoC DMA component
that is connected to the SPIO NoC in order to configure all the resources in the SoC including the
SPIO peripherals. That said, in case the SPIO NoC has a bug the system would not be configurable
from the Service Processor nor from the SPIO UltraSoC DMA so that no further debugging of the
rest of shires could be done, in order to avoid that, Esperanto’s design team decided to add the PU
UltraSoC DMA component which has visibility of the rest of the SoC shires resources.
35
3.1.2.3 Service Processor
The Service processor is a RISC-V core for which Esperanto has implemented the RISC-V External
Debug Support Version 0.13 specification [2]. The Service Processor Debug Module, which is
accessible through an AMBA APB3 interface, allows the user to perform run control operations such
as halt, resume, single stepping or setting breakpoints, and supports external software tools such as
GDB. The Debug Module is accessible through the UltraSoC BPAM that is also connected to the CRU
and the PLLs, so that there is a multiplexer that the user can control to target those resources.
On the other hand, given that the Service Processor is a custom core developed by Esperanto and is
the first time it is taped out, the design team integrated other UltraSoC components to increase
observability and to allow the user detect possible bugs either in the hardware integration or in the
firmware code.
An UltraSoC Trace Encoder component has been attached to the Service Processor allowing the user
to trace all the instructions and exceptions during code execution. The Trace Encoder captures the
taken branches and, based on the code being executed, the user can reproduce the sequence of
instructions and branches through UltraSoC software tools on the host.
In addition, an UltraSoC Status Monitor component has also been connected to the Service
Processor allowing the user to capture the values from finite-state machines, control signals and
track transactions. This module does not only enable the user to debug the pipeline, internal cache
or other resources within the core, but it is also connected to the interface between the Service
Processor and the Network On-Chip such that it can be used to monitor the requests and responses
with the rest of the system resources.
3.1.2.4 Maxion validation
ET-Maxion cores are based on the RISC-V Berkeley Out-of-Order Machine (BOOM), because of that
there is a Debug Module that allows the user to perform run control operations following the
RISC-V External Debug Support Version 0.13 specification [2]. The Maxion Debug Module interface
is accessible through a JTAG interface and due to that the Esperanto design team has integrated an
UltraSoC JPAM component that allows the debugger to configure the ET-Maxion cores to perform
run control operations such as the ones commented for the Service Processor.
On the other hand, given that the ET-Maxion cores have been customized by Esperanto there are
also other UltraSoC components integrated in the I/O Shire whose goal is to increase observability
and allow the user to detect bugs.
The Maxion sub-block has two UltraSoC Status Monitor components which provide visibility of both
the ET-Maxion core control signals and the Maxion cache hierarchy. In addition, there is also an
UltraSoC Bus Monitor component that allows the user to capture the requests and responses on the
master and slave AMBA AXI interfaces that ET-Maxion cores use to communicate with the rest of
the system resources through the Network On-Chip.
36
Finally, an UltraSoC BPAM component has also been integrated on the Maxion shire allowing users
to access to the Maxion configuration registers either for visibility or to apply different
configurations.
3.1.2.5 Debugger hardware interface
The debug subsystem can be accessed through JTAG and USB interfaces, which have been
integrated in the I/O Shire. From the point of view of the user this is transparent as the JTAG and
USB encoding protocols are handled by a software layer in the host.
3.1.2.5.1 External interface with the host
The JTAG interface provides a simple but slower interface, which is used to access the SoC during
first bring up stages to check that UltraSoC components can be accessed, perform basic debug
operations and to configure USB interface if needed.
The USB interface provides a faster interface but due to its complexity it is not ready to be used
when the SoC external reset is released, that is, this interface can be used when the user tries to
gather as much information as possible from the system but requires previous configuration.
3.1.2.5.2 Cores communication with the host
The Service Processor is a critical piece in Esperanto’s SoC and due to that, during the bring up effort
it is important that the user has a way to communicate the core with the external debugger.
Esperanto’s design integrates an UltraSoC Static Instrumentation component in the I/O Shire
allowing the Service Processor to write data as if it was a mailbox. Once the Service Processor writes
data in UltraSoC Static Instrumentation, the external host receives a message that can either
contain the written data or a notification that new data is available depending on the configuration
applied by the user.
This allows data to be passed from the software on the debug target to the debugger. A number of
software APIs are available and with an appropriate driver this module can support most popular
tools, such as LTTng or Ftrace.
3.1.2.5.3 Debug through Service Processor
The debug components are also configurable through the Service Processor via the UltraSoC Bus
Communicator module, which allows the master core to configure and manage the SoC debug
infrastructure, that is, the Service Processor can send requests to the UltraSoC modules as if it was
an external debugger, in order to do that it communicates through an AMBA AXI4 interface
targeting the Bus Communicator internal registers.
UltraSoC Bus Communicator module internal registers are used to read the status of the requests
and/or their response, and to write the requests to be performed. The complexity of the
communication between the UltraSoC Bus Communicator and the Service Processor relies on the
37
fact that the Service Processor has to encode the debug configuration messages and also decode
the responses. UltraSoC Bus Communicator does not encode the requests, so it just serves as a
mailbox between the debug components and the Service Processor.
3.1.2.6 Security
As we have seen in previous sections the debug infrastructure can be accessed either through a
JTAG or USB, due to that the system is exposed to receive attacks trying to access the internal
resources and cause malfunctions. In order to avoid this behavior, the System On-Chip has a
component that is in charge of enabling/disabling the access to the rest of the debug components
on the SoC.
The security component can only be accessed through the UltraSoC JPAM and requires the user to
go through a challenge that requires a public and a private key authentication. The only debug
components that are accessible by the user without performing the security authentication are the
JTAG and USB interfaces and the UltraSoC JPAM, that is, those modules are not locked so that the
user can perform the handshaking challenge ensuring that the rest of the system debug
components are only accessed by verified users.
In order to block the access to the UltraSoC analytic components, there is a small module named
Message Lock that receives the output signals of the security component and enables or disables
the interface between the debug infrastructure and the UltraSoC components.
Below we can see the distribution of the debug components within the IOShire, which are placed in
the three sections that have been previously introduced: Peripheral Unit (PU), Maxion and SPIO
blocks. For each of the analytic modules there is an UltraSoC Message Lock component that ensures
that no requests can be sent nor information can be extracted if the SoC is not configured at the
correct security level.
Figure 6. IOShire UltraSoC component integration
38
3.2 Minion Shire
3.2.1 Architecture
As has been introduced in previous sections, the Minion Shires integrate the system cores that
handle the computation of the different workloads and contain multiple cache hierarchy levels to
increase performance.
The ET-Minion cores within a Minion Shire have direct access to L1 cache memory but also rely on
the Minion Shire L2 cache. The former is splitted in two cache memories: one for instructions and
one for data.
Due to architectural decisions that are out of the scope of this thesis, the Minion Shire is splitted in
different voltage domains such that the ET-Neighborhoods are in a different voltage than the rest of
the Minion Shire resources. Due to that, the L1 cache is integrated within the ET-Neighborhood to
avoid delays in the access to instructions and data so that the ET-Minion cores can fetch new
instructions and perform load and store operations as fast as possible.
The ET-Minion cores L1 instruction cache memory is shared between the eight cores in the
neighborhood such that parallelism is the key factor for the system. On the other hand, the L1 data
cache is private for each ET-Minion and is not shared such that the cores do not rely on others to
access data.
On the other hand, the Minion Shire also integrates an L2 cache memory that is used to store both
data and instructions and it is shared by all the cores within the Minion Shire, that is, it is shared by
32 ET-Minion cores. The L3 cache memory is also part of the Minion Shire but it is shared by all the
Minion Shires in the SoC such that the whole cache is splitted across the system. The integration of
those two cache memories, L2 and L3, within the Minion Shire is separated from the
ET-Neighborhoods in a block named Shire Cache at a different voltage.
In the image below we can observe different Minion Shire blocks and the connection of the
ET-Neighborhoods with the Shire Cache through a crossbar, which allows the ET-Minion cores to
access L2 cache memory. In case of a miss in L2 cache memory, the request is forwarded to L3
cache, whose requested region can be in the same Minion Shire or in a different one.
39
Figure 7. Minion Shire voltage and clock internal architecture diagram
3.2.2 Debug features and UltraSoC integration
The Minion Shire is replicated several times in the SoC, which means that adding debug
components on the Minion Shire has a huge impact on area and power as this is replicated multiple
times, and the ET-Minion cores within the Minion Shire handle all the computation and its
performance is critical so that we need to provide as much visibility as possible.
On the other hand, due to security concerns it is important to ensure that UltraSoC components
access is restricted and because of that Esperanto’s design makes use of the UltraSoC Message Lock
component. Two security levels have been defined: full access to the debug infrastructure allowing
access to all monitor modules such as the UltraSoC Status Monitor, and a second level of security
which allows the user to perform run control operations on the compute cores, which is intended to
be used when running debugging software such as GDB.
3.2.2.1 Software application debugging
The most important feature for software debugging relies on the ability to perform run control
operations on the ET-Minion cores so that the user can set up breakpoints, enable single stepping or
read core internal registers. On the other hand, the user must also be able to access performance
counters and other Minion Shire registers. Therefore, the Minion Shire integrates an UltraSoC BPAM
component that provides an AMBA APB3 interface that is connected to all the Minion Shire registers
and also provides the debug run control interface for the ET-Minion cores.
3.2.2.2 Resources visibility
In addition, the user can also monitor ET-Minion cores execution making use of the UltraSoC Status
Monitor component. As mentioned above, the Minion Shire is replicated multiple times which
means that adding debug components escalates in area and power quickly, due to that there is one
UltraSoC Status Monitor per each neighborhood such that resources within this sub-block can be
monitored, this includes the 8 ET-Minion cores, the L1 caches and the interconnection between
those resources and the rest of the Minion Shire.
40
Furthermore, there is an UltraSoC Status Monitor that provides visibility of the crossbar between
the four Minion Shire neighborhoods and the L2 and L3 cache levels, which are known as Shire
Cache, such that the user can see the requests being generated by the compute cores and the
responses sent back from the Shire Cache. Note that the UltraSoC Status Monitor does not allow to
track requests based on unique identifiers which is a problem if the filtering to be applied requires
matching requests and responses.
An additional UltraSoC Status Monitor is connected to the Shire Cache control signals allowing
analysis of cache control signals and finite-state machines.
The compute cores can generate huge amounts of transactions that target not only their own
Minion Shire resources but also other Minion Shires and SoC resources such as memory. Then, it is
essential to be able to monitor the traffic on the Minion Shire AMBA AXI interfaces that connect
with the Network On-Chip. This monitoring feature is provided by the UltraSoC Bus Monitor which
allows to track the requests and responses with the rest of the system.
Figure 8. Minion Shire UltraSoC component integration
41
3.3 Memory Shire
3.3.1 Architecture
The Memory Shire integrates the 4GB LPDDR4 memory parts and connects those memories with
the rest of the SoC through the Network On-Chip. There are eight Memory Shires in the system
providing 32 GB of memory, each of them has two memory controllers which allow access to two
16-bit LPDDR4 channels resulting in a system with a 137GB/sec memory bandwidth.
The Memory Shire has multiple voltage domains which allows it to connect with the Network
On-Chip, internal third party components and the DDR memories. In addition, there are also
different frequencies for each of those blocks. Due to this integration the Memory Shire has
multiple voltage and clock crossing FIFOs that handle the different requests and responses.
3.3.2 Debug features and UltraSoC integration
3.3.2.1 Register visibility
The Memory Shire has several registers that can be configured to define different operational
modes to handle the memories. Those registers are accessed through the Network On-Chip by the
Service Processor but they can also be accessed through an UltraSoC BPAM component integrated
within the Memory Shire so that the user can perform read and write requests to configure the
memories without interfering with the rest of the system as the transactions go through the debug
infrastructure which is detached from the main Network On-Chip.
3.3.2.2 Memory visibility
Given that the Memory Shires integrate the DDRs, which serve as main memory for the cores, they
receive a big amount of requests through the Network On-Chip and it is very important to ensure
that requests are received/responded properly so that there are no hungs as this would end up
stopping the compute cores and would have critical consequences on functionality.
For this purpose, each Memory Shire integrates an UltraSoC Bus Monitor component that allows
the user to analyze the traffic on the interfaces that connect the Network On-Chip, through which
compute cores requests are received, and the DDR memories.
In addition, the Memory Shire also integrates an UltraSoC Status Monitor component so that the
user can have visibility of control signals and finite-state machine values.
3.3.2.3 Debug trace buffering
UltraSoC Status Monitor and UltraSoC Bus Monitor components can be used to capture data in
real-time and forward it to the host for further debugging. Due to the fact that the debug
infrastructure interface with the host is based on JTAG and USB protocols, the bandwidth is limited
42
and those interfaces can become a bottleneck causing the user to lose information gathered from
the system.
In order to avoid losing debug information, which means that engineers could miss the chunk of
data with information to detect the bug they are looking for, each of the Memory Shire blocks
integrate an UltraSoC System Memory Buffer component which allows the debug messages to be
stored in the DDR memories. It is important to note that the region of memory used to store those
debug messages acts as a circular buffer and is specified by the user, which means that the user
must ensure that compute cores do not overwrite this region as this would cause data corruption.
Figure 9. Memory Shire UltraSoC component integration
3.4 PCIe Shire
3.4.1 Architecture
The PCIe Shire integrates the PCIe Subsystem allowing Esperanto’s SoC to connect with an external
host through a PCIe interface. The PCIe Subsystem contains two PCIe Controllers, two PCIe PHYs and
the subsystem custom logic.
43
The same way as the rest of Esperanto’s SoC shires, the PCIe Shire integrates access to the system
Network On-Chip allowing not only the host to access memory mapped locations but also the
compute cores to use host memory as SoC extended memory.
3.4.2 Debug features and UltraSoC integration
The PCIe serves as an interconnection between the host and the System On-Chip, it is widely used in
the industry and a critical feature for Esperanto’s SoC performance. Due to its complexity the PCIe
Shire debug components have been integrated focusing on providing visibility of the PCI Express
functionality and verification.
Given that the PCIe is connected to the system Network On-Chip through multiple AMBA AXI4
interfaces, the debug infrastructure on the shire integrates two UltraSoC Bus Monitor components
that allow to analyze the transactions in-flight of the two PCIe lanes simultaneously.
In addition, the PCIe Shire also integrates two UltraSoC Status Monitor components that allow the
user to capture PCIe control signals and monitor its interconnection with system resources.
Figure 10. PCIe Shire UltraSoC component integration
3.5 Debug Network On-Chip
Access to system debug infrastructure is done through JTAG and USB interfaces which are integrated
in the IOShire. Given that there are several UltraSoC components integrated in other shires, the
44
Esperanto’s debug subsystem must have a way to route all the debug messages up to the IOShire.
We understand as debug messages not only those that configure UltraSoC components but also the
real-time data captured through the UltraSoC Status Monitor and UltraSoC Bus Monitor
components.
Apart from the system Network On-Chip, through which the Service Processor and compute cores
can access the memory mapped resources such as registers, caches or the DDRs, there is an
independent debug Network On-Chip that handles the routing of debug messages.
The debug messages sent from the debugger to the UltraSoC components always follow the fastest
path based on optimization route algorithms. On the other hand, the responses to those
configuration messages and the real-time data captured can either be sent to the debugger
interface or can be sent to the Memory Shire DDR memories in which they are stored using the
UltraSoC System Memory Buffer.
The destination of the configuration messages response and the real-time captured data messages
is defined by the user such that it can decide whether to send those debug messages directly to the
JTAG or USB interfaces, or to store them in DDR memories and then read the debug messages upon
request to avoid overflow in the external interfaces with the host.
This document does not include any further details about the debug Network On-Chip architecture
as it has been developed in an agreement between Netspeed and UltraSoC and its architectural
details are confidential.
3.6 Conclusions after ET-SoC-1
Now, after Esperanto’s ET-SoC-1 has been fabricated and the UltraSoC components have gone
through RTL integration, verification, physical design implementation and post-silicon validation
stages, Esperanto’s engineering team has found some critical aspects that must be improved for
future SoCs.
3.6.1 Architecture, RTL and physical design evaluation
During RTL integration the Esperanto engineering team found that UltraSoC proposal was
architected to be used on small SoCs due to how their messaging infrastructure was built and
because of the lack of voltage and clock crossing support, which is typical on multicore systems that
integrate several cores, memories and peripherals.
The lack of an efficient message infrastructure caused an impact on throughput capabilities which
limited the amount of information that could be gathered from the system, that is because UltraSoC
message infrastructure is based on a tree architecture. That is, on this tree architecture the
information flows from several analytic components to an UltraSoC Message Engine, which then
sends the information to the next UltraSoC Message Engine in the hierarchy which has other
UltraSoC Message Engines or analytic components connected, this is done until the top UltraSoC
Message Engine in the tree is reached and the information is then sent through JTAG or USB.
45
In addition, a tree-based architecture does not allow building a system out of stamping or tiling
shires. This forced Esperanto to involve a third-company (NetSpeed) to develop a debug Network
On-Chip (NoC) that met UltraSoC messaging requirements causing not only a non-predicted impact
on area and power but also a loss of debug performance as the NoC was not customizable and an
unexpected cost. In addition, the internal debug NoC protocol did not have support for tracing
capabilities such as allowing to target different memory subsystems and was based on NetsSpeed
protocol where many fields were unused causing an extra area overhead.
In addition, some of the debug components such as the UltraSoC Status Monitor or the UltraSoC
Bus Monitor have been found to be not efficient for high tracing requirements given that the signals
to be analyzed are stored on internal buffers before checking if they meet user filter configuration
causing buffers to overflow and ending up with the user losing visibility.
It is also important to mention that UltraSoC did not provide area and power estimation numbers
for their debug components and this required the RTL team to have to perform several iterations
and simulations to find out the appropriate configuration to fit the area and power requirements.
This also forced the design team to drop some key configurations such as wider buffers to store
traces while other features could not be removed as they were non-optional parts of the UltraSoC
components.
Power and area analysis have been performed through Synopsys PrimePower and Synopsys
PrimeTime tools, respectively. The analysis was made applying clock gating to UltraSoC modules in
order to discard dynamic power due to the fact that the debug infrastructure is intended to be
inactive after the post-silicon validation stage. The numbers showed that ET-SoC-1 debug
infrastructure static power consumes around 2% of SoC power and occupies 4% of the SoC area.
3.6.2 Post-silicon validation evaluation
Finally, during post-silicon validation the engineering team was able to stress the debug
components and make use of the provided features. The debug infrastructure ended up being
critical for bring up development and had a huge impact meeting time-to-market requirements. In
addition, the experience during this stage showed that some deprecated features, such as access to
some SoC registers, were essential and some features that were supposed to be essential, such as
providing the Service Processor a mailbox through the UltraSoC Static Instrumentation component,
were not really needed. Below we can see further details of post-silicon validation conclusions.
The UltraSoC Bus Communicator module, which was integrated to allow the Service Processor
access to other UltraSoC components without requiring JTAG/USB interface connection, was not
used due to its complexity and the lack of time on the engineering team to develop new software as
no support was provided by UltraSoC.
In addition, the UltraSoC Static Instrumentation, which was supposed to act as a way to allow the
master core to send information to the debugger, was neither used as the Service Processor was
making use of the UART interface for this purpose.
46
It is also important to note that the debug infrastructure showed problems when monitoring
multiple ET-Minion cores as the UltraSoC Status Monitor filtering implementation was not efficient
and caused the internal buffers to fill too fast before gathered data could be sent, causing a loss of
information that was critical for debugging.
On the other hand, due to the way RISC-V GDB software is implemented by default, operations such
as single stepping had big latencies as all General Purpose Registers (GPR) from the selected core
were read automatically. In addition, note that Esperanto Technologies SoC had no implementation
that could speed up reading GPR from the cores.
Moreover, the management team reported problems caused with license-acquisition meaning that
for being able to use the debug infrastructure on Esperanto’s ET-SoC-1 silicon it was required to buy
licenses from UltraSoC, which forced Esperanto to assume around 15% unexpected extra cost and
also to have to find out the best approach for customer support as they would have to reach
UltraSoC licenses for being able to debug instead of having a fully functional SoC with debug
features already available.
Note that UltraSoC software also shown problems and had to be modified several times because it
did not support CentOS, which was the operative system being used on the servers where the
Esperanto System Engineering team connected the Bring up Boards (BuB), causing the Esperanto
team to work with a slower version of UltraSoC software for weeks until they fixed the issue. This
operative system was supposed to be fully supported.
47
4. Proposed multicore debugging architectureAfter analyzing the required features for debugging multicore systems, the available options in the
market and the conclusions drawn from Esperanto Technologies ET-SoC-1 implementation, we focus
on developing a new debug architecture that intends to solve all the problems found during the last
sections.
As we have observed in previous sections an efficient multicore debug architecture requires designs
that are customizable, for example, allowing to decide the number of comparators or buffering
sizes, but that do not add features which cannot be disabled causing a direct impact on area. In
addition, when debug is not used we must ensure that it wastes the minimum power possible.
Users must be able to monitor not only the compute cores execution but also have visibility of other
logic in the system that interconnects the cores between them, with the memories and other
custom modules that are placed on the SoC. This means that the proposed debug monitor
components must be able to analyze different resources and support multiple protocols used on
those modules. For example, in case the interconnection between the compute cores SoC block and
the non-debug Network On-Chip is based on AMBA AXI4 protocol, then the debug monitor
components must be able to analyze transactions based on that protocol.
Furthermore, with the aim of providing more visibility, monitor debug components must allow the
user to gather information from the system based on advanced filtering. Note that when debugging
multicore systems it is generally required to gather information during a large amount of cycles
which means that the debug infrastructure must be able to sustain high-bandwidth during large
periods and to provide in-system storage in case the information cannot be extracted due to slow
I/O interface. We assume that intensive data gathering is performed frome one source at a time,
and in case the user wants to capture traces from multiple sources, advanced filtering is applied to
reduce the amount of information gathered avoiding loss of information and bottlenecks.
Multicore systems do also require a run control infrastructure that allows managing the execution
of multiple cores at the same time, propagate events such as halt or resume efficiently to stop the
cores with the minimum latency as possible, and extract cores information (e.g. register file values)
in a very short time so that the user can manage thousands of cores without a huge delay.
Note that the debug infrastructures we have analyzed previously in this document did not focus on
the time it takes to execute run control operations due to human interaction slowness and,
consequently, there was no requirement to operate within a certain latency. However, on a
multicore system with hundreds or thousands of compute cores the amount of transactions
required for run control operations escalate quickly, having a direct impact on latency and causing
large delays.
On the other hand, we also noted that the messaging infrastructure is essential and that is why we
not only focus on developing monitoring solutions but also center on designing a debug Network
48
On-Chip that can support sustained high-bandwidth traffic, escalate with different SoC sizes without
causing information to be aggregated on each router as in a tree-based architecture, utilize all NoC
resources by making use of an efficient messaging protocol and support multiple leader interfaces
to act as debuggers.
Note that, based on ET-SoC-1 experience and research previously done, we assume that debug can
be managed from multiple sources which can be either external interfaces such as JTAG, USB or
PCIe or a master interface within the System On-Chip itself such as a master core. In addition, we
also assume that only one debugger interface manages the system at any given time which is known
as the active debugger.
4.1 Run control support
Run control features are essential for developers to be able to debug hangs or misbehaviours on the
instruction execution of systems with one or multiple cores. As has been introduced in previous
sections, multicore systems can potentially combine peripherals, memories and hundreds or
thousands of cores and while debugging a problem the user must be able not only to have visibility
of the system resources such as peripherals or memories, but also to manage core execution.
Multicore systems can be used either to run general purpose software or to run very specific
software applications that require high-performance. Depending on the core architecture, the
debug run control support may follow specific implementations to be architecture compliant. Our
proposal takes RISC-V debug specification [2] as a baseline from which we expand to support
multicore systems by focusing on the indispensable features that must be integrated on the SoC
debug infrastructure no matter the core architecture to allow users manage core execution. RISC-V
has been selected as the baseline option due to its open Instruction Set Architecture (ISA) allowing
developing royalty-free designs, a well-defined debug specification and a wide range of tools
available that are supported and improved by the community.
4.1.1 Execution
First, the user must be able to stop the execution of the core so that no instructions are retired and
the pipeline is stopped. This operation is known as applying a halt to the cores. On the other hand,
once the user wants the core to start executing again it can perform a resume operation.
Those two operations allow to stop and restart the execution of a code in the cores. Then, given
that the user may want to debug starting from a specific point of the execution, it must be able to
configure the core so that it automatically halts when a certain instruction is going to be performed
or when accessing a specific memory address. This feature is known as breakpoints and allows the
user to ensure that the core will stop its execution when a certain condition is met.
On the other hand, the user may want to execute one instruction and halt the core automatically so
that it can see how this instruction has been performed and how the core behaved. This operation
is supported by what is known as single-stepping, which allows the user to configure the core such
that once a resume is received one instruction is performed and the core automatically halts. This
49
allows the user to debug the execution of a program step-by-step without much latency as it does
not require the user to set a new breakpoint each time an instruction has been performed.
An extra feature that is highly recommended is to allow the user to inject operations on the core
pipeline without having to modify the software application being executed. That is, the user can
stop the system at a given point and force the core to execute one or multiple instructions to see
how it behaves and to iterate faster as modifying software applications and loading them on the
system can require a huge amount of time.
4.1.2 Registers
The user may also be interested in having visibility of core registers contents so that it can see if an
instruction has modified the register as expected. For example, this allows the user to check if data
has been properly loaded onto a register, or if an arithmetical operation has been performed and
reported the correct value.
It is also recommended on complex-core architecture to allow the user modify the value of the
internal core registers so that it can load custom values, perform a given operation and check if
there is any misbehavior. In addition, some cores have a set of Control and Status Registers (CSR)
that enable special features or change instruction behavior such that allowing the user to read and
configure those registers at runtime provides not only good visibility but also a flexible test
environment.
4.1.3 Core status
It is also required to allow the user to check the status of the core not only when running software
applications but also to check if a set of injected instructions have reported problems or finished
successfully. The proposed architecture recommends adding at least the next flags so that the user
can check core status:
● halted: This bit is asserted when the core performed the halt request and stopped its
pipeline. It is cleared when the core receives a reset or resume request.
● running: This bit is asserted when the core is executing instructions. It is cleared when the
core performs a reset or halt request.
● busy: This bit is set to 1 while the injected instruction(s) by the user is being executed, and
is set to 0 when no instruction(s) injected by the user is being executed.
● exception: This bit is set to 1 if the user's injected instruction generated an exception. It is
manually cleared by the user to ensure it is not lost.
● error: This bit is set to 1 if the user’s injected instruction cannot complete but no exception
has been generated. It is manually cleared by the user to ensure it is not lost.
50
Note that halted flag is used by the user as it takes time between halt is requested and the core
stops fetching instructions. On the other hand, running allows the user to check that the core is
fetching instructions after it has been enabled or after performing a resume operation.
Finally, busy, exception and error allow the user to check the status of the instructions that have
been injected on the core pipeline while it was halted.
4.1.4 Multicore execution support
Due to the fact that multicore systems can have a huge number of cores and because the user may
want to apply a run control operation on a subset of them, it is important to have one or multiple
registers within the SoC that act as a mask. That is, the user must be able to configure the mask
registers so that it can select which cores must be affected by the run control operations such as
halt or resume.
Multicore systems can escalate fast on the number of cores and it is essential to allow users to
manage multiple cores at the same time, that is why core selection is a key feature of the proposed
debug architecture.
On the other hand, it is also important to take into account that the propagation of run control
requests can take several cycles on the SoC to reach all the target cores so an efficient
communication method is required.
The proposed architecture allows the user to configure mask registers once and then send as many
run control requests as desired. The supported run control operations include single stepping or
breakpoint configuration so that the user does not have to go core-by-core to apply this
configuration and requiring a back and forth communication between the host and the SoC causing
huge time impacts.
In the proposed architecture, there is a component named Debug Multicore Handler (DMH) which
serves as the interface between the host and the Debug Module (DM) within the SoC block(s) that
contain the compute cores. The DMH handles the propagation of the run control request to the
selected cores so that the operation is broadcasted through hardware to the DMs. Then, the Debug
Module within the SoC block forwards the request to the selected cores within the block.
In addition, the SoC block Debug Module integrates a feature which allows the user to halt a group
of cores within the block in case one of them halts. That is, the DM can be configured to handle up
to N groups, being group 0 reserved and meaning “no group”. Then , each core can be only on a
given group, when one of the cores within a group halts, the Debug Module sends a halt request to
the rest. This allows the user to specify a breakpoint to be hit on a specific core and ensure that the
rest of the cores also halt and do not perform any further instructions and avoids communication
latency with the host.
51
4.1.5 Multicore status report
As mentioned in previous sections, the user must be able to know the status of each core by having
access to a subset of flags that specify if it is halted, running or executing injected instructions.
On multicore systems the user can apply run control operations to several cores and due to that it
may be interested in having a status report that gathers the information from all selected cores
while maintaining the ability to access the status from a single core. The former can be achieved by
making use of a special register that is introduced later in this section while the latter can be
achieved by selecting a single core and requesting the status information as it is reported on the
same status register.
When the user wants to gather the status information this request is received by the Debug
Multicore Handler (DMH) which gathers the information from the status register of the compute
cores in the SoC Block(s). Then, this information is merged and user receives the next flags:
● anyhalted: This bit is asserted if any of the selected cores is halted.
● allhalted: This bit is asserted when all selected cores are halted.
● anyrunning: This bit is asserted if any of the selected cores is running.
● allrunning: This bit is asserted if all the selected cores are running.
● anynavailable: This bit is set if any of the selected cores is not halted nor running.
● allunavailable: This bit is set if all selected cores are not halted nor running.
For user interaction it is recommended to implement the Debug Multicore Handler (DMH) status
register such that it is updated based on the SoC block compute cores status every time selected
cores status change. Then, the user can retrieve SoC compute cores information by accessing a
single register.
On the other hand, it is also important to note that one of the biggest user potential problems is the
fact that the RISC-V GDB performs a read of all the core General Purpose Registers (GPR) each time
a single step is performed, which becomes a problem on multicore systems as it introduces a huge
latency due to the fact that not only gathering information from a single register requires multiple
accesses to the core program buffer but also the fact that there can be hundreds or thousands of
cores in the SoC.
On this project we recommend implementing a special register within the Debug Multicore Handler
(DMH) such that it captures the GDB request and gathers the information from all core’s GPR
avoiding the software to send hundreds of requests. Note that the SoC blocks Debug Module (DM)
can integrate a finite-state-machine that automatizes the read of the General Purpose Registers
from each of the SoC block cores so that information can be gathered as a trace of values that are
later forwarded to the DMH. Then, this trace of values can be read by the host software to show the
user a list with the registers and the value of each one.
The OpenOCD software layer, which acts as an interface between the GDB and the SoC debug
infrastructure and is open source, can be easily extended to make use of this DMH special register
52
after performing a single step rather than following the default implementation which sends
multiple requests for each register.
4.2 Debug Network On-Chip
The proposed debug infrastructure relies on the use of a Network On-Chip to communicate all the
debug components between them, the SoC resources and also with the active debuggers. The
debug components can be configured either through external interfaces such as JTAG, USB or PCIe,
or through a master core in the SoC.
The debug Network On-Chip must support high-bandwidth traffic and handle different message
types that are sent between the active debugger and the debug components within the SoC blocks.
The first requirement for a competent message infrastructure is to keep the overhead to the
minimum while maintaining good debug capabilities performance, due to that it is important to
implement efficient message packaging and compression algorithms that reduce the amount of
information in-flight causing less traffic and avoiding bottlenecks within the debug NoC as much as
possible.
The architecture implements a protocol with multiple message types in order to support all the
features explained later in this specification. It is important to note that the debug NoC interface
that connects to the SoC blocks has a fixed size which, depending on user debug components
configuration, may cause some of the messages to be splitted in flits. Note that, as introduced
previously in this document, a flit is the smallest unit that composes a packet, and equals the link
width. The flit is the link level unit of transport and packets are composed of one or more flits.
Further information about how flits are handled by the proposed architecture can be found in each
of the subsections below.
We assume, from the ET-SoC-1 experience and the research done in previous sections, that debug
components Configuration messages and run control capabilities require sending a big number of
messages, whose size is small, that go from the active debugger (e.g. JTAG or USB) to everywhere,
and that may require responses. In addition, there is also a need to support high-bandwidth
capability at given times when gathering data such that the debug NoC must support configuration
to allow extracting as much information as possible even if sacrificing access to debug resources or
slowing those.
Moreover, as commented previously, the debug NoC supports multiple active debuggers that go
from the standard and easy-to-use ones (e.g. JTAG) to the ones that provide high-throughput but
are complex to configure (e.g. PCIe). The proposed architecture, which is introduced below,
supports both options but focuses on not oversizing the debug NoC as adding buffering or
many-wires would increase the bandwidth but would have a huge impact on area and power. It is
important to have in mind that the debug infrastructure is intended to be used during bring up
effort and we need to focus on reducing the leakage consumed by this piece of the SoC as it is
disabled during normal operation mode.
53
4.2.1 Message Types
The message types introduced in this section support different functionalities but their encoding
follows a similar pattern.
As mentioned above, we must ensure that any lead debugger, which can be on the host or within
the SoC, can communicate with all the SoC debug resources. In addition, the proposed debug
architecture must be able to escalate from small SoCs to big SoCs. Due to that it is important to
have a set of flexible, but carefully defined message types. As part of this flexibility, the debug
resources are targeted making use of a coordinate system that identifies the SoC block and an
identifier of the debug component to be targeted within the SoC block.
On the other hand, reducing the amount of static power consumed is achieved by sharing the
messaging interface between the different message types. Furthermore, we also focus on reducing
the number of wires by sending the least amount of information required.
Finally, given that messaging protocol is shared between different lead debuggers and the fact that
some follow protocols that require matching in-flight requests with their correspondent responses,
the identifier of the request and unique transactions identifiers are attached as part of the message.
There are different message types that make the debug infrastructure to be flexible allowing the
user to apply different configurations and gather information from the system. This section
describes the four message types and how they are encoded.
Message type Encoding
Configuration 00
Event 01
Trace non-streaming 10
Trace streaming 11
The messages share the same interface and make use of different fields introduced below.
● Destination:
○ The most significant bits correspond to the X and Y coordinates of the system
allowing the debugger to target any location of the SoC.
○ The less significant bits are used to encode the destination debug component
within the target SoC block.
■ For example, the X and Y coordinates can be used to target the SoC Block 0
which has compute cores while the less significant bits of the Destination
field are used to target a monitor within the SoC Block to analyze core
retired instructions.
54
○ The coordinates are used by the debug NoC routers to select among all the possible
destinations and route the message to the endpoint. Once the message arrives at
the SoC block NoC interface, the less significant bits of the Destination field are
used to select the destination debug component and forward the rest of the
package information.
● Source:
○ Configuration requests and trace messages require the destination endpoint to
know which debug component generated the message. Due to that, the SoC block
debug component identifier is attached.
○ Configuration messages (both request and response) also require the use of a
transaction identifier so that the destination endpoint can generate a response
attaching the same identifier value. This is needed to support the cases in which the
active debugger makes use of protocols such as AMBA AXI4 which allow sending
multiple requests in parallel to be able to match each request with its
corresponding response.
● Data:
○ Data field encoding is different for each message type and the proposed
architecture ensures that the number of NoC dedicated Data field wires are the
minimum mandatory to support the different message types so that no wires are
wasted.
○ Configuration response messages make use of a dedicated wire that indicates that
the received package is the last one for the given message. That is, in case the
message has been splitted in multiple packages, this wire is used to indicate the
last. Otherwise, the value of this wire is not taken into account.
○ Note that the debug NoC does not use the Data field and it just forwards the value
to the correspondent destination. Further information can be found later in this
document.
The previous fields are exposed as input and output buses on the debug NoC. That is, the
communication between the debug NoC and SoC blocks is based on an interface of five output
buses, which are used to allow the debug NoC send requests to the SoC blocks, and five input
buses, which are used to allow SoC blocks send response messages and traces through the debug
NoC. The set of five buses refer to the Destination, Source and Data fields together with a validation
signal, to mark that the master interface wants to send a new message, and a ready signal used by
the slave interface to note that it can accept new messages.
In addition, this interface is parameterizable such that depending on users' design the debug NoC
interface can be replicated allowing to use different sets of buses for debug components responses
55
and to forward gathered data from the SoC blocks. From now on this document assumes a debug
NoC with a unique interface that is shared by all message types for simplicity.
Below we can see an example of how the debug NoC is integrated on an SoC composed of six
different blocks. Each SoC block debug component connects to the debug NoC making use of the
interface introduced above. In addition, lead debuggers such as JTAG, PCIe or the SoC master core
are also attached to the debug NoC allowing access to all SoC debug resources. Note that each block
has a unique identifier composed by (X,Y) coordinates, and each SoC block debug component has a
unique identifier within the SoC block where it is integrated.
Figure 11. Debug NoC integration on an SoC composed of six blocks
4.2.1.1 Configuration messages
This type of message is used to configure the debug components on the system. As a few examples
of Configuration messages we have filter enabling, counter configuration or to perform read and
write requests to registers and memories through the master interfaces exposed by the debug NoC.
The Configuration messages can be categorized in two types of operations: reads and writes; The
reads are used to gather information from the system which can either read a configuration applied
to a debug component or a value from a register or memory. On the other hand, the write
operations are used to apply configurations on the debug components and to write data on
registers or memory.
As mentioned above, the interface between the debug components and the debug NoC has a fixed
size which means that some requests or responses may be splitted in multiple packages. Those
packages are handled by the NoC and it is transparent for master interfaces such as the master core
or PCIe. Note that since the debug NoC handles the concatenation of the packages this improves
the efficiency of the PCIe, USB or JTAG interfaces allowing those interfaces to maximize their
throughput as they are no longer sending multiple messages when data can be handled more
efficiently.
56
4.2.1.1.1 Read configuration messages
Reads are used to gather information from the system by targeting a debug component or a
memory mapped resource such as a register or memory.
Request
The read request does not require to send data and we assume that all the information regarding
the target endpoint and operation size can be encoded in one debug NoC message.
SoC block and debug component target
When a read request is issued, the endpoint is specified under the Destination field following the
pattern explained in previous sections where the most significant bits of this field correspond to the
X and Y coordinates and the less significant bits are used to encode the destination debug
component.
Targeting register/memory
On the other hand, the user needs to specify which is the address offset within that SoC block
and/or debug component that has to be read, together with the size of the request and the
operation type (opcode), which in this case is a normal read.
All this information is encoded on the Data field such that when the request arrives at the debug
component, the package is decoded and read is performed.
Request source
When a read is issued the source expects a response with the requested value. It is important to
note that the response is generated by the endpoint as the debug NoC is stateless, that is, it does
not store the request identifiers waiting for a response. Due to that, the debug component that
received the request stores which is the requestor identifier so it can send the response back.
The debug component source information is encoded by the requestor under the Source field, so
that the message includes the source debug component that generated the request. In addition,
when the debug NoC receives the request, it appends the SoC block source coordinates such that
once the destination debug component receives the request it has all the information to generate a
response.
On the other hand, given that there can be multiple master interfaces that may require using
unique transaction identifiers to match multiple in-flight requests and responses, the requestor
debug component can encode a transaction unique identifier (trans_id) under the request Source
field so that the destination debug component can attach this information in the response. In case
no unique transaction identifier is used, then this subset of bits are unused.
Response
The read response does require to send the gathered value together with all the information
regarding the destination endpoint, this means that, depending on read request size, responses may
57
require more than one packet and information is splitted in flits. The debug NoC is responsible for
handling all the flits that compose a message before the response is returned to the requestor.
In case the debug module has to send multiple packages for a single read response, the encoding is
always the same and the only change is the value appended on the Data field.
SoC block and debug component target
As has been commented in previous sections, the debug component must generate a response to
the requestor returning the read value.
Response messages also target the endpoint based on the use of the Destination field and follow
the same pattern as read requests; the most significant bits of the Destination field correspond to
the X and Y coordinates of the SoC block while the less significant bits are used to encode the
destination within that block. In this case the destination endpoint corresponds to the information
extracted from the read request Source field.
Transaction
Given that the requestor may follow a protocol that requires the use of unique identifiers such as
AMBA AXI4, the debug module encodes the transaction identifier (trans_id) information extracted
from the read request, under the response Source field.
Data
The read value is returned under the Data field together with information that notifies if the
transaction has been successfully completed or if an error such as targeting a non-existing
destination has occurred. Note that the reply code, which indicates if the request was successful or
not, makes use of a subset of dedicated wires within the Data field. In addition, there is also a
dedicated wire that indicates if the sent package is the last one for a given message so that the
endpoint can concatenate the messages successfully.
4.2.1.1.2 Writes configuration messages
Request
Write requests do require to send a value to be written together with all the information regarding
the target endpoint and operation size. Depending on the size to be written, it can be required to
send multiple messages through the debug NoC.
SoC block and debug component target
Following the same pattern as read requests, the endpoint is specified under the Destination field
making use of the most significant bits as the X and Y coordinates to select the target SoC block and
the less significant bits of Destination field to encode the destination component within that block.
Targeting register/memory
On the other hand, the user needs to specify which is the address offset within that SoC block
and/or debug component that must be written, together with the size of the request and the
operation type (opcode), which in this case is a write request.
58
All this information is encoded on the Data field such that when the request arrives at the debug
component, the package is decoded and the write is performed.
Request source
As has been introduced in previous sections, depending on the active debugger protocol it may be
required or not to keep ordering of the transactions, or to observe if an operation has reported an
error as this may not be observed through other ways such as error interrupts. The write requests
do support the use of unique transaction identifiers and to specify if the user wants the destination
debug component to generate a write confirmation response. This allows the user to save around
50% of the bandwidth and sustain more outstanding writes in case write confirmation is not
required.
Due to the fact that debug components may need to generate a response, the information about
the requestor is sent as part of the write request so that it is stored and used later to set the
destination of the response.
This source information is encoded the same way as in read requests such that under the Source
field the active debugger encodes the identifier of the debug component that generated the
request. In addition, when the debug NoC receives the request, it appends the SoC block source
coordinates such that once the destination debug component receives the request it has all the
information to generate a response. Moreover, this information is also used by the debug NoC to
generate the message in case it has been splitted in multiple flits.
On the other hand, given that the master that generated the request may require using unique
transaction identifiers to match multiple in-flight requests and responses, the requestor debug
component can encode a transaction unique identifier (trans_id) under the request Source field so
that the destination debug component can attach this information in the response.
Then, the unique transaction identifier (trans_id) together with the SoC block coordinates and the
identifier of the debug component that performed the request generate a system transaction
identifier that can uniquely identify the transaction even if there are multiple master interfaces
using the same trans_id.
Data
Debug components support multiple configurations that enable the different debug features such
as advanced filtering or defining a counter configuration. These configurations are applied by the
user by attaching the register/memory address in the Data field together with the request size and
the value to be written.
Note that the write opcode is extended with an extra bit that is used as a write acknowledgement
enable. This opcode acknowledgement bit can be set to 0 when using interfaces such as USB or
JTAG as those do not need responses when configuring a debug component. On the other hand,
59
this bit can be set to 1 for the PCIe and the master core given that they make use of protocols that
may require write confirmation responses.
When a debug component receives a write request with an opcode that has the acknowledgement
bit set to one, it generates a response that is sent back to the requestor. Otherwise, operation is
performed and not notified.
In case the value to be written can be concatenated on the message Data field and fits without
needing extra messages, only one message is sent to the requested endpoint.
Otherwise, the write request must be splitted in multiple packages. The Destination and Source
fields of those messages have the same values while the Data field of the first message has the size
and opcode information, and the rest of messages' Data fields are filled with the value to be
written.
Response
As explained above, write responses depend on the value of the request opcode acknowledgement
bit.
SoC block and debug component target
Response endpoint is also based on the use of the Destination field and follows the same pattern as
write and read requests; the most significant bits of Destination field correspond to the X and Y
coordinates of the SoC block while the less significant bits are used to encode the destination within
that block.
Transaction
Given that the requestor may follow a protocol with unique identifiers such as AMBA AXI4, the
debug module encodes the transaction identifier (trans_id) information previously stored, under
the response Source field.
Data
In case a write response is generated, the Data field makes use of a subset of dedicated wires which
are used to contain information regarding the transaction completeness or failure. For example, an
error can be returned if the user has targeted a non-existing destination.
4.2.1.2 Event messages
This type of messages are used to propagate an identifier to all the debug components on the
system. The identifier can be used to enable or disable debug capabilities such as filtering, resetting
counters, or trigger read requests to registers and/or memories among others. Allows triggering
pre-configured debug features without having a huge impact on traffic as the amount of
information sent is small. In addition, events are by default propagated through the system so that
the user can configure multiple SoC blocks debug components to perform actions based on the
same event identifier.
60
The event messages are generated by the debug components or injected through the active
debugger and never generate a response.
Event messages are also composed following the same pattern as Configuration messages such that
there are Destination, Source and Data fields.
The Destination field is unused as the debug NoC decodes information regarding the message type
and once it identifies it as an Event message it automatically propagates the message to all the
possible endpoints.
On the other hand, given that no response is needed, no source information regarding the SoC
block identifier nor the debug component identifier are required. Finally, the Data field contains the
identifier to be propagated and that is received by all the debug components in the system.
4.2.1.3 Trace messages
This message type is used to send gathered values from the debug components to the active
debugger without requiring the debugger to perform a request. This is useful when the debugger
has configured a filter and wants to extract information from the system when the configured
condition is met.
The interface is shared with Configuration and Event messages which means that the interface
between the debug components and the debug NoC is the same. Depending on the amount of
information that the user wants to retrieve from the system some Trace messages may require to
be splitted in multiple packages.
Following the same pattern introduced in previous sections, the Destination field is filled with the X
and Y coordinates encoding the SoC block. The proposed architecture assumes that Trace messages
are always sent to the memory subsystem, which can be configured to either store the message in
memory or forward it to the debugger. Further information about Trace message routing can be
found later in this document.
On the other hand, even if no response is needed for Trace messages, the Source field is filled with
the identifier of the debug component that extracted the information from the system. Once the
debug NoC receives the Trace message from the debug component, it concatenates the X and Y
coordinates of the source SoC block such that the destination endpoint is aware of the debug
component that generated the message.
Finally, the Data field contains the value that has been gathered from the system. In addition, the
Data field may also be filled with information regarding the timestamp, which is introduced later in
this document.
61
4.2.1.3.1 Streaming mode
Trace message type supports streaming mode, which allows to remove some operational fields
from the Trace messages and use this freed space to contain additional extracted information from
the system.
This special trace mode is enabled by the user when configuring the monitor debug components.
Both the debug infrastructure and the host are aware of when a Trace message has streaming
encoding based on the message type encoding which is part of the Trace message. When the Trace
message is encoded as non-streaming, it is fully decoded by the active debugger as it contains
information about the source, the timestamp value and the gathered information from the system.
On the other hand, in case a Trace message is encoded as streaming, then it does not contain
information about the source module, source SoC block nor timestamp values as all the message
fields are filled with the information gathered from the system. Further information about this
feature can be found later in this specification.
4.2.2 Interface and message packaging
The interface between the SoC blocks debug infrastructure and the debug NoC follows the encoding
introduced on the message types sections. The debug NoC has a set of five input buses (Destination,
Source, Data, valid and ready) and five output buses (Destination, Source, Data, valid and ready)
that communicate with the SoC blocks.
● Destination
○ Encodes the destination SoC block and debug component for the message.
○ Note that in case of an Event message this field is unused as the debug NoC
automatically performs a broadcast and propagates the event to all SoC debug
components.
Figure 12. Message Destination field encoding
● Source
○ Configuration and Trace messages encode the source SoC block and debug
component identifiers from the requestor that initiated the transaction.
■ Note that Configuration messages also attach the transaction identifier
value when the requester that initiated the transaction makes use of a
protocol that requires matching in-flight requests and responses based on
unique identifiers. Otherwise, the transaction identifier field is unused.
■ Note also that in case of Trace messages if streaming mode is enabled, this
field contains the less significant bits of the gathered information from the
system by replacing the Transaction ID, SoC block debug component ID and
SoC block ID sub-fields.
62
○ Event messages do not use this field.
Figure 13. Message Source field encoding
● Data
○ Configuration messages requests encode the target endpoint address, for example,
the offset from a SoC block or debug component register. In addition, this field also
encodes the operation type and size together with the value to be written, in case
of write requests.
Figure 14. Configuration request message Data field encoding
The available operation codes (opcode) are:
Operation code Encoding
Read 00
Write without acknowledge 01
Write with acknowledge 10
Reserved 11
○ Configuration messages responses contain the reply code that notes if an operation
has been successful or failed. In addition, in case of reads requests, the
Configuration message response also attaches the returned read value. Note that
Configuration message responses make use of a dedicated wire indicating which is
the last package of the current message to handle when those have to be split. In
addition, dedicated wires are also used to encode the reply code, which indicates if
the operation has been successfully performed.
Configuration response message for reads
D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0
raw data Reply code
Figure 15. Configuration response message Data field encoding for read operations
63
Configuration response message for writes
D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Reply code
Figure 16. Configuration response message Data field encoding for write operations
○ Event messages encode the event identifier.
Figure 17. Event message Data field encoding
○ Trace messages encode the information gathered from the system. In addition,
depending on user configuration, it also encodes the timestamp value as part of the
payload on its less significant bits. Note that streaming Trace messages do not
attach the timestamp information.
Trace non-streaming message
D x x x x x x x x x x x x x x x x x x x x x x x x x TS+1 TS x x x 0
raw_data Timestamp
Figure 18. Trace non-streaming message Data field encoding
Trace streaming message
D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0
raw data
Figure 19. Trace streaming message Data field encoding
4.2.3 Timestamps
Timestamps are used to keep a record of when information has been captured so that it can not
only be ordered but the user can also see the elapsed time between different Trace messages. As
we have seen in previous sections, the timestamp value is attached as part of the debug message in
case the user has enabled this feature, this means that it takes part of the Data field so that more
information has to be exchanged.
Given that multicore systems can run one test during millions of cycles we need to ensure we can
capture a wide range of timestamp values and this requires using counters and having them on the
message source generation, that is, in each SoC block that contains debug monitor components.
With an eye on reducing the amount of area required we defined a design that optimizes the area
used for timestamp registers.
64
4.2.3.1 Synchronization
All timestamp counters in the system must be synchronized so that if monitor messages are
captured in different blocks we can order them at the host. This requires a way to start all
timestamp counters at the same time.
In the proposed design there is a timestamp counter within each block, which is used by all the
monitor debug modules within that block, which is enabled upon the reception of a
synchronization signal from a common point in the SoC. In our design this common point is the I/O
peripherals block as it contains the debug interface with the host.
Note also that since the timing path between the I/O peripherals block to the rest of the blocks is
different for each one, there is the need to ensure that all of them start at the same time, or with
the same value. With this purpose it is needed that each SoC block that requires a timestamp
counter has a different timestamp counter reset value which depends on the propagation latency
between the I/O peripherals block to the SoC block. That is, depending on how many cycles it takes
for the synchronization signal to reach the SoC block, the timestamp counter starting value will be
different ensuring that all of them have the almost the same value once they are enabled. Note that
we cannot make sure all the counters in the systems have the exact same value as there can be
differences of ±1 cycle due to the fact that this signal needs to be synchronized to the local clock
and the Change-Data-Capture (CDC) logic on the SoC block has certain variability depending on the
arrival time of the signal in relation to the SoC block clock.
On the other hand, due to the fact that each SoC block runs at a different frequency we need to
ensure that timestamp counters are increased at the same rate, so the timestamp increment relies
on using the Network On-Chip clock as a reference.
4.2.3.2 Timestamp computation
Another problem we face with timestamps is that given its value can be of thousands or millions, it
can take a big chunk of the debug message payload causing debug messages to have to be splitted
in multiple messages and impacting throughput, which can end up causing bottlenecks and lose of
information.
The proposed solution takes advantage of the fact that we can determine the maximum latency
between the debug NoC routers attached to each SoC block and the memory subsystem blocks. This
way we can compute the maximum number of timestamp bits required to reconstruct the full
timestamp value ensuring we transport the minimum number of bits required. This means that
using the Source field information and the deterministic latency, the full timestamp can be
reconstructed because all timestamp counters in the system are synchronized. This allows the SoC
blocks monitor debug components to send only the TS less significant bits of the timestamp counter
value.
The original full timestamp value can be reconstructed on the I/O block or memory subsystems by
taking into account the N less significant bits sent by the debug component as part of the payload,
65
the latency between the debug NoC router and the destination, the message source identifier, and
the destination current timestamp value when the message is received.
In case the destination is the memory subsystem, then the full timestamp value can either be stored
on memory, in case buffering is enabled, or send the full timestamp value to the debug interface
with the host.
4.2.4 Interleaving
As has been introduced in previous sections, the debug monitor component allows the user to
generate debug Trace messages that can be either stored on memory or sent to the active
debugger.
We must ensure that in case the debug NoC and memory controller throughput are lower than the
tracing capabilities, the debug NoC buffers do not overflow when the SoC block monitor
components generate a huge number of Trace messages. In addition, note that post-silicon
debugging of multicore systems do also require gathering information from the system during large
periods of time, which means that the debug NoC, and the debug infrastructure, must be able to
maintain high-throughput during large periods of time.
Interleaving feature is available when the multicore system has multiple memory subsystem blocks
and allows to alleviate throughput bottlenecks by distributing Trace messages through different
system memories in an interspersed manner.
Due to the fact that the debug Trace messages from a given monitor component can end up in
multiple memory subsystems, it is recommended that the user enables timestamps so that
messages can be ordered at the host to reconstruct the state of the system.
On the other hand, it is important to note that interleaving does not require to have just one debug
monitor component enabled. That is, the user is allowed to enable tracing on multiple debug
components given that the generated Trace messages encode the destination and source
information which means that debug messages can be routed and decoded without problems.
Nevertheless, it is important to note that it is not recommended to configure tracing capabilities on
multiple debug components targeting the same memory subsystem when interleaving is enabled as
this not only stresses the debug NoC but may cause overflows that end up in messages being lost.
4.2.5 Streaming
Streaming feature allows the user to capture information from the system every cycle through a
specific debug monitor component when the amount of real-time data to be sent does not fit on a
single debug message so that interleaving feature is not enough.
This feature is intended to be used together with interleaving in order to ensure that all generated
Trace messages can be handled and sent to the debugger, or stored in memory. As mentioned in the
66
Interleaving section, the user must enable timestamps to allow the host to order the Trace
messages.
Streaming feature focuses on optimizing the amount of information sent to the debugger which
impacts on how timestamp value is computed and also modifies the payload. That is, streaming
mode removes some operational fields from the debug Trace messages and uses this freed space to
contain the information extracted from the debug monitor components in order to utilize the full
debug NoC buses width to send system information.
4.2.5.1 Requirements
When streaming mode is enabled there must not be multiple debug monitor components on the
SoC generating Trace messages targeting the same memory subsystems. Otherwise, the Trace
messages from multiple debug components are merged and wrong information is either stored in
memory or forwarded to the active debugger.
Note also that memory subsystems only store debug Trace messages and not debug Configuration
messages. Due to that, the rest of the SoC blocks can still be used to retrieve information based on
Configuration response messages such as read requests to registers.
4.2.5.2 Operational mode
Once streaming mode is enabled, the first Trace message sent by the debug monitor module
contains the information regarding the source component identifier, the SoC block identifier, the
target destination of the Trace message, the full timestamp value and a chunk with the extracted
signals values from the system. Note that this Trace message type is marked as non-streaming.
Then, the subsequent debug Trace messages, which contain only gathered information from the
system, are marked as streaming by changing the Source field Message Type sub-field value
accordingly. The information such as the source component identifier, the SoC block identifier and
the timestamp value are computed by the host. In order to do that, the host decodes all the debug
Trace messages by always looking at the Message Type sub-field value to identify streaming and
non-streaming Trace messages.
● In case Message Type is encoded so that streaming mode is disabled, then the message is
fully decoded and fields such as source component identifier and SoC block identifier are
retrieved from the message.
● In case Message Type is encoded so that streaming mode is enabled, then the host makes
use of the source component identifier, SoC block identifier, and the timestamp value of the
last Trace message that had Message Type set as non-streaming in order to fill the missing
information.
The first debug Trace message works as a header that separates the non-streaming extracted data
from the generated streaming information. In case messages are stored in memory, the Message
Type information is also part of the stored message. Given that the source component identifier and
the SoC block identifier do not change while streaming mode is enabled there is no need to store
67
those fields. On the other hand, the timestamp is deterministic and can be regenerated by the host
as is explained later in this document.
4.2.5.3 Timestamp computation
As mentioned in previous sections, the timestamp value is only captured on the first message
before streaming mode is enabled and is the responsibility from the host to compute the ones
corresponding to the rest of debug Trace messages while streaming mode is enabled.
Due to the fact that streaming mode generates a Trace message per cycle, and that interleaving is
enabled, it is expected that each Trace message from the memory subsystems is forwarded to the
debugger or stored in memory every M cycles, being M the number of memory subsystems enabled
and targeted by the debug component when using streaming feature.
On the other hand, in case Trace messages can no longer be stored or forwarded to the debugger
due to lack of resources, back pressure is applied to the debug monitor component and no new
streaming Trace messages are generated until backpressure is resolved. Once the debug monitor
can send new Trace messages, the first generated Trace message is encoded with full payload
including the source component identifier, the SoC block identifier and the full timestamp value,
and Message Type field is configured to note that is a non-streaming message. The next debug
Trace messages are sent containing only gathered information from the system and setting Message
Type sub-field to mark that streaming is enabled.
This architecture allows the host to detect when streaming mode has stopped and started again, so
that it can be aware of when tracing has stopped and compute the Trace messages timestamps
based on the new full timestamp value.
Finally, due to interleaving being enabled, the host is responsible for merging the memory
subsystem buffer contents taking into account the order in which debug Trace messages have been
generated. For example, in case of using two memory subsystems then one of the buffers contains
debug Trace messages from odd cycles while the other has the even cycles as messages were sent
interspersed.
4.2.5.4 Bottleneck analysis
Multicore systems are likely to face situations in which due to the amount of traffic generated there
are bottlenecks. Below there is an analysis of the two most likely places we can find a bottleneck
and how is handled by the proposed architecture.
Memory Controller
First, in case the debug infrastructure is configured to store Trace messages in memory, this feature
relies on the readiness of the memory controller such that it is required by the debug subsystem to
read the debug messages received through the NoC and store them as buffering.
There are two options that can be followed by the memory subsystem debug logic:
68
1. Send an acknowledgement to the debug NoC as if Trace messages are being stored
properly. This option does not stall the system but the user is not aware that messages are
lost.
2. Do not send the acknowledgement to the debug NoC so that backpressure is propagated up
to the source. The debug monitor component ready signal goes low and, depending on the
configuration, tracing can be stopped and a notification is sent to the user.
The proposed architecture selects the second option in which backpressure is propagated to the
debug monitor components so that the user is always aware if gathered data from the system is
being lost. This allows the user to modify advanced filtering setup in order to reduce the amount of
information gathered from the system or to post-process the information on the host being aware
that there are cycles where information is missing.
Analytic component
The debug monitor component may not be able to send Trace messages because the interface with
the debug NoC is not ready, which can happen either because there are other monitors making use
of the debug NoC channel or because the memory subsystem is not ready and cannot store
messages so that backpressure has been propagated to the source.
The monitor components behave such that they stop sending debug Trace messages even if
streaming mode is enabled. Then, once the interface with the debug NoC is ready, the monitor
component sends a Trace message with streaming mode disabled and encodes all the required
fields and the timestamp value. Next, the rest of the Trace messages are sent with streaming mode
enabled as we have seen in previous sections.
This behavior ensures that the analytic component is aware that messages have been lost so that it
can re-start streaming mode to ensure that once Trace messages can be stored again, the
timestamp is correct and ensures that the memory subsystem is able to store new messages.
Furthermore, this behavior also allows the active debugger to detect that messages have been lost
since there are at least two debug Trace messages whose timestamp is not consecutive.
4.2.6 System resources access
The access to system resources is done through the debug Network On-Chip as it exposes AMBA
interfaces such as AMBA AXI4 and AMBA APB3 interfaces, which are then connected to caches,
memories and registers allowing the user to read and write without injecting traffic on the
non-debug NoC and interfering with the compute cores.
Note that AMBA is an open standard widely extended for the connection and management of
Systems on-Chip functional blocks as they reduce the risk and costs of developing new protocols
and adapt different resources such as third company IPs to those.
69
4.2.6.1 Event support
The user can configure the debug NoC master interfaces to trigger read or write requests when an
event with a certain identifier is received. The request target address, size and the value to be
written, in case of a write request, are pre-configured by the user and stored in debug NoC internal
registers such that it is transparent to the multicore system blocks.
For example, the user can configure the APB master interface to perform a write request to a run
control register based on a specific event, such that all cores would receive a halt request. This
event could have been generated based on a Logic Analyzer or Bus Tracker filter.
4.2.6.2 Polling & Compare
A nice-to-have feature for multicore systems is to be able to monitor a register or memory position
waiting for it to reach a certain value. This is very useful for example if the user wants to know when
all cores have reached a certain point in the execution monitoring a credit counter, or to wait for all
cores to hit a breakpoint and halt such that the debugger can poll the cores status register waiting
for all of them to be halted.
In addition to the event support, the debug NoC also allows the user to configure polling behavior
for a given register or memory so that the master interface sends a read request and checks the
returned value until the condition is met.
The supported operations are:
● Check if the read register value is {equal, not equal, greater than, less than, greater or equal
to, less or equal to} a given value provided by the user.
● Check if the read value is within a range provided by the user.
● Check if the read value changes.
The user can set a maximum number of cycles or run indefinitely until condition is met or polling is
manually disabled.
Actions
Once the the register or memory location being read has the expected value so that the comparator
condition is met, there are three available options:
● Generate an event, which is propagated through the system debug infrastructure allowing
to enable or disable filters, or perform run control operations.
● Generate a Trace message with the read value and forward it to the debugger.
● Generate a notification message and forward it to the debugger.
The user can also configure the debug infrastructure to generate a Trace message with the read
value for each iteration of the polling so that the user can have a sequence of values. Note that this
can generate huge amounts of debug traffic.
70
4.2.7 Routing
Debug NoC message routing is based on coordinates such that the bridges between the SoC blocks
and the debug NoC use the (X,Y) values to route messages through the routers north, south, east or
west interfaces.
Coordinate routing method allows to optimize area as the NoC does not need a static table
containing all SoC blocks identifiers and the router interface to be target for each one. In addition, in
case some path becomes unavailable, then coordinates allow re-routing messages easier without
complex algorithms.
Moreover, when comparing coordinates methodology versus enumeration we take advantage from
the fact that there is no need to perform a full discovery of the system when it is initialized nor to
reassign the SoC block identifiers when there is a change on the topology. The software
management is less complex causing less trouble for designers not only during integration but also
during post-silicon validation in case re-configuration is required.
The debug NoC design guarantees that it is lossless, that is, debug messages are not lost once they
are injected from the SoC blocks. In order to ensure this behavior, the debug NoC takes into account
the possible backpressure at the destination of the message such that buffering is added at the
endpoint. In case the endpoint cannot be accessed during a large period of time and the debug NoC
buffers are filled, then backpressure is propagated up to the source indicating that the debug NoC
does not accept more messages.
4.2.8 Network on-Chip interconnection
As explained in previous sections the debug NoC is integrated in the System on-Chip in such a way
that it exposes a set of input and output ports allowing SoC block debug components to send and
receive debug messages.
Due to the fact that the SoC blocks can have multiple debug components and in order to avoid
creating interfaces that require too many I/O pins on the debug NoC, a new debug component is
introduced.
This component handles the connection between the SoC block debug resources and the debug
NoC and is integrated as part of the debug NoC in such a way that its interface with the debug NoC
router consists of a set of ten fields, which relate to the Destination, Source and Data fields together
with the valid and ready signals for the connection with the SoC block debug resources.
On the other hand, the debug NoC component has N interfaces with the SoC block, being N the
number of debug components within the SoC block and the interface consisting of the ten fields
introduced previously in this document.
71
When a message from the debug NoC is received, the component makes use of the Destination field
debug component identifier subfield to route the message to the SoC block debug component
correspondent interface. Moreover, the access to the debug NoC interface that allows the SoC block
debug components to send Configuration responses, Trace messages or Event messages is
arbitrated between all the SoC block debug components following a Round Robin policy, which
provides all debug components the ability to send messages ensuring that there is no starvation.
Note that arbitration policy can be based on priorities such that it provides higher priority to Trace
messages making the architecture focus on gathering information while sacrificing system access
and making it less responsive, or the other way around so that it could ensure that the system
debug resources could be accessed at any time even if Trace messages are generated. That said, in
this document we assume that while tracing is enabled the user has reduced the activity on the rest
of the debug components so that there are no multiple debug components making use of the
debug NoC interface.
Figure 20. Debug NoC interconnection diagram
4.3 Monitoring
Visibility is essential when debugging execution failures on multicore systems. We need to provide
the user the ability to select which signals have to be monitored, allow filtering to look for specific
cases and gather information.
4.3.1 Logic Analyzer
This component provides the user a wide variety of monitoring functions, and can be used for
debugging and performance profiling. The Logic Analyzer allows observability of one or multiple
signals which designers connect on the filter input port. Then, based on advanced filtering it can
72
either generate an event to be propagated through the SoC, gather information from the system,
notify the debugger or perform special operations such as increasing or decreasing internal
counters. Note that the Logic Analyzer is also connected to the SoC debug infrastructure allowing it
to be configured and/or send data or events.
This debug component can be parameterized allowing each instance in the system to be unique and
completely adapted to designers requirements. The Logic Analyzer behaves on a two-step
methodology:
1. Reads the input filter bus to identify the condition of interest each cycle.
2. Takes action when the condition is met:
a. Generates an event that can be propagated the whole SoC and enable or
disable other debug features. Events are sent the next cycle.
b. Traces data by reading the whole filter bus value or a subset of consecutive
bits in the bus and forwards it up to the hierarchy. In case trace messages
cannot be sent on the next cycle, then data is captured, stored in internal
buffers and sent as soon as possible.
c. Gathers performance metrics using internal counters.
As we can see, the Logic Analyzer always performs filtering before storing captured data on the
internal buffers in order to avoid filling the buffers with data that does not meet filter requirements
and avoiding losing visibility due to bottlenecks.
In addition, the Logic Analyzer supports timestamps so that generated messages can have a
timestamp value attached to its information. On the other hand, the Logic Analyzer timer can also
be used to generate messages every N cycles, being N a value that can be configured by the user in
real-time, in order to report internal counter values or perform snapshots of the filter input bus.
Figure 21. Logic Analyzer block diagram
73
4.3.1.1 Interface
The Logic Analyzer is connected to the debug NoC through an I/O interface that is used for
configuration and tracing purposes, more details of this interface and message protocols can be
seen later on this document.
On the other hand, there is a filter interface consisting of an input bus, whose width is
parameterizable, where designers connect the signals they want to analyze in real-time. There is
also a general purpose output bus (gpio), whose value can also be specified by the user through the
external debugger, that can be used to control multiplexers to select among different groups of
signals to be monitored. There is no valid signal for the filter interface since the Logic Analyzer
assumes that the data to be filtered is correct and that the user makes use of a bit within the signals
being filtered to validate the data when needed.
There is no need to have multiple filter input ports since the comparators, as is explained in the next
section, can be configured to target different subsets of bits on the same filter input bus, that is,
designers can attach several signals from different blocks to the same input bus and then make use
of multiple comparators to select different subsets and apply a particular filtering in each of them.
In addition, there is an output port that shows if the Logic Analyzer is enabled allowing designers to
apply clock gating to the debug logic around the debug module if it is disabled. This has a direct
impact on dynamic power.
4.3.1.2 Filtering
In order to identify conditions of interest the Logic Analyzer makes use of one or more filters that
can be managed by the user in runtime; note that the number of filters is parameterizable by the
user. Each filter can be configured with one or multiple comparators, which means that the user can
specify multiple conditions that must be met to perform an action.
Each comparator has three fields that must be configured:
● A mask that specifies the bits of the filter bus that must be compared.
● A value to compare.
● The comparison operation to be performed.
The comparator hits if the subset of the mask bits set to 1 comply with the operation and the
specified value. The supported comparator operations are:
● Less than (lt)
● Bigger than (bt)
● Equal (e)
● Not equal (ne)
For example, if the user configures a comparator with a mask of 0xF0, operator less than (lt) and a
value 0xA0. This means that the comparator would hit when bits [7:4] of the signals connected to
74
the Logic Analyzer input filter port are less than 0xA. Note that the bits selected on the mask can be
spread and do not require to be consecutive.
Due to the ability of the Logic Analyzer comparators to use masks to select a subset of bits to be
filtered, there is no need to make the Logic Analyzer registers that save the comparators masks and
values configurations as wide as the filter input bus. That is, the Logic Analyzer internal registers to
save comparator configurations can be smaller than the filter input bus. This allows designers to
attach big groups of signals to the filter input bus but without causing the Logic Analyzer to have a
huge increase in area as internal registers scale together.
Following the description above, the Logic Analyzer is parameterized so that designers specify if
they want one or multiple comparators to be smaller than the filter input bus and set the
comparator mask and value width. This reduces the amount of area needed without sacrificing
visibility and debuggability. Note that in case of using smaller comparators, the subset of bits to be
monitored must always be consecutive so that the user during runtime configuration specifies the
position of the less significant bit of the value to be compared in the filter input bus.
In summary there are five parameters that the user must configure when instantiating the module:
● Total number of filters.
● Filter input port width.
● Total number of comparators.
● Number of comparators that can be used for ranges.
● Size of comparators mask and value fields that are used for ranges.
For example, let's suppose that designers want to analyze the core pipeline so that they have
connected 128 bits from different control signals. Then, they want to define a subgroup to analyze
the retired instructions in which case they are interested in the program counter value, which lets
assume is 49 bits wide and a valid signal. The designer can parameterize Logic Analyzer so that it
can have two comparators of 50 bits allowing it to look for upper and lower values on a specific
range. Then, the rest of Logic Analyzer comparators, which would be as wide as the filter input bus
(128 bits), could be used to look for specific bits within the filter input bus.
A possible configuration for this integration could be as next:
● Total number of filters set to 1.
● Filter input port width set to 128.
● Total number of comparators set to 4.
● Number of comparators that can be used for ranges set to 2.
● Size of the comparators that are used for ranges set to 50.
This configuration allows the user to use the 2 comparators to look for a subset of Program Counter
values, and use the rest of the comparators (2 in this case) for other filtering purposes that require
extra matching conditions based on valid, exceptions or other subset of bits of interest. The first two
comparators' mask and value registers would be limited to 50 bits, while the other two comparators
would have the same width as the filter input bus.
75
It is important to note that filter trigger condition depends on one or multiple comparators, and all
comparators can be used in a single filter.
Note that parameterization allows the user to adjust the capabilities of the debug component while
having an important impact on area measurements. If we take the example above, we can see that
adjusting a couple comparators size allowed the designer to save the 30% of the flops dedicated to
the Logic Analyzer comparators due to the fact that the comparators mask and value to be
compared require less flops to be stored.
4.3.1.3 Counters
A wide range of performance metrics can be analyzed using Logic Analyzer internal counters. Each
counter is assigned to a filter such that when the filter condition is met, the counter is incremented
or decremented. Since counter action is based on filters, the user can decide to either select a
subset of bits and look for a range of values or look if a single bit in the bus is high or low.
In addition, the user can also configure a threshold such that an event or message is generated
when the counter reaches a certain value, or make use of Logic Analyzer timestamp feature to
retrieve counter value every N cycles, being N a configurable value by the user at runtime.
4.3.1.4 Actions
When filter condition is met the Logic Analyzer module can perform different actions depending on
the user configuration.
Events
The events are messages that are broadcasted through all the debug infrastructure by default.
Events can be used to enable or disable filters from monitor modules or to apply run control
operations such as halt or resume. The user can enable event generation and set the event
identifier to be propagated.
Trace
When this action is selected, the filter input bus value is captured and sent through the debug
message infrastructure up in the hierarchy to the debugger. By default the Logic Analyzer sends the
whole bus value but the user can configure the Logic Analyzer to select a subset of consecutive bits
to be traced in order to reduce the amount of data to be sent impacting on the debug message
infrastructure traffic and potentially avoiding congestions that can end up causing back pressure
and loss of information.
In addition, the trace data buffer size is parameterizable meaning that the user can specify the size
of the data buffer, which stores the trace values to be sent, causing the amount of required area to
be reduced.
76
In case the user sets a data trace width smaller than the filter input bus width then by default the
trace data buffer stores the less significant bits of the filter bus. As explained in previous sections,
the user can configure the Logic Analyzer to specify the subset of bits from the filter input bus that
must be captured at runtime but in case the subset of bits is bigger than the trace data buffer width,
then just the less significant bits of that subset are captured and forwarded.
4.3.2 Bus Tracker
This component provides the user monitoring capabilities over transactions that follow a protocol,
so that it can do performance analysis while providing observability and debuggability features. The
Bus Tracker allows observability of bus transactions so that designers connect the request and
response critical signals together with some control signals to allow advanced filtering. Then, based
on the configured comparators by the user, the debug component can either generate an event to
be propagated through the SoC, gather information from the system, notify the debugger or
perform special operations such as increasing or decreasing internal counters. Note that the Bus
Tracker is also connected to the SoC debug infrastructure allowing it to be configured and/or send
data or events.
The Bus Tracker analyzes bus transactions based on the user's configuration to track the requests
in-flight. It can be parameterized allowing each instance in the system to be unique and monitor
buses that use different protocols. Similar to the Logic Analyzer, it behaves on a two-step process:
1. Reads the input bus values to identify the condition of interest each cycle.
2. Takes action when the condition is met:
a. Generates an event that can be propagated the whole SoC and enable or disable
other debug features. Events are sent the next cycle.
b. Traces information from the transaction and forwards it up to the hierarchy. In case
trace messages cannot be sent on the next cycle, then data is captured, stored in
internal buffers and sent as soon as possible.
c. Gathers performance metrics using internal counters.
In addition, the Bus Tracker also supports timestamps so that generated messages can have a
timestamp value attached to its information. On the other hand, timestamp can also be used to
generate messages every N cycles, being N a value that can be configured by the user at runtime.
Note that the Bus Tracker is integrated on the SoC blocks providing visibility of the requests and
responses received and sent. Different options, which focus on adding debug hardware on the
non-debug NoC [41][42][43], have been taken into account in order to provide the user visibility of
the transaction in-flight in order to detect possible issues. Unfortunately, some of those proposals
used the compute cores to analyze the transactions in-flight [41] causing an interference with the
software application being executed in the system while other approaches [42] required huge
amounts of buffering causing a big impact on the area overhead (up to 24%). There were some
more fancy solutions such as using wireless communication [43] but this increased complexity of
the SoC debug infrastructure. As a solution, it has been decided to use simple but powerful
monitors integrated on the SoC blocks.
77
Figure 22. Bus Tracker block diagram
4.3.2.1 Interface
The Bus Tracker is connected to the debug NoC through an I/O interface that is used for
configuration and tracing purposes, more details of this interface and message protocols can be
seen later on this document.
On the other hand, the filter interface consists of a set of input buses whose width is
parameterizable that are used to identify the requests and responses on a given interface. The set
of input buses consist of:
● Request valid and ready signals.
● Request unique identifier value.
● Response valid and ready signals.
● Response unique identifier value.
● Control signals.
Control signals can be used by designers to specify fields such as request address, request size or
other signals that serve for filtering purposes meaning that it supports custom protocols. In
addition, it can also be used to attach response fields such as error response or returned data.
There is also a general purpose output bus (gpio), whose value can also be specified by the user
through the external debugger, that can be used to control multiplexers to select different groups of
signals if needed. Furthermore, there is an output port that shows if Bus Tracker is enabled allowing
78
designers to apply clock gating to the debug logic around the debug module if it is disabled reducing
dynamic power.
4.3.2.2 Filtering
In order to identify the condition of interest the Bus Tracker makes use of one or more filters that
can be managed by the user in runtime. Each filter is associated with one or multiple comparators,
which means that the user can specify multiple conditions that must be met to perform an action.
Each comparator is configured by the user at runtime to perform a given operation looking either at
the request or response identifiers, or a subset of bits within the control signals attached to the Bus
Tracker input ports. Note that in case of comparing against the control signals, the user must
configure the subset of bits it is interested in and the expected value.
The comparator hits if the interface input port values comply with the operation and the configured
values and mask. The supported comparator operations are:
● Less than (lt)
● Bigger than (bt)
● Equal (e)
● Not equal (ne)
Bus Tracker parameterization allows the user to define buffering size so that request information is
stored for the transactions in-flight that comply with the advanced filtering configuration. There are
two buffers that allow tracking requests and responses information.
The first stores the request transaction identifiers which allow the Bus Tracker to match in-flight
requests and responses. The buffer width must be set to the request transaction identifier field
width and its depth corresponds to the maximum number of concurrent requests that can be
in-flight so that the user has full visibility of all requests in-flight and no information is lost. In
addition, each buffer position has an index identifier that points to the second buffer position
where further information about transaction request and response signals are stored so that this
information can be returned to the user for further analysis if needed.
On the other hand, the second buffer, which is named trace buffer, allows the Bus Tracker to store
requests and responses extra information that permits the user trace information about request
address, response reply code and/or other control signals values. The fields stored on the buffer
position are configured by the user at runtime. Note that the buffer width must be set by the
designer to allow tracking all or a subset of this information while buffer depth depends on how
many transactions information must be saved for tracing purposes.
In case the user makes a configuration mistake on the Bus Tracker component and intends to trace
more information than the one it fits on each trace buffer position, then the trace buffer is filled
with a subset of this information as explained in the next sections.
79
Moreover, in case the trace buffer is full and new trace information cannot be stored the Bus
Tracker can either send a notification to the debugger, overwrite buffer positions acting as a circular
buffer, or discard new messages depending on user runtime configuration.
In addition, the user can also configure tracing to be applied in advance and post-process the
information on the host software. That is, in case the user has configured a filter to track
transactions until a response is received (e.g. looking for a specific response signal value), the
transactions in-flight are always traced. Then, if the user wants to capture transaction request
signals such as the address but trace buffer capabilities may be not enough, the Bus Tracker can be
configured to forward this information to the debugger so that it does not need to be stored during
large amounts of periods avoiding possible overflows on trace buffering and allowing the designer
to reduce the trace buffer width. Note that the transaction unique identifier is always captured so
that the host software can use it during post-processing.
On the other hand, when a transaction response is received, filtering is applied and in case filter
configuration is met, then the information is captured and sent to the debugger. Note that the
response does also always capture the transaction unique identifier. Making use of the transaction
unique identifier values captured, the host software can discard those trace messages that did not
meet filter conditions for transaction response signals..
For example, the user can configure the Bus Tracker to look if an address is in a certain range with a
filter with multiple comparators or it can look if a single bit is high or low with just one comparator
assigned to that filter.
4.3.2.3 Counters
A wide range of performance metrics can be analyzed using the counters. Each counter is assigned
to a filter such that when there is a hit, the counter is incremented or decremented. The supported
performance metrics that can be gathered are:
● Count the number of transactions.
● Duration of transactions: minimum, maximum or average.
○ Tracks from the time the request is sent until a full response is received. In case of a
read request full response is achieved when the last beat of data is received.
In addition, the user can also configure a threshold such that an event or message is generated
when the counter reaches a certain value, or make use of Bus Tracker timestamp feature to retrieve
counter value every N cycles, being N a configurable value by the user at runtime.
4.3.2.4 Actions
When a filter condition is met the Bus Tracker module can perform different actions depending on
the user configuration.
Events
The user can enable event generation and set the event identifier to be propagated in order to
enable or disable filters from monitor modules, or to apply run control operations such as halt or
resume.
80
Trace
When this action is selected the user can also specify which of the input ports data must be
captured and forwarded. That is, the user can for example select the request identifier field so that
each time the filter condition is met the identifier is returned to the debugger. In case of selecting
the input control signals bus, the user can also specify the subset of bits making use of a mask. By
default the Bus Tracker sends the all input ports value.
In addition, the data buffer size, which captures trace values to be sent, is parameterizable such that
the total trace data buffer size can be smaller than the sum of the width of all input ports.
4.3.3 Trace Storage Handler
The debug monitor components can generate huge amounts of Trace messages that can collapse
external interfaces such as USB or JTAG so that those become the debug system bottleneck because
of their low bandwidth. In order to avoid losing information and to increase visibility, the debug
infrastructure can be configured to store those Trace messages in one or multiple memories. Note
that although trace buffering can be integrated making use of dedicated memories within the SoC,
this is not a realistic cost as it would consume huge amounts of area and impact on static power.
Due to this, the proposed debug architecture relies on the use of the SoC memory subsystem
commonly composed by DDR or High Bandwidth Memory (HBM) technology.
The routing of the debug traces is performed by the debug Network On-Chip based on the user
configuration, which is encapsulated under the Trace message Destination field. Note that in the
proposed infrastructure the Trace messages are always sent to the memory subsystem block which
then forwards messages either to the active debugger or stores them on memory, this facilitates the
debug NoC routing algorithm and the integration in multicore systems.
The Trace Storage Handler (TSH) is placed on the SoC block that is connected to the SoC memory
subsystem, and is responsible for applying the user configuration to the received Trace messages
such that Trace messages are stored in memory or forwarded to the active debugger either through
USB, JTAG or PCIe interfaces.
Due to the huge number of trace messages that the memory subsystem can receive, it may not be
possible to store those in memory in case the memory controllers are busy performing requests for
the compute cores. In order to avoid discarding messages and applying backpressure, it is
recommended to integrate a small SRAM that is used to store all received Trace messages before
they are moved to memory when it becomes available.
On the other hand, it is also a key factor to ensure that the region of memory used to store the
Trace messages is not overwritten by the compute core as this would cause data corruption.
81
Figure 23. Trace Storage Handler block diagram
4.3.3.1 Trace storage configuration
Multicore systems can have one or multiple blocks that contain the SoC memory, that is, there can
be several memory subsystem blocks. Due to that the user must be able to define which is the
memory subsystem to be target. In addition, the user must also be allowed to define a region of
memory to be targeted to ensure that debug traces are not overwritten by compute cores and the
other way around.
Each memory subsystem block instantiates a Trace Storage Handler component which has a subset
of internal registers that define tuples of values that are the start address range and the size of the
range. This allows the user to define multiple memory ranges within that memory subsystem to be
reserved for debug traces, each one having a unique identifier that can be targeted by monitor
debug components. Those regions of memory act as circular buffers and the user can configure at
runtime if buffer positions are overwritten when the buffer is full or if new arriving debug traces are
discarded.
The Trace message routing from the SoC blocks to the memory subsystem requires the user to
configure the debug monitor components to set the destination of the message. The destination
information is built by setting the memory subsystem block unique identifier on the most significant
bits while the less significant bits are used to specify the memory range identifier within that
memory subsystem. That is, as mentioned above the Trace Storage Handler can be configured to
handle multiple memory ranges to store Trace messages, each one has a unique identifier within
82
the SoC block and is seen as a different debug component such that the SoC block monitor debug
components can set a specific memory region as a destination so that Traces can be separated.
Once the debug NoC receives the Trace message from the SoC block debug component, it decodes
the Destination field and using the most significant bits targets the appropriate memory subsystem.
Then, when the Trace message arrives at the memory subsystem and is routed to the Trace Storage
Handler, the less significant bits of the Destination field are used to target the memory range to be
written.
4.3.3.2 Trace extraction operational modes
As we have seen, the debug traces can be stored in memory, which means that this information
must be read and returned to the debugger so that it can be analyzed. The user can configure the
Trace Storage Handler (TSH) to operate in two different modes to extract the information stored in
memories:
1. Memory is read automatically every N cycles and T traces are extracted, being N and T
configurable values by the user
a. The debug traces are always sent to the same destination, which is also
configurable by the user when performing the initial TSH configuration.
b. Note that the user can specify how many Trace messages are extracted from
memory periodically, allowing to optimize its configuration to avoid bottlenecks on
I/O interfaces.
2. Memory is read upon user requests
a. The debug traces are sent to the requestor as a response with data.
b. The user can gather multiple trace messages by performing read requests of
different sizes.
The Trace Storage Handler has read and write pointers which are updated automatically when
traces are read or written so that the user does not need to know which positions of memory have
valid traces or contain the oldest traced values.
On the other hand, when the user requests the oldest data, that data is sent to the requestor as a
response, which means that the user can make use of the master interfaces such as PCIe to extract
the traces from the memories and given that the read pointers are automatically updated, no
further work is needed by the host. Further information about PCIe support can be found later in
this specification.
4.3.3.3 Trace extraction through active debugger
The debug infrastructure supports multiple debuggers that can not only be controlled by the host
through an external interface such as the USB, JTAG or PCIe but can also be integrated within the
multicore system such as the master core.
83
Due to the complexity of trace analysis the traces are decoded and visualized in the host system,
which means that the master core can be used to retrieve information but it is recommended to not
use it to analyze it.
On the other hand, it is important to note that PCIe connection with the host allows data transfers
to be initiated either from the host, through requests, or from the multicore system as PCIe can
map host memory as system memory. This differs from the USB or JTAG interfaces which do not
follow a protocol based on a request-response neither make use of transaction identifiers. Due to
that, the PCIe support for reading debug traces is different than the one used for USB or JTAG
interfaces.
PCIe is the preferred option when the System On-Chip debug infrastructure is generating huge
amounts of debug traces given that it provides higher bandwidth than USB or JTAG interfaces.
When debug traces are gathered through the USB or JTAG interfaces we make use of the
operational modes previously described in this specification, but for PCIe the system requires a
different methodology.
Debug trace extraction through PCIe
The standard PCIe features allow both the host and the device to initiate a transaction as long as
properly configured. Consequently, there are two possible mechanisms to move data from the SoC
memory to the host memory. First, the host can read the SoC memory, which is mapped in the host
system address map. Second, the SoC can map, within its PCIe address space, a region of the host
system memory and use write operations to send data to the host.
The PCIe access to the debug infrastructure requires the system to be configured in the appropriate
security level as is explained later in this specification.
PCIe as trace destination
The debug infrastructure has been designed to support using host memory as storage endpoint
rather than using the SoC memory. To achieve this, the debug infrastructure acts as an initiator to
write the traces through the PCIe system into the host memory. This feature can be used in case the
user does not want to modify the code being executed on the compute cores to avoid targeting a
SoC memory region that was intended to be used to store debug traces.
The debug traces generated by monitor debug components are always routed to the memory
subsystem. However, the memory subsystem can be configured to forward the debug traces to the
PCIe, such that the traces are stored in host memory rather than be stored in the SoC memory. It is
important to note that interleaving is still supported and in this case the host must allocate two
buffers/memory regions and configure each correspondent memory subsystem to target host
memory.
The memory subsystem must be configured by the user before the monitor components are
enabled and start to generate trace messages. The configuration applied is used by the memory
subsystem to handle the received trace messages and route those messages to the PCIe interface.
84
On the other hand, the host is responsible for allocating one or more buffers and advertising these
buffers to the SoC through descriptors that are stored in a FIFO in the SoC memory subsystems.
Each descriptor, which corresponds to a host buffer, is a data tuple that contains the PCIe start
address and the size that corresponds to the host memory allocated as a buffer. The SoC memory
subsystem targets each of those descriptors and fills the host memory space with trace messages;
Once the current descriptor is filled, the SoC memory subsystem pops from the FIFO to get the next
one and this process keeps going until no more descriptors are available in the FIFO. Optionally the
last descriptor can be kept and its contents overwritten depending on Trace Handler Storage
configuration. Otherwise, new trace messages are discarded by the TSH until a new descriptor is
added to the FIFO.
The configuration to be applied to the Trace Storage Handler is as following:
● The TSH is set to the push-to-host mode.
● The host stores descriptors in the TSH descriptor FIFO to specify the address range that can
be used in host memory.
● The TSH is configured to overwrite or not the last available descriptor.
In this case we can see that the SoC memory subsystem uses host memory as a buffer and the
software running on the host is responsible to allocate one or more buffers large enough, so if
desired it has time to move the data from that buffer to a definite location.
Note, that the host must send a dump request through PCIe which enables the Trace Storage
Handler to start forwarding traces to the host memory. In addition, the SoC memory subsystem
must have a master interface with the non-debug NoC that allows it to send requests as it is writing
system memory mapped addresses. It is recommended that the host initializes the reserved
memory region to 0s such that it knows how many Trace messages have been written and are valid
based on the amount of memory positions within that region that have been filled with data
different than 0s.
On the other hand, note also that when using streaming mode we reuse some of the message fields
to contain gathered information from the system but both Destination field and Source Message
Type subfield values are maintained to identify the destination of the message and the type of this
one so that it can be successfully decoded. This means that the written information on the PCIe host
memory space will never be 0s even if the Destination field information is removed and the
gathered data from the system are 0s due to the fact that Source Message Type subfield value is
always different from 0 for Trace messages. This allows the user to know how many Trace messages
have been gathered from the system and must be post-processed.
In addition, it is important to note that it is expected that the host analyzes the traces only when
the allocated buffers have been filled or when the trace has been disabled. This ensures that there
is no need for the host to manage the SoC TSH read and write pointers for a given memory region.
85
PCIe as requestor
As has been explained in previous sections, the debug traces can also be stored in the SoC memory
and read through PCIe at any given time. In this case the user can make use of either the debug NoC
or the non-debug NoC to retrieve those trace messages from the SoC memories.
Note that using one NoC or another has different advantages and inconveniences as shown below:
1. PCIe read traces through the non-debug NoC
a. Provides higher bandwidth than the debug NoC.
b. Can have collisions with real traffic generated by the compute cores.
2. PCIe read traces through the debug NoC
a. Provides lower bandwidth than the non-debug NoC.
b. Ensures there are no collisions with the traffic generated by the compute cores.
When using debug NoC to retrieve Trace messages from SoC memory, the PCIe request targets the
Debug Transaction Manager virtual register that corresponds to the Trace Storage Handler, that is,
the read is captured by the Debug Transaction Manager which then forwards the read request
through the debug NoC. This request is responded by the TSH on the SoC memory subsystem and
routed through the debug NoC up to the Debug Transaction Manager where the PCIe response is
computed and forwarded. Note that the Debug Transaction Manager and its PCIe support can be
found later in this specification.
Note that it is up to the designer to define how the Debug Transaction Manager virtual register is
configured. On the one hand, the designer can set a specific virtual register per each Trace Storage
Handler instance in the SoC so that the user at runtime can target this specific address with no
previous configuration required. On the other hand, the designer can make the Debug Transaction
Manager virtual registers configurable such that the user has first to set up the virtual register to
target a certain SoC block and debug component and then perform as many reads as required.
On the other hand, when using the non-debug NoC approach to retrieve Trace messages from SoC
memory, the PCIe request targets the SoC memory subsystem Trace Storage Handler through the
SoC non-debug memory map. The TSH captures the unique transaction identifier and generates a
response attaching the Trace messages read from the SoC memory. Note that this means that the
Trace Storage Handler must be visible and accessible through the non-debug NoC.
Both options require the user to perform a read request from the host through the PCIe interface
targeting the SoC memory subsystem to initiate the read of debug traces from SoC memory to PCIe.
The Trace Storage Handler uses the PCIe request transaction unique identifier to route the debug
traces as a response. The generated response by the TSH can contain one or multiple traces. The
amount of trace messages returned is based on the read request size.
Following the same pattern as in the previous subsection, the Trace Storage Handler must be
configured by the user before the debug monitor components are enabled and start to generate
86
Trace messages. The SoC must reserve a specific address within its memory map to allow the user
access the Trace Storage Handler through PCIe and enable dump of Trace messages.
Note that interleaving is supported and the host is responsible for merging messages from one or
multiple SoC memory subsystems. On the other hand, there is no need for the host system to
reserve a memory region as traces are returned as responses from the first PCIe request.
4.3.4 Compute core execution analysis
On a multicore system it is essential to have visibility of the work that is done by the compute cores.
It is widely extended to make use of dedicated monitors to track information from the cores such as
the retired instructions, exceptions and other control signals making use of debug components like
the ARM Embedded Trace Macrocell or the UltraSoC Trace Encoder. Unfortunately, on a multicore
system the number of compute cores is so big that we would require several of those dedicated
debug components which would cause a huge impact on area and power measurements.
The proposed architecture allows the user to make use of the Logic Analyzer and connect with the
compute cores pipeline to track control signals, which can include retired instructions, exceptions
and/or control signals.
In addition, we also propose to add a special debug Control Status Register (CSR) on each core so
that software developers can use it to monitor the execution of a software application on multiple
cores at the same time. This option is faster than using memory to store debug information and is
also less intrusive as it does not inject traffic on the non-debug resources but requires adding extra
hardware on the system.
Software designers could modify their code slightly by adding a CSR write instruction with a specific
value. Then, designers can connect all or a subset of the CSR bits to the Logic Analyzer allowing to
trace the execution status of one or multiple cores at the same time. The visibility is limited by the
amount of Logic Analyzer filter input width and the subset of CSR bits per core connected from each
compute core to the Logic Analyzer.
Note that in case the user wants to monitor all instructions retired by compute cores it would have
to connect the full Program Counter value of each of those cores to the Logic Analyzer requiring a
wide Logic Analyzer filter input bus. On the other hand, with the proposed CSR the user can connect
N CSR bits from each compute core to the Logic Analyzer and keep track of the software application
execution stage. For example, making use of the 10 less significant bits of core CSR the user could
encode up to 1024 different execution stages so that monitoring 4 cores would take just 40 bits
rather than the Program Counter width four times which is way bigger.
As mentioned above, another feasible option is to modify the software such that the core writes
debug information on memory which is later on read through the debug infrastructure but this
option is slower, more intrusive and requires reserving memory space.
87
4.3.5 Multicore system workload balance
On multicore systems it is fundamental to allow the master core to manage and distribute the work
on the compute cores to achieve better performance. The proposed design allows the master core
to make use of the debug infrastructure to gather compute cores performance metrics and use it to
balance workload.
It is standard that the SoC blocks that contain the compute cores have one or multiple memory
mapped registers that are used as counters so that the software can be programmed to increase the
register value each time a given phase has started or finished. It is highly recommended that in case
there are multiple SoC blocks with several compute cores, the most significant bits of the register
are used to specify a unique identifier that can be used to differentiate between those blocks when
data is retrieved.
Then, based on features previously mentioned in this specification, the debug NoC master
interfaces (e.g. AMBA AXI4 or AMBA APB3) can be configured to perform polling and compare
operations targeting those performance registers. This configuration is applied such that the polling
comparator mask selects the less significant bits, which correspond to the counter value, and masks
out the SoC block identifier.
The comparator is then configured to match under a certain threshold value and to perform a write
request to a known memory address with the full register value. In addition, the comparator is also
configured to generate an event that is captured by the master core so that it is aware that memory
has been written with an updated counter value.
Once the master core has received the debug event, it performs a read request to the known
memory address to extract the updated counter value. Finally, the master core can use this
retrieved data to balance workload if needed. Note that if configuring multiple comparators to
monitor multiple SoC blocks, it is recommended that there is a unique event identifier and memory
address for each SoC block so that the master core knows which is the SoC block counter value that
has been updated and the specific memory address where this value has been written. Due to its
flexibility, the debug configuration could make use of a single event identifier and a counter so that
when it reached a threshold the master core would read the according memory region to gather the
information.
The master core can receive debug events through an interrupt controller component that is
integrated in the SoC, which receives Event messages as the rest of SoC debug components,
allowing the debug infrastructure to generate interrupts that are later captured by the master core.
In addition, the interrupt controller contains a set of registers which allow it to store the
outstanding events so that no events are lost.
It is highly recommended not to enable other monitors to trace or generate events such that the
traces on memory subsystem and the events in-flight only correspond to polling & compare
configuration on the compute cores SoC blocks.
88
Note that the polling & compare configuration must ensure that traces are always stored on the
same memory subsystem such that the master core gathers data from only one location no matter
which compute core SoC block generated the trace. That is, both interleaving and streaming debug
NoC features must be disabled.
4.3.6 Network On-Chip
As introduced before in this document, nowadays most of the Systems on-Chip rely on the use of
NoCs rather than point-to-point connections which do not scale well, are not reusable and can
become a bottleneck.
The proposed architecture allows the user to integrate both the Logic Analyzer and the Bus Tracker
on the non-debug Network on-Chip allowing advanced filtering and information gathering from the
system. Note that due to the fact that the Bus Tracker does not follow a specific standard protocol,
it can be attached to the NoC routers even if they are designed to follow custom protocols.
On the other hand, the user can also integrate access to the non-debug NoC routers performance
counters either filtering through the Logic Analyzer or accessing through the debug NoC exposed
master interfaces.
In addition, the user can also inject requests through different lead interfaces in the system which
can be either JTAG/USB or master core within the system to test different routings. Note that Design
For Testability (DFT) infrastructure connects all the flip-flops in the system in one or multiple scan
chains allowing further testing.
Other solutions have been taken into account [41] such as periodically taking snapshots from the
network, storing the data on core’s cache memory and analyzing it using software algorithms, but
have been discarded as they change SoC functionality and impact the way cores behave on the
system.
4.3.7 Hardware assertions
In the last sections we have focused on providing monitoring features that allow the user to
perform filtering and gather information from the system but assuming that the monitor modules
were running at enough speed to capture the changes from all signals.
This may not happen when running on silicon as there may be bits that flip too fast to be detected
by the monitors but that cause wrong behaviors on the hardware. There are two error scenarios
known as silent errors and masked errors. The former relate to the errors that propagate enough so
that they can be observed but they are missed due to insufficient checking while the latter relate to
those errors that are masked out before they can be observed. Recording this type of errors is
critical so that we need to ensure that error latency does not become a problem.
89
It is important to note that bit-flips take place depending on the logic state of the circuit, the
workload and electrical state such as voltage supplies or possible drops. That is, they do not keep a
constant value that can be observed or produce obvious hangs in the system. Due to that, it is
recommended to add further visibility on items that cannot be tested in simulation such as voltage
crossings, Phase-locked loops (PLL) stability or regulators.
During pre-silicon verification stage assertions allow the user to ensure that one or several signals
comply with a specific functionality and follow temporal and regular expressions. Those assertions
are added as part of the source code and can be simulated allowing to detect functional errors. The
most common languages for writing assertions are the Property Specification Language (PSL) and
the System Verilog Assertion (SVA).
The proposed architecture recommends the user to make use of assertions during post-silicon
validation in order to detect bit-flips. New techniques [44][45] have been introduced allowing not
only to map pre-silicon assertions to post-silicon hardware [44] but also to automate this effort and
introduce nonobvious assertions that support multiple time cycles and different modules
automatically [45].
Note that bit-flips are caused due to problems in the design netlist rather than functional errors
meaning that the assertions to be placed must cover items that can be affected during
manufacturing and cannot be detected making use of monitors such as the Logic Analyzer or the
Bus Tracker. In addition, due to the fact that during pre-silicon phase designers may have added a
huge number of assertions, it is needed to select those that must be mapped in order not to have a
huge impact on area.
4.4 Debug modules messaging infrastructure
4.4.1 Interface
The debug components are connected to the debug NoC without other messaging route
components such as the UltraSoC Message Engine. This means that the interface between the
debug components and the debug message infrastructure is based on the protocol shown earlier in
this document.
Multicore systems generally make use of different voltage regions and due to that the messaging
protocol must support voltage crossings, queues and other logic structures that require
synchronization between the leader and the follower placed at different voltages or running at
different frequencies. The proposed debug protocol, which is based on handshaking, supports
both voltage and clock crossing.
Note that handshaking protocol consists of valid, ready and data signals. The leader sends a
message capturing the information on the Data field and validating the data asserting a valid signal.
The follower makes use of the ready signal to acknowledge to the leader that it can capture the
90
information. In case the follower is not ready, the leader driven valid is asserted until follower
driven ready signal is asserted.
On the other hand, it is the user's responsibility to implement voltage and/or clock crossing logic
between the debug NoC and the debug components if required. Below we introduce how the user
can implement voltage crossing for handshaking protocol such as AMBA AXI4 between a leader and
a follower running at different voltages.
Figure 24. Voltage crossing implementation for handshaking protocol
On the other hand, note that protocols such as AMBA APB3 require more complex voltage crossing
support due to the fact that there are multiple signals used to start a transaction. Below we can see
an example of the recommended voltage crossing support for an APB master and leader
transaction:
91
Figure 25. Voltage crossing implementation for AMBA APB3 protocol
4.4.2 Compression
Reducing the amount of information that is sent between the active debugger and the SoC debug
components is essential in order to reduce the probability of losing data due to bottlenecks.
In previous sections we have seen that the debug NoC implements an interface and protocol such
that is lossless, that is, once a package has been injected into the NoC then it always reaches its
destination and, if needed, is stored in a buffer. Given that backpressure is propagated up to the
debug components, then visibility can be lost from a few cycles to hundreds of cycles.
It is important then trying to reduce the amount of information in-flight on the debug infrastructure
and because of that we now focus on trying to reduce the amount of information that debug
components send and receive through the debug NoC Data and Source fields given that Destination
92
already contains the minimum possible amount of information and is required by debug NoC
routers to forward the information in the correct path.
Both Source and Data fields are not used by the Debug NoC so that this information can be
compressed and decompressed on the SoC blocks allowing to reduce the amount of information
sent. The proposed method focuses on Trace messages given that Configuration and Event
messages do not stress the debug messaging infrastructure and are not generated every cycle while
Trace messages can stress the debug NoC when interleaving or streaming features are enabled. Due
to that the Source Message Type information is not compressed, this way the debug destination
component decodes only the Trace messages.
There are several research articles [46][47][48] that discuss compression algorithms for memories
and caches, so in this section we take some of their main ideas and adapt them to the proposed
debug architecture with the goal of building a more efficient messaging infrastructure.
First, decompression and compression methods must be extremely fast so that the debug
component can send information every cycle and the compression/decompression method does
not become a blocker.
Second, the added hardware area overhead must be small compared to the option of increasing the
number of wires in the debug NoC to allow more bandwidth causing also an impact on power
consumption.
Third, the algorithm should ensure that no information is lost when packages are compressed.
In our case the data obtained from the system does not follow a clear pattern since it belongs to
control signals whose value changes depending on many factors but some of the bits may not
change during several cycles as they can belong to finite state machines or logic that does not
toggle often.
4.4.2.1 XOR and RLE compression
This XOR technique does not compress the information and requires a very small amount of time to
compute the value to be sent which ensures that latency constraints are met and compression does
not become a problem.
The idea is to apply an XOR operation between the data that has been sent on the previous package
and the data to be sent on the current one. This operation is easy to invert both on the host or
within a debug component and ensures that if part of the package has not changed its value. In
addition, using XOR operation we maximize the number of consecutive zeros for a better use of the
Run Length Encoding (RLE) compression algorithm.
If not many bits have changed the data to be transmitted has several 0s. Then, we can make use of
Run Length Encoding (RLE) compression algorithm [49] which allows us to compress data in case a
93
number is repeated several times by specifying the number of repetitions and the number itself. For
example, a data package whose data is: “0000aadd000” would be translated to: “402a2d30”.
Once XOR is applied, multiple compression algorithms can be applied such as LZ77 [50] compression
which allows to encode data making use of tables that encode message patterns with identifiers
and changes dynamically based on the number of pattern appearances. That said, this adds
complexity and requires extra logic on the debug infrastructure.
Nevertheless, in case the Run Length Encoding compression algorithm is used when the debug
monitor components captured data changes often, the amount of information to be sent can
increase and become bigger than its original size. Due to this, the RLE compression algorithm is an
optional feature that can be enabled and disabled by the user at runtime depending on the
resource being monitored taking into account if enough bits from captured information do not
change so that it is worth applying RLE compression. Then, due to the fact that all Trace messages
are post-processed at the host, decompression is applied if required.
4.5 Access to the debug infrastructure
The proposed debug infrastructure can be managed by multiple master interfaces as has been
explained in previous sections. In order to ensure that each master does not require a different
protocol, which increases complexity of the infrastructure, it is important to define a messaging
infrastructure that can be used by multiple protocols to access all the SoC debug features.
4.5.1 Debug address map
Each master connected to the debug infrastructure has its own interface that is based on different
protocols such as JTAG, USB, PCIe or AMBA AXI4. In order to develop an infrastructure that is
accessible from all those interfaces and to maintain a common implementation for coherency, it is
necessary to manage the same debug protocol and rules.
The debug resources are all exposed using the same debug address map, which allows all master
interfaces to refer to a debug feature using the same address. Then, there is the need of a
component that is able to translate from master protocol to the debug protocol, which is the Debug
Transaction Module (DTM). This component connects the master interfaces with the debug NoC
allowing the active debugger to access all the SoC resources.
For the master core and PCIe to access the debug SoC infrastructure, the request address is used to
target the Debug Transaction Manager component that will handle the connection. Note that the
DTM has an address range in the system memory map that allows the master core and PCIe to
access this component. Further information on how this is managed can be found later in this
document.
Once the request has been forwarded within the debug NoC, the implemented memory map is
composed of the SoC block identifier on the most significant bits and the debug component
94
identifier on the less significant bits. Then, the configuration to be applied to a given debug
component is attached to the message Data field.
4.5.2 Debug Transaction Manager (DTM)
The Debug Transaction Manager handles the requests and responses between the master interfaces
and the debug NoC. On a system with multiple master interfaces there is an instance of the DTM in
up to four different locations:
● Between the JTAG external interface and the debug NoC,
● Between the USB external interface and the debug NoC,
● Between the master core interface and debug NoC and,
● Between the PCIe interface and the debug NoC.
4.5.2.1 Communication with Debug NoC
The communication between the Debug Transaction Manager and the Debug NoC is done through
an interface consisting of ten fields. Five debug NoC input ports that allow the DTM to send
requests to debug SoC resources and five debug NoC output ports that allow debug SoC resources
to respond to those requests and to send the gathered data from the system execution.
As has been explained in previous sections those five input and output ports are grouped as:
Destination, Source and Data fields together with a valid and ready signal, and the information
encoded in each of those fields allow to route the message, specify the operation to be performed
and the source of the request in case a response is required.
In addition, note that the debug monitor components do also follow this protocol such that they
encode the messages the same way as the DTM does for USB, JTAG, master core and PCIe support.
Further information about how the DTM translates to the debug NoC protocol can be seen in the
next subsections.
4.5.2.2 USB and JTAG support
The USB and JTAG interfaces do not follow protocols where message size information is part of the
package given that both interfaces send messages as a stream of data.
Due to the fact that the Debug Transaction Manager needs to know how many bytes are received to
perform successful decoding and JTAG/USB interfaces do not have a field that specifies this
information, the first subset of bits from this streamed data contain the size of the message, the
destination SoC block and debug component, the message type and the transaction identifier. Then,
the rest of the streamed information is filled with information about the Data field. The DTM makes
use of the specified size to concatenate the rest of the stream packages and forward them through
the debug NoC targeting the destination specified in the first package.
Note that the only Source field information required to be sent from the host is the Message Type
and the Transaction ID. The former identifies Configuration requests, Trace messages or Events
messages, while the latter allows the software to generate multiple requests and match the
95
corresponding responses. That is, the source SoC block identifier and the SoC debug component
identifier, which are used in case the target debug component has to generate a response, are
concatenated by the Debug Transaction Manager before the transaction is sent through the debug
NoC.
The first subset of JTAG/USB streaming data looks as shown below:
TR x x MT+1 MT x AS+1 AS x x x x x AD+1 AD x x P+1 P x x x 0
Transaction ID Message type SoC block IDSoC block debug
component IDPayload length
Source field Destination field Header
Figure 26. First JTAG/USB streaming data package encoding for requests
The rest of data stream message is used to encode the data depending on the message type as
shown below:
● Configuration request message:
○ Encodes the endpoint offset address to be accessed, the operation type and the
size together with the data to be written if needed.
Figure 27. Configuration request message encoding from JTAG/USB
● Event message:
○ Encodes the event identifier.
Figure 28. Event message encoding from JTAG/USB
On the other hand, the Configuration responses or Trace messages, which are sent from the SoC to
the host, are encoded in a similar way such that the first subset of bits contain the size of the
message, the source SoC block and debug component that is sending the message, and the
message type together with the transaction identifier. Then, the rest of the data is filled with
information about the Data field.
Note that the DTM discards the Destination field information before forwarding this message to the
JTAG or USB interface.
The first subset of JTAG/USB streaming data looks as shown below:
SS x x x x x SD+1 SD x x TR+1 TR x x MT+1 MT x P+1 P x x x 0
SoC block IDSoC block debug
component IDTransaction ID Message type Payload length
Source field Header
Figure 29. First JTAG/USB streaming data package encoding for responses
96
The rest of data stream message is used to encode the data depending on the message type as
shown below:
● Configuration response message:
○ Encodes the data returned from the SoC block debug components.
D x x x x x x x x x x x x x x x x x x x x x x x x x x x RC+1 RC x 0
raw data Reply code
Figure 30. Configuration response message encoding to JTAG/USB
● Trace message:
○ Encodes the gathered data returned from the SoC block debug components. In
addition, depending on user configuration, it also encodes the timestamp value as
part of raw data on its less significant bits.
Trace non-streaming message
D x x x x x x x x x x x x x x x x x x x x x x x x x TS+1 TS x x x 0
raw_data Timestamp
Figure 31. Trace non-streaming message encoding to JTAG/USB
Trace streaming message
D x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x 0
raw data
Figure 32. Trace streaming message encoding to JTAG/USB
4.5.2.3 Master core and PCIe support
The master core and the PCIe interfaces with the system may make use of protocols that generate a
single package for each request with variable size. For example, AMBA AXI4 allows generating
messages of different sizes which means that data may have to be splitted in multiple debug NoC
packages.
The Debug Transaction Manager receives requests from the master core and PCIe and based on the
destination address and operation type (opcode) information, which is encoded in the Data field, it
translates the request to the debug NoC protocol and forwards it to the endpoint through the
debug NoC. Further information about how messages are encoded can be seen later in this
document.
In addition, the Debug Transaction Manager also takes care of concatenating all the response
packages that are returned from the requests generated by master core and PCIe, and sends a
unique response to the requestor following the correspondent protocol, for example, AMBA AXI4.
97
Note that the Debug Transaction Manager makes use of the Configuration response message
dedicated wire that indicates which is the last package from a given message in order to
concatenate the returned value successfully. This is required due to the fact that Configuration
response messages do not contain information regarding size but the DTM needs to know how
many bytes of data must be concatenated for read requests.
4.5.3 Read operations to SoC resources
The debug infrastructure allows access to a collection of SoC resources comprising access to block
registers, memories and debug components internal registers, which are used to apply different
debug features configurations. Both read and write requests performed through the debug
infrastructure must encode the full endpoint target address. Consequently, allowing access to all the
SoC memory mapped resources, that is, not only the destination block and debug component must
be encoded but also the address of the memory or register to be accessed.
4.5.3.1 JTAG/USB
As has been mentioned in previous sections, the JTAG and USB protocols are based on data
streaming so that the user can send multiple data packages using the Data fields to encode the
target resource address. Because of that, the read operation requires just one request when
performed through the JTAG and USB interfaces.
The external debugger performs a read request whose address targets the destination block and
debug component and, within the message Data field, it encodes a tuple that contains the
register/memory target address, the operation type (opcode) and the requested read size.
As commented in previous sections, the Debug Transaction Manager captures the stream of data
sent by the host via USB and JTAG interfaces and builds a request by concatenating those packages
together with the Source SoC block and debug components identifiers. Then, once the message has
been built, it is forwarded through the debug NoC.
The Debug Transaction Manager does not need to perform special operations for the response
messages as they are forwarded to the USB/JTAG interfaces and handled by the external debugger
software. The maximum read size is limited by the resources in the system as data streaming allows
sending as much information as requested.
4.5.3.2 Master core and PCIe
The access to system resources through debug modules from the master core and PCIe is based on
read and write requests performed through their standard interfaces with the rest of the system.
As has been explained in the Debug Network On-Chip message types section, the Configuration
messages are used to perform read and write requests to access SoC block registers, memories and
debug components internal registers. The register/memory target address is encoded on the
Configuration messages Data field given that the master core/PCIe protocol request address field is
98
used to target the Debug Transaction Manager, the destination SoC block and the debug component
that performs the read operation.
The register/memory address is encoded as part of the Data field given that on multicore systems
the debug memory map can require a huge amount of bits which would cause debug protocol
Destination field to take a huge chunk of the payload that would not be used by Trace messages nor
Event messages, which only require to know the destination SoC block and debug component.
In case the user wants to perform a read to an SoC resource from the master core or PCIe through a
debug component, it is always necessary to send first a write request which specifies the target
register/memory address on the request Data field. This is needed because the master core and
PCIe read request interfaces may not have a data bus that can be used to encode the target
register/memory address. That is, the Debug Transaction Manager receives two requests, first a
write specifying the target resource and then the read request trigger, and generates a read request
through the debug NoC following the debug NoC protocol. The sequence can be observed in the
diagram attached below.
Figure 33. Master core/PCIe read sequence diagram
The Debug Transaction Manager (DTM) is accessed by the master core/PCIe through read and write
requests as there is a range reserved on the system memory map to target this resource. Then, the
reserved region is seen as a set of virtual registers that allow the master core/PCIe to have multiple
requests in-flight with different transaction identifiers and targeting different or the same SoC block
debug component.
The first request is a write whose address targets the Debug Transaction Manager. Then the
destination SoC block and debug component that will perform the read (e.g. the debug NoC APB
master interface attached to a given SoC block) are encoded on the Data field. That is, the debug
NoC Destination field information is encoded on the master core/PCIe request data field. Then, the
write request Data field of this first request is also filled with a tuple that contains the
99
register/memory target address and the operation opcode of a future request, which is a read in
this case as is shown below:
D x x x x x x x x x x x x x x x O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 read opcoderegister/memory
addressSoC block ID
SoC blockdebug
component ID
Figure 34. Read operation encoding for Master core/PCIe
The write request is captured by the Debug Transaction Manager, which looking at the operation
type sees that the master core/PCIe wants to perform a read operation so it stores the target
address and the operation type and waits for the second request, which is the read. Depending on
the write opcode the DTM responds to the first request with a write acknowledgement.
Once the master core or PCIe receives the write response confirmation, it performs a read request
targeting the same address, that is, the same DTM virtual register, with the size field set to the
desired read size.
This second request is then captured by the Debug Transaction Manager, which takes into account
the information stored on the previous write request, and generates a read request that is sent
through the debug NoC targeting register or memory endpoint. Note that the DTM encodes the
debug NoC Source field information by taking into account who has been the requestor and the SoC
block such that response from the endpoint can be routed back.
The read response is always sent by the target endpoint and once the response data is received by
the DTM, it concatenates the multiple response packages and builds the response based on the
master core or PCIe protocols.
4.5.4 Write operations to SoC resources
In the previous sections we have introduced the necessity of having the target endpoint address
encoded as part of the Data field due to the fact that multicore systems memory space is usually big
and requires many bits to target the SoC resources.
4.5.4.1 JTAG/USB
The write operations through the JTAG and USB interfaces follow the same pattern as the read
operations such that the external debugger performs a single request whose first package encodes
the destination SoC block and debug component and a tuple that contains the register/memory
target address, the operation type for the destination component and the requested size. Then, the
rest of the streaming packages contain the data to be written.
The Debug Transaction Manager captures the stream of data sent through the USB or JTAG
interfaces and by concatenating those packages it builds a request that later on is sent through the
debug NoC.
100
4.5.4.2 Master core and PCIe
The behavior of the master core and PCIe with the Debug Transaction Manager depends on the
desired write size given that the request information (e.g. target register/memory address) must be
part of Data field which means that the request information together with the data to be written
can exceed the master core/PCIe interface data width.
Write requests that do not exceed standard interface data width
In case the user wants to perform a write request whose total size, taking into account request
information and data to be written, is less or equal than master core/PCIe interface data width, only
one write request is needed. This write request encodes in the Data field the request information
on a tuple, that is placed on the less significant bits of the interface data bus, that contains the
destination SoC block and debug component that will perform the operation, the register/memory
target address, the operation type for the destination debug component and the requested write
size. Then, the rest of the bus is filled with the data to be written on the target endpoint as can be
seen below:
D x x x x x x x x x x S+1 S x xO+1
O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0
raw_data write size write opcoderegister/memory
addressSoC block ID
SoC blockdebug
componentID
Figure 35. Single-write operation encoding for Master core/PCIe
The Debug Transaction Manager captures this write request and looking at the operation type and
the requested size, it translates the request to the debug NoC protocol and sends it to the
destination target. Note that the DTM encodes the debug NoC Source field information by taking
into account who has been the requestor and the SoC block such that, in case acknowledgement
has been requested, the write confirmation response from the endpoint can be routed back.
Figure 36. Master core/PCIe single-write sequence diagram
Write requests that exceed standard interface data width
In case the user wants to perform a write request whose total size is bigger than the master
core/PCIe data interface width, then two write requests are required.
101
The first write request encodes in the Data field a tuple that contains the destination SoC block and
debug component that will perform the operation, the register/memory target address, the
operation type for the destination debug component and the requested write size the same way as
for read requests.
D x x x x x x x x x x S+1 S x xO+1
O x x R+1 R x x x AS+1 AS x x x AD+1 AD x x 0
0 0 0 0 0 0 0 0 0 0 0 0 write size write opcoderegister/memory
addressSoC block ID
SoC blockdebug
componentID
Figure 37. Write operation encoding for Master core/PCIe
Note that the desired write size is required such that the DTM can know if the data to be written
has been attached on the most significant bits of the Data field, or if a second write request with
the raw data is required.
On the other hand, the second write request, which targets the same address and specifies the
desired write data size, Data field is filled with the data to be written on the target resource.
Figure 38. Write operation data encoding for Master core/PCIe
The Debug Transaction Manager captures the first request and looking at the operation type and
the requested size value, which is part of the tuple within the Data field, it stores the SoC block
target resource address and the requested size, and sends a write acknowledgement, if required,
back to the requestor, which is the master core or PCIe. Then, once the second write request to the
same destination SoC block and debug component is received, the DTM translates the request to
the debug NoC protocol and sends it to the destination target.
Note that the DTM encodes the debug NoC Source field information by taking into account who has
been the requestor and the SoC block such that, in case acknowledgement has been requested, the
write confirmation response from the endpoint can be routed back.
102
Figure 39. Master core/PCIe write sequence diagram
4.5.5 Security
Post-silicon validation debug infrastructure allows to detect design flaws that go from functional
errors that have not been detected during pre-silicon validation to checking if there are issues such
as bad manufacturing design that have not been detected using automatic test equipment (ATE).
Due to the fact that the proposed post-silicon validation architecture can be integrated in any SoC
and can end up in any company or user to run either general purpose software or very specific
high-performance applications, we cannot confirm that security and privacy of the system is
ensured.
On one hand, the post-silicon verification engineering team would like to have as much visibility of
the system's behavior as possible while the security experts would like to reduce visibility or access
to the minimum in order to ensure that the system is not accessed by malicious users.
Systems on-Chip security can be implemented in multiple ways; some SoC designs make use of a
specific module known as the Root of Trust (RoT) in which the user has to perform a challenge to
enable or disable access to the debug capabilities while others make use of efuses such that there is
physical pin on the SoC that enables or disables debug access, that is, during post-silicon validation
the full debug features are open and during production mode this physical pin is blown up at
manufacturing disabling the access.
On the proposed architecture there are several lead debuggers that can access the system making
use of external interfaces such as JTAG, USB or PCIe. That said, all those interfaces are connected to
the Debug Transaction Manager (DTM) which acts as the interconnection between the debugger
and the debug NoC providing the user access to all SoC debug resources. It is recommended that
the debug NoC routing is by default configured to redirect messages to the master interface
103
attached to the Root of Trust so that no other debug components can be accessed until the security
challenge has been validated.
Note that Systems on-Chip can have other critical security holes such as Design for Testability (DFT)
hardware due to the fact that all the flip-flops in the system are connected to one or multiple scan
chains and the scan input and output pins are accessible by the user. Wide research has been done
on DFT security issues that make systems vulnerable [51]. For example, one kind of attack consists
of introducing different patterns within the system and utilizing the difference between the
generated test output vectors to discover how the system behaves. In this thesis we do not focus on
DFT and assume that this security concern has already been addressed.
4.5.6 JTAG integration
As has been explained in previous sections JTAG protocol is a well-defined standard that is easy to
use and integrate on systems. The proposed architecture integrates a JTAG interface on the system
that allows users to perform post-silicon validation as it does not require a complex configuration
and is ready to use once the system power is up. On the other hand, it is important to note that due
to its low bandwidth it does not allow to extract huge amounts of information and this can become
a problem on multicore systems, because of that, the proposed architecture integrates support for
high-bandwidth interfaces such as USB and PCIe.
The software connection with the JTAG SoC interface is usually performed through J-Link debug
probes connectors such as Segger [52]. The J-Link debug probes are the most popular choice in the
semiconductor field for performing debugging and in order to flash new firmware into the systems.
In addition, J-Link is supported by almost all the integrated development environments (IDE) that go
from open-source ones such as GDB to commercial ones.
One of the most important choices when integrating a JTAG interface on a system is to decide the
size of the Test Data Register (TDR), which is used to load data from the host to the SoC and also to
retrieve data from the SoC to the host.
We must note that accessing the JTAG TDR, either for loading or extracting information, implies a
software overhead for managing the J-Link connection. In addition, depending on the J-Link speed
connection the TDR access speed varies. That is, configuring a J-Link connection of 4MHz implies a
period of 250 nanoseconds, which means that the software can load and extract a single bit in 250
nanoseconds. Then, increasing TDR size implies also increasing the time for loading or extracting the
information.
On the other hand, depending on software configuration, the software overhead for accessing TDR
may be big enough so that the time for accessing the TDR does not have a big impact.
Based on the previous analysis developed in this project, we have seen that one of the most
important features from users perspective is to load and store information from system memories.
Due to that, with the goal of optimizing this feature, this project recommends setting TDR size to a
value that allows both the host and the SoC to perform Configuration write requests and
Configuration read responses by loading the TDR just once. Note that we do not take into account
104
Configuration write responses nor Configuration read requests as those do not require to
send/retrieve values to/from the system. In addition, due to the fact that for debug architecture
access we only require one JTAG Test Data Register to be integrated in the system, there is a low
area overhead.
105
5. Proposed multicore design integrationIn this section we explain how to integrate the new proposed debug architecture on Esperanto’s
System On-Chip. Later on this document an evaluation is done comparing the advantages and
drawbacks of using the new proposed architecture.
Esperanto’s SoC architecture has been described before in this document but as a summary it is a
system composed of one master core, known as Service Processor, four RISC-V out-of-order cores,
known as ET-Maxions, and one-thousand-twenty-four RISC-V in-order cores, known as ET-Minions.
In addition, it has multiple memory subsystems, known as Memory Shires, that implement access to
DDR memories and an SoC block that provides a PCIe interface, known as PCIe Shire.
The Service Processor is allocated on the I/O Shire together with several peripherals while the
ET-Maxion cores are placed on an I/OShire sub-block with an independent cache hierarchy. On the
other hand, the ET-Minion cores are put together in groups of 32 on SoC blocks known as Minion
Shires with multiple cache hierarchies.
5.1 I/O Shire
5.1.1 Architecture
The IOShire block contains several sub-blocks that not only allow access through peripherals such as
JTAG, USB, UART or SPI but also integrates the Service Processor which is the SoC master core in
charge of maintenance and managerial tasks in the SoC, and associated peripherals configuration.
In addition, the IOShire also integrates the clock and reset unit (CRU) in charge of releasing or
asserting SoC blocks resets, the security subsystem that blocks non-allowed access to the debug
infrastructure and the ET-Maxion cores.
5.1.2 Integration of the new proposed architecture
5.1.2.1 Clock and reset unit
This critical SoC component is required to perform master core boot sequence successfully and
enable the use of the rest of SoC blocks. Due to that, it is required to maintain a workaround in case
the master core cannot access such that the user can perform the required operations on the clock
and reset unit (CRU) component and enable the rest of the SoC blocks to run.
In order to solve this potential issue, the debug NoC meshtop on the IOShire provides an AMBA
APB3 interface that is connected to the CRU and the I/O Shire PLLs such that the user can manage
the system reset and clock signals without relying on the Service Processor. Note it is highly
recommended to ensure that both the debug NoC and the debug peripherals reset are deasserted
based on external reset so that access to the clock and reset unit is available if something fails.
106
5.1.2.2 Registers and memory access
Access to System On-Chip registers and memory is granted by the debug NoC as read and write
requests can be performed through the different debuggers which can be either the host through
JTAG, USB or PCIe, or through the Service Processor.
Note that the SoC sub-blocks such as the Maxion shire or peripheral components whose registers
and/or memories must be visible through debug, require to integrate the debug NoC meshtop with
an interface that allows access to the SoC block resources.
5.1.2.3 Service Processor
Due to the fact that the Service processor is a RISC-V core it follows the run control RISC-V External
Debug Support Version 0.13 specification [2], which ensures that the user can perform operations
such as halt, resume, single stepping or setting breakpoints, and supports external software tools
such as GDB. The Service Processor Debug Module is accessible through the debug NoC AMBA APB3
interface that is also connected to the CRU and the PLLs. Note that there is a user-controllable
multiplexer that allows targeting those resources.
Visibility of the master core internal pipeline is still required and in order to fulfill users needs. The
proposed integration makes use of a Logic Analyzer that allows it to gather information from the
pipeline finite-state machines and control signals. The proposed debug infrastructure new and
improved features on filtering, tracing and compression allow using the Logic Analyzer to replace
both the UltraSoC Status Monitor and the UltraSoC Trace Encoder components from being attached
to the Service Processor to provide visibility.
On the other hand, a Bus Tracker module monitors the interface between the Service Processor and
the non-debug NoC allowing to track the transactions in-flight and not only detect if a request is
hang but also to allow the user count how many requests have been sent or to obtain performance
numbers based on latency. The integration of the Bus Tracker for monitoring this interface allows
replacing the UltraSoC Status Monitor which did not allow tracking requests and responses in-flight
as it was not supported for managing transaction identifiers.
5.1.2.4 Maxion validation
The ET-Maxion cores architecture is based on the RISC-V Berkeley Out-of-Order Machine (BOOM),
which means that it follows the RISC-V External Debug Support Version 0.13 specification [2]
allowing users to perform run control operations. The ET-Maxion cores Debug Module is accessible
through a debug NoC AMBA APB3 interface.
In addition, the Maxion sub-block has two Logic Analyzer components in order to allow the user
monitor not only the ET-Maxion cores pipeline but also the Maxion cache hierarchy.
107
On the other hand, due to the fact that ET-Maxion cores access the SoC memory subsystem and
share computation with the ET-Minion cores we need to have visibility of the transactions that are
sent and received on the Maxion block. Due to that, there is a Bus Tracker that monitors the Maxion
AMBA AXI4 interfaces that connect the Maxion cache with the rest of the system.
5.1.2.5 Debugger hardware interface
The debug subsystem is accessible through external interfaces such as JTAG but also making use of
Esperanto’s SoC master core, which in this case is the Service Processor.
External interface with the host
Access through JTAG and USB interfaces is still supported and now it is not required to integrate
UltraSoC’s third-party JTAG and USB which expose the I/O interface but have limited features and
throughput. On the proposed architecture the Debug Transaction Manager (DTM) handles the
connection of the external interfaces with the debug infrastructure so that the JTAG and USB I/O
interfaces can be directly attached to the debug infrastructure. The DTM also impacts data
streaming behavior improving performance as it supports different Maximum Transaction Unit
(MTU) values.
In addition, there is also now support for allowing debugging through other external interfaces such
as PCIe which provides a high-bandwidth interface as the DTM also handles the connection.
Cores communication with the host
Service Processor is a critical piece for which a way to communicate with the host is needed so that
the master core can send information even if there has not been a previous request from the
debugger.
In order to keep this feature but saving area overhead and power consumption, the proposed
architecture focuses on delegating the communication to the UART and SPI interfaces from which
the host can communicate with the Service Processor and exchange data. This allows removing the
UltraSoC Static Instrumentation component which has been unused during Esperanto’s SoC bring up
and save both area and power consumption.
It is also important to note that the Service Processor can target host memory through the PCIe
interface as part of Esperanto’s architecture features.
Debug through Service Processor
Access to the debug infrastructure from a master core is not only maintained but also improved as
now there is no need to target the UltraSoC Bus Communicator internal registers and handle the
UltraSoC messaging protocol encoding and decoding. On the proposed architecture, the Service
Processor is attached to the Debug Transaction Manager (DTM) which handles the connection
between the master core and the debug NoC allowing the Service Processor not only to use its own
108
protocol and still have full access to the debug features but also to avoid having to encode or
decode complex messages.
Note that for configuring UltraSoC debug components such as the Status Monitor it was needed to
perform multiple writes to the UltraSoC Bus Communicator internal registers and check the status
of a finite state machine at each step which added complexity and a huge overhead. The proposed
architecture allows the Service Processor to directly target the debug monitor components without
adding overhead.
Figure 40. IOShire proposed debug architecture component integration
5.2 Minion Shire
5.2.1 Architecture
The Minion Shire integrates the compute cores and their cache hierarchy that handle the
computation of the different workloads in the system. Note that, as explained before in this
document, there are thousands of compute cores distributed in groups of 32, which are known as
Minion Shires, within which are arranged in groups of 8, which are known as Neighborhoods.
5.2.2 Integration of the new proposed architecture
Due to the fact that the Minion Shire block is replicated several times in the system, adding several
debug components on those has a huge impact on area and power consumption. On the other
hand, it is essential to have visibility of the Minion Shire resources as they handle most of the work
and determine the SoC performance.
109
5.2.2.1 Software application debugging
First, it is required to allow debugging software applications through run control operations such as
breakpoints, single stepping or the ability to read core internal registers, together with having
access to the Minion Shire configuration registers. Therefore, the debug NoC meshtop attached to
each Minion Shire integrates an AMBA APB3 interface that provides access to all the Minion Shire
registers and ET-Minion cores run control interfaces through a sub-block named Debug Module.
Note that no new compute cores run control operations are introduced as Esperanto’s SoC previous
debug infrastructure already contained the most essential operations. That said, due to how
software applications are designed a new feature is integrated allowing run control operations such
as halt or resume to be applied automatically to a group of compute cores when one of the cores
within that same group halts. This allows performing fine-grained run control debugging allowing
software developers to debug based on different groups of cores at the same time ensuring that
they are halted within a few range of cycles when another halts. Note that the grouping feature
reduces latency when applying halt run control operation as the requests do not need
synchronization with the host.
5.2.2.2 Resources visibility
In addition, users still require to monitor the ET-Minion cores execution and due to that a Logic
Analyzer is attached placed within each Neighborhood allowing the user to connect with the 8
ET-Minion cores and other resources within that same sub-block.
The proposed integration connects the Logic Analyzer to the ET-Minion cores pipeline to track
control signals, which can include Program Counter or control signals values. On the other hand,
there is a new ET-Minion core Control Status Register (CSR) that is used to allow software
developers to track multiple cores' execution stages. In addition, filtering optimization allows one to
focus on gathering useful data from the system, reducing the probability of buffer overflows that
end up in losing information.
Furthermore, there is a Bus Tracker which provides visibility of the crossbar between the four
Minion Shire Neighborhoods and the L2 and L3 caches allowing the user track the requests sent
from the compute cores. Note that this is an essential debug feature that was not available on
Esperanto’s debug infrastructure as it was using an UltraSoC Status Monitor module which does not
handle transaction identifiers to match requests and responses.
On the other hand, a Logic Analyzer is also connected to the Shire Cache control signals to allow
analyzing critical control signals and finite-state machines.
Finally, there is a second Bus Tracker component connected to the interface between the Minion
Shire L2 and L3 caches and the non-debug Network On-Chip. This allows monitoring the traffic
generated by the compute cores. The Bus Tracker replaces the UltraSoC Bus Monitor which caused a
110
huge area overhead because even if Esperanto’s architecture does not use some of the AMBA AXI4
interface fields, UltraSoC’s component buffering for those fields could not be removed.
Figure 41. Minion Shire proposed debug architecture component integration
5.3 Memory Shire
5.3.1 Architecture
The Memory Shire provides access to the SoC memory by integrating the LPDDR4 memory parts
and connecting those with the rest of the SoC through the non-debug Network On-Chip. There are
eight Memory Shires in the system which means that, the same way that happened with the Minion
Shires, adding several debug components has a direct impact on SoC area and power consumption.
5.3.2 Integration of the new proposed architecture
5.3.2.1 Register visibility
The first feature to be maintained from Esperanto’s debug infrastructure is the access to the block
registers used to define different operational modes to handle the memories. The access to those
registers is now performed through an APB master interface that is exposed by the debug NoC
meshtop that connects the Memory Shire with the rest of the debug infrastructure.
111
5.3.2.2 Memory visibility
The same way as for registers, memory access is also supported not only for reading DDR content
but also allowing the user to load data through the debug infrastructure in case it is needed. The
access to this resource is performed through an AMBA AXI4 interface exposed by the debug NoC
meshtop attached to the Memory Shire.
5.3.2.3 Debug trace buffering
On the other hand, this document introduced optimizations not only for increasing the amount of
information that can be gathered using monitor components such as the Logic Analyzer or the Bus
Tracker, but also improving routing of trace data and adding the Trace Storage Handler which allows
setting PCIe as a destination providing an output high-bandwidth interface.
In addition, among the new features we have to highlight debug NoC interleaving and streaming
which allow to route gathered data to multiple memory subsystems and also allows to utilize the
full debug NoC wires to transmit the information.
All these features not only provide better visibility but also reduce the risk of losing information or
overflowing the reserved region of memory for the gathered data.
Figure 42. Memory Shire proposed debug architecture component integration
112
5.4 PCIe Shire
5.4.1 Architecture
Esperanto’s SoC PCIe is integrated on this block allowing access to memory mapped locations from
an external host and also allowing compute cores to use host memory as SoC extended memory.
5.4.2 Integration of the new proposed architecture
The PCIe integration on Esperanto’s SoC makes use of two PCIe controllers and two PCIe PHYs so in
order to have visibility of those resources, a Logic Analyzer is attached to each of them.
On the other hand, due to the fact that PCIe is connected to the SoC Network On-Chip through
multiple AMBA AXI4 interfaces, the debug infrastructure uses two Bus Tracker components to
analyze the transactions in-flight on the two PCIe lanes simultaneously.
Furthermore, the new debug infrastructure integrates a Debug Transaction Manager (DTM) allowing
the host to have access to the SoC debug components through the PCIe interface. Note that even if
both PCIe and Service Processor have the same DTM implementation and the interface with the
non-debug NoC follows the same AMBA protocol, a separate instance of the DTM is required just in
case the IOShire SPIO NoC is not available so that the host can still use the PCIe as an active
debugger.
Figure 43. PCIe Shire proposed debug architecture component integration
113
5.5 Debug Network On-Chip
System On-Chip debug components, which include both the debugger interfaces such as JTAG and
the debug components such as the Logic Analyzer, are connected through the debug NoC which
allows configuring and gathering information from the system.
In this document we propose to replace Esperanto’s SoC debug NoC with a new version that allows
gathering more information from the system and also provides further support for accessing the
debug infrastructure from different master interfaces through the Debug Transaction Manager
(DTM) components. This change not only affects visibility with new features such as interleaving or
streaming, but also has a huge impact on area and power as it maximizes the utilization of the
wires. Note that Esperanto’s first design used a debug NoC that was generic and non-intended to be
used for debug purposes so it unused many of the internal wires and did not provide support for
debugging using different master interfaces.
In addition, the multiple bridge modules between the debug NoC and the SoC blocks are no longer
needed as the proposed architecture simplifies the message routing based on coordinates and
provides further support through the DTM. Note that the UltraSoC Message Engine modules have
also been removed due to the fact that the SoC block debug components are directly attached to
the debug NoC.
5.6 Security
In order to ensure that Esperanto’s Technology SoC post-silicon validation debug architecture is
secure and data is not exposed we rely on the Root of Trust component integrated on the I/O Shire
and an efuse which provides access to the debug infrastructure.
The efuse provides access to the debug JTAG/USB/PCIe interface that connects the host software to
the debug infrastructure. Then, the user must target the I/O Shire Root of Trust component and
successfully perform the security challenge to enable the debug capabilities. The rest of the Debug
Transaction Manager instances on the system receive the output signal from the Root of Trust which
serves as an enable to lock or unlock the access to the debug NoC and the rest of the debug
components within the SoC.
Note that the access to the I/O Shire Root of Trust is performed by accessing through the debug
NoC but the rest of the SoC debug components cannot be accessed as the NoC blocks accesses to all
of them making the debug NoC master JTAG interface connected to the RoT the only available
resource to be accessed.
114
6. EvaluationIn this section we evaluate the proposed debug infrastructure and analyze the advantages and
drawbacks compared to the available options on the market and with Esperanto’s SoC
implementation. This is done by considering the different elements that compose the debug
infrastructure and taking into account not only the features but also the expected impact on area
and power consumption.
6.1 Run control
User debugging of software applications relies on the debug access to multicore systems compute
cores as they perform the workloads. In addition, it is also important to have access to the master
core as it is in charge of balancing workloads across the compute cores.
The proposed architecture does not introduce new run control features that are not available in the
market but ensures that there is support for multicore systems that can scale up to thousands of
cores. Note that there is not only the need to be able to manage multiple cores at a time but also to
ensure that latency of applying certain operations to them such as halt or setting breakpoints do
not take a huge time.
6.1.1 Broadcast support
The proposed architecture allows to broadcast operations such as halt, resume or single stepping to
all selected cores in the system instead of sending those operations core-by-core which causes a
huge latency as there needs to be a synchronization between the host and the system for each
operation.
Taking Esperanto’s Technology SoC as an example, there are 32 Minion Shires, which are SoC blocks
that contain 32 ET-Minion compute cores. Then, the average write latency on RTL simulation
without taking into account the latency from the JTAG/USB/PCIe transport to access ET-Minion
cores registers is of 196 nanoseconds and a total of 344 nanoseconds if acknowledgement is
required. In an architecture which does not empower the multicore system run control features
introduced in this document, sending a run control request to all system compute cores such as halt
would require to target individually each of those cores meaning 1024 write requests without
acknowledgement to be performed from the active debugger. That is, it would take 200704
nanoseconds to perform a halt or resume request to all the SoC compute cores.
On the other hand, in case Esperanto’s Technology system integrated the proposed mask register
within each SoC block together with the run control operations broadcast feature performed by the
Debug Multicore Handler (DMH), it is required to perform just one write per SoC block to select the
ET-Minion cores, in this case acknowledgement is required to ensure that run control operation is
performed after cores have been successfully selected, and one write to perform a given run control
operation, which does not require acknowledgment. That is, it would require 33 write operations to
115
perform a halt request to all the SoC cores which means 11204 nanoseconds, a reduction of
94,42%.
In addition, further run control operations to the selected cores can be performed by doing a single
write request to the DMH as the cores have already been selected and SoC blocks mask register
value is kept. For example, it would take for the user 196 nanoseconds (one write request without
requiring acknowledgement as the user can check cores status later if required) to perform a
resume on the selected cores in comparison to 200704 nanoseconds on an SoC debug architecture
that does not implement the proposed multicore debug run control integration as it would have to
access al SoC cores one-by-one. This means a percentage decrease of 99,9%.
Note that the Debug Multicore Handler (DMH) run control registers are recommended to be placed
near the debugger interfaces such as JTAG/USB to reduce latency, but in the example above we
assume that it has a similar latency as if the user was accessing SoC block compute cores registers.
Furthermore, the numbers shown above do not include the JTAG/USB latency so it is important to
take into account the impact of the savings is bigger due to the fact that the number of transactions
decreases, so the number of traversals through JTAG/USB/PCIe also decreases.
Moreover, using broadcast feature also allows all the SoC compute cores to receive run control
operations in a small window of time while following the traditional approach of going core-by-core
introduces huge latency from the first core that received the run control operation to the last in the
list.
In addition, there is also support to create execution groups such that when one core halts then the
rest of the cores in the group receive a halt request, which allows control of the execution of
multiple cores in the system based on the configuration of a specific one.
6.1.2 Status report
There is also support to be able to gather the status from all the selected cores in the system by
reading a subset of registers instead of having to gather the status information of each core
independently.
6.1.2.1 Cores run control status
The proposed debug architecture allows the user to retrieve selected cores' run control status by
targeting a register placed on the Debug Multicore Handler (DMH).
In terms of latency savings we can take Esperanto’s Technology SoC as an example as we did in the
previous section. Since there are 32 Minion Shires, each containing 32 ET-Minion compute cores,
and the average latency to perform a read to ET-Minion cores registers is of 344 nanoseconds, if
there is no support for multicore run control operations we need to target the 1024 cores
one-by-one, which means that it would take 352256 nanoseconds to know if all cores in the system
are running, halted or performed an injected instruction as expected.
116
On the other hand, if Minion Shires integrate the proposed mask register together with the SoC
status report infrastructure introduced before in this document, the user would require 32 writes
with acknowledgment to configure each Minion Shire mask register and one read to the DMH in
order to retrieve the global status. Then, following Esperanto Technologies SoC latencies, we would
require 32 writes with acknowledgment and a read whose latencies are of 344 nanoseconds each,
so that the user could retrieve all cores status in 11352 nanoseconds, a reduction of 96,77%.
In addition, in case the user applies new run control operations or wants to retrieve the status value
once again, a single read request to the DMH is required as the cores have already been selected
and SoC blocks mask register value is kept. For example, it would take for the user 344 nanoseconds
(one read request) on the proposed architecture in comparison to 352256 nanoseconds on an SoC
debug architecture that does not implement the proposed multicore debug run control integration
as it would have to access al SoC cores one-by-one. This means a percentage decrease of 99,9%.
Note that, as explained in the last section, we recommend placing the Debug Multicore Handler
(DMH) near the debugger interfaces such as JTAG/USB to reduce latency. In the example above we
assume that it has a similar latency as if the user was accessing SoC block compute cores registers.
In addition, note that we assume that the Debug Multicore Handler status register value is always
updated so no latency on updating this register has been taken into account.
6.1.2.2 Cores register dump
As we have seen before in this document, the RISC-V GDB software is not implemented in such a
way that support is provided for multicore systems. One of the biggest concerns is that certain
operations such as single stepping by default enforces the software to read all the thread General
Purpose Registers (GPR).
On Esperanto Technologies SoC reading a core GPR requires the debugger to perform 3 AMBA APB3
requests that are serialized, which load instructions on the program buffer and read its output. Due
to the fact that ET-Minion cores have 32 GPR for each hardware thread, that there are 2 hardware
threads per core and 1024 ET-Minion cores in the SoC, there are up to 2048 threads that can
potentially be selected at the same time, the latency becomes too big that is noticeable by the user
and causes problems during debugging.
Taking into account that the AMBA APB3 transactions are serialized and each require 344
nanoseconds, reading all GPR from a single thread takes 33024 nanoseconds, while reading all
ET-Minion core GPR from one Minion Shire, which contains 32 ET-Minion cores, that is, 64 hardware
threads, takes 2113536 nanoseconds. It is important to note that the 344 nanoseconds do not take
into account the delay of the JTAG/USB/PCIe interface, so the real latency seen by the user is even
bigger.
Following this project recommendation of adding a register within the Debug Multicore Handler
(DMH) to capture the GDB request to gather General Purpose Registers information from a given
thread and also by adding a finite-state-machine on the SoC block Debug Module that automatizes
the GPR read, we avoid hundreds of requests from being sent from the host software.
117
Then, in case the DMH and SoC blocks DMs have been integrated following this thesis
recommendation and, the OpenOCD software layer is extended to make use of this DMH special
register after performing operations such as single step rather than following the default
implementation which sends multiple requests for each register, we can perform 33 operations
from the software to read the General Purpose Register values from a given thread. That is, the first
request would be used to access the DMH register that triggers the ET-Minion thread GPR read
while the other 32 requests would be used to gather the registers values once they are extracted
from the thread. This means that latency for reading all thread registers, taking the same 344
nanoseconds, would be 11352 nanoseconds. On the other hand, reading all ET-Minion cores
registers from a given Minion Shire would take 726528 nanoseconds, a reduction of 65,62%.
Note that we assume that the latency between the first request to access the DMH register and the
second request to read the first GPR value is enough to extract this value. Otherwise, it is required
to add either delay on the software or to make use of a DMH register to check the status of the GPR
read process.
6.2 Debug NoC
The debug Network On-Chip is the most important change on the proposed debug infrastructure as
it is the key component that communicates all the debug components between them and with the
active debuggers.
6.2.1 Debugger support
The first important feature to notice is the support for both PCIe and master core interfaces
allowing the user to control the debug infrastructure through those master interfaces. The
proposed architecture is done in such a way that the master interfaces can follow protocols such as
JTAG or AMBA and is easy to translate to the debug NoC protocol through the Debug Transport
Manager (DTM).
From a performance point of view, the addition of the PCIe as master interface but also as a trace
destination, allows to gather more information from the system providing more visibility for the
user which is an essential point when debugging complex software applications. In addition, having
the PCIe as a master interface allows debugging from the server without needing to connect JTAG or
USB interfaces. That said, JTAG and/or USB interfaces are still highly recommended as PCIe requires
complex configuration and this may not be available during early bring up stages.
Then, due to the ability of the Debug NoC to split packages, performance is improved on the
different interfaces with the host given that JTAG, USB and PCIE work with different Maximum
Transaction Unit (MTU) values. This improvement is significantly useful when large amounts of data
like traces or loading/dumping memories have to be transported.
118
6.2.2 Efficient messaging protocol
On the other hand, as a result of having a messaging infrastructure that is based on a custom
protocol designed to handle configuration and trace messages efficiently and making use of all the
protocol fields and capabilities, performance is improved. Note that Esperanto Technologies debug
NoC was based on NetSpeed protocol where many fields were unused causing not only a waste of
area and static power consumption but also a lack of performance.
In addition, the debug NoC also takes advantage of the proposed compression algorithms that
reduce the amount of information that is sent through the network layer. This allows designers to
reduce the number of needed wires even more, directly impacting area and static power
consumption, while keeping the throughput requirements.
Furthermore, due to how bottlenecks are handled and the decision to propagate back pressure up
to the debug monitor components, the debug NoC is not only lossless but also does not require
large buffers and its a user configuration decision to decide if messages should be discarded while
configuring the monitoring system.
Note also that depending on the active debugger protocol it may not require write confirmation
responses, which means that at least 50% of the bandwidth is saved. In addition, due to the fact
that there is no round-trip latency, that is, there is no extra time invested on waiting for
acknowledgement of the request, more sustained outstanding writes without requiring more
transaction identifiers can be performed.
6.2.2.1 Performance improvementsTaking Esperanto’s Technology SoC as an example, we can see that the UltraSoc JTAG Communicator
integrates a Test Data Register (TDR) of 33 bytes (264 bits), which is fully loaded each time a
message is sent from the host to the SoC or retrieved from the SoC to the host. Note that the TDR is
always fully loaded or read even if the message is smaller than 33 bytes. That is, loading or
extracting data from the SoC requires loading all the 33 bytes and also includes a software overhead
each time this is performed. Based on post-silicon experience when running the JLink connection at
4MHz we have concluded that the software overhead is 1640000 nanoseconds each time the TDR is
accessed. Note that the mentioned software overhead number is believed to be reasonable as
Esperanto engineering team went through several iterations of UltraSoc software improvements to
optimize its utilization and adapt to the host operative system. In addition, due to the fact that the
JLink connection runs at 4MHz, loading each bit on the JTAG TDR takes 250 nanoseconds. Then, the
amount of time invested on making use of the TDR is computed as the software overhead plus the
time of sending each of the TDR bits.
If we now focus on the operations performed during ET-SoC-1 bring up stage, we have seen that
writing and reading to/from memory has been critical as the infrastructure has been used to load
big binaries and dump all memory content and those operations were the most time consuming.
119
Due to that, we can measure performance improvements based on how much time it takes for the
debug architectures to perform writes of 512 bits, which is the SoC cache line size.
The UltraSoc DMA, which is the only debug component on ET-SoC-1 that allows access to system
memory, allows to perform read and write operations of up to 8 bytes. This means that for writing
512 bits, which are 64 bytes, it is required to perform 8 write operations. In addition, note that the
way UltraSoc software is implemented, the UltraSoc DMA operations must be sequential, meaning
that for each request we must wait for its correspondent response. Furthermore, the UltraSoc DMA
write requests are encoded using 16 bytes while write responses require 5 bytes, that is, requests
and responses do not exceed the JTAG TDR size, which means that for each one we just need to load
TDR once.
Taking into account this information we can see that we need to perform 8 UltraSoc DMA requests
and extract 8 UltraSoc DMA responses, which means that the amount of time to write 512 bits is 16
times, 8 for requests and 8 for responses, the software overhead (1640000 nanoseconds) plus
loading the TDR, which is 250 nanoseconds per bit multiplied by 264 bits. That is, writing 512 bits
from memory on ET-SoC-1 through UltraSoc takes 27296000 nanoseconds. Note that we do not take
into account the amount of time for the request and response to be performed within the SoC as it
depends on factors that are outside of the debug infrastructure such as availability of the memory
controllers which are not discussed in this thesis. Hence, we assume that request and response
times do not change.
On the other hand, in case Esperanto’s Technology system integrated the proposed debug
architecture, which follows a new debug messaging protocol and implements a JTAG TDR of 73
bytes, the amount of time required for writing 512 bits from memory is reduced.
On the proposed architecture Configuration message write requests require up to 73 bytes while
Configuration message write responses need 3 bytes. Then, due to the fact that the proposed
architecture integrates an AMBA AXI4 interface with SoC memory and supports write requests of up
to 512 bits, performing a write only requires one request. In addition, the Configuration write
response messages allow to encode up to 512 bits of write data. That is, performing a write request
of 512 bits on ET-SoC-1 would require only one request and one response. Note that in previous
sections we recommended integrating a JTAG TDR whose size matches Configuration write request
messages. We now assume that Configuration write request messages perform writes of up to 512
bits, which match with system cache line size, that together with the rest of the message
information it ends up requiring 73 bytes of information to be transmitted.
As we have seen, Configuration write requests, which are encoded on messages that require up to
73 bytes, fit on the proposed JTAG TDR so that we have to take into account the software overhead
and the time for loading the JTAG TDR just once. In addition, Configuration write responses, which
are encoded on 3 bytes, also require loading the JTAG TDR just once. Assuming that software is not
improved (overhead of 1640000 nanoseconds) and the JLink connection runs at 4MHz (250
nanoseconds per bit), writing 512 bits on the proposed architecture takes 1786000 nanoseconds, a
reduction of 93,45%.
120
6.2.2.2 Compression savingsAs explained previously in this document, XOR operation and Run Length Encoding (RLE)
compression algorithm are applied to Trace messages in order to reduce the amount of information
sent through the debug NoC. Note that depending on designers parameterization Trace messages
may require to be splitted in multiple flits if information does not fit on the debug NoC interface.
Taking Esperanto’s Technology SoC as an example, the UltraSoc Status Monitor modules are used to
trace information from the system. During post-silicon validation effort one of the most important
features required gathering information from ET-Minion cores execution by tracing their Program
Counter value and a valid signal, which was 49 bits plus a 1 bit valid signal, for long periods of time.
Those ET-Minion core signals are attached on a bus that is 128 bit wide and due to how the UltraSoc
Status Monitor trace capabilities are implemented, there is no support for tracing a subset of the
filtered signals nor a compression algorithm that allows reducing the amount of information that is
sent. That is, each time a Trace message is generated, all the bus (128 bits) plus the message
header, which is 4 bytes, is traced. This means that each Trace message requires sending 20 bytes
through the debug NoC. In addition, we detected that out of the 128 bits the 64 most significant
bits were not changing during large periods of time.
On the other hand, the proposed architecture introduces the Logic Analyzer which allows the user
to apply advanced filtering and select a subset of bits to be traced. That is, following the same filter
signals connections as in Esperanto’s Technology SoC, the user could configure the debug
component to trace the 50 bits that correspond to the Program Counter value and a valid signal.
Then, together with the Trace message header for non-streaming mode, which is less than 4 bytes,
each Trace message would require 10 bytes. This means that there is a reduction of 50% even if no
compression can be applied.
Furthermore, assuming the user does want to trace all the filter input signals, which are 128 bits,
and following the same case as explained above where the 64 most significant bits do not change
during large periods of time, the reduction on the injected traffic on the debug NoC is also
noticeable if debug monitor components are configured to apply compression. Since XOR operation
is applied between consecutive gathered values, the 64 most significant bits would become 0
meaning that together with the RLE compression algorithm, the information to be sent to the
destination endpoint would consist of the 64 least significant bits plus 8 bits (encoded as 0x80),
which would encode the fact that there are 8 bytes (0x8) with value 0. Then, together with the
Trace message header for non-streaming mode, each Trace message would require less than 13
bytes. This means that there is a reduction of 30%.
6.2.3 Support for high-bandwidth monitoring
Another important feature that has been added is support for timestamps, which are not only
synchronized but also integrated on the system in a way that reduces the amount of information
121
sent allowing to have big traces with huge timestamp values and do not having a huge impact on
the amount of information sent from the debug monitor components.
The two most important monitoring support features, which are interleaving and streaming, allow
the user to gather information making use of the SoC block debug monitor components and store it
on memory in a way that the user cannot only extract a huge number of information but also
support multiple SoC memory subsystems and having host memory as a possible destination.
It is also important to note that standard interfaces such as AMBA APB3 or AMBA AXI4 have been
integrated as part of the debug NoC allowing the user not to add dedicated debug components
within the SoC blocks and avoiding extra points of failure.
6.3 Monitoring
Visibility is essential when debugging multicore systems because compute cores not only interact
between them but also with several resources such as peripherals or memories. The monitoring
features proposed in this document are standard on the processor design field but the way they
have been architected reduce the amount of area and power overhead dedicated to the debug
infrastructure which is extremely important for systems that must not exceed a certain power
threshold and require several instances of monitor components.
6.3.1 Filtering optimization
One of the most important changes refers to how the signals connected by the designers to the
monitor components are filtered in order to generate notification messages, events or traces. The
proposed architecture focuses on filtering signals before they are stored in the debug components
interface buffers which allows to discard useless data and avoids filling the buffers. This mechanism
reduces the probability of losing important information in case the destination endpoint is busy and
backpressure is applied.
In addition, the proposed filter mechanism can be used to reduce buffer size as, depending on filter
conditions, most of the analyzed data is discarded. Note that this has a direct impact on area usage.
The Bus Tracker implementation allows optimization for monitoring buses whose protocol makes
use of unique transaction identifiers allowing further visibility without sacrificing huge area and
power consumption. Moreover, the ability to send transaction gathered information in advance for
cases where response must be filtered, allows to reduce the pressure on the Bus Tracker trace
buffer and allows designers to reduce its width.
6.3.2 Multiple protocol support
On the other hand, the Bus Tracker debug component allows the user to track transactions without
requiring the analyzed interface to follow a specific protocol such as AMBA AXI4. The Bus Tracker
filtering interface consists of the essential signals to track requests by identifiers and a bus that can
be filled by the user with protocol specific control signals.
122
The proposed architecture has a huge impact in terms of area given that it reduces not only the
debug monitor I/O interface but also all the buffering for those signals.
As an example, Esperanto’s SoC used the UltraSoC Bus Monitor module to track requests on an
AMBA AXI4 interface. The AMBA AXI protocol required several I/O buses and internal buffers,
having a huge impact on the area when most of those AMBA AXI fields were hardwired or unused in
Esperanto’s architecture.
6.3.3 Trace storage improvements
Another important feature of the proposed debug architecture is the capability to store gathered
data from the system on the memory subsystem(s) through the Trace Storage Handler. This debug
component allows the system to forward stored data upon requests from the host or automatically
every a certain amount of cycles allowing to decrease the pressure on the reserved region of
memory for gathered data.
In addition, the proposed trace storage system supports PCIe either as the requestor for extracting
data, which provides a high-bandwidth interface to output gathered data, or as the trace
destination endpoint as the host memory space can be used for buffering. It is important to note
that if the host can analyze the gathered information at the same speed, or faster, than the monitor
components traces are written, it can virtually keep capturing traces during the whole execution.
Taking Esperanto’s Technology SoC as an example, each Minion Shire runs at 1GHz and is connected
to the debug NoC which runs at 500Mhz. The monitor components such as the UltraSoC Status
Monitor are connected to the Minion Shire clock. Note that the UltraSoC Status Monitor stores
captured data before applying filtering and requires 2 cycles to check if comparator conditions are
met and generate the trace message to be forwarded through the debug NoC. Then, the UltraSoC
Status Monitor is parameterized to capture 128 bits, which together with the message header
requires 161 bits to be sent for each traced message. Note that we are not taking into account
timestamp information.
This means that each Minion Shire UltraSoC Status Monitor is able to trace 128 bits every two cycles
running at 1GHz, meaning that the throughput from the debug monitor component is 64 trace bits
per nanosecond but 161 bits must be sent. In addition, the debug NoC still works at 500MHz which
is half the frequency and creates further pressure and possible bottlenecks on the debug
infrastructure. Moreover, there are multiple UltraSoC Status Monitor components within the Minion
Shire which traffic is aggregated on a Message Engine that has a 128 bit interface with the debug
NoC. Due to the fact that 161 bits must be sent and the interface between the Message Engine and
the debug NoC is 128 bits, the message needs to be split in two and due to the frequency difference
the debug NoC accepts one message every 2 Message Engine cycles. This means that on ET-SoC-1
debug infrastructure 4 cycles are required for each 128 bits of traced data, meaning a throughput of
32 bits per nanosecond. In conclusion, we can potentially have overflows on the UltraSoC Status
Monitor buffers as it needs 2 cycles per trace and stores traces before applying filtering, the
interface with the debug NoC does not escalate accordingly neither adapts at running at different
123
frequencies and, tracing from multiple debug components even if targeting different memory
subsystem blocks causes problems as traffic is aggregated on the Message Engine.
On the other hand, the proposed architecture introduces the Logic Analyzer which performs
filtering before captured data is stored in order to avoid filling the buffer with useless data, and
generates the trace message in just one cycle. In addition, the debug NoC interleaving feature
allows sending 1 trace per cycle even if the debug NoC goes half the frequency of the Minion Shire,
which means 1 trace per nanosecond.
Furthermore, we also need to take into account the debug NoC streaming feature which allows the
user to capture more data reusing some of the message fields. Assuming the Logic Analyzer is
parameterized to capture 64 bits of data, one non-streaming trace sends 64 bits of data that
together with the message header requires 92 bits. Note that we are not taking into account
timestamp information. Then, once streaming is enabled, Transaction ID, source SoC block and
source debug component ID sub-fields from the message Source field contain captured data from
the system. Then, each trace message is composed of the header together with the captured data
allowing the user to capture 81 bits of traced data per nanosecond and still use the 92 bus bits.
Then, the debug NoC interleaving and streaming features together would allow the user to trace 81
bits of raw data per nanosecond due to the fact that the messages are sent to different memory
subsystems each cycle allowing the debug NoC to work at half the frequency, and more data can be
traced each cycle. This means that the proposed architecture requires 28% less wires (from 128 bits
from the Message Engine to 92 bits) and provides up to 253% more trace throughput (from 32 bits
per nanosecond to 81 bits per nanosecond).
124
7. ConclusionThis project explores the post-silicon validation field which is a crucial step on System on-Chip
designs and, during the last decade, is becoming a bottleneck for companies development. In a
world that grows technologically very fast and where globalization has connected companies to all
the countries and their needs, time-to-market has become the most important constraint together
with a good SoC performance. Due to that, companies are shrinking the time invested on pre-silicon
validation such that it is not possible to detect all design flaws.
With the goal of designing an efficient post-silicon validation debug hardware architecture, this
project introduces an infrastructure that can be fit in any SoC without depending on its size,
complexity or target software applications.
This study analyzes the available options in the market that go from FPGA emulator clusters to
embedded debug components provided by third party companies to get a better understanding of
the debug features required on Systems on-Chip and to develop an efficient debug architecture. The
study shows that the options in the market either require to use specific cores architectures such as
ARM or RISC-V, or are too generic so that they integrate features that may not be needed causing an
area and power overhead.
After a study of multicore systems requirements and an overview of Esperanto Technologies
ET-SoC-1 design, various aspects could be improved in order to provide further observability while
maintaining a low area overhead.
The proposed debug architectured focused its efforts on developing an efficient debug Network
on-Chip and messaging infrastructure able to provide sustained high-bandwidth allowing to gather
data from the system during large periods of time. In addition, this project also centered its efforts
on developing a debug architecture that is accessible not only by external standard debug interfaces
such as JTAG or USB, but also from high-bandwidth external interfaces such as PCIe or a master core
within the system.
On the other hand, this project introduced new debug components which can be integrated on the
SoC blocks providing further visibility based on advanced filtering and allowing the user to gather
information from the system. The gathered information is compressed following an efficient and
easy-to-implement algorithm. The compressed data can then be sent either to the host for
post-processing analysis or stored in the SoC memory subsystem allowing the user to capture data
from the system even when using slow-speed external debug interfaces such as JTAG which could
become a bottleneck.
Furthermore, this thesis introduces recommendations for how to integrate the debug components
on multicore systems introducing debug features for the compute cores, management of system
workload from master core through the debug infrastructure or integrating hardware assertions for
capturing manufacturing faults that are hard to detect among others. A proposal of Esperanto
125
Technologies ET-SoC-1 new debug architecture is done, and based on the evaluation analysis
developed in previous sections, with less area and power overhead we obtain up to 93,45%
improvements in bandwidth for JTAG connectivity and up to 99,99% improvements in latency for
run control operations. All of them allow less impact of the debug infrastructure on the use of SoC
resources while providing further visibility.
126
8. Future workDuring this project we highlighted the importance of validation of System on-Chip designs from
both functional and nonfunctional perspectives nowadays. The validation effort of SoCs is
performed on two stages:(i) pre-silicon validation and, (ii) post-silicon validation.
Before in this document we have seen how to increase observability and controllability of the
design making use of scan chains, monitor components and an efficient network communication in
order to detect errors either during the design stage or manufacturing process.
First, we must implement the proposed architecture including the monitor components and the
debug Network On-Chip in order to get area and power numbers from which we can perform
further evaluation and improvements.
Then,as part of the future work for improving the proposed architecture we must focus on trace
compression techniques that target both the depth and width of the trace buffering. We understand
depth compression as the number of cycles where the data is erroneous and must be stored for
further analysis. On the other hand, width compression would allow us to store more information
each cycle.
Another important aspect to take into account is the tradeoff between observability and security.
Verification engineers would like to maximize observability of the system in order to reduce the
debug time, whereas the security engineering team prefers to sacrify visibility of system resources
in order to enhance security and privacy of systems behavior. Due to that, it is required to
investigate encryption methods that can protect the data being transmitted through the debug
architecture in order to prevent information from being leaked.
127
9. References
[1] Abhishek Jadhav, (2020). ET-SoC-1 Chip with More Than 1,000 RISC-V Cores Aimed at
Accelerating Machine Learning. Retrieved 1/12/2021, from:
https://www.hackster.io/news/et-soc-1-chip-with-more-than-1-000-risc-v-cores-aimed-at-accelerati
ng-machine-learning-ada13f0d9a1e
[2] RISC-V, (2019). RISC-V External Debug Support Version 0.13.2. Retrieved 1/12/2021, from:
https://riscv.org/wp-content/uploads/2019/03/riscv-debug-release.pdf
[3] Sourceware, (1990). GDB: The GNU Project Debugger. Retrieved 1/12/2021, from:
https://www.sourceware.org/gdb/
[4] Sourceware, (1990). Debugging with GDB. Retrieved 1/12/2021, from:
https://sourceware.org/gdb/onlinedocs/gdb/
[5] GitHub, (2018). RISC-V GDB. Retrieved 1/12/2021, from:
https://github.com/riscv-collab/riscv-binutils-gdb/tree/riscv-binutils-2.31.1/gdb
[6] OpenOCD, (2018). Open On-Chip Debugger: OpenOCD User’s Guide. Retrieved 1/12/2021, from:
https://openocd.org/doc/pdf/openocd.pdf
[7] Xilinx, (2021). Xilinx. Retrieved 1/12/2021, from:
https://www.xilinx.com
[8] Xilinx, (2021). Vivado design tool. Retrieved 1/12/2021, from:
https://www.xilinx.com/products/design-tools/vivado.html
[9] Xilinx, (2021). Integrated Logic Analyzer. Retrieved 1/12/2021, from:
https://www.xilinx.com/products/intellectual-property/ila.html
[10] Xilinx, (2021). Virtual Input/Output. Retrieved 1/12/2021, from:
https://www.xilinx.com/support/documentation/ip_documentation/vio/v3_0/pg159-vio.pdf
[11] Xilinx, (2021). JTAG to AXI Master. Retrieved 1/12/2021, from:
https://www.xilinx.com/support/documentation/ip_documentation/jtag_axi/v1_2/pg174-jtag-axi.p
df
[12] Synopsys, (2021). Synopsys. Retrieved 1/12/2021, from:
https://www.synopsys.com/
128
[13] Synopsys, (2021). Verification IP. Retrieved 1/12/2021, from:
https://www.synopsys.com/verification/verification-ip.html
[14] Synopsys, (2021). Finding Hardware Bugs Faster with Full Visibility Debug. Retrieved 1/12/2021,
from:
https://www.synopsys.com/company/resources/newsletters/prototyping-newsletter/finding-hardw
are-bugs-faster.html
[15] Synopsys, (2021). Faster Prototyping with HAPS. Retrieved 1/12/2021, from:
https://www.synopsys.com/company/resources/newsletters/prototyping-newsletter/faster-prototy
ping-haps.html
[16] Synopsys, (2021). ZeBu server. Retrieved 1/12/2021, from:
https://www.synopsys.com/verification/emulation/zebu-server.html
[17] System Verilog, (2021). System Verilog Assertion Basics. Retrieved 1/12/2021, from:
https://www.systemverilog.io/sva-basics
[18] Wikipedia, (2021). Universal Verification Methodology. Retrieved 1/12/2021, from:
https://en.wikipedia.org/wiki/Universal_Verification_Methodology
[19] Cadence, (2021). Cadence. Retrieved 1/12/2021, from:
https://www.cadence.com/
[20] Cadence, (2021). Palladium emulation platform. Retrieved 1/12/2021, from:
https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prot
otyping/palladium.html
[21] Siemens, (2021). Siemens. Retrieved 1/12/2021, from:
https://sw.siemens.com/
[22] Siemens, (2021). Veloce emulation platform. Retrieved 1/12/2021, from:
https://eda.sw.siemens.com/en-US/ic/veloce/
[23] ARM, (2021). ARM. Retrieved 1/12/2021, from:
https://www.arm.com/
[24] ARM, (2021). CoreSight Debug and Trace. Retrieved 1/12/2021, from:
https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace
[25] ARM, (2021). Embedded Trace Macrocells. Retrieved 1/12/2021, from:
https://developer.arm.com/documentation/ihi0014/q
129
[26] ARM, (2021). Embedded Trace Buffer. Retrieved 1/12/2021, from:
https://developer.arm.com/documentation/ddi0242/b/introduction/about-the-embedded-trace-bu
ffer
[27] ARM, (2021). System Trace Macrocell. Retrieved 1/12/2021, from:
https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace/coresight-component
s/system-trace-macrocell
[28] ARM, (2021). Trace Memory Controller. Retrieved 1/12/2021, from:
https://developer.arm.com/ip-products/system-ip/coresight-debug-and-trace/coresight-component
s/trace-memory-controller
[29] SiFive, (2021). SiFive. Retrieved 1/12/2021, from:
https://www.sifive.com/
[30] SiFive, (2021). SiFive Insight. Retrieved 1/12/2021, from:
https://www.sifive.com/blog/introducing-sifive-insight
[31] SiFive, (2021). SiFive Nexus Encoder. Retrieved 1/12/2021, from:
https://www.sifive.com/press/sifive-announces-key-enablement-of-trace-and-debug
[32] Nexus 5001, (2021). Nexus 5001 standard. Retrieved 1/12/2021, from:
https://nexus5001.org/nexus-5001-forum-standard/
[33] Tessent Embedded Analytics, (2021). Tessent Embedded Analytics. Retrieved 1/12/2021, from:
https://www.tessentembeddedanalytics.com/
[34] Tessent Embedded Analytics, (2021). Tessent Embedded Analytics debug IPs. Retrieved
1/12/2021, from:
https://www.tessentembeddedanalytics.com/technology-2/semiconductor-ip/
[35] Xilinx, (2021). Aurora 64B/66B Protocol Specification. Retrieved 1/12/2021, from:
https://www.xilinx.com/support/documentation/ip_documentation/aurora_64b66b_protocol_spec
_sp011.pdf
[36] International Research Journal of Engineering and Technology (IRJET), Lakshmi F Savanoor 1, Dr.
H. S. Guruprasad, (2020). Survey on Post Silicon Validation and Contribution of Scan Chain
Technology.
[37] AMD, (2009). Revision Guide forAMDAthlon 64 and AMD Opteron Processors. Retrieved
1/12/2021, from:
https://www.amd.com/system/files/TechDocs/25759.pdf
130
[38] P. Jayaraman, R. Parthasarathi, (2017), Association for COmputing Machinery. A survey on
post-silicon functional validation for multicore architectures.
[39] B. Vermeulen, K. Goossens, International Symposium on VLSI Design, Automation and Test
(2009). A network-on-chip monitoring infrastructure for communication centric debug of embedded
multiprocessor SoCs.
[40] C. Ciordas, K. Goossens, A. Radulescu, T. Basten, Proceedings of the IEEE International
Symposium on Circuits and Systems,( 2006). Noc monitoring: impact on the design flow.
[41] Rawan Abdel-Khalek and Valeria Bertacco, Computer Science and Engineering Department,
University of Michigan (2012). Functional Post-Silicon Diagnosis and Debug for Networks-on-Chip
[42] C. Ciordas, K. Goossens, T. Basten, A. Radulescu, and A. Boon, (2006). Transaction monitoring in
networks on chip: the on-chip run-time perspective.
[43] Sidhartha Sankar Rout, Sujay Deb,and Kanad Basu, Member, IEEE, (2020). WiND: An Efficient
Post-Silicon Debug Strategy for Network-on-Chip.
[44] M. Boulé, Z. Zilic, (2008). Generating Hardware Assertion Checkers: For Hardware Verification,
Emulation, Post-Fabrication Debugging and On-Line Monitoring.
[45] S. Hangal, S. Narayanan, N. Chandra, S. Chakravorty, ACM/IEEE Design Automation Conference,
(2005). IODINE: a tool to automatically infer dynamic invariants for hardware designs.
[46] Srikanth Pothula, Sk Nayab Rasool, International Journal of Soft Computing and Engineering
(IJSCE), (2009). C-Pack: a High-Performance Microprocessor Cache Compression Algorithm.
[47] Vinson Young , Sanjay Kariyappa , Moinuddin K. Qureshi, Georgia Institute of Technology,
(2019). Enabling Transparent Memory-Compression for Commodity Memory Systems.
[48] Luluk Anjar Fitriya, Tito Waluyo Purboyo, Anggunmeka Luhur Prasasti, International Journal of
Applied Engineering Research ISSN, (2017). A Review of Data Compression Techniques.
[49] Robinson, A. H., Cherry, C. (1967). Results of a prototype television bandwidth compression
scheme.
[50] Ziv, Jacob; Lempel, Abraham, (1977). A Universal Algorithm for Sequential Data Compression.
[51] B. Yang, K. Wu, R. Karri, (2006). Secure scan: a design-for-test architecture for crypto chips.
[52] Segger, (2021). J-Link Debug Probes. Retrieved 1/12/2021, from:
https://www.segger.com/products/debug-probes/j-link/
131
Top Related