EFFICIENT LARGE-SCALE COMPUTER AND NETWORK ...
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of EFFICIENT LARGE-SCALE COMPUTER AND NETWORK ...
EFFICIENT LARGE-SCALE COMPUTER ANDNETWORK MODELS USING OPTIMISTIC PARALLEL
SIMULATION
By
Garrett R. Yaun
A Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Computer Science
Approved by theExamining Committee:
Dr. Christopher D. Carothers, Thesis Adviser
Dr. Shivkumar Kalyanaraman, Member
Dr. Sibel Adalı, Member
Dr. Boleslaw K. Szymanski, Member
Dr. Biplab Sikdar, Member
Rensselaer Polytechnic InstituteTroy, New York
June 2005
EFFICIENT LARGE-SCALE COMPUTER ANDNETWORK MODELS USING OPTIMISTIC PARALLEL
SIMULATION
By
Garrett R. Yaun
An Abstract of a Thesis Submitted to the Graduate
Faculty of Rensselaer Polytechnic Institute
in Partial Fulfillment of the
Requirements for the Degree of
DOCTOR OF PHILOSOPHY
Major Subject: Computer Science
The original of the complete thesis is on filein the Rensselaer Polytechnic Institute Library
Examining Committee:
Dr. Christopher D. Carothers, Thesis Adviser
Dr. Shivkumar Kalyanaraman, Member
Dr. Sibel Adalı, Member
Dr. Boleslaw K. Szymanski, Member
Rensselaer Polytechnic InstituteTroy, New York
June 2005
CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
AKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 List of Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Introduction to Simulation . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Parallel Discrete-Event Simulation . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Conservation Synchronization . . . . . . . . . . . . . . . . . . 11
2.2.2 Optimistic Synchronization . . . . . . . . . . . . . . . . . . . 14
2.2.3 Comparison between Optimistic and Conservative Synchro-nization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Reverse Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Other Applications of Reverse Computation and Optimistic Execution 25
2.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3. Configurable Application View Storage System: CAVES . . . . . . . . . . 30
3.1 ROSS’ Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 CAVES Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.1 CAVES Model Overview . . . . . . . . . . . . . . . . . . . . . 32
3.2.2 CAVES Server . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.3 CAVES Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.4 CAVES Flow and Statistics . . . . . . . . . . . . . . . . . . . 35
3.2.5 CAVES Implementation . . . . . . . . . . . . . . . . . . . . . 39
iii
3.3 Reverse Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Methodology for Reverse Computation . . . . . . . . . . . . . 41
3.3.2 CAVES Reverse Code . . . . . . . . . . . . . . . . . . . . . . 42
3.3.3 CAVES: A one to many delete . . . . . . . . . . . . . . . . . . 48
3.3.4 Variable Dependencies . . . . . . . . . . . . . . . . . . . . . . 49
3.4 CAVES Model Performance Study . . . . . . . . . . . . . . . . . . . . 50
3.4.1 CAVES Model parameters . . . . . . . . . . . . . . . . . . . . 50
3.4.2 Performance Metrics and Platforms . . . . . . . . . . . . . . . 51
3.4.3 Overall Speedup Results . . . . . . . . . . . . . . . . . . . . . 52
3.4.4 Experiment Changes . . . . . . . . . . . . . . . . . . . . . . . 53
3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4. TCP Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1 TCP Model Motivation and Introduction . . . . . . . . . . . . . . . . 56
4.2 TCP Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 TCP Model Implementation . . . . . . . . . . . . . . . . . . . . . . . 60
4.3.1 TCP Model Data Structures . . . . . . . . . . . . . . . . . . . 61
4.3.2 TCP Model Compressing Router State . . . . . . . . . . . . . 63
4.3.3 TCP Model Reverse Code . . . . . . . . . . . . . . . . . . . . 64
4.3.4 TCP Model Validation . . . . . . . . . . . . . . . . . . . . . . 70
4.4 TCP Model Performance Study . . . . . . . . . . . . . . . . . . . . . 74
4.4.1 Hyper-Threaded Computing Platform . . . . . . . . . . . . . . 74
4.4.2 Quad and Dual Pentium-3 Platform . . . . . . . . . . . . . . . 75
4.4.3 TCP Model’s Configuration . . . . . . . . . . . . . . . . . . . 76
4.4.4 Synthetic Topology Experiments . . . . . . . . . . . . . . . . 76
4.4.5 Hyper-Threaded vs. Multiprocessor System . . . . . . . . . . 80
4.4.6 AT&T Topology . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.7 Campus Network . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
iv
5. Reverse Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Design Decisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Reverse Computing Memory . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Internals of the Reverse Memory Subsystem . . . . . . . . . . 92
5.3.2 Reverse Computing Memory Initialization . . . . . . . . . . . 97
5.3.3 Reverse Computing Memory Allocations . . . . . . . . . . . . 97
5.3.4 Reverse Computing Memory De-allocations . . . . . . . . . . 99
5.3.5 Attaching Memory Buffers to Events . . . . . . . . . . . . . . 100
5.4 Memory Buffers for State Saving . . . . . . . . . . . . . . . . . . . . 101
5.5 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.1 Benchmark Model . . . . . . . . . . . . . . . . . . . . . . . . 102
5.5.2 Benchmark Model Results . . . . . . . . . . . . . . . . . . . . 105
5.5.3 TCP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5.4 Reduction in CAVES Model Complexity . . . . . . . . . . . . 108
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
6. Sharing Event Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.2 Multicast Background . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3.1 Sequential & Conservative Simulation . . . . . . . . . . . . . . 115
6.3.2 Optimistic Simulation . . . . . . . . . . . . . . . . . . . . . . 116
6.4 Performance Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.1 Itanium Architecture . . . . . . . . . . . . . . . . . . . . . . . 120
6.4.2 Benchmark Multicast Model . . . . . . . . . . . . . . . . . . . 121
6.4.3 Model Parameters and Results . . . . . . . . . . . . . . . . . . 122
6.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
v
LIST OF TABLES
2.1 Summary of treatment of various statement types. . . . . . . . . . . . . 21
3.1 Summary of treatment of various statement types. . . . . . . . . . . . . 43
4.1 Summary of treatment of various statement types. . . . . . . . . . . . . 65
4.2 Performance results measured in speedup (SU) for N = 4, 8, 16, 32 syn-thetic topology network for low (500 Kb), medium (1.5 Mb) and high(45 Mb) bandwidth scenarios on 1, 2 and 4 instruction streams (IS) ona dual Hyper-Threaded 2.8 GHz Pentium-4 Xeon. Efficiency is the netevents processed (i.e., excludes rolled events) divided by the total num-ber of events. Remote is the percentage of the total events processedsent between LPs mapped to different threads/instruction streams. . . . 78
4.3 Memory requirements for N = 4, 8, 16, 32 synthetic topology networkfor low (500 Kb), medium (1.5 Mb) and high (45 Mb) bandwidth sce-narios on 1, 2 and 4 instruction streams on a dual Hyper-Threaded 2.8GHz Pentium-4 Xeon. Optimistic processing only required 7,000 moreevent buffers (140 bytes each) on average which is less 1 MB. . . . . . . 79
4.4 Performance results measured in speedup (SU) for N = 8 synthetictopology network medium bandwidth on 1, 2 and 4 instruction streams(dual Hyper-Threaded 2.8 GHz Pentium-4 Xeon) vs. 1, 2 and 4 proces-sors (quad, 500 MHz Pentium-III) . . . . . . . . . . . . . . . . . . . . . 80
4.5 Performance results measured in speedup (SU) for AT&T network topol-ogy for medium (96,500 LPs) and large (266,160 LPs) on 1, 2 and 4instruction streams (IS) on the dual-hyper-threaded system. . . . . . . . 83
4.6 Performance results measured for ROSS and PDNS for a ring of 256campus networks. Only one processor was used on each computing node 86
5.1 Reverse Memory Subsystem memory usage and event rate. LPs thenumber of LPs in the model. BSz the size of the memory buffers.St−Sv, the state saving model. Swap the swap with statically allocatedlist model. RMS the Reverse Memory Subsystem . . . . . . . . . . . . 105
5.2 Reverse Memory Subsystem data cache misses. NumLPs the numberof LPs in the model. BufSz the size of the memory buffers. St −Svmisses, the state saving model cache misses. Swap the swap withstatically allocated list model cache misses. RMS the Reverse MemorySubsystem cache misses. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
vi
5.3 Reverse Memory Subsystem speedup. NumLPs the number of LPsin the model. BufSz the size of the memory buffers. 2 − 4Spdup isspeedup for 2 to 4 processors. . . . . . . . . . . . . . . . . . . . . . . . 106
5.4 Memory requirements for N = 4, 8, 16, 32 synthetic topology networkfor low (500 Kb), medium (1.5 Mb) and high (45 Mb) bandwidth sce-narios on 1, 2 and 4 instruction streams on a dual Hyper-Threaded 2.8GHz Pentium-4 Xeon. . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.1 Sequential Performance with and without shared data. T is the numberof trees in the multicast graph. L denotes the number of level withineach tree. Estart is the number of initial events each LP schedules at thestart of the simulation. S is the size of the data size in the messages.Mtraditional and Mshared are the required memory for the traditional andshared event data models respectively. ERtraditional and ERshared are theevent rate for the traditional and shared event data models respectively. 122
6.2 Data cache misses per memory reference. T is the number of trees in themulticast graph. L denotes the number of level within each tree. Estart
is the number of initial events each LP schedules at the start of thesimulation. MRshared and MRtraditional are the data cache misses ratesfor the shared event data and traditional models respectively. Finally,% Reduction is the amount the miss rate is reduced by the event sharingscheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.3 Parallel results for shared event data. T is the number of trees in themulticast graph. L denotes the number of level within each tree. Estart
is the number of initial events each LP schedules at the start of the sim-ulation. 2-4 PEs is performance measured in speedup (i.e., sequentialexecution divided by parallel execution time) for 2 to 4 processors. . . . 124
vii
LIST OF FIGURES
2.1 Discrete-event simulation event processing loop. . . . . . . . . . . . . . 9
2.2 Causality error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Deadlock cause by a waiting cycle. . . . . . . . . . . . . . . . . . . . . . 12
2.4 Straggler event arrives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Straggler event arrives causing rollback. . . . . . . . . . . . . . . . . . . 16
2.6 Anti-message arrives annihilating an unprocessed event. . . . . . . . . . 16
2.7 Anti-message arrives causing secondary rollback. . . . . . . . . . . . . . 17
2.8 Transient message problem. . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.9 Simultaneous message problem. . . . . . . . . . . . . . . . . . . . . . . 18
2.10 LP state to message data swap example. . . . . . . . . . . . . . . . . . 24
3.1 The topology of the model. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Flow chart for request arrival and neighbors request. . . . . . . . . . . . 35
3.3 Flow chart for neighbor response and database request. . . . . . . . . . 36
3.4 Flow chart for client response. . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Flow chart for add view. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Forward and reverse CAVES request. . . . . . . . . . . . . . . . . . . . 44
3.7 Forward and Reverse CAVES response. . . . . . . . . . . . . . . . . . . 47
3.8 Reverse computation code without considering variable dependencies. . 50
3.9 Reverse computation code using variable dependencies. . . . . . . . . . 50
4.1 Forward and reverse of TCP correct ack . . . . . . . . . . . . . . . . . . 66
4.2 Forward and reverse of TCP updating cwnd . . . . . . . . . . . . . . . 68
4.3 Forward and reverse of TCP handling a duplicate ack . . . . . . . . . . 68
4.4 Forward and reverse of TCP process sequence number . . . . . . . . . . 70
viii
4.5 Comparison of SSFNet’s and ROSS’ TCP models based on sequencenumber for TCP Tahoe retransmission timeout behavior. Top panel isROSS and bottom panel is SSFNet. . . . . . . . . . . . . . . . . . . . . 71
4.6 Comparison of SSFNet and ROSS’ TCP models based on congestionwindow for TCP Tahoe retransmission timeout behavior test. Top panelis ROSS and bottom panel is SSFNet. . . . . . . . . . . . . . . . . . . . 72
4.7 Comparison of SSFNet’s and ROSS’ TCP models based on sequencenumber for TCP Tahoe fast retransmission behavior. Top panel is ROSSand bottom panel is SSFNet. . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Comparison of SSFNet’s and ROSS’ TCP models based on congestionwindow for TCP Tahoe fast retransmission behavior test. Top panel isROSS and bottom panel is SSFNet. . . . . . . . . . . . . . . . . . . . . 74
4.9 AT&T Network Topology (AS 7118) from the Rocketfuel data bank forthe continental U.S. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.10 Campus Network [59]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.11 Ring of 10 Campus Networks [35]. . . . . . . . . . . . . . . . . . . . . . 85
4.12 Packet rate as a function of the number of processors. . . . . . . . . . . 86
5.1 New data structures added to ROSS for the Reverse Memory Subsystem. 93
5.2 Kernel Process Memory Buffers Fossil Collection Function. . . . . . . . 94
5.3 Freeing of an event after annihilation. . . . . . . . . . . . . . . . . . . . 95
5.4 Initialization Functions and Variables for the Reverse Memory Subsys-tem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.5 The allocation function for the Reverse Memory Subsystem. . . . . . . 98
5.6 The reverse allocation function for the Reverse Memory Subsystem. . . 98
5.7 The free and reverse free functions for the Reverse Memory Subsystem. 99
5.8 The event memory set function for the Reverse Memory Subsystem. . . 100
5.9 The event memory get function for the Reverse Memory Subsystem. . 101
5.10 The reverse event memory get function for the Reverse Memory Sub-system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.11 CAVES cache data structure. . . . . . . . . . . . . . . . . . . . . . . . . 107
5.12 CAVES removal of a view. . . . . . . . . . . . . . . . . . . . . . . . . . 108
ix
5.13 CAVES reverse computation code for a removal of a view. . . . . . . . . 109
5.14 CAVES allocation of a view. . . . . . . . . . . . . . . . . . . . . . . . . 109
5.15 CAVES reverse computation of a view allocation . . . . . . . . . . . . . 110
6.1 Multicast graphic from Cisco [21]. . . . . . . . . . . . . . . . . . . . . . 114
6.2 Memory Set Routine. This routine is performed when setting a memorybuffer to an event. B is the memory buffer. E is the event. . . . . . . . 117
6.3 Memory free routine. This routine is performed on all event allocations.B is the memory buffer. E is the event. . . . . . . . . . . . . . . . . . . 118
6.4 Additional Memory Required for Increases in Levels. . . . . . . . . . . . 119
6.5 Memory required with respect to shared data segment size. One case is8 multicast trees, 10 levels and 8 start events and the other case is 16trees, 10 levels and 4 start events. . . . . . . . . . . . . . . . . . . . . . 120
6.6 Event Rate with respect to shared data segment size. One case is 8multicast trees, 10 levels and 8 start events and the other case is 16trees, 10 levels and 4 start events . . . . . . . . . . . . . . . . . . . . . 121
x
AKNOWLEDGEMENTS
I would like to thank my adviser and my mentor Christopher Carothers for his
support and guidance that helped me through my graduate studies and along with
his belief and confidence in my abilities. I have an immense amount of gratitude
for the knowledge I gained from his council and guidance. I feel fortunate to have
Chris as my advisor, teacher and friend.
I want to express my gratitude to Shivkumar Kalyanaraman, Sibel Adali,
Boleslaw Szymanski and Biplab Sikdar for serving on my committee and helping
me with my dissertation. In addition Boleslaw Szymanski was kind enough to allow
me the usage of his cluster for performance results.
I would like to thank David Bauer for his help throughout college whether
it was a class project or research. I enjoyed our lengthy discussions where his
enthusiasm helped me spawn many new ideas and insights in both our research and
my life.
I appreciate the faculty and staff in the Computer Science Department at
Rensselaer Polytechnic Institute who offered me a great study and research environ-
ment. I would also like to thank my fellow students for their help in preparing for
the qualifiers and their comments on my rehearsals for my candidacy presentation
and dissertation defense.
A special thanks to my parents for their belief and support of me in whatever
I choose to pursue. I’m grateful for their patience and understanding.
xi
ABSTRACT
Modeling and simulation is a valuable tool in the analysis of large-scale networks
and computer systems. To tackle these complexities, conservative parallel simula-
tion is often employed as an approach to reduce the runtime. Optimistic simulation
has previously been viewed out of the performance envelop for such models. How-
ever, with the advent of a new technique called reverse computation the memory
requirements for benchmark models have been dramatically reduced.
In this thesis, we demonstrate the use of reverse computation for allowing large-
scale simulation models to achieve greater scalability and performance. The models
developed for this thesis consisted of network protocols and distributed computer
system applications.
Within these models, reverse computation was important in achieving per-
formance gains and dispelling views that optimistic techniques operate outside of
the performance envelope. These are the first real-world models to leverage reverse
computation and demonstrate its efficiency. Our TCP model executed at 5.5 mil-
lion packets per second which is 5.14 times greater then PDNS’ packet rate of 1.07
million for the same large-scale network scenario. This experiment was performed
across a distributed cluster of 32 nodes and executed on one processor per node.
Observations made from the creation of these models led to the development of
the reverse memory subsystem and the idea of shared event data. The contribution
of this subsystem is that it allows for easier implementation of models and allows for
the models to use dynamic memory. This subsystem permits for an overall reduction
of memory in simulation models as compared to models that work in statically “pre-
allocated” memory. The idea of shared event data works by decreasing the amount
of duplicate information in the event. Our experiments with shared event data show
significant memory reductions when there is a high degree of redundant data.
These contributions when taken together as a whole enable real-world large-
scale models to be efficiently developed and executed in an optimistic parallel sim-
ulation framework.
xii
CHAPTER 1
Introduction
This chapter focuses on the motivation for the research, illustrates the scope of this
thesis, presents the main contribution, and offers a outline for this thesis.
1.1 Motivation
Large-scale systems are difficult to understand due to their size and complexity.
Analytical models give quick solutions but often are constrained or use assumptions
that may not reflect realistic operating conditions. These simplifications in the
models can cause important results to be overlooked. Simulation, on the other
hand, can lead to new insights that analytical models might have missed.
Currently, the Internet is one such large-scale system and will continue un-
dergoing growth. In [22] it is reported that Internet data traffic is doubling each
year. During 1997, the Internet traffic was between 2,500 and 4,000 TB/month [22].
By year-end 2002, they estimated that the amount of Internet traffic was between
80,000 and 140,000 TB/month. RHK, Inc., a market research and consulting firm,
has estimates that also fall into this range [75, 91].
Pervasive devices such as VOIP capable mobile phones [54], wireless PDAs and
portable consumer electronics (ie. ipod [49], PSP [86]) are likely to increase wired
and wireless network data growth rates. Microsoft has just announced “MSN Video
Downloads, which will provide daily television programming, including video content
from MSNBC.com, Food Network, FOX Sports and IFILM Corp., for download to
Windows Mobile (TM)-based devices such as Portable Media Centers and select
Smartphones and Pocket PCs.” [67] With the introduction of these new multimedia
subscription services the growth rates could increase.
To address bandwidth allocation and congestion problems that arise as net-
work data transfer rates increase, researchers are proposing new overlay networks
1
2
that provide a high quality of service and a near lossless guarantee. However, the
central question raised by these new services is what impact will they have in the
large? To address these and other network engineering research questions, high-
performance simulation tools are required.
The predominate technique for analyzing network behavior is discrete-event
simulation. One such simulator is NS [74] which has flexibility and ease of use, but
is not adequate for simulating large-scale networks. Consider the following example
of a simple network with 512 source nodes connected by a duplex link to 512 sink
nodes with UDP packets traversing the link configured for drop-tail-queueing. NS
will allocate to this network almost 180MB of RAM. This simulation processes
events at a rate of approximately 1,000 per second. For large-scale networks, we
require almost 1,000 to 100,000 times greater performance. These computational
requirements are immense and are unobtainable on a single processor.
One might believe that with faster processors, any sequential simulator’s per-
formance will adequately scale, however the problem evolves with the increasing of
processor performance. For example mobile phones were only capable of sending
pictures or browsing the web, soon they will be creating VOIP traffic and be able to
view streaming video. [54, 67] To address these large-scale network engineering re-
search questions in a scalable, deterministic modeling framework, we believe parallel
and distributed simulation techniques are an enabling technology.
Current research for networking using parallel simulation is largely based on
conservative algorithms. For example, PDNS [94] is a parallel/distributed network
simulator that creates a federation of NS simulators. With conservative algorithms,
the simulation waits until it can guarantee no event will arrive in its simulated
time past prior to processing an event (i.e, is it ”safe” to execute an event). This
approach limits the parallelism that can be leveraged within a model. In addition
conservative simulations are limited based on the topology because simulated time
delays between network elements are used to compute the safe times. Changes
in the topology while challenging for parallel simulation algorithms, are after all
the heart of network engineering research questions. Optimistic simulation does
not explicitly use the network topology as part of its synchronization mechanism
3
and as well it relaxes the need for precise processing of events in strict time stamp
order. Optimistic simulation uses a detection and recover scheme to identify out of
order event processing and corrects these errors with rollbacks. In order to perform
a rollback, the simulation system requires additional memory (i.e. “optimistic”
memory). This mechanism for synchronization automatically uncovers the available
parallelism in a model. These attributes make a good case for using an optimistic
synchronization approach to parallel discrete-event simulation of large-scale, real-
world network and computer systems.
1.2 List of Terms
Throughout this thesis many different terms will be used. The core terms are
reverse computation, entropy, efficiency, event rate, and speedup and are defined as
followed:
• Reverse computation is realized by performing the inverse of the individ-
ual operations that are executed in the forward computation. The system
guarantees that the inverse operations recreate the application’s state to the
same value as before the computation. Reverse computation exploits opera-
tions that modify the state variables constructively. The undo of constructive
operations like ++, −−, and + = require no history and are easily reversed.
However, destructive operations like a = b can not be reversed.
• Entropy in a physical system is the measure of the amount of thermal energy
not available to do work. In a computational system, information in a message
is proportional to the amount of free energy required to reset the message to
zero. The act of resetting the message to zero uses energy and that used energy
is turn into heat (i.e., entropy). For our research here, we refer to entropy as
the amount of information loss within a system or a model that is part of
some sequential or multiprocessor execution. Destructive operations such as
an assignment or a copy lead to a great deal of entropy. However, reverse
computation on constructive operations is entropy free because the previous
value can always be recreated and thus no data lost. We also use entropy with
4
respect to caching. A cache with a higher miss rate is undergoing a higher
amount of entropy due to missed cache entries destroying the previous entries
in the cache. A better performing cache has less entries being replaced and
thus less entropy. Through out this thesis will be looking at ways of minimizing
the amount of entropy within models.
• Efficiency is defined as acting effectively with a minimum amount of expense
or unnecessary effort. We refer to the net events processed (i.e., excludes rolled
events) divided by the total number of events as efficiency because this value
represents the percentage of effective execution that our model performed.
We refer to our models as efficient due to the fact that they minimize memory
usage and execute effectively.
• Event rate is defined to be the total number of events processed less any
rolled back events divided by the execution time.
• Speedup is defined to be the event rate of the parallel case divided by event
rate of the sequential case. Because the total number of events are the same
between sequential and parallel runs of the same model configuration, this
definition is equivalent to using execution time. Speedup shows how effective
our parallel model executed. Event rate and Speedup are important because
they give us a way to compare our model and systems against others.
1.3 Scope of the thesis
The scope of this thesis is the investigation of large-scale optimistic parallel
discrete-event simulations on large-scale computing platforms and developing an
understanding of the fundamental limits of our discrete-event simulation engine in
the domain of large-scale network models and computer systems. In particular, the
study focuses on simulation of protocols and applications across the Internet. The
research can be broken down into three parts. The first part will discuss the design
of the models. The second part reveals the optimizations and methods dealing
with reverse computation. The third will discuss the implementation of additional
functionalities to the simulation system framework.
5
A breakdown of major research questions addressed by this thesis are:
• Model design is critical to the creation of efficient large-scale optimistic par-
allel simulation. What aspects of large-scale network models, as a motivating
example, can be exploited to improve performance without violating the need
for high-fidelity protocol dynamics?
• The models in large-scale optimistic simulations require large amounts of mem-
ory. State saving is a central overhead in event processing. How can we apply
reverse computation efficiently? What operations can be reversed to prevent
state saving?
• New functionalities are needed in the simulation system. What functionalities
can be added to increase ease of use along with minimizing memory usage?
What is the performance cost or benefit of such functionalities? Can these
new functionalities decrease the amount of memory needed?
This thesis will suggest and strive to answer these questions. Each area receives
equal focus and investigation.
1.4 Contributions
There are four major contributions of this thesis. The first contribution (Chap-
ters 3 and 4) is the development of efficient large-scale models in an optimistic
simulation framework. These models, include a transport layer protocol as well
as a database view storage system called CAVES. Our transport layer protocol
model simulates hosts sending and receiving files of given lengths across a realistic
network of routers using TCP which is describe later in Chapter 4. Our CAVES
model simulates a hierarchy of storage servers and databases which fulfill applica-
tions’ queries. Together they push the limits of the simulation engine and achieve
performance levels that were thought to be out of the scope of optimistic simula-
tion [111, 112, 113, 114]. In the design of these models, memory as well as processor
efficiency were the top priorities because they directly correlate with the model’s
execution time [17].
6
The second contribution (Chapters 3 and 4) is in the application of the meth-
ods of reverse computation which were employed in the construction process of
parallel models. The performance study demonstrates that the application of re-
verse computation to optimistic parallel simulation models decreases both space and
time complexity, thus enabling multi-dimensional scalability.
The third contribution (Chapter 5) is a new memory management approach
that is both easy to use as well as reduces model memory consumption. This is
accomplished by adding a reverse memory subsystem to the Rensselaer’s Optimistic
Simulation System (ROSS) API. Previously, all model memory management was
done “by hand”. This reverse memory subsystem decreases memory usage and
increases performance by decreasing the amount of “optimistic” memory required
by models.
The fourth contribution (Chapter 6) is in the development of shared event
data in an optimistic simulation. With this extension to the current framework
there will be a reduction in memory consumption which leads to greater scalability
in multicast network models.
These contributions when taken together as a whole enable real-world large-
scale models to be efficiently developed and executed in an optimistic parallel sim-
ulation framework.
1.5 Thesis Outline
The thesis is structured as follows:
Chapter 2 illustrates the history of simulation research which in turn intro-
duces the background for this thesis. An overview of both conservative and opti-
mistic parallel simulation synchronizations is given. This is followed by an assess-
ment of current research and concludes with applications of reverse computation.
In Chapter 3 presents the design and implementation of a configurable appli-
cation view storage system (CAVES). The CAVES model is comprised of a hierarchy
of view storage servers. The term view refers to the output or result of a query made
on the part of an application that is executing on a client machine. These queries
can be arbitrarily complex and formulated using SQL. The goal of this system is
7
to reduce the turnaround time of queries by exploiting locality both at the local
disk level as well as between clients and servers prior to making the request to the
highest level database server. We show the instrumentation of reverse computation
within the model and an analysis of the experimental data is performed and results
are presented.
Chapter 4 presents our TCP model which is comprised of hosts sending and
receiving files of given lengths across a realistic network of routers. The routers are
equipped with drop tail queues. This TCP model dispels the view that optimistic
simulation techniques operate outside the performance envelop for Internet protocols
and demonstrates that they are able to efficiently simulate large-scale TCP scenarios
for realistic networks.
Chapter 5 begins with a description of the reverse memory subsystem. The
reasons that led to the subsystem design are discussed. This is followed with an
explanation of its use along with example models. In addition there is a discussion
on how this subsystem eases model development. The chapter concludes with a
performance study which illustrates the subsystems benefits.
Chapter 6 explains shared event data. The chapter begins with an example
model which exhibits the properties where this new functionality would be most
useful. This is followed with a discussion of the challenges that arise from imple-
mentation in an optimistic simulation system. The chapter ends with a performance
analysis.
In Chapter 7 summarizes the current work and give conclusions for this re-
search.
CHAPTER 2
Related Work
This chapter illustrates the history of simulation research which in turn introduces
the background for this thesis. An overview of both conservative and optimistic
parallel simulation synchronization is given. This is then followed by a comparison of
the two synchronization techniques. Then the chapter discusses reverse computation
with respect to simulation and concludes with other uses of reverse computation.
2.1 Introduction to Simulation
Simulation is defined as “the imitation of the operation of a real-world process
or system over time.” [7] Computer simulation is a computation that models a
process or system and is beneficial because it is repeatable, controllable, sometimes
faster and less costly than the real system. Simulations allow for the study of com-
plex large-scale models that are intractable to closed-form mathematical methods
or analytic solutions.
The time flow mechanism allows a simulation model’s state to change while
time advances. There are two main classifications of the time flow mechanism,
continuous and discrete. Continuous simulation has the state changing continuously
over time. Some examples of continuous simulations are motion of vehicles and
global climatic models. For discrete simulation, the simulation model views time as
a set of discrete points in which state changes occur. There is a hybrid model which
combines the two time flow mechanisms [60].
Two of the most common types of discrete simulations are time-step and event-
driven. The time-step approach advances simulation time in small constant time
intervals. This approach gives the impression of continuous time and therefore might
be suitable for the simulation of some continuous systems.
Event-driven has time being advanced when “something interesting” occurs.
This “something interesting” is the event and is the key idea behind discrete-event
8
9
while( simulation executing )
{
remove event with the smallest time stamp from event-list
sim_time = time stamp of event
pass event to models event handler
}
Figure 2.1: Discrete-event simulation event processing loop.
simulations [33]. Some examples of discrete-events are airplanes taking off, landing,
and loading.
Discrete-event simulation contains algorithms for event management and an
executable software description of the model being simulated. Here, the model is
decomposed into “atomic” events. Each event has a time stamp of when it is to
occur and the events are stored in an event-list. The smallest event is removed from
the event-list and the simulation or “virtual” time is set to that event’s time stamp.
The event is then processed by the model when it is the smallest unprocessed event
in the system. Figure 2.1 shows the discrete-event processing loop.
The modeler can design a simulation with one of three world-views: event-
oriented, process-oriented, and activity-scanning. With event-oriented, the modeler
constructs event handlers and the event handlers on methods are called for the
specific event. This method is sometimes more difficult to construct models with,
but is the most efficient.
In the Process-oriented world view, the modeler can specify the model as a
collection of processes [33]. However, there is more overhead due to the thread man-
agement. Moreover, network protocols actually behave in an event-driven manner,
such as TCP. Thus, for the research domain here, the event-driven world-view is the
most appropriate.
The activity-scanning world-view is a variation on the time-step flow mecha-
nism. The simulation is a collection of procedures with a predicate associated with
10
each one. When each time-step occurs, the predicates are evaluated and if that
predicate is true the associated procedure is executed [33]. This world view is not
as efficient because of the predicate evaluation process.
2.2 Parallel Discrete-Event Simulation
Parallel discrete-event simulation is a well established field that has been ap-
plied to a diverse set of problems; from networking to air traffic control [109]. With
parallel discrete-event simulations, a single model is divided or distributed onto a
multiple processors. These processors could be tightly coupled in a single system
or distributed across a number of host systems and communicate over a high-speed
network. The end goal is typically the same: reduce the model’s overall execution
time. Time-critical applications and on-the-fly simulations have become realities.
In parallel discrete-event simulation individual physical processes are modeled
by logical processes (LPs). These processes communicate between each other with
time stamped events or messages.
The challenge for parallel discrete-event simulation is making sure events are
processed in correct time stamp order. An event processed at an earlier time could
effect the state of the simulation which is used for the processing of events at later
times. The causality constraint asserts that events are processed in time stamp
order and a causality error occurs when an event is processed out of order. If a
causality error transpires, the accuracy of the simulation will be questioned. The
local causality constraint states that all events within that LP must be processed in
nondecreasing time stamp order. All LPs adhering to the local causality constraint
guarantee the causality constraint for the entire simulation.
The correct local time stamp ordering of the events is difficult in parallel
simulation because there is no notion of a global simulation time clock. An example
of a causality error in a simple parallel simulation of an airport model is shown in
Figure 2.2. Here, two airports are modeled as LPs, each being mapped on a different
processor. The first airport has a plane departing at time 3. The second airport
has a plane loading at time 7. Both processors execute their events concurrently.
The plane departing, creates an arrival event at the second airport at time 6. The
11
0 3 6 7
2
1
airport processed event
unprocessed event
Simulation Time
Figure 2.2: Causality error.
processing of the arrival at time 6 would result in the causality constraint being
violated.
In order to prevent causality errors there is a need for a synchronization method
between processors or computers. There are two categories of synchronization meth-
ods: conservative and optimistic. Conservative synchronization makes sure it is safe
to execute an event. Here, the local causality constraint is strictly enforced. Op-
timistic synchronization, on the other hand, relaxes the local causality constraint.
This allows for causality errors to occur as long as they are later corrected. Rolling
back of events incorrectly processed and other recovery methods must be incorpo-
rated into the optimistic simulation system.
2.2.1 Conservation Synchronization
Conservative synchronization makes sure it is safe to execute an event. The
first conservative algorithm was the Chandy/Misra/Bryant CMB, named after the
creators. The algorithm makes sure it is safe to execute and uses null messages to
avoid deadlock [11, 18].
The CMB algorithm has logical processes connected by directional links. The
LP communicates across these links by sending messages. In the LP there is a
queue for each incoming link. The three assumptions that must hold true for the
12
waiting on 3
15, 10
Aiport 1
waiting on
waiting on1
2
Airport 2
11
8
5
17, 13
Airport 3
25
Figure 2.3: Deadlock cause by a waiting cycle.
Chandy/Misra/Bryant algorithm are:
• Time stamp messages sent over a link must be sent in nondecreasing order.
• The communication network, the link, must guarantee that messages are re-
ceived in the same order they are sent.
• The transfer of messages must be reliable (i.e., no losses).
Based on these constraints it is not possible for a smaller time stamped message
to arrive on an incoming link than previously received. Each link’s queue has a clock
value associated with it. The clock value is the smallest time stamped message on
the queue. If the queue is empty, the clock value is the last time stamped message
processed from that queue. The queue with the smallest clock value is checked. If
there are messages in that queue, the smallest time stamped message is processed.
13
If the queue is empty the LP waits for a message to arrive on that link. The “wait”
property of the algorithm enables a deadlock condition to be possible. Figure 2.3
illustrates a deadlock case.
The CMB algorithm accounts for this possibility of deadlock by having null
messages. A null message is sent after the processing of each message. The null
message has no information needed for the model. They only contain a time stamp
guaranteeing that no message will arrive from the sender LP with a time stamp less
than the sent time stamp. The LP receiving the null message can advance the clock
value based on the null message which in turn allows for safe event processing and
makes time advancement possible. In the null message the lookahead is included in
the time stamp. “Lookahead refers to the ability to predict what will happen, or
more importantly, what will not happen, in the simulated future.” [32] With this
knowledge the lookahead included in the time stamp is the amount of simulation
time that an LP can predict into the simulated future. The lookahead has to be
derived from the model based-on limitations on how quickly physical processes can
interact. It is believed that poor lookahead in a model “results in more frequent
synchronizations, therefore, higher overheads”. [60] In models with zero lookahead
deadlock is still possible.
Demand-driven is an alternative to the null message approach. Demand-driven
has the blocked LP and requests the next time stamp from that link. Once the
response is returned, the LP can continue execution. This method reduces the
message overhead created by the null messages but increases delay because of the
two transmissions.
The CMB algorithm is a deadlock avoidance algorithm. There are other con-
servative algorithms that focus on deadlock detection and recovery. [19, 68] Deadlock
can be overcome by exploiting the knowledge that the smallest time stamped event
in the system can be processed. Deadlock detection algorithms can allow for zero
lookahead cycles in simulations, however, “they tend to be overly conservative” and
parallelism is limited. [60]
The Critical Channel Traversing (CCT) algorithm extends the CMB algorithm
with the addition of policies that determine when an LP should be scheduled to
14
3 6 71
5
unprocessed event
processed event
Figure 2.4: Straggler event arrives
execute events. The CCT algorithm schedules the LPs with the largest number of
events that are ready to execute by identifying critical channels in the model [110].
The most recent conservative synchronization algorithm was Nicol’s composite
synchronization [72]. This scheme utilizes a barrier synchronization approach for
those LPs that are “far apart” in virtual time and Critical Channel Traversing
for LPs that are more closely related in virtual time. This algorithm effectively
divides LPs into these two categories based on an optimal algorithm. The composite
synchronization avoids channel scanning limitations associated CCT while at the
same time reducing the frequency of applying global barriers. Thus, it minimizes
the overheads of both CCT and barrier synchronization mechanisms.
2.2.2 Optimistic Synchronization
Optimistic synchronization allows for execution to happen as fast as possible
(i.e., synchronization or wait-free) assuming there are no causality errors. A wait-
free implementation guarantees that any process can complete any operation in a
finite number of steps, regardless of the execution speeds on the other processes [44].
If an error occurs the simulator provides mechanisms for detecting it and correcting
the error. The first optimistic synchronization algorithm was Time Warp done by
Jefferson [51]. The Time Warp algorithm is composed of two pieces, the local control
mechanism and the global control mechanism.
The local control mechanism acts in each LP and is largely independent from
other LPs. An LP after processing an event, inserts it into a processed event queue.
The processed event queue is used for reprocessing that is caused by a rollback. A
15
rollback happens when a straggler message arrives at the LP. This can be seen in
Figure 2.4. In order to maintain the local causality constraint, changes to the LP’s
state caused by out of order processing must be undone or rolled back. Once the
LP state is corrected the straggler message can be processed and the other events
then can be re-executed, thus maintaining the local causality constraint.
In the undo process there needs to be ways of reversing the state changes
and canceling events that were incorrectly sent. Methods of undoing state changes
include: Copy state saving, infrequent state saving [9, 56, 57], incremental state
saving [10, 41, 95, 100, 108], and reverse computation [15, 16, 80].
Copy state saving, copies all changeable variables in the state before each
event gets processed. When a straggler message arrives, the LP knows the exact
construction of the previous state. This method requires sufficient memory since a
full copy of the LPs state is required.
Infrequent state saving is similar to copy state saving but only copies the
state at intervals. When a straggler message arrives, the LP rolls back to a state
before the straggler. The LP then re-executes all events up to the desired state.
The re-executing of events is called “coasting forward”. [31] During this “coasting
forward” the re-executed events abstain from resending or canceling events since it
is unnecessary. This method reduces the memory required but adds the overhead
of the coasting forward phase.
Incremental state saving only saves the variables that where modified by the
current event. One form keeps a log of which variables were modified. When the
LP rolls back, the simulator is required to browse the log in decreasing time stamp
order to correctly revert the state changes. This form of incremental state saving
can be implemented by the modeler. Other forms try to automate the process. One
such form has been implemented by overloading assignment operators in C++. The
overloading of operators makes the state saving transparent to the user [95]. In
[108] incremental state saving was implemented by automatically editing compiled
executable code. Incremental state saving is useful when few state variables are
modified per event processed.
Finally reverse computation is realized by performing the inverse of the individ-
16
3 6 71
5
12 to LP 2
to LP 310
processed event
unprocessed event
anti−message
Figure 2.5: Straggler event arrives causing rollback.
65 12 15
anti−message
unprocessed event
processed event
12
Figure 2.6: Anti-message arrives annihilating an unprocessed event.
ual operations that are executed in the event computation. The system guarantees
that the inverse operations recreate the application’s state to the same value as
before the computation. Reverse computation exploits operations that modify the
state variables constructively. The undo of constructive operations like ++, −−,
and + = require no history and are easily reversed. However, destructive opera-
tions like a = b can not be reversed. To solve this problem, the modeler can simply
swap the data in the event with the changed state. Section 2.3 has a more formal
discussion of reverse computation with respect to simulation.
The next step in undoing an event is cancelling the events which it created.
The events that are created are stored on an output queue and anti-messages are
sent for those events upon rollback. If the event has not been processed, the anti-
17
13 15
12109 17
anti−message
unprocessed event
processed event10
Figure 2.7: Anti-message arrives causing secondary rollback.
message will annihilate that event. However, if the event has been processed by
the destination processor the anti-message will cause that event to be rolled back.
This is known as a secondary rollback. Figure 2.5 shows LP 1 receiving a straggler
message and sending the appropriate anti-messages. In Figure 2.6, the anti-message
is received before the event is processed and the event is annihilated. However, in
Figure 2.7 the anti-message is received after the event is processed and causes a
secondary rollback.
Another form of cancellation is Lazy Cancellation. This technique avoids
cancelling messages that will be later recreated. Lazy Cancellation only sends anti-
messages when the original message will not be created by event reprocessing [33].
The local control mechanism ensures that events are processed in time stamp
order which maintains the causality constraint. However, the memory required for
keeping processed events and state history is tremendous, if they are never reclaimed.
Thus there is a need for a memory garbage collection. The global control mechanism
takes care of these issues by determining a minimum time stamp for any future
rollback. This minimum time stamp is referred to as Global Virtual Time GVT.
Since GVT is the lowest time that the simulator can rollback to, all processed events
and I/O with a time less than GVT can be reclaimed and committed. The reclaiming
of memory used by processed events and state history is called fossil collection.
GVT requires the minimum time stamp over all unprocessed events or par-
18
B
A
15
2520Controller
Figure 2.8: Transient message problem.
B
A
25 20
15
Controller
Figure 2.9: Simultaneous message problem.
tially. There are two problems in obtaining this value. The first problem comes from
the fact that a message might be in transit while the processors are reporting their
minimum time stamp. This is called the transient message problem and is illustrated
in Figure 2.8. One solution to this problem is to have the receiver acknowledge the
message. Under this scheme, the sender is responsible for reporting the message in
a GVT computation until it receives the acknowledgment. This insures no transient
messages “fall between the cracks”. [33] However, if the receiver already performs
a GVT computation before the message arrives and the sender performs the GVT
computation after it is sent, both processors assume the message is accounted for by
the other. This problem is shown in Figure 2.9 and is referred to as the simultaneous
reporting problem.
Samadi’s algorithm fixes these problems by having the acknowledgment tagged
19
when it is sent between the period of a reported local minimum and a receiving new
GVT value [96]. These acknowledgments identify the ones that might “fall between
the cracks” [33] and now the sender will know to account for them. Mattern’s
GVT algorithm does not require message acknowledgments and uses the distributed
system concept of consistent cuts to calculate GVT [65]. Fujimoto’s GVT algorithm
exploits shared memory and greatly simplifies the GVT algorithm by generating a
cut by setting a global flag [34].
Once GVT is calculated the fossil collection can occur. This typically is done in
a batch mode. All processed events with a time stamp less than GVT are reclaimed
and placed into a free list from which internal events are allocated from.
2.2.3 Comparison between Optimistic and Conservative Synchroniza-
tion
The debate between which synchronization approach is superior has contin-
ued for years. There is no overall winner. Conservative has its advantage over
optimistic and vice-versa. In this subsection those advantages and disadvantages
will be discussed.
Optimistic synchronization can fully exploit parallelism and is only limited by
the actual dependencies. Whereas the conservative approach is limited by potential
dependencies. Optimistic synchronization can be viewed as more difficult to design
and construct models. The modeler has to be concerned with state saving or the
reverse computation code. A solution to this problem is to employ code translation
and generation techniques such as those used by Perumalla [80, 81]. However,
dynamic memory usage has been viewed as being difficult and is often avoided.
With the implementation of the reverse memory subsystem, these problems are
eased.
Conservative models do not have to handle inconsistencies due to lagging roll-
backs and stale state that can occur in optimistic synchronization [73]. However,
they need to design their models with explicit parallelism. With optimistic syn-
chronization parallelism is automatically exploited. Conservative synchronization
in addition has a limited model space due to lack of lookahead [58]. Conservative
20
models with good lookahead have an advantage over optimistic ones because they do
not have to perform rollbacks and they have a smaller memory footprint. However,
conservative synchronization has difficulties in dealing with dynamic topologies.
There is clearly no distinct winner. The best synchronization approach is
mainly model dependent. However, optimistic protocols might have the upper-hand
considering their ability to handle models with little lookaheads. Some of the more
interesting Internet models have small lookaheads such as ad-hoc wireless networks,
networks with low delays and dynamically changing topologies (i.e., link and route
changes) [2, 79, 103].
2.3 Reverse Computation
In optimistic simulation systems [51], the most common technique for realizing
rollback is state saving. In this technique, the original value of the state is saved
before it is modified by the event computation. Upon rollback, the state is restored
by copying back the saved value. An alternative technique for realizing rollback is
reverse computation [15, 16, 80]. In this technique, rollback is realized by performing
the inverse of the individual operations that are executed in the event computation.
The system guarantees that the inverse operations recreate the application’s state
to the same value as before the computation.
The key property that reverse computation exploits is that a majority of the
operations that modify the state variables are “constructive” in nature. That is,
the undo operation for such operations requires no history. Only the most current
values of the variables are required to undo the operation. For example, operators
such as ++, −−, + =, − =, ∗ = and / = belong to this category. Note, that
the ∗ = and / = operators require special treatment in the case of multiplying or
dividing by zero, and overflow/underflow conditions. More complex operations such
as circular shift (swap being a special case), and certain classes of random number
generation also belong here.
Operations of the form a = b, modulo and bitwise computations that result in
the loss of data are termed to be destructive. Typically these operations can only
be restored using conventional state saving techniques. Table 2.1 show the rules
21
Type Description Application Code Bit Requirements
Original Instrumented Reverse Self Child Total
T0 simple choice if()s1;elses2;
if(){s1;b=1;}else{s2;b=0;}
if(b==1){inv(s1);}else{inv(s2);}
1 x1, x2 1 +max(x1, x2)
T1 compoundchoice (n-way)
if()s1;elsif()s2;elsif()s3;else()sn;
if() {s1;b=1;}elsif(){s2;b=2;}elsif(){s3;b=3;}else {sn;b=n;}
if(b==1){inv(s1);}elsif(b==2){inv(s2);}elsif(b==3){inv(s3);}else{inv(sn);}
lg(n) x1, x2,..., xn
lg(n)+max(x1, ...xn)
T2 fixed itera-tions (n)
for(n)s;
for(n)s;
for(n)inv(s);
0 x n ∗ x
T3 variable iter-ations (maxi-mum n)
while()s;
b=0;while(){s;b++;}
for(b)inv(s);
lg(n) x lg(n) +n ∗ x
T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive
assignmentv@ =w;
v@ =w;
v =@w;
0 0 0
T6 k-byte de-structiveassignment
v =w;
{b =v; v =w; }
v = b; 8k 0 8k
T7 sequence s1;s2;sn;
s1;s2;sn;
inv(sn);inv(s2);inv(s1);
0 x1+... +xn
x1 + ...+ xn
T8 jump (labellbl as targetof n goto’s)
gotolbl;s1;gotolbl;sn;lbl:s;
b=1;goto lbl;s1;b=n;goto lbl;sn;b=0;label:s;
inv(s);switch(b){case 1:gotolabel1;case n:gotolabeln;}inv(sn);labeln:inv(s1);label1:
lg(n+1)
0 lg(n +1)
T9 Nestings ofT0-T8
Apply the above recursively Apply the above recursively
Table 2.1: Summary of treatment of various statement types.Generation rules and upper bounds on state size requirements for supporting reverse com-putation. s, or s1..sn are any of the statements of types T0..T7. inv(s) is the correspondingreverse code of the statement s. b is the corresponding state saved bits “belonging” tothe given statement. The operator = @ is the inverse operator of a constructive operator@ =, (e.g., + = for − =) [16].
that can be recursively applied to forward computation code that will generate the
reverse code. The significant parts of these rules are their state bit size requirements,
22
and the reuse of the state bits for mutually exclusive code segments. We explain
each of the rules in detail next.
• T0: The if statement can be reversed by keeping track of which branch is
executed in the forward computation. This is done using a single bit variable,
which is set to 1 or 0 depending on whether the predicate evaluated to true
or false in the forward computation. The reverse code can then use the value
of the bit to decide whether to reverse the if part or the else part when trying
to reverse the if statement.
The bodies of the if part and the else part are executed mutually exclusively,
the state bits used for one part can also be used for the other part. Thus, the
state bit size required for the if statement is one plus the larger of the state
bit sizes, x1, of the if part and x2 of the else part, i.e., 1 + max(x1, x2).
• T1: Similar to the simple if statement (T0), an n-way if statement can be
handled using a variable b of size lg(n) bits. The state size of the entire if
statement is lg(n) for b, plus the largest of the state bit sizes, x1 . . . xn, of the
component bodies, i.e., lg(n) + max(x1 . . . xn) (since the component bodies
are all mutually exclusive).
• T2: An n iteration loop, such as a for statement, whose body requires x state
bits for reversibility. Then n instances of the x bits can be used to keep track
of the n instances of the body, giving a total of n ∗ x bit requirement for the
loop statement. The inverse of the body is invoked n times in order to reverse
the loop.
• T3: Consider a loop with variable number of iterations, such as a while state-
ment. This statement can be treated the same as a fixed iteration loop, but
the actual number of iterations executed can be noted at runtime in a variable
b. The state bits for the body can be allocated based on an upper bound n on
the number of iterations. Thus, the total state size added for the statement is
lg(n) + n ∗ x.
23
• T4: For a function call there is no instrumentation added. For reversing it,
the inverse is just invoked. The inverse is easily generated using the rules in
T7 which is described later. The state bit size is the same as for T7.
• T5: Constructive assignments, such as ++, −−, + =, − =, and so on, do not
need any instrumentation. The reverse code is the inverse operator, such as
−−, ++, − = and + = respectively. These constructive assignments do not
require any state bits for reversibility.
• T6: A destructive assignment, such as =, % = and so on, can be instrumented
by saving its left hand side into a variable b before the assignment takes place.
The size of b is a k-byte for assignment to a k-byte left hand side variable
(lvalue).
• T7: For a sequence of statements, each statement is instrumented depending
on its type, using the previous rules. For the reverse code, the sequence of
statements is reversed, and each statement is replaced by its inverse. The
inverses are generated by applying the corresponding rules from the preceding
list. The state bit size for the entire sequence is the sum of the bit sizes of
each statement.
• T8: Jump instructions (such as goto, break and continue) require more com-
plex treatment, especially with inter-dependent jumps. The rules here are for
a simple example where no goto label in the model is reached more than once
during an event computation. Such use of jump instructions occur, to jump
out of a deeply nested if statement, or as convenient error handling code at the
end of a function. The reverse is as follows: for every label that is a target of
one or more goto statements, its goto statements are indexed. The instrumen-
tation of the forward code is to record the index of a goto statement whenever
that statement is executed. In the reverse code, each of the goto statements
is replaced by a goto label. The original goto label is replaced with a switch
statement that uses the recorded indexes in forward computation to jump back
to the corresponding new (reverse) goto label. The bit size requirement of this
24
Forward
1. temp = SV->b;
2. SV->b = M->b + 5;
3. M->b = temp;
Reverse
1: temp = M->b;
2. M->ack = SV->b - 5;
3. SV->b = temp;
Figure 2.10: LP state to message data swap example.
scheme is lg(n) where n is the number of goto statements that are the sources
of that single target label.
• T9: Any legal nesting of the previous types of statements can be treated by
recursively applying the generation rules [16].
In Sections 3.3.2 and 4.3.3 shows the application of these rules on pseudo code
for the CAVES and TCP models.
The rules in Table 2.1 merely show the bit requirement upper bounds on
particular programmatic constructs. We can however, break the rules without loss
of correctness or model accuracy and achieve greater efficiency . First lets consider
a simple optimization involving the swap operation. We observe that many of these
destructive operations are a consequence of the arrival of data contained within the
event being processed. For example, a message changes the state of a model. With
conventional state saving, we would have to have an additional variable to save the
old variables value. However, to solve this problem, one can simply swap the data
in the event with the changed state of the logical process (LP). This swap is shown
in Figure 2.10. In that figure and throughout this chapter the state is represented
by the variable SV and the message variable will be M. When the event is rolled
back, the event data and LP data are just re-swapped.
25
The above solution works well as long as there is a one-to-one mapping between
message data and the amount of LP state being modified. However, as we observed
in our CAVES model, this is not always the case. It may come to pass that a message
may cause the deletion or destruction of data that is too large to be swapped into
the message. An example of this and a possible solution is shown in Section 3.3.3.
Kalyan Perumalla and Richard Fujimoto have implemented a reverse C com-
piler called RCC which takes C code and creates reversible code [81]. RCC does
not support swaps and has its other limitations. However, the reverse C compiler
is a good start for easing the generation of reverse code. In order to generate code
which performs close to hand written reverse code, the user code must be riddled
with pragmas. RCC gives the user the ability to define their own reverse functions,
however they must have the same number of parameters as the forward functions
and be of the return type of void. This is not always the correct format for the
reverse functions. For example list operations such as push and pop are inverses.
These functions have different parameters and return types and therefore could not
be used as reverse functions for each other. In addition there is no functional-
ity for handling dynamic memory operations which are often required for models.
An excellent addition to the RCC would be the integration of our reverse memory
subsystem.
2.4 Other Applications of Reverse Computation and Opti-
mistic Execution
There has been a significant amount of other research done in the field of
reverse computation. Such research has been done in debugging applications, data-
base transactions, architectures and error detection and recovery. Each of these
areas differs from our area but provides insight on reverse computation.
There has been much research in debugging applications using reverse com-
putation. The main reason for this is to give the user a more intuitive away of
debugging. The common way of debugging is to put a breakpoint where the error
might have occurred and then reexecute the program until the breakpoint is reached.
If the error happens before the breakpoint the user must reexecute the program again
26
with a higher breakpoint point. If the program is quite lengthy this process could
take a significant amount of time. It has been observed that programmers some-
times spend up to 50% of their time debugging [4]. With reverse computation in
debugging, upon encountering a bug, the user can backtrack in the program with-
out reexecuting it, there by saving time and making the debugging more intuitive.
PROVIDE [69] uses a process history database to store process state changes. This
process history could take a large amount of memory and is often viewed unrealistic
for certain applications. IGOR [30] uses checkpointing and an interpreter to execute
forward till the specific program point is reached. This is similar to the infrequent
state saving used in optimistic simulation which is discussed in [9, 56, 57]. The
interpreters forward execution can be around 100 times slower so the checkpoint
interval is an important parameter. EXDAMS was an interactive debugger that has
a replay facility. It can only replay a programs execution. Changes can not be made
during the course of the playback [6]. Agrawal, DeMillo and Spafford in [3] use
structured backtracking to reduce the memory required. With this method, state
is only saved at the beginning and at the end of the structures. The user can not
reverse compute to the middle of a loop. They must go to the beginning of the loop
and step through it till they arrive at the middle.
Database transaction processing can use reverse computation or compensat-
ing operations when dealing with concurrency control. Two transactions can both
access the same abstract item as long as the transactions’ operations are backwards-
commutable. With the property of being backward-commutable the compensat-
ing operation can be performed and the transaction schedule will appear as if the
aborted transaction never happened. This type of schedule is said to be reducible
and hence recoverable. An example of two operations that are compensating would
be a withdraw and a deposit. For reference, compensating operations are applied
in reverse order to how they first appeared [55]. This is similar to how the inverse
operations are applied in our models.
Reversible computing has also been an interesting area of research. Within this
field of study, reversible logic and energy-efficient computation are investigated. It is
known that when changing from one state to an other state in irreversible computing
27
heat is dissipated. This entropy, “burns up your lap, runs up your electric bill, and
limits your computer’s performance.” [38] Away to overcome this limit is by using
reversible computing, which uncomputes bits rather than overwriting them. Un-
computing allows energy to be recovered and recycled for later use.“Unfortunately,
present-day oscillator technologies do not yet provide high enough quality factors to
allow reversible computing to be practical today for general-purpose digital logic,
given its overheads.” [38]
Pendulum is an implementation of a reversible architecture and was developed
by a group at MIT [105, 106]. R is a reversible programming language developed for
the pendulum architecture. The language R provides the functionalities of reversing
function calls using the rcall method [36]. This language is in its early stages and
does not support destructive operations.
Another reversible language is Janus which was constructed for the DEC
SYSTEM-20. This language provides a similar reverse method named UNCALL.
Janus is considered to be a throw-away piece of code [63].
Reverse computing can also be used to determine errors within computations.
The program executes forward until completion and then executes in the reverse.
If the final state of the reverse execution is different from the start state, an error
occurred and the result is untrustworthy. Reverse computation can also allow re-
covery from malicious attacks. The system can be reversed to a state before the
attack [37].
The idea of optimistic processing for database concurrency control has been
researched [42, 52, 53, 55]. Concurrency control prevents conflicts among transac-
tions such that their serializability can be guaranteed. “A system of concurrent
transactions is said to be serializable or has the property of serial equivalence if
there exists at least one serial schedule of execution leading to the same results for
every transaction and to the same final state of the database.” [42]
The optimistic concurrency control discussed in [53] has three phases. The
first being the read phase were only read operations are performed on the database
and the writes are subject to validation. Writes are performed on local copies of
data. The second phase is the validation phase were conflicts are discovered. Upon
28
discovery of a conflict the transaction is aborted and restarted. With a successful
validation the local copies are made global in the write phase.
With optimistic concurrency control deadlock is not possible. However there is
a chance for starvation, and when a starving process is observed it can be guaranteed
execution by restarting the process without releasing the critical section semaphore.
This is equivalent to write locking the entire database [53].
Due to the storage overheads, optimistic concurrency control restricts itself
to rather short writer transactions [42]. However, for query-dominant transactions
systems optimistic concurrency control appears ideal because the validation is often
trivial and parallelism can be exploited.
Another form of optimistic concurrency control has been proposed using the
Time Warp mechanism [52]. Time Warp is the mechanism which we use in our sim-
ulation system. Within this type of concurrency control, transactions and data are
represented as objects or LPs. Communication is facilitated through message pass-
ing. Each transaction is given a unique time stamp and accesses to the data objects
must be performed in time stamp order. When out of order processing is observed,
a rollback is performed and forward execution continues. Within this method errors
only cause partial reprocessing of transactions instead of entire reprocessing.
Concurrency controls using locks are the more common types of concurrency
control. These types are similar to conservative simulations because they avoid all
possible conflicts, however sacrifice some degree of parallelism.
2.5 Chapter Summary
In this chapter we gave an introduction to different types of simulations. We
focused on parallel discrete-event simulation and discussed the different synchroniza-
tion approaches, conservative and optimistic. Conservative synchronization does not
need handling for causality errors and can be viewed as simpler to develop mod-
els. Conservative is limited by the lookahead that can be exploited from the model
whereas optimistic synchronization uses all possible parallelism. In addition we dis-
cussed reversible computation with respect to simulation. We presented rules for
generating reverse code for models as well as discussing the significance of applying
29
these rules cautiously. We concluded the chapter with a brief summary of other
applications of reverse computation.
CHAPTER 3
Configurable Application View Storage System:
CAVES
The CAVES model is a hierarchy of view storage servers. The term view refers to
the output or result of a query made on the part of an application that is executing
on a client machine. These queries can be arbitrarily complex and formulated using
SQL. The goal of this system is to reduce the turnaround time of queries by exploit-
ing locality both at the local disk level as well as between clients and servers prior
to making the request to the highest level database server. One interesting imple-
mentation problem is how to simulate the caches without using dynamic memory
allocations.
This model has been designed for execution with an optimistic parallel simu-
lation engine. As previously discussed, one of the primary drawbacks of this parallel
synchronization mechanism has been high overheads due to state saving. We ad-
dressed this problem by implementing the models using reverse computation when
possible.
The chapter begins with a discussion of Rensselaer’s Optimistic Simulation
System’s (ROSS’) data structures and then a CAVES model overview is presented
in Section 3.2. It is then followed with the event flow diagrams and a discussion in
Section 3.2.4. The implementation of CAVES atop the ROSS engine is presented in
Section 3.2.5. Then, there is a discussion of the reverse computation code for the
CAVES model in Section 3.3.2. The chapter concludes with a performance study of
the model.
3.1 ROSS’ Data Structures
ROSS’ data structures are organized in a bottom-up hierarchy. Here, the core
data structure is the tw event. Inside every tw event is a pointer to its source and
30
31
destination LP structure, tw lp. Observe, that a pointer and not an index is used.
Thus, during the processing of an event, to access its source LP and destination LP
data only the following accesses are required:
my source lp = event− > src lp;
my destination lp = event− > dest lp;
Additionally, inside every tw lp is a pointer to the owning processor structure,
tw pe. So, to access processor specific data from an event the following operation is
performed:
my owning processor = event− > dest lp− > pe;
This bottom-up approach reduces access overheads and may improve locality
and processor cache performance [12, 13].
ROSS also uses a memory-based approach to throttle execution and safeguard
against over-optimism. Each processor allocates a single free-list of memory buffers.
When a processor’s free-list is empty, the currently processed event is aborted and
a GVT calculation is immediately initiated. Unlike the Georgia Tech Time Warp
(GTW) simulation system [34], ROSS fossil collects buffers from each LP’s processed
event-list after each GVT computation and places those buffers back in the owning
processor’s free-list.
A Kernel Process is a shared data structure among a collection of LPs that
manages the processed event-list for those LPs as a single, continuous list. The net
effect of this approach is that the tw scheduler function executes forward on an
LP by LP basis, but rollbacks and more importantly fossil collects on a KP by KP
basis. Because KPs are much fewer in number than LPs, fossil collection overheads
are dramatically reduced.
The consequence of this design modification is that all rollback and fossil
collection functionality are shifted from LPs to KPs. To effect this change, a new
32
data structure was created, called tw kp. This data structure contains the following
items: (i) identification field, (ii) pointer to the owning processor structure, tw pe,
(iii) head and tail pointers to the shared processed event-list and (iv) KP specific
rollback and event processing statistics.
When an event is processed, it is threaded into the processed event-list for a
shared KP. Because the LPs for any one KP are all mapped to the same processor,
mutual exclusion to a KP’s data can be guaranteed without locks or semaphores.
In addition to decreasing fossil collection overheads, this approach reduces memory
utilization by sharing the above data items across a group of LPs. For a large con-
figuration of LPs (i.e., millions), this reduction in memory can be quite significant.
Experimental analysis suggests a typical KP will service between 16 to 256 LPs,
depending on the number of LPs in the system. Mapping of LPs to KPs is accom-
plished by creating sub-partitions within a collection of LPs that would be mapped
to a particular processor.
While this approach appears to have a number of advantages over either “on-
the-fly” fossil collection [34] or standard LP-based fossil collection, a potential draw-
back with this approach is that “false rollbacks” would degrade performance. A
“false rollback” occurs when an LP or group of LPs is “falsely” rolled back because
another LP that shares the same KP is being rolled back. This phenomenon was
not observed in the PCS model [12].
3.2 CAVES Model
3.2.1 CAVES Model Overview
CAVES is a configurable applications view storage system. This system in
practice is designed to work as a middle-ware system, connecting multiple possibil-
ities of distributed servers to multiple possibilities of distributed clients. The main
purpose of any storage system is to reduce the overall turnaround time between
an application and a data provider. To achieve this, storage systems make use of
the possible locality between different data requests and try to optimize the overall
performance by storing data that is most likely to be re-requested in the near future
in a fast storage medium. In this storage system, we consider the local disk of a
33
storage management system as a fast medium compared to networked and possibly
distant servers that process complex database queries. We use view to refer to the
output of a query in any application.
Examples of applications that involve costly queries include data warehousing
and data mining applications that process complex queries over large data sets,
applications that use spatial aggregations and correlations over complex vector data,
biological databases that process highly complex sequence comparison algorithms,
and large data sets obtained from scientific experiments. In such applications, the
time to process and transmit a single query may be rather high even in the presence
of indices and the size of the output may be very large. The cost of producing
such a data set might be much larger than the cost of writing and reading back the
data set from the local disk. In regular storage replacement algorithms, the goal
of the system is to optimize the total number of hits. This assumes that the cost
of producing each item in the storage is uniform. However, this is not the case for
complex queries that are transmitted over a network which optimizes the overall
savings in time. As a result, the replacement policies for an application may vary
greatly based on its workload.
3.2.2 CAVES Server
A CAVES Server is comprised of a view storage and corresponding statistics.
The view storage is sorted by the priorities of the views. The view with the lowest
priority is at the head of the storage and will be the first to be removed when space
is needed to insert a new view. A view’s priority is calculated by its properties
and other external information, such as network speed and disk speed. The view
priorities are a very important part of the real CAVES system because they affect
which views are stored, which impacts the overall savings that the CAVES system
can offer.
The CAVES server receives requests for views. If the view is in its storage,
the CAVES server will return the view. Otherwise it will request the view from
elsewhere. The CAVES server also receives returns of views. A returned view is
inserted into the storage if the entry criteria is met.
34
... ...
DatabaseDatabase
Middleserver
Middleserver
Middleserver
Middleserver
Client Client Client Client
... ... ...
Client
Figure 3.1: The topology of the model.
3.2.3 CAVES Hierarchy
The CAVES model is constructed from an idea of developing a CAVES Hier-
archy. The hierarchy offers a larger global storage of views than could have been
achieved with a single CAVES Server. The larger global storage allows for increased
time savings for the users of the real system.
The CAVES Hierarchy, as shown in Figure 3.1, is comprised of three types of
servers: client storage, middle storage and database. The client and middle storage
servers are CAVES Servers. The database servers contain data needed to create
views. For views that the database servers cannot create, they have indexes which
point to other database servers that can create the views.
The Hierarchy has all the Database servers connected together and each Data-
base server is connected to a subnet of middle storage servers. The middle storage
servers in the subnet are all connected and each middle storage server has a subnet
of client storage servers. Each client storage server is connected to its neighbors on
the subnet. The connection allows for requests to propagate through the system.
When a client storage server requests a view from its neighbor the request is
propagated through the neighbors on the subnet in a ring pattern. The ring pattern
was chosen because of its small memory requirements. If the neighbors do not have
the view, a request is sent up to the client’s middle storage server. The middle
storage servers act similar to the clients. However, one difference is if its neighbors
35
Neighbor
increment Requests
View in cache?
to NeighborNeighbor Request
View in cache? Increment N Hits
Requesting Neighbor?
no
to Neighbor
increment Hitsreposition view in cache,update view’s priority,
noGOTO Client Response
Update Time
no
yes
yes
yesyes
no
tt
t
t
Request Arrives
Send RequestArrives
Send Request
Respond to orginalRequestor
Client? TimeWithOut
Respond to orginalRequestor
GOTO Neighbor Reponse
Figure 3.2: Flow chart for request arrival and neighbors request.
do not have the view, the middle storage server sends a request to its database server.
The database server will either be able to create the requested view and return it,
or will forward the view request to the correct database. The database architecture
is similar to a peering architecture. The architecture was chosen because it requires
less memory at the lower levels of the hierarchy.
3.2.4 CAVES Flow and Statistics
The client storage server is attached to the client applications and processes
requests from those applications as shown in Figure 3.2. When a request occurs, the
client storage server increments its Requests variable. Then it searches its storage
for the view. If the view is found, the view’s priority and position in the storage
36
GOTO Client Response
GOTO Client Response
Neighbor Response
View inResponse?
no
no
increment Requests
no
increment Hitsyes
Update Time
yes
no
GOTO Increment Misses
yes
yes
t
t
t
Client?
to Middle ServerSend Request
Client?
Send Requestto Database Server
Request Arrives
data set?view in
Send Requestto database withview in data set
TimeWithOut
GOTO Request Arrives
Figure 3.3: Flow chart for neighbor response and database request.
get updated. The storage is sorted from the lowest to the highest priority. When
the priority of the view changes the view needs to be moved to the right location in
the storage. Also the client increments the Hits variable and updates the Time and
TimeWithOut variables. The Time variable is the total time it takes for a request
to get answered with the storage system. The TimeWithOut is the time it would
have taken without the storage system.
If the view is not in the storage, the client sends a request for the view to its
neighbor shown in Figure 3.2. The neighbor searches its storage and if the view
is in its storage, it sends a response back and increments the NclientHits variable.
The NclientHits variable is used to show how many neighbor hits occur during the
37
already?in cacheView
increment already
yes
no
Client Response
update Time, TimeWithOut
GOTO Increment Misses
Figure 3.4: Flow chart for client response.
course of a simulation run. The client that receives the response updates the Time
and TimeWithOut variables (Figure 3.3) and then it frees up room in its storage if
needed. Once there is enough free space the view is inserted (Figure 3.5). When
the client frees up room in its storage it removes views and sends those views to the
middle server storage (Figure 3.5). The middle server checks if the view is in its
storage and if so, it increments the Already variable. If not then the view is inserted
into its storage.
If the neighbor does not have the view, it sends a request to its neighbor. The
request continues to go from neighbor to neighbor until a response gets sent back or
the request propagates back to the original initiator of the request. If the request
gets back to the initiating client, the client then sends a request up to the middle
server.
The middle server receives the request from the client storage server (Fig-
ure 3.2). The middle server increments its Request variable and searches its storage
for the requested view. If it has the view, it will respond to the client and update
38
Viewmeets entry
critia?
Roomin cache?
no
Calculate view’spriority and insertview into cache
Remove view from cache
Send Add viewto Middle Server
Viewin cache increment Adds
increment already
yes
yes
yes
Increment Misses yes
noalready?
t
Client?
Middle ServerAdd view to
Figure 3.5: Flow chart for add view.
the view’s priority and position in the storage. The middle server also updates its
hits variable.
If the middle server does not have the view, it will request the view from
its neighbor. If the neighbor has the view, it will respond and it will update the
NmidHits variables. The middle server that receives the response will then send the
view back to the initiating client. If the neighbor does not have the view, it will
send a request to its neighbor. The requests from neighbor to neighbor will continue
like they did for the client. If the request gets back around to the initiating Middle
Server, it will send a request up to the Database Server.
The database server receives a request from the middle server and updates
39
its Request variable (Figure 3.3). It then searches for the view in its data set. If
the search finds the information needed to create the view, it responds to the client
with the view and updates the Hits variable. If the search fails, the database server
requests the view from the database server with the information needed for the
view. That database server updates its Request variable and its Hits variable. It
then responds to the client with the view.
When the client gets the response it updates the Time and TimeWithOut
variables (Figure 3.4) and inserts the view into the storage if the view is not currently
stored (Figure 3.5).
3.2.5 CAVES Implementation
A natural way to map CAVES would be to make each storage server, clients
and middles, along with each Database server into LPs. All the requests and re-
sponses are mapped to time stamped events that are passed from LP to LP. Having
each storage server realized as a single LP allows caching and storage management
to happen sequentially. This leads to some challenges for the reverse computation
because the computational complexity of these events is high.
Each database server and all the storage servers under it are mapped to a single
KP. This was for performance reasons. With all the neighbor requests from both
the client storage servers and the middle storage servers, the system performance
was much better when all these requests are kept within one KP. We experimented
with mapping KPs to the middle storage servers. The results of this experiment
show that the performance was weak. This was due to the fact that middle storage
servers were sending remote neighbor requests. These requests lead to a greater
number of rollbacks on average.
When the simulation starts, a global pool of views is created. Each view is
given an id, a size, a computational complexity, and a filter factor, based on a
normal distribution. In addition, a neighbor and server table is created. These
tables are used by an LP to calculate where to schedule events. In the storage
server’s initialization, it is allocated with a lookup table for views in the storage
along with RAM and disk storages. Each storage consists of a size available variable
40
and two linked lists. The nodes of the linked lists are storage blocks that contain a
view number, a Hits variable and a priority. There is one storage block for each view
in the storage. A storage block can have a view of any size in it. The sum of the
views sizes in the used storage block link list cannot be larger than the total storage
size. All the allocated storage blocks are put on the free list and are later moved to
the used list upon a view’s entrance into the storage. A database is initialized with
a range of views that it can create.
Upon the client storage servers’ initialization, a given number of view requests
are generated uniformly from the pool of views. These view request events are
scheduled into the future and are processed by the client storage server requests
method. In this method, the view requested is looked up in the table to see if it is in
either storage (RAM or disk). If so then the storage block, which contains the view
number, Hits variable is incremented and the priority is recalculated. The view is
then removed from the used list and reinserted based on its new priority. The used
link list is kept with the lowest priority at the head. If it is a miss, a request is
issued to the client’s neighbor. At the end of this method a new view request would
be self generated and scheduled into the future for the current client LP.
When the response is received, time is increased by the amount of time it took
to receive the response. At each storage server and database that processed the
request, time is incremented. The time will be at least the view size divided by the
network speed plus the view size divided by the disk speed.
Then the view in the response gets put into the storage. There is a write
through policy implemented, so the view gets inserted to both storages. Before the
view is put into the storage, the size available parameter is checked to see if the
view fits. If so, a storage block from the free list is removed and the view number is
set. The priority of the storage block is calculated and the view is inserted into the
storage. If there is not enough room, views are removed from the storage used list
and put on the free list until there is enough free space. Each view that is removed
generates an Add event that is sent to the middle server. A storage block is then
removed from the free list, the view number is set, priority is calculated, and the
view is inserted into the storage.
41
The middle storage server acts just like the client storage server in the way
its storage is configured and managed. However, the middle storage server never
initiates a view request or inserts a neighbor’s response in its storage. The middle
storage only adds views that are sent up from the clients.
Upon receiving a request, the database looks to see if it can create the view
and either sends a response back to the client or redirects the request to the correct
database server. The database server uses the view size, filter factor, and compu-
tational complexity, along with the disk speed and network speed to calculate the
time it will take to compute and transfer the view. A response event is scheduled
for that time into the future to the requesting client.
3.3 Reverse Computation
3.3.1 Methodology for Reverse Computation
As mentioned in Michael Frank’s research in reversible computing [37, 38], one
should try to reduce the amount of entropy or information loss there by reducing
energy loss (heat). This heat, in a virtual sense leads to less performance, assuming
that the increase in entropy free computation does not dramatically increase the
complexity of the system.
In this subsection, we will take this notion of entropy from the hardware view
point and apply it to the modeling design process. One question is, by making
a model entropy-free (reverse computable), or as close as possible, how much ad-
ditional overhead do we introduce? This methodology needs to strike a balance
between the time and space saved due to less state saving and the increased expense
of the additional computation. It has been shown in [16] that a reverse compu-
tation model had fewer L2 cache misses due to the reduction of entropy over its
corresponding state saving model.
The first step in our methodology is to perform a model analysis. In this step
the modeler should identify what functionalities really need to be simulated. The
modeler then decides which data structures are required. The event flow should be
studied and the event structure should be formed.
With the second step, an analysis of entropy minimization should be performed
42
on the functions and data within the model. The designer should analyze the chosen
data structures and functions and observe if any are or could be made entropy free.
Certain functions are perfectly reversible and the inverses can be easily developed.
Next, the costs of state saving parts of the model should be taken into consideration.
If these costs and other issues are too high, the modeler should either go back to
the first step and redesign the implementation or the model might not be suited for
reverse computation. This step could prove to be quite time consuming and often the
modeler is not able to perform such an analysis. For this case the modeler can make
use of our reverse memory subsystem for quicker development. The reverse memory
subsystem also might allow the modeler to gain an idea of the actual requirement
of the model when there are uncertainties.
The third step deals with data compression and reuse. Here, we first look at
the message for any unused space or variables which are not in use. For example in
a networking model once a packet arrives at the destination the destination variable
is no longer needed and thus can be reused. We should then look at values in the
message and in the model, which will be used in the state of the model and can be
recovered. This gives the modeler knowledge of what space in the message can be
used for destructive operations and what state can be recreated.
Finally, iterate through the functions of the model and generate the reverse
code using the rules describe in Section 2.3 and shown in Table 3.1. While generating
the reverse code applied and used the variable dependencies and prefect reversible
functions found in steps two and three. In addition use available space in the message
for destructive operations. This space should have been identified within step three.
The next subsection will show the application of the rules and explain where
the rules were overlooked to minimize entropy.
3.3.2 CAVES Reverse Code
In this subsection we show the application of the rules in Table 3.1 on pseudo
code for the CAVES model. The table is reproduced here for ease of reference.
In the CAVES model, a cache server receives a request. Figure 3.6 shows the
forward and reverse code for this process. The first operation that is performed is
43
Type Description Application Code Bit Requirements
Original Instrumented Reverse Self Child Total
T0 simple choice if()s1;elses2;
if(){s1;b=1;}else{s2;b=0;}
if(b==1){inv(s1);}else{inv(s2);}
1 x1, x2 1 +max(x1, x2)
T1 compoundchoice (n-way)
if()s1;elsif()s2;elsif()s3;else()sn;
if() {s1;b=1;}elsif(){s2;b=2;}elsif(){s3;b=3;}else {sn;b=n;}
if(b==1){inv(s1);}elsif(b==2){inv(s2);}elsif(b==3){inv(s3);}else{inv(sn);}
lg(n) x1, x2,..., xn
lg(n)+max(x1, ...xn)
T2 fixed itera-tions (n)
for(n)s;
for(n)s;
for(n)inv(s);
0 x n ∗ x
T3 variable iter-ations (maxi-mum n)
while()s;
b=0;while(){s;b++;}
for(b)inv(s);
lg(n) x lg(n) +n ∗ x
T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive
assignmentv@ =w;
v@ =w;
v =@w;
0 0 0
T6 k-byte de-structiveassignment
v =w;
{b =v; v =w; }
v = b; 8k 0 8k
T7 sequence s1;s2;sn;
s1;s2;sn;
inv(sn);inv(s2);inv(s1);
0 x1+... +xn
x1 + ...+ xn
T8 jump (labellbl as targetof n goto’s)
gotolbl;s1;gotolbl;sn;lbl:s;
b=1;goto lbl;s1;b=n;goto lbl;sn;b=0;label:s;
inv(s);switch(b){case 1:gotolabel1;case n:gotolabeln;}inv(sn);labeln:inv(s1);label1:
lg(n+1)
0 lg(n +1)
T9 Nestings ofT0-T8
Apply the above recursively Apply the above recursively
Table 3.1: Summary of treatment of various statement types.Generation rules and upper bounds on state size requirements for supporting reverse com-putation. s, or s1..sn are any of the statements of types T0..T7. inv(s) is the correspondingreverse code of the statement s. b is the corresponding state saved bits “belonging” tothe given statement. The operator = @ is the inverse operator of a constructive operator@ =, (e.g., + = for − =) [16].
44
Forward:
1. SV->Requests++2. if((CV->c1 = caves_cache_search((int) SV->views[M->View] )) {3. call caves_cache_hits() {4. SV->Hits++;5. View->Hits++6. if((CV->c2 = caves_cache_search((int) (*loc)->state.b.ram))){7. caves_cache_del_any(*loc, cache ,&M->RC);8. (*loc)->priority = caves_policy_calculate_priority(SV, *loc);9. caves_cache_enqueue(*loc, cache);10. }11. else disk cache12. }13. }14. else {15. tw_rand_exponential();16. send neighber request17. }
Reverse:
1. SV->Requests--2. if(CV->c1){3. call caves_cache_hits_rc() {4. SV->Hits--;5. View->Hits--;6. if(CV->c2){7. caves_cache_del_any(*loc, cache ,&M->RC);8. (*loc)->priority = caves_policy_calculate_priority(SV, *loc);9. caves_cache_del_any_rc(*loc, cache);10. }11. else disk cache12. }13. }14. else {15. tw_rand_reverse_unif()16. }
Figure 3.6: Forward and reverse CAVES request.
the request variable is incremented, SV→Request++. This increment is classified as
a constructive assignment, T5. The reverse of this operation is just the decrement,
SV→Request--.
Next in the code is a simple check to see if the requested view is in the
cache, if(CV→c1 = caves cache search((int) SV→view[M→view])). CV is a
bit field and CV→c1 is used to store whether the branch was taken. This is an
45
example of a simple choice, T0. The reverse is to check if the branch was taken by
checking the bit value, if(CV→c1). Then the appropriate reverse code should be
executed.
If the branch was taken, the view is in the cache and caves cache hits()
function is called. A function call is classified as a type T4 and the reversal is
just the calling of the reverse of that function. For this function the reverse is
caves cache hits rc().
Within the caves cache hit() function there is an increment of the LP’s
hits variable, SV→Hits++. Increments can be seen as a constructive assignment,
T5. The reversal of the increment is simply the decrement SV→Hits--.
The following operation in the CAVES request code increments the view’s hit
variable, View→Hits++. Once again an increment is a constructive assignment and
the reversal of this increment is the decrement, View→Hits--.
Next the cache server checks if the view is in the ram cache or in the disk cache,
if((CV→c2 = caves cache search((int) (*loc)→state.b.ram))). The branch
direction is stored in the second location in the bit field, CV→c2. This is an example
of a T0, simple choice. The reverse is to check the bit from the bit field and execute
the appropriate reverse code.
In lines 7 through 9 of the forward code, Figure 3.6 has the view’s priority
changing based on getting hit and it is appropriately repositioned in the cache. In
the reverse we decrement hits so the priority can be reversed to its original value.
This allows for the correct repositioning of the view in the priority queue. The chang-
ing of the priority, (*loc)→priority = caves policy calculate priority(SV,
*loc), is a T6, a destructive assignment. However, state saving is not required since
the correct priority can be reconstructed with the appropriate hits value. This is an
example of variable dependence.
These lines construct a sequence, T7. The reverse of this sequence is the inverse
of line 9, followed by the inverse of lines 8 and 7. Line 9 is caves cache enqueue(*loc,
cache), and its inverse is caves cache del any(*loc, cache ,&M→RC). Line 8’s
inverse is caves cache del any rc(*loc,cache) which simply inserts the view in
its original position. We were unable to use the caves cache enqueue() for the
46
reversing of the caves cache del any() based on the fact that priority ties can
occur. State saving is required to calculate the previous position of the view.
Now if the view is not in the ram the view must be on the disk. The forward
and reverse code is exactly the same as lines 7-9 but the disk cache is used.
If the view was not on either the ram or disk cache, an N REQUEST is sent
to the LP’s neighbor. The tw ran exponential function is called to generate the
time stamp of the N REQUEST. The reverse of this T4 is tw rand reverse unif().
The reversing of sending an event is handled by ROSS, thus no user reverse code is
needed.
In the CAVES model, a cache server receives a response. Figure 3.7 shows
the forward and reverse code of this process. The first operation that is performed
when a response is received is that the misses variable is incremented, SV→misses++.
This increment is classified as a constructive assignment, T5. The reverse of this
operation is just the decrement, SV→misses--.
Next in the code is a simple check to see if the returned view is not in the
cache, if(!CV→c1 = caves cache search((int) SV→view[M→view])). CV is
a bit field and CV→c1 is used to store whether the branch was taken. This is a
example of a simple choice, T0. The reverse is to check if the branch was taken by
checking the bit value, if(CV→c1). This is followed by executing the appropriate
reverse code.
If the branch was taken, the view is not in the cache and caves cache miss()
function is called. A function call is classified as a type T4. The reversal is
just the calling of the reverse of that function. For this function the reverse is
caves cache miss rc().
Next there is a check for if the view will be allowed to enter the cache,
if((CV→c3 = caves policy entry(SV,v num))). The branch direction is stored
in the third location in the bit field, CV→c3. This is an example of a T0, simple
choice. The reverse is to check the bit from the bit field and execute the appropriate
reverse code.
If the view meets the entry criteria it is added to the cache with the caves
cache write() function, T4. The reversal of this function is performed by calling
47
Forward:
1. SV->misses++;2. if(!(CV->c1 = caves_cache_search((int) SV->views[M->View]))) {3. call caves_cache_miss(SV,CV,M,lp,M->View) {4. if((CV->c3 = caves_policy_entry(SV,v_num))){5. call caves_cache_write(lp->id , v_num , SV, CV, &M->RC) {6. if( (CV->c5 = (cache->free < v->sz ))) {7. caves_cache_freeup_ram(SV,lp, v->sz, cache_ram, &SV->Time, RC)8. }9. b = cache_ram->c_free_h;10. M->b = b;11. b = M->view;12. b->priority = caves_policy_calculate_priority();13. caves_cache_enqueue(b,cache);14. }15. }16. }17. }18. else {19. SV->Allready++;20. }
Reverse:
1. SV->misses--;2. if(!(CV->c1)) {3. call caves_cache_miss_rc(SV,CV,M,lp,M->View) {4. if(CV->c3){5. call caves_cache_write(lp->id , v_num , SV, CV, &M->RC) {6. caves_cache_del_any(*loc,SV->cache);7. cache_ram->c_free_h = b;8. b = M->b;9. if( CV->c5) {10. caves_cache_freeup_ram(SV,lp, v->sz, cache_ram, &SV->Time, RC)11. }12. }13. }14. }15. }17. else {18. SV->Allready--;19. }
Figure 3.7: Forward and Reverse CAVES response.
48
caves cache write rc().
Lines 6 through 13 construct a sequence, T7. If needed the code frees up space
for a new view. Then allocates that freed space and inserts the view. The reverse
of this is to remove the inserted view, free the allocated space, and reallocate the
space that was freed if needed.
There is a check to see if free space is needed to fit the view in the cache,
if((CV→c5 = (cache→free < v→sz ))). This is a simple choice, T0. The re-
sult is stored in the fifth location in the bit field, CV→c5. The reverse is to check
the bit from the bit field and execute the appropriate reverse code.
When the view does not fit in the free space of the cache, views have to be
removed until it fits. The caves cache freeup ram() function, a T4, frees up the
needed space. The reversal of this function is performed by calling caves cache
freeup ram rc(). Note, that the methods of reversing this function are discussed
in Section 3.3.3 due to the difficulties in reversing dynamic memory.
A view is then removed from the free list and the new view is then stored in
its location. The new view getting stored causes a destructive assignment, T6. The
view which is removed is then stated saved, M→b = b; The reverse is to use the
state saved information to restore the freed view, b = M→b.
The view is then inserted into the cache, caves cache enqueue(). This is
a function call, T4, that inserts a view into the cache. The reverse of this is
caves cache del any(). Once again, since this is a sequence, the reverse order-
ing of the forward codes inverses is executed.
Now if the view was already in the cache, the SV→allready value is incre-
mented. This is a constructive assignment and therefore can be easily reversed with
the decrement operation.
3.3.3 CAVES: A one to many delete
To enable parallel execution, reversible execution of the model must be sup-
ported. In particular, the freeing of necessary space in the view storage in a reversible
way is a problem that had to be solved. The problem stemmed from the fact that
the system could remove multiple views from the storage to free up enough space
49
for an incoming view that was significantly larger than any stored single view.
Our solution is to utilize a free-list of views. This list structure allows the view
storage to be made reversible for single inserts and multiple deletes. To illustrate
this functionality, consider the following example. Suppose an event is scheduled
which adds view number 8 to the storage. In order to fit this view, 5 other views
(1,2,3,4,5) must be removed from the storage. These 5 storage blocks are put on
the free list (5,4,3,2,1). The head of the free list (5) is removed to put view 8 in.
Prior to modifying the view number of the storage block, the state of that storage
block is stored in the current event (i.e., swapped). We then insert 8 into the
storage. Another event, which adds view 9, comes in and there is enough room in
the storage to fit it. A storage block is removed for the free list (4) and the storage
block’s previous view information is swapped with the event data. View 9 is then
put in the storage block and the storage block is inserted into the storage. The next
event that comes in causes a rollback of the last two events. First we remove view 9
from the storage and set the storage block back to its old state (4) by re-swapping
the data. The storage block is then put back on the free list. We then remove view
8 from the storage and again we set the storage block back to its old state, along
with putting it on the free list (5). Five views were removed from the storage to fit
view 8, these views are then removed from the head of the free list and put on the
head of the storage (1,2,3,4,5).
3.3.4 Variable Dependencies
An optimization that was observed while calculating the priority of a value is
to use variable dependencies to avoid state saving. With such an optimization, the
rules in Table 3.1 are broken but the correctness of the model is preserved. This is
illustrated in Figures 3.8 and 3.9. The modeler can observe that the SV→value
is derived from SV→x. This implies that if SV→x is reversed to its previous state
that SV→value then can be derived back to its original value. This avoids the state
saving required by applying the rules. In order for this to work SV→value must
always be able to be derived by SV→x.
50
Forward
1. SV->x++;
2. SV->RC.value = SV->value;
3. SV->value = x * 5.5;
Reverse
1. SV->value = SV->RC.value;
2. SV->x--;
Figure 3.8: Reverse computation code without considering variable de-pendencies.
Forward
1. SV->x++;
2. SV->value = x * 5.5;
Reverse
1: SV->x--;
2. SV->value = x * 5.5;
Figure 3.9: Reverse computation code using variable dependencies.
3.4 CAVES Model Performance Study
3.4.1 CAVES Model parameters
The CAVES model has the following application parameters with the respec-
tive configuration values used in the experiments described below: (i) number of
global views (400 and 2400), (ii) size of the clients RAM storage (2 MB and 4 MB)
and disk storage (4 MB and 16 MB), (iii) size of the middle server RAM storage
(4 MB and 16 MB) and disk storage (32 MB and 128 MB) (iv) request arrival time
(10 seconds and 40 seconds), (v) mean view size and standard deviation (250 KB,
stddev 100 KB), (vi) mean view computational complexity and standard deviation
(4 with stddev of 2), (vii) mean view filter factor and standard deviation (150 with
stddev of 60), (viii) network speed (100 KB/second), (ix) weights of the priorities
51
factors (permutation of 0.2 and 0.6 for w1, w2 and w3), (x) number of clients per
middle server (4, 8 and 16), (xi) number of middle servers per database (4 and 8),
(xii) number of databases (4 and 8). Each of these parameters are discussed below.
Number of global views is the number of different views that can be requested.
The RAM and disk size parameters determine the amount of available space for
storing the views. The RAM and disk parameters are correlated. When the RAM
size is small the disk size is small. For the rest of the chapter when RAM size is
used it also refers to the disk size also.
The request arrival time is used to set how frequently view requests would
occur at the client. The view size, computational complexity and filter factor pa-
rameter are used to create the global view pool. The filter factor is the ratio of
the amount of data to search through to find the view, over the view’s size. The
computational complexity is the difficulty level of computing the view. All of these
factors go into the amount of time to receive a requested view. These factors along
with the disk speed are used in the calculations of priority.
By changing the weights of priorities, we hope to find the optimum combination
of weights, for a given workload, which would lead to the maximum savings for
CAVES users. This, however, is not in the scope of this thesis.
This set of parameter configurations resulted in 2304 experiments, which was
576 experiments for each processor grouping. The results below focus on the 4
processor results as compared with the single processor results.
3.4.2 Performance Metrics and Platforms
A number of performance metrics are used to compare and contrast the cause
and effect relationships of various model parameters to ROSS performance metrics.
Event rate is defined to be the total number of events processed less any rolled back
events divided by the execution time. Speedup is defined to be the event rate of
the parallel case divided by event rate of the sequential case. Because the total
number of events are the same between sequential and parallel runs of the same
model configuration, this definition is equivalent to using execution time.
Our computing testbed consists of a single quad processor Dell personal com-
52
puter. Each processor is a 500 MHz Pentium III with 512 KB of level-2 cache. The
total amount of available RAM is 1 GB. Four processors are used in every experi-
ment. All memory is accessed via the PCI bus, which runs at 100 MHz. The caches
are kept consistent using a snoopy, bus-based protocol.
The memory subsystem for the PC server is implemented using the Intel
NX450 PCI chipset. This chipset has the potential to deliver up to 800 MB of
data per second. However, early experimentation determined the maximum obtain-
able bandwidth is limited to 300 MB per second. This performance degradation
is attributed to the memory configuration itself. The 1 GB of RAM consists of 4,
256 MB DIMMs. With 4 DIMMs, only one bank of memory is available. Thus,
“address-bit-permuting” (ABP), and bank interleaving techniques are not available.
The net result is that a single 500 MHz Pentium III processor can saturate the
memory bus. This aspect will play an important roll in our performance results.
3.4.3 Overall Speedup Results
Our experimental data, in the aggregate, showed a number of interesting
trends. In particular, many of the runs showed super linear speedup, while oth-
ers showed much weaker speedups. Varying of parameters for the CAVES system
had an interesting effect on the Time warp system and is detailed below. The effect
of the parameters on CAVES performance is beyond the scope of this thesis because
it deals with the complexities of changing storage policies which vary in response to
changes in workload statistics.
Of the 576, four processor runs, 117 resulted in super linear speed up. Nine
had a performance speed up above 5.0. The highest speed up was 5.30. These
super linear speedups are attributed to the system’s high workload and minimum
remote messages combined with four times the level-1 and level-2 cache space. The
simulator’s memory requirements ranged from 23 megabytes to 45 megabytes.
The minimum number of remote messages can be attributed to the fact that
the middle storage servers had larger storages and the global view pool was 400. Of
the 117 runs, 111 had large middle storages and 107 had a view pool of 400. With
the larger middle storage and smaller view pool the 117 averaged about three times
53
more middle storage hits and 1.5 times more client storage hits than the system’s
average. The 117 had 2.6 times less rollbacks and about two times less remote
messages sent than the average.
The memory subsystem of this quad processor machine is limited to only 300
MB/second. Consequently, the uniprocessor runs are executing outside of level 1 /
level 2 cache and exhaust the available memory bandwidth. The four processor cases
have 4 times the available cache allowing more of the simulation’s working dataset
to fit within cache memory, thus greatly increasing the per processor performance
relative to sequential.
The average speedup of the system was 3.56 on four processors and of the 576
runs, 382 had speed ups between 3 and 4. The lowest speedup, however, was 1.57.
There are 77 runs with a speedup under three. This lack of performance is directly
related to the fact that 50 of those runs had a large view pool and 63 had small
middle storages. With the small middle storage and the large view pool the 77 runs
had four times less middle hits and 1.26 times less client hits than the average of
all the runs. The 77 runs had 2 times more rollbacks (25% of events where rolled
back) and 1.61 times more remote events than the average.
3.4.4 Experiment Changes
In order to observe the significance of the model’s parameters; we constructed
an additional series of experiments. The number of processors (PEs) was held to
the constant of four. Without the number of PEs varying, the effects of the other
parameters can be seen. In addition, a wider range of values where added to the
set of experiments. Intermediate view pool sizes of 800 and 1200 where added. The
client ram size had intermediate values of 2750 and 3500 inserted. The mean request
time values were changed to ten, twenty, thirty, and forty. Also the middle server
ram sizes had additional values of 8000 and 12000. The total number of different
experiments was 15360 and each experiment was run 4 time. These wider ranges of
experiments resulted in a model that more accurately predicts event rate based on
the CAVES system input parameters.
The most significant factors were the ram size, the number of secondary roll-
54
backs, the number of clients per middle and the number of LPs in the system. The
other parameters in the model did have an effect but it was not nearly as noticeable.
The client ram and the middle ram size were responsible for a negative cor-
relation to the event rate. As the size went down the event rate went up. This is
explained by the fact that smaller storage leads to smaller event granularity which
allows for more events to be processed in a second.
The number of secondary rollbacks was negatively correlated which meant as
the secondary rollbacks went down the event rate went up. The reason is fewer
secondary rollbacks leads to less reverse computation, which allows for more events
to be processed. Also not as many events had to be reprocessed.
The number of LPs in the system was responsible for another negative corre-
lation. With a large number of LPs the system can be loaded too heavily and can
cause a significant amount of secondary rollbacks.
3.5 Related Work
A number of web-caching architectures and strategies have been proposed,
including Abrams et al. 1996 [1]; Arlitt and Williamson 1997 [5]; Glassman 1994 [40]
; Pitkow and Recker 1994 [83]; Shi, Watson and Chen 1997 [97]; and Wessels 1995
[107]; The fundamental difference between these works and our caching approach
is the complexity of queries. Our view storage system assumes that views are the
results of arbitrarily complex SQL queries and not web-pages. Consequently, it is
generally sufficient to measure the effectiveness of the caching strategy in terms
of hit rate. That measure is insufficient for a view storage systems because views
are typically not of the same size, nor of the same computational complexity and
have varying network transmission times depending on the location of the source
database system.
3.6 Conclusions
Overall for CAVES, we find that our model performs well with an average
speedup of 3.6 on 4 processors over all configurations. Many cases yield super-linear
speedup, which is attributed to a slow memory subsystem on the multiprocessor PC.
55
We find that a number of parameters effect key Time Warp performance metrics.
In particular, when the view storage size decreases the event rate increased.
CHAPTER 4
TCP Model
The TCP network model simulates concurrent file transfers. The goal of the system
is to simulate a large-scale network optimistically. The issues faced by this model
are as follows: First, network simulations had long been viewed out of the scope
of optimistic approaches due to state-saving overheads. Second, it is not easy to
simulate a real-world topology due to its complexities.
This chapter begins with the motivation for the TCP model 4.1. This is then
followed with the TCP model’s implementation in Section 4.3 and a discussion of
the instrumentation of the reverse computation code is shown in Section 4.3.3. The
validation results are in Section 4.3.4. This chapter concludes with a performance
study for the TCP model.
4.1 TCP Model Motivation and Introduction
To address these bandwidth allocation and congestion problems, researchers
are proposing new overlay networks that provide a high quality of service and a
near lossless guarantee. However, the central question raised by these new services
is what impact they will have in the large? To address these and other network
engineering research questions, high-performance simulation tools are required.
The predominate technique used to analyze Internet protocol behavior is packet-
level, discrete-event simulation. Here, networking researchers are interested in ex-
amining the effects routing protocols like OSPF and BGP have on quality service
guarantees and measures [43] as well as other large-scale network operation and
engineering problems. Because the computational requirements of this problem are
immense, network designers require tools that can efficiently model a network with
potentially millions of nodes and data streams. These tools will enable better net-
work configurations and more efficient, accurate management of capacity.
56
57
To date, optimistic techniques, such as Time Warp [51], have been viewed
as operating outside the performance range for Internet protocol models such as
TCP, OSPF and BGP. The reason most often cited is state saving overheads are
too large [84, 85]. These overheads not only impede the performance of the model
but also limit the scale of the model because of increased memory consumption.
Other critiques of optimistic methods are associated with inconsistent states due to
the inherent risk involved in optimistic processing [73].
We demonstrate that optimistic protocols are able to efficiently simulate over
million node TCP scenarios for realistic network topologies. In addition to effi-
cient execution, we observed that Time Warp executes with increased stability as
model size increases and handles short delays in high-bandwidth links exceptionally
well. Last, from the developers point-of-view, the issue of inconsistent states as a
consequence of full optimistic processing was not observed. Special error handling
considerations for the forward code path were not required.
The innovations for achieving this level of scalability are two fold. First, the
undo operation as part of optimistic processing, we employ reverse computation [16].
Here, the event computations are developed in such a way that they can be reversed
as opposed to state saving the computations. This approach has been shown to
have negligible impact on forward execution and to substantially reduce the memory
requirements of the optimistic parallel models [16, 112].
The second innovation is our compact implementation modeling approach.
Here we demonstrate a TCP model compactly implemented atop a parallel discrete-
event platform. The object hierarchy for a TCP connection is kept extremely lean
and compressed into a single contiguous logical process (LP) state vector. Similarly,
the event data is compressed to a minimum for the feature set of the protocol model.
This approach enables a single TCP connection state to only occupy 320 bytes total
(both sender and receiver) and 64 bytes per each packet-event.
The end result of these innovations is that, we are able to simulate million node
network topologies using commercial off-the-shelf multiprocessor systems costing less
than $7,000 USD. On a more costly distributed cluster of 32 nodes, our TCP model
executed at 5.5 million packets per second which is 5.14 times greater than PDNS’
58
packet rate of 1.07 million for the same large-scale network scenario.
Later in this chapter, we describe our TCP model and its implementation on
an optimistic parallel simulation engine called ROSS.
4.2 TCP Overview
The Internet relies on the TCP/IP protocol suite combined with router mecha-
nisms to perform the necessary traffic management functions. TCP provides reliable
transport using an end-to-end window-based control strategy [50]. TCP design is
guided by the “end-to-end” principle which suggests that “functions placed at the
lower levels may be redundant or of little value when compared to the cost of pro-
viding them at the lower level” As a consequence, TCP provides several critical
functions (reliability, congestion control, session/connection management) because
layer four is where these functions can be completely and correctly implemented.
While TCP provides multiplexing/de-multiplexing and error detection using
means similar to UDP (e.g.: port numbers, checksum), one fundamental difference
between them lies is the fact that TCP is connection oriented and reliable. The
connection oriented nature of TCP implies that before a host can start sending data
to another host, it has to first setup a connection using a 3-way reliable handshaking
mechanism.
The functions of reliability and congestion control are coupled in TCP. The
reliability process in TCP works as follows:
When TCP sends the segment, it maintains a timer and waits for the receiver
to send an acknowledgment on the receipt of the packet. If an acknowledgment is
not received at the sender before its timer expires (i.e., a timeout event), the segment
is retransmitted. Another way in which TCP can detect losses during transmission
is through duplicate acknowledgments. Duplicate acknowledgments arise due to the
cumulative acknowledgment mechanism of TCP, wherein if segments are received
out of order, TCP sends an acknowledgment for the next byte of data that it is
expecting. Duplicate acknowledgments refer to those segments that re-acknowledge
a segment for which the sender has already received an earlier acknowledgment.
If the TCP sender receives three duplicate acknowledgments for the same data, it
59
assumes that a packet loss has occurred. In this case the sender now retransmits the
missing segment without waiting for its timer to expire. This mode of loss recovery
is called “fast retransmit”.
TCP’s flow and congestion control mechanisms work as follows: TCP uses a
window that limits the number of packets in flight, (i.e., unacknowledged). TCP’s
congestion control works by modulating this window as a function of the congestion
that it estimates. TCP starts with a window size of one segment. As the source
receives acknowledgments, it increases the window size by one segment per acknowl-
edgment received (“slow start”), until a packet is lost, or the receiver window (flow
control) limit is hit. After this event, it decreases its window by a multiplicative
factor (one half) and uses the variable ss thresh to denote its current estimate of
the network bandwidth-delay product. Beyond ss thresh the window size follows a
linear increase. This procedure of additive increase/multiplicative decrease (AIMD)
allows TCP to operate in an efficient and fair manner [20].
The various types of TCP (TCP Tahoe, Reno, SACK) differ primarily in the
details of the congestion control algorithms, though TCP SACK also proposes an
efficient selective retransmit procedure for reliability. In TCP Tahoe, when a packet
is lost, it is detected through the fast retransmit procedure, but the window is set
to a value of one and TCP initiates slow start after this. TCP Reno attempts to
use the stream of duplicate acknowledgments to infer the correct delivery of future
segments, especially for the case of occasional packet loss. It is designed to offer
1/2 round-trip time (RTT) of quiet time, followed by transmission of new packets
until the acknowledgment for the original lost packet arrives. Unfortunately Reno
often times out when a burst of packets in a window are lost. TCP NewReno fixes
this problem by limiting TCP’s window reduction during a single congestion epoch.
TCP SACK enhances NewReno by adding a selective retransmit procedure where
the source can pinpoint blocks of missing data at receivers and can optimize its
retransmission. All versions of TCP would timeout if the window sizes are small
(e.g.: small files) and the transfer encounters a packet loss. All versions of TCP
implement Jacobson’s RTT estimation algorithm (that sets the timeout to the mean
RTT plus four times the mean deviation of RTT, rounded up to the nearest multiple
60
of the timer-granularity (e.g.: 500 ms)). A comparative simulation analysis of these
versions of TCP was done by Fall and Floyd [28].
4.3 TCP Model Implementation
Our implementation follows the TCP Tahoe specification. Below, are the
specific capabilities of the TCP session on a single host.
• Logs: The system has the ability to log sequence numbers, and congestion
control window information. This information was used in our validation
study. For performance runs, logging was disabled.
• Receiver side: Data is acknowledged when received. If the received packet’s
sequence number is NOT equal-to AND is greater-than the expected sequence
number, it is stored in the receive buffer. Next, an acknowledgment is sent
for the wanted packet (duplicate acknowledgment). When a packet with the
expected sequence number is received, the next appropriate acknowledgment
is sent according to the receive buffer’s contents.
• Sender side: In practice, the sender will be in slow-start until the congestion
window is greater than the slow-start threshold. After that, congestion avoid-
ance is started. If 3 duplicate acknowledgments are observed by the sender,
then fast retransmission is performed (see below). If the acknowledgment se-
quence number is greater than the lowest unacknowledged sequence number,
the sender assumes that a gap was filled and sends the appropriate packet.
• Fast retransmission: When 3 duplicate acknowledgments are observed, fast
retransmission is started. Here, the slow-start threshold is set to half the
minimum congestion window size or the maximum of the receive window. If
this value is less than two times the maximum segment size, the slow start
threshold is reset to that value. The congestion window is set to the maximum
segment size.
• Slow start: In slow start, two packets are sent for every acknowledgment.
Here, the congestion window grows by one maximum segment size with every
61
acknowledgment.
• Congestion avoidance: The window grows by one maximum segment size
every window worth of acknowledgments. Here, one packet per acknowledg-
ment is normally sent and two packets are sent for every congestion window’s
worth of acknowledgments.
• Round trip time (RTT):.The RTT is measured one segment at a time.
When sending a packet and the RTT is not being measured, a new measure
is initiated. When retransmitting, cancel the current RTT measurement if
ongoing. The RTT measurement process is complete upon receiving the first
acknowledgment that covers the RTT packet being measured.
• Round trip timeout (RTO):. We approximate RTO using the Jacobson’s
tick-based algorithm for computing round trip time, which provides more of a
dampened RTO computation by including the deviation it measures [50].
4.3.1 TCP Model Data Structures
In the implementation of the TCP model there are three main data struc-
tures. The message, which is the data packet, is sent from host to host via the
forwarding plane. The routers LP state maintains the queuing information along
with the dropped statistics. Finally the host LP’s data structure keeps track of the
transferring of data.
A message contains the source and destination address. These addresses are
used for forwarding. The message also has the length of the data being transferred
which is used to calculate the transfer times at the routers. The acknowledgment
number is also included for the sender to observe which packets have been received.
The sequence number is another variable which indicates which chunk of data is
being transferred.
Now, in our model the actual data transferred is irrelevant and therefore it
was not modeled. However, in the case that the application was running on top of
TCP, such as the Border Gateway Protocol (BGP) [88, 89], such data is required
62
for the correctness of the simulation. We are currently examining solutions to this
issue.
Now, the router model’s state is kept small by exploiting the fact that most
of the information is read-only and does not change for the static routing scenarios
described in this chapter. Inside each router, only queuing information is kept along
with a dropped count statistics.
There is a global adjacency list which contains link information. This informa-
tion is used by the All-Pairs-Shortest-Path algorithm to generate the set of global
routing tables (one for each router). Each table is initialized during simulation setup
and consists only of the next hop/link number for all routers in the network.
Given the link number, the router can directly lookup of the next hop’s IP
address in its entry of the adjacency list. The adjacency list has an entry for each
router and each entry contains all the adjacencies for that router. Along with the
router neighbor’s address, it contains the speed, buffer size, and link delay for that
neighbor.
The host has the same data structures for both the sender and receiver sides
of the TCP connection. There is also a global adjacency list for the host, however,
there is only one adjacency per host. In our model, a host is not multi-homed and
can only be connected to one router. There is also a read-only global array which
contains the sender or receiver host status, and size of the network transfer (which is
usually a file of infinite size). The maximum segment size and the advertised window
size were also implemented as global variables to cut down on memory requirements.
The receiver contains the“next expected sequence” variable and a buffer for
out of order sequence numbers. On the sender side of a connection the following
variables are used to complete our TCP model implementation: the round trip time-
out (RTO), the measured round trip time (RTT), the sequence number that is being
used to measure the RTT, the next sequence number, the unacknowledged packet
sequence number, the congestion control window (cnwd), the slow-start threshold,
and the duplicate acknowledgment count.
For all experiments reported here, the RTO is initialized to 3 seconds at the
beginning of a transfer, along with the slow start threshold being initialized to
63
65,536. The maximum congestion window size is set to 32 packets, however, this
value is easily modified. The host, in addition to the variables needed for TCP, has
variables for statistic collection. Each host keeps track of the number of packets
sent and received, the number of timeouts that occur and its measurement of the
transfer’s throughput.
4.3.2 TCP Model Compressing Router State
As previously indicated, our router design at this point is assumed to be fixed
and have static routes. By leveraging this assumption, we set out to reduce the
routing table state.
Now, a problem encountered with real Internet topologies, such as the AT&T
network, is that they tend not to have a well defined structure for the purpose
of imposing a space-efficient address mapping scheme. Ideally, one would like to
impose some hierarchical address mapping scheme on the topology for the purposes
of compressing the routing tables. Such a compression will not lead to an incorrect
simulation of the network so long as flow paths remain the same from the real
network to the simulated network.
Our implementation of the routing table just contains the next hop’s link
number. Here, the maximum number of links per routers is 67. Therefore the
routing table could be represented in a byte per entry instead of a full integer size
address. In our simulation we have an entry in the routing table for each router. If
we had to have an entry for each host, the routing tables would be extremely large.
The hosts were addressed in such a way that the router they are connected to can
be inferred and therefore a routing table of only routers is acceptable. In the case
that it cannot be inferred, we could have a global table of hosts and the routers that
they are connected to. This one table is significantly smaller than having a routing
table in each router with every host. We note that some topologies are such that a
routing table is not needed, such as a hypercube. In these topologies, the next hop
can be inferred based on the current router and the destination.
Next, we assume that routers implement a drop-tail queuing policy. Because
of this, routers need not keep a queue of packets to be sent. Instead, the routers
64
schedule packets based on the service rate based on bytes per second and the time
stamp of the last sent packet. Here, sending a packet across a single link is ac-
complished by scheduling a single event in contrast to the scheduling of two events
needed by many other networking simulators [74]. The packet rate is defined as the
number of packets sent across a single link divided by the wall clock time. By han-
dling the router queues in this manner for our system, the event rate of the system
is approximately the packet rate.
As an example of how our queue works, let assume we have a buffer size of
2 packets, a service time of 2.0 time units per packet and 4 packets arrive at the
following times: 1.0, 2.0, 3.0 and 3.0. Clearly, the last packet will be dropped, but
let’s see how we can implement this without queuing them. If we keep track of the
last send time, we see that the packet at 1.0 will be scheduled at 3.0, following 5.0
and 7.0. Thus, when the last packet arrives, the last sent time is 7.0. If we subtract
the arrival time of the last packet, 3.0 from the last sent time of 7.0, this indicates
there are 4.0 time units worth of data to be sent, which divided by the service time,
yields that there are currently 2 packets in the queue. Thus, this packet will be
dropped.
4.3.3 TCP Model Reverse Code
As discussed in Section 3.3.1 one should try to reduce the amount of entropy or
information loss. With entropy free computation, there should be some performance
gains. Within our methodology, we try to reduce information losses. The first step
in our methodology is to perform a model analysis. The second step performs an
analysis on the functions and data within the model with a focus on the minimization
of entropy. The third step deals with data compression and reuse. Finally, we iterate
through the functions of the model and generate the reverse code using the rules.
In this subsection, we show the application of the rules in Table 4.1 on pseudo
code for the TCP model. The table is reproduced here for ease of reference.
The TCP host model upon receiving a correct acknowledgment sends the next
data packets with the appropriate sequence numbers. In Figure 4.1 the forward and
reverse code is shown for this procedure. Lines 1 through 13 can be viewed as a
65
Type Description Application Code Bit Requirements
Original Instrumented Reverse Self Child Total
T0 simple choice if()s1;elses2;
if(){s1;b=1;}else{s2;b=0;}
if(b==1){inv(s1);}else{inv(s2);}
1 x1, x2 1 +max(x1, x2)
T1 compoundchoice (n-way)
if()s1;elsif()s2;elsif()s3;else()sn;
if() {s1;b=1;}elsif(){s2;b=2;}elsif(){s3;b=3;}else {sn;b=n;}
if(b==1){inv(s1);}elsif(b==2){inv(s2);}elsif(b==3){inv(s3);}else{inv(sn);}
lg(n) x1, x2,..., xn
lg(n)+max(x1, ...xn)
T2 fixed itera-tions (n)
for(n)s;
for(n)s;
for(n)inv(s);
0 x n ∗ x
T3 variable iter-ations (maxi-mum n)
while()s;
b=0;while(){s;b++;}
for(b)inv(s);
lg(n) x lg(n) +n ∗ x
T4 function call foo(); foo(); inv(foo)(); 0 x xT5 constructive
assignmentv@ =w;
v@ =w;
v =@w;
0 0 0
T6 k-byte de-structiveassignment
v =w;
{b =v; v =w; }
v = b; 8k 0 8k
T7 sequence s1;s2;sn;
s1;s2;sn;
inv(sn);inv(s2);inv(s1);
0 x1+... +xn
x1 + ...+ xn
T8 jump (labellbl as targetof n goto’s)
gotolbl;s1;gotolbl;sn;lbl:s;
b=1;goto lbl;s1;b=n;goto lbl;sn;b=0;label:s;
inv(s);switch(b){case 1:gotolabel1;case n:gotolabeln;}inv(sn);labeln:inv(s1);label1:
lg(n+1)
0 lg(n +1)
T9 Nestings ofT0-T8
Apply the above recursively Apply the above recursively
Table 4.1: Summary of treatment of various statement types.Generation rules and upper bounds on state size requirements for supporting reverse com-putation. s, or s1..sn are any of the statements of types T0..T7. inv(s) is the correspondingreverse code of the statement s. b is the corresponding state saved bits “belonging” tothe given statement. The operator = @ is the inverse operator of a constructive operator@ =, (e.g., + = for − =) [16].
sequence of operations. The reversal of a sequence is the reverse ordering of the
lines inverses. This can be observed in the figure.
66
Forward:
1 M->RC.dup_count = SV->dup_count;
2 SV->dup_count = 0;
3 ack = SV->unack;
4 SV->unack = M->ack + g_mss;
5 tcp_host_update_cwnd(SV,CV,M,lp);
6 tcp_host_update_rtt(SV,CV,M,lp);
7 while(send seq nums) {
8 M->seq_num++;
9 tcp_util_event()
10 SV->seq_num += g_mss;
11 SV->sent_packets++;
12 }
13 M->ack = ack;
Reverse:
1. while(M->seq_num) {
2. M->seq_num--;
3. SV->seq_num -= g_mss;
4. SV->sent_packets--;
5. }
6. tcp_host_update_rtt_rc(SV,CV,M,lp);
7. tcp_host_update_cwnd_rc(SV,CV,M,lp);
8. ack = SV->unack;
9. SV->unack = M->ack;
10. M->ack = ack - g_mss;
11. SV->dup_count = M->RC.dup_count;
Figure 4.1: Forward and reverse of TCP correct ack
The first operation that is performed is the resetting of the duplicate count
value, SV→dup count = 0. This is a destructive assignment, T6. The dup count
has a max value of 3, which implies that only two bits are required for the state sav-
ing. The reverse of this assignment is the restoring of the previous state, SV→dup count
= M→RC.dup count.
Next in the code is another destructive assignment, SV→unack = M→ack +
g mss. Observe that the destructive assignment is setting the value to a value in
67
the message plus some constant. That means the value in the message could be
derived from the result of the destructive assignment. This enables the message
value, M→ack to be used for state saving. M→ack is used to store the value which
the destructive assignment destroyed. This technique is called a swap.
The following two lines are tcp host update cwnd() and tcp host update
rrt() which are function calls. A function call is classified as a type T4 operation.
The reversal is just the calling of the reverse of that function. For the above functions
reversals are tcp host update rtt rc() and tcp host update cwnd rc(). Note
that if the code is fairly trivial it could just be inlined instead of under going the
cost of a function call.
Next, there is a loop to see if more data can be sent. This is classified as a T3,
variable iterations. M→seq num is used to store the number of iterations that the
while loop performs. The reversal is to iterate the reverse code M→seq num times,
while(M→seq num).
Now the sequence number is incremented by a constant. This is a constructive
assignment, T5 and the reverse is merely the decrement of that constant. The new
packet is sent with tcp util event(). This function just sends an event which the
reversal is taken care of by the simulator, ROSS.
The statistic of the number of packet sent is incremented, SV→sent packets++.
This is a constructive assignment and therefore can be easily reversed, with the
decrement assignment, SV→sent packets--.
In Figure 4.2 the congestion control window is being updated. Here the con-
gestion control window is being destructively assigned because of its type. Its pre-
vious value is state saved and in the reverse the value is restored, SV→cwnd =
SV→RC.cwnd.
In the TCP host model there is handling of duplicate acknowledgments. This
is shown in Figure 4.3. First there is a simple choice, T0 to see if the acknowledgment
is a duplicate. The reverse is to check if the branch was taken by checking the bit
value, if(CV→c3), and then executing the appropriate reverse code.
The following operation that is performed is the incrementing of the duplicate
count variable, SV→dup count++. This increment is classified as a constructive as-
68
Forward:
1. SV->RC.cwnd = SV->cwnd
2. if((SV->cwnd * g_mss ) < SV->ssthresh
&& SV->cwnd * g_mss < TCP_SND_WND)) {
3. SV->cwnd += 1;
4. }
5. else if ( SV->cwnd * g_mss < TCP_SND_WND))) {
6. SV->cwnd += 1/SV->cwnd;
7. }
Reverse:
1. SV->cwnd = SV->RC.cwnd;
Figure 4.2: Forward and reverse of TCP updating cwnd
Forward:
1. else if((CV->c3 = ((SV->unack - g_mss) == M->ack))) {
2. SV->dup_count++;
3. if((CV->c4 = (SV->dup_count == 3))) {
4. M->dest = SV->ssthresh;
5. SV->ssthresh = (min(((int) SV->cwnd + 1),g_recv_wnd)/2)*g_mss;
6. M->RC.cwnd = SV->cwnd;
7. SV->cwnd = 1;
8. tcp_util_event(SEQ);
9. }
10 }
Reverse:
1. else if(CV->c3) {
2. SV->dup_count--;
3. if(CV->c4) {
4. SV->ssthresh = M->dest;
5. SV->cwnd = M->RC.cwnd;
6. }
7. }
Figure 4.3: Forward and reverse of TCP handling a duplicate ack
69
signment, T5. The reverse of this operation is just the decrement, SV→dup count--.
Next there is a check for if the TCP host received its third duplicate acknowl-
edgment, if((CV→c4 = (SV→dup count == 3))). The branch direction is stored
in the forth location in the bit field, CV→c4. This is an example of a T0, simple
choice. The reverse is to check the bit from the bitfield and execute the appropriate
reverse code, if(CV→c4).
The SV→ssthresh is then destructively assigned its new value. Its previous
value is state saved in the destination of the message. Since the message is already
at its destinations that value is no longer needed and the space can be used for
storage. The reverse is to restore the value of ssthresh, SV→sstresh = M→dest.
Finally the congestion control window is updated. This is another destructive
assignment, T6. Its previous value is state saved in M→RC.cwnd and the reversal is
trivial.
Figure 4.4 is the code for the TCP host model’s sequence number handling.
First there is a simple choice, T0 to see if the sequence number is the expected
value. The direction of the branch is stored in CV→c2. Which is later checked in
the reversal to select the appropriate reverse code to execute.
The next operation that is performed is the increment of the received packets
statistic, SV→received packet++. This increment is classified as a constructive as-
signment. The reverse of this operation is the decrement, SV→received packet--.
Now the sequence number is incremented by a constant. This is a construc-
tive assignment, T5 and the reverse is merely the decrement of that constant,
SV→seq num -= g mss.
There is a loop to see what is the sequence number that should be acknowl-
edged. This loop is classified as a variable iteration, T3. M→RC.dup count is used
to save the number of iterations. The reversal is to iterate the reverse code duplicate
count times, while(M→RC.dup count).
The out of order buffer for the next sequence number is constructively assigned
zero. The reverse is to assign it the value of one. The sequence number to be ac-
knowledged is incremented by a constant. The reverse is to decrement the sequence
number by that constant, SV→seq num -= g mss.
70
Forward:
1. if((CV->c2 =(M->seq_num == SV->seq_num))) {2. SV->received_packets++;3. SV->seq_num += g_mss;4. while(SV->out_of_order[(SV->seq_num / (int) g_mss) % g_recv_wnd]){5. M->RC.dup_count++;6. SV->out_of_order[(SV->seq_num / (int) g_mss) % g_recv_wnd] = 0;7. SV->seq_num += g_mss;8. }9. tcp_util_event(ACK);10. }11. else if((CV->c3 = (M->seq_num > SV->seq_num))) {12 SV->out_of_order[(M->seq_num / g_mss) % g_recv_wnd] = 1;13. tcp_util_event(ACK);14. }
Reverse:
1. if(CV->c2) {2. SV->received_packet--;3. SV->seq_num -= g_mss;4. while(SV->dup_count) {5. SV->out_of_order[(SV->seq_num / (int) g_mss) % g_recv_wnd ] = 1;6. SV->seq_num -= g_mss;7. M->RC.dup_count--;8. }9. }9. else if((CV->c3) {10. SV->out_of_order[(M->seq_num / (int) g_mss) % g_recv_wnd] = 0;11. }
Figure 4.4: Forward and reverse of TCP process sequence number
The simple choice of “if the packets sequence number is out of order” is stored
in the third bit field location. For the reversing of the simple choice the bit field is
checked. If the packet was out of order its value in the out of order array is set
to one. The reversal of this is to set that value to zero.
4.3.4 TCP Model Validation
The Scalable Simulation Framework Network models (SSFNet) [99] has a set of
validation tests which shows the basic behavior of TCP. In this subsection we show
71
0 1 2 3 4 5 6 7 8 9
x 105
0
2000
4000
6000
8000
10000
12000
14000
Time (seconds)
Num
ber (
byte
s mod
140
00)
serv_tcpdump_0.out
ACKnoPiggy Packet SEQno
Figure 4.5: Comparison of SSFNet’s and ROSS’ TCP models based onsequence number for TCP Tahoe retransmission timeout be-havior. Top panel is ROSS and bottom panel is SSFNet.
TCP Tahoe’s behavior with respect to congestion avoidance and retransmissions.
The sequence number and congestion window plots are shown in Figures 4.5
and 4.6 respectively. This test is configured with a server and a client TCP session
with a router in between. The bandwidth is 8 Mb/sec from the server to the router
with a 50 ms delay. The link from the client to the router had a bandwidth of 800
kilobits per second with a 300 ms delay and a buffer of 6000 bytes. The server was
transferring a file of 13,000 bytes. In the test, packets 11, 12 and 13 were dropped.
72
0 1 2 3 4 5 6 7 8 9
x 105
0
1
2
3
4
5
6
7x 10
4
Time (seconds)
cwnd
, rwn
d &
ssth
resh
(byte
s)
serv_cwnd_0.out
cwndssthreshrwnd
Figure 4.6: Comparison of SSFNet and ROSS’ TCP models based on con-gestion window for TCP Tahoe retransmission timeout behav-ior test. Top panel is ROSS and bottom panel is SSFNet.
Figure 4.5 shows the sequence number plots for both SSFNet and ROSS’ TCP
models covering the Tahoe retransmission test. Observe that since we did not im-
plement hand shaking the first sequence number and acknowledgment are different,
however, after the handshaking period, both graphs are in alignment with respect
to sequence numbers. The acknowledgments were for the sequence number that
was acknowledged which is the same as the implementation that the NS model
performs. However, SSFNet implements acknowledgments for the next expected
73
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
0
1
2
3
4
5
6
7
8
9x 10
4
Time (seconds)
Numb
er (b
ytes m
od 90
000)
serv_tcpdump_0.out
ACKnoPiggy Packet SEQno
0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
1
2
3
4
5
6
7
8
9x 10
4
Time (seconds)
Numb
er (b
ytes m
od 90
000)
f1.tcpdump.out
ACKnoPiggy Packet SEQno
Figure 4.7: Comparison of SSFNet’s and ROSS’ TCP models based on se-quence number for TCP Tahoe fast retransmission behavior.Top panel is ROSS and bottom panel is SSFNet.
sequence number.
The advertised window is global; it is known at the time of transfer and there-
fore does not start at zero. Other than that discrepancy, both models are in align-
ment with respect to congestion window behavior, as shown in Figure 4.6.
Next, we consider a more sophisticated validation test. Here, TCP Tahoe’s
behavior with congestion avoidance and fast retransmission is examined. The topol-
ogy was the same except the delay between the server and the router was 5ms and
the delay at the client was 100ms.
74
0 1 2 3 4 5 6 7
x 105
0
1
2
3
4
5
6
7x 10
4
Time (seconds)
cwnd
, rwn
d & ss
thres
h (by
tes)
serv_cwnd_0.out
cwndssthreshrwnd
1 2 3 4 5 6 7 8 90
1
2
3
4
5
6
7x 10
4
Time (seconds)
cwnd
, rwn
d & ss
thres
h (by
tes)
f1.wnd_6_100.out
cwndssthreshrwnd
Figure 4.8: Comparison of SSFNet’s and ROSS’ TCP models based oncongestion window for TCP Tahoe fast retransmission be-havior test. Top panel is ROSS and bottom panel is SSFNet.
As can be seen from Figures 4.7 and 4.8, our implementation with respect to
sequence number and congestion window behavior performs very similar to SSFNet.
The packet drop happens at similar times and so does the fast retransmission.
4.4 TCP Model Performance Study
4.4.1 Hyper-Threaded Computing Platform
Some of our experiments were conducted on a dual Hyper-Threaded Pentium-4
Xeon processor system running at 2.8 GHz. Hyper-threading is Intel’s name for a si-
75
multaneous multithreaded (SMT) architecture [62]. SMT supports the co-scheduling
of many threads or processes to fill-up unused instruction slots in the pipeline caused
by control or data hazards. Because the system knows that there can be no con-
trol or data hazards between threads, all threads or processes that are ready to
execute can be simultaneously scheduled. In the case of threads that share data,
mutual exclusion is guarded by locks. Consequently, the underlying architecture
need not know about shared variables or how they are used at the program level.
Additionally, because the threads assigned to the same physical processor share the
same cache, there is no additional hardware needed to support a cache-coherency
mechanism.
Intel’s Hyper-Threaded architecture supports two instruction streams per proces-
sor core [47]. From the OS scheduling point-of-view, each physical processor appears
as if there are two distinct processors. Under this mode of operation, an application
must be threaded to take advantage of the additional instruction streams. The dual-
processor configuration behaves as if it was a quad processor system. Because of
multiple instruction streams per processor, we denote instruction stream (IS) count
instead of processor count in our performance study to avoid confusing the issue
between physical processor counts and virtual processors or separate instruction
streams.
The total amount of physical RAM is 6 GB. The operating system is Linux,
version 2.4.18 configured with the 64 GB RAM patch. Here, each process or group
of threads (globally sharing data) is limited to a 32 bit address space, where the
upper 1 GB is reserved for the Linux kernel. Thus, an application is limited to
3 GB for all code and data (both heap and stack space and thread control data
structures).
4.4.2 Quad and Dual Pentium-3 Platform
For the quad processors platform we used the platform mentioned in 3.4.2.
For the dual platform we had a cluster of 40 dual processor machines at .9GHz
connected by 100Mbit Ethernet. Each dual processor machine had 512MB of main
memory and 256 KB of L2 cache.
76
4.4.3 TCP Model’s Configuration
For all experiments for the AT&T and the synthetic topologies, each TCP con-
nection maintains a consistent configuration. The transfer size was infinite, leading
to the transfers running for the duration of the simulation. The maximum segment
size was set to 960 bytes and the total size of all headers was 40 bytes. The packet
size was 1,000 bytes which was consistent with the size used in the SSFnet’s vali-
dation tests. The Initial sequence number was initialized to zero and the slow start
threshold was 65,536.
We did, however, have a window size of 16 for the one million host case which
differed from the default window size of 32. We set the window to 16, hoping to cut
down on the number of packets in the system. Later we found out that changing
the window size did not have an impact due to the limits on the bandwidth in the
system. The congested links force the TCP window to shrink so that the slow-
start threshold became small and the upper bounds on the window size is never
encountered. A similar behavior has been recently noted in [71].
All clients and servers in the AT&T and the synthetic topologies were con-
nected by having the first half of hosts randomly connect to the second half of hosts.
There was a distinct client-server pair for each TCP connection in the simulation.
Because of the random nature of connections, there was a high percentage of “long-
haul” links that resulted in a large number of remote events scheduled between
threads.
4.4.4 Synthetic Topology Experiments
The synthetic topology was fully connected at the top and had 4 levels. A
router at one level had N lower level routers or hosts connected. The number of
nodes was equal to N4 + N3 + N2 + N . N was varied between, 4, 8, 16, and 32.
The nodes were numbered in such a way that the next hop could be calculated on
the fly at each hop.
The bandwidth, delay and buffer size for the synthetic topology is as follows:
• 2.48 Gb/sec, a delay of 30 ms, and 3 MB buffer,
• 620 Mb/sec, a delay between 10 ms to 30 ms, and 750 KB buffer,
77
• 155 Mb/sec, a delay of 5 ms, 10ms and 30ms, and 200 KB buffer,
• 45 Mb/sec, a delay of 5 ms, and 60 KB buffer,
• 1.5 Mb/sec, a delay of 5 ms, and 20 KB buffer,
• 500 Kb per second, a delay of 5 ms, and 15 KB buffer
Here, we considered 3 bandwidth scenarios: (i) high, which has 2.48 Gb/sec
for the top-level router link bandwidths, and each lower level in the network topology
uses the next lower bandwidth shown above yielding a host bandwidth of 45 Mb/sec,
(ii) medium, which starts with 620 Mb/sec and goes down to 1.5 Mb/sec at the
end host, and (iii) low, which starts with 155 Mb/sec and goes down to 500 Kb/sec
at the end host. We note that these bandwidths and link delays are realistic relative
to networks in practice.
Our tests were run on 1, 2 and 4 instructions streams (IS). The synthetic
topology was mapped with each core router and all its children mapped to the same
processor.
Table 4.2 shows the performance results for all synthetic topology scenarios
across varying numbers of available instruction streams on the Hyper-Threaded
system. For all configurations, we report an extremely high degree of efficiency.
The lowest efficiency is 97.4% and to our surprise we observe a large number of
zero rollback cases for 2 and 4 instruction streams resulting in 100% simulator
efficiency. We observe that the amount of available work per instruction stream
retards the rate of forward progress of the simulation, particularly as N grows and
the bandwidth increases. Thus, remote messages arrive ahead of when they need
to be processed resulting in almost perfect simulator efficiency. This result holds
despite an inherently small lookahead which is a consequence of link delay and
relatively large amount of remote schedule work, which ranges from 7% to 15%.
Recall, our link delays range from a small as 5 ms at the low network levels to only
about 30 ms at the top router level.
The observed speedup ranges between 1.2 and 1.6 on the dual-hyper-threaded
processor system. These speedups are very much in line with what one would expect,
particularly given the memory size of the models at hand relative to the small level-2
78
N Bandwidth IS EvRate Effic % Remote SU4 500 Kb 1 441692 NA NA NA4 500 Kb 2 535093 99.388 7.273 1.2114 500 Kb 4 660693 97.411 14.308 1.4954 1.5 Mb 1 386416 NA NA NA4 1.5 Mb 2 440591 99.972 7.125 1.1404 1.5 Mb 4 585270 99.408 14.195 1.5164 45 Mb 1 402734 NA NA NA4 45 Mb 2 440802 99.445 7.087 1.0944 45 Mb 4 586010 99.508 14.312 1.6128 500 Kb 1 210338 NA NA NA8 500 Kb 2 270249 100 7.273 1.2848 500 Kb 4 331451 99.793 10.746 1.5758 1.5 Mb 1 177311 NA NA NA8 1.5 Mb 2 237496 100 7.313 1.3398 1.5 Mb 4 287240 99.993 10.823 1.6198 45 Mb 1 176405 NA NA NA8 45 Mb 2 221182 99.999 7.259 1.2538 45 Mb 4 257677 99.996 10.758 1.46016 500 Kb 1 128509 NA NA NA16 500 Kb 2 172542 100 7.091 1.34216 500 Kb 4 199282 99.987 10.600 1.55016 1.5 Mb 1 100980 NA NA NA16 1.5 Mb 2 137493 100 7.092 1.36116 1.5 Mb 4 153454 99.998 10.626 1.51916 45 Mb 1 99162 NA NA NA16 45 Mb 2 117312 100 7.102 1.18316 45 Mb 4 145628 99.999 10.648 1.46832 500 Kb 1 80210 NA NA NA32 500 Kb 2 108592 100 7.058 1.35332 500 Kb 4 126284 100 10.586 1.5732 1.5 Mb 1 75733 NA NA NA32 1.5 Mb 2 90526 100 7.052 1.20
Table 4.2: Performance results measured in speedup (SU) for N =4, 8, 16, 32 synthetic topology network for low (500 Kb), medium(1.5 Mb) and high (45 Mb) bandwidth scenarios on 1, 2 and4 instruction streams (IS) on a dual Hyper-Threaded 2.8 GHzPentium-4 Xeon. Efficiency is the net events processed (i.e.,excludes rolled events) divided by the total number of events.Remote is the percentage of the total events processed sent be-tween LPs mapped to different threads/instruction streams.
79
cache. We note that we were unable to execute the N = 32, 45 Mb bandwidth case.
This aspect and memory overheads are discussed in the paragraphs below.
N Bandwidth Max Ev-list Size Mem Size4 500 Kb 4,792 3 MB4 1.5 Mb 5,376 3 MB4 45 Mb 5,376 3 MB8 500 Kb 45,759 11 MB8 1.5 Mb 85,685 17 MB8 45 Mb 86,016 17 MB16 500 Kb 522,335 102 MB16 1.5 Mb 1,217,929 202 MB16 45 Mb 1,380,021 226 MB32 500 Kb 5,273,847 1,132 MB32 1.5 Mb 6,876,362 1,364 MB
Table 4.3: Memory requirements for N = 4, 8, 16, 32 synthetic topologynetwork for low (500 Kb), medium (1.5 Mb) and high (45 Mb)bandwidth scenarios on 1, 2 and 4 instruction streams on adual Hyper-Threaded 2.8 GHz Pentium-4 Xeon. Optimisticprocessing only required 7,000 more event buffers (140 byteseach) on average which is less 1 MB.
The memory footprint of each model is shown as a function of nodes and
bandwidth in Table 4.3. We report a steady increase in the memory requirements
and the event-list sizes as the bandwidths and the number of nodes in the net-
work increase. The peak memory usage is almost 1.4 GB of RAM for the N = 32,
1.5 Mb bandwidth scenario. The amount of additional memory allocated for opti-
mistic processing is 7,000 event buffers which is less than 1 MB. Thus, for 524,288
TCP connections, this model only consumes 2.6 KB per connection including event
data. By comparison, Nicol [71] reports that NS consumes 93 KB per connection,
SSFNet (Java version) consumes 53 KB, JavaSim consumes 22 KB per connection
and SSFNet (C++ version) consumes 18 KB for the “dumbbell” model which con-
tains only two routers. Our topology is obviously different from the dumbbell model
thereby the memory usages are not directly comparable.
Last, we find that there is an interplay in how the event population is effected
by the network size, topology, bandwidth and buffer space. In examining the mem-
80
ory utilization results, we find that the maximum observed event population differs
by only a moderate amount for the 1.5 Mb versus the 45 Mb case when N = 16
despite a rather significant change in the network buffer capacity. However, we were
unable to execute the 45 Mb scenario when N = 32 because it requires more than
17,000,000 events, which is the maximum we can allocate for that scenario without
exceeding operating system limits. This is because there are many more hosts at
a high bandwidth, resulting in much more of the available buffer capacity to be
occupied with packets waiting for service. This case results in a 2.5 times increase
in the amount of required memory. This suggested, model designers will have to
perform some capacity analysis, since networks memory requirements may explode
after passing some size, bandwidth or buffer capacity threshold, as observed here.
4.4.5 Hyper-Threaded vs. Multiprocessor System
SysConfig EvRate % Effic % Remote SU1 IS, Hyper-Threaded 220098 NA NA NA2 IS, Hyper-Threaded 313167 100 0.05 1.424 IS, Hyper-Threaded 375850 100 0.05 1.711 PE, Pentium-III 101333 NA NA NA2 PE, Pentium-III 183778 100 0.05 1.814 PE, Pentium-III 324434 100 0.05 3.20
Table 4.4: Performance results measured in speedup (SU) for N = 8 syn-thetic topology network medium bandwidth on 1, 2 and 4 in-struction streams (dual Hyper-Threaded 2.8 GHz Pentium-4Xeon) vs. 1, 2 and 4 processors (quad, 500 MHz Pentium-III)
In this series of experiments we compare a standard quad processor system
to our dual, hyper-threaded system in order to better quantify our performance
results relative to past processor technology. The network topology is the same as
previously described with N = 8, thus there are 4,680 LPs in this simulation. We
did however, modify the TCP connections such that they are more locally centered.
So, in total 87% of all TCP connections were within the same kernel process (KP)
which reduced the amount of remote messages.
We observe that the dual processor with hyper-threading out performs the
81
quad processor system without hyper-threading by 16%. The respective speedups
relative to their own sequential performance are 3.2 for the quad processor and 1.7
for the dual hyper-threaded system.
Additionally, we observe 100% simulator efficiency for all parallel runs. We
attribute this phenomenon to the low remote messages and large amount of work
(event population) per unit of simulation time.
4.4.6 AT&T Topology
Figure 4.9: AT&T Network Topology (AS 7118) from the Rocketfuel databank for the continental U.S.
The core US AT&T network topology contains 13,173 router nodes and 38,164
links. What makes Internet topologies like the AT&T network both interesting and
challenging from a modeling prospective is the sparseness and power-law struc-
ture [98] that they exhibit. In the case of AT&T, there are less than 3 links per
router on average. However, at the super core there is a high-degree of connectiv-
ity. Typically, an Internet service provider’s super core will be configured as a fully
connected mesh. Consequently, backbone routers will have up to 67 connections to
other routers, some of which are other backbone or super core routers and other
82
links to region core routers. Once at the region core level, the number of links per
router reduces and thus the connectivity between other region cores is sparse.
In performing a breath-first-search of the AT&T topology, there are eight
distinct levels. At the backbone, there are 414 routers. Each successive level has
the following router count : 4861, 5021, 1117, 118, 58, 6 and 5 nodes. There were
a number of routers not directly reachable from within this network. Those routers
are most likely transit routers going strictly between autonomous systems (AS).
With the transit routers removed, our AT&T network scenario has 11670 routers.
Link weights are derived based on the relative bandwidth of the link in comparison
to other available links. In this configuration, routing is kept static, however, we do
have dynamic routing currently working on a light-weight OSPF model in which we
plan to integrate with our TCP model in the very near future.
The bandwidth, delay, and buffer size for the AT&T topology is as follows:
Level 0 router: 9.92 Gb/sec with a delay randomly selected between 10 ms to 30
ms, and 12.4 MB buffer; Level 1 router: 2.48 Gb/sec, a delay selected randomly
between 10 ms to 30 ms, and 3 MB buffer; Level 2 router: 620 Mb/sec with a delay
selected randomly between 10 ms to 30 ms, and 750 KB buffer; Level 3 router:
155 Mb per second with a delay of 5 ms, and 200 KB buffer; Level 4 router: 45
Mb per second, a delay of 5 ms, and 60 KB buffer; Level 5 router: 1.5 Mb/sec, a
delay of 5 ms, and 20 KB buffer; Level 6 router: 1.5 Mb per second, a delay of 5
ms, and 20 KB buffer; Level 7 router: 500 Kb per second, a delay of 5 ms, and 5
KB buffer; link to all hosts: 70 Kb per second, a delay of 5 ms, and 5 KB buffer.
Hosts are connected in the network at PoP level routers. These routers only
have one link to another higher-level router. In the medium size configuration there
where 10 hosts per PoP level router which totaled 96,500 nodes (hosts plus routers).
In the large configuration there where 30 hosts per PoP totaling 266,160 LPs. In each
configuration, half the hosts establish a TCP session to a randomly selected receiving
host. We observe this configuration is almost pathological for a parallel network
simulation because the amount of remote network traffic will be much greater than
is typical in practice. The amount of remote message traffic is much greater than the
synthetic network topology because of the networks sparse structure. Our goal is to
83
SysConfig EvRate % Effic % Remote SUmedium, 1 IS 138546 NA NA NAmedium, 2 IS 154989 99.947 52.030 1.12medium, 4 IS 174400 99.005 78.205 1.25large, 1 IS 127772 NA NA NAlarge, 2 IS 143417 99.956 51.976 1.12large, 4 IS 165197 99.697 78.008 1.29
Table 4.5: Performance results measured in speedup (SU) for AT&T net-work topology for medium (96,500 LPs) and large (266,160LPs) on 1, 2 and 4 instruction streams (IS) on the dual-hyper-threaded system.
demonstrate simulator efficiency under high-stress workloads for realistic topologies.
We observe over 99% efficiency for the 2 and 4 IS runs as shown in Table 4.5,
yet there is a substantial reduction in the overall obtain speedup. Here, we report
speedups for the 4 IS cases of 1.25 for the medium size network and 1.29 for the
large. We attribute this reduction to enormous amounts of remote messages sent
between instruction streams/processors. The AT&T network topology for a round-
robin mapping results in 50% to almost 80% of all processed events being remotely
scheduled. We hypothesize that the behavior on the part of the model reduces the
memory locality and results in much higher cache miss rates. Consequently, all
instruction streams are spending more time stalled waiting for memory requests to
be satisfied. However, we note that more investigation is required to fully understand
this behavior.
The memory requirements for the AT&T scenario were 269 MB for the medium
size network and 328 MB for the large size network. This yields a per TCP connec-
tion overhead of 2.8 KB and 1.3 KB respectively. The reason for the reduction per
connection is that the amount of network buffer space, which effects the peak event
population, did not change but the number of connections increased.
4.4.7 Campus Network
The campus network has been used to benchmark many network simula-
tors [35, 61, 92, 93, 101] and is an interesting topology for network experimen-
tation [29]. The campus network is shown in Figure 4.10 and is comprised of 4
84
Figure 4.10: Campus Network [59].
servers, 30 routers, and 504 clients for a total of 538 nodes [59]. Limitations in our
TCP model caused us to have 504 servers but our model is still comparable. We
will therefore refer to our campus network model as having 538 nodes for ease of
comparison.
The campus network is comprised of 4 different networks. Network 0 consists
of 3 routers, where node zero is the gateway router for the campus network. Network
1 is composed of 2 routers and the servers. Network 2 is comprised of 7 routers, 7
LAN routers, and 294 clients. Network 3 contains 4 routers, 5 LAN routers, and 210
clients [35]. All routers links have a bandwidth of 2Gb/s and have a propagation
delay of 5ms with the exception of the network 0 to network 1 links, which have a
delay of 1ms. The clients are connected to their LAN routers with links of bandwidth
85
Figure 4.11: Ring of 10 Campus Networks [35].
100Mb/s and 1ms delay.
For our experiments we connected multiple campus networks together at their
gateway routers to form a ring. The links connecting the campus together were
2Gb/s with delays of 200ms. Figure 4.11 shows a ring of 10 of these campuses
connected. The traffic was comprised of clients in one domain connecting to the
server in the next domain in the ring. The server would transfer 500,000 bytes back
to the client application.
We connected 1,008 campus networks together to create a network of the size
542,304. Graph 4.12 shows a super linear speedup on the one processor case per
node but not on the two processor case. This can be attributed to the decrease in
context switching from the two processor per node case. This is a similar result to
86
0
1e+06
2e+06
3e+06
4e+06
5e+06
6e+06
7e+06
8e+06
9e+06
1e+07
4 8 16 18 24 28 36 48 56 72
Pac
ket R
ate
Number of Processors
Distributed ROSS: TCP-Tahoe Model
1 CPU per Node2 CPU per Node
Linear Performance
Figure 4.12: Packet rate as a function of the number of processors.
what was observed in the Network Atomic Operations paper [8]. Our fastest packet
rate of 8.7 million packets per second was observed on 72 processors.
2 nodes 4 nodes 8 nodes 16 nodes 32 nodesROSS 341853 p/s 730720 p/s 1493702 p/s 2817954 p/s 5508404 p/sNS X X X in swap 1069905 p/s
Table 4.6: Performance results measured for ROSS and PDNS for a ringof 256 campus networks. Only one processor was used on eachcomputing node
We tried to run a model of similar size in PDNS on our cluster but the memory
requirements were too high. The largest PDNS model that we could get to run was
137,728 and this was on 32 nodes. We were able to execute this model on two
87
nodes in ROSS and the results and performance numbers can be seen in Table 4.6.
ROSS was able to achieve 5.5 million packets per second whereas PDNS processed
1.07 million packets per second. This shows that our ROSS implementation has
a significantly smaller memory footprint and can achieve 5.14 times speedup over
PDNS.
4.5 Related Work
For networking models much of the current research in parallel simulation
is largely based on conservative algorithms. For example, PDNS [94] is a paral-
lel/distributed network simulator that leverages HLA-like technology to create a
federation of NS [74] simulators. SSFNet [26, 99] and TasKit [110] use Critical
Channel Traversing (CCT) [110] as the primary synchronization mechanism. Dart-
mouth implementation of SSF (DaSSF) employs a hybrid technique called Composite
Synchronization [72].
Recent optimistic simulation systems for network models include TeD [82],
which is a process-oriented framework for constructing high-fidelity telecommuni-
cation system models. Premore and Nicol [85] implemented a TCP model in TeD,
however, no performance results are given. USSF [87] is an optimistic simulation sys-
tem that dramatically reduces model runtime state by LP aggregation, and swapping
LPs out of core. Additionally, USSF proposes to execute simulation unsynchronized
using their NOTIME approach. Based on the results here, a NOTIME synchroniza-
tion could prove beneficial for large-scale TCP models. Unger et. al. simulate a
large-scale ATM network using an optimistic approach [104]. They report speedups
ranging from 2 to 7 on 16 processors and indicate that optimistic outperforms a
conservative protocol on 5 of the 7 tested ATM network scenarios.
Finally, Genesis (The General Network Simulation Integration System), is a
novel approach to scalably and efficiently simulate networks. The Genesis system
divides the networks into domains and simulates each with a separate processor
for a given time interval. Packet-level simulation is done for each domain but syn-
chronizations of the flows are done over the time interval. Upon finishing each
simulation, the domains exchange information about the flows and starts a new it-
88
eration over the same time interval. This continues until all simulations converge,
and then the system moves to a new time interval. This method has been proven
successful in many difficult tasks, such as accurately simulating TCP (Transmission
Control Protocol), which constantly adjusts to network conditions [90, 102].
4.6 Conclusions
The parallel TCP model proved to be extremely efficient with very few roll-
backs observed. Parallel simulator efficiency ranged between 97 to 100% (i.e., zero
rollbacks). The model was implemented to be as memory efficient as possible which
allowed for the million node topology to be executed. We observed model mem-
ory requirements between 1.3 KB to 2.8 KB per TCP connection depending on
the network configuration (size, topology, bandwidth and buffer capacity). Our ex-
periments on the campus network showed excellent performance and super-linear
speedup. We also showed our simulator was 5.14 times faster then PDNS and that
its memory footprint was significantly smaller.
Last, the hyper-threaded system was able to provide a low cost-performance
ratio. What is even more interesting is that these systems blur the lines in terms
of sequential versus parallel processing. Here, to obtain higher rates of performance
from a single processor, one has to resort to executing the model in parallel.
CHAPTER 5
Reverse Memory Subsystem
5.1 Introduction
One of the more difficult problems simulation developers face is designing ef-
ficient models. There are two major domains of efficiency to be concerned with:
memory and computation. The performance of large-scale simulation relies primar-
ily on the memory usage of the system. Traditional state saving solutions have been
shown to be an unacceptable solution, often requiring several times the amount of
sequential memory required. An alternative to state saving is reverse computation.
The tradeoff is that modelers must not only create models of complex problems,
but also their reverse computation as well. One difficulty with constructing an ef-
ficient reversible model is managing memory in both the forward and the reverse
computations. To date, reverse computation lacks a simple mechanism for dynamic
operations over static memory.
While reverse computation has been shown to consistently outperform both
conservative and optimistic (with state saving) methods of simulation, it does not
overcome state saving entirely. Destructive operations must still be state saved in
some form. Floating point operations are typical examples because of the loss of
precision given operations such as divide and multiply. The usual solution is to save
the variable by swapping the destroyed value with a variable in the event. In the
forward execution of an event, the value of the logical process (LP) state variable
is swapped with the event variable. During a rollback, the value of the LP state
variable is restored not by reverse computation, but rather by swapping back the
event variable. Both the state and the event variables are then restored. This is a
form of limited state saving.
Another similar example is memory. LP states are typically more complex
than simple variables; LPs typically contain data structures and user defined objects.
89
90
These objects cannot be reverse computed without a large amount of effort on the
part of the model designer. Further, any solution implemented within the model
would likely take the form of state saving. As a generic example, consider a simple
link list referenced within an LP state. During the execution of an event, an item
in the list must be deleted. Because this event may be rolled back, this list element
cannot simply be freed. Instead, a reference to the element must be saved for
reverse computation. This requires state saving because the list element cannot be
immediately reclaimed.
A reverse memory subsystem is required to address the problem of reverse
computing memory in an efficient way. Our Time Warp system maximizes cache
locality for events, and we apply this same technique in the reverse memory sub-
system. We will show in this thesis that reverse memory is plausible, and enhances
performance greatly. This new subsystem for reverse computing memory completes
the reverse computation solution.
5.2 Design Decisions
Certain assumptions were made in the design of our reverse memory subsys-
tem. The first challenge was how to statically allocate the memory that the model
would use throughout the lifetime of the simulation. For this purpose we chose
to allow the model to configure the amount and size of memory buffers. We con-
strained each memory initialization to be an array of fixed width memory buffers.
This simplifies the reverse memory subsystem greatly because when memory buffers
are needed or reclaimed, they are pulled from a single location. Offering models the
ability to allocate dynamic width memory buffers creates an overly complex memory
subsystem which would introduce fragmentation and would require a replacement
algorithm. Because we want our models to be as efficient as possible, we eliminated
this requirement by constraining the memory buffer sizes to a fixed width. This
is not too tight of a restriction on the model because each pool defines a different
width memory buffer, and the number of pools allocatable is only limited by the
physical memory available. Our intention was not to redesign or build a full memory
management system, but instead to provide the ability to allocate memory on the
91
fly in a simple and efficient manner.
Actual dynamic memory calls within the model can lead to severe performance
penalties and it is a well-accepted best practice to avoid them. Calls to malloc and
free carry a large penalty in a CPU intensive system. The average time to perform
1,000 sequential calls to malloc is ˜7 µs and the cost of the same number of sequential
frees is ˜1-2 µs for a 1KB allocation. One solution is to statically allocate the memory
in the init phase of a simulation, and realize memory operations through this pool
of memory. This pool can be used to fulfill the models’ memory needs, however, the
worst-case memory usage must be assumed. In order to use a statically allocated
memory pool, all of the memory required by the model must be provided in order
for the model to execute.
Models typically use memory in one of two forms: within the LP or model
global state, and within the event buffers themselves. Our CAVES model primarily
uses memory to creates a linked list of views within the LP state. Another example
is a simple router network model. Routers send update messages; these updates can
be variable in size. We could have the message be the size of the maximum update,
but that is an inefficient solution because it assumes a worst-case allocation. We
could have messages of variable sizes. Certain message sizes are not going to be
used as frequently and the modeler would have to estimate the number of messages
needed for each size to perform the static allocation. In addition, there is the
hidden cost of copying the message date to the receiver’s state. Instead, if we store
the updates in a memory buffer, and attach the updates to the events, the receivers
can simply unlink the updates from the event. This eliminates the copy, and allows
for a greater amount of memory reuse. For these types of memory usages it is clear
that a memory subsystem could be helpful. These building blocks can be linked
together and attached to events, and then used in the state of the LPs through a
pointer-based implementation. The pointers can be used in the state in all manners
of data structures, including: lists, trees, hashes and queues.
These building blocks seem to easily fulfill the needs of simple models. The
more difficult needs can still be met. For example, with an Internet routing protocol
model, the routers send dynamic list data structure so that an arbitrary number of
92
hops can appear in the routes. These routes are comprised of lists of integers. If
the building block’s data size was defined as an integer, the overhead of the block’s
other components, would make up the majority of the size. A better but more
difficult way to use the building blocks would be to have the building block’s data
component be the average or median size of the routes. This way the overhead of
the building block’s other components are not the dominating factor. There will
be internal fragmentation of the memory in the building blocks for the routes with
small hop counts or the ones with hop counts larger then the average. This is still
better than having the messages be allocated with the worst case route size.
5.3 Reverse Computing Memory
In the forward execution of an event, if an LP state variable is incremented,
then during a rollback of the same event, the LP state variable should be decre-
mented. This is the definition of reverse computation. When this technique is
applied to memory, the questions which arise are: how do we reverse compute a
memory allocation? Or worse, how do we reverse a memory de-allocation? What
does it mean to undo a free operation?
To further complicate matters, memory may be used in one of two ways.
Memory may be allocated, attached to an event, and the event sent to another
LP. Second, memory may be allocated and maintained within the model state for
accounting purposes. Of course some mix of the two is also possible. The location
and use of the memory buffer within the system determines how it should be handled
during event rollback.
In the following sections we will examine the functions and variables required
for the reverse memory subsystem.
5.3.1 Internals of the Reverse Memory Subsystem
In the design decisions we discussed why we chose a memory buffer building
block. For the rest of the chapter the memory buffer building block will be referred
to as a memory buffer. In this subsection we will discuss the implementation of the
reverse memory subsystem.
93
struct tw_memoryq{
int size;int start_size;// the data size to be allocated for each memory bufferint data_size;int grow;
// head and tail of the memory queuetw_memory *head;tw_memory *tail;
};
struct tw_memory{
// next and prev pointers so the memory buffer can be easily// linked into a listtw_memory *next;tw_memory *prev;
// time at which this event can be collectedtw_stime ts;
// data segment of the memory buffer: application definedvoid *data;
};
struct tw_event{
// added memory buffer pointertw_memory *memory;
};
struct tw_kp{
// added processed memory queuestw_memoryq *queues;
};
struct tw_pe{
// added free memory queuestw_memoryq *free_memq;
};
Figure 5.1: New data structures added to ROSS for the Reverse MemorySubsystem.
94
tw_kp_fossil_memory(){
// Nothing to freeif (queue is NULL)
return;
// There are no memory buffers to freeif (All memory buffers are greater than GVT)
return ;
// All memory buffers can be freed to the PEif (All memory buffers are less than GVT,){
Insert all memory buffers into the PE free queue
return ;}
Find memory buffers that can be freed.
Insert memory buffers into the PE free queue
return ;}
Figure 5.2: Kernel Process Memory Buffers Fossil Collection Function.
The reverse memory subsystem is a list of free memory buffers. The memory
buffer (tw memory) and memory list/queue (tw memoryq) data structures are shown
in Figure 5.1. The memory buffer contains a next and prev pointer for list opera-
tions. The memory buffer also has a data pointer for the user defined data and a
time stamp (ts) for fossil collection.
The user defines how many different sizes of memory buffers they wish to
have, along with how many memory buffers of each size are required. The memory
queue data structure (tw memoryq) has a data size variable which is the data size
of memory buffers that are in the queue. The number of initial memory in the queue
is specified with the start size variable. The user also sets the grow factor grow
which indicates how many memory buffers should be allocated if a shortage occurs.
All the memory buffers are linked onto free lists of the correct type. The user can
easily allocate from each list. The free lists are attached to the processing element
(PE) just like the events are attached in ROSS. The variable for the lists is free memq
95
void tw_event_free(){
// Code for freeing the event...// New code for the memory buffer
// remove all memory buffers from the freed eventm = event->memory;while(m){
event->memory = m->next;free m to the correct PE’s free queuem = event->memory;
}}
Figure 5.3: Freeing of an event after annihilation.
and can be seen in Figure 5.1. The processed memory buffers queues are attached to
the Kernel Process (KP). This is again similar to how the ROSS [12, 13] deals with
the processed event lists. KPs were added in ROSS to aggregate the processed event
lists of a collection of LPs. They dramatically reduce fossil collection overheads thus
this justifies our placement of the processed memory buffer lists. More details on
the ROSS’ data structures can be found in Section 3.1.
The placement of the processed memory buffer list guarantees that the or-
der of the list is fully reversible. The memory buffer fossil collection function
(tw kp fossil memory) is called immediately after the event fossil collection func-
tion. The functions follow the same logic just different data types and the memory
fossil collection function can be seen in Figure 5.2. This function checks if memory
buffers can be freed and frees them to the correct PE’s free memory queue.
Functionality was added to the simulation system to correctly free the memory
buffers upon event cancellation due to rollbacks. As stated in Chapter 2, the time
warp system allows for optimistic execution which can lead to rolling back of events.
One step in undoing an event is cancelling the events which it created. The events
that are created are stored on an output queue and anti-messages are sent for those
events upon rollback. If the event has not been processed the anti-message will
annihilate that event. However, if the event has been processed by the destination
96
g_tw_memory_nqueue; Set the number of different memory buffer
// Allocation the number of elements of// size size_of_data with a grow factor
types tw_memory_init( int number_elements, int size_of_data,int grow_factor){
tw_memoryq *free_q;
for all PEs {free_q = get PE’s queue;// initialize the queuefree_q->start_size = number_elements;free_q->data_size = size_of_data;free_q->grow = grow_factor;
tw_memory_allocate(q);}
}
tw_memory_allocate(tw_memoryq *q){int mem_len, mem_sz;
// calculate the size of a memory buffermem_len = sizeof tw_memory + data_size;
// calculate the size of memory needed for the total number of// memory buffersmem_sz = mem_len * start_size;
allocate mem_sz of memory and link the memory buffers together;
return ;}
Figure 5.4: Initialization Functions and Variables for the Reverse Mem-ory Subsystem.
processor the anti-message will cause that event to be rolled back. This leads to a
difficulty in the passing of memory buffers with events. The event with the memory
buffer attached to it could be in multiple states. There is no way the user could
reverse the passing of the memory buffer due to a lack of mutual exclusion. The only
97
appropriate time to free the memory buffer would be on the freeing of the event after
it is annihilated. Therefore, a memory buffer pointer (memory) was added to the
event shown in Figure 5.1. Upon annihilating an event the simulation system checks
the memory buffer pointer and frees the attached memory buffers to the correct lists.
This takes place in the tw event free function and is shown in Figure 5.3.
5.3.2 Reverse Computing Memory Initialization
In order for the users to access the reverse memory subsystem they need to
be aware of the function calls and the correct way of using them. The list of the
functions and variables for initialization is contained in Figure 5.4.
These function calls must be in the main of the simulation model. The
g tw memory queue should be set before the tw init function call. The function
tw memory init must be called for each of the memory buffer types and should be
called after the tw init but before the call to tw run.
5.3.3 Reverse Computing Memory Allocations
Returning to the linked list example, when an event causes an element to
be allocated and added to the list, the reverse of this operation is to remove the
element from the list and free the memory. However, if the rollback is not too long,
this event will soon be re-executed, and the memory likely re-allocated. Dynamic
memory allocations are too intensive to be used in an optimistic system. Static
memory allocation must be used in order to maintain system performance, but it
may not be possible to ascertain the appropriate amount of memory to allocate.
In addition, statically allocated memory must consider the worst case in order to
ensure model completion.
Our reverse memory subsystem significantly reduces the complexity of simula-
tion models by providing a statically allocated pool of memory for model allocations.
The pool has the ability to grow in the event that it is exhausted, a dynamic mem-
ory allocation which can be amortized across all allocations such that it has no sum
affect on the model. The call for allocating a memory buffer is in Figure 5.5.
The tw memory alloc function returns a tw memory pointer that is simply a
pointer to a memory location, safe to use by the model. The memory buffer pool
98
tw_memory * tw_memory_alloc(tw_lp * lp, tw_fd fd)
{
queue = get PE’s queue ;
if(queue is empty)
tw_memory_allocate(q);
return (remove/pop memory buffer from queue);
}
Figure 5.5: The allocation function for the Reverse Memory Subsystem.
is attached to the LPs tw lp processing element (PE) within ROSS. The tw fd file
descriptor is an index to the correct memory pool to use. Memory pools may be
defined to have different size memory buffers.
During the reverse computation of the event, the call shown in Figure 5.6 is
made to reverse compute the memory allocation. The function expects a pointer to
the memory buffer m that will be reverse allocated, along with the file descriptor fd
for the correct memory pool. This function should not be used when the new allo-
cated memory buffer is sent in an event. The simulation system will automatically
free the buffer in that case.
void tw_memory_alloc_rc(tw_lp * lp, tw_memory * m, tw_fd fd)
{
push memory buffer onto the PE’s free queue;
}
Figure 5.6: The reverse allocation function for the Reverse Memory Sub-system.
Upon reversing the allocation, the memory buffer is simply returned to the
proper pool.
99
5.3.4 Reverse Computing Memory De-allocations
The reverse computing a memory de-allocation is more difficult because an
event which causes a de-allocation may later be rolled back. During a rollback, the
memory buffer which was de-allocated in the forward execution must be restored.
This requires that de-allocated memory be maintained somewhere within the system.
For this purpose a second pool of memory buffers is maintained on the LPs KP.
Within the tw memory free function call the freed memory buffer is inserted into
this queue and the time stamp for the buffer is set. During a rollback, each call
to un-free memory buffers removes the next element in this second pool. Because
causality is enforced at the level of the KP, it is ensured that the memory buffers
will match with the events being rolled back.
The call for freeing memory buffers, and the matching reverse computation
call are shown in Figure 5.7.
void tw_memory_free(tw_lp * lp, tw_memory * m, tw_fd fd)
{
queue = KPs processed memory buffer queue
set time stamp for the m
// Now we need to link this memory buffer into the KP’s
// processed memory queue.
push memory buffer on to the KP’s queue
}
tw_memory *tw_memory_free_rc(tw_lp * lp, tw_fd fd)
{
return (remove/pop memory buffer from the KP’s queue);
}
Figure 5.7: The free and reverse free functions for the Reverse MemorySubsystem.
100
5.3.5 Attaching Memory Buffers to Events
Memory buffers may be allocated and attached to events, which are then sent
to other LPs. This memory buffer must now be considered in the context of the
event, rather than the simple context of the LP. Depending on the state of the
event, the memory buffer must be handled in a special way. Memory buffers can be
attached to events using the routine shown in Figure 5.8. The function expects a
pointer to the event e.
void tw_event_memory_set(tw_event * e, tw_memory * m, tw_fd fd){
if(e has been ABORTED)
{
// frees the memory buffer to the correct PE
tw_memory_alloc_rc( m );
return;
}
attach a memory buffer to the event
return;
}
Figure 5.8: The event memory set function for the Reverse Memory Sub-system.
When this call is made, the memory buffer is placed into a linked list on
the event. However, if the currently allocated event is the abort event (an event
which occurs when event memory has been exhausted), the memory buffer is simply
reclaimed as this event will never be sent or further processed. No reverse function
call is needed to unset a memory buffer. As shown in Figure 5.3 the simulation
system will automatically return the memory buffer to the appropriate pool as the
event is freed.
The call for retrieving memory buffers from a newly received event is shown
in Figure 5.9.
This call returns the memory buffers linked onto the event in reverse order.
But we must also provide a reverse computation call to reverse this process in the
101
tw_memory * tw_event_memory_get(tw_lp * lp)
{
tw_memory *m;
m = event->memory;
if (m)
event->memory = event->memory->next;
return m;
}
Figure 5.9: The event memory get function for the Reverse Memory Sub-system.
event of a rollback because the taking memory buffers off of an event is a destructive
procedure which must be restored. To facilitate this, the call shown in Figure 5.10
is provided.
void tw_event_memory_get_rc(tw_lp * lp, tw_memory * m, tw_fd fd)
{
m->next = event->memory;
event->memory = m;
}
Figure 5.10: The reverse event memory get function for the ReverseMemory Subsystem.
During the forward execution, memory buffers are removed from an event
where the buffers can either be used in the state of the LP or freed using tw memory free.
The reverse computation procedure is to replace the memory buffers on the event.
5.4 Memory Buffers for State Saving
Memory buffers may be allocated for state saving. This can be accomplished
with the above function calls and it will be discussed below.
102
Often the events have extra data in them for state saving. This memory is
often not used all the time which leads to an over allocation of memory in the
system. One example is our TCP model which has 20 bytes for state saving in each
event. Most of the time only 4 bytes are needed. With the use of memory buffers
this over allocation of memory can be reclaimed.
First, the model should no longer use the events for state saving. The TCP
model can use memory buffers. When an event needs to have a memory for state
saving a call to tw memory alloc is executed. The memory buffer is filled in with
the appropriate values and then the model calls the tw memory free on the mem-
ory buffer. This puts the memory buffer on the processed memory list and will be
reclaimed in the fossil collection routine. However, if a rollback occurs the reverse
computation would call tw memory free rc which would return the values of the
destroyed variables. The model can restore these values and then reverse the allo-
cation of the memory buffer with a call to tw rc memory alloc. The benefit of this
technique will be discussed in the performance study.
5.5 Performance Study
For these experiments we ran our tests on the Hyper-Threaded Pentium-4
architecture discussed in Chapter 4’s performance study 4.4.1 and the Pentium-III
architecture discussed in Chapter 3’s 3.4.2.
5.5.1 Benchmark Model
We designed the benchmark model to test all functionalities of the reverse
memory subsystem. The benchmark model mimics phold in its message passing.
Each LP starts off with a linked list of 20 memory buffers and its size is limited to
50 memory buffers. The lists are incorporated into the model to demonstrate how
the reverse memory subsystem can be used in an LP’s state. The LP’s initialization
routine sends messages with up to 3 memory buffers to other LPs. This illustrates
the subsystems capability of passing memory. Upon receiving the message the mem-
ory buffers are added to the LP’s list. This is followed by the LP adding up to 3 or
removing up to 5 memory buffers from its internal list. The LP then creates a new
103
message with up to 3 memory buffers attached to it.
We have created three benchmark models for comparison. The one variant in
each model is how memory is managed. The state saving model makes explicit calls
to malloc and free, the Pre-Allocated model allocates all the memory required for
the model during the initialization phase and uses memory statically, and finally the
reverse computation model which implements the reverse memory subsystem for all
memory management. Each model is identical except for the way in which memory
is managed.
• State Saving: This model variant allows the operating system memory li-
brary to manage memory dynamically throughout the simulation. This model
explicitly calls malloc and free, relying on traditional state saving to support
rollback. This approach would not be chosen in practice because of the high
cost of calling malloc and free. It will also illustrate the memory overheads
of poorly designed state saving models. We created a simple test to measure
the average time to perform 1,000 sequential calls to malloc. The measured
average was ˜7 µs for memory allocations and the cost of the same number of
sequential frees is ˜1-2 µs for a 1KB allocation. For this model there needs to
be 3 memory buffers in every event. There also must be 5 memory buffers in
every event for removed memory.
• Swaps and Pre-allocated Lists: The next approach is to design a model
which allocates a pool of static memory to be used throughout the simulation.
This allocation occurs during the initialization phase of the simulation and so
saves on the cost of calling malloc and free during model execution. However,
this approach requires a worst-case memory allocation in order to support
all possible cases which may arise for a given model. In addition, the model
must implement all of the memory management routines required for memory
allocation/deallocation. The worst case list for this model is initialized with
50 memory buffers. For handling the dynamics in the system there are two
approaches. The first is when a free happens, state save the removed values.
The other approach is, free the memory buffer without state saving the value.
On an allocation, state save the value currently in the freed memory buffer
104
and than use that buffer. For this model let’s perform an analysis. Worst case,
5 memory buffers are removed and no swaps can be performed because the
message values were not installed. In this case 5 memory buffers are required
in the message for state saving. In the allocation case, 6 memory buffers might
be allocated but 3 of the memory buffers can be swapped with the incoming
message values. So the total state saving required is 3 memory buffers. With
this design the free list is used as temporary storage for the removed memory
buffers.
• Reverse Memory Subsystem: The final approach was to use the reverse
memory subsystem. The goal of the subsystem is to statically allocate an
average case amount of memory during the initialization phase, and to provide
highly optimized memory management beyond what is possible within the
model. Because we have access to the internals of the simulator executive,
certain functions can be performed in the reverse memory subsystem which
cannot be handled easily within the model. For example, many operations
within a simulator with speculative execution cannot be committed until after
the GVT time has passed the event time (i.e., I/O operations) because they
cannot be rolled back. The same is true with the reclamation of memory
buffers. The fossil collection routine does not return memory buffers one at
a time to the free memory list. Instead, it “stitches” the subset of memory
buffers from the processed memory queue back into the free memory list. This
detail of memory management is not possible from within the model alone.
Without the reverse memory subsystem, memory buffers attached to events
can only be reclaimed upon event allocation. The reverse memory subsystem
is a highly optimized memory management library with access to all areas
of the simulation executive. The subsystem takes advantage of the fact that
an average case amount of memory may be allocated for the model. If the
simulation causes a worst-case allocation to occur, the simulation executive
may be forced to perform a GVT computation in order to reclaim event and
memory buffers.
105
LPs B Sz St-Sv Swap RMS St-Sv e/s Swap e/s RMS e/s1024 16 5.4 MB 5.1 MB 5.0 MB 306595.31 341213.80 304561.171024 64 14.0 MB 12.1 MB 8.9 MB 201933.67 284425.78 289531.831024 256 47.6 MB 40.0 MB 24.7 MB 138472.08 202924.73 240812.961024 1024 189.4 MB 151.4 MB 87.7 MB 74840.24 102144.14 175813.1816384 16 78.4 MB 73.4 MB 70.7 MB 173536.32 192662.80 178621.8716384 64 204.5 MB 184.4 MB 133.7 MB 136529.00 155620.24 169462.5016384 256 716.4 MB 628.6 MB 385.7 MB 104683.99 121728.00 154630.8116384 1024 2764.3 MB 2405.0 MB 1393.7 MB 67558.10 79469.52 128234.5865536 16 307.9 MB 291.9 MB 280.9 MB 144821.02 156086.62 147415.1765536 64 815.3 MB 735.9 MB 532.9 MB 117703.90 133909.80 143798.67
Table 5.1: Reverse Memory Subsystem memory usage and event rate.LPs the number of LPs in the model. BSz the size of thememory buffers. St − Sv, the state saving model. Swap theswap with statically allocated list model. RMS the ReverseMemory Subsystem
5.5.2 Benchmark Model Results
We ran the experiments on three different sized models and four different sized
memory buffers. The models where comprised of 1024, 16384, and 65536 LPs and
the memory buffers contain either 16, 64, 256, or 1024 bytes. Table 5.1 shows the
memory usage and event rate for the different models. It can be observed that the
model using the reverse memory subsystem had a much smaller memory footprint.
The model using state saving with calls to malloc and free performed the worst.
The reverse memory subsystem, at best, showed a memory savings of 2.17 over the
state saving model. In addition it had a speedup over the state saving model of 2.35
in the 1024 LP and 1024 memory buffer sized case. The interesting result was that
in the smaller memory buffer sized cases, the model with the statically allocated list
performed better than the reverse memory subsystem model. We attribute this to
the fact that the statically allocated list has a better cache performance which is
demonstrated in Table 5.2.
As the memory buffer size increased the performance suffered. This can be
explained by the fact the models had to deal with more data which led to more
cache misses. This can be seen in Table 5.2 which is the profiling results of the
models. It shows the data cache misses for the Pentium 4. This was obtained by
having Oprofile [76] monitor the BSQ CACHE REFERENCE counter with a mask
106
N LPs Buf Sz St-Sv misses Swap misses RMS misses
1024 16 39896 34124 393591024 64 45108 35799 396801024 256 44612 37772 417571024 1024 53906 44706 4449516384 16 42761 35695 4108216384 64 42724 36777 4227916384 256 47574 39488 4336416384 1024 52583 43856 4489165536 16 34953 30854 3525165536 64 35750 32382 36218
Table 5.2: Reverse Memory Subsystem data cache misses. NumLPs thenumber of LPs in the model. BufSz the size of the memorybuffers. St − Svmisses, the state saving model cache misses.Swap the swap with statically allocated list model cache misses.RMS the Reverse Memory Subsystem cache misses.
N LPs Buf Sz 2 Spdup 3 Spdup 4 Spdup
1024 16 1.75 2.39 3.051024 64 1.73 2.36 2.751024 256 1.59 2.00 2.0416384 16 1.63 1.88 1.9916384 64 1.59 2.01 2.0516384 256 1.53 1.84 1.82
Table 5.3: Reverse Memory Subsystem speedup. NumLPs the numberof LPs in the model. BufSz the size of the memory buffers.2− 4Spdup is speedup for 2 to 4 processors.
of 0x100 [24].
The speedup obtained from parallelism is shown in Table 5.3. These results
were performed the quad pentium III’s and our max speedup due to parallelism was
3.05. As model size increased the speedup from parallelism was decreased.
5.5.3 TCP Results
For the performance study we used the synthetic topology which is fully con-
nected at the top and had 4 levels. A router at one level had N lower level routers
or hosts connected. The number of nodes was equal to N4 + N3 + N2 + N . N was
107
N Bandwidth Mem Size Mem Size with buffers4 500 Kb 2.7 MB 2.5 MB4 1.5 Mb 2.8 MB 2.6 MB4 45 Mb 3.5 MB 3.36 MB8 500 Kb 10.8 MB 9.4 MB8 1.5 Mb 17.3 MB 14.9 MB8 45 Mb 17.9 MB 16.3 MB16 500 Kb 101.6 MB 87.8 MB16 1.5 Mb 198.6 MB 165.1 MB16 45 Mb 216.2 MB 179.1 MB32 500 Kb 1,114.3 MB 980.8 MB32 1.5 Mb 1,310.3 MB 1114.3 MB
Table 5.4: Memory requirements for N = 4, 8, 16, 32 synthetic topologynetwork for low (500 Kb), medium (1.5 Mb) and high (45 Mb)bandwidth scenarios on 1, 2 and 4 instruction streams on adual Hyper-Threaded 2.8 GHz Pentium-4 Xeon.
1. DEF(struct, cache) {
2. cache_blk *c_list_h;
3. cache_blk *c_free_h;
4.
5. int free;
6. int used;
7. };
Figure 5.11: CAVES cache data structure.
varied between, 4, 8, 16, and 32. The nodes were numbered in such a way that the
next hop can be calculated on the fly at each hop.
As mentioned above the TCP model had an extra 20 bytes for state saving in
each event. In this test, we removed the state saving from the event and used the
reverse memory subsystem. The result can be viewed in Table 5.4. The performance
between the different models stayed approximately the same while the memory usage
decreased. On average there was a 11.7 percent memory reduction and a best case
reduction of 17.16 percent.
108
Original CAVES:
1. b = cache->c_list_h;
2. temp = b;
3. b = b->next;
4. if(b)
5. b->prev = NULL;
6.
7. if(cache->c_free_h)
8. cache->c_free_h->prev = temp;
9.
10. temp->next = cache_ram->c_free_h;
11. cache->c_free_h = temp;
12. cache->c_list_h = b;
CAVES with the Reverse Memory Subsystem:
1. temp = queue_pop(cache);
2.
3. tw_memory_free(lp,temp, 0);
Figure 5.12: CAVES removal of a view.
5.5.4 Reduction in CAVES Model Complexity
As mentioned before developers often overlook certain optimizations. CAVES
is one such example it has a one to many delete and each LP having a free list. If
the reverse memory subsystem would have been existed, CAVES would have been
implemented with it and the free list would have been move to the KPs. Time
would have not been spent trying to deal with ways of minimizing memory. The
subsystem would have made the design much faster.
In order to support the comparisons being made in this thesis, the CAVES
model was changed to support the reverse memory subsystem. These changes are
discussed below.
With the reverse memory subsystem we also developed a memory queue data
structure which is used in the system for managing the memory buffers. This queue
data structure can be used in models as well and was used in the porting of CAVES
to the reverse memory subsystem.
109
Original CAVES:
1. loc = SV->cache ->c_free_h;
2. SV->cache->c_free_h = loc->next;
3.
4. if ( SV->cache->c_free_h )
5. SV->cache ->c_free_h->prev = NULL;
6.
7. // updates the cache
8. loc->next = SV->cache ->c_list_h;
9. loc->prev = NULL;
10.
11. if(SV->cache ->c_list_h)
12. SV->cache ->c_list_h->prev = loc;
13. SV->cache ->c_list_h = loc;
CAVES with the Reverse Memory Subsystem:
1. loc = tw_memory_free_rc(lp,0);
2. queue_push(&SV->cache, loc);
Figure 5.13: CAVES reverse computation code for a removal of a view.
Original CAVES:
1. b = cache->c_free_h;
2. cache->c_free_h = b->next;
3.
4. if ( cache->c_free_h )
5. cache->c_free_h->prev = NULL;
6.
7. copy_view(M->rc_v, b);
8. init_view(b, M->v);
CAVES with the Reverse Memory Subsystem:
1. b = tw_memory_alloc(tw_getlp(lp),0);
2. init_view(b, M->v);
Figure 5.14: CAVES allocation of a view.
110
Original CAVES:
1. copy_view(b, M->rc_v)
2.
3. if(SV->cache->c_free_h)
4. SV->cache->c_free_h->prev = loc;
5.
6. loc->next = SV->cache->c_free_h;
7. SV->cache->c_free_h = loc;
8. loc->prev=NULL;
CAVES with the Reverse Memory Subsystem:
1. tw_memory_alloc_rc(lp,o_loc,0);
Figure 5.15: CAVES reverse computation of a view allocation
The first thing that can be removed is the data structure for the cache lists
which can be seen in Figure 5.11. This can simply be replaced with a memory queue
for the used cache. Previously the CAVES cache blk were linked together to create
the cache. Now the memory buffers are going to form the link list so the cache blk
next and prev pointers can be removed.
When removing a view from the cache the original CAVES model performed
the code shown in Figure 5.12. The model using the reverse memory subsystem
only needs two lines. Figure 5.13 shows the reverse code for both models.
In an allocation of a cache block, the CAVES original model has to state
save parts of the allocated block. The reverse memory subsystem acts as the log
for state saving and is transparent to the modeler. The freed memory will remain
available until it is fossil collected. Figure 5.14 and 5.15 shows the allocations and
the reversals for the two models.
It can be seen that the code is significantly less complex. One thing to note
is that the memory was not reduced significantly as previously thought. The over
allocation of the cache was fairly small compared to the event population. The main
savings was the reduction of state saving in the event. The state saving was now
moved in to the LP in the form of memory buffers. This was a finding that we were
not expecting. The run-time performance of the two models are comparable, which
111
suggest the overheads introduced by this subsystem are not significant.
5.6 Conclusions
As can be seen the reverse memory subsystem provides better results than
the first passes on model design which indicates that the model design process has
been made simpler. Now modelers do not have to focus on creating their own
memory pool. The simulation system will take care of all the overheads and will
provide close to optimal results. For different types of models it shows that there are
ways to get around the issue of having different sized events. This reverse memory
subsystem gives the tools to allow for quick and more complex models to be created
without much difficulty. The memory reductions and speed up gains over a factor
of 2 from the worst case shows models will be able to be created much better,
faster and without nearly as much experience in an optimistic simulation system.
The subsystem can be used in previous models to gain more optimizations that
modelers might have missed or when too difficult to previously implement. This
is shown in the new CAVES section. Overall this subsystem opens the door and
lowers the development cost for modelers to leverage reverse computation in their
optimistic synchronization models.
CHAPTER 6
Sharing Event Data
6.1 Introduction
One of the more challenging issues in model design is keeping the model size
small. A good way to address this problem is to reduce the amount of duplicate
information. As mentioned in [111], Logical Processes (LPs) that share common
data can have a global data structure which will reduce the total state of the model.
From this, the question of “why must this just be limited to the LPs?” arises. Could
a sharing approach be employed in event data as well? Our experimentation shows
that the answer is yes and one good example is a multicast network model [27]
because of the duplicate nature of the events. However, this approach could be used
in several, more generic model scenarios. In fact, at any point in a model where
an event is to be broadcast to two or more LPs, there can be a significant memory
savings attributed to this approach.
In the multicast protocol, data transmission is minimized by sending messages
through a multicast tree before being broadcast to each subscriber. The goal is to
minimize individual transmissions sent separately to each subscriber. This proto-
col model has duplicate information being sent to each subscriber where there are
branches in the multicast tree. In a simulation with shared memory we have the
ability to have a global view of the system. Typically, a multicast model generates
messages that result in multiple, identical messages being sent to each subscriber
LP in the system. Each LP would then read those messages, update its state, and
possibly generate more events in the system.
Rather than identical events being sent to each subscriber with the same data
attached, we keep a pointer to the data in the event header and send each subscriber
LP this pointer. Each LP is required not to overwrite the data, as it is understood
in the model that this event data is being shared globally throughout the system.
112
113
Our second requirement is that only once each subscriber has received the multicast
event, the attached data can be reclaimed.
This optimization is important because it drastically reduces the most memory
exhaustive component of simulation: the event population. This optimization can
be applied to all types of simulation: sequential, parallel and even distributed.
In this chapter, we will discuss the sequential and parallel implementation of this
approach. Additionally, there are two types of parallel synchronizations. With
sequential and conservative synchronization the implementation of this idea is trivial
because attached data can immediately be reclaimed.
Optimistic simulators pose a greater design and implementation challenge be-
cause processed events are maintained for possible future rollback scenarios. These
scenarios are much more difficult to address because events must have been read by
all receivers prior to reclamation. In fact, there are several cases that are outline in
this chapter where this optimization must be managed properly by the optimistic
simulation executive.
We chose the multicast protocol model as our primary example for this ex-
perimental investigation. We follow with possible implementations of the method
assuming different synchronization mechanisms. There will be a detailed discussion
of the implementation of the method in a discrete event simulation which employs
optimistic synchronization. For the performance results a benchmark multicast-like
model will be used in the evaluation.
6.2 Multicast Background
The multicast protocol is a bandwidth-conserving technology that aims to
reduce packets in a network by transmitting a single stream of data to thousands
of receivers on the network. Many applications take advantage of this protocol,
including: video conferencing, corporate communications, distance learning, and
distribution of just about any type of information.
Multicast has been in use on large scale networks since the introduction of
the Mbone [64] in 1992. Today, Microsoft China has one of the largest multicast
networks, and plans to begin providing multicast television to viewers in 2005 [48].
114
Figure 6.1: Multicast graphic from Cisco [21].
The multicast solution will become widely used on the Internet as a solution to
higher bandwidth costs due to ever increasing numbers of Internet subscribers.
The multicast model works by delivering traffic from a single source to mul-
tiple receivers through the multicast tree. Multicast receivers subscribe to a given
source, and the information is then disseminated through the multicast tree back
to the multicast group of receivers. Internet routers replicate packets at branches
in the multicast tree so that all subscribers will receive the same packet data. This
low-cost solution not only reduces the amount of bandwidth required to transmit
a large stream of data to multiple receivers, it also reduces the number of requests
serviced by the source. The multicast protocol is efficient in its design compared to
other protocols which commonly require the source to send individual copies of the
same information to multiple receivers. When the amount of data being transmitted
is large, it quickly becomes difficult for the source to send multiple copies across a
network, as in the case of MPEG video. A large amount of network resources are
consumed providing an individual stream for each receiver. The multicast protocol
115
can also provide a substantial savings when the data transmitted from the source is
small because there may be thousands of receivers to be serviced. Figure 6.1 demon-
strates how data from a single source is delivered to multiple recipients through a
multicast tree.
Our model of the multicast protocol clearly illustrates the performance impact
of our shared event data optimization. It is easy to see that if we are able to move X
bytes into a shared message and there are 1000 multicast subscribers the savings will
be roughly 1000-fold bytes. Conversely, if X is very large, and there are only a few
receivers, the memory savings will also yield a significant performance improvement
as the large event memory segment would only need to be written into memory
once. Furthermore, as these large packets traverse the network the number of copy
operations is zero.
6.3 Implementation
In order to successfully deploy this new idea in a current simulator executive,
two restrictions must be met. Once the shared event data has been identified, it can
be allocated and written only once, and then dereferenced for each subscriber event
created. The first restriction is that each subscriber may not destroy or modify the
shared event data. The second restriction is that the simulation executive may not
prematurely reclaim the shared event data. Only once each subscriber has received
the event can the shared event data be freed.
6.3.1 Sequential & Conservative Simulation
In a sequential simulation this optimization can easily be implemented entirely
within the model. The modeler can keep a pointer to the shared event data. For
each newly created event that will forward the shared data, the pointer to the shared
data is set. Next, there needs to be a way to free the shared data once the event has
been processed by all subscribers. If the final subscriber is known, the shared event
data can be freed once that subscriber has processed the event. Another approach
is to maintain a counter that indicates the number of sends and receives. When the
last subscriber receives the event and does not forward it to any other subscriber,
116
the counter will be zero and the shared data segment is safe to reclaim. Since the
execution is sequential, only one LP will be accessing the counter at a time and
there is no need for mutual exclusion. Finally, the last subscriber to receive the
event will be the correct one to free it.
With conservative simulation systems such as DaSSF [72], this optimization
can also be implemented by the model. Once again the modeler creates a pointer
in the event message for the shared data segment. Each newly created event that is
going to be forwarding the data can point to the same shared data segment. The
freeing of the data can be done the same way as in the sequential case because in
conservative simulation there are no roll backs and all events executed are processed
in the correct causal order. However, the execution is parallel and so would require
mutual exclusion for the counter if that is the method used to determine the second
restriction.
6.3.2 Optimistic Simulation
Within optimistic simulation there are additional issues to address. In par-
ticular, speculative execution complicates the allocation/deallocation process when
rollbacks occur.
We chose to implement this novel idea within ROSS [12, 13, 14]. ROSS han-
dles causal errors through reverse computation. When a rollback occurs, an event’s
reverse computation event handler is called, which has the inverse effects on the
LP’s state compared to the forward execution of the event. ROSS includes a reverse
memory subsystem which allows for the dynamic allocation of statically allocated
memory. This subsystem was designed to overcome the problem of reverse com-
puting memory operations such as malloc and free during event execution. The
reverse memory subsystem greatly reduces the complexity of many models by al-
lowing them to create memory buffers and either maintain them in their LP state
or to send them as part of the event data. When we discuss reclaiming the shared
event data segments, it is these memory buffers to which we are referring.
Our implementation used a counter within the event header to track subscriber
sends. The easiest implementation of this idea is not to try to reclaim the shared
117
if(e has been ABORTED)
{
if( *b )
{
lock(&((*b)->mem_lck));
if((*b)->counter != 0)
{
unlock(&((*b)->mem_lck));
return;
}
free the memory pointed by event e (i.e., e->memory);
unlock(&((*b)->mem_lck));
*b = NULL;
}
return;
}
lock(&((*b)->mem_lck));
(*b)->counter++;
unlock(&((*b)->mem_lck));
e->memory = *b;
return;
Figure 6.2: Memory Set Routine. This routine is performed when settinga memory buffer to an event. B is the memory buffer. E isthe event.
data when it reaches the end points. The reason is the supposed final end point
might not be the final end point due to the fact that other end points might be rolled
back. In an optimistic solution, the event data segments are reclaimed only once
the possibility of a rollback is eliminated by the passing of the global virtual time
(GVT) [51]. Only those events with a time stamp less than the current GVT value
may be reclaimed by the system [33]. Typically, for caching purposes, those events
are made readily available for the next event creation. This improves the cache
hit rates because we know that the newly reclaimed event is in our cache, and so it
should be the next event to be allocated. On the reallocation of the event the shared
event data can be reclaimed. This method requires additional memory because the
118
if(e->memory)
{
lock(&e->memory->mem_lck);
e->memory->counter--;
if(e->memory->counter == 0 )
{
unlock(&e->memory->mem_lck);
free the memory pointed by event e (i.e., e->memory);
}
else
unlock(&e->memory->mem_lck);
e->memory = NULL;
}
Figure 6.3: Memory free routine. This routine is performed on all eventallocations. B is the memory buffer. E is the event.
shared event data is being reclaimed later in the simulation, but still dramatically
less than the amount of memory needed for a non-shared data approach. Within
the shared data there is a counter and a mutual exclusion lock which the simulation
executive manages. This optimization is entirely transparent to the model.
A second issue is that the execution of an event might not create the desired
new event. For example in ROSS, once all event-memory is allocated, a special
event called the abort event is returned as opposed to returning a null pointer. This
enables regular optimistic processing to continue until the scheduler reaches a point
at which it can correctly and safely re-claim memory. This approach is similar to the
approach taken in the Georgia Tech Time Warp (GTW) [34]. In the “event-send”
routine, if it finds an abort event has been scheduled, it continues processing but
does not send that event. Additionally, when the current event execution completes,
it is rolled back and any events which it created are cancelled.
The steps for the memory set routine are illustrated in Figure 6.2. Here, the
newly allocated memory buffer denoted by b, has its access control counter increased
by one provided the owning event, e, is not the abort buffer. If the abort buffer
is encountered, and the counter is zero, then that shared memory segment is freed.
119
Figure 6.4: Additional Memory Required for Increases in Levels.
Otherwise, this routine returns. Next, Figure 6.3 shows how a shared event segment
is freed. In this routine, the memory segment’s access counter must be zero prior to
the actual release of the memory segment. Please note, critical sections are denoted
by the lock and unlock routines. Both increment and decrement operations of
the access counter variables are “locked”. Additionally, any tests for zero are placed
within the lock since one and only one processor should free a shared memory buffer.
We observe that this interface only affects forward event processing. When a
rollback occurs, the reverse event handling code is not effected and no new function
calls or code modifications are required to support shared event segments.
120
Figure 6.5: Memory required with respect to shared data segment size.One case is 8 multicast trees, 10 levels and 8 start events andthe other case is 16 trees, 10 levels and 4 start events.
6.4 Performance Study
6.4.1 Itanium Architecture
The Itanium-II processor [46] is a 64 bit architecture based on Explicitly Paral-
lel Computing (EPIC) which intelligently bundles instructions together that are free
of data, branch or control hazards. This approach enables up to 48 instructions to
be in flight at any point in time. Current implementations employ a 6-wide, 8-stage
deep pipeline. A single system can physically address up to 250 bytes and has a full
64-bit virtual address capability. The L-3 cache comes in a 3 MBs configuration and
can be accessed at 48 GBs/second which is the core bus speed.
In contrast to other processors, this processor clearly has the largest “core
speed” cache of any available on the market. For example, the Apple G5 64-bit
121
Figure 6.6: Event Rate with respect to shared data segment size. Onecase is 8 multicast trees, 10 levels and 8 start events and theother case is 16 trees, 10 levels and 4 start events
processor provides only has 512KB level-2 cache or 14.3% of the available cache
on the Itanium-II processor. However, the core bus speed on this processor is 64
GBs/second.
6.4.2 Benchmark Multicast Model
For the performance study we implemented a benchmark multicast model. We
constructed binary trees to describe the network topology of sources, routers and
subscribers. All of the trees were disjointed. The leaf nodes off the routers were the
subscribers. The source root node was responsible for generating the packets. Once
the packet was received by the left-most subscriber in the tree, the root generates
the next packet.
122
T L Estart S Mtraditional Mshared ERtraditional ERshared
8 10 8 4 14.4 MB 14.8 MB 377986.673 370801.9248 10 8 16 17.4 MB 14.8 MB 350491.922 378053.9708 10 8 64 29.4 MB 14.8 MB 336866.703 377541.4408 10 8 256 77.4 MB 14.9 MB 330823.230 370219.7978 10 8 1024 269.3 MB 15.4 MB 324083.951 367463.72616 10 4 4 19.1 MB 19.4 MB 342921.408 335885.51416 10 4 16 22.1 MB 19.4 MB 322796.458 341562.54316 10 4 64 34.1 MB 19.4 MB 311617.293 342541.11016 10 4 256 82.0 MB 19.6 MB 306166.442 336635.59516 10 4 1024 273.9 MB 20.0 MB 300169.391 333708.05416 15 8 256 4936.3 MB 842.0 MB 137519.194 146734.51016 15 8 1024 17224.1 MB 842.9 MB 57867.255 146681.164
Table 6.1: Sequential Performance with and without shared data. T isthe number of trees in the multicast graph. L denotes thenumber of level within each tree. Estart is the number of initialevents each LP schedules at the start of the simulation. S is thesize of the data size in the messages. Mtraditional and Mshared arethe required memory for the traditional and shared event datamodels respectively. ERtraditional and ERshared are the event ratefor the traditional and shared event data models respectively.
6.4.3 Model Parameters and Results
We experimented with many parameters in the multicast benchmark model.
The most significant model parameters were the number of LPs and the size of the
shared data segments. We varied the number of trees from 2 to 16 and the number
of levels from 5 to 15. A power of two was not chosen because 15 was the largest
number of levels that would still fit into memory. The shared data size ranged from
4 integers to 1024 integers and was modeled using individual memory buffers of the
respective sizes. In addition we varied the number of start events from 2 to 8.
For the first set of experiments we ran ROSS sequentially with and without
a shared data segment in the events. Obviously, as the number of trees and start
events increased the memory increased accordingly. When the number of levels in
the trees grow, an exponential increase in memory usage was experienced. This
can be seen in Figure 6.4 and is explained by the exponential nature of the data
structure.
It can be observed that there was a smaller benefit for the small shared data
123
T L Estart S MRshared MRtraditional % Reduction
8 10 8 4 0.0221 0.0221 0.00 %8 10 8 16 0.0221 0.0211 -4.73 %8 10 8 64 0.0221 0.0231 4.33 %8 10 8 256 0.0224 0.0259 13.51 %8 10 8 1024 0.0224 0.0290 22.75 %16 10 4 4 0.0224 0.0224 0.00 %16 10 4 16 0.0225 0.0213 -5.63 %16 10 4 64 0.0228 0.0234 2.56 %16 10 4 256 0.0228 0.0261 12.64 %16 10 4 1024 0.0229 0.0293 21.84 %
Table 6.2: Data cache misses per memory reference. T is the numberof trees in the multicast graph. L denotes the number oflevel within each tree. Estart is the number of initial eventseach LP schedules at the start of the simulation. MRshared andMRtraditional are the data cache misses rates for the shared eventdata and traditional models respectively. Finally, % Reductionis the amount the miss rate is reduced by the event sharingscheme.
segments. However, in the larger sized data segments the benefits are quite notice-
able. This can be seen in Figures 6.5, 6.6 and Table 6.1. In the larger cases the share
data segments led to significant decreases in memory and increases in performance.
The decrease in memory is explained by the fact that the overhead of the share data
segment is surpassed by the duplicate information. In some cases the shared data
models used 1/20th of the memory required by traditional sequential simulations
(i.e., not sharing event segments). One observed result showed a speedup of 2.5.
This performance is explained by the fact that the traditional model was in swap
and thrashing. For the other data points, the speedup is attributed to a smaller
memory footprint which enables more events to fit in the cache. Another part of
the speedup was the model only had to assign values to pointers instead of copying
data from events.
Table 6.2 is the profiling results of the models. It shows the data cache misses
per memory reference for the shared event data and traditional models. The ratio
was obtained by dividing DCU LINES IN by DATA MEM REFS which are counters
124
T L Estart S 2 PEs 3 PEs 4 PEs
8 10 4 256 1.36 1.95 2.348 10 4 1024 1.35 1.94 2.328 10 8 256 1.49 2.22 2.738 10 8 1024 1.48 2.21 2.748 15 4 256 1.56 2.43 3.218 15 4 1024 1.55 2.43 3.228 15 8 256 1.54 2.41 3.198 15 8 1024 1.54 2.41 3.1916 10 4 256 1.51 2.25 2.8016 10 4 1024 1.51 2.23 2.7916 10 8 256 1.59 2.40 2.9416 10 8 1024 1.59 2.43 2.6416 15 4 256 1.54 2.42 3.2016 15 4 1024 1.53 2.42 3.2016 15 8 256 1.53 2.38 3.0616 15 8 1024 1.54 2.37 3.04
Table 6.3: Parallel results for shared event data. T is the number oftrees in the multicast graph. L denotes the number of levelwithin each tree. Estart is the number of initial events each LPschedules at the start of the simulation. 2-4 PEs is perfor-mance measured in speedup (i.e., sequential execution dividedby parallel execution time) for 2 to 4 processors.
that Oprofile monitored on a Pentium III [23]. The table shows that as the event
data size increases the shared event data model has fewer data cache misses than
the traditional model. These fewer misses can also explain the speedup which is
shown in Table 6.1.
One outlier result was in the 16 integer case, the traditional model had a
better ratio. Certain models are more sensitive than others to how the model fits
into the L2 cache, yielding better performance in some cases. It appears that the 16
integer case was one of these situations. More investigation is needed to determine
the precise effects of caching on performance.
Table 6.3 shows the result of the tests on the 1.5 GHz quad processors Itanium-
IIs. The maximum speedup attain was 3.22 on four processors. The low values of
speedups can be explained by the fact that the system does not have enough work
125
and can be remedied by increasing the number of start events in the system. This
can also be observed in the table.
6.5 Related Work
Much of the research in parallel simulation for shared data was based on
modifying and reading multiple LP’s states. For example sharks world breaks a
model down into sectors and each sector needed to be able to read or modify entities
on its neighbor state [25]. One method would be to use the push method, in which
messages are passed to the correct neighbors with the entities information. The other
way is posed in the space-time memory paper [39]. This concept has shared objects
with a time log attached to them. It allows for an easier model development over
the push method. A distributed method for sharing variables in systems without
physically shared memory was discussed by Mehl and Hammes [66]. The main
difference between our work and these other papers is that our shared memory is
not allowed to be modified. This eliminates the issue of whether the memory is safe
to read.
In the context of shared memory performance optimization, Panesar and Fuji-
moto have two key results. In [77], they present a event buffer management scheme
that reduces memory overheads on a cache-coherent shared memory multiprocessor
(KSR systems). To efficiently avoid over-optimistic execution, as well as ensure that
event memory is equally distributed among all processors, they devised a control
flow technique which treated event memory like a window of network packets and
applied a congestion control approach to throttling Time Warp event processing
rates [78].
Multicast is also used in the High-Level Architecture [33, 45] (also known as
IEEE 1516). This is a general purpose architecture for simulation reuse and interop-
erability. Here, simulators communicate through a publish and subscribe interface.
One of the key challenges is how to correctly disseminate update information. To
address this problem, multicast groups are employed as a means to allow simulators
to subscribe to regions of interest. Each “region” is assigned a multicast group iden-
tifier. This approach enables the efficient dissemination of update information about
126
simulation entities of interest. The key difference here is that our shared-memory
approach reduces memory consumption whereas the HLA’s implementation reduces
network bandwidth, but overall memory consumption remains the same.
Finally, we note that sharing event data has some linkages to multi-resolution
modeling [70]. Here, MRM is primarily concerned with the correct temporal and
spatial aggregation and disaggregation of simulation objects. The key difference
between our approach is that we are only concerned with a spatial aggregation of
event data that would be scheduled to a number of simulation objects at or about the
same point in virtual time. Additionally, we are unaware of any MRM approach for
an optimistic synchronization environment. In particular, how one would rollback
either an aggregation or disaggregation operation is still an open question.
6.6 Conclusions
From the idea of shared data in the LP we transform it into the idea of shared
data in the event. This chapter shows that the idea of shared event data is possible
and shows that there are benefits of 2 to 20 times in memory savings. There are
also speedup gains from this idea due to eliminating the copying for hop to hop. We
report a 22% reduction in the data cache miss rate and in addition we show parallel
speedups of 3.22 on a quad processor system.
CHAPTER 7
Summary
In this thesis we present two large-scale simulation models with reverse computation.
We discussed the benefits from a reverse memory subsystem and shared event data.
The first large-scale model was for a configurable view storage system (CAVES).
This model is suitable for execution on an optimistic simulation engine. Overall, we
find that our model performs well with an average speedup of 3.6 on 4 processors
over all configurations. Many cases yield super-linear speedup, which is attributed
to a slow memory subsystem on the multiprocessor PC. We find that a number
of parameters effect key Time Warp performance metrics. In particular, the view
storage size decreases rollbacks when increased and decreases the total number of
events.
In the second large-scale model, we dispel the view that optimistic simula-
tion techniques operate outside the performance envelop for Internet protocols and
demonstrate that they are able to efficiently simulate large-scale TCP scenarios for
realistic, network topologies using a single hyper-threaded computing system costing
less than $7,000 USD. The model was implemented to be as memory efficient as pos-
sible which allowed for the million node topology to be executed. We observed model
memory requirements between 1.3 KB to 2.8 KB per TCP connection depending on
the network configuration (size, topology, bandwidth and buffer capacity). For our
real-world topology, we use the core AT&T US network. Our optimistic simulator
yields extremely high efficiency and many of our performance runs produce zero roll-
backs. However, the number of remote messages limited the speedup for the AT&T
topologies. Our experiments on the campus network showed excellent performance
and super-linear speedup. We also showed our simulator was 5.14 times faster then
PDNS and that its memory footprint was significantly smaller.
As can be seen the reverse memory subsystem provides better results than the
first passes on model design which indicates that the model design process has been
127
128
made simpler. Now modelers do not have to focus on creating their own memory
pool. The simulation system will take care of all the overheads and will provide close
to optimal results. For different types of models it shows that there are ways to get
around the issue of having different sized events. This reverse memory subsystem
gives the tools to allow for quick and more complex models to be created without
all the difficulties. The memory reductions and speed up gains over a factor of 2
from the worst case shows models will be able to be created much better, faster
and without nearly as much experience in an optimistic simulation system. The
subsystem can be used in previous models to gain more optimizations that modelers
might have missed or when too difficult to previously implement. This is shown in
the new CAVES section. Overall this subsystem opens the door and lowers the
development cost for modelers to leverage reverse computation in their optimistic
synchronization models.
With the addition of this subsystem the modeler will be given the flexibility
to develop more complex models in less time. This subsystem will allow optimistic
parallel simulation to be more common place in the network simulation field.
From the idea of shared data in the LP, we transformed it into the idea of
shared data in the event. Our work shows that the idea of shared event data is
possible and shows that there are benefits of 2 to 20 times in memory savings.
There are also speedup gains from this idea due to eliminating the copying for hop
to hop. We report a 22% reduction in the data cache miss rate and in addition we
show parallel speedups of 3.22 on a quad processor system.
BIBLIOGRAPHY
[1] M. Abrams, C. R. Standridge, G. Abdulla, E. A. Fox and S. Williams.“Removal Policies in Network Caches for World-Wide Web Documents.” InProceedings on Applications, Technologies, Architectures, and Protocols forComputer Communications, 293–305, 1996.
[2] R. Agarwal, C. N. Chuah, S. Bhattacharyya, and C. Diot. “The Impact ofBGP Dynamics on Intra-Domain Traffic.” In Proceedings of SIGMETRICS,2004.
[3] H. Agrawal, R. A. DeMillo, and E. H. Spafford. “An Execution BacktrackingApproach to Program Debugging.” IEEE Software, pages 21-26, 1991.
[4] H. Agrawal, R. A. DeMillo, and E. H. Spafford. “Efficient Debugging withSlicing and Backtracking.” Software Practice & Experience, June 1993, 23(6),pp. 589-616.
[5] M. F. Arlitt, and C. L. Williamson. “Trace-Drive Simulation of DocumentCaching Strategies of Internet Web Servers.” Simulation 68(1): 23-33, 1997.
[6] R. M. Balzer. “EXDAMS: EXtendible Debugging and Monitoring System.” InAFIPS Proceedings, Spring Joint Computer Conference, volume 34, pages567-580, Montvale, New Jersey, 1969. AFIPS Press.
[7] J. Banks, J. S. Carson, II, and B. L. Nelson “Discrete-Event SystemSimulation.”, 2rd Edition. Prentice Hall, Upper Saddle River, New Jersey,1996.
[8] D. Bauer, G. Yaun, C. D. Carothers, M. Yuksel, and S. Kalyanaraman.“Seven-O’Clock: A New Distributed GVT Algorithm Using Network AtomicOperations.” In Proceedings of the Workshop on Parallel and DistributedSimulation (PADS ’05), 2005.
[9] S. Bellenot. “State Skipping Performance with the Time Warp OperatingSystem.” In Proceedings of the Workshop on Parallel and DistributedSimulation (PADS ’92), pages 53-64. January 1992.
[10] D. Bruce. “The treatment of state in optimistic systems.” Proceedings of the9th Workshop of Parallel and Distributed Simulation (PADS’95), 40-49, June1995.
129
130
[11] R. E. Bryant. “Simulation of Packet Communication Architecture ComputerSystems.” Computer Science Laboratory . Massachusetts Institute ofTechnology, 1977.
[12] C. D. Carothers, D. Bauer and S. Pearce. “ROSS: A High-Performance, LowMemory, Modular Time Warp System.” In Proceedings of the 14th Workshopof Parallel on Distributed Simulation (PADS 2000), pages 53-60, May 2000.
[13] C. D. Carothers, D. Bauer, and S. Pearce. “ROSS: A High-Performance, LowMemory, Modular Time Warp System.” Journal of Parallel and DistributedComputing, 2002.
[14] C. D. Carothers, D. Bauer and S. Pearce. “ROSS: Rensselaer’s OptimisticSimulation System User’s Guide.” Technical Report #02-12, Department ofComputer Science, Rensselaer Polytechnic Institute, 2002,http://www.cs.rpi.edu/tr/02-12.pdf
[15] C. D. Carothers, K. S. Perumalla, and R. M.Fujimoto. “Efficient OptimisticParallel Simulations using Reverse Computation.” In Proceedings of the 13th
Workshop on Parallel and Distributed Simulation (PADS’99), 126-135, 1999.
[16] C. D. Carothers, K. S. Perumalla, and R. M. Fujimoto. “Efficient OptimisticParallel Simulations using Reverse Computation.” (journal version). ACMTransactions on Computer Modeling and Simulation (TOMACS), 9(3):224–253, 1999.
[17] C. D. Carothers, K. S. Perumalla, and R. M. Fujimoto. “The Effect ofState-saving in Optimistic Simulation on A Cache-Coherent Non-UniformMemory Access Architecture.” In Proceedings of the 1999 Winter SimulationConference, 1999.
[18] K. M. Chandy and J. Misra. “Distributed Simulation: A Case Study in theDesign and Verification of Distributed Programs.” IEEE Transactions onSoftware Engineering 5 (3): 440-452, 1979.
[19] K. M. Chandy and J. Misra. “Asynchronous distributed simulation via asequence of parallel computations.” Communications of the ACM 24 (4):198-205, April 1981.
[20] D. Chiu and R. Jain. “Analysis of the Increase/Decrease Algorithms forCongestion Avoidance in Computer Networks.” Journal of ComputerNetworks and ISDN, Volume 17, Number 1, pages 1-14, June 1989.
[21] Cisco. “Internet Protocol Multicast.” http:
//www.cisco.com/univercd/cc/td/doc/cisintwk/ito doc/ipmulti.htm
131
[22] K. G. Coffman and A. M. Odlyzko. “Internet Growth: Is there a Moore’s Lawfor Data Traffic?” Handbook of Massive Data Sets, J. Abello, P. M. Pardalos,and M. G. C. Resende, eds., Kluwer, 2002, pp. 47-93
[23] W. Cohen. “Characterization of GCC 2.96 and GCC 3.1 generated code withOprofile.”www.redhat.com/support/wpapers/redhat/OProfile/oprofile.pdf
[24] W. Cohen. “Tuning Programs with Oprofile.”http://people.redhat.com/wcohen/Oprofile.pdf
[25] D. Conklin, J. Cleary, and B. Unger. “The Sharks World: A Study inDistributed Simulation Design.” In Distributed Simulation (1990), SCSSimulation Series, pp. 157-160.
[26] J. Cowie, H. Liu, J. Liu, D. Nicol and A. Ogielski. “Towards RealisticMillion-Node Internet Simulations.” In Proceedings of the 1999 InternationalConference on Parallel and Distributed Processing Techniques andApplications (PDPTA’99), June, 1999.
[27] S. Deering , “Host Extensions for IP Multicasting.” RFC1112,ftp://ftp.rfc-editor.org/in-notes/rfc1112.txt, August 1989.
[28] K. Fall and S. Floyd. “Simulation-based Comparisons of Tahoe, Reno, andSACK TCP.” Computer Communication Review, Volume 26, Number 3,pages 5–21, July 1996.
[29] J. J. Farris, D. M. Nicol. “Evaluation of secure peer-to-peer overlay routingfor survivable scada system.” Proceedings of the 2004 Winter SimulationConference
[30] S. Feldman and C. Brown. “IGOR: A system for program debugging viareversible execution.” A UM SIGPLAN Notices, Workshop on Parallel andDistributed Debugging, 24(1):112-123, January 1989
[31] R. M. Fujimoto. “Optimistic Approaches to Parallel Discrete EventSimulation.” Transactions of the Society for Computer Simulation,7(2):153(191, June 1990)
[32] R. M. Fujimoto. “Parallel Discrete Event Simulation.” Communications ofthe ACM, 33(10):30–53, October 1990.
[33] R. M. Fujimoto. “Parallel and distributed simulation systems.” John Wiley &Sons, New York, 2000.
[34] R. M. Fujimoto and M. Hybinette. “Computing Global Virtual Time inShared-Memory Multiprocessors.” ACM Transactions on Modeling andComputer Simulation, Volume 7, issue 4, pages 425–446, October 1997.
132
[35] R. M. Fujimoto, K. S. Perumalla, A. Park, H. Wu, M. H. Ammar, G. F. Riley.“Large-Scale Network Simulation: How Big? How Fast?” MASCOTS 2003
[36] M. Frank. “The R Programming Language and Compiler.” Memo M8, MITAI Lab, 1997.
[37] M. Frank. “Reversibility for Efficient Computing.” Ph.D. thesis, Dept. ofCISE, University of Florida, 1999.
[38] M. Frank. “Reversible Computing.” Developer 2.0 magazine, Jasubhai DigitalMedia, January 2004.
[39] K. Ghosh and R. M. Fujimoto. “Parallel Discrete Event Simulation UsingSpace-Time Memory.” In 20th International Conference on ParallelProcessing (ICPP), August 1991.
[40] S. Glassman. “A Caching Relay for the World Wide Web.” In Proceedings ofthe First International Conference on the World-Wide Web. 1994.
[41] F. Gomes. “Optimizing Incremental State Saving and Restoration.” Ph.D.thesis, Dept. of Computer Science, University of Calgary, 1996.
[42] T. Haerder. “Observations on Optimistic Concurrency Control Schemes.”Information Systems, 9(2):111-120, 1984.
[43] D. Harrison. “Edge-to-edge Control: A Congestion Control and ServiceDifferentiation Architecture for the Internet.” Ph.D. Dissertation, ComputerScience Department, Rensselaer Polytechnic Institute, May 2002.
[44] M. Herlihy. “Wait-Free Synchronization.” ACM Trans. Program. Lang. Syst.,13(1):124-149, 1991.
[45] HLA Department of Modeling and Simulation Website.https://www.dmso.mil/public/transition/hla/, Last accessed April 13,2005.
[46] Intel. “Intel Itanium-II Reference Manuals.” Available via the web at:http://www.intel.com/design/itanium/documentation.htm
[47] Intel. “Pentium 4 and Xeon Processor Optimization Reference Manual.”http://developer.intel.com/design/pentium4/manuals/248966.htm
[48] IPMSI. “Microsoft to co-operate with World Multicast China.”http://www.ipmulticast.com/ September 2004.
[49] iPod. Apple - iPod http://www.apple.com/ipod/
[50] V. Jacobson. “Congestion Avoidance and Control.” Proceedings of the ACMSIGCOMM, August 1988, pages 314-329.
133
[51] D. R. Jefferson. “Virtual Time.” ACM Transactions on ProgrammingLanguages and Systems, 7(3):404–425, July 1985.
[52] D. R. Jefferson and A. Metro. “The Time Warp Mechanism for DatabaseConcurrency Control.” Proceedings of the IEEE 2nd International Conferenceon Data Engineering, pages 141-150, February 1986.
[53] H. T. Kung and J. T. Robinson. “On Optimistic Methods for ConcurrencyControl.” ACM Transactions on Database Systems, 6(2):213–226, June 1981.
[54] S. Lawson. “Skype Sets It Sights on Cell Phones.”, PC Worldhttp://www.pcworld.com/news/article/0,aid,119652,00.asp
[55] P. M. Lewis, A. Bernstein, and M. Kifer. “Database and TransactionProcessing.” Addison & Wesley, 2002.
[56] Y-B. Lin, and E.D. Lazowska. “Reducing the State Saving Overhead ForTime Warp Parallel Simulation.” Technical Report 90-02-03, Department ofComputer Science and Engineering, University of Washington, 1990.
[57] Y-B. Lin, B. R. Press, W. M. Loucks, and E. D. Lazowska. “Selecting theCheckpoint Interval in Time Warp Simulation.” In Proceedings of theWorkshop on Parallel and Distributed Simulation (PADS ’92), pages 3-10.May 1993.
[58] R.J. Lipton and D. W. Mizell. “Time Warp vs. Chandy-Misra: A worst-casecomparison.” Proceedings of the SCS Multiconference on DistributedSimulation. 22, pages 137-143, 1990.
[59] J. Liu. NMS (Network Modeling and Simulation DARPA Program) baselinemodel. See web site at http://www.crhc.uiuc.edu/∼jasonliu/projects/ssfnet/dmlintro/baseline-dml.html
[60] J. Liu. “Improvements in Conservative Parallel Simulation of Large-scalemodels.” Ph.D. thesis, Dept. of Computer Science, University of Dartmouth,2003.
[61] Y. Liu, B. K. Szymanski. “Distributed Packet-Level Simulation for BGPNetworks under Genesis.” Proc. Summer Computer Simulation Conference,SCS Press, San Diego, CA, July 2004, pp. 271-278.
[62] J. L. Lo, S. J. Eggers, J. S. Emer, H. M. Levy, R. L. Stamm and D. M.Tullsen. “Converting Thread-Level Parallelism to Instruction-LevelParallelism via Simultaneous Multithreading.” ACM Transactions onComputer Systems, 15(3), pages 322-354, August 1997.
[63] C. Lutz and H. Derby. “JANUS: A Time-Reversible Language.”http://www.cise.ufl.edu/∼mpf/rc/janus.html
134
[64] M. R. Macedonia and D. P. Brutzman. “Mbone Provides Audio and VideoAcross the Internet.” IEEE Computer, Volume 27 Issue 4, pages 30–35, April1994
[65] F. Mattern. “Efficient Algorithms for Distributed Snapshots and GlobalVirtual Time Approximation.” Journal of Parallel and DistributedComputing, 18 (4), pages 423-434, August 1993.
[66] H. Mehl, and S. Hammes. “Shared Variables in Distributed Simulation.” InProceedings of the 7th Workshop on Parallel and Distributed Simulation(PADS93), 1993, vol. 23, no. 1, pp 68-76
[67] Microsoft. “Microsoft Launches Online Video Service for WindowsMobile-Based Devices.” http://www.microsoft.com/presspass/press/
2005/mar05/03-30MSNVideoDownloadsPR.asp, March 30, 2005
[68] J. Misra. “Distributed Discrete-Event Simulation.” ACM Computing Surveys,18(1):39–65, March 1986.
[69] T.G. Moher. “PROVIDE: A Process Visualization and DebuggingEnvironment.” IEEE Transactions on Software Engineering, 14(6):849–857,1988.
[70] A. Natrajan, P. F. Reynolds and S. Srinivasan. “MRE: A Flexible Approachto Multi-Resolution Modeling.” In Proceedings of the Eleventh Workshop onParallel and Distributed Simulation, pages 156–163, 1997.
[71] D. Nicol. “Scalability of Network Simulators Revisited.” In Proceedings of the2003 Communication Networks and Distributed Systems Modeling andSimulation Conference (CNDS ’03), January, 2003.
[72] D. Nicol and J. Liu. “Composite Synchronization in Parallel Discrete-EventSimulation.” IEEE Transactions on Parallel and Distributed Systems, Volume13, Number 5, May 2002.
[73] D. Nicol, and X. Liu. “The Dark Side of Risk – What your mother never toldyou about Time Warp.” In Proceedings of the 11th Workshop on Parallel andDistributed Simulation (PADS ’97), pages 188-195, 1997.
[74] NS2: The Network Simulator – Home Page http://www.isi.edu/nsnam/ns/
[75] A. M. Odlyzko. “Internet traffic growth: Sources and Implications.” OpticalTransmission Systems and Equipment for WDM Networking II, B. B. Dingel,W. Weiershausen, A. K. Dutta, and K.-I. Sato, eds., Proc. SPIE, vol. 5247,2003, pp. 1-15.
[76] OProfile - A System Profiler for Linux http://oprofile.sourceforge.net/
135
[77] K. S. Panesar and R. M. Fujimoto. “Buffer Management in Shared-MemoryTime Warp Systems.” In Proceedings of the 9th Workshop on Parallel andDistributed Simulation (PADS ’95), pp. 149–156, June, 1995.
[78] K. S. Panesar and R. M. Fujimoto. “Adaptive Flow Control in Time Warp.”In Proceedings of the 11th Workshop on Parallel and Distributed Simulation(PADS ’97), pp. 108–131, June 1997.
[79] V. Paxson and S. Floyd. “Why we don’t know how to simulate the Internet.”in Winter Simulation Conference, 1997, pp. 1037-1044.
[80] K. Perumalla. “Techniques for Efficient Parallel Simulation and theirApplication to Large-scale Telecommunication Network Models.” Ph.D.Thesis, College of Computing, Georgia Institute of Technology, December1999.
[81] K. Perumalla, R. M. Fujimoto. “Source-code Transformations for EfficientReversibility.” Technical report GIT-CC-99-21, College of Computing,Georgia Tech, September 1999.
[82] K. Perumalla, A. Ogielski, and R. Fujimoto. “TeD — A Language forModeling Telecommunication Networks.” In Proceedings of ACMSIGMETRICS Performance Evaluation Review, Vol. 25, No. 4, March 1998.
[83] J. E. Pitkow and M. M. Recker. “A Simple Yet Robust Caching AlgorithmBased on Dynamic Access Patterns.” In Proceedings of the First InternationalConference on the World-Wide Web, 1994.
[84] A. Poplawski and D. M. Nicol. “Nops: A Conservative Parallel SimulationEngine for TeD.” In Proceedings of the 12th Workshop on Parallel andDistributed Simulation (PADS ’98), volume 23, pages 180–187, May 1998.
[85] B. J. Premore and D. M. Nicol. “Parallel Simulation of TCP/IP Using TeD.”In Proceedings of the 1997 Winter Simulation Conference, pages 437-443,Atlanta, December 1997.
[86] PSP: PlayStation.com - PSP, http://www.us.playstation.com/psp.aspx
[87] D. M. Rao and P. A. Wilsey. “An Ultra-large Scale Simulation Framework.”Journal of Parallel and Distributed Computing 62: 16701693, 2002.
[88] Y. Rekhter and P. Gross. “Application of the Border Gateway Protocol in theInternet.” RFC1772, ftp://ftp.rfc-editor.org/in-notes/rfc1772.txt,March 1995.
[89] Y. Rekhter and T. Li.“A Border Gateway Protocol 4 (BGP-4).” RFC1771,ftp://ftp.rfc-editor.org/in-notes/rfc1771.txt, March 1995.
136
[90] Rensselaer Center for Pervasive Computing and Networking, “NetworkModeling, Simulation, and Management.”http://www.rpi.edu/cpcn/Networking.htm
[91] RHK, Inc.: RHK – Home page http://www.rhk.com/
[92] G. F. Riley. “Large-scale Network Simulations with GTNetS.” In Proceedingsof the 2003 Winter Simulation Conference, pages 676-684, 2003.
[93] G. F. Riley, M. Ammar, R. M. Fujimoto, A. Park, K. Perumalla and D. Xu.“Federated Approach to Distributed Network Simulation.” ACM Transactionson Modeling and Computer Simulation (TOMACS), Vol. 14, No. 2, April2004.
[94] G. F. Riley, R. M. Fujimoto and M. H. Ammar. “A Generic Framework forParallelization of Network Simulations.” In Proceedings of the 7thInternational Symposium on Modeling, Analysis and Simulation of Computerand Telecommunication Systems, pages 128-135, October 1999.
[95] R. Ronngren , M. Liljenstam , R. Ayani , J. Montagnat. “TransparentIncremental State Saving in Time Warp Parallel Discrete Event Simulation.”Proceedings of the tenth workshop on Parallel and distributed simulation,p.70-77, May 22-24, 1996.
[96] B. Samadi. “Distributed Simulation Algorithms and Performance Analysis.”Ph.D. thesis, Department of Computer Science, UCLA, 1985.
[97] Y. Shi, E. Watson, and Y-S. Chen. “Model-driven simulation ofworld-wide-web cache polices.” In Proceedings of the 1997 Winter simulationconference, 1045-1052, 1997.
[98] N. Spring, R. Mahajan, and D. Wetherall. “Measuring ISP Topologies withRocketfuel.” In Proceedings of ACM SIGCOMM, August 2002.
[99] SSFNet. Available online via http://www.ssfnet.org/homePage.html
[accessed March 30, 2005].
[100] J. S. Steinman.“Incremental State-Saving in SPEEDES using C++.” InProceedings of the 1993 Winter Simulation Conference, December 1993, pages687-696.
[101] B. Szymanski, Y. Liu and R. Gupta. “Parallel network simulation underdistributed Genesis.” Proc. 17th Workshop on Parallel and DistributedSimulation, June 2003.
[102] B. Szymanski, A. Saifee, A. Sastry, Y. Liu, and K. Madnani. “Genesis: Asystem for large-scale parallel network simulation.”, in Proceedings ofWorkshop on Parallel and Distributed Simulation (PADS ’02), 2002, pp.89-96.
137
[103] R. Teixeira, A. Shaikh, T. Griffin, and G. M. Voelker. “Network Sensitivityto Hot-Potato Disruptions.” in Proceedings of SIGCOMM, 2004.
[104] B. Unger, Z. Xiao, J. Cleary, J-J Tsai and C. Williamson. “ParallelShared-Memory Simulator Performance for Large ATM Networks.” ACMTransactions on Modeling and Computer Simulation, volume 10, number 4,pages 358-391, October 2000.
[105] C. Vieri. “Pendulum: A Reversible Computer Architecture.” Master’s thesis,MIT Artificial Intelligence Laboratory, 1995.
[106] C. Vieri. “Reversible Computer Engineering and Architecture.” MIT PhDthesis, 1999.
[107] D. Wessels. “Intelligent Caching for World-Wide Web Objects.” Master’sThesis, University of Colorado, 1995.
[108] D. West and K. S. Panesar. “Automatic incremental state saving.” InProceedings of the Tenth Workshop on Parallel and Distributed Simulation,pages 78-85, 1996
[109] F. Wieland. “Practical parallel simulation applied to aviation Modeling.”Proceedings of the fifteenth workshop on Parallel and distributed simulation,p.109-116, May 2001.
[110] Z. Xiao, B. Unger, R. Simmonds and J. Cleary. “Scheduling CriticalChannels in Conservative Parallel Discrete Event Simulation.” In Proceedingof the 13th Workshop on Parallel and Distributed Simulation (PADS’99),pages 20-28, 1999.
[111] G. Yaun, D. Bauer, H. Bhutada, C. D. Carothers, M. Yuksel, and S.Kalyanaraman. “Large-scale network simulation techniques: examples of TCPand OSPF models.” ACM SIGCOMM Computer Communication Review,Volume 33 Issue 3 , July 2003.
[112] G. Yaun, C. D Carothers, S. Adali and D. L. Spooner. “Optimistic ParallelSimulation of a Large-Scale View Storage System.” In Proceedings of 2001Winter Simulation Conference (WSC’01), pages 1363–1371, December 2001.
[113] G. Yaun, C. D. Carothers, S. Adali, and D. L. Spooner. “Optimistic parallelsimulation of a large-scale view storage system.” Future Generation ComputerSystems. 19(4): 479-492, 2003.
[114] G. Yaun, C. D. Carothers, and S.Kalyanaraman. “Large-Scale TCP ModelsUsing Optimistic Parallel Simulation.” PADS’03, 153-162, 2003.