The FTT-CAN Protocol for Flexibility in Safety-Critical Systems

10
46 Flexibility and safety are often con- sidered conflicting concepts 1 because flexibil- ity implies dealing with changing requirements that can, in turn, produce unpredictable and possibly unsafe operating scenarios. Therefore, some in the automotive and avionic system design industry believe 2 that a safety-critical system implies a fully sta- tic system in which all operating conditions are completely defined at pre-runtime. How- ever, flexibility supports evolving require- ments, simplifies maintenance and repair, and improves efficiency in system resources. The issue, then, becomes how to find a compro- mise achieving flexibility without jeopardiz- ing system safety. Achieving this compromise is particularly important in safety-critical systems that demand resource efficiency. For example, heavy pressure exists to reduce cost in auto- motive distributed computer control systems. Here, the communication infrastructure deserves particular attention because of the current trend toward encapsulating single functions in separate nodes. This fully dis- tributed scenario has several advantages: dependability, with easy replication of nodes and definition of error-contain- ment regions; composability, because the system is built by integrating nodes that constitute inde- pendent subsystems; scalability, through easy addition of new nodes to support new functionality; and maintainability, because of the architec- ture’s modularity and easy node replace- ment. However, a high level of distribution also leads to increased traffic on the network, which calls for bandwidth efficiency to meet temporal and throughput requirements. Time vs. event When reasoning about the efficient use of the network bandwidth, it is necessary to con- sider time- and event-triggered communica- tion paradigms. In the automotive industry, the move toward the time-triggered paradigm seems clear with such system standard candi- dates TTP/C 3 (time-triggered protocol—for class C safety requirements), Time-Triggered Controller Area Network (TT-CAN), 4 and FlexRay. 5 According to the time-triggered paradigm, all communication occurs at predetermined times. Hence, transmission schedules are pre- determined, leading to more foreseeable tem- poral behavior. This is particularly suited to Joaquim Ferreira Instituto Politécnico de Castelo Branco Paulo Pedreiras Luís Almeida José Alberto Fonseca Universidade de Aveiro A NEW COMMUNICATION PROTOCOL FOR DISTRIBUTED EMBEDDED SYSTEMS ATTEMPTS TO FIND A COMPROMISE BETWEEN THE OFTEN-OPPOSING GOALS OF SYSTEM FLEXIBILITY AND SAFETY . 0272-1732/02/$17.00 2002 IEEE THE FTT-CAN PROTOCOL FOR FLEXIBILITY IN SAFETY -CRITICAL SYSTEMS

Transcript of The FTT-CAN Protocol for Flexibility in Safety-Critical Systems

46

Flexibility and safety are often con-sidered conflicting concepts1 because flexibil-ity implies dealing with changingrequirements that can, in turn, produceunpredictable and possibly unsafe operatingscenarios. Therefore, some in the automotiveand avionic system design industry believe2

that a safety-critical system implies a fully sta-tic system in which all operating conditionsare completely defined at pre-runtime. How-ever, flexibility supports evolving require-ments, simplifies maintenance and repair, andimproves efficiency in system resources. Theissue, then, becomes how to find a compro-mise achieving flexibility without jeopardiz-ing system safety.

Achieving this compromise is particularlyimportant in safety-critical systems thatdemand resource efficiency. For example,heavy pressure exists to reduce cost in auto-motive distributed computer control systems.Here, the communication infrastructuredeserves particular attention because of thecurrent trend toward encapsulating singlefunctions in separate nodes. This fully dis-tributed scenario has several advantages:

• dependability, with easy replication ofnodes and definition of error-contain-ment regions;

• composability, because the system is builtby integrating nodes that constitute inde-pendent subsystems;

• scalability, through easy addition of newnodes to support new functionality; and

• maintainability, because of the architec-ture’s modularity and easy node replace-ment.

However, a high level of distribution also leadsto increased traffic on the network, which callsfor bandwidth efficiency to meet temporaland throughput requirements.

Time vs. eventWhen reasoning about the efficient use of

the network bandwidth, it is necessary to con-sider time- and event-triggered communica-tion paradigms. In the automotive industry,the move toward the time-triggered paradigmseems clear with such system standard candi-dates TTP/C3 (time-triggered protocol—forclass C safety requirements), Time-TriggeredController Area Network (TT-CAN),4 andFlexRay.5

According to the time-triggered paradigm,all communication occurs at predeterminedtimes. Hence, transmission schedules are pre-determined, leading to more foreseeable tem-poral behavior. This is particularly suited to

Joaquim FerreiraInstituto Politécnico de

Castelo Branco

Paulo PedreirasLuís Almeida

José Alberto FonsecaUniversidade de Aveiro

A NEW COMMUNICATION PROTOCOL FOR DISTRIBUTED EMBEDDED SYSTEMS

ATTEMPTS TO FIND A COMPROMISE BETWEEN THE OFTEN-OPPOSING GOALS

OF SYSTEM FLEXIBILITY AND SAFETY.

0272-1732/02/$17.00 2002 IEEE

THE FTT-CAN PROTOCOL FORFLEXIBILITY IN

SAFETY-CRITICAL SYSTEMS

the transmission of periodic message streams.Moreover, by setting an adequate relativephasing between message streams, designerscan control the collisions at medium-accesslevel, eliminating, or bounding, network-induced mutual interference. This featureleads to composability with respect to the tem-poral behavior because elimination of mutu-al interference allows the temporal behaviorof a subsystem when isolated, to remain unaf-fected upon integration in the global distrib-uted system. Relative phasing control alsoprovides improved temporal behavior underheavy traffic conditions. The feature enablesbetter packing of transmissions in the com-munication system. This means shorter worst-case response times to communicationrequests and more efficient bandwidth uti-lization. Finally, the time-triggered paradigmsupports detection of connectivity failures atthe receiver because message-arrival times areknown across the system.

Despite these positive properties, the time-triggered paradigm also has a downside: It isinefficient concerning average network uti-lization when one or more nodes generatemessages in a sporadic way and require fastresponse. For example, with alarms, event-triggered communication is more efficientbecause senders transmit only in response tosignificant state changes, that is, events.

Of course, the event-triggered paradigmalso has a downside. It does not support com-posability with respect to the temporal behav-ior because the asynchrony of nodetransmissions leads to mutual interferencewhenever nodes are integrated in the system.In addition, the requirements for worst-casecommunication can become higher when oneconsiders the situation in which all nodesattempt to communicate simultaneously.Thus, either we must use more bandwidth orthe response times to communication requestswill be longer. Finally, the remaining nodeswill not immediately detect a fail-silent fail-ure, for example, caused by a system shut-down in one node. To allow remote detectionof these failures, event-triggered systems typ-ically use a heartbeat—a timed mechanismusing coarse synchronization.

Therefore, from the point of view of efficientbandwidth use, the time-triggered paradigmworks better for periodic communication; the

event-triggered paradigm for sporadic commu-nication. However, many practical applicationsof distributed embedded control, such as auto-motive systems, require the exchange of infor-mation of both a periodic and sporadic nature.The former is typically associated with controlloops and the latter with alarms and manage-ment. Although these two types of traffic canbe conveyed over fully event-triggered sys-tems—such as typical CAN-based systems, orfully time-triggered systems, such as TTP/C—network efficiency suggests using a combina-tion of the two paradigms. Such a combinationmust enforce temporal isolation between bothtypes of traffic. Otherwise, the asynchrony ofthe event-triggered traffic will spoil the proper-ties of the time-triggered traffic because ofmutual interference. A typical way to achievetemporal isolation is to assign nonoverlappingtime windows exclusively to the transmissionof each type of traffic. Both TT-CAN andFlexRay support such a combination.

Finally, as the name suggests, time drivestime-triggered communication. Thus, acoherent notion of time must be enforcedacross the system. Since nodes are asynchro-nous by nature because of drift and limitedaccuracy of local clocks, they require specificsystemwide synchronization mechanisms.These, in turn, consume further bandwidth,although normally just a small amount. Con-versely, event-triggered communication doesnot explicitly require a global notion of timeor systemwide synchronization mechanisms.

Flexibility and safetyGenerically, the term flexibility means the

ability to change form. Since the term canapply to many different areas, we need todefine exactly to which form we are referring.For example, concerning embedded commu-nication systems, we can refer to such aspectsas physical media, topology, bit encoding, liveinsertion support, mode changes support andrapid download of new application software,and dynamic communication requirementssupport. The number of options the systemsupports determines its degree of flexibility.

Among these aspects, flexibility in dynam-ic communication requirements brings uphigher concerns regarding safety. This isbecause a change in communication require-ments could lead to a network overload and

47JULY–AUGUST 2002

consequent timing failures. Nevertheless, lim-iting the extent of changes to guarantee thetimely behavior of the network is possible.

Currently, most distributed control systemsused in automobiles are not safety-criticalbecause they merely enhance the performanceof conventional mechanical systems (forexample, antilock braking systems and assist-ed steering). However, in the near future,fault-tolerant computer-control systems thatuse active redundancy and can operate in thepresence of faults will replace the conventionalmechanical link between driver and actuators.These systems, typically called X-by-wire, willindeed be safety-critical.

Despite eventual safety-critical require-ments, network-traffic-parameter flexibilityremains important for greater networkresource efficiency. In this context, we con-sider event-triggered communication systemsflexible, that is, they react promptly to com-munication requests issued at any instant andto instantaneous (variable) communicationrequirements. Conversely, time-triggered sys-tems, such as TTP/C, are not so flexible; com-munication occurs at predefined instantsexclusively. These systems don’t take intoaccount runtime variations in applicationcommunication requirements. However,some degree of flexibility can be achieved withTTP/C, since it allows dynamic changes with-in a set of predefined operational modes. Nev-ertheless, all modes must be based on the sametime-division multiple access (TDMA) roundand be preconfigured in all nodes. TTP/C alsosupports the dynamic addition of nodes andmessages if designers have reserved therequired space in the TDMA round and clus-ter cycle. This, however, keeps extra band-width allocated, even when not immediatelyneeded, making it bandwidth inefficient.Therefore, the flexibility of TTP/C in termsof support to dynamic communicationrequirements is still limited.

The systems that combine both event- andtime-triggered paradigms present an inter-mediate level of flexibility due to the event-triggered part (this holds true for bothTT-CAN and FlexRay). FlexRay makes theclaim of flexibility for this very reason. Notice,however, that in both systems, the time-trig-gered traffic requires pre-runtime static defi-nition and is thus inflexible. This inflexibility

decreases using the same technique as appliedin TTP/C (reserving extra bandwidth atdesign time for runtime use in accommodat-ing new messages).

A new protocolWe propose the flexible time-triggered com-

munication on CAN (FTT-CAN) protocol6

to overcome inflexible handling of time-triggered traffic in distributed computer-control systems. FTT-CAN combines bothtime- and event-triggered paradigms in an effi-cient way, taking advantage of the underlyingmedium-access control of CAN. The mainfeature of the protocol supports online changesto time-triggered communication require-ments, such as adding new messages, remov-ing existing ones, or adapting their parameterssuch as period. The protocol includes adynamic traffic scheduler that takes intoaccount current communication requirements.An online admission control rejects any changerequest that might compromise the continuedtimely behavior of the system.

The main aim of the protocol is to supportefficient network utilization when subsystemsundergo changes in operating modes. It doesthis by sharing network bandwidth. Each sub-system only uses the bandwidth that it current-ly needs. The higher efficiency of this approachis obvious when several subsystems do not oper-ate simultaneously, for example, with cruisecontrol and antilock braking in a car.

The dynamic traffic scheduler supports extraflexibility in two ways: The scheduling policycan change online, for example as a response toan overload, and omitted messages, due tonode failures or transmission errors, can berescheduled within the respective deadlinesusing backward error-recovery techniques.TTP/C, TT-CAN, or FlexRay do not supportany of these FTT-CAN features because oftheir static approach to managing time-trig-gered traffic. Table 1 compares efficiency forboth types of traffic and for dynamic time-trig-gered communication requirements betweenFTT-CAN and the other protocols.

TTP/C receives the lowest score for efficienthandling of event-triggered traffic because ithandles this traffic type in the same way astime-triggered traffic, using dedicated timewindows called event channels. If there is nomessage to convey, the system wastes that win-

48

SAFETY-CRITICAL SYSTEMS

IEEE MICRO

dow of bus time. FlexRay usesdefined bus windows sharedby all nodes to send the event-triggered traffic, making itmore efficient than TTP/C.FlexRay, however, uses a col-lision-free medium-accessmechanism, which is equiva-lent to assigning different waittimes to asynchronous mes-sages according to their prior-ity. This mechanism can resultin substantial bus idle timewhen there are ready-to-send, yet low-priori-ty messages. TT-CAN and FTT-CAN alsoallow event-triggered traffic to share definedbus windows. Moreover, they both take advan-tage of the deterministic collision resolutionmechanism of CAN, which assigns prioritiesembedded in message identifiers and lets nodestry to immediately send messages within theallowed windows. This mechanism handlesevent-triggered traffic more efficiently and flex-ibly than those used in TTP/C and FlexRay.

For efficiency in handling time-triggered traf-fic, the figures become inverted. TPP/C receivesthe highest score because it uses a highly packedformat with several messages piggybacked in asingle frame, and it doesn’t use explicit messageidentifiers or addresses. Messages are identifiedby the time at which they are transmitted. Onaverage, TTP/C supports data efficiencies (theratio between time to transmit data bits and bustime) between 60 and 80 percent, with a theo-retical limit of around 95 percent. In the cur-rent version of FlexRay,5 a new maximum datapayload of 246 bytes per frame leads to data effi-ciency figures close to those of TTP/C. As forCAN-based systems, because of its maximum of8 bytes per frame, average efficiency figures arebetween 30 and 40 percent, with a theoreticallimit of around 50 percent.

FTT-CANWe first presented the original idea under-

neath the FTT-CAN protocol at the 19thIEEE Real-Time Systems Symposium.6 Themain features of the protocol are

• flexible handling of time-triggered traffic;• support for on-the-fly changes both on

the message set and scheduling policy; • online admission control of change

requests for the time-triggered traffic;• indication of temporal accuracy of time-

triggered messages;• support for different types of traffic (time

triggered, event triggered, hard real-time,soft real-time and non-real-time;

• temporal isolation, that is, time andevent-triggered traffic do not disturb eachother; and

• efficient use of network bandwidth.

As the name suggests, we built the FTT-CAN protocol on top of CAN. We realized,however, that we could abstract away theunderlying network, leading to the so-callednetwork-independent FTT paradigm. Theapplication of this paradigm to different net-works results in a set of protocols of whichFTT-CAN is an example. Recently, weapplied the same paradigm to Ethernet, lead-ing to FTT-Ethernet.7

Any protocol following the FTT paradigmmust fulfill the properties referred to previ-ously in the list. We achieve this complianceby enforcing a synchronized cyclic frameworkbased on four main features: centralized syn-chronization, centralized traffic scheduling,master/multislave transmission control, anda producer-distributor-consumer cooperationmodel.

Centralized synchronization builds, in asimple way, a coherent notion of time acrossthe system. The bus time is organized as aninfinite sequence of fixed duration windowscalled elementary cycles (ECs)—the durationis defined at pre-runtime. A special triggermessage (TM) indicates the start of each EC.A particular master node broadcasts the TM,which synchronizes all nodes in the system.The duration of the EC sets the temporal res-

49JULY–AUGUST 2002

Table 1. Traffic and communication handling

efficiency in different protocols.

Relative efficiency

Dynamic time-triggered

Event-triggered Time-triggered communication

Protocol traffic traffic requirements support

TTP/C Lower Higher LowerFlexRay Medium Higher LowerTT-CAN Higher Lower LowerFTT-CAN Higher Lower Higher

olution of the time-triggered part of thecommunication system. Consequently, thetransmission periods of the traffic are integermultiples of the EC duration.

Centralized traffic scheduling allows locat-ing both the communication requirementsand the message scheduling policy in one mas-ter node, facilitating online changes to bothand achieving a high level of flexibility. Suchcentralization also facilitates implementationof online admission control in the masternode, guaranteeing traffic timeliness forchanging communication requirements.

Master/multislave transmission controlallows enforcing the traffic timeliness in the busand achieving a high-bandwidth efficiency. Inthe first aspect, typical of master-slave trans-mission control, the master explicitly tells eachslave when to transmit, thus enforcing traffictimeliness. The second aspect results from theprotocol using a master-slave transmission con-trol on a per (elementary) cycle rather than aper message basis. A single master message trig-gers several slave messages in a single EC, thusreducing the number of control messages andconsequently increasing bandwidth efficiency.The TM data field conveys the so-called ECschedule, that is, the messages that must betransmitted within the respective EC, as Fig-ure 1 shows. These messages are transmittedsynchronously within the EC’s framework.

As opposed to typical time-triggered systems,a master/multislave system does not explicitlyneed a global distributed time base. The masternode tells the nodes when to transmit time-trig-gered messages. The system uses a centralized

time base within the masteronly, to trigger transmissionsvia the EC schedule broadcastin the TMs. Furthermore, thecentralized scheduling andmaster/multislave featuresfacilitate complex or costlyscheduling policies implemen-tation, with no impact on anynode except the master. Allother nodes promptly followthe EC schedules. For exam-ple, we have efficiently imple-mented earliest-deadline-firstscheduling of messages usingthe FTT paradigm.8

Finally, the time-triggeredcommunication in FTT-CAN follows theproducer-distributor-consumer cooperationmodel,9 such as the periodic traffic in theworld factory instrumentation protocol(WorldFIP). The nodes that generate data rel-evant for other nodes transmit (produce) thedata when the master—that is, the distribu-tor—tells them. The nodes that need a par-ticular datum—for example, the temperatureof sensor X, or the pressure of sensor Y—scanthe network traffic for the information andreceive it, that is, consume it, once detected.This model inherently supports one-to-manycommunication and requires source address-ing, which is the native addressing mechanismin CAN. With respect to the combination oftime- and event-triggered traffic, the EC isdivided into two distinct phases—one foreach type of traffic—called synchronous andasynchronous windows. The event-triggeredtraffic is asynchronous because the applica-tion can request the respective transmissionat any time regardless of the EC framework.

These FTT-CAN features are independentof the underlying network and, thus, are pre-sent in any FTT protocol. There are, howev-er, several network-specific featuresconcerning the traffic organization and time-liness enforcement within each EC and thesupport for event-triggered traffic. In this par-ticular case, the FTT-CAN protocol takesadvantage of the CAN-prioritized, distributedmedia access control (MAC) with its nonde-structive bitwise arbitration mechanism. Forexample, the synchronous messages scheduledfor a given EC can be triggered simultaneously

50

SAFETY-CRITICAL SYSTEMS

IEEE MICRO

Elementary cycle (EC)

TM

EC triggermessage

TMSM1SM2SM4SM13

0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0

Synchronous messages

Bit 16 Bit 8 Bit 0

Data bits of trigger message (EC schedule)

Figure 1. Master/multislave access control. Slaves produce synchronous messages accord-ing to an elementary-cycle schedule conveyed by the trigger message. If the x data bit is 1,then message x is produced in this EC; if it is 0, then message x is not produced.

within the respective synchronous window. Itis up to the distributed MAC of the CAN toresolve collisions and serialize message trans-missions.

For concerns about event-triggered (asyn-chronous) traffic, the situation is similar. Slaveswith pending requests might try to transmitimmediately, but only within the asynchro-nous window of each EC, shown in Figure 2.This scheme, similar to the arbitration win-dows in TT-CAN, allows a very efficient com-bination of time- and event-triggered traffic,resulting in low communication overhead andshorter response times.

As seen in Figure 2, the synchronous win-dow is processed after the asynchronous one,close to the end of the EC. As a result, the startof the synchronous window varies dependingon the length of the messages scheduled forthat EC. The TM also conveys the instant inwhich the synchronous window of each ECmust start, relative to the reception of the TM.Nevertheless, the protocol establishes a max-imum duration for the synchronous windowsand, correspondingly, a maximum bandwidthfor that type of traffic. This reserves minimumbandwidth for the asynchronous traffic.

To enforce a strict temporal isolationbetween synchronous and asynchronous traf-fic, FTT-CAN prevents the start of transmis-sions that won’t complete within therespective window. We achieve this by remov-ing from the network controller transmissionbuffer any pending request that cannot com-plete within that interval, keeping it in thetransmission queue. As a consequence, a shortamount of idle time might appear at the endof the asynchronous window (α in Figure 2).Variation in the stuff bits used in the physicalencoding of CAN messages might make ashort amount of idle time appear at the end ofthe synchronous window.

Master nodeA synchronous requirement table (SRT)

residing in the master node holds the commu-nication requirements of time-triggered traffic.Each SRT entry contains the description of oneperiodic (synchronous) message stream. Theseparameters include identification; data bytelength; maximum transmission time; periodlength; deadline; relative phasing; messagestream value to the application; attributes indi-

cating acceptable operations over the messagestream; and a list, or range, of specified para-meter values. These attributes or ranges limitflexibility on a per stream basis. They can indi-cate that no online changes are accepted for agiven stream (static stream), or that the streamproperties might be changed online (dynamicstream), such as adapting its period or relativephasing. In this case, the adapted values mustalways be within the list or range specified.

The central component within the masterinternal architecture, depicted in Figure 3 (nextpage), is the SRT. The traffic scheduler executesonline, recurrently scanning the SRT to buildthe EC schedules broadcast on the network.Therefore, the scheduler takes into account anychange in the SRT when next invoked. Thismakes the system highly flexible for dynamictime-triggered communication requirements.All change requests, however, go through theonline admission control block before accep-tance in the SRT. This enforces timely behav-ior of the communication system and assuresthat changes don’t go beyond the flexibility con-straints described by the attributes.

As stated previously, the scheduler executesonline, with runtime computing costs direct-ly related to the scheduling policy. The sched-uler can execute in low-processing-powermicrocontrollers (for example, a demonstra-tor was built using Philips 80C592 con-trollers). When the scheduling incurs highcomputing costs, we use either a more power-ful CPU or a special-purpose schedulingcoprocessor. The respective cost penalty is con-fined, however, to the master node.

51JULY–AUGUST 2002

Elementary cycle (EC)

TM

EC triggermessage

TMSM1SM2SM4AM4 αAM9 AM5 SM13

Start of synchronous window (encoded within trigger message)

Asynchronouswindow

Synchronouswindow

Insertedidle time

Figure 2. The double phase elementary cycle with synchronous and asyn-chronous windows.

Two subsystems—the synchronous mes-saging system (SMS) and the asynchronousmessaging system (AMS)—manage the com-munication services delivered to the applica-tion by FTT-CAN. The SMS offers servicesbased on the producer-distributor-consumermodel and handles the synchronous trafficwith autonomous control, that is, the networkinterface exclusively carries out transmissionand reception of messages without interven-tion from the application software. Sharedbuffers pass data to and from the network. TheAMS offers send-and-receive basic servicesonly. The AMS follows an external controlapproach according to which the applicationdirectly initiates the transmission. The SMSdelivers services to manage the SRT, makinguse of special-purpose asynchronous messages.

Fault hypothesisWe need to thoroughly determine the

impact of network errors and node failures touse FTT-CAN in safety-critical applications.We use fault tolerance techniques to limit thatimpact, which, however, increases system cost.The particular configuration of the FTT-CAN design lets the designer determine thedegree of complexity or fault tolerance.

The fault hypothesis that we consider in theprotocol takes into account physical faults, such

as those related to electromagnetic interference,defects, or damage. Moreover, as is common inall bus-based systems, the hypothesis does nottolerate partitioning unless the bus is replicat-ed. In addition, we consider every node in thesystem as fail-silent—that is, it transmits eithercorrectly or not at all. A node can be fail-silentin the time domain where transmissions occurat the right instants only, or in the value domainwhere messages contain correct values only.FTT-CAN enforces fail silence in the timedomain in all nodes by using bus guardians:autonomous devices with respect to the nodehost processor that enforce adequate timing inthe node transmissions. Upon reception of aTM, FTT-CAN bus guardians synchronizetheir internal timers, decode the trigger-mes-sage contents (duration of synchronous andasynchronous windows and EC schedule), andblock transmissions with the wrong timing. Inthe value domain, the protocol considers failsilence in the master node only. In this case, thescheduler and the SRT are replicated internal-ly. Whenever the EC schedule built by thescheduler replica does not match the one builtby the primary scheduler, no TM is generated.The remaining nodes are application specific—if this fail-silent behavior in the value domainis desired, it must also be implemented withinthe node.

52

SAFETY-CRITICAL SYSTEMS

IEEE MICRO

Figure 3. Master node’s internal architecture.

Application software

Application interface

Admission control

Synchronous requirements table

Parameters of synchronous messages(ID, period, deadline,transmission time, …)

Traffic scheduler Network interface

CAN

EC schedule register

In a replicated bus scenario, each nodeattaches to both redundant buses using inde-pendent network controllers. Every transmis-sion carried out on both buses occurs with theautomatic retransmission upon network errorsdisabled. The system considers a transmissionsuccessful if at least one of the transmissionsin both controllers was successful. On thereception side, all nodes receive messages fromall redundant media and discard the replicasby comparing the identifiers.

Transmission errorsElectromagnetic interference, bad connec-

tors, or cabling short circuits can cause trans-mission errors. When this happens, theoriginal CAN protocol triggers an automaticretransmission. This feature provides reliablecommunication. Timeliness, however, suffersbecause the retransmission takes place inde-pendently of the temporal validity of therespective message.

In FTT-CAN, unless replicated buses areused, there is no need to disable the automat-ic retransmission since its duration is limitedby the duration of the window where theretransmission takes place. All transmissionactivity is suspended at the end of each win-dow in every EC, including retransmissions.This leads to error confinement within bothsubsystems—any error in the SMS does notaffect the AMS and vice versa. Within eachsubsystem, the designer can allocate extra timeto cope with errors as forecast by an appropri-ate error model10 (this is fault-tolerant sched-uling). Possible errors, however, beyond theerror model might occur. An active mecha-nism based on the master node handles theseerrors according to a never-give-up strategy. Abus monitoring mechanism looks for missingsynchronous messages. If the respective pro-ducer node did not send a particular messageinstance, the master tries to reschedule themessage for future ECs11 within the messagedeadline, in a best-effort approach. The TMalways transmits in single-shot mode. If anyerror occurs during the TM’s transmission, itwill not effectively broadcast, resulting in anomission handled with master replication.

Replication and synchronizationThe reception of the TM synchronizes the

entire FTT-CAN-based distributed system.

When no such message occurs, a temporaryloss of connectivity occurs. One or more back-up masters prevent such a situation. Thesemasters monitor the network looking forTMs. Whenever the next TM is delayed morethan a given tolerance, the backup masters tryto transmit the missing TM. The first backupmaster that succeeds becomes the primarymaster. All others receive the recovered TMand become, or remain, backup masters. Incases of transmission of an error-corruptedTM, the system considers it an omission andboth active and backup masters will retrytransmission.

Fundamental to this design is the synchro-nization between primary and backup mas-ters. Since schedulers running on the mastersare dynamic, it must be guaranteed that ineach EC they generate similar schedules.Thus, in every EC, all backup masters com-pare their schedules with the one conveyed bythe TM. Moreover, they also compare a shortcyclic sequence number encoded in the TM.If a backup detects an inconsistency, it issuesa synchronization request to the primary. Thiscauses the current primary master to down-load the SRT and the relative phasing infor-mation necessary to resume schedulingsynchronously. This process might take a fewECs depending on the size of the SRT andcurrent network utilization. During this time,the resynchronizing backup remains inactive.

At startup, a relatively simple synchroniza-tion procedure assures that only one masterbecomes the primary one. Any starting (orrestarting) master waits for a preconfiguredtime, that is, an initial period lasting for a fewECs, where it listens for a TM. If one is detect-ed, this indicates that the system is runningand that a primary master is working. Thus,the starting master issues a synchronizationrequest and enters backup mode as soon asthe synchronization process completes. If noTM occurs during the initial listen period, thestarting master assumes that it is the first mas-ter activated, starts the scheduler, and tries toenter primary mode by transmitting a TM.However, other starting masters might alsocompete simultaneously to become primary.Therefore, all of the starting masters check forsuccessful transmission. The master that suc-ceeds becomes the primary. The others receivethe transmitted TM, issue a synchronization

53JULY–AUGUST 2002

request, and enter backup mode.Finally, a membership service enables the sys-

tem to know which backup masters are presentas well as their operational status. This serviceuses a regular query, broadcast by the current-ly active primary master, to which all opera-tional backup masters respond withinformation about their status. This service usesspecial-purpose asynchronous messages whosetransmission frequency is setup at pre-runtime.

The adoption of CAN to support the FTTparadigm has several advantages. It great-

ly simplifies efficient handling of event-triggeredtraffic because of the deterministic collision res-olution mechanism embedded in the respectiveMAC. Moreover, CAN network controllers andcabling are relatively inexpensive, an importantaspect in the automotive industry. And last, butnot the least, the relatively robust physical layerwith respect to error detection and tolerance ofphysical faults lets FTT-CAN systems operatein harsh environments.

Nevertheless, the maximum standardizedtransmission speed on CAN is 1 megabit persecond, which limits available bandwidth andputs more emphasis on efficient bandwidthutilization. TT-CAN suffers from the samelimitation, but both FlexRay and TTP/C useconsiderably higher transmission speeds.FlexRay’s transmission speed is 10 Mbps;TTP/C’s is 2 to 25 Mbps. Of course, theselarger bandwidths allow a more relaxedapproach toward efficient bandwidth utiliza-tion. Nevertheless, despite being a controver-sial topic within the automotive systemsdesign industry, with the current trend ofincreasing the number of subsystems in a car,even those bandwidths will probably becomescarce in the near future.

Therefore, to cope with an ever-growingbandwidth demand, we have recently proposeda new FTT-based protocol that is supported onEthernet (FTT-Ethernet7). Such as FTT-CAN,this protocol shares the properties inherent tothe FTT paradigm. It enforces a collision-freeMAC on Ethernet and allows taking advantageof scalable transmission speeds, low-cost net-work adapters, and hierarchical topologiesbased on hubs and switches. Safety issues arebeing currently addressed. Once completed,the FTT-Ethernet protocol will support morebandwidth-demanding applications such as

multimedia systems, computer vision, naviga-tion systems, and broadband Internet access,all integrated in a flexible and safe communi-cation framework. MICRO

References1. H. Kopetz, Real-Time Systems Design Princi-

ples for Distributed Embedded Applications,Kluwer Academic, Boston, 1997.

2. Minimal Operational Performance Standardsfor Avionics Computer Resources, ARINCRTCASC-182/EUROCAE WG-48, RTCA,Washington, D.C., 1999.

3. TTP/C Protocol, version 0.5, TTTechComputertechnik, Vienna, 1999.

4. ISO/CD11898-4, Road Vehicles—ControllerArea Network (CAN)—Part 4: Time-Triggered Communication, Int’l Organizationfor Standardization, Geneva, 2000.

5. F. Bogenberger, B. Müller, and T. Füher,“Protocol Overview,” Proc. FlexRay Int’lWorkshop, 2002; http://www.flexray.com/htm/infws402.htm (current July 2002).

6. L. Almeida, J. Fonseca, and P. Fonseca,“Flexible Time-Triggered Communication ona Controller Area Network,” Proc. Work-In-Progress Session, 19th IEEE Real-TimeSystems Symp. (RTSS), 1998; http://www.cse.unl.edu/rtss98wip/proceedings/ (currentJuly 2002).

7. P. Pedreiras, L. Almeida, and P. Gai, “The FTT-Ethernet Protocol: Merging Flexibility,Timeliness and Efficiency,” Proc. 14thEuromicro Conf. Real-Time Systems, IEEE CSPress, Los Alamitos, Calif., 2002, pp. 152-160.

8. P. Pedreiras and L. Almeida, “A PracticalApproach to EDF Scheduling on CAN,” Proc.IEEE Workshop Real-Time DistributedEmbedded Systems (WRTDES), 2001;http://rtds.cs.tamu.edu (current July 2002).

9. J.P. Thomesse and M. Leon-Chavez, “MainParadigms as a Basis for Current FieldbusConcepts,” Proc. Int’l Conf. FieldbusTechnology (FeT 99), Springer-Verlag,Vienna, 1999, pp. 2-15.

10. N. Navet and Y.-Q. Song, “Design of ReliableReal-Time Applications Distributed over CAN(Controller Area Network),” Proc. IFACSymp. Information Control in Manufacturing(INCOM), Elsevier Science, Oxford, England,1998, pp. 391-396.

11. J. Ferreira et al., “FTT-CAN ErrorConfinement,” Proc 4th IFAC Conf. Fieldbus

54

SAFETY-CRITICAL SYSTEMS

IEEE MICRO

Technology (FeT), Elsevier Science, Oxford,England, 2001, pp. 8-15.

Joaquim Ferreira is currently a PhD studentat the Universidade de Aveiro, Portugal. Heis also professor adjunto at the Instituto Politéc-nico de Castelo Branco, Portugal. His researchinterests include real-time communicationsystems, fieldbuses, fault tolerance, and digi-tal design. He has a licenciatura degree in elec-tronics and telecommunications engineeringand an MSc in electrical engineering from theUniversidade de Aveiro, Portugal.

Paulo Pedreiras is a PhD student at the Uni-versidade de Aveiro, Portugal. His research inter-ests include real-time communication systems,scheduling, fieldbuses, and real-time operatingsystems. He has a licenciatura degree in elec-tronics and telecommunications engineeringfrom the Universidade de Aveiro, Portugal.

Luís Almeida is assistant professor in theDepartment of Electronics and Telecommu-nications at the Universidade de Aveiro, Por-tugal. His research interests include real-timenetworks for distributed industrial/embeddedsystems and navigation control for mobilerobots. He has a licenciatura degree in elec-tronics and telecommunications engineeringand a PhD in electrical engineering from theUniversidade de Aveiro, Portugal. He is anIEEE Computer Society member.

José Alberto Fonseca is an associate professorin the Department of Electronics andTelecommunications at the University ofAveiro, Portugal. His research interests includeembedded systems, distributed systems, andindustrial communications. He has a licen-ciatura degree in electronics and telecommu-nications engineering and a PhD in electricalengineering from the University of Aveiro,Portugal. He is a member of the IEEE Indus-trial Electronics Society.

Direct questions and comments about thisarticle to Luís Almeida, Departamento deElectrónica e Telecomunicações, Universidadede Aveiro, Campo Universitário de Santiago,3810-193, Aveiro, Portugal; [email protected].

55JULY–AUGUST 2002

ComingNext IssueSEPTEMBER-OCTOBER

2002

Systems on a ChipMoore’s law is enabling the design of multi-

million transistor SOCs. Luciano Lavagno of thePolitecnico di Torino presents articles oncomplexity issues faced by designers andarchitects related to integrating these SOCs.Topics will include

• Chain: A Delay-Insensitive Chip AreaInterconnect,

• Coping with Latency in SOC Design,

• Eclipse: A Scalable High-PerformanceComputing Solution for Networks on a Chip,

• Octagon: A High-PerformanceInterconnect Architecture for NetworkingSystems on Chips,

• A Hierarchical Test Methodology forSystems on a Chip,

• Efficient Construction of Aliasing-FreeCompaction Circuitries, and

• A New Polymorphous Computing Fabric.

IEEE Micro serves your interests