Models for fault tolerance in manufacturing systems

10
Journal of Intelligent Manufacturing (1992) 3, 1-10 Models for fault tolerance in manufacturing systems ANDERS ADLEMO 1 and SVEN-ARNE ANDRI~ASSON 2 ~ Department of Computer Engineering and 2Department of Computer Science, Chalmers University of Technology, S-412 96 G6teborg, Sweden Received June 1990 and accepted March 1991 The field of fault tolerance in computer science and engineering has been thoroughly investigated over a long period of time. A great number of different approaches have been presented on means for improving fault tolerance under certain error conditions in computerized systems. One important area that has introduced computers in order to enhance productivity, flexibility and economy, is manufacturing systems in order to acquire com- puter-integrated manufacturing (CIM). Using computers in a manufacturing system intro- duces new sources of difficulties, as well as providing new possibilities for overcoming erroneous situations that might disturb production. The aim of this paper is to describe how the use of different configurations for a manufacturing system can improve fault tolerance. One specific erroneous situation which may occur in CIM is the partitioning of a network. This situation can be handled satisfactorily by using the suggested manufacturing system configurations. Additional improvements to fault tolerance can be achieved through the introduction of data buffers and material buffers. This approach is described in this paper. Keywords: Caches, computer-integrated manufacturing, distributed systems, dynamic con- figuration, fault tolerance, network partitions I. Introduction The study of computer integrated manufacturing (CIM) is an area of investigation that has had great impact on manufacturing systems. All the different kinds of com- puters within a factory can be incorporated into a single distributed system, forming a totally automated fabrication (Kang et al., 1985; Kusiak and Heragu, 1988 and McGuffin et al., 1988). Fault tolerance is of great importance for achieving very high production reliability between cyclic maintenance occasions (Messina and Tricomi, 1987 and Chintamaneni et al., 1988). The increasing complexity of interconnected systems increases demands for mainte- nance and fault handling, and the probability that com- ponents may fail increases with the number of components in a system. Failures, e.g. stops in operations, can be very expensive, and situations in which the failure of a single component disables the entire system are unacceptable. It is therefore necessary to replicate sensitive components. Consequently, more than one way must exist to configure a manufacturing system. 0956-5515 1992 Chapman & Hall This paper describes how improved fault tolerance can be achieved by using three different methods of dynamic configuration, namely: (1) Dynamic configuration of the work distribution, in which faulty units are avoided during work distribution and work is transferred to redundant units when a unit fails. (2) Dynamic configuration of the manufacturing sys- tem, in which units are moved to replace faulty units. (3) Dynamic configuration of the missions, in which alternative production is prepared in the CAD/CAM process. One of many failures that can occur in a distributed system is network partitioning, resulting in a division of the network nodes into sub-groups or partitions. In a modern manufacturing system, there usually exists two different kinds of networks, the data and the material network. A partition in either of these networks can result in a very costly production stop. As partitions of a network cannot be totally avoided, the goal should be to reduce their

Transcript of Models for fault tolerance in manufacturing systems

Journal o f Intelligent Manufacturing (1992) 3, 1-10

Models for fault tolerance in manufacturing systems

A N D E R S A D L E M O 1 and S V E N - A R N E A N D R I ~ A S S O N 2

~ Department of Computer Engineering and 2Department of Computer Science, Chalmers University of Technology, S-412 96 G6teborg, Sweden

Received June 1990 and accepted March 1991

The field of fault tolerance in computer science and engineering has been thoroughly investigated over a long period of time. A great number of different approaches have been presented on means for improving fault tolerance under certain error conditions in computerized systems. One important area that has introduced computers in order to enhance productivity, flexibility and economy, is manufacturing systems in order to acquire com- puter-integrated manufacturing (CIM). Using computers in a manufacturing system intro- duces new sources of difficulties, as well as providing new possibilities for overcoming erroneous situations that might disturb production. The aim of this paper is to describe how the use of different configurations for a manufacturing system can improve fault tolerance. One specific erroneous situation which may occur in CIM is the partitioning of a network. This situation can be handled satisfactorily by using the suggested manufacturing system configurations. Additional improvements to fault tolerance can be achieved through the introduction of data buffers and material buffers. This approach is described in this paper.

Keywords: Caches, computer-integrated manufacturing, distributed systems, dynamic con- figuration, fault tolerance, network partitions

I. Introduction

The study of computer integrated manufacturing (CIM) is an area of investigation that has had great impact on manufacturing systems. All the different kinds of com- puters within a factory can be incorporated into a single distributed system, forming a totally automated fabrication (Kang et al., 1985; Kusiak and Heragu, 1988 and McGuffin et al., 1988). Fault tolerance is of great importance for achieving very high production reliability between cyclic maintenance occasions (Messina and Tricomi, 1987 and Chintamaneni et al., 1988). The increasing complexity of interconnected systems increases demands for mainte- nance and fault handling, and the probability that com- ponents may fail increases with the number of components in a system. Failures, e.g. stops in operations, can be very expensive, and situations in which the failure of a single component disables the entire system are unacceptable. It is therefore necessary to replicate sensitive components. Consequently, more than one way must exist to configure a manufacturing system.

0956-5515 �9 1992 Chapman & Hall

This paper describes how improved fault tolerance can be achieved by using three different methods of dynamic configuration, namely:

(1) Dynamic configuration of the work distribution, in which faulty units are avoided during work distribution and work is transferred to redundant units when a unit fails.

(2) Dynamic configuration of the manufacturing sys- tem, in which units are moved to replace faulty units.

(3) Dynamic configuration of the missions, in which alternative production is prepared in the CAD/CAM process.

One of many failures that can occur in a distributed system is network partitioning, resulting in a division of the network nodes into sub-groups or partitions. In a modern manufacturing system, there usually exists two different kinds of networks, the data and the material network. A partition in either of these networks can result in a very costly production stop. As partitions of a network cannot be totally avoided, the goal should be to reduce their

2 A d l e m o andAndr~asson

negative effects as much as possible. This paper presents a way to improve fault tolerance for this type of occurrence.

Other examples of aspects that should be considered when dealing with fault tolerance in manufacturing systems are:

(1) Detection of faulty production units or communica- tion links (Andr6asson, 1990).

(2) Redundant information flow (Adlemo and Andr6as- son, 1991b).

(3) Redundant production units.

One common approach is to view a manufacturing system as having a hierarchical structure (Karmakar, 1988 and Sintonen and Virvalo, 1988). Figure 1 illustrates the hierarchical structure of the data network in a simple manufacturing system. The same hierarchical structure can be applied for both the data and the material network. The data network deals with the distribution of data between the production units. The distributed data can be of various types, e.g. control programs for robots or statistics on the production flow. The material network takes care of the transportation of material between different production units.

Section 2 briefly describes a tool called the general recursive system GRS (Adlemo et al., 1990a). This tool is built on a hierarchical, recursive structure that provides a systematist of a manufacturing system designer with the possibility of describing a system in a simple, uniform way. When a data or material network is partitioned, the caches enable the production to continue for a limited period of time. During this time, the system must try to perform some kind of reconfiguration in order to maintain produc- tion (Adlemo et al., 1990a). The aspects of system con- figuration, the problem of an optimal production allocation and a system reconfiguration in case of a production disturbance are discussed in Sections 3, 4 and 5, respective- ly. The caches that are introduced are of two kinds, data and material caches (Adlemo et al., 1990c). The data network allows the distribution of data to the data caches, while the material network distributes material to the material caches. The caches, as considered in this paper, are presented in Section 6. In case of a production disturbance, it might be necessary to perform a system reconfiguration. An algorithm is described in Section 7 that

Assembly lines [ ] [ ~ ~ [ ~ ~

M.,e,.I ~ . ~ .d i

Conveyor belt (Matedal network) Fig. 1. A part of a simple manufacturing system.

deals with this problem. Finally, Section 8 illustrates how the use of data and material caches can improve availability in manufacturing systems.

2. G e n e r a l r e c u r s i v e s y s t e m

The general recursive system (GRS) is a useful tool for constructing abstract models of distributed systems such as CIM, in which different types of fault tolerance may be introduced.

A GRS consists of a set of entities {el, e2,. �9 en} that are interconnected via a GRS internal data network, No, and a GRS internal material network, NM. The GRS is con- nected to the exterior through a data port, Po, and a material port, PM. When the GRS model is used to describe a real system, some of the ports may be absent. Consequently, the GRS has the ability to model both the data and the material transportation. A similar approach, which models material transportation and material buffers for assembly systems, is given in Barbier and Delmas (1989). An entity or GRS may be looked upon as a server, providing an offered service, using the terminology of Cristian (1989). An entity is either a GRS in itself or an atomic entity. An entity which is a GRS can be further expanded into more elementary entities connected to their own data and material network. In contrast, an atomic entity cannot be further expanded. An entity is connected to each of the internal GRS networks via one data and one material port. An atomic entity, AE, is associated with a set of offered operations, OAE (i.e. operations that can be performed by the atomic entity).

Formally, the concept of the GRS is defined as:

GRS = (E, ND, NM, PD, PM)

where E = {el, e2, �9 . . , e,,}; e = GRS [AE; A E = (black_box, PD, PM, OAE).

A general recursive system could be described graphical- ly as in Fig. 2. The diagram to the left shows an example of a GRS, while the diagram on the right demonstrates the recursiveness of the model. Figure 3 illustrates an example of a manufacturing system, using the GRS.

2.1. Missions, required services, offered services and mission pools

A detailed description of the work to be performed by the GRS is given in missions (Andr6asson et al., 1989).

A mission, M, is a (partially ordered) set of entity missions, EM:

M = {EM1, EM2, . . . , EM,_I, EM,}

An entity mission, EM, could either be a mission in itself, which can be further expanded, or an atomic mission, AM:

EM = M I A M

Models fo r fault tolerance in manufacturing systems 3

eo

r"q

PM

[""7

Fig. 2. A graphical illustration of the general recursive system.

Data Port

/

V'--1 Material Port

Data Port

Material Port

Fig. 3. Using the GRS to describe a manufacturing system.

An atomic mission is a mission that cannot be further expanded. This can be seen as a program that describes what will be performed by a corresponding atomic entity. Each atomic mission, AM, is associated with a set of required operations, RAM. An entity mission can be seen as a program that describes what will be performed by a corresponding entity. An entity mission also implicitly includes a partially ordered structure that gives informa- tion about how the entity mission should be distributed between entities on lower levels. Each entity mission, EM, is associated with a set of required services, Req-servEM (composed from the required services of the entities at the level below):

Req_servM = {Req--servEM,, Req--servEM2, . . . , Req--serv EM,}

Req-servEM = Req__servM[ RAM

The required services, Req__servEM, for each entity mission within the mission must be obtained. These are calculated on the basis of the required operations, RAM, that are included in the atomic missions.

Before the missions can be distributed in the GRS, it is necessary to retrieve information on the different tasks each entity is capable of performing. To be able to collect the offered operations at the atomic entity level, there is a data structure which is a set of offered services, Off__serv (Adlemo, 1990b). Each entity has its own offered service. An offered service consists of a set of offered services, Off_serv e (from lower-level entities), or a set of offered operations, OAE (i.e. operations that can be performed by the atomic entities):

Off-serV GRs = {Off--serve 1, Off-serve2, . . . , Off_serve,}

Off_serve = Off__serV GRs/ O AE

The correct allocation of entity missions to corresponding entities is achieved by comparing the required services with the offered services at each node. It might also be necessary, at times, to consider the structure of the offered services (Adlemo et al., 1990a). Regardless of the structure of an offered service, an atomic mission, AM, can be executed by an atomic entity, AE, iff RA~t C.C_ Oa~.

The missions are stored in a mission pool, MP, (Andr6asson et al., 1989). A mission pool can be con- sidered as an abstract data structure in which the missions are inserted in a First-In-First-Out order.

MP =- queue of missions

Mi =- set of partially ordered entity missions EMij

A mission pool can be centralized or distributed among a set of entities. A mission pool is used in such a way that missions, M, are entered into it and entity missions, EM, are then chosen from it following the partial order among the entity missions.

4 A d l e m o a n d A n d r d a s s o n

Hardware Configuration Mission Configuration

Work Configuration

Fig. 4. Hardware configuration, mission configuration and work configuration trees.

3. Configuration of a hierarchical manufacturing system

To achieve a system with fault-tolerant behavior, the system must be able to perform a reconfiguration in case of a failure within the system. To perform such a reconfigura- tion, the system must be able to react upon spontaneous activities such as partitioning of a network which implies an event-driven system in contrast to a periodic system, such as the Mars system (Kopetz et al., 1989). While it may sometimes be necessary to consider a periodic system for the type of actions dealt with in this paper, e.g. reconfigura- tions, an event-driven system, with its more flexible way of functioning, is sufficient.

Three different types of configuration are defined for the GRS structure: the hardware configuration, the mission configuration and the work configuration (Adlemo et al., 1990a):

(1) The hardware configuration can be represented as a tree showing the recursive levels of a GRS. The leaves in the hardware configuration tree represent atomic entities. The hardware configuration can be changed by adding, moving or deleting any node.

(2) The mission configuration can be represented as a tree which indicates how the missions must be divided into entity missions at each hierarchical level within the GRS. Each mission is associated with a unique tree; it is not possible to remove a sub-tree (i.e. an entity mission) and place it in another sub-tree.

(3) The work configuration can be viewed as 'how to program' a GRS. The program is achieved by associating each node in the mission configuration tree with a node in the hardware configuration tree, i.e. the mission configura- tion tree is mapped on the hardware configuration tree.

The three different configurations are illustrated in Fig. 4. The gray squares in work configuration illustrate the resulting work configuration tree.

4. Production allocation requirements

When performing work configuration the requirements at each node in the mission configuration must be matched by the offered operations at the corresponding nodes in the system configuration. This can be done according to two

Mode l s f o r fau l t tolerance in manufac tur ing sys tems 5

different principles determined by which type of manu- facturing system the model is aimed at. Between these are found variants of the two principles.

The first principle does not take into account the tree structure and consequently only matches operations on the leaf level. The approach does not take replicated entities into account. Thus, redundant entities are not chosen when performing a reconfiguration. The approach might lead to the selection of one entity for more than one task, i.e. time sharing of the entities might occur. In some types of manufacturing systems, this is the desired approach, espe- cially in systems dealing with many different missions in a short period of time.

The second principle is based on knowledge of the structure of a complete sub-tree at each level in the work configuration. This allows the possibility of optimizing the configuration, an important consideration when the con- figuration is to be used over a long period of time.

4.1. First principle

Using the first principle for matching nodes in performing a work configuration will provide a simple configuration algorithm. When a given hardware configuration sub-tree is capable of performing the production according to a given mission configuration sub-tree, it is sufficient to check that the required operations of a mission can be satisfied by the hardware configuration sub-tree.

We define the offered services of a GRS, OCRS, and the required services of a mission, RM, in the following way:

GRS --- {el, e 2 . . . . , ef}

m =- { rn l , m2 . . . . , mn}

OGRS = U Oei ei@ GRS

OAE = {offered operations}

RM = U Rmi miEM

RAM = {required operations}

A mission can be fulfilled by a GRS if

RM C_ OaRs

A GRS, G, that fulfills a mission, Ms, required services is, for example:

R M = {a, b, c, d}

Oc, = {a, b, c, d, e}

4.2. Second principle

The second principle takes into account the structure of the offered services. To do so, a new operator for comparing

required services with offered services must be defined. The offered services of a GRS, OCRS, and the required services of a mission, RM, are defined in the following way:

GRS =- {el, e 2 , �9 � 9 el}

m -= {ml, me, . . . , m,,}

OGRS = {Oe i] ei E M}

The symbol I-- is used for the new operator, defined in the following way:

RA M I'" O z E iff RAM C OAE

RM E OGRS iff =1 il, i2, �9 . . , in:

Rm, [-" Oe. ' /~ Rm 2 r" 0 % / X . . . / X R m r - Oe , t i n i n

ii 4=i24= . . . 4= inAei, E G R S , j = 1 . . n

A GRS, G, that fulfills a mission, Ms, requirements is, for example:

RM = {{{a}, {b}}, {{a}}, {{c}, {d}}}

Oc, = {{{a}, {b}, {b}}, {{a}, {b}}, {{c}, {d}}}

In this example RM I" Oc. An example for which it is not true that a GRS fulfills the required services owing to the structure is

RM = {{{a}, {b}}, {{a}}, {{c}, {d}}}

Oc = {{{a}, {b}, {b}}, {{a}, {b}, {c}}, {{d}}}

In Andr6asson et al. (1989) a basic algorithm is given for obtaining work configuration. This algorithm uses the data structure called mission pool. Before the missions can be distributed, the structure of the system configuration must be collected. It must be known what each sub-tree (entity) can perform in order for missions to be assigned to them (Adlemo, 1990b).

Because of the combinatorial explosion in computing an optimal configuration, such a computation cannot be done completely automatically. Instead, the possibility must exist for an operator to interact with the system. This means that an interface between the human operator and the system must be included, e.g. by using a scheduling editor. Two examples of such editors are the editor developed at the Imperial College in London (Hatzikon- stantis et al., 1989) and the editor developed at the Mitsubishi Electric Corporation in Japan (Fukuda et al., 1989).

5. Reconflguration of a hierarchical manufacturing system

In the case that an entity fails, the system must undergo a reconfiguration in order to activate the redundant entities. We have identified three different reconfigurations that correspond to the three different configurations mentioned

6 Adlemo and Andrdasson

Fig. 5. System failure.

Fig. 6. Hardware reconfiguration.

Fig. 7. Work reconfiguration.

Fig. 8. Mission reconfiguration.

earlier. The examples of the reconfigurations illustrated in the Figs in this section are related to the second principle concerning the matching of nodes.

(1) Hardware reconfiguration is carried out in such a way that the current work configuration is preserved. This can be achieved in several ways, e.g.

(i) by repairing the faulty entities and returning to the original hardware configuration;

(ii) by moving entities within the GRS to maintain production.

Note that an erroneous entity automatically generates a spontaneous, non-voluntary hardware reconfiguration. This is illustrated in Fig. 5, in which one of the leaf entities has ceased to work. Figure 6 illustrates the case in which an entity is moved to maintain production.

(2) Work reconfiguration can be used when some redun- dancy exists among the entities in the GRS. When one entity fails, a redundant entity can be appointed to do its work. The work reconfiguration of the system must take into consideration the hierarchical structures of the hard- ware configuration and of the mission configuration (Fig. 7). A work reconfiguration should be attempted at a level as close as possible to the source of the failure and then tried hierarchically on higher levels if necessary. This procedure is known as hierarchical scheduling (Karmakar, 1988).

(3) Mission reconfiguration is performed by structuring the mission in a different way than the original structure. This means that leaves defined in the original mission configuration tree may be moved from one sub-tree to another (Fig. 8). The mission is the output from a CAD/CAM process on the highest GRS level, and a mission reconfiguration entails a recompilation of the CAD/CAM process. However, it should be possible to integrate, at the moment of the initial compilation of the CAD/CAM process, various different mission configura- tion trees for the mission. This is known as off-line scheduling (Kopetz et al., 1989). The different alternatives can be chosen according to the possibilities of the current hardware configuration.

When an entity failure or a partitioning has been detected, the system attempts to perform one, or possibly several, of the reconfigurations mentioned.

6. A general model of network partitioning and caching

A distributed system is always faced with the problem of network partitioning. In a manufacturing system, parti- tionings may occur in both the data and the material network. If no actions are taken, the partitioning of a network could lead to an uncontrolled production stop.

Data and material caches are introduced to reduce the

Models for fault tolerance in manufacturing systems 7

effects of network partitions. The advantages obtained with such caches are shown later in this paper, but first some relevant concepts are defined. Each non-atomic entity has an instance of each of these concepts, while each atomic entity only has an instance of the concepts dealing with the entity data cache and the entity material cache. Note that the explanations contained within the following definitions are used as though an entity would actually deal with the management of material. This, of course, is not so. As stated in the beginning of the paper, the GRS, and the following definitions, are introduced to facilitate discus- sions about different aspects in manufacturing systems, e.g. fault tolerance for CIM.

The concepts defined are described below. An abbrevia- tion such as DC indicates a concept, while an abbreviation such as DC implies a specific item.

(1) Data network (DN), provides the means for sending and receiving data between the entities in a GRS.

(2) Data cache (DC) for an entity can be viewed as a buffer containing pre-loaded data which is consumed by the entity as the production within the entity continues. In the case of a data network partition, the data cache is used as a back-up. The data found in the cache are one or several entity missions, EM. Each EM includes information about what will be produced by the entity and how long the EM is valid. The purpose of the DC is to obtain fault-tolerant behavior in case of data network partitions. When a partitioning isolates the entity, the production in the entity can continue owing to the pre-loaded entity missions in the DC. This gives the system an opportunity to reconfigure before production stops. The pre-loaded entity missions in the DC also make it possible to switch to new production as soon as the production in progress terminates. A DC is said to be valid if it contains one or several adequate entity missions. An entity mission is adequate if it can be executed by the entity and if it has not passed its time-out limit. A DC that contains no adequate entity mission is said to be invalid. This means that a valid DC can be defined as below, where we have a specific data cache, DC, for a specific entity, e, and a specific entity mission, EM:

value-failure(EM) -= EM cannot be performed by e timing..failure(EM) -= EM has passed its time-out limit

adequate(EM) ~- ---1 value..failure(EM) A-~ timing..failure(EM)

valid(DC)<=> (3 EM E DC : adequate(EM))

The definitions value failure and timing failure are taken from Ezhilchelvan and Shrivastava, (1986) and Laprie, (1990).

(3) Entity data cache (EDC) for an entity contains the entity missions that have been assigned to the entity but that have not yet been executed and the mission that the entity is processing. The data that is contained in the entity

mission, and that is currently being processed by the entity, is denoted EMwork:

EDC = DC U EMwork

An EDC is said to be valid or invalid based on the same conditions as for the data cache.

(4) Material network (MN) provides the means for sending and receiving material between entities in the GRS.

(5) Material cache (MC) for an entity is a buffer with material that has been received, but not consumed, by the entity. The MC is normally the production line itself. An entity might also have a special buffer with pre-loaded material. When this is the case, both the buffer and the production line together form the MC. The MC is filled with material by preceding entities in the production line, and the material in the MC is normally used constantly by the entity. An MC is said to be valid if it contains adequate material. Material is adequate if the entity is able to use the material in accordance with its entity mission. An MC that contains no adequate material is said to be invalid. This means that a valid MC can be defined as below where we

have a specific MC MC for a specific entity, e. EM denotes an entity mission and Mati denotes a piece of material.

Mat(gM) -~ {Matl,Mat2 . . . . . Mat,} V (~ mat..value.-failure(EM) =- Mat(EM) C MC

mat-timing-failure(EM) -= Mat(EM)inMCisnotup-to-date as specified by EM

mal--adequate(EM) ~ ---i mat-value_failure(EM) A --I mat-timing._failure(EM)

valid(MC) <=> (B EM E DC: adequate(EM) A mat..adequate(EM))

(6) Entity material cache (EMC) for an entity contains the material in the MC and the material that the entity is processing. The material that is defined in the entity mission, and that is currently being processed by the entity, is denoted Matwo~k:

EMC = MC U Matwork

An EMC is said to be valid or invalid based on the same conditions as for the material cache.

In the case of partitioning in a data network or a material network, or perhaps both, an entity can continue its production as long as its entity data cache (EDC) and its entity material cache (EMC) are still valid. When one of the EDC or EMC becomes invalidated, the entity stops its production.

7. Fault- tolerant product ion under ne twork parti- t ions

To obtain a system with fault-tolerant behavior in terms of network partitions, the system must support some kind of

8 Adlemo and Andrdasson

failure detection, caches, redundant entities and recon- figuration strategies.

In the following discussions, we will deal only with systems that have both data caches (DCs) and material caches (MCs). Furthermore, it is assumed that a failure detection algorithm, redundant entities and an algorithm for reconfiguration exist. When a data network and/or a material network partitioning occurs, the individual net- work partitions continue to function as long as the entity data caches (EDCs) and the entity material caches (EMCs) are still valid.

Difficulties in communicating with an entity are discovered through a failure detection algorithm (Andr6as- son, 1990). The communication problems can be caused either by a network partitioning or an entity failure. In our discussions we consider the failures caused by network partitionings. A way to differ between a failure caused by a network partitioning and a failure caused by an entity break-down in a manufacturing environment is described in Adlemo and Andr6asson (1991a). After the detection of network partitions, the system attempts to reconfigure before the invalidation of any of the EDCs or EMCs. This means that every level in the GRS structure, from the level at which the partitions were detected upwards, tries to find one or more alternative configurations to maintain produc- tion in progress. When an alternative configuration must be activated, the system tries to use the configuration that affects the GRS structure as little as possible, i.e. involves a minimum of GRS levels and entities. If no alternative configuration is found within the level at which the partitioning occurred, the level above is involved. This procedure is repeated recursively until an alternative configuration is found or until the highest GRS level is reached. If no alternative configuration is found on any level, production will terminate, finally, unless the malfunctioning network is repaired.

When the system finds alternative configurations, these must be made as candidate configurations, as the parti- tioning may be temporary and repaired before the caches become invalid. If a partitioning is temporary, none of the candidate configurations will be activated. Otherwise, one of the candidate configurations is activated. The way in which the choice between the candidate configurations should be made, in order to find the most effective configuration, is beyond the scope of this paper.

The following algorithm explains how the system acts upon the partitioning of a network. The algorithm consists of four processes:

(1) The first process, entity_reconfiguration, attempts to find candidate configurations when a lifeless entity has been detected within a GRS level. The process also initiates the search for candidate configurations on higher GRS levels. When candidate configurations have been found, one of these is activated when one of the entities

caches becomes invalid. If no candidate configuration has been found, the GRS level above is requested to activate one of its candidates.

(2) The second process, father_reconfiguration, tries to find candidate configurations when a lifeless entity has been detected by a working entity that is placed on a lower level in the GRS structure. If the lower levels have not found any candidates, the process is informed that it must activate one of its candidates when one entity cache becomes invalid. If no candidate configuration has been found, the level above is requested to activate one of its candidates. The difference between the first and the second process is that the first process is placed on the entity that detects a lifeless entity, while the second process is placed on the entity that is informed about the detection of a lifeless entity.

(3) The third process, l i f e l e s s _ e n t i t y _ c h e c k , checks whether or not a lifeless entity becomes alive before the invalidation of the caches. If the entity recovers, the candidate configurations on all GRS levels are aborted.

(4) The last process, system_reconfiguration, tries to discover as many candidate configurations as possible. The detailed process by which this is done is not described in this paper.

process entity..reconfiguration; while true do

if detection of a lifeless entity then lifeless_entity_check.check-ifJifeless-entity-is-reactivated; if 3 a father to this level then father-reconfiguration.father..should -findlcandidate-configurations;

end if; system_recon figuration.find_candidate_system_configurations; wait until EDCentity ~/EMCentity is invalidated; if 3 candidate system configuration(s) then

activate one of the candidate configurations; else if 3 a father to this level then

father_reconfiguration.activate_fathers_candidate_configurations; halt entity production;

end if; end if;

end while; end entity_reconfiguration;

process father_reconfiguration; while true do

if 3 children to this level then accept father..shouid_find_candidate_configurations; if 3 a father to this level then

father.xeconfiguration.father..shouid-find-candidate-configurations; end if; system_reconfiguration.find..candidate_system_configurations; wait untiiEDCentity V EMCentity is invalidated; accept activate..fatherslcandidate.lconfigurations; if 3 candidate system configuration(s) then

activate one of the candidate configurations; else if 3 a father to this level then

father_reconfiguration.activate-.fathers--candidate--configurations; halt entity production;

end if; end if;

Models for fault tolerance in manufacturing systems 9

end while; end father..reconfiguration;

process lifeless_entity_check; while true do

accept check_if_lifeless_entity_is.xeactivated; while EDCentlty/~ EMCentity are valid do

if lifeless entity back again then delete the candidate configurations; reset father_reconfiguration process; reset entity-reconfiguration process; reset lifeless_entity..check process;

end if; end while;

end while; end lifeless..entity_check;

process system_reconfiguration; while true do

accept find_candidate_system_configurations; find candidate Hardware, Work, and Mission Configurations;

end while; end system_reconfiguration;

8. Caches and system availability

As observed in earlier sections, a manufacturing system may greatly benefit from the use of caches to increase the reliability and availability of the system. However, the introduction of data and material caches in a manufactur- ing system is not without cost, especially with regard to material caches. A material cache can be regarded as a kind of material store, which can be expensive for a company. The goal should thus be to minimize the use and size of the caches. Parallel to concern over the cost caused by the

material store is the desire for a manufacturing system capable of surviving different types of failures without a production stop.

We have not investigated the relative improvements concerning fault tolerance in manufacturing systems when using different approaches, e.g. cache sizes, production through-put and the size of the manufacturing system in this paper. We have instead stressed the importance of having caches to be able to deal with network partitionings in manufacturing systems. Nevertheless, it is clear that a larger cache will improve the possibilities of a system to surviving specific types of failures, e.g. network parti- tionings, as the system is given more time to compute a new work configuration.

Figure 9 shows a situation with network partitionings in a manufacturing system, in which the use of caches improves the availability of a system. The Fig. and terminology is partly taken from Siewiorek (1984). This demonstrates what happens to an entity and a GRS in the case of a failure, e.g. a data or material network that partitions. No exact sizes concerning the caches have been given. The possibilities for an entity and a GRS to survive are depicted. The following abbreviations are found in the Fig: MTTF: mean time to failure (=reliability); MTFD: mean time to detect (=er ro r latency); MTTR: mean time to repair (=maintainability). The availability for a GRS can be written as:

MTTFGRs

MTTFcRs + MTTRGRs = AvailabilitycRs

The erroneous situation begins with Failure 1, which is caused by a data or a material network partition for an

i,i[

MTTRc~ i

GRS available

MTFFGR s MTrRc~

MTI'D e MTFR e

I I Cache A new Failure 1 Detection A new Repair invalid configu- and startof configu- done No new ration repair ration Reinte- configu- found Find new found gration ration configu- found ration

I Failure 2 Detection

and start of repair Find new configu- ration

Cache invalid No new configu- ration found

Fig. 9. A scenario for on-line detection and on-line reconfiguration.

A new configu- ration found

10 A d lemo and Andr~asson

entity. The system detects the network partition at Detec- tion and start of repair, where the possible repair of the network also starts. At the same time the entity and higher levels within the system try to find new configurations which can be used if and when the caches become invalid (see Section 7). The action of Reintegration means that the repaired network link is connected to the system again. If a repair of a network link should be successful before any new configuration is found, the system will not use the new configurations but instead continue as though there had not been any partitioning at all. In the situation that begins with Failure 2, the system does not find any new configurations before the invalidation of a data cache or a material cache, which happens at Cache invalid. This should be avoided whenever possible, possibly by increasing the size of some of the caches in the system, and thus perhaps removing the erroneous situation. However, as long as the data and the material caches are valid, production in the manufacturing system can continue, even if temporary network parti- tionings occur.

9. Summary

This article stressed the importance of fault tolerance in manufacturing systems to be able to maintain and increase productivity and profits. To achieve the required fault tolerance, the possibility of manufacturing system con- figurations and reconfigurations was discussed. The con- figurations can be performed in three different ways as described here. One specific fault that might occur in a CIM system is the partitioning of a network. This fault can be dealt with by using the system reconfiguration approach. To enhance fault tolerance even more, the use of data and material caches was shown to improve the possibility for a manufacturing system to survive network partitionings.

References

Adlemo, A., Andr6asson, S.-A., Andr6asson, T. and Carlsson, C. (1990a) Achieving fault tolerance in factory automation systems by dynamic configuration, in Proceedings of the 1st International Conference on Systems Integration, New Jersey, USA.

Adlemo, A. (1990b) Distribution of production unit capabilities in automated manufacturing systems, in Proceedings of the 13th International Conference on Fault Tolerant Systems and Diagnostics, Bulgaria.

Adlemo, A., Andr6asson, S.-A. and Carlsson, C. (1990c) A model for network partitionings with caching in manufactur- ing systems, in Proceedings of the lOth SCCC International Conference on Computer Science, Chile.

Adlemo, A. and Andr6asson, S.-A. (1991a) Analyzing network partitionings in manufacturing systems. Internal report, Chalmers University of Technology, Sweden.

Adlemo, A. and Andr6asson, S.-A. (1991b) Fault tolerant

information distribution in partitioned manufacturing net- works, in Proceedings of the 1991 IEEE International Confer- ence on Robotics and Automation, California.

Andr6asson, S.-A., Andr6asson, T. and Carlsson, C. (1989) An abstract data type for fault tolerant control algorithms in manufacturing systems, in Preprint of the 6th IFAC/IFIP/- IFORS/IMACS Symposium on Information Control Prob- lems in Manufacturing Technology, Madrid.

Andr6asson, T. (1990) Towards a hierarchical operating system for supporting fault tolerance in flexible manufacturing systems, in Proceedings of the 13th International Conference on Fault Tolerant Systems and Diagnostics, Bulgaria.

Barbier, M. and Delmas, J. (1989) Modelling, control and simulation of flexible assembly systems, in Preprints of the 6th IFAC/IFIP/IFORS/IMACS Symposium on Information Control Problems in Manufacturing Technology, Madrid.

Chintamaneni, P. R., Jalote, P., Shieh, Y.-B. and Tripathi, S. K. (1988) On fault tolerance in manufacturing systems. IEEE Network, 2, pp. 32-39.

Cristian, F. (1989) Questions to ask when designing or attempting to understand a fault tolerant distributed system, in Proceed- ings of the 3rd Brazilian Conference on Fault Tolerant Computing, Rio de Janeiro.

Ezhilchetvan, P. D. and Shrivastava, S. K. (1986) A characteriza- tion of faults in systems, in Proceedings of the 5th Symposium on Reliability in Distributed Software and Database Systems.

Fukuda, T., Tsukiyama, M. and Mori, K. (1989) Scheduling editor for production management with human-computer cooperative systems, in Preprints of the 6th IFAC/IFIP/ IFORS/IMACS Symposium on Information Control Prob- lems in Manufacturing Technology, Madrid.

Hatzikonstantis, L., Sahirad, M., Ristic, M. and Besant, C. B. (1989) Interactive scheduling for a human-operated flexible machining cell, in Preprints of the 6th IFAC/IFIP/IFORS/ IMACS Symposium on Information Control Problems in Manufacturing Technology, Madrid.

Kang, Y. I., Lovegrove, W. and Spragina, J. D. (1985) An architecture for factory automation and information management systems, in Proceedings of the Pacific Computer Communications Symposium, Korea, Seoul.

Karmakar, U. S. (1988) Recent Developments in Production Research, Elsevier Science Publishers B.V., Holland.

Kopetz, H. et al. (1989) Distributed fault tolerant real-time systems: the mars approach. IEEE Micro, February, pp. 14-22.

Kusiak, A. and Heragu, S. S. (1988) Computer integrated manufacturing: a structural perspective. IEEE Network, 2.

Laprie, J. C. (1990) Dependability: concepts and terminology, in Proceedings of the Workshop on Fundamental Concepts and Terminology, France.

McGuffin, L. J., Reid, L. O. and Sparks, S. R. (1988) MAP/TOP in CIM distributed computing. IEEE Network, 2, pp. 23-31.

Messina, G. and Tricomi, G. (1987) Fault and safety management in intelligent robot networks, in Proceedings of TENCON 87, Seoul.

Siewiorek, D. P. (1984) Architecture of fault tolerant computers. Computer, 17, pp. 9-18.

Sintonen, L. and Virvalo, T. (1988) The hierarchy of communica- tion networks in the programmable assembly cell: an experi- mental framework. IEEE Network, 2, pp. 48-54.