On the reliability and validity of ship–ship collision risk analysis in light of different...

18
On the reliability and validity of ship–ship collision risk analysis in light of different perspectives on risk Floris Goerlandt , Pentti Kujala Aalto University, School of Engineering, Department of Applied Mechanics, Marine Technology, Research Group on Maritime Risk and Safety, P.O. Box 15300, FI-00076 AALTO, Espoo, Finland article info Article history: Received 2 July 2013 Received in revised form 6 September 2013 Accepted 12 September 2013 Keywords: Quantitative risk analysis Reliability Validity Risk perspective Ship–ship collision abstract A number of authors have discussed reliability and validity of quantitative risk analysis (QRA). These con- cepts address respectively whether a QRA provides the same risk picture when the analysis is repeated and whether the analysis addresses the right concept. While it has been argued that QRA is not in general reliable, there is little evidence supporting this claim available in the scientific literature. In light of this, this paper studies the reliability of QRA through a case study of ship–ship collision risk. It is found that probability- and indicator based risk perspectives do not necessarily provide a reliable risk picture, nei- ther in terms of numerical accuracy of the risk metrics, nor in terms of rank order of risk metrics in var- ious parts of the system. The results of the case study indicate a low inter-methodological reliability for the selected methods, raising concerns about their validity. This is discussed applying criteria concerning validity of risk analysis and in terms of the validity of the proposed encounter detection mechanisms. Sig- nificant uncertainty is found regarding this encounter definition in the selected methods, implying a need for more focus on this important aspect of maritime traffic risk analysis. Ó 2013 Elsevier Ltd. All rights reserved. 1. Introduction A number of authors have discussed the reliability and validity of quantitative risk analysis (QRA) (Aven and Heide, 2009; Rae et al., 2012; Suokas, 1985). Reliability and validity are fundamen- tally different concepts. Reliability relates to the question in how far a risk analysis leads to the same risk picture when the analysis is repeated, while validity can be understood as the degree to which an analysis describes the specific concept one intends to de- scribe (Aven and Heide, 2009). Such an understanding is in line with common usage in e.g. the social sciences (Carmines and Zel- ler, 1979; Drost, 2011; Trochim and Donnely, 2008). Aven and Heide (2009) provide an analysis of the reliability and validity of a number of risk perspectives, focusing on perspectives where probabilities are applied to quantify uncertainties. They dis- tinguish between traditional statistical approaches, the probability of frequency approach and Bayesian approaches predicting observ- ables. They conclude that QRA as a tool to measure risk is only reli- able in the sense described above for traditional statistical approaches if a large amount of relevant data is available. In other cases, a risk analysis is not in general reliable. We consider the analysis of Aven and Heide (2009) an impor- tant contribution to the foundational literature on risk analysis. However, their discussion is mainly theoretical and no empirical evidence is provided or referenced to substantiate the claims. The lack of evidence for or against QRA meeting the scientific requirements of reliability and validity is also noted by Rae et al. (2012). One of their concerns is the accuracy of the calculated risk metric: do quantitative risk analyses succeed in precisely estimat- ing the total system risk? These authors propose that evidence should be sought to fill evidential gaps in relation to the reliability and validity of quantitative risk analysis. Some evidence concern- ing the reliability of QRA exists in the application areas of nuclear and chemical process industries (Suokas and Kakko, 1989; Suokas, 1985), but no empirical studies or discussion is found in the mar- itime application area. Recently, there has been increased focus on such foundational issues related to risk analysis (Aven, 2012a), and there have been calls for ‘‘[...] a continuous discussion in the scien- tific environments and application areas on how to best measure/ describe risk’’ (Aven, 2012b, p. 42). In light of the above, this paper has two aims. First, some evi- dence is sought regarding the reliability of QRA through a compar- ative case study concerning ship–ship collision accident risk in the maritime transportation system. Three methods presented in the scientific literature are applied for which the reliability require- ments proposed by Aven and Heide (2009) are tested. These reli- ability criteria are furthermore refined and extended. A distinction is made between reliability in terms the accuracy of the risk metric and reliability in terms of how well risk metrics 0925-7535/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ssci.2013.09.010 Corresponding author. Tel.: +358 9 470 23476; fax: +358 9 470 23493. E-mail address: floris.goerlandt@aalto.fi (F. Goerlandt). Safety Science 62 (2014) 348–365 Contents lists available at ScienceDirect Safety Science journal homepage: www.elsevier.com/locate/ssci

Transcript of On the reliability and validity of ship–ship collision risk analysis in light of different...

Safety Science 62 (2014) 348–365

Contents lists available at ScienceDirect

Safety Science

journal homepage: www.elsevier .com/locate /ssc i

On the reliability and validity of ship–ship collision risk analysis in lightof different perspectives on risk

0925-7535/$ - see front matter � 2013 Elsevier Ltd. All rights reserved.http://dx.doi.org/10.1016/j.ssci.2013.09.010

⇑ Corresponding author. Tel.: +358 9 470 23476; fax: +358 9 470 23493.E-mail address: [email protected] (F. Goerlandt).

Floris Goerlandt ⇑, Pentti KujalaAalto University, School of Engineering, Department of Applied Mechanics, Marine Technology, Research Group on Maritime Risk and Safety, P.O. Box 15300, FI-00076 AALTO,Espoo, Finland

a r t i c l e i n f o a b s t r a c t

Article history:Received 2 July 2013Received in revised form 6 September 2013Accepted 12 September 2013

Keywords:Quantitative risk analysisReliabilityValidityRisk perspectiveShip–ship collision

A number of authors have discussed reliability and validity of quantitative risk analysis (QRA). These con-cepts address respectively whether a QRA provides the same risk picture when the analysis is repeatedand whether the analysis addresses the right concept. While it has been argued that QRA is not in generalreliable, there is little evidence supporting this claim available in the scientific literature. In light of this,this paper studies the reliability of QRA through a case study of ship–ship collision risk. It is found thatprobability- and indicator based risk perspectives do not necessarily provide a reliable risk picture, nei-ther in terms of numerical accuracy of the risk metrics, nor in terms of rank order of risk metrics in var-ious parts of the system. The results of the case study indicate a low inter-methodological reliability forthe selected methods, raising concerns about their validity. This is discussed applying criteria concerningvalidity of risk analysis and in terms of the validity of the proposed encounter detection mechanisms. Sig-nificant uncertainty is found regarding this encounter definition in the selected methods, implying a needfor more focus on this important aspect of maritime traffic risk analysis.

� 2013 Elsevier Ltd. All rights reserved.

1. Introduction

A number of authors have discussed the reliability and validityof quantitative risk analysis (QRA) (Aven and Heide, 2009; Raeet al., 2012; Suokas, 1985). Reliability and validity are fundamen-tally different concepts. Reliability relates to the question in howfar a risk analysis leads to the same risk picture when the analysisis repeated, while validity can be understood as the degree towhich an analysis describes the specific concept one intends to de-scribe (Aven and Heide, 2009). Such an understanding is in linewith common usage in e.g. the social sciences (Carmines and Zel-ler, 1979; Drost, 2011; Trochim and Donnely, 2008).

Aven and Heide (2009) provide an analysis of the reliability andvalidity of a number of risk perspectives, focusing on perspectiveswhere probabilities are applied to quantify uncertainties. They dis-tinguish between traditional statistical approaches, the probabilityof frequency approach and Bayesian approaches predicting observ-ables. They conclude that QRA as a tool to measure risk is only reli-able in the sense described above for traditional statisticalapproaches if a large amount of relevant data is available. In othercases, a risk analysis is not in general reliable.

We consider the analysis of Aven and Heide (2009) an impor-tant contribution to the foundational literature on risk analysis.

However, their discussion is mainly theoretical and no empiricalevidence is provided or referenced to substantiate the claims.The lack of evidence for or against QRA meeting the scientificrequirements of reliability and validity is also noted by Rae et al.(2012). One of their concerns is the accuracy of the calculated riskmetric: do quantitative risk analyses succeed in precisely estimat-ing the total system risk? These authors propose that evidenceshould be sought to fill evidential gaps in relation to the reliabilityand validity of quantitative risk analysis. Some evidence concern-ing the reliability of QRA exists in the application areas of nuclearand chemical process industries (Suokas and Kakko, 1989; Suokas,1985), but no empirical studies or discussion is found in the mar-itime application area. Recently, there has been increased focus onsuch foundational issues related to risk analysis (Aven, 2012a), andthere have been calls for ‘‘[. . .] a continuous discussion in the scien-tific environments and application areas on how to best measure/describe risk’’ (Aven, 2012b, p. 42).

In light of the above, this paper has two aims. First, some evi-dence is sought regarding the reliability of QRA through a compar-ative case study concerning ship–ship collision accident risk in themaritime transportation system. Three methods presented in thescientific literature are applied for which the reliability require-ments proposed by Aven and Heide (2009) are tested. These reli-ability criteria are furthermore refined and extended. Adistinction is made between reliability in terms the accuracy ofthe risk metric and reliability in terms of how well risk metrics

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 349

determined for various parts of the system have the same the rankorder, which is a significantly lower demand in terms of measure-ment theory (Stevens, 1946). In addition, the reliability of thequantitative risk perspective applying risk indicators rather thanfrequency estimates or uncertainty descriptions is investigated.

Second, the results of the reliability case study are taken as adeparture point for a discussion on the validity of the applied def-initions of the ship–ship encounter in connection with collisionrisk. Both the applicable validity criteria for QRA proposed by Avenand Heide (2009) are discussed, as well as the construct validity ofthe definition of the encounter, which constitutes the exposure tocollision.

In sum, this paper aims to contribute to the foundational liter-ature on risk analysis by providing evidence that QRA is not neces-sarily a reliable tool to measure risk, and to answer to the call formore focus on foundational issues in the maritime applicationarea.

The paper is organized as follows. Section 2 provides a shortoverview of a number of risk perspectives and of the reliabilityand validity criteria. Section 3 introduces the applied methodsfor determining ship–ship collision risk in a sea area and outlinesthe study area and applied data. In Section 4, the methodologyfor testing the reliability is explained and in Section 5, the resultsof the reliability criteria for the investigated methods are shown.A discussion on the reliability of risk perspectives is provided inSection 6, as well as a discussion on the validity of the appliedquantitative ship–ship collision risk analysis methods in the casestudy.

2. Risk perspectives and reliability and validity of risk analysis

2.1. Risk perspectives for QRA

While often not clearly distinguished, there is a fundamentaldifference between the concept of risk and ways to describe andmeasure risk (Aven, 2012a). We adopt following terminology: therisk concept concerns what risk means in itself, what risk ‘‘is’’. Arisk perspective is a way to describe risk, a systematic manner toanalyze and make statements about risk. A risk metric is the as-signed numerical value to an aspect of risk according to a certainstandard or rule.

In the technical application areas, the risk concept is most com-monly defined through and described with frequencies or proba-bilities (Aven, 2011), e.g.:

(i) Risk is the combination of the frequency and the severity ofthe consequence (IMO, 2007).

(ii) Risk is equal to the triplet < si,pi,ci>, where si is the ith sce-nario, pi the probability of that scenario and ci the conse-quence of the ith scenario (Kaplan and Garrick, 1981).

Corresponding risk perspectives describe risk through events A,consequences C and probabilities P. Uncertainties are expressedthrough probabilities. Formally:

1 We follow the definition of measurement by Stevens (1946) that ‘‘measurementin the broadest sense, is defined as the assignment of numerals to objects or eventsaccording to rules’’. Numerals can have a nominal, ordinal, interval and ratio scaleStevens (1946) considers interval and ratio scales to be quantitative. Thus, if theindicators Ik apply either of these two scales, we consider this also a risk perspectivefor quantitative risk analysis.

Risk � ðA; C; PÞ ð1Þ

Several variations of this perspective exist. A key factor is theinterpretation of the probability. In risk analysis, frequentist andsubjective interpretations are the most prevalent (Aven and Re-niers, 2013). In the probability of frequency approach by Kaplanand Garrick (1981), subjective probabilities are assigned to expressthe uncertainty about frequentist probabilities. In a classicalstatistical approach, a confidence interval provides a measure of

uncertainty about the frequentist probability estimate. It is beyondthe scope of this paper to consider these differences in detail. Morereflection on these issues is given in e.g. Aven (2009).

Other definitions of the risk concept and corresponding per-spectives have also been proposed:

(i) Risk is an uncertain consequence of an event or an activitywith respect to something that humans value (InternationalRisk Governance Council, 2009).

(ii) Risk is uncertainty about and severity of the consequences(or outcomes) of an activity with respect to something thathumans value (Aven and Renn, 2009).

These definitions do not consider probability as part of the riskconcept, but uncertainty. Probabilities are in this setting only usedas tools to express uncertainty, but these are not perfect tools andthis should be highlighted, leading to a risk perspective (Aven,2010):

Risk � ðA; C; U; PsjKÞ ð2Þ

where Ps is a subjective probability (i.e. a degree of belief) express-ing an assessor’s uncertainty about the occurrence of event A andthe consequences C based on the background knowledge K. Uncer-tainties U in the background knowledge are also systematically ad-dressed, see e.g. (Aven, 2013; Flage and Aven, 2009; Montewkaet al., 2013).

Some authors apply a risk perspective based on risk indicators(Kukic et al., 2013; USCG, 2012; Vinnem, 2010). These describe riskusing a set of indicators I:

Risk � fIkg; k ¼ 1; � � � ;N ð3Þ

While indicator-based risk perspectives are not commonlyunderstood as ‘quantitative risk analysis’, we consider this a matterof interpretation.1

In the probability-based perspectives Eqs. (1) and (2), a riskmetric is e.g. the probability of an event or the consequence sever-ity in case that event happens. Derived metrics such as expectedvalues or quantile exceedance probabilities may also be defined.In the indicator-based perspective (Eq. (3)), the indicators Ik actas direct metrics of risk and stand in a causal or correlational rela-tion with the likelihood of occurrence of A or the severity of theconsequences C.

2.2. Reliability and validity of QRA

Aven and Heide (2009) discuss the reliability and validity of theperspectives of Eqs. (1) and (2). Their first concern is whether ornot QRA as a measurement tool provides the same results if theanalysis is repeated, i.e. if QRA is reliable for different risk perspec-tives. Their second concern is in how far a QRA describes the riskconcept, i.e. how valid QRA is under differing perspectives on risk.

Following reliability criteria are proposed:

R1. The degree to which the risk analysis methods produce thesame results at reruns of these methods.

R2. The degree to which the risk analysis produces identicalresults when conducted by different analysis teams, butusing the same methods and data.

R3. The degree to which the risk analysis produces identical

,

.

Fig. 1. Overall methodology for ship–ship collision risk analysis, adapted from(Fowler and Sørgård, 2000).

350 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

results when conducted by different analysis teams with thesame scope and objective, but no restrictions on methodsand data.

In our analysis, we extend these criteria to the perspectiveaccording to Eq. (3), and make a distinction between reliability interms of accuracy of the risk metric and reliability in terms of rankorder across different parts of the system.

Following validity criteria for risk analysis are proposed by Avenand Heide (2009):

V1. The degree to which the produced risk numbers are accuratecompared to the underlying true risk.

V2. The degree to which the assigned subjective probabilitiesadequately describe the assessor’s uncertainties of theunknown quantities considered.

V3. The degree to which the epistemic uncertainty assessmentsare complete.

V4. The degree to which the analysis addresses the right quanti-ties (model parameters or observable events).

Note that the accuracy of the risk number as understood underR1 only concerns the accuracy in terms of a repeated measure-ment, whereas the accuracy as understood under V1 concernsthe accuracy when compared to the true risk. V2 is not relevantfor the risk perspectives according to Eqs. (1) and (3) as no subjec-tive probabilities are assigned.

In addition to the above criteria addressing the validity of therisk perspectives, we apply two criteria prevalent in the social sci-ences concerning the validity of the translation of a construct(what something is) into an operationalization (how it is de-scribed). This construct concerns the object about which risk is ex-pressed. In our case, this is ship–ship collision, while as a resultfrom the reliability case study, we will focus on the construct re-lated to collision exposure, i.e. the encounter conditions in a mar-itime traffic setting.

Trochim and Donnely (2008) and Drost (2011) distinguish facevalidity and content validity as criteria of the validity of a con-struct. Face validity is a subjective, heuristic appreciation of howwell the operationalization captures the meaning of the constructand is a weak validity test. Content validity is a more detailed com-parison of the operationalization to the relevant content domain.

Note that face and content validity in terms of the translation ofthe construct to the operationalization are not the same as crite-rion V4. Aven and Heide (2009) define ‘‘addressing the right quan-tities’’ as whether or not the QRA focuses on fictional modelparameters such as relative frequencies in the risk perspectiveaccording to Eq. (1) or on observable events as in the risk perspec-tive of Eq. (2). Face and content validity rather address whetherthis event has been properly defined in the measurement.

2 A TSS area is an area where ship traffic is regulated, such that vessels are requiredfollow certain sea lanes.

3. Case study: applied methods and data

The overall methodology for ship–ship collision risk analysis ascommonly applied under the risk perspective according to Eq. (1)is depicted in Fig. 1. It is an established approach to evaluate mar-itime risk, see e.g. (Fowler and Sørgård, 2000; Li et al., 2012;Montewka et al., 2011; van Dorp and Merrick, 2011). The approachconsists of finding a number of vessel conflicts in nautical trafficdata and assigning a probability of collision to each of these con-flicts to find a collision frequency. In addition, the consequencesare evaluated for each conflict e.g. in terms of oil spill size.

In this work, for reasons of brevity, focus is restricted to theevaluation of the likelihood (however expressed) of ship–shipcollision. Even if the consequence dimension is not evaluated, the

reliability of the collision likelihood estimate provides an indica-tion of the reliability of the risk analysis as evident from Fig. 1,see also Section 5.

The scope of the analysis thus focuses on the question howlikely collisions are in various locations of a given sea area.Depending on the risk perspective, this is considered both in termsof accuracy of the accident frequency and/or in terms of a rankingof various sea areas in terms of collision likelihood. Such knowl-edge is of practical interest e.g. for planning of oil spill combatingresources (COWI, 2011).

3.1. Method 1: fuzzy quaternion ship domain in ship traffic data

3.1.1. General rationaleThe model by Qu et al. (2011) uses three risk indices to assess

risk of collision: a speed dispersion index, an acceleration/deceler-ation index and a vessel conflict index. The indices provide quanti-tative information regarding the vessel traffic in a TrafficSeparation Scheme (TSS) area.2

The speed dispersion is a macroscopic index, providing infor-mation regarding the relative speeds of encounters between ves-sels. This is used as a proxy for collision risk as it is related tothe available reaction time of navigators. The degree of accelera-tion and deceleration is a microscopic index. As evasive maneuversin meeting, crossing and overtaking encounters involve accelera-tion and deceleration, the index provides indirect informationregarding encounter scenarios with collision potential. The numberof fuzzy quaternion ship domain (FQSD) overlaps is also a micro-scopic index, providing information regarding the number of vesselconflicts in different sea areas. Higher numbers for these threeindices, especially when simultaneously occurring in the sameleg of the TSS, are considered to indicate high risk. Qu et al.(2011) give equal importance to each of the indices, i.e. no weigh-ing or mathematical aggregation rule is proposed.

In the present case study, we present results only for the vesselconflict index. This is both for reasons of brevity and because ourfocus regards the impact of the different definition of vessel con-flicts. Moreover, the results of Section 5 will show that the reliabil-ity of the entire risk analysis according to Method 1 hinges on thereliability of this vessel conflict index.

3.1.2. Fuzzy quaternion ship domainIn maritime traffic engineering, the concept of ship domain has

been suggested by Fujii and Tanaka (1971). Goodwin (1975)provides following definition: ‘‘the surrounding effective waters

to

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 351

which the navigators of a ship want to keep clear of other ships orfixed objects.’’ A considerable number of ship domains have beenproposed; see e.g. Wang et al. (2009).

Method 1 by Qu et al. (2011) applies the fuzzy quaternion shipdomain (FQSD) proposed by Wang (2010), given by:

FQSDkðrÞ ¼ fðx; yÞjfkðx; y; QðrÞ 6 1; k P 1Þg ð4Þ

where:

fxðx; y; QðrÞÞ ¼ 2xð1þ sgnxÞRforeðrÞ þ ð1� sgnxÞRaftðrÞ

� �k

þ 2yð1þ sgnyÞRstarbðrÞ þ ð1� sgnyÞRportðrÞ

� �k

ð5Þ

QðrÞ ¼ fRforeðrÞ;RaftðrÞ;RstarbðrÞ;RportðrÞg; 0 < r < 1 ð6Þ

RiðrÞ ¼Inð1=rÞ

In2

� �1=k

=Ri; I 2 ffore; aft; starb; portg; 0 < r < 1 ð7Þ

The possibility value r 2�0;1½ determines the fuzzy boundariesand k is the shape index of the FQSD. Rfore, Raft, Rstarb, Rport representthe forward, aft, starboard and port radii of the quaternion, and aredetermined based on a model proposed by Kijima and Furukawa(2003):

Rfore ¼ ð1þ 1:34ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1:55v0:72 þ 0:17v1:09p

ÞLRaft ¼ ð1þ 0:67

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1:55v0:72 þ 0:17v1:09p

ÞLRstarb ¼ ð0:2þ 0:83v0:5441ÞLRport ¼ ð0:2þ 0:62v0:5441ÞL

8>>>><>>>>:

ð8Þ

The variables r and k have a direct and significant influence onthe size and shape of the FQSD, see Fig. 2. According to Wang(2010), the fuzzy possibility value r is related to the size of the do-main and the shape index k is related to the state of the navigator.Lower values for r lead to larger domains and can be interpreted asthe navigator finding less need to keep the area contained inFQSD(r1) clear than the area contained in FQSD(r2), if r1 < r2. The

Fig. 2. Definition of the Fuzzy Qua

shape index is derived from a fuzzy inference based on the physicaland mental state of the navigator and his skill ability. Qu et al.(2011) make direct choices for r and k.

3.1.3. FQSD vessel conflict index determination procedureThe number of FQSD overlaps in a given time period on a given

leg of the TSS is determined as follows:

(i) The position, speed and course over ground of each vessel inthe traffic area is determined for a set of examining times Tj,based on interpolations of the original AIS trajectories in theselected data set. Qu et al. (2011) apply a time intervalDT = Tj � Tj�1 = 30 min.

(ii) For each traffic image from step i., the information for eachvessel is used to draw the FQSD according to Eqs. (4)–(8),for a selected value of r and k. Qu et al. (2011) set(r,k) = (0.4,1).

(iii) The number of overlaps of the FQSDs of all ship in the area iscounted for each traffic image at examining times Tj fromstep i. and the positions of these overlaps are stored.

(iv) The detected FQSD overlaps are grouped per leg of the TSS.

In order to get insight in the proposed method and in the ap-plied data, the above dynamic procedure is illustrated in Video 1for a choice of time interval DT = 10 min and FQSD parameters(r,k) = (0.2,1) as in Eqs. (4)–(8). Examining times Tj are shown inthe top left corner. Overlapping FQSDs in the TSS area are coloredred, whereas non-overlapping FQSDs in the TSS and overlappingFQSDs outside the TSS are colored green. Relevant information re-lated to applied data is provided in Section 3.5.

In the study of the reliability criteria R1–R3 given in Section 2,the effect of varying choices in the FQSD parameters r and k isinvestigated. Furthermore, the effect of different choices of timeinterval DT between traffic images and the application of differentdata sets is assessed.

ternion Ship Domain (FQSD).

352 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

3.2. Method 2: Blind navigation collision candidates in ship trafficsimulation

3.2.1. General rationaleThe model by Goerlandt and Kujala (2011) is based on concepts

of Fujii and Shiobara (1971) and Friis-Hansen and Simonsen(2002). The ship–ship collision frequency is estimated by:

F ¼ NApc ð9Þ

where NA is the number of collision candidates, i.e. the number ofpairwise vessel contacts in a given time period determined underthe assumption that no evasive action is taken. This is equivalentto the blind navigation assumption in (Friis-Hansen and Simonsen,2002) and (COWI, 2011) as illustrated in Fig. 3. pc is a so-called cau-sation probability, defined as the probability of failing to avoid acollision when on a collision course. Its value depends on theencounter type, see Table 1.

The method is based on a simulation of the maritime traffic inwhich each vessel proceeds from a certain departure time alonga predetermined trajectory between an origin and destination har-bor or sea area. Vessels are probabilistically assigned certain attri-butes based on analysis of trade patterns in the studied area. Theseinclude the vessel type, main dimensions and speed. The maindimensions and speed are vessel type and route-dependent. Thetraffic simulation provides a basis to find cases in which vesselscome in contact as in Fig. 3. Assignment of a causation probabilityaccording to Table 1 leads to an estimate of the ship–ship collisionfrequency, see Eq. (9).

3.2.2. Collision frequency estimation procedureThe procedure to estimate the ship–ship collision frequency in

each TSS leg is as follows:

(i) The individual AIS ship trajectories for the given data set aredetermined and grouped per route, i.e. per combination oforigin and destination harbor or sea area.

(ii) For each route, the number of ship departures, departuretime (DT) distribution and ship type distributions are empir-ically determined. Per vessel type, derived conditional distri-butions for ship length, width and speed (SD) are empiricallydetermined.

(iii) A traffic simulation is performed. This stage consists of con-structing, for each route, a set of ship voyages based on theempirically determined statistical route-specific informa-tion. Each such ship voyage is characterized by a voyagedeparture time, a trajectory, a ship type, length, width and

Fig. 3. Blind navigation collision candidate detection.

voyage speed. In the algorithm, various choices for departuretime generation procedure and vessel speed distributionsare possible, see Table 2.

(iv) A collision candidate detection algorithm is run, dynamicallydetecting ship contour overlaps, assuming blind navigation.This collision candidate definition is thus the dynamicalequivalent of the collision diameter as analytically derivedby Pedersen (1995) and applied in a static risk methodologydiscussed by Friis-Hansen and Simonsen (2002).

(v) The encounter type (head-on, overtaking or crossing) for allcollision candidates is determined and an appropriate causa-tion factor pc is assigned, according to the properties listed inTable 2. The collision frequency is determined according toEq. (9).

(vi) Collision frequencies are gathered per considered leg of theTSS.

(vii) As the procedure is stochastic in nature, steps iii.–vi. arerepeated in a Monte Carlo loop to determine the statisticalsignificance of the estimated mean collision frequencythrough computation of the confidence interval.

In this study, the vessel trajectories are sampled directly fromthe AIS ship trajectory database. As the focus is on the ship–shipcollision frequency in the TSS area, simulated vessel trajectoriesnot crossing the TSS are not withheld in the simulation. This isfor reasons of computational efficiency, because the simulationcodes are rather slow and a large number of test runs is performedin the test series.

Steps iii. and iv. of the above dynamic procedure are illustratedin Video 2. The simulated traffic follows predetermined trajectoriesat constant speeds. Ships are indicated with dots. Two vessels com-ing in contact are indicated with a pentagon. The vessel departuretime generation procedure in this video is according to the proce-dure DT3 and the vessel speed distribution according to SD2, as ex-plained in Table 2. Relevant information related to applied data isprovided in Section 3.5.

In the study of the reliability criteria R1–R3 given in Section 2,the effect of various choices of the vessel departure time genera-tion procedure and the vessel speed distribution are tested, as wellas the application of different data sets.

3.3. Method 3: projected domain violation in ship traffic data

3.3.1. General rationaleThe model proposed by Weng et al. (2012) estimates the fre-

quency of ship–ship collisions based on concepts by Fujii and Shio-bara (1971) and MacDuff (1974):

f ¼ Ncpc ð10Þ

where NC is the number of vessel conflicts and pc represents thecausation probability. pc is defined as the probability of failing toavoid a collision for a given vessel conflict. Vessel conflicts are de-fined as a critical situation where a vessel is expected to enter an-other vessel’s ship domain in the next time interval, as illustratedin Fig. 4. The value for pc depends on the encounter type, seeTable 1.

3.3.2. Collision frequency estimation procedureThe procedure to estimate the ship–ship collision frequency in

each TSS leg is as follows:

(i) The position, speed and course over ground of each vessel inthe traffic area is determined for a set of examining times Tj,based on interpolations of the original AIS trajectories. Wenget al. (2012) apply a time interval DT = Tj � Tj�1 = 3 min.

Table 1Classification of encounter or conflict type and corresponding causation factors, seeFigs. 3 and 4.

Encounter orconflict type

Goerlandt and Kujala(2011)

Weng et al. (2012)

Coursedifference a

pc Coursedifference a

pc

Overtaking <67.5� 4.9 � 10�5 <10� 4.9 � 10�5

Head-On 175�–185� 4.9 � 10�5 170�–190� 4.9 � 10�5

Crossing Otherwise 1.3 � 10�4 Otherwise 1.3 � 10�4

Table 2Algorithmic options for traffic generation in model by Goerlandt and Kujala (2011).

Departure time (DT) generation procedureIdentifier ExplanationDT1 Non-stationary Poisson process for all routesDT2 Stationary Poisson process for all routesDT3 Non-stationary Poisson process for all routes with, on average,

more than 2 vessel departures per day; otherwise stationaryPoisson process

Vessel speed distribution (SD)Identifier ExplanationSD1 Time-mean vessel speed per ship type class, all ships in whole

studied areaSD2 Time-mean vessel speed per ship type class, conditional to

specific routeSD3 Time-mean vessel speed per ship type class, conditional to

specific route, with data restricted to the TSS area

Fig. 4. Vessel conflict detection based on circular ship domain.

3 AIS is a system where navigational parameters are transmitted from ships to oneanother and to shore stations, allowing for improved situational awareness. Iprovides a rich data source for studies in maritime transportation, containing detailedinformation about vessel movements.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 353

(ii) Using the information of time step Tj, the projected shippositions at time Tj+1 are calculated.

(iii) The vessel domains at time Tj+1 are determined. Weng et al.(2012) apply a circular ship domain with radius R = 3L, withL the vessel length.

(iv) The projected ship positions are compared to the vesseldomains and conflicts are stored for further analysis.

(v) Depending on the type of vessel conflict (head-on, overtak-ing or crossing), a causation factor is applied according toEq. (10). The vessel conflict types and causation factors aresummarized in Table 1.

(vi) Collision frequencies are gathered per considered leg of theTSS.

Where a given vessel pair has conflicts in consecutive time stepsof the above procedure i.–v., only the first of these conflicts is re-tained in the present study. The above dynamic procedure is illus-trated in Video 3 for a choice of time interval DT = 3 min andcircular domain with R = 3L. Examining times Tj are shown in thetop left corner. The current and projected vessel positions are con-nected with a line and the domain is drawn around the projectedposition. Violations are indicated with pentagons. Relevant infor-mation related to applied data is provided in Section 3.5.

In the study of the reliability criteria R1–R3 given in Section 2,the effect of varying choices in the above procedure is investigated.Different choices of time interval DT between traffic images, differ-ent ship domains and the application of different data sets isinvestigated.

As evident from the overview of ship domains by Wang et al.(2009), many domains have been proposed without current agree-ment on which to use. The current study applies the circular do-main assumed by Weng et al. (2012), the elliptical domain byFujii and Tanaka (1971) and the circle sector domain by Goodwin(1975), as illustrated in Fig. 5. While Weng et al. (2012) provideanalytical formulae for projected violations of the circle domain,for investigating the various domain shapes in our study, vesselconflicts are detected numerically.

3.4. Notes on the applied methods and connection to holistic riskanalysis

The presented methods apply risk perspectives according toEqs. (1) and (3). From an overview of maritime risk analysis meth-ods (Li et al., 2012), it is found that no methods are available apply-ing the perspective in Eq. (2). The reliability of this risk perspectivecan thus not be tested in our analysis.

In Method 2 and Method 3, each vessel conflict is assigned aprobability of actually resulting in a collision. This leads to a riskperspective in line with Eq. (1). In Method 1, vessel conflicts aresimply counted as a risk index, leading to a risk perspective in linewith Eq. (3). The main differences between the methods consist inhow the vessel traffic is considered and how conflicts are definedand treated. In Method 1 and Method 3, historic vessel movementdata obtained from the Automatic Information System3 (AIS) is ap-plied directly. In Method 2, the traffic in the area is constructed usinga probabilistic sampling based on analysis of historic vessel traffic.

All three methods are intended only to provide a risk picture ina maritime sea area, focusing on the question where accidents aremost likely. For the methods applying the probability-based per-spective of Eq. (1), the causality of the accidents in terms of e.g. hu-man and organizational factors or machinery failures is consideredindirectly through a causation probability. This is the probabilitythat vessels on a collision course or in a vessel conflict situationwill actually collide. It is derived from Bayesian networks (Friis-Hansen and Simonsen, 2002), fault tree analysis (Fowler andSørgård, 2000) and/or incident/accident statistics (Montewkaet al., 2012c), see also Fig. 1. While this clearly is an important sim-plification, it is noteworthy that a method using a similar rationaleis recommended by relevant maritime authorities (IALA, 2009;IMO, 2010) which has been used in many applications, e.g. COWI(2011). The approximation is commonly taken to be accurate en-ough to estimate the areas with highest accident likelihood in opensea areas (Almaz, 2012; Li et al., 2012).

While the methods only address the likelihood of the collisionto occur in a sea area, a complete description of collision risk

t

Fig. 5. Definition of the various domains applied in the analysis.

354 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

should include the consequence dimension, see Fig. 1. Such holisticcollision risk analysis models have been proposed e.g. for collisionsinvolving a passenger vessel by Vanem et al. (2007), Konovessisand Vassalos (2007) and Montewka et al. (2012a,b). Importantvariables in such a context are the evaluation of the damage extent(Pedersen, 2010), flooding time (Ruponen, 2007), time to capsizegiven a damage (Hogström, 2012) and the impact scenario condi-tional to the encounter conditions (Ståhlberg et al., 2013). As evi-dent from Fig. 1, the reliability of the collision likelihooddimension directly affects the reliability of the holistic collisionrisk analysis.

It is furthermore acknowledged that only three models to eval-uate ship–ship collision likelihood are selected in the presentedcase study, for reasons of brevity. More models are available, e.g.Fowler and Sørgård (2000), Friis-Hansen and Simonsen (2002)and Montewka et al. (2010), see Li et al. (2012) for a recent review.While it would be interesting to compare the risk picture providedby all proposed methods, this task is left for further research.

3.5. Outline of the study area and data

The likelihood of ship–ship collision is investigated in the TSS inthe Gulf of Finland, which is chosen to suit the restrictions of Meth-od 1 and to facilitate comparisons between methods. The selectedsea area is among the busiest in the Baltic Sea, characterized byintensive tanker traffic to and from Russian oil terminals and sig-nificant passenger traffic, in particular between Helsinki and Tall-inn. The maritime traffic is regulated by means of a TSS and issupervised by a Vessel Traffic System (VTS). Other measuresenhancing safety is a mandatory ship reporting system GOFREP.The likelihood of ship–ship collision in the area has been of practi-cal interest for oil spill response planning, see (COWI, 2011).

The area of the TSS is pragmatically divided in 16 legs, shown inFig. 6. E and W signify traffic lane for eastbound and westboundtraffic, respectively.

In the case study, data from the AIS system is used from the ice-free period 07.2010–11.2010, containing over 8 million data pointsrelated to ship movements. Only merchant ships larger than 500GRT are retained for analysis, so data of e.g. pilot vessels, searchand rescue units, patrol boats and recreational craft are not consid-ered. This is because such smaller ships are not required to carry anAIS transponder, making the limited data available for such classesunreliable and also because the methods applied in the case studyare proposed for merchant traffic. In this study, used static data in-cludes the Maritime Mobile Service Identity (MMSI) number (a

unique identifier for a particular ship), the ship type, length andwidth. Applied dynamic data includes timestamp, position (longi-tude and latitude), speed over ground (SOG) and course overground (COG). Data with zero vessel speed (stationary vessels)has been excluded from the database. AIS data furthermore oftencontains a number of errors (Graveson, 2004), which need to be ad-dressed. Faulty or missing SOG and/or COG data fields have beencorrected by making use of subsequent timestamps and ship posi-tion data. If for a given MMSI number no ship dimensions could beretrieved from the database, the average length and width for thegiven vessel type was applied.

4. Methodology for testing analysis reliability

This section describes the conditions for the various test runs ofthe methods outlined in Section 3, in line with the reliability crite-ria R1–R3 of Aven and Heide (2009) given in Section 2. For each ofthese criteria, a number of test matrices are constructed in Sections4.1–4.3. These summarize the conditions for which the algorithmsare run with settings appropriate to the specific reliability crite-rion. The risk model output is subsequently compared and a num-ber of statistical tests are performed as described in Section 4.4.

4.1. R1: Rerun of method

Method 1 and Method 3 are deterministic algorithms based onhistoric AIS data. Hence, a rerun of these methods with identicalparameter settings leads to identical results. In contrast, Method2 is stochastic in nature. A rerun of this method hence does notnecessarily lead to identical results. To test criterion R1, Method2 is performed 3 times. In each of these three applications of themethod, 5 Monte Carlo simulations are made and mean values ofthe calculated frequency estimate are compared, as well as theconfidence interval. The limited number of Monte Carlo simula-tions is due to the computational intensity of the method: onerun takes several hours to compute. The applied AIS data is of July2010, the selected departure time generation procedure and vesselspeed distribution are DT3 and SD2 respectively, as explained inTable 2. This test is referred to as Test R1M2.

4.2. R2: Same method and data, but different analysis team

The criterion R2 is in our case study interpreted such that thesame method indicates that the general philosophy of the methodis followed. For the methods described in Section 3, this means that

Fig. 6. Definition of the studied areas of the Traffic Separation Scheme (TSS) in the Gulf of Finland.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 355

FQSD overlaps are counted in historic AIS data (Method 1), that ablind navigator collision candidate detection is performed in aprobabilistic traffic simulation (Method 2) or that violations of pro-jected ship positions and ship domains are detected (Method 3). Ineach of the methods, a number of algorithmic choices can be madeat the discretion of the analysis team. Such algorithmic choicesmay have an influence to the resulting risk metrics.

In all tests concerning reliability criterion R2, the applied AISdata is of the period July–November 2010 for Method 1 and Meth-od 3. For Method 2, the applied AIS data is of the month July 2010.The R2-test cases for each method are summarized in Table 3.

For Method 1, test R2M1A investigates the influence of the timeinterval DT between investigated traffic images with FQSD param-eters (r,k) = (0.4,1), for six cases. In test R2M1B, the influence ofFQSD size factor r and shape index k is investigated with a timeinterval DT = 30 min for nine chosen combinations, see Table 3.

For Method 2, test R2M2A investigates the influence of thedeparture time generation procedure (DT) on the mean collision

Table 3Definition of R2-test cases for Method 1, 2 and 3.

Test R2M1A – influence of time interval DT(r,k) = (0.4,1) in Eqs. (4)–(8); AIS data = 07–11.2010

Case i ii iii iv600 s 1200 s 1800 s 2700 s

Test R2M1B – influence of FQSD size and shape (r,k) in Eqs. (4)–(8)DT = 1800 s; AIS data = 07–11.2010

Case i ii iii iv(0.2,1) (0.4,1) (0.6,1) (0.2,2)

Test R2M2A – influence of departure time generation procedure, see Table 25 Monte Carlo-runs; vessel speed distribution SD2; data = 07–11.2010

Case i ii iiiDT1 DT2 DT3

Test R2M2B – influence of vessel speed distribution, see Table 25 Monte Carlo runs; departure time generation procedure DT3; data =

Case i ii iiiSD1 SD2 SD3

Test R2M3A –influence of time interval DTDomain = circle radius R = 3L; data = 07–11.2010

Case i ii iii iv300 s 240 s 210 s 180 s

Test R2M3B – influence of domain size and shape, domains see Fig. 5DT = 180 s; data = 07–11.2010

Case i iiCircle R = 3L Circle R = 2.5L

frequency for three cases. Test R2M2B investigates the influenceof the chosen vessel speed distributions (SD) on the mean collisionfrequency, also for three cases. Clearly, the exact construction ofempirical distributions of step ii. of the procedure in Section3.2.2 is to a large extent at the modeler’s discretion. The R2 testis for our purposes limited to the cases of Table 3. Furthermore,for reasons of visual clarity, only the mean frequency estimate isgiven for each traffic leg, i.e. the confidence interval is notconsidered.

For Method 3, test R2M3A investigates six different choicesregarding the time interval DT for which the vessel and ship do-main positions are extrapolated based on conditions at a givenexamining time, see Table 3. A circular domain with R = 3L is ap-plied as in Weng et al. (2012). Test R2M3B investigates the influ-ence of the choice of ship domain for five cases. Three circulardomains with varying radii are chosen, as well as the elliptical do-main by Fujii and Tanaka (1971) and the circle sector domain byGoodwin (1975). A time interval DT = 3 min is applied.

v vi3600 s 5400 s

v vi vii viii ix(0.4,2) (0.6,2) (0.2,3) (0.4,3) (0.6,3)

07–11.2010

v vi150 s 120 s

iii iv vCircle R = 3.5L Fujii and Tanaka Goodwin

356 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

4.3. R3: Same scope and objective, no restrictions on methods and data

The reliability criterion R3 covers both the possibility that anal-ysis teams use different data and different methods to analyze therisk. In the present study, the data-related reliability is investigatedby using different subsets of the available AIS vessel movementdata. For each of the applied methods, one case is studied for AISdata for one month, another accounts for all five months of AISdata, see Table 4. Algorithmic settings for the various methodsare shown as well. These correspond to the choices originally pro-posed in the methods as discussed in Sections 3.1–3.3.

4.4. Visual comparison of model results and summary statistics

The analysis results for the selected tests outlined in Sections4.1–4.3 are compared in two ways. The first is a pairwise visualinspection of risk metric results for each traffic leg for varying testconditions using a series of scatter plots. If the methods under thevarious tests are reliable, higher values under the first test condi-tion should correspond to higher values under a second testcondition.

In recognition of the fact that reliability is a matter of degree,correlation coefficients between the risk metrics for varying testconditions offer a suitable mathematical appreciation of thestrength of the relationship between the two metrics. Three coeffi-cients are determined: the Spearman rank correlation coefficient q,the Kendall rank correlation coefficient s and the Pearson product-moment correlation coefficient r (Sheskin, 1997).

The Spearman rank q measures how well the ordinal rankingbetween the metrics of the compared methodologies can be de-scribed by a monotonic (not necessarily linear) function. It indi-cates in how far the methods under varying test conditions rankthe likelihood of collision in various TSS legs in the same order.Higher values for q are for our purposes desirable. It is a relativelyweak test as it only requires the methods to show reliability ofrank order, without requiring reliability of the numerical accuracyrequired by an interval or ratio scale (Stevens, 1946). Kendall’s scan be interpreted as the difference between the probability ofthe metrics being in the same order and the probability of the met-rics being in a different order. It thus provides similar informationas the Spearman rank q. The Pearson’s r is a measure of the lineardependence between the complete set of risk metrics. As the Pear-son’s r does not retain information on the order of the risk metricsfor the respective TSS legs, it is only meaningful if q and s have ahigh value, indicating that the rank order of the metrics is retained.Where r is not meaningful, it is not retained.

5. Results

5.1. Reliability criterion R1

As stated in Section 4.1., the Method 1 and Method 3 are deter-ministic and driven by historical data. Hence, a risk analysis apply-ing these models is reliable according to reliability criterion R1.

Table 4Definition of R3-test cases.

Test R3 – cross-validation of methods and data

Case Method Settings

i 1 DT = 1800 s; (r,k) = (0.4,1) in Eqii 1 DT = 1800 s; (r,k) = (0.4,1) in Eqiii 3 DT = 180 s; domain = circle R =iv 3 DT = 180 s; domain = circle R =v 2 5 MC-runs; speed distributionvi 2 5 MC-runs; speed distribution

Fig. 7 shows the results for test R1M2, investigating the reliabil-ity of method for three reruns as introduced in Section 4.1. Thediagonal contains information regarding the test case and regard-ing the axis labels.

The scatterplots in the figure show pairwise comparisons of therisk metrics for the different TSS legs under various test conditions.The metric for each leg is given a specific color and shape code asindicated. The scatterplots are best read for a given combination oftest conditions, i.e. for a specific subplot of the figure. The metricvalue for a specific TSS leg according to the first test condition isread on the horizontal axis. The corresponding metric valueaccording to the second test condition is read on the vertical axis.If the metrics of all TSS legs have an identical value for all test con-ditions, the metrics will all appear on a line through the originbisecting the graph. In this case, the rank order of the risk metricswill naturally also be retained. The extent to which the risk metricsdiffer from the bisecting line for varying test conditions and the ex-tent to which metrics are ranked in a different order are indicativeof the reliability of the method under the specified test conditions.For instance, for test R1M2.i, Leg 2E has a mean collision frequencyof 0.0014 per month, the 4th highest rank of all TSS legs. For testR1M2.iii, Leg 2E has a mean collision frequency of 0.0018 permonth, the 2nd highest rank of all TSS legs. The correlation coeffi-cients on the other side of the diagonal indicate the strength of therelation between the respective risk metrics under the various testconditions.

The results of Fig. 7 indicate that a rerun of Method 2, a highreliability is obtained in terms of the mean overall ship–ship colli-sion frequency. As comparing the mean values of a probabilisticsample provides no information about the statistical uncertaintyabout that estimate, it is instructive to consider the width of the95%-confidence interval for the three separate test runs, as shownin Table 5. To the extent that these intervals overlap for corre-sponding legs of the TSS, the R1 criterion is fulfilled.

In an in principle infinite number of repetitions of the stochasticsampling, the frequency estimates will asymptotically converge tothe same numbers and criterion R1 is asymptotically fulfilled. Inactual applications, a lower number of repetitions is used becauseof practical limitations. Nonetheless, the tests R1M2.i to R1M2.iiishows high R1-reliability for the collision frequency estimate evenfor the relatively small number of Monte Carlo runs as applied inthe tests.

5.2. Reliability criterion R2

5.2.1. Analysis according to Method 1Fig. 8 shows the results for test R2M1A as explained in Section

4.2. As seen in Table 3, the test investigates the effect of the chosentime interval DT. The number of FQSD domain overlaps for the con-sidered TSS legs is shown below the diagonal as a pairwise compar-ison between two test conditions. Summary statistics concerningthe rank order are shown above the diagonal. The diagonal pro-vides information regarding the test cases according to Table 3and shows the axis labels. As expected, the shorter the intervalDT, the more FQSD overlaps are detected. The results show a very

s. (4)–(8); AIS data 2010.07s. (4)–(8); AIS data 2010.07–113L; AIS data 2010.073L; AIS data 2010.07–11SD3; departure time generation DT3, see Table 2; AIS data 2010.07SD3; departure time generation DT3, see Table 2; AIS data 2010.07–11

Fig. 7. Results of test R1M2 as described Section 4.1, definition of TSS areas see Fig. 6.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 357

high R2-reliability in terms of rank order of the collision likelihoodin the various legs. This is evident from the high values for q and sfor all investigated cases. DT hence does not affect the rank order ofthe likelihood of ship–ship collision in the various legs of the TSS.Moreover, the high values for the Pearson’s r indicate that thedependency is highly linear for the tested cases. A visual inspectionof the results in Fig. 8 also shows the high R2-reliability of themethod in relation to DT, but reveal that the absolute value ofthe number of FQSD overlaps increases with lower values of DT.As the vessel conflict risk index only serves as an indication of

Fig. 8. Results of test R2M1A according to setting

the TSS legs where more vessel conflicts occur, the absolute valueof the risk metric is not of special interest.

Fig. 9 shows the results for test R2M1B as explained in Section4.2 and Table 3. It is seen that varying choices of FQSD size factor rand shape factor k lead to important differences for determining inwhich TSS legs ship–ship collisions are most likely. This is readilyconcluded from the low to medium rank order of the resultsaccording to e.g. test R2M1B.i and R2M1B.iii. A Spearman rank cor-relation coefficient q of 0.52 and a Kendall s of 0.43 can be consid-ered low for our purposes. Such results indicate that the method

s in Table 3, definition of TSS areas see Fig. 6.

Fig. 9. Results of test R2M1B according to settings in Table 3, definition of TSS areas see Fig. 6.

358 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

concludes that the likelihood of ship–ship collision varies signifi-cantly for the different sea areas based solely on settings of FQSDparameters r and k. For some test conditions, the risk metrics arehighest in Legs 2E and 2W, e.g. test R2M1B.i. In e.g. test R2M1B.iii,these legs have significantly lower values and Leg 6W stands out.Closer inspection of the obtained number of FQSD overlaps forthe different test cases shown in Fig. 9 clearly shows that theseexamples are by no means exhaustive.

Qu et al. (2011) choose values (r,k) = (0.4,1), as in test R2M1B.ii.There is however no ‘true’ or ‘correct’ value for these parameters,these are analyst’s choices in a quantitative model and differentanalysis teams may select different values. The results thus indi-cate that a risk analysis according to Method 1 has low reliabilityaccording to criterion R2.

5.2.2. Analysis according to Method 2Fig. 10 shows the results for test R2M2A and R2M2B as ex-

plained in Section 4.2. As seen in Table 3, the test addresses howreliable the estimates of the ship–ship collision frequencies arewhen analyst teams make different choices regarding the vesselspeed distributions and the vessel departure time generationprocedures.

It is seen that these choices have a measurable effect on the reli-ability of the collision frequency estimate per TSS leg. Nonetheless,the overall reliability is rather high. All test cases find ship–shipcollisions most likely in crossing C1, see Fig. 6, and the estimatesfor the other legs are also quite reliable. The stochastic natureMethod 2 necessarily leads to some variation between runs, seealso Section 5.1.

Based on these tests, it can be concluded that Method 2 showshigh R2-reliability in terms of the mean collision frequency. How-ever, it is expected that this conclusion is too strong in general. Themodel requires many choices by the analyst in terms of the de-tailed construction of the empirical distributions to generate theship traffic, e.g. in terms of departure time, ship sizes and speeds,

see Table 2. If the method is applied to also consider the collisionconsequences, the R2-reliability is expected to be lower. This is be-cause the vessels detected as collision candidates according toFig. 3 will have different characteristics in terms of main dimen-sions, ship type and speed due to the stochastic sampling from dif-ferent distributions. As these parameters are important to assessthe possible consequences e.g. in terms of the probability of hullbreach occurrence (Goerlandt et al., 2012; Ståhlberg et al., 2012),this will result in diverging ranges of e.g. estimated hull breachprobabilities and further consequences, lowering the reliability interms of risk.

In conclusion, the test indicates that method 2 has high R2-reli-ability in terms of the mean ship–ship collision frequency for theTSS legs. However, there are reasons to believe that a risk analysisalso covering the consequence dimension would show a lower R2-reliability.

5.2.3. Analysis according to Method 3Fig. 11 shows the results for test R2M3A as explained in Sec-

tion 4.3. As seen in Table 3, the test investigates the effect ofthe chosen time interval DT. The frequency of collision for theconsidered TSS legs is shown below the diagonal as a pairwisecomparison between two test settings. Summary statisticsconcerning the rank order are shown above the diagonal. Thediagonal provides information regarding the test cases accordingto Table 3 and shows the axis labels. It is evident that a choiceof time interval DT within the chosen bounds does not signifi-cantly affect the frequency estimates in the TSS legs and thatthe method shows high R2-reliability for this choice. It is clearthat the accuracy of the estimate across test settings alsomaintains the rank order of the risk level in the various legs. Thisis evident from the very high values of the Spearman q and Ken-dall s values for all investigated cases. However, the absolute va-lue of the frequency estimates is to some extent conditional tothe chosen time interval DT. Comparing e.g. the collision

Fig. 10. Results of test R2M2A and R2M2B according to settings in Table 3, definition of TSS areas see Fig. 6.

Fig. 11. Results of test R2M3A according to settings in Table 3, definition of TSS areas see Fig. 6.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 359

frequency in TSS leg LW according to test R2M3A.i withR2M3A.vi, the former condition leads to a frequency estimateof ca. 0.03 per 5 months, the latter to ca. 0.04 per 5 months.Overall, shorter time intervals DT lead to higher frequency esti-mates. This can be explained by the interaction counting process

of Fig. 4: if DT is small, the domain violation is evaluated morefrequently with extrapolated vessel speeds and courses whichare reasonable estimates compared to the real values. If DTbecomes larger, the extrapolations of speeds and courses arecruder, and extrapolations of the projected position of vessel B

Fig. 12. Results of test R2M3B according to settings in Table 3, definition of TSS areas see Fig. 6.

360 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

can overshoot the domain of vessel A. The frequency estimatesare somewhat sensitive to the choice of DT.

Fig. 12 shows the results for test R2M3B as explained in Section4.2 and Table 3. From cases R2M3B.i to R2M3B.ii, it is seen thatvarying choices for the size of the circular ship domain have no sig-nificant effect on rank order of the collision frequency estimates.The Spearman q and Kendall s values for these test cases show thatthe collision frequency in the TSS legs is ranked very consistently.However, when considering the absolute value of the frequency, itis seen that relatively small changes in the ship domain diameterhave a quite significant effect on the obtained number. E.g. caseR2W2.ii with domain radius R = 2.5L has a monthly collision fre-quency of 0.018 in Leg 6W. Changing the domain radius toR = 3.5L as in case R2W2.iii results for the same leg in a collisionfrequency of 0.055, i.e. about three times as high. In Weng et al.(2012), a domain size R = 3L is assumed based on information in(Mou et al., 2010). This choice is stated as an average for all ships,but it is clear that even slight variations in this assumption havequite important effects.

Changing the ship domain shape has even more significant ef-fects. Application of the elliptical domain by Fujii and Tanaka(1971) as in case R2M3B.iv, which is longer but narrower thanthe circular domains as shown in Fig. 5, leads to much lower colli-sion frequencies and mediocre values for the Spearman q and Ken-dall s. A choice of the circle sector domain by Goodwin (1975)leads to much higher collision frequencies and very low valuesfor the correlation coefficients between test cases.

The lack of retained rank order and lack of numerical accuracyof the risk estimate is due to two reasons. First, simple applicationof the same causation probability pc for all domain shapes in Eq.(11) is not justified. This causation factor needs to be seen as a cal-ibration factor between the number of domain violations and theactual collision frequency and is not simply transferable betweenmodels, see e.g. Montewka et al. (2012c). Second, for different do-main shapes, violations in overtaking, crossing and head-on

encounters occur in different ratios. Fig. 13 shows the number ofand ratios between overtaking and crossing encounters found inthe various sea areas for the cases R2M3B.i, R2M3B.iv andR2M3B.v as in Table 3. No head-on encounters are detected. It isseen that the narrower elliptical domain by Fujii and Tanaka(1971) results in fewer overtaking encounters than the circular do-main by Weng et al. (2012) or the circle sector domain by Goodwin(1975). The circle sector domain of Goodwin (1975) generally findsa larger share of crossing encounters than the circular domain. Theratio between overtaking and crossing encounters depends on theconsidered leg of the TSS and comparisons between domains showlarge variations.

As clear from the overview given by Wang et al. (2009), there isno current agreement on which domain is ‘correct’. The selecteddomain clearly is an analyst’s choice. From the results in Figs. 12and 13, it is evident that the accuracy of the collision frequency de-pends on the choice. Moreover, the relative ranking of collisionlikelihood in various sea areas is also affected. Nonetheless, it isevident that regardless of the chosen domain, the collision fre-quency is always found to be highest in legs 6W and 6E, with col-lisions significantly less likely in other sea areas.

In light of the above, it is concluded that method 3 has a low R2-reliability.

5.3. Reliability criterion R3

Fig. 14 shows the results test R3 of explained in Section 4.3. Asseen in Table 4, the test investigates the effect the applied data andmethod. The structure of the figure is as before: the various metricsindicating the likelihood of collision are shown below the diagonalin a pairwise comparison between the test cases of Table 4. Abovethe diagonal, summary statistics show the reliability of the meth-ods in terms of preservation of rank order of the likelihood of col-lision between test cases. On the diagonal, the case identification isshown according to Table 4. The axis label is shown as well.

Fig. 13. Number and relative shares of overtaking (OT, light gray) and crossing (CR, dark gray) projected domain violations for three domain shapes in the various TSS areas ofFig. 6.

Fig. 14. Results of test R3 according to settings in Table 4, definition of TSS areas see Fig. 6.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 361

A comparison of cases R3.i and R3.ii, R3.iii and R3.iv and R3.vand R3.vi shows that the applied data has a certain influence onthe results. The rank order of collision likelihood for various TSSlegs is relatively well retained, and no major fluctuations in thenumerical values of the risk metrics are found either. All threeinvestigated methods perform comparably well. The R3-reliability

regarding data for the considered cases can be considered high.This indicates that the global traffic patterns show a large consis-tency over longer periods of time.

In contrast, a comparison across methods shows a very differentpicture. Consider for example a comparison between Method 1 andMethod 3, as in cases R3.ii and R3.iv. A direct comparison of the

362 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

accuracy of the risk metrics is not meaningful due to the differentmeasurement units (FQSD overlaps vs collision frequency). How-ever, the ranking of the likelihood of collision for the various TSSareas is meaningful. A Spearman q of 0.71 and a Kendall s of0.50 can be considered low for these purposes. This implies thatthe methods consider very different sea areas more likely for colli-sions to occur. Comparison of Method 1 and Method 2, i.e. casesR3.ii and R3.vi, lead to somewhat more consistent results for theranking. However, whereas the crossing C1 is considered a high-risk area according to the Method 2, Method 1 ranks this areamuch lower. In a comparison of Method 2 and 3, i.e. cases R3.ivand R3.vi, the numerical accuracy of the frequency estimates ismeaningful. However, it is evident that collision frequency esti-mates vary significantly between the methods. Moreover, the rankorder of the collision frequency across the various sea areas is al-most entirely different. A Spearman q of 0.40 and a Kendall s of0.22 can be considered very low for our purposes.

In conclusion, it is found that the analysis according to the threeinvestigated methods shows very low R3-reliability.

A qualification is in place with regards to the applied AIS data. Itis well known that AIS data contains errors, see e.g. (Graveson,2004), although the data reliability has increased over the years(Felski and Jaskolski, 2013). In particular, the applied data containsa number of short periods of a few hours where no data is present.No attempts have been made to compensate for these gaps. Theinvestigated methods do not respond to these gaps in the samemanner, which can be a contributing reason for the low R3-reliabil-ity. For the Method 1 and Method 3, the data gaps lead to a numberof examining times Tj and corresponding traffic images not beingconstructed. For these missing examining times, no FQSD overlapsor vessel conflict detections are made. In Method 2, these short datagaps may affect the number of generated vessel voyages and theconstructed empirical distributions. It is practically unfeasible toassess how much effect these data gaps will have on the conclu-sions for the R3-criterion and no attempts are made to quantify thiseffect. However, as Method 1 and Method 3 deal with these gaps ina comparable manner, while also leading to widely varying riskmetrics, the overall conclusion of low R3-validity remains.

6. Discussion

6.1. Reliability of risk perspectives

Recalling the first aim of the presented study, we have providedsome evidence for the claims by Aven and Heide (2009) that quan-titative risk analysis according to the risk perspective of Eq. (1) isnot in the general case reliable. The reliability of the uncertainty-based risk perspective of Eq. (2) has not been investigated as nomethods are at present available for the chosen aim and scope ofthe risk analysis case study. In comparison with Aven and Heide(2009), we have extended the conclusion that risk analysis is notin general reliable for indicator-based perspectives as in Eq. (3).This is evident from the analysis in Sections 5.2.1 and 5.3.

We can summarize our findings as in Table 6. The reliabilityassessment distinguishes between the numerical accuracy of therisk estimate and the rank order of risk metrics in various partsof the system.

In relation to the accuracy of risk analysis, one of the concernsof Rae et al. (2012), it is concluded that claims concerning numer-ical accuracy of the risk metrics should in general be moderated.Relatively small changes in model choices can have a quite signif-icant effect on the value of the risk metric, as in test R2M3A for thetime interval DT and in test R2M3B for the size of the circular do-main. For the indicator based perspective, the numerical accuracyis not relevant.

Based on our findings, we extend this concern regarding thenumerical accuracy of the calculated metrics. Also claims concern-ing the rank order between risk metrics, which is a significantlylower demand in terms of measurement theory (Stevens, 1946),should in light of the presented case study be moderated. Thismeans that a method designed to estimate the risk for comparativepurposes may not give an appropriate ranking of the parts of a sys-tem posing a higher risk.

The results from our analysis run largely parallel to the claimsby Aven and Heide (2009). Reliability criterion R1 is easiest toachieve as it is only related to the model as such. Criterion R2 issignificantly more demanding as different analysis teams mayinterpret and manipulate the data in different ways and make dif-ferent methodological choices within the applied risk analysismethod. Criterion R3 is in general not satisfied. Differences in ap-plied data and the rationale, assumptions and simplifications ofdifferent methods can lead to varying risk characterizations.

From Section 2.2, it should be clear that a high reliability isdesirable, but that a reliable method may not adequately measurewhat it should, i.e. that it may not be valid. This is addressed in thenext Section.

6.2. Implications for validity of quantitative ship–ship collision riskanalysis

One of the main findings of the reliability case study presentedin Section 5 is the significantly different sea areas found to be rep-resenting high risk of collision. This lack of inter-methodologicalreliability should, in the authors’ view, raise concern about thevalidity of the investigated analysis methods.

In this section, the validity of the three methods is discussedboth in terms of the applicable criteria of the applied risk perspec-tives and in terms of the criteria concerning the construct ‘‘ship–ship collision given an encounter’’. Additionally, the suggestion ofone of the reviewers to apply ensemble studies to improve the esti-mate of the areas posing high collision risk is briefly addressed inthe context of model uncertainty of the selected methods.

6.2.1. Validity in terms of applied risk perspectiveConsidering validity criterion V1 of Section 2.2, Methods 2 and

Method 3 apply the probability-based risk perspective according toEq. (1) and aim at describing the true risk. Method 2 allows for astatistical evaluation of a confidence bound about this estimatedtrue probability, whereas Method 3 does not allow this. Method1, applying the risk perspective according to Eq. (3), does not makeclaims about the true probability of collision, but applies a quanti-tative ranking about which no uncertainty is expressed. So in thatsense, also Method 1 aims to provide a true ranking of the sea areasin terms of collision likelihood. However, the lack of inter-method-ological reliability according to R3 indicates that at least some andpossibly all of the methods do not accurately measure or rank thetrue collision risk.

In light of validity criterion V3, the completeness of the uncer-tainty assessment, it is noted that in Method 1 and 3, uncertainty isnot addressed whereas in the probabilistic approach by Method 2,the confidence interval about the mean frequency as in Table 5 canbe seen as a measure of statistical uncertainty. However, none ofthe methods address epistemic uncertainty in terms of thestrength of the knowledge on which the assessment is based. Thisissue of evaluation of uncertainty in the background knowledgehas received some recent attention, and a growing number ofresearchers acknowledge its importance (Flage and Aven, 2009;Montewka et al., 2013). For the definition of the ship–ship encoun-ter on which the collision probability or rank order is conditioned,this background knowledge can be evaluated using the concepts of

Table 5Confidence intervals for collision frequency estimates; tests R1M2.i to R1M2.iii.

Leg R1M2.i R1M2.ii R1M2.iii

CI(# coll/5 months) CI(# coll/5 months) CI(# coll/5 months)

1W 6.2 � 10�4 12.2 � 10�4 5.2 � 10�4 11.7 � 10�4 7.5 � 10�4 10.1 � 10�4

1E 4.7 � 10�4 7.5 � 10�4 4.8 � 10�4 5.9 � 10�4 5.2 � 10�4 8.7 � 10�4

2W 15.7 � 10�4 21.9 � 10�4 15.2 � 10�4 20.9 � 10�4 13.9 � 10�4 18.0 � 10�4

2E 11.4 � 10�4 17.9 � 10�4 13.6 � 10�4 21.1 � 10�4 14.2 � 10�4 20.8 � 10�4

3W 14.0 � 10�4 18.4 � 10�4 14.9 � 10�4 16.9 � 10�4 13.7 � 10�4 18.2 � 10�4

3E 10.1 � 10�4 14.9 � 10�4 12.4 � 10�4 15.7 � 10�4 9.9 � 10�4 14.2 � 10�4

4W 8.8 � 10�4 12.9 � 10�4 10.4 � 10�4 13.3 � 10�4 9.3 � 10�4 11.1 � 10�4

4E 9.0 � 10�4 12.4 � 10�4 7.7 � 10�4 10.6 � 10�4 8.6 � 10�4 14.8 � 10�4

5W 6.7 � 10�4 7.8 � 10�4 4.8 � 10�4 8.7 � 10�4 6.2 � 10�4 9.1 � 10�4

5E 4.8 � 10�4 8.9 � 10�4 3.7 � 10�4 6.9 � 10�4 3.7 � 10�4 6.5 � 10�4

6W 11.9 � 10�4 15.6 � 10�4 9.0 � 10�4 17.9 � 10�4 7.7 � 10�4 12.2 � 10�4

6E 6.3 � 10�4 7.5 � 10�4 4.3 � 10�4 9.2 � 10�4 5.3 � 10�4 8.5 � 10�4

7W 2.2 � 10�5 1.6 � 10�4 6.9 � 10�5 2.1 � 10�4 9.2 � 10�5 2.1 � 10�4

7E 5.9 � 10�5 9.8 � 10�5 1.9 � 10�5 1.8 � 10�4 3.0 � 10�5 2.2 � 10�4

C1 29.5 � 10�4 39.7 � 10�4 32.9 � 10�4 40.9 � 10�4 3.1 � 10�4 3.6 � 10�4

C2 1.4 � 10�4 2.6 � 10�4 3.8 � 10�4 4.6 � 10�4 1.3 � 10�4 4.9 � 10�4

Table 6Summary of the results of the reliability tests; implications for risk perspectives.

Method riskperspective

Reliabilitycriterion

Test Reliability score

Accuracy of riskestimate

Rankorder

Method 1 R1 N/A – YQu et al. (2011) R2 R2M1A – HR � {Ik} R2M1B – L

R3 R3 – L

Method 2 R1 R1M2 H HGoerlandt and

Kujala (2011)R2 R2M2A H-M H-M

R � (A,C,P) R2M2B H-M H-MR3 R3 L L

Method 3 R1 N/A Y YWeng et al. (2012) R2 R2M3A M HR � (A,C,P) R2M3B L L

R3 R3 L L

Y: Yes | H: High | M: Medium | L: Low.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 363

face and content validity, see Section 6.2.2. For all three methods,this definition involves high uncertainty.

Concerning validity criterion V4, all three methods focus on fic-tional parameters rather than observable events. In Methods 2 and3, the aim of the analysis is to provide an estimate of a probability,which is a model parameter in the risk perspective according to Eq.(1). Focus is not on the observable event ‘‘ship–ship collision’’ as itis known or expected to occur in the real world, but on the proba-bility assigned to a model construct (blind navigation based colli-sion candidates or projected circular domain overlaps). Likewise,Method 1 aims to provide a count of the overlaps of the FQSD,which is also a model construct, conditional to an analyst’s param-eter choices r and k, see Eqs. (4)–(8). The quantity of interest is notan observable event but a mathematically constructed domainaround two encountering ships, the overlap of which is taken tobe correlated to the likelihood of collision. The number of FQSDoverlaps thus is a fictional parameter, where it is furthermorenot evident how exactly to interpret the domain size and shapein relation to the parameters r and k. We can question whether riskanalysis should focus on such fictional parameters (Aven andHeide, 2009).

6.2.2. Construct validity of the applied definition of ship–ship collisiongiven an encounter

In this section we reflect on the validity of the object aboutwhich risk is expressed, i.e. on how well the construct ‘‘ship–shipcollision conditional to an encounter’’ is translated in an operation-alization. The evaluation of face and content validity, see Section2.2, provides a reason why the epistemic uncertainty accordingto V3 is considered to be high.

In the authors’ view, the operationalization of what constitutesa ship–ship encounter in terms of a blind navigation description asin Method 2 has low face validity. This is simply not how actualship navigation happens and accident reports contain no exampleof accidents occurring in such conditions: at least one of the vesselsperforms evasive maneuvering prior to collision (Buzek and Hold-ert, 1990; Cahill, 2002). While this blind navigation assumption isused also in other models (COWI, 2011; Friis-Hansen and Simon-sen, 2002; Pedersen, 1995), it lacks justification and leads to uncer-tain results. The domain based vessel conflict detection of Method1 and Method 3 has higher face validity as it is reasonable to as-sume that a violation of an area which navigators would normallylike to keep clear does indeed constitute an exposure to collision.Nonetheless, there is no obvious reason why the domain oughtto be circular as in Method 3. The more complex FQSD model

applied in Method 1 seems to have higher face validity as it notonly accounts for the vessel size but also the speed.

Considering more closely how the encounter is defined in thethree methods in terms of content validity, it is evident that themodels lack some key features of the encounter process. The blindnavigator model used in Method 2 is a strong simplification of thecollision evasive process for which there is no empirical evidence.The projected circular domain violation accounts for the vessellength to define the domain size, but fails to account for e.g. theencounter speed and the direction of approach, which otherresearchers find relevant in defining a domain, see Wang et al.(2009). Moreover, the circular domain shape is not supported byempirical evidence: work by Gucma and Marcjan (2012) and vanIperen (2012) suggests that the domain shape is not circular butrather depends on the navigation conditions. Similarly, for Method3 applying the FQSD, the available empirical evidence regardingdomain shape and size does not support the general applicabilityof this model to evaluate the severity of an encounter. Importantly,it is furthermore not clear which parameter choices for r and kwould be valid in a given context, while different choices resultin significantly different collision risk evaluations, see Section5.2.3.

Returning to validity criterion V3, it is apparent that there is ahigh uncertainty in the background knowledge related to the def-inition of the encounter. For Method 2, this implies that the ‘‘true’’collision probability may be well outside the statistical confidence

364 F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365

interval, which fails to account for the poor knowledge base towhich this encounter detection method and collision probabilityestimate is conditioned. Similarly, for Method 3, the ‘‘true’’ colli-sion probability may deviate far from the calculated number, asthe epistemic uncertainty about the domain definition is high. Thisis also true for Method 1: the definition of the FQSD is mathemat-ically advanced, but evidence suggests that this domain shape isnot generally applicable. Furthermore, the uncertainty regardingwhich parameter choices to make for r and k in a given settingleads to a high uncertainty regarding the attainment of a valid riskpicture in a given sea area.

This illustrates that a risk perspective according to Eqs. (1) and(3) in general provides an incomplete risk description. Importantuncertainties may be hidden in the background knowledge, andthese should be made explicit, see Section 2.2.

6.2.3. Use of ensemble studies to account for model uncertaintyOne of the reviewers suggested to consider using ensemble

studies to improve the estimate of the areas posing high collisionrisk. Such an approach uses estimates from various studies to makea weighed statement concerning the collision risk. We consider theapplicability of this proposal for the presented case study.

This approach is, to the best of the authors’ knowledge, firstsuggested by Apostolakis (1990) as a structured approach to ac-count for model uncertainty, see also Zio and Apostolakis (1996).The rationale can be summarized as follows. Consider M2 and M3

as notation for the Methods 2 and 3 as introduced in Section 3.Conditional to these models Mi, we have a probability of collisionin each TSS leg Pi(A|Ki) for i = 2 or 3. This probability Pi is basedon a specific background knowledge Ki on which each model Mi

is conditioned. In the alternative hypothesis approach by Apostola-kis (1990), a subjective probability pi is assigned to each model Mi

being true, with p2 + p3 = 1. Unconditionally, we obtain:

PðAjKÞ ¼ P2ðAjK2Þp2 þ P3ðAjK3Þp3 ð11Þ

Theoretically, such an alternative hypothesis approach is a soundway to incorporate the results of rival theories. However, it is theauthors’ view that for the given case study, this approach doesnot improve the estimate of the collision probability for each TSSleg.

The key to understanding this is the poor knowledge Ki onwhich each of the models Mi is conditioned in terms of the encoun-ter definition. The unconditional results can only be as good as theestimate based on the best background knowledge Ki. Aggregatingresults from models based on uncertain assumptions such as theblind navigator encounter or the circular domain does not improvethe unconditional knowledge K. It rather obfuscates the uncondi-tional collision probability estimate, adding complexity withoutadding validity.

In a review paper regarding collision analysis procedures,Pedersen (2010, p. 259) has called for ‘‘a concerted effort to identifygaps in our knowledge and then to integrate this knowledge intorisk based procedures for ship operation and ship design’’. Fromthe results of the study, it should be clear that the encounter pro-cess is such a knowledge gap and that more valid methods fordefining the encounter severity are needed to increase the validityof ship–ship collision risk analysis.

7. Conclusion

In this paper we have analyzed the reliability of quantitativerisk analysis through a case study of ship–ship collision risk anal-ysis for a given sea area. We have provided some evidence thatprobability- and indicator based risk perspectives do not necessar-ily provide the same risk picture when the analysis is repeated.

Reliability was addressed both in terms of accuracy of the risk pic-ture and in terms of the rank order of the risk in various parts of thesystem when the analysis is repeated. For the case study, it is foundthat only modest claims can be made regarding the reliability ofthe risk estimate in terms of accuracy, and that also the rank orderof the determined collision risk for the various sea areas is not wellretained across the investigated methods. While the case study fo-cuses on the likelihood of the accident occurrence, the lack of reli-ability in terms of risk will only be exacerbated if the consequencedimension is also determined for the detected events.

Given the lack of reliability of the investigated ship–ship colli-sion risk analysis methods, the question regarding their validityhas been addressed. This has been discussed in terms of the ap-plied risk perspective and using the concepts of face and contentvalidity prevalent in the social sciences. It is found that the meth-ods lack validity, in particular with respect to the completeness ofthe epistemic uncertainty assessment. One significant finding thusis the need for more valid methods to link the ship–ship encounterto collision risk, to reduce uncertainty in this crucial aspect of col-lision risk analysis.

Acknowledgements

This research is carried out within the RescOp project in associ-ation with the Kotka Maritime Research Centre. This project is co-funded by the European Union, the Russian Federation and theRepublic of Finland. The financial support is acknowledged. Theauthors are grateful to two anonymous reviewers whose com-ments have contributed to improve an earlier version of this paper.

Appendix A. Supplementary material

Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.ssci.2013.09.010.

References

Almaz, O.A., 2012. Risk and performance analysis of ports and waterways: the caseof Delaware river and bay, PhD Thesis, Department of Industrial and SystemsEngineering. Rutgers University, New Brunswick, New Jersey.

Apostolakis, G.E., 1990. The concept of probability in safety assessments oftechnological systems. Science 250, 1359–1364.

Aven, T., 2009. Perspectives on risk in a decision-making context – review anddiscussion. Saf. Sci. 47, 798–806.

Aven, T., 2010. On how to define, understand and describe risk. Reliab. Eng. Syst. Saf.95, 623–631.

Aven, T., 2011. A risk concept applicable for both probabilistic and non-probabilisticperspectives. Saf. Sci. 49, 1080–1086.

Aven, T., 2012a. Foundational issues in risk assessment and risk management. RiskAnal. 32, 1647–1656.

Aven, T., 2012b. The risk concept—historical and recent development trends. Reliab.Eng. Syst. Saf. 99, 33–44.

Aven, T., 2013. Practical implications of the new risk perspectives. Reliab. Eng. Syst.Saf. 115, 136–145.

Aven, T., Heide, B., 2009. Reliability and validity of risk analysis. Reliab. Eng. Syst.Saf. 94, 1862–1868.

Aven, T., Reniers, G., 2013. How to define and interpret a probability in a risk andsafety setting. Saf. Sci. 51, 223–231.

Aven, T., Renn, O., 2009. On risk defined as an event where the outcome is uncertain.J. Risk Res. 12, 1–11.

Buzek, F.J., Holdert, H.M.C., 1990. Collision Cases Judgments and Diagrams, 2nd ed.Lloyd’s of London Press Ltd., London.

Cahill, R.A., 2002. Collisions and their Causes, 3rd ed. The Nautical Institute,London.

Carmines, E.G., Zeller, R.A., 1979. Reliability and validity assessment, QuantitativeApplications in the Social Sciences. Sage Publications Inc., Thousand Oaks,California.

COWI, 2011. BRISK – Sub-regional risk of spill of oil and hazardous substances in theBaltic Sea.

Drost, E.A., 2011. Validity and reliability in social science research. Educ. Res.Perspect. 38, 105–123.

Felski, A., Jaskolski, K., 2013. The integrity of information received by means of AISduring anti-collision manoeuvring. Trans. Nav. Int. J. Mar. Navig. Saf. Sea Transp.7, 95–100.

F. Goerlandt, P. Kujala / Safety Science 62 (2014) 348–365 365

Flage, R., Aven, T., 2009. Expressing and communicating uncertainty in relation toquantitative risk analysis (QRA). Reliab. Risk Anal. Theory Appl. 2, 9–18.

Fowler, T.G., Sørgård, E., 2000. Modeling ship transportation risk. Risk Anal. 20, 225–244.

Friis-Hansen, P., Simonsen, B.C., 2002. GRACAT: software for grounding andcollision risk analysis. Mar. Struct. 15, 383–401.

Fujii, Y., Shiobara, R., 1971. The analysis of traffic accidents – studies in marinetraffic accidents. J. Navig. 24, 534–543.

Fujii, Y., Tanaka, K., 1971. Traffic capacity. J. Navig. 24, 543–552.Goerlandt, F., Kujala, P., 2011. Traffic simulation based ship collision probability

modeling. Reliab. Eng. Syst. Saf. 96, 91–107.Goerlandt, F., Ståhlberg, K., Kujala, P., 2012. Influence of impact scenario models on

collision risk analysis. Ocean Eng. 47, 74–87.Goodwin, E.M., 1975. A statistical study of ship domains. J. Navig. 28, 329–341.Graveson, A., 2004. AIS - an inexact science. J. Navig. 57, 339–343.Gucma, L., Marcjan, K., 2012. Examination of ships passing distances distribution in

the coastal waters in order to build a ship probabilistic domain. Sci. J. Marit.Univ. Szczec. 32, 34–40.

Hogström, P., 2012. RoPax ship collision – a methodology for survivability analysis,Dissertations of the Chalmers University of Technology. Chalmers University ofTechnology, Gothenburg, Sweden.

IALA, 2009. IALA Recommendation O-134 on the IALA risk management tool forports and restricted waterways.

IMO, 2007. Formal safety assessment – Consolidated text of the guidelines forformal safety assessment (FSA) for use in the IMO rule-making process (MSC/Circ. 1023-MEPC/Circ.392).

IMO, 2010. Degree of risk evaluation. SN.1/Circ. 296.International Risk Governance Council, 2009. Risk governance deficits – an analysis

and illustration of the most common deficits in risk governance.Kaplan, S., Garrick, J.B., 1981. On the quantitative definition of risk. Risk Anal. 1, 11–

27.Kijima, K., Furukawa Y., 2003. Automatic collision avoidance system using the

concept of blocking area, In: Proceedings of the IFAC Conference onManoeuvering and Control of Marine Craft. Girona, Spain.

Konovessis, D., Vassalos, D., 2007. Risk-based design for damage survivability ofpassenger ro-ro vessels. Int. Shipbuild. Prog. 54, 129–144.

Kukic, D., Lipovac, K., Pešic, D., Vujanic, M., 2013. Selection of a relevant indicator –road casualty risk based on final outcomes. Saf. Sci. 51, 165–177.

Li, S., Meng, Q., Qu, X., 2012. An overview of maritime waterway quantitative riskassessment models. Risk Anal. 32, 496–512.

MacDuff, T., 1974. The probability of vessel collisions. Ocean Ind., 144–148.Montewka, J., Hinz, T., Kujala, P., Matusiak, J., 2010. Probability modelling of vessel

collisions. Reliab. Eng. Syst. Saf. 95, 573–589.Montewka, J., Krata, P., Goerlandt, F., Mazaheri, A., Kujala, P., 2011. Marine traffic

risk modelling – an innovative approach and a case study. Proc. Inst. Mech. Eng.Part O J. Risk Reliab. 225, 307–322.

Montewka, J., Ehlers, S., Goerlandt, F., Hinz, T., Kujala, P., 2012a. A model for riskanalysis of RoPax ships – the Gulf of Finland case. In: 11th InternationalProbabilistic Safety Assessment and Management Conference and the AnnualEuropean Safety and Reliability Conference. Curran Associates Inc., New York,pp. 5544–5553.

Montewka, J., Goerlandt, F., Ehlers, S., Kujala, P., Erceg, S., Polic, D., Klanac, A., Hinz,T., Tabri, K., 2012b. A Model for Consequence Evaluation of Ship–Ship CollisionBased on Bayesian Belief Network, in: Sustainable Maritime Transportation andExploitation of Sea Resources. Taylor & Francis Group, London.

Montewka, J., Goerlandt, F., Kujala, P., 2012c. Determination of collision criteria andcausation factors appropriate to a model for estimating the probability ofmaritime accidents. Ocean Eng. 40, 50–61.

Montewka, J., Goerlandt, F., Kujala, P., 2013. On a risk perspective for maritimedomain. J. Pol. Saf. Reliab. Assoc. 4, 101–108.

Mou, J.M., van der Tak, C., Ligteringen, H., 2010. Study on collision avoidance in busywaterways by using AIS data. Ocean Eng. 37, 483–490.

Pedersen, P.T., 1995. Collision and grounding mechanics. Proceedings of the DanishSociety of Naval Architects and Marine Engineers, 125–157.

Pedersen, P.T., 2010. Review and application of ship collision and groundinganalysis procedures. Mar. Struct. 23, 241–262.

Qu, X., Meng, Q., Li, S., 2011. Ship collision risk assessment for the Singapore Strait.Accid. Anal. Prev. 43, 2030–2036.

Rae, A.J., Alexander, R., McDermid, J.A., 2012. The science and superstition ofquantitative risk assessment. In: Proceedings of PSAM 11 & ESREL 2012.International Association of Probabilistic Safety Assessment and Management.IAPSAM, Helsinki, Finland.

Ruponen, P., 2007. Progressive flooding of a damaged passenger ship, TKKdissertations. Helsinki University of Technology, Espoo, Finland.

Sheskin, D., 1997. Handbook of Parametric and Nonparametric StatisticalProcedures. CRC Press, New York.

Ståhlberg, K., Goerlandt, F., Montewka, J., Kujala, P., 2012. Uncertainty in analyticalcollision dynamics model due to assumptions in dynamic parameters. Trans.Nav. Int. J. Mar. Navig. Saf. Sea Transp. 6, 47–54.

Ståhlberg, K., Goerlandt, F., Ehlers, S., Kujala, P., 2013. Impact scenario models forprobabilistic risk-based design for ship–ship collision. Mar. Struct. 33, 238–264.

Stevens, S.S., 1946. On the theory of scales of measurement. Science 103, 677–680.Suokas, J., 1985. On the Reliability and Validity of Safety Analysis. Technical

Research Centre of Finland, Technical Research Centre of Finland, Espoo,Finland.

Suokas, J., Kakko, R., 1989. On the problems and future of safety and risk analysis. J.Hazard. Mater. 21, 105–124.

Trochim, W., Donnely, J.P., 2008. The research Methods Knowledge Base, 3rd ed.Atomic Dog Publishing.

USCG, 2012. Ports and waterways safety assessment methodology.Van Dorp, J.R., Merrick, J.R., 2011. On a risk management analysis of oil spill risk

using maritime transportation system simulation. Ann. Oper. Res., 249–277.Van Iperen, E., 2012. Detection of hazardous encounters at the North Sea from AIS

data. In: Proceedings of International Workshop on Next Generation NauticalTraffic Models. Shanghai, China, pp. 1–12.

Vanem, E., Rusås, S., Skjong, R., Olufsen, O., 2007. Collision damage stability ofpassenger ships: holistic and risk-based approach. Int. Shipbuild. Prog. 54, 323–337.

Vinnem, J.E., 2010. Risk indicators for major hazards on offshore installations. Saf.Sci. 48, 770–787.

Wang, N., 2010. An intelligent spatial collision risk based on the quaternion shipdomain. J. Navig. 63, 733–749.

Wang, N., Meng, X., Xu, Q., Wang, Z., 2009. A unified analytical framework for shipdomains. J. Navig. 62, 643–655.

Weng, J., Meng, Q., Qu, X., 2012. Vessel collision frequency estimation in theSingapore Strait. J. Navig. 65, 207–221.

Zio, E., Apostolakis, G.E., 1996. Two methods for the structured assessment of modeluncertainty by experts in performance assessments of radioactive wasterepositories. Reliab. Eng. Syst. Saf. 54, 225–241.