An optimization method for learning statistical classifiers in structural reliability

9
Probabilistic Engineering Mechanics 25 (2010) 26–34 Contents lists available at ScienceDirect Probabilistic Engineering Mechanics journal homepage: www.elsevier.com/locate/probengmech An optimization method for learning statistical classifiers in structural reliability Jorge E. Hurtado * , Diego A. Alvarez Universidad Nacional de Colombia, Apartado 127, Manizales, Colombia article info Article history: Received 12 September 2008 Received in revised form 23 May 2009 Accepted 29 May 2009 Available online 9 June 2009 Keywords: Structural reliability Monte Carlo simulation Statistical classification Particle swarm optimization Support vector machines Neural networks Quasi-random numbers Entropy abstract Monte Carlo simulation is a general and robust method for structural reliability analysis, affected by the serious efficiency problem consisting in the need of computing the limit state function a very large number of times. In order to reduce this computational effort the use of several kinds of solver surrogates has been proposed in the recent past. Proposals include the Response Surface Method (RSM), Neural Networks (NN), Support Vector Machines (SVM) and several other methods developed in the burgeoning field of Statistical Learning (SL). Many of these techniques can be employed either for function approximation (regression approach) or for pattern recognition (classification approach). This paper concerns the use of these devices for discriminating samples into safe and failure classes using the classification approach, because it constitutes the core of Monte Carlo simulation as applied to reliability analysis as such. Due to the flexibility of most SL methods, a critical step in their use is the generation of the learning population, as it affects the generalization capacity of the surrogate. To this end it is first demonstrated that the optimal population from the information viewpoint lies around in the vicinity of the limit state function. Next, an optimization method assuring a small as well as highly informative learning population is proposed on this basis. It consists in generating a small initial quasi-random population using Sobol sequence for triggering a Particle Swarm Optimization (PSO) performed over an iteration-dependent cost function defined in terms of the limit state function. The method is evaluated using SVM classifiers, but it can be readily applied also to other statistical classification techniques because the distinctive feature of the SVM, i.e. the margin band, is not actively used in the algorithm. The results show that the method yields results for the probability of failure that are in very close agreement with Monte Carlo simulation performed on the original limit state function and requiring a small number of learning samples. © 2009 Elsevier Ltd. All rights reserved. 1. Introduction The basic problem of structural reliability can be defined as the estimation of the probability mass of a failure domain F defined by a limit state function g (x) of the set of basic random variables x of dimensionality d: P f = Z F f x (x)dx (1) where f x (x) is the joint probability density function of the basic variables. The methods that have been proposed in the last decades to solve this problem can be grouped into two classes, namely those based on first- and second-order approximations of g (x) (FORM and SORM), on the one hand, and Monte Carlo simulation methods, on the other [1]. These latter methods require the sequential or parallel computation of the limit state function for a * Corresponding address: Civil Engineering, Universidad Nacional de Colombia, Apartado 127, 001Manizales, Caldas, Colombia. Tel.: +57 6 8879300; fax: +57 6 8879334. E-mail address: [email protected] (J.E. Hurtado). set of samples {x 1 , x 1 ,..., x N } and computing the estimate of the failure probability as ˆ P f = 1 N N X i=1 I [g (x i ) 0] (2) where I [·] is the indicator function for the event argument, equal to one if the event occurs and zero otherwise. In turn, Monte Carlo methods can be grouped into two categories. The first corresponds to methods that make use of the actual limit state function g (x), such as importance sampling [2], directional simulation [3], line sampling [4,5], subset simulation [6], and others. The second group corresponds to methods that employ a surrogate function, i.e. a simple function ˆ g (x) that approximates g (x) in the important region for the reliability computation. The first proposals in this direction made use of the Response Surface Methods (RSM) developed in the field of Design of Experiments [7– 12]. The surrogate function has the form ˆ g (x) = m X i w i h i (x) (3) in which h i (x) are m power functions of the coordinates. This method is characterized by the following facts: On the one 0266-8920/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.probengmech.2009.05.006

Transcript of An optimization method for learning statistical classifiers in structural reliability

Probabilistic Engineering Mechanics 25 (2010) 26–34

Contents lists available at ScienceDirect

Probabilistic Engineering Mechanics

journal homepage: www.elsevier.com/locate/probengmech

An optimization method for learning statistical classifiers in structural reliabilityJorge E. Hurtado ∗, Diego A. AlvarezUniversidad Nacional de Colombia, Apartado 127, Manizales, Colombia

a r t i c l e i n f o

Article history:Received 12 September 2008Received in revised form23 May 2009Accepted 29 May 2009Available online 9 June 2009

Keywords:Structural reliabilityMonte Carlo simulationStatistical classificationParticle swarm optimizationSupport vector machinesNeural networksQuasi-random numbersEntropy

a b s t r a c t

Monte Carlo simulation is a general and robust method for structural reliability analysis, affected by theserious efficiency problemconsisting in the need of computing the limit state function a very large numberof times. In order to reduce this computational effort the use of several kinds of solver surrogates has beenproposed in the recent past. Proposals include the Response Surface Method (RSM), Neural Networks(NN), Support Vector Machines (SVM) and several other methods developed in the burgeoning field ofStatistical Learning (SL). Many of these techniques can be employed either for function approximation(regression approach) or for pattern recognition (classification approach). This paper concerns the use ofthese devices for discriminating samples into safe and failure classes using the classification approach,because it constitutes the core of Monte Carlo simulation as applied to reliability analysis as such. Due tothe flexibility ofmost SLmethods, a critical step in their use is the generation of the learning population, asit affects the generalization capacity of the surrogate. To this end it is first demonstrated that the optimalpopulation from the information viewpoint lies around in the vicinity of the limit state function. Next,an optimization method assuring a small as well as highly informative learning population is proposedon this basis. It consists in generating a small initial quasi-random population using Sobol sequence fortriggering a Particle Swarm Optimization (PSO) performed over an iteration-dependent cost functiondefined in terms of the limit state function. The method is evaluated using SVM classifiers, but it can bereadily applied also to other statistical classification techniques because the distinctive feature of the SVM,i.e. the margin band, is not actively used in the algorithm. The results show that the method yields resultsfor the probability of failure that are in very close agreement with Monte Carlo simulation performed onthe original limit state function and requiring a small number of learning samples.

© 2009 Elsevier Ltd. All rights reserved.

1. Introduction

The basic problem of structural reliability can be defined as theestimation of the probability mass of a failure domain F definedby a limit state function g(x) of the set of basic random variables xof dimensionality d:

Pf =∫

F

fx(x)dx (1)

where fx(x) is the joint probability density function of the basicvariables. Themethods that have been proposed in the last decadesto solve this problem can be grouped into two classes, namelythose based on first- and second-order approximations of g(x)(FORM and SORM), on the one hand, and Monte Carlo simulationmethods, on the other [1]. These latter methods require thesequential or parallel computation of the limit state function for a

∗ Corresponding address: Civil Engineering, Universidad Nacional de Colombia,Apartado 127, 001Manizales, Caldas, Colombia. Tel.: +57 6 8879300; fax: +57 68879334.E-mail address: [email protected] (J.E. Hurtado).

0266-8920/$ – see front matter© 2009 Elsevier Ltd. All rights reserved.doi:10.1016/j.probengmech.2009.05.006

set of samples {x1, x1, . . . , xN} and computing the estimate of thefailure probability as

Pf =1N

N∑i=1

I[g(xi) ≤ 0] (2)

where I[·] is the indicator function for the event argument,equal to one if the event occurs and zero otherwise. In turn,Monte Carlo methods can be grouped into two categories. Thefirst corresponds to methods that make use of the actual limitstate function g(x), such as importance sampling [2], directionalsimulation [3], line sampling [4,5], subset simulation [6], andothers. The second group corresponds to methods that employ asurrogate function, i.e. a simple function g(x) that approximatesg(x) in the important region for the reliability computation. Thefirst proposals in this direction made use of the Response SurfaceMethods (RSM) developed in the field of Design of Experiments [7–12]. The surrogate function has the form

g(x) =m∑i

wihi(x) (3)

in which hi(x) are m power functions of the coordinates. Thismethod is characterized by the following facts: On the one

J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34 27

hand, functions hi(x) are non-adaptive, in the sense that theyare not indexed by the given samples of the experimental plan.As a consequence the approximation errors manifest themselvesexclusively in theweightswi, thusmaking themodel very sensitiveto the sample selection. On the other hand, functions hi(x) are non-flexible, as they are normally power functions of the coordinates xj.Such functions have infinite active support (meaning that the rangeof the variable x yielding values of the function hi(x) greater than asmall threshold is infinite), so that they propagate the computationerrors over a wide range. These features explain the drawbacks ofthe application of the RSM in structural reliability reported in [13].As an alternative, the building of surrogates with inspiration on

a different field, namely Statistical Learning [14–16], was proposedin the previous decade. The first proposals made use of NeuralNetworks (NN) [17–20], of which the most widely used are theRadial Basis Function Networks (RBFN), that fit functions of theform

g(x) =m∑i

wih(x, xi) (4)

and the Multi-Layer Perceptrons (MLP), whose simplest form is

g(x) = h2

(M∑k=0

wk h1

(d∑i=0

wkixi

))(5)

where hk(·), k = 1, 2 are linear or nonlinear functions and thew’sareweights. Notice that in the RBFN the basis functions are indexedby the learning samples xi, thus making the surrogate adaptive tothe data. In addition, the functions have normally a small activesupport. In the MLP case, the functions hk(·) are non-adaptive,but they do have small active support and the flexibility is greatlyincreased by adding new layers to the perceptron, which impliesenlarging Eq. (5) with new functions hk(·), k > 2.The approximation of the limit state function performed by

thesemethods can be regarded as a regression approach. However,for estimating probabilities in a space divided by the limit statefunction g(x), an accurate approximation of it is not necessary, asonly its sign matters for using Eq. (2). In fact, this equation can beput in the form

Pf =1N

N∑i=1

c(xi) (6)

where c(x) is a binary classification code defined as

c(x) =12[sgn(g(x))+ 1] (7)

in which sgn(g(x)) is the sign of the limit state function:

s(x) = sgn(g(x)) ={−1 if g(x) ≤ 0+1 if g(x) > 0. (8)

This observation motivated the introduction of the classification(or pattern recognition) approach in the field of structuralreliability using Neural Networks [19] and also another SLtechnique known as Support Vector Machines (SVM) [21–23],developedmainly for text classification and image analysis [24,25].When used for classification, the main equation of this SL device is

s(x) = sgn(g(x)

)= sgn

(S∑i

wiK(x, xi)− b

)(9)

where K(x, xi) is a kernel function, wi are weights, b is a constantplaying the role of the bias in linear regression and S is the numberof support vectors, i.e. samples lying closest to the interclassboundary on both sides of it, which are obtained by solving

a quadratic optimization procedure and defining two ancillaryfunctions known as margins. Thus, the boundary between the twoclasses corresponds to s(x) = 0, the margin in the safe class tos(x) = −1 and that in the failure class to s(x) = −1. Noticethat similarly to the RBFN approximation, the basis functions areindexed by the essential samples, i.e. the support vectors, thusgiving adaptivity to the model. On the other hand, the flexibilityof the SVM depends on the type of the kernel and its parameters.See [26] for the details.Neural Networks and Support Vector Machines can be used

either under the regression approach or the classification one.However, since the regression equation is only used for estimatingthe sign of the limit state function, the present paper is limited tothe classification approach exclusively.Thesemethods are but only two among themany SL techniques

developed recently for solving challenging technological newproblems, such as those of text classification, image analysis,internet search, etc. Other methods worth mentioning are theRelevance Vector Machines [27], Bayes Point Machines [28], LeastSquare SVM [29]. Considering the intensity of the research in thisfield and the burgeoning publication of new SL proposals, fosteredby the development of those technologies, it has been consideredtimely to investigate the problem of generating the learningsamples vis-à-vis the use of SL techniques in a rather differentapplication, namely structural reliability analysis, that may beapplicable for using in connection of any of these techniques.This is an issue on which little research has been reported.

However, it is a subject whose importance stems from thespecific features of the application of SL methods to the structuralreliability problem, which are in contrast with respect to thosecharacterizing its application to empirical observations of naturalphenomena, for which most if not all SL methods have beendeveloped. These features are the following [26]:

1. At a difference with actual populations drawn from empiricalobservations, the probabilistic distribution of the Monte Carlosamples in structural reliability and, therefore, their entropy,are known beforehand.

2. The computation of each sample in structural reliability impliessometimes a high computational cost, so that it is important toreduce the number of training samples as much as possible.

3. The samples are artificial, so that they can be generated byvirtually any method whatever. However, for the sake ofcomputational cost minimization, this should be ideally madeunder the guidance of an optimization procedure intended tominimize their number without affecting the generalizationcapacity of the SL device.

4. The safe and failure classes are perfectly separable. This is insharp contrast with respect to most applications of StatisticalLearning dealing with classes of natural samples, characterizedby very noisy boundaries.

5. The dimensionality of the variable space is also knownbeforehand, which is not always the case in statisticalapplications for natural samples, which frequently exhibit anintrinsic dimensionality lower than the apparent one.

These important differences should be exploited for the sake ofaccuracy and efficiency in applying SL techniques to Monte Carlosimulation to the problem at hand. In this paper, with due regard tothese features, amethod for generating the population for learningstatistical classifiers is proposed. Up to the authors’ knowledge,the generation of the optimal learning population for the specificcase of the classification approach to solve the structural reliabilityproblem with random samples, using any of the large numberof available methods for classification [30,14], has not been theobject of a specific research. Most methods use learning samplesgenerated at random or samples derived with DoE (Design ofExperiments) techniques or simple random variates.

28 J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34

Fig. 1. Target learning population for statistical classifiers in standard Gaussianspace. The dashed lines correspond to one-, two- and three-sigma levels of theGaussian density function.

The proposed approach is based on the optimization of theinformation conveyed by the population, as defined by InformationTheory in terms of entropy [31]. To produce such an optimalpopulation, the maximization of the entropy is induced by solvingan unconstrained optimization problem formulated in terms ofthe limit state function. It is demonstrated that the solution ofthis problem is entirely equivalent to the entropy maximization.In order to assure that the entropy is high from the initial step, acomparison among several proposals is carried out and it is foundthat the maximum entropy corresponds to Sobol quasi-randomnumbers. They are therefore used for triggering a sequentialminimization program using Particle Swarm Optimization [32].The application of the proposed methodology is illustrated withseveral examples.

2. Proposed approach

2.1. Optimization procedure

From the point of view of numerical methodologies forgenerating an optimal learning population, the third and fourthpoints of the above list of specific features of the reliability problemare the most important. In fact, if the classes were not perfectlyseparable and, consequently, the boundary were noisy, having alarge number of samples close to the boundary of the two classeswould result in an increase of the complexity of the boundaryestimate in order to avoid learning classification errors, but thevariance of the estimates will be large. If, on the contrary, a lesscomplex function were required, some errors should be admittedwhen the boundary estimate were to be used with samples notused in the learning phase. Consequently, the variance of theestimates will be low but their bias error will be large. This isthe typical of bias-variance trade-off arising in SL applicationwith natural data [14]. In contrast, for artificial learning dataand perfectly separable classes, this problem is of much lessimportance and it is only necessary to generate a small but tightpopulation on both sides of the boundary, relying on the fact thatthere is necessarily a noise-free boundary separating and thenfitting a functionwith a complexity as small as possible (see Fig. 1).The demonstration that this kind of population is the optimal onefrom the information viewpoint is given in the next paragraph.Since in general the limit state function is only implicitly known

in most structural reliability problems, it is necessary to develop

an iterative procedure to reach the situation presented in Fig. 1.To this end, it is proposed to solve the following unconstrainedoptimization problem:

find: zminimizing: C(z) = g2(z) (10)

where the vector variable z , defined in the same space as x, hasbeen introduced for denoting the learning samples. The meaningof the preceding equation is that, since the squared limit statefunction has infinitely many local minima, a population as thatshown in Fig. 1 could be obtained by minimizing such a costfunctionwith no restrictions whatever. In practice, this implies theapplication of gradient-free optimizationmethods, such as geneticalgorithms, evolutionary strategies or artificial life techniques, inorder to avoid gradient computations. These methods should beapplied to an initial set of samples that after some iterations woulddensely populate the boundary on both sides.Notice, however, that other alternatives for the cost function are

possible. For instance, a function defined in terms of the absolutevalue of the limit state function:

find: zminimizing: C(z) = |g(z)|2ν+1 (11)

where ν ∈ N.

2.2. Information maximization

As said in the Introduction, it is demonstrated in this paperthat the population clustering shown in Fig. 1 is optimal fromthe point of view of its information content. The considerationof this aspect of random samples is common in applications ofStatistics in Communication Technology, but it is rarely consideredin applications of Monte Carlo simulation as a whole. For thisreason some of the essential concepts of Information Theory(IT) [31,33] are briefly summarized next.The information content of random samples depends on

the degree of surprise they provoke in their arising. Thus, aninformation function I(P) can be built with the simple rule thatnon-surprising events, i.e. those having a probability of occurrenceP = 1, afford no information, while those of little probabilityare very informative. Consider two independent events A and Bwith probabilities PA and PB. Since the events are independent,their joint probability is the product PAB = PAPB and the totalinformation is I(PAPB). With respect to this term the followingequality holds:

I(PAPB)− I(PA) = I(PB) (12)

this means that the residual information on the joint occurrenceof the events, upon knowing that the information that A hasoccurred, differs not from the information that B has occurred, dueto the independence of the events. (Notice the difference with thedefinition of conditional probability). From this reasoning it can beinferred that

I(P2) = 2I(P) (13)

and hence

I(Pq) = qI(P). (14)

Denoting r = − ln(P), it results that P = (1/e)r and applying theprevious equation one obtains

I(P) = I[(1/e)r

]= rI(1/e) = −I(1/e) ln(P). (15)

Defining the scale I(1/e) ≡ 1 the result is

I(P) = − ln(P). (16)

J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34 29

In other words, the informative content of a certain region of thespace (the so-called self-information in IT) is equal to the negative ofthe logarithm of its probability of occurrence. Since I(P)→ ∞ asP → 0 and I(P)→ 0 as P → 1, this means that there is an infinitehierarchy amongst the regions of the random variable space fromthe information point of view.On the other hand, for samples generated from the distribution

function of a random variable x, grouped in the bins defined by apartitionX = {x1, x2, . . . , xN}, the bin probabilities Pj = nj/N, j =1, 2, . . . ,N , where nj is the number of samples in each bin, areassociated. The expected value of the information on the variableyielded by the population is the weighted average of the self-information values:

H(x,X) = −N∑j=1

Pj ln Pj. (17)

This average, called entropy, has been used as ameasure of both thedisorder of the samples, and also of their informative content. Forour present purposes the second interpretation is more relevant,as shown in what follows.A distinguishing feature of SL classifiers is their high adaptivity

to the samples used for the learning. For this reason it is importantthat the informative content of the training population, as givenby its entropy, be as large as possible in order to ensure agood generalization capacity. This is due to the fact that a lowentropy learning set corresponds to samples that in some regionsare close to each other, thus representing small surprises andincreasing the computational cost. On the contrary, large entropypopulations are more spread in the learning space and, therefore,are more informative. This is mathematically established by atheorem expressing that for a given partition, the maximumentropy corresponds to the case when the samples are equallylikely [33]. From the point of view of probability models, thistheorem could be interpreted as meaning that the ideal initialpopulation for triggering the optimization algorithm describedabove should be obtained by sampling the uniform distribution inthe d-dimensional space. However, this task is affected by the so-called curse of dimensionality. Nevertheless, since the probabilitylaw of such initial population is of no relevance, it is advisable toproduce it in amodel-free fashion, having entropymaximization asthe leading criterion. Taking also into account that this populationshould be small for computational cost reasons, these conditionsare met by the so-called low discrepancy quasi-random numbers.In essence, these are numbers featuring less randomness thanusual ones, but having a tendency to sample the space moreuniformly than them [34–36]. For this reason they may producepopulations with a larger entropy than that obtained after auniform distribution.In order to test this hypothesis, the algorithms knownunder the

names of their inventors, namely Faure, Halton, Niederreiter andSobol, were used to generate populations of N = 100 samples inthe range [0, 1] in each dimension and varying the dimensionalityin the range from 2 to 20. Then the corresponding entropy wascomputed as

H(xi) = −100∑j=1

njNln(njN

)(18)

where nj is the number of samples located in the jth bin inthe projection on the xith axis. In addition, twenty populationsof 100 uniformly distributed samples were generated for eachdimensionality and the maximum entropy among them wascomputed. The results are displayed in Fig. 2. It can be seen thatboth Sobol and Niederreiter outperform the uniform distributionfor all dimensions. The entropy of Halton numbers is higher than

Fig. 2. Entropy of quasi-random numbers compared to the maximum entropyamong 20 uniform populations in several dimensions.

that of the Niederreiter ones for two and three dimensions butit continuously degrades as the dimensionality increase. In alldimensions Sobol numbers exhibited the highest entropy. For thisreason they were selected for generating the initial population inthe optimization approach exposed below.

2.3. Demonstration of optimality

The justification for the proposal for generating the learningpopulation by solving the optimization program (10) or (11) liesin the very nature of the reliability problem when regarded as aclassification task. In fact, the probabilistic analysis of this taskusing Bayes theoremconcludes in the definition of a discriminationfunction whose structure illustrates why the solution of theunconstrained problem (10) or (11) is optimal in the informationsense. To do this it is necessary to recall some fundamentalconcepts of statistical classification [14–16].Consider two classes C1 and C1, defining a partition Z. Given

some samples z of a random variable z distributed randomlyamong the classes, it is possible to estimate prior probabilitiesP(C1) and P(C2) describing such distributions. If no otherinformation is available, the decision rule leading to the minimumclassification error is

r(z) ={0 if P(C1) < P(C2)1 otherwise. (19)

However, when a new sample z is added to the population, newinformation is available and then it is more rational to shift thedecision rule to

r(z) ={0 if P(C1|z) < P(C2|z)1 otherwise. (20)

In words, the decision rule is now expressed in terms of theposterior probabilities. Bayes theorem states that

P(Ck|z) =P(z|Ck)P(Ck)

P(z)(21)

for k = 1, 2, where P(z|Ck) is the probability distribution functionof z conditional to its location in classCk and P(z) its unconditionalvalue. This yields

r(z) =

0 ifP(z|C1)P(z|C2)

>P(C2)P(C1)

1 otherwise(22)

30 J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34

or equivalently

r(z) ={0 if P(z|C1)P(C1)− P(z|C2)P(C2) > 01 otherwise. (23)

This result can be presented as

r(z) ={0 if h(z) > 01 otherwise. (24)

withh(z) = P(z|C1)P(C1)− P(z|C2)P(C2). (25)Eq. (22) indicates that the actual values of the class probabilities

are of no importance and that only their relative values matter.Function h(z) represents a map from the d-dimensional space tothe one-dimensional real line constituting a so-called sufficientstatistic for performing the classification. Notice that the functionso defined has a close behavior to the limit state function g(z) inthat it is positive for most values of z and negative for the rest.In fact, the rule (24) can be equally applied with the very limitstate function g(z) or with any monotonic transformation thereofwithout affecting the decision.However, let us preserve the definition of h(z) given by Eq. (25)

and set it in the formh(z) = P(C1|z)− P(C2|z). (26)Fig. 3 illustrates adequate models, in the one-dimensional case,for the two terms of this function when there is either (a) a noiseboundary between non-perfectly classes, typically occurring inpattern recognition applicationswith data taken fromobservationsand experiments, or (b) an discrimination function for perfectlyseparable classes whose trace is uncertain because it is onlyimplicitly known. The latter is the case in structural reliabilityapplications. In the figure the value z0 defines the interclassboundary.In classification analysis the model for h(z) in Eq. (26) is built

in such a way that the samples located far from the boundary havea constant value P(Ck|z) equal to zero or to one, as is obvious. Bydoing this, the really informative samples will be those lying in thevicinity of the decision threshold at both sides. In fact, the entropyof this simple Bernoulli trial case equalsH(z,Z) = −Q (z) lnQ (z)− (1− Q (z)) ln(1− Q (z)) > 0 (27)where Q (z) = P(C1|z) and, obviously, P(C2|z) = 1− Q (z). Fig. 4shows a schematic representation of this entropy as a function of zin the one-dimensional case. It is evident that for a sample lyingfar from the boundary, its entropy is H(z,Z) ≈ −1 ln 1 = 0,while at the intersection of the two curves in Fig. 3 the entropyreaches its maximum, meaning that the interclass boundary isthe most informative region. Besides, the symmetrical structureof the functions in Figs. 3 and 4 suggests that there should alsobe a symmetry in the distribution of the samples in both classes,which is a goal that is approximately achieved with the proposedunconstrained optimization strategy (Eq. (10) or (11)), as is evidentin Fig. 1.Further insight into this problem can be obtained by calculating

the sensitivity of h(z)with respect to z . It equalsdh(z)dz= p(z|C1)P(C1)− p(z|C2)P(C2) (28)

where p(z|Ck) is the probability density function of z conditionalon its location in class Ck. The shape of this sensitivity is alsoillustrated in Fig. 4. This means that the discrimination function isinsensitive to samples located far from the boundary.The preceding analysis sufficiently shows that for approximat-

ing implicit limit state functionswith classification approaches, thelearning population should be concentrated in the vicinity of thelimit state function, as this maximizes both the entropy (i.e. the in-formation worth of the population) and the sensitivity of the dis-crimination function h(z).

Fig. 3. Class-conditional probability models used in statistical classification.

Fig. 4. Entropy and discrimination sensitivity as a function of sample position.

2.4. Implementation

The direct implementation of the optimization equation (10)or (11) is hindered by the fact that the gradient of C(z) maypresent variations in the z-space. As a consequence, the applicationof any of the above mentioned optimization procedures wouldlead the samples to regions with low gradient, because the valueof C(z) for the samples in them present smaller differencesthan the corresponding one in regions of large gradient. Thiscould be remedied by using an iteration-dependent cost function.However, since the cost function thus becomes more or lessarbitrary, the application of this approach is more efficient ifthe specific features of the optimization algorithm are takeninto account.. In this respect, reference is made to a benchmarkcomparison among five gradient-free optimization algorithmswith biological inspiration (Genetic Algorithms, EvolutionaryAlgorithms, Memetic Algorithms, Ant-Colony Optimization andParticle Swarm Optimization) reported in [37]. The authorsconclude that Particle Swarm Optimization (PSO) [32] performsbetter than the othermethods in terms of success rate and solutionquality. It also features easy implementation, good memory,feedback capabilities, small number of parameters and robustness.For this reason it has been selected for this research.In essence, at a difference with respect to genetic and

evolutionary algorithms, which are inspired in the Darwiniantheory of the evolution of species, the PSO technique imitates theassociative motion of animal communities (such as flocks of birds,colonies of ants, etc), which solve their needs upon the basis of asocial communication of the individual findings, in such away thatthe cohesion of the group is maintained. The algorithm makes theindividuals moving in a random direction that takes into accountthe best position obtained in the history of the communal survey,

J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34 31

Fig. 5. Components of Particle Swarm Optimization.

while preserving some autonomy of the individual as representedby the memory of its own best findings.Consider several structural models identified as particles in the

z-design space. Themotion of the ith individual particle in iterationk+ 1 to a new position z[k+1]i is [32]

z[k+1]i = z[k]i + V [k+1]i (29)

where V [k+1]i is the so-called velocity, composed by three vectors(see Fig. 5):

V [k+1]i = N [k+1]i + G[k+1]i + P [k+1]i . (30)

The vectors are respectively associated to the following behavioralobservations:

• N [k+1]i = wV [k]i : The inertia of particles motion. Thus it isrecommended to initiate the optimization process with a largevalue and decreasing it as the solution advances.• G[k+1]i = U1c1

(B[k] − z[k]i

): The experience of the group. Here

B[k] is the best historical position among all the particles up tothe kth iteration, i.e. the position showing the minimum valueof the cost function. On the other hand, c1 is a constant known associal parameter andU1 a uniform randomnumber in the range[0, 1].• P [k+1]i = U2c2

(b[k]i − z[k]i

): The experience of each individual.

Here b[k]i is the best historical position of the ith particle up tothe kth iteration, c2 is a constant known as cognitive parameterand U2 a uniform random number in the range [0, 1].

Upon the basis of this brief summary of the PSO technique, thefollowing comments concerning its application to the problem athand are in order. First, notice that the goal is finding many localminima instead of a single one, which could be the purpose, forinstance, if the algorithm were applied for determining the designpoint, as in [38]. For this situation, the published experience on PSOapplications indicates that a large inertia factorw gives impulse tofind global solutions while a small factor facilitates finding localminima. It is therefore recommended to decrease it gradually asthe iteration progresses. Second, to this very goal it is also advisableto make a social parameter lower than the cognitive one, i.e. c1 <c2 in order to favor the participation of the particle’s best positionin the sumgivenby Eq. (30), because otherwise the algorithm tendsto find a global optimumwhich in this case is of little value. Taking

Fig. 6. Iteration-dependent cost function and particle evolution.

into account the variations of the gradient of function g2(z) in thez-space, it is proposed tominimize the following cost functionwiththe PSO algorithm:

find: zminimizing: C(z) = gν(z) (31)

where ν equals 10, 8, 6, 4 for iterations k = 1, 2, 3, 4 and ν = 2for k > 4. In words, as the optimization advances, a decreasingeven power is used until reaching the second power.For the sake of clarity, the purpose of this device is illustrated

in Fig. 6 for the cases ν = 10, 6, 2. Notice that for large powers thegradient of the cost function is very high in regions far from theboundary but a flat,wide valley takes place about it. For lowpowersthe situation is the opposite. For this reason, the introduction ofa high power in the first iterations induces good values of thebest positions b[k]i in particle’s memory and make it rapidly forgetpositions that are very far from the boundary. As the optimizationsadvances, the power of the limit state function is decreased in orderto concentrate the samples in the close vicinities of the function,which is the prerogative of low powers over high ones, as shownby the figure.Summarizing, the algorithm for applying the proposed ap-

proach comprises the following steps:

1. Generate an initial population with Sobol numbers as follows:

x[0]ij = aij + (bij − aij)sij, i = 1, 2, . . . ,M; j = 1, 2, . . . , d.(32)

HereM is the number of particles, d the number of dimensions,sij is a Sobol number defined in the unit hyper-cube andaid, bid are respectively the left and right coordinates in thejth dimension of the extremes of the hyper-box in which theoptimization algorithm is applied. In this respect it is importantto take into account that the high flexibility and adaptivity ofSL classifiers may induce fake twists and loops if trained witha narrow population, thus inducing misclassification of somesampleswhen the classifier is used as a solver surrogate. For thisreason it is recommended to define a boxwith extremes beyondthe maximum likely values of each variable. For instance, inthe standard Gaussian space a suitable box is composed withextremes aij ≤ −7, bij ≥ 7.

2. Apply the PSO optimization technique for generating a newpopulation according to the cost function given by Eq. (31). Ateach iteration of this procedure, fit a SL classifier and estimatethe failure probability with such a surrogate.

3. Stop when the failure probability estimate reaches a stationaryvalue.

32 J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34

Fig. 7. Example 3.1: Initial population based on Sobol numbers.

3. Application examples

In this section the application of the proposed method isillustrated with some selected examples. In all cases use wasmadeof the SVM classification method (see Eq. (9)) with a polynomialkernel of the third degree, whose equation is [24]:

K(x, y) = (〈x, y〉 + 1)3 (33)

where 〈·, ·〉 stands for inner product of the vectors in theargument. In previous research on this classifier use is made ofthe distinguishing feature of this classifier, namely the marginequationss∑i

wiK(x, xi)− b = ±1 (34)

for generating new learning samples [21,23]. Since the aim ofpresent paper is to propose a general method for training SLclassifiers, resort to these equations is entirely avoided and theSVMmethod is only used as a possibility of SL classification amongseveral many others, for which the reader is referred to the keyreferences [24,27,25,30,14,29,28,39]. A benchmark comparisonamongst these techniques is beyond the aims of present paper.

3.1. A two-dimensional function with multiple design points

This function has been selected in order to illustrate in detailthe features of the proposed method. It is defined as [26]

g(x1, x2) = −3.8+ exp(x1 − 1.7)− x2 = 0

where x1, x2 are independent standard Gaussian variables. This isa function showing several candidates to a design point, making itdifficult the application of reliability techniques based on that con-cept. Fig. 7 shows the initial M = 60 training samples calculatedwith Sobol numbers by means of Eq. (32). A box of extremeswell separated extreme points (aij = −7, bij = 7, , i =1, 2, . . . ,M; j = 1, 2) was used in order to show the possibilityof the loops and twists mentioned above. These appear in Fig. 8,corresponding to iteration No. 5. Notice that the goal of obtain-ing a tight population surrounding the limit state function, as il-lustrated by Fig. 1, is progressively reached. Such a clustering canalso be noted in Fig. 9, which displays the values of the limit statefunction for all the samples. Notice how the samples rapidly con-verge to locations forwhich the limit state function oscillates aboutzero. On the other hand, Fig. 10 shows the rapid decrease of the costfunction given by Eq. (31) corresponding to the initial Sobol popu-lation and to iterations No. 5 and 15 (notice the logarithmic scale).

Fig. 8. Example 3.1: Samples and classifier at iteration No. 5.

Fig. 9. Example 3.1: Convergence of the samples to the limit state function.

In Fig. 11 the evolution of the failure probability estimates alongthe PSO iterations is shown. It can be seen that the values stabi-lizes after iteration No. 4. Since the number of g(x)-function callsis given byMK , whereK is the number of iterations needed for suchan stabilizations, it is clear that the computational effort requiredto solve this case is low. For the classification shown in Fig. 12, theSVM calculated after 6 iterations was used, implying a total of 360function calls. The failure probability estimate is 0.00235, whereasthe value obtained with the actual function is 0.0024. Finally, no-tice in this figure the appearance of fake twists, which justify theuse of a wide box for starting the optimization.

3.2. A highly concave function

The limit state function is given by [40]g(x1, x2) = 3− x2 + (4x1)4.This function presents a high concavity towards the origin, makingfailure events very rare. The algorithmwas triggeredwith 60 Sobolpoints. At 12 iterations the stationary value of the probability offailure was equal to 1.77 × 10−4. It is in fairly good agreementwith the exact solution (1.8× 10−4). The number of function callsrequired to reach stationarity was 720.

3.3. A two-dimensional series system with multiple failure points

This function is given by [40]g1(x1, x2) = 2− x2 + exp(−0.1x21)+ (0.2x1)

4

J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34 33

Fig. 10. Example 3.1: Snapshots of the variation of the cost function in threeiterations.

Fig. 11. Example 3.1: Evolution of the failure probability.

Fig. 12. Example 3.1: Classification of 105 samples with the classifier obtained initeration No. 6.

g2(x1, x2) = 4.5− x1x2g(x1, x2) = min(g1, g2)

in which the variables are standard independent Gaussian. Forreaching a stationary value Pf = 0.0036 and a c.o.v = 0.074, themethod required 556 function calls.

Table 1Example 3.6 — Probability estimates.

d Learning samples Pf

10 450 0.001215 450 0.001420 600 0.001425 750 0.001230 750 0.0014

3.4. A three-dimensional series system

The function is [41]

g1(x1, x2) = −x1 − x2 − x3 + 3√3

g2(x1, x2) = 3− x3g(x1, x2) = min(g1, g2)

where the variables are standard independent Gaussian. Being aparallel system, the algorithm was initiated with a large numberof samples, namely 200. The stationary value of the probabilityof failure, equal to 0.000138, was obtained at iteration No. 5,i.e. implying 1000 function calls. To this end use was made of 10−6randomnumbers. The estimate compareswellwith the exact value(0.000140) obtained with the same number of samples.

3.5. A five-dimensional parallel system

This function is given by [42]

g1(x) = 2.677− x1 − x2g2(x) = 2.5− x2 − x3g3(x) = 2.323− x3 − x4g4(x) = 2.25− x4 − x5g(x) = max(g1, g2, g3, g4)in which the variables are again independent, standard Normal.The stationary failure probability 1.9×10−4 was obtained with 10iterations comprising 400 each, for a total of 4000 function calls,which is a very low number. The exact value of the probability is2× 10−4.

3.6. A function with failure probability independent of dimensionality

The function in this case is given by [13]

g(x) = 3√d−

d∑i=1

xi

where the variables are again independent, standard Normal. Tothis function corresponds a failure probability equal to 0.00135,regardless of the dimensionality. Therefore, it is useful for testingthe robustness of the proposed approach with respect to thedimensionality.The problem was solved for dimensions in the range [10 30].

The results are shown in Table 1. It can be seen that the probabilityof failure is accurately estimated and that the rate of increase ofthe number of samples needed for training is slower than thecorresponding rate of the dimensionality.

4. Conclusions

In the last years a significant development in the field ofStatistical Learning has occurred, fostered by the research needsin fields such as image analysis, text classification and patternrecognition. These methods have found application in the fieldof structural reliability, because of the natural interpretation ofthe reliability problem as a two-class problem. It has thereforejudged timely to investigate on the optimal training population

34 J.E. Hurtado, D.A. Alvarez / Probabilistic Engineering Mechanics 25 (2010) 26–34

for using statistical learning classification methods in the specificfield of structural reliability analysis and, consequently, to proposea general method for producing it. To accomplish this goal,by means of an analysis of the Bayesian interpretation of theclassification task from the vantage point of Information Theory,it has been demonstrated that the optimal learning population forany classificationmethod (from the point of view of its informativeworth) lies in the vicinity of the limit state function. On thisbasis a numerical algorithm to accomplish such a generationhas been presented. In this proposal, the learning samples areproduced as the trail solutions of an unconstrained minimizationproblem formulated onnonlinear transformations of the limit statefunction. An iteration dependent cost function is minimized usingthe Particle Swarm Optimization method, which besides beingeasy to apply, it has been found superior over other gradient-free optimization techniques. The examples carried out with aStatistical Learning method known as Support Vector Machinesdemonstrate the excellent features of the proposed approach interms of accuracy and computational labor.Having demonstrated that the optimal learning population for

statistical classifiers lies around the limit state function, use hasbeen made in the present paper of a particular numerical strategy,namely to solve an unconstrained optimization algorithm withquadratic cost functions. Other alternatives are, however, possible,such as the use of different cost functions, optimization techniquesand even other numerical approaches, such as perturbationtechniques, for instance. Notice also that the efficiency of themethod, in terms of the number of samples needed for estimatingthe failure probability, depends on the numerical approach and theclassification device used, but not on the classification approachas such — a regard that naturally leads to the proposed learningpopulation structure. For this reason it is important to investigatealternative ways of generating such an optimal population.Research in this direction is presently being done by the authors.

Acknowledgement

Financial support for the realization of the present research hasbeen received from the Universidad Nacional de Colombia. Thesupport is graciously acknowledged.

References

[1] Melchers RE. Structural reliability: Analysis and prediction. Chichester: JohnWiley and Sons; 1999.

[2] Engelund S, Rackwitz R. A benchmark study on importance samplingtechniques in structural reliability. Structural Safety 1993;12:255–76.

[3] Ditlevsen O, Madsen HO. Structural reliability methods. Chichester: JohnWiley and Sons; 1996.

[4] Koutsourelakis PS, Pradlwarter HJ, Schuëller GI. Reliability of structures inhigh dimensions, Part i: Algorithms and applications. Probabilistic EngineeringMechanics 2004;19:409–17.

[5] Pradlwarter HJ, Schuëller GI, Koutsourelakis PS, Charmpis DC. Applicationof line sampling simulation method to reliability benchmark problems.Structural Safety 2007;29:208–21.

[6] Au SK, Beck JL. Estimation of small failure probabilities in high dimensions bysubset simulation. Probabilistic Engineering Mechanics 2001;16:263–77.

[7] Schuëller GI, Bucher CG, Bourgund U, Ouypornprasert W. On efficient com-putational schemes to calculate failure probabilities. Probabilistic EngineeringMechanics 1989;4:10–8.

[8] Faravelli L. Response-surface approach for reliability analysis. Journal of theEngineering Mechanics 1989;115:2763–81.

[9] Bucher C. A fast and efficient response surface approach for structuralreliability problems. Structural Safety 1990;7:57–66.

[10] Rajashekhar MR, Ellingwood BR. A new look at the response surface approachfor reliability analysis. Structural Safety 1993;12:205–20.

[11] Gupta S, Manohar CS. An improved response surface method for thedetermination of failure probability and importance measures. StructuralSafety 2004;26:123–39.

[12] Breitung K, Faravelli L. Response surface methods and asymptotic approxima-tions. In: Casciati F, Roberts B, editors. Mathematical models for structural re-liability analysis. Boca Ratón: CRC Press; 1996. p. 227–86.

[13] Guan XL, Melchers RE. Effect of response surface parameter variation onstructural reliability estimates. Structural Safety 2001;23:429–44.

[14] Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. NewYork: Springer Verlag; 2001.

[15] Bishop CM. Neural networks for pattern recognition. Oxford: OxfordUniversity Press; 1995.

[16] Cherkassky V, Mulier F. Learning from data. New York: John Wiley and Sons;1998.

[17] Papadrakakis M, Papadopoulos V, Lagaros ND. Structural reliability analysis ofelastic–plastic structures using neural networks and Monte Carlo simulation.Computer Methods in Applied Mechanics and Engineering 1996;136:145–63.

[18] Chapman OJ, Crossland AD. Neural networks in probabilistic structuralmechanics. In: (Raj) Sundararajan C, editor. probabilistic structural mechanicshandbook. New York: Chapman & Hall; 1995. p. 317–30.

[19] Hurtado JE, Alvarez DA. Neural network-based reliability analysis: Acomparative study. ComputerMethods in AppliedMechanics and Engineering2001;191:113–32.

[20] Hurtado JE. Analysis of one-dimensional stochastic finite elements usingneural networks. Probabilistic Engineering Mechanics 2002;17:35–44.

[21] Hurtado JE, Alvarez DA. A classification approach for reliability analysis withstochastic finite element modeling. Journal of Structural Engineering 2003;129:1141–9.

[22] Hurtado JE. An examination of methods for approximating implicit limit statefunctions from the viewpoint of statistical learning theory. Structural Safety2004;26:271–93.

[23] Hurtado JE. Filtered importance sampling with support vector margin: Apowerful method for structural reliability analysis. Structural Safety 2007;29:2–15.

[24] Vapnik VN. Statistical learning theory. New York: John Wiley and Sons; 1998.[25] Schölkopf B, Smola A. Learning with kernels. Cambridge: The MIT Press; 2002.[26] Hurtado JE. Structural reliability. Statistical learning perspectives. Heidelberg:

Springer; 2004.[27] Tipping ME. The relevance vector machine. In: Solla SA, Leen TK, Müller KR,

editors. Advances in neural information processing systems. Cambridge: TheMIT Press; 2000. p. 652–8.

[28] Herbrich R. Learning kernel classifiers. Cambridge: The MIT Press; 2002.[29] Suykens JAK, VanGestel T, DeBrabanter J, DeMoor B, Vandewalle J. Least

squares support vector machines. Singapore: World Scientific; 2002.[30] Duda RO, Hart PE, Stork DG. Pattern classification. New York: John Wiley and

Sons; 2001.[31] Reza FM. An introduction to information theory. New York: Dover Publica-

tions; 1994.[32] Kennedy J, Eberhart RC. Swarm intelligence. San Francisco:MorganKaufmann;

2001.[33] Papoulis A. Probability, random variables and stochastic processes. New York:

McGraw-Hill; 1991.[34] Ripley BD. Stochastic simulation. New York: John Wiley and Sons; 1987.[35] Bratley P, Fox BL, Schrage LE. A guide to simulation. NewYork: Springer Verlag;

1987.[36] Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical recipes in

FORTRAN. Cambridge: Cambridge University Press; 1992.[37] Ebeltagi E, Hegazi T, Grierson D. Comparison among five evolutionary-based

optimization algorithms. Information Sciences 2005;19:43–53.[38] Elegbede C. Structural reliability assessment based on particle swarm

optimization. Structural Safety 2005;27:171–86.[39] Vidyasagar M. Learning and generalization. London: Springer Verlag; 2003.[40] Au SK, Beck JL. A new adaptive importance sampling scheme for reliability

calculations. Structural Safety 1999;21:135–58.[41] Nie J, Ellingwood BR. Directional methods for structural reliability analysis.

Structural Safety 2000;22:233–49.[42] Grooteman F. Adaptive radial-based importance sampling method for

structural reliability. Structural Safety 2008;30:533–42.