Multi-objective optimization of operational variables in a waste incineration plant
Multi-Objective Optimization of NARX Model for System Identification Using Genetic Algorithm
-
Upload
teknologimalaysia -
Category
Documents
-
view
6 -
download
0
Transcript of Multi-Objective Optimization of NARX Model for System Identification Using Genetic Algorithm
ORIGINAL ARTICLE
Structure optimization of neural network for dynamic systemmodeling using multi-objective genetic algorithm
Sayed Mohammad Reza Loghmanian •
Hishamuddin Jamaluddin • Robiah Ahmad •
Rubiyah Yusof • Marzuki Khalid
Received: 9 September 2010 / Accepted: 7 February 2011 / Published online: 1 March 2011
� Springer-Verlag London Limited 2011
Abstract The problem of constructing an adequate and
parsimonious neural network topology for modeling non-
linear dynamic system is studied and investigated. Neural
networks have been shown to perform function approxima-
tion and represent dynamic systems. The network structures
are usually guessed or selected in accordance with the
designer’s prior knowledge. However, the multiplicity of the
model parameters makes it troublesome to get an optimum
structure. In this paper, an alternative algorithm based on a
multi-objective optimization algorithm is proposed. The
developed neural network model should fulfil two criteria or
objectives namely good predictive accuracy and minimum
model structure. The result shows that the proposed algo-
rithm is able to identify simulated examples correctly, and
identifies the adequate model for real process data based on a
set of solutions called the Pareto optimal set, from which the
best network can be selected.
Keywords Artificial neural network � Multi-objective
genetic algorithm � NSGA-II � System identification �Model structure selection
1 Introduction
The field of non-linear system identification has been studied
for many years and yet still an active research area. Many
techniques have been proposed for non-linear system iden-
tification, which are mostly based on parameterized
non-linear models such as artificial neural networks (ANNs)
[1–3], Volterra series, Wiener and Hammerstein models [4]
and wavelet networks. Artificial neural network requires
establishing the structure in terms of number of layers,
number of nodes in the layers and connections between them.
A network with more complicated structure than necessary
over fits the training data [5] i.e., it performs well on data
included in the training set but may perform poorly on testing
set. On the other hand, a network having a simpler structure
than necessary will not give good performance even for
training set, thus structural optimization is important. Trial
and error method is one of the methods of artificial neural
network structural optimization [6], but this approach is
laborious and may not arrive to an optimum structure. In
addition, if a large number of initial values such as learning
rate, weight, and threshold and so forth are to be defined, then
these methods become impractical for ANNs. Network
growing techniques such as cascade-correlation learning and
network pruning have also been successfully used for
structural optimization [7]. However, all these methods still
suffer from slow convergence. In addition, these are based on
gradient techniques and can easily stick at a local minimum.
Genetic algorithms (GAs) have been proposed to optimize
the structure and identification of parameters in non-linear
S. M. R. Loghmanian (&) � R. Yusof � M. Khalid
Centre for Artificial Intelligence and Robotics (CAIRO),
Universiti Teknologi Malaysia, 54100 Jalan Semarak, Kuala
Lumpur, Malaysia
e-mail: [email protected]
R. Yusof
e-mail: [email protected]
M. Khalid
e-mail: [email protected]
H. Jamaluddin � R. Ahmad
Department of Applied Mechanics, Faculty of Mechanical
Engineering, Universiti Teknologi Malaysia, 81310 Johor,
Malaysia
e-mail: [email protected]
R. Ahmad
e-mail: [email protected]
123
Neural Comput & Applic (2012) 21:1281–1295
DOI 10.1007/s00521-011-0560-3
system identification [8–10]. In the field of the neural net-
work, GAs have been employed to optimize weights, layers,
number of input–output nodes, neurons and derive optimal
ANN structures [9, 11].
A reliable automated technique for structure optimiza-
tion is needed. Considering the requirements and the nature
of modeling is to obtain a good predictive accuracy and
optimum model structure, two objective functions namely
the minimization of mean square error of the model pre-
diction and model complexity are proposed. This leads to
the proposed multi-objective function optimization. To
obtain an optimal structure, the two objectives must be
minimized simultaneously. In contrast with classical
methods such as the weighted sum, e-constrain, weighted
metric and Benson’s method, which uses a single initial
point and strongly requires some prior knowledge about the
problem, there exist evolutionary algorithms with multi-
objective optimization methods [12]. Since classical search
and optimization methods use a point by point approach,
where one solution in each iteration is modified to a dif-
ferent solution, the outcome of using a classical optimi-
zation method is a single optimized solution. The field of
search and optimization has changed by the introduction of
a number of non-classical, unconventional and stochastic
search, and optimization algorithms. The main difference
between classical search and evolutionary algorithms
(EAs) is the use of population of solutions in each iteration.
Schaffer [13] performed the first multi-objective GA,
vector evaluated genetic algorithm (VEGA), to find a set of
non-dominated solutions. This algorithm is based on
dividing a population of GA to the number of objective
functions used (M) randomly. Each subpopulation is
assigned a fitness based on a different objective function. In
this approach, no solution is tested for other (M-1) objec-
tive functions. Another method for multi-objective evolu-
tionary algorithms is the multiple objective genetic
algorithms (MOGAs). To maintain diversity, user must
define the sharing parameter. So, the algorithm needs prior
knowledge about the problem [14].
Non-dominated sorting genetic algorithm (NSGA) [15]
implements sharing operation to maintain population
diversity, but it is too sensitive to the selection of sharing
parameters. Besides, the lack of elitism is also a motivation
for the modification of NSGA to NSGA-II [12]. Elitist non-
dominated sorting genetic algorithm (NSGA-II) has been
proposed to optimize neural network as a multi-objective
optimization algorithm. Unlike other methods, NSGA-II
uses an elite-preservation strategy and an explicit diversity-
preserving mechanism simultaneously. These mechanisms
do not allow an already found Pareto optimal solution to be
deleted [12]. Some studies on multi-objective optimization
of ANN design have been presented. For example, Field-
send and Singh [16] optimized the structure of a four
hidden layers ANN for stock data prediction and used risk
and profit as objectives. Abbass [17] implemented a me-
metic or a hybrid method to minimize the number of hid-
den units and approximation error of ANN and showed that
it is faster than traditional back propagation. Sexton et al.
[18] investigated the simultaneous optimization of struc-
ture and effectiveness of multi-layer perceptron (MLP),
with ANN simultaneous optimization algorithm. However,
this method needs assignment of penalty value. Gonzalez
et al. [19] applied multiple objective genetic algorithms
(MOGAs) to optimize error and number of basis functions
for radial basis function neural network. Palmes [20] and
Koza and Rice [11] employed multi-objective optimization
for structure and initial weights for ANN and Mandal et al.
[21] adopted NSGA-II to get the best structure of ANN for
optimizing two parameters of electrical discharge
machining. However, the methods that have been proposed
for modeling dynamic system using neural network have
considered only one objective function based on error to
optimize hidden nodes and layers [22].
In this study, two objective functions are proposed, the
first one is complexity based on assignment of input nodes
and variable as well as hidden node, while the second
objective function is based on the mean square error.
This study starts by using simulated data generated from
dynamic models represented by known defined neural
network structures. Since their structures are known, the
final models can be validated in terms of the model
structure identified by the algorithm and the respective
values of the weights and thresholds as well. A multi-
objective optimization method was employed. After prov-
ing the effectiveness of the algorithm using simulated
models, a real process data available from literature was
used for further study. Model validity tests were performed
to test the adequacy of the developed model.
2 Dynamic system modeling
System identification is a general process of developing a
model of a system based on measured or given input–
output data. The general construction of the model [23]
involves acquiring the process input and output data sets,
defining a class of the model to be used, parameter esti-
mation, and model validation.
2.1 Neural network for dynamic system modeling
The neural networks have been successfully applied in
many research areas such as speech processing, pattern
recognition, and non-linear system identification. Extensive
works on non-linear identification using the neural network
have been reported [2, 24] covering various applications
1282 Neural Comput & Applic (2012) 21:1281–1295
123
and received significant interest. It offers many advantages
such as adaptation without having a prior knowledge, speed
and efficiency in providing solutions, and their ability to
handle non-numeric data, and generalization. For a given
set of data, a multi-layer perceptron network can provide a
good non-linear relationship. Theoretical works have pro-
ven that a feedforward multi-layer perceptron even with
only one hidden layer can uniformly approximate any
continuous function [25]. Thus, a feedforward MLP is an
attractive approach for researchers [26].
Non-linear counterparts to the linear model structures
are given by
yðtÞ ¼ G½uðt; hÞ; h� þ eðtÞ; ð1Þ
or on predictor form
yðtjhÞ ¼ G½uðt; hÞ; h�: ð2Þ
where u(t, h) is the regression vector, while h is the vector
containing the adjustable parameters in the neural network
known as weights, the function G is realized by the neural
network and it is assumed to have a feedforward structure.
Depending on the regression vector, different non-linear
model structures can be obtained. If the regression vector is
selected as in ARX model, the model structure is called
NNARX as the acronym for neural network ARX. Likewise,
there exist NNFIR, NNARMAX, NNOE, and NNSSIF [26].
2.2 Multi-objective evolutionary algorithms
A number of stochastic optimization techniques such as
simulated annealing, tabu search, and ant colony optimi-
zation could be used to generate Pareto set [22]. Due to the
working procedure of these algorithms, the solutions
attempt to obtain a good approximation, but they do not
guarantee to identify optimal trade-offs [27]. Evolutionary
algorithms are characterized by a population of solution
candidates, and the reproduction process enables the
combination of existing solutions to generate new solu-
tions. This enables finding several members of Pareto
optimal set in a single run instead of performing a series of
separate runs, which is the case for some of the conven-
tional stochastic processes. Finally, natural selection
determines which individuals of the current population
participate in the new population.
Some of the other advantages of having evolutionary
algorithms is that they require very little knowledge about
the problem being solved, less susceptible to the shape or
continuity of the Pareto front, easy to implement, robust,
and could be implemented in a parallel environment [12].
Srinivas and Deb [15] presented the non-dominated
sorting GA (NSGA) as Pareto-based approach. The main
advantage of the algorithm is the assignment of fitness
according to non-dominated sets. Nevertheless, the
performance is sensitive to the sharing parameter. To
overcome the disadvantage, the elitist non-dominated
sorting genetic algorithm (NSGA-II) has been proposed
[12]. In this paper, NSGA-II is applied to the modeling
dynamic system. Two objective functions in this study can
be formulated mathematically as follows:
Predictive error; min: MSE ¼ 1
Ns
XNs
i�1
yiðtÞ � yiðtÞð Þ2
Complexity; min: CX ¼ nu þ ny þ nd
ð3Þ
where Ns is the number of samples, y(t) and yðtÞ are the
desired and predicted system output, respectively and nu, ny
and nd are the number of input, output lags, and hidden
nodes, respectively.
2.3 Elitist non-dominated sorting genetic algorithm
Deb [12] presented the elitist non-dominated sorting
genetic algorithm (NSGA-II). This method uses an explicit
diversity-preserving mechanism. For a multi-objective
optimization problem, any two solutions a and b can have
one of the two possibilities: one dominates the other or
none dominates the other. In a minimization problem,
without loss of generality, a solution a dominates b if the
following two conditions are satisfied:
8m fmðaÞ� fmðbÞ m ¼ 1; 2; . . .; g a; b 2 Rn
9m fmðaÞ\fmðbÞ m ¼ 1; 2; . . .; g a; b 2 Rn
If any of the above conditions are not violated, solution
a dominates solution b. If there is not a solution like a, which
dominates solution b, then b is called the non-dominated
solution. The solutions that are non-dominated within the
entire search space are denoted as Pareto optimal and con-
stitute the Pareto optimal set or Pareto optimal front.
After creation of new population Q by using parent P of
size N, two populations are combined together and a new
population R of size 2 N is constructed. Then the process of
finding the non-dominated solution sorts the population in
R. After non-dominated sorting, the new population is
constructed by a new non-dominated front. It starts with the
best front (rank one or first front) and continues with the
second non-dominated front and so forth. Since the size of
R is 2 N and the new population needs N individuals, in the
case of equality in their front level they will be selected
according to crowding distance. The crowding distance is
used by NSGA-II to maintain the diversity among solutions
in a front. The crowding distance for an individual is cal-
culated using hypercube [12].
For selection between the two solutions with the same
rank, the method chooses the one with a larger distance.
The advantages of NSGA-II are that an elite-preservation
strategy and an explicit diversity-preserving mechanism
Neural Comput & Applic (2012) 21:1281–1295 1283
123
are used simultaneously. These mechanisms do not allow
an already found Pareto optimal solution to be deleted and
keep the diversity of solution. The implementation steps of
NSGA-II can be summarized as follows.
Step 1. Generate binary initial population P0 of size
N according to the number and length of the decision
variables.
Step 2. Calculate the objective functions for each
chromosome (individual).
Step 3. Assign number of ranks as the fitness function to
each chromosome by the non-dominated sorting proce-
dure and classify them into distinct Pareto fronts.
Step 4. Create offspring population Pt (start with t = 1)
from initial population P0 by GA operators namely
selection, crossover, and mutation.
Step 5. Compute the objective functions for each
chromosome in Pt.
Step 6. Create a combine population of size 2 N consist-
ing of parent and off-spring populations, Rt = Pt [ Pt-1.
Step 7. Perform non-dominated sorting and compute
crowding distance for all chromosomes.
Step 8. Select lower ranked solutions and put them in the
new population, Pt?1 of size N.
Step 9. In the case of equality of the front number among
the solutions, choose the chromosomes with a higher
crowding distance.
Step 10. A new population is ready, Pt?1, go to step 4
and repeat until Pareto optimal front is obtained.
It is important to note again that the concept of Pareto
front indicates a set of solutions that are non-dominated to
each other, but they are better than the rest of the solutions.
Consequently, it is impossible to find the best single
solution and there is always a group of possible solution in
the first front (rank one) in the last generation.
3 Model validation
In most reported works in modeling non-linear systems
using neural networks, one-step-ahead prediction has been
used to verify the models. However, this is not a sufficient
indicator of model performance because at each step the
past input, outputs, and residuals are available and used to
predict just one increment forward. The one-step-ahead
prediction is given by:
yOSAðtÞ ¼ F½ðyðt � 1Þ; . . .; yðt � nyÞ; uðt � 1Þ; . . .; uðt� nuÞÞ� ð4Þ
where F is an estimate of the non-linear function.
Model validation such as the correlation test is another
validation technique that can detect deficiency using the
prediction errors or residuals, which give biased result in
parameter estimation. If a model of a system is adequate
then the residual or predictive error e(t) should be
unpredictable from all linear and non-linear combinations
of past inputs and outputs. The derivation of simple tests
which can detect these conditions is complex, but it
can be shown that [28] the following conditions should
hold:
UeeðsÞ ¼E½eðtÞeðt � sÞ�
E½e2ðtÞ� ¼ dðsÞ s ¼ 0
UueðsÞ ¼E½uðsÞeðt � sÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
E½u2ðsÞe2ðtÞ�p ¼ 0 8s
UeðeuÞðsÞ ¼E½eðtÞeðt � 1� sÞuðt � 1� sÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
E½e2ðtÞ�E½e2ðtÞu2ðtÞ�p ¼ 0; s� 0
Uu2eðsÞ ¼E½ðu2ðtÞ � �u2Þeðt � sÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE½ðu2ðtÞ � �u2Þ2�E½e2ðtÞ�
q ¼ 0; 8s
Uu2e2ðsÞ ¼ E½ðu2ðtÞ � �u2Þe2ðt � sÞ�ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiE½ðu2ðtÞ � �u2Þ2�E½e4ðtÞ�
q ¼ 0; 8s ð5Þ
where U represents the standard correlation function, E[.]
is the expectation operator, dðsÞ is an impulse function and
e(t) represents the prediction errors or residual,
e ¼ y� y ð6Þ
where y is the predicted output, and
�u2 ¼ 1
Np
XNp
t¼1
u2ðtÞ: ð7Þ
These tests are able to indicate the adequacy of the fitted
model. Generally, if the correlation functions are within the
95% confidence interval, i.e., �1:96=ffiffiffiffiffiffiNp
p, the model is
regarded as adequate, where Np is the number of data
points.
4 Results
4.1 Case studies
To show the effectiveness of the proposed algorithm, first
three simulated systems referred as S1, S2, and S3 were
used. They were chosen because of their ability to repre-
sent one hidden layer perceptron networks exactly.
S1 : yðtÞ ¼ 0:6
1þ e�ð0:1uðt�2Þ�0:5uðt�3Þ�0:2yðt�1Þ�0:7yðt�5Þþ0:3Þ
1284 Neural Comput & Applic (2012) 21:1281–1295
123
S2 :
yðtÞ ¼ 0:6
1þ e�ð0:1uðt�2Þ�0:5uðt�3Þ�0:2yðt�1Þ�0:7yðt�5Þþ0:3Þ
þ �0:2
1þ e�ð�0:44uðt�2Þ�0:67uðt�3Þþ0:23yðt�1Þ�0:17yðt�5Þ�0:1Þ
System S1 can represent a network with one hidden node
and four input nodes with variables u(t - 2), u(t - 3),
y(t - 1), and y(t - 5). Solution S2 expresses a two hidden
nodes network with four input nodes u(t - 2), u(t - 3),
y(t - 1), and y(t - 5) and S3 represents a four hidden nodes
neural network with eight input nodes u(t - 1), u(t - 2),
u(t - 3), u(t - 5), y(t - 1), y(t - 2), y(t - 4), and y(t -
5). All corresponding weights and thresholds are tabulated in
Table 1. Thus, the expected results are at least one among the
final solutions of Pareto optimal front equivalent to the
neural network structure of S1, S2, and S3 as shown in
Table 1.
To show the application of the proposed algorithm on
real process dataset, the Box–Jenkins gas furnace data (S4)
available in the literature [29] is used.
4.2 Implementation procedure
To implement multi-objective optimization of neural net-
work structure, some initial settings need to be defined. The
initial parameters and values are divided into two sets. The
first set is associated with the neural network and system
identification. These are the number of epochs, activation
function, weights and thresholds, learning rate and number
of training, and testing data. The second set deals with the
multi-objective genetic algorithm, such as population size,
number of generations and crossover and mutation proba-
bilities. After setting these parameters, the algorithm gen-
erates the initial population. The length of the chromosome is
according to the maximum number of input lag, nu, and
output lag, ny, and number of hidden nodes, nd, and the limits
are
1� nu� 5
1� ny� 5
1� nd � 8
8><
>:ð8Þ
In this case, a chromosome is expressed with 13 genes and
shows schematically in Fig. 1. The first and the second five
genes present input and output lags. Each gene can be 0 or
1. For example, if there is an 1 in the second bit, one of the
input nodes is u(t - 2). The same interpretation is for the
output lags. The last three genes in a chromosome show the
number of hidden nodes and are converted into an integer
number. Figure 2 illustrates a sample of chromosome and
its interpretation as a regression vector u(t) and associated
network. After an initial population of size of N has been
created, every chromosome is trained using Levenberg–
Marquardt algorithm. Then each network was trained using
the training dataset, while MSE of the test dataset was
calculated to be used as the first objective. Note that the
second objective is complexity and is calculated using
Eq. 3. Crossover and mutation operators create a new
population from the previous population, and both popu-
lations are placed in the mating pool with the size of
2 N. Based on NSGA-II procedure, non-dominated sorting
and crowding distance were computed for all individuals in
the mating pool to be used as the fitness functions. Selec-
tion chooses N chromosomes among 2 N for the next
generation according to lower rank and if necessary higher
crowding distance.
The above procedures are performed until the end of the
iteration and the Pareto optimal front is obtained based on
convergence and diversity metrics.
4.3 Multi-objective optimization control parameters
The control parameters of GA include population size,
crossover and mutation probability, crossover, and selec-
tion strategy. The choice of these parameters can affect the
behavior and performance of GA, whether in single or
S3 : yðtÞ ¼ 0:6
1þ e�ð0:34uðt�1Þþ0:1uðt�2Þ�0:5uðt�3Þþ0:45uðt�5Þ�0:2yðt�1Þþ0:26yðt�2Þ�0:56yðt�4Þ�0:7yðt�5Þþ0:3Þ
þ �0:2
1þ e�ð�0:54uðt�1Þþ0:16uðt�2Þþ0:23uðt�3Þþ0:67uðt�5Þ�0:9yðt�1Þþ0:59yðt�2Þþ0:2yðt�4Þ�0:4yðt�5Þ�0:1Þ
þ 0:92
1þ e�ð0:2uðt�1Þ�0:5uðt�2Þþ0:58uðt�3Þþ0:1uðt�5Þþ0:7yðt�1Þþ0:86yðt�2Þ�0:3yðt�4Þ�0:1yðt�5Þþ0:7Þ
þ 0:2
1þ e�ð0:25uðt�1Þ�0:3uðt�2Þþ0:9uðt�3Þ�0:19uðt�5Þ�0:37yðt�1Þþ0:6yðt�2Þ�0:43yðt�4Þþ0:6yðt�5Þ�0:3Þ
Neural Comput & Applic (2012) 21:1281–1295 1285
123
multi-objectives cases. The purpose of conducting simu-
lation to study the effects of control parameters in system
identification applications is to capture the behavioral
characteristic of the algorithm. The use of crossover in GA
is to maintain the fittest genes that correspond to the correct
terms of the model structure while eliminating the bad ones
that should not be included in the model. The purpose of
mutation is to ensure the diversity of the population,
especially after several generations. Varying the crossover
probability rate (Pc) and mutation probability rate (Pm), the
convergence of the algorithm is observed with respect to
the survival rate of the population and the interaction
between Pc and Pm in finding the Pareto optimal front. To
have an effective search, the variability of population is
important. In order to find solutions that contribute to
higher fitness values, the chromosomes in the population
have to be tested using many different approaches.
Therefore, different crossover strategies were explored
to relate how it can preserve good solutions that correspond
to correct combination of terms for the model, as well as
recombine them with good ones. The control parameters
used in the study were:
1. population size: population size of 20, 50 and 100
2. crossover mechanism: single crossover and double
crossover
Table 1 The expected results for S1, S2 & S3
Systems Input lags Output lags Hidden neuron Complexity Weights and thresholds
S1 u(t - 2)
u(t - 3)
y(t - 1)
y(t - 5)
1 5 w111 ¼ 0:1; w1
12 ¼ 0:5; w113 ¼ �0:2; w1
14 ¼ �0:7
w211 ¼ 0:6; b1
1 ¼ 0:3; b21 ¼ 0
S2 u(t - 2)
u(t - 3)
y(t - 1)
y(t - 5)
2 6 w111 ¼ 0:1; w1
12 ¼ 0:5; w113 ¼ �0:2; w1
14 ¼ �0:7
w211 ¼ 0:6; b1
1 ¼ 0:3; b21 ¼ 0
w121 ¼ �0:44; w1
22 ¼ �0:67; w123 ¼ 0:23; w1
24 ¼ �0:17
w212 ¼ �0:2; b1
2 ¼ �0:1; b21 ¼ 0
S3 u(t - 1)
u(t - 2)
u(t - 3)
u(t - 5)
y(t - 1)
y(t - 2)
y(t - 4)
y(t - 5)
4 12 w111 ¼ 0:34; w1
12 ¼ 0:1; w113 ¼ �0:5; w1
14 ¼ 0:45
w115 ¼ �0:2; w1
16 ¼ 0:26; w117 ¼ �0:56; w1
18 ¼ �0:7
w211 ¼ 0:6; b1
1 ¼ 0:3
w121 ¼ �0:54; w1
22 ¼ 0:16; w123 ¼ 0:23; w1
24 ¼ 0:67
w115 ¼ �0:9; w1
26 ¼ 0:59; w127 ¼ 0:2; w1
28 ¼ �0:4
w212 ¼ �0:2; b1
2 ¼ �0:1
w131 ¼ 0:2; w1
32 ¼ �0:5; w133 ¼ 0:58; w1
34 ¼ 0:1
w135 ¼ 0:7; w1
36 ¼ 0:86; w137 ¼ �0:3; w1
38 ¼ �0:1
w213 ¼ 0:92; b1
3 ¼ 0:7
w141 ¼ 0:25; w1
42 ¼ �0:3; w143 ¼ 0:9; w1
44 ¼ �0:19
w145 ¼ �0:37; w1
46 ¼ 0:6; w147 ¼ �0:43; w1
48 ¼ 0:6
w214 ¼ 0:2; b1
4 ¼ �0:3
Fig. 1 Schematic representation of a chromosome with three vari-
ables, nu, ny, and nd
Fig. 2 An example of a chromosome, its regression vector and
network structure
1286 Neural Comput & Applic (2012) 21:1281–1295
123
3. crossover probabilities: probabilities of 0.05, 0.3, 0.6,
0.9
4. mutation probabilities: probabilities of 0.001, 0.01,
0.1.
4.3.1 Population size
To study the effect of varying population size, the other
control parameters were fixed with the number of genera-
tions 50, Pc was 0.6 and Pm was 0.01. System S3 was
studied (because it is more complex) and after 50 itera-
tions, the metrics of convergence and diversity for 20, 50
and 100 population sizes are shown in Figs. 3 and 4. The
convergence for 20 population size is faster than the others,
but the diversity metrics for all populations converged. For
a small number of populations, most of the highly superior
individuals dominate the population towards the later
generation giving them the chance of being selected for the
next generation. The disadvantage is that there will be a
chance of not selecting the better solution within the whole
population thus resulting in a poor exploration. For further
investigation, the algorithm was applied 50 times for all
three population sizes and the solutions in the last gener-
ations were investigated. The results show that, in some
attempts for population size of 20, the expected solution
was not among the non-dominated solutions. While for
sizes of 50 and 100 in all attempts, the expected solutions
(mentioned in Table 1) existed. Thus, it seems that the
population size of 50 is the best choice for this study and
increasing to 100 is not necessary.
4.3.2 Crossover and mutation
By varying the crossover probability rate, Pc, and mutation
probability rate, Pm, the performance of the algorithm was
observed. The population size was set to 50 and the number
of generations was set to 50. The effect of different
crossover strategies was also investigated. The metrics to
evaluate the results are the convergence and diversity
preservation. Figure 5 shows the metrics for various Pm
with crossover probability of 0.05. The convergence occurs
around generation 30 for all Pm, but the diversity for
Pm = 0.01 is more stable than the other cases.
Increasing Pc to 0.3 indicates better results in both
metrics (Fig. 6). The convergence to zero is faster and the
diversity metrics are more stable. Figure 7 shows the
results for Pc = 0.6. The figure clearly shows the reduction
in diversities of Pareto optimal front for different mutation
probabilities.
Finally, after increasing Pc to 0.9, there are similarities
in convergences for different mutation probabilities as
shown in Fig. 8a. The second metric in Fig. 8b indicates
that in all cases convergence have occurred. However,
none of them has converged to the ideal value of one like
the previous cases.
The above results indicate that there are no specific rules
for values Pc and Pm and these values have to be chosen by
trial and error. However, Pm = 0.1 produced a better result
among the others. Although, the crossover probability of
0.3 and 0.9 has shown acceptable performances, but
Fig. 8b shows more reliable convergence for all mutation
probability values than Fig. 6b.
To investigate the effect of varying crossover strategy,
single and double point crossovers were considered. Using
the above results, Pc and Pm were chosen 0.9 and 0.1,
respectively. Figure 9 shows the convergence (a) and
diversity metrics (b) for the multi-objective optimization of
neural network structure using NSGA-II with these two
crossover strategies. The result shows that a double cross-
over strategy slightly gives better diversity preservation.
The above tests were applied to all three simulated
systems and real process data of case studies and the results
are summarized in Table 2.
4.4 Case studies of S1, S2, and S3
To model the data generated by S1 until S3 and optimi-
zation of the model structure, the initial parameters as
summarized in Table 2 were used.
Fig. 3 The effect of varied population size on convergence, S3
Fig. 4 The effect of varied population size on diversity, S3
Neural Comput & Applic (2012) 21:1281–1295 1287
123
Since the two criteria, namely, convergence and
diversity metrics need to be investigated during the
process of multi-objective optimization; these two met-
rics are shown in Fig. 10. It is important to recall that
the ideal values of the convergence and diversity should
tend to zero and one, respectively. Figure 10a shows that
the convergence metrics quickly moved to zero, thereby
implying that NSGA-II solutions starting from a random
set of solutions quickly approach the Pareto optimal
front. The value of zero of the convergence metrics
implies that all non-dominated solutions match the
Pareto optimal points. After about 20 generations, the
NSGA-II population converges to the Pareto optimal
front. Similarly, the diversity metrics, Fig. 10b, shows
that the diversity increases exponentially until about 30,
after which the diversity remains the same.
Since in each generation only the solutions with rank
one are chosen for the next population, to investigate
whether all points in the last population are the members of
Pareto optimal front or not, the summation of ranks of all
solutions in each generation versus the iteration number is
shown in Fig. 11. The graph shows that around generation
Fig. 5 The effect of varying
mutation rate for Pc = 0.05 for
System S3, a convergence
metrics, b diversity metric
Fig. 6 The effect of varying
mutation rate for Pc = 0.3 for
System S3, a convergence
metrics, b diversity metric
Fig. 7 The effect of varying
mutation rate for Pc = 0.6 for
System S3, a convergence
metrics, b diversity metric
1288 Neural Comput & Applic (2012) 21:1281–1295
123
16 all solutions have assigned rank one, because the sum-
mation of their ranks is equal to the population size, 50.
The collection of all solutions in Pareto optimal front
is indicated in Fig. 12. There are only four solutions in
the Pareto optimal front. It shows that the remaining
solutions have been repeated. To inspect more about
these four solutions and to understand whether the Pareto
optimal front contains the expected solution for S1, the
details of four solutions obtained are summarized in
Table 3.
Two metrics of multi-objective optimization for S2 and
S3 are shown in Figs. 13 and 14. These figures indicate that
the multi-objective optimization algorithm converged as
shown by the convergence and diversity metrics.
Fig. 8 The effect of varying
mutation rate for Pc = 0.9 for
System S3, a convergence
metrics, b diversity metric
Fig. 9 The effect of varying
crossover point for system S3
Table 2 The initial parameters
and values used in algorithmParameter S1 to S3 S4
Population size 50 50
No. of generation 50 50
Probability of crossover (Pc) 0.9 0.8
Probability of mutation (Pm) 0.1 0.05
No. of epochs 200 400
Initial weights and thresholds Rand (-1, 1) Rand (-1, 1)
Max no. of input lags 5 5
Max no. of output lags 5 5
Max no. hidden neurons 8 8
Activation functions Sigmoid-linear Sigmoid-linear
Training method Levenberg–Marquardt Levenberg–Marquardt
Error function MSE MSE
Learning rate 0.1 0.1
Neural Comput & Applic (2012) 21:1281–1295 1289
123
Investigation of the Pareto optimal front (Figs. 15, 16)
in detail shows that there are only four solutions for S2 and
ten solutions for S3 in their respective Pareto optimal
fronts. That is, the rest of the solutions have been repeated.
Tables 4 and 5 illustrate the details of the possible solu-
tions for S2 and S3, respectively and Table 6 presents the
weights and thresholds for D2 and J3.
As illustrated in Tables 4, 5 and 6, the algorithm could
find the expected results given by the true models defined
by S2 and S3. Solution D2 in Fig. 15 is the exact solution of
the simulated model S2. All four proposed solutions in
Pareto optimal front of Fig. 15 have at least two true lags.
The algorithm has successfully found the true model rep-
resented by S3. Solution J3 in Fig. 16 and Table 5 indicates
the simulated model S3 and is one of the ten candidates in
Pareto optimal front.
4.5 Case study of S4, gas furnace process
The above results for simulated examples show the effec-
tiveness of the proposed algorithm. In this case, the
application of the algorithm on real processes is discussed.
In gas furnace dataset [29], there are 296 pairs of input–
output data as shown in Fig. 17. The input u(t) of the plant
is methane gas flow rate into a furnace, and the output
y(t) is %CO2 concentration in the outlet gas with sampling
interval equals to 9 s.
The process of optimization and convergence for the
given input–output data is shown in Fig. 18. The graphs of
the convergence metrics tend to zero as desired and
diversity metric increases with increasing generation to
make a uniform Pareto optimal front.
The Pareto optimal front of S4 is presented in Fig. 19.
Since the population size is 50 and the number of points in the
last generation is less than 50, the rest of the solutions have
been repeated. The number of repetitions for this system is
shown in Table 7. In addition, the details on the solutions and
their lags and hidden nodes have been included.
Fig. 10 Two metrics for S1,
a convergence, b diversity
Fig. 11 Summation of ranks convergence for
Fig. 12 Pareto optimal front (last generation) for S1
Table 3 Details of the solutions in Pareto optimal front Fig. 12 for S1
Solution Input lags Output lags No. of hidden unit Cx MSE (test set) No. of repetitions
A1 u(t - 3) y(t - 5) 1 3 6.86 9 10 -5 12
B1 u(t - 2), u(t - 3) y(t - 5) 1 4 1.53 9 10 -6 15
C1 u(t - 2), u(t - 3) y(t - 1), y(t - 5) 1 5 3.37 9 10 -33 13
D1 u(t - 2), u(t - 3), u(t - 4) y(t - 1), y(t - 3) 1 6 3.35 9 10 -33 10
1290 Neural Comput & Applic (2012) 21:1281–1295
123
4.5.1 Trade-off
It is important to note that A4, B4, C4, D4, and E4 are not
superior or inferior to each other. For instance, B4 is better
than A4 in the first objective, MSE, but is worse in the second
objective (complexity). One trade-off among the solutions
A4, B4, C4, D4, and E4 in Table 7 is that point A4 and B4 have
large errors compared with the others. Although, E4 has the
smallest error among C4, D4, and E4, but the difference is not
too much (0.002 between D4 and E4 and 0.005 between C4
and E4), while it has more complexity compare with D4 and
C4. Thus, the selection should be done between C4 and D4.
Model validity test such as the correlation tests may helps us
to find the better solution between C4 and D4. Figures 20 and
21 show the correlation tests for solution C4 and D4,
respectively. Figure 20 indicates that all correlation func-
tions fall within the confidence bands except uee for C4, while
all the correlation tests were satisfied for D4 as shown in
Fig. 21. Thus, it seems that D4 could be an adequate model
for the gas furnace data.
4.6 Summary and discussion
The proposed algorithm has been applied on the data,
generated by three simulated system. There were several
points in the Pareto optimal front of each system. Since
these simulated systems represent the multi-layer percep-
tron network exactly, at least one of the non-dominated
solutions in the Pareto optimal front is exactly the same as
the simulated model with the same weights, thresholds,
input and output lags, and hidden nodes. Thus, in S1, S2
and S3 the results showed the effectiveness of the proposed
algorithm.
Fig. 13 Two metrics for S2,
a convergence, b diversity
Fig. 14 Two metrics for S3,
a convergence, b diversity
Fig. 15 Pareto optimal front (last generation) for S2
Fig. 16 Pareto optimal front (last generation) for S3
Neural Comput & Applic (2012) 21:1281–1295 1291
123
Gas furnace dataset was chosen from literature to show
the application of the algorithm. There was no prior
knowledge about the number of lags to help designers to
make a final decision. The correlation of MSE and com-
plexity and correlation test can be the basis to choose the
final solution.
In this study, only one hidden layer has been considered.
A more complex neural network structure with more than
one hidden layers will require re-definition of the chro-
mosomes in the algorithm and more extensive computation
would be involved. The procedure remains the same and
similar modeling results are expected as the optimization
Table 4 Details of the solutions in Pareto optimal front Fig. 15, S2
Solution Input lags Output lags No. of hidden unit Cx MSE (test set) No. of repetitions
A2 u(t - 2) y(t - 1) 1 3 4.00 9 10-4 16
B2 u(t - 2), u(t - 3) y(t - 5) 1 4 1.30 9 10-6 16
C2 u(t - 2), u(t - 3) y(t - 1), y(t - 5) 1 5 2.61 9 10-7 16
D2 u(t - 2), u(t - 3) y(t - 1), y(t - 3) 2 6 7.48 9 10-27 2
Table 5 Details of the solutions in Pareto optimal front Fig. 16, S3
Solution Input lags Output lags No. of hidden unit Cx MSE (test set) No. of
repetitions
A3 u(t - 1) y(t - 2) 1 3 1.46 9 10-3 10
B3 u(t - 1), u(t - 2) y(t - 2) 1 4 3.57 9 10-4 11
C3 u(t - 1), u(t - 2), u(t - 3) y(t - 2) 1 5 1.50 9 10-4 14
D3 u(t - 1), u(t - 2), u(t - 3) y(t - 1) 2 6 1.00 9 10-4 2
E3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 2) 2 7 4.82 9 10-5 2
F3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 2) 3 8 2.99 9 10-5 3
G3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 2), y(t - 4) 3 9 2.41 9 10-5 3
H3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 2), y(t - 4), y(t - 5) 3 10 5.00 9 10-6 2
I3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 1), y(t - 2), y(t - 4), y(t - 5) 3 11 1.44 9 10-6 1
J3 u(t - 1), u(t - 2), u(t - 3), u(t - 5) y(t - 1), y(t - 2), y(t - 4), y(t - 5) 4 12 6.91 9 10-25 2
Table 6 Connection values for D2 and J3
Solution Weights Thresholds
D2 w111 ¼ 0:1; w1
12 ¼ �0:5; w113 ¼ �0:2; w1
14 ¼ �0:7
w211 ¼ 0:6; b1
1 ¼ 0:3; b12 ¼ 0
w121 ¼ �0:44; w1
22 ¼ �0:67; w123 ¼ 0:23; w1
24 ¼ �0:17
w121 ¼ �0:2; b1
2 ¼ �0:1; b21 ¼ 0
b11 ¼ 0:3; b1
2 ¼ 0:1
b21 ¼ 0:2
J3 w111 ¼ 0:34; w1
12 ¼ 0:1; w113 ¼ �0:5; w1
14 ¼ 0:45; w115 ¼ �0:2; w1
16 ¼ 0:26; w117 ¼ �0:56; w1
18 ¼ �0:7
w111 ¼ 0:6
w121 ¼ �0:54; w1
22 ¼ 0:16; w123 ¼ 0:23; w1
24 ¼ 0:67; w115 ¼ �0:9; w1
26 ¼ 0:59; w127 ¼ 0:2; w1
28 ¼ �0:4
w212 ¼ �0:2
w131 ¼ 0:2; w1
32 ¼ �0:5; w133 ¼ 0:58; w1
34 ¼ 0:1; w135 ¼ 0:7; w1
36 ¼ 0:86; w137 ¼ �0:3; w1
38 ¼ �0:1
w213 ¼ 0:92
w141 ¼ �0:25; w1
42 ¼ 0:3; w143 ¼ �0:9; w1
44 ¼ 0:19; w145 ¼ 0:37; w1
46 ¼ �0:6; w147 ¼ 0:43; w1
48 ¼ �0:6
w214 ¼ �0:2
b11 ¼ 0:3
b12 ¼ �0:1
b13 ¼ 0:7
b14 ¼ 0:3
b21 ¼ 0:2
1292 Neural Comput & Applic (2012) 21:1281–1295
123
procedures are still the same, but performance such as
convergence rate might be affected.
5 Conclusion
A multi-objective genetic algorithm method, the elitist non-
dominated sorting genetic algorithm (NSGA-II), has been
proposed and investigated to optimize the neural network
structure for modeling dynamic systems. The objective
functions used are the complexity of the neural network
Fig. 17 Input and output data
for S4
Fig. 18 Two metrics for S4,
a convergence, b diversity
Fig. 19 Pareto optimal front (last generation) for S4, gas furnace data
Table 7 Details of the solutions in Pareto optimal front Fig. 19, S4
Solution Input lags Output lags No. of hidden unit Cx MSE (test set) No. of repetitions
A4 u(t - 3) y(t - 1) 1 3 4.12 9 10-1 8
B4 u(t - 2) y(t - 1), y(t - 2) 1 4 1.63 9 10-1 12
C4 u(t - 2) y(t - 1), y(t - 2), y(t - 4) 1 5 1.41 9 10-1 9
D4 u(t - 2) y(t - 1), y(t - 2), y(t - 4) 2 6 1.38 9 10-1 11
E4 u(t - 2) y(t - 1), y(t - 2), y(t - 4) 3 7 1.36 9 10-1 10
Neural Comput & Applic (2012) 21:1281–1295 1293
123
architecture and mean square error of the test set which are
minimized simultaneously. The proposed method has been
shown to be effective in identifying the correct structure of
the system using two objective functions for three neural
network simulated systems. Then the algorithm was applied
to a real process data. There is more than one possible
solution in multi-objective optimization problems. Trade-off
among the possible solutions in Pareto optimal and correla-
tion tests can help designers to select the final solution.
References
1. Hagan MT, Demuth MT, Beale MH (1996) Neural network
design. PWS Publishing, Boston
2. Billings SA, Jamaluddin H, Chen S (1992) Properties of neural
networks with applications to modeling non-linear dynamical
systems. Int J Control 55(1):193–224
3. Chen S, Billings SA, Grant PM (1990) Non-linear system iden-
tification using neural networks. Int J Control 51(6):1191–1214
4. Xua KJ, Zhang J, Wang XF, Teng Q, Tan J (2008) Improvements
of nonlinear dynamic modeling of hot-film MAF sensor. Sens
Actuators A 147:34–40
5. Caruana R, Lawrence S, Giles CL (2001) Overfitting in neural
networks: backpropagation, conjugate gradient, and early stop-
ping. Adv Neural Inf Process Syst 13:402–408
6. Bebis G, Georgiopoulos M (1994) Feed-forward neural networks:
why network size is so important. IEEE Potentials 13(4):27–31
7. Sietsma J, Dow RJF (1988) Neural net pruning—why and how.
In: Proceedings of the IEEE international conference on neural
networks, San Diego, pp 325–333
8. Ahmad R, Jamaluddin H, Hussain MA (2004) Model structure
selection for discrete-time nonlinear systems using genetic
algorithm. J Syst Control Eng 218(12):85–98
Fig. 20 Correlation test for
solution C4 of S4
Fig. 21 Correlation test for
solution D4 of S4
1294 Neural Comput & Applic (2012) 21:1281–1295
123
9. SK Oh, Pedrycz W (2006) Genetic optimization driven multi
layer hybrid fuzzy neural networks. Simul Model Pract Theory
14:597–613
10. Park KJ, Pedrycz W, Oh SK (2007) A genetic approach to
modeling fuzzy systems based on information granulation and
successive generation-based evolution method. Simul Model
Pract Theory 15:1128–1145
11. Koza YJR, Rice JP (1991) Genetic generation of both the weights
and architecture for a neural network. In: IEEE international joint
conference on neural networks, vol 2. IEEE Press, Seattle,
pp 397–404
12. Deb K (2001) Multi-objective optimization using evolutionary
algorithms. Wiley, Chichester
13. Schaffer JD (2011) Some experiments in machine learning using
vector evaluated genetic algorithm. Ph.D Thesis, Vanderbilt
University, Nashville, TN
14. Fonseca CM, Fleming PJ (1993) Genetic algorithms for multi-
objective optimization: formulation, discussion and generalisa-
tion. In: Proceedings of the fifth international conference on
genetic algorithms, Morgan Kaufman, San Mateo, pp 416–423
15. Srinivas N, Deb K (1994) Multi-objective optimization using
non-dominated sorting in genetic algorithms. Evol Comput
2(3):221–248
16. Fieldsend JE, Singh S (2005) Pareto evolutionary neural net-
works. IEEE Trans Neural Netw 16(2):338–354
17. Abbass HA (2003) Speeding up backpropagation using multi-
objective evolutionary algorithms. Neural Comput 15(11):2705–
2726
18. Sexton RS, Dorsey RE, Sikander NA (2004) Simultaneous opti-
mization of neural network function and architecture algorithm.
Decis Support Syst 36:283–296
19. Gonzalez J, Rojas I, Ortega J, Pomares H, Fernandez J, Diaz A
(2003) Multi-objective evolutionary optimization of the size,
shape, and position parameters of radial basis function networks
for function approximation. IEEE Trans Neural Netw
14(1):1478–1495
20. Palmes S (2005) Robustness, evolvability and optimality in
evolutionary neural networks. Biosystems 82(2):168–188
21. Mandal D, Pal SK, Sah P (2007) Modeling of electrical discharge
machining process using back propagation neural network and
multi-objective optimization using non-dominating sorting
genetic algorithm-II. J Mater Process Technol 186:154–162
22. Sexton RS, Dorsey RE, Johnson JD (1999) Optimization of
neural networks: a comparative analysis of the genetic algorithm
and simulated annealing. Eur J Oper Res 114:589–601
23. Ljung L (1999) System identification, theory for the user, 2nd
edn. Prentice-Hall, Englewood Cliffs
24. Ibnkahla M (2003) Nonlinear system identification using neural
networks trained with natural gradient descent. EURASIP J Appl
Signal Process 12:1229–1237
25. Funahashi K (1989) On the approximate realization of continuous
mappings by neural networks. Neural Netw 2:183–192
26. Norgaard MRO, Poulsen NK, Hansen LK (2000) Neural net-
works for modeling and control of dynamic systems. A practi-
tioner’s handbook. Springer, London
27. Zitzler E, Deb K, Thiele L (2000) Comparison of multi-objective
evolutionary algorithm: empirical results. Evol Comput
8:173–195
28. Billings SA, Voon WSF (1986) Correlation based model validity
tests for non-linear models. Int J Control 44(1):235–244
29. Box GEP, Jenkins GM, Reinsel GC (1994) Time series analysis
forecasting and control. Prentice-Hall Inc, Englewood Cliffs
Neural Comput & Applic (2012) 21:1281–1295 1295
123