Process Development Based on Model Mining and Data Reconciliation Techniques

10
Process Development Based on Model Mining and Data Reconciliation Techniques J. Abonyi, B. Farsang University of Pannonia; Deparment of Process Engineering Egyetem Street, 10; H-8200 Veszprém; Hungary [email protected] Abstract During the last decade a major shift has begun in process and chemical industry, since there was an urgent need for new tools that are able to support the optimization of process operations and the development of new production technologies. Approaches of this shift differ from company to company but one common feature is the requirement of the intensive communication between those who conduct research, design, manufacturing, marketing and management. Such communication should be centered on modeling and simulation due to the need of the integration of the whole product and process development chain in all time and scale levels of the company. To explore and transfer all the useful knowledge needed to operate and optimize technologies and the business processes, we aimed the development of a novel methodology to integrate heterogeneous information sources and heterogeneous models. The proposed methodology can be referred as model mining, since it is based on the extraction and transformation of information not only from historical process data but also from different type of process models. The introduction of this novel concept required the development of new algorithms and tools for model analysis, reduction and information integration. In this paper we report our newest results related to the integration of multivariate statistical methods, data reconciliation and a priori models. Keywords process development, data mining, principal component analysis, data reconciliation, a priori model, fault detection Background It is essential to develop new methods to increase industrial competitiveness in order to speed up the transformation of the European economy. Hence, there is an urgent need for new tools which are able to support product and process design, and the optimization of production technologies. These tools require intensive communication between design, manufacturing, marketing and management that could be centred on modelling and simulation. This goal has been already realized by engineers and directors of leading companies, e.g. DuPont and Dow Chemical, think that "model integrates the whole organization" [1]. They believe that the extensive use of models is the way that data, information, and knowledge should be conveyed from research to engineering, to manufacturing, and on to the business team. According to that, modeling and simulation will have a much greater role in bio-, chemical, and process engineering; it is prognosticated as a key feature of modern process maintenance in the future.

Transcript of Process Development Based on Model Mining and Data Reconciliation Techniques

Process Development Based on Model Mining and Data Reconciliation Techniques

J. Abonyi, B. Farsang University of Pannonia; Deparment of Process Engineering

Egyetem Street, 10; H-8200 Veszprém; Hungary

[email protected]

Abstract

During the last decade a major shift has begun in process and chemical industry, since there was an urgent need for new tools that are able to support the optimization of process operations and the development of new production technologies. Approaches of this shift differ from company to company but one common feature is the requirement of the intensive communication between those who conduct research, design, manufacturing, marketing and management. Such communication should be centered on modeling and simulation due to the need of the integration of the whole product and process development chain in all time and scale levels of the company. To explore and transfer all the useful knowledge needed to operate and optimize technologies and the business processes, we aimed the development of a novel methodology to integrate heterogeneous information sources and heterogeneous models. The proposed methodology can be referred as model mining, since it is based on the extraction and transformation of information not only from historical process data but also from different type of process models. The introduction of this novel concept required the development of new algorithms and tools for model analysis, reduction and information integration. In this paper we report our newest results related to the integration of multivariate statistical methods, data reconciliation and a priori models.

Keywords

process development, data mining, principal component analysis, data reconciliation, a priori model, fault detection

Background

It is essential to develop new methods to increase industrial competitiveness in order to speed up the transformation of the European economy. Hence, there is an urgent need for new tools which are able to support product and process design, and the optimization of production technologies. These tools require intensive communication between design, manufacturing, marketing and management that could be centred on modelling and simulation. This goal has been already realized by engineers and directors of leading companies, e.g. DuPont and Dow Chemical, think that "model integrates the whole organization" [1]. They believe that the extensive use of models is the way that data, information, and knowledge should be conveyed from research to engineering, to manufacturing, and on to the business team. According to that, modeling and simulation will have a much greater role in bio-, chemical, and process engineering; it is prognosticated as a key feature of modern process maintenance in the future.

Figure 1. Knowledge management in “life-cycle modeling” and integrated modeling.

Officials of AspenTech and other companies dealing with simulation technologies talk about “life-cycle modeling” and integrated modeling technology, i.e. a model that is applied at every level of a technology (see Fig. 1 for the details of this concept).

However, the concept of life-cycle modeling is only a vision of how companies should operate in the future, in the 21st century. It is only a future concept, an ideal case. But instead of this there are only “model islands” for the time being. Isolated models are used for different and limited purposes on different levels of the technology (if they exist at all). These models are heterogeneous not only because they have different purposes, but also because they process data or information taken from different sources and apply different methods [2] which are often not compatible with other methods.

Information for the modeling and identification of the processes can be obtained from different sources: mechanistic knowledge obtained from first-principles (physics and chemistry) [1], empirical or expert knowledge, expressed as linguistic rules [2], measurement data, obtained during normal operation or from an experimental process [3]. Different modeling paradigms should be used for an efficient utilization of these different sources of information [3].

The aim of our research is to develop an approach, a methodology to integrate heterogeneous information sources and heterogeneous models to cover the whole process, technology and company. This is an extremely complex task considering the time scales for a production system, because different scales define different engineering problems [4, 5]. The information gained in one scale could be useful in other scales, hence the research is focused on mining and transferring information from one scale to another (see also Fig. 2). The key idea is the utilization of data mining and data driven techniques with a priori, expert knowledge (see Fig. 3).

Figure 2. Time and scale levels in process engineering. [4]

Figure 3. The proposed scheme combines expert and mechanistic knowledge and measured data in the form of rule-based systems, model structure, parameter constrains and local models. [3]

Model mining – A novel methodology

It is clear that a created global model cannot be developed by the improvement and refinement of existing models. The key of the proposed approach is to integrate the existing models and information sources to explore useful knowledge. The whole process can be called model mining. This type of knowledge discovery is relatively new [6], its importance has only been recognized in the automated detection and mining of atmospheric phenomena relationships by the Global Modelling and Assimilation Office at NASA. Hence, our research can be considered as the first concentrated attempt to develop a new methodology for model mining. According to our current ideas, the proposed approach, depicted in the Fig 4., consists of the following steps.

1. Developing and understanding of the application domain and the relevant prior knowledge, and identifying the problem.

2. Creating and pre-processing the target set. This phase starts with activities in order to get familiar with the models and data, to identify quality problems, to get first insights into the data and models to detect interesting subsets, to form hypotheses for hidden information. To support this step, the utilized models should be as transparent as possible, since model transparency allows the user to effectively combine different types of information, namely linguistic knowledge, first-principle knowledge and information extracted from data. For that purpose, a model warehouse should be developed in which the different information sources and the developed models are stored in an easily retrievable way. To achieve this goal, there is a need for a language and clear definitions of metadata that can be used to describe the models and different information sources. Such metadata can be represented by an XML based system,

similar to the Predictive Model Markup Language (PMML, www.dmg.org) used to store and transfer data mining models.

Figure 4. Scheme of model mining. The concept is similar to data mining, but in this schemes

not only historical process data but different type of models are also analyzed.

3. Model Mining. The ultimate goal of the whole process is to extract potentially useful knowledge which could not be explored from one single model or database, but only from the integrated information sources and models in the model warehouse. The goals of model mining are achieved via the following tasks:

3.a Model Transformation and Reduction: the different kinds of information presented by models can be used to transform or reduce other models in order to simplify them, to make them more precise and/or robust, or to expand their operational range (e.g. to improve the extrapolation capability of a black-box model using a priori knowledge). Some of the computational intelligence models lend themselves to be transformed into other model structures that allow information transfer between different models (e.g. decision trees can be mapped into feedforward neural networks, or radial basis functions are functionally equivalent to fuzzy inference systems).

3.b Model Fusion: the integration of the information content of different models is the key issue of the research. During this process it should be kept in mind in what range the particular models are valid and what process should be applied or developed to get the global result (e.g. voting systems for models on the same level of the technology). Information can be stored within nominal, ordinal, ratio or fuzzy data; images and pictures, multivariate time series or documents can also be information sources [7]. Besides of that, several type of models should be treated, e.g. graphs, decision trees, neural networks, rule based systems, etc., which need different types of reduction methods (e.g. spanning trees for graphs, pruning for decision trees, rule base reduction for fuzzy models etc.).

3.c Visualization is a very promising field because it makes possible merging the best of computer (calculation) capabilities with the best of human perception (cognitive) abilities [6]. There are several types of methods to visualize multidimensional data [8]. However, the visualization of the results of the knowledge discovery process is much more complex than the visualization of multivariate data. The world-leading companies like AT&T and Microsoft are dealing with these problems. The visualization of clustering results have been also dealt with by our research group; these methods could be improved, modified and extended, and new ones would be developed according to model mining purposes. Visualization methods are based on similarity measures (for multivariate data it depends on a particular distance measure; for fuzzy sets it is based on fuzzy set similarity measures etc.). For model mining purposes, there is a clear need to define several types of similarity measures according to the heterogeneous information sources (similarity of models, images, graphs etc.). The existing methods are commonly used to project data or other types of information into two dimensions.

3.d Choosing and application of the model mining algorithm(s): Selecting algorithms for searching for patterns. It is essential to involve the user or the experts into this key step because it is often very difficult to formalize the complex and many times contradictory goals. One possible solution is to increase the degree of interactivity. For that purpose the visualization of partial results is needed [9].

4. Interpreting and application of the mined patterns: Whereas the “knowledge worker” judges the success of the application of modeling and discovery techniques more technically, she or he contacts domain experts later in order to discuss the results and their applicability. Based on the results it can be necessary to return to any of steps 1-3 for further iteration.

Figure 5. Integrated information system for process analysis

Development of a sensitive data based model for process development and fault detection

The aim of the presented research is to propose a methodology to support the development and validation of complex process models and simulators based on increasing the quality of measurement data. The developed method is suitable for fault diagnosis using process data and model warehouses that integrates plant-wide information. The process data warehouse contains non-violate, consistent and preprocessed historical process data and works independently from other databases operating at the level of the control system. To extract information from historical process data stored in the process data warehouse tools for data mining and exploratory data analysis have been developed (see Fig. 5). The proposed analytical level consists of the models of the process and control systems. Stored data can be used to re-simulate the passed critical situations of the operation and the measured and the calculated data can be analyzed by statistical and data mining tools.

In case of complex production processes it is often not sufficient to analyze only input-output data for process monitoring purposes. The reasons may be that historical process data alone do not have enough information content, it can be incomplete, not measured frequently or not at regular intervals. Another reason is that most of the used information are measured values they are affected by errors influencing the quality of used models (see Fig. 1.). In these cases it is important to obtain reliable information about state variables; therefore (nonlinear) state estimation algorithm is needed.

According to the type of information model and data driven sensors can be distinguished. Model driven sensors are based on white-box (also referred as mechanistic or a priori) models based on balance equation (mass, component, energy) and contain detailed physical-chemical information about the system. This approach performs well on the basis of a clear and good understanding of the mechanisms of the process. Unfortunately, in many practical applications, processes are often too complex and uncertain so these mechanisms are not sufficiently well understood for the successful application of this approach [10]. Furthermore, the development of these first-principle models of complex industrial process is very difficult and time-consuming procedure. In particular, it is difficult to build precise first-principle models that can explain why defects appear in products. This is a critical issue since product life cycles are getting shorter and the time available for improving of product quality and yield requires fast and adaptive solutions.

Data driven (black-box or a posteriori) model are built when no detailed knowledge is available about the process. In this case historical process data is used to build statistical models to determine the relationship

between inputs and outputs. In the last decade these data-based approaches have been widely accepted. Many companies have built integrated databases to store historical process data from all plants, process units and product properties. Statistical regression methods have also become increasingly popular techniques for process modeling and they are used for fault detection and quality estimation. Kalman filters are often used for model update because of its good performance characteristics and simple implementation. Multivariate statistical models are powerful tools for fault detection. One of the most common methods is principal component analysis (PCA). It is a data-based, variable reduction procedure that uses orthogonal transformation to convert correlated variables into linearly uncorrelated variables (they called principal components). It is often used in fault detection and isolation systems. Data driven methods are capable of including all relevant process measurements in a model without the problem of "overfitting" that is present in ordinary regression methods. In this way all the process information can be included in the model leading to more precise predictions [10].

The aim of our research is to develop a method that can be applied in industrial environment and includes advantages of a priori knowledge based models and data-driven multivariate statistical process monitoring tools. Data and a priori model driven approaches can have several synergistic combinations. The structure of these hybrid models are shown in Figure 6.

Figure 6. Soft sensor structures according to how property variables are estimated by the combination of white and

black box models [11].

In the first case the statistical model is the input of physical model in the form of differential or algebraic equations or a complex flowsheeting simulator. In this case statistical model is used to estimate difficult to model parameters and phenomena.

Combination 2 shows the case when outputs of physical model are transformed by a statistical model.

In the third option the difference between measured and calculated variables are the inputs of statistical model used for correction. Model based data reconciliation techniques are similar to this approach.

Data reconciliation aims the reduction of random errors to enhance the quality of data used for model development resulting in more reliable process simulators. During data reconciliation information are from both measurements and process model so measured data can be more accurate. This technique takes minimal correction of measured variables to make them verify a set of model constraints (balance equations). It is a recommended step before fine-tuning model parameters, especially in a framework of real time optimal control, where model fidelity is of paramount importance: there is no incentive in seeking to optimize a model, when it does not match the actual behavior of the real plant [12]. Typical steps of on-line data correction can be seen in Figure 7.

Figure 7. General methodology for on-line data correction [13]

We present a novel method that effectively utilizes first principle models for the development, maintenance and validation of multivariate statistical models. We demonstrate that the performance of Principal Component Analysis (PCA) models used for process monitoring can be improved by model based data reconciliation. Figure 8 shows the main steps of the proposed approach.

Figure 8. Proposed algorithm for fault detection and process development

Results and discussion

The proposed concept has been applied to a well-known Tennessee Eastman Process. The system includes five major unit operations: reactor, condenser, gas-liquid separator, compressor and stripper. The gaseous reactants are fed to the reactor. In reactor four reactions take place: all reactions are exothermic and irreversible. The reactor product stream (in gas-phase) passes through a cooler for condensing the products and in the separator the two phases (products and reactant) are separated. The reactants are recirculated; purge stream removes inert components and byproducts from system. The liquid stream of separator contains dissolved reactants which are removed in stripper. The bottom stream of stripper is the product; the overhead stream recycles back to the reactor feed.

41 measured, 12 control variables and 20 disturbances are defined in the benchmark problem [14]. Disturbances and set point changes appear in technology and we analyzed how different data-based methods can identify them. In figures area between dashed lines shows when disturbance affects the system, while dotted lines indicate set point changes. Table 1 summarizes the time intervals and disturbances.

Table 1. Disturbance in system

Time-interval (h) Type of disturbance 75-100 Change A/C feed ratio (IDV(1)) 400-470 Unknown disturbance (IDV(16)) 850-920 Unknown disturbance (IDV(18)) 530-1000 Purge valve set point changes from 0 to 10 % 700-1000 Stripper level set point changes from 50 to 45 %

Data reconciliation is applied to ensure that input and output mass flows satisfy the mass balance of the technology:

0 productpurgeCfeedEfeedDfeedAfeed (1)

Measured data can be reconciled using model equation (Eq.(1)) and covariance matrix of measurement error. The result is demonstrated in Figure 10 that shows the result of data reconciliation in case of E feed. As it can be seen, the difference between measured and reconciled values is small, but the reconciled values satisfy Eq.(1).

2.37 2.375 2.38 2.385 2.39 2.395 2.4

x 104

6140

6150

6160

6170

6180

6190

6200

6210

6220

Time (s)

Mas

s flo

w o

f E

fee

d (k

g/h)

measured

reconciled

Figure 10. Comparison of measured and reconciled data

Figure 11. Effect of data reconciliation to the sensitivity of PCA a) Measured data are projected. b) Reconciled data

are projected. c) Results of original PCA method

As a further step of life-cycle modeling we utilized the flowsheeting simulator to determine the projection matrix of PCA. With the use of the simulator we generated an informative dataset that can describe the behavior of system. In this case external disturbances do not affect the process; there is no measurements noise, so data pairs satisfy the balance equations. After we determined the optimal number of principal components, alarm threshold is determined. Then we designed the process monitoring system based on projection of the measured raw process values (see Fig. 11a.). As it can be seen, the projection error (Q) is always higher than the alarm threshold. Real disturbances cannot be separated, we cannot identify all fault and set point changes, so this method is not sensitive for fault detection enough.

We also studied a combined approach as reconciled process variables were projected (Fig 11b. case. If we pre-processed the measured values by data reconciliation we get perfect alarming, signals related to faults are generated only when real disturbances affect the process. The reason is that data reconciliation filters the random noise so we should determine the systematic error using PCA

To demonstrate the benefit of the proposed approach we compare the result obtained by the classical data-driven PCA model. In this case the projection matrix is identified based on measured - erroneous - data. If we used this matrix to project the measured values (Figure 11c.) we get alarms almost all the time, so we are not able to detect differences among the different faults. Comparison of “a” and “c” cases shows that model driven projection has significant benefits; it is worth to extract PCA model from flowsheeting simulators. The drawback is that this approach assumes available and accurate process simulator. Data-driven PCA has also drawbacks, the bottleneck of this approach is that the quality of the data used for the identification of PCA significantly affects the performance of the projection, and PCA model is defined in a narrow operating regime, so it is difficult to select accurate yet informative data for multivariate statistical modeling, contrary to the simulation based approach [15].

Conclusions

The main motivation of our research is to create an integrated framework whereto data mining, modeling and simulation, and advanced process development tools can be incorporated. To achieve model integrity, the existing models should be reapplied, the non-existing models created, and all the models should be connected in an appropriate way.

Simulators are often applied to improve industrial processes, optimize operation and identify bottlenecks of technology. Historical process data can be used for the identification and verification of models utilized by these tools. Usually measured data do not satisfy balance equations due to all measurements are in certain extent incorrect. This causes the disadvantage of data driven models because erroneous measured data influence the accuracy of models. Thus it is necessary to develop a method which can simultaneously and iteratively improve data and model performance. A method based on data reconciliation technique has been developed for this purpose.

The acceptability of proposed algorithm is analysed in a detailed case study. We show PCA and data reconciliation techniques are independently suitable for fault detection. However, when the effect of fault to monitored process variable is comparable to the variation of process, data-driven PCA models are not able to isolate these faults. Data reconciliation based process monitoring can identify faults only having effect for balance equations. The result shows that the combined method can improve the application of data reconciliation and PCA in fault detection and this algorithm gives a more sensitive model to determine faults from the changes of process.

When models of process and control systems are integrated to a process Data Warehouse the resulted structure support engineering tasks related to analysis of system performance, process optimization, operator training (OTS), reverse engineering, and form decision support (DSS) systems.

Authors' contributions

Janos Abonyi formalized the concept of model mining and summarized the steps of the method. Barbara Farsang summarized the combined method of data reconciliation, principal component analysis and flowsheeting simulator and drafted the case study. All authors drafted and approved the final manuscript.

Authors' information

AJ is a full professor at the Department of Process Engineering at the University of Pannonia in computer science and chemical engineering. He received MEng and PhD degrees in chemical engineering in 1997 and 2000 from the University of Veszprem, Hungary. In 2008, he earned his Habilitation in the field of Process Engineering, and the DSc degree from the Hungarian Academy of Sciences in 2011. In the period of 1999-2000 he was employed at the Control Laboratory of the Delft University of Technology (in the Netherlands). He has co-authored more than 200 journal papers and chapters in books and has published three research monographs and one Hungarian textbook about data mining. His research interests include process engineering, quality engineering, data-mining and business process redesign.

Barbara Farsang is a PhD student at the Department of Process Engineering at the University of Pannonia (Hungary). She is a chemical engineer and received her M.Sc. degree in 2013. She is dealing with the role of data reconciliation and principal component analysis techniques in validation of process system and development of simulator.

Acknowledgements

The research was realized in the frames of TÁMOP 4.2.4.A/2-11-1-2012-0001 „National Excellence Program – Elaborating and operating an inland student and researcher personal support system convergence program”. The infrastructure of research is supported by the frame of the TÁMOP 4.2.4.A/11/1-KONV-2012-0071 project. The projects were subsidized by the European Union and cofinanced by the European Social Fund.

References

1. Krieger JH (1995) Process Simulation Seen As Pivotal In Corporate Information Flow. Chemical & Engineering News March 27

2. Gershenfeld N (1999) The nature of mathematical modeling. Cambridge University Press

3. Abonyi J (2003) Fuzzy Model Identification for Control. Birkhauser Boston

4. Charpentier J.C. (2005) Four main objectives for the future of chemical and process engineering mainly concerned by the science and technologies of new materials production. Chemical Engineering Journal 107:3–17. doi: 10.1016/j.cej.2004.12.004

5. Grossmann I.E, Westerberg AE (2000) Research challenges in process systems engineering. AIChE J. 9:1700–1703. doi: 10.1002/aic.690460902

6. Valdes J, Barton A (2006) Virtual Reality Spaces: Visual Data Mining with a Hybrid Computational Intelligence Tool. NRC/ERB-1137 (NRC 48501)

7. Abonyi J, Feil B (2007) Cluster Analysis for Data Mining and System Identification. Birkhauser, 300 pages

8. Tenenbaum JB, Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323. doi: 10.1126/science.290.5500.2319

9. Yan L, Bogle IDL (2007) A visualization method for operating optimization. Computers and Chemical Engineering 31:808–814. doi: 10.1016/j.compchemeng.2006.08.014

10. Kadlec P, Gabrys B, Strandt S (2009) Data-driven soft sensors in the process industry. Computers and Chemical Engineering 33(4):795-814. doi: 10.1016/j.compchemeng.2008.12.012

11. Nakaya M, Xinchun L (2013) On-line tracking simulator with a hybrid of physical and just-in-time models. Journal of Process Control 23(2):171-178. doi: 10.1016/j.jprocont.2012.06.007

12. Narasimhan S, Jordache C (2000) Data Reconciliation & Gross Error Detection: An Intelligent Use of Process Data. Gulf Publishing Company, Houston, Texas, ISBN 0-88415-255-3

13. Bellec S, Jiang T, Kerr B, Diamond M, Stuart P (2005) On-line processing and steady state reconciliation of pulp and paper mill process data. 91st Annual Meeting Preprints-Book B pp. 105–110

14. Downs J. J, Vogel E. F (1993) A plant-wide industrial process control problem. Computers & Chemical Engineering 17(3):245-255. doi: 10.1016/0098-1354(93)80018-I

15. Farsang B, Németh S, Abonyi J (2014) Life-cycle Modelling for Fault Detection - Extraction of PCA Models from Flowsheeting Simulators. 24 European Symposium on Computer Aided Process Engineering, Computer Aided Chemical Engineering, ISBN: 978-0-444-63234-0, accepted