Searching for optimal variables in real multivariate stochastic data

arX

iv:1

111.

2008

v1 [

phys

ics.

data

-an]

8 N

ov 2

011

Optimal variables for describing the evolution of NO2 concentration

F. Raischela, A. Russob, M. Haasec, D. Kleinhansd,e, P.G. Linda, f

aCenter for Theoretical and Computational Physics, University of Lisbon, Av. Prof. Gama Pinto 2, 1649-003 Lisbon, Portugal

bCenter for Natural Resources and the Environment, Instituto Superior Tecnico, Av. Rovisco Pais, 1049-001 Lisboa, Portugal

cInstitute for High Performance Computing, University of Stuttgart, Nobelstr. 19, D-70569 Stuttgart, Germany

dInstitute for Marine Ecology, University of Gothenburg, Box 461, SE-405 30 Goteborg, Sweden

eInstitute of Theoretical Physics, University of Munster,D-48149 Munster, Germany

f Departamento de Fısica, Faculdade de Ciencias da Universidade de Lisboa, 1649-003 Lisboa, Portugal

Abstract

By implementing a recent technique for determination of stochastic eigendirections of two coupled stochastic variables, we inves-tigate the evolution of fluctuations of NO2 concentrations at two monitoring stations in the city of Lisbon, Portugal. We analyzethe stochastic part of the measurements recorded at the monitoring stations by means of a method where the two concentrationsare considered as stochastic variables evolving accordingto a system of coupled stochastic differential equations. Analysis of theirstructure allows for transforming the set of measured variables to a set of derived variables, one of them with reduced stochasticity.For the specific case of NO2 concentration measures, the set of derived variables are related to the total amount of NO2 concentra-tion observed in the two stations and the relative concentration between them. Having one variable with reduced stochasticity, webriefly discuss how to increase the predictive power of the measurement series.

Keywords: Stochastic Systems, Environmental Research, Pollutants,Langevin Equation, PredictabilityPACS:[2010] 02.50.Ga, 02.50.Ey, 92.60.Sz,

\rcsInfo $Id: no2_rrhkl.tex,v 1.35 2011/11/08 18:17:31 stud Exp $

Preprint submitted to arxiv November 9, 2011

http://arxiv.org/abs/1111.2008v1

Optimal variables for describing the evolution of NO2 concentration

F. Raischela, A. Russob, M. Haasec, D. Kleinhansd,e, P.G. Linda, f

aCenter for Theoretical and Computational Physics, University of Lisbon, Av. Prof. Gama Pinto 2, 1649-003 Lisbon, Portugal

bCenter for Natural Resources and the Environment, Instituto Superior Tecnico, Av. Rovisco Pais, 1049-001 Lisboa, Portugal

cInstitute for High Performance Computing, University of Stuttgart, Nobelstr. 19, D-70569 Stuttgart, Germany

dInstitute for Marine Ecology, University of Gothenburg, Box 461, SE-405 30 Goteborg, Sweden

eInstitute of Theoretical Physics, University of Munster,D-48149 Munster, Germany

f Departamento de Fısica, Faculdade de Ciencias da Universidade de Lisboa, 1649-003 Lisboa, Portugal

Contents

1 Introduction 2

2 NO2 measurements in Lisbon’s stations 3

3 Modeling stochasticity in series of NO2 concentra-tions: Langevin processes and Markov properties 4

4 Deriving optimal variables: eigensystem for NO2

measurements at different stations 6

5 Transform of NO2 concentrations to the stochasticeigendirections 7

6 Comparison with standard methodologies 9

7 Discussion and Conclusion 10

1. Introduction

The industrial and urban development during the last decadeshas led to a general decrease of air quality, drastically affectingurban environmental and human life quality. Although, accord-ing to the European Environment Agency report[1], air qualityhas improved in general during the last years, this enhancementwas not significant enough to ensure good air quality in all ur-ban areas. One of the pollutants with negative impact on healthand environment is NO2. Anthropogenic NO2 is mainly emittedby vehicles and industrial processes. NO2 has not only severeeffects on health causing e.g. respiratory and cardiovasculardis-eases, it also affects the environment[2] as nitrogen depositionleads to eutrophication[3]. A better understanding of the mech-anisms that influence production, transport, and decompositionof NO2 is therefore important. Previous studies revealed thattemperature, wind speed and direction, relative humidity,cloudcover, dew point temperature, sea level pressure, precipitation,and mixing layer height are relevant meteorological variablesto model the concentrations of air pollutants[2, 4, 5, 6, 7].Inparticular, approaches that deal with the evolution of the NO2

Figure 1: NO2 measurement stations in the region of Lisbon (Portugal) at theSouthwestern coast of Europe. In this paper we focus on the set of measure-ments taken at the stations of Chelas and Avenida da Liberdade with approx-imately 105 data points each extracted in the period between 1995 and 2006.For the statistical features of these data sets see Tab. 4, first column.

concentration at individual city spots are important for forecast-ing the air quality of urban regions.

Recently a framework[9, 10] for analyzing measurements oncomplex systems was introduced, aiming for a quantitative es-timation of drift and diffusion functions from measured data.These functions can be identified with the deterministic andstochastic contributions to the dynamics, respectively, and givea considerable insight into the underlying systems. The frame-work was already successfully applied for instance to describeturbulent flows[9] and the evolution of climate indices[11,29],stock market indices[12], and oil prices[13]. At the same time,the basic method has been refined in particular with respect tothe impact of finite sampling effects [14, 15], the impact of mea-surement noise[16, 17, 18], and the role of local eigendirectionsof the diffusion matrices [19].

In this paper, we adapt this framework for analyzing mea-surements of NO2 concentrations in the metropolitan region of

Preprint submitted to arxiv November 9, 2011

Lisbon, Portugal (see Fig. 1), taken over several years. We ar-gue that the temporal fluctuations of these concentrations re-sult from two independent contributions: one periodic and onestochastic. The periodic part describes daily, weekly, seasonaland yearly variations of the concentration, which is an acceptedand well-studied result[8].

When addressing stochastic higher-dimensional systems itistypically difficult to identify the variables most relevant for aproper description of the system’s evolution. In geophysicalapplications, the reduction of the full set of variables to only afew variables often is achieved by means of the so-called Princi-pal Component Analysis (PCA)[21] or other standard reductionmethods, such as stepwise regression or ARIMA[22]. However,the inherent fluctuations are not so commonly investigated.

In this paper we will apply a recent method for reconstruct-ing the phase space of two stochastic variables, which evolveaccording to a set of two coupled stochastic equations definedthrough drift vectors and diffusion matrices [19]. The method isbased on the eigenvalues of the diffusion matrices, from whichit is possible to derive a path in phase space through which thedeterministic contribution is enhanced. This technique, hence,allows for the investigation of the minimal number of indepen-dent sources of stochastic forcing in the system. In particular, arather small eigenvalue of the diffusion matrix, compared to theaverage value of all the other, corresponds to one eigendirec-tion in which stochastic fluctuations may be neglected, reduc-ing the number of stochastic variables taken for describingthesystem’s evolution. Therefore, we argue that the diagonaliza-tion of the diffusion matrices gives insight into the system andthat the measurements in the transformed coordinates show anincreased degree of predictability.

We start in Sec. 2 by describing the properties and the prepa-ration of the data set. Consecutively, in Sec. 3, the modelingof the time series as a Langevin process is carried out and itstransformation to a new coordinate system given by the small-est eigenvalues are described in Secs. 4 and 5 respectively.InSec. 6 we discuss the performance of the transformation of thecoordinates obtained by our approach compared to other tech-niques commonly used for statistical analysis of measured data.Section 7 closes this Letter with a general summary and ideason the interpretation of the transformed time series with respectto the underlying environmental processes.

2. NO2 measurements in Lisbon’s stations

In this section we briefly describe the sets of data analyzed inthis paper as well as its preparation for analyzing the stochasticcomponents of the measurements.

The data set covers hourly measurements of NO2 concentra-tion, c(NO2) taken at 22 stations in the urban center of Lisbonrecorded from of 1995 to 2006. For this study we choose thedata from 1995 to 2005 for the monitoring stations at Chelasand at Avenida da Liberdade. These stations are located at a dis-tance of∼ 4 km from each other, see Fig. 1. In the following,the NO2 concentrations at the stations of Chelas and Avenidada Liberdade will be designated asy1(t) andy2(t), respectively,omitting the temporal dependency when not necessary.

Figure 2: (a) Time series of the NO2 concentration at the station of Chelas,before detrending according to Eq. (2), and(b) a zoom-in of these “raw” timeseriesy1 compared to the detrended seriesx1, which takes averages at each dayfrom a succession of “13-weeks” period. Vertical offset of same plots are donefor clarity. For the station at Avenida da Liberdade similarfeatures are found.

Each of the data sets contains 105 measurement points ap-proximately, including some periods of incomplete or erro-neous measurements that are disregarded for our analysis. Inthe case of the chosen stations, the series of measurementsy1

andy2 contain 13046 and 8553 instances of measurement er-rors, respectively. These erroneous values have been excludedfrom the following discussion. The statistical characterizationof y1 andy2 is summarized in Tab. 4.

The concentration of NO2 is strongly driven by daily andweekly, monthly and yearly periodic routines and also periodicatmospheric processes. For instance the rush hours on workingdays have an almost immediate impact on the NO2 concentra-tion, and thus, on air quality. The 24 hours and one week cyclesare both traffic related and mirror daily and weekly cycles. Themeasurements of NO2 are therefore influenced by different pe-riodic forcings and since we are interested in the fluctuationsof NO2 concentrations, the periodic behavior must be first de-trended.

The NO2 concentration data sets are detrended by groupingN successive data points in one group, and dividing the entiredata set intoM groups of the same lengthN. For convenience,the measurementsyi(t), with t = 1, . . . ,T andi = 1, 2, are now

3

labeled asyi((m−1)N+n), with n = 1, . . . ,N andm= 1, . . . ,M,such thatT = MN. For eachn we then consider the average

〈yi((m− 1)N + n)〉(N,n) =1M

M∑

m=1

yi((m− 1)N + n), (1)

and subtract from it each data point accordingly, yielding twodetrended variablesyi((m− 1)N + n) for i = 1, 2 as

yi(t) → yi(m, n) ≡ yi((m− 1)N + n)xi(t) = yi((m− 1)N + n) − 〈yi((m− 1)N + n)〉(N,n). (2)

To take into account all known periodicities mentioned abovewe choseN = 13 weeks. Increments in time are always of 1hour.

Figure 2a shows the original datay1 for the station of Chelas.A zoom-in of a small time interval is plot in Fig. 2b togetherwith the corresponding detrended datax1 for N = 13 weeks.

From now on, we will only consider the detrended time seriesx1 and x2 unless otherwise noted, and we next describe theircharacteristics by means of a stochastic process.

3. Modeling stochasticity in series of NO2 concentrations:Langevin processes and Markov properties

The detrended seriesxi in Eq. (2) reflect the remainingstochastic components of the measurements at the respectivestations of Chelas and Avenida da Liberdade. In this sectionwe assume that, with two variables, the stochastic process ismodeled by a system of two coupled Langevin equations, con-taining one deterministic and stochastic part, described throughone drift vector and one diffusion matrix, respectively.

For the general case of aK-dimensional state vectorX =(x1, ..., xK), the Ito-Langevin equations describing the evolutionof a particular trajectory in time read [23, 24]:

dXdt= h(X) + g(X)Γ(t), (3)

whereΓ = (Γ1, . . . , ΓK) is a set ofK independent stochasticforces with Gaussian distribution fulfilling

〈Γi(t)〉 = 0 (4a)

〈Γi(t)Γ j(t′)〉 = 2δi jδ(t − t′). (4b)

The two terms on the right hand side of Eq. (3) include both thedeterministic contribution,h = {hi}, and the stochastic contri-bution,g = {gi j }. The deterministic contribution describes thephysical forces which drive the system, while functionsg ac-count for the amplitudes of the different sources of fluctuationsΓ[10].

The coefficientsh andg are directly related to the drift vec-tors and diffusion matrices[23]

D(1)i (X) = hi(X) (5)

D(2)i j (X) =

K∑

k=1

gik(X)g jk(X) (6)

for i, j = 1, . . . ,K, describing the evolution of the joint proba-bility density function (PDF)f (X, t) by means of the Fokker-Planck equation [23, 24]:

∂

∂tf (X, t) = −

K∑

k=1

∂

∂xkD(1)

k (X) f (X, t)

+

K∑

k=1

K∑

m=1

∂2

∂xk∂xmD(2)

km(X) f (X, t). (7)

As done previously in other contexts[10, 11, 12, 13, 14, 16,17], the drift vector and the diffusion matrix can be derived di-rectly from the data.

Statistically, the drift and diffusion coefficients coefficientsD(1)

i andD(2)i j are defined as

D(k)(X) = limτ→0

1τ

M (k)(X, τ)k!

, (8)

with the first and second conditional moments, given by

M(1)i (X, τ) = 〈Yi(t + τ) − Yi(t)|Y(t) = X〉 (9a)

M(2)i j (X, τ) = 〈(Yi(t + τ) − Yi(t))

·(Yj(t + τ) − Yj(t))|Y(t) = X⟩

, (9b)

Here 〈·|Y(t) = X〉 symbolizes conditional averaging over allevents that fulfill the conditionY(t) = X.

To determine the underlying Langevin equations, defined inEq. (3), one additionally needs to solve Eqs. (4) and (6). In par-ticular, the calculation of matricesg from the diffusion matricesrequires to solveD(2) = ggT , e.g. by means of diagonalization.The matricesD(2) are symmetric and positive semi-definite withall their eigenvalues real and non-negative (see Eq. (9b)).Thus,we defineg as the back transformed diagonal matrix with ele-ments

√Dii multiplied by an arbitrary orthogonal transforma-

tion O, obeyingOOT = 1, which for simplicity is taken as theidentity matrix[19]. For any other choice ofg our analysis doesnot change.

The computation of the conditional moments is based ontheir statisticalτ-dependence for smallτ[10, 17]. Previousworks showed that Eqs. (9a) and (9b) are an operational defini-tion of the conditional moments that can easily be implementedfor the direct estimation of the drift and diffusion coefficientsfrom the data[10, 17]. In some practical situations, the limitin Eq. (8) can be approximated by the slope of a linear fit ofthe corresponding conditional moments at smallτ. When suchlinear fit is not possible, an alternative estimate is to considerthe first value ofM(τ)/τ at the lowest value ofτ[14]. We willuse this latter estimate for deriving the drift and diffusion coef-ficients, underlying the evolution of NO2 concentration in Lis-bon.

With such framework, we consider the two-dimensional sys-tem of NO2 concentrationsX = (x1, x2) describing the fluctu-ations at the stations of Chelas and Avenida da Liberdade, seeFig. 1. In order to comply with a Langevin process, as definedin Eq. (3), we first verify that both data sets exhibit Markovianproperties, which we show next for componentx1 only, for sakeof clarity. Forx2 the results are similar.

4

−15 0 15x

(1)1

0.0

0.1

0.2

0.3

p

x(2)1 =−0.00

c)

−15 0 15x

(1)1

0.0

0.1

0.2

0.3

p

x(2)1 =5.82

d)

−15

0

15

x(2)

1

a)

0.200

0.200

0.10

00.1

00

0.05

0

0.05

0

0.010

0.01

0

0.005

0.00

5

0.001−15

0

15

x(2)

1

b)

0.200

0.200

0.100

0.100

0.050

0.05

0

0.010

0.01

0

0.005

0.00

5

p(x(1)1 ,t|x (2)

1 ,t−τ2 )

p(x(1)1 ,t|x (2)

1 ,t−τ2 ;x(3)1 ,t−τ3 )

Figure 3: Contour plots of conditional probabilities (solid curves) and con-ditional two-point probabilities (dashed curves) computed from the detrendedtime seriesx1 and x2 with τ2 = 1 hour for(a) τ3 = 2 hours and(b) τ3 = 10hours. The corresponding cuts through contour planes, indicated by the hori-zontal dashed lines, are shown in(c) and(d) with a good matching between therespective one-point and two-point conditional probabilities. The distributionswere computed with 13 bins for each variable using a sample of105 data points.

The Markovian nature of the variablex2 can be verified byconsidering the differences between the conditional one-pointprobability p(x(1)

1 , t|x(2)1 , t − τ2) and the conditional two-point

probability p(x(1)1 , t|x

(2)1 , t − τ2; x(3)

1 , t − τ3). If the process isMarkovian on time scales larger thanτ2, then these probabilitydistributions should not differ significantly[10] for any choiceof τ3. Indeed, as can be seen from Fig. 3, the Markovian prop-erties seem to be fulfilled forτ2 = 1h and bothτ3 = 2h andτ3 = 10h . We therefore assume that the process is Markovianalready at the sampling rate of the data points of 1h and for timelags longer than 1h.

Further, it is also necessary to check the Gaussian nature ofthe stochastic forceΓ and ascertain it indeed obeys Eqs. (4). Us-ing the measured time series and the estimated KM-coefficients,the noiseΓ(t) can be reconstructed from a numerical discretiza-tion of Eq. (3) solved with respect toΓ[25], namely solving

Γ = g−1(X)(

X − h(X))

, (10)

whereX = X(t + 1)− X(t) andh(X) andg(X) are evaluated atX = X(t).

The resulting noise is analyzed with respect to its autocorre-lation, shown in Fig. 4: the autocorrelation decays to zero forthe very first values ofτ, which strongly supports to treatΓ1 andΓ2 as a white,δ-correlated noise source.

For ascertaining the Gaussian nature of the stochastic sourceswe plot in the inset of Fig. 4 the probability density function ofthe reconstructed noise time series (Γ1, Γ2)(t) against a Gaussiandistribution.

As one sees from the inset, in the range comprising over95% of the Gaussian noise, the distributions for the stochas-tic sources are well approximated by a Gaussian distribution.

0 10 20 30 40 50τ

0.0

0.5

1.0

ac

−10 −3 0 3 10noise

10-6

10-4

10-2

100

pdf

Gauss

Figure 4: Autocorrelation of the reconstructed dynamical noises Γ1,Γ2(stochastic forcing), indicating that they areδ-correlated. The inset shows thepdf of the reconstructed noise normalized to variance 1 (lines) and a normaldistribution for comparison (dashed line).

However, the pdfs for both measurement stations also showexponential tails resulting in deviations from the normal dis-tribution, significant in particular for large contributions. Thisdiscrepancy indicates that the estimation procedure or thesta-tionary Langevin model might not apply to the data at any pre-cision, an observation not unusual in the analysis of long-termfield measurements. We find it reasonable to assume, however,that the data series can be approximated sufficiently well by aFokker-Planck equation, enabling us to obtain at least a quali-tative picture of the principal directions of stochasticity in thefollowing.

From the tests above one may satisfactorily take the seriesx1

andx2 as a set described by two coupled Langevin Equations,Eq. (3) with K = 2. Next we derive these equations from thesets ofx1 andx2.

As Fig. 5 indicates, the conditional moments of the time se-ries are well behaved, both before and after apply detrendingmethods, suggesting that measurement noise can be neglected[16]. For largeτ > τc the quotientM(n)/τ decays to zero. Forsmall τ < τc some oscillations are observed and the estimatewe chose is at the smallest value, namely atτ = 1.

h1 h2 D11 D12 D22

mean 3.89e-01 4.69e-01 4.97e+01 2.14e+01 8.81e+01var 1.10e+00 3.53e+00 8.89e+02 1.24e+02 1.37e+03a1 -1.2e-03 -1.8e-03 1.4e-02 2.2e-03 1.3e-02a2 2.5e-05 -4.3e-04 3.2e-03 4.7e-04 7.0e-03a3 5.1e-04 8.3e-04 -7.1e-03 3.5e-03 -2.1e-03a4 4.5e-03 1.2e-01 1.1e+00 1.9e-01 2.2e-01a5 -1.8e-02 -3.8e-02 -3.6e-01 -2.1e-01 2.2e-01a6 1.9e+00 1.6e+00 1.2e+01 1.5e+01 1.7e+01

Table 1: Statistical features of the series for NO2 concentration series: Mean,variance, and fitting coefficients defining drift and diffusion functions accordingto Eq. (11).

As can be seen in Fig. 6 and the accompanying Tables 1 and

5

τ

−2

0

2

M(1)/τ

Raw Data

τ

Detrended

0 2 4 6 8

τ

0

100

200

M(2)/τ τc =5

0 2 4 6 8

τ

(a) (b)

(c) (d)

(−18.4,−27.5)(−9.2,−13.8)(−0.0,−0.0)(13.8,20.6)

Figure 5: First conditional momentsM(1) for (a) the original series and(b) thedetrended series, with different NO2 concentrations (x1, x2) at each one of thetwo stations (see legend). The corresponding second conditional momentsM(2)

are shown in(c) and(d), respectively. These moments are computed accordingto Eqs. (9a) and (9b). While the original data presents oscillations beyond agiven time intervalτc ∼ 5, the detrended time series does not (see text). Thevalue of the corresponding Kramers-Moyal coefficient at the value of (x1, x2)chosen is given by Eq. (8) for the lowest value ofτ, i.e. one.

h1 h2 D11 D12 D22

mean 3.22e-01 3.18e-01 3.84e+01 1.04e+01 6.93e+01var 9.57e-01 2.70e+00 3.63e+02 3.86e+01 8.99e+02a1 -9.4e-04 -1.8e-03 6.6e-03 1.4e-03 7.7e-03a2 -4.1e-04 -7.5e-04 2.2e-03 -7.6e-04 1.1e-02a3 6.3e-04 1.3e-03 -4.6e-04 5.7e-03 3.4e-03a4 -4.7e-02 2.5e-02 1.2e+00 2.6e-01 8.0e-01a5 1.8e-02 -5.7e-02 9.6e-02 6.4e-02 1.1e+00a6 7.1e-01 1.1e+00 3.6e+01 1.0e+01 6.2e+01

Table 2: Statistical features of the detrended series obtained through Eq. (2)from the original data addressed in Tab. 1.

2, the resulting components of the drift and diffusion coeffi-cients are adequately fitted by a quadratic polynomial

p(z1, z2) = a1z21 + a2z2

2 + a3z1z2 + a4z1 + a5z2 + a6 , (11)

wherep denotes the drift and diffusion components,D(1)i and

D(2)i j respectively, and the coefficientsai are computed from a

least-square procedure on the drift and diffusion components asfunctions of the respective raw (y1, y2), detrended (x1, x2), andtransformed coordinates (rg, θg) to be introduced in Section 5.

Table 1 gives the coefficients in Eq. (11) for the originaltime-series, while Table 2 lists the coefficient values for the de-trended time-series. Further,D(1) andD(2) for detrended seriesare plotted in Fig. 6 with the corresponding fit surface.

4. Deriving optimal variables: eigensystem for NO2 mea-surements at different stations

Having successfully determined the drift and diffusionconstants describing the respective deterministic forcing and

stochastic fluctuations of the system of NO2 concentration mea-surements, we now determine the eigensystem of the diffusionmatrices and investigate its principal directions. This procedurewas described in detail in [19] and was previously applied toatwo-dimensional sub-critical bifurcation[26] and to the analysisof human movement[27]. It will be briefly outlined here, forKvariables.

Diffusion matrices are numerically estimated on a mesh ofpoints in phase space, as shown for example in Fig. 6c-e. Thenat each mesh point theK eigenvalues and corresponding eigen-vectors of the estimated matrices are calculated. The diffusionmatrices contain information about the stochastic forcingactingon the system and we use the local eigensystems of the matrixfor a further characterization of these forces. In particular, avanishing eigenvalue indicates that the corresponding stochas-tic force may be neglected.

We are looking for a transform of the original coordinatesX = {xi} into new onesX = {xi}, such that the new coordinatesare aligned in the directions of the eigenvectors of the diffusionmatrix in each mesh point, i. e. the principal direction in whichthe diffusion matrix is diagonal. Diagonalizing the diffusionmatrix decouples the stochastic contribution in the set of vari-ables, and if the eigenvalues in the transformed coordinates aresignificantly different, we are able to restrict our investigationto the coordinates with lower stochasticity, which are the oneshaving higher predictive power.

We therefore look for a two-times continuously differentiablefunctionF with

X = F(X, t), (12)

for which[23, 24] the deterministic and stochastic parts intheLangevin systems of equations, transform respectively as[19]

h(1)i (X) =

N∑

k=1

(

h(1)k (X)

∂Fi

∂xk

+

N∑

l=1

N∑

j=1

gl j (X)gk j(X)∂2Fi

∂xk∂xl

)

(13)

gi j (X) =

N∑

k=1

gk j(X)∂Fi

∂xk(14)

where the second equation readsg(X) = J(X)g(X), with J(X)the Jacobian of our transformationF. For reasons of clarity wein the following do not explicitly notate the dependence onXandX.

The eigenvectorsuk of matrices g with coordinates inlocal bases ei , can be incorporated in matricesU =

[u1 u2 . . . uK ]. Defining U asU = JTU one then obtains(see Eq. (14))

UT ggTU = UTgTgU. (15)

By definition the inverse transformF−1(X) is chosen suchthat the normalized eigenvectors are given by

uk =1sk

∂F−1

∂xk(16)

6

x1

−20−10

010

20x2

−30 −20 −10 0 10 20 30

h1

−3

0

3

a)

x1

−20−10

010

20x2

−30 −20 −10 0 10 20 30

h2

−3

0

3

b)

x1

−20−10

010

20

x2

−30−20

−10010

2030

D11

0

50

c)

x1

−20−10

010

20

x2

−30−20

−10010

2030

D12

0

50

d)

x1

−20−10

010

20

x2

−30−20

−10010

2030

D22

0

50

e)

Figure 6: For the detrended series we plot(a-b) both components of the drift vectorh = (h1, h2) and(c-e) the components of diffusion matrixD(2) = {D(2)i j }. The

corresponding fitted surfaces (black) are vertically offset for clarity. SinceD(2) is symmetric (see text) one hasD(2)12 = D(2)

21 .

with

sk =

∣

∣

∣

∣

∣

∣

∂F−1

∂xk

∣

∣

∣

∣

∣

∣

. (17)

i.e. the respective square sum of the columns in the Jacobianofthe inverse transform. Taking into account this scaling factorthe eigenvalues in the new coordinate system can be calculated[19],

D(2) = diag

λ1

s21

λ2

s22

. . .λK

s2K

. (18)

In general, the eigenvalues of the diffusion matrix indicatethe amplitude of the stochastic force and the correspondingeigenvector indicates the direction towards which such forceacts. These directions can be regarded as principal axes of theunderlying stochastic dynamics[19].

In particular the vector field yielding at each mesh point theeigenvector associated to the smallest eigenvalue of matrix gdefines the paths in phase space towards which the fluctuationsare minimal. If this eigenvalue is very small compared to allthe other, the corresponding stochastic force can be neglectedand the system can be assumed to have onlyK − 1 independentstochastic forces, reducing the number of stochastic variablesin the system.

Notice however that, whereas in a Cartesian coordinate sys-tem the eigenvalues are strictly related to the amplitude ofdif-fusion in the corresponding eigenvectors, a nonlinear transfor-mation usually changes the metric[19, 28]. In the transformedsystem the direction of the maximal eigenvalue is not neces-sarily the direction with the highest diffusion. This disparity isaccounted for by the factorsi above.

hr hθ Drr Drθ Dθθmean 5.10e-01 2.26e-03 9.50e+01 2.33e-01 1.71e-02var 3.15e+00 1.43e-03 3.11e+03 3.50e-02 8.22e-04a1 -2.8e-05 -1.1e-06 2.3e-02 5.8e-05 1.3e-05a2 -8.0e+00 3.3e-02 -1.5e+01 -8.6e-01 1.5e-01a3 2.3e-02 -3.4e-04 -1.4e+00 1.0e-02 -1.5e-03a4 -1.1e-01 1.1e-03 2.6e+00 -3.6e-02 1.7e-03a5 4.2e+01 -2.3e-01 1.5e+02 4.0e+00 -6.7e-01a6 -5.2e+01 3.7e-01 -2.2e+02 -3.8e+00 8.3e-01

Table 3: The optimal variablerg andθg, defined in Eq. (19) withα = 0.786,β = 27.735,γ = 1.125,δ = 59.964, andǫ = 1.646: mean, variance, and fittingcoefficients defining the drift and diffusion functions according to Eq. (11), with(z1, z2) = (rg, θg).

5. Transform of NO2 concentrations to the stochasticeigendirections

A plot of the eigenvectors ofD(2) in Fig. 7 suggests that acontinuous and smooth description of the corresponding sortedeigenvalues exists. We will see that, since the two eigenvaluesare different, the eigenvector corresponding to the lower eigen-value describes the direction of minimum stochasticity. Theeigenvectors and eigenvalues of the diffusion matrix give thelocal principal directions of diffusion. From Fig. 7a one iden-tifies two eigendirections, one approximately along the radialdirection and another perpendicular to it, along the angular di-rection. In Fig 7c, the angle of the smaller eigenvectors withthe coordinate axes is shown. The large fluctuations in Fig. 7care due to lack of statistics as shown in Fig. 7d, where the prob-ability density functionP(x1, x2) is plotted.

Both Figs. 7a and 7c call for a bijective transform from(x1, x2) to (rg, θg) similar to polar coordinates, apart from a

7

(a) (c)

(b) (d)

Figure 7: (a) Eigenvectors of the diffusion matrix for the detrended seriesand(b) the eigenvectors for the transformation of variables (x1, x2) → (rg, θg)shown in Eq. (19). In(c) the measured angles for the smaller eigenvector areplotted. The deviations from the fitted surface in (c) are dueto the lack of statis-tics, as shown in(d) the plot of the corresponding probability density functionP(x1, x2).

rescalingx1 → c1 = αx1 + β and x2 → c2 = γx2 + δ. Ouransatz for such generalized polar coordinates is:

(

x1

x2

)

=

(

rg

θg

)

=

√

(αx1 + β)2 + (γx2 + δ)2)arctan

(

γx2+δ

αx1+β

)

+ ǫ

(19)

where the radial and angle variables,rg andθg, respectively arenow functions of the detrended NO2 concentrations,x1 andx2.

This approach has the advantage that the inverse transformF−1 is given by the simple form

x1

x2

=

1α[rg cos(θg − ǫ) − β]

1γ[rg sin(θg − ǫ) − δ]

. (20)

Fitting the parameters to the field of eigenvectors, yields forEq. (19),α = 0.786,β = 27.735,γ = 1.125,δ = 59.964, andǫ = 1.646: Using these parameter values the eigenvectors ofthe diffusion matrix in new coordinates appear to be perfectlyaligned along the new radialrg and angularθg coordinate axes,cf. Fig. 7b, and fit the angles of the eigenvectors in Fig. 7c

Figure 7 shows that the principal directions for the systemof two NO2 concentrations are roughly the radial and the az-imuthal direction. From Eq. (19), one also sees that the ra-dial direction is related to the absolute concentration or theamplitude of the individual concentrations,

√

c21 + c2

2, whilethe azimuthal direction is associated with the relative strength,arctan (c2/c1). Figure 8 shows the eigenvalues as a function ofboth concentrationsx1 andx2.

Figure 8a and 8b show the pairs of detrended and polar-liketime series, each shifted vertically in order to increase visibility.It is clearly visible that the transformed time series have highlydifferent variances, indicating that the separation with respectto higher and lower stochasticity was successful. This is alsocorroborated by Table 3 which shows a smaller variance forhθthan forhr .

According to Eq. (18) it must be noted, however, that theamplitude of the diffusion component is not preserved by thetransformation. This becomes also apparent from Fig. 8a-b,where the variance of both transformed time series togetherissmaller than the sum of the variances of the detrended seriesx1 and x2. Instead, the transformed diffusion constants mustbe scaled with the metrics of the transform, which in our casereads (check Eq. (17))

s21 =

[

1γ2+ cos2(θg − ǫ)

(

1α2− 1γ2

)]

,

s22 = r2

gs21. (21)

The sorting of the eigenvalues in the original time series re-sults in a remarkable difference of the two eigenvalues in eachmesh point. This can be seen by comparing Fig. 8c and 8d.The transform results in a clear separation of two regimes ofeigenvalues which are three orders of magnitude apart. Further,we confirmed that multiplying the transformed eigenvalues bythe metric factors yields the detrended eigenvalues, meaningthat the transformF, its inverse transform and calculation of thecorresponding metric factors have been performed consistently(not shown).

It is important to remember that the metric factors need tobe taken into consideration only when comparing the relativestrength of the components of the diffusion matrix. This is un-derpinned by Table 4, which illustrates another important pointregarding relative signal strengths by using averaged ratios. Ascan be seen, the ratio of the principal components of the diffu-sion matrix,〈|D11| / |D22|〉, is almost unity in both the originaland detrended series. Remarkably, the transformation increasesthe corresponding ratio,〈|Drr | / |Dθθ|〉, for the new coordinates,thus decoupling different modes of stochastic behavior, evenwhen taking the metric factors into account. In the transformedsystem,Dθθ is on average smaller thanDrr , which means thatthe transformed coordinateθ = x2, the angular coordinate de-scribing the relative concentration between the two stations,shows reduced stochasticity as compared to the radial coordi-nater = x1. Likewise, the relative strength of the off-diagonalterm, 〈|D12| / |D11|〉, is reduced after the coordinate transformto 〈|Drθ| / |Drr |〉, meaning that an effective diagonalization hasbeen reached. Most importantly, though, is the ratio of the de-terministic to the weaker stochastic components,〈|h2| / |D22|〉and〈|hθ| / |Dθθ|〉, which increases by a factor of 100, after ap-plying the transform in Eq. (14). Conversely, the ratio of thestronger stochastic part,〈|h1| / |D11|〉, , decreases when goingto 〈|hr | / |Drr |〉. From these observations one can conclude thatin the new, low-stochastic coordinateθ not only the absolutevalue of the stochastic noise has been reduced, but also the rel-ative strength with respect to the deterministic drift term, which

8

a) c) e)

b) d) f)

Figure 8:(a) The two original NO2 concentration series,x1 andx2, plotted offset to increase visibility.(b) The time series under the transformed coordinates (seeEq. (19)), plotted also vertically offset. Notice the much higher ratio between the variances of both signalsrg andθg, compared to the ration between the variancesof the original seriesx1 andx2 (see text).(c) The two eigenvalues of the diffusion matrix for the original signalsx1 andx2. These eigenvalues are much closer toeach other than(d) the eigenvalues derived for the diffusion matrix for the transformed (optimal) variables.(e-f) Drift vector (hr , hθ) as a function of the optimalvariables, namely the total NO2 amountrg and its relative concentrationθg (see text for discussion). Fitted surfaces are obtained forthe coefficient values given inTab. 3.

means better predictability. Division by the metrics has noef-fect on the relative strength of the stochastic parts since it af-fects both the drift and the diffusion coefficients similarly. Weconsider this finding the main result of this letter. Figure 8(e-f)shows the drift vector for the transformed coordinates, whichis well defined and apparently has a different functional depen-dence on its coordinates than the untransformed drift showninFig. 6.

6. Comparison with standard methodologies

In this Section we first address the question of how good thecoordinate transform derived above is compared to other, pos-sibly simpler transforms.

For example, we may consider a transform to coordinateswhich describe the mean value and difference between the twomeasured time series, e.g.

(

x1

x2

)

= 12

(

x1 + x2

x1 − x2

)

. (22)

This choice is the simpliest one for two variables, one describ-ing the total amountx1 + x2 and another describing the relativeamountx1 − x2.

We have analyzed the time series of variables ˆx1, x2, andadded the results to Tab. 4 in the columns ”Simple Trans”and ”S. Tfd/metrics” for the transformed and transformed andcorrected for metrics characteristics, respectively. Thevari-ances of such new variables is of the same order of magnitudeas (x1, x2). Further, the ratio between principal componentsof diffusion matrix,〈

∣

∣

∣D11/D22

∣

∣

∣〉, is much smaller then in ourmethod, and consequently the separation of the principal eigen-values is much weaker. Therefore, both the ratios between thedeterministic and the stochastic coefficients,

⟨∣

∣

∣h1

∣

∣

∣ /∣

∣

∣D11

∣

∣

∣

⟩

and⟨∣

∣

∣h2

∣

∣

∣ /∣

∣

∣D22

∣

∣

∣

⟩

, remain very low under this transform, quite con-trary to the results we have achieved with our transformationto general polar coordinated based on the analysis of principaldirection of the diffusion matrices.

Another question addressed within this section is the com-parison between methods applied to choose the most appro-priate inputs. Theoretically, any set of input data can be fedinto a model for training and evaluation. However, the num-ber of possible variables to be used and the number of waysthey can be presented is too diverse to test all possible com-binations. A number of statistical methods can be applied inorder to choose the most appropriate set of predictors or inputs.

9

Unfiltered Detrended Transformed Tfd / metrics Simple Transf. S. Tfd/ metrics〈y1〉 , 〈x1〉 36.05 -0.00 〈r〉 68.76 -/- 〈x1〉 -0.10 -/-〈y2〉〈x2〉 64.68 -0.00 〈θ〉 2.66 -/- 〈x2〉 0.23 -/-std(y1), std(x1) 26.85 25.31 std(r) 42.33 -/- std(x1) 26.87 -/-std(y2), std(x2) 40.78 37.85 std(θ) 0.47 -/- std(x2) 35.06 -/-

〈|D11| / |D22|〉 6.27e-01 6.09e-01 〈|Drr | / |Dθθ |〉 1.11e+04 1.46e+00⟨∣

∣

∣D11

∣

∣

∣ /∣

∣

∣D22

∣

∣

∣

⟩

3.79e-01 1.52e+00

〈|D12| / |D11|〉 4.90e-01 3.04e-01 〈|Drθ | / |Drr |〉 3.80e-03 2.40e-01⟨∣

∣

∣D12

∣

∣

∣ /∣

∣

∣D11

∣

∣

∣

⟩

5.58e-01 2.79e-01

〈|h1| / |D11|〉 3.23e-02 3.13e-02 〈|hr | / |Drr |〉 2.24e-02 2.24e-02⟨∣

∣

∣h1

∣

∣

∣ /∣

∣

∣D11

∣

∣

∣

⟩

3.03e-02 3.03e-02

〈|h2| / |D22|〉 2.21e-02 2.38e-02 〈|hθ | / |Dθθ |〉 1.78e+00 1.78e+00⟨∣

∣

∣h2

∣

∣

∣ /∣

∣

∣D22

∣

∣

∣

⟩

2.02e-02 2.02e-02

Table 4: Mean ratios of the absolute values of drift and diffusion components for the measured (x1, x2) and transformed time-series (rg, θg). On the right hand side,a comparison with a simpler, yet ineffective, transform is given, cf. Eq. (22).

Examples are, among other, stepwise regression, PCA, clusteranalysis and ARIMA. For details, see Ref. [22] and referencestherein. Such pre-processing procedures, reduce the numberof input variables into the models, thus eliminating the redun-dant information. In these standard procedures, the selectionof variables is usually made independently for each monitoringstation.

Another possible way to tackle redundancy is the pre-processing of data consisting of the computation of backwardstepwise regressions (BSR) conducted between a target variableand all the other data sets. Based on the available common pe-riod data sets, one constructs a collection of records, composingthe input vector, which includes the meteorological variables,air pollutant concentrations, etc, and together with it assumesthe corresponding target, which in our case is the atmosphericconcentration of a certain pollutant. Subsequently, one retainsthe smallest subset of statistically significant variablesto pre-dict a certain pollutant concentration automatically at a givenmonitoring station. In addition, BSR allows the determinationof the best time lags for each input variable, typically daily andweekly cycles.

The referred techniques also allow the comparison betweenthe original data sets and surrogate data sets including only thestochastic component. The stochastic component may be deter-mined through a rough approximation of a mathematical func-tion (e.g., sin x), or, for example, by the presented framework.After the selection of variables and the determination of cyclicand stochastic behaviors on each time series, linear and non-linear models can be applied in order to model air pollution ineach monitoring station. The forecasting capabilities of the dif-ferent approaches can then be compared. Such models are alsoapplied to each decoupled time-series in order to predict nextdays air quality at each monitoring station. The applicationsof this framework, however, allows to determine the stochasticcomponent on a efficient manner, enhancing air quality predic-tions.

7. Discussion and Conclusion

In this paper, we have investigated the stochastic propertiesof two simultaneous series of NO2 measurements.

We assumed, that the time series after detrending were prop-erly modeled by a system of Langevin equations. The validityof this assumption is discussed in section 3, showing that there

are deviations of the estimated dynamical noise from normal-ity. These discrepancies, which could be caused e.g. by thelow sampling frequency of 1/h or by instationarities intrinsic tothe system, however, are not expected to have an effect on thequalitative results of our study. For further evaluation oftheseeffects it would be interesting to apply the procedure outlinedinthis paper to data sampled at higher frequency (and in a shortermeasurement interval) which is planned for a future study.

In the detrended time series, a very low ratio between de-terministic and the stochastic terms, and a strong couplingbe-tween the stochastic terms between the two measurements wasfound. Calculating the eigenvalues of the diffusion matrix, wefound a transform that leads to a description in which the dif-fusion matrix is diagonal and increases the ratio between deter-ministic and stochastic terms.

The transform maps the detrended variables into two polar-like coordinates: The radial direction associated with thelargereigenvalue describes the total amount of NO2 observed in thetwo stations taken together, while the angular component isas-sociated to the lower eigenvalue and describes the relativecon-centration at one station compared with the overall amount ofNO2.

This procedure worked out very fine for the NO2 data, sincethe transformation of variables resulted in a decoupling ofthediffusion components in the new coordinates. In particular, weidentified the ratio of NO2 concentrations at the measurementstations as being least affected by stochastic fluctuations. Thisresult suggests that the strongest fluctuations in NO2 concentra-tion affect both measurements stations at equal proportions andthat they dominate with respect to minor, local stochastic fluc-tuations between the stations. Due to the rather simple formoftransforms these features can be directly studied from measureddata.

It could been shown that the ratio of deterministic to stochas-tic components is optimal in the transformed variables, andthatthe overall amplitude of the stochastic noise in one componentof the transformed time series is reduced.

One question that should be addressed in a forthcoming studyis to compare the predictability of both sets of variables andinterpret the physical meaning of the transformed coordinates intheir evolution equation, as well as the transformed drift vectorand diffusion matrix.

10

Acknowledgments

The authors thank DAAD and FCT for financial supportthrough the bilateral cooperation DREBM/DAAD /03/2009.FR (SFRH/BPD/65427/2009) and PGL (Ciencia 2007) thankFundacao para a Ciencia e a Tecnologia for financial support,also with the support Ref. PEst-OE/FIS/UI0618/2011.

References

[1] EEA European Environment Agency,The European environment. Stateand outlook 2010 : synthesis, DOI: 10.2800/45773 (2010).

[2] N. de Nevers,Air Pollution Control Engineering(McGraw-Hill Compa-nies, 2000).

[3] ReportHealth aspects of air pollution with particulate matter, ozone andnitrogen dioxide, (World Health Organization, Denmark, 2004).

[4] M. Demuzere, R. Trigo, V. Arellano, N. van Lipzig, Atmospheric Chem-istry and Physics9, 26952714 (2009).

[5] M. Gardner, S. Dorling, Atmospheric Environment33, 709719 (1999).[6] J. Hooyberghs, C. Mensink, G. Dumont, F. Fierens, O. Brasseur, Atmo-

spheric Environment39, 32793289 (2005).[7] J. Kukkonen, L. Partanen, A. Karppinen, J. Ruuskanen, H.Junninen,

M. Kolehmainen, H. Niska, S. Dorling, T. Chatterton, R. Foxall, G. Caw-ley, Atmospheric Environment37, 45394550 (2003).

[8] M. Kolehmainen, H. Martikainen, J. Ruuskanen, Atmospheric Environ-ment35, 815-825 (2001).

[9] R. Friedrich and J. Peinke, Phys. Rev. Lett.78, 863 (1997).[10] R. Friedrich, J. Peinke, M. Sahimi, and M.R.R. Tabar, Phys. Rep.50687

(2011).[11] P.G. Lind, A. Mora, J.A.C. Gallas and M. Haase, Phys. Rev. E 72, 056706

(2005).[12] R. Friedrich, J. Peinke and Ch. Renner, Phys. Rev. Lett.84, 5224 (2000).[13] F. Ghasemi, M. Sahimi, J. Peinke, R. Friedrich, G.R. Jafari and

M.M.R. Tabar, Phys. Rev. E75, 060102 (2007).[14] D. Kleinhans, R. Friedrich, A. Nawroth, J. Peinke, Phys. Lett. A34642-

46 (2005).[15] S.J. Lade, Phys. Lett. A3733705-3709 (2009).[16] F. Boettcher, J. Peinke, D. Kleinhans, R. Friedrich, P.G. Lind, M. Haase,

Phys. Rev. Lett.97 090603 (2006).[17] P.G. Lind, M. Haase, F. Boettcher, J. Peinke, D. Kleinhans and

R. Friedrich, Phys. Rev. E81 041125 (2010).[18] J. Carvalho, F. Raischel, M. Haase and P.G. Lind, J. Physics 285012007

(2011).[19] V. Vasconcelos, F. Raischel, M. Haase, J. Peinke, M. Wachter, P.G. Lind

and D. Kleinhans, Phys. Rev. E84 031103 (2011).[20] R. Friedrich, S. Siegert, J. Peinke, St. Luck, M. Siefert, M. Lindemann,

J. Raethjen, G. Deuschl, G. Pfister, Physics Letters A271, 217 (2000).[21] K. Pearson, Philosophical Magazine2, 559 (1901).[22] D. Wilks, Statistical Methods in the Atmospheric Sciences(2nd Ed.)

No. 59 in International Geophysics (Academic Press, 2006).[23] H. Risken,The Fokker-Planck Equation, (Springer, Heidelberg, 1984).[24] C. W. Gardiner,Handbook of stochastic Methods, (Springer, Germany,

1997).[25] C. Micheletti, G. Bussi, and A. Laio, J. Chem. Phys.129, 074105 (2008).[26] J. Gradisek, R. Friedrich, E. Govekar and I. Grabec, Meccanica,38, 33

(2003).[27] A. M. van Mourik, A. Daffertshofer, and P. J. Beek, Biological cybernetics

94, 233 (2006).[28] S.J. Lade, Phys. Rev. E80 031137 (2009).[29] P.G. Lind, A. Mora, M. Haase and J.A.C. Gallas, Int. J. Bif. Chaos17(10)

3461-3466 (2007).

11

Searching for optimal variables in real multivariate stochastic data

Documents

Transcript of Searching for optimal variables in real multivariate stochastic data