A track-before-detect algorithm using joint probabilistic data association filter and interacting...

UNIVERSITY OF GENOVA

Doctoral School in

Science and Technology For Information and Knowledge

Ph.D Course in

ELECTRONIC AND COMPUTER ENGINEERING,ROBOTICS AND TELECOMMUNICATIONS

CYCLE XXVII

DOCTOR OF PHILOSOPHY THESIS

COGNITIVE DYNAMIC SYSTEM: AN

INNOVATIVE INFORMATION ACQUISITION

AND MODELLING FRAMEWORK FOR

CROWD MONITORING

SIMONE CHIAPPINO

Ph.D. Candidate

PROFESSOR CARLO S. REGAZZONI

Advisor

PROFESSOR MARIO MARCHESE

Chairperson

April 17, 2015

Abstract

Cognitive algorithms, integrated in intelligent systems, represent an important innovation

in designing interactive smart environments. More in details, Cognitive Systems have im-

portant applications in anomaly detection and management in advanced video surveillance.

These algorithms mainly address the problem of modelling interactions and behaviours

among entities in a scene. A bio-inspired architecture is here proposed, which is able to

encode and synthesize signals, not only for the description of single entities behaviours, but

also for modelling cause-effect relationships between user actions and changes in environ-

ment configurations. Neuroscience studies describe how this information has been learned

and stored within a specific brain area namely the Autobiographical Memory (AM), which

consists in a collection of episodic elements based on events.

This thesis proposes an artificial structure of aforementioned AM embedded on cognitive

architecture. This work shows the effectiveness of the proposed bio-inspired structure for

automatic crowd monitoring applications. Here the system acquires the knowledge directly

by observing the operator actions aimed at the controlling of a crowd in order to prevent

anomalous situations, such as overcrowding.

The proposed framework operates an effective knowledge transfer from a human operator

towards an automatic systems called Cognitive Surveillance Node (CSN), which is part of

a complex cognitive JDL-based and bio-inspired architecture. After such a knowledge-

transfer phase, learned representations can be used, at different levels, either to support

human decisions, by detecting anomalous interaction models and thus compensating for

human shortcomings, or, in an automatic decision scenario, to identify anomalous patterns

3

and choose the best strategy to preserve stability of the entire system. In the large-scale

video surveillance networks the amount of information increases (i.e., due to a larger num-

ber of monitored areas), attention focusing techniques are needed to highlight most relevant

parts within the overall acquired data.

This thesis presents a bio-inspired algorithm called Selective Attention Automatic Focus

(S2AF), as a part of the CSN for crowd monitoring. The main objective of the proposed

method is to extract relevant information needed for crowd monitoring directly from the

environmental observations. Such important data can be considered as sub-part (i.e. sparse

and limited information) of the whole information. By an automatic relevance identifica-

tion algorithm, the proposed system can support the operator in focusing the attention on

relevant part of the scenes only.

When wide area surveillance systems are considered, one of the major problems in event

detections is the reconstruction of the scene as a whole, from spatially limited observations.

In this thesis, a novel relevant information representation technique for sparse and limited

observations, based on information theory, is presented. An innovative hierarchical struc-

ture for knowledge representation based on artificial neural networks, the Self Organizing

Maps (SOMs), have been used for classifying and correlating observed sparse data time

series acquired by cameras. In this thesis, by means of Information Bottleneck theory, a

cost function approach is proposed in order to determine the optimal data representation

in the SOM-space as a trade-off between the capabilities of sparse signal classification and

the original data statistical similarities (i.e. correlations) preservation. By means of the

proposed cost function, an algorithm for knowledge representation, which is called infor-

mation bottleneck-based SOM selection (IB-SOM), is described.

The advanced mechanisms for representation of relevant information, which are described

in this work, can improve the video surveillance systems. In particular in this thesis, a

novel adaptive inductive reasoning (top-down logic) mechanism for saliency extraction

and information recovery from a distributed camera sensors network is presented. The

proposed system, by means of Self Organizing Maps, can first learn the correlations of the

4

observed data and secondly select the optimal subset of sensors for recovering the whole

information, allowing a minimum distortion.

The thesis deals with the increasing demand to acquire and model, by observing user ac-

tivities, a large number of strategic and operative crowd management solutions. In the

thesis an innovative visual 3D simulator, where crowd behaviour is modelled by means of

Social Forces, has been developed and described. The configuration of the synthetic envi-

ronment provides a set of elements (e.g doors, walls and rooms) which are customized, so

that a wide range of scenarios can be set. More in details, a crowd enters in a room of the

simulator and, by a given motivation, it moves toward the exit of the environment (e.g. a

building).

The CSN can observe two interacting entities consisting in a simulated crowd and a hu-

man operator. The user can interact within the simulator by actions (i.e. the opening and

closing of doors in the environment) aimed to avoid anomalous situations (i.e. overcrowd-

ing). Results demonstrates how the proposed system can automatically detect and manage

anomalous crowd configurations. Experimental tests, on synthetic and also on real video

sequences, demonstrate applicability of the CSN in both the user-support and automatic

modes. The proposed attention focus method has been tested by means of a 3D crowd

simulator; the results show how, by the proposed S2AF, the CSN is able to detect densely

populated areas. The so called IB-SOM algorithm for knowledge representation can be

applied in the CSN structure in order to adapt the crowd density representation to the sur-

rounding environment changes. The IB-SOM has been evaluated on synthetic and real

video sequences and compared to other neural networks for event detection in the crowd.

Finally, the proposed SOM based inductive reasoning mechanism, which is embedded in

CSN architecture, has been tested on synthetic and real video sequences. The results show

how the CSN can select relevant data from observed parts of the environment and how,

by means of a inductive reasoning, it can reconstruct the information in the non-observed

parts.

5

Dedication

Dedico la mia Tesi di Dottorato di Ricerca ai miei genitori

I would like to dedicate my Ph.D Thesis to my parents

Simone Chiappino

7

This doctoral thesis has been examined by the Evaluation Committeecomposed as follows:

Professor Fulvio Mastrogiovanni, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Genova

Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria deiSistemi

Professor Claudio Sacchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Trento

Dipartimento di Informatica e Scienze dell’Informazione

Professor Nicola Zingirian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Padova

Dipartimento di Ingegneria dell’Informazione

Contents

Abstract 3

1 Introduction 25

1.1 Motivation and context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

1.1.1 The JDL Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28

1.1.2 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . 31

1.2 Specific contributions to progress in scientific and technological state-of-

the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

1.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2 State of the art 39

2.1 From neuroscience to cognitive architectures . . . . . . . . . . . . . . . . 42

2.2 Overview of cognitive architectures . . . . . . . . . . . . . . . . . . . . . 44

2.2.1 Symbolic architectures . . . . . . . . . . . . . . . . . . . . . . . . 45

2.2.2 Connectionism architectures . . . . . . . . . . . . . . . . . . . . . 47

2.2.2.1 Self Organizing Map (SOM) . . . . . . . . . . . . . . . 48

2.2.2.2 Neural Gas (NG) . . . . . . . . . . . . . . . . . . . . . . 49

2.2.2.3 Growing Neural Gas (GNG) . . . . . . . . . . . . . . . . 50

2.2.2.4 Instantaneous Topological Map (ITM) . . . . . . . . . . 52

2.2.2.5 Growing Hierarchical-SOM (GH-SOM) . . . . . . . . . 53

2.2.3 Hybrid architectures . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.3 Haykin’s architecture and dynamics of cognitive systems . . . . . . . . . . 55

11

2.3.1 Haykin’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . 56

2.3.2 Dynamics of cognitive systems . . . . . . . . . . . . . . . . . . . . 58

2.4 Final remarks about cognitive architectures . . . . . . . . . . . . . . . . . 59

3 Crowd modeling and proposed architecture for cognitive video surveillance

system 61

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.2 Scale issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.2.1 Crowd monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.2.2 Crowd simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3 A bio-inspired cognitive model for Cognitive Surveillance Systems . . . . . 67

3.3.1 Cognitive Cycle for single and multiple entities representation . . . 68

3.3.2 The Cognitive Node . . . . . . . . . . . . . . . . . . . . . . . . . 73

4 Information extraction and probabilistic model for knowledge representation 77

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.1 Data fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.1.2 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

4.1.3 Autobiographical memory . . . . . . . . . . . . . . . . . . . . . . 83

4.2 Autobiographical Memory domain applications: Surveillance and Crowd

Management scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.2.1 Learning phase: interaction representations . . . . . . . . . . . . . 87

4.2.2 Detection phase: surveillance scenarios . . . . . . . . . . . . . . . 88

4.2.3 Detection phase: Crowd management scenarios . . . . . . . . . . . 89

4.3 Proposed approach for abnormal interaction detection in crowd monitoring

domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

4.3.1 Dimensional reduction and preservation of the information . . . . . 91

4.3.2 Self Organizing Map: classification evaluation . . . . . . . . . . . 92

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4.1 Example of application on real video sequences . . . . . . . . . . . 102

12

5 A bio-inspired algorithm for designing a human attention mechanism 109

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

5.2 Selective attention for automatic crowd monitoring . . . . . . . . . . . . . 114

5.3 Information flow process for cognitive crowd monitoring . . . . . . . . . . 115

5.4 Selective Attention Automatic Focus (S2AF) algorithm . . . . . . . . . . . 116

5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

6 Information bottleneck-based relevant knowledge representation in large-scale

video surveillance systems 127

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 Information bottleneck-based relevant information representation . . . . . . 130

6.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

6.3.1 Training of SOMs and cost function evaluation on synthetic data . . 135

6.3.2 Crowd density reconstruction on real video sequences . . . . . . . 136

7 A bio-inspired logical process for information recovery in cognitive crowd mon-

itoring 141

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

7.2 Data correlation representation for bio-inspired inductive reasoning algorithm144

7.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

8 Conclusions and future developments 155

8.1 Analysis block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

8.2 Attention Focusing Module . . . . . . . . . . . . . . . . . . . . . . . . . . 157

8.3 Information Refinement for relevant knowledge representation . . . . . . . 157

8.4 Adaptive reasoning algorithm for information reconstruction . . . . . . . . 158

8.5 Final remarks and future developments . . . . . . . . . . . . . . . . . . . . 159

13

List of Figures

1-1 JDL’s structure for data fusion [116]. . . . . . . . . . . . . . . . . . . . . 30

1-2 Operator-machine-environment loop, the machine is represented by JDL

based system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1-3 Self-aware oriented system proposed by Endsley[34]. The model of sit-

uation attention in dynamical decision making is formed as follows: per-

ception (level 1), comprehension of situation (level 2), prediction (level 3)

after that decision and action close the loop with the environment . . . . . 31

1-4 Proposed Cognitive Node structure. . . . . . . . . . . . . . . . . . . . . . 33

2-1 The scheme of LIDA’s cognitive cycle. . . . . . . . . . . . . . . . . . . . 40

2-2 The Perception Action Cycle formalized and Reinforcement Learning. . . 43

2-3 Functional block diagram that shows how the Cognitive Dynamic System

perceives the environment in which it operates. . . . . . . . . . . . . . . . 44

2-4 Some of the more important cognitive architectures. . . . . . . . . . . . . . 46

2-5 Example of a self organizing map structure: the SOM-layer is formed by

2-D fixed grid of 𝑁 neurons, while the input vector is defined as 𝑥 =

[𝑥1, 𝑥,2 , · · · , 𝑥𝑛]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2-6 Example of GNG adaptation to input stimuli. On the left the initial situation

on the right the final topology are shown. . . . . . . . . . . . . . . . . . . 51

2-7 Node creation procedure in ITM algorithm. . . . . . . . . . . . . . . . . . 53

15

2-8 Column insertion and layer expansion in GH-SOM. Step 1: when MQE𝑙

overcome a given threshold, a column is inserted between the nodes 𝑒 with

highest mqe𝑒 and its most dissimilar neighbor 𝑑. Step 2: layer expansion

by adding of a new map in order to represent more in detail the input data. . 54

2-9 Haykin’s Cognitive Dynamic System. Cognitive Perceptron (CP); Cogni-

tive Control (CC); Probabilistic Reasoning Machine (PRM). . . . . . . . . 56

2-10 The Markov blanket of node 𝐴. . . . . . . . . . . . . . . . . . . . . . . . 58

2-11 Global state representation of system. The internal 𝑟 are connected with

the action 𝑎 states, while the environment 𝜓 is influenced by the actions.

The external states are observed by sensors 𝑠. . . . . . . . . . . . . . . . . 59

2-12 Characteristics of cognitive structures. . . . . . . . . . . . . . . . . . . . . 60

3-1 Autobiographical Memory stores the cause-effect relationships between

entities such as user actions and changes in environment configurations

(i.e. the crowd). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

3-2 Cognitive Cycle (single object representation) . . . . . . . . . . . . . . . . 69

3-3 Cognitive Node-based object representation: Bottom-up analysis and top-

down decision chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3-4 Embodied Cognitive Cycle, Interactive Virtual Cognitive Cycles and Cog-

nitive Node matching representation . . . . . . . . . . . . . . . . . . . . . 72

3-5 Cognitive Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74

4-1 Examples of temporal proximity trajectories among fired neurons in 2-D

SOM-map (5×5) for different core state vector sequences. The trajectories

are non-linear and discontinuous. . . . . . . . . . . . . . . . . . . . . . . . 83

4-2 Coupled Event based Dynamic Bayesian Network model representing in-

teractions within an AM . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

16

4-3 Graphical representation of the mapping into AM 3-D space of passive

triplet 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃. The symbols 𝑙𝑥𝑃/𝐶 represent the contextual SOM-label

associated to each cluster. In this example the proto or core events are

represented by: 𝑙𝑥𝑃/𝐶 99K 𝑙𝑦𝑃/𝐶 , where 𝑥 = 𝑦. The transitions into Proto and

Core-Map are dashed for representing the non-linearity and discontinuity

of the trajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4-4 Example of CE-DBNs for passive triplet, e.g. 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃, with a parame-

ter 𝜃 tied across proto-core-proto transitions. . . . . . . . . . . . . . . . . . 88

4-5 Model learning problem: triplet recovering from model. 𝑇𝑟𝑗 represents 𝑗𝑡ℎ

generic triplet, 𝜃𝑖 is 𝑖𝑡ℎ interaction model. . . . . . . . . . . . . . . . . . . 91

4-6 Example of graphical comparison between VAR(2) models and simulated

data which represent the number of the people within the zone 1. The

averages of the matching between VAR(2) model outputs and 140 simu-

lated vectors (expressed in percentages) are the following: SOM 4 × 4 fit:

67.14%; SOM 5 × 5 fit: 53.6%; SOM 7 × 7 fit: 40.18%. . . . . . . . . . . 96

4-7 The simulated monitored environment. . . . . . . . . . . . . . . . . . . . . 97

4-8 Vectorial sum of forces 𝐹𝑇𝑂𝑇 and influence range of characters. . . . . . . 98

4-9 Classification examples of interaction behaviour using evaluation window

W=10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4-10 Detection of anomalous operator-crowd interactions. The system detects an

anomalous interaction when the operator does not open two doors and the

number of people increase. This incorrect crowding management situation

is shown in the figure and compared with the correct situation. . . . . . . . 102

4-11 Example of real environment (a) and simulated scenario (b) used for the

test, the virtual rooms correspond to the zones. The red dashed line corre-

sponds to people flow direction from zone 1 to zone 3; the blue dashed line

describes people movement from zone 3 to zone 1. Dashed circular areas

qualitatively correspond to the parts of the rooms monitored by cameras

equipped with people counter module. 𝑑1 − 𝑑4 are the doors. . . . . . . . . 103

17

4-12 Sample frames for four different crowd-environment interactions. Differ-

ent people flows are presented: two opposite directions of movement (a)

(b), people splitting (c), people merging (d). (a) and (b) represent normal

behaviours, while (c) and (d) represent two abnormal behaviours. . . . . . . 105

4-13 The qualitative results of the normal and anomalous operator-crowd inter-

action detection, during the operator support phase. The ground truth bar

represents the operator actions in corresponding with video frames. Pre-

diction and action bar represents the cognitive system actions. . . . . . . . 107

5-1 Attention module, which is formed by the attention mechanism based on

Selective Attention Automatic Focus (S2AF) and the Environmental Mem-

ory, which memorized the crowd density local models. . . . . . . . . . . . 111

5-2 The switch T actives or deactivates the attention mechanism. Attention

open loop: the operator can receive the information directly from cameras

networks. Attention close loop: the operator receives the information by

attention module which provides to select the relevant information only. . . 112

5-3 The schematic information flow process for selective attention automatic

focus and relevant information extraction. The total information is associ-

ated to macroscopic level (a). 𝑋 represents a set of trajectories defined as

microscopic level (b). By an encoding function 𝑍(𝑋), the more informa-

tive environment parts, associated to sufficient information 𝑌 , are identified

as mesoscopic level (c). Such a representation is used to extract the relevant

information 𝑇 (𝑌 ) in according to the task (d). . . . . . . . . . . . . . . . . 113

5-4 Crowd simulator: simulated crowding scene for trajectories generation. . . 117

5-5 Simulated people flows: qualitative sketch of trajectories, the arrow colors

indicate five different people flows 𝑑1, ..., 𝑑5. . . . . . . . . . . . . . . . . 117

5-6 ITM: topological map, formed by different zones created, after 50 trajecto-

ries using ITM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

18

5-7 Histogram: a learned people histogram from which it is possible to evaluate

the probability that an observed number of people takes place in the zone

𝑧𝑗 (in this case 𝑗 = 42). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

5-8 Gaussian P.d.f.: Gaussian histogram normalization p.d.f. (in this case aver-

age is 𝑚42 = 3.4 and variance is 𝜎242 = 2.8). . . . . . . . . . . . . . . . . . 119

5-9 Seed attentional resources: initial conditions (i.e. three high entropy 𝑧𝑠𝑗 for

each room). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5-10 Examples of transition probabilities for ITM zones. A conditional entropy

value is computed between a given zone and its neighbours. On the left,

a zone 𝑧1 is connected with one neighbour only 𝑧2 (i.e. low uncertainty

and conditional entropy), while on the right a zone 𝑧4 has more potential

neighbours 𝑧5, 𝑧6, 𝑧7 (i.e. high uncertainty and conditional entropy). . . . 120

5-11 An example of normalized cost function average trends for room 𝑅1. The

experiments have been conducted on 150 sequences of synthetic data pro-

vided by simulator with 𝑚 = 2 and people flow corresponding to 𝑑1. Total

time of each simulation is 1000[𝑠]. Three attentional resources choices

(high, low entropy and randomly choice) are been considered. Each point

on the curves corresponds to the average values of 𝑓(𝑡) for the different

people initial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5-12 Groups of zones. S2AF provides groups of zones. . . . . . . . . . . . . . . 124

5-13 Asynchronous Event detection. Direction change events in 100 simulated

situations are been evaluated. The figure shows average trends for the prob-

abilities 𝑝𝑘 considering time alignment with the events 𝑡 = 0. In 𝑡 = 30

the two curves reach the same probability. The system is able to detect

direction change of the people from 𝑐1 to other zones. . . . . . . . . . . . . 125

6-1 Relevant information extraction and crowd density map estimation. a. The

coloured circles specify important subparts of the scene; optical flow fea-

tures represent sparse relevant information. b. Crowd density map estima-

tion by Lucas-Kanade [72] optical flow features. . . . . . . . . . . . . . . . 129

19

6-2 Information Refinement block for automatic relevant data representation

selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6-3 Relevant information extraction and crowding density map projections into

SOM-space. The grid cells, laid over the image plane, can be seen as the set

of controlled areas. Each cell is associated to a number of feature extracted

by Lucas-Kanade optical flow and used for estimating the crowd density

map. In this example 𝑋 ∈ R20, ∈ R5 and the sparse vector 𝑋 ∈ R20.

𝑋 and 𝑋 are mapped into the same unit 16. The percentage of controlled

area corresponds to 40% (i.e. 2|5). . . . . . . . . . . . . . . . . . . . . . . 132

6-4 Kullback-Leibler divergences 𝐷𝐾𝐿 for two Gaussian probability density

functions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). The information lost has been represented

by distance metric between 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) (see Table 6.1). . . . . . 134

6-5 Normalized cost function average trends for different SOMs. The exper-

iments have been conducted on 100 sequences of synthetic data provided

by the simulator. Total time of each simulation is 1000𝑠𝑒𝑐𝑠. The validity

regions are defined by intervals of 𝜆. . . . . . . . . . . . . . . . . . . . . . 136

6-6 Qualitative and quantitative results for event recognition for PETS sequence

S1 L2 Time 14 : 06 using 40% of the controlled area. The figure shows

the comparison between the proposed method and other approaches: on

the upper part crowding density map reconstructions are presented; on the

lower part distortion curves are shown. The whole video is available on

https://www.youtube.com/watch?v=KN2aYZ64TTw. . . . . . . . . . . . . 138

7-1 Reconstruction block for automatic recovering of the whole information. . 143

7-2 Information extraction and crowd density map estimation. . . . . . . . . . 145

7-3 Example of correlations analysis between two couples of locations. The

data are represented by the number of feature extracted by Lucas-Kanade

optical flow associated to each cell. . . . . . . . . . . . . . . . . . . . . . . 150

20

7-4 Salient information extraction and crowding density map reconstruction.

The sparse vector is projected into SOMs space (i.e. “SOMs stack"). The

grid cells, laid over the image plane, can be seen as the set of controlled

areas. Each cell is associated to a number of feature extracted by Lucas-

Kanade optical flow and used for estimating the crowd density map. The

percentage of controlled area corresponds to 40%. In this example𝑋 ∈ R20

and the vector of the reduced observations 𝑋 ∈ R20. 𝑋 is mapped into

different units of the SOMs. By comparing the cost functions (see equation

(7.4)) it is possible to identify the optimal SOM (in this case SOM-9). . . . 151

7-5 An example of qualitative and quantitative results for PETS sequence S1

L2 Time 14 : 06 (frames 56 and 140) using 40% of the controlled area. Up

to the time of first people detection (i.e. initialization phase, frame 44) the

system acquires the data from all the zones. . . . . . . . . . . . . . . . . . 152

7-6 Reconstruction comparison for different saliency extraction methods: saliency

detection presented in [74]; HOG descriptors [27] combined with a learning-

based SVM classifier for person (pedestrian) and head detections. The av-

erage mean square error normalized to 1 and the percentage of the crowd

density map average acquired from the system (i.e. Extracted Relevance-

ER) are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

21

List of Tables

4.1 SOM and PCA comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 92

4.2 Classification evaluation for different SOM-layer . . . . . . . . . . . . . . 94

4.3 Different crowd behaviours in simulated scenarios . . . . . . . . . . . . . . 99

4.4 People flows detections using different people counters and SOMs. Accu-

racy of the precision: 𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. Success = 1, Failure =

0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

5.1 Groups of zones: S2AF output. . . . . . . . . . . . . . . . . . . . . . . . . 123

6.1 Cost function parameters. For evaluating 𝐷𝐾𝐿, two divergence normalized

density functions 𝑝(𝑌,𝑋) = 𝑁(0, 0.9996) and 𝑝(𝑌,𝑊𝐾) are being consid-

ered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

6.2 Comparisons between the proposed IB-SOM selection and other neural

networks. Results are presented for different percentages of controlled ar-

eas. In the table normalized reconstruction errors are shown. For IB-SOM

reconstruction 𝑑𝑟𝑊𝐾and prediction 𝑑𝑝𝑊𝐾

errors have been computed. The

last two rows show the averages and the variances of the errors. . . . . . . . 137

7.1 Quantitative results for PETS video sequences. It is important to note that

the minimum value of the Cost Function, proposed in this paper, corre-

sponds to the minimum value of MSE. . . . . . . . . . . . . . . . . . . . . 149

23

Chapter 1

Introduction

Recently, new pervasive technologies influence our life by increasing the possibilities of

using computing devices all the time and everywhere. In this context software engineering

and computer science are rapidly leading to development tools which are able to support

and substitute humans during their daily activities. The concept of Ambient Intelligence

(AmI) [47] describes the idea of what pervasive computing paradigm can be applied in

designing interactive smart environments (i.e. smart spaces). The ambient, by intelligent

sensors, acquires the capabilities of recognizing behaviours of individuals in the controlled

area, furthermore the environment, by intuitive interface, can interact directly with them.

These systems, having the capabilities to assist the people, have found applications in dif-

ferent domains: smart domotic [115], surveillance [36], smart vehicles [9] and smart traffic

lights [112].

Nowadays it is commonly accepted that the connection between the traditional video analy-

sis tools and context aware methodologies are the fundamental elements in the designing of

smart devices for security systems. This technological union augments the functionalities

of common surveillance systems by modules of: situation understanding, person and crowd

behaviour analysis, classification and detection of potential threads [106], [126], [83],[67],

[127]. However, often the main focus of surveillance applications in the past years has

been addressed to recognize anomalous events, such as panic or creation of group of peo-

25

ple, when they exactly happen [118] [80]. This means that the systems can not anticipate

an anomalous behaviour in order to draw possible control strategies. Therefore, large-scale

surveillance systems represent now one of the most important application fields in which

new generation security and safety systems are involved. Moreover, they can constitute an

additional advantage within the context of smart cities and buildings design, by providing

information that can be used not only to dynamically maintain a high security and safety

level, but also to help humans to feel more comfortable and be better served by the envi-

ronments they build. However, a major research effort is required to reach this ambitious

objective of making the environments we live is not only capable of internally representing

the state in which they are (i.e. of showing a kind of be self-consciousness/self-awareness

[28]), but also of using the so-obtained internal representations to help humans with ef-

ficiently addressing actions aimed at dynamically maintaining a high safety level in such

environments.

The development of cognitive theories and techniques, finalized to address safety and secu-

rity tasks, is often bio-inspired and represents a potential relevant added value that can also

gain impulse from cross-fertilization with related fields, such as cognitive radio, cognitive

radar [55].

Cognitive sciences explore the principles of the cognition including many aspects from

different fields like: psychology, philosophy, neuroscience, linguistic, anthropology and

artificial intelligence, [86]. In science, the cognition is addressed to study how the human

beings perceive, process and understand acquired information [117] by their sensors.

Several recent studies have proposed algorithms allocated in a distributed camera and sen-

sor networks in order to move from object recognition paradigm to event/situation iden-

tification and to consequent distributed decision/action at multiple spatial/temporal levels

[35].

The dynamic models enhance the capability not only of detecting (e.g. the presence of an

intruder in a forbidden area, abnormal situations in the crowd, recognizing the trajectory of

an object in an urban scenario), but also of predicting behaviors of observed entities [33].

26

Such prediction allows activating reactions in order to recover and maintain homeostasis,

namely stability of safe conditions, by automatically taking decisions and performing ac-

tions on the environment. Cognitive sciences represent one of the most promising research

fields in terms of capability of provoking improvements with respect to state-of-the-art.

In addition, to efficiently exploit cognitive capabilities in an intelligent sensor network, the

role of data fusion algorithms is crucial [39], [24]. In the literature, several works deal

with data fusion problem applied to heterogenous sensors both for security [113], [102]

and safety tasks [13], [129].

In this work, the features of a cognitive-based framework, inspired by the previously cited

concepts, are described and the application of the proposed architecture to video surveil-

lance domain for crowd monitoring is demonstrated.

1.1 Motivation and context

Many neuroscience studies define human cognitive mechanisms as a set of different mental

processes, that involve perception, learning, memorization, reasoning and decision making

in order to understand and find solutions to problems. In according with Fuster’s paradigm

the perceptual and executive components are completely defined by the Perception Action

Cycle (PAC) [44]. From an engineering perspective the cognition process has been rede-

fined by means of a Cognitive Cycle that includes the following phases: sensing, analysis,

decision and action.

A Cognitive System represents an artificial system that implements data acquisition, in-

formation storage, representation and processing, according to bio-inspired structures[90].

Flexibility is an important prerequisite for a cognitive process: it represents the human

ability to form and apply general task rules. Cognitive Control enables Cognitive Sys-

tems to flexibly switch between different strategies sets in order to find the best actions to

perform, according to the task [51]. Cognitive Control allows a dynamic reconfiguration

of the system by taking into consideration its previous experiences. This capability is a

27

strong component of Cognitive Dynamic System (CDS) [54]. Cognitive control permits to

implement an adaptive mechanism in order to change the cognitive system behavior de-

pending on new tasks or different circumstances in the external environment. The CDS

actuates an adaptation mechanism, in order to control and adjust parameters for processing

the acquired information and to update them according to feedback signals that come from

external environment. More in detail, the CDS performs a parameter change, by minimiz-

ing the information gap, in order to select (into the cognitive dynamic system memory)

and memorize (from the real or virtual sensors) information associated to some maximum

information content. Furthermore, a CDS ought to adapt its actions-making process in or-

der to maximize the reward as feedback from the environment, which is linked to action

success connected to a specific task. Cognitive surveillance systems proposed in literature

so far mainly share the idea that it can be time to extend functionalities beyond simple

video analytics. This can be done in different ways: analytics capabilities are structured

onto more abstraction levels, by allowing the system to keep autonomously improved con-

text awareness (e.g. situation assessment, threat prediction). In other cases, the concept of

embodied cognition comes in and the goal is to make the system self-aware of its actuation

resources and active monitoring strategies. Video analytics for surveillance in critical areas

is becoming more significant for public security. Several works have been devoted in the

last decade to link traditional computer vision tasks to high-level context aware functionali-

ties such as scene understanding, behavior analysis, interaction classification or recognition

of possible threats or dangerous situations. This work presets the development and design

of an innovative data processing architecture of Cognitive Dynamic System based on JDL

(Joint Directors of Laboratories) [1] [69].

1.1.1 The JDL Architecture

In this section the main concepts of JDL model are explained. Basically the problem of

collecting and analyzing data coming from different sources is also known as Data Fusion.

The JDL model is addressed to describe an information processing flow divided by five

levels, see figure 1-1. The elements of the model are described in the following:

28

∙ Level 0 Source Preprocessing: this level is addressed to feature extraction from the

images of video sequences.

∙ Level 1 Object Refinement: this block in mainly oriented to data alignment, data

association, tracking, identification of the observed entity state

∙ Level 2 Situation Refinement: this step is focused on the assessment of the relation-

ships among different observed entities with respect to the environmental conditions.

∙ Level 3 Impact Assessment: at this stage the prediction of future possible events,

with respect to previous level, is preformed.

∙ Level 4 Process Refinement: this level is mainly devoted to balance the system

capabilities for recognizing events and optimizing its resources.

∙ Level 5 Cognitive Refinement: the output of the whole system can be affected (i.e.

refined) by how the humans perceive the output and by their intervention and control.

This process permits to have a directly link with the human interpretation of the

scene.

∙ Database Management System: it has formed by Fusion Data Base (i.e. the mem-

ory of the architecture) and Support Data Base (support to manage the stored data).

It has the objective of monitoring, evaluating, adding, updating, and providing infor-

mation for the fusion processes.

∙ Human/Computer interface: it is the part that provides an interface for human input

and communication of fusion results to operators and users. This part is extremely

important in order to create a close operator-machine-environment loop. Such a cy-

cle describes the way that data fusion architecture acquires, processes and provides

feedback in order to support the user during his work, figure 1-2.

29

Figure 1-1: JDL’s structure for data fusion [116].

Figure 1-2: Operator-machine-environment loop, the machine is represented by JDL basedsystem.

30

1.1.2 Formulation of the problem

Certainly, under many aspects such artificial mechanism reflects the natural information

processing, but it can not be considered a cognitive system. A framework to be defined

as cognitive architecture has to specify parts that are fitted by the human brain capabili-

ties (such as perception, learning, reasoning and attention). However, there are still open

problems: natural systems encode the information and store it in order to drive their be-

haviours. A large amount of data is usually stored by JDL system in order to accomplish its

tasks. JDL can be seen as framework that produces results mainly oriented to help human

operator. In order to accomplish this objective, situation awareness capabilities (such as

to perceive, memorize, learn, reason, understand, focus the attention and predict) must be

included within the structure.

Figure 1-3: Self-aware oriented system proposed by Endsley[34]. The model of situationattention in dynamical decision making is formed as follows: perception (level 1), compre-hension of situation (level 2), prediction (level 3) after that decision and action close theloop with the environment

In the JDL model are not clearly represented any self aware mechanisms. However, such

a structure can be mapped into a self-aware oriented architecture, see figure 1-3. Percep-

tion, comprehension and prediction are strongly correlated and these interconnections can

not be ignored in order to build plausible artificial cognitive systems. Additionally, for the

human beings the perception and the action are the result of previous recorded experiences

31

[89]. This is the main reason why the comprehension block is extremely important in the

decision making process. In this scenario, the attention is particularly decisive in determin-

ing which are the relevant data acquired from sensors for guiding the cognitive processes.

The attention mechanisms open a question of how the information is represented by a pop-

ulation with different number of neurons is quite difficult. It is an important issue for

cognitive science in order to evaluate the computational properties of neurons in decod-

ing and interpreting the external stimuli [108]. For example the system could choose an

optimum information representation model in accordance with the environmental relevant

information.

1.2 Specific contributions to progress in scientific and tech-

nological state-of-the-art

The innovative contribution of the thesis consists in studying, designing and validating

a complete Cognitive Dynamic System architecture which combines the cognition theo-

ries and video analytics methods. Such structure realizes a bio-inspired system for video

surveillance oriented to crowd monitoring. The proposed framework is characterized by

the following components:

∙ Cognitive Surveillance Node: in figure 1-4 the proposed structure scheme based on

cognitive cycle is shown, it presents some relations among the different development

blocks. The Event Detection, Situation Assessment and Prediction reflect the charac-

teristics of self-aware module, shown in figure 1-3. In addition Memory, Attention,

Information Refinement and Reconstruction module have been included in the sys-

tem. The previous functional blocks and algorithms have been published in [22] [14]

[18] [21] [20] [19] [15] [16].

– Analysis block: the bio-inspired structure, figure 1-4, is formed by: Event

Detection, Situation Assessment and Prediction. Basically, the analysis func-

32

Figure 1-4: Proposed Cognitive Node structure.

tion is able to encode and synthesize signals, not only for the description of

single entities behaviours, but also for modelling cause-effect relationships be-

tween user actions (e.g. opening of gates in order to control people flows) and

changes in crowd configurations (e.g. the number of people in controlled areas).

Such models are stored, during a learning phase, within a sub-part of the mem-

ory block namely Autobiographical Memory (AM), which is derived from the

neuroscientist Antonio Damasio’ studies [28]. The system, by observing cause-

effect relationships, can acquire the human (i.e. operator) capability of analyz-

ing and reacting to different situations. Such a procedure can be considered

as an effective knowledge transfer from an operator to the system. This mech-

anism is relied on an automatic systems called Cognitive Surveillance Node

(CSN), which is part of a complex cognitive JDL-based and bio-inspired ar-

chitecture. After such a knowledge-transfer phase, learned representations can

be used, at different levels, firstly it can support human decisions, by detecting

anomalous interaction models and thus compensating for human shortcomings.

Secondly, considering an automatic decision scenario, the learned representa-

33

tions can identify anomalous patterns and choose the best strategy in order to

preserve stability of the entire system.

– Attention Focusing Module: it a bio-inspired algorithm called Selective Atten-

tion Automatic Focus (S2AF), which is a part of more complex Cognitive Dy-

namic System (CDS) for crowd monitoring. The proposed algorithm can gather

information related to different parts of the environment, after that it stores

the knowledge within the so called Environmental Memory, which is involved

in recognition of the external stimuli . The main objectives of the proposed

method are: first to extract relevant information needed for crowd monitoring

directly from the environmental observations. Secondly, the S2AF can enhance

the cognitive system capabilities not only for recognizing a specific situation

but also for changing its behaviour in accordance with new circumstances for

localizing the important parts of a controlled area.

– Information Refinement for relevant knowledge representation: when the

amount of information increases (i.e., due to a larger number of monitored ar-

eas), attention focusing techniques are needed to highlight most relevant parts

within the overall acquired data. When wide area surveillance systems are con-

sidered, one of the major problems in event detections is the representation of

the scene as a whole, given a set of spatially limited, but considered relevant,

observations. The main aim of the algorithm is to propose an innovative rep-

resentation method for sparse data, based on information theory. By means of

Information Bottleneck theory [125], it is possible to determine the optimal data

representation in the memory space.

– Adaptive reasoning algorithm for information reconstruction: advanced

mechanisms for selection and extraction of saliency information can improve

the performances of autonomous video surveillance systems and support the

operator. Such a method can keep limited the extracted environmental infor-

mation. However, recovering methods are needed in order to maintain the full

34

control of the whole area under control. The proposed framework, is an adap-

tive inductive reasoning (top-down logic) mechanism. The method can learn

the correlations of the observed data, storing them in the memory, after that it

can recover the whole information, allowing a minimum distortion.

∙ Crowd simulator: the following contributions have been published in [17]

– Simulator of virtual crowd: the proposed tool can simulate the movement of

a large number of entities or characters. The simulator presents a 3D visual-

ization of virtual people, moreover it offers the possibility to configure specific

environment for the simulation of different crowding situations. Such a tool it

is extremely required in order to gather enough data for training of the CDS

(i.e. acquisition of stability models) and validation, as video sequences of the

desired kind are often not available for training. The environment into the sim-

ulator can be described by means of topological representations, providing log-

ical configurations of dynamic entity-environment (e.g., crowd-environment)

and entity-entity (e.g., operator-crowd) interactions.

1.3 Structure of the thesis

The thesis is organized as follows.

In present introduction the topics of this work has been presented with specific emphasis

on the contributions of this thesis.

The chapter 2 explores the human cognition mechanisms, with specific reference to Perception-

Action Cycle, which introduce the fundamental basis for designing artificial computational

processes that act like cognitive systems. Then an overview about the evolution of the

cognitive architectures is proposed.

The chapter 3 a bio-inspired structure, namely Cognitive Surveillance Node, is proposed.

The presented framework is able to encode and synthesize signals such as entity behaviours.

35

In particular the chapter proposes an innovative algorithm for modelling a biological struc-

ture: the Autobiographical Memory. Such structure can describe interaction relationships

between multiple entity behaviors, such as operator (i.e. user’s actions) and changes in

environment (i.e. the crowd configurations).

The chapter 4 is devoted to the analysis of the main features that allow to design a proba-

bilistic model able to learn the aforementioned interactions between operator and crowd. In

the chapter the visual 3D simulator, where crowd behaviour is modelled by means of Social

Forces,is explained. Then, the fundamental application of the proposed CSN, for automatic

crowd monitoring and anomaly detections, is discussed and demonstrated by experiments

on simulated and real video sequences.

In the chapter 5 an algorithm, based on cognitive perception process, for designing a human

attention mechanism (i.e. S2AF) is proposed. Such algorithm enhances the CSN capabili-

ties for relevant information detection and extraction within a surveillance network.

The chapter 6 presents a novel relevant information representation for sparse and limited

observations. An innovative hierarchical structure for knowledge representation based on

artificial neural networks, the Self Organizing Maps (SOMs), have been used for classify-

ing and correlating observed sparse data time series acquired by cameras. In the chapter,

through the Information Bottleneck theory, a cost function approach is proposed in or-

der to determine the optimal data representation in the SOM-space as a trade-off between

the capabilities of sparse signal classification and the original data statistical similarities

(i.e. correlations) preservation. By means of the proposed cost function, an algorithm

for knowledge representation, which is called information bottleneck-based SOM selection

(IB-SOM), is described.

The chapter 7 a bio-inspired logical process for information recovery in cognitive crowd

monitoring is described. In the proposed method the relevant information, which is identi-

fied directly from environmental observations, is defined through the learning of the rela-

tionships among sensory data. Unlike other methods for relevance detection, the proposed

algorithm shows how by using the SOMs it is possible to learn the data correlations. More-

36

over it is presented how through such a procedure, it is possible to reduce the number of

environment parts to observe; by using such limited observations the proposed system can

reconstruct the whole information.

Finally, the conclusions and future works are drawn in chapter 8.

37

Chapter 2

State of the art

Cognitive science and neuroscience aim at understanding and explicating human cogni-

tion [46]. In these last decades, engineering has been learning from neuroscience these

principles. In particular, of most interest is the capability of the human adapting to the

situations. This characteristic can be very valuable especially in non stationary stochastic

environments [57]. The human ability of changing its behavior in accordance with new

circumstances can be seen as the result of the Perception Action Cycle (PAC), which is

formalized by Joaquin Fuster [44]. This mechanism underlies how cognitive entity actions

are driven by the perception of the external environment. Fuster explains the central rule

of working memory in planning and decision-making processes. Moreover such cognitive

functions are involved in the organization of behavior, language, and reasoning [25]. In the

literature other models are devoted to describe the human cognition: the most important is

the Learning Intelligent Distribution Agent (LIDA) cognitive cycle [41]. LIDA’s cognitive

cycle, figure 2-1, consists of multiple modules, which can be described by four main stages:

perception, memory retrieval, conscious broadcast and action cycle.

∙ Perception: describes the human brain process of making sense of the current scene.

∙ Memory Retrieval: different memories are involved in the working of higher-level

minds, figure 2-1. Episodic memory: is the memory of autobiographical events. It

39

comes in two forms: the first is the transient which lasts only hours or a day , the

declarative which can last a lifetime.

∙ Concious Broadcast: LIDA’s workspace implements the preconscious buffers of

working memory. The conscious broadcast, by the workspace, recruits appropriate

resources by which it deals with the current situation. The content of consciousness

includes a compact representation of that situation.

∙ Action: the sensory-motor automatism is instantiated in response to the chosen ac-

tion made by sensory information. Such a sensory information comes directly from

sensory memory without the aid of the consciousness mechanism.

Figure 2-1: The scheme of LIDA’s cognitive cycle.

The main modules of the LIDA’s cognitive cycle are:

∙ Sensory Memory: it holds incoming sensory stimuli. Such stimuli are external (i.e.

generated by from the environment) or internal (i.e. generated by internal processes).

∙ Perceptual Associative Memory: the perception involves the recognition of incom-

ing sensory stimuli, categorization, identifications and situation awareness.

∙ Workspace: LIDA’s workspace can be seen as a “container” or buffer of current

40

percept and earlier percepts, i.e. long-term working memory. The workspace can

contains local associations retrieved from transient episodic memory or from declar-

ative memory.

∙ Transient Episodic Memory: the transient episodic memory holds the events, which

decay after few hours or a day.

∙ Declarative Memory: it is often called Long-term episodic memory. The declarative

memory can be subdivided into the Autobiographical Memories of events and the

Semantic Memory. In the declarative memory the events can last for years or a whole

life.

∙ Attentional Codelets Memory: this module gathers all the attention codelets. The

codelets must bring specific information to consciousness. In other word the attention

codelet helps the decision of making something.

∙ Global Workspace: it contains the most relevant, the most, important, the most

urgent, the most insistent information, that of most concern to the agent at specific

moment.

∙ Procedural Memory: it is where the entity stores the various procedures that it may

choose to perform.

∙ Action Selection: the mechanism addressed to choose the appropriate action to take

at this instant as response to the current situation.

∙ Sensory-Motor Memory: holds incoming internal or external stimuli.

Many of cognitive architectures are based on previous cognition theories. In this chapter an

overview of cognitive science aspects for designing of intelligent systems and a discussion

about fundamental cognitive systems are hereby presented.

41

2.1 From neuroscience to cognitive architectures

Cognitive systems are designed by implementing of human characteristics (such as per-

ception, learning, attention) within an artificial information processing framework. The

modern studies, on such systems, are more focused on information processing which must

reflect the human reasoning processes. A lot of works, in the last decades, have been

devoted to design methodical procedures (or algorithms) that underlie the acquisition, rep-

resentation, processing, storage, communication and access to information. Among these,

it is possible to mention works by Alan Turing on computing machinery and intelligence,

by John McCarthy, Nathaniel Rochester, Marvin Lee Minsky and Claude Elwood Shannon

all founders of the discipline of artificial intelligence [82]. The progresses in artificial in-

telligence, cognitive science have led to realize artificial architectures that act like certain

biological systems. The term “architecture” implies an approach that attempts to model

not only behaviour, but also structural properties of the modelled system. More in general,

the cognitive systems are a sub part of the agent architectures. In this context, machine

learning techniques, such as the reinforcement learning [124], can be examined from the

neuroscience perspective. A reinforcement learning agent can interact with the environ-

ment. At each discrete time t, the agent acquires an observation 𝑜𝑡 and a reward 𝑟𝑡. It

then chooses an action 𝑎𝑡 from the set of possible actions, as shown in figure 2-2. The

environment, as consequence, changes its state, the new state is 𝑠𝑡+1 and the new reward

is a function 𝑟𝑡+1 = 𝑓(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1), which combines the performed action 𝑎𝑡 with the state

transition 𝑠𝑡 ↦→ 𝑠𝑡+1. The goal of reinforcement learning agent is to maximize as much as

possible the amount of rewards that it receives. The so called goal-directed action mecha-

nism is realized by reward and it is driven by the perception. Basically, the entity receives

rewards from the environment so that it can assess if its goal is accomplished.

The PAC implies that a cognitive architecture is divided in two parts: perceptron and ac-

tuator. The perceptron perceives the environment and the cognitive system, by an internal

mechanisms, can tune the internal state of the system in accordance with the external stim-

uli. The actuator acts directly on the environment in response to the perception. These

42

Figure 2-2: The Perception Action Cycle formalized and Reinforcement Learning.

concepts imply that a CDS must have an internal signal, realized by means of feedback,

which connects the perceptron to the actuator parts, figure 2-3. As it can be noted that in a

cognitive architecture there is a remarkable separation between internal (i.e. feedback) and

external signals (i.e. observations), which can be considered an essential element in order

to distinguish cognitive systems from other artificial intelligent framework. This concept is

the key of Haykin’s cognitive architecture [53], inspired by PAC paradigm. Furthermore,

the concept of self-organization is central to the description of biological systems. The

definition of such a process comes from the studies on physics, chemical and mathemat-

ical systems, there are also other examples about natural, social and economics sciences.

Basically, the “principle of the self-organizing dynamic system” was formalized in 1947

by Ashby [6], it explains how a dynamic system evolves from non-equilibrium points to

a state of equilibrium (i.e. attractor). This evolution implies a whole dependency among

all the sub-parts (i.e. sub-systems or components) of the dynamic system. Each part must

adapt to the other components. A link from self-organizing dynamics of ergodic systems to

the inferential nature of human perception has been mathematically formalized by Friston

et al. in [42].

In the next sections, an overview and discussion about some well known cognitive architec-

tures are presented (see section 2.2). Then, an insight into Haykin’s system and dynamics

43

Figure 2-3: Functional block diagram that shows how the Cognitive Dynamic System per-ceives the environment in which it operates.

of cognitive systems proposed by Karl Friston, Biswa Sengupta and Gennaro Auletta are

described (see section 2.3).

2.2 Overview of cognitive architectures

The cognitive architectures, as it has aforementioned, propose (artificial) computational

mechanisms that act like certain cognitive systems. It is possible to summarize some pro-

prieties, which characterize and distinguish the cognitive frameworks, as follows:

∙ Implementation: an architecture can describe separately aspects of cognitive be-

haviour (e.g. learning, problem solving, reasoning, attention etc.) or the cognition

properties should be viewed as wholes, not as collections of parts (i.e. Holism) [7].

The Unified Theory of Cognition (UTC) by Allen Newell [93] is based on Holism

concepts.

∙ Learning: not all the cognitive architectures implement the learning phase which is

addressed to synthesize the information in order to build or reinforce the knowledge

of the system. One of the most important learning process derives form Hebb’s

rule introduced by Donald Hebb (1949) [58]. The Hebbian theory describes the

adaptation of neurons in the brain during the learning process.

∙ Parameters: some frameworks implement self-tuning capabilities, which derive

44

from control theory. These mechanisms can optimize internal system parameters

in order to maximize or minimize specific objective functions.

∙ Architectural levels: there are architectures compose by sub-parts (or sub-architectures)

often they are called as layers (or levels). Each level implement a function, or mech-

anism, or a kind of information representation.

∙ Internal vs. Perception Action: some architectures are addressed to model only

internal information process, which can include: reasoning, planning, solving prob-

lems, learning concepts. Other cognitive frameworks expand the description of cog-

nition by encompassing the perception and action in the system.

∙ Task based-selection: this characteristic implements a “switching mechanism”, the

system is able to process the information by selecting specific modules or compo-

nents according to the current task.

∙ Fixed structure: the cognitive architecture is fixed, e.g. the links among the sub-

parts can not be modified and only the information stored in the levels can change.

∙ Growing structure: this frameworks adapt their dimension by increasing the number

of units and the links among them in order to improve the information representation.

The cognitive architecture can be distinguished in three main categories: symbolic, con-

nectionism and hybrid systems. In figure 2-4 a time-line related to evolution history of

cognitive architectures is shown.

2.2.1 Symbolic architectures

Symbolic systems generically are formed by “signs” which are used in order to commu-

nicate various messages. They are production systems, the behaviours, of the cognitive

entity (i.e. the system), are typically described by rules. All the variables and procedural

mechanisms are not referred to semantic content of data representations. Among the most

45

1980

2015

Adaptive Control of Thought-Rational (ACT-R) by J. R. Anderson[4]

1983 Soar by J. Laird, A. Newell, and P. Rosenbloom.[65]

Self Organizing Maps (SOMs) by T. Kohonen [62]

1990 Neural Gas (NG) by T. Martinetz and K. J. Schulten [79]

Growing Neural Gas (GNG) by Bernard Fritzke [43]

1995 Instantaneous Topological Map (ITM) by J. Jockhusch and H. Ritter[59]

1999Growing Hierarchical SOM (GH-SOM)by A. Rauber, D. Merkl,and M. Dittenbach [105]

2002 Extended Soar by J. E. Laird (2008) [64]

Cognitive Dynamic System (CDS) by S. Haykin and J. Fuster in [53]

2014

Figure 2-4: Some of the more important cognitive architectures.

46

important architecture belonging to this category is the Soar based on the UTC. This tech-

nique has been developed by John Laird, Allen Newell, and Paul Rosenbloom. The first

studies was in 1983 and the official presentation of Soar was in a paper in 1987 [65]. Soar

is based on set of “if ... then” rules which provides a comprehensive descriptions about

how a system searches for the states which bring it gradually closer to its goal (i.e. stability

or goal state). In many cases should be impossible to define contextual rules, e.g. when

system interacts with an unknown environment. In order to avoid this problem, reasoning

techniques, such as reinforcement learning, have been used. A recent extended version of

Soar theory, presented by John E. Laird (2008), has included non-symbolic representations

[64]. Another fundamental framework is Adaptive Control of Thought-Rational (ACT-R).

It has been inspired by the work of Allen Newell and created by John R. Anderson in 1983

[4]. Basically, it is divided in two modules: Perceptual-motor and Memory. The system

through the perceptual motor interacts with the environment. The memories are formed

by Declarative which is a set of rules and Procedural that basically is represented by a

set of procedures about “how to do something”. The buffers represent the front-end of the

modules which communicate by them.

2.2.2 Connectionism architectures

The connectionism represents an approach in the designing cognitive architectures based

on the interconnections of simple units (i.e. neurons) in a network. One of the most popu-

lar frameworks rely on such a mechanism are the Artifical Neural Networks (ANNs). The

ANNs are based on Hebb’s rules, each external stimuli fire (i.e. active) a neuron so that

it can be interpreted and understood. They are, basically, composed by a matrix of neu-

rons (𝑁 = 𝑛 × 𝑚) and the knowledge representation depends from the weights and the

connections among the units acquired by learning. The ANNs can be classified by:

a. Interpretation of neuronal activations: the meaning of stimuli can depend by single

or group neuronal activations.

b. Definition of activation: the neuronal activities product consequences. Such results

47

associate a definition to each activation pattern, e.g. in Bolzan machine [2] (1985)

the activation is connected with a probability of generating an action.

c. Learning algorithm definition: the ANNs can be distinguished by their different

learning methods. The competitive learning are based on a distance computed be-

tween the input 𝑥 and the neurons 𝑤𝑖 where 𝑖 ∈ (1, ·, 𝑁). Input and weight vec-

tors must have the same dimensions. The neuron whose weight is most similar

(𝑢 = arg min𝑖 (𝑥− 𝑤𝑖)) to the input is called the best matching unit (BMU)𝑤𝑢. Dur-

ing the learning phase, the neurons are updated by a coefficient 𝛼(𝑠) which decreases

at each step 𝑠. The general update formula is: 𝑤(𝑠+1)𝑖 = 𝑤

(𝑠)𝑖 +𝜃(𝑠)(𝑖, 𝑢)𝛼(𝑠)(𝑥−𝑤(𝑠)

𝑖 ),

where 𝜃(𝑠)(𝑖, 𝑢) is the neighbourhood function which gives the distance between the

neuron 𝑢 and the neuron 𝑖 in step 𝑠. Each ANN uses different choices for 𝛼(𝑠) and

𝜃(𝑠)(𝑖, 𝑢).

Some very famous ANNs are: the Self Organizing Maps (SOMs) [62] created by Kohonen

(1990), the Neural Gas (NG) [79] introduced by Thomas Martinetz and Klaus J. Schulten

(1991) and one most important adaptive ANN is Growing Neural Gas (GNG) algorithms

[43] created by Bernard Fritzke (1995). Inspired to the GNG a growing neural network

theInstantaneous Topological Map [59] (ITM) has been developed by J. Jockhusch and H.

Ritter (1999). The Growing Hierarchical SOM (GH-SOM) [105] created by A. Rauber,

D. Merkl, and M. Dittenbach (2002) combines the capabilities to increase the neurons and

the layers.

2.2.2.1 Self Organizing Map (SOM)

The Self Organizing Map (SOM) [62] is an unsupervised classifier. During the learn-

ing phase, the neurons of network are tuned through the input stimuli. SOMs have been

widely used in different pattern recognition applications, such as face recognition [63].

The SOM can produce a spatial organization of neuronal units, according to the inputs,

into SOM-layer (or SOM-map) (as in figure2-5). A SOM network is defined by a fixed

low-dimensional (usually 2-D) structure of neuronal units 𝑖 = 1, · · · , 𝑁 (i.e. nodes) and

48

connections 𝑗 (i.e connecting edges). Each neuron is characterized by a weight vector

𝑤𝑖 = [𝑤𝑖,1, 𝑤𝑖,2, · · · , 𝑤𝑖,𝑀 ]. For each input vector 𝑥 = [𝑥1, 𝑥2, · · · , 𝑥𝑀 ] the adaptation of

the SOM network (i.e. learning phase) is performed in 2 steps:

∙ Node Matching: the nearest node 𝑢 (i.e. the best match unit) to the input vector 𝑥

and its neighbourhood 𝑁𝑒𝑖𝑔ℎ(𝑢) are selected.

∙ Node Adaptation: the BMU and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are modified

according to a monotonically decreasing neighbourhood function 𝜃(𝑠)(𝑖, 𝑢).

The accuracy of the SOM mapping basically depends on two parameters: the number of

learning steps (a common choice is 500 times the number of nodes in the network) and the

decreasing rule of 𝜃(𝑠)(𝑖, 𝑢).

Figure 2-5: Example of a self organizing map structure: the SOM-layer is formed by 2-Dfixed grid of 𝑁 neurons, while the input vector is defined as 𝑥 = [𝑥1, 𝑥,2 , · · · , 𝑥𝑛].

2.2.2.2 Neural Gas (NG)

The NG is an ANN inspired by the Kohonen’SOM and introduced in 1991 by Thomas Mar-

tinetz and Klaus Schulten [79]. The particular name of this algorithm “neural gas" derives

49

from the assumption that, during the adaptation process, the dynamics of feature vectors

are similar to the distribution of gas molecules within the space. Given a probability distri-

bution 𝑃 (𝑥) of input vectors 𝑥 and considering a set of nodes 𝑖 = 1, · · · , 𝑁 represented by

weights 𝑤𝑖 the algorithm consists of 2 steps:

∙ Node Matching: the nearest node 𝑢 (i.e. the best match unit) to the input vector 𝑥

and its neighbourhood 𝑁𝑒𝑖𝑔ℎ(𝑢) are selected.

∙ Node Adaptation: the BMU and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are modi-

fied according to a monotonically decreasing neighbourhood function 𝜃(𝑠)(𝑖, 𝑢), 𝜀

an adaptation step size ans 𝜆 the neighbourhood range. Both the parameters 𝜀 and 𝜆

are increased at each time step.

The NG performances depend from: the choices of parameters and the decreasing rule of

neighbourhood function.

2.2.2.3 Growing Neural Gas (GNG)

The Growing Neural Gas [43] is an incremental network that is able to learn topological re-

lations in a given set of input data by means of Hebb’s rules [78]. The advantage of growing

networks with respect to fixed layer of neurons is the self-adaptation mechanism. Such an

algorithm can be seen as a modified Kohonen learning which combines neuron’s positions

updating with the possibility to introduce new units in the layer. A GNG starts with a gas of

nodes and builds appropriate connections between them during a training phase, inserting

new nodes in regions where the approximation error is largest. The principle elements of

the algorithm are:

∙ Set of nodes 𝑖 represented by weights 𝑤𝑖 and error measures 𝑒𝑖

∙ Set of edges or connections 𝑗 whose age value are 𝑎𝑖

For each input vector 𝑥 the adaptation of the GNG network is performed by the following

4 steps:

50

Figure 2-6: Example of GNG adaptation to input stimuli. On the left the initial situationon the right the final topology are shown.

∙ Node Matching: given an input vector 𝑥, the nearest node 𝑢 and the second nearest

node 𝑠 are selected.

∙ Node Adaptation: the nearest node 𝑢 and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are

modified according to a specific adaptation strategy.

∙ Edge update: an edge connecting 𝑢 and 𝑠 is created if it does not already exists. The

age of this edge is initialized to zero and the edge of all other edges is increased. All

the edges whose age is greater than a threshold 𝑎𝑚𝑎𝑥 are removed.

∙ Node update: the error measure 𝑒𝑢 of nearest node 𝑢 is increased. A new node is

created every 𝜆 updating steps in the zone where a node 𝑞 and its neighbours have

the greatest error values in the network. In order to avoid an undefined growth of the

number of neurons, all nodes are multiplied by a decaying factor.

The 𝑎𝑚𝑎𝑥 is the most critical parameter of GNG algorithm: an incorrect choice of thresholds

can provoke a wrong deletion of useful important connections.

51

2.2.2.4 Instantaneous Topological Map (ITM)

The algorithms NG, GNG and SOM make a strong assumption that training data are un-

correlated. The Instantaneous Topological Map (ITM) algorithm [59] overcomes this lim-

itation. It can create maps by using strongly correlated input sequences. ITM does not

rely upon ageing or accumulative error parameters and it does not require node adaptation

(as other neural networks SOM and GNG). Unlike SOM, which has a rigid topology and

adaptable nodes, ITM has a rigid node positions topology and maps the topology of input

data through adaptation rules. The main elements of an ITM are:

∙ Set of nodes 𝑖 represented by weights 𝑤𝑖.

∙ Set of undirected edges 𝑗 between the neurons.

For each input vector 𝑥 the adaptation of the ITM network is performed in 3 steps:

∙ Node Matching: according to the input vector 𝑥, the nearest node 𝑢 and the second

nearest node 𝑠 are selected.

∙ Edge update: an edge connecting 𝑢 and 𝑠 is built if it does not already exists. For

each node 𝑚 ∈ 𝑁𝑒𝑖𝑔ℎ(𝑢) if 𝑤𝑠 lies inside the Thales sphere between 𝑤𝑢 and 𝑤𝑚 the

link between 𝑚 and 𝑢 is removed.

∙ Node update: if the input vector 𝑥 lies outside the Thales sphere through 𝑤𝑢 e 𝑤𝑠 a

new node 𝑦 is created with 𝑤𝑦 = 𝑥 as in Fig.2-7.

ITM outperforms the convergence speed of SOM and GNG algorithms. Each training step

can produce and remove nodes or edges with no dependency to the network past history.

The Instantaneous Topological Map is particularly suitable approach in order to create

environmental maps, i.e. topologies which are able to maintain main shape characteristics

of the environment, by using people trajectories.

52

Figure 2-7: Node creation procedure in ITM algorithm.

2.2.2.5 Growing Hierarchical-SOM (GH-SOM)

The very popular SOM has two main limitations, which are related to the static architec-

ture of the neuronal layer and the limited capabilities for the representation of hierarchical

relations of the data. The Growing Hierarchical Self-Organizing Map [105] are addressed

to overcome the limitations of the SOM. The elements of the network consist of layers 𝑙

and set of node 𝑖, which belong to specific layer, represented by weights 𝑤𝑙𝑖. The network

training starts with one node which forms the layer 0 with weight 𝑤01. The weights of this

neuron are initialized as the average of all the input data𝑋 . During this phase, the deviation

of the input data, i.e. the mean quantization error mqe0 of this single unit, is computed as

follows:

mqe0 =1

𝑑‖𝑥− 𝑤0

1‖, (2.1)

where 𝑥 ∈ 𝑋 is the input vector and 𝑋 is the set of input data, 𝑑 = 𝑐𝑎𝑟𝑑(𝑋) represents

the total number of inputs. After the computation of mqe0, training of the GHSOM starts

as first layer SOM. This first layer map initially consists of a small number of units, e.g. a

grid of 𝑁1 = 2 × 2 units, where 𝑁1 represents the number of neurons contained within the

layer 𝑙 = 1. The learning process is equal to training of SOM. According to the equation

53

(2.2), m𝑞𝑒𝑖 is computed as follows:

mqei =1

di

‖x−w1i ‖, (2.2)

where 𝑤1𝑖 is the weight vector and 𝑑𝑖 is the number of input patterns 𝑥 mapped onto unit

𝑖 within the layer 𝑙 = 1. The Mean Quantization Error MQE (capital letters) of a map is

defined as:

MQE𝑙 =1

𝑁𝑙

𝑁𝑙∑𝑖=1

mqe𝑖. (2.3)

Figure 2-8: Column insertion and layer expansion in GH-SOM. Step 1: when MQE𝑙 over-come a given threshold, a column is inserted between the nodes 𝑒 with highest mqe𝑒 andits most dissimilar neighbor 𝑑. Step 2: layer expansion by adding of a new map in order torepresent more in detail the input data.

When during the training MQE𝑙 ≥ 𝜏𝑙·mqe0, where 𝜏𝑙 is a fixed percentage (i.e. a predefined

threshold), a column is between the unit 𝑒 with the highest mean quantization error, mqe𝑒,

and its most dissimilar neighbor 𝑑, see the step 1 in figure 2-8. If each unit 𝑖 ∈ 𝑙 satisfy the

54

following condition mqe𝑖 ≥ 𝜏𝑢mqe0, where 𝜏𝑢 indicates the desired quality of input data

representation of the training process, a new layer is added in order to explain the input

data in more detail, see step 2 in figure 2-8.

2.2.3 Hybrid architectures

The hybrid systems use a combination of methods and techniques from artificial intelli-

gence such as: artificial neural networks, reinforcement learning, hybrid connectionist-

symbolic architectures. In this category it is possible to mention the Connectionist Learn-

ing with Adaptive Rule Induction On-line (CLARION) created by the Ron Sun’s research

group. It was proposed in its early form by Sun [120] (1997), [121] (1999), [122] (2000)

and by Sun et al. [123] (2001). The CLARION is a system which consists of distinct

sub-parts. The fundamental idea is to design an architecture which emphasizes bottom-up

learning. Such a framework includes and explores the interaction between explicit and im-

plicit cognition processes. Explicit process is more intuitive concept, it is referred to an

explicit goal, e.g. a system must active explicit mechanisms of the cognition in order to

reach a given point (i.e. explicit goal). Each movement consists of a set of decisions which

bring the system to reach its goal. These intermediate states are the results of implicit

processes.

2.3 Haykin’s architecture and dynamics of cognitive sys-

tems

This section initially reviews the model proposed by Haykin which is called Cognitive

Dynamic System, which can be considered as hybrid structure. Secondly, according to the

a recent study by Karl Friston, Biswa Sengupta and Gennaro Auletta the causality between

perception and action will be discussed and formalized.

55

2.3.1 Haykin’s architecture

Haykin in [53] introduces a representation of the cognitive dynamic system in order to

mimic the brain functionalities. This representation is based on two main principles of

cognition that are perception and memory. There are three are the main blocks in a cogni-

tive dynamic system: Cognitive Perception unit, Cognitive Control unit and Probabilistic

Reasoning Machine unit, figure 2-9. These units form a hierarchical closed-loop feedback

system (namely global perception-action cycle), where the environment reward plays a

critical role in how it perceives the world.

Figure 2-9: Haykin’s Cognitive Dynamic System. Cognitive Perceptron (CP); CognitiveControl (CC); Probabilistic Reasoning Machine (PRM).

In the first part of perception-action cycle, the Cognitive Perception unit processes the mea-

surements coming from the sensors. The system observing the environment can form an

internal representation of the external world. In this sense, the CDS is able to compute

the reward based on the perception error. The feedback connects the Cognitive Perception

unit with the Cognitive Control unit. This second block is devoted to choose the optimum

action that the system performs according to its goal. Such an action must maximize the

next reward. According with Haykin and Fuster, the biological processes of adapting be-

haviours of an entity to the environmental dynamic are reproduced by the afore-mentioned

56

mechanism. In figure 2-9 a multi-layers structure is shown, each layer represents a per-

ception action module. By exploring one layer of the PAC it is possible to see that after a

measure is gathered, the representation of the external world in the Cognitive Perception

unit is updated. This representation is afflicted by an intrinsic error (i.e. perception error)

due to the noisy (i.e. imperfect) observation. The uncertainty of acquired data can be mea-

sured by the perception error. Such measure can be associated to a random variable 𝑋 ,

where 𝑥 denotes a sample value of 𝑋 and where 𝑝(𝑥) is the probability density function of

the random variable 𝑋 . It is possible to define a perception entropic state for a continuous

variable as follow:

𝐻𝑘𝑝 =

∫ + inf

− inf

𝑝(𝑥)𝑙𝑜𝑔1

𝑝(𝑥)· 𝑑𝑥, (2.4)

while, for a discrete variable the perception entropic state is:

𝐻𝑘𝑝 =

+ inf∑𝑘=− inf

𝑝(𝑥𝑘)𝑙𝑜𝑔1

𝑝(𝑥𝑘). (2.5)

Through the entropic state it is possible to compute an incremental difference as follows:

∆𝐻𝑝 = 𝐻𝑘𝑝 −𝐻𝑘−1

𝑝 , (2.6)

where ∆𝐻𝑝 represents the difference between two consecutive entropic states (i.e. on 𝑘

and 𝑘 + 1 time instants). The Cognitive Controller unit, by the value ∆𝐻𝑝, can select

the optimum future action. It is important to underline that the system must take into

consideration the uncertainty of the action selection process. The error of action choices

can be computed by the control entropic state: 𝐻𝑘𝑐 . Probabilistic Reasoning unit is the block

that connects the perception to the control part. It is addressed to generate a signal called

Perceptual Attention, it can be considered as a result of balancing between the perception

and the cognitive control errors. Such a signal is essential in order to maintain or restore

the stability in the system.

57

2.3.2 Dynamics of cognitive systems

Friston et al. in [42] formalize the neuronal self-organization processes in order to provide

an explanation about internal dynamics of cognitive systems. The CDS with its environ-

ment form an ergodic Random Dynamical System, it can be described by means of Markov

blacket [98]. A Markov blanket for a node 𝐴 in a Bayesian network is described as a set of

states formed by: parents nodes of 𝐴, children nodes of 𝐴 and parents of all the children,

see figure 2-10. Based on this concept, the global state of system (called agent) is divided

into: External Ψ (i.e. the environment), Sensory 𝑆 (i.e. sensors), Active 𝐴 (i.e. the actions)

and Internal 𝑅 (i.e. representation of information) states. The figure 2-11 shows the par-

tition of states into internal and states that are separated by a Markov blanket, including

sensory and active states. More in details: external states 𝜓 ∈ Ψ directly influence the

sensory states 𝑠 ∈ 𝑆 and they are influenced by the actions described by means of active

states 𝑎 ∈ 𝐴. The agent internal dynamics are denoted by the representational states 𝑟 ∈ 𝑅.

Such a internal conditions cause actions and depend from sensory states.

Figure 2-10: The Markov blanket of node 𝐴.

The information flow, which is motivated from cognitive theories, can be more formally

represented by the Markov blanket in terms of functions: 𝑓𝜓(𝜓, 𝑠, 𝑎), 𝑓𝑠(𝜓, 𝑠, 𝑎), 𝑓𝑟(𝑠, 𝑎, 𝑟),

𝑓𝑎(𝑠, 𝑎, 𝑟). Such functions identify the dependencies among the variables (internal, active,

sensory and external) shown by the arrows in figure 2-11. A cognitive system can be

represented in terms of equations of motion and then, by means of Markov blanket, it is

58

Figure 2-11: Global state representation of system. The internal 𝑟 are connected with theaction 𝑎 states, while the environment 𝜓 is influenced by the actions. The external statesare observed by sensors 𝑠.

possible to identify the external and internal states and specifying their dynamics. By using

Bayesian inference, Friston et.al show how to predict internal dynamics of states, such

as biological mechanisms of self-organization, that has many of the hallmarks of human

cognition processes.

2.4 Final remarks about cognitive architectures

The previous sections have described the main classes of cognitive architectures. In the

table 2-12 the differences among the previous discussed cognitive frameworks are summa-

rized. Soar and ACT-R are symbolic structures they implement the human brain processes

by mean of set of rules in order to describe the behaviour. ACT-R uses also both declar-

ative (i.e. set of statements which define the behaviour) and procedural memories (set of

procedures that can involve parameters and functions). The Extended Soar can be seen as

symbolic and not symbolic, indeed it adds to the system machine learning techniques in

order to introduce a degree of freedom.

The artificial neural networks emulate the learning and understanding human being pro-

cess by reproducing the creations of neuronal connectivities. The main differences with the

symbolic frameworks is that in the connection oriented structures there is absolutely lack

59

Figure 2-12: Characteristics of cognitive structures.

of behavioral rules. Moreover, some ANNs can increase the knowledge by augmenting

the number of neurons. On the other hand, both symbolic and connectionist architectures

do not make any distinction between the internal and external state representations of the

cognitive entity. The hybrid structures , such as CLARION and Haykin’s CDS, introduce

the PAC which implies an separation of internal (implicit) and external (explicit) configu-

rations. The mechanisms of relation between internal and external states has been defined

by Friston et al. . They enplane that internal configurations, of a cognitive entity, depends

from sensory perception of the environment. The added value of hybrid structures is repre-

sented by their applications to both control tasks (which are typically internal issues of the

system) and procedural tasks (which consist in executing of physical actions).

60

Chapter 3

Crowd modeling and proposed

architecture for cognitive video

surveillance system

3.1 Introduction

Human behaviour analysis has important applications in the field of anomaly management,

such as Intelligent Video Surveillance (IVS). As the number of individuals in a scene in-

creases, however, new macroscopic complex behaviours emerge from the underlying inter-

action network among multiple agents. This phenomenon has lately been investigated by

modelling such interactions through Social Forces [84]. In most recent Intelligent Video

Surveillance systems, mechanisms to support human decisions are integrated in cognitive

artificial processes. These algorithms are mainly addressed to the problem of modelling

behaviours of monitored entities (i.e. people).

In this chapter a bio-inspired structure is proposed, which is able to encode and synthesize

signals, not only for the description of entity behaviours, but also for modelling cause-

effect relationships between user actions and changes in environment configurations (i.e.

61

the crowd). During a learning phase the aforementioned interaction models are stored

within a memory: the Autobiographical Memory, figure 3-1.

Figure 3-1: Autobiographical Memory stores the cause-effect relationships between enti-ties such as user actions and changes in environment configurations (i.e. the crowd).

Through this step, the system operates an effective knowledge transfer from a human oper-

ator towards an automatic systems called Cognitive Surveillance Node (CSN), which is part

of a complex cognitive JDL-based and bio-inspired architecture. After such a knowledge-

transfer phase, learned representations can be used, at different levels, either to support

human decisions by detecting anomalous interaction models and thus compensating for

human shortcomings. While, in an automatic decision scenario, the system can identify

anomalous patterns and choose the best strategy to preserve stability of the entire system.

The first part of this chapter will discuss the strong relation among crowd monitoring,

simulation and modelling fields. Just to give some examples, the tasks of simulating and

monitoring a crowd raise the issue of modelling its behaviour: crowds obviously need

to be given a dynamic evolution model to be simulated; also, a dynamic model is often

needed to improve crowd monitoring application performances trough Bayesian filtering;

62

then again, simulations are often necessary in order to test crowd monitoring algorithms;

eventually, crowd monitoring can provide valuable hints on how to effectively model and

describe crowds. A comprehensive traction of such interconnected fields is given in section

3.2. The section 3.3 presents the bio-inspired cognitive model for cognitive surveillance

systems.

3.2 Scale issue

Sure enough, one should ask oneself what a crowd is, before starting discussing about it.

The way people define a crowd obviously depends on the area in which the crowd itself is

investigated, and thus many different definitions can be found in literature. However, any

definition one could try to give can hardly avoid describing crowd in terms of its compo-

nents, namely the people which it is formed by. This remark may sound trivial, but it has

deep implications in the way a crowd is depicted. In particular, it raises the issue of choos-

ing between a local description of it and a global one. A local description of a crowd relies

on the features associated to each member, such as positions, speeds, orientations, motiva-

tions, destinations etc. A global (holistic) description, on the other hand, relies on features

that can be associated to the crowd as a single entity, such average density, the entropy,

the average shift in some direction, the displacement etc. Global features can in general

be derived from local ones, by averaging or integrating local quantities. The opposite, on

the contrary, never happens. However, it is not only a matter of scale at which the crowd

is analysed, but rather of the additional amount of information stored in local quantities

compared to global ones. A nice parallel example comes from the well-known thermody-

namics, where global quantities, such as energy, pressure and temperature of gases can in

principles be derived from the average kinetic energy of its molecules: by knowing the ex-

act behaviour of each single molecule in the gas one can derive the temperature, while the

opposite calculation is not possible, as information is lost by averaging over all molecules.

However, in both the cases of crowd and thermodynamics, it is not always possible to ac-

cess local information entirely, while global quantities can be easily gathered. For example,

63

in a video surveillance framework, it is unrealistic to track every single person in a high

density crowded scene, especially if a single camera is available: the visual information

gathered by the camera sensor is simply not enough to accomplish such a task. This kind

of considerations has led to suggest approaches such as the one proposed in [87], in which

a very subtle analysis is performed, taking into account a global macroscopic scale, a mid-

dle mesoscopic scale and eventually a local microscopic scale in a hydrodynamics-inspired

framework (here again physics is of great help). A perfectly specular approach is, on the

contrary, often adopted in simulating and also modelling crowds. Here an underlying model

can be designed in order to model the fine-scale behaviour of each crowd member, in order

to reproduce (simulate) some desired macroscopic behaviour. This approach can on the

one hand be really helpful in fine tuning macroscopic simulation outputs by correcting mi-

croscopic local parameters in the model. On the other hand it can be a very effective way to

validate the accuracy of models, as it gives a way to check their accuracies in reproducing

global crowd behaviours.

3.2.1 Crowd monitoring

The crowd phenomenon has increasingly attracted the attention of worldwide researchers

in video surveillance and video analysis [131] in the last decades. Nowadays an extremely

prolific literature is growing on the subject. Different implications related to crowd be-

haviour analysis can be considered, since both technical and social aspects are currently

under researchers’investigation. On the one hand, researchers focusing on psychology and

sociology domains consider crowd and model behaviour as a social phenomenon. Several

examples can be found in literature dealing with the role and the relevance of human inter-

action in characterizing the behaviour of a crowd. In [70], a simulation-based approach for

the creation of a population of pedestrians is proposed. The authors aim at modelling the

behaviour of up to 10,000 pedestrians in order to analyse several traffic patterns and people

reactions typical of an urban environment. The impact of emotions of individual agents

in a crowded area has been investigated also by Liu et al. [68] in order to simulate and

model the behaviour of groups of people. As well, Handford and Rogers [49] have recently

64

proposed a framework for modelling drivers’behaviour during an evacuation in a post dis-

aster scenario taking into account several social factors which can affect their behaviour in

following a path to reach a safe spot.

To the other hand, technical aspects in crowd behaviour analysis applications mainly focus

on the detection of events or the extraction of particular features exploiting computer vision

based algorithms. An estimation of the number of people in a crowd can be performed by

computing the number of foreground and edge pixels. Davies et al. propose a system using

Fourier transform for estimating the motion of the crowd [29]. Many researchers tried to

use segmentation and shape recognition techniques for detecting and tracking individuals

and thus estimating the crowd. However this kind of approach can hardly be applied to

overcrowding situations where people are typically severely occluded [128], [50]. Neural

networks are used in [76] for estimating crowd density from texture analysis, but in this case

an extensive training phase is needed for getting good performances. A Bayesian model

based segmentation algorithm was proposed in [132]; this method uses shape models for

segmenting individual in the scene and is thus able to estimate the number of people in the

crowd. The algorithm is based on Markov chain Monte Carlo sampling and it is extremely

slow for large crowds. Optical flow based technique is used in [5], [8], while Rahmalan et

al. [103] proposed a computer vision-based approach relying on three different methods to

estimate crowd density for outdoor surveillance applications. The combination of techni-

cal and social aspects can represent an added value with respect to the already presented

works. A first example can be found in [26] where authors exploit a joint visual tracking-

Bayesian reasoning approach to understand people and crowd behaviour in a metro station

scenario. More recently [84], [99], [71], [87] a social force model describing the inter-

actions among the individual members of a group of people has been proposed to detect

abnormal events in crowd videos. Here people are treated as interacting particles subject

to internal and external physical forces which determine their motion. At the same time

social and psychological aspects are taken into account in modelling such indeed “social”

forces, showing the effectiveness of a synergic multidisciplinary approach to the problem.

This represents the current trend in crowd modelling research.

65

3.2.2 Crowd simulation

Graphical or symbolical simulation of moving crowds is a continuously evolving field

which involves research groups all around the world in many different areas, such as en-

tertainment industry (video-games and motion-picture), police and military force train-

ing (manifestations and riots simulations), architecture (buildings and cities design), traffic

control (crossovers and walking paths), security sciences (evacuation of crowded environ-

ments) and sociology (behaviour studies). Simulation of crowds meets the needs for crowd

observation data that are often hard or even impossible to gather directly and is also often

necessary in the design stage of security and surveillance systems. Here again different

application areas obviously show different approaches to the problem. Basically, these

approaches can be divided into two main categories. The first is mostly focused on be-

havioural aspects of the crowd, while neglects visual output quality. Crowd members can

be schematically represented as dots or stylized shapes or even melt together in a rougher

framework, wherever only a global point of view is needed. Here, only realism of dynam-

ics is stressed. The second approach, on the contrary, is centred on visual effects and it is

not really concerned with an appropriate modelling of the real behaviour. A well balanced

integration of realism in the behaviour of the crowd and in the visualization of it is also

often needed, at least to some extent, as in the case here presented. This will be discusses

in details in the following. As mentioned at the beginning of this section, crowds need to

be given an underlying dynamical model in order to be simulated. Actually, such a model

is inherently in charge of depicting the evolution of some crowd features only. This raises

again the issue of how to describe crowds. Such a description includes a selection not only

of the features one is interested in simulating, but also of the scale at which the model has

to lie, in order to effectively describe the formers. Namely, a microscopic model could

be given the task of simulating features at a more global level, while the opposite way is

hardly practicable as it can be easily guessed.

66

3.3 A bio-inspired cognitive model for Cognitive Surveil-

lance Systems

The proposed approach to Intelligent Video Surveillance (IVS) has been implemented ac-

cording to a bio-inspired model of human reasoning and consciousness grounded on the

work of the neuro-physiologist A. Damasio [28].

As already mentioned, Damasio’s theories describe cognitive entities as complex systems

capable of incremental learning based on the experience of relationships between them-

selves and the external world. Two specific brain devices can be defined to formalize the

aforementioned concept: Damasio names them proto-self and core-self. Such devices are

specifically devoted to monitor and manage the internal status of an entity (proto-self) and

the external world (core-self). Thus, crucial aspects in modelling a cognitive entity fol-

lowing Damasio’s model are first of all the capability of accessing entities’ internal status

and secondly the analysis of the surrounding environment. Relevant information comes

from the relationships between the two. This approach can be mapped into a sensing

framework by dividing the sensors into endo-sensors (or proto-sensors) and eso-sensors

(or core-sensors) as they monitor, respectively, the internal or external state of the interact-

ing entities.

Applying these concepts to the video analysis domain, interacting entities can be repre-

sented either by a guard monitoring a smart environment or by a subject driving an in-

telligent vehicle as well as a guard and an intruder interacting in some monitored area,

while considering a crowd management scenario, eso-sensors can monitor the crowd, while

endo-sensors can provide information about system parameters, as it will be clearer in the

following.

Referring to the sample proposed framework, four main blocks have been identified as

representative of a cognitive-based sensing architecture as the control centre, the CN, the

(intelligent) sensing nodes and the mobile terminal and/or actuators. The tasks which can

67

be accomplished by each block are shown in Fig. 3-2, establishing a preliminary bridge

between the concepts introduced by Damasio and the effective implementation of the sys-

tem.

The core of the proposed architecture is the already mentioned CN, which can be described

as a module that is able to receive data from sensors of all kinds, to process them, defining

different configurations as interactions between proto and core states. Such a bio-inspired

knowledge representation permits to asses potentially dangerous or anomalous events and

situations and possibly to interact with the environment itself.

3.3.1 Cognitive Cycle for single and multiple entities representation

Within the proposed scheme, the representation of each entity has to be structured in a

multi-level hierarchical way. As a whole, the closed processing loop realized by the CN in

case of a given interaction between an observed object and the system can be represented

by means of the so-called Cognitive Cycle (CC - see Fig. 3-2) which is based on four

fundamental logical blocks:

∙ Sensing: the system has to continuously acquire knowledge about interacting objects

and their own internal status.

∙ Analysis: the collected raw knowledge is processed in order to obtain a precise and

concise representation of occurring events and causal interactions.

∙ Decision: the precise information provided by the analysis phase is processed and a

decision strategy is selected according to the goal of the system.

∙ Action: the system fulfils the configuration computed during the decision phase as a

direct action over the environment or as a message provided to some actuator.

The proposed model for cognition has many analogies with the one adopted by Haykin

in its formalization of Cognitive Dynamic Systems [56] and referred to as the Fuster’s

Paradigm: Joaquin Fuster proposes in fact the concept of cognit and an abstract model for

68

Figure 3-2: Cognitive Cycle (single object representation)

cognition, based on five fundamental building blocks, namely perception, memory, atten-

tion, intelligence and language [45]. Perception represents the information gain block, and

corresponds to the sensing block of the CC; similarly, intelligence matches the analysis

logical block and also, according to the Fuster’s paradigm, includes the decision-making

stage. Memory is associated, within the CC, to a learning phase which is continuous and

basically involves all the stages of the cognitive cycle: this will be explained more in de-

tails in sections 4.1.3 and 4.2. The attention block is meant to optimize the information

flow within the Dynamic Cognitive System: this aspect goes beyond the purposes of this

work. Eventually, the language block is intended to provide efficient communication on a

person to person basis but it is not considered here (and not even by Haykin in his works).

The CC, by experiencing interactions between the CN and the external object, provides

different configurations also called cause-effect relationships. Starting from these relations

it is possible to define object representations based on theirs dispositional capabilities, i.e.

the objects can be disposed (or not) to change in some way. More formally, an observed

object x is disposed to D in different C-cases (i.e. situations), where D defines the dispo-

sitional propriety of x by a set of C configurations (i.e. cause-effect relationships), called

dispositional statements [10].

69

A set of dispositional proprieties gives a dispositional embodied description of an object,

and it includes reactions it generates in the cognitive system, i.e. possible actions that

the system can plan and perform when a situation involving that object is observed or

predicted. According to this statement, it is possible to refer to the representation model

depicted in Fig. 3-2 as to an Embodied Cognitive Cycle (ECC). The cognitive cycle can be

seen as a way of representing generic observed objects within the CN by means of a multi-

level representation involving both the bottom-up analysis chain and the top-down decision

chain (see Fig. 3-3). With respect to security and safety domains, in which the ECC

is here applied, the above mentioned dispositional proprieties are associated to a precise

objective: to maintain stability of the equilibrium between the object and the environment

(i.e. maintenance of the proper level of security and/or safety). Anomaly is a deviation from

the normality and it can be considered as a violation of a certain dispositional propriety.

Figure 3-3: Cognitive Node-based object representation: Bottom-up analysis and top-downdecision chain.

As a consequence, each entity is provided with a ’security/safety oriented ECC (S/S-ECC)’

which is representative of the entity itself within the CN. Moreover, the mapping of the

S/S-ECC onto the CN chain shown in Fig. 3-3 can be viewed as the result of the interaction

between two entities, each one described as a cognitive cycle too. In particular, if the ex-

70

ternal object (eso) and the internal autonomous system (endo) are represented as a couple

of Interacting Virtual Cognitive Cycles (IVCC), the IVCCs can be matched with the CN

structure (i.e. the bottom-up and the top-down chains) by associating parts of the knowl-

edge related with the different ECC phases to the multilevel structure processing parts of

the CN (Fig. 3-3).

More in detail, the representation model of the ECC (top left corner of Fig. 3-4) is centered

on the cognitive system that can be considered by itself as a cognitive entity. Therefore, it

is possible to map the proposed representation as in the top right corner of Fig. 3-4, where

two IVCCs, the one representing the entity (or object - 𝐼𝑉 𝐶𝐶𝑂) and the other representing

the cognitive system (𝐼𝑉 𝐶𝐶𝑆), interact in a given environment. In this model, the sensing

and action blocks of the 𝐼𝑉 𝐶𝐶𝑆 correspond to the sensing and action blocks of the ECC

(see bottom right corner of the figure). However, in the 𝐼𝑉 𝐶𝐶𝑆 , such blocks assume a

parallel virtual representation of the physical sensing and action observed corresponding

respectively to the Intelligent Sensing Node and the Actuator blocks in the general frame-

work.

The analysis phase of the 𝐼𝑉 𝐶𝐶𝑆 (𝐴𝑛𝑎𝑙𝑦𝑠𝑖𝑠 − 𝐼𝑉 𝐶𝐶𝑆) can be considered as including

a virtual representation of the four stages characterizing the state of the interacting ob-

ject. Sensing phase can be mapped in the event detection sub-block of the 𝐴𝑛 − 𝐼𝑉 𝐶𝐶𝑆

(𝐸𝐷𝑆𝑦𝑠𝑡𝑒𝑚) as well as the object event detection (𝐸𝐷𝑂𝑏𝑗𝑒𝑐𝑡). Similarly, the system situation

assessment sub-block (𝑆𝐴𝑆𝑦𝑠𝑡𝑒𝑚) includes a virtual representation of the object situation as-

sessment (𝑆𝐴𝑂𝑏𝑗𝑒𝑐𝑡). Finally, as shown in the bottom left corner of Fig. 3-4, the prediction,

decision and action parts of the object can be considered as knowledge that allows the cog-

nitive system to predict the future behaviour of the interacting object itself (the interacting

objects are here the system and the one observed external object). Prediction, decision and

action can be included in the prediction sub-block of the system (𝑃𝑆𝑦𝑠𝑡𝑒𝑚).

The proposed interpretation of the matching among the embodied cognitive model, the in-

teractive virtual cycles representing the entities acting in the environment (including the

71

Figure 3-4: Embodied Cognitive Cycle, Interactive Virtual Cognitive Cycles and CognitiveNode matching representation

72

system) and the CN, allows considering the CN as a universal machine for processing

ECCs with respect to a large variety of application domains. In general, each ECC starts

with ISN (intelligent Sensor Node) data, including an interacting entity (eso-sensor) and

a system reflexive observation (endo-sensor). The observed data are considered from two

different perspectives (the object’s and the system’s) by creating a description of the cur-

rent state of the entities using knowledge learned in previous experiences. Such process

happens at event detection and situation assessment sub-blocks. Then, a prediction of fu-

ture actions taken by the 𝐼𝑉 𝐶𝐶𝑂, contextualized with the self-prediction of future planned

actions of the system, occur at the prediction sub-block. The use of the knowledge of the

𝐼𝑉 𝐶𝐶𝑂 ends at this stage. Finally, the 𝐼𝑉 𝐶𝐶𝑆 is completed by adjusting plans of the

system in the representation of its decision and action phases that are, as stated above, a

parallel virtualization of the ECC. In addition, it is relevant to briefly point out that a sim-

ilar decomposition can be adopted in the case when two interactive entities are observed.

The description of the interacting subjects can be modelled observing that the two entities

can form a single meta-entity to which is associated a meta cognitive cycle interacting with

the autonomous system. The meta-entity (ME) can simply be considered as a composi-

tion of the two cognitive cycles associated to the initial entity couple. The advantage of

the proposed representation, involving the description of an Embodied Cognitive Cycle by

means of an IVCC couple is that the same mechanism used to represent the interaction of a

ME with the autonomous system can be also used to represent the interaction between two

observed entities forming an observed meta-entity, as proposed in [32]. Dynamic Bayesian

Networks (DBNs) can be used to represent cognitive cycles and IVCCs based on a structure

called Autobiographical Memory (AM) [30] [30]. DBNs provide a tool for describing em-

bodied objects within the CN in a way that can allow incremental learning from experience

[91]. Section 4.1 is devoted to the demonstration of such a claim.

3.3.2 The Cognitive Node

The general architecture of the Cognitive Node is depicted in Fig. 3-5. Intelligent sensors

are able to acquire raw data from physical sensors and to generate feature vectors corre-

73

sponding to the entities to be observed by the CN. Acquired feature vectors must be fused

spatially and temporally in the first stages of the node, if they are coming from different

intelligent sensors.

Figure 3-5: Cognitive Node Architecture

As briefly introduced in the previous section, the CN is internally subdivided into two main

parts, namely the analysis and the decision blocks. These two stages are linked together by

the cognitive refinement block, whose role will be shortly explained. Analysis blocks are

responsible for organizing sensors information and finding interesting or notable configu-

rations of the observed entities at different levels. Those levels can communicate directly

with the human operator through network interfaces in the upper part of Fig. 3-5. This

is basically what can be done by a standard signal processing system being able to alert

a supervisor whenever a specific anomalous interaction behaviour is detected. A predic-

tion module is able to use the stored experience of the node through the internal AM for

estimating a possible evolution of the observed environment. All the processed data and

predictions generated by the analysis steps are used as input of the cognitive refinement

block. This module can be seen as a surrogate of the human operator: during the con-

74

figuration of the system it is able to learn how to distinguish between different levels of

potentially dangerous situations. This process can be done by manually labelling different

zones of the observed data or by implementing a specific algorithm for the particular cog-

nitive application. In the on-line phase, the CN works in two different ways: for operator

support and in automatic mode. In both cases the cognitive refinement module is able to de-

tect if a predicted condition starts to drift away from the standard observed interaction, thus

getting the overall system closer to a warning situation. Specifically, in the human support

case, the switch, depicted in Fig. 3-5, is opened. The CN, by means of the cognitive refine-

ment block, can detect anomalies as possible discrepancies from standard operator-crowd

interactions. During the automatic mode, the switch is closed and the information con-

tained into the cognitive refinement is employed to identify specific crowd-environment

situations. The communication link towards the operator permits a direct warning about

anomalous situations relative to crowd normal behaviours. Decision modules of the CN

are responsible for selecting the best actions to be automatically performed by the system

for avoiding a dangerous situation. Those actions can be performed on the fully cooperative

parts of the observed system; all the decisions taken by the CN are made with the precise

intent of maintaining the environment in a controllable, alarm-free state. If all the actions

of the node are unable to keep the system in a standard state and the measured warning

level continues to increase, the node itself can decide to stop the cognitive cycle and to give

command of the controllable parts of the system back to the human operator, who is always

given the possibility to decide on his own and completely bypass the automatic system or

to be acknowledged of each single action that the CN is processing (Interfaces, Fig. 3-5).

As a final remark, this work would like to point out that, as well as the proposed perception-

action cycle for crowd monitoring, robot control mechanisms also are often motivated by

biology. However, there are some conceptual differences between the two approaches.

Robot control strategies, such as Reinforcement Learning, allow for optimizing actions

by evaluating their rewards. The presented mechanism, based on Damasio’s concept of

Autobiographical Self, during an off-line phase, acquires and mathematically models in-

teraction information by observations of two entities operator and crowd (i.e. proto and

75

core). During the on-line phase, the cognitive system uses the previously stored knowledge

for actively interacting with the external world. In the case of operator-crowd, a predic-

tion mechanism drives the system actions, selecting the possible countermeasure according

to learned rules during the training. The proposed algorithm is a general framework for

acquiring and building up the rule sets in different context.

76

Chapter 4

Information extraction and probabilistic

model for knowledge representation

4.1 Introduction

Interactions between two entities can be described in terms of mathematical relationships.

Such a mathematical description must obviously rest on a feature extraction phase, which

is addressed to get relevant information about the entities.

This chapter is devoted to the analysis of the main features that allow to design a proba-

bilistic model able to learn interactions. After information is extracted, Dynamic Bayesian

Networks (DBNs) can be used to represent cognitive cycles and IVCCs [30], as already

mentioned in section 3.3.1, based on the AM algorithm, thus providing a tool for describ-

ing embodied objects within the CN in a way that can allow incremental learning from

experience. It was already pointed out that also interactions between the operator and the

system can be represented as an IVCC. In that case, the operator-system interaction can be

differently used as an internal reference for the CN as the operator can be seen as a teach-

ing entity addressing most effective actions towards the goal of maintaining security/safety

levels during the learning phase. This learning phase represents an effective knowledge

77

transfer from human operator towards an automatic system.

A proposed framework for information extraction is composed of two main blocks: Data

Fusion (DF) and Event Detection (ED). DF involves source separation and feature ex-

traction: these two phases permit to recognize the same entities monitored by different

heterogeneous sensors. The ED block extracts information related to changing in the sig-

nals acquired by sensors. Events will be eventually defined, in order to develop a specific

probabilistic models able to describe different kinds of the relationships permitting to de-

tect anomalous interactions. The remaining of this chapter is organised as follows. The

applications of such models to crowd monitoring are presented in section 4.2. Section 4.3

describes the proposed approach for anomaly detections, while results are given in section

7.3.

4.1.1 Data fusion

Many different approaches can be used for designing architectures embedded on system,

which are able to collect heterogeneous environmental information. According to the func-

tionalities provided by the systems, data fusion mechanisms should be considered as logical

tasks which can be subdivided in a multi-modal architecture. An interesting method of the

data fusion model is the JDL model [48].

The JDL model includes five levels of processing, that represent the description of increas-

ing level of abstraction [31]. In our description, information on two distinct entities are

fused and aligned at different levels.

The data fusion module is able to receive data from intelligent sensors on the field, and to

fuse them from a temporal and spatial point of view. If one considers a set of 𝑆 intelligent

sensors, each 𝑘 ∈ 𝑆 sends to the CN a vector of features (𝑘, 𝑡𝑘) = 𝑥1, 𝑥2, . . . , 𝑥𝑁𝑘

where 𝑘 = 1, 2, . . . , 𝑆 at time instant 𝑡𝑘. Intelligent sensors send feature vectors asyn-

chronously to the CN, that must be able to register them temporally and spatially before

sending data to upper level processing modules.

78

From a temporal point of view, the data fusion module collects and stores into an internal

buffer all newest measurements 𝑘,𝑡*𝑘 from intelligent sensors 𝑘 = 1, 2, . . . , 𝑆. The time

instant 𝑡*𝑘 represents the last time when the feature vectors are acquired from each sensor

that are received. Data acquisition time can vary from sensor to sensor.

As soon as a new feature vector is acquired from sensor 𝑘, the data fusion module can

compute an extended feature vector by combining all measurements from all considered

intelligent sensors 𝜙(𝑡) = 𝑓(1,𝑡*1 , 2,𝑡*2 , . . . , 𝑆,𝑡*𝑆), where 𝑡 ≥ 𝑡*1, 𝑡*2, . . . , 𝑡*𝑆.

Thus the generation rate of the data fusion module can be estimated by considering the min-

imum time interval between two sequential measurements of the higher frequency sensor.

If ∆𝑡𝑛𝑘 = (𝑡𝑛𝑘 − 𝑡𝑛−1𝑘 ) is the time interval between arrival times of feature vectors (𝑘, 𝑡𝑛)

and (𝑘, 𝑡𝑛−1) for sensor 𝑘, the actual data rate of the fusion block can be estimated by

computing 𝑚𝑖𝑛𝑘(∆𝑡𝑛𝑘).

The analytic expression of the fusion function 𝜙(𝑡), depends on the physical relationship

between measured quantities and cannot be studied with a generic approach. In the fol-

lowing scenarios, feature vectors are mainly generated by (real, but possibly simulated)

video analytics algorithms that are able to process images acquired from video-surveillance

cameras and extract scene descriptors (i.e., trajectories of moving objects, crowd densities

within a certain environment, human activity related features, etc.).

In any case one can suppose that the fused feature vectors produced as output of this module

have the following form:

(𝑡) = 𝐶 , 𝑃 = 𝐶1 , 𝐶2 , . . . , 𝐶𝑛 , 𝑃1 , 𝑃2 , . . . , 𝑃𝑚 , (4.1)

where n and m represent feature numbers of the core and proto state vectors, 𝐶 and 𝑃

respectively (i.e. the dimensionality of the vectors). Equation 4.1 expresses a general

form for the global feature vector that is the result of the data fusion module. Vector 𝐶

identifies features related to so-called core objects, i.e., entities that are detected within

the considered environment but that are not part of the internal state of the system itself.

79

Vector 𝑃 identifies proto object features that are specific for entities that can be completely

controlled by the CN.

4.1.2 Event detection

The data fusion phase permits to obtain a high dimensionality core and proto multi-dimensional

space, where each point represents a state vector of features at a specific time instant: 𝑃 (𝑡)

and 𝐶(𝑡). Using this representation it is possible to interpret the changes of state vectors

as movements, trajectories, in a multi-dimensional space. Furthermore, as the dynamic

evolution of one entity depends on the other entity, relationships between such trajectories

describe interactions.

A Self Organizing Map [62] (SOM) unsupervised classifier is employed in this work at two

different logic levels: first, to detect events in term of relevant state changing, secondly

to represent complex relationships between the entities in a low-dimensional space. The

latter logic level will be discussed in detail through the next sections. The SOM operates

a dimensionality reduction, by mapping the multidimensional proto or core state vectors

(𝑃 (𝑡) and 𝐶(𝑡)) onto a lower 𝑀 -dimensional space, where 𝑀 is the dimension of the

Kohonen’s neuron layer (from here on we consider 𝑀 = 2 without loss of generality).

Input vectors are clustered according to their similarities, after the neural network is trained.

The choice of SOMs to perform feature reduction and clustering is motivated by their

capabilities to reproduce in a plausible mathematical way the global behaviour of the

winner-takes-all and lateral inhibition mechanism shown by distributed bio-inspired de-

cision mechanisms. Besides, a SOM allows for the establishment of a representation of

reality based on analogies: similar (though not necessarily identical!) input vectors are

likely to be mapped by the Kohonen’s map to the same neuron (in a non-injective way).

Similarity is, in our case, measured by simple euclidean distance between vectors; however

more complicated metrics can be employed to this end.

More in details, neural networks such as SOM, Neural Gas (NG) [79] and Growing Neural

80

Gas (GNG) algorithms [43] are inspired by Hebbian’s theory and permit the adaptation

of neurons during the learning process. The Neural Gas represents a very interesting and

powerful tool for vector quantization and data compression techniques. NG derives from

SOM and it improves the input data topology preservation through an adaptive method

based on neighbourhood relationships learning between the weight vector (associated to

neuronal unit) and each external stimuli (associated to input vector). In this context we have

supposed that the global environment is divided in different rooms, each one controlled by

cameras. A camera-embedded people counter is able to provide an estimation of number

of people. The considered state vectors 𝐶 are multidimensional and we are interested

in reducing it to a 2-D space. However in other applications, where it is highly desirable

to conserve the topology, we have explored the possibility of automatically determining

the set of regions to monitor according to environmental topology. In this case the input

information can be the people trajectories and an we use the Instantaneous Topological Map

(ITM) for learning structured input data manifold [14]. SOMs present a fixed number of

neuronal units, while for GNG the number of neurons is automatically decided during the

training phase. The study of the dimension of the reduced space is very important for us,

because it is correlated to definition of the events. Fixing the dimension of the SOM layer

it is possible to maintain limited the total number of possible events. A common learning

problem, in designing models, is to acquire all possible configurations, i.e. all possible

events. To this end, in this stage of our study, a fixed number of neurons is better than a

self-adaptable topology. The Growing Hierarchical SOMs (GH-SOMs) represent another

interesting tools [105]. They can increase the number of neurons and layers by means

of distance measurements between neuronal weights and input data. These mechanisms

of adapting layer sizes permit accuracy on original data reconstructions. On the other

hand, we are interested in studying the optimum number of units for balancing the learning

efficiency, the knowledge representation and the prediction capabilities of the AM. These

facts will become clearer in section 4.2.1. A technique for the definition of contextual

knowledge was prosed in [77]. By using a single 2-D SOM, an event classifications was

obtained by fusing of the heterogeneous vectors, shown in Equation 4.1. But in this case

the relationships between the entities are “fused" in the neurons. According to Damasio’s

81

theory, by means of different SOMs, for separately mapping core and proto vectors, it is

possible to detect relevant transitions between SOM neurons, i.e. the events. Such distinct

core and proto events are basic units of the AM, which represents a bio-inspired fusing

method for modelling the dependences between two entities [19].

The clustering process, applied to internal and external data, allows one to obtain a mapping

of proto and core vectors 𝑃 (𝑡) and 𝐶(𝑡) in 2-D vectors, corresponding to the positions

of the neurons in the SOM-map, that we call respectively proto Super-states 𝑆𝑥𝑃 and core

Super-states 𝑆𝑥𝐶 . Each cluster of Super-states, deriving from the SOM classifiers, is then

associated with a semantic label related to the contextual situation:

𝑆𝑥𝑖𝑃 ↦→ 𝑙𝑖𝑃 , 𝑖 = 1, . . . , 𝑁𝑃

𝑆𝑥𝑗𝐶 ↦→ 𝑙𝑗𝐶 , 𝑗 = 1, . . . , 𝑁𝐶

(4.2)

where the notation 𝑆𝑥𝑖𝑃 and 𝑆𝑥𝑗𝐶 indicates that the Super-state belongs, respectively, to

the 𝑖-th proto label and to the 𝑗-th core label; 𝑁𝑃 and 𝑁𝐶 are, respectively, the maximum

number of the proto and core Super-states labels. Then, the result of this process is the

building of a 2D map divided in regions labelled with a meaningful identifier related to the

ongoing situation. It must be noted that changes of state vectors 𝑃 (𝑡) and 𝐶(𝑡) do not

necessary imply a change of Super-state labels 𝑆𝑥𝑖𝑃 ↦→ 𝑙𝑖𝑃 and 𝑆𝑥𝑗𝐶 ↦→ 𝑙𝑗𝐶 . This means

that there are some modifications which are irrelevant from the point of view of the chosen

semantic representation of the situation. On the other hand, when the Super-state labels

𝑆𝑥𝑖𝑃 and 𝑆𝑥𝑗𝐶 change in subsequent time instants, a contextual situation modification, i.e.

an event occurs. It is then possible to define proto (𝜖𝑃 ) and core (𝜖𝐶) events. Actually,

even null events (i.e. null changes in Super states) can be defined. In fact, these could

be relevant while considering very slowly changing systems and dynamics or whenever

lengthy immobility could become relevant.

A Kohonen’s layer consists of a 2-D matrix of neurons, identified by an hexagonal location.

The network is constructed based on competitive learning: all the output neurons that win

the competition are subsequently activated by input state vectors. Two SOM-nodes are

82

(a) (b)

Figure 4-1: Examples of temporal proximity trajectories among fired neurons in 2-D SOM-map (5 × 5) for different core state vector sequences. The trajectories are non-linear anddiscontinuous.

considered as near if they are consecutively active at two successive time instants. It is

possible to connect all fired neurons describing a temporal proximity trajectory among

neurons. Not necessarily different input state sequences describe different trajectories in

the Super-state space. By sequentially analysing the dynamic evolution of Super-states,

proto and core events can be detected and visualized by trajectories into 2-D SOM-map.

The output of the SOM algorithm is in fact a list of clusters (or zones) within the Kohonen’s

layer, that describe a trajectory. Two trajectories for two different core state sequences are

presented in Fig. 4-1. The ED module also considers dynamical aspects of the evolution

of clustered features: transition probabilities between different Super-states (i.e. zones) are

computed, in such a way that the outcome of the training process can be ideally compared

to a DBN. Instead of considering sequences of Super-states to describe the evolution of

each entity, it is possible to consider proto and core event series, which can be modelled by

two Event based DBNs (E-DBNs) [97] as explained in the next section.

4.1.3 Autobiographical memory

According to Damasio’s theory, the sequences of proto (internal) and core (external) events

can be organized into two kinds of triplets in order to account for interactions between en-

tities: 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃 (passive interaction) and 𝜖−𝐶 , 𝜖𝑃 , 𝜖

+𝐶, (active interaction), to represent

83

the causal relationships, in terms of initial situation (first event), cause (second event) and

consequent effect on the examined entity (third event) 1.

The resulting information becomes an approximation of what Damasio himself calls the

Autobiographical Memory where these triplets, which encode possible interactions be-

tween entities, are memorized. The basic idea behind the algorithms is to estimate the

frequency of occurrence of the effects caused by a certain external event in order to derive

two probability distributions:

𝑝(𝜖+𝑃 |𝜖𝐶 , 𝜖−𝑃 ), (4.3)

𝑝(𝜖+𝐶 |𝜖𝑃 , 𝜖−𝐶), (4.4)

representing the causality of observed events in the interaction. The sequence of events is

represented by a statistical graphical model in order to introduce a mathematical description

of the proposed interaction model. This choice is motivated by the fact that the interaction

pattern is composed by a temporal sequence of interdependent events and then it can be

seen as a stochastic process. Therefore, an approach for modelling sequences of events that

relies on a probabilistic model results particularly appropriate.

The interaction patterns are modelled by a Coupled Event based DBN (CE-DBN) in or-

der to have a representation able to statistically encode the relevant data variability. The

proposed CE-DBN graph, shown in Fig. 4-2, aims at describing interactions taking into

account the neuro-physiologically motivated model of the Autobiographical Model. The

conditional probability densities (CPD) 𝑝(𝜖𝑃𝑡 |𝜖𝑃𝑡−1) and 𝑝(𝜖𝐶𝑡 |𝜖𝐶𝑡−1) encode the motion pat-

tern of the objects in the environment regardless the presence of other objects. Note that

each triplet can be seen as one dispositional statement (configuration) with an associated

conditional probability, Equations 4.3 and 4.4. The AM provides a dispositional descrip-

tion, a set of dispositional proprieties, for proto and core entities.

1An active interaction (represented by a triplet) is defined when an internal modification (proto event) isthe cause of external environmental change, i.e. the third event in the triplet is a core event. Vice versa the pas-sive triplet is defined when an external environmental change (core event) provokes an internal modification,i.e. the third event in the triplet is a proto event.

84

Figure 4-2: Coupled Event based Dynamic Bayesian Network model representing interac-tions within an AM

The dispositional proprieties describe a precise objective: to maintain stability of the equi-

librium between the object and the environment (i.e. maintenance of the proper level of

security and/or safety). Anomaly can be seen as a deviation from the normality and it can

be considered as a violation of a certain dispositional propriety. The interactions between

the two objects are described by the CPDs:

𝑝(𝜖𝑃𝑡 |𝜖𝐶𝑡−Δ𝑡𝐶 ), (4.5)

𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 ). (4.6)

In particular, Equation 4.5 describes the probability that the events 𝜖𝐶 , occurred at time

𝑡 − ∆𝑡𝐶 and performed by the object associated to the core context, has caused the event

𝜖𝑃 in the proto context. Reversed interpretation in terms of causal events should be given

to 𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 ).

Considering the definition of the core consciousness, the causal relationships between the

two entities are encoded in two conditional probability densities (CPDs):

𝑝(𝜖𝑃𝑡 |𝜖𝐶𝑡−Δ𝑡𝐶 , 𝜖𝑃𝑡−Δ𝑡𝑃 ) (4.7)

𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 , 𝜖𝐶𝑡−Δ𝑡𝐶 ) (4.8)

As a matter of fact, the probability densities in Equations 4.7-4.8 consider both the interac-

tion (i.e. Eq. 4.5 or Eq. 4.6) and the initial situation (i.e. 𝜖𝑃𝑡−Δ𝑡𝑃 or 𝜖𝐶𝑡−Δ𝑡𝐶 ).

85

Figure 4-3: Graphical representation of the mapping into AM 3-D space of passive triplet𝜖−𝑃 , 𝜖𝐶 , 𝜖

+𝑃. The symbols 𝑙𝑥𝑃/𝐶 represent the contextual SOM-label associated to each

cluster. In this example the proto or core events are represented by: 𝑙𝑥𝑃/𝐶 99K 𝑙𝑦𝑃/𝐶 , where𝑥 = 𝑦. The transitions into Proto and Core-Map are dashed for representing the non-linearity and discontinuity of the trajectories.

4.2 Autobiographical Memory domain applications: Surveil-

lance and Crowd Management scenarios

In the previous section a probabilistic model based on CE-DBN was sketched in order to

describe multiple entity interactions. The knowledge thus represented inside the proposed

CN can be employed in many different domains: surveillance scenarios and crowd analysis-

management are just two limited examples. Generally, in surveillance scenarios the goal of

the system is the analysis of interactions and recognition of specific behaviour between two

or more people (external entities). On the other hand, in the crowd analysis domain, the

focus of the system can be shifted toward the analysis and classifications interactions that

occur between the crowd and a human operator who is in charge of maintaining a proper

security level within the monitored area (for this purpose, the crowd can be seen as a unique

macro-entity). The two entities can be represented as a couple of 𝐼𝑉 𝐶𝐶s, as proposed in

section 3.3.2, namely an 𝐼𝑉 𝐶𝐶𝑂 and an 𝐼𝑉 𝐶𝐶𝑠 respectively.

86

In this section two aspects will be discussed, namely the probabilistic model learning phase

and the detection phase for surveillance and crowd scenarios. During the (off-line) learning

phase the CN observes both entities, i.e. the human operator and the crowd, storing their

interactions within the AM. As for the (on-line) detection phase, it will be shown how

different definitions of the probabilistic model are needed.

The system is designed to support a human operator in crowd management during the on-

line phase. This task is accomplished by recognizing specific operator-crowd abnormal

interactions. Typically, in people flow redirection problems, an abnormal interaction can

be detected whenever the user puts in action wrong countermeasures to avoid the panic or

overcrowding situations. In this case the CN ought to drive the operator to choose correct

actions by either simply signalling the anomaly or by suggesting actions to be performed

based on its learned experience.

4.2.1 Learning phase: interaction representations

During an off-line phase, the CN is able to learn and store into the AM a set of triplets (i.e.

interactions) for different situations: 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃 (passive triplet) and 𝜖−𝐶 , 𝜖𝑃 , 𝜖

+𝐶 (active

triplet). The crowd configurations are captured by core sensors, while the operator actions

are mapped into proto sensors. Each triplet represents a point of a 3-D space. In Fig. 4-3 an

example of 3-D space mapping of a passive triplet is depicted. This representation allows

to sketch the set of triplets stored into an AM. We point out that the ordering of the events

along the ℰ−𝑃 , ℰ𝐶 and ℰ+

𝑃 axes is not relevant as what is really significant is only the number

of occurrences of a certain triplet. However, each generic triplet of events can be associated

to an influence model, i.e. a specific AM can model the dynamic evolutions of interactions

for a specific context. It is possible to define a switching variable 𝜃 as influence parameter

[96].

Each triplet is associated to a probability, derived from an estimate of two conditional

probability densities: 𝑝(𝜖+𝑃 |𝜖𝐶 , 𝜖−𝑃 , 𝜃) and 𝑝(𝜖+𝐶 |𝜖𝑃 , 𝜖

−𝐶 , 𝜃) (cfr. 4.7 and 4.8), which are pro-

portional to the number of votes that the particular triplet received, i.e. the number of

87

occurrences observed during the AM training phase that represents a specific interaction

(i.e. an influence model). Fig. 4-4 shows an example of conditional relationship for a

passive triplet: 𝜖+𝑃 given the two previous events 𝜖𝐶 𝜖−𝑃 and the interaction model 𝜃.

A temporal histogram is associated to each AM element (i.e. to each triplet), in order

to store the temporal information related to events of the triplets. For example, taking

into consideration a passive triplet 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃,with given events, the histogram permits to

evaluate the probability that a specific proto event 𝜖+𝑃 takes place 𝜏𝐶𝑃+ seconds after the

core event 𝜖𝐶 . The histogram bin dimension must be selected by performing a trade off

between the precision of the temporal prediction that it is required by the application and

the number of training samples available.

Figure 4-4: Example of CE-DBNs for passive triplet, e.g. 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃, with a parameter 𝜃

tied across proto-core-proto transitions.

4.2.2 Detection phase: surveillance scenarios

After a learning phase, the CN, by using the AM, has the capability of recognizing the

interactions while they take place, in an on-line timing. In [30] the exploitation of an

AM for the detection of different kinds of interactions between two people was proposed.

For this reason, a cumulative measure is computed exploiting the information encoded

in the proposed Coupled E-DBN model. To accomplish this task, for each interaction

88

𝑖 : 𝑖 = 1, . . . , 𝑁𝐼 , where 𝑁𝐼 is the number of considered interactions, a set of couple of

trajectories (core and proto) are used to train the model (𝜃𝑖), originating a trajectory into a

3-D space (as shown in Fig. 4-3). To detect the type of cause-effect relationship between

entities, for each triplet (𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖

𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 ) the following measure is computed:

𝑙𝑖𝑡 = 𝑙𝑖𝑡−Δ𝑡𝐶,𝑃 + 𝑝(𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖

𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 |𝜃𝑖), (4.9)

where 𝑙𝑖𝑡−Δ𝑡𝐶,𝑃 is the measure computed at the time in which the previous event has been

observed and with 𝑝(𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖

𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 |𝜃𝑖) the probability that the observed triplet be-

longs to the 𝑖-th interaction model is indicated. For each triplet of events, the best matching

influence model is chosen as 𝑖* = arg max𝑖 𝑙𝑖 with 𝑖 = 1, . . . , 𝑁𝐼 . The high level of crit-

icality of the detection phase entails that, if mismatching between the observed data and

learned knowledge is detected, the system can call the attention of the operator. In this case

the learning phase starts up again.

4.2.3 Detection phase: Crowd management scenarios

In human-to-human interactions, at each state change of one entity typically corresponds

a state change of the other. In this case it is possible to affirm that the entities have the

same (or at least a similar) dynamic. On the contrary, in crowd scenarios, the dynamics of

the entities are extremely different, namely the crowd changes its status more frequently

than the operator. Generally the number of people, in a room or in a zone, can change

without any operator actions. In all the cases in which the dynamics between entities show

significant differences, the AM can be considered as a sparse collection of triplets. In

order to design a robust classification algorithm for abnormal interaction recognitions, an

approach to encode a statistical sparse model using the Self Organizing Map is needed. The

following section is dedicated to this scope.

89

4.3 Proposed approach for abnormal interaction detec-

tion in crowd monitoring domain

The proposed cognitive video surveillance system has two main purposes. The first and

most important one is to detect the interaction anomalies between operator and crowd.

The second is to substitute or to help the user during the crowd management, recogniz-

ing anomaly interactions with crowd. The presented cognitive system accomplishes both

these goals by learning a specific behavioural model for operator-crowd interactions, in

which the crowd is correctly controlled by a user. This model describe normal condi-

tions of crowding management. CN can detect anomalous operator-crowd interactions

as deviation from normality situation. In automatic operating mode, the system substi-

tutes the operator and interacts directly with the crowd. When crowd reaction patterns

are not conform to expected behaviour an anomalous configuration (i.e. interaction) is

detected. The method used for interaction modelling and above mentioned anomalous

detections is here presented. An interaction behaviour cannot be completely represented

by a triplet alone: a set of triplets must be analysed in order to individuate a model. A

common learning problem can be formalized as follows: the generic sequence of triplets

𝑇𝑟𝑗 = 𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶, 𝑗 = 1, . . . , 𝑁𝑇 , where 𝑁𝑇 is the number of triplets in that specific

sequence, can belong to different observed models 𝜃𝑖, 𝑖 = 1, . . . , 𝑁𝐼 (𝑁𝐼 is the number of

operator-crowd interaction models). Fig. 4-5 shows triplet encoding by means of a map-

ping function 𝑓(.). For sparse collected data, i.e. sparse triplets, the mapping function

defined as 𝑓(𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶) = 𝑝(𝜖+𝑃,𝐶 |𝜖𝐶,𝑃 , 𝜖

−𝑃,𝐶 , 𝜃𝑖) is not potentially useful in order to

distinguish triplet associations with specific kinds of operator-crowd interaction models.

A different transform function, 𝑓(𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶), is defined for triplet mapping into 2-

D space to decrease miss-classification errors. A specific dimensionality reduction method

can be employed to encode the AM. In this way, it is possible to obtain a probabilistic model

for rare-interaction detections, in order to describe high-complexity relationships between

entities by means of simpler formulas [107]. The mapping function 𝑓(.) must meet the

90

Figure 4-5: Model learning problem: triplet recovering from model. 𝑇𝑟𝑗 represents 𝑗𝑡ℎ

generic triplet, 𝜃𝑖 is 𝑖𝑡ℎ interaction model.

following requirements: maximum information preservation of the operator-crowd interac-

tions and correct reconstruction of original data optimizing the classification accuracy.

4.3.1 Dimensional reduction and preservation of the information

A large number of methods have been addressed for dimensional reduction: they are typ-

ically classified in linear and non-linear methods. This section addresses a fundamental

issue rising in reduction problems: interaction information contained in primary data must

be preserved. Two well-known feature reduction techniques, namely Principal Component

Analysis (PCA) (linear method) [111] and Self Organizing Map (non-linear method) are

compared.

In Table 4.1, a comparison between PCA and SOM is presented, where binary formats

are varied for output vectors encoding. The binary format is expressed by [𝑤𝑙, 𝑓𝑙], where

𝑤𝑙 represents word-length and 𝑓𝑙 is the fraction-length. In particular Table 4.1 presents

the error measures, calculated as average Euclidean distance, between original data 𝐷 and

reconstructed data D. It is possible to note that, by increasing the number of bits, the SOM

behaves better than PCA.

91

Table 4.1: SOM and PCA comparison

Binary Format SOM-map PCA err SOM err[3, 1] 8 × 8 0.1857 1.4430[5, 1] 32 × 32 0.1954 0.0938[5, 2] 32 × 32 0.0846 0.0938[6, 2] 64 × 64 0.0803 2.8175 · 10−7

4.3.2 Self Organizing Map: classification evaluation

Taking into account a SOM layer formed by𝐾 neurons, its dimensions are adapted in order

to find the best matching couple (𝑙, 𝑤) such that 𝑙×𝑤 = 𝐾. The number of core (or proto)

Super-states is then 𝐾 and the total number of possible core (or proto) events is 𝐾2, taking

null-events as relevant as explained in [21]. The parameter 𝐾 must be tuned: in fact, by

decreasing the SOM-map size, many different input state vectors can fall into the same

cluster: this fact generates a rougher classification but ensures that only relevant events are

likely to be selected. On the other hand, by employing high-dimensional Kohonen’s layers,

the classification is improved, whereas the total number of irrelevant events increases.

The dimension of the layer is a relevant parameter in our system. A small layer allows

the system to summarize its knowledge in a few concepts, which is positive, although

classification of situations may result too rough in some cases. On the other hand, very

large layers result in a very sparsely populated Superstate space, meaning that the system

would need massive training in order to observe, and later recognize, any possible situation.

At the moment such a parameter was empirically tuned.

We define a data set 𝐷 as follows:

𝐷 = 𝐷(𝑡) : 𝑡 ∈ 0, . . . , 𝑇 , (4.10)

where 𝐷(𝑡) ∈ R𝑁 is a vector 𝐷(𝑡) = [𝑑1(𝑡), . . . , 𝑑𝑁(𝑡)]′ in which each component 𝑑𝑖(𝑡)

will represent, in our application (section 7.3), the number of people in the 𝑖𝑡ℎ room at

92

instant 𝑡. The clustering process performed by the SOM is defined by means of a transfor-

mation function 𝑓𝑛(𝐷) : 𝐷 → 𝑆 with 𝑆 = 𝑆𝑘(𝑡) : 𝑡 ∈ 0, . . . , 𝑇, 𝑘 = 1, . . . , 𝐾 is the

index of the neuron and 𝑇 maximum training time. The vectors 𝑆𝑘(𝑡) ∈ R𝑀 , with 𝑀 < 𝑁

(𝑀 = 2 in our case), represent the coordinates into the SOM Map of the neurons fired

at the time 𝑡. Each element of the data set can be determined as: 𝐷(𝑡) = 𝐶𝑘(𝑡) + 𝑛𝑘(𝑡),

where 𝐶𝑘(𝑡) ∈ R𝑁 is the vector of weights for the 𝑘𝑡ℎ neuron which is associated with

𝑆𝑘(𝑡). 𝑛𝑘(𝑡) can be considered as a Gaussian noise 𝒩𝑘(0,Σ𝑘). The covariance matrix Σ𝑘 is

computed in each 𝑘𝑡ℎ SOM node considering all the training vectors which have activated

the 𝑘𝑡ℎ neuron. It is possible to define a conditional probability density function 𝑝(𝐷|𝐶𝑘)

as follows:

𝑝(𝐷|𝐶𝑘) = [𝑝(𝐶𝑘) 𝑤𝑘(𝑡)]−1𝑒𝑥𝑝−

𝑇∑𝑡=0

Ω(𝑡)′Σ−1𝑘 Ω, (4.11)

where 𝑝(𝐶𝑘) is the probability neuron activation and it is computed as the number of sam-

ples in the node over the total number of training samples. Ω(𝑡) = 𝐷(𝑡) − 𝐶𝑘+ and

𝑤𝑘 = [(2𝜋)𝑁 |2(𝑑𝑒𝑡(Σ𝑘))0.5].

A possible criterion to evaluate a SOM, given a data set 𝐷, relies on Average Mutual

Information (AMI) 𝐼𝑀(𝐷,𝐶), [38], defined by Equation 4.12:

𝐼𝑀(𝐷,𝐶) = 𝐻(𝐷) −𝐻(𝐷|𝐶), (4.12)

where 𝐻(𝐷) is the data set entropy, while the conditional entropy of the normal multivari-

ate distribution of 𝑝(𝐷|𝐶) =∑𝐾

𝑘=1 𝑝(𝐷|𝐶𝑘) is defined as

𝐻(𝐷|𝐶) := 0.5 ln[(2𝜋𝑒)𝑁 |Σ|], (4.13)

where Σ is the covariance matrix of normal multivariate p.d.f. 𝑝(𝐷|𝐶). To investigate the

capabilities of the Self Organizing Maps we set up a test: an artificial data set𝐷 for training

was constructed consisting of 143 vectors, with sampling time equal to 4[𝑠], provided by

our crowding simulator. Each vector is formed by 𝑁 = 6 components and contains the

number of people in each room.

93

Table 4.2: Classification evaluation for different SOM-layer

SOM-map 𝐻(𝐷) 𝐻(𝐷|𝐶) 𝐼𝑀(𝐷,𝐶)

2 × 2 6.3750 0.7249 5.65014 × 4 6.3750 0.1291 6.24595 × 5 6.3750 0.0384 6.33667 × 7 6.3750 0 6.3750

Table 4.2 lists entropies for different size of the SOM layer. Over 7 × 7 the quality of the

classification have not significant improvements from AMI point of view. However, we are

not concerned with an extremely precise description of the core state space (we do not want

to maximize 𝐼𝑀(𝐷,𝐶) at all costs). We certainly need a sufficient amount of information

to be preserved, but at the same time, as explained in this subsection, we need our system

to be capable of synthesizing knowledge by establishing analogies.

A representation of reality based on analogies is necessary in order to deal with situations

never seen during the training phase.

The SOM can be used for dividing a set of training data 𝐷 into different multivariate time

series 𝐷𝑘𝐾𝑘=1 where 𝐷𝑘 = 𝐷1,𝑘, · · · , 𝐷𝑛,𝑘 associated to the 𝑘 − 𝑡ℎ neuron, such as

𝐷𝑘 ∪𝐷𝑗 = ∅ with 𝑘 = 𝑗 and⋃𝐾𝑘=1𝐷𝑘 = 𝐷. These sub-sequences of vectors can be mod-

elled by local Vector Auto Regressive (VAR) models [101]. The number of generated VAR

models correspond to the number of neurons of the SOM. The local matching measurement

between the sequence of input data and the output of the local VAR models specifies how

much of the output variation has been represented by the SOM. Considering a multivariate

time series 𝐷𝑘, an auto regressive model of order 𝑝, denoted as VAR(p), describes 𝑖 − 𝑡ℎ

vector 𝐷𝑖,𝑘 as linear combination of the previous state vectors:

𝐷𝑖,𝑘 = Φ0 + Φ1𝐷𝑖−1,𝑘 + Φ2𝐷𝑖−2,𝑘 + · · · + Φ𝑝𝐷𝑖−𝑝,𝑘 + 𝜖𝑖, (4.14)

where Φ0, · · · ,Φ𝑝 are (𝑁 × 𝑁) parameter matrices and 𝜖𝑖 represents a (𝑁 × 1) white

noise. By the multivariate time series 𝐷𝑘 we have modelled a VAR(2) as 𝐷𝑖,𝑘 = Φ0 +

94

Φ1𝐷𝑖−1,𝑘 + Φ2𝐷𝑖−2,𝑘 + 𝜖𝑖, where Φ0, Φ1 and Φ2 are estimated coefficient matrices which

have stored in each SOM node. Each VAR model has been used as linear predictor fil-

ter. A dataset 𝐷𝑐, different from 𝐷, has been used for the classification phase. Also in

this case the SOM divides the data into different multivariate time series 𝐷𝑐𝑘𝐾𝑘=1 where

𝐷𝑐𝑘 =

𝐷𝑐

1,𝑘, · · · , 𝐷𝑐𝑛,𝑘

associated to the 𝑘− 𝑡ℎ neuron, such as 𝐷𝑐

𝑘 ∪𝐷𝑐𝑗 = ∅ with 𝑘 = 𝑗

and⋃𝐾𝑘=1𝐷

𝑐𝑘 = 𝐷𝑐. We have compared one period ahead forecast sequences 𝑘 obtained

by VAR(2) model built over different SOM layer sizes with 𝐷𝑐𝑘. Fig. 4-6 shows an exam-

ple of curve trends for predicted vector sequences by VAR(2) model built over different

SOM layer sizes. A comparison between simulated data and the predictor filter outputs is

provided by FIT measurement:

𝐹𝐼𝑇 =

∑𝑛𝑖=1 ‖𝐷𝑐

𝑖,𝑘 − 𝑖,𝑘‖∑𝑛𝑖=1 ‖𝐷𝑐

𝑖,𝑘 −𝐷𝑐

𝑘‖, (4.15)

where 𝐷𝑐𝑖,𝑘 ∈ 𝐷𝑐

𝑘, 𝑖,𝑘 is the output of the 𝑘 − 𝑡ℎ VAR(2) model and 𝐷𝑐

𝑘 = 𝐸𝐷𝑐𝑘.

The averages of the FIT between one period ahead forecast obtained by VAR(2) models

and simulated data (formed by 140 vectors) show that a SOM with small layers are able

to build analogies between the stored data into the same neurons during the training phase

and the classification phase.

After a training phase of the chosen SOM, a mapping function 𝑓(𝑇𝑟𝑗) to project each

triplet into 2-D SOM-map can be defined. The output of this function is a list of zones, i.e.

trajectories (which can be actually discontinuous), within the SOM-map. Dynamic aspects

of the evolution of clustered triplets are also considered: transition probabilities between

different zones are computed, in such a way that the outcome of the training process can be

ideally compared to an Hidden Markov Model (HMM) [95].

95

Figure 4-6: Example of graphical comparison between VAR(2) models and simulated datawhich represent the number of the people within the zone 1. The averages of the matchingbetween VAR(2) model outputs and 140 simulated vectors (expressed in percentages) arethe following: SOM 4 × 4 fit: 67.14%; SOM 5 × 5 fit: 53.6%; SOM 7 × 7 fit: 40.18%.

4.4 Results

The simulated monitored environment is shown in Fig. 4-7. The configuration of doors,

walls and rooms is however customizable and a wide range of scenarios can be set for

tests. A crowd enters a room of the simulator and is given the motivation of moving toward

the exit of the building. Births of new characters occur during the simulation, modelled

by a Poisson distribution (we hypothesize a fixed average incoming rate: data coming

from different simulations are thus comparable). The human operator must act on door

configuration in order to direct crowd flows, thus preventing overcrowding

The use of a graphical engine (freely available at http://www.horde3d.org/) has

been introduced in order to make the simulation realistic in the AM (section 4.2) training

phase. Here a human operator acts on doors configuration in order to prevent room over-

crowding, based on the visual output, which need to be as realistic as possible. Namely, the

96

http://www.horde3d.org/

Figure 4-7: The simulated monitored environment.

simulator has to output realistic data both from the behavioural point of view, in order to

effectively interact with the human operator, and from the visual point of view, in order to

grant an effective interface by truly depicting reality. Reactions of an operator faced with

an unrealistic visual output could be extremely different and strongly depend on render-

ing quality. For this reason, characters are also animated to simulate walk motion (at first

glance a crowded environment with still people could look less populated than it really is).

Crowd behaviour within the simulator is modeled based on Social Forces, which were

mentioned in section 3.1. This model assimilates each character on the scene to a particle

subject to 2D forces, and treats it consequently from a strictly physical point of view. Its

motion equations are derived from Newton’s law 𝐹 = 𝑚. The forces a character is

driven by are substantially of three kinds [71]. An attractive motivational force 𝐹𝑚𝑜𝑡 pulls

characters toward some scheduled destination, while repulsive physical forces 𝐹𝑝ℎ𝑦 and

interaction forces 𝐹𝑖𝑛𝑡 prevent from collision into physical objects and take into account

interactions within characters. An additional linear drag (viscous resistance) 𝐹𝑟𝑒𝑠 takes

into account the fact that no character actually persists in its state of constant speed but

tends to stop its motion as motivation runs out. This force is in fact accounted for and

included in 𝐹𝑚𝑜𝑡. Such forces are sketched in Fig. 4-8. Chaotic fluctuations are included,

according to the modified social force model proposed in [114]. These fluctuation account

for random individual variations in behaviour and lead to crowd motion self organization.

The three forces are estimated at each time instant for each character, whose position is then

updated according to the motion equation and normalized according to the current fps rate

97

Figure 4-8: Vectorial sum of forces 𝐹𝑇𝑂𝑇 and influence range of characters.

supported by the graphical engine (which strongly depends on the number of characters

to be handled). As already mentioned, people incoming rate is modelled as a Poisson

distribution. Their death occurs as they get to their final scheduled destination. A human

operator interacts with the crowd by opening doors to let it flow, while trying to minimize

the time a door remains open. Although somehow simplified with respect to more complex

works, such as [71] (where additional assumptions on trajectories’ regularity are made),

the developed model results in a good overall output, where people behave correctly and

realistically.

The simulator also includes (simulated) sensors. These try to reproduce (processed) sensor

data coming from different cameras looking at different subsets (rooms) of the monitored

scene. A virtual people estimation algorithm outputs the number of people by simply

adding some noise to the mere number of people framed by the virtual camera, thus trying

to mimic the output of real image processing algorithms, such as [61]. The state vector of

the system (which corresponds to the external object, eso) is (cfr. equation 4.10)

X𝐶𝑟(𝑡) = 𝑥𝐶𝑟1(𝑡), 𝑥𝐶𝑟2(𝑡), . . . , 𝑥𝐶𝑟𝑁 (𝑡) , (4.16)

with 𝑁 = 6 in our case (six cameras, one for each room). 𝑥𝐶𝑟𝑛(𝑡) is the number of people

in room 𝑛. The people flow starts in a hall room, that corresponds to 𝑥𝐶𝑟1 . A 7×7 2D SOM

is then trained in order to cluster the state vector space. The SOM Super States (better say,

their variations) define events. The internal (endo) state of the system (namely, the doors’

98

configuration) is simply modelled by a binary vector storing the state of each door (true if

open, false if closed). Variations of such a vector define proto events.

An AM is then trained by a human operator who opens virtual gates in order to let the

crowd stream outside in case high occupancy states are reached and, at the same time, to

minimize the time gates remain open.

In our case, four kinds of simulated scenarios for different crowd behaviours (see Table

4.3), have been taken into consideration, in order to memorize the interactions between a

human operator (proto-self) and the crowd (core-self) as formalized in 3.3. For instance,

the first crowd behaviour, identified by 1𝑑, has 𝜇 = 𝜎2 = 1 for the Poisson probability

mass function, weak interaction force, and a relatively short interaction range.

After mapping the AM into a 2-D space, by means of a SOM, the operator’s reactions to

different crowd fluctuations, stored and encoded by 𝑓 , can be used on-line to choose an

optimal strategy, i.e. to emulate the actions of a human operator, by predicting not only his

behaviour but also crowd’s reaction to it.

Table 4.3: Different crowd behaviours in simulated scenarios

𝐼𝐷 Num. persons/second Interaction range Interaction Forces1d 1 1 𝑚 Low2d 2 2 𝑚 Low1p 1 1 𝑚 High2p 2 2 𝑚 High

A reference model 𝜃𝑖 for operator-crowd interactions is then designed (refer to Fig. 4-5).

We define a sequence of passive triplets (related to 𝑖 = 1𝑑 crowd behaviour, Table 4.3) as:

𝑇𝑟𝑘 = (𝑇𝑟1, 𝑇 𝑟2, . . . , 𝑇 𝑟𝑘, . . . , 𝑇 𝑟𝐾), (4.17)

where 𝑇𝑟 = 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃. The mapping function 𝑓(𝑇𝑟𝑘) = 𝑆𝑘 defines a corresponding se-

quence of Super states into the SOM-map as follows: 𝑆𝑘 = (𝑆1, 𝑆2, . . . , 𝑆𝑘, . . . , 𝑆𝐾). In

the simulation, the maximum time between two subsequent Super states (. . . , 𝑆𝑘, 𝑆𝑘+1, . . . )

99

is taken as 8[𝑠]. After such a time lapse, a new interaction (Super state) is considered. The

𝑘𝑡ℎ Super state probability is a vector 𝑃 whose elements are defined as: 𝑃𝑘 = 𝑃 (𝑆𝑘); it cor-

responds to the number of occurrences of 𝑆𝑘 over 𝑆𝑘 with 𝑘 = 1, . . . , 𝐾. The elements of

the transition probability matrix 𝑀 , are defined as: 𝑀(𝑆𝑘, 𝑆𝑘+1) = 𝑃𝑘+1|𝑘 = 𝑃 (𝑆𝑘+1|𝑆𝑘).

A test sequence of passive triplets 𝑇𝑟𝐼𝐷𝑘 (one for each crowd behaviour listed in Table

4.3) is simulated and processed by 𝑆𝐼𝐷𝑘 = 𝑓(𝑇𝑟𝐼𝐷𝑘 ) in order to generate 𝑆𝐼𝐷𝑘 with 𝑘 =

1, . . . , 𝐾. A weighted average of transition probabilities between subsequent Super states

(. . . , 𝑆𝐼𝐷𝑘 , 𝑆𝐼𝐷𝑘+1, . . . ) is computed as follows:

𝑃 𝐼𝐷𝑖 (𝑡) =

1

𝑊

𝑊∑𝑘=1

𝑃 𝐼𝐷𝑘 𝑃 𝐼𝐷

𝑘+1|𝑘, (4.18)

where 𝑃 𝐼𝐷𝑘 = 𝑃 (𝑆𝐼𝐷𝑘 |𝜃𝑖) and 𝑃 𝐼𝐷

𝑘+1|𝑘 = 𝑀(𝑆𝐼𝐷𝑘 , 𝑆𝐼𝐷𝑘+1|𝜃𝑖), while 𝑊 , called moving eval-

uation windows, defines the number of test sequence triplets considered at each step 𝑡.

We define the probability to reach 𝑘 + 1𝑡ℎ Super state starting from the 𝑘𝑡ℎ, as follows:

𝑃 𝐼𝐷𝑘 ↦→𝑘+1 = 𝑃 𝐼𝐷

𝑘 𝑃 𝐼𝐷𝑘+1|𝑘. The recognition of the interaction model is performed by taking the

maximum probability: (𝑖*, 𝑡) = arg max𝑖 𝑃𝐼𝐷𝑖 (𝑡) with 𝑖 = 1, . . . , 𝑁𝐼 and 𝑡 = 1, . . . , 𝑇 .

The couple (𝑖*, 𝑡) defines the kind of recognized operator-crowd interaction 𝜃𝑖 and also the

maximum time 𝑊 · 8 + 𝑡 · 8[𝑠] in which the detection is performed.

Different average of transition probabilities curves, with 𝑊 = 2, 5, 10, 15 and 𝑇 = 10

steps, are evaluated. An example with 𝑊 = 10 is shown in Fig. 4-9. The four interaction

behaviours (red curve) are compared with the reference model (blue curve). Using only

a few triplets (i.e. lower 𝑊 values, e.g. 𝑊 = 2 and 𝑊 = 5) for each time step, some

behaviour models result confused. The separation distance between the curves increases

when the evaluation window values increase, e.g. with 𝑊 = 10 and 𝑊 = 15.

The Mean Square Error (MSE) is computed, in order to evaluate the distances between the

observed interaction behaviour curves and the reference behaviour model. The minimum

MSE provides a similarity measure between interaction behaviours. At each time step

𝑡 = 1, . . . , 𝑇 as follows: 𝑀𝑆𝐸(𝑡) =1

𝑊

∑𝑊𝑘=1(𝑃

*𝑘 ↦→𝑘+1 − 𝑃𝑘 ↦→𝑘+1)

2, where 𝑃𝑘 ↦→𝑘+1 and

100

Figure 4-9: Classification examples of interaction behaviour using evaluation windowW=10.

𝑃 *𝑘 ↦→𝑘+1 correspond to probability values over 𝑆𝑘 and 𝑆*

𝑘, i.e. reference and observed

sequences. The anomalous interactions between an operator and the monitored crowd could

emerge after a normal behaviour, e.g., a careless user does not open some doors. In this

situation the CN, working in its on-line modality, is able to recognize anomalous crowd

management. Fig. 4-10 shows the normal behaviour, in the specific case of 𝐼𝐷 = 1𝑑

(blue curve) and compares it the with observed operator-crowd interactions (red curve).

Using an evaluation window 𝑊 = 10, two processes start to drift away at 𝑡 = 6.4[𝑠] . In

a corresponding manner MSE starts to grow up. The rule of detection is ∇𝑀𝑆𝐸(𝑡) > 0

for 𝑡 ∈ [𝑡𝑚𝑖𝑛, 𝑡𝑚𝑎𝑥]. The system detects operator-crowd anomalous interactions when the

curve gradient is positive for an evaluation period 𝑡𝑒𝑝 = 𝑡𝑚𝑎𝑥−𝑡𝑚𝑖𝑛. In the on-line modality

the CN when an anomalous interaction has been recognized, the system alerts the operator

sending a message. Such a message can contain the last detected abnormal passive triplet,

e.g. first user action (proto event), evolution of the crowd (core event) and consequent

operator action (proto event). In the case shown in Fig. 4-10, the anomalous situation is

due to wrong consequent user action, i.e. the operator does not open some doors and the

number of people increase.

101

Figure 4-10: Detection of anomalous operator-crowd interactions. The system detects ananomalous interaction when the operator does not open two doors and the number of peo-ple increase. This incorrect crowding management situation is shown in the figure andcompared with the correct situation.

4.4.1 Example of application on real video sequences

In order to give consistence to the proposed cognitive video surveillance system, an experi-

ment has been conducted on a available video sequences from the "Performance Evaluation

of Tracking and Surveillance" (PETS) workshop dataset [37] for single camera (S3 High

level, Time 14 − 16, View_0001, sequence length is 223 frames, frame rate is ∼ 7[𝑓𝑝𝑠]).

The main target of this experiment is to demonstrate how the system is able to recognize

interaction between an operator and the crowd behaviour in a video sequence. For this

purpose the real environment has been partitioned in three zones, which are supposed to

be monitored by cameras, as shown in Fig. 4-11 (a). In the simulated environment, the

zones correspond to three rooms, Fig. 4-11 (b). In the sequence, two crowd behaviours

corresponding to different people flows have been individuated. The fist flow direction

when the people go across the scene from zone 1 to zone 3 (i.e. from frame_0000 to

frame_0107), while the second flow when the people move from zone 3 to zone 1 (i.e. from

102

frame_0108 to frame_0222). By using the simulator these two different people behaviours

have been reproduced: for the first flow the people enter the scene in zone 1 and head out in

zone 3, while for the second the people enter in zone 3 and leave in zone 1 (second flow).

In the simulator a human operator can manage the crowd flow, from a room to another,

by acting on doors, 𝑑1 𝑑2 𝑑3 𝑑4. The user opens the door when the people are near to

it. In order to reproduce the same interaction using the real video sequences, it has been

supposed to have the same configuration of the doors. A people counting algorithm [88]

provides an estimate of the total number of people in the zones present in video sequences,

Fig. 4-11 (a). In this virtual environment a people counting module is simulated in order

to count the people into a sub-area of the room (dashed circular areas, Fig. 4-11 (b)).

(a)

(b)

Figure 4-11: Example of real environment (a) and simulated scenario (b) used for the test,the virtual rooms correspond to the zones. The red dashed line corresponds to people flowdirection from zone 1 to zone 3; the blue dashed line describes people movement fromzone 3 to zone 1. Dashed circular areas qualitatively correspond to the parts of the roomsmonitored by cameras equipped with people counter module. 𝑑1 − 𝑑4 are the doors.

The test is composed by two parts: learning and detection (on-line). During the learning

phase, the cognitive system has learned two probabilistic models from the simulation, i.e.

103

two AMs, in order to describe two crowd behaviours. The rules used to memorize such two

models are specified as follows: if the operator sees the people moving from zone 1 to zone

3 must open only 𝑑1 𝑑2 according to the people flow; the user has to act on 𝑑3 𝑑4 if people

flow is in opposite direction. Four scenes for the two people flows have been simulated,

each scene is 60[𝑠] long. The simulated people counters provide number of people in each

zone per second. During the second part the system works on real video sequences. The

CN recognizes the observed situations according to the memorized knowledges. During

autonomous phase, the CN, to the end to interact directly with the crowd, must discriminate

different crowd-environment configurations. Fig. 4-12 presents four sample frames about

different crowd behaviours: in case (a) the people flow moves leading red arrow (i.e. from

zone 1 to zone 3), in case (b) the opposed people movement direction is presented (i.e. from

zone 3 to zone 1). In cases (c) and (d) the groups of the people have different movement

directions, namely people splitting and merging. In these last two cases, the system does

not find any correspondence between observed scene and stored interaction models. In

particular, the CN can consider the scene (c) as anomalous crowd-environment interaction

compared with (a) situation. The same consideration can be done for (d) and (b) situations.

When anomalous crowd-environment interactions are detected, the cognitive system sends

an alarm message in order to inform the human operator. After this phase, the CN is able

to predict most likely future actions and when it will be performed. During the operator

support phase, the cognitive system individuates anomalies in terms of differences between

predicted actions and user actions.

The SOM-map dimensions produce different results in terms of knowledge representation

for crowd-environment interactions. In particular, small Kohonen’s layers increase SOM

capability of creating analogies between different input data. This effect becomes much

relevant when the input data is corrupted by noise. A test has been conducted employing

two people counters, namely 𝑃𝐶1 and 𝑃𝐶2, characterised by different accuracies, i.e.

𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. The experiment can be divided in two parts. Firstly, we have

manually built the ground truth for the video sequence. We use this information in order

to generate the sequences of the super-states. Through three different SOMs, i.e. SOM 16,

104

(a) (b)

(c) (d)

Figure 4-12: Sample frames for four different crowd-environment interactions. Differentpeople flows are presented: two opposite directions of movement (a) (b), people splitting(c), people merging (d). (a) and (b) represent normal behaviours, while (c) and (d) representtwo abnormal behaviours.

SOM 25 and SOM 32, the original data have been mapped into SOM-spaces. Secondly,

by using of people counter [28], it is possible to obtain the number of people (𝑃𝐶1) with

estimated accuracy of 80% (𝑎𝑃𝐶1 = 80%). Tuning a people counter parameter another

set of number of people (𝑃𝐶2), with less accuracy, has been acquired (𝑎𝑃𝐶2 = 60%).

We have manually corrupted the parameter for decreasing the accuracy of people number

estimation. The data provided by 𝑃𝐶1 and 𝑃𝐶2 are classified by using the three different

SOMs so that six different sequences of fired neurons are obtained. The events (i.e. super

states transitions), which correspond to passages across the zones (i.e. Zone 1 ↦→ Zone

2, Zone 2 ↦→ Zone 3 and Zone 3 ↦→ Zone 2, Zone 2 ↦→ Zone 1), are compared with the

events generated from ground truth. When the system recognizes the same events it is

possible to affirm that a specific SOM is able to provide an efficient crowd-environment

interaction representations. Vice versa a failure will be detected. Failures are due to the

poor capability of larger SOM layers of finding analogies between input data: similar inputs

may be mapped to different neurons (see Subsection 4.3.2). In table 4.4, the performances

(in people flow detections) of three SOMs (16, 25 and 36 neurons respectively) are shown.

105

The interesting result is that a 16-neuron SOM is able to detect three zone passages also in

the presence of corrupted data input.

Table 4.4: People flows detections using different people counters and SOMs. Accuracy ofthe precision: 𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. Success = 1, Failure = 0.

SOM 16 SOM 25 SOM 36Direction PC 1 PC 2 PC 1 PC 2 PC 1 PC 2

Zone 1 ↦→ Zone 2 1 1 1 1 1 0Zone 2 ↦→ Zone 3 1 0 1 0 0 0Zone 3 ↦→ Zone 2 1 1 1 1 1 0Zone 2 ↦→ Zone 1 1 1 1 0 1 0

For the test, a SOM 25 and PC1 have been employed. In Fig. 4-13 the cognitive system

predictions and detections of normal and abnormal interactions between an operator and

the crowd are shown. In the figure, the operator actions and corresponding video frames

are represented (blue operator action rectangle) in the ground truth bar. The prediction

(yellow prediction rectangle) and action bar represents the cognitive system actions. The

anomaly is represented by a red anomaly rectangle. Considering the case (a), the system

predicts the first zone crossing (i.e. from zone 1 to zone 2) as to open 𝑑1 (specified by blue

triangle). In this case, the operator action finds a correspondence with the predicted action

(i.e. 𝑑1). During the second zone crossing (i.e. from zone 2 to zone 3) the system detects

an anomalous operator-crowd interaction behaviour: the user opens an uncorrected door

(i.e. 𝑑3 indicated by blue circle). The case (b) presents the same analysis for a different

people flow direction.

106

(a)

(b)

Figure 4-13: The qualitative results of the normal and anomalous operator-crowd inter-action detection, during the operator support phase. The ground truth bar represents theoperator actions in corresponding with video frames. Prediction and action bar representsthe cognitive system actions.

107

Chapter 5

A bio-inspired algorithm for designing a

human attention mechanism

5.1 Introduction

In previous two chapters 3 and 4 an intelligent video surveillance systems called Cognitive

Surveillance Node is presented. Such a framework can predict possible dangerous situa-

tion. At the same time, the system can actively support human decision in order to prevent

possible wrong user’s actions. The main issue in large scale video surveillance networks

is the analysis of huge amount of information in order to detect dangerous situations. For

example,overcrowding was recently the cause of several tragic events: in January 2013 the

panic due to an accidental fire in a nightclub in Brazil caused at least 241 casualties; three

years before, 21 people died from suffocation during a famous pop festival in Duisburg

(Germany). In these and other unfortunate and similar situations, automatic crowd mon-

itoring systems could have played an important role for increasing the safety level of the

environment, and, in the end, have saved some human lives.

Typically, in wide-area surveillance systems a huge amount of information is acquired by

camera networks. This can lead to an unpredictable increase in the required computational

109

power and the possibility to lose some important data. In particular, it can be seen that

not all the data coming from the guarded environment are indeed relevant for a specific

high level security task. In order to avoid the problem of information overload, innovative

methods based on human attentional process for selection of relevant features are been

proposed. The human attention is a biological mechanism which permits to address the

focus on some stimuli (i.e. visual features) neglecting others, it is basically divided in two

main approaches: bottom-up and top-down. Bottom-up attention takes into consideration

sensory data only; while top-down approach considers also a priori knowledge, acquired

from the scene, for improving saliency extraction.

To this end an algorithm based on cognitive perception process for Selective Attention Au-

tomatic Focus (S2AF), for relevant information extraction within a surveillance network, is

proposed. S2AF can be seen as a novel processing method for analysing raw data acquired

by multiple sensors of a Cognitive Surveillance Node (CSN), presented in chapter 4. The

application of cognitive science, related to the selective attention focus [66], in order to

design the next generation of safety and security systems represents a relevant added value

with respect to state-of-art. Such processes enhance the cognitive system capabilities not

only for recognizing a specific situation but also for changing its behaviour in accordance

with new circumstances for localizing the important parts of a controlled area.

Other methods have been proposed for saliency detection, e.g. Mancas et al. in [75] has

proposed a bottom-up attention algorithm for crowd abnormality detections, based on fea-

tures extracted by optical flow. Each frame has been manually divided into different cells

(i.e. parts) which have compared in order to detect contrasts.

This chapter describes the application of cognitive perception mechanisms for designing

a top-down algorithm for automatic focus of attention as part of CSN for crowd monitor-

ing. In particular, this work shows how detection performances in the more cluttered areas

within the guarded environment depend from the allocation of processing resources. In the

proposed framework a completely automatic mechanism for partitioning the environment,

forming sub-areas or zones, and allocating the resources is presented.

110

The attention module work flow is divided in two main steps: training and active phases.

During training phase, the algorithm can acquire and model local information (i.e. local

models), such as crowd density in a zone (i.e. number of people). This part is extremely

important in order to build the so called Environmental Memory, figure 5-1.

Figure 5-1: Attention module, which is formed by the attention mechanism based on Selec-tive Attention Automatic Focus (S2AF) and the Environmental Memory, which memorizedthe crowd density local models.

During active phase, the attention module, by using the knowledge contained within the

environmental memory, can analyse the situation ans underline the most relevant environ-

mental parts. This task permits to determine the zones, of the area under control, where

the operator has to focus its attention. Secondly, the system, by using the local models,

can identify possible deviations from an expected situation, such as anomalous density of

people (i.e. overcrowding).

The figure 5-2 shows two operative situations where the attention module, which is a part

of the cognitive node, is operative deactivated (i.e. attention open loop) or activated (i.e.

attention close loop).

111

∙ Attention open loop: the operator by the switch T can bypass the attention module

and receive the complete information directly from all camera networks.

∙ Attention close loop: the operator by the switch can active the attention module and

receive the relevant information only (i.e. the video images come from a subset of

sensors).

Figure 5-2: The switch T actives or deactivates the attention mechanism. Attention openloop: the operator can receive the information directly from cameras networks. Attentionclose loop: the operator receives the information by attention module which provides toselect the relevant information only.

According to biological reasoning mechanisms, the stimuli can first be perceived by sen-

sory receptors and second understood with respect to previous experiences. In the proposed

framework, the environmental memory is used for storing and holding the sensory infor-

mation.

Results are carried out through a crowd simulator similar to the one described in [17]. The

use of a simulated environment is necessary in order to gather enough substantial testing

video sequences. The issue of modelling and simulating crowds will be discussed and

motivated, as it represents a central matter in applying the proposed cognitive theory.

112

The remainder of this work is organized as follows: section 5.2 gives a quick insight on

selective attention cognitive process for crowd monitoring, modelling and analysis. Section

5.3 discusses the information flow process for cognitive crowd monitoring, which leads to

the important attention focus concept on relevant information. The proposed Selective

Attention Automatic Focus algorithm is presented in section 5.4. Results are drawn in

sections 5.5.

Figure 5-3: The schematic information flow process for selective attention automatic fo-cus and relevant information extraction. The total information is associated to macroscopiclevel (a). 𝑋 represents a set of trajectories defined as microscopic level (b). By an encodingfunction 𝑍(𝑋), the more informative environment parts, associated to sufficient informa-tion 𝑌 , are identified as mesoscopic level (c). Such a representation is used to extract therelevant information 𝑇 (𝑌 ) in according to the task (d).

113

5.2 Selective attention for automatic crowd monitoring

Focus of attention is a human cognitive process. This biological mechanism is involved

in operator activity which monitors several security cameras at the same time in order to

perform a guarding task. This can happen, for example, to ensure that there are no on-

going dangerous situations (e.g. overcrowding) or specific behaviours (e.g. suspicious

movements) in the environment under control. If a big amount of data is being considered,

human capability of focusing the attention on important events ignoring other situations,

can be compromised. For this reason an automatic system for crowd analysis and moni-

toring, which is able to support and help human selective attention and focus mechanisms,

can be of fundamental importance as an additional tool for the user. Crowds can be seen

as a dense group of moving objects thus focusing on their temporal evolution and dynam-

ics. However, if an extremely cluttered crowd is considered, it is nearly impossible for an

automatic system or a human operator to individually track each single particle. It is clear

that one can decide to tackle this problem at different scale interpretations, considering

the crowd as a global entity or as a summation of different parts. The attention focus pro-

cess depends on the focus size, defined by the way in which the attentional resources are

distributed on monitored area [12]. It is possible to consider a proportional inverse relation-

ship between the efficiency of the focus of attention process and focus size (i.e. attentional

resources distribution). Different approaches for crowd analysis depend on different atten-

tional resources distribution. Typically, crowd can be described in terms of its components,

namely the people which it is formed by. This remark may sound trivial, but it has deep

implications in the way a crowd is depicted. In particular, it raises the issue of choosing be-

tween local or global descriptors. A local crowd descriptor relies on the features associated

to each member, such as positions, speeds, orientations, motivations, destinations etc. A

global (holistic) descriptor, on the other hand, is based on features of the crowd as a single

entity, such average density, entropy, average shift in some direction, global displacement

etc. Global features can in general be derived from local ones, by averaging or integrating

local quantities. However, it is not only a matter of scale at which the crowd is analysed,

114

but rather of the additional amount of information stored in local quantities compared to

global ones. A nice parallel example comes from the well-known thermodynamics, where

global quantities, such as energy, pressure and temperature of gases can in principles be

derived from the average kinetic energy of its molecules: by knowing the exact behaviour

of each single molecule in the gas one can derive the temperature, while the opposite cal-

culation is not possible, as information is lost by averaging over all molecules. However,

in both the cases of crowd and thermodynamics, it is not always possible to access local in-

formation entirely, while global quantities can be easily gathered. For example, in a video

surveillance framework, it is unrealistic to analyse the behaviour of every single person

(i.e. microscopic level analysis) in a high density crowded scene, in order to understand

crowd global behaviour (i.e. macroscopic level analysis). The proposed S2AF algorithm

represents a method to extract most informative data for solving the considered surveil-

lance task. Through the proposed approach, it is possible to select a subset of the guarded

environment (i.e., mesoscopic level analysis), thus maximizing the efficiency of focus of

attention process.

5.3 Information flow process for cognitive crowd monitor-

ing

According to the previous considerations, proposed approach extracts the most relevant

information from acquired data; this minimal set of information can be considered as a

sufficient statistic that can be obtained through a specific transform of the sensory data ac-

quired from the considered surveillance network (see Figure 5-3). The relevant information

can be determined starting from the observed environment. It is possible to define 𝑋𝑛(𝑡)

as position of the object (or particle) 𝑂𝑛 at time 𝑡, where 𝑋𝑛(𝑡) ∈ R2. A trajectory is a

sequence of Xn : 𝑋𝑛𝑡1, ..., 𝑋𝑛𝑡𝑘

, where 𝑡1 is the first instant in which an object has ob-

served for the first time, while 𝑡𝑘 represents the moment in which the particle disappear.

The crowd simulator is able to generate all the trajectories for each object Xn𝑁𝑛=1 where

115

𝑁 is the total number of simulated entities. Based on these trajectories an Instantaneous

Topological Map (ITM) algorithm [59] can be used for partitioning the whole considered

area on the map into smaller zones. If Ω ⊆ R2 is the map of the guarded area, the ITM

method produces a set 𝑧𝑗𝐽𝑗=1 such as 𝑧𝑗 ∪ 𝑧𝑖 = ∅ for 𝑗 = 𝑖 and⋃𝐽𝑗=1 𝑧𝑗 = 𝑍 ⊆ Ω.

The main aim of the proposed algorithm is to select a subset 𝑐𝑘𝐾𝑘=1 of 𝑧𝑗𝐽𝑗=1 as the best

possible group of zones for maximizing the information in according to a specific task of

the system. 𝑐𝑘 ∪ 𝑐𝑘 = ∅ for ℎ = 𝑘 and⋃𝐾𝑘=1 𝑐𝑘 = 𝐶 ⊂ 𝑍. Moreover a zone 𝑐𝑘 can be

obtained through merging of different zones 𝑧𝑗 i.e. ∀𝑘 = 1, ..., 𝐾∃𝑗 : 𝑧𝑘 ⊇ 𝑧𝑗 . In the case

of the crowding, the number of people in each zones will be considered. The number of

people that has been estimated in each zone 𝑧𝑗 is given by Y(𝑡) = 𝑌1(𝑡), ..., 𝑌𝐽(𝑡). It is

possible to define 𝑇 as a tessellation procedure that is able to provide the set of zone groups

𝑐𝑘 starting from 𝑧𝑗 . The objective of the proposed work is to find the specific 𝑇 which is

able to maximizing the efficiency of the attention focus process. According to Haykin [52],

two mutual information are defined as follows: 𝐼(𝑌, 𝑇 ) related to informativeness mea-

surement and 𝐼(𝑋,𝑇 ) associated to complexity value. Where 𝑋 ≡ Xn and 𝑌 ≡ Y(𝑡). In

according to the cognitive control paradigm 𝐼(𝑌, 𝑇 ) must be maximized (i.e. to minimize

𝐻(𝑌 |𝑇 )),while 𝐼(𝑋,𝑇 ) which must be minimized (i.e. to maximize 𝐻(𝑋|𝑇 )). It possible

to define an evaluation measure for the focus process as follows:

𝑃 = min𝑇

𝐻(𝑌 |𝑇 ) − 𝜆𝐻(𝑋|𝑇 ) (5.1)

Where the parameter 𝜆 is inversely proportional to area of 𝐶: 𝜆 ∝ 1𝐴(𝐶)

with 𝐴(𝐶) =∑𝐾𝑘=1𝐴(𝑐𝑘). In according to the Equation (5.1) it is possible to individuate an attentional

resources distribution 𝐶 in order to focal the attention on relevant information only.

5.4 Selective Attention Automatic Focus (S2AF) algorithm

This section describes an application on crowd monitoring domain to demonstrate the pro-

posed Selective Attention Automatic Focus method. The crowd simulator has been used in

116

Figure 5-4: Crowd simulator: simulated crowding scene for trajectories generation.

Figure 5-5: Simulated people flows: qualitative sketch of trajectories, the arrow colorsindicate five different people flows 𝑑1, ..., 𝑑5.

117

Figure 5-6: ITM: topological map, formed by different zones created, after 50 trajectoriesusing ITM algorithm .

Figure 5-7: Histogram: a learned people histogram from which it is possible to evaluatethe probability that an observed number of people takes place in the zone 𝑧𝑗 (in this case𝑗 = 42).

118

Figure 5-8: Gaussian P.d.f.: Gaussian histogram normalization p.d.f. (in this case averageis 𝑚42 = 3.4 and variance is 𝜎2

42 = 2.8).

Figure 5-9: Seed attentional resources: initial conditions (i.e. three high entropy 𝑧𝑠𝑗 foreach room).

119

order to generate a significant set of realistic trajectories. The first step is to create a topo-

logical encoding the people trajectories through ITM algorithm. In figure 5-4 shows an im-

age of the simulator used for simulating five different people flows, figure 5-5, while figure

5-6 presents an example of topological map. The map is formed by 𝑍 = 147 zones gener-

ated through the encoding of 𝑁 = 40 trajectories. The second step is to acquire for each

zone, by a large number of training simulations, the transition models and the probability

density functions of people number, figures 5-7 and 5-8. The transition models represents

the passages from each zone 𝑧𝑗 to its neighbours 𝑧𝑖, where 𝑧𝑗, 𝑧𝑖 ∈ 𝑍 with 𝑖 = 1, ...,𝑀

and 𝑖 = 𝑗. The transition probabilities for each zone 𝑧𝑗 are defined as: 𝑝𝑗,𝑖 = 𝑝(𝑧𝑖/𝑧𝑖). A

conditional entropy value 𝐻𝑖,𝑗 for each 𝑧𝑗 is computed as follows:

𝐻𝑖,𝑗 = −𝑀∑𝑖=1

𝑝𝑗,𝑖𝑙𝑜𝑔𝑝𝑗,𝑖 (5.2)

Figure 5-10 shows an example of two situations: low (on the left) and high (on the right)

uncertainties of transition between a zone and its neighbours.

Figure 5-10: Examples of transition probabilities for ITM zones. A conditional entropyvalue is computed between a given zone and its neighbours. On the left, a zone 𝑧1 isconnected with one neighbour only 𝑧2 (i.e. low uncertainty and conditional entropy), whileon the right a zone 𝑧4 has more potential neighbours 𝑧5, 𝑧6, 𝑧7 (i.e. high uncertainty andconditional entropy).

The proposed method is able to provide configurations 𝐶 of group zones 𝑐𝑘 (obtained by

attentional resources) in according to the Equation (5.1). Through the maximization of

120

𝐻(𝑋|𝑇 ) it is possible to define a projection of people trajectories 𝑋 over 𝐶 by means of

transformation function 𝑇 . Furthermore, the transition model gives a representation of 𝑋

over 𝑍, being 𝐶 ⊂ 𝑍, for maximizing the term 𝐻(𝑋|𝑇 ) is sufficient to determine which

zones 𝑧𝑗 satisfy the condition:

𝑍𝑠 = 𝑧𝑠𝑗 : 𝐻𝑖,𝑗 ≥ 𝐻𝑡𝑟 (5.3)

Where 𝐻𝑡𝑟 is an threshold in order to select and order zones associated to high entropy.

Form the Equation (5.3) it is possible to individuate a seed zones (i.e. attentional resources)

set 𝑍𝑠 = 𝑧𝑠𝑗 with 𝑍𝑠 ⊂ 𝑍. 𝑍𝑠 represents the initial condition (i.e. initial attentional

resources, see figure 5-9) of the S2AF algorithm, Algorithm 2. In the section 5.5 will be

shown how from this initial set depends the efficiency of the proposed cognitive process

for selective automatic attention focus.

Data: 𝑧𝑗 ∈ 𝑍Result: 𝑐𝑘 ∈ 𝐶Initialization: given seed zones set 𝑍𝑠 ⊂ 𝑍 ;for 𝑘 = 1 : 𝐾 do

𝑧𝑠𝑗 ↦→ 𝑐𝑘 with 𝑧𝑠𝑗 ∈ 𝑍𝑠;endwhile ∃𝑧𝑗 /∈ 𝐶 do

if Number of people observed in 𝑧𝑗 𝑁𝑗 = 0 thenfor 𝑘 = 1 : 𝐾 do

if |𝑂𝑗 −𝑂𝑘𝑠| < 𝐷 then𝑧𝑗 ↦→ 𝑐𝑘;𝐶ℎ = 𝑐𝑘 ∀𝑐𝑘 to estimate 𝑁𝑘;∀𝑐𝑘 to compute 𝑝𝑘 = 𝑝( 𝑁𝑘);∀𝑐𝑘 to normalize 𝑃𝑟𝑜𝑏ℎ = 𝑝𝑘;to compute 𝑓ℎ = −

∑𝐾𝑘=1

1𝜆𝑘𝑝𝑘𝑙𝑜𝑔𝑝𝑘;

endend𝑓(𝑡) = minℎ𝑓ℎ

endif 𝑓(𝑡) < 𝑓(𝑡− 1) then

𝑧𝑗 ↦→ 𝑐𝑘 where 𝑐𝑘 ∈ 𝐶ℎend

endAlgorithm 1: S2AF algorithm

121

Supposing fixed the number of the clusters 𝑐𝑘 (i.e. fixed 𝐾). During initialization phase,

the proposed algorithm associates to each 𝑐𝑘 one 𝑧𝑠𝑗 ∈ 𝑍𝑠. The crowd simulator generates

synthetic data regarding people positions in the environment with a frame rate of 30[𝑓𝑝𝑠].

The proposed algorithm with sample time 𝑡𝑠 = 1[𝑠] takes into consideration a zone 𝑧𝑗 /∈

𝐶 in which, by use of simulated people counter, it has been detected and estimated the

presence and the number of the people 𝑁𝑗 . The S2AF algorithm tries to assign 𝑧𝑗 to 𝑐𝑘 if

|𝑂𝑗 −𝑂𝑘𝑠| < 𝐷, where𝑂𝑗 and𝑂𝑠𝑘 are the centers of 𝑗𝑡ℎ zone and of 𝑧𝑠𝑗 ∈ 𝑐𝑘, while𝐷 is the

maximum distance for grouping the zone with the cluster. Now an possible configuration

𝐶ℎ is formed. Total number of people in each 𝑐𝑘 = 𝑧𝑗 with 𝑐𝑘 ∈ 𝐶ℎ 𝑗 = 1, ..., 𝑆 is𝑁𝑘 =∑𝑆

𝑗=1𝑁𝑗 . Supposing that people number distribution, in each zone, is Gaussian

and independent from the other. The probability 𝑝𝑘 = 𝑝( 𝑁𝑘) is given by a Gaussian p.d.f

𝒩𝑘(𝑚𝑘, 𝜎2𝑘) where the average is 𝑚𝑘 =

∑𝑆𝑗=1𝑚𝑗 and the variance is 𝜎2

𝑘 =∑𝑆

𝑗=1 𝜎2𝑗 .

The estimation of the probabilities for all 𝑐𝑘 ∈ 𝐶ℎ is computed, S2AF provides a set of

probabilities (normalized) 𝑃𝑟𝑜𝑏ℎ = 𝑝𝑘 = 𝑝𝑘∑𝐾𝑘=1 𝑝𝑘

for the configuration 𝐶ℎ. For the

current configuration 𝐶ℎ, S2AF computes the following cost function:

𝑓ℎ(𝑃𝑟𝑜𝑏ℎ, 𝐴(𝐶ℎ)) = −𝐾∑𝑘=1

1

𝜆𝑘𝑝𝑘𝑙𝑜𝑔𝑝𝑘; (5.4)

where 𝜆𝑘 ∝ 1𝐴(𝑐𝑘)

. The algorithm tries to assign 𝑧𝑗 to 𝑐𝑙 ∈ 𝐶 with 𝑙 = 1, ..., 𝐾 𝑙 = 𝑘,

obtaining another possible configuration. At each time step 𝑡 S2AF provides a configuration

𝐶ℎ which satisfy the following condition:

𝐶ℎ = 𝑐𝑘 : minℎ

𝑓ℎ(𝑃𝑟𝑜𝑏ℎ, 𝐴(𝐶ℎ)) (5.5)

𝑓(𝑡) = 𝑓ℎ is defined as instantaneous cost function. This measure can be considered in

order to evaluate the efficiency of the automatic focus process. In according to the Equation

(5.1) for minimizing 𝐻(𝑌 |𝑇 ), S2AF assigns 𝑧𝑗 to 𝑐𝑘 if 𝑓(𝑡) < 𝑓(𝑡 − 1), where 𝑓(𝑡 − 1)

is the cost function related to previous time step. As can be seen in the results section, the

Algorithm 2 decreases the cost function. Anyway, 𝑓(𝑡) is never 0, this can depend from

many factors: e.g. initial configurations (i.e. 𝑧𝑠𝑗 positions), initial distribution of virtual

122

Table 5.1: Groups of zones: S2AF output.

𝑧𝑠𝑗 𝑐𝑗

𝑧𝑠1 = 11 𝑐1 = 11, 1, 2, 6, 7, 10, 12, 14𝑧𝑠2 = 13 𝑐2 = 13, 48𝑧𝑠3 = 43 𝑐3 = 43, 42, 44

people, different people flows and speeds, etc.

5.5 Results

The experimental results have been provided using the crowd 3D simulator, Figure 5-4.

The configuration of doors, walls and rooms is however customizable and a wide range

of scenarios can be set for tests. A crowd enters a room of the simulator and is given

the motivation of moving toward the exit of the building. Births, i.e. entry rate 𝑚, of

new characters occur during the simulation, modelled by a Poisson distribution. Proposed

algorithm performances have been studied in two following cases: the first for different

seed zones choices and different people initial distributions; the second, fixed an initial

configuration, it has been an asynchronous event: i.e. different people direction. In Figure

5-11 cost function 𝑓(𝑡) average (normalized to 1) trends are shown. It has been supposed

of fixing 3 seed zones 𝑧𝑠𝑗 with different entropy values: high (i.e. over 𝐻𝑡𝑟 = 1.5 ), low

(i.e. under 𝐻𝑡𝑟 = 1.0) and randomly choice. It is possible to note that: the cost function

associated to high entropy configuration, figure 5-11, tends to decrease compared with

other curves. This configuration is able to satisfy the Equation 5.3. In Table 5.1 the best

groups of zones 𝐶 = 𝑐1, 𝑐2, 𝑐3, provided by S2AF, is shown. In figure 5-12 presents a

visual representation of grouping mechanism of zones.

Given the configuration presented in Table 5.1, in the second part of the experimental has

been analysed how the S2AF is able to detect an asynchronous event. A direction change

events in 100 simulated situations are been generated. The people are moving in according

123

Figure 5-11: An example of normalized cost function average trends for room 𝑅1. Theexperiments have been conducted on 150 sequences of synthetic data provided by simu-lator with 𝑚 = 2 and people flow corresponding to 𝑑1. Total time of each simulation is1000[𝑠]. Three attentional resources choices (high, low entropy and randomly choice) arebeen considered. Each point on the curves corresponds to the average values of 𝑓(𝑡) for thedifferent people initial distributions.

Figure 5-12: Groups of zones. S2AF provides groups of zones.

124

to a predefine direction (𝑑1), at a time instant they change the way (𝑑3). Considering that

the event happens at 𝑡 = 0, the Figure 5-13 presents an average trend of the probability

curves 𝑝𝑘. When the people drift from 𝑑1 (green curve) the corresponding probability 𝑝1

decreases, instead the red curve 𝑝2 grows up. This means that in the people are moving

from 𝑐1 to other zones. At 𝑡 = 30 the two curves (i.e. 𝑝1 and 𝑝2) reach the same probability

and after oscillatory behaviour they tend to stabilize: the configuration 𝐶 is not able to

describe the observed situations. An anomalous situation of environmental occupation in

the group of zones that correspond to 𝑐1 is detected.

Figure 5-13: Asynchronous Event detection. Direction change events in 100 simulatedsituations are been evaluated. The figure shows average trends for the probabilities 𝑝𝑘considering time alignment with the events 𝑡 = 0. In 𝑡 = 30 the two curves reach the sameprobability. The system is able to detect direction change of the people from 𝑐1 to otherzones.

125

Chapter 6

Information bottleneck-based relevant

knowledge representation in large-scale

video surveillance systems

6.1 Introduction

Automatic representation, analysis and detection of abnormal events, as discussed in chap-

ter 4, are central issues for last generation video surveillance systems. In this context, an

interactive and intelligent systems embedded in physical environments in chapter 3 has

been proposed. It can represent a breakthrough in the design of people-oriented services

applied to crowd monitoring in large-scale environments. Several works have been ad-

dressed to link traditional computer vision tasks to context aware capabilities such as scene

understanding, interaction analysis and recognition of possible threats or dangerous situa-

tions. Different features have been considered for automatic crowd analysis: local features

(e.g., features from accelerated segment test - FAST) can be used for people detection,

while optical flow efficiently estimates human motion [100]. Considering such features it

is possible to evaluate the density of a crowd ([40, 104]).

127

Video crowd analysis frameworks in the state-of-the-art typically do not address two major

problems that arise when an higher number of sensors are used: rigorous methods for

obtaining 1) optimal information representation able to maintain the informativeness of

acquired low level features, as well as 2) compact description to reduce the processing

complexity due to an ever increasing amount of information, are needed.

The problem of information overload can be avoided through an automatic method for

selecting subparts of the guarded environment and focusing operators’ attention on most

informative regions, such as the one proposed in chapter 5. In Figure 7-2.a an example

of relevant information extraction, which is defined as sparse information, is shown. The

main problem, for event detection and classification mechanisms, is related to the recon-

struction accuracy of original data from incomplete and limited observations. Typical tasks

in crowd monitoring applications consist in recognizing particular events within the crowd

itself, such as presence of crush in forbidden areas or suspicious movements. For instance,

in Figure 7-2.a two events of interest can be defined when the crowd flow crosses red and

brown lines, respectively. An approach based on Self Organizing Maps (SOMs) for re-

constructing observed signals is presented in [23]; more recently, a SOM-based algorithm

for defective image restoration is proposed [73]. In particular, it is highlighted how the

SOM-based method performances depend on Kohonen-layer size. Other artificial neural

networks derived from SOMs, such as Growing Neural Gas (GNG) (see [11]) and Growing

Hierarchical SOM (GH-SOM) (see [105]), can automatically adapt the dimension of the

Kohonen-layer. More in details, GNG computes an accumulated local error, which repre-

sents a distance measurements between two neuronal weights, and increases the number of

neurons if this is considered too large. Similarly, in GH-SOM the increase of the number

of neurons and layers is based on distance measurements between neuronal weights and

input data. Another type of neural network (Neural Gas (NG)) can improve the input data

topology preservation through an adaptive method based on learning of neighbourhood re-

lationships between the weight vector (associated with neuronal unit) and each external

stimuli (associated with input vector).

These mechanisms of adapting layer sizes and topology preservation are mainly addressed

128

towards original data reconstruction. The problem of recovering the signal from sparse

data, requires more than just reconstruction accuracy: it is indeed necessary to preserve the

similarities between relevant information and original data (i.e., input signals). This SOM-

layer size optimization problem can be represented by means of specific cost function,

which relies on Information Bottleneck theory [125].

(a) Relevant events. (b) Crowd density map.

Figure 6-1: Relevant information extraction and crowd density map estimation. a. Thecoloured circles specify important subparts of the scene; optical flow features representsparse relevant information. b. Crowd density map estimation by Lucas-Kanade [72] opti-cal flow features.

The contributions of this chapter are as follows. A novel approach is presented, based on

information bottleneck, for designing a cost function able to quantify the SOM trade-off be-

tween the capability to recover original signals and preserve statistical similarities between

sparse relevant information and original data. It will be shown how SOMs’ correlation abil-

ities can be measured through a mixture of local linear regressive models associated to each

neuron. Such models can be used for predicting the future values based on previous states.

Finally, by means of the proposed cost function, an algorithm is described for information

bottleneck-based SOM selection (IB-SOM). By means of the selection algorithm the Infor-

mation Refinement block can determine the optimal data representation in the SOM-space

6-2.

The proposed framework has been applied to crowd monitoring domain for people density

estimation and event recognition on video real sequences extracted from public database

PETS [37]. Moreover, proposed approach is compared to other neural networks, such as

129

Figure 6-2: Information Refinement block for automatic relevant data representation selec-tion.

NG, GNG and GH-SOM. The remainder of the paper is organized as follows: in Section

6.2 the information bottleneck-based SOM selection for relevant knowledge representation

is presented; experimental results are described in Section 6.3.

6.2 Information bottleneck-based relevant information rep-

resentation

This section describes the proposed relevant information representation method applied to

video-surveillance. In the communication system theory, encoding a time-varying multi-

dimensional signal 𝑋(𝑡) is a common approach for extracting relevant information from

it. The available information acquired from video-surveillance network can be defined as

the vector 𝑋 = [𝑋1, 𝑋2, · · · , 𝑋𝑁 ] where 𝑋 ∈ X with 𝑋 ∈ R𝑁 and 𝑋 is a sample of

𝑋(𝑡), which has been acquired at sampling time 𝑡. For crowd monitoring applications, 𝑋𝑖

represents people density in each 𝑖 − 𝑡ℎ monitored area with 𝑖 = 1, · · · , 𝑁 and 𝑁 is the

130

maximum number of guarded zones. Such a vector describes the crowd density map. It

is possible to define the relevant information , extracted at time instant 𝑡 by using the

attention focusing algorithm in 5, as a subset of 𝑋: ⊆ 𝑋 where ∈ R𝑀 and 𝑀 ≤ 𝑁 .

Figure 6-3 shows how relevant sparse information 𝑋 (i.e., crowd density sparse map) can

be reconstructed from . The percentage of controlled area can be computed as a ratio

between the number of significant values in 𝑋 and total available information contained

in 𝑋 . The quantity 𝑋 ∈ R𝑁 is a sample of the reconstructed signal 𝑋 ≈ 𝑋 . The SOM

projects input data (i.e., X) into reduced dimensionality space. The neural network has

the ability to semantically represent input vectors by selecting a similar but not necessary

identical crowding density maps within the same neuronal unit. Each neuron represents a

codeword associated to a prototype vector, i.e., a weight 𝑊𝑘 ∈ W with 𝑊𝑘 ∈ R𝑁 , where

𝑘 ∈ 1, · · · , 𝐾 and 𝐾 is the maximum number of prototype vectors within SOM-layer.

The problem can be formalized as that of representing with the same best match unit (BMU)

𝑘 the vector 𝑋 and the corresponding sparse vector 𝑋 , as shown in Figure 6-3. The cen-

tral task is here to establish the firing properties of neuronal patterns by balancing recon-

struction capabilities and correlation properties, between sparse information and neurons.

To this end, a new variable Y has been defined in order to quantify the differences between

sparse and original signals. Such a variable is as informative as possible of X. Reconstruc-

tion and correlation attributes lead to the information bottleneck concept, which is defined

as a trade-off between two average mutual information (AMI) 𝐼(𝑊,𝑋) and 𝐼(𝑊,𝑌 ).

The first term 𝐼(𝑊,𝑋) denotes the reconstruction measurement; according to the Rate

Distortion theory, it should be minimized depending on the allowed distortion 𝑑𝑊 intro-

duced by the mapping process. Such a distortion is measured by the conditional entropy

𝐻(𝑋|𝑊 ): more details are given in [21]. The second term 𝐼(𝑊,𝑌 ) represents the cor-

relation measurement between sparse and original data, which should be maximized. A

practical measure of the correlation is proposed as the difference between statistical rela-

tionships of data, described by 𝑝(𝑌,𝑋), and neuronal unit correlation capabilities which

are defined by 𝑝(𝑌,𝑊 ). Where 𝑝(𝑌,𝑋) and 𝑝(𝑌,𝑊 ) are two joint probabilities.

131

Figure 6-3: Relevant information extraction and crowding density map projections intoSOM-space. The grid cells, laid over the image plane, can be seen as the set of controlledareas. Each cell is associated to a number of feature extracted by Lucas-Kanade optical flowand used for estimating the crowd density map. In this example 𝑋 ∈ R20, ∈ R5 and thesparse vector 𝑋 ∈ R20. 𝑋 and 𝑋 are mapped into the same unit 16. The percentage ofcontrolled area corresponds to 40% (i.e. 2|5).

132

The quantities 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) can be estimated by using the SOMs for dividing the

set of input data (i.e., X) into different multivariate time series X𝑘𝐾𝑘=1 where X𝑘 =

𝑋1,𝑘, · · · , 𝑋𝑛,𝑘 associated to the 𝑘 − 𝑡ℎ neuron, such as X𝑘 ∪ X𝑗 = ∅ with 𝑘 = 𝑗

and⋃𝐾𝑘=1X𝑘 = X [94]. These sub-sequences of vectors can be modelled by local Vector

Auto Regressive (VAR) models [101]. The number of generated VAR models corresponds

to the number of neurons of the SOM. Considering a multivariate time series X𝑘, an auto

regressive model of order𝑚, denoted as 𝑉 𝐴𝑅(𝑚), describes the 𝑖−𝑡ℎ vector𝑋𝑖,𝑘 as linear

combination of the previous state vectors:

𝑋𝑖,𝑘 = Φ0 + Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘 + · · · + Φ𝑝𝑋𝑖−𝑝,𝑘 + 𝜖𝑖,𝑘, (6.1)

where Φ0, · · · ,Φ𝑝 are (𝑁 × 𝑁) parameter matrices and 𝜖𝑖 represents a (𝑁 × 1) Gaussian

noise. By the multivariate time series X𝑘 we have modelled a 𝑉 𝐴𝑅(2) as 𝑋𝑖,𝑘 = Φ0 +

Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘 + 𝜖𝑖,𝑘, where Φ0, Φ1 and Φ2 are estimated coefficient matrices which

are stored in each SOM node. In order to determine the fitting of the data to the 𝑉 𝐴𝑅(2)

models, error terms are estimated as follows: 𝜖𝑖,𝑘 = 𝑋𝑖,𝑘 − [Φ0 + Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘].

The error vector associated to each neuron is denoted with 𝜖𝑘. The average of 𝜖𝑘 is denoted

with 𝑌𝑘 = 1𝑁

𝑁∑𝑐=1

𝜖𝑘,𝑐, where 𝜖𝑘,𝑐 is 𝑐− 𝑡ℎ component of 𝜖𝑘.

It is supposed that 𝑝(𝑌,𝑊 ) = 𝒩 (0, 𝜎𝑌,𝑊 ) is the joint pdf between Y and W, where 𝜎𝑌,𝑊 =

𝐸𝑌𝑘2 − 𝐸𝑌𝑘2. For expressing the quantity 𝑝(𝑌,𝑋), it is sufficient to define only one

𝑉 𝐴𝑅(2) and this can be used for modelling all input data X. The same approach can be

used for estimating 𝑝(𝑌,𝑋).

The optimal SOM-layer size can be obtained by minimizing the modified cost function

based on information bottleneck, as follows:

ℱ = min𝑝(𝑊,𝑋)

𝐻(𝑋|𝑊 ) + 𝜆𝐷𝐾𝐿[𝑝(𝑌,𝑊 )‖𝑝(𝑌,𝑋)]; (6.2)

where Kullback-Leibler divergence 𝐷𝐾𝐿[𝑝(𝑌,𝑊 )‖𝑝(𝑌,𝑋)] is a measure of the difference

between the two joint probability distributions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). Figure 6-4 shows

133

Figure 6-4: Kullback-Leibler divergences 𝐷𝐾𝐿 for two Gaussian probability density func-tions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). The information lost has been represented by distance metricbetween 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) (see Table 6.1).

that a larger SOM-layer (e.g. 𝐾 = 100 neurons) present higher 𝐷𝐾𝐿 values (i.e., poor

correlation quality) than smaller SOM (e.g., 𝐾 = 4 neurons).

It can be noticed that 𝐷𝐾𝐿 describes an effective distortion measurement. In Equation 6.2

the 𝜆 parameter was introduced, which can balance the information bottleneck. In particu-

lar when 𝜆→ 0, the cost function privileges reconstruction capabilities of the SOMs, (i.e.,

larger SOM-layers will be selected). Vice versa when 𝜆 → ∞, ℱ selects the correlation

properties of the SOMs, (i.e., smaller SOM-layers will be selected).

6.3 Experimental results

This section is divided into two subparts: in the first part, SOMs training and information

bottleneck based cost function evaluation is carried out through synthetic data. Then, per-

formances of the proposed IB-SOM selection algorithm are compared with other neural

networks (GH-SOM, GNG and NG), for crowding density reconstruction on real video

sequences.

134

Table 6.1: Cost function parameters. For evaluating 𝐷𝐾𝐿, two divergence normalizeddensity functions 𝑝(𝑌,𝑋) = 𝑁(0, 0.9996) and 𝑝(𝑌,𝑊𝐾) are being considered.

SOM 𝐻(𝑊 |𝑋) 𝐷𝐾𝐿 𝑝(𝑌,𝑊 ) 𝑑𝑟𝑊 𝑑𝑝𝑊

2 × 2 2,38 0,005 𝑁(0, 0.94) 0,842 0,3201𝜆 ∈ (1.28 ÷∞)

5 × 5 2,04 0,27 𝑁(0, 1.88) 0,623 0,4018𝜆 ∈ (0.48 ÷ 1.28]

7 × 7 1,89 0,60 𝑁(0, 3) 0,417 0,53𝜆 ∈ (0.32 ÷ 0.48]

10 × 10 1,78 0,96 𝑁(0, 4) 0,209 0,78𝜆 ∈ (0 ÷ 0.32]

6.3.1 Training of SOMs and cost function evaluation on synthetic data

A common training set is generated by using a simulator where crowd behaviours are gener-

ated based on Social Forces model [18]. The simulator has the capability to add virtual sen-

sors able to acquire data coming from different subparts of the monitored scene. A virtual

image processing algorithm has been implemented for obtaining a plausible crowd density

map for each frame. Generated map is a 32 × 32 matrix, which corresponds to a vector

X with 𝑁 = 1024 components. Four different SOMs were used, with 𝐾 = 100, 49, 25, 4

number of neurons respectively and the following layer topologies: 10 × 10, 7 × 7, 5 × 5

and 2 × 2.

By using the common training set, SOMs and other neural networks have been trained.

Finally, the parameters for evaluating the normalized information bottleneck-based cost

function curves ℱ are determined (see Figure 6-5). In Table 6.1, 𝑑𝑟𝑊 and 𝑑𝑝𝑊 repre-

sent average reconstruction and prediction errors obtained by different value intervals of

𝜆 parameter. 𝑑𝑟𝑊 is the average error (i.e., Euclidean distance) between the input data

X and its representation W. Each 𝑉 𝐴𝑅 model can be used as linear predictor filter.

The 𝑑𝑝𝑊 has been defined by an average measurement of the fitting between one period

ahead forecast sequences 𝑘 (obtained by 𝑉 𝐴𝑅(2) local models) and the training data 𝑋𝑘:

𝑑𝑝𝑊 =∑𝑛

𝑖=1 ‖𝑋𝑖,𝑘−𝑖,𝑘‖∑𝑛𝑖=1 ‖𝑋𝑖,𝑘−𝐸𝑋𝑘‖

.

135

Figure 6-5: Normalized cost function average trends for different SOMs. The experimentshave been conducted on 100 sequences of synthetic data provided by the simulator. Totaltime of each simulation is 1000𝑠𝑒𝑐𝑠. The validity regions are defined by intervals of 𝜆.

For small 𝜆 values, such as (0÷0.32], the minimum of ℱ is given by the SOM 10×10 (i.e.,

the reconstruction capabilities will be preserved and 𝑑𝑟𝑊 > 𝑑𝑝𝑊 ). Vice versa for higher

𝜆 values, such as more than 1.2, the minimum of ℱ is given by the SOM 2 × 2 (i.e., the

correlation proprieties will be maintained and 𝑑𝑝𝑊 > 𝑑𝑟𝑊 ).

6.3.2 Crowd density reconstruction on real video sequences

An experiment has been conducted on three available video sequences from PETS dataset

for single camera (S1 L2 Time 14 : 06 and 14 : 31, S3 High Level Time 14 : 33 View_0001;

sequences length are 200, 130 and 377 frames respectively and frame rate is ∼ 7 [fps]). The

information bottleneck theory is adopted as a practical strategy for optimal SOM selec-

tion; by using this approach it is possible to limit the reconstruction error by varying the

percentages of controlled areas.

Under the hypothesis that the data are acquired and processed at the same PETS sequence

136

Table 6.2: Comparisons between the proposed IB-SOM selection and other neural net-works. Results are presented for different percentages of controlled areas. In the tablenormalized reconstruction errors are shown. For IB-SOM reconstruction 𝑑𝑟𝑊𝐾

and pre-diction 𝑑𝑝𝑊𝐾

errors have been computed. The last two rows show the averages and thevariances of the errors.

IB-SOM GH-SOM GNG NG100 NG49 NG4𝑑𝑟𝑊 |𝑑𝑝𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊

100% 0,331 | 0,841 0,403 0,396 0,234 0,409 0,75480% 0,343 | 0,5743 0,409 0,408 0,335 0,451 0,75460% 0,355 | 0,5006 0,423 0,412 0,391 0,511 0,81140% 0,431 | 0,4089 0,685 0,639 0,588 0,533 0,853

Average 0,365 | 0,5412 0,480 0,463 0,387 0,476 0,793Variance 0,0015 | 0,0099 0,0140 0,0102 0,0166 0,0023 0,0017

frame rates, 𝜆 parameter (see Equation 6.2) can be defined as follows: 𝜆 ∝ 𝑑(𝑋,𝑊𝑘)|𝑝(𝑊,𝑋).

Such a value can automatically balance the bottleneck through a distortion 𝑑(·) (i.e., Eu-

clidean distance), which is due to mapping process 𝑝(𝑊,𝑋), between observed vector 𝑋

and its representation 𝑊𝑘. In particular, when 𝑑(𝑋,𝑊𝑘) is low, the reconstruction capabili-

ties will be preserved (i.e., larger SOM-layers will be selected). Vice versa, when 𝑑(𝑋,𝑊𝑘)

is high, the correlation properties of the SOMs will be maintained (i.e., smaller SOM-layers

will be selected).

The table 6.2 shows how the NG100 has the minimum reconstruction errors in 100% and

80% of controlled area percentages. Vice versa when the controlled area percentages de-

crease (i.e., 60% and 40%) the distortions of this neural network increase. In these situa-

tions, proposed method can find the optimal SOM size. IB-SOM selection restricts recon-

struction accuracy reduction, i.e., it is able to maintain a minimum reconstruction average

error. Moreover, the proposed approach delimits error variations, due to different percent-

ages of controlled areas (i.e., error variance).

137

Figure 6-6: Qualitative and quantitative results for event recognition for PETS sequenceS1 L2 Time 14 : 06 using 40% of the controlled area. The figure shows the comparisonbetween the proposed method and other approaches: on the upper part crowding densitymap reconstructions are presented; on the lower part distortion curves are shown. Thewhole video is available on https://www.youtube.com/watch?v=KN2aYZ64TTw.

138

On the other hand, when the SOM-map sizes are reduced the prediction errors decrease as

well. Finally, in Figure 7-5 quantitative and qualitative comparison measurements for event

recognition are presented. In particular, using limited observations as well, the proposed

IB-SOM presents smaller distortion errors. The density map reconstructions show how all

the neural networks can identify the first event, while only through the proposed approach

it is possible to recognize the second event.

139

Chapter 7

A bio-inspired logical process for

information recovery in cognitive crowd

monitoring

7.1 Introduction

There is currently a great interest in methodologies which combine computer vision tasks

with high-level situation awareness functionalities for environment monitoring. Techniques

such as behaviour analysis, event detection, or recognition of possibly dangerous situations

can have a high impact in different sectors, ranging from social (e.g. security-related) to

technological ones (e.g. resources management). In fact, several works in the literature try

to link physical and social aspects to video surveillance tasks [85]. The main and common

problem is that they require monitoring of events in wide video surveillance networks for

a long periods of time [92]. In order to detect people, local features can be used (e.g., fea-

tures from accelerated segment test - FAST [109]), while optical flow efficiently estimates

human motion [100]. Considering such features, in [81], a people-trajectories-based social

force model has been proposed for describing interactions among the individual members

141

of a group of people. These techniques often require a large number of networked sen-

sors for acquiring data. Adaptable intelligent systems can be exploited in order to balance

the amount of employed resources and sensory relevant information. Human perception

mechanisms for data processing, integrated in artificial systems, represent a breakthrough

for designing a new generation of algorithms inspired by neuroscience. Such frameworks

can optimize resource management and improve camera identification by acquiring (i.e.

perceiving) relevant information to be combined with inductive reasoning, and machine

learning techniques. The main concept in cognitive neuroscience, discussed in chapter 2,

is the Perception Action Cycle (PAC) also referred to as the Fuster’s Paradigm, which is

based on five fundamental building blocks, namely perception, memory, attention, intelli-

gence and language, [45]. The PAC describes how an entity can perceive, learn and change

itself through continuous interaction experiences with the external world. In particular the

perception and attention blocks are meant to optimize the information flow within the sys-

tem. In such a context, Haykin et al. in [3] have presented an algorithmic implementation of

the human perceptual attention process in order to separate the relevant from irrelevant in-

formation. The challenge of this scheme is to bridge the gap between cognitive science and

engineering in order to design intelligent system called Cognitive Dynamic Systems (CDS),

originally formalized by Haykin in the cognitive radio domain, [56] and later generalized

in [53]. Only recently the scheme was applied in the crowd monitoring field [22]. In

video surveillance domain many frameworks are dealing on the relevant or saliency extrac-

tion from video sequences. In [74], by using a saliency detector based on center-surround

discriminant descriptor (i.e. colour, intensity, orientation, scale), an anomaly detection and

localization in crowded environment has been proposed. Other frameworks for saliency ex-

traction employ the Histogram of Oriented Gradient (HOG) combined with Support Vector

Machine (SVM) classifier for pedestrian detection [27] and for people counting [130]. In

this chapter an adaptive inductive reasoning mechanism for saliency extraction from a dis-

tributed camera sensors network is proposed. The common goal of the saliency detection

frameworks, for video surveillance systems, is limited to extract the relevance from all the

environment parts where, for example the presence of the people has been detected. In our

method the relevant information, which is identified directly from environmental observa-

142

tions, is defined through the learning of the relationships among sensory data. Unlike other

methods for relevance detection, the proposed algorithm shows how by using the neural

network it is possible to learn the data correlations. Moreover it is presented how through

such a procedure, it is possible to reduce the number of environment parts to observe; by us-

ing such limited observations the proposed system can reconstruct the whole information.

The module devote to this aim is shown in figure 7-1.

Figure 7-1: Reconstruction block for automatic recovering of the whole information.

The presented system works at two different logic levels: first, by means of a Self Orga-

nizing Map [62] (SOM) (a well known neural network), it is able to learn the correlation

of the salient observed data. Secondly, the framework, relying on an adaptive SOM-based

inductive reasoning process, can select the optimal subset of sensors for recovering the

whole information (i.e. acquired relevant and induced redundant information), allowing

a maximum distortion. The remaining of this work is organised as follows. Section 7.2

presents the proposed bio-inspired inductive reasoning model based on SOMs, for extract-

ing relevant data and recovering complete information. In order to evaluate the information

recovering capabilities of the proposed scheme, such a model has been evaluated by using:

available real video sequences for PETS database [37]. Experimental results are shown in

143

section 7.3, while conclusions are drawn in chapter 8 section 8.4.

7.2 Data correlation representation for bio-inspired induc-

tive reasoning algorithm

Inductive reasoning is a mental process, which permits to draw a conclusion according to

pre-defined models, which, in this application, describe the relations derived from acquired

camera sensors information.

For the sake of simplicity, the problem can be formulated as follows: given a not completed

set of observations, the chapter proposes a methodology that can quantify the missing data.

The problem of finding the most probable spatial configuration of the crowd (i.e. number

of people in different areas) has a strongly analogy with the Statistical Physics‘s problem

of determining the equilibrium allocations of a number of particles in a set of energy lev-

els under the constraint of total energy conservation. The analogy between Physics and

other application fields has been already explored for describing the dynamic movements

of workers among various economy sectors [110]. Such a situation should be considered

closer as possible to the movements of people across different zones of a monitored envi-

ronment. Scalas et. al propose a solution, for the optimum occupation of economy sectors,

relies on a generalization of Bose-Einstain distribution. The crowd configurations are dif-

ficult to model in accordance to a given distribution function because they generally have a

high variability, which depends on the topology of environments. In the autonomous video

surveillance system should be more suitable to acquire in advance the crowding levels on

the specific monitored areas. This chapter presets an adaptive approach relies on artificial

neural networks, which permits to learn a set of possible configurations directly from en-

vironmental observations (training of the neural networks). Furthermore, by the proposed

bio-inspired mechanism called inductive reasoning it is possible to recover missing data.

This section proposes a SOM-based data correlation modelling for inductive reasoning

144

algorithm for relevant information representation. In particular we suppose to apply it to

the crowd monitoring domain. In Figure 7-2.a an example of information extraction from

a crowded environment is shown. The controlled area can be divided into sub-parts and by

employing tools such as Lucas-Kanade optical flow [72] it is possible quantify the density

of the people within each cell. In this way, a crowd density map, Figure 7-2.b can be

sketched. Basically, this map can describe the people concentration (i.e. number of people)

within the scene.

(a) Information extraction. (b) Controlled areas. (c) Crowd density map.

Figure 7-2: Information extraction and crowd density map estimation.

Let us define the available information X, acquired from a video surveillance network, as

the set of crowding state vectors

𝑋 = [𝑋1, 𝑋2, · · · , 𝑋𝑁 ] where 𝑋 ∈ R𝑁 . (7.1)

𝑋 is a sample of the signal 𝑋(𝑡) ∈ X, which has been acquired at sampling time instant

𝑡. 𝑋𝑖 represents the people number estimation, provided for instance by a people counter

embedded on cameras, in each 𝑖𝑡ℎ monitored area with 𝑖 = 1, · · · , 𝑁 . 𝑁 also represents

the maximum number of sensors, as we assume one camera only for each controlled area.

During a training phase, a SOM provides an adaptive mechanism for acquiring different

correlation models, according to similar dynamical evolution of people captured by neu-

rons. The SOM maps the data (i.e. 𝑋) into a dimensionally reduced space. It provides

a semantic representation of the input vectors, by associating similar (though not neces-

sary identical) crowding states to the same neuronal unit. Each neuron represents a code-

word i.e. a prototype (or weight) vector 𝑊𝑘 = [(𝑊𝑘)1, (𝑊𝑘)2, · · · , (𝑊𝑘)𝑁 ] and 𝑊𝑘 ∈ R𝑁

145

where 𝑘 ∈ 1, ..., 𝐾 and 𝐾 is the maximum number of prototype vectors within the neu-

ronal layer (i.e. the number of neurons in the SOM). The external stimuli (i.e. observed

state vectors) activate specific neuronal units in the SOM-map. Therefore, the SOM can

be used for dividing the training data X into different multivariate sets X𝑘𝐾𝑘=1 where

X𝑘 = 𝑋1,𝑘, · · · , 𝑋𝑛,𝑘 is associated to the 𝑘 − 𝑡ℎ neuron, and X𝑘𝐾𝑘=1 is a partition of

X, i.e. X𝑘 ∪ X𝑧 = ∅, ∀𝑘 = 𝑧 and⋃𝐾𝑘=1 X𝑘 = X. By examining the possible relationships

among these sub-sequences of training samples, the system can build correlation structures

𝐶𝑘 (i.e. correlation matrices) embedded in each SOM-neuron. The relative effects of the

number of people in the 𝑖𝑡ℎ monitored area on people on 𝑗𝑡ℎ zone is stored in the correlation

coefficients 𝑐𝑘(𝑖, 𝑗).

By means of a saliency masking vector𝑀 , whose entries are zeroes or ones, it is possible to

hide (i.e. mask) some components of𝑋 defining a new vector𝑋 ∈ R𝑁 . The 𝑖𝑡ℎ component

of 𝑋 is 𝑋 𝑖 = 𝑀𝑖 ·𝑋𝑖, 𝑖 = 1, ..., 𝑁 . Defined this way, 𝑋 will retain the components of 𝑋

when 𝑀𝑖 = 1 and will be set to zero elsewhere; it is extracted at the same time instant 𝑡

and highlights the monitored areas. It is to be noted that the distance used for the matching

process (or mapping process) is not the euclidean distance in this case, but its “masked"

version:

𝑑(𝑋,𝑊𝑘,𝑀) =

⎯ 𝑁∑𝑖=1

[𝑋 𝑖 − (𝑊𝑘)𝑖

]2 ·𝑀𝑖. (7.2)

The role of the proposed deductive algorithm is to deduce the people number inside the

non-observed zones. This is accomplished by minimizing the following cost function:

𝐹 () = [ −𝑋]2 + [( −𝑊𝑘) − 𝜇𝑘 · (𝑋 −𝑊𝑘)]2, (7.3)

where the state vector to be estimated and 𝑊𝑘 is the SOM prototype given by SOM

mapping process: 𝑋 ↦→ 𝑊𝑘.

For each couple of observed and non-observed areas (i.e. 𝑖 − 𝑡ℎ and 𝑗 − 𝑡ℎ with the

respective observed number of people 𝑋 𝑖 and 𝑋𝑗), it is possible to define an influence

parameter 𝜇𝑘(𝑖, 𝑗) = 𝑐𝑘(𝑖, 𝑗) · 𝜎𝑗/𝜎𝑖. Such a parameter takes into consideration the cross

146

correlation 𝑐𝑘(𝑖, 𝑗) between the zones and their standard deviations (SDs) 𝜎𝑖 and 𝜎𝑗 which

are calculated during training over the set X𝑘, where 𝑘 is the index of the Best Matching

Unit (BMU) of the SOM (i.e. activated neuron), which has weight 𝑊𝑘.

Data: 𝑋 , 𝑊𝑘 and 𝑀Result: local minimumInitialization: ∀𝑖 𝑜𝑙𝑑

𝑖 = 𝑋 𝑖,∀𝑗 𝑜𝑙𝑑𝑗 = (𝑊𝑘)𝑗 , 𝑑𝑚𝑖𝑛, 𝑚𝑎𝑥𝑖𝑛𝑡, 𝛼;

Definition: 𝑓2(𝑗) = 2[𝑗 − (𝑊𝑘)𝑗] − 𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖];for 𝑗 = 1 : 𝑁 do

if 𝑀𝑗 = 0 with 𝑀𝑗 ∈𝑀 thenwhile 𝑑𝑥 > 𝑑𝑚𝑖𝑛 And 𝑛𝑖𝑛𝑡 6 𝑚𝑎𝑥𝑖𝑛𝑡 do

𝑜𝑙𝑑𝑗 = 𝑛𝑒𝑤

𝑗 ;𝑛𝑒𝑤𝑗 = 𝑜𝑙𝑑

𝑗 − 𝛼 · 𝐹 (𝑜𝑙𝑑𝑗 );

𝑑𝑥 = ‖𝑛𝑒𝑤𝑗 − 𝑜𝑙𝑑

𝑗 ‖;𝑛𝑖𝑛𝑡+ +;

endend

endAlgorithm 2: SOM-based inductive reasoning algorithm. In the algorithm the SOM is represented bythe weights of the BMU 𝑊𝑘.

The cost function 𝐹 presents two terms: the first 𝑓1() = [ −𝑋]2 is clearly minimized

by 𝑖 = 𝑋 𝑖, i.e. the 𝑖 − 𝑡ℎ component is equivalent to the observation. The second term

𝑓2() = [( −𝑊𝑘) − 𝜇𝑘 · (𝑋 −𝑊𝑘)]2 represents the objective function introduced for

estimating the number of people within each non-observed component (i.e. 𝑗 − 𝑡ℎ) by the

data acquired from the observed areas (i.e. 𝑖 − 𝑡ℎ). The minimization of the second term

is an optimization problem. The fundamental idea behind the proposed method is to use

a gradient descent approach in order to find a local minimum of the function 𝑓2(𝑗). The

slope of the function 𝑓2(𝑗) is equal to zero if 𝑓 ′2(𝑗) =

𝑑𝑓2(𝑗)

𝑑𝑗= 2[𝑗 − (𝑊𝑘)𝑗] −

𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖] = 0. The solution is: 𝑗 = 𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖] + (𝑊𝑘)𝑗 ,

where [𝑋 𝑖 − (𝑊𝑘)𝑖] is the variation of the number of people, into the observed area. Such

a variability can influence the estimated value of the state vector through a weight param-

eter 𝜇𝑘(𝑖, 𝑗), which introduces correlation to neighbouring locations. In Algorithm 2 the

pseudo code of the proposed method is presented. The algorithm parameters are: the preci-

sion value 𝑑𝑚𝑖𝑛, which defines a stop criterion based on minimum threshold of updating;

𝑚𝑎𝑥𝑖𝑛𝑡, which defines the maximum number of allowed iterations; 𝛼, which specifies the

147

step size weighting the value of the cost function.

7.3 Experimental results

In this section the performance of the proposed inductive reasoning algorithm has been

analysed for data recovering. In order to give consistence to the proposed approach, an

experiment has been conducted on three available video sequences from the “Performance

Evaluation of Tracking and Surveillance" (PETS) workshop dataset for single camera [37].

According to the main purpose of this work, performances of the method are evaluated

for selecting the optimal subset of sensors in real crowding scenes, in order to optimally

recover the whole information (i.e. acquired relevant and induced redundant information).

In both cases, five different SOMs were employed, with 𝐾 = 25, 16, 9, 4, 1 neuronal units

respectively, and the following layer topologies: 5×5, 4×4, 3×3 and 2×2. The parameters

presented in the algorithm 2 are set as follows: 𝑑𝑚𝑖𝑛 = 0.01, 𝑚𝑎𝑥𝑖𝑛𝑡 = 60, 𝛼 = 0.1.

Finally, the proposed SOM-based inductive reasoning algorithm has been compared with

the standard SOM capability to recover defective values, where restoration is based on

the rough representation of the loss data by the value provided by the prototype or by the

“temporal influence" of the BMU, as defined in [60]. Such a method has been applied for

the reconstruction of areas covered by clouds in a time sequence of optical satellite images.

We will refer to this method as SOM-BMU method, table 7.1.

The main purpose of this experiment is to demonstrate how the system is able to dynami-

cally recover the whole information (i.e. crowd density map) using a limited set of obser-

vations. In order to select the groups of zones to observe we use the correlation measures

among the neighbouring locations. The correlations among the zones have been computed

over a training set. The regression analysis can quantify the relationships between the vari-

ables. Figure 7-3 shows the correlations between two couples of locations: for distant

zones the correlation is lower then for close cells. These facts can be analysed by the fitting

error of the data on the fit-line (i.e. 𝑝(𝑥) = 𝑝1𝑥 + 𝑝2, where 𝑝1 and 𝑝2 are two coefficients

ot the first order polynomial curve 𝑝(𝑥)).

148

SOM-1 SOM-4 SOM-9 SOM-16 SOM-25S1 L2Time 14-06 40% of data loss

MSE 0,911 0.845 0.727 0,651 0,455Cost Function 0,85 0,731 0,698 0,602 0,53

SOM-BMU 0,945 0.596 0.581 0.546 0.559S1 L2Time 14-31 40% of data loss


SOM-BMU 0,901 0.663 0.645 0.639 0.630S3 14-33 40% of data loss


SOM-BMU 0,891 0.799 0.735 0.714 0.701

Table 7.1: Quantitative results for PETS video sequences. It is important to note that theminimum value of the Cost Function, proposed in this paper, corresponds to the minimumvalue of MSE.

The areas have been ordered on the base of their correlations. Three more correlated zones

(i.e. attentional resources) have been chosen in proximity of the entry and exit points (the

red circles represented in figure 7-3). Based on a predefined percentage of data loss equal to

40%, we have employed the focusing attention method [14]. Such a method, based on three

attentional resources, is able to generate groups of zones which represent the most infor-

mative regions for the scene reconstruction. In Figure 7-4, an example of how the method

works is shown: the set of sparse data are acquired in proximity to entry and exit points.

This test has been divided into two sub-steps: training and testing. A common training set

is provided by using the simulator where crowd behaviours are generated. Furthermore, a

virtual image processing algorithm has been implemented for obtaining a plausible crowd

density map for each frame [20]. Generated map is a 16× 16 matrix, which corresponds to

a vector X with 𝑁 = 256 components. The same previous five different SOMs were used.

In the test phase the vector of the reduced observations 𝑋 is the input of the “SOMs stack".

By the neuronal activation process, each SOM-𝐾 provides a reconstructed crowding den-

sity vector 𝐾 with a corresponding value of the cost function 𝐹𝐾(𝐾) (please refer to

equation (7.3) and the algorithm 2). The main objective is to minimize the differences, for

149

Figure 7-3: Example of correlations analysis between two couples of locations. The dataare represented by the number of feature extracted by Lucas-Kanade optical flow associatedto each cell.

instance the Minimum Square Error (MSE), between the recovered data and the original

signal, i.e. 𝑀𝑆𝐸(𝑋, ), where is the reconstructed vector. In table 7.1 a quantitative

results of the reconstruction are presented. In this case the SOM-25 provides the lower av-

erage cost function value, which corresponds to the minimum (normalized) reconstruction

error (MSE). The optimum recovered vector can be dynamically determined by comparing

at each frame 𝑡 the five values of the objective functions (i.e. cost functions) as follows:

= , = arg min𝐾

𝐹𝐾(𝐾). (7.4)

In Figure 7-5 a qualitative results of the SOM selection process for crowd density map

reconstruction are presented. In the proposed layer dynamical selection process different

SOMs can be chosen in order to provide the optimal information recovery. During the

initialization phase, when the flow of the people is out from the camera view fields, the

150

Figure 7-4: Salient information extraction and crowding density map reconstruction. Thesparse vector is projected into SOMs space (i.e. “SOMs stack"). The grid cells, laid overthe image plane, can be seen as the set of controlled areas. Each cell is associated toa number of feature extracted by Lucas-Kanade optical flow and used for estimating thecrowd density map. The percentage of controlled area corresponds to 40%. In this example𝑋 ∈ R20 and the vector of the reduced observations 𝑋 ∈ R20. 𝑋 is mapped into differentunits of the SOMs. By comparing the cost functions (see equation (7.4)) it is possible toidentify the optimal SOM (in this case SOM-9).

system acquires the data form all the zones. In this case the proposed framework evaluates

the optimum SOM-𝐾 based on the minimum MSE between the prototype 𝑊𝑘 of the BMU

𝑘 and the observed data 𝑋 .

Finally, in Figure 7-6 shows the comparison of the proposed strategy (i.e. our method)

for relevant information extraction with other approaches for detecting saliency in video

sequences. In particular, it has been tested the recovering algorithm using the saliency de-

tected by the method proposed in [74] and using the HOG descriptors combined with SVM

for pedestrian [27] and head detections. The results underline how the presented relevant

extraction, based on the correlations among the parts of the environment, is able to drasti-

cally reduce the total amount of the acquired information (34.11%) then saliency detection

151

Figure 7-5: An example of qualitative and quantitative results for PETS sequence S1 L2Time 14 : 06 (frames 56 and 140) using 40% of the controlled area. Up to the time of firstpeople detection (i.e. initialization phase, frame 44) the system acquires the data from allthe zones.

(74.95%) and the pedestrian detection by HOG descriptors (70.18%). We point out like an

higher percentage of crowd density saliency extraction is not always synonymous of better

reconstruction. Considering an extracted relevance provided by the head detection method

(40.54%), the proposed system is able to better recover the whole density map (MSE 0.4015

vs 0.4168).

152

Figure 7-6: Reconstruction comparison for different saliency extraction methods: saliencydetection presented in [74]; HOG descriptors [27] combined with a learning-based SVMclassifier for person (pedestrian) and head detections. The average mean square error nor-malized to 1 and the percentage of the crowd density map average acquired from the system(i.e. Extracted Relevance-ER) are shown.

153

Chapter 8

Conclusions and future developments

The Cognitive Dynamic System can be seen as a complex structure of multiple and cooper-

ating bio-inspired and modular elements, such as: memory, attention, reasoning, planning,

problem solving etc. These features represent the main innovation keys of these frame-

works. The integration of different elements permits to create systems, which are able to

take not only local decisions, for example for each element separately, but also to describe

how such outcomes can influence other system processes. The main characteristic of bi-

ological entities is the decision-making mechanisms (i.e. planning and problem solving)

which involve complicated cooperation among recursive and incremental cognitive pro-

cesses. The Cognitive Control is an executive function addressed to manage these very

complex mechanisms; specifically it regulates the above mentioned mental faculties (i.e.

reasoning, attention, memory, planning etc). This process is the core and distinctive com-

ponent, in relation to other mechanisms, in leading the development of innovative artificial

cognitive systems. For such a reason, both design of different modules (i.e. reasoning,

attention, memory) and definition of information flows, among the system elements, are

extremely important. The thesis explores a Cognitive Dynamic System addressed to crowd

monitoring applications. The proposed system is a complex framework characterized by

different parts, which process the information at different levels. In particular, each chapter,

of this work, deeply describes the capabilities and applicability of the proposed innovative

155

techniques: Analysis block, Attention Focusing Module, Information Refinement and Re-

construction Module. The combination and cooperation among such algorithms realize

an automatic systems called Cognitive Surveillance Node (CSN), which is part of a com-

plex cognitive JDL-based and bio-inspired architecture. By means of the Autobiographical

Memory, the proposed CSN models the relationships between external (i.e. core-self) and

internal (i.e. to proto-self) information flows. The presented system also shows how to

describe and model the relationships between the internal sub-systems. In particular, At-

tention Module communicates with the memory. Through the Information Refinement it

is possible to select the internal knowledge in order to obtain the optimum representation,

according to the system task (e.g. event detection), of relevant observed data. The thesis

also proposes a mechanism, based on internal information flows, in order to emulate a spe-

cific human adaptive mental mechanism: inductive reasoning. By a continuous exchange

of information between memory and Reconstruction Module, the proposed algorithm can

recover the whole data given a limited set of observations.

The scientific contributions of each technique, proposed in the thesis, are summarized in

the following order.

8.1 Analysis block

A bio-inspired structure was proposed, for encoding and synthesizing signals for modeling

cause-effect relationships between entities. In particular, the case where one of such enti-

ties is a human operator was studied. Interaction models are stored within an AM during

a learning phase. Knowledge is thus transferred from a human operator towards the CSN.

Learned representations can be used, at different levels, either to support human decisions

by detecting anomalous interaction models and thus compensating for human shortcom-

ings, or, in an automatic decision scenario, to identify anomalous patterns and choose the

best strategy to preserve stability of the entire system. Results are shown in a simulated

visual 3D environment in the context of crowd monitoring. The simulated crowd is mod-

eled according to the Social Forces Model. The results show two possible applications of

156

the CSN for crowd surveillance applications: first, the system can support the operator for

crowd management and people flow redirection by detecting drift from some learned in-

teraction models; secondly, to work in automatic mode and thus autonomously detecting

anomalies in crowd behaviour. Furthermore, it has been shown how user-crowd interac-

tion knowledge, learned from the simulator and modelled as proposed is useful in order to

detect anomalies on real video sequences.

8.2 Attention Focusing Module

In this thesis has been presented a bio-inspired algorithm called Selective Attention Auto-

matic Focus, which is a part of more complex cognitive architecture for crowd monitoring.

The main objective of the proposed method is to extract relevant information needed for

crowd monitoring directly from the environmental observations. Experimental results are

shown in a simulated visual 3D environment. The results show the possible applications

of the proposed algorithm for attention focusing on densely populated areas and detecting

anomalies in crowd, e.g. overcrowding. Future developments of this work will include

investigation on many and more complex simulated scenes. In addition, thanks to the scal-

ability of the model based on topological maps, many other application on real scenes

can be explored, to give more consistence to the abstract cognitive perception mechanism

integrate in an intelligent system.

8.3 Information Refinement for relevant knowledge rep-

resentation

This thesis presented a novel approach for information representation. It has been applied

for sparse data within a crowd monitoring application. The proposed algorithm is a method

encompassing different steps, which involves the application of information theory and

neural networks such as SOMs. First of all, by means of information bottleneck paradigm,

157

a cost function has been designed in order to balance the data reconstruction and corre-

lation capabilities of different SOMs. An information bottleneck based strategy for SOM

selection was proposed.

Finally, the IB-SOM selection method has been tested on public datasets. The results show

that the proposed approach outperforms other neural networks (such as NG, GNG, GH-

SOM) in crowd density reconstruction from very sparse observations.

Furthermore, it has been shown how such a knowledge representation can recover original

crowding density maps in order to recognize particular events on real video sequences.

8.4 Adaptive reasoning algorithm for information recon-

struction

This thesis has presented an adaptive inductive reasoning mechanism as an inductive logic

framework for saliency (i.e. relevant information) extraction. Such salient information has

been defined as the optimal subset of observations (e.g. camera sensors for crowd moni-

toring or pixel values for images) which are able to reconstruct the lost information which

is therefore “redundant" for that specific model. The main goal of the proposed framework

is to adapt the inductive reasoning mechanism and therefore to emphasize the saliency.

Accordingly, the reconstruction of the non-observed data (i.e. loss values) is performed

by means of different correlation models acquired by different SOMs. By minimizing the

proposed cost function, the method is able to regenerate the missing data and establish the

optimal salient information. In order to evaluate the reconstruction capability, we have

tested the algorithm for crowd state recovering by considering a limited subset of obser-

vations (i.e. excluding a part of the available camera sensors). The proposed approach

optimizes the estimation procedure of the original values by considering not only the pro-

totypes (weights of the BMU) but also the influence of the variations of neighbours’ values.

Moreover, by minimizing the proposed cost function, by varying the SOM-layer size, it is

158

possible to identify the optimum filter (i.e. SOM) which corresponds to the minimum MSE

calculated a posteriori (i.e. the minimum introduced distortion).

8.5 Final remarks and future developments

The CDSs are complex frameworks, as it has been discussed in the thesis (see Chapter

2 Section 2.2). Such architectural complexity leads to various systemic and conceptual

problems, which can be summarized in two following points: integration and validation.

∙ The integration: is a systemic problem and it involves the cooperation mechanisms

among the different parts of the system. The control and management of the flow

of internal and external information is a central key in this kind of systems. Specif-

ically, this work has proposed an architecture where the sub-parts of the system are

connected. The different elements work together by communication mechanisms

based on information flow cognitive control.

∙ The validation: is another fundamental problem that involve, for example, the com-

parison between the proposed CDS with alternative crowd managements systems.

In general crowd management systems are based on video analysis methods only.

Typically, by means of optical flow or other visual features, it is possible to detect a

specific situation or (event detection) at a specific time instant (in the frame). To con-

sider theese methodologies as complete systems, such as a CDS should be a concep-

tual error. Such methods can be therefore defined as crowd behavior analysis tools.

For example, Hang Su et al. [119] presented an optical-flow-based fluid field which,

compared with the optical-flow-based social force model presented from Mehran et

al. [84], provides better results in terms of identification of events in time. The pro-

posed method is different, as it implements a crowd management system: it can pre-

dict possible future crowd behaviors (as consequences of system’s actions) and it is

able to detect possible interaction anomalies. Referring to the result we presented on

video sequences, we show how the system, predicting the crowd events many frames

159

before of the exact time instant, suggests to the operator the best possible action to

do. Any different action will be detected as interaction anomaly. The added value of

our work consists in the automatic learning of crowd behavioral rules, which are in

fact interaction rules. In addition the system has the capability of predicting future

events and to detect interaction anomalies at the exact moment they happen. Further-

more, by Attention Module it is possible to help the user in focusing his attention

on relevant environmental parts only. The Information Refinement can model such

acquired relevant information by selecting the optimum knowledge baseline. While,

through the Reconstruction algorithm, it is possible to recover the missing data from

a set af limited observations acquired from video cameras. These facts make it really

hard to compare the performance of our system with the ones of other state-of-the-art

systems as [84] and [119]. However, this thesis presents comparisons of sub-parts of

the entire system (i.e. Information Refinement and the Reconstruction algorithms)

with specific state-of-the-art methods.

In the proposed CDS, a still open issue can be identified in developing a method for relevant

information localization. Specifically, in this work, it has been defined as fixed observed

zones. In a dynamic situations, like a moving group of people, it is important to follow the

evolution of the scene. For this reason future developments of this work will include a study

on the impact of the dynamic contextual relevant knowledge localization on the inductive

reasoning algorithm. Furthermore we want to include the prediction of the future crowding

state evolutions based on the historical data modeling. Future developments of this work

will include a detailed study on the impact of the information bottleneck on the GH-SOM.

It can lead to improve the GH-SOM strategy for selecting the knowledge representation

among different hierarchical layers.

160

Bibliography

[1] Handbook of Multisensor Data Fusion: Theory and Practice, Second Edition (Elec-trical Engineering & Applied Signal Processing Series). CRC Press, 2 edition,September 2008.

[2] David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algo-rithm for boltzmann machines. Cognitive Science, 9(1):147 – 169, 1985.

[3] Ashkan Amiri and Simon Haykin. Improved Sparse Coding Under the Influence ofPerceptual Attention. Neural Computation, pages 1–44, November 2013.

[4] John R. Anderson. The architecture of cognition, 1983.

[5] E.L. Andrade, S. Blunsden, and R.B. Fisher. Hidden markov models for optical flowanalysis in crowds. In Pattern Recognition, 2006. ICPR 2006. 18th InternationalConference on, volume 1, pages 460 –463, September 2006.

[6] W.R. Ashby. Principles of the self-organizing dynamic system. The Journal ofGeneral Psychology, 37(2):125–128, 1947.

[7] S. Y. Auyang. Foundations of Complex-system Theories in Economics, Evolution-ary Biology, and Statistical Physics. Cambridge University Press, Cambridge, UK,1998.

[8] Y. Benabbas, N. Ihaddadene, and C. Djeraba. Motion pattern extraction and eventdetection for automatic visual surveillance. EURASIP Journal on Image and VideoProcessing, 2011:15, 2011.

[9] Massimo Bertozzi, Alberto Broggi, L. Bombini, C. Caraffi, S. Cattani, Pietro Cerri,Ra Fascioli, M. Felisa, R. I. Fedriga, S. Ghidoni, Paolo Grisleri, P. Medici, M. Pater-lini, P. P. Porta, M. Posterli, and P. Zani. Vision technologies for intelligent vehicles.In KES’07/WIRN’07 Proceedings of the 11th international conference, KES 2007and XVII Italian workshop on neural networks conference on Knowledge-based in-telligent information and engineering systems: Part I, 2010.

[10] Alexander Bird. Routledge Companion to Philosophy of Language. Routledge Phi-losophy Companions. Routledge, 2012.

161

[11] Fernando Canales and Max Chacón. Modification of the growing neural gas algo-rithm for cluster analysis. In Luis Rueda, Domingo Mery, and Josef Kittler, edi-tors, CIARP, volume 4756 of Lecture Notes in Computer Science, pages 684–693.Springer, 2007.

[12] U. Castiello and C. Umiltà. Size of the attentional focus and efficiency of processing.Acta Psychologica, 73(3):195–209, 1990.

[13] Bao Rong Chang, Hsiu Fen Tsai, and Chung-Ping Young. Intelligent data fusionsystem for predicting vehicle collision warning using vision/gps sensing. ExpertSystems with Applications, 37(3):2439 – 2450, 2010.

[14] S. Chiappino, L. Marcenaro, and C. Regazzoni. Selective attention automatic focusfor cognitive crowd monitoring. In Proc. of IEEE Int. Conference on Advanced Videoand Signal based Surveillance (AVSS), Krakow, Poland, 27–30 August 2013.

[15] S. Chiappino, L. Marcenaro, and C.S. Regazzoni. Information bottleneck-based rel-evant knowledge representation in large-scale video surveillance systems. In Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conferenceon, pages 4364–4368, May 2014.

[16] S. Chiappino, A. Mazzù, L. Marcenaro, and C.S. Regazzoni. A bio-inspired logicalprocess for saliency detections in cognitive crowd monitoring. In Acoustics, Speechand Signal Processing (ICASSP), 2015 IEEE International Conference on, April2015.

[17] S. Chiappino, P. Morerio, L. Marcenaro, E. Fuiano, G. Repetto, and C. Regazzoni.A multi-sensor cognitive approach for active security monitoring of abnormal over-crowding situations in critical infrastructure. 15th International Conference on In-formation Fusion, July 2012.

[18] Simone Chiappino, Lucio Marcenaro, Pietro Morerio, and Carlo Regazzoni. Eventbased switched dynamic bayesian networks for autonomous cognitive crowd mon-itoring. In Augmented Vision and Reality, Augmented Vision and Reality, pages1–30. Springer Berlin Heidelberg, 2013.

[19] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo Regazzoni. Runlength encoded dynamic bayesian networks for probabilistic interaction modeling.In 21th European Signal Processing Conference (EUSIPCO 2013), 2013.

[20] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni. Abio-inspired knowledge representation method for anomaly detection in cognitivevideo surveillance systems. In Information Fusion (FUSION), 2013 16th Interna-tional Conference on, pages 242–249, July 2013.

[21] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni.Event definition for stability preservation in bio-inspired cognitive crowd monitor-ing. In Digital Signal Processing (DSP), 2013 18th International Conference on,pages 1–6, July 2013.

162

[22] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and CarloS. Regazzoni. Bio-inspired relevant interaction modelling in cognitive crowd management. Journal ofAmbient Intelligence and Humanized Computing, pages 1–22, 2014.

[23] Jeongho Cho, António R. C. Paiva, Sung-Phil Kim, Justin C. Sanchez, and José C.Príncipe. Self-organizing maps with dynamic learning for signal reconstruction.Neural Netw., 20(2):274–284, March 2007.

[24] R.T. Collins, A.J. Lipton, H. Fujiyoshi, and T. Kanade. Algorithms for coopera-tive multisensor surveillance. Proceedings of the IEEE, 89(10):1456–1477, October2001.

[25] Nelson Cowan. Chapter 20 what are the differences between long-term, short-term,and working memory? In Vincent F. Castellucci Wayne S. Sossin, Jean-Claude La-caille and Sylvie Belleville, editors, Essence of Memory, volume 169 of Progress inBrain Research, pages 323 – 338. Elsevier, 2008.

[26] F. Cupillard, A. Avanzi, F. Bremond, and M. Thonnat. Video understanding formetro surveillance. In Networking, Sensing and Control, 2004 IEEE InternationalConference on, volume 1, pages 186 – 191 Vol.1, 2004.

[27] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InComputer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-ciety Conference on, volume 1, pages 886–893 vol. 1, 2005.

[28] Antonio Damasio. The Feeling of What Happens: Body and Emotion in the Makingof Consciousness. Harvest Books, October 2000.

[29] Anthony C. Davies, Jia Hong Yin, and Sergio A. Velastin. Crowd monitoring usingimage processing. Electronics and Communication Engineering Journal, 7:37–47,1995.

[30] A. Dore, A.F. Cattoni, and C.S. Regazzoni. Interaction modeling and prediction insmart spaces: a bio-inspired approach based on autobiographical memory. Systems,Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 2010.

[31] A. Dore, M. Pinasco, and C.S. Regazzoni. Multi-modal data fusion techniques andapplications. In H. Aghajan and A. Cavallaro, editors, Multi-camera networks: Con-cepts and Applications, pages 213–237. Elsevier, May 2009.

[32] A. Dore and C. S. Regazzoni. Bayesian bio-ispired model for learning interactivetrajectories. In Proc. of the IEEE International Conference on Advanced Video andSignal based Surveillance, AVSS 2009, Genoa, Italy, September 2009.

[33] A. Dore and C.S. Regazzoni. Interaction analysis with a bayesian trajectory model.Intelligent Systems, IEEE, 25(3):32 –40, may-june 2010.

163

[34] Mica R. Endsley. Toward a theory of situation awareness in dynamic systems. Hu-man Factors: The Journal of the Human Factors and Ergonomics Society, 37:32–64(33), March 1995.

[35] M. Valera Espina and S. A. Velastin. Intelligent distributed surveillance systems: Areview. IEE Proceedings - Vision, Image and Signal Processing, 152(2):192–204,April 2005.

[36] C. Fernández, P. Baiget, F.X. Roca, and J. Gonzàlez. Augmenting video surveillancefootage with virtual agents for incremental event evaluation. Pattern RecognitionLetters, 32(6):878 – 889, 2011.

[37] James Ferryman and James L. Crowley, editors. Eleventh IEEE International Work-shop on Performance Evaluation of Tracking and Surveillance, PETS 2009, 2009.

[38] John T. Finn. Use of the average mutual information index in evaluating classi-fication error and consistency. International journal of geographical informationsystems, 7(4):349–366, 1993.

[39] G. L. Foresti, C. S. Regazzoni, and P. K. Varshney. Multisensor Surveillance Sys-tems: The Fusion Perspective. Kluwer Academic, Boston, 2003.

[40] Hajer Fradi, Volker Eiselein, Ivo Keller, Jean-Luc Dugelay, and Thomas Sikora.Crowd context-dependent privacy protection filters. In DSP 2013, 18th Interna-tional Conference on Digital Signal Processing, 1-3 July 2013, Santorini, Greece,Santorini, GRÈCE, 07 2013.

[41] Stan Franklin, Steve Strain, Javier Snaider, Ryan McCall, and Usef Faghihi. Globalworkspace theory, its lida model and the underlying neuroscience. Biologically In-spired Cognitive Architectures, 1:32–43, 2012.

[42] K. Friston, B. SenGupta, and G. Auletta. Cognitive dynamics: From attractors toactive inference. Proceedings of the IEEE, 102(4):427–445, April 2014.

[43] Bernd Fritzke. A growing neural gas network learns topologies. In Advances inNeural Information Processing Systems 7, pages 625–632. MIT Press, 1995.

[44] J.M. Fuster. Cortex and Mind: Unifying Cognition. Oxford University Press, USA,2005.

[45] J.M. Fuster. Cortex and Mind: Unifying Cognition. Oxford University Press, USA,2005.

[46] Stephen Grossberg, editor. Cognition, Learning, Reinforcement, and Rhythm, vol-ume 1 of The Adaptive Brain. Elsevier, 1988.

[47] IST Advisory Group. Scenarios for ambient intelligence. European Commission,2010.

164

[48] D. L. Hall and J. Llinas. An introduction to multisensor data fusion. Proceedings ofthe IEEE, 85:6–23, 1997.

[49] D. Handford and A Rogers. Modelling driver interdependent behaviour in agent-based traffic simulations for disaster management. In The Ninth International Con-ference on Practical Applications of Agents and Multi-Agent Systems, Salamanca,Spain, accepted for publication, april 2011.

[50] Ismail Haritaoglu, Davis Harwood, and Larry S. David. W4: Real-time surveillanceof people and their activities. IEEE Trans. Pattern Anal. Mach. Intell., 22:809–830,August 2000.

[51] S. Haykin, M. Fatemi, P. Setoodeh, and Yanbo Xue. Cognitive control. Proceedingsof the IEEE, 100(12):3156–3169, Dec.

[52] S. Haykin, M. Fatemi, P. Setoodeh, and Yanbo Xue. Cognitive control. Proceedingsof the IEEE, 100(12):3156–3169, Dec.

[53] S. Haykin and J.M. Fuster. On cognitive dynamic systems: Cognitive neuroscienceand engineering learning from each other. Proceedings of the IEEE, 102(4):608–628, April 2014.

[54] Simon Haykin. Cognitive dynamic systems: An integrative field that will be a hall-mark of the 21st century. In Cognitive Informatics Cognitive Computing (ICCI*CC), 2011 10th IEEE International Conference on, pages 2–2, Aug 2011.

[55] Simon Haykin. Cognitive dynamic systems: Radar, control, and radio. Proceedingsof the IEEE, 100(7):2095–2103, 2012.

[56] Simon Haykin. Cognitive dynamic systems: Radar, control, and radio. Proceedingsof the IEEE, 100(7):2095–2103, 2012.

[57] Simon Haykin and D.J. Thomson. Signal detection in a nonstationary environmentreformulated as an adaptive pattern classification problem. Proceedings of the IEEE,86(11):2325–2344, Nov 1998.

[58] Donald O. Hebb. The Organization of Behavior. John Wiley, New York, 1949.

[59] J. Jockusch and H. Ritter. An instantaneous topological map for correlated stimuli.In Proceedings of the international joint conference on neural Networks, volume 1,pages 529–534, Washington, USA, 1999.

[60] M. Jouini, S. Thiria, and M. Crépon. Images reconstruction using an iterative sombased algorithm. In Artificial Neural Networks Computational Intelligence and Ma-chine Learning (ESANN), 2012 European Symposium on, pages 25–27, 2012.

[61] Prahlad Kilambi, Evan Ribnick, Ajay J. Joshi, Osama Masoud, and Nikolaos Pa-panikolopoulos. Estimating pedestrian counts in groups. Computer Vision and Im-age Understanding, 110(1):43–59, 2008.

165

[62] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464 –1480,September 1990.

[63] D. Kumar, C.S. Rai, and S. Kumar. Face recognition using self-organizing map andprincipal component analysis. In Neural Networks and Brain, 2005. ICNN B ’05.International Conference on, volume 3, pages 1469–1473, Oct 2005.

[64] John E. Laird. Extending the soar cognitive architecture. In Proceedings of the 2008Conference on Artificial General Intelligence 2008: Proceedings of the First AGIConference, pages 224–235, Amsterdam, The Netherlands, The Netherlands, 2008.IOS Press.

[65] John E. Laird, Allen Newell, and Paul S. Rosenbloom. Soar: An architecture forgeneral intelligence. Artif. Intell., 33(1):1–64, September 1987.

[66] Nilli Lavie, Aleksandra Hirst, Jan W De Fockert, and Essi Viding. Load theoryof selective attention and cognitive control. Journal of experimental psychologyGeneral, 133(3):339–354, 2004.

[67] A.J. Lipton, C.H. Heartwell, N. Haering, and D. Madden. Automated video protec-tion, monitoring & detection. IEEE Aerospace and Electronic Systems Magazine,18(5):3–18, May 2003.

[68] Bangquan Liu, Zhen Liu, and Yuan Hong. A simulation based on emotions model forvirtual human crowds. In Image and Graphics, 2009. ICIG ’09. Fifth InternationalConference on, pages 836 –840, September 2009.

[69] James Llinas, Christopher Bowman, Galina Rogova, Alan Steinberg, and FrankWhite. Revisiting the jdl data fusion model ii. In In P. Svensson and J. Schubert(Eds.), Proceedings of the Seventh International Conference on Information Fusion(FUSION 2004, pages 1218–1230, 2004.

[70] C. Loscos, D. Marchal, and A. Meyer. Intuitive crowd behavior in dense urbanenvironments using local laws. In Theory and Practice of Computer Graphics, 2003.Proceedings, pages 122 – 129, 2003.

[71] Matthias Luber, Johannes Andreas Stork, Gian Diego Tipaldi, and Kai O. Arras.People tracking with human motion predictions from social forces. In Proc. of theInt. Conf. on Robotics & Automation (ICRA), Anchorage, USA, 2010.

[72] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique withan application to stereo vision. pages 674–679, 1981.

[73] Michiharu Maeda. Restoration model with inference capability of self-organizingmaps. In Pablo A. Estévez, José C. Príncipe, and Pablo Zegers, editors, Advances inSelf-Organizing Maps, volume 198 of Advances in Intelligent Systems and Comput-ing, pages 153–162. Springer Berlin Heidelberg, 2013.

166

[74] V. Mahadevan, Weixin Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection incrowded scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEEConference on, pages 1975–1981, June 2010.

[75] M. Mancas, N. Riche, J. Leroy, and B. Gosselin. Abnormal motion selection incrowds using bottom-up saliency. In Image Processing (ICIP), 2011 18th IEEEInternational Conference on, pages 229–232, 2011.

[76] A. N. Marana, S. A. Velastin, L. F. Costa, and R. A. Lotufo. Automatic estimationof crowd density using texture. Safety Science, pages 165–175, April 1998.

[77] L. Marchesotti, S. Piva, and C.S. Regazzoni. Structured context-analysis techniquesin biologically inspired ambient-intelligence systems. IEEE Transactions on Sys-tems, Man and Cybernetics, Part A, 35(1):106–120, 2005.

[78] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507–522, 1994.

[79] Thomas M. Martinetz and Klaus J. Schulten. A “neural gas” network learns topolo-gies. In Teuvo Kohonen, Kai Mäkisara, Olli Simula, and Jari Kangas, editors, Pro-ceedings of the International Conference on Artificial Neural Networks 1991 (Espoo,Finland), pages 397–402. Amsterdam; New York: North-Holland, 1991.

[80] R. Mazzon, F. Poiesi, and A. Cavallaro. Detection and tracking of groups in crowd.In Proc. of IEEE Int. Conference on Advanced Video and Signal based Surveillance(AVSS), Krakow, Poland, 27–30 August 2013.

[81] R. Mazzon, F. Poiesi, and A. Cavallaro. Detection and tracking of groups in crowd.In Proc. of IEEE Int. Conference on Advanced Video and Signal based Surveillance(AVSS), Krakow, Poland, 27–30 August 2013.

[82] J. McCarthy, M. L. Minsky, N. Rochester, and C.E. Shannon. A PROPOSAL FORTHE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTEL-LIGENCE. http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html,1955.

[83] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection usingsocial force model. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 935 –942, June 2009.

[84] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection usingsocial force model. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 935 –942, June 2009.

[85] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior de-tection using social force model. In CVPR. IEEE Computer Society, 2009.

[86] George A. Miller. The cognitive revolution: a historical perspective. Trends inCognitive Sciences, 7(3):141–144, 2003.

167

[87] Brian E. Moore, Saad Ali, Ramin Mehran, and Mubarak Shah. Visual crowd surveil-lance through a hydrodynamics lens. Commun. ACM, 54(12):64–73, December2011.

[88] Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni. People count estimationin small crowds. In Advanced Video and Signal-Based Surveillance (AVSS), 2012IEEE Ninth International Conference on, pages 476 –480, sept. 2012.

[89] R.G.M. Morris. Further studies of the role of hippocampal synaptic plasticity inspatial learning: Is hippocampal LTP a mechanism for automatically recordingattended experience? Journal of Physiology-Paris, 90(5–6):333 – 334, 1996.

[90] Richard Morris, Graham Hitch, Kim Graham, and Tim Bussey. Chapter 9 - learningand memory. In Richard MorrisLionel TarassenkoMichael Kenward, editor, Cogni-tive Systems - Information Processing Meets Brain Science, pages 193 – XII. Aca-demic Press, London, 2006.

[91] Kevin Murphy. Dynamic Bayesian Networks: Representation, Inference and Learn-ing. PhD thesis, UC Berkeley, Computer Science Division, July 2002.

[92] Yunyoung Nam, Seungmin Rho, and Jong Hyuk Park. Intelligent video surveillancesystem: 3-tier context-aware surveillance system with metadata. Multimedia ToolsAppl., 57(2):315–334, March 2012.

[93] Allen Newell. Unified Theories of Cognition. Harvard University Press, Cambridge,MA, USA, 1990.

[94] HE NI and HUJUN YIN. Self-organising mixture autoregressive model fornon-stationary time series modelling. International Journal of Neural Systems,18(06):469–480, 2008. PMID: 19145663.

[95] N. Oliver and A.P. Pentland. Graphical models for driver behavior recognition ina smartcar. In Intelligent Vehicles Symposium, 2000. IV 2000. Proceedings of theIEEE, pages 7 –12, 2000.

[96] Wei Pan, Wen Dong, Manuel Cebrián, Taemie Kim, James H. Fowler, and AlexPentland. Modeling dynamical influence in human interaction: Using data to makebetter inferences about influence within social systems. IEEE Signal Process. Mag.,pages 77–86, 2012.

[97] Debprakash Patnaik, Srivatsan Laxman, and Naren Ramakrishnan. Discovering ex-citatory networks from discrete event streams with applications to neuronal spiketrain analysis. In Proceedings of the 2009 Ninth IEEE International Conferenceon Data Mining, ICDM ’09, pages 407–416, Washington, DC, USA, 2009. IEEEComputer Society.

[98] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.

168

[99] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc van Gool. You’ll neverwalk alone: Modeling social behavior for multi-target tracking. In InternationalConference on Computer Vision, 2009.

[100] Roland Perko, Thomas Schnabel, Gerald Fritz, Alexander Almer, and Lucas Paletta.Counting people from above: Airborne video based crowd analysis. CoRR,abs/1304.6213, 2013.

[101] Bernhard Pfaff. Var, svar and svec models: Implementation within r package vars.Journal of Statistical Software, 27(4):1–32, 7 2008.

[102] A. Prati, R. Vezzani, L. Benini, E. Farella, and P. Zappi. An integrated multi-modalsensor network for video surveillance. In Proc. of the third ACM international work-shop on Video surveillance & sensor networks, November 2005.

[103] H. Rahmalan, M.S. Nixon, and J.N. Carter. On crowd density estimation for surveil-lance. In Crime and Security, 2006. The Institution of Engineering and TechnologyConference on, pages 540 –545, June 2006.

[104] S. Aravinda Rao, Jayavardhana Gubbi, Slaven Marusic, P. Stanley, and MarimuthuPalaniswami. Crowd density estimation based on optical flow and hierarchical clus-tering. In ICACCI 2013 International Conference on Advances in Computing Com-munications and Informatics , 22-25 August 2013, Mysore, India, Mysore, India,August 2013.

[105] Andreas Rauber, Dieter Merkl, and Michael Dittenbach. The growing hierarchicalself-organizing map: Exploratory analysis of high-dimensional data. IEEE Transac-tions on Neural Networks, 13:1331–1341, 2002.

[106] Paolo Remagnino, Sergio A. Velastin, Gian Luca Foresti, and Mohan Trivedi. Novelconcepts and challenges for the next generation of video surveillance systems. Mach.Vision Applications, 18(3):135–137, 2007.

[107] Irina Rish and Genady Grabarnik. Sparse signal recovery with exponential-familynoise. In Proceedings of the 47th annual Allerton conference on Communication,control, and computing, Allerton’09, pages 60–66, Piscataway, NJ, USA, 2009.IEEE Press.

[108] Edmund T Rolls and Gustavo Deco. Networks for memory, perception, anddecision-making, and beyond to how the syntax for language might be implementedin the brain. Brain research, September 2014.

[109] Edward Rosten and Tom Drummond. Fusing points and lines for high performancetracking. In IN INTERNATIONAL CONFERENCE ON COMPUTER VISION, pages1508–1515. Springer, 2005.

[110] Enrico Scalas and Ubaldo Garibaldi. A dynamic probabilistic version of theaoki–yoshikawa sectoral productivity model. Economics: The Open-Access, Open-Assessment E-Journal, 3(2009-15), 2009.

169

[111] Jonathon Shlens. A tutorial on principal component analysis. In Systems Neurobiol-ogy Laboratory, Salk Institute for Biological Studies, 2005.

[112] S. Singh and S.C. Badwaik. Design and implementation of fpga-based adaptive dy-namic traffic light controller. In Emerging Trends in Networks and Computer Com-munications (ETNCC), 2011 International Conference on, pages 324–330, April2011.

[113] D. Smith and S. Singh. Approaches to multisensor data fusion in target tracking:A survey. IEEE Transactions on Knowledge and Data Engineering, 18(12):1696–1710, December 2006.

[114] C.S. Soh, P. Raveendran, and Z. Taha. Automatic generation of self-organized vir-tual crowd using chaotic perturbation. In TENCON 2004. 2004 IEEE Region 10Conference, volume B, pages 124 – 127 Vol. 2, nov. 2004.

[115] Vijay Srinivasan, John Stankovic, and Kamin Whitehouse. Using height sensors forbiometric identification in multi-resident homes. In Proceedings of the 8th Interna-tional Conference on Pervasive Computing, Pervasive’10, pages 337–354, Berlin,Heidelberg, 2010. Springer-Verlag.

[116] Alan N. Steinberg, Christopher L. Bowman, and Franklin E. White. Revisions to theJDL data fusion model. volume 3719, pages 430–441. SPIE, 1999.

[117] R. J. Sternberg and K. Sternberg. Cognitive psychology. Belmont, California:Wadsworth, 6 edition, 2012.

[118] Hang Su, Hua Yang, Shibao Zheng, Yawen Fan, and Sha Wei. Crowd event per-ception based on spatio-temporal viscous fluid field. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages458 –463, sept. 2012.

[119] Hang Su, Hua Yang, Shibao Zheng, Yawen Fan, and Sha Wei. Crowd event per-ception based on spatio-temporal viscous fluid field. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages458 –463, sept. 2012.

[120] Ron Sun. Learning, action and consciousness: a hybrid approach toward modellingconsciousness. Neural Networks, 10(7):1317 – 1331, 1997.

[121] Ron Sun. Accounting for the computational basis of consciousness: A connectionistapproach. Consciousness and Cognition, 8(4):529 – 565, 1999.

[122] Ron Sun. Symbol grounding: A new look at an old idea. Philosophical Psychology,13(2):149–172, 2000.

[123] Ron Sun, Edward Merrill, and Todd Peterson. From implicit skills to explicit knowl-edge: a bottom-up model of skill learning. Cognitive Science, 25(2):203 – 244, 2001.

170

[124] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.

[125] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneckmethod. pages 368–377, 1999.

[126] M. Trivedi, K. Huang, and I. Mikic. Intelligent environments and active cameranetworks. In Proceedings of the IEEE International Conference on System, Manand Cybernetics, pages 804–809, 2000.

[127] Mohan Manubhai Trivedi, Tarak Gandhi, and Joel McCall. Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety. Intelligent Trans-portation Systems, IEEE Transactions on, 8(1):108 –120, march 2007.

[128] Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, and Alex Paul Pent-land. Pfinder: Real-time tracking of the human body. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19:780–785, 1997.

[129] Shunguang Wu, S. Decker, Peng Chang, T. Camus, and J. Eledath. Collision sensingby stereo vision and radar sensor fusion. Intelligent Transportation Systems, IEEETransactions on, 10(4):606 –614, 2009.

[130] Chengbin Zeng and Huadong Ma. Robust head-shoulder detection by pca-basedmultilevel hog-lbp detector for people counting. In Pattern Recognition (ICPR),2010 20th International Conference on, pages 2069–2072, Aug 2010.

[131] Beibei Zhan, Dorothy Monekosso, Paolo Remagnino, Sergio Velastin, and Li-QunXu. Crowd analysis: a survey. Machine Vision and Applications.

[132] Tao Zhao and Ram Nevatia. Bayesian human segmentation in crowded situations.Computer Vision and Pattern Recognition, IEEE Computer Society Conference on,2:459, 2003.

171

A track-before-detect algorithm using joint probabilistic data association filter and interacting...

Documents

Transcript of A track-before-detect algorithm using joint probabilistic data association filter and interacting...