A track-before-detect algorithm using joint probabilistic data association filter and interacting...
Transcript of A track-before-detect algorithm using joint probabilistic data association filter and interacting...
UNIVERSITY OF GENOVA
Doctoral School in
Science and Technology For Information and Knowledge
Ph.D Course in
ELECTRONIC AND COMPUTER ENGINEERING,ROBOTICS AND TELECOMMUNICATIONS
CYCLE XXVII
DOCTOR OF PHILOSOPHY THESIS
COGNITIVE DYNAMIC SYSTEM: AN
INNOVATIVE INFORMATION ACQUISITION
AND MODELLING FRAMEWORK FOR
CROWD MONITORING
SIMONE CHIAPPINO
Ph.D. Candidate
PROFESSOR CARLO S. REGAZZONI
Advisor
PROFESSOR MARIO MARCHESE
Chairperson
April 17, 2015
Abstract
Cognitive algorithms, integrated in intelligent systems, represent an important innovation
in designing interactive smart environments. More in details, Cognitive Systems have im-
portant applications in anomaly detection and management in advanced video surveillance.
These algorithms mainly address the problem of modelling interactions and behaviours
among entities in a scene. A bio-inspired architecture is here proposed, which is able to
encode and synthesize signals, not only for the description of single entities behaviours, but
also for modelling cause-effect relationships between user actions and changes in environ-
ment configurations. Neuroscience studies describe how this information has been learned
and stored within a specific brain area namely the Autobiographical Memory (AM), which
consists in a collection of episodic elements based on events.
This thesis proposes an artificial structure of aforementioned AM embedded on cognitive
architecture. This work shows the effectiveness of the proposed bio-inspired structure for
automatic crowd monitoring applications. Here the system acquires the knowledge directly
by observing the operator actions aimed at the controlling of a crowd in order to prevent
anomalous situations, such as overcrowding.
The proposed framework operates an effective knowledge transfer from a human operator
towards an automatic systems called Cognitive Surveillance Node (CSN), which is part of
a complex cognitive JDL-based and bio-inspired architecture. After such a knowledge-
transfer phase, learned representations can be used, at different levels, either to support
human decisions, by detecting anomalous interaction models and thus compensating for
human shortcomings, or, in an automatic decision scenario, to identify anomalous patterns
3
and choose the best strategy to preserve stability of the entire system. In the large-scale
video surveillance networks the amount of information increases (i.e., due to a larger num-
ber of monitored areas), attention focusing techniques are needed to highlight most relevant
parts within the overall acquired data.
This thesis presents a bio-inspired algorithm called Selective Attention Automatic Focus
(S2AF), as a part of the CSN for crowd monitoring. The main objective of the proposed
method is to extract relevant information needed for crowd monitoring directly from the
environmental observations. Such important data can be considered as sub-part (i.e. sparse
and limited information) of the whole information. By an automatic relevance identifica-
tion algorithm, the proposed system can support the operator in focusing the attention on
relevant part of the scenes only.
When wide area surveillance systems are considered, one of the major problems in event
detections is the reconstruction of the scene as a whole, from spatially limited observations.
In this thesis, a novel relevant information representation technique for sparse and limited
observations, based on information theory, is presented. An innovative hierarchical struc-
ture for knowledge representation based on artificial neural networks, the Self Organizing
Maps (SOMs), have been used for classifying and correlating observed sparse data time
series acquired by cameras. In this thesis, by means of Information Bottleneck theory, a
cost function approach is proposed in order to determine the optimal data representation
in the SOM-space as a trade-off between the capabilities of sparse signal classification and
the original data statistical similarities (i.e. correlations) preservation. By means of the
proposed cost function, an algorithm for knowledge representation, which is called infor-
mation bottleneck-based SOM selection (IB-SOM), is described.
The advanced mechanisms for representation of relevant information, which are described
in this work, can improve the video surveillance systems. In particular in this thesis, a
novel adaptive inductive reasoning (top-down logic) mechanism for saliency extraction
and information recovery from a distributed camera sensors network is presented. The
proposed system, by means of Self Organizing Maps, can first learn the correlations of the
4
observed data and secondly select the optimal subset of sensors for recovering the whole
information, allowing a minimum distortion.
The thesis deals with the increasing demand to acquire and model, by observing user ac-
tivities, a large number of strategic and operative crowd management solutions. In the
thesis an innovative visual 3D simulator, where crowd behaviour is modelled by means of
Social Forces, has been developed and described. The configuration of the synthetic envi-
ronment provides a set of elements (e.g doors, walls and rooms) which are customized, so
that a wide range of scenarios can be set. More in details, a crowd enters in a room of the
simulator and, by a given motivation, it moves toward the exit of the environment (e.g. a
building).
The CSN can observe two interacting entities consisting in a simulated crowd and a hu-
man operator. The user can interact within the simulator by actions (i.e. the opening and
closing of doors in the environment) aimed to avoid anomalous situations (i.e. overcrowd-
ing). Results demonstrates how the proposed system can automatically detect and manage
anomalous crowd configurations. Experimental tests, on synthetic and also on real video
sequences, demonstrate applicability of the CSN in both the user-support and automatic
modes. The proposed attention focus method has been tested by means of a 3D crowd
simulator; the results show how, by the proposed S2AF, the CSN is able to detect densely
populated areas. The so called IB-SOM algorithm for knowledge representation can be
applied in the CSN structure in order to adapt the crowd density representation to the sur-
rounding environment changes. The IB-SOM has been evaluated on synthetic and real
video sequences and compared to other neural networks for event detection in the crowd.
Finally, the proposed SOM based inductive reasoning mechanism, which is embedded in
CSN architecture, has been tested on synthetic and real video sequences. The results show
how the CSN can select relevant data from observed parts of the environment and how,
by means of a inductive reasoning, it can reconstruct the information in the non-observed
parts.
5
Dedication
Dedico la mia Tesi di Dottorato di Ricerca ai miei genitori
I would like to dedicate my Ph.D Thesis to my parents
Simone Chiappino
7
This doctoral thesis has been examined by the Evaluation Committeecomposed as follows:
Professor Fulvio Mastrogiovanni, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Genova
Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria deiSistemi
Professor Claudio Sacchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Trento
Dipartimento di Informatica e Scienze dell’Informazione
Professor Nicola Zingirian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Università degli Studi di Padova
Dipartimento di Ingegneria dell’Informazione
Contents
Abstract 3
1 Introduction 25
1.1 Motivation and context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
1.1.1 The JDL Architecture . . . . . . . . . . . . . . . . . . . . . . . . 28
1.1.2 Formulation of the problem . . . . . . . . . . . . . . . . . . . . . 31
1.2 Specific contributions to progress in scientific and technological state-of-
the-art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.3 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2 State of the art 39
2.1 From neuroscience to cognitive architectures . . . . . . . . . . . . . . . . 42
2.2 Overview of cognitive architectures . . . . . . . . . . . . . . . . . . . . . 44
2.2.1 Symbolic architectures . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2 Connectionism architectures . . . . . . . . . . . . . . . . . . . . . 47
2.2.2.1 Self Organizing Map (SOM) . . . . . . . . . . . . . . . 48
2.2.2.2 Neural Gas (NG) . . . . . . . . . . . . . . . . . . . . . . 49
2.2.2.3 Growing Neural Gas (GNG) . . . . . . . . . . . . . . . . 50
2.2.2.4 Instantaneous Topological Map (ITM) . . . . . . . . . . 52
2.2.2.5 Growing Hierarchical-SOM (GH-SOM) . . . . . . . . . 53
2.2.3 Hybrid architectures . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.3 Haykin’s architecture and dynamics of cognitive systems . . . . . . . . . . 55
11
2.3.1 Haykin’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3.2 Dynamics of cognitive systems . . . . . . . . . . . . . . . . . . . . 58
2.4 Final remarks about cognitive architectures . . . . . . . . . . . . . . . . . 59
3 Crowd modeling and proposed architecture for cognitive video surveillance
system 61
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.2 Scale issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.1 Crowd monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2.2 Crowd simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 A bio-inspired cognitive model for Cognitive Surveillance Systems . . . . . 67
3.3.1 Cognitive Cycle for single and multiple entities representation . . . 68
3.3.2 The Cognitive Node . . . . . . . . . . . . . . . . . . . . . . . . . 73
4 Information extraction and probabilistic model for knowledge representation 77
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Data fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.1.2 Event detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1.3 Autobiographical memory . . . . . . . . . . . . . . . . . . . . . . 83
4.2 Autobiographical Memory domain applications: Surveillance and Crowd
Management scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.2.1 Learning phase: interaction representations . . . . . . . . . . . . . 87
4.2.2 Detection phase: surveillance scenarios . . . . . . . . . . . . . . . 88
4.2.3 Detection phase: Crowd management scenarios . . . . . . . . . . . 89
4.3 Proposed approach for abnormal interaction detection in crowd monitoring
domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.1 Dimensional reduction and preservation of the information . . . . . 91
4.3.2 Self Organizing Map: classification evaluation . . . . . . . . . . . 92
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4.1 Example of application on real video sequences . . . . . . . . . . . 102
12
5 A bio-inspired algorithm for designing a human attention mechanism 109
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.2 Selective attention for automatic crowd monitoring . . . . . . . . . . . . . 114
5.3 Information flow process for cognitive crowd monitoring . . . . . . . . . . 115
5.4 Selective Attention Automatic Focus (S2AF) algorithm . . . . . . . . . . . 116
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6 Information bottleneck-based relevant knowledge representation in large-scale
video surveillance systems 127
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.2 Information bottleneck-based relevant information representation . . . . . . 130
6.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.1 Training of SOMs and cost function evaluation on synthetic data . . 135
6.3.2 Crowd density reconstruction on real video sequences . . . . . . . 136
7 A bio-inspired logical process for information recovery in cognitive crowd mon-
itoring 141
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
7.2 Data correlation representation for bio-inspired inductive reasoning algorithm144
7.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8 Conclusions and future developments 155
8.1 Analysis block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2 Attention Focusing Module . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.3 Information Refinement for relevant knowledge representation . . . . . . . 157
8.4 Adaptive reasoning algorithm for information reconstruction . . . . . . . . 158
8.5 Final remarks and future developments . . . . . . . . . . . . . . . . . . . . 159
13
List of Figures
1-1 JDL’s structure for data fusion [116]. . . . . . . . . . . . . . . . . . . . . 30
1-2 Operator-machine-environment loop, the machine is represented by JDL
based system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1-3 Self-aware oriented system proposed by Endsley[34]. The model of sit-
uation attention in dynamical decision making is formed as follows: per-
ception (level 1), comprehension of situation (level 2), prediction (level 3)
after that decision and action close the loop with the environment . . . . . 31
1-4 Proposed Cognitive Node structure. . . . . . . . . . . . . . . . . . . . . . 33
2-1 The scheme of LIDA’s cognitive cycle. . . . . . . . . . . . . . . . . . . . 40
2-2 The Perception Action Cycle formalized and Reinforcement Learning. . . 43
2-3 Functional block diagram that shows how the Cognitive Dynamic System
perceives the environment in which it operates. . . . . . . . . . . . . . . . 44
2-4 Some of the more important cognitive architectures. . . . . . . . . . . . . . 46
2-5 Example of a self organizing map structure: the SOM-layer is formed by
2-D fixed grid of 𝑁 neurons, while the input vector is defined as 𝑥 =
[𝑥1, 𝑥,2 , · · · , 𝑥𝑛]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2-6 Example of GNG adaptation to input stimuli. On the left the initial situation
on the right the final topology are shown. . . . . . . . . . . . . . . . . . . 51
2-7 Node creation procedure in ITM algorithm. . . . . . . . . . . . . . . . . . 53
15
2-8 Column insertion and layer expansion in GH-SOM. Step 1: when MQE𝑙
overcome a given threshold, a column is inserted between the nodes 𝑒 with
highest mqe𝑒 and its most dissimilar neighbor 𝑑. Step 2: layer expansion
by adding of a new map in order to represent more in detail the input data. . 54
2-9 Haykin’s Cognitive Dynamic System. Cognitive Perceptron (CP); Cogni-
tive Control (CC); Probabilistic Reasoning Machine (PRM). . . . . . . . . 56
2-10 The Markov blanket of node 𝐴. . . . . . . . . . . . . . . . . . . . . . . . 58
2-11 Global state representation of system. The internal 𝑟 are connected with
the action 𝑎 states, while the environment 𝜓 is influenced by the actions.
The external states are observed by sensors 𝑠. . . . . . . . . . . . . . . . . 59
2-12 Characteristics of cognitive structures. . . . . . . . . . . . . . . . . . . . . 60
3-1 Autobiographical Memory stores the cause-effect relationships between
entities such as user actions and changes in environment configurations
(i.e. the crowd). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3-2 Cognitive Cycle (single object representation) . . . . . . . . . . . . . . . . 69
3-3 Cognitive Node-based object representation: Bottom-up analysis and top-
down decision chain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3-4 Embodied Cognitive Cycle, Interactive Virtual Cognitive Cycles and Cog-
nitive Node matching representation . . . . . . . . . . . . . . . . . . . . . 72
3-5 Cognitive Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 74
4-1 Examples of temporal proximity trajectories among fired neurons in 2-D
SOM-map (5×5) for different core state vector sequences. The trajectories
are non-linear and discontinuous. . . . . . . . . . . . . . . . . . . . . . . . 83
4-2 Coupled Event based Dynamic Bayesian Network model representing in-
teractions within an AM . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
16
4-3 Graphical representation of the mapping into AM 3-D space of passive
triplet 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃. The symbols 𝑙𝑥𝑃/𝐶 represent the contextual SOM-label
associated to each cluster. In this example the proto or core events are
represented by: 𝑙𝑥𝑃/𝐶 99K 𝑙𝑦𝑃/𝐶 , where 𝑥 = 𝑦. The transitions into Proto and
Core-Map are dashed for representing the non-linearity and discontinuity
of the trajectories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4-4 Example of CE-DBNs for passive triplet, e.g. 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃, with a parame-
ter 𝜃 tied across proto-core-proto transitions. . . . . . . . . . . . . . . . . . 88
4-5 Model learning problem: triplet recovering from model. 𝑇𝑟𝑗 represents 𝑗𝑡ℎ
generic triplet, 𝜃𝑖 is 𝑖𝑡ℎ interaction model. . . . . . . . . . . . . . . . . . . 91
4-6 Example of graphical comparison between VAR(2) models and simulated
data which represent the number of the people within the zone 1. The
averages of the matching between VAR(2) model outputs and 140 simu-
lated vectors (expressed in percentages) are the following: SOM 4 × 4 fit:
67.14%; SOM 5 × 5 fit: 53.6%; SOM 7 × 7 fit: 40.18%. . . . . . . . . . . 96
4-7 The simulated monitored environment. . . . . . . . . . . . . . . . . . . . . 97
4-8 Vectorial sum of forces 𝐹𝑇𝑂𝑇 and influence range of characters. . . . . . . 98
4-9 Classification examples of interaction behaviour using evaluation window
W=10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4-10 Detection of anomalous operator-crowd interactions. The system detects an
anomalous interaction when the operator does not open two doors and the
number of people increase. This incorrect crowding management situation
is shown in the figure and compared with the correct situation. . . . . . . . 102
4-11 Example of real environment (a) and simulated scenario (b) used for the
test, the virtual rooms correspond to the zones. The red dashed line corre-
sponds to people flow direction from zone 1 to zone 3; the blue dashed line
describes people movement from zone 3 to zone 1. Dashed circular areas
qualitatively correspond to the parts of the rooms monitored by cameras
equipped with people counter module. 𝑑1 − 𝑑4 are the doors. . . . . . . . . 103
17
4-12 Sample frames for four different crowd-environment interactions. Differ-
ent people flows are presented: two opposite directions of movement (a)
(b), people splitting (c), people merging (d). (a) and (b) represent normal
behaviours, while (c) and (d) represent two abnormal behaviours. . . . . . . 105
4-13 The qualitative results of the normal and anomalous operator-crowd inter-
action detection, during the operator support phase. The ground truth bar
represents the operator actions in corresponding with video frames. Pre-
diction and action bar represents the cognitive system actions. . . . . . . . 107
5-1 Attention module, which is formed by the attention mechanism based on
Selective Attention Automatic Focus (S2AF) and the Environmental Mem-
ory, which memorized the crowd density local models. . . . . . . . . . . . 111
5-2 The switch T actives or deactivates the attention mechanism. Attention
open loop: the operator can receive the information directly from cameras
networks. Attention close loop: the operator receives the information by
attention module which provides to select the relevant information only. . . 112
5-3 The schematic information flow process for selective attention automatic
focus and relevant information extraction. The total information is associ-
ated to macroscopic level (a). 𝑋 represents a set of trajectories defined as
microscopic level (b). By an encoding function 𝑍(𝑋), the more informa-
tive environment parts, associated to sufficient information 𝑌 , are identified
as mesoscopic level (c). Such a representation is used to extract the relevant
information 𝑇 (𝑌 ) in according to the task (d). . . . . . . . . . . . . . . . . 113
5-4 Crowd simulator: simulated crowding scene for trajectories generation. . . 117
5-5 Simulated people flows: qualitative sketch of trajectories, the arrow colors
indicate five different people flows 𝑑1, ..., 𝑑5. . . . . . . . . . . . . . . . . 117
5-6 ITM: topological map, formed by different zones created, after 50 trajecto-
ries using ITM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
18
5-7 Histogram: a learned people histogram from which it is possible to evaluate
the probability that an observed number of people takes place in the zone
𝑧𝑗 (in this case 𝑗 = 42). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5-8 Gaussian P.d.f.: Gaussian histogram normalization p.d.f. (in this case aver-
age is 𝑚42 = 3.4 and variance is 𝜎242 = 2.8). . . . . . . . . . . . . . . . . . 119
5-9 Seed attentional resources: initial conditions (i.e. three high entropy 𝑧𝑠𝑗 for
each room). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5-10 Examples of transition probabilities for ITM zones. A conditional entropy
value is computed between a given zone and its neighbours. On the left,
a zone 𝑧1 is connected with one neighbour only 𝑧2 (i.e. low uncertainty
and conditional entropy), while on the right a zone 𝑧4 has more potential
neighbours 𝑧5, 𝑧6, 𝑧7 (i.e. high uncertainty and conditional entropy). . . . 120
5-11 An example of normalized cost function average trends for room 𝑅1. The
experiments have been conducted on 150 sequences of synthetic data pro-
vided by simulator with 𝑚 = 2 and people flow corresponding to 𝑑1. Total
time of each simulation is 1000[𝑠]. Three attentional resources choices
(high, low entropy and randomly choice) are been considered. Each point
on the curves corresponds to the average values of 𝑓(𝑡) for the different
people initial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5-12 Groups of zones. S2AF provides groups of zones. . . . . . . . . . . . . . . 124
5-13 Asynchronous Event detection. Direction change events in 100 simulated
situations are been evaluated. The figure shows average trends for the prob-
abilities 𝑝𝑘 considering time alignment with the events 𝑡 = 0. In 𝑡 = 30
the two curves reach the same probability. The system is able to detect
direction change of the people from 𝑐1 to other zones. . . . . . . . . . . . . 125
6-1 Relevant information extraction and crowd density map estimation. a. The
coloured circles specify important subparts of the scene; optical flow fea-
tures represent sparse relevant information. b. Crowd density map estima-
tion by Lucas-Kanade [72] optical flow features. . . . . . . . . . . . . . . . 129
19
6-2 Information Refinement block for automatic relevant data representation
selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6-3 Relevant information extraction and crowding density map projections into
SOM-space. The grid cells, laid over the image plane, can be seen as the set
of controlled areas. Each cell is associated to a number of feature extracted
by Lucas-Kanade optical flow and used for estimating the crowd density
map. In this example 𝑋 ∈ R20, ∈ R5 and the sparse vector 𝑋 ∈ R20.
𝑋 and 𝑋 are mapped into the same unit 16. The percentage of controlled
area corresponds to 40% (i.e. 2|5). . . . . . . . . . . . . . . . . . . . . . . 132
6-4 Kullback-Leibler divergences 𝐷𝐾𝐿 for two Gaussian probability density
functions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). The information lost has been represented
by distance metric between 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) (see Table 6.1). . . . . . 134
6-5 Normalized cost function average trends for different SOMs. The exper-
iments have been conducted on 100 sequences of synthetic data provided
by the simulator. Total time of each simulation is 1000𝑠𝑒𝑐𝑠. The validity
regions are defined by intervals of 𝜆. . . . . . . . . . . . . . . . . . . . . . 136
6-6 Qualitative and quantitative results for event recognition for PETS sequence
S1 L2 Time 14 : 06 using 40% of the controlled area. The figure shows
the comparison between the proposed method and other approaches: on
the upper part crowding density map reconstructions are presented; on the
lower part distortion curves are shown. The whole video is available on
https://www.youtube.com/watch?v=KN2aYZ64TTw. . . . . . . . . . . . . 138
7-1 Reconstruction block for automatic recovering of the whole information. . 143
7-2 Information extraction and crowd density map estimation. . . . . . . . . . 145
7-3 Example of correlations analysis between two couples of locations. The
data are represented by the number of feature extracted by Lucas-Kanade
optical flow associated to each cell. . . . . . . . . . . . . . . . . . . . . . . 150
20
7-4 Salient information extraction and crowding density map reconstruction.
The sparse vector is projected into SOMs space (i.e. “SOMs stack"). The
grid cells, laid over the image plane, can be seen as the set of controlled
areas. Each cell is associated to a number of feature extracted by Lucas-
Kanade optical flow and used for estimating the crowd density map. The
percentage of controlled area corresponds to 40%. In this example𝑋 ∈ R20
and the vector of the reduced observations 𝑋 ∈ R20. 𝑋 is mapped into
different units of the SOMs. By comparing the cost functions (see equation
(7.4)) it is possible to identify the optimal SOM (in this case SOM-9). . . . 151
7-5 An example of qualitative and quantitative results for PETS sequence S1
L2 Time 14 : 06 (frames 56 and 140) using 40% of the controlled area. Up
to the time of first people detection (i.e. initialization phase, frame 44) the
system acquires the data from all the zones. . . . . . . . . . . . . . . . . . 152
7-6 Reconstruction comparison for different saliency extraction methods: saliency
detection presented in [74]; HOG descriptors [27] combined with a learning-
based SVM classifier for person (pedestrian) and head detections. The av-
erage mean square error normalized to 1 and the percentage of the crowd
density map average acquired from the system (i.e. Extracted Relevance-
ER) are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
21
List of Tables
4.1 SOM and PCA comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Classification evaluation for different SOM-layer . . . . . . . . . . . . . . 94
4.3 Different crowd behaviours in simulated scenarios . . . . . . . . . . . . . . 99
4.4 People flows detections using different people counters and SOMs. Accu-
racy of the precision: 𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. Success = 1, Failure =
0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.1 Groups of zones: S2AF output. . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1 Cost function parameters. For evaluating 𝐷𝐾𝐿, two divergence normalized
density functions 𝑝(𝑌,𝑋) = 𝑁(0, 0.9996) and 𝑝(𝑌,𝑊𝐾) are being consid-
ered. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.2 Comparisons between the proposed IB-SOM selection and other neural
networks. Results are presented for different percentages of controlled ar-
eas. In the table normalized reconstruction errors are shown. For IB-SOM
reconstruction 𝑑𝑟𝑊𝐾and prediction 𝑑𝑝𝑊𝐾
errors have been computed. The
last two rows show the averages and the variances of the errors. . . . . . . . 137
7.1 Quantitative results for PETS video sequences. It is important to note that
the minimum value of the Cost Function, proposed in this paper, corre-
sponds to the minimum value of MSE. . . . . . . . . . . . . . . . . . . . . 149
23
Chapter 1
Introduction
Recently, new pervasive technologies influence our life by increasing the possibilities of
using computing devices all the time and everywhere. In this context software engineering
and computer science are rapidly leading to development tools which are able to support
and substitute humans during their daily activities. The concept of Ambient Intelligence
(AmI) [47] describes the idea of what pervasive computing paradigm can be applied in
designing interactive smart environments (i.e. smart spaces). The ambient, by intelligent
sensors, acquires the capabilities of recognizing behaviours of individuals in the controlled
area, furthermore the environment, by intuitive interface, can interact directly with them.
These systems, having the capabilities to assist the people, have found applications in dif-
ferent domains: smart domotic [115], surveillance [36], smart vehicles [9] and smart traffic
lights [112].
Nowadays it is commonly accepted that the connection between the traditional video analy-
sis tools and context aware methodologies are the fundamental elements in the designing of
smart devices for security systems. This technological union augments the functionalities
of common surveillance systems by modules of: situation understanding, person and crowd
behaviour analysis, classification and detection of potential threads [106], [126], [83],[67],
[127]. However, often the main focus of surveillance applications in the past years has
been addressed to recognize anomalous events, such as panic or creation of group of peo-
25
ple, when they exactly happen [118] [80]. This means that the systems can not anticipate
an anomalous behaviour in order to draw possible control strategies. Therefore, large-scale
surveillance systems represent now one of the most important application fields in which
new generation security and safety systems are involved. Moreover, they can constitute an
additional advantage within the context of smart cities and buildings design, by providing
information that can be used not only to dynamically maintain a high security and safety
level, but also to help humans to feel more comfortable and be better served by the envi-
ronments they build. However, a major research effort is required to reach this ambitious
objective of making the environments we live is not only capable of internally representing
the state in which they are (i.e. of showing a kind of be self-consciousness/self-awareness
[28]), but also of using the so-obtained internal representations to help humans with ef-
ficiently addressing actions aimed at dynamically maintaining a high safety level in such
environments.
The development of cognitive theories and techniques, finalized to address safety and secu-
rity tasks, is often bio-inspired and represents a potential relevant added value that can also
gain impulse from cross-fertilization with related fields, such as cognitive radio, cognitive
radar [55].
Cognitive sciences explore the principles of the cognition including many aspects from
different fields like: psychology, philosophy, neuroscience, linguistic, anthropology and
artificial intelligence, [86]. In science, the cognition is addressed to study how the human
beings perceive, process and understand acquired information [117] by their sensors.
Several recent studies have proposed algorithms allocated in a distributed camera and sen-
sor networks in order to move from object recognition paradigm to event/situation iden-
tification and to consequent distributed decision/action at multiple spatial/temporal levels
[35].
The dynamic models enhance the capability not only of detecting (e.g. the presence of an
intruder in a forbidden area, abnormal situations in the crowd, recognizing the trajectory of
an object in an urban scenario), but also of predicting behaviors of observed entities [33].
26
Such prediction allows activating reactions in order to recover and maintain homeostasis,
namely stability of safe conditions, by automatically taking decisions and performing ac-
tions on the environment. Cognitive sciences represent one of the most promising research
fields in terms of capability of provoking improvements with respect to state-of-the-art.
In addition, to efficiently exploit cognitive capabilities in an intelligent sensor network, the
role of data fusion algorithms is crucial [39], [24]. In the literature, several works deal
with data fusion problem applied to heterogenous sensors both for security [113], [102]
and safety tasks [13], [129].
In this work, the features of a cognitive-based framework, inspired by the previously cited
concepts, are described and the application of the proposed architecture to video surveil-
lance domain for crowd monitoring is demonstrated.
1.1 Motivation and context
Many neuroscience studies define human cognitive mechanisms as a set of different mental
processes, that involve perception, learning, memorization, reasoning and decision making
in order to understand and find solutions to problems. In according with Fuster’s paradigm
the perceptual and executive components are completely defined by the Perception Action
Cycle (PAC) [44]. From an engineering perspective the cognition process has been rede-
fined by means of a Cognitive Cycle that includes the following phases: sensing, analysis,
decision and action.
A Cognitive System represents an artificial system that implements data acquisition, in-
formation storage, representation and processing, according to bio-inspired structures[90].
Flexibility is an important prerequisite for a cognitive process: it represents the human
ability to form and apply general task rules. Cognitive Control enables Cognitive Sys-
tems to flexibly switch between different strategies sets in order to find the best actions to
perform, according to the task [51]. Cognitive Control allows a dynamic reconfiguration
of the system by taking into consideration its previous experiences. This capability is a
27
strong component of Cognitive Dynamic System (CDS) [54]. Cognitive control permits to
implement an adaptive mechanism in order to change the cognitive system behavior de-
pending on new tasks or different circumstances in the external environment. The CDS
actuates an adaptation mechanism, in order to control and adjust parameters for processing
the acquired information and to update them according to feedback signals that come from
external environment. More in detail, the CDS performs a parameter change, by minimiz-
ing the information gap, in order to select (into the cognitive dynamic system memory)
and memorize (from the real or virtual sensors) information associated to some maximum
information content. Furthermore, a CDS ought to adapt its actions-making process in or-
der to maximize the reward as feedback from the environment, which is linked to action
success connected to a specific task. Cognitive surveillance systems proposed in literature
so far mainly share the idea that it can be time to extend functionalities beyond simple
video analytics. This can be done in different ways: analytics capabilities are structured
onto more abstraction levels, by allowing the system to keep autonomously improved con-
text awareness (e.g. situation assessment, threat prediction). In other cases, the concept of
embodied cognition comes in and the goal is to make the system self-aware of its actuation
resources and active monitoring strategies. Video analytics for surveillance in critical areas
is becoming more significant for public security. Several works have been devoted in the
last decade to link traditional computer vision tasks to high-level context aware functionali-
ties such as scene understanding, behavior analysis, interaction classification or recognition
of possible threats or dangerous situations. This work presets the development and design
of an innovative data processing architecture of Cognitive Dynamic System based on JDL
(Joint Directors of Laboratories) [1] [69].
1.1.1 The JDL Architecture
In this section the main concepts of JDL model are explained. Basically the problem of
collecting and analyzing data coming from different sources is also known as Data Fusion.
The JDL model is addressed to describe an information processing flow divided by five
levels, see figure 1-1. The elements of the model are described in the following:
28
∙ Level 0 Source Preprocessing: this level is addressed to feature extraction from the
images of video sequences.
∙ Level 1 Object Refinement: this block in mainly oriented to data alignment, data
association, tracking, identification of the observed entity state
∙ Level 2 Situation Refinement: this step is focused on the assessment of the relation-
ships among different observed entities with respect to the environmental conditions.
∙ Level 3 Impact Assessment: at this stage the prediction of future possible events,
with respect to previous level, is preformed.
∙ Level 4 Process Refinement: this level is mainly devoted to balance the system
capabilities for recognizing events and optimizing its resources.
∙ Level 5 Cognitive Refinement: the output of the whole system can be affected (i.e.
refined) by how the humans perceive the output and by their intervention and control.
This process permits to have a directly link with the human interpretation of the
scene.
∙ Database Management System: it has formed by Fusion Data Base (i.e. the mem-
ory of the architecture) and Support Data Base (support to manage the stored data).
It has the objective of monitoring, evaluating, adding, updating, and providing infor-
mation for the fusion processes.
∙ Human/Computer interface: it is the part that provides an interface for human input
and communication of fusion results to operators and users. This part is extremely
important in order to create a close operator-machine-environment loop. Such a cy-
cle describes the way that data fusion architecture acquires, processes and provides
feedback in order to support the user during his work, figure 1-2.
29
Figure 1-1: JDL’s structure for data fusion [116].
Figure 1-2: Operator-machine-environment loop, the machine is represented by JDL basedsystem.
30
1.1.2 Formulation of the problem
Certainly, under many aspects such artificial mechanism reflects the natural information
processing, but it can not be considered a cognitive system. A framework to be defined
as cognitive architecture has to specify parts that are fitted by the human brain capabili-
ties (such as perception, learning, reasoning and attention). However, there are still open
problems: natural systems encode the information and store it in order to drive their be-
haviours. A large amount of data is usually stored by JDL system in order to accomplish its
tasks. JDL can be seen as framework that produces results mainly oriented to help human
operator. In order to accomplish this objective, situation awareness capabilities (such as
to perceive, memorize, learn, reason, understand, focus the attention and predict) must be
included within the structure.
Figure 1-3: Self-aware oriented system proposed by Endsley[34]. The model of situationattention in dynamical decision making is formed as follows: perception (level 1), compre-hension of situation (level 2), prediction (level 3) after that decision and action close theloop with the environment
In the JDL model are not clearly represented any self aware mechanisms. However, such
a structure can be mapped into a self-aware oriented architecture, see figure 1-3. Percep-
tion, comprehension and prediction are strongly correlated and these interconnections can
not be ignored in order to build plausible artificial cognitive systems. Additionally, for the
human beings the perception and the action are the result of previous recorded experiences
31
[89]. This is the main reason why the comprehension block is extremely important in the
decision making process. In this scenario, the attention is particularly decisive in determin-
ing which are the relevant data acquired from sensors for guiding the cognitive processes.
The attention mechanisms open a question of how the information is represented by a pop-
ulation with different number of neurons is quite difficult. It is an important issue for
cognitive science in order to evaluate the computational properties of neurons in decod-
ing and interpreting the external stimuli [108]. For example the system could choose an
optimum information representation model in accordance with the environmental relevant
information.
1.2 Specific contributions to progress in scientific and tech-
nological state-of-the-art
The innovative contribution of the thesis consists in studying, designing and validating
a complete Cognitive Dynamic System architecture which combines the cognition theo-
ries and video analytics methods. Such structure realizes a bio-inspired system for video
surveillance oriented to crowd monitoring. The proposed framework is characterized by
the following components:
∙ Cognitive Surveillance Node: in figure 1-4 the proposed structure scheme based on
cognitive cycle is shown, it presents some relations among the different development
blocks. The Event Detection, Situation Assessment and Prediction reflect the charac-
teristics of self-aware module, shown in figure 1-3. In addition Memory, Attention,
Information Refinement and Reconstruction module have been included in the sys-
tem. The previous functional blocks and algorithms have been published in [22] [14]
[18] [21] [20] [19] [15] [16].
– Analysis block: the bio-inspired structure, figure 1-4, is formed by: Event
Detection, Situation Assessment and Prediction. Basically, the analysis func-
32
Figure 1-4: Proposed Cognitive Node structure.
tion is able to encode and synthesize signals, not only for the description of
single entities behaviours, but also for modelling cause-effect relationships be-
tween user actions (e.g. opening of gates in order to control people flows) and
changes in crowd configurations (e.g. the number of people in controlled areas).
Such models are stored, during a learning phase, within a sub-part of the mem-
ory block namely Autobiographical Memory (AM), which is derived from the
neuroscientist Antonio Damasio’ studies [28]. The system, by observing cause-
effect relationships, can acquire the human (i.e. operator) capability of analyz-
ing and reacting to different situations. Such a procedure can be considered
as an effective knowledge transfer from an operator to the system. This mech-
anism is relied on an automatic systems called Cognitive Surveillance Node
(CSN), which is part of a complex cognitive JDL-based and bio-inspired ar-
chitecture. After such a knowledge-transfer phase, learned representations can
be used, at different levels, firstly it can support human decisions, by detecting
anomalous interaction models and thus compensating for human shortcomings.
Secondly, considering an automatic decision scenario, the learned representa-
33
tions can identify anomalous patterns and choose the best strategy in order to
preserve stability of the entire system.
– Attention Focusing Module: it a bio-inspired algorithm called Selective Atten-
tion Automatic Focus (S2AF), which is a part of more complex Cognitive Dy-
namic System (CDS) for crowd monitoring. The proposed algorithm can gather
information related to different parts of the environment, after that it stores
the knowledge within the so called Environmental Memory, which is involved
in recognition of the external stimuli . The main objectives of the proposed
method are: first to extract relevant information needed for crowd monitoring
directly from the environmental observations. Secondly, the S2AF can enhance
the cognitive system capabilities not only for recognizing a specific situation
but also for changing its behaviour in accordance with new circumstances for
localizing the important parts of a controlled area.
– Information Refinement for relevant knowledge representation: when the
amount of information increases (i.e., due to a larger number of monitored ar-
eas), attention focusing techniques are needed to highlight most relevant parts
within the overall acquired data. When wide area surveillance systems are con-
sidered, one of the major problems in event detections is the representation of
the scene as a whole, given a set of spatially limited, but considered relevant,
observations. The main aim of the algorithm is to propose an innovative rep-
resentation method for sparse data, based on information theory. By means of
Information Bottleneck theory [125], it is possible to determine the optimal data
representation in the memory space.
– Adaptive reasoning algorithm for information reconstruction: advanced
mechanisms for selection and extraction of saliency information can improve
the performances of autonomous video surveillance systems and support the
operator. Such a method can keep limited the extracted environmental infor-
mation. However, recovering methods are needed in order to maintain the full
34
control of the whole area under control. The proposed framework, is an adap-
tive inductive reasoning (top-down logic) mechanism. The method can learn
the correlations of the observed data, storing them in the memory, after that it
can recover the whole information, allowing a minimum distortion.
∙ Crowd simulator: the following contributions have been published in [17]
– Simulator of virtual crowd: the proposed tool can simulate the movement of
a large number of entities or characters. The simulator presents a 3D visual-
ization of virtual people, moreover it offers the possibility to configure specific
environment for the simulation of different crowding situations. Such a tool it
is extremely required in order to gather enough data for training of the CDS
(i.e. acquisition of stability models) and validation, as video sequences of the
desired kind are often not available for training. The environment into the sim-
ulator can be described by means of topological representations, providing log-
ical configurations of dynamic entity-environment (e.g., crowd-environment)
and entity-entity (e.g., operator-crowd) interactions.
1.3 Structure of the thesis
The thesis is organized as follows.
In present introduction the topics of this work has been presented with specific emphasis
on the contributions of this thesis.
The chapter 2 explores the human cognition mechanisms, with specific reference to Perception-
Action Cycle, which introduce the fundamental basis for designing artificial computational
processes that act like cognitive systems. Then an overview about the evolution of the
cognitive architectures is proposed.
The chapter 3 a bio-inspired structure, namely Cognitive Surveillance Node, is proposed.
The presented framework is able to encode and synthesize signals such as entity behaviours.
35
In particular the chapter proposes an innovative algorithm for modelling a biological struc-
ture: the Autobiographical Memory. Such structure can describe interaction relationships
between multiple entity behaviors, such as operator (i.e. user’s actions) and changes in
environment (i.e. the crowd configurations).
The chapter 4 is devoted to the analysis of the main features that allow to design a proba-
bilistic model able to learn the aforementioned interactions between operator and crowd. In
the chapter the visual 3D simulator, where crowd behaviour is modelled by means of Social
Forces,is explained. Then, the fundamental application of the proposed CSN, for automatic
crowd monitoring and anomaly detections, is discussed and demonstrated by experiments
on simulated and real video sequences.
In the chapter 5 an algorithm, based on cognitive perception process, for designing a human
attention mechanism (i.e. S2AF) is proposed. Such algorithm enhances the CSN capabili-
ties for relevant information detection and extraction within a surveillance network.
The chapter 6 presents a novel relevant information representation for sparse and limited
observations. An innovative hierarchical structure for knowledge representation based on
artificial neural networks, the Self Organizing Maps (SOMs), have been used for classify-
ing and correlating observed sparse data time series acquired by cameras. In the chapter,
through the Information Bottleneck theory, a cost function approach is proposed in or-
der to determine the optimal data representation in the SOM-space as a trade-off between
the capabilities of sparse signal classification and the original data statistical similarities
(i.e. correlations) preservation. By means of the proposed cost function, an algorithm
for knowledge representation, which is called information bottleneck-based SOM selection
(IB-SOM), is described.
The chapter 7 a bio-inspired logical process for information recovery in cognitive crowd
monitoring is described. In the proposed method the relevant information, which is identi-
fied directly from environmental observations, is defined through the learning of the rela-
tionships among sensory data. Unlike other methods for relevance detection, the proposed
algorithm shows how by using the SOMs it is possible to learn the data correlations. More-
36
over it is presented how through such a procedure, it is possible to reduce the number of
environment parts to observe; by using such limited observations the proposed system can
reconstruct the whole information.
Finally, the conclusions and future works are drawn in chapter 8.
37
Chapter 2
State of the art
Cognitive science and neuroscience aim at understanding and explicating human cogni-
tion [46]. In these last decades, engineering has been learning from neuroscience these
principles. In particular, of most interest is the capability of the human adapting to the
situations. This characteristic can be very valuable especially in non stationary stochastic
environments [57]. The human ability of changing its behavior in accordance with new
circumstances can be seen as the result of the Perception Action Cycle (PAC), which is
formalized by Joaquin Fuster [44]. This mechanism underlies how cognitive entity actions
are driven by the perception of the external environment. Fuster explains the central rule
of working memory in planning and decision-making processes. Moreover such cognitive
functions are involved in the organization of behavior, language, and reasoning [25]. In the
literature other models are devoted to describe the human cognition: the most important is
the Learning Intelligent Distribution Agent (LIDA) cognitive cycle [41]. LIDA’s cognitive
cycle, figure 2-1, consists of multiple modules, which can be described by four main stages:
perception, memory retrieval, conscious broadcast and action cycle.
∙ Perception: describes the human brain process of making sense of the current scene.
∙ Memory Retrieval: different memories are involved in the working of higher-level
minds, figure 2-1. Episodic memory: is the memory of autobiographical events. It
39
comes in two forms: the first is the transient which lasts only hours or a day , the
declarative which can last a lifetime.
∙ Concious Broadcast: LIDA’s workspace implements the preconscious buffers of
working memory. The conscious broadcast, by the workspace, recruits appropriate
resources by which it deals with the current situation. The content of consciousness
includes a compact representation of that situation.
∙ Action: the sensory-motor automatism is instantiated in response to the chosen ac-
tion made by sensory information. Such a sensory information comes directly from
sensory memory without the aid of the consciousness mechanism.
Figure 2-1: The scheme of LIDA’s cognitive cycle.
The main modules of the LIDA’s cognitive cycle are:
∙ Sensory Memory: it holds incoming sensory stimuli. Such stimuli are external (i.e.
generated by from the environment) or internal (i.e. generated by internal processes).
∙ Perceptual Associative Memory: the perception involves the recognition of incom-
ing sensory stimuli, categorization, identifications and situation awareness.
∙ Workspace: LIDA’s workspace can be seen as a “container” or buffer of current
40
percept and earlier percepts, i.e. long-term working memory. The workspace can
contains local associations retrieved from transient episodic memory or from declar-
ative memory.
∙ Transient Episodic Memory: the transient episodic memory holds the events, which
decay after few hours or a day.
∙ Declarative Memory: it is often called Long-term episodic memory. The declarative
memory can be subdivided into the Autobiographical Memories of events and the
Semantic Memory. In the declarative memory the events can last for years or a whole
life.
∙ Attentional Codelets Memory: this module gathers all the attention codelets. The
codelets must bring specific information to consciousness. In other word the attention
codelet helps the decision of making something.
∙ Global Workspace: it contains the most relevant, the most, important, the most
urgent, the most insistent information, that of most concern to the agent at specific
moment.
∙ Procedural Memory: it is where the entity stores the various procedures that it may
choose to perform.
∙ Action Selection: the mechanism addressed to choose the appropriate action to take
at this instant as response to the current situation.
∙ Sensory-Motor Memory: holds incoming internal or external stimuli.
Many of cognitive architectures are based on previous cognition theories. In this chapter an
overview of cognitive science aspects for designing of intelligent systems and a discussion
about fundamental cognitive systems are hereby presented.
41
2.1 From neuroscience to cognitive architectures
Cognitive systems are designed by implementing of human characteristics (such as per-
ception, learning, attention) within an artificial information processing framework. The
modern studies, on such systems, are more focused on information processing which must
reflect the human reasoning processes. A lot of works, in the last decades, have been
devoted to design methodical procedures (or algorithms) that underlie the acquisition, rep-
resentation, processing, storage, communication and access to information. Among these,
it is possible to mention works by Alan Turing on computing machinery and intelligence,
by John McCarthy, Nathaniel Rochester, Marvin Lee Minsky and Claude Elwood Shannon
all founders of the discipline of artificial intelligence [82]. The progresses in artificial in-
telligence, cognitive science have led to realize artificial architectures that act like certain
biological systems. The term “architecture” implies an approach that attempts to model
not only behaviour, but also structural properties of the modelled system. More in general,
the cognitive systems are a sub part of the agent architectures. In this context, machine
learning techniques, such as the reinforcement learning [124], can be examined from the
neuroscience perspective. A reinforcement learning agent can interact with the environ-
ment. At each discrete time t, the agent acquires an observation 𝑜𝑡 and a reward 𝑟𝑡. It
then chooses an action 𝑎𝑡 from the set of possible actions, as shown in figure 2-2. The
environment, as consequence, changes its state, the new state is 𝑠𝑡+1 and the new reward
is a function 𝑟𝑡+1 = 𝑓(𝑠𝑡, 𝑎𝑡, 𝑠𝑡+1), which combines the performed action 𝑎𝑡 with the state
transition 𝑠𝑡 ↦→ 𝑠𝑡+1. The goal of reinforcement learning agent is to maximize as much as
possible the amount of rewards that it receives. The so called goal-directed action mecha-
nism is realized by reward and it is driven by the perception. Basically, the entity receives
rewards from the environment so that it can assess if its goal is accomplished.
The PAC implies that a cognitive architecture is divided in two parts: perceptron and ac-
tuator. The perceptron perceives the environment and the cognitive system, by an internal
mechanisms, can tune the internal state of the system in accordance with the external stim-
uli. The actuator acts directly on the environment in response to the perception. These
42
Figure 2-2: The Perception Action Cycle formalized and Reinforcement Learning.
concepts imply that a CDS must have an internal signal, realized by means of feedback,
which connects the perceptron to the actuator parts, figure 2-3. As it can be noted that in a
cognitive architecture there is a remarkable separation between internal (i.e. feedback) and
external signals (i.e. observations), which can be considered an essential element in order
to distinguish cognitive systems from other artificial intelligent framework. This concept is
the key of Haykin’s cognitive architecture [53], inspired by PAC paradigm. Furthermore,
the concept of self-organization is central to the description of biological systems. The
definition of such a process comes from the studies on physics, chemical and mathemat-
ical systems, there are also other examples about natural, social and economics sciences.
Basically, the “principle of the self-organizing dynamic system” was formalized in 1947
by Ashby [6], it explains how a dynamic system evolves from non-equilibrium points to
a state of equilibrium (i.e. attractor). This evolution implies a whole dependency among
all the sub-parts (i.e. sub-systems or components) of the dynamic system. Each part must
adapt to the other components. A link from self-organizing dynamics of ergodic systems to
the inferential nature of human perception has been mathematically formalized by Friston
et al. in [42].
In the next sections, an overview and discussion about some well known cognitive architec-
tures are presented (see section 2.2). Then, an insight into Haykin’s system and dynamics
43
Figure 2-3: Functional block diagram that shows how the Cognitive Dynamic System per-ceives the environment in which it operates.
of cognitive systems proposed by Karl Friston, Biswa Sengupta and Gennaro Auletta are
described (see section 2.3).
2.2 Overview of cognitive architectures
The cognitive architectures, as it has aforementioned, propose (artificial) computational
mechanisms that act like certain cognitive systems. It is possible to summarize some pro-
prieties, which characterize and distinguish the cognitive frameworks, as follows:
∙ Implementation: an architecture can describe separately aspects of cognitive be-
haviour (e.g. learning, problem solving, reasoning, attention etc.) or the cognition
properties should be viewed as wholes, not as collections of parts (i.e. Holism) [7].
The Unified Theory of Cognition (UTC) by Allen Newell [93] is based on Holism
concepts.
∙ Learning: not all the cognitive architectures implement the learning phase which is
addressed to synthesize the information in order to build or reinforce the knowledge
of the system. One of the most important learning process derives form Hebb’s
rule introduced by Donald Hebb (1949) [58]. The Hebbian theory describes the
adaptation of neurons in the brain during the learning process.
∙ Parameters: some frameworks implement self-tuning capabilities, which derive
44
from control theory. These mechanisms can optimize internal system parameters
in order to maximize or minimize specific objective functions.
∙ Architectural levels: there are architectures compose by sub-parts (or sub-architectures)
often they are called as layers (or levels). Each level implement a function, or mech-
anism, or a kind of information representation.
∙ Internal vs. Perception Action: some architectures are addressed to model only
internal information process, which can include: reasoning, planning, solving prob-
lems, learning concepts. Other cognitive frameworks expand the description of cog-
nition by encompassing the perception and action in the system.
∙ Task based-selection: this characteristic implements a “switching mechanism”, the
system is able to process the information by selecting specific modules or compo-
nents according to the current task.
∙ Fixed structure: the cognitive architecture is fixed, e.g. the links among the sub-
parts can not be modified and only the information stored in the levels can change.
∙ Growing structure: this frameworks adapt their dimension by increasing the number
of units and the links among them in order to improve the information representation.
The cognitive architecture can be distinguished in three main categories: symbolic, con-
nectionism and hybrid systems. In figure 2-4 a time-line related to evolution history of
cognitive architectures is shown.
2.2.1 Symbolic architectures
Symbolic systems generically are formed by “signs” which are used in order to commu-
nicate various messages. They are production systems, the behaviours, of the cognitive
entity (i.e. the system), are typically described by rules. All the variables and procedural
mechanisms are not referred to semantic content of data representations. Among the most
45
1980
2015
Adaptive Control of Thought-Rational (ACT-R) by J. R. Anderson[4]
1983 Soar by J. Laird, A. Newell, and P. Rosenbloom.[65]
Self Organizing Maps (SOMs) by T. Kohonen [62]
1990 Neural Gas (NG) by T. Martinetz and K. J. Schulten [79]
Growing Neural Gas (GNG) by Bernard Fritzke [43]
1995 Instantaneous Topological Map (ITM) by J. Jockhusch and H. Ritter[59]
1999Growing Hierarchical SOM (GH-SOM)by A. Rauber, D. Merkl,and M. Dittenbach [105]
2002 Extended Soar by J. E. Laird (2008) [64]
Cognitive Dynamic System (CDS) by S. Haykin and J. Fuster in [53]
2014
Figure 2-4: Some of the more important cognitive architectures.
46
important architecture belonging to this category is the Soar based on the UTC. This tech-
nique has been developed by John Laird, Allen Newell, and Paul Rosenbloom. The first
studies was in 1983 and the official presentation of Soar was in a paper in 1987 [65]. Soar
is based on set of “if ... then” rules which provides a comprehensive descriptions about
how a system searches for the states which bring it gradually closer to its goal (i.e. stability
or goal state). In many cases should be impossible to define contextual rules, e.g. when
system interacts with an unknown environment. In order to avoid this problem, reasoning
techniques, such as reinforcement learning, have been used. A recent extended version of
Soar theory, presented by John E. Laird (2008), has included non-symbolic representations
[64]. Another fundamental framework is Adaptive Control of Thought-Rational (ACT-R).
It has been inspired by the work of Allen Newell and created by John R. Anderson in 1983
[4]. Basically, it is divided in two modules: Perceptual-motor and Memory. The system
through the perceptual motor interacts with the environment. The memories are formed
by Declarative which is a set of rules and Procedural that basically is represented by a
set of procedures about “how to do something”. The buffers represent the front-end of the
modules which communicate by them.
2.2.2 Connectionism architectures
The connectionism represents an approach in the designing cognitive architectures based
on the interconnections of simple units (i.e. neurons) in a network. One of the most popu-
lar frameworks rely on such a mechanism are the Artifical Neural Networks (ANNs). The
ANNs are based on Hebb’s rules, each external stimuli fire (i.e. active) a neuron so that
it can be interpreted and understood. They are, basically, composed by a matrix of neu-
rons (𝑁 = 𝑛 × 𝑚) and the knowledge representation depends from the weights and the
connections among the units acquired by learning. The ANNs can be classified by:
a. Interpretation of neuronal activations: the meaning of stimuli can depend by single
or group neuronal activations.
b. Definition of activation: the neuronal activities product consequences. Such results
47
associate a definition to each activation pattern, e.g. in Bolzan machine [2] (1985)
the activation is connected with a probability of generating an action.
c. Learning algorithm definition: the ANNs can be distinguished by their different
learning methods. The competitive learning are based on a distance computed be-
tween the input 𝑥 and the neurons 𝑤𝑖 where 𝑖 ∈ (1, ·, 𝑁). Input and weight vec-
tors must have the same dimensions. The neuron whose weight is most similar
(𝑢 = arg min𝑖 (𝑥− 𝑤𝑖)) to the input is called the best matching unit (BMU)𝑤𝑢. Dur-
ing the learning phase, the neurons are updated by a coefficient 𝛼(𝑠) which decreases
at each step 𝑠. The general update formula is: 𝑤(𝑠+1)𝑖 = 𝑤
(𝑠)𝑖 +𝜃(𝑠)(𝑖, 𝑢)𝛼(𝑠)(𝑥−𝑤(𝑠)
𝑖 ),
where 𝜃(𝑠)(𝑖, 𝑢) is the neighbourhood function which gives the distance between the
neuron 𝑢 and the neuron 𝑖 in step 𝑠. Each ANN uses different choices for 𝛼(𝑠) and
𝜃(𝑠)(𝑖, 𝑢).
Some very famous ANNs are: the Self Organizing Maps (SOMs) [62] created by Kohonen
(1990), the Neural Gas (NG) [79] introduced by Thomas Martinetz and Klaus J. Schulten
(1991) and one most important adaptive ANN is Growing Neural Gas (GNG) algorithms
[43] created by Bernard Fritzke (1995). Inspired to the GNG a growing neural network
theInstantaneous Topological Map [59] (ITM) has been developed by J. Jockhusch and H.
Ritter (1999). The Growing Hierarchical SOM (GH-SOM) [105] created by A. Rauber,
D. Merkl, and M. Dittenbach (2002) combines the capabilities to increase the neurons and
the layers.
2.2.2.1 Self Organizing Map (SOM)
The Self Organizing Map (SOM) [62] is an unsupervised classifier. During the learn-
ing phase, the neurons of network are tuned through the input stimuli. SOMs have been
widely used in different pattern recognition applications, such as face recognition [63].
The SOM can produce a spatial organization of neuronal units, according to the inputs,
into SOM-layer (or SOM-map) (as in figure2-5). A SOM network is defined by a fixed
low-dimensional (usually 2-D) structure of neuronal units 𝑖 = 1, · · · , 𝑁 (i.e. nodes) and
48
connections 𝑗 (i.e connecting edges). Each neuron is characterized by a weight vector
𝑤𝑖 = [𝑤𝑖,1, 𝑤𝑖,2, · · · , 𝑤𝑖,𝑀 ]. For each input vector 𝑥 = [𝑥1, 𝑥2, · · · , 𝑥𝑀 ] the adaptation of
the SOM network (i.e. learning phase) is performed in 2 steps:
∙ Node Matching: the nearest node 𝑢 (i.e. the best match unit) to the input vector 𝑥
and its neighbourhood 𝑁𝑒𝑖𝑔ℎ(𝑢) are selected.
∙ Node Adaptation: the BMU and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are modified
according to a monotonically decreasing neighbourhood function 𝜃(𝑠)(𝑖, 𝑢).
The accuracy of the SOM mapping basically depends on two parameters: the number of
learning steps (a common choice is 500 times the number of nodes in the network) and the
decreasing rule of 𝜃(𝑠)(𝑖, 𝑢).
Figure 2-5: Example of a self organizing map structure: the SOM-layer is formed by 2-Dfixed grid of 𝑁 neurons, while the input vector is defined as 𝑥 = [𝑥1, 𝑥,2 , · · · , 𝑥𝑛].
2.2.2.2 Neural Gas (NG)
The NG is an ANN inspired by the Kohonen’SOM and introduced in 1991 by Thomas Mar-
tinetz and Klaus Schulten [79]. The particular name of this algorithm “neural gas" derives
49
from the assumption that, during the adaptation process, the dynamics of feature vectors
are similar to the distribution of gas molecules within the space. Given a probability distri-
bution 𝑃 (𝑥) of input vectors 𝑥 and considering a set of nodes 𝑖 = 1, · · · , 𝑁 represented by
weights 𝑤𝑖 the algorithm consists of 2 steps:
∙ Node Matching: the nearest node 𝑢 (i.e. the best match unit) to the input vector 𝑥
and its neighbourhood 𝑁𝑒𝑖𝑔ℎ(𝑢) are selected.
∙ Node Adaptation: the BMU and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are modi-
fied according to a monotonically decreasing neighbourhood function 𝜃(𝑠)(𝑖, 𝑢), 𝜀
an adaptation step size ans 𝜆 the neighbourhood range. Both the parameters 𝜀 and 𝜆
are increased at each time step.
The NG performances depend from: the choices of parameters and the decreasing rule of
neighbourhood function.
2.2.2.3 Growing Neural Gas (GNG)
The Growing Neural Gas [43] is an incremental network that is able to learn topological re-
lations in a given set of input data by means of Hebb’s rules [78]. The advantage of growing
networks with respect to fixed layer of neurons is the self-adaptation mechanism. Such an
algorithm can be seen as a modified Kohonen learning which combines neuron’s positions
updating with the possibility to introduce new units in the layer. A GNG starts with a gas of
nodes and builds appropriate connections between them during a training phase, inserting
new nodes in regions where the approximation error is largest. The principle elements of
the algorithm are:
∙ Set of nodes 𝑖 represented by weights 𝑤𝑖 and error measures 𝑒𝑖
∙ Set of edges or connections 𝑗 whose age value are 𝑎𝑖
For each input vector 𝑥 the adaptation of the GNG network is performed by the following
4 steps:
50
Figure 2-6: Example of GNG adaptation to input stimuli. On the left the initial situationon the right the final topology are shown.
∙ Node Matching: given an input vector 𝑥, the nearest node 𝑢 and the second nearest
node 𝑠 are selected.
∙ Node Adaptation: the nearest node 𝑢 and its topological neighbours 𝑁𝑒𝑖𝑔ℎ(𝑢) are
modified according to a specific adaptation strategy.
∙ Edge update: an edge connecting 𝑢 and 𝑠 is created if it does not already exists. The
age of this edge is initialized to zero and the edge of all other edges is increased. All
the edges whose age is greater than a threshold 𝑎𝑚𝑎𝑥 are removed.
∙ Node update: the error measure 𝑒𝑢 of nearest node 𝑢 is increased. A new node is
created every 𝜆 updating steps in the zone where a node 𝑞 and its neighbours have
the greatest error values in the network. In order to avoid an undefined growth of the
number of neurons, all nodes are multiplied by a decaying factor.
The 𝑎𝑚𝑎𝑥 is the most critical parameter of GNG algorithm: an incorrect choice of thresholds
can provoke a wrong deletion of useful important connections.
51
2.2.2.4 Instantaneous Topological Map (ITM)
The algorithms NG, GNG and SOM make a strong assumption that training data are un-
correlated. The Instantaneous Topological Map (ITM) algorithm [59] overcomes this lim-
itation. It can create maps by using strongly correlated input sequences. ITM does not
rely upon ageing or accumulative error parameters and it does not require node adaptation
(as other neural networks SOM and GNG). Unlike SOM, which has a rigid topology and
adaptable nodes, ITM has a rigid node positions topology and maps the topology of input
data through adaptation rules. The main elements of an ITM are:
∙ Set of nodes 𝑖 represented by weights 𝑤𝑖.
∙ Set of undirected edges 𝑗 between the neurons.
For each input vector 𝑥 the adaptation of the ITM network is performed in 3 steps:
∙ Node Matching: according to the input vector 𝑥, the nearest node 𝑢 and the second
nearest node 𝑠 are selected.
∙ Edge update: an edge connecting 𝑢 and 𝑠 is built if it does not already exists. For
each node 𝑚 ∈ 𝑁𝑒𝑖𝑔ℎ(𝑢) if 𝑤𝑠 lies inside the Thales sphere between 𝑤𝑢 and 𝑤𝑚 the
link between 𝑚 and 𝑢 is removed.
∙ Node update: if the input vector 𝑥 lies outside the Thales sphere through 𝑤𝑢 e 𝑤𝑠 a
new node 𝑦 is created with 𝑤𝑦 = 𝑥 as in Fig.2-7.
ITM outperforms the convergence speed of SOM and GNG algorithms. Each training step
can produce and remove nodes or edges with no dependency to the network past history.
The Instantaneous Topological Map is particularly suitable approach in order to create
environmental maps, i.e. topologies which are able to maintain main shape characteristics
of the environment, by using people trajectories.
52
Figure 2-7: Node creation procedure in ITM algorithm.
2.2.2.5 Growing Hierarchical-SOM (GH-SOM)
The very popular SOM has two main limitations, which are related to the static architec-
ture of the neuronal layer and the limited capabilities for the representation of hierarchical
relations of the data. The Growing Hierarchical Self-Organizing Map [105] are addressed
to overcome the limitations of the SOM. The elements of the network consist of layers 𝑙
and set of node 𝑖, which belong to specific layer, represented by weights 𝑤𝑙𝑖. The network
training starts with one node which forms the layer 0 with weight 𝑤01. The weights of this
neuron are initialized as the average of all the input data𝑋 . During this phase, the deviation
of the input data, i.e. the mean quantization error mqe0 of this single unit, is computed as
follows:
mqe0 =1
𝑑‖𝑥− 𝑤0
1‖, (2.1)
where 𝑥 ∈ 𝑋 is the input vector and 𝑋 is the set of input data, 𝑑 = 𝑐𝑎𝑟𝑑(𝑋) represents
the total number of inputs. After the computation of mqe0, training of the GHSOM starts
as first layer SOM. This first layer map initially consists of a small number of units, e.g. a
grid of 𝑁1 = 2 × 2 units, where 𝑁1 represents the number of neurons contained within the
layer 𝑙 = 1. The learning process is equal to training of SOM. According to the equation
53
(2.2), m𝑞𝑒𝑖 is computed as follows:
mqei =1
di
‖x−w1i ‖, (2.2)
where 𝑤1𝑖 is the weight vector and 𝑑𝑖 is the number of input patterns 𝑥 mapped onto unit
𝑖 within the layer 𝑙 = 1. The Mean Quantization Error MQE (capital letters) of a map is
defined as:
MQE𝑙 =1
𝑁𝑙
𝑁𝑙∑𝑖=1
mqe𝑖. (2.3)
Figure 2-8: Column insertion and layer expansion in GH-SOM. Step 1: when MQE𝑙 over-come a given threshold, a column is inserted between the nodes 𝑒 with highest mqe𝑒 andits most dissimilar neighbor 𝑑. Step 2: layer expansion by adding of a new map in order torepresent more in detail the input data.
When during the training MQE𝑙 ≥ 𝜏𝑙·mqe0, where 𝜏𝑙 is a fixed percentage (i.e. a predefined
threshold), a column is between the unit 𝑒 with the highest mean quantization error, mqe𝑒,
and its most dissimilar neighbor 𝑑, see the step 1 in figure 2-8. If each unit 𝑖 ∈ 𝑙 satisfy the
54
following condition mqe𝑖 ≥ 𝜏𝑢mqe0, where 𝜏𝑢 indicates the desired quality of input data
representation of the training process, a new layer is added in order to explain the input
data in more detail, see step 2 in figure 2-8.
2.2.3 Hybrid architectures
The hybrid systems use a combination of methods and techniques from artificial intelli-
gence such as: artificial neural networks, reinforcement learning, hybrid connectionist-
symbolic architectures. In this category it is possible to mention the Connectionist Learn-
ing with Adaptive Rule Induction On-line (CLARION) created by the Ron Sun’s research
group. It was proposed in its early form by Sun [120] (1997), [121] (1999), [122] (2000)
and by Sun et al. [123] (2001). The CLARION is a system which consists of distinct
sub-parts. The fundamental idea is to design an architecture which emphasizes bottom-up
learning. Such a framework includes and explores the interaction between explicit and im-
plicit cognition processes. Explicit process is more intuitive concept, it is referred to an
explicit goal, e.g. a system must active explicit mechanisms of the cognition in order to
reach a given point (i.e. explicit goal). Each movement consists of a set of decisions which
bring the system to reach its goal. These intermediate states are the results of implicit
processes.
2.3 Haykin’s architecture and dynamics of cognitive sys-
tems
This section initially reviews the model proposed by Haykin which is called Cognitive
Dynamic System, which can be considered as hybrid structure. Secondly, according to the
a recent study by Karl Friston, Biswa Sengupta and Gennaro Auletta the causality between
perception and action will be discussed and formalized.
55
2.3.1 Haykin’s architecture
Haykin in [53] introduces a representation of the cognitive dynamic system in order to
mimic the brain functionalities. This representation is based on two main principles of
cognition that are perception and memory. There are three are the main blocks in a cogni-
tive dynamic system: Cognitive Perception unit, Cognitive Control unit and Probabilistic
Reasoning Machine unit, figure 2-9. These units form a hierarchical closed-loop feedback
system (namely global perception-action cycle), where the environment reward plays a
critical role in how it perceives the world.
Figure 2-9: Haykin’s Cognitive Dynamic System. Cognitive Perceptron (CP); CognitiveControl (CC); Probabilistic Reasoning Machine (PRM).
In the first part of perception-action cycle, the Cognitive Perception unit processes the mea-
surements coming from the sensors. The system observing the environment can form an
internal representation of the external world. In this sense, the CDS is able to compute
the reward based on the perception error. The feedback connects the Cognitive Perception
unit with the Cognitive Control unit. This second block is devoted to choose the optimum
action that the system performs according to its goal. Such an action must maximize the
next reward. According with Haykin and Fuster, the biological processes of adapting be-
haviours of an entity to the environmental dynamic are reproduced by the afore-mentioned
56
mechanism. In figure 2-9 a multi-layers structure is shown, each layer represents a per-
ception action module. By exploring one layer of the PAC it is possible to see that after a
measure is gathered, the representation of the external world in the Cognitive Perception
unit is updated. This representation is afflicted by an intrinsic error (i.e. perception error)
due to the noisy (i.e. imperfect) observation. The uncertainty of acquired data can be mea-
sured by the perception error. Such measure can be associated to a random variable 𝑋 ,
where 𝑥 denotes a sample value of 𝑋 and where 𝑝(𝑥) is the probability density function of
the random variable 𝑋 . It is possible to define a perception entropic state for a continuous
variable as follow:
𝐻𝑘𝑝 =
∫ + inf
− inf
𝑝(𝑥)𝑙𝑜𝑔1
𝑝(𝑥)· 𝑑𝑥, (2.4)
while, for a discrete variable the perception entropic state is:
𝐻𝑘𝑝 =
+ inf∑𝑘=− inf
𝑝(𝑥𝑘)𝑙𝑜𝑔1
𝑝(𝑥𝑘). (2.5)
Through the entropic state it is possible to compute an incremental difference as follows:
∆𝐻𝑝 = 𝐻𝑘𝑝 −𝐻𝑘−1
𝑝 , (2.6)
where ∆𝐻𝑝 represents the difference between two consecutive entropic states (i.e. on 𝑘
and 𝑘 + 1 time instants). The Cognitive Controller unit, by the value ∆𝐻𝑝, can select
the optimum future action. It is important to underline that the system must take into
consideration the uncertainty of the action selection process. The error of action choices
can be computed by the control entropic state: 𝐻𝑘𝑐 . Probabilistic Reasoning unit is the block
that connects the perception to the control part. It is addressed to generate a signal called
Perceptual Attention, it can be considered as a result of balancing between the perception
and the cognitive control errors. Such a signal is essential in order to maintain or restore
the stability in the system.
57
2.3.2 Dynamics of cognitive systems
Friston et al. in [42] formalize the neuronal self-organization processes in order to provide
an explanation about internal dynamics of cognitive systems. The CDS with its environ-
ment form an ergodic Random Dynamical System, it can be described by means of Markov
blacket [98]. A Markov blanket for a node 𝐴 in a Bayesian network is described as a set of
states formed by: parents nodes of 𝐴, children nodes of 𝐴 and parents of all the children,
see figure 2-10. Based on this concept, the global state of system (called agent) is divided
into: External Ψ (i.e. the environment), Sensory 𝑆 (i.e. sensors), Active 𝐴 (i.e. the actions)
and Internal 𝑅 (i.e. representation of information) states. The figure 2-11 shows the par-
tition of states into internal and states that are separated by a Markov blanket, including
sensory and active states. More in details: external states 𝜓 ∈ Ψ directly influence the
sensory states 𝑠 ∈ 𝑆 and they are influenced by the actions described by means of active
states 𝑎 ∈ 𝐴. The agent internal dynamics are denoted by the representational states 𝑟 ∈ 𝑅.
Such a internal conditions cause actions and depend from sensory states.
Figure 2-10: The Markov blanket of node 𝐴.
The information flow, which is motivated from cognitive theories, can be more formally
represented by the Markov blanket in terms of functions: 𝑓𝜓(𝜓, 𝑠, 𝑎), 𝑓𝑠(𝜓, 𝑠, 𝑎), 𝑓𝑟(𝑠, 𝑎, 𝑟),
𝑓𝑎(𝑠, 𝑎, 𝑟). Such functions identify the dependencies among the variables (internal, active,
sensory and external) shown by the arrows in figure 2-11. A cognitive system can be
represented in terms of equations of motion and then, by means of Markov blanket, it is
58
Figure 2-11: Global state representation of system. The internal 𝑟 are connected with theaction 𝑎 states, while the environment 𝜓 is influenced by the actions. The external statesare observed by sensors 𝑠.
possible to identify the external and internal states and specifying their dynamics. By using
Bayesian inference, Friston et.al show how to predict internal dynamics of states, such
as biological mechanisms of self-organization, that has many of the hallmarks of human
cognition processes.
2.4 Final remarks about cognitive architectures
The previous sections have described the main classes of cognitive architectures. In the
table 2-12 the differences among the previous discussed cognitive frameworks are summa-
rized. Soar and ACT-R are symbolic structures they implement the human brain processes
by mean of set of rules in order to describe the behaviour. ACT-R uses also both declar-
ative (i.e. set of statements which define the behaviour) and procedural memories (set of
procedures that can involve parameters and functions). The Extended Soar can be seen as
symbolic and not symbolic, indeed it adds to the system machine learning techniques in
order to introduce a degree of freedom.
The artificial neural networks emulate the learning and understanding human being pro-
cess by reproducing the creations of neuronal connectivities. The main differences with the
symbolic frameworks is that in the connection oriented structures there is absolutely lack
59
Figure 2-12: Characteristics of cognitive structures.
of behavioral rules. Moreover, some ANNs can increase the knowledge by augmenting
the number of neurons. On the other hand, both symbolic and connectionist architectures
do not make any distinction between the internal and external state representations of the
cognitive entity. The hybrid structures , such as CLARION and Haykin’s CDS, introduce
the PAC which implies an separation of internal (implicit) and external (explicit) configu-
rations. The mechanisms of relation between internal and external states has been defined
by Friston et al. . They enplane that internal configurations, of a cognitive entity, depends
from sensory perception of the environment. The added value of hybrid structures is repre-
sented by their applications to both control tasks (which are typically internal issues of the
system) and procedural tasks (which consist in executing of physical actions).
60
Chapter 3
Crowd modeling and proposed
architecture for cognitive video
surveillance system
3.1 Introduction
Human behaviour analysis has important applications in the field of anomaly management,
such as Intelligent Video Surveillance (IVS). As the number of individuals in a scene in-
creases, however, new macroscopic complex behaviours emerge from the underlying inter-
action network among multiple agents. This phenomenon has lately been investigated by
modelling such interactions through Social Forces [84]. In most recent Intelligent Video
Surveillance systems, mechanisms to support human decisions are integrated in cognitive
artificial processes. These algorithms are mainly addressed to the problem of modelling
behaviours of monitored entities (i.e. people).
In this chapter a bio-inspired structure is proposed, which is able to encode and synthesize
signals, not only for the description of entity behaviours, but also for modelling cause-
effect relationships between user actions and changes in environment configurations (i.e.
61
the crowd). During a learning phase the aforementioned interaction models are stored
within a memory: the Autobiographical Memory, figure 3-1.
Figure 3-1: Autobiographical Memory stores the cause-effect relationships between enti-ties such as user actions and changes in environment configurations (i.e. the crowd).
Through this step, the system operates an effective knowledge transfer from a human oper-
ator towards an automatic systems called Cognitive Surveillance Node (CSN), which is part
of a complex cognitive JDL-based and bio-inspired architecture. After such a knowledge-
transfer phase, learned representations can be used, at different levels, either to support
human decisions by detecting anomalous interaction models and thus compensating for
human shortcomings. While, in an automatic decision scenario, the system can identify
anomalous patterns and choose the best strategy to preserve stability of the entire system.
The first part of this chapter will discuss the strong relation among crowd monitoring,
simulation and modelling fields. Just to give some examples, the tasks of simulating and
monitoring a crowd raise the issue of modelling its behaviour: crowds obviously need
to be given a dynamic evolution model to be simulated; also, a dynamic model is often
needed to improve crowd monitoring application performances trough Bayesian filtering;
62
then again, simulations are often necessary in order to test crowd monitoring algorithms;
eventually, crowd monitoring can provide valuable hints on how to effectively model and
describe crowds. A comprehensive traction of such interconnected fields is given in section
3.2. The section 3.3 presents the bio-inspired cognitive model for cognitive surveillance
systems.
3.2 Scale issue
Sure enough, one should ask oneself what a crowd is, before starting discussing about it.
The way people define a crowd obviously depends on the area in which the crowd itself is
investigated, and thus many different definitions can be found in literature. However, any
definition one could try to give can hardly avoid describing crowd in terms of its compo-
nents, namely the people which it is formed by. This remark may sound trivial, but it has
deep implications in the way a crowd is depicted. In particular, it raises the issue of choos-
ing between a local description of it and a global one. A local description of a crowd relies
on the features associated to each member, such as positions, speeds, orientations, motiva-
tions, destinations etc. A global (holistic) description, on the other hand, relies on features
that can be associated to the crowd as a single entity, such average density, the entropy,
the average shift in some direction, the displacement etc. Global features can in general
be derived from local ones, by averaging or integrating local quantities. The opposite, on
the contrary, never happens. However, it is not only a matter of scale at which the crowd
is analysed, but rather of the additional amount of information stored in local quantities
compared to global ones. A nice parallel example comes from the well-known thermody-
namics, where global quantities, such as energy, pressure and temperature of gases can in
principles be derived from the average kinetic energy of its molecules: by knowing the ex-
act behaviour of each single molecule in the gas one can derive the temperature, while the
opposite calculation is not possible, as information is lost by averaging over all molecules.
However, in both the cases of crowd and thermodynamics, it is not always possible to ac-
cess local information entirely, while global quantities can be easily gathered. For example,
63
in a video surveillance framework, it is unrealistic to track every single person in a high
density crowded scene, especially if a single camera is available: the visual information
gathered by the camera sensor is simply not enough to accomplish such a task. This kind
of considerations has led to suggest approaches such as the one proposed in [87], in which
a very subtle analysis is performed, taking into account a global macroscopic scale, a mid-
dle mesoscopic scale and eventually a local microscopic scale in a hydrodynamics-inspired
framework (here again physics is of great help). A perfectly specular approach is, on the
contrary, often adopted in simulating and also modelling crowds. Here an underlying model
can be designed in order to model the fine-scale behaviour of each crowd member, in order
to reproduce (simulate) some desired macroscopic behaviour. This approach can on the
one hand be really helpful in fine tuning macroscopic simulation outputs by correcting mi-
croscopic local parameters in the model. On the other hand it can be a very effective way to
validate the accuracy of models, as it gives a way to check their accuracies in reproducing
global crowd behaviours.
3.2.1 Crowd monitoring
The crowd phenomenon has increasingly attracted the attention of worldwide researchers
in video surveillance and video analysis [131] in the last decades. Nowadays an extremely
prolific literature is growing on the subject. Different implications related to crowd be-
haviour analysis can be considered, since both technical and social aspects are currently
under researchers’investigation. On the one hand, researchers focusing on psychology and
sociology domains consider crowd and model behaviour as a social phenomenon. Several
examples can be found in literature dealing with the role and the relevance of human inter-
action in characterizing the behaviour of a crowd. In [70], a simulation-based approach for
the creation of a population of pedestrians is proposed. The authors aim at modelling the
behaviour of up to 10,000 pedestrians in order to analyse several traffic patterns and people
reactions typical of an urban environment. The impact of emotions of individual agents
in a crowded area has been investigated also by Liu et al. [68] in order to simulate and
model the behaviour of groups of people. As well, Handford and Rogers [49] have recently
64
proposed a framework for modelling drivers’behaviour during an evacuation in a post dis-
aster scenario taking into account several social factors which can affect their behaviour in
following a path to reach a safe spot.
To the other hand, technical aspects in crowd behaviour analysis applications mainly focus
on the detection of events or the extraction of particular features exploiting computer vision
based algorithms. An estimation of the number of people in a crowd can be performed by
computing the number of foreground and edge pixels. Davies et al. propose a system using
Fourier transform for estimating the motion of the crowd [29]. Many researchers tried to
use segmentation and shape recognition techniques for detecting and tracking individuals
and thus estimating the crowd. However this kind of approach can hardly be applied to
overcrowding situations where people are typically severely occluded [128], [50]. Neural
networks are used in [76] for estimating crowd density from texture analysis, but in this case
an extensive training phase is needed for getting good performances. A Bayesian model
based segmentation algorithm was proposed in [132]; this method uses shape models for
segmenting individual in the scene and is thus able to estimate the number of people in the
crowd. The algorithm is based on Markov chain Monte Carlo sampling and it is extremely
slow for large crowds. Optical flow based technique is used in [5], [8], while Rahmalan et
al. [103] proposed a computer vision-based approach relying on three different methods to
estimate crowd density for outdoor surveillance applications. The combination of techni-
cal and social aspects can represent an added value with respect to the already presented
works. A first example can be found in [26] where authors exploit a joint visual tracking-
Bayesian reasoning approach to understand people and crowd behaviour in a metro station
scenario. More recently [84], [99], [71], [87] a social force model describing the inter-
actions among the individual members of a group of people has been proposed to detect
abnormal events in crowd videos. Here people are treated as interacting particles subject
to internal and external physical forces which determine their motion. At the same time
social and psychological aspects are taken into account in modelling such indeed “social”
forces, showing the effectiveness of a synergic multidisciplinary approach to the problem.
This represents the current trend in crowd modelling research.
65
3.2.2 Crowd simulation
Graphical or symbolical simulation of moving crowds is a continuously evolving field
which involves research groups all around the world in many different areas, such as en-
tertainment industry (video-games and motion-picture), police and military force train-
ing (manifestations and riots simulations), architecture (buildings and cities design), traffic
control (crossovers and walking paths), security sciences (evacuation of crowded environ-
ments) and sociology (behaviour studies). Simulation of crowds meets the needs for crowd
observation data that are often hard or even impossible to gather directly and is also often
necessary in the design stage of security and surveillance systems. Here again different
application areas obviously show different approaches to the problem. Basically, these
approaches can be divided into two main categories. The first is mostly focused on be-
havioural aspects of the crowd, while neglects visual output quality. Crowd members can
be schematically represented as dots or stylized shapes or even melt together in a rougher
framework, wherever only a global point of view is needed. Here, only realism of dynam-
ics is stressed. The second approach, on the contrary, is centred on visual effects and it is
not really concerned with an appropriate modelling of the real behaviour. A well balanced
integration of realism in the behaviour of the crowd and in the visualization of it is also
often needed, at least to some extent, as in the case here presented. This will be discusses
in details in the following. As mentioned at the beginning of this section, crowds need to
be given an underlying dynamical model in order to be simulated. Actually, such a model
is inherently in charge of depicting the evolution of some crowd features only. This raises
again the issue of how to describe crowds. Such a description includes a selection not only
of the features one is interested in simulating, but also of the scale at which the model has
to lie, in order to effectively describe the formers. Namely, a microscopic model could
be given the task of simulating features at a more global level, while the opposite way is
hardly practicable as it can be easily guessed.
66
3.3 A bio-inspired cognitive model for Cognitive Surveil-
lance Systems
The proposed approach to Intelligent Video Surveillance (IVS) has been implemented ac-
cording to a bio-inspired model of human reasoning and consciousness grounded on the
work of the neuro-physiologist A. Damasio [28].
As already mentioned, Damasio’s theories describe cognitive entities as complex systems
capable of incremental learning based on the experience of relationships between them-
selves and the external world. Two specific brain devices can be defined to formalize the
aforementioned concept: Damasio names them proto-self and core-self. Such devices are
specifically devoted to monitor and manage the internal status of an entity (proto-self) and
the external world (core-self). Thus, crucial aspects in modelling a cognitive entity fol-
lowing Damasio’s model are first of all the capability of accessing entities’ internal status
and secondly the analysis of the surrounding environment. Relevant information comes
from the relationships between the two. This approach can be mapped into a sensing
framework by dividing the sensors into endo-sensors (or proto-sensors) and eso-sensors
(or core-sensors) as they monitor, respectively, the internal or external state of the interact-
ing entities.
Applying these concepts to the video analysis domain, interacting entities can be repre-
sented either by a guard monitoring a smart environment or by a subject driving an in-
telligent vehicle as well as a guard and an intruder interacting in some monitored area,
while considering a crowd management scenario, eso-sensors can monitor the crowd, while
endo-sensors can provide information about system parameters, as it will be clearer in the
following.
Referring to the sample proposed framework, four main blocks have been identified as
representative of a cognitive-based sensing architecture as the control centre, the CN, the
(intelligent) sensing nodes and the mobile terminal and/or actuators. The tasks which can
67
be accomplished by each block are shown in Fig. 3-2, establishing a preliminary bridge
between the concepts introduced by Damasio and the effective implementation of the sys-
tem.
The core of the proposed architecture is the already mentioned CN, which can be described
as a module that is able to receive data from sensors of all kinds, to process them, defining
different configurations as interactions between proto and core states. Such a bio-inspired
knowledge representation permits to asses potentially dangerous or anomalous events and
situations and possibly to interact with the environment itself.
3.3.1 Cognitive Cycle for single and multiple entities representation
Within the proposed scheme, the representation of each entity has to be structured in a
multi-level hierarchical way. As a whole, the closed processing loop realized by the CN in
case of a given interaction between an observed object and the system can be represented
by means of the so-called Cognitive Cycle (CC - see Fig. 3-2) which is based on four
fundamental logical blocks:
∙ Sensing: the system has to continuously acquire knowledge about interacting objects
and their own internal status.
∙ Analysis: the collected raw knowledge is processed in order to obtain a precise and
concise representation of occurring events and causal interactions.
∙ Decision: the precise information provided by the analysis phase is processed and a
decision strategy is selected according to the goal of the system.
∙ Action: the system fulfils the configuration computed during the decision phase as a
direct action over the environment or as a message provided to some actuator.
The proposed model for cognition has many analogies with the one adopted by Haykin
in its formalization of Cognitive Dynamic Systems [56] and referred to as the Fuster’s
Paradigm: Joaquin Fuster proposes in fact the concept of cognit and an abstract model for
68
Figure 3-2: Cognitive Cycle (single object representation)
cognition, based on five fundamental building blocks, namely perception, memory, atten-
tion, intelligence and language [45]. Perception represents the information gain block, and
corresponds to the sensing block of the CC; similarly, intelligence matches the analysis
logical block and also, according to the Fuster’s paradigm, includes the decision-making
stage. Memory is associated, within the CC, to a learning phase which is continuous and
basically involves all the stages of the cognitive cycle: this will be explained more in de-
tails in sections 4.1.3 and 4.2. The attention block is meant to optimize the information
flow within the Dynamic Cognitive System: this aspect goes beyond the purposes of this
work. Eventually, the language block is intended to provide efficient communication on a
person to person basis but it is not considered here (and not even by Haykin in his works).
The CC, by experiencing interactions between the CN and the external object, provides
different configurations also called cause-effect relationships. Starting from these relations
it is possible to define object representations based on theirs dispositional capabilities, i.e.
the objects can be disposed (or not) to change in some way. More formally, an observed
object x is disposed to D in different C-cases (i.e. situations), where D defines the dispo-
sitional propriety of x by a set of C configurations (i.e. cause-effect relationships), called
dispositional statements [10].
69
A set of dispositional proprieties gives a dispositional embodied description of an object,
and it includes reactions it generates in the cognitive system, i.e. possible actions that
the system can plan and perform when a situation involving that object is observed or
predicted. According to this statement, it is possible to refer to the representation model
depicted in Fig. 3-2 as to an Embodied Cognitive Cycle (ECC). The cognitive cycle can be
seen as a way of representing generic observed objects within the CN by means of a multi-
level representation involving both the bottom-up analysis chain and the top-down decision
chain (see Fig. 3-3). With respect to security and safety domains, in which the ECC
is here applied, the above mentioned dispositional proprieties are associated to a precise
objective: to maintain stability of the equilibrium between the object and the environment
(i.e. maintenance of the proper level of security and/or safety). Anomaly is a deviation from
the normality and it can be considered as a violation of a certain dispositional propriety.
Figure 3-3: Cognitive Node-based object representation: Bottom-up analysis and top-downdecision chain.
As a consequence, each entity is provided with a ’security/safety oriented ECC (S/S-ECC)’
which is representative of the entity itself within the CN. Moreover, the mapping of the
S/S-ECC onto the CN chain shown in Fig. 3-3 can be viewed as the result of the interaction
between two entities, each one described as a cognitive cycle too. In particular, if the ex-
70
ternal object (eso) and the internal autonomous system (endo) are represented as a couple
of Interacting Virtual Cognitive Cycles (IVCC), the IVCCs can be matched with the CN
structure (i.e. the bottom-up and the top-down chains) by associating parts of the knowl-
edge related with the different ECC phases to the multilevel structure processing parts of
the CN (Fig. 3-3).
More in detail, the representation model of the ECC (top left corner of Fig. 3-4) is centered
on the cognitive system that can be considered by itself as a cognitive entity. Therefore, it
is possible to map the proposed representation as in the top right corner of Fig. 3-4, where
two IVCCs, the one representing the entity (or object - 𝐼𝑉 𝐶𝐶𝑂) and the other representing
the cognitive system (𝐼𝑉 𝐶𝐶𝑆), interact in a given environment. In this model, the sensing
and action blocks of the 𝐼𝑉 𝐶𝐶𝑆 correspond to the sensing and action blocks of the ECC
(see bottom right corner of the figure). However, in the 𝐼𝑉 𝐶𝐶𝑆 , such blocks assume a
parallel virtual representation of the physical sensing and action observed corresponding
respectively to the Intelligent Sensing Node and the Actuator blocks in the general frame-
work.
The analysis phase of the 𝐼𝑉 𝐶𝐶𝑆 (𝐴𝑛𝑎𝑙𝑦𝑠𝑖𝑠 − 𝐼𝑉 𝐶𝐶𝑆) can be considered as including
a virtual representation of the four stages characterizing the state of the interacting ob-
ject. Sensing phase can be mapped in the event detection sub-block of the 𝐴𝑛 − 𝐼𝑉 𝐶𝐶𝑆
(𝐸𝐷𝑆𝑦𝑠𝑡𝑒𝑚) as well as the object event detection (𝐸𝐷𝑂𝑏𝑗𝑒𝑐𝑡). Similarly, the system situation
assessment sub-block (𝑆𝐴𝑆𝑦𝑠𝑡𝑒𝑚) includes a virtual representation of the object situation as-
sessment (𝑆𝐴𝑂𝑏𝑗𝑒𝑐𝑡). Finally, as shown in the bottom left corner of Fig. 3-4, the prediction,
decision and action parts of the object can be considered as knowledge that allows the cog-
nitive system to predict the future behaviour of the interacting object itself (the interacting
objects are here the system and the one observed external object). Prediction, decision and
action can be included in the prediction sub-block of the system (𝑃𝑆𝑦𝑠𝑡𝑒𝑚).
The proposed interpretation of the matching among the embodied cognitive model, the in-
teractive virtual cycles representing the entities acting in the environment (including the
71
Figure 3-4: Embodied Cognitive Cycle, Interactive Virtual Cognitive Cycles and CognitiveNode matching representation
72
system) and the CN, allows considering the CN as a universal machine for processing
ECCs with respect to a large variety of application domains. In general, each ECC starts
with ISN (intelligent Sensor Node) data, including an interacting entity (eso-sensor) and
a system reflexive observation (endo-sensor). The observed data are considered from two
different perspectives (the object’s and the system’s) by creating a description of the cur-
rent state of the entities using knowledge learned in previous experiences. Such process
happens at event detection and situation assessment sub-blocks. Then, a prediction of fu-
ture actions taken by the 𝐼𝑉 𝐶𝐶𝑂, contextualized with the self-prediction of future planned
actions of the system, occur at the prediction sub-block. The use of the knowledge of the
𝐼𝑉 𝐶𝐶𝑂 ends at this stage. Finally, the 𝐼𝑉 𝐶𝐶𝑆 is completed by adjusting plans of the
system in the representation of its decision and action phases that are, as stated above, a
parallel virtualization of the ECC. In addition, it is relevant to briefly point out that a sim-
ilar decomposition can be adopted in the case when two interactive entities are observed.
The description of the interacting subjects can be modelled observing that the two entities
can form a single meta-entity to which is associated a meta cognitive cycle interacting with
the autonomous system. The meta-entity (ME) can simply be considered as a composi-
tion of the two cognitive cycles associated to the initial entity couple. The advantage of
the proposed representation, involving the description of an Embodied Cognitive Cycle by
means of an IVCC couple is that the same mechanism used to represent the interaction of a
ME with the autonomous system can be also used to represent the interaction between two
observed entities forming an observed meta-entity, as proposed in [32]. Dynamic Bayesian
Networks (DBNs) can be used to represent cognitive cycles and IVCCs based on a structure
called Autobiographical Memory (AM) [30] [30]. DBNs provide a tool for describing em-
bodied objects within the CN in a way that can allow incremental learning from experience
[91]. Section 4.1 is devoted to the demonstration of such a claim.
3.3.2 The Cognitive Node
The general architecture of the Cognitive Node is depicted in Fig. 3-5. Intelligent sensors
are able to acquire raw data from physical sensors and to generate feature vectors corre-
73
sponding to the entities to be observed by the CN. Acquired feature vectors must be fused
spatially and temporally in the first stages of the node, if they are coming from different
intelligent sensors.
Figure 3-5: Cognitive Node Architecture
As briefly introduced in the previous section, the CN is internally subdivided into two main
parts, namely the analysis and the decision blocks. These two stages are linked together by
the cognitive refinement block, whose role will be shortly explained. Analysis blocks are
responsible for organizing sensors information and finding interesting or notable configu-
rations of the observed entities at different levels. Those levels can communicate directly
with the human operator through network interfaces in the upper part of Fig. 3-5. This
is basically what can be done by a standard signal processing system being able to alert
a supervisor whenever a specific anomalous interaction behaviour is detected. A predic-
tion module is able to use the stored experience of the node through the internal AM for
estimating a possible evolution of the observed environment. All the processed data and
predictions generated by the analysis steps are used as input of the cognitive refinement
block. This module can be seen as a surrogate of the human operator: during the con-
74
figuration of the system it is able to learn how to distinguish between different levels of
potentially dangerous situations. This process can be done by manually labelling different
zones of the observed data or by implementing a specific algorithm for the particular cog-
nitive application. In the on-line phase, the CN works in two different ways: for operator
support and in automatic mode. In both cases the cognitive refinement module is able to de-
tect if a predicted condition starts to drift away from the standard observed interaction, thus
getting the overall system closer to a warning situation. Specifically, in the human support
case, the switch, depicted in Fig. 3-5, is opened. The CN, by means of the cognitive refine-
ment block, can detect anomalies as possible discrepancies from standard operator-crowd
interactions. During the automatic mode, the switch is closed and the information con-
tained into the cognitive refinement is employed to identify specific crowd-environment
situations. The communication link towards the operator permits a direct warning about
anomalous situations relative to crowd normal behaviours. Decision modules of the CN
are responsible for selecting the best actions to be automatically performed by the system
for avoiding a dangerous situation. Those actions can be performed on the fully cooperative
parts of the observed system; all the decisions taken by the CN are made with the precise
intent of maintaining the environment in a controllable, alarm-free state. If all the actions
of the node are unable to keep the system in a standard state and the measured warning
level continues to increase, the node itself can decide to stop the cognitive cycle and to give
command of the controllable parts of the system back to the human operator, who is always
given the possibility to decide on his own and completely bypass the automatic system or
to be acknowledged of each single action that the CN is processing (Interfaces, Fig. 3-5).
As a final remark, this work would like to point out that, as well as the proposed perception-
action cycle for crowd monitoring, robot control mechanisms also are often motivated by
biology. However, there are some conceptual differences between the two approaches.
Robot control strategies, such as Reinforcement Learning, allow for optimizing actions
by evaluating their rewards. The presented mechanism, based on Damasio’s concept of
Autobiographical Self, during an off-line phase, acquires and mathematically models in-
teraction information by observations of two entities operator and crowd (i.e. proto and
75
core). During the on-line phase, the cognitive system uses the previously stored knowledge
for actively interacting with the external world. In the case of operator-crowd, a predic-
tion mechanism drives the system actions, selecting the possible countermeasure according
to learned rules during the training. The proposed algorithm is a general framework for
acquiring and building up the rule sets in different context.
76
Chapter 4
Information extraction and probabilistic
model for knowledge representation
4.1 Introduction
Interactions between two entities can be described in terms of mathematical relationships.
Such a mathematical description must obviously rest on a feature extraction phase, which
is addressed to get relevant information about the entities.
This chapter is devoted to the analysis of the main features that allow to design a proba-
bilistic model able to learn interactions. After information is extracted, Dynamic Bayesian
Networks (DBNs) can be used to represent cognitive cycles and IVCCs [30], as already
mentioned in section 3.3.1, based on the AM algorithm, thus providing a tool for describ-
ing embodied objects within the CN in a way that can allow incremental learning from
experience. It was already pointed out that also interactions between the operator and the
system can be represented as an IVCC. In that case, the operator-system interaction can be
differently used as an internal reference for the CN as the operator can be seen as a teach-
ing entity addressing most effective actions towards the goal of maintaining security/safety
levels during the learning phase. This learning phase represents an effective knowledge
77
transfer from human operator towards an automatic system.
A proposed framework for information extraction is composed of two main blocks: Data
Fusion (DF) and Event Detection (ED). DF involves source separation and feature ex-
traction: these two phases permit to recognize the same entities monitored by different
heterogeneous sensors. The ED block extracts information related to changing in the sig-
nals acquired by sensors. Events will be eventually defined, in order to develop a specific
probabilistic models able to describe different kinds of the relationships permitting to de-
tect anomalous interactions. The remaining of this chapter is organised as follows. The
applications of such models to crowd monitoring are presented in section 4.2. Section 4.3
describes the proposed approach for anomaly detections, while results are given in section
7.3.
4.1.1 Data fusion
Many different approaches can be used for designing architectures embedded on system,
which are able to collect heterogeneous environmental information. According to the func-
tionalities provided by the systems, data fusion mechanisms should be considered as logical
tasks which can be subdivided in a multi-modal architecture. An interesting method of the
data fusion model is the JDL model [48].
The JDL model includes five levels of processing, that represent the description of increas-
ing level of abstraction [31]. In our description, information on two distinct entities are
fused and aligned at different levels.
The data fusion module is able to receive data from intelligent sensors on the field, and to
fuse them from a temporal and spatial point of view. If one considers a set of 𝑆 intelligent
sensors, each 𝑘 ∈ 𝑆 sends to the CN a vector of features (𝑘, 𝑡𝑘) = 𝑥1, 𝑥2, . . . , 𝑥𝑁𝑘
where 𝑘 = 1, 2, . . . , 𝑆 at time instant 𝑡𝑘. Intelligent sensors send feature vectors asyn-
chronously to the CN, that must be able to register them temporally and spatially before
sending data to upper level processing modules.
78
From a temporal point of view, the data fusion module collects and stores into an internal
buffer all newest measurements 𝑘,𝑡*𝑘 from intelligent sensors 𝑘 = 1, 2, . . . , 𝑆. The time
instant 𝑡*𝑘 represents the last time when the feature vectors are acquired from each sensor
that are received. Data acquisition time can vary from sensor to sensor.
As soon as a new feature vector is acquired from sensor 𝑘, the data fusion module can
compute an extended feature vector by combining all measurements from all considered
intelligent sensors 𝜙(𝑡) = 𝑓(1,𝑡*1 , 2,𝑡*2 , . . . , 𝑆,𝑡*𝑆), where 𝑡 ≥ 𝑡*1, 𝑡*2, . . . , 𝑡*𝑆.
Thus the generation rate of the data fusion module can be estimated by considering the min-
imum time interval between two sequential measurements of the higher frequency sensor.
If ∆𝑡𝑛𝑘 = (𝑡𝑛𝑘 − 𝑡𝑛−1𝑘 ) is the time interval between arrival times of feature vectors (𝑘, 𝑡𝑛)
and (𝑘, 𝑡𝑛−1) for sensor 𝑘, the actual data rate of the fusion block can be estimated by
computing 𝑚𝑖𝑛𝑘(∆𝑡𝑛𝑘).
The analytic expression of the fusion function 𝜙(𝑡), depends on the physical relationship
between measured quantities and cannot be studied with a generic approach. In the fol-
lowing scenarios, feature vectors are mainly generated by (real, but possibly simulated)
video analytics algorithms that are able to process images acquired from video-surveillance
cameras and extract scene descriptors (i.e., trajectories of moving objects, crowd densities
within a certain environment, human activity related features, etc.).
In any case one can suppose that the fused feature vectors produced as output of this module
have the following form:
(𝑡) = 𝐶 , 𝑃 = 𝐶1 , 𝐶2 , . . . , 𝐶𝑛 , 𝑃1 , 𝑃2 , . . . , 𝑃𝑚 , (4.1)
where n and m represent feature numbers of the core and proto state vectors, 𝐶 and 𝑃
respectively (i.e. the dimensionality of the vectors). Equation 4.1 expresses a general
form for the global feature vector that is the result of the data fusion module. Vector 𝐶
identifies features related to so-called core objects, i.e., entities that are detected within
the considered environment but that are not part of the internal state of the system itself.
79
Vector 𝑃 identifies proto object features that are specific for entities that can be completely
controlled by the CN.
4.1.2 Event detection
The data fusion phase permits to obtain a high dimensionality core and proto multi-dimensional
space, where each point represents a state vector of features at a specific time instant: 𝑃 (𝑡)
and 𝐶(𝑡). Using this representation it is possible to interpret the changes of state vectors
as movements, trajectories, in a multi-dimensional space. Furthermore, as the dynamic
evolution of one entity depends on the other entity, relationships between such trajectories
describe interactions.
A Self Organizing Map [62] (SOM) unsupervised classifier is employed in this work at two
different logic levels: first, to detect events in term of relevant state changing, secondly
to represent complex relationships between the entities in a low-dimensional space. The
latter logic level will be discussed in detail through the next sections. The SOM operates
a dimensionality reduction, by mapping the multidimensional proto or core state vectors
(𝑃 (𝑡) and 𝐶(𝑡)) onto a lower 𝑀 -dimensional space, where 𝑀 is the dimension of the
Kohonen’s neuron layer (from here on we consider 𝑀 = 2 without loss of generality).
Input vectors are clustered according to their similarities, after the neural network is trained.
The choice of SOMs to perform feature reduction and clustering is motivated by their
capabilities to reproduce in a plausible mathematical way the global behaviour of the
winner-takes-all and lateral inhibition mechanism shown by distributed bio-inspired de-
cision mechanisms. Besides, a SOM allows for the establishment of a representation of
reality based on analogies: similar (though not necessarily identical!) input vectors are
likely to be mapped by the Kohonen’s map to the same neuron (in a non-injective way).
Similarity is, in our case, measured by simple euclidean distance between vectors; however
more complicated metrics can be employed to this end.
More in details, neural networks such as SOM, Neural Gas (NG) [79] and Growing Neural
80
Gas (GNG) algorithms [43] are inspired by Hebbian’s theory and permit the adaptation
of neurons during the learning process. The Neural Gas represents a very interesting and
powerful tool for vector quantization and data compression techniques. NG derives from
SOM and it improves the input data topology preservation through an adaptive method
based on neighbourhood relationships learning between the weight vector (associated to
neuronal unit) and each external stimuli (associated to input vector). In this context we have
supposed that the global environment is divided in different rooms, each one controlled by
cameras. A camera-embedded people counter is able to provide an estimation of number
of people. The considered state vectors 𝐶 are multidimensional and we are interested
in reducing it to a 2-D space. However in other applications, where it is highly desirable
to conserve the topology, we have explored the possibility of automatically determining
the set of regions to monitor according to environmental topology. In this case the input
information can be the people trajectories and an we use the Instantaneous Topological Map
(ITM) for learning structured input data manifold [14]. SOMs present a fixed number of
neuronal units, while for GNG the number of neurons is automatically decided during the
training phase. The study of the dimension of the reduced space is very important for us,
because it is correlated to definition of the events. Fixing the dimension of the SOM layer
it is possible to maintain limited the total number of possible events. A common learning
problem, in designing models, is to acquire all possible configurations, i.e. all possible
events. To this end, in this stage of our study, a fixed number of neurons is better than a
self-adaptable topology. The Growing Hierarchical SOMs (GH-SOMs) represent another
interesting tools [105]. They can increase the number of neurons and layers by means
of distance measurements between neuronal weights and input data. These mechanisms
of adapting layer sizes permit accuracy on original data reconstructions. On the other
hand, we are interested in studying the optimum number of units for balancing the learning
efficiency, the knowledge representation and the prediction capabilities of the AM. These
facts will become clearer in section 4.2.1. A technique for the definition of contextual
knowledge was prosed in [77]. By using a single 2-D SOM, an event classifications was
obtained by fusing of the heterogeneous vectors, shown in Equation 4.1. But in this case
the relationships between the entities are “fused" in the neurons. According to Damasio’s
81
theory, by means of different SOMs, for separately mapping core and proto vectors, it is
possible to detect relevant transitions between SOM neurons, i.e. the events. Such distinct
core and proto events are basic units of the AM, which represents a bio-inspired fusing
method for modelling the dependences between two entities [19].
The clustering process, applied to internal and external data, allows one to obtain a mapping
of proto and core vectors 𝑃 (𝑡) and 𝐶(𝑡) in 2-D vectors, corresponding to the positions
of the neurons in the SOM-map, that we call respectively proto Super-states 𝑆𝑥𝑃 and core
Super-states 𝑆𝑥𝐶 . Each cluster of Super-states, deriving from the SOM classifiers, is then
associated with a semantic label related to the contextual situation:
𝑆𝑥𝑖𝑃 ↦→ 𝑙𝑖𝑃 , 𝑖 = 1, . . . , 𝑁𝑃
𝑆𝑥𝑗𝐶 ↦→ 𝑙𝑗𝐶 , 𝑗 = 1, . . . , 𝑁𝐶
(4.2)
where the notation 𝑆𝑥𝑖𝑃 and 𝑆𝑥𝑗𝐶 indicates that the Super-state belongs, respectively, to
the 𝑖-th proto label and to the 𝑗-th core label; 𝑁𝑃 and 𝑁𝐶 are, respectively, the maximum
number of the proto and core Super-states labels. Then, the result of this process is the
building of a 2D map divided in regions labelled with a meaningful identifier related to the
ongoing situation. It must be noted that changes of state vectors 𝑃 (𝑡) and 𝐶(𝑡) do not
necessary imply a change of Super-state labels 𝑆𝑥𝑖𝑃 ↦→ 𝑙𝑖𝑃 and 𝑆𝑥𝑗𝐶 ↦→ 𝑙𝑗𝐶 . This means
that there are some modifications which are irrelevant from the point of view of the chosen
semantic representation of the situation. On the other hand, when the Super-state labels
𝑆𝑥𝑖𝑃 and 𝑆𝑥𝑗𝐶 change in subsequent time instants, a contextual situation modification, i.e.
an event occurs. It is then possible to define proto (𝜖𝑃 ) and core (𝜖𝐶) events. Actually,
even null events (i.e. null changes in Super states) can be defined. In fact, these could
be relevant while considering very slowly changing systems and dynamics or whenever
lengthy immobility could become relevant.
A Kohonen’s layer consists of a 2-D matrix of neurons, identified by an hexagonal location.
The network is constructed based on competitive learning: all the output neurons that win
the competition are subsequently activated by input state vectors. Two SOM-nodes are
82
(a) (b)
Figure 4-1: Examples of temporal proximity trajectories among fired neurons in 2-D SOM-map (5 × 5) for different core state vector sequences. The trajectories are non-linear anddiscontinuous.
considered as near if they are consecutively active at two successive time instants. It is
possible to connect all fired neurons describing a temporal proximity trajectory among
neurons. Not necessarily different input state sequences describe different trajectories in
the Super-state space. By sequentially analysing the dynamic evolution of Super-states,
proto and core events can be detected and visualized by trajectories into 2-D SOM-map.
The output of the SOM algorithm is in fact a list of clusters (or zones) within the Kohonen’s
layer, that describe a trajectory. Two trajectories for two different core state sequences are
presented in Fig. 4-1. The ED module also considers dynamical aspects of the evolution
of clustered features: transition probabilities between different Super-states (i.e. zones) are
computed, in such a way that the outcome of the training process can be ideally compared
to a DBN. Instead of considering sequences of Super-states to describe the evolution of
each entity, it is possible to consider proto and core event series, which can be modelled by
two Event based DBNs (E-DBNs) [97] as explained in the next section.
4.1.3 Autobiographical memory
According to Damasio’s theory, the sequences of proto (internal) and core (external) events
can be organized into two kinds of triplets in order to account for interactions between en-
tities: 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃 (passive interaction) and 𝜖−𝐶 , 𝜖𝑃 , 𝜖
+𝐶, (active interaction), to represent
83
the causal relationships, in terms of initial situation (first event), cause (second event) and
consequent effect on the examined entity (third event) 1.
The resulting information becomes an approximation of what Damasio himself calls the
Autobiographical Memory where these triplets, which encode possible interactions be-
tween entities, are memorized. The basic idea behind the algorithms is to estimate the
frequency of occurrence of the effects caused by a certain external event in order to derive
two probability distributions:
𝑝(𝜖+𝑃 |𝜖𝐶 , 𝜖−𝑃 ), (4.3)
𝑝(𝜖+𝐶 |𝜖𝑃 , 𝜖−𝐶), (4.4)
representing the causality of observed events in the interaction. The sequence of events is
represented by a statistical graphical model in order to introduce a mathematical description
of the proposed interaction model. This choice is motivated by the fact that the interaction
pattern is composed by a temporal sequence of interdependent events and then it can be
seen as a stochastic process. Therefore, an approach for modelling sequences of events that
relies on a probabilistic model results particularly appropriate.
The interaction patterns are modelled by a Coupled Event based DBN (CE-DBN) in or-
der to have a representation able to statistically encode the relevant data variability. The
proposed CE-DBN graph, shown in Fig. 4-2, aims at describing interactions taking into
account the neuro-physiologically motivated model of the Autobiographical Model. The
conditional probability densities (CPD) 𝑝(𝜖𝑃𝑡 |𝜖𝑃𝑡−1) and 𝑝(𝜖𝐶𝑡 |𝜖𝐶𝑡−1) encode the motion pat-
tern of the objects in the environment regardless the presence of other objects. Note that
each triplet can be seen as one dispositional statement (configuration) with an associated
conditional probability, Equations 4.3 and 4.4. The AM provides a dispositional descrip-
tion, a set of dispositional proprieties, for proto and core entities.
1An active interaction (represented by a triplet) is defined when an internal modification (proto event) isthe cause of external environmental change, i.e. the third event in the triplet is a core event. Vice versa the pas-sive triplet is defined when an external environmental change (core event) provokes an internal modification,i.e. the third event in the triplet is a proto event.
84
Figure 4-2: Coupled Event based Dynamic Bayesian Network model representing interac-tions within an AM
The dispositional proprieties describe a precise objective: to maintain stability of the equi-
librium between the object and the environment (i.e. maintenance of the proper level of
security and/or safety). Anomaly can be seen as a deviation from the normality and it can
be considered as a violation of a certain dispositional propriety. The interactions between
the two objects are described by the CPDs:
𝑝(𝜖𝑃𝑡 |𝜖𝐶𝑡−Δ𝑡𝐶 ), (4.5)
𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 ). (4.6)
In particular, Equation 4.5 describes the probability that the events 𝜖𝐶 , occurred at time
𝑡 − ∆𝑡𝐶 and performed by the object associated to the core context, has caused the event
𝜖𝑃 in the proto context. Reversed interpretation in terms of causal events should be given
to 𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 ).
Considering the definition of the core consciousness, the causal relationships between the
two entities are encoded in two conditional probability densities (CPDs):
𝑝(𝜖𝑃𝑡 |𝜖𝐶𝑡−Δ𝑡𝐶 , 𝜖𝑃𝑡−Δ𝑡𝑃 ) (4.7)
𝑝(𝜖𝐶𝑡 |𝜖𝑃𝑡−Δ𝑡𝑃 , 𝜖𝐶𝑡−Δ𝑡𝐶 ) (4.8)
As a matter of fact, the probability densities in Equations 4.7-4.8 consider both the interac-
tion (i.e. Eq. 4.5 or Eq. 4.6) and the initial situation (i.e. 𝜖𝑃𝑡−Δ𝑡𝑃 or 𝜖𝐶𝑡−Δ𝑡𝐶 ).
85
Figure 4-3: Graphical representation of the mapping into AM 3-D space of passive triplet𝜖−𝑃 , 𝜖𝐶 , 𝜖
+𝑃. The symbols 𝑙𝑥𝑃/𝐶 represent the contextual SOM-label associated to each
cluster. In this example the proto or core events are represented by: 𝑙𝑥𝑃/𝐶 99K 𝑙𝑦𝑃/𝐶 , where𝑥 = 𝑦. The transitions into Proto and Core-Map are dashed for representing the non-linearity and discontinuity of the trajectories.
4.2 Autobiographical Memory domain applications: Surveil-
lance and Crowd Management scenarios
In the previous section a probabilistic model based on CE-DBN was sketched in order to
describe multiple entity interactions. The knowledge thus represented inside the proposed
CN can be employed in many different domains: surveillance scenarios and crowd analysis-
management are just two limited examples. Generally, in surveillance scenarios the goal of
the system is the analysis of interactions and recognition of specific behaviour between two
or more people (external entities). On the other hand, in the crowd analysis domain, the
focus of the system can be shifted toward the analysis and classifications interactions that
occur between the crowd and a human operator who is in charge of maintaining a proper
security level within the monitored area (for this purpose, the crowd can be seen as a unique
macro-entity). The two entities can be represented as a couple of 𝐼𝑉 𝐶𝐶s, as proposed in
section 3.3.2, namely an 𝐼𝑉 𝐶𝐶𝑂 and an 𝐼𝑉 𝐶𝐶𝑠 respectively.
86
In this section two aspects will be discussed, namely the probabilistic model learning phase
and the detection phase for surveillance and crowd scenarios. During the (off-line) learning
phase the CN observes both entities, i.e. the human operator and the crowd, storing their
interactions within the AM. As for the (on-line) detection phase, it will be shown how
different definitions of the probabilistic model are needed.
The system is designed to support a human operator in crowd management during the on-
line phase. This task is accomplished by recognizing specific operator-crowd abnormal
interactions. Typically, in people flow redirection problems, an abnormal interaction can
be detected whenever the user puts in action wrong countermeasures to avoid the panic or
overcrowding situations. In this case the CN ought to drive the operator to choose correct
actions by either simply signalling the anomaly or by suggesting actions to be performed
based on its learned experience.
4.2.1 Learning phase: interaction representations
During an off-line phase, the CN is able to learn and store into the AM a set of triplets (i.e.
interactions) for different situations: 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃 (passive triplet) and 𝜖−𝐶 , 𝜖𝑃 , 𝜖
+𝐶 (active
triplet). The crowd configurations are captured by core sensors, while the operator actions
are mapped into proto sensors. Each triplet represents a point of a 3-D space. In Fig. 4-3 an
example of 3-D space mapping of a passive triplet is depicted. This representation allows
to sketch the set of triplets stored into an AM. We point out that the ordering of the events
along the ℰ−𝑃 , ℰ𝐶 and ℰ+
𝑃 axes is not relevant as what is really significant is only the number
of occurrences of a certain triplet. However, each generic triplet of events can be associated
to an influence model, i.e. a specific AM can model the dynamic evolutions of interactions
for a specific context. It is possible to define a switching variable 𝜃 as influence parameter
[96].
Each triplet is associated to a probability, derived from an estimate of two conditional
probability densities: 𝑝(𝜖+𝑃 |𝜖𝐶 , 𝜖−𝑃 , 𝜃) and 𝑝(𝜖+𝐶 |𝜖𝑃 , 𝜖
−𝐶 , 𝜃) (cfr. 4.7 and 4.8), which are pro-
portional to the number of votes that the particular triplet received, i.e. the number of
87
occurrences observed during the AM training phase that represents a specific interaction
(i.e. an influence model). Fig. 4-4 shows an example of conditional relationship for a
passive triplet: 𝜖+𝑃 given the two previous events 𝜖𝐶 𝜖−𝑃 and the interaction model 𝜃.
A temporal histogram is associated to each AM element (i.e. to each triplet), in order
to store the temporal information related to events of the triplets. For example, taking
into consideration a passive triplet 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃,with given events, the histogram permits to
evaluate the probability that a specific proto event 𝜖+𝑃 takes place 𝜏𝐶𝑃+ seconds after the
core event 𝜖𝐶 . The histogram bin dimension must be selected by performing a trade off
between the precision of the temporal prediction that it is required by the application and
the number of training samples available.
Figure 4-4: Example of CE-DBNs for passive triplet, e.g. 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃, with a parameter 𝜃
tied across proto-core-proto transitions.
4.2.2 Detection phase: surveillance scenarios
After a learning phase, the CN, by using the AM, has the capability of recognizing the
interactions while they take place, in an on-line timing. In [30] the exploitation of an
AM for the detection of different kinds of interactions between two people was proposed.
For this reason, a cumulative measure is computed exploiting the information encoded
in the proposed Coupled E-DBN model. To accomplish this task, for each interaction
88
𝑖 : 𝑖 = 1, . . . , 𝑁𝐼 , where 𝑁𝐼 is the number of considered interactions, a set of couple of
trajectories (core and proto) are used to train the model (𝜃𝑖), originating a trajectory into a
3-D space (as shown in Fig. 4-3). To detect the type of cause-effect relationship between
entities, for each triplet (𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖
𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 ) the following measure is computed:
𝑙𝑖𝑡 = 𝑙𝑖𝑡−Δ𝑡𝐶,𝑃 + 𝑝(𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖
𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 |𝜃𝑖), (4.9)
where 𝑙𝑖𝑡−Δ𝑡𝐶,𝑃 is the measure computed at the time in which the previous event has been
observed and with 𝑝(𝜖𝑃,𝐶𝑡 , 𝜖𝐶,𝑃𝑡−Δ𝑡𝐶,𝑃 , 𝜖
𝑃,𝐶𝑡−Δ𝑡𝑃,𝐶 |𝜃𝑖) the probability that the observed triplet be-
longs to the 𝑖-th interaction model is indicated. For each triplet of events, the best matching
influence model is chosen as 𝑖* = arg max𝑖 𝑙𝑖 with 𝑖 = 1, . . . , 𝑁𝐼 . The high level of crit-
icality of the detection phase entails that, if mismatching between the observed data and
learned knowledge is detected, the system can call the attention of the operator. In this case
the learning phase starts up again.
4.2.3 Detection phase: Crowd management scenarios
In human-to-human interactions, at each state change of one entity typically corresponds
a state change of the other. In this case it is possible to affirm that the entities have the
same (or at least a similar) dynamic. On the contrary, in crowd scenarios, the dynamics of
the entities are extremely different, namely the crowd changes its status more frequently
than the operator. Generally the number of people, in a room or in a zone, can change
without any operator actions. In all the cases in which the dynamics between entities show
significant differences, the AM can be considered as a sparse collection of triplets. In
order to design a robust classification algorithm for abnormal interaction recognitions, an
approach to encode a statistical sparse model using the Self Organizing Map is needed. The
following section is dedicated to this scope.
89
4.3 Proposed approach for abnormal interaction detec-
tion in crowd monitoring domain
The proposed cognitive video surveillance system has two main purposes. The first and
most important one is to detect the interaction anomalies between operator and crowd.
The second is to substitute or to help the user during the crowd management, recogniz-
ing anomaly interactions with crowd. The presented cognitive system accomplishes both
these goals by learning a specific behavioural model for operator-crowd interactions, in
which the crowd is correctly controlled by a user. This model describe normal condi-
tions of crowding management. CN can detect anomalous operator-crowd interactions
as deviation from normality situation. In automatic operating mode, the system substi-
tutes the operator and interacts directly with the crowd. When crowd reaction patterns
are not conform to expected behaviour an anomalous configuration (i.e. interaction) is
detected. The method used for interaction modelling and above mentioned anomalous
detections is here presented. An interaction behaviour cannot be completely represented
by a triplet alone: a set of triplets must be analysed in order to individuate a model. A
common learning problem can be formalized as follows: the generic sequence of triplets
𝑇𝑟𝑗 = 𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶, 𝑗 = 1, . . . , 𝑁𝑇 , where 𝑁𝑇 is the number of triplets in that specific
sequence, can belong to different observed models 𝜃𝑖, 𝑖 = 1, . . . , 𝑁𝐼 (𝑁𝐼 is the number of
operator-crowd interaction models). Fig. 4-5 shows triplet encoding by means of a map-
ping function 𝑓(.). For sparse collected data, i.e. sparse triplets, the mapping function
defined as 𝑓(𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶) = 𝑝(𝜖+𝑃,𝐶 |𝜖𝐶,𝑃 , 𝜖
−𝑃,𝐶 , 𝜃𝑖) is not potentially useful in order to
distinguish triplet associations with specific kinds of operator-crowd interaction models.
A different transform function, 𝑓(𝜖−𝑃,𝐶 , 𝜖𝐶,𝑃 , 𝜖+𝑃,𝐶), is defined for triplet mapping into 2-
D space to decrease miss-classification errors. A specific dimensionality reduction method
can be employed to encode the AM. In this way, it is possible to obtain a probabilistic model
for rare-interaction detections, in order to describe high-complexity relationships between
entities by means of simpler formulas [107]. The mapping function 𝑓(.) must meet the
90
Figure 4-5: Model learning problem: triplet recovering from model. 𝑇𝑟𝑗 represents 𝑗𝑡ℎ
generic triplet, 𝜃𝑖 is 𝑖𝑡ℎ interaction model.
following requirements: maximum information preservation of the operator-crowd interac-
tions and correct reconstruction of original data optimizing the classification accuracy.
4.3.1 Dimensional reduction and preservation of the information
A large number of methods have been addressed for dimensional reduction: they are typ-
ically classified in linear and non-linear methods. This section addresses a fundamental
issue rising in reduction problems: interaction information contained in primary data must
be preserved. Two well-known feature reduction techniques, namely Principal Component
Analysis (PCA) (linear method) [111] and Self Organizing Map (non-linear method) are
compared.
In Table 4.1, a comparison between PCA and SOM is presented, where binary formats
are varied for output vectors encoding. The binary format is expressed by [𝑤𝑙, 𝑓𝑙], where
𝑤𝑙 represents word-length and 𝑓𝑙 is the fraction-length. In particular Table 4.1 presents
the error measures, calculated as average Euclidean distance, between original data 𝐷 and
reconstructed data D. It is possible to note that, by increasing the number of bits, the SOM
behaves better than PCA.
91
Table 4.1: SOM and PCA comparison
Binary Format SOM-map PCA err SOM err[3, 1] 8 × 8 0.1857 1.4430[5, 1] 32 × 32 0.1954 0.0938[5, 2] 32 × 32 0.0846 0.0938[6, 2] 64 × 64 0.0803 2.8175 · 10−7
4.3.2 Self Organizing Map: classification evaluation
Taking into account a SOM layer formed by𝐾 neurons, its dimensions are adapted in order
to find the best matching couple (𝑙, 𝑤) such that 𝑙×𝑤 = 𝐾. The number of core (or proto)
Super-states is then 𝐾 and the total number of possible core (or proto) events is 𝐾2, taking
null-events as relevant as explained in [21]. The parameter 𝐾 must be tuned: in fact, by
decreasing the SOM-map size, many different input state vectors can fall into the same
cluster: this fact generates a rougher classification but ensures that only relevant events are
likely to be selected. On the other hand, by employing high-dimensional Kohonen’s layers,
the classification is improved, whereas the total number of irrelevant events increases.
The dimension of the layer is a relevant parameter in our system. A small layer allows
the system to summarize its knowledge in a few concepts, which is positive, although
classification of situations may result too rough in some cases. On the other hand, very
large layers result in a very sparsely populated Superstate space, meaning that the system
would need massive training in order to observe, and later recognize, any possible situation.
At the moment such a parameter was empirically tuned.
We define a data set 𝐷 as follows:
𝐷 = 𝐷(𝑡) : 𝑡 ∈ 0, . . . , 𝑇 , (4.10)
where 𝐷(𝑡) ∈ R𝑁 is a vector 𝐷(𝑡) = [𝑑1(𝑡), . . . , 𝑑𝑁(𝑡)]′ in which each component 𝑑𝑖(𝑡)
will represent, in our application (section 7.3), the number of people in the 𝑖𝑡ℎ room at
92
instant 𝑡. The clustering process performed by the SOM is defined by means of a transfor-
mation function 𝑓𝑛(𝐷) : 𝐷 → 𝑆 with 𝑆 = 𝑆𝑘(𝑡) : 𝑡 ∈ 0, . . . , 𝑇, 𝑘 = 1, . . . , 𝐾 is the
index of the neuron and 𝑇 maximum training time. The vectors 𝑆𝑘(𝑡) ∈ R𝑀 , with 𝑀 < 𝑁
(𝑀 = 2 in our case), represent the coordinates into the SOM Map of the neurons fired
at the time 𝑡. Each element of the data set can be determined as: 𝐷(𝑡) = 𝐶𝑘(𝑡) + 𝑛𝑘(𝑡),
where 𝐶𝑘(𝑡) ∈ R𝑁 is the vector of weights for the 𝑘𝑡ℎ neuron which is associated with
𝑆𝑘(𝑡). 𝑛𝑘(𝑡) can be considered as a Gaussian noise 𝒩𝑘(0,Σ𝑘). The covariance matrix Σ𝑘 is
computed in each 𝑘𝑡ℎ SOM node considering all the training vectors which have activated
the 𝑘𝑡ℎ neuron. It is possible to define a conditional probability density function 𝑝(𝐷|𝐶𝑘)
as follows:
𝑝(𝐷|𝐶𝑘) = [𝑝(𝐶𝑘) 𝑤𝑘(𝑡)]−1𝑒𝑥𝑝−
𝑇∑𝑡=0
Ω(𝑡)′Σ−1𝑘 Ω, (4.11)
where 𝑝(𝐶𝑘) is the probability neuron activation and it is computed as the number of sam-
ples in the node over the total number of training samples. Ω(𝑡) = 𝐷(𝑡) − 𝐶𝑘+ and
𝑤𝑘 = [(2𝜋)𝑁 |2(𝑑𝑒𝑡(Σ𝑘))0.5].
A possible criterion to evaluate a SOM, given a data set 𝐷, relies on Average Mutual
Information (AMI) 𝐼𝑀(𝐷,𝐶), [38], defined by Equation 4.12:
𝐼𝑀(𝐷,𝐶) = 𝐻(𝐷) −𝐻(𝐷|𝐶), (4.12)
where 𝐻(𝐷) is the data set entropy, while the conditional entropy of the normal multivari-
ate distribution of 𝑝(𝐷|𝐶) =∑𝐾
𝑘=1 𝑝(𝐷|𝐶𝑘) is defined as
𝐻(𝐷|𝐶) := 0.5 ln[(2𝜋𝑒)𝑁 |Σ|], (4.13)
where Σ is the covariance matrix of normal multivariate p.d.f. 𝑝(𝐷|𝐶). To investigate the
capabilities of the Self Organizing Maps we set up a test: an artificial data set𝐷 for training
was constructed consisting of 143 vectors, with sampling time equal to 4[𝑠], provided by
our crowding simulator. Each vector is formed by 𝑁 = 6 components and contains the
number of people in each room.
93
Table 4.2: Classification evaluation for different SOM-layer
SOM-map 𝐻(𝐷) 𝐻(𝐷|𝐶) 𝐼𝑀(𝐷,𝐶)
2 × 2 6.3750 0.7249 5.65014 × 4 6.3750 0.1291 6.24595 × 5 6.3750 0.0384 6.33667 × 7 6.3750 0 6.3750
Table 4.2 lists entropies for different size of the SOM layer. Over 7 × 7 the quality of the
classification have not significant improvements from AMI point of view. However, we are
not concerned with an extremely precise description of the core state space (we do not want
to maximize 𝐼𝑀(𝐷,𝐶) at all costs). We certainly need a sufficient amount of information
to be preserved, but at the same time, as explained in this subsection, we need our system
to be capable of synthesizing knowledge by establishing analogies.
A representation of reality based on analogies is necessary in order to deal with situations
never seen during the training phase.
The SOM can be used for dividing a set of training data 𝐷 into different multivariate time
series 𝐷𝑘𝐾𝑘=1 where 𝐷𝑘 = 𝐷1,𝑘, · · · , 𝐷𝑛,𝑘 associated to the 𝑘 − 𝑡ℎ neuron, such as
𝐷𝑘 ∪𝐷𝑗 = ∅ with 𝑘 = 𝑗 and⋃𝐾𝑘=1𝐷𝑘 = 𝐷. These sub-sequences of vectors can be mod-
elled by local Vector Auto Regressive (VAR) models [101]. The number of generated VAR
models correspond to the number of neurons of the SOM. The local matching measurement
between the sequence of input data and the output of the local VAR models specifies how
much of the output variation has been represented by the SOM. Considering a multivariate
time series 𝐷𝑘, an auto regressive model of order 𝑝, denoted as VAR(p), describes 𝑖 − 𝑡ℎ
vector 𝐷𝑖,𝑘 as linear combination of the previous state vectors:
𝐷𝑖,𝑘 = Φ0 + Φ1𝐷𝑖−1,𝑘 + Φ2𝐷𝑖−2,𝑘 + · · · + Φ𝑝𝐷𝑖−𝑝,𝑘 + 𝜖𝑖, (4.14)
where Φ0, · · · ,Φ𝑝 are (𝑁 × 𝑁) parameter matrices and 𝜖𝑖 represents a (𝑁 × 1) white
noise. By the multivariate time series 𝐷𝑘 we have modelled a VAR(2) as 𝐷𝑖,𝑘 = Φ0 +
94
Φ1𝐷𝑖−1,𝑘 + Φ2𝐷𝑖−2,𝑘 + 𝜖𝑖, where Φ0, Φ1 and Φ2 are estimated coefficient matrices which
have stored in each SOM node. Each VAR model has been used as linear predictor fil-
ter. A dataset 𝐷𝑐, different from 𝐷, has been used for the classification phase. Also in
this case the SOM divides the data into different multivariate time series 𝐷𝑐𝑘𝐾𝑘=1 where
𝐷𝑐𝑘 =
𝐷𝑐
1,𝑘, · · · , 𝐷𝑐𝑛,𝑘
associated to the 𝑘− 𝑡ℎ neuron, such as 𝐷𝑐
𝑘 ∪𝐷𝑐𝑗 = ∅ with 𝑘 = 𝑗
and⋃𝐾𝑘=1𝐷
𝑐𝑘 = 𝐷𝑐. We have compared one period ahead forecast sequences 𝑘 obtained
by VAR(2) model built over different SOM layer sizes with 𝐷𝑐𝑘. Fig. 4-6 shows an exam-
ple of curve trends for predicted vector sequences by VAR(2) model built over different
SOM layer sizes. A comparison between simulated data and the predictor filter outputs is
provided by FIT measurement:
𝐹𝐼𝑇 =
∑𝑛𝑖=1 ‖𝐷𝑐
𝑖,𝑘 − 𝑖,𝑘‖∑𝑛𝑖=1 ‖𝐷𝑐
𝑖,𝑘 −𝐷𝑐
𝑘‖, (4.15)
where 𝐷𝑐𝑖,𝑘 ∈ 𝐷𝑐
𝑘, 𝑖,𝑘 is the output of the 𝑘 − 𝑡ℎ VAR(2) model and 𝐷𝑐
𝑘 = 𝐸𝐷𝑐𝑘.
The averages of the FIT between one period ahead forecast obtained by VAR(2) models
and simulated data (formed by 140 vectors) show that a SOM with small layers are able
to build analogies between the stored data into the same neurons during the training phase
and the classification phase.
After a training phase of the chosen SOM, a mapping function 𝑓(𝑇𝑟𝑗) to project each
triplet into 2-D SOM-map can be defined. The output of this function is a list of zones, i.e.
trajectories (which can be actually discontinuous), within the SOM-map. Dynamic aspects
of the evolution of clustered triplets are also considered: transition probabilities between
different zones are computed, in such a way that the outcome of the training process can be
ideally compared to an Hidden Markov Model (HMM) [95].
95
Figure 4-6: Example of graphical comparison between VAR(2) models and simulated datawhich represent the number of the people within the zone 1. The averages of the matchingbetween VAR(2) model outputs and 140 simulated vectors (expressed in percentages) arethe following: SOM 4 × 4 fit: 67.14%; SOM 5 × 5 fit: 53.6%; SOM 7 × 7 fit: 40.18%.
4.4 Results
The simulated monitored environment is shown in Fig. 4-7. The configuration of doors,
walls and rooms is however customizable and a wide range of scenarios can be set for
tests. A crowd enters a room of the simulator and is given the motivation of moving toward
the exit of the building. Births of new characters occur during the simulation, modelled
by a Poisson distribution (we hypothesize a fixed average incoming rate: data coming
from different simulations are thus comparable). The human operator must act on door
configuration in order to direct crowd flows, thus preventing overcrowding
The use of a graphical engine (freely available at http://www.horde3d.org/) has
been introduced in order to make the simulation realistic in the AM (section 4.2) training
phase. Here a human operator acts on doors configuration in order to prevent room over-
crowding, based on the visual output, which need to be as realistic as possible. Namely, the
96
Figure 4-7: The simulated monitored environment.
simulator has to output realistic data both from the behavioural point of view, in order to
effectively interact with the human operator, and from the visual point of view, in order to
grant an effective interface by truly depicting reality. Reactions of an operator faced with
an unrealistic visual output could be extremely different and strongly depend on render-
ing quality. For this reason, characters are also animated to simulate walk motion (at first
glance a crowded environment with still people could look less populated than it really is).
Crowd behaviour within the simulator is modeled based on Social Forces, which were
mentioned in section 3.1. This model assimilates each character on the scene to a particle
subject to 2D forces, and treats it consequently from a strictly physical point of view. Its
motion equations are derived from Newton’s law 𝐹 = 𝑚. The forces a character is
driven by are substantially of three kinds [71]. An attractive motivational force 𝐹𝑚𝑜𝑡 pulls
characters toward some scheduled destination, while repulsive physical forces 𝐹𝑝ℎ𝑦 and
interaction forces 𝐹𝑖𝑛𝑡 prevent from collision into physical objects and take into account
interactions within characters. An additional linear drag (viscous resistance) 𝐹𝑟𝑒𝑠 takes
into account the fact that no character actually persists in its state of constant speed but
tends to stop its motion as motivation runs out. This force is in fact accounted for and
included in 𝐹𝑚𝑜𝑡. Such forces are sketched in Fig. 4-8. Chaotic fluctuations are included,
according to the modified social force model proposed in [114]. These fluctuation account
for random individual variations in behaviour and lead to crowd motion self organization.
The three forces are estimated at each time instant for each character, whose position is then
updated according to the motion equation and normalized according to the current fps rate
97
Figure 4-8: Vectorial sum of forces 𝐹𝑇𝑂𝑇 and influence range of characters.
supported by the graphical engine (which strongly depends on the number of characters
to be handled). As already mentioned, people incoming rate is modelled as a Poisson
distribution. Their death occurs as they get to their final scheduled destination. A human
operator interacts with the crowd by opening doors to let it flow, while trying to minimize
the time a door remains open. Although somehow simplified with respect to more complex
works, such as [71] (where additional assumptions on trajectories’ regularity are made),
the developed model results in a good overall output, where people behave correctly and
realistically.
The simulator also includes (simulated) sensors. These try to reproduce (processed) sensor
data coming from different cameras looking at different subsets (rooms) of the monitored
scene. A virtual people estimation algorithm outputs the number of people by simply
adding some noise to the mere number of people framed by the virtual camera, thus trying
to mimic the output of real image processing algorithms, such as [61]. The state vector of
the system (which corresponds to the external object, eso) is (cfr. equation 4.10)
X𝐶𝑟(𝑡) = 𝑥𝐶𝑟1(𝑡), 𝑥𝐶𝑟2(𝑡), . . . , 𝑥𝐶𝑟𝑁 (𝑡) , (4.16)
with 𝑁 = 6 in our case (six cameras, one for each room). 𝑥𝐶𝑟𝑛(𝑡) is the number of people
in room 𝑛. The people flow starts in a hall room, that corresponds to 𝑥𝐶𝑟1 . A 7×7 2D SOM
is then trained in order to cluster the state vector space. The SOM Super States (better say,
their variations) define events. The internal (endo) state of the system (namely, the doors’
98
configuration) is simply modelled by a binary vector storing the state of each door (true if
open, false if closed). Variations of such a vector define proto events.
An AM is then trained by a human operator who opens virtual gates in order to let the
crowd stream outside in case high occupancy states are reached and, at the same time, to
minimize the time gates remain open.
In our case, four kinds of simulated scenarios for different crowd behaviours (see Table
4.3), have been taken into consideration, in order to memorize the interactions between a
human operator (proto-self) and the crowd (core-self) as formalized in 3.3. For instance,
the first crowd behaviour, identified by 1𝑑, has 𝜇 = 𝜎2 = 1 for the Poisson probability
mass function, weak interaction force, and a relatively short interaction range.
After mapping the AM into a 2-D space, by means of a SOM, the operator’s reactions to
different crowd fluctuations, stored and encoded by 𝑓 , can be used on-line to choose an
optimal strategy, i.e. to emulate the actions of a human operator, by predicting not only his
behaviour but also crowd’s reaction to it.
Table 4.3: Different crowd behaviours in simulated scenarios
𝐼𝐷 Num. persons/second Interaction range Interaction Forces1d 1 1 𝑚 Low2d 2 2 𝑚 Low1p 1 1 𝑚 High2p 2 2 𝑚 High
A reference model 𝜃𝑖 for operator-crowd interactions is then designed (refer to Fig. 4-5).
We define a sequence of passive triplets (related to 𝑖 = 1𝑑 crowd behaviour, Table 4.3) as:
𝑇𝑟𝑘 = (𝑇𝑟1, 𝑇 𝑟2, . . . , 𝑇 𝑟𝑘, . . . , 𝑇 𝑟𝐾), (4.17)
where 𝑇𝑟 = 𝜖−𝑃 , 𝜖𝐶 , 𝜖+𝑃. The mapping function 𝑓(𝑇𝑟𝑘) = 𝑆𝑘 defines a corresponding se-
quence of Super states into the SOM-map as follows: 𝑆𝑘 = (𝑆1, 𝑆2, . . . , 𝑆𝑘, . . . , 𝑆𝐾). In
the simulation, the maximum time between two subsequent Super states (. . . , 𝑆𝑘, 𝑆𝑘+1, . . . )
99
is taken as 8[𝑠]. After such a time lapse, a new interaction (Super state) is considered. The
𝑘𝑡ℎ Super state probability is a vector 𝑃 whose elements are defined as: 𝑃𝑘 = 𝑃 (𝑆𝑘); it cor-
responds to the number of occurrences of 𝑆𝑘 over 𝑆𝑘 with 𝑘 = 1, . . . , 𝐾. The elements of
the transition probability matrix 𝑀 , are defined as: 𝑀(𝑆𝑘, 𝑆𝑘+1) = 𝑃𝑘+1|𝑘 = 𝑃 (𝑆𝑘+1|𝑆𝑘).
A test sequence of passive triplets 𝑇𝑟𝐼𝐷𝑘 (one for each crowd behaviour listed in Table
4.3) is simulated and processed by 𝑆𝐼𝐷𝑘 = 𝑓(𝑇𝑟𝐼𝐷𝑘 ) in order to generate 𝑆𝐼𝐷𝑘 with 𝑘 =
1, . . . , 𝐾. A weighted average of transition probabilities between subsequent Super states
(. . . , 𝑆𝐼𝐷𝑘 , 𝑆𝐼𝐷𝑘+1, . . . ) is computed as follows:
𝑃 𝐼𝐷𝑖 (𝑡) =
1
𝑊
𝑊∑𝑘=1
𝑃 𝐼𝐷𝑘 𝑃 𝐼𝐷
𝑘+1|𝑘, (4.18)
where 𝑃 𝐼𝐷𝑘 = 𝑃 (𝑆𝐼𝐷𝑘 |𝜃𝑖) and 𝑃 𝐼𝐷
𝑘+1|𝑘 = 𝑀(𝑆𝐼𝐷𝑘 , 𝑆𝐼𝐷𝑘+1|𝜃𝑖), while 𝑊 , called moving eval-
uation windows, defines the number of test sequence triplets considered at each step 𝑡.
We define the probability to reach 𝑘 + 1𝑡ℎ Super state starting from the 𝑘𝑡ℎ, as follows:
𝑃 𝐼𝐷𝑘 ↦→𝑘+1 = 𝑃 𝐼𝐷
𝑘 𝑃 𝐼𝐷𝑘+1|𝑘. The recognition of the interaction model is performed by taking the
maximum probability: (𝑖*, 𝑡) = arg max𝑖 𝑃𝐼𝐷𝑖 (𝑡) with 𝑖 = 1, . . . , 𝑁𝐼 and 𝑡 = 1, . . . , 𝑇 .
The couple (𝑖*, 𝑡) defines the kind of recognized operator-crowd interaction 𝜃𝑖 and also the
maximum time 𝑊 · 8 + 𝑡 · 8[𝑠] in which the detection is performed.
Different average of transition probabilities curves, with 𝑊 = 2, 5, 10, 15 and 𝑇 = 10
steps, are evaluated. An example with 𝑊 = 10 is shown in Fig. 4-9. The four interaction
behaviours (red curve) are compared with the reference model (blue curve). Using only
a few triplets (i.e. lower 𝑊 values, e.g. 𝑊 = 2 and 𝑊 = 5) for each time step, some
behaviour models result confused. The separation distance between the curves increases
when the evaluation window values increase, e.g. with 𝑊 = 10 and 𝑊 = 15.
The Mean Square Error (MSE) is computed, in order to evaluate the distances between the
observed interaction behaviour curves and the reference behaviour model. The minimum
MSE provides a similarity measure between interaction behaviours. At each time step
𝑡 = 1, . . . , 𝑇 as follows: 𝑀𝑆𝐸(𝑡) =1
𝑊
∑𝑊𝑘=1(𝑃
*𝑘 ↦→𝑘+1 − 𝑃𝑘 ↦→𝑘+1)
2, where 𝑃𝑘 ↦→𝑘+1 and
100
Figure 4-9: Classification examples of interaction behaviour using evaluation windowW=10.
𝑃 *𝑘 ↦→𝑘+1 correspond to probability values over 𝑆𝑘 and 𝑆*
𝑘, i.e. reference and observed
sequences. The anomalous interactions between an operator and the monitored crowd could
emerge after a normal behaviour, e.g., a careless user does not open some doors. In this
situation the CN, working in its on-line modality, is able to recognize anomalous crowd
management. Fig. 4-10 shows the normal behaviour, in the specific case of 𝐼𝐷 = 1𝑑
(blue curve) and compares it the with observed operator-crowd interactions (red curve).
Using an evaluation window 𝑊 = 10, two processes start to drift away at 𝑡 = 6.4[𝑠] . In
a corresponding manner MSE starts to grow up. The rule of detection is ∇𝑀𝑆𝐸(𝑡) > 0
for 𝑡 ∈ [𝑡𝑚𝑖𝑛, 𝑡𝑚𝑎𝑥]. The system detects operator-crowd anomalous interactions when the
curve gradient is positive for an evaluation period 𝑡𝑒𝑝 = 𝑡𝑚𝑎𝑥−𝑡𝑚𝑖𝑛. In the on-line modality
the CN when an anomalous interaction has been recognized, the system alerts the operator
sending a message. Such a message can contain the last detected abnormal passive triplet,
e.g. first user action (proto event), evolution of the crowd (core event) and consequent
operator action (proto event). In the case shown in Fig. 4-10, the anomalous situation is
due to wrong consequent user action, i.e. the operator does not open some doors and the
number of people increase.
101
Figure 4-10: Detection of anomalous operator-crowd interactions. The system detects ananomalous interaction when the operator does not open two doors and the number of peo-ple increase. This incorrect crowding management situation is shown in the figure andcompared with the correct situation.
4.4.1 Example of application on real video sequences
In order to give consistence to the proposed cognitive video surveillance system, an experi-
ment has been conducted on a available video sequences from the "Performance Evaluation
of Tracking and Surveillance" (PETS) workshop dataset [37] for single camera (S3 High
level, Time 14 − 16, View_0001, sequence length is 223 frames, frame rate is ∼ 7[𝑓𝑝𝑠]).
The main target of this experiment is to demonstrate how the system is able to recognize
interaction between an operator and the crowd behaviour in a video sequence. For this
purpose the real environment has been partitioned in three zones, which are supposed to
be monitored by cameras, as shown in Fig. 4-11 (a). In the simulated environment, the
zones correspond to three rooms, Fig. 4-11 (b). In the sequence, two crowd behaviours
corresponding to different people flows have been individuated. The fist flow direction
when the people go across the scene from zone 1 to zone 3 (i.e. from frame_0000 to
frame_0107), while the second flow when the people move from zone 3 to zone 1 (i.e. from
102
frame_0108 to frame_0222). By using the simulator these two different people behaviours
have been reproduced: for the first flow the people enter the scene in zone 1 and head out in
zone 3, while for the second the people enter in zone 3 and leave in zone 1 (second flow).
In the simulator a human operator can manage the crowd flow, from a room to another,
by acting on doors, 𝑑1 𝑑2 𝑑3 𝑑4. The user opens the door when the people are near to
it. In order to reproduce the same interaction using the real video sequences, it has been
supposed to have the same configuration of the doors. A people counting algorithm [88]
provides an estimate of the total number of people in the zones present in video sequences,
Fig. 4-11 (a). In this virtual environment a people counting module is simulated in order
to count the people into a sub-area of the room (dashed circular areas, Fig. 4-11 (b)).
(a)
(b)
Figure 4-11: Example of real environment (a) and simulated scenario (b) used for the test,the virtual rooms correspond to the zones. The red dashed line corresponds to people flowdirection from zone 1 to zone 3; the blue dashed line describes people movement fromzone 3 to zone 1. Dashed circular areas qualitatively correspond to the parts of the roomsmonitored by cameras equipped with people counter module. 𝑑1 − 𝑑4 are the doors.
The test is composed by two parts: learning and detection (on-line). During the learning
phase, the cognitive system has learned two probabilistic models from the simulation, i.e.
103
two AMs, in order to describe two crowd behaviours. The rules used to memorize such two
models are specified as follows: if the operator sees the people moving from zone 1 to zone
3 must open only 𝑑1 𝑑2 according to the people flow; the user has to act on 𝑑3 𝑑4 if people
flow is in opposite direction. Four scenes for the two people flows have been simulated,
each scene is 60[𝑠] long. The simulated people counters provide number of people in each
zone per second. During the second part the system works on real video sequences. The
CN recognizes the observed situations according to the memorized knowledges. During
autonomous phase, the CN, to the end to interact directly with the crowd, must discriminate
different crowd-environment configurations. Fig. 4-12 presents four sample frames about
different crowd behaviours: in case (a) the people flow moves leading red arrow (i.e. from
zone 1 to zone 3), in case (b) the opposed people movement direction is presented (i.e. from
zone 3 to zone 1). In cases (c) and (d) the groups of the people have different movement
directions, namely people splitting and merging. In these last two cases, the system does
not find any correspondence between observed scene and stored interaction models. In
particular, the CN can consider the scene (c) as anomalous crowd-environment interaction
compared with (a) situation. The same consideration can be done for (d) and (b) situations.
When anomalous crowd-environment interactions are detected, the cognitive system sends
an alarm message in order to inform the human operator. After this phase, the CN is able
to predict most likely future actions and when it will be performed. During the operator
support phase, the cognitive system individuates anomalies in terms of differences between
predicted actions and user actions.
The SOM-map dimensions produce different results in terms of knowledge representation
for crowd-environment interactions. In particular, small Kohonen’s layers increase SOM
capability of creating analogies between different input data. This effect becomes much
relevant when the input data is corrupted by noise. A test has been conducted employing
two people counters, namely 𝑃𝐶1 and 𝑃𝐶2, characterised by different accuracies, i.e.
𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. The experiment can be divided in two parts. Firstly, we have
manually built the ground truth for the video sequence. We use this information in order
to generate the sequences of the super-states. Through three different SOMs, i.e. SOM 16,
104
(a) (b)
(c) (d)
Figure 4-12: Sample frames for four different crowd-environment interactions. Differentpeople flows are presented: two opposite directions of movement (a) (b), people splitting(c), people merging (d). (a) and (b) represent normal behaviours, while (c) and (d) representtwo abnormal behaviours.
SOM 25 and SOM 32, the original data have been mapped into SOM-spaces. Secondly,
by using of people counter [28], it is possible to obtain the number of people (𝑃𝐶1) with
estimated accuracy of 80% (𝑎𝑃𝐶1 = 80%). Tuning a people counter parameter another
set of number of people (𝑃𝐶2), with less accuracy, has been acquired (𝑎𝑃𝐶2 = 60%).
We have manually corrupted the parameter for decreasing the accuracy of people number
estimation. The data provided by 𝑃𝐶1 and 𝑃𝐶2 are classified by using the three different
SOMs so that six different sequences of fired neurons are obtained. The events (i.e. super
states transitions), which correspond to passages across the zones (i.e. Zone 1 ↦→ Zone
2, Zone 2 ↦→ Zone 3 and Zone 3 ↦→ Zone 2, Zone 2 ↦→ Zone 1), are compared with the
events generated from ground truth. When the system recognizes the same events it is
possible to affirm that a specific SOM is able to provide an efficient crowd-environment
interaction representations. Vice versa a failure will be detected. Failures are due to the
poor capability of larger SOM layers of finding analogies between input data: similar inputs
may be mapped to different neurons (see Subsection 4.3.2). In table 4.4, the performances
(in people flow detections) of three SOMs (16, 25 and 36 neurons respectively) are shown.
105
The interesting result is that a 16-neuron SOM is able to detect three zone passages also in
the presence of corrupted data input.
Table 4.4: People flows detections using different people counters and SOMs. Accuracy ofthe precision: 𝑎𝑃𝐶1 = 80%, 𝑎𝑃𝐶2 = 60%. Success = 1, Failure = 0.
SOM 16 SOM 25 SOM 36Direction PC 1 PC 2 PC 1 PC 2 PC 1 PC 2
Zone 1 ↦→ Zone 2 1 1 1 1 1 0Zone 2 ↦→ Zone 3 1 0 1 0 0 0Zone 3 ↦→ Zone 2 1 1 1 1 1 0Zone 2 ↦→ Zone 1 1 1 1 0 1 0
For the test, a SOM 25 and PC1 have been employed. In Fig. 4-13 the cognitive system
predictions and detections of normal and abnormal interactions between an operator and
the crowd are shown. In the figure, the operator actions and corresponding video frames
are represented (blue operator action rectangle) in the ground truth bar. The prediction
(yellow prediction rectangle) and action bar represents the cognitive system actions. The
anomaly is represented by a red anomaly rectangle. Considering the case (a), the system
predicts the first zone crossing (i.e. from zone 1 to zone 2) as to open 𝑑1 (specified by blue
triangle). In this case, the operator action finds a correspondence with the predicted action
(i.e. 𝑑1). During the second zone crossing (i.e. from zone 2 to zone 3) the system detects
an anomalous operator-crowd interaction behaviour: the user opens an uncorrected door
(i.e. 𝑑3 indicated by blue circle). The case (b) presents the same analysis for a different
people flow direction.
106
(a)
(b)
Figure 4-13: The qualitative results of the normal and anomalous operator-crowd inter-action detection, during the operator support phase. The ground truth bar represents theoperator actions in corresponding with video frames. Prediction and action bar representsthe cognitive system actions.
107
Chapter 5
A bio-inspired algorithm for designing a
human attention mechanism
5.1 Introduction
In previous two chapters 3 and 4 an intelligent video surveillance systems called Cognitive
Surveillance Node is presented. Such a framework can predict possible dangerous situa-
tion. At the same time, the system can actively support human decision in order to prevent
possible wrong user’s actions. The main issue in large scale video surveillance networks
is the analysis of huge amount of information in order to detect dangerous situations. For
example,overcrowding was recently the cause of several tragic events: in January 2013 the
panic due to an accidental fire in a nightclub in Brazil caused at least 241 casualties; three
years before, 21 people died from suffocation during a famous pop festival in Duisburg
(Germany). In these and other unfortunate and similar situations, automatic crowd mon-
itoring systems could have played an important role for increasing the safety level of the
environment, and, in the end, have saved some human lives.
Typically, in wide-area surveillance systems a huge amount of information is acquired by
camera networks. This can lead to an unpredictable increase in the required computational
109
power and the possibility to lose some important data. In particular, it can be seen that
not all the data coming from the guarded environment are indeed relevant for a specific
high level security task. In order to avoid the problem of information overload, innovative
methods based on human attentional process for selection of relevant features are been
proposed. The human attention is a biological mechanism which permits to address the
focus on some stimuli (i.e. visual features) neglecting others, it is basically divided in two
main approaches: bottom-up and top-down. Bottom-up attention takes into consideration
sensory data only; while top-down approach considers also a priori knowledge, acquired
from the scene, for improving saliency extraction.
To this end an algorithm based on cognitive perception process for Selective Attention Au-
tomatic Focus (S2AF), for relevant information extraction within a surveillance network, is
proposed. S2AF can be seen as a novel processing method for analysing raw data acquired
by multiple sensors of a Cognitive Surveillance Node (CSN), presented in chapter 4. The
application of cognitive science, related to the selective attention focus [66], in order to
design the next generation of safety and security systems represents a relevant added value
with respect to state-of-art. Such processes enhance the cognitive system capabilities not
only for recognizing a specific situation but also for changing its behaviour in accordance
with new circumstances for localizing the important parts of a controlled area.
Other methods have been proposed for saliency detection, e.g. Mancas et al. in [75] has
proposed a bottom-up attention algorithm for crowd abnormality detections, based on fea-
tures extracted by optical flow. Each frame has been manually divided into different cells
(i.e. parts) which have compared in order to detect contrasts.
This chapter describes the application of cognitive perception mechanisms for designing
a top-down algorithm for automatic focus of attention as part of CSN for crowd monitor-
ing. In particular, this work shows how detection performances in the more cluttered areas
within the guarded environment depend from the allocation of processing resources. In the
proposed framework a completely automatic mechanism for partitioning the environment,
forming sub-areas or zones, and allocating the resources is presented.
110
The attention module work flow is divided in two main steps: training and active phases.
During training phase, the algorithm can acquire and model local information (i.e. local
models), such as crowd density in a zone (i.e. number of people). This part is extremely
important in order to build the so called Environmental Memory, figure 5-1.
Figure 5-1: Attention module, which is formed by the attention mechanism based on Selec-tive Attention Automatic Focus (S2AF) and the Environmental Memory, which memorizedthe crowd density local models.
During active phase, the attention module, by using the knowledge contained within the
environmental memory, can analyse the situation ans underline the most relevant environ-
mental parts. This task permits to determine the zones, of the area under control, where
the operator has to focus its attention. Secondly, the system, by using the local models,
can identify possible deviations from an expected situation, such as anomalous density of
people (i.e. overcrowding).
The figure 5-2 shows two operative situations where the attention module, which is a part
of the cognitive node, is operative deactivated (i.e. attention open loop) or activated (i.e.
attention close loop).
111
∙ Attention open loop: the operator by the switch T can bypass the attention module
and receive the complete information directly from all camera networks.
∙ Attention close loop: the operator by the switch can active the attention module and
receive the relevant information only (i.e. the video images come from a subset of
sensors).
Figure 5-2: The switch T actives or deactivates the attention mechanism. Attention openloop: the operator can receive the information directly from cameras networks. Attentionclose loop: the operator receives the information by attention module which provides toselect the relevant information only.
According to biological reasoning mechanisms, the stimuli can first be perceived by sen-
sory receptors and second understood with respect to previous experiences. In the proposed
framework, the environmental memory is used for storing and holding the sensory infor-
mation.
Results are carried out through a crowd simulator similar to the one described in [17]. The
use of a simulated environment is necessary in order to gather enough substantial testing
video sequences. The issue of modelling and simulating crowds will be discussed and
motivated, as it represents a central matter in applying the proposed cognitive theory.
112
The remainder of this work is organized as follows: section 5.2 gives a quick insight on
selective attention cognitive process for crowd monitoring, modelling and analysis. Section
5.3 discusses the information flow process for cognitive crowd monitoring, which leads to
the important attention focus concept on relevant information. The proposed Selective
Attention Automatic Focus algorithm is presented in section 5.4. Results are drawn in
sections 5.5.
Figure 5-3: The schematic information flow process for selective attention automatic fo-cus and relevant information extraction. The total information is associated to macroscopiclevel (a). 𝑋 represents a set of trajectories defined as microscopic level (b). By an encodingfunction 𝑍(𝑋), the more informative environment parts, associated to sufficient informa-tion 𝑌 , are identified as mesoscopic level (c). Such a representation is used to extract therelevant information 𝑇 (𝑌 ) in according to the task (d).
113
5.2 Selective attention for automatic crowd monitoring
Focus of attention is a human cognitive process. This biological mechanism is involved
in operator activity which monitors several security cameras at the same time in order to
perform a guarding task. This can happen, for example, to ensure that there are no on-
going dangerous situations (e.g. overcrowding) or specific behaviours (e.g. suspicious
movements) in the environment under control. If a big amount of data is being considered,
human capability of focusing the attention on important events ignoring other situations,
can be compromised. For this reason an automatic system for crowd analysis and moni-
toring, which is able to support and help human selective attention and focus mechanisms,
can be of fundamental importance as an additional tool for the user. Crowds can be seen
as a dense group of moving objects thus focusing on their temporal evolution and dynam-
ics. However, if an extremely cluttered crowd is considered, it is nearly impossible for an
automatic system or a human operator to individually track each single particle. It is clear
that one can decide to tackle this problem at different scale interpretations, considering
the crowd as a global entity or as a summation of different parts. The attention focus pro-
cess depends on the focus size, defined by the way in which the attentional resources are
distributed on monitored area [12]. It is possible to consider a proportional inverse relation-
ship between the efficiency of the focus of attention process and focus size (i.e. attentional
resources distribution). Different approaches for crowd analysis depend on different atten-
tional resources distribution. Typically, crowd can be described in terms of its components,
namely the people which it is formed by. This remark may sound trivial, but it has deep
implications in the way a crowd is depicted. In particular, it raises the issue of choosing be-
tween local or global descriptors. A local crowd descriptor relies on the features associated
to each member, such as positions, speeds, orientations, motivations, destinations etc. A
global (holistic) descriptor, on the other hand, is based on features of the crowd as a single
entity, such average density, entropy, average shift in some direction, global displacement
etc. Global features can in general be derived from local ones, by averaging or integrating
local quantities. However, it is not only a matter of scale at which the crowd is analysed,
114
but rather of the additional amount of information stored in local quantities compared to
global ones. A nice parallel example comes from the well-known thermodynamics, where
global quantities, such as energy, pressure and temperature of gases can in principles be
derived from the average kinetic energy of its molecules: by knowing the exact behaviour
of each single molecule in the gas one can derive the temperature, while the opposite cal-
culation is not possible, as information is lost by averaging over all molecules. However,
in both the cases of crowd and thermodynamics, it is not always possible to access local in-
formation entirely, while global quantities can be easily gathered. For example, in a video
surveillance framework, it is unrealistic to analyse the behaviour of every single person
(i.e. microscopic level analysis) in a high density crowded scene, in order to understand
crowd global behaviour (i.e. macroscopic level analysis). The proposed S2AF algorithm
represents a method to extract most informative data for solving the considered surveil-
lance task. Through the proposed approach, it is possible to select a subset of the guarded
environment (i.e., mesoscopic level analysis), thus maximizing the efficiency of focus of
attention process.
5.3 Information flow process for cognitive crowd monitor-
ing
According to the previous considerations, proposed approach extracts the most relevant
information from acquired data; this minimal set of information can be considered as a
sufficient statistic that can be obtained through a specific transform of the sensory data ac-
quired from the considered surveillance network (see Figure 5-3). The relevant information
can be determined starting from the observed environment. It is possible to define 𝑋𝑛(𝑡)
as position of the object (or particle) 𝑂𝑛 at time 𝑡, where 𝑋𝑛(𝑡) ∈ R2. A trajectory is a
sequence of Xn : 𝑋𝑛𝑡1, ..., 𝑋𝑛𝑡𝑘
, where 𝑡1 is the first instant in which an object has ob-
served for the first time, while 𝑡𝑘 represents the moment in which the particle disappear.
The crowd simulator is able to generate all the trajectories for each object Xn𝑁𝑛=1 where
115
𝑁 is the total number of simulated entities. Based on these trajectories an Instantaneous
Topological Map (ITM) algorithm [59] can be used for partitioning the whole considered
area on the map into smaller zones. If Ω ⊆ R2 is the map of the guarded area, the ITM
method produces a set 𝑧𝑗𝐽𝑗=1 such as 𝑧𝑗 ∪ 𝑧𝑖 = ∅ for 𝑗 = 𝑖 and⋃𝐽𝑗=1 𝑧𝑗 = 𝑍 ⊆ Ω.
The main aim of the proposed algorithm is to select a subset 𝑐𝑘𝐾𝑘=1 of 𝑧𝑗𝐽𝑗=1 as the best
possible group of zones for maximizing the information in according to a specific task of
the system. 𝑐𝑘 ∪ 𝑐𝑘 = ∅ for ℎ = 𝑘 and⋃𝐾𝑘=1 𝑐𝑘 = 𝐶 ⊂ 𝑍. Moreover a zone 𝑐𝑘 can be
obtained through merging of different zones 𝑧𝑗 i.e. ∀𝑘 = 1, ..., 𝐾∃𝑗 : 𝑧𝑘 ⊇ 𝑧𝑗 . In the case
of the crowding, the number of people in each zones will be considered. The number of
people that has been estimated in each zone 𝑧𝑗 is given by Y(𝑡) = 𝑌1(𝑡), ..., 𝑌𝐽(𝑡). It is
possible to define 𝑇 as a tessellation procedure that is able to provide the set of zone groups
𝑐𝑘 starting from 𝑧𝑗 . The objective of the proposed work is to find the specific 𝑇 which is
able to maximizing the efficiency of the attention focus process. According to Haykin [52],
two mutual information are defined as follows: 𝐼(𝑌, 𝑇 ) related to informativeness mea-
surement and 𝐼(𝑋,𝑇 ) associated to complexity value. Where 𝑋 ≡ Xn and 𝑌 ≡ Y(𝑡). In
according to the cognitive control paradigm 𝐼(𝑌, 𝑇 ) must be maximized (i.e. to minimize
𝐻(𝑌 |𝑇 )),while 𝐼(𝑋,𝑇 ) which must be minimized (i.e. to maximize 𝐻(𝑋|𝑇 )). It possible
to define an evaluation measure for the focus process as follows:
𝑃 = min𝑇
𝐻(𝑌 |𝑇 ) − 𝜆𝐻(𝑋|𝑇 ) (5.1)
Where the parameter 𝜆 is inversely proportional to area of 𝐶: 𝜆 ∝ 1𝐴(𝐶)
with 𝐴(𝐶) =∑𝐾𝑘=1𝐴(𝑐𝑘). In according to the Equation (5.1) it is possible to individuate an attentional
resources distribution 𝐶 in order to focal the attention on relevant information only.
5.4 Selective Attention Automatic Focus (S2AF) algorithm
This section describes an application on crowd monitoring domain to demonstrate the pro-
posed Selective Attention Automatic Focus method. The crowd simulator has been used in
116
Figure 5-4: Crowd simulator: simulated crowding scene for trajectories generation.
Figure 5-5: Simulated people flows: qualitative sketch of trajectories, the arrow colorsindicate five different people flows 𝑑1, ..., 𝑑5.
117
Figure 5-6: ITM: topological map, formed by different zones created, after 50 trajectoriesusing ITM algorithm .
Figure 5-7: Histogram: a learned people histogram from which it is possible to evaluatethe probability that an observed number of people takes place in the zone 𝑧𝑗 (in this case𝑗 = 42).
118
Figure 5-8: Gaussian P.d.f.: Gaussian histogram normalization p.d.f. (in this case averageis 𝑚42 = 3.4 and variance is 𝜎2
42 = 2.8).
Figure 5-9: Seed attentional resources: initial conditions (i.e. three high entropy 𝑧𝑠𝑗 foreach room).
119
order to generate a significant set of realistic trajectories. The first step is to create a topo-
logical encoding the people trajectories through ITM algorithm. In figure 5-4 shows an im-
age of the simulator used for simulating five different people flows, figure 5-5, while figure
5-6 presents an example of topological map. The map is formed by 𝑍 = 147 zones gener-
ated through the encoding of 𝑁 = 40 trajectories. The second step is to acquire for each
zone, by a large number of training simulations, the transition models and the probability
density functions of people number, figures 5-7 and 5-8. The transition models represents
the passages from each zone 𝑧𝑗 to its neighbours 𝑧𝑖, where 𝑧𝑗, 𝑧𝑖 ∈ 𝑍 with 𝑖 = 1, ...,𝑀
and 𝑖 = 𝑗. The transition probabilities for each zone 𝑧𝑗 are defined as: 𝑝𝑗,𝑖 = 𝑝(𝑧𝑖/𝑧𝑖). A
conditional entropy value 𝐻𝑖,𝑗 for each 𝑧𝑗 is computed as follows:
𝐻𝑖,𝑗 = −𝑀∑𝑖=1
𝑝𝑗,𝑖𝑙𝑜𝑔𝑝𝑗,𝑖 (5.2)
Figure 5-10 shows an example of two situations: low (on the left) and high (on the right)
uncertainties of transition between a zone and its neighbours.
Figure 5-10: Examples of transition probabilities for ITM zones. A conditional entropyvalue is computed between a given zone and its neighbours. On the left, a zone 𝑧1 isconnected with one neighbour only 𝑧2 (i.e. low uncertainty and conditional entropy), whileon the right a zone 𝑧4 has more potential neighbours 𝑧5, 𝑧6, 𝑧7 (i.e. high uncertainty andconditional entropy).
The proposed method is able to provide configurations 𝐶 of group zones 𝑐𝑘 (obtained by
attentional resources) in according to the Equation (5.1). Through the maximization of
120
𝐻(𝑋|𝑇 ) it is possible to define a projection of people trajectories 𝑋 over 𝐶 by means of
transformation function 𝑇 . Furthermore, the transition model gives a representation of 𝑋
over 𝑍, being 𝐶 ⊂ 𝑍, for maximizing the term 𝐻(𝑋|𝑇 ) is sufficient to determine which
zones 𝑧𝑗 satisfy the condition:
𝑍𝑠 = 𝑧𝑠𝑗 : 𝐻𝑖,𝑗 ≥ 𝐻𝑡𝑟 (5.3)
Where 𝐻𝑡𝑟 is an threshold in order to select and order zones associated to high entropy.
Form the Equation (5.3) it is possible to individuate a seed zones (i.e. attentional resources)
set 𝑍𝑠 = 𝑧𝑠𝑗 with 𝑍𝑠 ⊂ 𝑍. 𝑍𝑠 represents the initial condition (i.e. initial attentional
resources, see figure 5-9) of the S2AF algorithm, Algorithm 2. In the section 5.5 will be
shown how from this initial set depends the efficiency of the proposed cognitive process
for selective automatic attention focus.
Data: 𝑧𝑗 ∈ 𝑍Result: 𝑐𝑘 ∈ 𝐶Initialization: given seed zones set 𝑍𝑠 ⊂ 𝑍 ;for 𝑘 = 1 : 𝐾 do
𝑧𝑠𝑗 ↦→ 𝑐𝑘 with 𝑧𝑠𝑗 ∈ 𝑍𝑠;endwhile ∃𝑧𝑗 /∈ 𝐶 do
if Number of people observed in 𝑧𝑗 𝑁𝑗 = 0 thenfor 𝑘 = 1 : 𝐾 do
if |𝑂𝑗 −𝑂𝑘𝑠| < 𝐷 then𝑧𝑗 ↦→ 𝑐𝑘;𝐶ℎ = 𝑐𝑘 ∀𝑐𝑘 to estimate 𝑁𝑘;∀𝑐𝑘 to compute 𝑝𝑘 = 𝑝( 𝑁𝑘);∀𝑐𝑘 to normalize 𝑃𝑟𝑜𝑏ℎ = 𝑝𝑘;to compute 𝑓ℎ = −
∑𝐾𝑘=1
1𝜆𝑘𝑝𝑘𝑙𝑜𝑔𝑝𝑘;
endend𝑓(𝑡) = minℎ𝑓ℎ
endif 𝑓(𝑡) < 𝑓(𝑡− 1) then
𝑧𝑗 ↦→ 𝑐𝑘 where 𝑐𝑘 ∈ 𝐶ℎend
endAlgorithm 1: S2AF algorithm
121
Supposing fixed the number of the clusters 𝑐𝑘 (i.e. fixed 𝐾). During initialization phase,
the proposed algorithm associates to each 𝑐𝑘 one 𝑧𝑠𝑗 ∈ 𝑍𝑠. The crowd simulator generates
synthetic data regarding people positions in the environment with a frame rate of 30[𝑓𝑝𝑠].
The proposed algorithm with sample time 𝑡𝑠 = 1[𝑠] takes into consideration a zone 𝑧𝑗 /∈
𝐶 in which, by use of simulated people counter, it has been detected and estimated the
presence and the number of the people 𝑁𝑗 . The S2AF algorithm tries to assign 𝑧𝑗 to 𝑐𝑘 if
|𝑂𝑗 −𝑂𝑘𝑠| < 𝐷, where𝑂𝑗 and𝑂𝑠𝑘 are the centers of 𝑗𝑡ℎ zone and of 𝑧𝑠𝑗 ∈ 𝑐𝑘, while𝐷 is the
maximum distance for grouping the zone with the cluster. Now an possible configuration
𝐶ℎ is formed. Total number of people in each 𝑐𝑘 = 𝑧𝑗 with 𝑐𝑘 ∈ 𝐶ℎ 𝑗 = 1, ..., 𝑆 is𝑁𝑘 =∑𝑆
𝑗=1𝑁𝑗 . Supposing that people number distribution, in each zone, is Gaussian
and independent from the other. The probability 𝑝𝑘 = 𝑝( 𝑁𝑘) is given by a Gaussian p.d.f
𝒩𝑘(𝑚𝑘, 𝜎2𝑘) where the average is 𝑚𝑘 =
∑𝑆𝑗=1𝑚𝑗 and the variance is 𝜎2
𝑘 =∑𝑆
𝑗=1 𝜎2𝑗 .
The estimation of the probabilities for all 𝑐𝑘 ∈ 𝐶ℎ is computed, S2AF provides a set of
probabilities (normalized) 𝑃𝑟𝑜𝑏ℎ = 𝑝𝑘 = 𝑝𝑘∑𝐾𝑘=1 𝑝𝑘
for the configuration 𝐶ℎ. For the
current configuration 𝐶ℎ, S2AF computes the following cost function:
𝑓ℎ(𝑃𝑟𝑜𝑏ℎ, 𝐴(𝐶ℎ)) = −𝐾∑𝑘=1
1
𝜆𝑘𝑝𝑘𝑙𝑜𝑔𝑝𝑘; (5.4)
where 𝜆𝑘 ∝ 1𝐴(𝑐𝑘)
. The algorithm tries to assign 𝑧𝑗 to 𝑐𝑙 ∈ 𝐶 with 𝑙 = 1, ..., 𝐾 𝑙 = 𝑘,
obtaining another possible configuration. At each time step 𝑡 S2AF provides a configuration
𝐶ℎ which satisfy the following condition:
𝐶ℎ = 𝑐𝑘 : minℎ
𝑓ℎ(𝑃𝑟𝑜𝑏ℎ, 𝐴(𝐶ℎ)) (5.5)
𝑓(𝑡) = 𝑓ℎ is defined as instantaneous cost function. This measure can be considered in
order to evaluate the efficiency of the automatic focus process. In according to the Equation
(5.1) for minimizing 𝐻(𝑌 |𝑇 ), S2AF assigns 𝑧𝑗 to 𝑐𝑘 if 𝑓(𝑡) < 𝑓(𝑡 − 1), where 𝑓(𝑡 − 1)
is the cost function related to previous time step. As can be seen in the results section, the
Algorithm 2 decreases the cost function. Anyway, 𝑓(𝑡) is never 0, this can depend from
many factors: e.g. initial configurations (i.e. 𝑧𝑠𝑗 positions), initial distribution of virtual
122
Table 5.1: Groups of zones: S2AF output.
𝑧𝑠𝑗 𝑐𝑗
𝑧𝑠1 = 11 𝑐1 = 11, 1, 2, 6, 7, 10, 12, 14𝑧𝑠2 = 13 𝑐2 = 13, 48𝑧𝑠3 = 43 𝑐3 = 43, 42, 44
people, different people flows and speeds, etc.
5.5 Results
The experimental results have been provided using the crowd 3D simulator, Figure 5-4.
The configuration of doors, walls and rooms is however customizable and a wide range
of scenarios can be set for tests. A crowd enters a room of the simulator and is given
the motivation of moving toward the exit of the building. Births, i.e. entry rate 𝑚, of
new characters occur during the simulation, modelled by a Poisson distribution. Proposed
algorithm performances have been studied in two following cases: the first for different
seed zones choices and different people initial distributions; the second, fixed an initial
configuration, it has been an asynchronous event: i.e. different people direction. In Figure
5-11 cost function 𝑓(𝑡) average (normalized to 1) trends are shown. It has been supposed
of fixing 3 seed zones 𝑧𝑠𝑗 with different entropy values: high (i.e. over 𝐻𝑡𝑟 = 1.5 ), low
(i.e. under 𝐻𝑡𝑟 = 1.0) and randomly choice. It is possible to note that: the cost function
associated to high entropy configuration, figure 5-11, tends to decrease compared with
other curves. This configuration is able to satisfy the Equation 5.3. In Table 5.1 the best
groups of zones 𝐶 = 𝑐1, 𝑐2, 𝑐3, provided by S2AF, is shown. In figure 5-12 presents a
visual representation of grouping mechanism of zones.
Given the configuration presented in Table 5.1, in the second part of the experimental has
been analysed how the S2AF is able to detect an asynchronous event. A direction change
events in 100 simulated situations are been generated. The people are moving in according
123
Figure 5-11: An example of normalized cost function average trends for room 𝑅1. Theexperiments have been conducted on 150 sequences of synthetic data provided by simu-lator with 𝑚 = 2 and people flow corresponding to 𝑑1. Total time of each simulation is1000[𝑠]. Three attentional resources choices (high, low entropy and randomly choice) arebeen considered. Each point on the curves corresponds to the average values of 𝑓(𝑡) for thedifferent people initial distributions.
Figure 5-12: Groups of zones. S2AF provides groups of zones.
124
to a predefine direction (𝑑1), at a time instant they change the way (𝑑3). Considering that
the event happens at 𝑡 = 0, the Figure 5-13 presents an average trend of the probability
curves 𝑝𝑘. When the people drift from 𝑑1 (green curve) the corresponding probability 𝑝1
decreases, instead the red curve 𝑝2 grows up. This means that in the people are moving
from 𝑐1 to other zones. At 𝑡 = 30 the two curves (i.e. 𝑝1 and 𝑝2) reach the same probability
and after oscillatory behaviour they tend to stabilize: the configuration 𝐶 is not able to
describe the observed situations. An anomalous situation of environmental occupation in
the group of zones that correspond to 𝑐1 is detected.
Figure 5-13: Asynchronous Event detection. Direction change events in 100 simulatedsituations are been evaluated. The figure shows average trends for the probabilities 𝑝𝑘considering time alignment with the events 𝑡 = 0. In 𝑡 = 30 the two curves reach the sameprobability. The system is able to detect direction change of the people from 𝑐1 to otherzones.
125
Chapter 6
Information bottleneck-based relevant
knowledge representation in large-scale
video surveillance systems
6.1 Introduction
Automatic representation, analysis and detection of abnormal events, as discussed in chap-
ter 4, are central issues for last generation video surveillance systems. In this context, an
interactive and intelligent systems embedded in physical environments in chapter 3 has
been proposed. It can represent a breakthrough in the design of people-oriented services
applied to crowd monitoring in large-scale environments. Several works have been ad-
dressed to link traditional computer vision tasks to context aware capabilities such as scene
understanding, interaction analysis and recognition of possible threats or dangerous situa-
tions. Different features have been considered for automatic crowd analysis: local features
(e.g., features from accelerated segment test - FAST) can be used for people detection,
while optical flow efficiently estimates human motion [100]. Considering such features it
is possible to evaluate the density of a crowd ([40, 104]).
127
Video crowd analysis frameworks in the state-of-the-art typically do not address two major
problems that arise when an higher number of sensors are used: rigorous methods for
obtaining 1) optimal information representation able to maintain the informativeness of
acquired low level features, as well as 2) compact description to reduce the processing
complexity due to an ever increasing amount of information, are needed.
The problem of information overload can be avoided through an automatic method for
selecting subparts of the guarded environment and focusing operators’ attention on most
informative regions, such as the one proposed in chapter 5. In Figure 7-2.a an example
of relevant information extraction, which is defined as sparse information, is shown. The
main problem, for event detection and classification mechanisms, is related to the recon-
struction accuracy of original data from incomplete and limited observations. Typical tasks
in crowd monitoring applications consist in recognizing particular events within the crowd
itself, such as presence of crush in forbidden areas or suspicious movements. For instance,
in Figure 7-2.a two events of interest can be defined when the crowd flow crosses red and
brown lines, respectively. An approach based on Self Organizing Maps (SOMs) for re-
constructing observed signals is presented in [23]; more recently, a SOM-based algorithm
for defective image restoration is proposed [73]. In particular, it is highlighted how the
SOM-based method performances depend on Kohonen-layer size. Other artificial neural
networks derived from SOMs, such as Growing Neural Gas (GNG) (see [11]) and Growing
Hierarchical SOM (GH-SOM) (see [105]), can automatically adapt the dimension of the
Kohonen-layer. More in details, GNG computes an accumulated local error, which repre-
sents a distance measurements between two neuronal weights, and increases the number of
neurons if this is considered too large. Similarly, in GH-SOM the increase of the number
of neurons and layers is based on distance measurements between neuronal weights and
input data. Another type of neural network (Neural Gas (NG)) can improve the input data
topology preservation through an adaptive method based on learning of neighbourhood re-
lationships between the weight vector (associated with neuronal unit) and each external
stimuli (associated with input vector).
These mechanisms of adapting layer sizes and topology preservation are mainly addressed
128
towards original data reconstruction. The problem of recovering the signal from sparse
data, requires more than just reconstruction accuracy: it is indeed necessary to preserve the
similarities between relevant information and original data (i.e., input signals). This SOM-
layer size optimization problem can be represented by means of specific cost function,
which relies on Information Bottleneck theory [125].
(a) Relevant events. (b) Crowd density map.
Figure 6-1: Relevant information extraction and crowd density map estimation. a. Thecoloured circles specify important subparts of the scene; optical flow features representsparse relevant information. b. Crowd density map estimation by Lucas-Kanade [72] opti-cal flow features.
The contributions of this chapter are as follows. A novel approach is presented, based on
information bottleneck, for designing a cost function able to quantify the SOM trade-off be-
tween the capability to recover original signals and preserve statistical similarities between
sparse relevant information and original data. It will be shown how SOMs’ correlation abil-
ities can be measured through a mixture of local linear regressive models associated to each
neuron. Such models can be used for predicting the future values based on previous states.
Finally, by means of the proposed cost function, an algorithm is described for information
bottleneck-based SOM selection (IB-SOM). By means of the selection algorithm the Infor-
mation Refinement block can determine the optimal data representation in the SOM-space
6-2.
The proposed framework has been applied to crowd monitoring domain for people density
estimation and event recognition on video real sequences extracted from public database
PETS [37]. Moreover, proposed approach is compared to other neural networks, such as
129
Figure 6-2: Information Refinement block for automatic relevant data representation selec-tion.
NG, GNG and GH-SOM. The remainder of the paper is organized as follows: in Section
6.2 the information bottleneck-based SOM selection for relevant knowledge representation
is presented; experimental results are described in Section 6.3.
6.2 Information bottleneck-based relevant information rep-
resentation
This section describes the proposed relevant information representation method applied to
video-surveillance. In the communication system theory, encoding a time-varying multi-
dimensional signal 𝑋(𝑡) is a common approach for extracting relevant information from
it. The available information acquired from video-surveillance network can be defined as
the vector 𝑋 = [𝑋1, 𝑋2, · · · , 𝑋𝑁 ] where 𝑋 ∈ X with 𝑋 ∈ R𝑁 and 𝑋 is a sample of
𝑋(𝑡), which has been acquired at sampling time 𝑡. For crowd monitoring applications, 𝑋𝑖
represents people density in each 𝑖 − 𝑡ℎ monitored area with 𝑖 = 1, · · · , 𝑁 and 𝑁 is the
130
maximum number of guarded zones. Such a vector describes the crowd density map. It
is possible to define the relevant information , extracted at time instant 𝑡 by using the
attention focusing algorithm in 5, as a subset of 𝑋: ⊆ 𝑋 where ∈ R𝑀 and 𝑀 ≤ 𝑁 .
Figure 6-3 shows how relevant sparse information 𝑋 (i.e., crowd density sparse map) can
be reconstructed from . The percentage of controlled area can be computed as a ratio
between the number of significant values in 𝑋 and total available information contained
in 𝑋 . The quantity 𝑋 ∈ R𝑁 is a sample of the reconstructed signal 𝑋 ≈ 𝑋 . The SOM
projects input data (i.e., X) into reduced dimensionality space. The neural network has
the ability to semantically represent input vectors by selecting a similar but not necessary
identical crowding density maps within the same neuronal unit. Each neuron represents a
codeword associated to a prototype vector, i.e., a weight 𝑊𝑘 ∈ W with 𝑊𝑘 ∈ R𝑁 , where
𝑘 ∈ 1, · · · , 𝐾 and 𝐾 is the maximum number of prototype vectors within SOM-layer.
The problem can be formalized as that of representing with the same best match unit (BMU)
𝑘 the vector 𝑋 and the corresponding sparse vector 𝑋 , as shown in Figure 6-3. The cen-
tral task is here to establish the firing properties of neuronal patterns by balancing recon-
struction capabilities and correlation properties, between sparse information and neurons.
To this end, a new variable Y has been defined in order to quantify the differences between
sparse and original signals. Such a variable is as informative as possible of X. Reconstruc-
tion and correlation attributes lead to the information bottleneck concept, which is defined
as a trade-off between two average mutual information (AMI) 𝐼(𝑊,𝑋) and 𝐼(𝑊,𝑌 ).
The first term 𝐼(𝑊,𝑋) denotes the reconstruction measurement; according to the Rate
Distortion theory, it should be minimized depending on the allowed distortion 𝑑𝑊 intro-
duced by the mapping process. Such a distortion is measured by the conditional entropy
𝐻(𝑋|𝑊 ): more details are given in [21]. The second term 𝐼(𝑊,𝑌 ) represents the cor-
relation measurement between sparse and original data, which should be maximized. A
practical measure of the correlation is proposed as the difference between statistical rela-
tionships of data, described by 𝑝(𝑌,𝑋), and neuronal unit correlation capabilities which
are defined by 𝑝(𝑌,𝑊 ). Where 𝑝(𝑌,𝑋) and 𝑝(𝑌,𝑊 ) are two joint probabilities.
131
Figure 6-3: Relevant information extraction and crowding density map projections intoSOM-space. The grid cells, laid over the image plane, can be seen as the set of controlledareas. Each cell is associated to a number of feature extracted by Lucas-Kanade optical flowand used for estimating the crowd density map. In this example 𝑋 ∈ R20, ∈ R5 and thesparse vector 𝑋 ∈ R20. 𝑋 and 𝑋 are mapped into the same unit 16. The percentage ofcontrolled area corresponds to 40% (i.e. 2|5).
132
The quantities 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) can be estimated by using the SOMs for dividing the
set of input data (i.e., X) into different multivariate time series X𝑘𝐾𝑘=1 where X𝑘 =
𝑋1,𝑘, · · · , 𝑋𝑛,𝑘 associated to the 𝑘 − 𝑡ℎ neuron, such as X𝑘 ∪ X𝑗 = ∅ with 𝑘 = 𝑗
and⋃𝐾𝑘=1X𝑘 = X [94]. These sub-sequences of vectors can be modelled by local Vector
Auto Regressive (VAR) models [101]. The number of generated VAR models corresponds
to the number of neurons of the SOM. Considering a multivariate time series X𝑘, an auto
regressive model of order𝑚, denoted as 𝑉 𝐴𝑅(𝑚), describes the 𝑖−𝑡ℎ vector𝑋𝑖,𝑘 as linear
combination of the previous state vectors:
𝑋𝑖,𝑘 = Φ0 + Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘 + · · · + Φ𝑝𝑋𝑖−𝑝,𝑘 + 𝜖𝑖,𝑘, (6.1)
where Φ0, · · · ,Φ𝑝 are (𝑁 × 𝑁) parameter matrices and 𝜖𝑖 represents a (𝑁 × 1) Gaussian
noise. By the multivariate time series X𝑘 we have modelled a 𝑉 𝐴𝑅(2) as 𝑋𝑖,𝑘 = Φ0 +
Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘 + 𝜖𝑖,𝑘, where Φ0, Φ1 and Φ2 are estimated coefficient matrices which
are stored in each SOM node. In order to determine the fitting of the data to the 𝑉 𝐴𝑅(2)
models, error terms are estimated as follows: 𝜖𝑖,𝑘 = 𝑋𝑖,𝑘 − [Φ0 + Φ1𝑋𝑖−1,𝑘 + Φ2𝑋𝑖−2,𝑘].
The error vector associated to each neuron is denoted with 𝜖𝑘. The average of 𝜖𝑘 is denoted
with 𝑌𝑘 = 1𝑁
𝑁∑𝑐=1
𝜖𝑘,𝑐, where 𝜖𝑘,𝑐 is 𝑐− 𝑡ℎ component of 𝜖𝑘.
It is supposed that 𝑝(𝑌,𝑊 ) = 𝒩 (0, 𝜎𝑌,𝑊 ) is the joint pdf between Y and W, where 𝜎𝑌,𝑊 =
𝐸𝑌𝑘2 − 𝐸𝑌𝑘2. For expressing the quantity 𝑝(𝑌,𝑋), it is sufficient to define only one
𝑉 𝐴𝑅(2) and this can be used for modelling all input data X. The same approach can be
used for estimating 𝑝(𝑌,𝑋).
The optimal SOM-layer size can be obtained by minimizing the modified cost function
based on information bottleneck, as follows:
ℱ = min𝑝(𝑊,𝑋)
𝐻(𝑋|𝑊 ) + 𝜆𝐷𝐾𝐿[𝑝(𝑌,𝑊 )‖𝑝(𝑌,𝑋)]; (6.2)
where Kullback-Leibler divergence 𝐷𝐾𝐿[𝑝(𝑌,𝑊 )‖𝑝(𝑌,𝑋)] is a measure of the difference
between the two joint probability distributions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). Figure 6-4 shows
133
Figure 6-4: Kullback-Leibler divergences 𝐷𝐾𝐿 for two Gaussian probability density func-tions 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋). The information lost has been represented by distance metricbetween 𝑝(𝑌,𝑊 ) and 𝑝(𝑌,𝑋) (see Table 6.1).
that a larger SOM-layer (e.g. 𝐾 = 100 neurons) present higher 𝐷𝐾𝐿 values (i.e., poor
correlation quality) than smaller SOM (e.g., 𝐾 = 4 neurons).
It can be noticed that 𝐷𝐾𝐿 describes an effective distortion measurement. In Equation 6.2
the 𝜆 parameter was introduced, which can balance the information bottleneck. In particu-
lar when 𝜆→ 0, the cost function privileges reconstruction capabilities of the SOMs, (i.e.,
larger SOM-layers will be selected). Vice versa when 𝜆 → ∞, ℱ selects the correlation
properties of the SOMs, (i.e., smaller SOM-layers will be selected).
6.3 Experimental results
This section is divided into two subparts: in the first part, SOMs training and information
bottleneck based cost function evaluation is carried out through synthetic data. Then, per-
formances of the proposed IB-SOM selection algorithm are compared with other neural
networks (GH-SOM, GNG and NG), for crowding density reconstruction on real video
sequences.
134
Table 6.1: Cost function parameters. For evaluating 𝐷𝐾𝐿, two divergence normalizeddensity functions 𝑝(𝑌,𝑋) = 𝑁(0, 0.9996) and 𝑝(𝑌,𝑊𝐾) are being considered.
SOM 𝐻(𝑊 |𝑋) 𝐷𝐾𝐿 𝑝(𝑌,𝑊 ) 𝑑𝑟𝑊 𝑑𝑝𝑊
2 × 2 2,38 0,005 𝑁(0, 0.94) 0,842 0,3201𝜆 ∈ (1.28 ÷∞)
5 × 5 2,04 0,27 𝑁(0, 1.88) 0,623 0,4018𝜆 ∈ (0.48 ÷ 1.28]
7 × 7 1,89 0,60 𝑁(0, 3) 0,417 0,53𝜆 ∈ (0.32 ÷ 0.48]
10 × 10 1,78 0,96 𝑁(0, 4) 0,209 0,78𝜆 ∈ (0 ÷ 0.32]
6.3.1 Training of SOMs and cost function evaluation on synthetic data
A common training set is generated by using a simulator where crowd behaviours are gener-
ated based on Social Forces model [18]. The simulator has the capability to add virtual sen-
sors able to acquire data coming from different subparts of the monitored scene. A virtual
image processing algorithm has been implemented for obtaining a plausible crowd density
map for each frame. Generated map is a 32 × 32 matrix, which corresponds to a vector
X with 𝑁 = 1024 components. Four different SOMs were used, with 𝐾 = 100, 49, 25, 4
number of neurons respectively and the following layer topologies: 10 × 10, 7 × 7, 5 × 5
and 2 × 2.
By using the common training set, SOMs and other neural networks have been trained.
Finally, the parameters for evaluating the normalized information bottleneck-based cost
function curves ℱ are determined (see Figure 6-5). In Table 6.1, 𝑑𝑟𝑊 and 𝑑𝑝𝑊 repre-
sent average reconstruction and prediction errors obtained by different value intervals of
𝜆 parameter. 𝑑𝑟𝑊 is the average error (i.e., Euclidean distance) between the input data
X and its representation W. Each 𝑉 𝐴𝑅 model can be used as linear predictor filter.
The 𝑑𝑝𝑊 has been defined by an average measurement of the fitting between one period
ahead forecast sequences 𝑘 (obtained by 𝑉 𝐴𝑅(2) local models) and the training data 𝑋𝑘:
𝑑𝑝𝑊 =∑𝑛
𝑖=1 ‖𝑋𝑖,𝑘−𝑖,𝑘‖∑𝑛𝑖=1 ‖𝑋𝑖,𝑘−𝐸𝑋𝑘‖
.
135
Figure 6-5: Normalized cost function average trends for different SOMs. The experimentshave been conducted on 100 sequences of synthetic data provided by the simulator. Totaltime of each simulation is 1000𝑠𝑒𝑐𝑠. The validity regions are defined by intervals of 𝜆.
For small 𝜆 values, such as (0÷0.32], the minimum of ℱ is given by the SOM 10×10 (i.e.,
the reconstruction capabilities will be preserved and 𝑑𝑟𝑊 > 𝑑𝑝𝑊 ). Vice versa for higher
𝜆 values, such as more than 1.2, the minimum of ℱ is given by the SOM 2 × 2 (i.e., the
correlation proprieties will be maintained and 𝑑𝑝𝑊 > 𝑑𝑟𝑊 ).
6.3.2 Crowd density reconstruction on real video sequences
An experiment has been conducted on three available video sequences from PETS dataset
for single camera (S1 L2 Time 14 : 06 and 14 : 31, S3 High Level Time 14 : 33 View_0001;
sequences length are 200, 130 and 377 frames respectively and frame rate is ∼ 7 [fps]). The
information bottleneck theory is adopted as a practical strategy for optimal SOM selec-
tion; by using this approach it is possible to limit the reconstruction error by varying the
percentages of controlled areas.
Under the hypothesis that the data are acquired and processed at the same PETS sequence
136
Table 6.2: Comparisons between the proposed IB-SOM selection and other neural net-works. Results are presented for different percentages of controlled areas. In the tablenormalized reconstruction errors are shown. For IB-SOM reconstruction 𝑑𝑟𝑊𝐾
and pre-diction 𝑑𝑝𝑊𝐾
errors have been computed. The last two rows show the averages and thevariances of the errors.
IB-SOM GH-SOM GNG NG100 NG49 NG4𝑑𝑟𝑊 |𝑑𝑝𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊 𝑑𝑟𝑊
100% 0,331 | 0,841 0,403 0,396 0,234 0,409 0,75480% 0,343 | 0,5743 0,409 0,408 0,335 0,451 0,75460% 0,355 | 0,5006 0,423 0,412 0,391 0,511 0,81140% 0,431 | 0,4089 0,685 0,639 0,588 0,533 0,853
Average 0,365 | 0,5412 0,480 0,463 0,387 0,476 0,793Variance 0,0015 | 0,0099 0,0140 0,0102 0,0166 0,0023 0,0017
frame rates, 𝜆 parameter (see Equation 6.2) can be defined as follows: 𝜆 ∝ 𝑑(𝑋,𝑊𝑘)|𝑝(𝑊,𝑋).
Such a value can automatically balance the bottleneck through a distortion 𝑑(·) (i.e., Eu-
clidean distance), which is due to mapping process 𝑝(𝑊,𝑋), between observed vector 𝑋
and its representation 𝑊𝑘. In particular, when 𝑑(𝑋,𝑊𝑘) is low, the reconstruction capabili-
ties will be preserved (i.e., larger SOM-layers will be selected). Vice versa, when 𝑑(𝑋,𝑊𝑘)
is high, the correlation properties of the SOMs will be maintained (i.e., smaller SOM-layers
will be selected).
The table 6.2 shows how the NG100 has the minimum reconstruction errors in 100% and
80% of controlled area percentages. Vice versa when the controlled area percentages de-
crease (i.e., 60% and 40%) the distortions of this neural network increase. In these situa-
tions, proposed method can find the optimal SOM size. IB-SOM selection restricts recon-
struction accuracy reduction, i.e., it is able to maintain a minimum reconstruction average
error. Moreover, the proposed approach delimits error variations, due to different percent-
ages of controlled areas (i.e., error variance).
137
Figure 6-6: Qualitative and quantitative results for event recognition for PETS sequenceS1 L2 Time 14 : 06 using 40% of the controlled area. The figure shows the comparisonbetween the proposed method and other approaches: on the upper part crowding densitymap reconstructions are presented; on the lower part distortion curves are shown. Thewhole video is available on https://www.youtube.com/watch?v=KN2aYZ64TTw.
138
On the other hand, when the SOM-map sizes are reduced the prediction errors decrease as
well. Finally, in Figure 7-5 quantitative and qualitative comparison measurements for event
recognition are presented. In particular, using limited observations as well, the proposed
IB-SOM presents smaller distortion errors. The density map reconstructions show how all
the neural networks can identify the first event, while only through the proposed approach
it is possible to recognize the second event.
139
Chapter 7
A bio-inspired logical process for
information recovery in cognitive crowd
monitoring
7.1 Introduction
There is currently a great interest in methodologies which combine computer vision tasks
with high-level situation awareness functionalities for environment monitoring. Techniques
such as behaviour analysis, event detection, or recognition of possibly dangerous situations
can have a high impact in different sectors, ranging from social (e.g. security-related) to
technological ones (e.g. resources management). In fact, several works in the literature try
to link physical and social aspects to video surveillance tasks [85]. The main and common
problem is that they require monitoring of events in wide video surveillance networks for
a long periods of time [92]. In order to detect people, local features can be used (e.g., fea-
tures from accelerated segment test - FAST [109]), while optical flow efficiently estimates
human motion [100]. Considering such features, in [81], a people-trajectories-based social
force model has been proposed for describing interactions among the individual members
141
of a group of people. These techniques often require a large number of networked sen-
sors for acquiring data. Adaptable intelligent systems can be exploited in order to balance
the amount of employed resources and sensory relevant information. Human perception
mechanisms for data processing, integrated in artificial systems, represent a breakthrough
for designing a new generation of algorithms inspired by neuroscience. Such frameworks
can optimize resource management and improve camera identification by acquiring (i.e.
perceiving) relevant information to be combined with inductive reasoning, and machine
learning techniques. The main concept in cognitive neuroscience, discussed in chapter 2,
is the Perception Action Cycle (PAC) also referred to as the Fuster’s Paradigm, which is
based on five fundamental building blocks, namely perception, memory, attention, intelli-
gence and language, [45]. The PAC describes how an entity can perceive, learn and change
itself through continuous interaction experiences with the external world. In particular the
perception and attention blocks are meant to optimize the information flow within the sys-
tem. In such a context, Haykin et al. in [3] have presented an algorithmic implementation of
the human perceptual attention process in order to separate the relevant from irrelevant in-
formation. The challenge of this scheme is to bridge the gap between cognitive science and
engineering in order to design intelligent system called Cognitive Dynamic Systems (CDS),
originally formalized by Haykin in the cognitive radio domain, [56] and later generalized
in [53]. Only recently the scheme was applied in the crowd monitoring field [22]. In
video surveillance domain many frameworks are dealing on the relevant or saliency extrac-
tion from video sequences. In [74], by using a saliency detector based on center-surround
discriminant descriptor (i.e. colour, intensity, orientation, scale), an anomaly detection and
localization in crowded environment has been proposed. Other frameworks for saliency ex-
traction employ the Histogram of Oriented Gradient (HOG) combined with Support Vector
Machine (SVM) classifier for pedestrian detection [27] and for people counting [130]. In
this chapter an adaptive inductive reasoning mechanism for saliency extraction from a dis-
tributed camera sensors network is proposed. The common goal of the saliency detection
frameworks, for video surveillance systems, is limited to extract the relevance from all the
environment parts where, for example the presence of the people has been detected. In our
method the relevant information, which is identified directly from environmental observa-
142
tions, is defined through the learning of the relationships among sensory data. Unlike other
methods for relevance detection, the proposed algorithm shows how by using the neural
network it is possible to learn the data correlations. Moreover it is presented how through
such a procedure, it is possible to reduce the number of environment parts to observe; by us-
ing such limited observations the proposed system can reconstruct the whole information.
The module devote to this aim is shown in figure 7-1.
Figure 7-1: Reconstruction block for automatic recovering of the whole information.
The presented system works at two different logic levels: first, by means of a Self Orga-
nizing Map [62] (SOM) (a well known neural network), it is able to learn the correlation
of the salient observed data. Secondly, the framework, relying on an adaptive SOM-based
inductive reasoning process, can select the optimal subset of sensors for recovering the
whole information (i.e. acquired relevant and induced redundant information), allowing
a maximum distortion. The remaining of this work is organised as follows. Section 7.2
presents the proposed bio-inspired inductive reasoning model based on SOMs, for extract-
ing relevant data and recovering complete information. In order to evaluate the information
recovering capabilities of the proposed scheme, such a model has been evaluated by using:
available real video sequences for PETS database [37]. Experimental results are shown in
143
section 7.3, while conclusions are drawn in chapter 8 section 8.4.
7.2 Data correlation representation for bio-inspired induc-
tive reasoning algorithm
Inductive reasoning is a mental process, which permits to draw a conclusion according to
pre-defined models, which, in this application, describe the relations derived from acquired
camera sensors information.
For the sake of simplicity, the problem can be formulated as follows: given a not completed
set of observations, the chapter proposes a methodology that can quantify the missing data.
The problem of finding the most probable spatial configuration of the crowd (i.e. number
of people in different areas) has a strongly analogy with the Statistical Physics‘s problem
of determining the equilibrium allocations of a number of particles in a set of energy lev-
els under the constraint of total energy conservation. The analogy between Physics and
other application fields has been already explored for describing the dynamic movements
of workers among various economy sectors [110]. Such a situation should be considered
closer as possible to the movements of people across different zones of a monitored envi-
ronment. Scalas et. al propose a solution, for the optimum occupation of economy sectors,
relies on a generalization of Bose-Einstain distribution. The crowd configurations are dif-
ficult to model in accordance to a given distribution function because they generally have a
high variability, which depends on the topology of environments. In the autonomous video
surveillance system should be more suitable to acquire in advance the crowding levels on
the specific monitored areas. This chapter presets an adaptive approach relies on artificial
neural networks, which permits to learn a set of possible configurations directly from en-
vironmental observations (training of the neural networks). Furthermore, by the proposed
bio-inspired mechanism called inductive reasoning it is possible to recover missing data.
This section proposes a SOM-based data correlation modelling for inductive reasoning
144
algorithm for relevant information representation. In particular we suppose to apply it to
the crowd monitoring domain. In Figure 7-2.a an example of information extraction from
a crowded environment is shown. The controlled area can be divided into sub-parts and by
employing tools such as Lucas-Kanade optical flow [72] it is possible quantify the density
of the people within each cell. In this way, a crowd density map, Figure 7-2.b can be
sketched. Basically, this map can describe the people concentration (i.e. number of people)
within the scene.
(a) Information extraction. (b) Controlled areas. (c) Crowd density map.
Figure 7-2: Information extraction and crowd density map estimation.
Let us define the available information X, acquired from a video surveillance network, as
the set of crowding state vectors
𝑋 = [𝑋1, 𝑋2, · · · , 𝑋𝑁 ] where 𝑋 ∈ R𝑁 . (7.1)
𝑋 is a sample of the signal 𝑋(𝑡) ∈ X, which has been acquired at sampling time instant
𝑡. 𝑋𝑖 represents the people number estimation, provided for instance by a people counter
embedded on cameras, in each 𝑖𝑡ℎ monitored area with 𝑖 = 1, · · · , 𝑁 . 𝑁 also represents
the maximum number of sensors, as we assume one camera only for each controlled area.
During a training phase, a SOM provides an adaptive mechanism for acquiring different
correlation models, according to similar dynamical evolution of people captured by neu-
rons. The SOM maps the data (i.e. 𝑋) into a dimensionally reduced space. It provides
a semantic representation of the input vectors, by associating similar (though not neces-
sary identical) crowding states to the same neuronal unit. Each neuron represents a code-
word i.e. a prototype (or weight) vector 𝑊𝑘 = [(𝑊𝑘)1, (𝑊𝑘)2, · · · , (𝑊𝑘)𝑁 ] and 𝑊𝑘 ∈ R𝑁
145
where 𝑘 ∈ 1, ..., 𝐾 and 𝐾 is the maximum number of prototype vectors within the neu-
ronal layer (i.e. the number of neurons in the SOM). The external stimuli (i.e. observed
state vectors) activate specific neuronal units in the SOM-map. Therefore, the SOM can
be used for dividing the training data X into different multivariate sets X𝑘𝐾𝑘=1 where
X𝑘 = 𝑋1,𝑘, · · · , 𝑋𝑛,𝑘 is associated to the 𝑘 − 𝑡ℎ neuron, and X𝑘𝐾𝑘=1 is a partition of
X, i.e. X𝑘 ∪ X𝑧 = ∅, ∀𝑘 = 𝑧 and⋃𝐾𝑘=1 X𝑘 = X. By examining the possible relationships
among these sub-sequences of training samples, the system can build correlation structures
𝐶𝑘 (i.e. correlation matrices) embedded in each SOM-neuron. The relative effects of the
number of people in the 𝑖𝑡ℎ monitored area on people on 𝑗𝑡ℎ zone is stored in the correlation
coefficients 𝑐𝑘(𝑖, 𝑗).
By means of a saliency masking vector𝑀 , whose entries are zeroes or ones, it is possible to
hide (i.e. mask) some components of𝑋 defining a new vector𝑋 ∈ R𝑁 . The 𝑖𝑡ℎ component
of 𝑋 is 𝑋 𝑖 = 𝑀𝑖 ·𝑋𝑖, 𝑖 = 1, ..., 𝑁 . Defined this way, 𝑋 will retain the components of 𝑋
when 𝑀𝑖 = 1 and will be set to zero elsewhere; it is extracted at the same time instant 𝑡
and highlights the monitored areas. It is to be noted that the distance used for the matching
process (or mapping process) is not the euclidean distance in this case, but its “masked"
version:
𝑑(𝑋,𝑊𝑘,𝑀) =
⎯ 𝑁∑𝑖=1
[𝑋 𝑖 − (𝑊𝑘)𝑖
]2 ·𝑀𝑖. (7.2)
The role of the proposed deductive algorithm is to deduce the people number inside the
non-observed zones. This is accomplished by minimizing the following cost function:
𝐹 () = [ −𝑋]2 + [( −𝑊𝑘) − 𝜇𝑘 · (𝑋 −𝑊𝑘)]2, (7.3)
where the state vector to be estimated and 𝑊𝑘 is the SOM prototype given by SOM
mapping process: 𝑋 ↦→ 𝑊𝑘.
For each couple of observed and non-observed areas (i.e. 𝑖 − 𝑡ℎ and 𝑗 − 𝑡ℎ with the
respective observed number of people 𝑋 𝑖 and 𝑋𝑗), it is possible to define an influence
parameter 𝜇𝑘(𝑖, 𝑗) = 𝑐𝑘(𝑖, 𝑗) · 𝜎𝑗/𝜎𝑖. Such a parameter takes into consideration the cross
146
correlation 𝑐𝑘(𝑖, 𝑗) between the zones and their standard deviations (SDs) 𝜎𝑖 and 𝜎𝑗 which
are calculated during training over the set X𝑘, where 𝑘 is the index of the Best Matching
Unit (BMU) of the SOM (i.e. activated neuron), which has weight 𝑊𝑘.
Data: 𝑋 , 𝑊𝑘 and 𝑀Result: local minimumInitialization: ∀𝑖 𝑜𝑙𝑑
𝑖 = 𝑋 𝑖,∀𝑗 𝑜𝑙𝑑𝑗 = (𝑊𝑘)𝑗 , 𝑑𝑚𝑖𝑛, 𝑚𝑎𝑥𝑖𝑛𝑡, 𝛼;
Definition: 𝑓2(𝑗) = 2[𝑗 − (𝑊𝑘)𝑗] − 𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖];for 𝑗 = 1 : 𝑁 do
if 𝑀𝑗 = 0 with 𝑀𝑗 ∈𝑀 thenwhile 𝑑𝑥 > 𝑑𝑚𝑖𝑛 And 𝑛𝑖𝑛𝑡 6 𝑚𝑎𝑥𝑖𝑛𝑡 do
𝑜𝑙𝑑𝑗 = 𝑛𝑒𝑤
𝑗 ;𝑛𝑒𝑤𝑗 = 𝑜𝑙𝑑
𝑗 − 𝛼 · 𝐹 (𝑜𝑙𝑑𝑗 );
𝑑𝑥 = ‖𝑛𝑒𝑤𝑗 − 𝑜𝑙𝑑
𝑗 ‖;𝑛𝑖𝑛𝑡+ +;
endend
endAlgorithm 2: SOM-based inductive reasoning algorithm. In the algorithm the SOM is represented bythe weights of the BMU 𝑊𝑘.
The cost function 𝐹 presents two terms: the first 𝑓1() = [ −𝑋]2 is clearly minimized
by 𝑖 = 𝑋 𝑖, i.e. the 𝑖 − 𝑡ℎ component is equivalent to the observation. The second term
𝑓2() = [( −𝑊𝑘) − 𝜇𝑘 · (𝑋 −𝑊𝑘)]2 represents the objective function introduced for
estimating the number of people within each non-observed component (i.e. 𝑗 − 𝑡ℎ) by the
data acquired from the observed areas (i.e. 𝑖 − 𝑡ℎ). The minimization of the second term
is an optimization problem. The fundamental idea behind the proposed method is to use
a gradient descent approach in order to find a local minimum of the function 𝑓2(𝑗). The
slope of the function 𝑓2(𝑗) is equal to zero if 𝑓 ′2(𝑗) =
𝑑𝑓2(𝑗)
𝑑𝑗= 2[𝑗 − (𝑊𝑘)𝑗] −
𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖] = 0. The solution is: 𝑗 = 𝜇𝑘(𝑖, 𝑗) · [𝑋 𝑖 − (𝑊𝑘)𝑖] + (𝑊𝑘)𝑗 ,
where [𝑋 𝑖 − (𝑊𝑘)𝑖] is the variation of the number of people, into the observed area. Such
a variability can influence the estimated value of the state vector through a weight param-
eter 𝜇𝑘(𝑖, 𝑗), which introduces correlation to neighbouring locations. In Algorithm 2 the
pseudo code of the proposed method is presented. The algorithm parameters are: the preci-
sion value 𝑑𝑚𝑖𝑛, which defines a stop criterion based on minimum threshold of updating;
𝑚𝑎𝑥𝑖𝑛𝑡, which defines the maximum number of allowed iterations; 𝛼, which specifies the
147
step size weighting the value of the cost function.
7.3 Experimental results
In this section the performance of the proposed inductive reasoning algorithm has been
analysed for data recovering. In order to give consistence to the proposed approach, an
experiment has been conducted on three available video sequences from the “Performance
Evaluation of Tracking and Surveillance" (PETS) workshop dataset for single camera [37].
According to the main purpose of this work, performances of the method are evaluated
for selecting the optimal subset of sensors in real crowding scenes, in order to optimally
recover the whole information (i.e. acquired relevant and induced redundant information).
In both cases, five different SOMs were employed, with 𝐾 = 25, 16, 9, 4, 1 neuronal units
respectively, and the following layer topologies: 5×5, 4×4, 3×3 and 2×2. The parameters
presented in the algorithm 2 are set as follows: 𝑑𝑚𝑖𝑛 = 0.01, 𝑚𝑎𝑥𝑖𝑛𝑡 = 60, 𝛼 = 0.1.
Finally, the proposed SOM-based inductive reasoning algorithm has been compared with
the standard SOM capability to recover defective values, where restoration is based on
the rough representation of the loss data by the value provided by the prototype or by the
“temporal influence" of the BMU, as defined in [60]. Such a method has been applied for
the reconstruction of areas covered by clouds in a time sequence of optical satellite images.
We will refer to this method as SOM-BMU method, table 7.1.
The main purpose of this experiment is to demonstrate how the system is able to dynami-
cally recover the whole information (i.e. crowd density map) using a limited set of obser-
vations. In order to select the groups of zones to observe we use the correlation measures
among the neighbouring locations. The correlations among the zones have been computed
over a training set. The regression analysis can quantify the relationships between the vari-
ables. Figure 7-3 shows the correlations between two couples of locations: for distant
zones the correlation is lower then for close cells. These facts can be analysed by the fitting
error of the data on the fit-line (i.e. 𝑝(𝑥) = 𝑝1𝑥 + 𝑝2, where 𝑝1 and 𝑝2 are two coefficients
ot the first order polynomial curve 𝑝(𝑥)).
148
SOM-1 SOM-4 SOM-9 SOM-16 SOM-25S1 L2Time 14-06 40% of data loss
MSE 0,911 0.845 0.727 0,651 0,455Cost Function 0,85 0,731 0,698 0,602 0,53
SOM-BMU 0,945 0.596 0.581 0.546 0.559S1 L2Time 14-31 40% of data loss
MSE 0,908 0.826 0.802 0,688 0,501Cost Function 0,899 0,831 0,788 0,762 0,677
SOM-BMU 0,901 0.663 0.645 0.639 0.630S3 14-33 40% of data loss
MSE 0,853 0.783 0.713 0,694 0,655Cost Function 0,885 0,811 0,806 0,787 0,706
SOM-BMU 0,891 0.799 0.735 0.714 0.701
Table 7.1: Quantitative results for PETS video sequences. It is important to note that theminimum value of the Cost Function, proposed in this paper, corresponds to the minimumvalue of MSE.
The areas have been ordered on the base of their correlations. Three more correlated zones
(i.e. attentional resources) have been chosen in proximity of the entry and exit points (the
red circles represented in figure 7-3). Based on a predefined percentage of data loss equal to
40%, we have employed the focusing attention method [14]. Such a method, based on three
attentional resources, is able to generate groups of zones which represent the most infor-
mative regions for the scene reconstruction. In Figure 7-4, an example of how the method
works is shown: the set of sparse data are acquired in proximity to entry and exit points.
This test has been divided into two sub-steps: training and testing. A common training set
is provided by using the simulator where crowd behaviours are generated. Furthermore, a
virtual image processing algorithm has been implemented for obtaining a plausible crowd
density map for each frame [20]. Generated map is a 16× 16 matrix, which corresponds to
a vector X with 𝑁 = 256 components. The same previous five different SOMs were used.
In the test phase the vector of the reduced observations 𝑋 is the input of the “SOMs stack".
By the neuronal activation process, each SOM-𝐾 provides a reconstructed crowding den-
sity vector 𝐾 with a corresponding value of the cost function 𝐹𝐾(𝐾) (please refer to
equation (7.3) and the algorithm 2). The main objective is to minimize the differences, for
149
Figure 7-3: Example of correlations analysis between two couples of locations. The dataare represented by the number of feature extracted by Lucas-Kanade optical flow associatedto each cell.
instance the Minimum Square Error (MSE), between the recovered data and the original
signal, i.e. 𝑀𝑆𝐸(𝑋, ), where is the reconstructed vector. In table 7.1 a quantitative
results of the reconstruction are presented. In this case the SOM-25 provides the lower av-
erage cost function value, which corresponds to the minimum (normalized) reconstruction
error (MSE). The optimum recovered vector can be dynamically determined by comparing
at each frame 𝑡 the five values of the objective functions (i.e. cost functions) as follows:
= , = arg min𝐾
𝐹𝐾(𝐾). (7.4)
In Figure 7-5 a qualitative results of the SOM selection process for crowd density map
reconstruction are presented. In the proposed layer dynamical selection process different
SOMs can be chosen in order to provide the optimal information recovery. During the
initialization phase, when the flow of the people is out from the camera view fields, the
150
Figure 7-4: Salient information extraction and crowding density map reconstruction. Thesparse vector is projected into SOMs space (i.e. “SOMs stack"). The grid cells, laid overthe image plane, can be seen as the set of controlled areas. Each cell is associated toa number of feature extracted by Lucas-Kanade optical flow and used for estimating thecrowd density map. The percentage of controlled area corresponds to 40%. In this example𝑋 ∈ R20 and the vector of the reduced observations 𝑋 ∈ R20. 𝑋 is mapped into differentunits of the SOMs. By comparing the cost functions (see equation (7.4)) it is possible toidentify the optimal SOM (in this case SOM-9).
system acquires the data form all the zones. In this case the proposed framework evaluates
the optimum SOM-𝐾 based on the minimum MSE between the prototype 𝑊𝑘 of the BMU
𝑘 and the observed data 𝑋 .
Finally, in Figure 7-6 shows the comparison of the proposed strategy (i.e. our method)
for relevant information extraction with other approaches for detecting saliency in video
sequences. In particular, it has been tested the recovering algorithm using the saliency de-
tected by the method proposed in [74] and using the HOG descriptors combined with SVM
for pedestrian [27] and head detections. The results underline how the presented relevant
extraction, based on the correlations among the parts of the environment, is able to drasti-
cally reduce the total amount of the acquired information (34.11%) then saliency detection
151
Figure 7-5: An example of qualitative and quantitative results for PETS sequence S1 L2Time 14 : 06 (frames 56 and 140) using 40% of the controlled area. Up to the time of firstpeople detection (i.e. initialization phase, frame 44) the system acquires the data from allthe zones.
(74.95%) and the pedestrian detection by HOG descriptors (70.18%). We point out like an
higher percentage of crowd density saliency extraction is not always synonymous of better
reconstruction. Considering an extracted relevance provided by the head detection method
(40.54%), the proposed system is able to better recover the whole density map (MSE 0.4015
vs 0.4168).
152
Figure 7-6: Reconstruction comparison for different saliency extraction methods: saliencydetection presented in [74]; HOG descriptors [27] combined with a learning-based SVMclassifier for person (pedestrian) and head detections. The average mean square error nor-malized to 1 and the percentage of the crowd density map average acquired from the system(i.e. Extracted Relevance-ER) are shown.
153
Chapter 8
Conclusions and future developments
The Cognitive Dynamic System can be seen as a complex structure of multiple and cooper-
ating bio-inspired and modular elements, such as: memory, attention, reasoning, planning,
problem solving etc. These features represent the main innovation keys of these frame-
works. The integration of different elements permits to create systems, which are able to
take not only local decisions, for example for each element separately, but also to describe
how such outcomes can influence other system processes. The main characteristic of bi-
ological entities is the decision-making mechanisms (i.e. planning and problem solving)
which involve complicated cooperation among recursive and incremental cognitive pro-
cesses. The Cognitive Control is an executive function addressed to manage these very
complex mechanisms; specifically it regulates the above mentioned mental faculties (i.e.
reasoning, attention, memory, planning etc). This process is the core and distinctive com-
ponent, in relation to other mechanisms, in leading the development of innovative artificial
cognitive systems. For such a reason, both design of different modules (i.e. reasoning,
attention, memory) and definition of information flows, among the system elements, are
extremely important. The thesis explores a Cognitive Dynamic System addressed to crowd
monitoring applications. The proposed system is a complex framework characterized by
different parts, which process the information at different levels. In particular, each chapter,
of this work, deeply describes the capabilities and applicability of the proposed innovative
155
techniques: Analysis block, Attention Focusing Module, Information Refinement and Re-
construction Module. The combination and cooperation among such algorithms realize
an automatic systems called Cognitive Surveillance Node (CSN), which is part of a com-
plex cognitive JDL-based and bio-inspired architecture. By means of the Autobiographical
Memory, the proposed CSN models the relationships between external (i.e. core-self) and
internal (i.e. to proto-self) information flows. The presented system also shows how to
describe and model the relationships between the internal sub-systems. In particular, At-
tention Module communicates with the memory. Through the Information Refinement it
is possible to select the internal knowledge in order to obtain the optimum representation,
according to the system task (e.g. event detection), of relevant observed data. The thesis
also proposes a mechanism, based on internal information flows, in order to emulate a spe-
cific human adaptive mental mechanism: inductive reasoning. By a continuous exchange
of information between memory and Reconstruction Module, the proposed algorithm can
recover the whole data given a limited set of observations.
The scientific contributions of each technique, proposed in the thesis, are summarized in
the following order.
8.1 Analysis block
A bio-inspired structure was proposed, for encoding and synthesizing signals for modeling
cause-effect relationships between entities. In particular, the case where one of such enti-
ties is a human operator was studied. Interaction models are stored within an AM during
a learning phase. Knowledge is thus transferred from a human operator towards the CSN.
Learned representations can be used, at different levels, either to support human decisions
by detecting anomalous interaction models and thus compensating for human shortcom-
ings, or, in an automatic decision scenario, to identify anomalous patterns and choose the
best strategy to preserve stability of the entire system. Results are shown in a simulated
visual 3D environment in the context of crowd monitoring. The simulated crowd is mod-
eled according to the Social Forces Model. The results show two possible applications of
156
the CSN for crowd surveillance applications: first, the system can support the operator for
crowd management and people flow redirection by detecting drift from some learned in-
teraction models; secondly, to work in automatic mode and thus autonomously detecting
anomalies in crowd behaviour. Furthermore, it has been shown how user-crowd interac-
tion knowledge, learned from the simulator and modelled as proposed is useful in order to
detect anomalies on real video sequences.
8.2 Attention Focusing Module
In this thesis has been presented a bio-inspired algorithm called Selective Attention Auto-
matic Focus, which is a part of more complex cognitive architecture for crowd monitoring.
The main objective of the proposed method is to extract relevant information needed for
crowd monitoring directly from the environmental observations. Experimental results are
shown in a simulated visual 3D environment. The results show the possible applications
of the proposed algorithm for attention focusing on densely populated areas and detecting
anomalies in crowd, e.g. overcrowding. Future developments of this work will include
investigation on many and more complex simulated scenes. In addition, thanks to the scal-
ability of the model based on topological maps, many other application on real scenes
can be explored, to give more consistence to the abstract cognitive perception mechanism
integrate in an intelligent system.
8.3 Information Refinement for relevant knowledge rep-
resentation
This thesis presented a novel approach for information representation. It has been applied
for sparse data within a crowd monitoring application. The proposed algorithm is a method
encompassing different steps, which involves the application of information theory and
neural networks such as SOMs. First of all, by means of information bottleneck paradigm,
157
a cost function has been designed in order to balance the data reconstruction and corre-
lation capabilities of different SOMs. An information bottleneck based strategy for SOM
selection was proposed.
Finally, the IB-SOM selection method has been tested on public datasets. The results show
that the proposed approach outperforms other neural networks (such as NG, GNG, GH-
SOM) in crowd density reconstruction from very sparse observations.
Furthermore, it has been shown how such a knowledge representation can recover original
crowding density maps in order to recognize particular events on real video sequences.
8.4 Adaptive reasoning algorithm for information recon-
struction
This thesis has presented an adaptive inductive reasoning mechanism as an inductive logic
framework for saliency (i.e. relevant information) extraction. Such salient information has
been defined as the optimal subset of observations (e.g. camera sensors for crowd moni-
toring or pixel values for images) which are able to reconstruct the lost information which
is therefore “redundant" for that specific model. The main goal of the proposed framework
is to adapt the inductive reasoning mechanism and therefore to emphasize the saliency.
Accordingly, the reconstruction of the non-observed data (i.e. loss values) is performed
by means of different correlation models acquired by different SOMs. By minimizing the
proposed cost function, the method is able to regenerate the missing data and establish the
optimal salient information. In order to evaluate the reconstruction capability, we have
tested the algorithm for crowd state recovering by considering a limited subset of obser-
vations (i.e. excluding a part of the available camera sensors). The proposed approach
optimizes the estimation procedure of the original values by considering not only the pro-
totypes (weights of the BMU) but also the influence of the variations of neighbours’ values.
Moreover, by minimizing the proposed cost function, by varying the SOM-layer size, it is
158
possible to identify the optimum filter (i.e. SOM) which corresponds to the minimum MSE
calculated a posteriori (i.e. the minimum introduced distortion).
8.5 Final remarks and future developments
The CDSs are complex frameworks, as it has been discussed in the thesis (see Chapter
2 Section 2.2). Such architectural complexity leads to various systemic and conceptual
problems, which can be summarized in two following points: integration and validation.
∙ The integration: is a systemic problem and it involves the cooperation mechanisms
among the different parts of the system. The control and management of the flow
of internal and external information is a central key in this kind of systems. Specif-
ically, this work has proposed an architecture where the sub-parts of the system are
connected. The different elements work together by communication mechanisms
based on information flow cognitive control.
∙ The validation: is another fundamental problem that involve, for example, the com-
parison between the proposed CDS with alternative crowd managements systems.
In general crowd management systems are based on video analysis methods only.
Typically, by means of optical flow or other visual features, it is possible to detect a
specific situation or (event detection) at a specific time instant (in the frame). To con-
sider theese methodologies as complete systems, such as a CDS should be a concep-
tual error. Such methods can be therefore defined as crowd behavior analysis tools.
For example, Hang Su et al. [119] presented an optical-flow-based fluid field which,
compared with the optical-flow-based social force model presented from Mehran et
al. [84], provides better results in terms of identification of events in time. The pro-
posed method is different, as it implements a crowd management system: it can pre-
dict possible future crowd behaviors (as consequences of system’s actions) and it is
able to detect possible interaction anomalies. Referring to the result we presented on
video sequences, we show how the system, predicting the crowd events many frames
159
before of the exact time instant, suggests to the operator the best possible action to
do. Any different action will be detected as interaction anomaly. The added value of
our work consists in the automatic learning of crowd behavioral rules, which are in
fact interaction rules. In addition the system has the capability of predicting future
events and to detect interaction anomalies at the exact moment they happen. Further-
more, by Attention Module it is possible to help the user in focusing his attention
on relevant environmental parts only. The Information Refinement can model such
acquired relevant information by selecting the optimum knowledge baseline. While,
through the Reconstruction algorithm, it is possible to recover the missing data from
a set af limited observations acquired from video cameras. These facts make it really
hard to compare the performance of our system with the ones of other state-of-the-art
systems as [84] and [119]. However, this thesis presents comparisons of sub-parts of
the entire system (i.e. Information Refinement and the Reconstruction algorithms)
with specific state-of-the-art methods.
In the proposed CDS, a still open issue can be identified in developing a method for relevant
information localization. Specifically, in this work, it has been defined as fixed observed
zones. In a dynamic situations, like a moving group of people, it is important to follow the
evolution of the scene. For this reason future developments of this work will include a study
on the impact of the dynamic contextual relevant knowledge localization on the inductive
reasoning algorithm. Furthermore we want to include the prediction of the future crowding
state evolutions based on the historical data modeling. Future developments of this work
will include a detailed study on the impact of the information bottleneck on the GH-SOM.
It can lead to improve the GH-SOM strategy for selecting the knowledge representation
among different hierarchical layers.
160
Bibliography
[1] Handbook of Multisensor Data Fusion: Theory and Practice, Second Edition (Elec-trical Engineering & Applied Signal Processing Series). CRC Press, 2 edition,September 2008.
[2] David H. Ackley, Geoffrey E. Hinton, and Terrence J. Sejnowski. A learning algo-rithm for boltzmann machines. Cognitive Science, 9(1):147 – 169, 1985.
[3] Ashkan Amiri and Simon Haykin. Improved Sparse Coding Under the Influence ofPerceptual Attention. Neural Computation, pages 1–44, November 2013.
[4] John R. Anderson. The architecture of cognition, 1983.
[5] E.L. Andrade, S. Blunsden, and R.B. Fisher. Hidden markov models for optical flowanalysis in crowds. In Pattern Recognition, 2006. ICPR 2006. 18th InternationalConference on, volume 1, pages 460 –463, September 2006.
[6] W.R. Ashby. Principles of the self-organizing dynamic system. The Journal ofGeneral Psychology, 37(2):125–128, 1947.
[7] S. Y. Auyang. Foundations of Complex-system Theories in Economics, Evolution-ary Biology, and Statistical Physics. Cambridge University Press, Cambridge, UK,1998.
[8] Y. Benabbas, N. Ihaddadene, and C. Djeraba. Motion pattern extraction and eventdetection for automatic visual surveillance. EURASIP Journal on Image and VideoProcessing, 2011:15, 2011.
[9] Massimo Bertozzi, Alberto Broggi, L. Bombini, C. Caraffi, S. Cattani, Pietro Cerri,Ra Fascioli, M. Felisa, R. I. Fedriga, S. Ghidoni, Paolo Grisleri, P. Medici, M. Pater-lini, P. P. Porta, M. Posterli, and P. Zani. Vision technologies for intelligent vehicles.In KES’07/WIRN’07 Proceedings of the 11th international conference, KES 2007and XVII Italian workshop on neural networks conference on Knowledge-based in-telligent information and engineering systems: Part I, 2010.
[10] Alexander Bird. Routledge Companion to Philosophy of Language. Routledge Phi-losophy Companions. Routledge, 2012.
161
[11] Fernando Canales and Max Chacón. Modification of the growing neural gas algo-rithm for cluster analysis. In Luis Rueda, Domingo Mery, and Josef Kittler, edi-tors, CIARP, volume 4756 of Lecture Notes in Computer Science, pages 684–693.Springer, 2007.
[12] U. Castiello and C. Umiltà. Size of the attentional focus and efficiency of processing.Acta Psychologica, 73(3):195–209, 1990.
[13] Bao Rong Chang, Hsiu Fen Tsai, and Chung-Ping Young. Intelligent data fusionsystem for predicting vehicle collision warning using vision/gps sensing. ExpertSystems with Applications, 37(3):2439 – 2450, 2010.
[14] S. Chiappino, L. Marcenaro, and C. Regazzoni. Selective attention automatic focusfor cognitive crowd monitoring. In Proc. of IEEE Int. Conference on Advanced Videoand Signal based Surveillance (AVSS), Krakow, Poland, 27–30 August 2013.
[15] S. Chiappino, L. Marcenaro, and C.S. Regazzoni. Information bottleneck-based rel-evant knowledge representation in large-scale video surveillance systems. In Acous-tics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conferenceon, pages 4364–4368, May 2014.
[16] S. Chiappino, A. Mazzù, L. Marcenaro, and C.S. Regazzoni. A bio-inspired logicalprocess for saliency detections in cognitive crowd monitoring. In Acoustics, Speechand Signal Processing (ICASSP), 2015 IEEE International Conference on, April2015.
[17] S. Chiappino, P. Morerio, L. Marcenaro, E. Fuiano, G. Repetto, and C. Regazzoni.A multi-sensor cognitive approach for active security monitoring of abnormal over-crowding situations in critical infrastructure. 15th International Conference on In-formation Fusion, July 2012.
[18] Simone Chiappino, Lucio Marcenaro, Pietro Morerio, and Carlo Regazzoni. Eventbased switched dynamic bayesian networks for autonomous cognitive crowd mon-itoring. In Augmented Vision and Reality, Augmented Vision and Reality, pages1–30. Springer Berlin Heidelberg, 2013.
[19] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo Regazzoni. Runlength encoded dynamic bayesian networks for probabilistic interaction modeling.In 21th European Signal Processing Conference (EUSIPCO 2013), 2013.
[20] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni. Abio-inspired knowledge representation method for anomaly detection in cognitivevideo surveillance systems. In Information Fusion (FUSION), 2013 16th Interna-tional Conference on, pages 242–249, July 2013.
[21] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni.Event definition for stability preservation in bio-inspired cognitive crowd monitor-ing. In Digital Signal Processing (DSP), 2013 18th International Conference on,pages 1–6, July 2013.
162
[22] Simone Chiappino, Pietro Morerio, Lucio Marcenaro, and CarloS. Regazzoni. Bio-inspired relevant interaction modelling in cognitive crowd management. Journal ofAmbient Intelligence and Humanized Computing, pages 1–22, 2014.
[23] Jeongho Cho, António R. C. Paiva, Sung-Phil Kim, Justin C. Sanchez, and José C.Príncipe. Self-organizing maps with dynamic learning for signal reconstruction.Neural Netw., 20(2):274–284, March 2007.
[24] R.T. Collins, A.J. Lipton, H. Fujiyoshi, and T. Kanade. Algorithms for coopera-tive multisensor surveillance. Proceedings of the IEEE, 89(10):1456–1477, October2001.
[25] Nelson Cowan. Chapter 20 what are the differences between long-term, short-term,and working memory? In Vincent F. Castellucci Wayne S. Sossin, Jean-Claude La-caille and Sylvie Belleville, editors, Essence of Memory, volume 169 of Progress inBrain Research, pages 323 – 338. Elsevier, 2008.
[26] F. Cupillard, A. Avanzi, F. Bremond, and M. Thonnat. Video understanding formetro surveillance. In Networking, Sensing and Control, 2004 IEEE InternationalConference on, volume 1, pages 186 – 191 Vol.1, 2004.
[27] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. InComputer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer So-ciety Conference on, volume 1, pages 886–893 vol. 1, 2005.
[28] Antonio Damasio. The Feeling of What Happens: Body and Emotion in the Makingof Consciousness. Harvest Books, October 2000.
[29] Anthony C. Davies, Jia Hong Yin, and Sergio A. Velastin. Crowd monitoring usingimage processing. Electronics and Communication Engineering Journal, 7:37–47,1995.
[30] A. Dore, A.F. Cattoni, and C.S. Regazzoni. Interaction modeling and prediction insmart spaces: a bio-inspired approach based on autobiographical memory. Systems,Man and Cybernetics, Part A: Systems and Humans, IEEE Transactions on, 2010.
[31] A. Dore, M. Pinasco, and C.S. Regazzoni. Multi-modal data fusion techniques andapplications. In H. Aghajan and A. Cavallaro, editors, Multi-camera networks: Con-cepts and Applications, pages 213–237. Elsevier, May 2009.
[32] A. Dore and C. S. Regazzoni. Bayesian bio-ispired model for learning interactivetrajectories. In Proc. of the IEEE International Conference on Advanced Video andSignal based Surveillance, AVSS 2009, Genoa, Italy, September 2009.
[33] A. Dore and C.S. Regazzoni. Interaction analysis with a bayesian trajectory model.Intelligent Systems, IEEE, 25(3):32 –40, may-june 2010.
163
[34] Mica R. Endsley. Toward a theory of situation awareness in dynamic systems. Hu-man Factors: The Journal of the Human Factors and Ergonomics Society, 37:32–64(33), March 1995.
[35] M. Valera Espina and S. A. Velastin. Intelligent distributed surveillance systems: Areview. IEE Proceedings - Vision, Image and Signal Processing, 152(2):192–204,April 2005.
[36] C. Fernández, P. Baiget, F.X. Roca, and J. Gonzàlez. Augmenting video surveillancefootage with virtual agents for incremental event evaluation. Pattern RecognitionLetters, 32(6):878 – 889, 2011.
[37] James Ferryman and James L. Crowley, editors. Eleventh IEEE International Work-shop on Performance Evaluation of Tracking and Surveillance, PETS 2009, 2009.
[38] John T. Finn. Use of the average mutual information index in evaluating classi-fication error and consistency. International journal of geographical informationsystems, 7(4):349–366, 1993.
[39] G. L. Foresti, C. S. Regazzoni, and P. K. Varshney. Multisensor Surveillance Sys-tems: The Fusion Perspective. Kluwer Academic, Boston, 2003.
[40] Hajer Fradi, Volker Eiselein, Ivo Keller, Jean-Luc Dugelay, and Thomas Sikora.Crowd context-dependent privacy protection filters. In DSP 2013, 18th Interna-tional Conference on Digital Signal Processing, 1-3 July 2013, Santorini, Greece,Santorini, GRÈCE, 07 2013.
[41] Stan Franklin, Steve Strain, Javier Snaider, Ryan McCall, and Usef Faghihi. Globalworkspace theory, its lida model and the underlying neuroscience. Biologically In-spired Cognitive Architectures, 1:32–43, 2012.
[42] K. Friston, B. SenGupta, and G. Auletta. Cognitive dynamics: From attractors toactive inference. Proceedings of the IEEE, 102(4):427–445, April 2014.
[43] Bernd Fritzke. A growing neural gas network learns topologies. In Advances inNeural Information Processing Systems 7, pages 625–632. MIT Press, 1995.
[44] J.M. Fuster. Cortex and Mind: Unifying Cognition. Oxford University Press, USA,2005.
[45] J.M. Fuster. Cortex and Mind: Unifying Cognition. Oxford University Press, USA,2005.
[46] Stephen Grossberg, editor. Cognition, Learning, Reinforcement, and Rhythm, vol-ume 1 of The Adaptive Brain. Elsevier, 1988.
[47] IST Advisory Group. Scenarios for ambient intelligence. European Commission,2010.
164
[48] D. L. Hall and J. Llinas. An introduction to multisensor data fusion. Proceedings ofthe IEEE, 85:6–23, 1997.
[49] D. Handford and A Rogers. Modelling driver interdependent behaviour in agent-based traffic simulations for disaster management. In The Ninth International Con-ference on Practical Applications of Agents and Multi-Agent Systems, Salamanca,Spain, accepted for publication, april 2011.
[50] Ismail Haritaoglu, Davis Harwood, and Larry S. David. W4: Real-time surveillanceof people and their activities. IEEE Trans. Pattern Anal. Mach. Intell., 22:809–830,August 2000.
[51] S. Haykin, M. Fatemi, P. Setoodeh, and Yanbo Xue. Cognitive control. Proceedingsof the IEEE, 100(12):3156–3169, Dec.
[52] S. Haykin, M. Fatemi, P. Setoodeh, and Yanbo Xue. Cognitive control. Proceedingsof the IEEE, 100(12):3156–3169, Dec.
[53] S. Haykin and J.M. Fuster. On cognitive dynamic systems: Cognitive neuroscienceand engineering learning from each other. Proceedings of the IEEE, 102(4):608–628, April 2014.
[54] Simon Haykin. Cognitive dynamic systems: An integrative field that will be a hall-mark of the 21st century. In Cognitive Informatics Cognitive Computing (ICCI*CC), 2011 10th IEEE International Conference on, pages 2–2, Aug 2011.
[55] Simon Haykin. Cognitive dynamic systems: Radar, control, and radio. Proceedingsof the IEEE, 100(7):2095–2103, 2012.
[56] Simon Haykin. Cognitive dynamic systems: Radar, control, and radio. Proceedingsof the IEEE, 100(7):2095–2103, 2012.
[57] Simon Haykin and D.J. Thomson. Signal detection in a nonstationary environmentreformulated as an adaptive pattern classification problem. Proceedings of the IEEE,86(11):2325–2344, Nov 1998.
[58] Donald O. Hebb. The Organization of Behavior. John Wiley, New York, 1949.
[59] J. Jockusch and H. Ritter. An instantaneous topological map for correlated stimuli.In Proceedings of the international joint conference on neural Networks, volume 1,pages 529–534, Washington, USA, 1999.
[60] M. Jouini, S. Thiria, and M. Crépon. Images reconstruction using an iterative sombased algorithm. In Artificial Neural Networks Computational Intelligence and Ma-chine Learning (ESANN), 2012 European Symposium on, pages 25–27, 2012.
[61] Prahlad Kilambi, Evan Ribnick, Ajay J. Joshi, Osama Masoud, and Nikolaos Pa-panikolopoulos. Estimating pedestrian counts in groups. Computer Vision and Im-age Understanding, 110(1):43–59, 2008.
165
[62] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464 –1480,September 1990.
[63] D. Kumar, C.S. Rai, and S. Kumar. Face recognition using self-organizing map andprincipal component analysis. In Neural Networks and Brain, 2005. ICNN B ’05.International Conference on, volume 3, pages 1469–1473, Oct 2005.
[64] John E. Laird. Extending the soar cognitive architecture. In Proceedings of the 2008Conference on Artificial General Intelligence 2008: Proceedings of the First AGIConference, pages 224–235, Amsterdam, The Netherlands, The Netherlands, 2008.IOS Press.
[65] John E. Laird, Allen Newell, and Paul S. Rosenbloom. Soar: An architecture forgeneral intelligence. Artif. Intell., 33(1):1–64, September 1987.
[66] Nilli Lavie, Aleksandra Hirst, Jan W De Fockert, and Essi Viding. Load theoryof selective attention and cognitive control. Journal of experimental psychologyGeneral, 133(3):339–354, 2004.
[67] A.J. Lipton, C.H. Heartwell, N. Haering, and D. Madden. Automated video protec-tion, monitoring & detection. IEEE Aerospace and Electronic Systems Magazine,18(5):3–18, May 2003.
[68] Bangquan Liu, Zhen Liu, and Yuan Hong. A simulation based on emotions model forvirtual human crowds. In Image and Graphics, 2009. ICIG ’09. Fifth InternationalConference on, pages 836 –840, September 2009.
[69] James Llinas, Christopher Bowman, Galina Rogova, Alan Steinberg, and FrankWhite. Revisiting the jdl data fusion model ii. In In P. Svensson and J. Schubert(Eds.), Proceedings of the Seventh International Conference on Information Fusion(FUSION 2004, pages 1218–1230, 2004.
[70] C. Loscos, D. Marchal, and A. Meyer. Intuitive crowd behavior in dense urbanenvironments using local laws. In Theory and Practice of Computer Graphics, 2003.Proceedings, pages 122 – 129, 2003.
[71] Matthias Luber, Johannes Andreas Stork, Gian Diego Tipaldi, and Kai O. Arras.People tracking with human motion predictions from social forces. In Proc. of theInt. Conf. on Robotics & Automation (ICRA), Anchorage, USA, 2010.
[72] Bruce D. Lucas and Takeo Kanade. An iterative image registration technique withan application to stereo vision. pages 674–679, 1981.
[73] Michiharu Maeda. Restoration model with inference capability of self-organizingmaps. In Pablo A. Estévez, José C. Príncipe, and Pablo Zegers, editors, Advances inSelf-Organizing Maps, volume 198 of Advances in Intelligent Systems and Comput-ing, pages 153–162. Springer Berlin Heidelberg, 2013.
166
[74] V. Mahadevan, Weixin Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection incrowded scenes. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEEConference on, pages 1975–1981, June 2010.
[75] M. Mancas, N. Riche, J. Leroy, and B. Gosselin. Abnormal motion selection incrowds using bottom-up saliency. In Image Processing (ICIP), 2011 18th IEEEInternational Conference on, pages 229–232, 2011.
[76] A. N. Marana, S. A. Velastin, L. F. Costa, and R. A. Lotufo. Automatic estimationof crowd density using texture. Safety Science, pages 165–175, April 1998.
[77] L. Marchesotti, S. Piva, and C.S. Regazzoni. Structured context-analysis techniquesin biologically inspired ambient-intelligence systems. IEEE Transactions on Sys-tems, Man and Cybernetics, Part A, 35(1):106–120, 2005.
[78] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks,7(3):507–522, 1994.
[79] Thomas M. Martinetz and Klaus J. Schulten. A “neural gas” network learns topolo-gies. In Teuvo Kohonen, Kai Mäkisara, Olli Simula, and Jari Kangas, editors, Pro-ceedings of the International Conference on Artificial Neural Networks 1991 (Espoo,Finland), pages 397–402. Amsterdam; New York: North-Holland, 1991.
[80] R. Mazzon, F. Poiesi, and A. Cavallaro. Detection and tracking of groups in crowd.In Proc. of IEEE Int. Conference on Advanced Video and Signal based Surveillance(AVSS), Krakow, Poland, 27–30 August 2013.
[81] R. Mazzon, F. Poiesi, and A. Cavallaro. Detection and tracking of groups in crowd.In Proc. of IEEE Int. Conference on Advanced Video and Signal based Surveillance(AVSS), Krakow, Poland, 27–30 August 2013.
[82] J. McCarthy, M. L. Minsky, N. Rochester, and C.E. Shannon. A PROPOSAL FORTHE DARTMOUTH SUMMER RESEARCH PROJECT ON ARTIFICIAL INTEL-LIGENCE. http://www-formal.stanford.edu/jmc/history/dartmouth/dartmouth.html,1955.
[83] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection usingsocial force model. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 935 –942, June 2009.
[84] R. Mehran, A. Oyama, and M. Shah. Abnormal crowd behavior detection usingsocial force model. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.IEEE Conference on, pages 935 –942, June 2009.
[85] Ramin Mehran, Alexis Oyama, and Mubarak Shah. Abnormal crowd behavior de-tection using social force model. In CVPR. IEEE Computer Society, 2009.
[86] George A. Miller. The cognitive revolution: a historical perspective. Trends inCognitive Sciences, 7(3):141–144, 2003.
167
[87] Brian E. Moore, Saad Ali, Ramin Mehran, and Mubarak Shah. Visual crowd surveil-lance through a hydrodynamics lens. Commun. ACM, 54(12):64–73, December2011.
[88] Pietro Morerio, Lucio Marcenaro, and Carlo S. Regazzoni. People count estimationin small crowds. In Advanced Video and Signal-Based Surveillance (AVSS), 2012IEEE Ninth International Conference on, pages 476 –480, sept. 2012.
[89] R.G.M. Morris. Further studies of the role of hippocampal synaptic plasticity inspatial learning: Is hippocampal LTP a mechanism for automatically recordingattended experience? Journal of Physiology-Paris, 90(5–6):333 – 334, 1996.
[90] Richard Morris, Graham Hitch, Kim Graham, and Tim Bussey. Chapter 9 - learningand memory. In Richard MorrisLionel TarassenkoMichael Kenward, editor, Cogni-tive Systems - Information Processing Meets Brain Science, pages 193 – XII. Aca-demic Press, London, 2006.
[91] Kevin Murphy. Dynamic Bayesian Networks: Representation, Inference and Learn-ing. PhD thesis, UC Berkeley, Computer Science Division, July 2002.
[92] Yunyoung Nam, Seungmin Rho, and Jong Hyuk Park. Intelligent video surveillancesystem: 3-tier context-aware surveillance system with metadata. Multimedia ToolsAppl., 57(2):315–334, March 2012.
[93] Allen Newell. Unified Theories of Cognition. Harvard University Press, Cambridge,MA, USA, 1990.
[94] HE NI and HUJUN YIN. Self-organising mixture autoregressive model fornon-stationary time series modelling. International Journal of Neural Systems,18(06):469–480, 2008. PMID: 19145663.
[95] N. Oliver and A.P. Pentland. Graphical models for driver behavior recognition ina smartcar. In Intelligent Vehicles Symposium, 2000. IV 2000. Proceedings of theIEEE, pages 7 –12, 2000.
[96] Wei Pan, Wen Dong, Manuel Cebrián, Taemie Kim, James H. Fowler, and AlexPentland. Modeling dynamical influence in human interaction: Using data to makebetter inferences about influence within social systems. IEEE Signal Process. Mag.,pages 77–86, 2012.
[97] Debprakash Patnaik, Srivatsan Laxman, and Naren Ramakrishnan. Discovering ex-citatory networks from discrete event streams with applications to neuronal spiketrain analysis. In Proceedings of the 2009 Ninth IEEE International Conferenceon Data Mining, ICDM ’09, pages 407–416, Washington, DC, USA, 2009. IEEEComputer Society.
[98] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
168
[99] Stefano Pellegrini, Andreas Ess, Konrad Schindler, and Luc van Gool. You’ll neverwalk alone: Modeling social behavior for multi-target tracking. In InternationalConference on Computer Vision, 2009.
[100] Roland Perko, Thomas Schnabel, Gerald Fritz, Alexander Almer, and Lucas Paletta.Counting people from above: Airborne video based crowd analysis. CoRR,abs/1304.6213, 2013.
[101] Bernhard Pfaff. Var, svar and svec models: Implementation within r package vars.Journal of Statistical Software, 27(4):1–32, 7 2008.
[102] A. Prati, R. Vezzani, L. Benini, E. Farella, and P. Zappi. An integrated multi-modalsensor network for video surveillance. In Proc. of the third ACM international work-shop on Video surveillance & sensor networks, November 2005.
[103] H. Rahmalan, M.S. Nixon, and J.N. Carter. On crowd density estimation for surveil-lance. In Crime and Security, 2006. The Institution of Engineering and TechnologyConference on, pages 540 –545, June 2006.
[104] S. Aravinda Rao, Jayavardhana Gubbi, Slaven Marusic, P. Stanley, and MarimuthuPalaniswami. Crowd density estimation based on optical flow and hierarchical clus-tering. In ICACCI 2013 International Conference on Advances in Computing Com-munications and Informatics , 22-25 August 2013, Mysore, India, Mysore, India,August 2013.
[105] Andreas Rauber, Dieter Merkl, and Michael Dittenbach. The growing hierarchicalself-organizing map: Exploratory analysis of high-dimensional data. IEEE Transac-tions on Neural Networks, 13:1331–1341, 2002.
[106] Paolo Remagnino, Sergio A. Velastin, Gian Luca Foresti, and Mohan Trivedi. Novelconcepts and challenges for the next generation of video surveillance systems. Mach.Vision Applications, 18(3):135–137, 2007.
[107] Irina Rish and Genady Grabarnik. Sparse signal recovery with exponential-familynoise. In Proceedings of the 47th annual Allerton conference on Communication,control, and computing, Allerton’09, pages 60–66, Piscataway, NJ, USA, 2009.IEEE Press.
[108] Edmund T Rolls and Gustavo Deco. Networks for memory, perception, anddecision-making, and beyond to how the syntax for language might be implementedin the brain. Brain research, September 2014.
[109] Edward Rosten and Tom Drummond. Fusing points and lines for high performancetracking. In IN INTERNATIONAL CONFERENCE ON COMPUTER VISION, pages1508–1515. Springer, 2005.
[110] Enrico Scalas and Ubaldo Garibaldi. A dynamic probabilistic version of theaoki–yoshikawa sectoral productivity model. Economics: The Open-Access, Open-Assessment E-Journal, 3(2009-15), 2009.
169
[111] Jonathon Shlens. A tutorial on principal component analysis. In Systems Neurobiol-ogy Laboratory, Salk Institute for Biological Studies, 2005.
[112] S. Singh and S.C. Badwaik. Design and implementation of fpga-based adaptive dy-namic traffic light controller. In Emerging Trends in Networks and Computer Com-munications (ETNCC), 2011 International Conference on, pages 324–330, April2011.
[113] D. Smith and S. Singh. Approaches to multisensor data fusion in target tracking:A survey. IEEE Transactions on Knowledge and Data Engineering, 18(12):1696–1710, December 2006.
[114] C.S. Soh, P. Raveendran, and Z. Taha. Automatic generation of self-organized vir-tual crowd using chaotic perturbation. In TENCON 2004. 2004 IEEE Region 10Conference, volume B, pages 124 – 127 Vol. 2, nov. 2004.
[115] Vijay Srinivasan, John Stankovic, and Kamin Whitehouse. Using height sensors forbiometric identification in multi-resident homes. In Proceedings of the 8th Interna-tional Conference on Pervasive Computing, Pervasive’10, pages 337–354, Berlin,Heidelberg, 2010. Springer-Verlag.
[116] Alan N. Steinberg, Christopher L. Bowman, and Franklin E. White. Revisions to theJDL data fusion model. volume 3719, pages 430–441. SPIE, 1999.
[117] R. J. Sternberg and K. Sternberg. Cognitive psychology. Belmont, California:Wadsworth, 6 edition, 2012.
[118] Hang Su, Hua Yang, Shibao Zheng, Yawen Fan, and Sha Wei. Crowd event per-ception based on spatio-temporal viscous fluid field. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages458 –463, sept. 2012.
[119] Hang Su, Hua Yang, Shibao Zheng, Yawen Fan, and Sha Wei. Crowd event per-ception based on spatio-temporal viscous fluid field. In Advanced Video and Signal-Based Surveillance (AVSS), 2012 IEEE Ninth International Conference on, pages458 –463, sept. 2012.
[120] Ron Sun. Learning, action and consciousness: a hybrid approach toward modellingconsciousness. Neural Networks, 10(7):1317 – 1331, 1997.
[121] Ron Sun. Accounting for the computational basis of consciousness: A connectionistapproach. Consciousness and Cognition, 8(4):529 – 565, 1999.
[122] Ron Sun. Symbol grounding: A new look at an old idea. Philosophical Psychology,13(2):149–172, 2000.
[123] Ron Sun, Edward Merrill, and Todd Peterson. From implicit skills to explicit knowl-edge: a bottom-up model of skill learning. Cognitive Science, 25(2):203 – 244, 2001.
170
[124] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning.MIT Press, Cambridge, MA, USA, 1st edition, 1998.
[125] Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneckmethod. pages 368–377, 1999.
[126] M. Trivedi, K. Huang, and I. Mikic. Intelligent environments and active cameranetworks. In Proceedings of the IEEE International Conference on System, Manand Cybernetics, pages 804–809, 2000.
[127] Mohan Manubhai Trivedi, Tarak Gandhi, and Joel McCall. Looking-in and looking-out of a vehicle: Computer-vision-based enhanced vehicle safety. Intelligent Trans-portation Systems, IEEE Transactions on, 8(1):108 –120, march 2007.
[128] Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, and Alex Paul Pent-land. Pfinder: Real-time tracking of the human body. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19:780–785, 1997.
[129] Shunguang Wu, S. Decker, Peng Chang, T. Camus, and J. Eledath. Collision sensingby stereo vision and radar sensor fusion. Intelligent Transportation Systems, IEEETransactions on, 10(4):606 –614, 2009.
[130] Chengbin Zeng and Huadong Ma. Robust head-shoulder detection by pca-basedmultilevel hog-lbp detector for people counting. In Pattern Recognition (ICPR),2010 20th International Conference on, pages 2069–2072, Aug 2010.
[131] Beibei Zhan, Dorothy Monekosso, Paolo Remagnino, Sergio Velastin, and Li-QunXu. Crowd analysis: a survey. Machine Vision and Applications.
[132] Tao Zhao and Ram Nevatia. Bayesian human segmentation in crowded situations.Computer Vision and Pattern Recognition, IEEE Computer Society Conference on,2:459, 2003.
171