A neural network approach for video object segmentation in traffic surveillance

A Neural Network Approach for Video ObjectSegmentation in Traffic Surveillance

R.M. Luque, E. Domınguez, E.J. Palomo, and J. Munoz

Department of Computer ScienceE.T.S.I.Informatica, University of Malaga

Campus Teatinos s/n, 29071 – Malaga, Spain{rmluque,enriqued,ejpalomo,munozp}@lcc.uma.es

Abstract. This paper presents a neural background modeling based onsubtraction approach for video object segmentation. A competitive neu-ral network is proposed to form a background model for traffic surveil-lance. The unsupervised neural classifier handles the segmentation innatural traffic sequences with changes in illumination. The segmentationperformance of the proposed neural network is qualitatively examinedand compared to mixture of Gaussian models. The proposed algorithmis designed to enable efficient hardware implementation and to achievereal-time processing at great frame rates.

1 Introduction

The traffic surveillance systems are designed to detect, recognize and understandthe object (vehicle) movements in natural traffic sequences. Vehicle segmentationrepresents a basic task in video processing and the foundation of vehicle move-ments understanding for traffic surveillance applications. Although the video isgrabbed from a stationary camera, the vehicle segmentation is still difficult dueto the moving objects and illumination changes.

The goal of the video object segmentation is to separate pixels correspondingto foreground from those corresponding to background. The task is complex bythe increasing resolution of video sequences, continuing advances in the videocapture and transmission technology. As a result, research into more efficientalgorithms for real-time object segmentation continues unabated.

The process of modeling the background by comparison with the frames of thesequence is often referred to as background subtraction. Probabilistic methodsare the preferred approach for segmentation of sequences with complex back-ground, but their main shortcoming is that they are computationally complexand only able to achieve real-time processing of comparatively small video for-mats at reduced frame rates. The development of a parallelized object segmen-tation approach, which would allow for object detection in real time for complexvideo sequences, is the focus of this paper.

Neuronal networks have been widely used for images segmentation. Thereexists numerous works in the fields of typewriting recognition [1] and medicalimages segmentation [2,3]. In addition, neural networks have been used to adjust

A. Campilho and M. Kamel (Eds.): ICIAR 2008, LNCS 5112, pp. 151–158, 2008.c© Springer-Verlag Berlin Heidelberg 2008

152 R.M. Luque et al.

the topology of the segmented objects using self organizing maps [4]. Neverthelessthese neural techniques have not been suitable for real time applications sincethey mainly required long time of computation in the training process.

In this work an unsupervised competitive neural network is proposed for objectssegmentation in real time. The proposed approach employs is based on competitiveneural network to achieve background subtraction. A new unsupervised compet-itive neural network is designed to serve both as an adaptive model of the back-ground in a video sequence and a classifier of pixels as background or foreground.Neural networks posse intrinsic parallelism which can be exploited in a suitablehardware implementation to achieve fast segmentation of foreground objects.

2 Related Works

There exists two broad classes of backgrounding: adaptive and non-adaptivemethods. Most researches have abandoned non-adaptive methods because of theneed for manual initialization. Moreover, errors in the background are accu-mulative over time, making this method useful only in supervised applicationswithout significant changes in the scene.

A standard method of adaptive backgrounding is averaging the images overtime [5,6], creating a background approximation. A Kalman filter-based adap-tive background model is used by Koller et al. [7] for making the system morerobust to lighting changes in the scene. While these method are effective in situ-ations where objects move continuously, they are not robust to scenes with manymoving objects particularly if they move slowly.

Wren et al. [8] used a multiclass statistical model, named Pfinder, based onGaussian distributions. But the background model is a single Gaussian per pixel.A modified version modeling each pixel as a mixture of Gaussians is proposed byStauffer and Grimson [9]. This approach is robust to scenes with many movingobjects and lighting changes, but it is only able to achieve real time processingof small video formats (120x160 pixels).

3 Background Model

The goal of the proposed segmentation algorithm is to be able to classify thepixel in a frame of the sequence as foreground or background, based on statisticslearned by the neural network. These statistics are learned from the framesobserved of the video sequence. The video sequence can be viewed as a set ofpixel features values varying over time. Different pixel features, such as intensity,hue or red, green and blue (RGB) color components can be used as basis forsegmentation. The values of these features are considered as a time series ofpixel values. At any time t, the history of pixel (i, j) is defined as following

{X1, X2, ..., Xt} = {I(i, j, k) 1 ≤ k ≤ t}

where I is the image sequence and Xk is the pixel values at time k, whichrepresents a measurement of the radiance. With static background and lighting,

A Neural Network Approach for Video Object Segmentation 153

that value would be relatively constant. Unfortunately, the most interesting videosequences involve lighting changes and moving objects.

The recent history of each pixel (i, j) is modeled by an unsupervised neuralnetwork Mij . The input of the neural network is the pixel values as showed inthe figure 1. The output of the neural network is formed by almost 2 neurons(foreground and background), although we can add more neurons for multimodalbackgrounds. Note that one output neuron represents the foreground and therest of output neurons represents the background.

Fig. 1. Neural architecture of the pixel history model

Let Xt be the pixel value at time (frame) t of the video sequence and Wi(t) bethe synaptic weight vector of the output neuron i at time t. The synaptic poten-tial of each neuron is defined by expression (1), which measures the dissimilaritybetween the pixel value and the output neuron state. The winner neuron will bethe neuron with lowest synaptic potential, that is, the most similar neuron.

hi(t) =‖ Xt − Wi(t) ‖ (1)

During training, only the synaptic weight to the winner neuron is updated bythe following expressions

wij(t + 1) =

{wij(t) + Δwij(t) if neuron i is the winnerwij(t) otherwise

(2)

Δwij(t) = η(t)(xj(t) − wij(t)) (3)

where η(t) is the learning rate of the neuron at frame t, xj(t) is the componentj of the feature vector pixel Xt and wij(t) is the synaptic weight between theinput neuron j and the output neuron i. The change in the weights of the winnerneuron is proportional to the dissimilarity between the pixel value and the winnerneuron state. All pixel values are presented to the associated network, and theweights are adjusted each frame. The pseudo-code of each neural network ispresented in the figure (2). Note that these steps can be parallelized to reducethe computational cost.


Initialize the synaptic weights of each neuronREPEAT

Read the input pixel valueCalculate the synaptic potentialUpdate the synaptic weight of the winner neuronIF the winner is a background neuron THEN

Pixel classified as backgroundELSE

Pixel classified as foregroundENDIFGo to the next frame

UNTIL to the end of sequence

Fig. 2. Pseudo-code of an unsupervised neural network

4 Experimental Results

In this section the performance of the proposed neural approach is evaluated.The results of the neural approach (NA) are compared against the mixtureof gaussian model (MoG)[9], which is the most quoted for everybody in thecommunity research.

The evaluated methods have been applied to a selected set of traffic surveil-lance sequences provided by a video surveillance online repository [10] and theFederal Highway Administration (FHWA) under the Next Generation Simula-tion (NGSIM) program1. Results of two sequences are shown in figures 3 and4, which are common examples of the monitoring systems in current highways.These sequences have different sizes, a low quality sequence (Highway II) of160x120 and a high quality sequence (New U.S. Highway 101) of 640x240.

Morphological operations such as dilation and erosion as well as the elimina-tion of small objects are used to enhance the segmentation results, and then toachieve the vehicles tracking as shown in figures 4(e) and 4(f). Results of theobjects (vehicle) segmentation are suitable to identify the vehicle movementsand they are reliable to achieve a vehicle tracking.

Quality evaluation of the segmentation results are presented in the table 1.Similar results are obtained using MoG to the same sequences, to the referredalgorithm, and due to it the accuracy of our system could be considered thesame to the other. The major difference between the approaches is in the falsenegative detection. The flase negatives of the proposed NA is due to the incorrectsegmentation, mostly some parts of the car windows. However, this is not aninconvenient to obtain good results since morphological operators are appliedbefore the tracking process. Moreover, an important advantage of our method isits possible scalability to integrate major functionality.

Efficiency of both MoG and NA are also compared running in the same con-ventional computer Pentium IV 3GHz with 2Gb RAM. The processing time ofthe video sequences showed in the figures 3 and 4 are provided by the table 2.

1 Datasets of NGSIM are available at http://ngsim.fhwa.dot.gov/


(a) (b)

(c) (d)

(e) (f)

Fig. 3. Two different frames from U.S Highway 101 and their processing with theproposed NA. (a) and (b) are the original frames in raw form; (c) and (d) are thesegmentation results using the NA; (e) and (f) show the classification of vehicles forthe tracking phase.

The first and second columns show the average processing time of the highquality sequence (frame size of 640x480 pixels) and of the low quality sequence(frame size 160x120 pixels), respectively. The maximum frame rates achieved bythe studied approaches are presented in the third and fourth columns. Whilethe MoG is only able to achieve real time processing of low quality sequences atnormal frame rate (25 fps), the proposed neural approach is able to achieve itof high quality sequences. Moreover, NA is able to achieve processing low qual-ity sequences at 100 fps. In addition, table 2 shows the good scalability of theproposed NA, since the average processing time of the high quality sequences istwice greater than of the low quality sequences. Note that the processing timeof the NA can be reduced using a parallel hardware implementation, since the

Table 1. Results of detection for the studied sequences

Matching (%) False Positives (%) False Negatives (%)MoG 98.54 1.41 2.20NA 98.34 1.37 6.70


(a) (b)

(c) (d)

(e) (f)

Fig. 4. Two different frames from Highway II. (a) and (b) are the original frames; (c)and (d) are the segmentation results using the proposed NA; (e) and (f) show theenhanced frames using morphological operations and the classification of vehicles.

speed of the neural segmentation does not depend on the size of the frame. Thedelay of the neural network (processing time) corresponds to the time requiredfor the propagation and update of the signal. In a typical field programmablegate array implementation, this can be done in less than 2 ms. Thus, the neuralnetworks themselves are capable of achieving a throughput of some 500 fps.


Table 2. Comparative analysis of the processing time

Processing time (spf) Maximum frame rate (fps)640x240 160x120 640x240 160x120

MoG 0.0411 0.0190 16 25NA 0.0216 0.0106 25 33

5 Conclusions

This paper presents a neural approach for object detection in motion, based onan unsupervised competitive neural network, in which we use a model for eachpixel image. This method manages to solve some common problems in videosurveillance such as slow lighting changes by adapting the neuron information,effects of moving elements of the scene (swaying trees, sea waves, etc) by us-ing several neurons to represent multimodal background, sudden backgroundchanges and objects being introduced or removed from the scene.

The proposed NA for video object segmentation obtains similar results tothose of MoG with regard to the segmentation quality, but a significant im-provement in the efficiency and a reduction by half in the processing time areachieved using the NA. This system has been successfully applied to processreal-time video sequences (more than 25 fps) using only a conventional PC anda low cost camera.

Moreover, the speed of the proposed segmentation, in a parallel hardwareimplementation, does not depend on the size of the frame. The delay of theneural network (segmentation time) corresponds to the time needed by the signalto propagate through the network and the time required to update it. Thissegmentation time can be improves in a typical field programmable gate array(FPGA) implementation running at 10 MHz clock rate. Thus, the proposedneural approach is capable of achieving a throughput of some 500 fps.

In addition, scalability is one of major advantages of neural networks, thereforewe can design new neural models with new functionalities. In future works,shadow detection method integrated into the neural network is planned to avoidapplying traditional shadow elimination algorithms.

Acknowledgements

This work is partially supported by Junta de Andalucıa (Spain) under contractTIC-01615, project name Intelligent Remote Sensing Systems.

References

1. Zabavin, N., Kuznetsova, M., Luk’yanitsa, A., Torshin, A., Fedchenko, V.: Recog-nition of handwritten characters by means of artificial neural networks. Journal ofComputer and Systems Sciences International 38(5), 831–834 (1999)


2. Hall, L., Bensaid, A., Clarke, L., Velthuizen, R., Silbiger, M., Bezdek, J.: A com-parison of neural network and fuzzy clustering techniques in segmenting magneticresonance images of the brain. IEEE Transactions on Neural Networks 3(5), 672–682 (1992)

3. Pham, D., Xu, C., Prince, J.: Current methods in medical image segmentation.Annual Review of Biomedical Engineering 2, 315 (2000)

4. Reyes-Aldasoro, C., Aldeco, A.: Image segmentation and compression using neuralnetworks. Advances in Artificial Perception and Robotics CIMAT (2000)

5. Lo, B., Velastin, S.: Automatic congestion detection system for underground plat-forms. In: Proceedings of 2001 International Symposium on Intelligent Multimedia,Video and Speech Processing, 2001, pp. 158–161 (2001)

6. Cucchiara, R., Grana, C., Piccardi, M., Prati, A.: Detecting moving objects, ghosts,and shadows in video streams. IEEE Transactions on Pattern Analysis and MachineIntelligence 25(10), 1337–1342 (2003)

7. Koller, D., Weber, J., Huang, T., Malik, J., Ogasawara, G., Rao, B., Russell, S.:Towards robust automatic traffic scene analysis in real-time. In: Proceedings of theInternational Conference on Pattern Recognition (1994)

8. Wren, C., Azarbayejani, A., Darrell, T., Pentl, A.: Pfinder: Real-time tracking ofthe human body. IEEE Transactions on Pattern Analysis and Machine Intelli-gence 19(7), 780–785 (1997)

9. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-timetracking. In: IEEE Computer Society Conference on Computer Vision and Pat-tern Recognition, pp. 246–252 (1999)

10. Vezzani, R., Cucchiara, R.: Visor: Video surveillance online repository. In: BMVAsymposium on Security and surveillance: performance evaluation (2007)

A neural network approach for video object segmentation in traffic surveillance

Documents

Transcript of A neural network approach for video object segmentation in traffic surveillance