Spatio-temporal reasoning for traffic scene understanding

Spatio-temporal Reasoning for Traffic SceneUnderstanding

Raluca Brehar∗, Carolina Fortuna†, Silviu Bota∗, Dunja Mladenic‡, Sergiu Nedevschi∗∗Computer Science Department

Technical University of Cluj-NapocaEmail: [email protected], [email protected], [email protected]

†Department of Communication SystemsJosef Stefan Institute

Email: [email protected]‡Department of Knowledge Technologies

Josef Stefan InstituteEmail: [email protected]

Abstract—In this paper we introduce a system for semanticunderstanding of traffic scenes. The system detects objectsin video images captured in real vehicular traffic situations,classifies them, maps them to the OpenCyc1 ontology and finallygenerates descriptions of the traffic scene in CycL or cvasi-natural language. We employ meta-classification methods basedon AdaBoost and Random forest algorithms for identifyinginterest objects like: cars, pedestrians, poles in traffic and wederive a set of annotations for each traffic scene. These annota-tions are mapped to OpenCyc concepts and predicates, spatio-temporal rules for object classification and scene understandingare then asserted in the knowledge base. Finally, we show thatthe system performs well in understanding traffic scene situationsand summarizing them. The novelty of the approach resides inthe combination of stereo-based object detection and recognitionmethods with logic based spatio-temporal reasoning.

I. INTRODUCTION

The field of traffic scene understanding is closely related toimage and video understanding. It has numerous applicationslike active intelligent surveillance, summarization and index-ing of video data, unobtrusive home care for the elderly, andhands-free human-computer interaction.

Our method tries to improve the understanding of humanactions and interactions in a traffic environment. The noveltyof the approach resides in the combination of stereo-basedobject detection and recognition methods with logic basedspatio-temporal reasoning.

The objects appearing in a traffic scene are complex andusually traffic scenes are represented by highly clutteredenvironments. Hence the separation between the objects andthe background is extremely difficult. In our approach thesegmentation of the foreground is done using stereo-visionobject detection techniques. Even if the segmentation providesgood results, a more difficult task is the recognition of theobjects. We apply classification algorithms for finding threemajor classes of objects in a traffic scene: pedestrians, carsand poles. All other objects that may appear are grouped inthe class labeled ‘other objects’ (also referred as ‘unknown’).

1http://www.cyc.com/opencyc

Due to the high inter and intra class variance of pedestriansthese are the most difficult to recognize. Nevertheless, carshave also complex particularities due to different illumina-tion conditions, orientation angle with respect to camera andalso due to their different aspects (sedan, coupe, etc). Polesmay also have different shapes and dimensions, but we mayconsider they are the easiest to recognize. Once recognized,relationships between objects can be complex, therefore weuse semantic technologies for modeling the objects and inferrelationships between them.

The method we propose detects objects in video imagescaptured in real vehicular traffic situations, classifies them,maps them to the OpenCyc2 ontology and finally generatesdescriptions of the traffic scene in CycL or cvasi-naturallanguage. We employ meta-classification methods based onAdaBoost and Random forest algorithms for identifying inter-est objects like: cars, pedestrians, poles in traffic and we derivea set of annotation for each traffic scene. These annotations aremapped to OpenCyc concepts and predicates, spatio-temporalrules for object classification and scene understanding are thenasserted in the knowledge base.

The paper is structured as follows. In Section II we sum-marize related work. Section III described the architecture ofthe system. Section IV presents the experimental results whileSection V concludes the paper.

II. RELATED WORK

Semantic image understanding is an active field of research.Most state of the art methods deal with semantic imageunderstanding and less have as research topic traffic sceneunderstanding. There are several directions in the field ofsemantic image understanding that are somewhat related totraffic scene understanding. A first direction focuses on under-standing video events [1] - those high-level semantic conceptsthat humans perceive when observing a video sequence. Acomprehensive survey is done by [1] that offers a “Bird’s

2http://www.cyc.com/opencyc

https://www.researchgate.net/publication/224506219_Understanding_Video_Events_A_Survey_of_Methods_for_Automatic_Interpretation_of_Semantic_Occurrences_in_Video?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==


Eye View of the Video Event Understanding Domain”. Theyconsider the video event understanding process takes an imagesequence as input and abstracts it into meaningful units.These units may refer to pixel based features, object basedattributes or logic based features. The result of the abstractionis used by an event model to determine if an event of interesthas occurred. The event model may comprise state models(Bayesian networks, Hidden Markov Models), pattern recog-nition methods (Support Vector Machines, Nearest Neighbors,Neural Network), semantic models (Petri Nets, Grammars,Logic). Output of a video event understanding process maybe a decision on whether a particular event has occurred or asummary of events in the input sequence.

Most video event understanding methods are related tohuman activities and their interactions with the environment.For example, the authors in [2] propose a method for under-standing continued and recursive human activities like fighting,greeting approach, depart, point, shake-hands, hug, punch,kick, and push. The understanding is done using a context freegrammar representation scheme. A more elaborate method isemployed by [3] that propose a framework for recognizinghuman actions and interactions in video by using three levelsof abstraction: the poses of individual body parts recognizedby Bayesian networks, the overall body pose, the actions ofa single person modeled using a dynamic Bayesian networkand time juxtaposition for identifying interactions.

Object hypothesis generation (object detection) is usuallyperformed by grouping together some low level features. In[4] clusters of 6D vectors with common motion are used togenerate object hypotheses. In [5] a grid based method forobject localization is presented. Approaches to object recogni-tion in two-dimensional images are based on invariant featureslike Scale-invariant feature transform (SIFT) [6], Speeded UpRobust Features (SURF)[7], Histogram of Oriented features[8], wavelets fed into different classification schemes like neu-ral networks, ensemble methods or support vector machines.Other approaches comprise contour matching or bag of wordsrepresentations of objects [9], [10].

The method we propose has three layers of abstraction:(1) object detection, (2) object recognition and (3) spatio-temporal reasoning. It is different from the body of relatedwork in the approach and components, especially by usinga powerful reasoning engine and comprehensive ontology forscene understanding and English language transliteration.

III. SYSTEM ARCHITECTURE

The general context in which the proposed method applies isthe analysis of traffic scenes. A stereo based driving assistancesystem mounted on a car captures information about thetraffic scene. Figure 1 shows the general context. A vehicleequipped with two cameras captures information about theoutside environment. Using complex processing methods for3D stereo reconstruction it generates hypotheses of objects.The object hypotheses are further processed using methodsfor:

Fig. 1. System architecture

• extracting objects’ parameters like: dimension, speed,depth and position in scene

• object recognition that generates class hypotheses. Thesecan be: car, pedestrian, pole ore unknown.

Based on these methods annotations are generated for eachobject in the scene. Next the annotations are fed into a spatio-temporal reasoning module that improves the object detectionresults based on semantic reasoning and generates a bettermodel of the traffic scene.

A. Object Detection and Object Attributes Generation

Our system uses stereo-vision in order to construct objecthypotheses. Two cameras mounted on a vehicle in a stereoconfiguration are used for acquiring images. Figure 2 showsthe position of the cameras, the orientation of the coordinatesystem and the cuboidal object model.

After acquiring the images, as a preprocessing stage weapply undistortion, scaling and rectification in order to obtaina canonical rectified image pair. Then, dense stereo reconstruc-tion is performed using the TYZX [11] hardware. Optical flowis then computed using the pyramidal Lucas-Kanade cornerbased optical flow algorithm described in [12].

The object detection stage generates cuboidal objects bygrouping together individual 3D points. First, the ground’sprofile is detected using an elevation map [13]. Then, thepoints are classified according to their height above the ground.

https://www.researchgate.net/publication/224190584_Tracking_oncoming_and_turning_vehicles_at_intersections?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/222148028_Grid-based_localization_and_local_mapping_with_moving_object_detection_and_tracking?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/232627416_Mixed_road_surface_model_for_driving_assistance_systems?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/221303886_SURF_Speeded_Up_Robust_Features?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/4156337_Object_recognition_with_features_inspired_by_visual_cortex?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/243785622_Pyramidal_Implementation_of_the_Lucas_Kanade_Feature_Tracker_Description_of_the_algorithm?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/3816624_Object_Recognition_from_Local_Scale-Invariant_Features?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/4119532_Semantic-level_Understanding_of_Human_Actions_and_Interactions_using_Event_Hierarchy?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/4119574_Tyzx_DeepSea_High_Speed_Stereo_Vision_System?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/221364059_A_Bayesian_Hierarchical_Model_for_Learning_Natural_Scene_Categories?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/224649180_Semantic_Understanding_of_Continued_and_Recursive_Human_Activities?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/4156272_Triggs_B_Histograms_of_Oriented_Gradients_for_Human_Detection_In_CVPR?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/268733104_Speeded-up_robust_features_SURF?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/281327886_Histograms_of_Oriented_Gradients_for_Human_Detection?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

Fig. 2. Coordinate system and object model

Points located above the ground are then projected on the xOzplane and grouped together. A oriented cuboid is the generatedaround the grouped points. A complete description of thisalgorithm can be found in [14]. An alternative algorithm basedon density maps [15] is also used for detecting small objects(such as pedestrians). Speeds are then computed, for detectedobjects, by averaging the optical flow vectors contained in eachobject.

After detection, objects are tracked using Kalman filters.Tracking improves detection by integrating information acrossmultiple frames. It also allows to maintain the identity of theobject and it improves the speed computed using optical flow.

Objects are modeled as cuboids, right prisms with rectan-gular bases. The bases of the prism are always parallel to thethe xOz plane. Therefore, each object can be fully describedby following parameters:

• The (x, z) coordinates of its base corners (the near left,near right, far left and far right corners). We call theseparameters NearLeftX m, NearLeftZ m, NearRightX m,NearRightZ m, FarLeftX m, FarLeftZ m, FarRightX m,FarRightZ m. They are expressed in meters.

• The y coordinates for the top and bottom bases.• Speed vector components, along the x and z axes,

expressed in km/h, SpeedX kmh and SpeedZ kmh. Asobjects move only on the ground surface there is no needto consider speed along the y axis.

From these parameters we also derive the following values:• The object’s mass center, given by the CenterX m, Cen-

terY m and CenterZ m coordinates.• The object’s height, Height m obtained by the difference

of the top and bottom y coordinates.• Object’s width and length, Width m and Length m.• Object’s z distance from the ego vehicle, Depth meters.

B. Object Recognition

An important feature that can be added to semantic anno-tations is the class to which the objects belong. In order torecognize the objects that appear in a traffic scene we haveemployed several machine learning methods for identifyingrelevant classes of objects in traffic:

• pedestrian• pole• car

• unknown - class that contains objects other than pedes-trians, poles, cars.

Our methodology comprises the following:1) Construct a database of models for each class2) Extract relevant features for the models in each class3) Train level 1 classifiers. These learners are trained on

histogram of oriented gradients features for the Pole andPedestrian class. Their recognition score will be used asinput feature (among other features) for a meta-classifier.

4) Train meta-classifier on all features and on the responsesof level 1 classifiers.

1) The database of models: The database of models wasobtained by manually labeling several traffic video sequences.When choosing the sequences we have tried to have a largevariety in illumination conditions, background, weather con-ditions and intersection configurations. Each object in thethree classes was perfectly framed. Directed by the detectionresults, we have introduced in the class - ’Unknown’ - 2Dbounding boxes in which two objects were grouped together,or bounding boxes that contain parts of an object. Samplesfrom the database are shown in Figure 3:

Fig. 3. Models used for object recognition

2) Relevant feature extraction: For each sample in thedatabase we have computed the following features:a) Object dimension - width and height - are computed based

on the cuboids generated by the segmentation of 3D points.The cuboids capture the position and the spatial extent ofeach obstacle.

b) Motion features are represented by: (i)Speed (on Ox andOz axis): obtained from optical flow filtered by trackingand (ii) Motion signature – the horizontal variance of the2D motion field.

c) The HoG Pedestrian score and the HoG Pole Score aregiven by previously trained boosted classifiers on His-togram of Oriented Gradient features for the class pole andpedestrian.

d) The pattern matching score is given by the degree of matchbetween the contour of an obstacle and templates from apreviously generated hierarchy of pedestrian contours.

e) The vertical texture dissimilarity measures the maximumvertical dissimilarity that can be found in the object’s area.

3) Train level 1 classifiers: For the images in the classpedestrian and pole we have extracted Histogram of OrientedGradient (HoG) features [8] and on each class we have traineda cascaded boosted classifier.

https://www.researchgate.net/publication/224378750_Stereo-Based_Pedestrian_Detection_for_Collision-Avoidance_Applications?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==

https://www.researchgate.net/publication/224190767_Real-time_obstacle_detection_in_complex_scenarios_using_dense_stereo_vision_and_optical_flow?el=1_x_8&enrichId=rgreq-1717f7b6f242bca9123db53357ed7f7e-XXX&enrichSource=Y292ZXJQYWdlOzIyNDI2MzAzMDtBUzoxMDEwMzAzMjI3Njk5MjJAMTQwMTA5ODkxNDIxMA==


The HoG features are obtained by dividing the 2D imagecorresponding to the 3D cuboid surrounding the object, intonon-overlapping cells of equal dimension. For each the gradi-ent magnitude and orientation are computed. Based on them,a weighted histogram of orientations is build within each cell.Next, cells are grouped in overlapping blocks and the valuesof the histograms contained by a block are normalized. Twodifferent cascades of boosted classifiers are trained using theAdaBoost algorithm, one for poles and one for pedestrians,Figure 4.

Fig. 4. HoG based classifiers for poles and pedestrians

4) Meta-classification for object recognition in trafficscenes: A generic meta-classifier combines the output fromthe level 1 classifiers with the other features (motion, patternmatching, object dimension, texture) as shown in Figure 5.

Fig. 5. Meta-classifier for object recognition

We have experimented with two types of classificationschemes: adaboost and random forest. Section “ExperimentalResults” describes the performance of the recognition task.

C. Ontological Modeling of the World

For the ontological modeling of the world, by world re-ferring to the acquired images, objects and information aboutthe objects using the previously described modules, we usedOpencyc. The OpenCyc knowledge base is composed ofontology (i.e. an information model and relations betweenits elements), instances of concepts, relations and rules. Theworld model is mapped and modeled via OpenCyc specificabstractions. For each scene (image), the Object recognitionmodule provides a list of objects described by a set ofattributes. These attributes are mapped to existing OpenCycconcepts or expressed using corresponding predicates. Forattributes with no equivalence in the OpenCyc ontology, newcorresponding concepts and predicates have been defined andlinked. The basic mapping of the descriptors provided by theObject recognition model is presented in Table I. The detectedobject type is mapped to one of four concepts (Automobile,UtilityPole, Person or Obstacle for the unclassified objects).

Descriptor Value Constant Instance

ObjectClass N Automobile, Car5A000287 isa : Automobile

+ ObjectNo UtilityPole, Pole4A000282 isa : UtilityPole

Person, Pedestrian2A000282 isa : Person

Obstacle Unclassified1000282 isa : Obstacle

NearLeftX, X, Z Point Point00A000282 isa : Point

NearLeftZ, memberOfList : ListOfVerticesFP0A000282

NearRightX, (positionAlongXAxisInCoordinateSystem

Point00A000282

NearRightZ, (SomeExampleFn CartesianCoordinateSystem)

FarRightX, ((Milli Meter) 3))

FarRightZ, (positionAlongZAxisInCoordinateSystem

Point00A000282

FarLeftX, (SomeExampleFn CartesianCoordinateSystem)

FarLeftZ ((Milli Meter) 24))

CenterX, X, positionAlongX (positionAlongXAxisInCoordinateSystem

AxisIn Pedestrian1A000285

CoordinateSystem (SomeExampleFn CartesianCoordinateSystem)

((Milli Meter) -2))

CenterY, Y, positionAlongY (positionAlongYAxisInCoordinateSystem



((Milli Meter) -1))

CenterZ Z positionAlongX (positionAlongZAxisInCoordinateSystem



((Milli Meter) 21))

Heigth, H, W, L heightOfObject heightOfObject : ((Milli Meter) 2)

Width, widthOfObject widthOfObject : ((Milli Meter) 9)

Length lengthOfObject lengthOfObject : ((Milli Meter) 10)

SpeedX, X, speedAlongX (speedAlongXAxisInCoordinateSystem


CoordinateSystem (KilometersPerHour -11)

(SomeExampleFn CartesianCoordinateSystem))

SpeedZ Z speedAlongZ (speedAlongZAxisInCoordinateSystem


CoordinateSystem (KilometersPerHour 15)

(SomeExampleFn CartesianCoordinateSystem))

TABLE IMAPPING OF OBJECT DESCRIPTORS TO OPENCYC CONCEPTS

The four points determining the projection on the XoZ planeof the object are defined as points in a list of vertices. Forsome unclassified objects, the four points determining theprojections are not available, so for identifying the objects’position in space, the 3D coordinates of the center are used.As a result, we use the positionAlongAxisInCoordinateSystempredicates on the object itself as opposed to on the pointdelimiting the footprint in the XoZ plane. The dimensionsof objects is specified using the height/length/widthOfObjectpredicates. For inserting knowledge about 2D speed of theobjects, two predicates were created: speedAlongXAxisIn-CoordinateSystem and speedAlongZAxisInCoordinateSystem.These are similar to the positioning predicates just that theydenote speed.

A set of objects depicted by an image represents fromontological point of view, a PhysicalSeries - a series ofspatially ordered objects. Each image depicts a correspondingPhysicalSeries and objects which belong to this series. Imagesthemselves are ordered in time and belong to a Series aninstance of which is a VideoClip in this particular case. Forinstance, the knowledge base description of a video clip com-posed of a time ordered sequence of 15 images is presented in

Figure 6 and a description of an image is presented in Figure7.

Individual : VideoClip0isa : SeriesseriesLength : 15seriesMembers :

ImageA000288 ImageA000287 ImageA000286ImageA000285 ImageA000284 ImageA000283ImageA000282 ImageA000281 ImageA000280ImageA000279 ImageA000278 ImageA000277ImageA000276 ImageA000275 ImageA000274

seriesOrderedBy : after

Fig. 6. Representation of a time ordered sequence of images

Individual : ImageA000282isa : Image TemporalThingimageDepicts :

Unclassified10A000282 Unclassified9A000282Unclassified8A000282 Unclassified7A000282Unclassified6A000282 Unclassified5A000282Pole4A000282 Unclassified3A000282Pedestrian2A000282 Unclassified1A000282Unclassified0A000282 PhySeriesA000282

(precedesInSeries ImageA000282 ImageA000283 VideoClip0)(precedesInSeries ImageA000281 ImageA000282 VideoClip0)(seriesMembers VideoClip0 ImageA000282)

Fig. 7. Representation of an image in a sequence

Starting from the descriptors listed in the first column ofTable I, additional information can be extracted and insertedin the knowledge base. Based on the speed on X and Z axis,a movement direction can be determined as well as a speedin that direction. For instance, an object with SpeedZ >0 andSpeedX = 0, moves in direction North with speed equals toSpeedZ. The appropriate OpenCyc constants for representingthis ale listed in the first two rows of Table II.

Descriptor ConstantDirection of motion movesInDirection-Average,

North-Generally,Northeast-GenerallyWest-Generally,Northwest-Generally

Speed of object speedOfObject-Translation,velocityOfObject

Distance between objects distanceBetweenRelative position near, farFrombetween objects northOf, northeastOf,

eastOf, southeastOfspatiallyIntersects,overlappingSpaceRegions

Action execution performedBy, doneByEvents Walking, Jogging, Running

TABLE IIOTHER KNOWLEDGE

Furthermore, the distance between two objects can bedetermined based on the distance between their footprints(note that objects are delimited by a rectangular cuboid with arectangular footprint). Based on the distance between objectsand the width of the scene, subjective estimations are relativedistances between objects can be made using predicates such

as farFrom and near. Also, relative positioning predicates suchas northOf and spatiallyIntersects can be inserted.

Once the relevant concepts have been modeled and intro-duced in the knowledge base, generic or application specificrules can be defined to achieve the desired functionality.Our scope in this paper is to define rules to improve theclassification and to achieve better scene understanding.

1) Scene Understanding: Scene understanding can beachieved just by inserting appropriate instances and predicatesin the knowledge base. With the additional knowledge alreadyin the knowledge base (OpenCyc knowledge base constrainsover 47,000 concepts and 306,000 facts) rich inferences canbe performed and more abstract queries can be answeredto. Additional rules can be added to the knowledge baseto allow more application specific reasoning if this is notalready covered by the existing (common sense) knowledge.For instance, rules for inferring the directionBetweenObjectsbased on relative positioning, rules for classification and rulesfor action recognition.

2) Action Recognition: Action recognition in the context ofurban traffic scenes typically deal with detecting people cars,motorbikes, trams, etc. and recognizing actions they perform.For instance a person which moves on two legs is performingone of the following actions: walking, jogging, running. Whenthe person is on the bike, then the action is riding a bike.Cars are driven be persons who can drive straight, take a turn,drive with high speed or according to the rules, etc. Dependingon their positioning in the scene and their movements, actorsmay approach each other. Such events as approaching shouldbe detected for safety reasons especially when a person isapproaching a car or vice versa. We introduce rules for spatialreasoning over relative positions and motion (speed, direction)of actors to detect approaching events.

D. English transliteration

The semantic representation of each traffic scene is providedin the CycL language and stored in the OpenCyc knowledgebase using specific representation. For some type of relations,including the ‘isa’ predicate, transliterations can be obtainedthrough the API. For more complex queries, answers in CycLneed to be parsed by the custom transliteration modules andoutput in a specific manner. The complete transliteration of atraffic scene can be seen as a textual summary of the scene.In other words, our system is able to take images, detectobjects, classify objects, infer relationships between objects,understand the complex situation in a scene and output this ina qvasi-natural language manner using the transliteration.

IV. EXPERIMENTAL RESULTS

A. Object Detection and Recognition

The experimental results of object detection are betterunderstood if presented via the correlation with object recog-nition.

In performing the recognition task our database of modelscontained: 27000 cars, 26000 pedestrians, 34000 pole models,100000 samples in the unknown class. For all these models

we have extracted the features mentioned in section III-B2 andwe have applied the meta-classification scheme.

For meta-classification we have experimented with twotypes of multi-class learners: random forest and adaptiveboosting. The random forest represents a meta-learner com-posed of many individual decision trees. It is designed tooperate quickly over large datasets and to be diverse byusing random samples to build each tree in the forest. Itconstructs random forests by bagging ensembles of randomtrees. We have used 20 random trees in the implementation.The adaptive boosted classifier is also a meta-learner and inour implementation it is formed by the linear combination ofJ48 weak learners.

The validation on such a classification system is verydifficult. A manual validation is extremely time consuming andit requires plenty of effort from the user. We have validatedthe system on the database of build models.

The results of the meta-classifier trained using random forestand AdaBoost are presented in what follows. The evaluationis done using stratified cross-validation on the database ofmodels.

Classifier: Random Forest evaluated using stratified cross-validationCorrectly Classified Instances 104336 = 86.5643 %Incorrectly Classified Instances 16194 = 13.4357 %Kappa statistic 0. 8201Mean absolute error 0. 0986Root mean squared error 0. 2226Relative absolute error 26.3851 %Root relative squared error 51.4945 %Total Number of Instances 120530

TABLE IIIRANDOM FOREST META-CLASSIFIER EVALUATION

Classifier: AdaBoost with J48 decision stumpsevaluated using stratified cross-validation

Correctly Classified Instances 93151 = 86.4759 %Incorrectly Classified Instances 14568 = 13.5241 %Kappa statistic 0.8188Mean absolute error 0.0826Root mean squared error 0.2271Relative absolute error 22.1109 %Root relative squared error 52.5714 %Total Number of Instances 107719

TABLE IVADABOOST META-CLASSIFIER EVALUATION

These results are promising but not perfect. The spatio-temporal reasoning that we have presented shows how theseresults can be improved.

B. Ontological Modeling of the World

1) Scene Understanding: To improve the classification pro-vided by the Object recognition module we defined severalrules, one of them is presented in Figure 8. These rules usespatio-temporal reasoning for improving object classification.For instance, the rule in Figure 8 asserts that, if there are

two consequent images in a video, in both images an objectwas detected in a location, in one of the images, the objectis classified by the Object recognition module as unknown(Obstacle in the knowledge base) and in the other image, itis classified as a Pedestrian (Person in the knowledge base),and the two objects have similar dimensions and relativepositions to the origin of the coordinate system, then probablythe unknown object is a pedestrian. In the case of image

(#$implies(#$and(#$isa ?Video #$Series)(#$isa ?Img1 #$Image)(#$isa ?Img2 #$Image)(#$precedesInSeries ?Img1 ?Img2 ?Video)(#$imageDepicts ?Img1 ?Obj1)(#$isa ?Obj1 #$Obstacle)(#$imageDepicts ?Img2 ?Obj2)(#$isa ?Obj2 #$Person)(#$and(#$positionAlongXAxisInCoordinateSystem ?Obj1(#$SomeExampleFn #$CartesianCoordinateSystem) ?X1)(#$positionAlongZAxisInCoordinateSystem ?Obj1(#$SomeExampleFn #$CartesianCoordinateSystem) ?Z1)(#$positionAlongXAxisInCoordinateSystem ?Obj2

(#$SomeExampleFn #$CartesianCoordinateSystem) ?X2)(#$positionAlongZAxisInCoordinateSystem ?Obj2

(#$SomeExampleFn #$CartesianCoordinateSystem) ?Z2)(#$evaluate ?S1

(#$AbsoluteValueFn (#$DifferenceFn ?X1 ?X2)))(#$evaluate ?S2

(#$AbsoluteValueFn (#$DifferenceFn ?Z1 ?Z2)))(#$greaterThanOrEqualTo ((#$Milli #$Meter) 3) ?S1)(#$greaterThanOrEqualTo ((#$Milli #$Meter) 3) ?S2)(#$widthOfObject ?Obj1 ?W1)(#$widthOfObject ?Obj2 ?W2)(#$greaterThanOrEqualTo ((#$Milli #$Meter) 1)

(#$AbsoluteValueFn (#$DifferenceFn ?W1 ?W2)))(#$heightOfObject ?Obj1 ?H1)(#$heightOfObject ?Obj2 ?H2)(#$greaterThanOrEqualTo ((#$Milli #$Meter) 1)

(#$AbsoluteValueFn (#$DifferenceFn ?H1 ?H2)))))(#$isa ?Obj1 #$Person))

Fig. 8. Forward rule classifying an obstacle as a Person

A282, the classification algorithms in the Object recognitionmodule detected 11 objects and classified two (a utility poleand a pedestrian) as can be seen at the top of Figure 9. Thepole is surrounded by a green rectangle and the person by ayellow one. Some of the unclassified objects we as humans candistinguish: object 3 is a car, object 8 is a pole. Furthermore,we can tell that there is another car in the image which wasnot detected, there is a publicity panel supported by legswhich look like poles, there is a fence with metal bars, thepublicity panel is behind the fence. Telling which exactly arethe support pillar of the publicity panel is hard even for us.It can be seen, that the Object recognition algorithm performspoorly here (and it’s rather inconsistent with respect to thedetected objects across several similar images). The spatio-temporal reasoning improves the classification of unknownobjects by recognizing the car, the utility pole on the rightside of the image (check Table V). However, it misclassifiesmany of the unknown objects (i.e. metal fence, publicitypanel) as poles. In this particular case, the misclassificationerror is not expensive since all these objects are static, and

the shapes of a pole or a support pillar of the fence aresimilar. However, the misclassification of a person as a polewould be more expensive. These rules we designed for spatio-

Image Classification Spatio-temporalname algorithm reasoning

[TP/FP/detected] [TP/FP/detected]A274 0/0/6 4/0/6A275 1/0/6 3/1/6A276 0/0/6 3/2/6A277 1/0/9 4/4/9A278 3/0/12 4/5/12A279 2/0/1 5/2/11A280 2/0/8 4/2/8A281 2/0/9 4/2/9A282 2/0/11 4/4/11A283 4/0/11 5/3/11A284 2/0/9 5/2/9A285 2/0/8 3/1/8A286 3/0/7 4/1/7A287 3/0/8 3/0/8A288 3/0/8 3/0/8

TABLE VIMPROVING CLASSIFICATION BY MEANS OF SPATIO-TEMPORAL

REASONING

temporal reasoning perform well on the small number ofimages we manually verified (25 images). However we believethat they can be further optimized especially by involving morecomplex spatial reasoning including surrounding objects, theirclassification and their relative position. Complex rules forclassification based on spatio-temporal reasoning applied toimages where the object detection algorithm produces noisierresults may though underperform. The listing below the imagein Figure 9 represents part of the description of the scene asmodelled in the knowledge base. The complete descriptionof a scene is rather lenghty as the description of one objectconsists of over 100 assertions. Using appropriate queries, alist of things depicted by the image can be obtained, the con-ceptualization of things (i.e. unclassified3a000282 is a car.),inferred conceptualization based on the knowledge alreadyexisting the the knowledge base (i.e. unclassified3a000282 is amechanicaldevice, unclassified3a000282 is a powereddevice.)as well as relative positioning of objects in space (i.e. pedes-trian2a000282 farFrom unclassified9a000282).

2) Action Recognition: We introduce rules for spatial rea-soning over relative positions and motion (speed, direction) of

List of things:Image depicts physeriesa000282Image depicts unclassified0a000282Image depicts unclassified1a000282Image depicts pedestrian2a000282Image depicts unclassified3a000282Image depicts pole4a000282Image depicts unclassified5a000282Image depicts unclassified6a000282Image depicts unclassified7a000282Image depicts unclassified8a000282Image depicts unclassified9a000282Image depicts unclassified10a000282

Classification of object instancesto semantic concepts:pedestrian2a000282 is a person.pole4a000282 is a utility pole.unclassified6a000282 is a utility pole.unclassified7a000282 is a utility pole.unclassified8a000282 is a utility pole.unclassified9a000282 is a utility pole.unclassified10a000282 is a utility pole.unclassified3a000282 is a car.

Inference of belonging torelated semantic concepts:unclassified3a000282 is a roadvehicle.unclassified3a000282 is a mechanicaldevice.unclassified3a000282 is a powereddevice.

Relative positioning:unclassified0a000282 near unclassified1a000282.unclassified0a000282 northOf unclassified1a000282.unclassified0a000282 eastOf pedestrian2a000282.unclassified1a000282 northeastOf pedestrian2a000282.unclassified8a000282 farFrom pole4a000282.unclassified8a000282 near unclassified9a000282.unclassified0a000282 northOf unclassified9a000282.unclassified1a000282 northOf unclassified9a000282.pedestrian2a000282 farFrom unclassified9a000282.

Fig. 9. Scene with its description

Motion modeling by translation anddirection:pedestrian2a000282 moves in direction northwest-generally,speed of object kilometersperhour 17.205.

unclassified3a000282 moves in direction northwest-generally,speed of object kilometersperhour 17.205.

Action recognition:actionpedestrian2a000282 performed by pedestrian2a000282,actionpedestrian2a000282 isa running.

actionpedestrian2a000282 performed by pedestrian2a000282,actionpedestrian2a000282 isa ambulation.

actionpedestrian2a000282 performed by pedestrian2a000282,actionpedestrian2a000282 isa movingalllegs.

movement02a000282 done by pedestrian2a000282,movement02a000282 isa approaching.movement02a000282 done by pedestrian2a000282,movement02a000282 isa translation-locationchange.movement02a000282 done by pedestrian2a000282,movement02a000282 isa movement-translationevent.

Fig. 10. Description of a scene

actors to detect approaching events. Figure 10 shows how theaction recognition task is modeled in Cyc.

C. English transliteration

The english language transliteration is generated based onthe knowledge present in the knowledge base. This consistsof transliterating the reasoning (i.e. the proof) for a query. Forinstance, when a query is fired asking for all SpatialThings in

an image, the reasoning will go up the hierarchy via the isapredicate and will determine whether the object is an instanceof the abstract SpatialThing concept. In Figure 11 can beseen that the object instance denoted by the name unclas-sified0a000282 is a utility pole (as it was classified by thespatio-temporal reasoning), and in the OpenCyc ontology it isrepresented that utility pole is a specialization of spatial thing,therefore the transliterated answer to the above question canbe read in the figure. Such transliterations, or transliterationsof the scene can be used to generate a textual summary of theimage.

Image ImageA000282 depicts SpatialThing?unclassified0a000282 is a utility pole,every utility pole is an open-air,every open-air is a localized spatialthing,

every localized spatial thing is a spatialthing.

unclassified1a000282 is an obstacle,every obstacle is a tangible thing,every tangible thing is three dimensionalthing,

all three dimensional thing is a thingwith two or more

dimensions,every thing with two or more dimensionsis a spatial thing with one or moredimensions,

every spatial thing with one or moredimensions is a spatial thing.

Image ImageA000282 depicts ObjectWithUse?unclassified0a000282 is a utility pole,every utility pole is a post,every post is a shaft,every shaft is a rod,every rod is an implement,every implement is a device,every device is an object with uses.unclassified3a000282 is a car,every car is a device that isnot a weapon,

every device that is not a weaponis a device,

every device is an object with uses.

Fig. 11. Transliteration of semantic representations

V. CONCLUSIONS

This paper has presented a novel method of spatio-temporalreasoning for understanding traffic scenes. Our method com-bines stereo-vision object detection approaches with meta-classification object recognition algorithms in order to generatea first level semantic annotation of the traffic scene. Thissemantic annotation is further mapped into OpenCyc conceptsand predicates, spatio-temporal rules for object classificationand scene understanding are then asserted in the knowledgebase. The method performs well in understanding traffic scenesituations and summarizing them.

VI. ACKNOWLEDGMENTS

This work was partially funded by the PlanetData, Intersafe2and the Romanian-Slovenian bilateral cooperation project.

REFERENCES

[1] G. Lavee, E. Rivlin, and M. Rudzsky, “Understanding video events: Asurvey of methods for automatic interpretation of semantic occurrencesin video,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39,pp. 489–504, 2009.

[2] M. S. Ryoo and J. K. Aggarwal, “Semantic understanding of continuedand recursive human activities,” in International Conference on PatternRecognition, 2006, pp. 379–382.

[3] S. Park and J. K. Aggarwal, “Semantic-level understanding of humanactions and interactions using event hierarchy,” in Computer Vision andPattern Recognition, 2004.

[4] A. Barth and U. Franke, “Tracking oncoming and turning vehicles atintersections,” in Intelligent Transportation Systems, IEEE Conferenceon, Madeira Island, Portugal, 2010, pp. 861–868.

[5] T. Vu, J. Burlet, and O. Aycard, “Grid-based localization and localmapping with moving objects detection and tracking,” InternationalJournal on Information Fusion, Special issue on Intelligent TransportSystem, July 2010.

[6] D. G. Lowe, “Object recognition from local scale-invariant features,”in Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2, ser. ICCV ’99. Washington, DC, USA:IEEE Computer Society, 1999, pp. 1150–. [Online]. Available:http://portal.acm.org/citation.cfm?id=850924.851523

[7] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robustfeatures (surf),” Comput. Vis. Image Underst., vol. 110, pp. 346–359,June 2008. [Online]. Available: http://portal.acm.org/citation.cfm?id=1370312.1370556

[8] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in IEEE Computer Society Conference on Computer Visionand Pattern Recognition, 2005. CVPR 2005., June 2005, pp. 886–893.

[9] T. Serre, L. Wolf, and T. Poggio, “Object recognition with featuresinspired by visual cortex,” in Proceedings of the 2005 IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition(CVPR’05) - Volume 2 - Volume 02, ser. CVPR ’05. Washington,DC, USA: IEEE Computer Society, 2005, pp. 994–1000. [Online].Available: http://dx.doi.org/10.1109/CVPR.2005.254

[10] F. fei Li and P. Perona, “A bayesian hierarchical model for learningnatural scene categories,” in Computer Vision and Pattern Recognition,2005, pp. 524–531.

[11] J. I. Woodfill, G. Gordon, and R. Buck, “Tyzx deepsea high speedstereo vision system,” in Proceedings of the IEEE Computer SocietyWorkshop on Real Time 3-D Sensors and Their Use, Conferenceon Computer Vision and Pattern Recognition, jun 2004. [Online].Available: http://www.tyzx.com/pubs/CVPR3DWorkshop2004.pdf

[12] J.-Y. Bouguet. (2000) Pyramidal implementation of the Lucas Kanadefeature tracker. [Online]. Available: http://mrl.nyu.edu/∼bregler/classes/vision\ spring06/bouget00.pdf

[13] F. Oniga, R. Danescu, and S. Nedevschi, “Mixed road surface model fordriving assistance systems,” in Proceedings of the IEEE InternationalConference on Intelligent Computer Communication and Processing,Cluj-Napoca, Romania, August 2010, pp. 185–190.

[14] C. Pantilie and S. Nedevschi, “Real-time obstacle detection in complexscenarios using dense stereo vision and optical flow,” in Proceedings ofthe IEEE Intelligent Transportation Systems Conference (IEEE-ITSC)Madeira, Portugal, September 2010, pp. 439–444.

[15] S. Nedevschi, S. Bota, and C. Tomiuc, “Stereo-based pedestrian de-tection for collision-avoidance applications,” IEEE Transactions onIntelligent Transportation Systems, vol. 10, no. 3, September 2009.
























































Spatio-temporal reasoning for traffic scene understanding

Documents

Transcript of Spatio-temporal reasoning for traffic scene understanding