D4.2: Vehicles and VRUs Database

Disclaimer

This report is part of a project that has received funding by the European Union’s Horizon 2020 research and

innovation programme under grant agreement number 723021. The content of this report reflects only the

authors’ view. The Innovation and Networks Executive Agency (INEA) is not responsible for any

use that may be made of the information it contains.

BRAVE

BRidging gaps for the adoption of Automated VEhicles

No 723021

D4.2 Vehicles and VRUs Database Lead Author: Miguel Ángel Sotelo (UAH)

With contributions from: Ignacio Parra, David F. Llorca,

Carlota Salinas, and Javier Alonso (UAH) Reviewer: Jaime Medina (TREE)

Deliverable nature: Report (R)

Dissemination level: (Confidentiality)

Public (PU)

Contractual delivery

date:

31 May 2018

Actual delivery date: 11 June 2018

Version: 1.2

Total number of pages: 33

Keywords: Dataset, predictions of intentions, vehicles, vulnerable road users

Deliverable D4.2 BRAVE


innovation programme under grant agreement number 723021 of 33

Abstract

Please include an abstract of not more than 10 lines here.

This document describes the process followed to create a dataset containing data about vehicles and Vulnerable

Road Users (VRUs) looking at providing a new tool for performing accurate predictions of trajectories. In

addition, a clear description of the dataset structure is provided in this deliverable. The dataset is divided in

two main parts: Vehicles (cars and trucks) and VRUs (pedestrians and cyclists). Vehicles’ data have been

logged using 6 different sensors: 1 long range radar, 2 short range radars, 2 cameras (1 forward looking and 1

rear looking) and Velodyne-32. The time synchronization of the six sensors is properly described in the

document, having proved to be one of the most complex technical issues to take into account. Similarly, the

cross-calibration of the six sensors is described. Regarding VRUs, data from cameras is the sole source of

information, given the high density of information contained in the high-resolution cameras used in this project.

All the data have been gathered using the DRIVERTIVE vehicle.

[End of abstract]




Executive summary

BRAVE is looking at developing advanced methods for providing accurate predictions of vehicles and

Vulnerable Road Users (VRUs) trajectories. This task is complex, given that it requires a deep understanding

of intentions, given the contextual situations, not only based on kinematic data. For such purpose, BRAVE

proposes to create a dataset containing representative data for vehicles and VRUs, obtained in naturalistic, real

driving conditions. The only means to provide accurate predictions of intentions is to learn from real driving

behaviour. Thus, a realistic dataset is needed. Such dataset must contain a diverse number of behavioural

situations, both for vehicles (cars and trucks) and for VRUs (pedestrians and cyclists). Regarding vehicles,

different situations will be considered, including lane changes, cut-in and cut-out trajectories, overtaking, etc.

Needless to say that it is compulsory to log the ego-vehicle relative position with respect to all lanes and

surrounding vehicles in a given range (at least 60-80 meters). For VRUs, activity recognition will be the key

variable to identify the following behaviours: pedestrian standing, pedestrian walking, pedestrian sitting,

cyclist bending left or right, cyclist pedalling, cyclist swerving, etc. Additional data such as relative velocities

will also be needed. Time synchronization and calibration among the different sensors involved in the system

is compulsory.

This document describes the process followed to create a dataset containing data about vehicles and VRUs

looking at providing a new tool for performing accurate predictions of trajectories. In addition, a clear

description of the dataset structure is provided in this deliverable. The dataset is divided in two main parts:

Vehicles (cars and trucks) and VRUs (pedestrians and cyclists). Vehicles’ data have been logged using 6

different sensors: 1 long range radar, 2 short range radars, 2 cameras (1 forward looking and 1 rear looking)

and Velodyne-32. The time synchronization of the six sensors is properly described in the document, having

proved to be one of the most complex technical issues to take into account. Similarly, the cross-calibration of

the six sensors is described. Regarding VRUs, data from cameras is the sole source of information, given the

high density of information contained in the high resolution cameras used in this project. All the data have

been gathered using the DRIVERTIVE vehicle owned by the University of Alcala (UAH).




Document Information

IST Project

Number

723021 Acronym BRAVE

Full Title BRidging gaps for the adoption of Automated VEhicles

Project URL http://www.brave-project.eu/

Deliverable Number D4.2 Title Vehicles and VRUs database

Work Package Number WP4 Title Vehicle-environment interaction concepts,

enhancing ADAS

Date of Delivery Contractual M12 Actual M12

Status version 1.2

Nature Report

Dissemination level Public

Authors (Partner) University of Alcalá (UAH)

Responsible Author Name Miguel Ángel Sotelo E-mail [email protected]

Partner UAH Phone +34 91 885 6573

Abstract

(for dissemination)

This document describes the process followed to create a dataset containing

data about vehicles and Vulnerable Road Users (VRUs) looking at providing a

new tool for performing accurate predictions of trajectories. In addition, a clear

description of the dataset structure is provided in this deliverable. The dataset

is divided in two main parts: Vehicles (cars and trucks) and VRUs (pedestrians

and cyclists). Vehicles’ data have been logged using 6 different sensors: 1 long

range radar, 2 short range radars, 2 cameras (1 forward looking and 1 rear

looking) and Velodyne-32. The time synchronization of the six sensors is

properly described in the document, having proved to be one of the most

complex technical issues to take into account. Similarly, the cross-calibration

of the six sensors is described. Regarding VRUs, data from cameras is the sole

source of information, given the high density of information contained in the

high-resolution cameras used in this project. All the data have been gathered

using the DRIVERTIVE vehicle.

Keywords Dataset, predictions of intentions, vehicles, vulnerable road users

Version Log

Issue Date Rev. No. Author Change

14th May, 2018 0.1 Miguel Ángel Sotelo Initial version

15th May, 2018 0.2 Miguel Ángel Sotelo, Ignacio

Parra, David F. Llorca, J.

Alonso, C. Salinas

Internal contributions at UAH

28th May, 2018 0.3 Jaime Medina EU disclaimer included

31st May, 2018 1.0 Jaime Medina Final revision

11st June2 018 1.2 Jaime Medina PO information removed

http://www.brave-project.eu/

mailto:[email protected]




Table of Contents

Executive summary ........................................................................................................................................... 3 Document Information ...................................................................................................................................... 4 Table of Contents .............................................................................................................................................. 5 Abbreviations .................................................................................................................................................... 6 1 Introduction ................................................................................................................................................ 7 2 Vehicle Platform and Hardware Description ............................................................................................. 9

2.1 Cameras and Lidar range and resolution ............................................................................................. 9 2.2 Radar ................................................................................................................................................. 11 2.3 Sensors Synchronization ................................................................................................................... 11

3 Calibration of sensors ............................................................................................................................... 13 3.1 Lidar-cameras calibration ................................................................................................................. 13

3.1.1 Camera Calibration .................................................................................................................... 13 3.1.2 Lidar Calibration ........................................................................................................................ 13 3.1.3 Extrinsic Calibration .................................................................................................................. 14 3.1.4 Results........................................................................................................................................ 15

3.2 Radar calibration ............................................................................................................................... 16 3.2.1 Calibration Environment Description ........................................................................................ 16 3.2.2 Ego-Position Estimation ............................................................................................................ 17 3.2.3 Rotation Estimation ................................................................................................................... 17 3.2.4 Translation Estimation ............................................................................................................... 18 3.2.5 Results........................................................................................................................................ 19

4 Labelling and tracking .............................................................................................................................. 22 4.1 Image Tracking ................................................................................................................................. 22 4.2 Depth Estimation............................................................................................................................... 22

5 Description of Dataset .............................................................................................................................. 24 5.1 Vehicles Dataset ................................................................................................................................ 24

5.1.1 Sensors and Interfaces ............................................................................................................... 24 5.1.2 Recorded Data and Formats ....................................................................................................... 24 5.1.3 Calibration processes ................................................................................................................. 25 5.1.4 Post-processed Data ................................................................................................................... 25

5.2 VRU Dataset ..................................................................................................................................... 27 5.2.1 Dataset structure ........................................................................................................................ 31

6 Final remarks ............................................................................................................................................ 33




Abbreviations

ACC: Adaptive Cruise Control

ADAS: Advanced Driver Assistance Systems

AEB: Automated Emergency Braking

AI: Artificial Intelligence

CAN: Control Access Network

CNN: Convolutional Neural Network

DGPS: Differential GPS

DOF: Degrees of Freedom

DRIVERTIVE: DRIVER-less cooperaTIVE vehicle

GPLV: Gaussian Process with Latent Variable

GPDM: Gaussian Process with Dynamical Model

GPS: Global Positioning System

IMU: Inertial Measurement Unit

RMSE: Root Mean Square Error

ROC: Receiver Operational Curve

RTK: Real Time Kinematic

TTA: Time-To-Action

TTC: Time-To-Curb

TTE: Time-To-Event

TTS: Time-To-STOP

V2V: Vehicle-to-Vehicle

V2VRU: Vehicle-to-Vulnerable Road User

VEH: Vehicle-related use case

VRU: Vulnerable Road User




1 Introduction

Current advances in autonomous vehicles and active safety systems have demonstrated autonomy and safety

in a wide set of driving scenarios. SAE Levels 3 and 4 have been achieved in some pre-defined areas under

certain restrictions. In order to improve the level of safety and autonomy, self-driving cars need to be endowed

with the capacity of anticipating potential hazards, which involves a deeper understanding of the complex

driving behaviours corresponding to Vulnerable Road Users (VRUs) other human-driven vehicles, including

inter-vehicle interactions. The need of data from naturalistic driving scenarios can be considered as one of the

most important requirements for revealing, modelling and understanding driver behaviours as well as for

accelerating the evaluation of automated vehicles. A considerable effort has been made during the last decade

to collect data from equipped vehicles driven under naturalistic conditions, covering different driving tasks,

such as car following, lane change, lane departure, cut-in manoeuvres, etc., and using different in-vehicle

sensors, such as cameras, 2D laser scanners, radars, CAN-Bus signals, GPS devices, etc. Among the different

variables that can be collected, the vehicle and VRU trajectories are the most relevant features when modelling

complex interactions of vehicles and VRU that share the same road section. Trajectories can be collected from

high-resolution cameras mounted on the infrastructure or on helicopters. However, due to the intrinsic

limitations of the cameras to obtain accurate distance measurements at far distances, these approaches are

somehow limited to frameworks where accuracy is not critical (e.g., microscopic traffic flow models). On-

road vehicle collection is a more suitable approach to obtain accurate trajectories of adjacent vehicles and

VRU. For that purpose, radar sensors are by far the most common choice for data acquisition when it comes

to vehicles, as well as cameras are the most appropriate means to collect VRU information. Although radar

sensors are very robust at detecting vehicles and measuring their distance reliably, they suffer from a poor

lateral position accuracy. Although this handicap can be partially mitigated by using a strong temporal filtering,

it involves an increased latency which affects the reported lateral position accuracy. Thus, in most cases they

are used to get the relative distance and speed of vehicles located in the same lane for car following tasks.

When used for lane change studies, where lateral position accuracy is critical, they are combined with other

sensors such as cameras or 2D laser scanners. The use of 2D laser scanners for on-road trajectory vehicle

collection is the most suitable methodology to obtain accurate measurements of both longitudinal and lateral

positions of surrounding vehicles. However, their use involves other limitations. For example, in order to

provide 360 degrees coverage, several units have to be installed on-board the vehicle, which increases the costs

and the system complexity. Thus, some previous works use several side-facing radars and front- and rear facing

2D lidars or a combination of several 2D lidars. Other limitations arises from the limited angular resolution,

which compromises the effective range (40 – 50m), and the strong sensitivity of the horizontal detection plane

to pitch variations, which compromises both the effective range and the detection robustness. In the BRAVE

project we have developed a novel data acquisition methodology for on-road vehicle and VRU trajectory

collection that overcomes the limitations of current approaches in terms of complexity, accuracy and range. A

new sensor setup is proposed by using a 3D lidar (Velodyne HDL-32E) which directly provides 360 degrees

coverage, in combination with two high-resolution cameras (front and rear), and 3 radars (1 long range radar

and 2 short range radars). Once the extrinsic relationship between the different sensors is calibrated, the

combination of all of them enhances the final detection range. On the one hand, low angular resolution of the

3D lidar is compensated by using the cameras, allowing an extended detection range and a more dense

representation of the vehicles shape. On the other hand, depth estimation limitation of the cameras is avoided

by using the highly accurate 3D lidar measurements. A novel semi-automatic labelling tool has been developed

in order to collect 360 degrees trajectories up to 60m with an angular resolution of 0.027 degrees, which clearly

outperform the current methodologies for onroad vehicle trajectory collection. In this regard, the proposed

setup for creating a dataset is unique in the state of the art.

As previously indicated, multi-modal complementary sensors are used (vision, Lidar, RADAR, IMUs, GPS,

etc.) for increasing the accuracy and robustness of measurements. Usually, the accuracy of these sensors

depends on extrinsic (position and alignment) and intrinsic parameters that are obtained through, ideally,

unsupervised calibration procedures. The quality and long-time stability of these calibrations will establish the

quality of the environmental representation and therefore of the autonomous driving decisions. Consequently,

there are numerous publications that address the problem of semi-supervised calibration of one or several

sensors on-board autonomous vehicles. Vision systems are probably the most common ones given the amount




of information they provide and their low-cost. In the recent years, also laser scanners have become very

popular, due to the rapid development of their technology and the release of new 360 degrees field of view

scanners. This has led to the development of vision-laser extrinsic semi-supervised calibration methods.

However, there are not that many works devoted to the extrinsic calibration of RADAR sensors. RADAR

sensors are commonly used in the automotive industry for Collision Warning/Avoidance (CW/A), Adaptive

Cruise Control (ACC) or Parking Assistance systems. Most of these applications are used to monitor traffic

surrounding the vehicle, only needing a conservative estimation of other road participants’ position and speed.

Automotive RADARs usually can adjust the elevation plane through semi-supervised calibration procedures

that involve the use of a known reflective target at a predefined position. More advanced systems allow for a

self-calibration of the azimuth angle or alignment using static targets tracked during a calibration sequence.

These self-calibration procedures are intended for maintenance/repair of ADAS such as Adaptive Cruise

Control (ACC). However, for autonomous driving applications which require integrating perception

information with a digital map to create a representation of the environment, the extrinsic calibration of the

sensors to the vehicle frame is a significant problem. An alignment error of just 1 degree can introduce a lateral

offset of 2 m at 100 m for a RADAR system. This error will make it difficult to associate targets to its correct

lane. Thus, accuracy is an element of the utmost relevance when it comes to creating a dataset of this nature.

In order to develop an accurate system for predicting the trajectories (based on intentions) of vehicles and

VRUs, the detailed behaviour of those must be studied and learnt. For this purpose, a rich dataset is mandatory.

Such dataset must contain all the different behaviours under consideration for vehicles and VRUs so that

contextual information can be properly used by an AI-based system. There are tons of datasets containing

vehicle and VRU data for recognition purpose, but very few datasets intended for prediction purpose. The few

datasets available contain limited amounts of data in a limited range (bellow 40m). Thus, it becomes necessary

to create a new dataset exhibiting the following features:

- Centralized data for vehicles and VRUs.

- Robust data based on multi-sensor setup.

- Extended range up to 80 metres.

- Different vehicles and VRUs behaviours are covered. This includes data dealing with vehicle lane

position, lane changes, cut-in and cut-out manoeuvers, overtaking, VRU activity recognition

(pedestrian standing, pedestrian walking, cyclist swerving, etc.).

The dataset will be publicly available to the scientific community. By creating this unique dataset, BRAVE is

making a contribution to the state of the art of automated driving and advanced driving assistance systems.




2 Vehicle Platform and Hardware Description

The mobile platform that has been used for creating this dataset is the DRIVERTIVE vehicle owned by the

University of Alcala (UAH), a Citroën C4 equipped with a rotating lidar scanner, two high-speed colour

cameras and three radars. DRIVERTIVE is depicted in Figure 1. The lidar is a Velodyne HDL-32E located

0.55 meters over the roof. This place is a strategic position which is where the lidar’s layers are most useful.

The HDL-32E data rate is approximately 32 vertical scans plus 2011 horizontal scans at 10 Hz. The used

cameras are a Grasshopper3 with 12,5mm fixed focal length lens. They are located over the roof pointing to

the front and the back of the vehicle, in order to cover the maximum drivable area. The cameras work up to

1920x1200@163Hz. For this experiment the frame rate has been set to 100 Hz. The radars are Continental’s.

Further detailed description of those is provided in a following section. Figure 2 shows the sensors setup over

the mobile platform and a schematic top view of the sensors reference systems.

Figure 1. Vehicle prototype used for creating the dataset: DRIVERTIVE.

2.1 Cameras and Lidar range and resolution

One of the goals of the BRAVE project is to provide a tool able to generate a dataset for vehicle’s and VRUs

trajectory prediction in highway and urban scenarios. When it comes to highway driving, long-range detection

sensors are needed. A combination of different types of sensors helps to improve the detection range. In this

case, a lidar, a couple of cameras, and three radars are used together exploiting the best of each sensor. One of

the most relevant characteristics of the camera is the horizontal Angle Of View (AOV), which represents the

area covered by the camera in the horizontal axis. This value for our camera-lens setup is αh = 48.12º. Another

important characteristic of the camera-lens setup is the horizontal Angular Resolution (AR), which basically

determines the lateral positioning resolution. In our camera-lens setup Δαh = 0.027º. The HDL-32E is a rotating

laser-scan turning approximately at 10 Hz. Its horizontal AOV is 360º and 2011 horizontal scans are performed

in each turn. The horizontal AR of the lidar can be computed dividing the horizontal AOV by the number of

horizontal scans. The result is Δαh = 0.18º, 6.5 times larger than in the case of the camera. This means that the

camera is 6.5 times more accurate and dense than the lidar for the lateral labelling task. Given Δαh, the lateral

positioning resolution can be defined as a function of the target distance. Table I shows the lateral positioning

resolution up to 100 m for the camera and the lidar.




Figure 2. Mobile platform and top view of sensors setup.

Table 1. Lateral positioning resolution

The main problem using a monocular camera setup is the fact that it is not possible to determine the position

of an object without making certain assumptions. In other words, the impossibility to measure the pixel’s depth,

unlike stereo camera systems. A point (u; v) in the image plane reference system SI defines a 3D line in the

camera reference system SC according to the camera pin-hole model. This means that multiple solutions are

possible, but knowing one of the coordinates of the point xC, yC, or zC, the others can be fixed. The HDL-32E

is a high precision laser range sensor with ±2 cm accuracy which is one of the best options to solve the

indeterminacy of the pixel 3D location. Assuming the objects which will be tracked are cars, and these are

mostly parallel in the front and the back to the cameras (constant zC values) the lidar measures can be used to

set this coordinate and solve the points indeterminacy problem. HDL-32E detection range is up to 100 m, due

to the lidar layers distribution, the maximum detection range over the ground plane is reduced up to 72 m




without pitch variations. Figure 3 shows the relevant lidar layer distribution regarding this limitation. The first

tilted down layer has an elevation of -1.33º. This layer intersects with the ground plane approximately at 85

m, however, the vehicles are not in contact with the road on front or back bumper. Assuming a maximum

clearance of 0.3 m w.r.t the ground plane the maximum detection distance decreases as far as 72 m.

Figure 3. Lidar detection range without pitch variations.

2.2 Radar

The radars, which have been calibrated with the method that will be described in a further section, are a long-

range radar ARS-308 and two short-range radar SRR-208 (Continental) located according to the distribution

showed in figure 4. Each radar model has a different error configuration (see table II). The ARS has a range

up to 200 m, this fact makes this radar highly sensitive to alignment errors. On top of that, the angular accuracy

is really high in the narrow beam. In the opposite, the SRR has a range up to 50 m, something that makes the

radar less sensitive to the alignment errors. However, the angular accuracy is lower.

2.3 Sensors Synchronization

Multi-sensor acquisition architectures need a time reference framework, especially with high frame rates.

Sometimes the sensors can be synchronized using hardware signals, but on other occasions it is not possible.

The HDL-32E spins freely and there is no control over the lidar scans. However, it is self-synchronized with

the GPS time reference system. Every group of 12 horizontal scans is referenced to the GPS time. On the other

hand, the cameras have an external trigger signal which is managed by a specific synchronization hardware

designed and built by the University of Alcalá. This hardware is synchronized with the GPS time by a Pulse

Per Second (PPS) signal and the recording computer running a Network Time Protocol (NTP) server.




Table 2. Radar measuring performance

Figure 4. Configuration of radar sensors.




3 Calibration of sensors

The sensor calibration process has been divided in two stages. On the one hand, the two cameras and the lidar

are calibrated together, in order to provide an all-around detection system using the same frame of coordinates.

In a second step, the three radars are calibrated together, thus providing detection in the frontal and lateral

areas of the ego-vehicle. The lidar-radar calibration is taken directly from the geometrical configuration of the

car. This is a reasonable assumption given that the radars provide 2D information, thus errors committed by

this assumption are negligible. In the following sections the two separate calibration stages are described.

3.1 Lidar-cameras calibration

The lidar-cameras calibration process estimates a homogeneous transformation matrix CTL that allows

transforming points from the lidar reference system SL to SC, and v/v. In order to avoid ambiguities regarding

the orientation of the calibration pattern planes, appropriate notation has been defined. Sub-index S represents

the sensor reference system, C or L for the camera and lidar respectively. The camera-lidar calibration process

has three steps. The first one calibrates the camera and generates πC, the second one finds πL and finally, the

last step finds the best extrinsic calibration matrix CTL in a Singular Value Decomposition (SVD) fashion and

a posterior non-linear optimization algorithm.

3.1.1 Camera Calibration

The camera calibration process consists of the estimation of the intrinsic matrix K and the lens distortion

coefficients. For this purpose, the MatlabR Computer Vision System ToolboxTM has been used. Knowing the

parameters that model the sensor the images are undistorted and the calibration pattern equation πC is directly

computed for each calibration image.

3.1.2 Lidar Calibration

The lidar calibration process, unlike the camera calibration process, needs manual inputs from the user. As an

advantage, there are no restrictions to make this method work, like i.e. background constraints. It is necessary

to manually mark the corners of the calibration pattern for each image-cloud pair to generate πL. This process

consists of three stages. In the first one, the corners of the calibration pattern pm are manually selected and

used to compute the centroid of the calibration pattern cm and a preliminary plane πm by means of Least

Squares (LS). In the second step, a basic geometric segmentation is performed over the point cloud in order to

isolate the calibration pattern set of points P. Finally, a closed-form robust method has been used to fit the

plane πL using the segmented set of points P. Firstly, a tentative plane is computed in an LS fashion using P.

In every iteration, the set of points P is sorted based on the distance to the plane πL, then, the η percentage of

points with the longest distances are removed, and finally, a new plane is estimated with the remaining points

(see Algorithm 1). Figure 5 shows the input set of points P in blue, the final set of points used to fit the plane

in green and the fitted plane in black.

Algorithm 1. Plane estimation.




Figure 5. Calibration pattern plane estimation.

3.1.3 Extrinsic Calibration

The extrinsic calibration process finds a homogeneous transformation matrix CTL which transform points pL

from SL to points pC in SC. The conventional extrinsic calibration process uses a set of point pairs as input to

estimate the homogeneous transformation matrix. However, the corners of the calibration pattern cannot be

detected by the lidar and consequently, the point pairs cannot be established. An alternative calibration process

is performed aligning the calibration plane pairs. The n plane equations are used to compute matrix N and

vector d which represent the normal plane vectors (a; b; c), and the distances to the origin of the reference

system d. The aligning process firstly estimates the translation vector t. The only point which is not affected

by the rotation is 0L = (0; 0; 0), applying the translation vector to 0L we have pC = t. Replacing pC by t and pL

by 0L, a new equation is formed and solved by Least Squares, thus yielding the translation vector t. The second

step computes the rotation matrix R in an SVD fashion. Basically, the rotation matrix must align the normal

vectors of the plane pairs. Finally, a non-linear optimization algorithm finds the minimum of an unconstrained

multivariable function using derivative-free method. R is transformed into an unconstrained vector using a

quaternion transformation. The translation vector t is concatenated to the quaternion. Figure 6 shows the

projection of the colour image over the lidar scans.

Figure 6. Image colour reprojection over the lidar scan.




3.1.4 Results

In this section, the final results are shown and commented. Figure 7 shows the tracking errors on the image

plane. In the left sub-trajectory, there is a tracking reset event marked with a red circle due to the tracking drift.

In the right sub-trajectory, the error rises since the point marked with a red square due to a partial occlusion

which displaces the tracker. The tracking Mean Root Square Error (MRSE) and the Mean Absolute Error

(MAE) for this trajectory are shown in table III.

Figure 7. Tracking errors on the image plane.

Fig. 8 shows the trajectory reconstruction. The green line represents the best possible result which will be

considered as ground truth for error metric computation. It is achieved using manual annotations in both the

image and the point cloud. The red line represents manual annotations using only the point cloud. The blue

line is the semi-automatic trajectory reconstruction. As can be seen in fig. 8 and table II the presented

methodology overcomes the results generated using only the lidar. This is a proof of the sensor fusion

capability to increase the detection accuracy due to the higher angular resolution in the camera sensor w.r.t the

lidar. The trajectory generated by the semi-automatic method is quite similar to the manually annotated one.

The error metrics are shown in table III. There are small errors at the beginning of the left sub-trajectory

because of the tracking error mentioned previously. The lateral error is reduced to zero when the tracking is

reset. This instant is represented by a black circle in figure 8. On the other hand, there is a small difference at

the end of the right sub-trajectory. The trajectory starts to differ in the point represented by the black square

due to the visual occlusion of the tracked key point.

Table 3. Radar measuring performance




Figure 8. Trajectory reconstruction.

3.2 Radar calibration

The calibration procedure tries to estimate the transformation matrix VTR that transforms points from the radar

reference system SR to the vehicle reference system SV. In the case of more than one radar, the transformation

matrix allows to transform points from one radar to each other in a common frame, and thus the generation of

continuous trajectories. Any transformation matrix T can be split into two basic spatial operations, a rotation

R2x2 and a translation t2x1. The calibration procedure splits the estimation of the transformation matrix in two

steps, the first one estimates the rotation matrix and the second one the translation vector.

3.2.1 Calibration Environment Description

The calibration method exploits the detection of high radar-sensitive structural elements which are static and

their location is fixed in a global reference system. The calibration environment consists of a two ways road

with two lanes per way with a roundabout at each end. There is a sidewalk with street lights and traffic signs

on both sides of the road. The street lights and the traffic signal positions have been used as ground truth in

the calibration procedure. Their positions have been measured with a Differential Global Navigation Satellite

System (DGNSS) with an error lower than 2 cm. The higher the quality of the elements in the environment the

better the calibration is. The recorded positions are surrounding to the vertical pole axis, so the pole axis

position is estimated as the centroid of the measures. The latitude-longitude coordinates are transformed into

an easting-northing reference system according to the transverse Mercator projection in order to get a flat and

orthonormal representation reference system. Figure 9 shows the position of some of the street lights and the

traffic signals in a high definition digital map representation of the environment.




Figure 9. Calibration targets in a digital map. Red spheres represent the GPS coordinates of the

calibration elements.

3.2.2 Ego-Position Estimation

An Extended Kalman Filter (EKF) has been used to fuse the information from the DGNSS, an MPU6050

Inertial Measurement Unit (IMU) and the Controller Area Network (CAN) bus of the vehicle. The state vector

x is estimated using the measure vector z, the non-linear process model f and the observation model h. In order

to transform the position of the calibration elements from the world reference system SW to the vehicle (local)

reference system SV a transformation T is performed according to the vehicle state vector.

3.2.3 Rotation Estimation

The rotation calibration process estimates the radar rotation angle θR that aligns the vehicle reference system

SV and the radar reference system SR. Matrix R is a two-dimensional rotation transformation. As far as the

relation between the radar detections and the targets is not known the traditional calibration approaches based

on pairs of points cannot be conducted. An alternative methodology based on the relative movement properties

of static objects has been developed. The trajectory described by a static object is seen from the sensor

reference system as the opposite of its own trajectory. If the mobile reference system (vehicle) performs a

trajectory in a straight line, the trajectory of the static object is seen as a straight line in the opposite direction

in the sensor reference system. The orientation of the trajectory in the sensor reference system reveals the

rotation needed to virtually align both reference systems. The trajectories generated by the objects are a

sequence of points in the radar reference system SR. A single trajectory P is formed by a set of n points.

Assuming that the trajectory represents a straight line the orientation can be easily computed as the arctan

function of the extreme points. However, due to the radar detection accuracy, this way is not the best path to

follow. The endpoints of the trajectory are commonly in the border of the detection area, which is where the

detection error is higher. A set of tuples of points TP associated to the trajectory P is used to generate all the

possible combinations of points in the trajectory P in a forward time sense. The trajectory direction is computed

for each tuple of points in TP. The error associated to each computed direction depends on the individual point

errors. At this point, many rotations and their associated errors have been computed for each trajectory P. Now,

the trajectories described by many detections must be commonly evaluated in order to achieve the most reliable

value of θ. Four different functions are proposed to evaluate the set of rotations and errors in a scoring fashion

as they are graphically described in figure 10. The first score function s1 is a normal distribution. The second




score function s2 is a truncated version of s1. This function assigns the same score s2 to the estimated orientation

range in the uncertainty band. The third score function s3 implements a linear version of the normal distribution

function. The function is centred in θ and the score decreases with a constant value K. In order to generate a

cumulative score of 1, the maximum score A and the slope K are computed accordingly. The fourth score

function s4 implements a truncated version of s3. This function assigns the same score s4 to the estimated

orientation in the uncertainty band.

Figure 10. Rotation scoring functions. s1 and s2 on top and s3 and s4 on bottom.

The global score is computed as the sum of each individual score distribution. The total score distribution is

normalized in order to be treated as a probability density function and to provide a confidence interval of the

estimations. Finally, the radar rotation angle θR is minus the θ value with the highest cumulative score due to

the opposite direction of the relative object’s movement.

3.2.4 Translation Estimation

Once the rotation between the vehicle and the radar reference systems is known and they are virtually aligned,

the position difference between targets and detections is the translation which is being looked for. The

translation estimation has three stages, firstly the world points are transformed to the vehicle reference system,

secondly the radar detections are rotated to be aligned with the vehicles reference system, and finally, the

vehicle-radar translation is achieved by scoring the individual detection-target translations. In a first step, the

position of the calibration targets is transformed from the world reference system to the vehicle reference

system. In order to avoid time delays between the radar detection and the ego-estimation systems, static

sequences are used to calibrate the translation. In the second step, the radar detections and their errors are

rotated to be aligned with the vehicle reference system. A two-dimensional translation vector t is needed to

transform calibration targets to detections in the common vehicle reference system and vice-versa. The

parameters that need to be found are tx and ty which are the translation along the x and the y axis of the vehicle

respectively. The translation vector t has been limited by using basic information of the vehicle dimensions.

The translation vector limit is denoted as tL. The error associated to each translation is defined as the error

vector ti which is the result of the detection error rotation. Four similar bi-dimensional scoring functions have

been designed to find out the translation that was used to estimate to estimate the radar rotation angle. The

shape of the scoring functions is shown in figure 11. The first score function s5 is a normal distribution. The

second score function s6 is a truncated version of s5. The same score s6 is assigned to the translations inside the

uncertainty band. The third score function s7 implements a linear version of s5. The score decreases according

to K around t. The matrix K and the maximum value A are computed accordingly in order to achieve a

cumulative score of 1. The function max is the vector maximum. The fourth score function s8 implements a




truncated version of s7. This function assigns the same score s8 to the estimated translation in the uncertainty

band like the score function s6.

Figure 11. Translation scoring functions. On top s5 and s6, on bottom s7 and s8. Inner area of red

ellipse has a constant score of s6 in s6, inner area of the red rectangle has a constant score of s8 in s8.

Finally, the global translation score is computed as the sum of each single translation score. The estimated

radar translation vector t is achieved finding the translation vector with the highest S score inside the translation

limits tL.

3.2.5 Results

For the rotation estimation two different sequences have been recorded, one traveling at 20 km/h and other at

50 km/h. Table IV shows the angle estimation and the value that covers the 68.27% of the score for the three

radars (units expressed in radians). The recorded sequences have 447 and 301 objects trajectories for ASR-

308, 47 and 53 for the SRR208-L and 56 and 65 for the SRR208-R at 20 and 50 km/h respectively

Table 4. Radar rotation results at 20 and 50 km/h

Analysing the rotation estimations, in general, the nontruncated scoring functions (s1 and s3) generate a

narrower confidence band, being the narrowest one achieved with the scoring function s3. Analysing the effect

of the speed, higher speeds produce trajectories with fewer points. This produces different effects depending

on the radar. The effects to the ASR are beneficial because the points are more apart but inside the high




accuracy detection area, the angle accuracy rises and produces a narrower confident band. The effects to the

SRR are not beneficial, the points are more apart but in low accuracy detection areas, generating wider bands.

For the estimation of the translation, 22 static poses have been recorded. There is a total of 74 valid calibration

targets for the ARS, 35 for the SRR-L and 33 for the SRR-R. For our particular case, the translation matrix

limits tL have been set according to the vehicle dimensions W = 1:79, L = 4:33 and _x = 2_y = 1:0. The Mean

Range Error (MRE) and the Mean Angular Error (MAE) are used as metrics to show the performance of the

overall calibration. The translation estimations have a strong dependence with the estimated rotation, therefore

a combination of rotation and translation scoring functions pairs are evaluated independently, and the results

are shown in table V, where the MRE is expressed in percentage or meters and the MAR is in degrees.

Table 5. Mean range error & Mean Angular error

The ASR achieves a MAE below 0.5° which is greater than the accuracy of the narrow beam but lower than

the accuracy of the wide beam. The SRR achieves a MAE close to 1° which is lower than the accuracy of any

beam. The truncated functions produce worst results for this kind of radar. It could be caused by the low

angular accuracy of the radar. The alignment in conjunction with the translation performs MRE values which

are below the range uncertainty for both kinds of radars. Figure 12 shows an example of the radar detections

reconstruction in the vehicle frame.




Figure 12. Radar detections reconstruction. Black circles are calibration targets, blue x are ASR308

detections, red x are SRR-L detections and green x are SRR-R detections.




4 Labelling and tracking

It becomes absolutely necessary to develop a specific labelling tool to create the proposed dataset. For the case

of VRUs, as will be later on justified, a number of pre-existing labelled dataset will be used as the starting

point. Thus, the labelling tool will be used mostly for vehicles. The labelling tool must be able to assist human

operators when locating the element of interest in the image provided by the camera or in the point cloud

provided by the Lidar. Similarly, the tool must assist the operator by performing automatic tracking of labelled

data. For this purpose, UAH has developed its own labelling and tracking tool using deep learning techniques

that assist both the labelling and tracking tasks. A point in the image plane defines a 3D line in the camera

reference system. Labelling a target point in the image sets the target direction. Assuming that the vehicles are

on the ground and the ground is mostly flat on highway scenarios this direction can be interpreted as the angular

coordinate in a polar reference system. Finally, the depth, or radial coordinate, is estimated using the lidar

detections. The point used as the reference for the vehicle location is any point horizontally centred located on

the front or rear bumper. The best key points for a correct vehicle location are the logo, the license plate or any

other feature symmetrically placed on the vehicle.

4.1 Image Tracking

A state of the art Median-Flow tracker algorithm has been used in order to semi-automate the labelling process.

Firstly, the tracker is set up with a Region Of Interest (ROI) containing the desired key points and consecutively

updated. This labelling tool implements a multi-object-multi-camera tracking system with key points reset

option. The tracked key points can be changed when the tracker lost them or when the user considers that the

tracker accuracy is not enough. Figure 13 shows the tracking of a front and back vehicle key point along 3

seconds which represents 300 images and 30 image-cloud pairs. The green box represents the bounding box

of the tracked key point and the red mark is the manual annotated key point.

4.2 Depth Estimation

Once the vehicles or a part of them are labelled it is necessary to assign a depth value or z coordinate to this

region. A semantic segmentation FCN-8s Convolutional Neural Network (CNN) has been used in order to

segment the area of the vehicles in the image. For each annotation the vehicle container area is searched into

the segmented image, setting the shape of the annotated vehicle in the image. Figure 14 shows an example of

the vehicle semantic segmentation. The lidar points are re-projected over the image and the longitudinal

distance to the vehicle is estimated using the closest lidar detection over the vehicle shape in the image.




Figure 13. Vehicle key point tracking. From top to bottom, tracker initialization, and key points after

1, 2 and 3 seconds of tracking using 100 FPS. Left column was tracked in reverse recording order and

right column in the forward. 1, 2, 4 and 8 zoom factor have been applied in each row.

Figure 14. Vehicle semantic segmentation. From top to bottom, original image, vehicle class heat map

and segmented vehicles. Warm colours represent higher probabilities.




5 Description of Dataset

The dataset created by UAH in the framework of the BRAVE project is divided in two main parts: Vehicles

and VRUs. The section for vehicles includes cars and trucks, having been created totally from scratch.

However, the VRU dataset has not been created from scratch. Although the original idea was to create a totally

brand-new dataset containing VRU information for prediction purpose, the evolution of the state of the art has

made recommendable to take a slightly different approach in order to leverage the progress made by the

scientific community and, at the same time, contribute to the state of the art by harmonizing the formats and

content of the most valuable already existing datasets. In this regard, our efforts have been devoted to creating

a kind of meta-dataset that provides a link to other datasets. In addition, a deep study of those datasets has been

carried out in order to harmonize their formats and metrics, so that further learning can be done from a solid,

coherent ground. On top of that, some additional features have been added for prediction purpose, such as

those dealing with activity recognition. In the following section a detailed description of the dataset is provided.

5.1 Vehicles Dataset

As previously described, the Vehicles dataset has been created on the grounds of data provided by six sensors

on-board the DRIVERTIVE vehicle: 1 long-range radar, 2 short-range radars, 1 Velodyne-32, 2 cameras (1

forward-looking, 1 rear-looking). For such purpose, the six sensors have been properly cross-calibrated and

time synchronized as described in the previous sections. The DRIVERTIVE vehicle has been used to gather

data in a naturalistic fashion driving along highways in the provinces of Madrid and Guadalajara in Spain from

January to May 2018. Twelve hours of driving have been recorded and exhaustively labelled. The technical

details of the Vehicles Dataset are provided in the next sections.

5.1.1 Sensors and Interfaces

The complete set of sensors used to log data together with their corresponding interfaces and main features are

provided next:

● High-definition (1.5Mb) and high-speed cameras (one forward-looking, one rear-looking).

● Rotative laser scanner, Velodyne HDL-32E (1.4 M points/s). Velodyne provides 10 readings per

second.

● D-GPS (high accuracy Differential GPS positioning based on RTK correction).

● Vehicle CAN bus reader. This interface provides access to proprioceptive vehicle variables, such as

vehicle speed, acceleration, yaw rate, etc.

● IMU 10 DoF (acceleration, angular speed, magnetometer, and altitude).

● 3 Radars (1 ARS308 long range and 2 SRR208 wide range.).

5.1.2 Recorded Data and Formats

The following features are recorded while driving on highways in order to gather relevant information for

prediction purpose:

● Front and Rear image. RGB 1920x800@11Hz x 2.

● 360º laser scan. XYZ 120K points @11Hz.

● Long-range radar detection. ARS308 40 targets @15Hz.

● Left and Right short-range radar detections. SRR208 25 targets @33Hz x 2.

● Raw GPS position. @20Hz.

● RAW CAN bus data. @100Hz.

● RAW IMU data.




5.1.3 Calibration processes

As a wrap-up of the information provided in previous sections, the following steps are taken in the calibration

of the sensor set-up:

● Lidar - Camera (front and back): the camera increases the angular accuracy of the lidar 6.5 times. The

calibration is based on planes estimations.

● Radar-GPS (left, front and right): it extends virtually the detection range up to 100 m. It is linked to

the Velodyne by means of a geometric vector based on empirical measurement. Calibration is based

on a digital reconstruction of the environment with high radar sensitive static objects.

● Velodyne-Camera time synchronization. It is based on GPS Pulse Per Second timing (wireless sync)

or based on communications (wire sync). All the triggering is managed by a microcontroller (ATmega

2560). One image is generated when the lidar reaches the position of the camera in order to have the

most consistent 3D representation of the scene. Both cameras are not triggered at the same time.

● Vehicle CAN bus and IMU time synchronization based on recording computer synchronized with a

Network Time Protocol based on GPS Pulse Per Second.

● Radar time synchronization has not been carried out yet, under the assumption of negligible delay

among the different data strings. However, it is expected to implement this time synchronization in

the next months, probably writing PPS messages in the radar's CAN bus.

5.1.4 Post-processed Data

One of the most relevant features of the developed dataset is the post-processed data that is made available.

The first step in the postprocessing of logged data is taken by using the semi-automatic labelling tool developed

by UAH. The labelling tool assists the human operator in identifying and tracking the different vehicles that

surround the ego-vehicle. For that purpose, the work flow is described next:

● Image vehicle segmentation is done using a VGG-16 segmentation CNN. The output of this system

provides an almost perfect segmentation of the input image.

● The result of the labelling process is a log file for each camera with a representative rectangle (not the

bounding box, but a symmetric rectangle) of each vehicle (with a single identification number) along

the sequence of frames. Figure 15 shows an example of vehicle labelling. The rectangle which

describes the position of the vehicle is matched with the vehicles detected by the CNN. The contour

of the vehicle is used to find the set of lidar impacts over the vehicle shape. The closest or farthest

points are used to establish the distance to the vehicle (front or back camera) and the centre of the

labelling area is used to set the direction (like in polar coordinates). With the radar detections, the

detection range can be extended up to 100 metres, although an accurate camera-radar calibration is

needed for that purpose. ● Automatic vehicle-tracking is carried out based on a median flow algorithm. The first click selects the

region to be tracked (multiple targets can be tracked simultaneously). Zooming is possible for accurate

labelling.

● Supervised forward or backward tracking with repositioning option is also available.

● Target tracking can be finished individually.

● The result of the vehicle tracking is a set of tracklets. Each tracklet contains the entire sequence of

vehicle images and positions that have been automatically computed during the tracking stage. Vehicle

tracklets can be retrieved from the dataset so that a vehicle can be tracked form the time of its first

appearance in the image until the time it disappears entirely from the image (either the vehicle gets out

of the image field of view or it gets occluded by other elements in the image).




a)

b)

Figure 15. Example of Vehicle labelling using UAH’s labelling tool. a) Automatic vehicles masking

using CNN. b) Example of precise labelling of long-distance vehicle using zoom tool.

After the vehicles tracklets are created, postprocessing of all gathered data is carried out in order to detect a

number of context-related variables that account for different scenarios on which drivers can make

substantially different decisions. The post-processed variables are described in the following lines.

- Traffic density: the average traffic flow on a given road can create a feeling of safety or discomfort

on drivers. This feeling has the potential to affect their decisions when manoeuvring.

- Number of lanes: number of lanes in the direction of travel. This variable is computed based on the

detected lane markings or on a global map, if available. - Ego-lane position: it is computed based on visible lane markers. The trajectories are referenced to the

vehicle (Velodyne reference system), using the detection of the lanes or a digital map (made by

ourselves) the trajectories can be transformed into the road reference system. Usual variables can be

computed in the road reference system, like i.e. distance to lane centre, distance to left line, distance

to the right line, lateral speed, etc.

- Lane Width: this is another important parameter that can affect drivers’ decisions on whether or not

initiating an overtaking or lane change manoeuvre.

- Road Type: different types of road can be considered, such as highway, rural road, urban road, etc.

Driver decisions for lane change may differ depending on the road type. This variable can be

determined only if a global map is available.

- Lane changes: lane change manoeuvres will be automatically identified and tagged in the dataset.

- Cut-in and cut-out manoeuvres: these manoeuvres will also be identified and tagged in the dataset.




5.2 VRU Dataset

There are several VRU datasets in the state of the art containing images of pedestrians and cyclists together

with their corresponding annotations. These annotations consist of bounding boxes containing the entire shape

of the VRU or keypoints describing the VRU body joints and parts, such as head, shoulders, knees, feet, etc.

On some occasions, both the bounding box and the keypoints are available, but that is not always the case.

Although these datasets are exhaustive and contain thousands of VRU examples they only contain information

for recognition purpose, but they fail to provide information for prediction purpose. Given the availability of

VRU datasets and the lack of availability of prediction information, it was decided to leverage the available

information on the currently existing datasets and to enrich those by including information for prediction

purpose. In this regard, a first step was to harmonize the formats among the different datasets, in order to create

a kind of gigantic meta-dataset by harmonizing and linking the available datasets. The task is huge given the

large number of available datasets and their size and complexity. Thus, after a first analysis, seven datasets

were preselected on the grounds of the quality and quantity of the data that they contain. The preselected

datasets are the following:

- Datasets oriented to automated driving:

o Citypersons (https://bitbucket.org/shanshanzhang/citypersons). It is large-scale dataset that

contains a diverse set of stereo video sequences recorded in street scenes from 50 different

cities, with high quality pixel-level annotations of 5 000 frames in addition to a larger set of

20 000 weakly annotated frames. The focus is on pedestrians and cyclists.

o Kitti (http://www.cvlibs.net/datasets/kitti/index.php). Accurate ground truth is provided by a

Velodyne laser scanner and a GPS localization system. Kitti datasets are captured by driving

around the mid-size city of Karlsruhe, in rural areas and on highways. Up to 15 cars and 30

pedestrians are visible per image. Besides providing all data in raw format, benchmarks are

extracted for each task. For each of the benchmarks, an evaluation metric is provided as well

as an evaluation website.

o Inria (http://pascal.inrialpes.fr/data/human/). This dataset was collected as part of research

work on detection of upright people in images and video in 2005. The dataset is divided in

two formats: (a) original pedestrian images with corresponding annotation files, and (b)

positive images in normalized 64x128 pixel format with original negative images.

o Tsinghua (www.gavrila.net/Datasets/Daimler_Pedestrian_Benchmark_D/Tsinghua-

Daimler_Cyclist_Detec/tsinghua-daimler_cyclist_detec.html). The Tsinghua-Daimler

Cyclist Benchmark provides a benchmark dataset for cyclist detection. Bounding Box based

labels are provided for the classes: ("pedestrian", "cyclist", "motorcyclist", "tricyclist",

"wheelchairuser", "mopedrider").

o Caltech (http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/). The Caltech

Pedestrian Dataset consists of approximately 10 hours of 640x480 30Hz video taken from a

vehicle driving through regular traffic in an urban environment. About 250,000 frames (in 137

approximately minute long segments) with a total of 350,000 bounding boxes and 2300 unique

pedestrians were annotated. The annotation includes temporal correspondence between

bounding boxes and detailed occlusion labels.

- General purpose datasets: these datasets are much more oriented to pose and activity recognition.

o COCO (http://cocodataset.org/#home). COCO is a large-scale object detection, segmentation,

and captioning dataset. COCO has several features: object segmentation; recognition in

context; superpixel stuff segmentation; 330K images (>200K labelled); 1.5 million object

instances; 80 object categories; 91 stuff categories; 5 captions per image; 250,000 people with

keypoints.

o MPII (http://human-pose.mpi-inf.mpg.de/). MPII Human Pose dataset is a state of the art

benchmark for evaluation of articulated human pose estimation. The dataset includes around

25K images containing over 40K people with annotated body joints. The images were

https://bitbucket.org/shanshanzhang/citypersons

http://www.cvlibs.net/datasets/kitti/index.php

http://pascal.inrialpes.fr/data/human/

http://www.gavrila.net/Datasets/Daimler_Pedestrian_Benchmark_D/Tsinghua-Daimler_Cyclist_Detec/tsinghua-daimler_cyclist_detec.html

http://www.gavrila.net/Datasets/Daimler_Pedestrian_Benchmark_D/Tsinghua-Daimler_Cyclist_Detec/tsinghua-daimler_cyclist_detec.html

http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/

http://cocodataset.org/#home

http://human-pose.mpi-inf.mpg.de/




systematically collected using an established taxonomy of every day human activities. Overall

the dataset covers 410 human activities and each image is provided with an activity label. Each

image was extracted from a YouTube video and provided with preceding and following un-

annotated frames. In addition, for the test set richer annotations were obtained including body

part occlusions and 3D torso and head orientations.

Figure 16 shows an example of labelled image in the Tsinghua dataset containing pedestrians. In the figure,

both the pedestrian bounding box and its corresponding keypoints are depicted.

Figure 16. Example of VRU in labelled image (Tsinghua dataset).

Among the previous datasets, Citypersons is the most complete one, given the variety of the data in terms of

traffic situations and types of environments. A major difficulty when harmonizing data is the variability of

formats among the datasets. On some occasions, the authors of the different datasets have made available script

files that have eased the parsing and normalization of data to a common format. Unfortunately, this is not the

general case. Thus, scripts for data extraction and format normalization were created in most cases. The code

developed for this task has been written in MATLAB. Another important task is to extract the persons and

cyclists contained in the images using automatic tools. For that purpose, Deep Learning techniques (CNN-

based) have been used. Three main CNN-based systems have been selected, given the maturity and robustness

of their performance: OpenPose, Detectron (Mask-R-CNN), and Alphapose. All these three networks exhibit

excellent performance in VRU detection tasks. The work flow for data extraction and normalization is

described next:

1. VRU keypoints are extracted from the input images using Detectron, OpenPose or Alphapose.

2. The output files, usually in JSON format, are converted to MAT format in order to use them in

MATLAB environment.

3. The bounding boxes containing the entire shape of the selected VRUs are generated based on the

previously extracted keypoints. The coordinates of the upper left and bottom right corners are stored

for that purpose. Two different approaches have been followed in this task:

a. Generation of bounding boxes using a given aspect ratio. This approach requires an exhaustive

study of human proportions in different poses.

b. Use of minimum and maximum of different keypoints in each dimension. This is a straight

forward approach that relies entirely on the detected keypoints.

4. The extracted bounding boxes are converted and stored in JSON format.

Figure 17 provides a close-up example how the bounding box is extracted based on the keypoints. As observed

in the figure, the ground-truth bounding box (in red) contains only loosely the target pedestrian, containing a




significant amount of spare room around the pedestrian. However, the finally selected bounding box (in blue)

provides a much better fit to the detected keypoints. This operation is needed for those extraction algorithms

that provide keypoints but fail to provide bounding boxes (that is the case of the OpenPose detector).

Figure 17. Example of conversion from key-points to bounding box.

The code that has been followed for the different body parts and joint links is provided bellow:

Keypoints order

1. Nose

2. Neck

3. RShoulder

4. RElbow

5. RWrist

6. LShoulder

7. LElbow

8. LWrist

9. RHip

10. RKnee

11. RAnkle

12. LHip

13. LKnee

14. LAnkle

15. REye




16. LEye

17. REar

18. LEar

Joints links

1. [0, 1] - 'neck-line'

2. [1, 2] - 'right-shoulder-line'

3. [1, 5] - 'left-shoulder-line'

4. [2, 3] - 'right-forearm-line'

5. [3, 4] - 'right-arm-line'

6. [5, 6] - 'left-forearm-line'

7. [6, 7] - 'left-arm-line'

8. [2, 8] - 'right-torso-line'

9. [5, 11] - 'left-torso-line'

10. [8, 11] - 'hips-line'

11. [8, 9] - 'right-thigh-line'

12. [11, 12] - 'left-thigh-line'

13. [9, 10] - 'right-leg-line'

14. [12, 13] - 'left-leg-line'

15. [16, 17] - 'ears-line'

16. [14, 16] - 'right-ear-eye-line'

17. [15, 17] - 'left-ear-eye-line'

18. [0, 14] - 'right-nose-line'

19. [0, 15] - 'left-nose-line'

A labelling tool has been used to assist the manual validation task. Figure 18 shows a snapshot example of the

labelling interface during operation. This tool allows to categorize the different false positive. The benefit of

this approach is two-fold. On the one hand, it largely reduces the effort needed for dataset creation (labelling

from scratch is thus avoided). On the other hand, it enables the training of a validation network for rejecting

false positive detections. After using the semi-automatic labelling tool, it is possible to harmonize the format

of data contained in seven different datasets, as described. The harmonization task has been guided by a number

of rules, as described next, based on the lessons learnt after a thorough analysis and study of the seven datasets

used in this work. These rules described the desirable features for a VRU dataset for prediction purpose.

1. It is recommendable to use a data format similar to that of the Citypersons dataset, especially in terms

of labelling metrics.

2. Tracking information must be included. This leads to the inclusion of complete sequences (tracklets)

at a high frame rate.

3. Images must be high-resolution so that they can be later on re-scaled to the appropriate scale. Some of

the existing dataset, such as Caltech, are of not much use for modern CNN-based systems given the

low resolution of their images.

4. Exhaustive documentation must be provided regarding the structure of the data formats and the dataset.

5. Additional code and tools must be provided for data visualization and extraction. This is especially

useful to ease and foster further research work.

6. Activity recognition tags must be included in the labelled data in order to allow for prediction of

intentions and trajectories.




7. A large variety of data formats is a plus when it comes to exchange data for research. The following

formats are recommended: JSON, HDF5, TXT, CSV. Similarly, the code provided must be usable in

the most widely-used programming languages, such as Python, MATLAB, C++, and Java.

Figure 18. Labeling a pedestrian with the labelling tool.

5.2.1 Dataset structure

All the information has been stored in www.github.com/invett/predset. The structure of the information stored

in the dataset is divided in different sections or folders, containing the data that is described below.

- Data files organization: it contains different files used in the extraction and data format normalization

process as well as the detailed explanation of the content of those. An example of such files is provided

next

kp_det.mat:

o kpoints{1xN_images}.imgname

o kpoints{1xN_images}.identity

o kpoints{1xN_images}.pedestrians (equivalent to the "people" field in .json files)

o kpoints{1xN_images}.pedestrians{ID_ped}.pose_keypoints{1x54} It corresponds with the

following fields:

[ x_joint_det(px), y_joint_det(px), conf_det ] x 18 joints

- Directories Organization: the structure of directories and files in the seven datasets is described in

this section. An example for the KITTI dataset is provided bellow

KITTI

training

http://www.github.com/




o calib - camera calibration matrices of object data set (16 MB)

o image_2 - left color images of object data set (12 GB)

o image_3 - right color images, if you want to use stereo information (12 GB)

o label_2 - training labels of object data set (5 MB)

o prev_2 - 3 temporally preceding frames (left color) (36 GB)

testing

o calib - camera calibration matrices of object data set (16 MB)

o image_2 - left color images of object data set (12 GB)

o image_3 - right color images, if you want to use stereo information (12 GB)

o prev_2 - 3 temporally preceding frames (left color) (36 GB)

- Datasets Organization: it provides links to the seven original datasets.

- Evaluation method metrics: the detailed metrics used for evaluation are described in this section.

An example is provided for the COCO dataset.

COCO

Average Precision (AP):

o AP % AP at IoU=.50:.05:.95 (primary challenge metric) AP (averaged across all 10 IoU thresholds and all 80 categories) will determine the

challenge winner. This should be considered the single most important metric when

considering performance on COCO.

o APIoU=.50 % AP at IoU=.50 (PASCAL VOC metric)

o APIoU=.75 % AP at IoU=.75 (strict metric)

AP Across Scales:

o APsmall % AP for small objects: area < 32^2

o APmedium % AP for medium objects: 32^2 < area < 96^2

o APlarge % AP for large objects: area > 96^2

Average Recall (AR):

o ARmax=1 % AR given 1 detection per image

o ARmax=10 % AR given 10 detections per image

o ARmax=100 % AR given 100 detections per image

AR Across Scales:

o ARsmall % AR for small objects: area < 32^2

o ARmedium % AR for medium objects: 32^2 < area < 96^2

ARlarge % AR for large objects: area > 96^2

- Tests tracking: this section is used to keep track of all the tests carried out, including the algorithm

and parameters used during the tests.




6 Final remarks

This document has presented the methodology followed to create a dataset containing Vehicles and VRU data.

This dataset is intending to provide a new tool for performing accurate predictions of Vehicles and VRUs

trajectories in an attempt to enhance safety of current ADAS and Automated Driving systems, a main goal of

the BRAVE project. In addition, a description of the dataset structure is provided in this deliverable. The

dataset is divided in two main parts: Vehicles (cars and trucks) and VRUs (pedestrians and cyclists). Vehicles’

data have been logged using 6 different sensors: 1 long range radar, 2 short range radars, 2 cameras (1 forward

looking and 1 rear looking) and Velodyne-32. The time synchronization of the six sensors is properly described

in the document. Similarly, the cross-calibration of the six sensors is described. Regarding VRUs, data from

cameras is the sole source of information, given the high density of information contained in the high-

resolution cameras used in this project. All the data have been gathered using the DRIVERTIVE vehicle,

owned by UAH. The following remarks emphasize the main characteristics of the work done so far, as well as

the future improvements and developments that are expected to carry out in the remaining of the project.

- This dataset is unique in the state of the art, given that it is oriented to providing data for prediction

purpose, not for recognition purpose, as is the usual case of most datasets in the state of the art. It

contains information about all vehicles in a range of 80 meters all-around the ego-vehicle. On top of

that, the dataset contains a high number of post-processed variables that offer contextual information

of enormous value for prediction purpose.

- The content of the dataset will continue to be expanded on a regular basis. Thus, it is expected to grow

significantly by the end of the BRAVE project. An updated version of this deliverable is thus expected

to be issued before the end of the project in order to reflect the changes and add-ons of the dataset.

- Some modifications and technical improvements are still possible. Thus, time synchronization among

the three radars and radar-camera calibration are currently under consideration. Future versions of the

data gathering software might include these two features.

D4.2: Vehicles and VRUs Database

Documents

Transcript of D4.2: Vehicles and VRUs Database